English Amiga Board - View Single Post

paraj · 24 April 2017, 09:43

Quote:

Originally Posted by a/b

muls is 56 on average
2x18 to fetch LW data + 4x4 to align access = 52, so you are about even without all the other instructions

If you are dealing with limited range, say approx. 256x256 *and* you include >>2 in your table, you could use words instead.
2x14 fetch + 2x4 align = 36

Something like:

Code:

	add.w	d0,d0
	add.w	d1,d1
	move.w	d0,d6
	add.w	d1,d0
	move.w	(a5,d0.w),d0
	sub.w	d1,d6
	sub.w	(a5,d6.w),d0

That's 5x4 + 2x14 = 48 so it could work. In general case (16-bit input data) you would need a rather large table so doing a muls instead is a good compromise.

Yeah, that could work, but if the input is limited to $100 won't the worst case for MULS.W be (38+8*2) 54 cycles (46 average), in which case we're back to muls being hard to beat?

I was thinking of using the optimization in a bog-standard vector/matrix multiplication routine, so some of the cost could be amortized, but I still have a hard time gaining anything (register pressure is also increased and the code has to be changed to accommodate the optimization).

I.e. when doing X' = MUL(MXX, VX) + MUL(MXY, VY) + MUL(MXZ, VZ), M?? can be premultiplied by 2/4 (the pointer to the square table could also be added). And as VX/VY/VZ are used 3 times, we can save 2*2 scalings for those.

Something like this, but again the gain seems marginal at best:

Code:

        ; assume \1.l is premultipled by 4 (amortized cost=2*4/3)
        ;        \2 is premultipled by 4 and added with a5 (amortized cost=3*9*4/npoints)
        ;        \3 dest
        ;        a6 scratch
        ;        \1 datareg
        ;        \2 any ea
        ;        \3 should be datareg

        move.l  \2, a6          ;  4(1/0) + ea [e.g. 12 for (d16,An), 0 for registers]
        move.l  (a6,\1.w), \3   ; 18(4/0)
        suba.w  \1, a6          ;  8(1/0)
        sub.l   (a6), \3        ; 14(3/0)
                                ; -------
                                ; 44(9/0) + ea + (8/3 + 108/npoints ~ 3)