Quote:
Originally Posted by a/b
muls is 56 on average
2x18 to fetch LW data + 4x4 to align access = 52, so you are about even without all the other instructions
If you are dealing with limited range, say approx. 256x256 *and* you include >>2 in your table, you could use words instead.
2x14 fetch + 2x4 align = 36
Something like:
Code:
add.w d0,d0
add.w d1,d1
move.w d0,d6
add.w d1,d0
move.w (a5,d0.w),d0
sub.w d1,d6
sub.w (a5,d6.w),d0
That's 5x4 + 2x14 = 48 so it could work. In general case (16-bit input data) you would need a rather large table so doing a muls instead is a good compromise.
|
Yeah, that could work, but if the input is limited to $100 won't the worst case for MULS.W be (38+8*2) 54 cycles (46 average), in which case we're back to muls being hard to beat?
I was thinking of using the optimization in a bog-standard vector/matrix multiplication routine, so some of the cost could be amortized, but I still have a hard time gaining anything (register pressure is also increased and the code has to be changed to accommodate the optimization).
I.e. when doing X' = MUL(MXX, VX) + MUL(MXY, VY) + MUL(MXZ, VZ), M?? can be premultiplied by 2/4 (the pointer to the square table could also be added). And as VX/VY/VZ are used 3 times, we can save 2*2 scalings for those.
Something like this, but again the gain seems marginal at best:
Code:
; assume \1.l is premultipled by 4 (amortized cost=2*4/3)
; \2 is premultipled by 4 and added with a5 (amortized cost=3*9*4/npoints)
; \3 dest
; a6 scratch
; \1 datareg
; \2 any ea
; \3 should be datareg
move.l \2, a6 ; 4(1/0) + ea [e.g. 12 for (d16,An), 0 for registers]
move.l (a6,\1.w), \3 ; 18(4/0)
suba.w \1, a6 ; 8(1/0)
sub.l (a6), \3 ; 14(3/0)
; -------
; 44(9/0) + ea + (8/3 + 108/npoints ~ 3)