Regarding the innerloop itself; if you want to speed that up, note that the ADDX operations eat up two pipeline slots. The ADD/ADDX/ADDX scheme costs as much as 5 ADDs, or 3 ADDs + 2 shifts would.
Here is an example of using 1 ADD + 1 ADDX, combined with some creative packing to reduce the amount of shifting/masking. First, the linear flow, without rolling the loop into itself:
Code:
; d1 (step) & d0 (counter) vvvvvv-- -------- CCCCcccc cc------
; d3 (step) & d2 (counter) UUUUUUuu uuuu---- -------- --VVVVVV
add.l d1,d0
addx.l d3,d2
move.l d2,d4
rol.l #6,d4
and.l d6,d4 ; d5 = $00000fff
move.w d0,d5
lsr.w d7,d5 ; d7 = 16-4 ; d5.w will be in the range $0000 ... $000f
or.b (a0,d4.l),d5
move.b d5,(a1)+
Then, when rolling the loop into itself, you get a 5-cycle-per-iteration version:
Code:
; d1 (step) & d0 (counter) vvvvvv-- -------- CCCCcccc cc------
; d3 (step) & d2 (counter) UUUUUUuu uuuu---- -------- --VVVVVV
or.b (a0,d4.l),d5
move.l d2,d4
move.b d5,(a1)+
rol.l #6,d4
and.l d6,d4 ; d5 = $00000fff
move.w d0,d5
lsr.w d7,d5 ; d7 = 16-4 ; d5.w will be in the range $0000 ... $000f
add.l d1,d0
addx.l d3,d2
Be aware though, that the more convoluted the loop is, the more setup logic is necessary -- and for a loop that will be run for 32 iterations, ~10 cycles of setup logic is equivalent to 10/32 cycles spent per-pixel in the innerloop itself. For 32 iterations it isn't so bad, but for short runs (think general purpose texturemapper for detailed 3d scenes) these super-optimized innerloops are sometimes not worth it due to the setup overhead.