060 Texture Mapping Optimization - Page 2

paraj · 05 August 2022, 19:16

Quote:

Originally Posted by bebbo

unfolded the sign handling and added proper stack handling which reduced the object size by 16 bytes. Should be a tad faster now.

Yep, gained 3 cycles in sdiv64. Thinking it might be worthwhile to change the bitfield insertion/extractions into more simple (but more 060-friendly) operations (e.g. in smul64 they take ~14 cycles [counted by just commenting them out]). Maybe I'll try that unless you beat me to it

On a side note when testing I was worried my 060 was dying, but "luckily" it just turned out to be a nearby building on fire

paraj · 06 August 2022, 12:15

Thanks to bebbo for sharing an improved version via PM, massaging it a bit further and the signed multiply function is now at 60 cycles (including overhead):

Code:

_smul64:
        fmove.l d0,fp0
        move.l  d2,a0
        fmul.l  d1,fp0

        fmove.x fp0,-(a7)
        move.w  (a7),d2
        addq.l  #4,a7
        bmi.s   .Neg64
        sub.w   #16382+32,d2
        ble.s   .L1

        move.l  (a7)+,d0
        move.l  d3,a1
	move.l	d0,d3
	lsl.l	d2,d3
        neg.w   d2
        add.w   #32,d2
        move.l  (a7)+,d1
	lsr.l	d2,d0
	lsr.l	d2,d1
        move.l  a0,d2
	or.l	d3,d1
        move.l  a1,d3
        rts

.Neg64:
        sub.w   #16382+$8000+32,d2
        ble.s   .L1neg

        move.l  (a7)+,d0
        move.l  d3,a1
	move.l	d0,d3
	lsl.l	d2,d3
        neg.w   d2
        add.w   #32,d2
        move.l  (a7)+,d1
	lsr.l	d2,d0
	lsr.l	d2,d1
	or.l	d3,d1
        move.l  a1,d3
        neg.l   d1
        negx.l  d0
        move.l  a0,d2
        rts

.L1:
        ; bfextu    (a7){0:d2},d1
        neg.w   d2
        move.l  (a7),d1
        lsr.l   d2,d1

        moveq   #0,d0
        addq.l  #8,a7
        move.l  a0,d2
        rts

.L1neg:
        ; bfextu    (a7){0:d2},d1
        neg.w   d2
        move.l  (a7),d1
        lsr.l   d2,d1

        moveq   #-1,d0
        addq.l  #8,a7

        neg.l   d1
        move.l  a0,d2
        rts

(Instruction ordering can probably be improved further)

a/b · 06 August 2022, 14:18

Shouldn't move.l d2,a0 in .L1 and .L1Neg be move.l a0,d2?

paraj · 06 August 2022, 14:40

Quote:

Originally Posted by a/b

Shouldn't move.l d2,a0 in .L1 and .L1Neg be move.l a0,d2?

Thanks, yes. They used to be exg's (which is pOEP-only) and I didn't fix them up properly.

Don_Adan · 06 August 2022, 16:52

Quote:

Originally Posted by paraj

Thanks to bebbo for sharing an improved version via PM, massaging it a bit further and the signed multiply function is now at 60 cycles (including overhead):

Code:

_smul64:
        fmove.l d0,fp0
        move.l  d2,a0
        fmul.l  d1,fp0

;        moveq   #0,d2
        fmove.x fp0,-(a7)
        move.w  (a7),d2
        addq.l  #4,a7
        bmi.s   .Neg64
        sub.w   #16382+32,d2
        ble.s   .L1

        move.l  (a7)+,d0
        move.l  d3,a1
	move.l	d0,d3
	lsl.l	d2,d3
        neg.w   d2
        add.w   #32,d2
        move.l  (a7)+,d1
	lsr.l	d2,d0
	lsr.l	d2,d1
        move.l  a0,d2
	or.l	d3,d1
        move.l  a1,d3
        rts

.Neg64:
        sub.w   #16382+$8000+32,d2
        ble.s   .L1neg

        move.l  (a7)+,d0
        move.l  d3,a1
	move.l	d0,d3
	lsl.l	d2,d3
        neg.w   d2
        add.w   #32,d2
        move.l  (a7)+,d1
	lsr.l	d2,d0
	lsr.l	d2,d1
	or.l	d3,d1
        move.l  a1,d3
        neg.l   d1
        negx.l  d0
        move.l  a0,d2
        rts

.L1:
        ; bfextu    (a7){0:d2},d1
        neg.w   d2
        move.l  (a7),d1
        lsr.l   d2,d1

        moveq   #0,d0
        addq.l  #8,a7
        move.l  a0,d2
        rts

.L1neg:
        ; bfextu    (a7){0:d2},d1
        neg.w   d2
        move.l  (a7),d1
        lsr.l   d2,d1

        moveq   #-1,d0
        addq.l  #8,a7

        neg.l   d1
        move.l  a0,d2
        rts

(Instruction ordering can probably be improved further)

For me moveq #0,d2 command is useless, or I missed something.

paraj · 06 August 2022, 17:36

Quote:

Originally Posted by Don_Adan

For me moveq #0,d2 command is useless, or I missed something.

It is indeed useless now that all the bitfield operations have been removed, thanks.
Also for the int64_t/int32_t (sdiv64) case it's faster to just use the FPU to do the float->int conversion (30 cycle improvement vs. bebbo's improved version (at 130)). I.e. for d0:d1/d2 -> d0

Code:

        ; d0>0 case (for d0 < 0, negate d0:d1 before and d0 after division, checking for d0=0 may or may not be worth it)
        tst.l   d1
        bpl.s   .L1
        addq.l  #1,d0 ; add 2**32 (can't overflow if result is <= 2**31)
.L1:
        fmove.l d0,fp0
        fmul.s  #$4f800000,fp0 ; 2**32
        fadd.l  d1,fp0
        fdiv.l  d2,fp0
        fintrz.x fp0
        fmove.l fp0,d0

paraj · 08 August 2022, 18:38

Where were we? Oh right, optimizing texture mappers for 060

I've had my eye on getting a one-frame (mostly) fullscreen effect done for some time. Trying to replicate the free-directional tunnel/planes effects in hotstyle takeover (only 20 years late), I came up with this innerloop for use with a 64x64 texture (taking up half of the data cache) and 16 shades for each color (8 seems to give too much banding, and 32 leaves too few colors for the texture). "v" is only 6.6 fixpoint (rest are .16) which is not great. The whole thing is very imprecise though (due to the 32x16 interpolation grid), so that might not matter too much. AFAIK the demo also used that grid size, but I'm not sure why my version seems to look much worse. Already tried tweaking the calculations a bit to hide the worst artifacts, but haven't spend much time on it. In the demo it's mostly noticeable in the planes part, but looks quite OK (also goes by quickly).

Code:

        ; d6 = ufrac    | v
        ; d1 = zfrac    | uint
        ; d2 = -------- | zint  (<=15)
        ; a2 = dudxfrac | dvdx
        ; d4 = dzdxfrac | dudxint
        ; d5 = -------- | dzdxint
        ; d7 = 63*64
        ; a0 = output buffer
        ; a1 = texture (only upper bits are set in the texels)
        move.l  d6,d0           ; 1
        and.l   d7,d0

        and.w   #63,d1          ; 2
        move.b  d2,d3

        or.b    d1,d0           ; 3
        add.l   a2,d6

        addx.l  d4,d1           ; 4

        addx.b  d5,d2           ; 5

        or.b    (a1,d0.l),d3    ; 6
        ; -

        move.b  d3,(a0)+        ; 7
        ; - (cmp.l)

For reasons I don't understand changing "and.l d7,d0" in cycle 1 sOEP to "andi.l #63*64,d0" makes the loop take an extra cycle. Haven't found a way of shaving off any more, but maybe I'm missing something (increasing v-precision while keeping the number of cycles the same would also be good).
I've attached complete source, can be compiled with vc +aos68k -o boxi.exe -O3 -cpu=68060 -fpu=68060 -DNDEBUG boxi.c boxiasm.s demosys.c -I$NDK -lm060 -lamiga -lauto (should also work with gcc).
On my blizzard 1260/50Mhz it runs one-frame (at 320x208) with "Inner" reported as taking 141.4ns (7.1 cycles) per pixel. At 320x256 i get a 40FPS. The way c2p/effect overlapping is done might be the bottleneck here - I suspect there aren't enough cycles between the first part of the C2P writing 4 long words to the store buffer, and the later handling of the final four (but overlapping that is already a bit complicated and moving them slightly later didn't give any improvement).

Sorry in advance for the nausea inducing effect

ross · 08 August 2022, 21:15

Nice!

(apart from a slight stomach ache

)

Kalms · 08 August 2022, 22:09

Fun stuff!

Quote:

Originally Posted by paraj

For reasons I don't understand changing "and.l d7,d0" in cycle 1 sOEP to "andi.l #63*64,d0" makes the loop take an extra cycle.

Looking at boxiasm.s, _boxi_inner_time, it looks like you are using more I-cache than what is good for you there. Your original loop (with "and.l d7,d0") is 26 bytes in size. When you then unroll that 320 times, you end up with 8320 bytes of linear code that the CPU will plow through when generating a line if output. This probably results in a couple of cachelines worth of code being ping-ponged in/out of the cache for each iteration through the outer loop.

When you switch to "andi.l #63*64,d0", your innerloop becomes larger - 30 bytes per pixel yields a total of 9600 bytes of innerloop, which ought to result in ~1.4kB being ping-ponged in/out of the cache for each iteration through the outer loop. Perhaps that is the reason why you see a performance degradation?

Generally speaking, it is usually good to unroll a couple of times, say 8x-32x; this can reduce time spent doing loop control. However, even if the unrolled code fits into the cache, it still needs to be fetched once. Therefore, there is a sweet spot past which excessive unrolling makes the whole-frame time slower because the time saved from eliminating loop control instructions is instead spent doing memory reads for filling the I-cache during the first iteration through the loop.

When I look at _boxi_inner it seems that you are not unrolling 320 times there though. Therefore, I think that the perf difference you saw in your measurement loop was not something you observed in the "real" loop.

I'm not sure what to do to get performance up regarding the C2P parts. One thing you can try is to eliminate any logic that doesn't affect memory accesses, and check: does that improve performance? because if it does, it is a hint to you that you could theoretically get to that performance if you positioned your memory accesses better. For example, erase the contents of the C2P_STEP macro and check if that causes a performance difference.
You could also try removing texture interpolation logic, keeping only memory accesses, to get another measure of theoretical max throughput.

Kalms · 08 August 2022, 22:40

Regarding the innerloop itself; if you want to speed that up, note that the ADDX operations eat up two pipeline slots. The ADD/ADDX/ADDX scheme costs as much as 5 ADDs, or 3 ADDs + 2 shifts would.

Here is an example of using 1 ADD + 1 ADDX, combined with some creative packing to reduce the amount of shifting/masking. First, the linear flow, without rolling the loop into itself:

Code:

; d1 (step) & d0 (counter)	vvvvvv-- -------- CCCCcccc cc------
; d3 (step) & d2 (counter)	UUUUUUuu uuuu---- -------- --VVVVVV

	add.l	d1,d0

	addx.l	d3,d2

	move.l	d2,d4
	rol.l	#6,d4

	and.l	d6,d4		; d5 = $00000fff
	move.w	d0,d5

	lsr.w	d7,d5		; d7 = 16-4			; d5.w will be in the range $0000 ... $000f

	or.b	(a0,d4.l),d5

	move.b	d5,(a1)+

Then, when rolling the loop into itself, you get a 5-cycle-per-iteration version:

Code:

; d1 (step) & d0 (counter)	vvvvvv-- -------- CCCCcccc cc------
; d3 (step) & d2 (counter)	UUUUUUuu uuuu---- -------- --VVVVVV

	or.b	(a0,d4.l),d5
	move.l	d2,d4

	move.b	d5,(a1)+
	rol.l	#6,d4

	and.l	d6,d4		; d5 = $00000fff
	move.w	d0,d5

	lsr.w	d7,d5		; d7 = 16-4			; d5.w will be in the range $0000 ... $000f
	add.l	d1,d0

	addx.l	d3,d2

Be aware though, that the more convoluted the loop is, the more setup logic is necessary -- and for a loop that will be run for 32 iterations, ~10 cycles of setup logic is equivalent to 10/32 cycles spent per-pixel in the innerloop itself. For 32 iterations it isn't so bad, but for short runs (think general purpose texturemapper for detailed 3d scenes) these super-optimized innerloops are sometimes not worth it due to the setup overhead.

paraj · 09 August 2022, 14:01

Wow, thanks a lot, that's amazing! Humbling to see such a massive improvement possible. Clearly I still have much to learn. That brings FPS at 320x256 to 46 and 320x240 is 99.6% one frame (3-4 frames missed out of 1000). While refactoring I seem to have introduced an issue (or perhaps it's lack of precision) where C (Z) seems to "bleed" into V in some way that I'll have to debug, but this is clearly a massive step forward! Also great tips for trying to place memory accesses.

Regarding the "andi.l" thing, I tried both rolled and unrolled and various sizes and it seems pretty consistent. I've done some more measurements, and one thing I forgot about was branch target alignment (noticed this when I timed individual instructions and e.g. add was faster than sub!). I must have missed one or more additional details in MC68060UM because it seems like using immediate operands (for otherwise 0.5 cycle instructions) incurs an extra cost. Perhaps some part of the instruction fetch machinery isn't able to keep up with the execution units.
Instructions were timed like this:

Code:

        cnop    0,8
.loop:
        ; <instruction sequence>
        subq.l  #1,d0
        bne.b   .loop

; unrolled
        cnop    0,8
.loop:
        rept UNROLLX ;50
        ; <instruction sequence>
        endr
        subq.l  #1,d0
        bne.w   .loop

; sample instruction sequence (all are 3 operations)
        add.l   d1,d2
        add.l   d1,d3
        add.l   d1,d4
; mix1
        move.l    d1,d2
        andi.l    #42,d3
        addq.l    #1,d4
; mix2
        move.l    d1,d2
        and.l     d5,d3
        addq.l    #1,d4

Results (numbers are in cycles)

Code:

Instruction        Looped   Unrolled
addq                 2.0        1.5
addi                 5.0        4.5
addiw                4.0        3.0
add                  2.0        1.5
subq                 2.0        1.5
subi                 5.0        4.5
sub                  2.0        1.5
andi                 5.0        4.5
andiw                4.0        3.0
and                  2.0        1.5
eori                 5.0        4.5
eoriw                4.0        3.0
eor                  2.0        1.5
addx                 4.0        3.0
mul                  7.0        6.0
muli                 10.0       9.0
movei                5.0        4.5
moveiw               4.0        3.0
move                 2.0        1.5
mix1                 3.0        2.5
mix2                 2.0        1.5

Attached is the test code (warning: disables interrupts+dma while tests are running)
EDIT: Added nofpu version of test.

Kalms · 09 August 2022, 23:04

Quote:

Originally Posted by paraj

While refactoring I seem to have introduced an issue (or perhaps it's lack of precision) where C (Z) seems to "bleed" into V in some way that I'll have to debug, but this is clearly a massive step forward!

If there's bleed (and not just lack of precision), it is either that the V fractional bits are getting clobbered by some other value, or the range of bits between the lowest V fractional bit and the highest C integer bit has a non-zero value in the "step" register. Perhaps the C integer value is getting sign-extended into some/all of those bits?

Regarding andi.l, if it wasn't down to code size, and you are not getting lots of cache hits, then yes you are probably getting instruction fetch limited.

Instruction fetch works like this:

The I-cache can supply one aligned longword of code bytes each cycle.

The IFU will fetch longword after longword of code bytes, using the branch prediction logic to determine whether to just walk linearly ahead or to skip somewhere else next time. Branch prediction information is associated with the instruction before the branch itself, so if there is a "the following instruction is predicted to be a branch-taken instruction" then the branch instruction itself will not be fetched by the IFU.

The IFU then feeds fetched results through "Early Decode" (IED), which takes a variable number of words as input, and spits out 6-byte packets. These 6-byte packets go into the Instruction Buffer (IB), a 16-entry FIFO. Instructions smaller than 6 bytes still turn into 6-byte packets. Instructions larger than 6 bytes turn into multiple 6-byte packets.

The Integer Unit, with its two pipelines, attempt to pull two 6-byte packets from the Instruction Buffer each cycle, and feed them into pOEP / sOEP if possible.

Now -- if you have a stream of instructions that pair (example: add.l d0,d1 / add.l d2,d3 / unrolled 16x) then the Integer Unit will process 2 instructions / cycle, and thus the IFU needs to enqueue 2 instructions / cycle to keep the Integer Unit busy. The IFU will succeed at this if the instructions are 2 bytes each (since 2 instructions * 2 bytes/instruction = 4 bytes, and the I-cache can sustain 4 bytes/cycle).

However, if these instructions that pair are larger than 2 bytes, you will become instruction fetch limited. The worst example is if you use 32-bit immediates (example: add.l #$12345678,d0 / add.l #$87654321,d1) -- the I-cache will spend 3 cycles fetching this instruction pair, so the IFU will submit on average one instruction pair every 3 cycles, yet the Integer Unit can execute this pair every 3 cycles. What will happen is that the IFU submits 0,1,1,0,1,1,0,1,1 ... instructions per cycle ... and the Integer Unit will snatch an instruction from the FIFO as soon as it is enqueued, run that instruction through the pOEP with the sOEP idle.

In your synthetic benchmark, the "x_andi" case is 3x ANDI.L unrolled 50 times. This is 6 bytes per instruction of pairable ops - so that case will be IFU/I-cache limited, and the 1.5 cycle/instruction is effectively a measure of instruction fetch speed from the I-cache.

In a more complex scenario, like the texturemapper, then any Integer Unit stalls (processing a slow instruction - like a DIVS/DIVU - or a data cache read miss) will pause the Integer Unit's processing of instructions, but the IFU can continue to do instruction fetch, early decode, and fill the FIFO. This buys the IFU time in the future, thereby hiding subsequent IFU throughput problems ... but, if your texturemapper really is having no data cache misses, and usig all pairable instructions, then you will eventually become IFU throughput bound: the Integer Unit is processing instructions as quickly or quicker than the I-cache can deliver them to the IFU.

If we go back to your 7-cycle example, 7 cycles allows the I-cache to deliver 7*4=28 bytes. The original version of your 7-cycle example is 26 bytes in size, but with ANDI.L it becomes 30 bytes. If you really have zero d-cache misses, then the I-cache won't be able to sustain the throughput you need; it needs another 0.5 cycles per iteration extra on average.

paraj · 10 August 2022, 10:30

Quote:

Originally Posted by Kalms

If there's bleed (and not just lack of precision), it is either that the V fractional bits are getting clobbered by some other value, or the range of bits between the lowest V fractional bit and the highest C integer bit has a non-zero value in the "step" register. Perhaps the C integer value is getting sign-extended into some/all of those bits?

Finally figured out the issue. Had a couple of bugs in my setup code and an annoying precision issue that took some time to figure out. There was a line where lz=$1AB0 and rz=$76 (in 16.16 fixpoint). That turned into z=$1ab and dzdx=-14 meaning that the 31st pixel had z=-7 which was very noticeable. Didn't happen if I used more precision than 4.12 (like the 4.16 I used before, 4.13 was also enough..). Luckily the solution was simple: Make sure to properly round dzdx.

Quote:

Originally Posted by Kalms

Instruction fetch works like this:

That explains it perfectly. Another thing to consider when creating synthetic/microbenchmarks. It also made me look at the code again, and it turned out that replacing the "andi.l #mask,dN" with "andi.l mask(sp),dN" in C2P_STEP was worth around 2fps at 320x256 (bringing over 48). When I have some more time I'll have to see if I can find the final ~37K cycles/frame to bring it to 50fps

Once again thanks a bunch for the detailed information! It was extremely helpful, and there's not much info on the 060 to find on the net (except mostly posts by you and Blueberry).

Attached is latest version including 020/nofpu compatible version. It's a bit slow on an unexpanded A1200 (at 0.6 fps).

paraj · 12 August 2022, 14:54

Cycles found. 52.6fps at 320x256 (B1260@50MHz), gonna say that's good enough and call it a day. If I were to use it in some way, I'd probably have to spend quite a bit of time tweaking the effect to hide the lack of precision from the huge interpolation grid distance, and find someone with graphic skills to create a decent 64x64 16 color texture.

Final speedup came from interleaving the last 2 chip writes with the inner loop. Did some rough estimation of when there would be room in the store buffer, and that turned out to work. Played around a bit more with the placement of the last 4 chip writes, but didn't see any improvement.

EDIT: If you have a different Amiga setup and try it, please post results

the "nofpu" version should run on any AGA machine. 040/060LC results might be interesting.

Bruce Abbott · 15 August 2022, 21:09

Quote:

Originally Posted by paraj

If you have a different Amiga setup and try it, please post results

the "nofpu" version should run on any AGA machine..

A1200 with Blizzard 1230-IV (50MHz 030+FPU, 60ns RAM) here.

boxi.exe: 6.21 fps.
boxi-nofpu: 4.13 fps.

I also did some experiments in an attempt to see where the bottleneck was (obviously not the FPU).

Allocating bitplanes in FastRAM: 6.8 fps.

So writing to ChipRAM is not the bottleneck.

No c2p code at all (_boxi = rts): 50.08 fps.

Aha!

Remove c2p register manipulations, copying to FastRAM: 27.81 fps

Same but copying to ChipRAM: 22.36 fps.

I haven't figured out exactly what all the code does so I probably left in more than I could have. How would I modify your code to simply do each of the following (not caring that the display would be scrambled)?

1. Render the 'chunky' pixels directly to ChipRAM.

2. Copy the rendered pixels from FastRAM to ChipRAM.

Reason I want to do this is to evaluate the relative speed of planar vs chunky graphics, if it had been implemented in the AGA chipset or could be added via some external hardware.

paraj · 15 August 2022, 21:56

Quote:

Originally Posted by Bruce Abbott

A1200 with Blizzard 1230-IV (50MHz 030+FPU, 60ns RAM) here.

boxi.exe: 6.21 fps.
boxi-nofpu: 4.13 fps.

I also did some experiments in an attempt to see where the bottleneck was (obviously not the FPU).

Allocating bitplanes in FastRAM: 6.8 fps.

So writing to ChipRAM is not the bottleneck.

No c2p code at all (_boxi = rts): 50.08 fps.

Aha!

Remove c2p register manipulations, copying to FastRAM: 27.81 fps

Same but copying to ChipRAM: 22.36 fps.

I haven't figured out exactly what all the code does so I probably left in more than I could have. How would I modify your code to simply do each of the following (not caring that the display would be scrambled)?

1. Render the 'chunky' pixels directly to ChipRAM.

2. Copy the rendered pixels from FastRAM to ChipRAM.

Reason I want to do this is to evaluate the relative speed of planar vs chunky graphics, if it had been implemented in the AGA chipset or could be added via some external hardware.

Quick version: The bulk of the work happens in boxi_scanline. I think "_boxi" in itself should be reasonably straight forward - it just interpolates a 32x16 grid cell with (u,v,z) provided for each corner.
The idea is to calculate 32 chunky pixels and do C2P on those while previous stores to chip ram complete. Since the store buffer can only hold 4 longwords, the C2P part after the main loop is structured to finish 4 LWs ASAP and get them on their way (almost at the end of boxi_scanline). This leaves 4 more LWs which are stored in "pdata(sp)". Except for the final iteration the first two are stored during the setup for the main loop (move.l (a3)+,a2 ... move.l a2,X*ROWBYTES(a5)) and the last two after 8 and 16 pixel respectively - see the rept 8 INNER part. The code is tuned for what's fast on my 060, so not surprising that it's not great on a 030 (though I would have hoped it was faster).

So if I understand you correctly, you should get rid of the chip ram stores in the setup/mainloop (a3+ -> a2/3 -> X(a5)) and the C2P part (after the INNER stuff). That should leave you with a loop that just renders pixels to fast ram (pixels(sp)). If you want to render directly to chipram (doing byte accesses) change the "lea pixels(sp),a0" to "move.l a6,a0" just before the mainloop. If you just want to render to fast RAM and then copy, do that after the main loop instead.

If you want, I can share earlier versions of the code (that were less optimized/060 specific) if that helps (PM me, and I'll just send you a copy of the git repo).

08 August 2022, 22:40	#30
Kalms Registered User Join Date: Nov 2006 Location: Stockholm, Sweden Posts: 237	Regarding the innerloop itself; if you want to speed that up, note that the ADDX operations eat up two pipeline slots. The ADD/ADDX/ADDX scheme costs as much as 5 ADDs, or 3 ADDs + 2 shifts would. Here is an example of using 1 ADD + 1 ADDX, combined with some creative packing to reduce the amount of shifting/masking. First, the linear flow, without rolling the loop into itself: Code: ; d1 (step) & d0 (counter) vvvvvv-- -------- CCCCcccc cc------ ; d3 (step) & d2 (counter) UUUUUUuu uuuu---- -------- --VVVVVV add.l d1,d0 addx.l d3,d2 move.l d2,d4 rol.l #6,d4 and.l d6,d4 ; d5 = $00000fff move.w d0,d5 lsr.w d7,d5 ; d7 = 16-4 ; d5.w will be in the range $0000 ... $000f or.b (a0,d4.l),d5 move.b d5,(a1)+ Then, when rolling the loop into itself, you get a 5-cycle-per-iteration version: Code: ; d1 (step) & d0 (counter) vvvvvv-- -------- CCCCcccc cc------ ; d3 (step) & d2 (counter) UUUUUUuu uuuu---- -------- --VVVVVV or.b (a0,d4.l),d5 move.l d2,d4 move.b d5,(a1)+ rol.l #6,d4 and.l d6,d4 ; d5 = $00000fff move.w d0,d5 lsr.w d7,d5 ; d7 = 16-4 ; d5.w will be in the range $0000 ... $000f add.l d1,d0 addx.l d3,d2 Be aware though, that the more convoluted the loop is, the more setup logic is necessary -- and for a loop that will be run for 32 iterations, ~10 cycles of setup logic is equivalent to 10/32 cycles spent per-pixel in the innerloop itself. For 32 iterations it isn't so bad, but for short runs (think general purpose texturemapper for detailed 3d scenes) these super-optimized innerloops are sometimes not worth it due to the setup overhead.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
replace a color with a texture	turrican3	support.WinUAE	9	18 October 2019 08:10
Code optimization.	gazj82	Coders. Blitz Basic	26	08 July 2018 15:56
Sound Quest - Texture	emufan	request.Apps	2	08 April 2016 20:03
3D Graphics: possible optimization?	sandruzzo	Coders. General	3	26 February 2016 08:01
Speed! - no 3D texture mapping	s2325	HOL data problems	1	17 October 2010 16:32

06 August 2022, 14:18	#23
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	Shouldn't move.l d2,a0 in .L1 and .L1Neg be move.l a0,d2?

08 August 2022, 21:15	#28
ross Defendit numerus Join Date: Mar 2017 Location: Crossing the Rubicon Age: 53 Posts: 4,468	Nice! (apart from a slight stomach ache )

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)