English Amiga Board - View Single Post

Kalms · 08 August 2022, 22:40

Regarding the innerloop itself; if you want to speed that up, note that the ADDX operations eat up two pipeline slots. The ADD/ADDX/ADDX scheme costs as much as 5 ADDs, or 3 ADDs + 2 shifts would.

Here is an example of using 1 ADD + 1 ADDX, combined with some creative packing to reduce the amount of shifting/masking. First, the linear flow, without rolling the loop into itself:

Code:

; d1 (step) & d0 (counter)	vvvvvv-- -------- CCCCcccc cc------
; d3 (step) & d2 (counter)	UUUUUUuu uuuu---- -------- --VVVVVV

	add.l	d1,d0

	addx.l	d3,d2

	move.l	d2,d4
	rol.l	#6,d4

	and.l	d6,d4		; d5 = $00000fff
	move.w	d0,d5

	lsr.w	d7,d5		; d7 = 16-4			; d5.w will be in the range $0000 ... $000f

	or.b	(a0,d4.l),d5

	move.b	d5,(a1)+

Then, when rolling the loop into itself, you get a 5-cycle-per-iteration version:

Code:

; d1 (step) & d0 (counter)	vvvvvv-- -------- CCCCcccc cc------
; d3 (step) & d2 (counter)	UUUUUUuu uuuu---- -------- --VVVVVV

	or.b	(a0,d4.l),d5
	move.l	d2,d4

	move.b	d5,(a1)+
	rol.l	#6,d4

	and.l	d6,d4		; d5 = $00000fff
	move.w	d0,d5

	lsr.w	d7,d5		; d7 = 16-4			; d5.w will be in the range $0000 ... $000f
	add.l	d1,d0

	addx.l	d3,d2

Be aware though, that the more convoluted the loop is, the more setup logic is necessary -- and for a loop that will be run for 32 iterations, ~10 cycles of setup logic is equivalent to 10/32 cycles spent per-pixel in the innerloop itself. For 32 iterations it isn't so bad, but for short runs (think general purpose texturemapper for detailed 3d scenes) these super-optimized innerloops are sometimes not worth it due to the setup overhead.

08 August 2022, 22:40	#30
Kalms Registered User Join Date: Nov 2006 Location: Stockholm, Sweden Posts: 237	Regarding the innerloop itself; if you want to speed that up, note that the ADDX operations eat up two pipeline slots. The ADD/ADDX/ADDX scheme costs as much as 5 ADDs, or 3 ADDs + 2 shifts would. Here is an example of using 1 ADD + 1 ADDX, combined with some creative packing to reduce the amount of shifting/masking. First, the linear flow, without rolling the loop into itself: Code: ; d1 (step) & d0 (counter) vvvvvv-- -------- CCCCcccc cc------ ; d3 (step) & d2 (counter) UUUUUUuu uuuu---- -------- --VVVVVV add.l d1,d0 addx.l d3,d2 move.l d2,d4 rol.l #6,d4 and.l d6,d4 ; d5 = $00000fff move.w d0,d5 lsr.w d7,d5 ; d7 = 16-4 ; d5.w will be in the range $0000 ... $000f or.b (a0,d4.l),d5 move.b d5,(a1)+ Then, when rolling the loop into itself, you get a 5-cycle-per-iteration version: Code: ; d1 (step) & d0 (counter) vvvvvv-- -------- CCCCcccc cc------ ; d3 (step) & d2 (counter) UUUUUUuu uuuu---- -------- --VVVVVV or.b (a0,d4.l),d5 move.l d2,d4 move.b d5,(a1)+ rol.l #6,d4 and.l d6,d4 ; d5 = $00000fff move.w d0,d5 lsr.w d7,d5 ; d7 = 16-4 ; d5.w will be in the range $0000 ... $000f add.l d1,d0 addx.l d3,d2 Be aware though, that the more convoluted the loop is, the more setup logic is necessary -- and for a loop that will be run for 32 iterations, ~10 cycles of setup logic is equivalent to 10/32 cycles spent per-pixel in the innerloop itself. For 32 iterations it isn't so bad, but for short runs (think general purpose texturemapper for detailed 3d scenes) these super-optimized innerloops are sometimes not worth it due to the setup overhead.