View Single Post
Old 08 August 2022, 22:40   #30
Kalms
Registered User
 
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
Regarding the innerloop itself; if you want to speed that up, note that the ADDX operations eat up two pipeline slots. The ADD/ADDX/ADDX scheme costs as much as 5 ADDs, or 3 ADDs + 2 shifts would.

Here is an example of using 1 ADD + 1 ADDX, combined with some creative packing to reduce the amount of shifting/masking. First, the linear flow, without rolling the loop into itself:

Code:
; d1 (step) & d0 (counter)	vvvvvv-- -------- CCCCcccc cc------
; d3 (step) & d2 (counter)	UUUUUUuu uuuu---- -------- --VVVVVV

	add.l	d1,d0

	addx.l	d3,d2

	move.l	d2,d4
	rol.l	#6,d4

	and.l	d6,d4		; d5 = $00000fff
	move.w	d0,d5

	lsr.w	d7,d5		; d7 = 16-4			; d5.w will be in the range $0000 ... $000f

	or.b	(a0,d4.l),d5

	move.b	d5,(a1)+
Then, when rolling the loop into itself, you get a 5-cycle-per-iteration version:

Code:
; d1 (step) & d0 (counter)	vvvvvv-- -------- CCCCcccc cc------
; d3 (step) & d2 (counter)	UUUUUUuu uuuu---- -------- --VVVVVV

	or.b	(a0,d4.l),d5
	move.l	d2,d4

	move.b	d5,(a1)+
	rol.l	#6,d4

	and.l	d6,d4		; d5 = $00000fff
	move.w	d0,d5

	lsr.w	d7,d5		; d7 = 16-4			; d5.w will be in the range $0000 ... $000f
	add.l	d1,d0

	addx.l	d3,d2
Be aware though, that the more convoluted the loop is, the more setup logic is necessary -- and for a loop that will be run for 32 iterations, ~10 cycles of setup logic is equivalent to 10/32 cycles spent per-pixel in the innerloop itself. For 32 iterations it isn't so bad, but for short runs (think general purpose texturemapper for detailed 3d scenes) these super-optimized innerloops are sometimes not worth it due to the setup overhead.
Kalms is offline  
 
Page generated in 0.04548 seconds with 11 queries