Advice to make real-time Bob X-Flipping faster

Brick Nash · 27 November 2022, 11:45

Hi folks!

I've had to start using a real-time sprite/bob flipping routine to save memory, and it works perfectly, but it's just dog-slow at the moment, so I'm looking for some advice on how to maybe speed it up.

The Bobs are all made up of horizontal slices (see image below) which are 16 pixels tall and a varying width, and this routine flips the slices using a lookup table into an Image and Mask buffer which are then blitted to the screen (it sits in part of a wider drawing routine).

All the information for source/screen positions/modulos etc. have been loaded in an earlier part of the drawing routine from a "sprite table" of pre-calculated values.

I've done two version of this so far - one which flips all the words sequentially in each bitplane before moving on to the next, and then this one which does one word then the next bitplane of that word and so on (essentially doing it a "tile" at a time). Both are around the same speed.

I'm kind of shocked that there's no Blitter function to do this as it seems really cumbersome to have to do such a commonly occurring graphical task with the CPU.

Regardless, any advice on how to maybe speed this up would be most welcome, as I'm out of ideas.

Many thanks!

Code:

;------------------------------------------------------------------------------
;	XFLIP
;------------------------------------------------------------------------------
; a1 	= SLICE MASK ADDRESS
; a2 	= SLICE IMAGE ADDRESS
; a4 	= LOOKUP TABLE
;---------------------------------------
; d3/d2 = Width counter
; d5 	= BLITSIZE  (Second BYTE is WORD-WIDTH+1)
; d6 	= Lines counter
;---------------------------------------
	
	if 1=1
	
DRAWBOB_XFLIP:	

	;---------------------------------------
	btst 		#0,OBJ_FLAGS(a0)	; Test XFLIP flag: 0 = Face LEFT | 1 = Face RIGHT
	bne.w		DRAWBOB_CLIP
	;---------------------------------------
	
	movem.l		d2-a0/a3-a6,-(sp)	; Back Up Registers
	
	;---------------------------------------	
	lea 		XFLIP_TABLE,a4		; Lookup Table (65K)
	lea		XFLIP_BUFFER,a5		; Flipped Image Buffer
	lea 		XFLIP_MASK_BUFFER,a6	; Flipped Mask Buffer
	;---------------------------------------
	
FLIP_WIDTH:	

	move.b 		d5,d3			; Get WIDTH Counter
	sub.w		#2,d3			; Sub 2 (Don't need masked word & 0 for counter)	
	move.w		d3,d2			; Backup
	add.w 		d3,d3			; Double for offset value
	
	;---------------------------------------
	
	adda.w 		d3,a5			; Add to Image Flip Buffer
	adda.w 		d3,a6			; Add to Mask Flip Buffer
	
	move.w 		d2,d3			; Refresh WIDTH to word value
	;---------------------------------------
	move.l		a5,a0			; Back up Image Flip Starting Pos
	move.l		a6,a3			; Back up MASK Flip Starting Pos
	
	move.l		a1,d5			; Back up Image Starting Pos
	move.l		a2,d7			; Back up MASK Starting Pos
	;--------------------------------------

	move.w 		#16-1,d6		; LINE Counter (Always 16)
	
	;--------------------------------------
	macro 		FLIPIMAGE
	;--------------------------------------
	move.w		(a2),d0					
	add.l		d0,d0					
	move.w 		(a4,d0.l),(a5)				
	adda.w		#40,a2
	adda.w		#40,a5	
	moveq 		#0,d0
	;--------------------------------------
	endm
	
	;--------------------------------------
	macro 		FLIPMASK
	;--------------------------------------
	move.w		(a1),d0					
	add.l		d0,d0					
	move.w 		(a4,d0.l),(a6)				
	adda.w		#40,a1
	adda.w		#40,a6	
	moveq 		#0,d0
	;--------------------------------------
	endm
	
BOB_FLIPLOOP:

	;---------------------------------------
	; FLIP IMAGE WORDS ON ALL 4 BITPLANES
	;---------------------------------------

	FLIPIMAGE				; Bitplane 1
	FLIPIMAGE				; Bitplane 2
	FLIPIMAGE				; Bitplane 3
	FLIPIMAGE				; Bitplane 4

	;---------------------------------------	
	; FLIP MASK WORDS ON ALL 4 BITPLANES
	;---------------------------------------
	
	FLIPMASK				; Bitplane 1
	FLIPMASK				; Bitplane 2
	FLIPMASK				; Bitplane 3
	FLIPMASK				; Bitplane 4
	
	;---------------------------------------
	
	dbf 		d6,BOB_FLIPLOOP
	move.w 		#16-1,d6		; Refresh Line counter
	
	;---------------------------------------
	
	move.l		a0,a5			; Refresh Positions
	move.l		a3,a6					
	move.l		d5,a1					
	move.l		d7,a2
	
	adda.w 		#2,a1			; Apply next WORD to be flipped
	adda.w 		#2,a2
	suba.w 		#2,a5
	suba.w 		#2,a6
	
	move.l		a5,a0			; Store new Positions
	move.l		a6,a3					
	move.l		a1,d5					
	move.l		a2,d7

	;---------------------------------------
	
	dbf 		d3,BOB_FLIPLOOP		; Dec slice width Counter
			
	;---------------------------------------
	; FLIPPING DONE
	; - LOAD REGISTERS FOR BLITTING
	;---------------------------------------

	lea 		XFLIP_MASK_BUFFER,a1
	lea		XFLIP_BUFFER,a2

	;---------------------------------------	
	movem.l		(sp)+,d2-a0/a3-a6	; Restore Registers
	;---------------------------------------
	endif

a/b · 27 November 2022, 15:53

Lets start with a few simple micro optimizations:

Code:

;	adda.w	#40,ax
	lea	(40,ax),ax

; y = 1 to 8
;	adda.w 	#y,ax
	addq.w 	#y,ax
;	suba.w 	#y,ax
	subq.w 	#y,ax

; y = -128 to 127
;	move.w #y,dx
	moveq 	#y,dx

This will make the code shorter as well, so maybe you could unroll it 16 times. This eliminates dbf overhead and removes lea/adda 40 (you can hardcode a5/a6 offsets, with the first one being 0 and not needed):

Code:

	move.w (a4,d0.l),(<OFFSET>*40,a5)
...
	move.w (a4,d0.l),(<OFFSET>*40,a6)

Next, if you can place the XFLIP_TABLE on a 128kb boundary (eg., $60000), you can eliminate index addressing and gain 2 cycles (-6+4 = -2) per lookup, for example:

Code:

init:
	move.l	#XFLIP_TABLE/2,d0

loop:
;	move.w	(a2),d0
;	add.l	d0,d0
;	move.w (a4,d0.l),(a5)
	move.w	(a2),d0
	move.l	d0,a0		; +4
	add.l	a0,a0
	move.w (a0),(a5)	; -6

lmimmfn · 27 November 2022, 17:41

Would a LUT of words/bytes help? Where direction A value is an offset of a base address and the value stored there is the opposite direction B? LUT woyld be 255 in length for bytes, 64k for words, or would that be too slow?(very long time since I did 68k assembly so realise my post might be useless lol)

Brick Nash · 27 November 2022, 17:49

Quote:

Originally Posted by a/b

Lets start with a few simple micro optimizations:

Code:

;	adda.w	#40,ax
	lea	(40,ax),ax

; y = 1 to 8
;	adda.w 	#y,ax
	addq.w 	#y,ax
;	suba.w 	#y,ax
	subq.w 	#y,ax

; y = -128 to 127
;	move.w #y,dx
	moveq 	#y,dx

This will make the code shorter as well, so maybe you could unroll it 16 times. This eliminates dbf overhead and removes lea/adda 40 (you can hardcode a5/a6 offsets, with the first one being 0 and not needed):

Code:

	move.w (a4,d0.l),(<OFFSET>*40,a5)
...
	move.w (a4,d0.l),(<OFFSET>*40,a6)

Next, if you can place the XFLIP_TABLE on a 128kb boundary (eg., $60000), you can eliminate index addressing and gain 2 cycles (-6+4 = -2) per lookup, for example:

Code:

init:
	move.l	#XFLIP_TABLE/2,d0

loop:
;	move.w	(a2),d0
;	add.l	d0,d0
;	move.w (a4,d0.l),(a5)
	move.w	(a2),d0
	move.l	d0,a0		; +4
	add.l	a0,a0
	move.w (a0),(a5)	; -6

Awesome, thanks! These are great tips!

I'll give them a try, and definitely unroll that dbf (I tried it with the Macros and it did feel a bit smoother).

chb · 27 November 2022, 17:49

I guess there's a small typo: I was wrong, no typo, see a/b's post below.

EDIT: And those adda.w #40 eat up quite some cycles probably... could you organize your data/code differently to avoid that? Like doing mask and image one after the other, that should let you use a separate address register for every plane. I'm not sure if I understand your code correctly, but that part looks a bit wasteful to me.

PS: The absence of an easy blitter flip is annoying, true. You can use the blitter to flip, but it takes four passes (AB->D type) + then drawing the BOB, so it's probably in most cases slower than using the table approach.

a/b · 27 November 2022, 18:51

Quote:

Originally Posted by chb

I guess there's a small typo:
Probably it should be

Code:

    move.w    d0,a0        ; +4

No typo, the idea is to have d0 upper word preloaded, load the lower word with sprite data, and then move the whole 32-bits (address/2) to a0.

chb · 27 November 2022, 19:05

Quote:

Originally Posted by a/b

No typo, the idea is to have d0 upper word preloaded, load the lower word with sprite data, and then move the whole 32-bits (address/2) to a0.

Ah, you're right, of course. I embarrassingly missed the init line and thought a0 was preloaded with $address/2...

Jobbo · 27 November 2022, 19:13

Are these bob arranged for interleaved blitting and if so then are those four masks all the same?

If they are the same then you only need to flip the first one and copy that result to the others.

I wonder if it'd be less memory overhead to blit for each plane separately so you only need to store one copy of the mask.

Then you might not need to do all this flipping, or at least do it for fewer bobs.

Jobbo · 27 November 2022, 19:19

If you have the registers to spare then it might be worth loading one up with #40 and replacing:

Code:

lea (40,a0),a0

With:

Code:

add.w d0,a0

They both take 8 cycles, but the later is one word less of code and so it could free up some bandwidth if you have blits going on at the same time.

Brick Nash · 27 November 2022, 20:02

Thanks for all the suggestions folks. This is great stuff!

Quote:

Originally Posted by Jobbo

Are these bob arranged for interleaved blitting and if so then are those four masks all the same?

If they are the same then you only need to flip the first one and copy that result to the others.

I wonder if it'd be less memory overhead to blit for each plane separately so you only need to store one copy of the mask.

Then you might not need to do all this flipping, or at least do it for fewer bobs.

That was a splendid suggestion to just flip the mask once and then copy. Scooped out a fair few lines of code there. Thank you!

roondar · 27 November 2022, 20:12

A common trick to speed up X-flipping of bobs is to store bobs with 1/2 of the lines pointing to the left and 1/2 pointing to the right. When blitting, you blit the 1/2 that is pointing in the correct direction as normal and only flip the other 1/2 (well, you output the flipped result, not actually flip the bob data).

In your case that would mean storing each slice with 1/2 the lines pointing to the right and 1/2 to the left

hooverphonique · 28 November 2022, 10:29

Quote:

Originally Posted by roondar

A common trick to speed up X-flipping of bobs is to store bobs with 1/2 of the lines pointing to the left and 1/2 pointing to the right.

You mean for evening out the flipping cost?

roondar · 28 November 2022, 11:34

Yup, so you have no 'cheap' vs 'expensive' frames to consider

Brick Nash · 28 November 2022, 15:17

Quote:

Originally Posted by roondar

A common trick to speed up X-flipping of bobs is to store bobs with 1/2 of the lines pointing to the left and 1/2 pointing to the right. When blitting, you blit the 1/2 that is pointing in the correct direction as normal and only flip the other 1/2 (well, you output the flipped result, not actually flip the bob data).

In your case that would mean storing each slice with 1/2 the lines pointing to the right and 1/2 to the left

Yes, I think I remember reading one of the old dev teams like Probe or Core did this.

I've actuality got an old routine which flips the slices in the sheets anyway (when I still stored both left and right versions), so I could easily modify it to just do half of each slice. I never actually thought of trying it myself, so thanks for the suggestion.

Photon · 28 November 2022, 21:06

I think there was already a thread and the fastest was a 32K words table + roll a carry bit around?

Gist:

Code:

	add.w d0,d0
	move.w (a0,d0.w),d0
	addx d0,d0

a/b · 29 November 2022, 02:39

If you could serialize the reads and writes with movem, then it would be faster than what I posted above (which wouldn't work because movem would wipe the upper words). Otherwise, it's 2+4 cycles slower (4: it can't do a table read + output write in a single move, it needs an extra opcode) and has one more memory access.
But it has advantages if you have memory constraints. And it's a nice trick overall.

buzzybee · 30 November 2022, 16:10

Did not read the entire thread, but just in case no one has posted it before: There was quite an interesting conversation about tile flipping going on here:

https://eab.abime.net/showthread.php...5555555&page=2

Looks like there is no such thing as a single best code. Using a table-based or logical-based approach has both their pro's and con's.

Brick Nash · 30 November 2022, 17:02

Thanks for all the help folks!

I've been working my way through the suggestions and the flipping is noticeably faster now. Still got a bit to go, but I learned quite a few tricks and tips from this great thread.

Brick Nash · 02 December 2022, 10:50

Just a little additional question while this thread is still relatively warm:

I'm trying to run this instruction:

move.w (a4,d0.l*2),(a5)

But the assembler (VASM) is complaining that the d0.l*2 isn't supported, however I checked and it works in Asm-One.

I'm hoping it's just a case of needing a new module or something, because I think this would save all the (many) lines of "add.l d0,d0" that I currently have.

I've tried a few different exes from a download pack (vasmm68k_madmac/vasmm68k_mod/vasmm68k_std) but none resolve the issue as I don't really know what the versions do.

I'm quite unclear about VASM as I find the documentation very vague, so any advice on what to do/install would be great.

Thanks!

saimo · 02 December 2022, 11:06

That's a 68020+ addressing mode: if you aren't targeting 68000/68010 as well, you need to tell the assembler (for example with a directive or a command line switch) that the code is for (at least) 68020.

27 November 2022, 17:49	#5
chb Registered User Join Date: Dec 2014 Location: germany Posts: 439	I guess there's a small typo: I was wrong, no typo, see a/b's post below. EDIT: And those adda.w #40 eat up quite some cycles probably... could you organize your data/code differently to avoid that? Like doing mask and image one after the other, that should let you use a separate address register for every plane. I'm not sure if I understand your code correctly, but that part looks a bit wasteful to me. PS: The absence of an easy blitter flip is annoying, true. You can use the blitter to flip, but it takes four passes (AB->D type) + then drawing the BOB, so it's probably in most cases slower than using the table approach. Last edited by chb; 27 November 2022 at 19:06.

27 November 2022, 19:19	#9
Jobbo Registered User Join Date: Jun 2020 Location: Druidia Posts: 387	If you have the registers to spare then it might be worth loading one up with #40 and replacing: Code: lea (40,a0),a0 With: Code: add.w d0,a0 They both take 8 cycles, but the later is one word less of code and so it could free up some bandwidth if you have blits going on at the same time.

28 November 2022, 21:06	#15
Photon Moderator Join Date: Nov 2004 Location: Eksjö / Sweden Posts: 5,602	I think there was already a thread and the fastest was a 32K words table + roll a carry bit around? Gist: Code: add.w d0,d0 move.w (a0,d0.w),d0 addx d0,d0

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
WANTED - A1000 Real time clock "A-time"	loggio	MarketPlace	0	21 August 2020 04:40
Will a faster CPU make the blitter obsolete?	olleharstedt	Coders. General	12	21 April 2020 23:57
Make Window Refresh Faster?	AGS	Coders. System	4	06 January 2014 17:05
Anything to make A600 IDE go faster?	Photon	support.Hardware	6	18 October 2009 18:31
Can I make WinUAE faster? (loading time and such)	EssKung	support.WinUAE	15	29 May 2007 11:59

27 November 2022, 15:53	#2
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	Lets start with a few simple micro optimizations: Code: ; adda.w #40,ax lea (40,ax),ax ; y = 1 to 8 ; adda.w #y,ax addq.w #y,ax ; suba.w #y,ax subq.w #y,ax ; y = -128 to 127 ; move.w #y,dx moveq #y,dx This will make the code shorter as well, so maybe you could unroll it 16 times. This eliminates dbf overhead and removes lea/adda 40 (you can hardcode a5/a6 offsets, with the first one being 0 and not needed): Code: move.w (a4,d0.l),(<OFFSET>40,a5) ... move.w (a4,d0.l),(<OFFSET>40,a6) Next, if you can place the XFLIP_TABLE on a 128kb boundary (eg., $60000), you can eliminate index addressing and gain 2 cycles (-6+4 = -2) per lookup, for example: Code: init: move.l #XFLIP_TABLE/2,d0 loop: ; move.w (a2),d0 ; add.l d0,d0 ; move.w (a4,d0.l),(a5) move.w (a2),d0 move.l d0,a0 ; +4 add.l a0,a0 move.w (a0),(a5) ; -6

27 November 2022, 17:41	#3
lmimmfn Registered User Join Date: May 2018 Location: Ireland Posts: 672	Would a LUT of words/bytes help? Where direction A value is an offset of a base address and the value stored there is the opposite direction B? LUT woyld be 255 in length for bytes, 64k for words, or would that be too slow?(very long time since I did 68k assembly so realise my post might be useless lol)

27 November 2022, 19:13	#8
Jobbo Registered User Join Date: Jun 2020 Location: Druidia Posts: 387	Are these bob arranged for interleaved blitting and if so then are those four masks all the same? If they are the same then you only need to flip the first one and copy that result to the others. I wonder if it'd be less memory overhead to blit for each plane separately so you only need to store one copy of the mask. Then you might not need to do all this flipping, or at least do it for fewer bobs.

27 November 2022, 20:12	#11
roondar Registered User Join Date: Jul 2015 Location: The Netherlands Posts: 3,408	A common trick to speed up X-flipping of bobs is to store bobs with 1/2 of the lines pointing to the left and 1/2 pointing to the right. When blitting, you blit the 1/2 that is pointing in the correct direction as normal and only flip the other 1/2 (well, you output the flipped result, not actually flip the bob data). In your case that would mean storing each slice with 1/2 the lines pointing to the right and 1/2 to the left

28 November 2022, 11:34	#13
roondar Registered User Join Date: Jul 2015 Location: The Netherlands Posts: 3,408	Yup, so you have no 'cheap' vs 'expensive' frames to consider

29 November 2022, 02:39	#16
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	If you could serialize the reads and writes with movem, then it would be faster than what I posted above (which wouldn't work because movem would wipe the upper words). Otherwise, it's 2+4 cycles slower (4: it can't do a table read + output write in a single move, it needs an extra opcode) and has one more memory access. But it has advantages if you have memory constraints. And it's a nice trick overall.

30 November 2022, 16:10	#17
buzzybee Registered User Join Date: Oct 2015 Location: Landsberg / Germany Posts: 526	Did not read the entire thread, but just in case no one has posted it before: There was quite an interesting conversation about tile flipping going on here: https://eab.abime.net/showthread.php...5555555&page=2 Looks like there is no such thing as a single best code. Using a table-based or logical-based approach has both their pro's and con's.

30 November 2022, 17:02	#18
Brick Nash Prototron Join Date: Mar 2015 Location: Glasgow, Scotland Posts: 411	Thanks for all the help folks! I've been working my way through the suggestions and the flipping is noticeably faster now. Still got a bit to go, but I learned quite a few tricks and tips from this great thread.

02 December 2022, 10:50	#19
Brick Nash Prototron Join Date: Mar 2015 Location: Glasgow, Scotland Posts: 411	Just a little additional question while this thread is still relatively warm: I'm trying to run this instruction: move.w (a4,d0.l2),(a5) But the assembler (VASM) is complaining that the d0.l2 isn't supported, however I checked and it works in Asm-One. I'm hoping it's just a case of needing a new module or something, because I think this would save all the (many) lines of "add.l d0,d0" that I currently have. I've tried a few different exes from a download pack (vasmm68k_madmac/vasmm68k_mod/vasmm68k_std) but none resolve the issue as I don't really know what the versions do. I'm quite unclear about VASM as I find the documentation very vague, so any advice on what to do/install would be great. Thanks!

02 December 2022, 11:06	#20
saimo Registered User Join Date: Aug 2010 Location: Italy Posts: 787	That's a 68020+ addressing mode: if you aren't targeting 68000/68010 as well, you need to tell the assembler (for example with a directive or a command line switch) that the code is for (at least) 68020.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)