English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 27 November 2022, 11:45   #1
Brick Nash
Prototron
 
Brick Nash's Avatar
 
Join Date: Mar 2015
Location: Glasgow, Scotland
Posts: 411
Advice to make real-time Bob X-Flipping faster

Hi folks!

I've had to start using a real-time sprite/bob flipping routine to save memory, and it works perfectly, but it's just dog-slow at the moment, so I'm looking for some advice on how to maybe speed it up.

The Bobs are all made up of horizontal slices (see image below) which are 16 pixels tall and a varying width, and this routine flips the slices using a lookup table into an Image and Mask buffer which are then blitted to the screen (it sits in part of a wider drawing routine).

All the information for source/screen positions/modulos etc. have been loaded in an earlier part of the drawing routine from a "sprite table" of pre-calculated values.



I've done two version of this so far - one which flips all the words sequentially in each bitplane before moving on to the next, and then this one which does one word then the next bitplane of that word and so on (essentially doing it a "tile" at a time). Both are around the same speed.

I'm kind of shocked that there's no Blitter function to do this as it seems really cumbersome to have to do such a commonly occurring graphical task with the CPU.

Regardless, any advice on how to maybe speed this up would be most welcome, as I'm out of ideas.

Many thanks!

Code:
;------------------------------------------------------------------------------
;	XFLIP
;------------------------------------------------------------------------------
; a1 	= SLICE MASK ADDRESS
; a2 	= SLICE IMAGE ADDRESS
; a4 	= LOOKUP TABLE
;---------------------------------------
; d3/d2 = Width counter
; d5 	= BLITSIZE  (Second BYTE is WORD-WIDTH+1)
; d6 	= Lines counter
;---------------------------------------
	
	if 1=1
	
DRAWBOB_XFLIP:	

	;---------------------------------------
	btst 		#0,OBJ_FLAGS(a0)	; Test XFLIP flag: 0 = Face LEFT | 1 = Face RIGHT
	bne.w		DRAWBOB_CLIP
	;---------------------------------------
	
	movem.l		d2-a0/a3-a6,-(sp)	; Back Up Registers
	
	;---------------------------------------	
	lea 		XFLIP_TABLE,a4		; Lookup Table (65K)
	lea		XFLIP_BUFFER,a5		; Flipped Image Buffer
	lea 		XFLIP_MASK_BUFFER,a6	; Flipped Mask Buffer
	;---------------------------------------
	
FLIP_WIDTH:	

	move.b 		d5,d3			; Get WIDTH Counter
	sub.w		#2,d3			; Sub 2 (Don't need masked word & 0 for counter)	
	move.w		d3,d2			; Backup
	add.w 		d3,d3			; Double for offset value
	
	;---------------------------------------
	
	adda.w 		d3,a5			; Add to Image Flip Buffer
	adda.w 		d3,a6			; Add to Mask Flip Buffer
	
	move.w 		d2,d3			; Refresh WIDTH to word value
	;---------------------------------------
	move.l		a5,a0			; Back up Image Flip Starting Pos
	move.l		a6,a3			; Back up MASK Flip Starting Pos
	
	move.l		a1,d5			; Back up Image Starting Pos
	move.l		a2,d7			; Back up MASK Starting Pos
	;--------------------------------------

	move.w 		#16-1,d6		; LINE Counter (Always 16)
	
	;--------------------------------------
	macro 		FLIPIMAGE
	;--------------------------------------
	move.w		(a2),d0					
	add.l		d0,d0					
	move.w 		(a4,d0.l),(a5)				
	adda.w		#40,a2
	adda.w		#40,a5	
	moveq 		#0,d0
	;--------------------------------------
	endm
	
	;--------------------------------------
	macro 		FLIPMASK
	;--------------------------------------
	move.w		(a1),d0					
	add.l		d0,d0					
	move.w 		(a4,d0.l),(a6)				
	adda.w		#40,a1
	adda.w		#40,a6	
	moveq 		#0,d0
	;--------------------------------------
	endm
	
BOB_FLIPLOOP:

	;---------------------------------------
	; FLIP IMAGE WORDS ON ALL 4 BITPLANES
	;---------------------------------------

	FLIPIMAGE				; Bitplane 1
	FLIPIMAGE				; Bitplane 2
	FLIPIMAGE				; Bitplane 3
	FLIPIMAGE				; Bitplane 4

	;---------------------------------------	
	; FLIP MASK WORDS ON ALL 4 BITPLANES
	;---------------------------------------
	
	FLIPMASK				; Bitplane 1
	FLIPMASK				; Bitplane 2
	FLIPMASK				; Bitplane 3
	FLIPMASK				; Bitplane 4
	
	;---------------------------------------
	
	dbf 		d6,BOB_FLIPLOOP
	move.w 		#16-1,d6		; Refresh Line counter
	
	;---------------------------------------
	
	move.l		a0,a5			; Refresh Positions
	move.l		a3,a6					
	move.l		d5,a1					
	move.l		d7,a2
	
	adda.w 		#2,a1			; Apply next WORD to be flipped
	adda.w 		#2,a2
	suba.w 		#2,a5
	suba.w 		#2,a6
	
	move.l		a5,a0			; Store new Positions
	move.l		a6,a3					
	move.l		a1,d5					
	move.l		a2,d7

	;---------------------------------------
	
	dbf 		d3,BOB_FLIPLOOP		; Dec slice width Counter
			
	;---------------------------------------
	; FLIPPING DONE
	; - LOAD REGISTERS FOR BLITTING
	;---------------------------------------

	lea 		XFLIP_MASK_BUFFER,a1
	lea		XFLIP_BUFFER,a2

	;---------------------------------------	
	movem.l		(sp)+,d2-a0/a3-a6	; Restore Registers
	;---------------------------------------
	endif
Brick Nash is offline  
Old 27 November 2022, 15:53   #2
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,039
Lets start with a few simple micro optimizations:
Code:
;	adda.w	#40,ax
	lea	(40,ax),ax

; y = 1 to 8
;	adda.w 	#y,ax
	addq.w 	#y,ax
;	suba.w 	#y,ax
	subq.w 	#y,ax

; y = -128 to 127
;	move.w #y,dx
	moveq 	#y,dx
This will make the code shorter as well, so maybe you could unroll it 16 times. This eliminates dbf overhead and removes lea/adda 40 (you can hardcode a5/a6 offsets, with the first one being 0 and not needed):
Code:
	move.w (a4,d0.l),(<OFFSET>*40,a5)
...
	move.w (a4,d0.l),(<OFFSET>*40,a6)
Next, if you can place the XFLIP_TABLE on a 128kb boundary (eg., $60000), you can eliminate index addressing and gain 2 cycles (-6+4 = -2) per lookup, for example:
Code:
init:
	move.l	#XFLIP_TABLE/2,d0

loop:
;	move.w	(a2),d0
;	add.l	d0,d0
;	move.w (a4,d0.l),(a5)
	move.w	(a2),d0
	move.l	d0,a0		; +4
	add.l	a0,a0
	move.w (a0),(a5)	; -6
a/b is online now  
Old 27 November 2022, 17:41   #3
lmimmfn
Registered User
 
Join Date: May 2018
Location: Ireland
Posts: 672
Would a LUT of words/bytes help? Where direction A value is an offset of a base address and the value stored there is the opposite direction B? LUT woyld be 255 in length for bytes, 64k for words, or would that be too slow?(very long time since I did 68k assembly so realise my post might be useless lol)
lmimmfn is offline  
Old 27 November 2022, 17:49   #4
Brick Nash
Prototron
 
Brick Nash's Avatar
 
Join Date: Mar 2015
Location: Glasgow, Scotland
Posts: 411
Quote:
Originally Posted by a/b View Post
Lets start with a few simple micro optimizations:
Code:
;	adda.w	#40,ax
	lea	(40,ax),ax

; y = 1 to 8
;	adda.w 	#y,ax
	addq.w 	#y,ax
;	suba.w 	#y,ax
	subq.w 	#y,ax

; y = -128 to 127
;	move.w #y,dx
	moveq 	#y,dx
This will make the code shorter as well, so maybe you could unroll it 16 times. This eliminates dbf overhead and removes lea/adda 40 (you can hardcode a5/a6 offsets, with the first one being 0 and not needed):
Code:
	move.w (a4,d0.l),(<OFFSET>*40,a5)
...
	move.w (a4,d0.l),(<OFFSET>*40,a6)
Next, if you can place the XFLIP_TABLE on a 128kb boundary (eg., $60000), you can eliminate index addressing and gain 2 cycles (-6+4 = -2) per lookup, for example:
Code:
init:
	move.l	#XFLIP_TABLE/2,d0

loop:
;	move.w	(a2),d0
;	add.l	d0,d0
;	move.w (a4,d0.l),(a5)
	move.w	(a2),d0
	move.l	d0,a0		; +4
	add.l	a0,a0
	move.w (a0),(a5)	; -6
Awesome, thanks! These are great tips!

I'll give them a try, and definitely unroll that dbf (I tried it with the Macros and it did feel a bit smoother).
Brick Nash is offline  
Old 27 November 2022, 17:49   #5
chb
Registered User
 
Join Date: Dec 2014
Location: germany
Posts: 439
I guess there's a small typo: I was wrong, no typo, see a/b's post below.

EDIT: And those adda.w #40 eat up quite some cycles probably... could you organize your data/code differently to avoid that? Like doing mask and image one after the other, that should let you use a separate address register for every plane. I'm not sure if I understand your code correctly, but that part looks a bit wasteful to me.

PS: The absence of an easy blitter flip is annoying, true. You can use the blitter to flip, but it takes four passes (AB->D type) + then drawing the BOB, so it's probably in most cases slower than using the table approach.

Last edited by chb; 27 November 2022 at 19:06.
chb is offline  
Old 27 November 2022, 18:51   #6
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,039
Quote:
Originally Posted by chb View Post
I guess there's a small typo:
Probably it should be
Code:
    move.w    d0,a0        ; +4
No typo, the idea is to have d0 upper word preloaded, load the lower word with sprite data, and then move the whole 32-bits (address/2) to a0.
a/b is online now  
Old 27 November 2022, 19:05   #7
chb
Registered User
 
Join Date: Dec 2014
Location: germany
Posts: 439
Quote:
Originally Posted by a/b View Post
No typo, the idea is to have d0 upper word preloaded, load the lower word with sprite data, and then move the whole 32-bits (address/2) to a0.
Ah, you're right, of course. I embarrassingly missed the init line and thought a0 was preloaded with $address/2...
chb is offline  
Old 27 November 2022, 19:13   #8
Jobbo
Registered User
 
Jobbo's Avatar
 
Join Date: Jun 2020
Location: Druidia
Posts: 387
Are these bob arranged for interleaved blitting and if so then are those four masks all the same?

If they are the same then you only need to flip the first one and copy that result to the others.

I wonder if it'd be less memory overhead to blit for each plane separately so you only need to store one copy of the mask.

Then you might not need to do all this flipping, or at least do it for fewer bobs.
Jobbo is online now  
Old 27 November 2022, 19:19   #9
Jobbo
Registered User
 
Jobbo's Avatar
 
Join Date: Jun 2020
Location: Druidia
Posts: 387
If you have the registers to spare then it might be worth loading one up with #40 and replacing:

Code:
lea (40,a0),a0
With:

Code:
add.w d0,a0
They both take 8 cycles, but the later is one word less of code and so it could free up some bandwidth if you have blits going on at the same time.
Jobbo is online now  
Old 27 November 2022, 20:02   #10
Brick Nash
Prototron
 
Brick Nash's Avatar
 
Join Date: Mar 2015
Location: Glasgow, Scotland
Posts: 411
Thanks for all the suggestions folks. This is great stuff!

Quote:
Originally Posted by Jobbo View Post
Are these bob arranged for interleaved blitting and if so then are those four masks all the same?

If they are the same then you only need to flip the first one and copy that result to the others.

I wonder if it'd be less memory overhead to blit for each plane separately so you only need to store one copy of the mask.

Then you might not need to do all this flipping, or at least do it for fewer bobs.
That was a splendid suggestion to just flip the mask once and then copy. Scooped out a fair few lines of code there. Thank you!
Brick Nash is offline  
Old 27 November 2022, 20:12   #11
roondar
Registered User
 
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,408
A common trick to speed up X-flipping of bobs is to store bobs with 1/2 of the lines pointing to the left and 1/2 pointing to the right. When blitting, you blit the 1/2 that is pointing in the correct direction as normal and only flip the other 1/2 (well, you output the flipped result, not actually flip the bob data).

In your case that would mean storing each slice with 1/2 the lines pointing to the right and 1/2 to the left
roondar is offline  
Old 28 November 2022, 10:29   #12
hooverphonique
ex. demoscener "Bigmama"
 
Join Date: Jun 2012
Location: Fyn / Denmark
Posts: 1,624
Quote:
Originally Posted by roondar View Post
A common trick to speed up X-flipping of bobs is to store bobs with 1/2 of the lines pointing to the left and 1/2 pointing to the right.
You mean for evening out the flipping cost?
hooverphonique is offline  
Old 28 November 2022, 11:34   #13
roondar
Registered User
 
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,408
Yup, so you have no 'cheap' vs 'expensive' frames to consider
roondar is offline  
Old 28 November 2022, 15:17   #14
Brick Nash
Prototron
 
Brick Nash's Avatar
 
Join Date: Mar 2015
Location: Glasgow, Scotland
Posts: 411
Quote:
Originally Posted by roondar View Post
A common trick to speed up X-flipping of bobs is to store bobs with 1/2 of the lines pointing to the left and 1/2 pointing to the right. When blitting, you blit the 1/2 that is pointing in the correct direction as normal and only flip the other 1/2 (well, you output the flipped result, not actually flip the bob data).

In your case that would mean storing each slice with 1/2 the lines pointing to the right and 1/2 to the left
Yes, I think I remember reading one of the old dev teams like Probe or Core did this.

I've actuality got an old routine which flips the slices in the sheets anyway (when I still stored both left and right versions), so I could easily modify it to just do half of each slice. I never actually thought of trying it myself, so thanks for the suggestion.
Brick Nash is offline  
Old 28 November 2022, 21:06   #15
Photon
Moderator
 
Photon's Avatar
 
Join Date: Nov 2004
Location: Eksjö / Sweden
Posts: 5,602
I think there was already a thread and the fastest was a 32K words table + roll a carry bit around?

Gist:
Code:
	add.w d0,d0
	move.w (a0,d0.w),d0
	addx d0,d0
Photon is offline  
Old 29 November 2022, 02:39   #16
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,039
If you could serialize the reads and writes with movem, then it would be faster than what I posted above (which wouldn't work because movem would wipe the upper words). Otherwise, it's 2+4 cycles slower (4: it can't do a table read + output write in a single move, it needs an extra opcode) and has one more memory access.
But it has advantages if you have memory constraints. And it's a nice trick overall.
a/b is online now  
Old 30 November 2022, 16:10   #17
buzzybee
Registered User
 
Join Date: Oct 2015
Location: Landsberg / Germany
Posts: 526
Did not read the entire thread, but just in case no one has posted it before: There was quite an interesting conversation about tile flipping going on here:

https://eab.abime.net/showthread.php...5555555&page=2

Looks like there is no such thing as a single best code. Using a table-based or logical-based approach has both their pro's and con's.
buzzybee is offline  
Old 30 November 2022, 17:02   #18
Brick Nash
Prototron
 
Brick Nash's Avatar
 
Join Date: Mar 2015
Location: Glasgow, Scotland
Posts: 411
Thanks for all the help folks!

I've been working my way through the suggestions and the flipping is noticeably faster now. Still got a bit to go, but I learned quite a few tricks and tips from this great thread.
Brick Nash is offline  
Old 02 December 2022, 10:50   #19
Brick Nash
Prototron
 
Brick Nash's Avatar
 
Join Date: Mar 2015
Location: Glasgow, Scotland
Posts: 411
Just a little additional question while this thread is still relatively warm:

I'm trying to run this instruction:

move.w (a4,d0.l*2),(a5)

But the assembler (VASM) is complaining that the d0.l*2 isn't supported, however I checked and it works in Asm-One.

I'm hoping it's just a case of needing a new module or something, because I think this would save all the (many) lines of "add.l d0,d0" that I currently have.

I've tried a few different exes from a download pack (vasmm68k_madmac/vasmm68k_mod/vasmm68k_std) but none resolve the issue as I don't really know what the versions do.

I'm quite unclear about VASM as I find the documentation very vague, so any advice on what to do/install would be great.

Thanks!
Brick Nash is offline  
Old 02 December 2022, 11:06   #20
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
That's a 68020+ addressing mode: if you aren't targeting 68000/68010 as well, you need to tell the assembler (for example with a directive or a command line switch) that the code is for (at least) 68020.
saimo is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
WANTED - A1000 Real time clock "A-time" loggio MarketPlace 0 21 August 2020 04:40
Will a faster CPU make the blitter obsolete? olleharstedt Coders. General 12 21 April 2020 23:57
Make Window Refresh Faster? AGS Coders. System 4 06 January 2014 17:05
Anything to make A600 IDE go faster? Photon support.Hardware 6 18 October 2009 18:31
Can I make WinUAE faster? (loading time and such) EssKung support.WinUAE 15 29 May 2007 11:59

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 19:44.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.12832 seconds with 13 queries