English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 14 February 2021, 15:57   #21
orangespider
Registered User
 
Join Date: Feb 2021
Location: Becej / Serbia
Posts: 120
Quote:
Originally Posted by ross View Post
Your reasoning applies only to fast 060 (probably 040 also).
It doesn't work for 020/030 and chip ram only systems.

Try it for yourself
Nevermind them. Apparently I got infected with the 060.
orangespider is offline  
Old 14 February 2021, 16:25   #22
mcgeezer
Registered User
 
Join Date: Oct 2017
Location: Sunderland, England
Posts: 2,702
OK, so for some odd reason I can't get the movem.w to work... I get wrong values coming in from the lookup table, at first I thought it could be the upper word of d2/d3/d4/d5 not being zero but I recall these are set anyway when a word operand is used.

Code:
	move.l	d4,a2
	move.l	d5,a3
	
.copy_tile_right:				; Right copy
	movem.w	(a0)+,d2/d3/d4/d5
	move.w	(a5,d2.l*2),(a1)		; Bitplane 1
	move.w	(a5,d3.l*2),8(a1)		; Bitplane 2
	move.w	(a5,d4.l*2),128*16(a1)		; Bitplane 3
	move.w	(a5,d5.l*2),(128*16)+8(a1)	; Bitplane 4
	add.l	d0,a1
	dbf	d1,.copy_tile_right

	move.l	a2,d4
	move.l	a3,d5
It makes no difference anyway because this code is slower than this:

Code:
.copy_tile_right:				; Right copy
	move.w	(a0)+,d2
	beq.s	.1b
	move.w	(a5,d2.l*2),(a1)		; Bitplane 1
	
.1b:	move.w	(a0)+,d2
	beq.s	.2b
	move.w	(a5,d2.l*2),8(a1)		; Bitplane 2
	
.2b:	move.w	(a0)+,d2
	beq.s	.3b
	move.w	(a5,d2.l*2),128*16(a1)		; Bitplane 3
	
.3b:	move.w	(a0)+,d2
	beq.s	.4b
	move.w	(a5,d2.l*2),(128*16)+8(a1)	; Bitplane 4
	
.4b:	add.l	d0,a1
	dbf	d1,.copy_tile_right
Thank you everyone for their input, it was really interesting!

Graeme
mcgeezer is offline  
Old 14 February 2021, 16:55   #23
orangespider
Registered User
 
Join Date: Feb 2021
Location: Becej / Serbia
Posts: 120
What's fastest on 020 out of these 3:
Code:
; case 1
	move.w	(a0)+,d2
	move.w	(a5,d2.l*2),(a1)
; other bitplanes
	move.w	(a0)+,d2
	move.w	(a5,d2.l*2),2(a1)

; case 2
	move.w	(a0),d2
	move.w	(a5,d2.l*2),d3
	swap	d3
	move.w	16(a0),d2
	move.w	(a5,d2.l*2),d3
	move.l	d3,(a1)

; case 3
	move.w	(a0),d2
	move.l	(a5,d2.l*4),d3     ; a5 - pointer to a 256k table
	move.w	16(a0),d2
	move.w	(a5,d2.l*4),d3
	move.l	d3,(a1)
edit also what's faster out of these 2:
Code:
; case 1
	move.w	(a0)+,d2
	move.l	(a5,d2.l*2),(a0)
	move.w	(a0)+,d2
	move.l	(a5,d2.l*2),8(a0)

; case 2
	move.l	(a0)+,d2
	move.l	(a5,d2.w*2),(a0)
	swap	d2
	move.l	(a5,d2.w*2),8(a0)

Last edited by orangespider; 14 February 2021 at 17:07.
orangespider is offline  
Old 14 February 2021, 17:31   #24
LaBodilsen
Registered User
 
Join Date: Dec 2017
Location: Denmark
Posts: 179
Quote:
Originally Posted by orangespider View Post

Code:
; this part can be set only once for multiple mirrors
	move.l	#%11111111000000001111111100000000, d2

	move.l	(a0)+, d0
	move.l	d0, d1		; d1 = ABCDEFGH.IJKLMNOP.abcdefgh.ijklmnop
	and.l	d2, d0		; d0 = ABCDEFGH.00000000.abcdefgh.00000000
	lsl.l	#8, d1		; d1 = IJKLMNOP.abcdefgh.ijklmnop.00000000
	lsr.l	#8, d0		; d0 = 00000000.ABCDEFGH.00000000.abcdefgh
	and.l	d2, d1		; d1 = IJKLMNOP.00000000.ijklmnop.00000000
	or.l	d1, d0		; d0 = IJKLMNOP.ABCDEFGH.ijklmnop.abcdefgh
Would that not be the same as this: (but faster at least on 68000)
Code:
	move.l	(a0)+,D0	;d0 = ABCDEFGH.IJKLMNOP.abcdefgh.ijklmnop
	ror.w	#8,D0		;d0 = ABCDEFGH.IJKLMNOP.ijklmnop.abcdefgh
	swap	d0		;d0 = ijklmnop.abcdefgh.ABCDEFGH.IJKLMNOP
	ror.w	#8,d0		;d0 = ijklmnop.abcdefgh.IJKLMNOP.ABCDEFGH
	swap	d0		;d0 = IJKLMNOP.ABCDEFGH.ijklmnop.abcdefgh
LaBodilsen is offline  
Old 14 February 2021, 17:39   #25
orangespider
Registered User
 
Join Date: Feb 2021
Location: Becej / Serbia
Posts: 120
Quote:
Originally Posted by LaBodilsen View Post
Would that not be the same as this: (but faster at least on 68000)
Code:
	move.l	(a0)+,D0	;d0 = ABCDEFGH.IJKLMNOP.abcdefgh.ijklmnop
	ror.w	#8,D0		;d0 = ABCDEFGH.IJKLMNOP.ijklmnop.abcdefgh
	swap	d0		;d0 = ijklmnop.abcdefgh.ABCDEFGH.IJKLMNOP
	ror.w	#8,d0		;d0 = ijklmnop.abcdefgh.IJKLMNOP.ABCDEFGH
	swap	d0		;d0 = IJKLMNOP.ABCDEFGH.ijklmnop.abcdefgh
yeah, it would, but I just looked at the problem as a c2p and modified my 060 c2p for it, where the other code would run faster.
orangespider is offline  
Old 14 February 2021, 18:13   #26
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,039
Quote:
Originally Posted by mcgeezer View Post
OK, so for some odd reason I can't get the movem.w to work... I get wrong values coming in from the lookup table, at first I thought it could be the upper word of d2/d3/d4/d5 not being zero but I recall these are set anyway when a word operand is used.
movem.w sign-extends, any offset $8000-$ffff becomes a negative 32-bit offset (you are using them right after as .l).
a/b is offline  
Old 14 February 2021, 18:33   #27
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,960
Quote:
Originally Posted by mcgeezer View Post
OK, so for some odd reason I can't get the movem.w to work... I get wrong values coming in from the lookup table, at first I thought it could be the upper word of d2/d3/d4/d5 not being zero but I recall these are set anyway when a word operand is used.

Code:
	move.l	d4,a2
	move.l	d5,a3
	
.copy_tile_right:				; Right copy
	movem.w	(a0)+,d2/d3/d4/d5
	move.w	(a5,d2.l*2),(a1)		; Bitplane 1
	move.w	(a5,d3.l*2),8(a1)		; Bitplane 2
	move.w	(a5,d4.l*2),128*16(a1)		; Bitplane 3
	move.w	(a5,d5.l*2),(128*16)+8(a1)	; Bitplane 4
	add.l	d0,a1
	dbf	d1,.copy_tile_right

	move.l	a2,d4
	move.l	a3,d5
It makes no difference anyway because this code is slower than this:

Code:
.copy_tile_right:				; Right copy
	move.w	(a0)+,d2
	beq.s	.1b
	move.w	(a5,d2.l*2),(a1)		; Bitplane 1
	
.1b:	move.w	(a0)+,d2
	beq.s	.2b
	move.w	(a5,d2.l*2),8(a1)		; Bitplane 2
	
.2b:	move.w	(a0)+,d2
	beq.s	.3b
	move.w	(a5,d2.l*2),128*16(a1)		; Bitplane 3
	
.3b:	move.w	(a0)+,d2
	beq.s	.4b
	move.w	(a5,d2.l*2),(128*16)+8(a1)	; Bitplane 4
	
.4b:	add.l	d0,a1
	dbf	d1,.copy_tile_right
Thank you everyone for their input, it was really interesting!

Graeme
I think that if you want to reach maximum speed, you can use two versions. First for flip images with many null words values, and second for flip images with small numbers of null words. Two short routines and one 128KB table. For access to table for movem.w you must set A5 as A5 + 64K, because negative longword values (auto extending). Anyway D2.W*2 is enough.
Don_Adan is offline  
Old 14 February 2021, 19:59   #28
mcgeezer
Registered User
 
Join Date: Oct 2017
Location: Sunderland, England
Posts: 2,702
Quote:
Originally Posted by a/b View Post
movem.w sign-extends, any offset $8000-$ffff becomes a negative 32-bit offset (you are using them right after as .l).
Quote:
Originally Posted by Don_Adan View Post
I think that if you want to reach maximum speed, you can use two versions. First for flip images with many null words values, and second for flip images with small numbers of null words. Two short routines and one 128KB table. For access to table for movem.w you must set A5 as A5 + 64K, because negative longword values (auto extending). Anyway D2.W*2 is enough.
Ahhh thanks for pointing out the sign-extend on movem.w - I should have known that and incorrectly assumed it was clearing the upper word.
mcgeezer is offline  
Old 15 February 2021, 17:16   #29
LaBodilsen
Registered User
 
Join Date: Dec 2017
Location: Denmark
Posts: 179
If you are in a pinch for memory, I propose this no table approach.

Code:
	move.w	(a0)+,D2
	beq.s	.noflip
	moveq	#7-1,D3
	ror.b	#1,D2
.loopshakeflip
	rol.w	#1,D2
	ror.b	#2,D2
	dbf	D3,.loopshakeflip
	rol.w	#1,D2
	ror.b	#1,D2
	move.w	D2,(a1)
.noflip
it came to me in the middle of the night, where I woke up with this idea in my head. I call it a shakeflip, as that is almost what's going on here.

its optimized for smallest number of chipmem instruction fetches, and hopefully the "shakeflip" loop will run in cache. cycle count for small ror/rol is 6+2 and 6+4 per #1, and #2. so 7 times 18 + 24 = 150 cycles, not counting the loop and instruction fetches.

it's most likely much slower than a table lookup, but I do like the simplicity in it.

Last edited by LaBodilsen; 15 February 2021 at 17:52.
LaBodilsen is offline  
Old 15 February 2021, 17:31   #30
mcgeezer
Registered User
 
Join Date: Oct 2017
Location: Sunderland, England
Posts: 2,702
I’ll give it a try and show the results for you.
mcgeezer is offline  
Old 15 February 2021, 18:17   #31
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Another possibility for the c2p-style approach :
Code:
; data and result in d4

; enter here for 32-bit bit reverse
 ror.b #4,d4
 ror.w #8,d4
 ror.b #4,d4
 swap d4

; enter here for 16-bit only
 ror.b #4,d4
 ror.w #8,d4
 ror.b #4,d4
 move.l d4,d0
 lsr.l #2,d0
 lsl.l #2,d4
 eor.l d4,d0
 and.l #$33333333,d0
 eor.l d0,d4
 move.l d4,d0
 lsr.l #1,d0
 add.l d4,d4
 eor.l d4,d0
 and.l #$55555555,d0
 eor.l d0,d4
Of course the two constants can be preloaded in other registers (and shortened if you use only 16-bit).
meynaf is offline  
Old 15 February 2021, 18:19   #32
ross
Defendit numerus
 
ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
Oh well, then check also the good old magnitude progressive group swapping.
Not much suited for 16 bit or 020, but give it a try

Code:
    move.w   (a0)+,d2
    beq.s    .noflip
    move.w   #$5555,d3
    and.w    d2,d3
    eor.w    d3,d2
    add.w    d3,d3
    lsr.w    #1,d2
    or.w     d3,d2
    move.w   #$3333,d3
    and.w    d2,d3
    eor.w    d3,d2
    lsl.w    #2,d3
    lsr.w    #2,d2
    or.w     d3,d2
    move.w   #$0f0f,d3
    and.w    d2,d3
    eor.w    d3,d2
    lsl.w    #4,d3
    lsr.w    #4,d2
    or.w     d3,d2
    rol.w    #8,d2
    move.w   d2,(a1)
.noflip
ross is offline  
Old 15 February 2021, 19:06   #33
LaBodilsen
Registered User
 
Join Date: Dec 2017
Location: Denmark
Posts: 179
okay I did some test myself for a 16*16 block in 4 bitplanes (with Winuae though)

turns out my own loop, is slower than just unrolling it.

Code:
	move.w	(a0)+,D2
	beq.s	.noflip
	ror.b	#1,D2
	rol.w	#1,D2
	ror.b	#2,D2
	rol.w	#1,D2
	ror.b	#2,D2
	rol.w	#1,D2
	ror.b	#2,D2
	rol.w	#1,D2
	ror.b	#2,D2
	rol.w	#1,D2
	ror.b	#2,D2
	rol.w	#1,D2
	ror.b	#2,D2
	rol.w	#1,D2
	ror.b	#2,D2
	rol.w	#1,D2
	ror.b	#1,D2
	move.w	D2,(a1)
.noflip
	move.w	(a0),D2
with regards to speed, Meynaf code is the fastest of the 3, at about 3 2/3 raster line, when the 2 constants are preloaded into registers. Ross and mine is the same with about 4 2/3 rasterline. but my code only use 1 data register

of course table lookup is the fastest with about 2 1/3 rasterlines
(edit: fixed my unrolled code, which made it faster)

Last edited by LaBodilsen; 15 February 2021 at 19:27.
LaBodilsen is offline  
Old 15 February 2021, 19:18   #34
DanScott
Lemon. / Core Design
 
DanScott's Avatar
 
Join Date: Mar 2016
Location: Tier 5
Posts: 1,212
could you use a sequence of

ror.w #1,d2
addx.w d3,d3

uses one more data register
DanScott is online now  
Old 15 February 2021, 21:08   #35
orangespider
Registered User
 
Join Date: Feb 2021
Location: Becej / Serbia
Posts: 120
I believe this is the fastest no-table acode:

Code:
	move.l	#%11111111000000001111111100000000,d2
	move.l	#%11110000111100001111000011110000,d3
	move.l	#%11001100110011001100110011001100,d4
	move.l	#%10101010101010101010101010101010,d5
.copy_tile_right:
	move.l	(a0)+,d6
	beq.s	.sk0
	move.l	d6,d7		; d7 - ABCDEFGH.IJKLMNOP.abcdefgh.ijklmnop
	and.l	d2,d6		; d6 - ABCDEFGH.00000000.abcdefgh.00000000
	eor.l	d6,d7		; d7 - 00000000.IJKLMNOP.00000000.ijklmnop
	swap	d6		; d6 - abcdefgh.00000000.ABCDEFGH.00000000
	or.l	d7,d6		; d6 - abcdefgh.IJKLMNOP.ABCDEFGH.ijklmnop
	move.l	d6,d7		; d7 - abcdefgh.IJKLMNOP.ABCDEFGH.ijklmnop
	and.l	d3,d6		; d6 - abcd6000.IJKL0000.ABCd6000.ijkl0000
	eor.l	d6,d7		; d7 - 0000efgh.0000MNOP.0000EFGH.0000mnop
	ror.l	#8,d6		; d6 - ijkl0000.abcd6000.IJKL0000.ABCd6000
	or.l	d7,d6		; d6 - ijklefgh.abcdMNOP.IJKLEFGH.ABCDmnop
	move.l	d6,d7		; d7 - ijklefgh.abcdMNOP.IJKLEFGH.ABCDmnop
	and.l	d4,d6		; d6 - ij00ef00.ab00MN00.IJ00EF00.AB00mn00
	eor.l	d6,d7		; d7 - 00kl00gh.00cd60OP.00KL00GH.00Cd60op
	ror.l	#4,d6		; d6 - mn00ij00.ef00ab00.MN00IJ00.EF00AB00
	or.l	d7,d6		; d6 - mnklijgh.efcdabOP.MNKLIJGH.EFCDABop
	move.l	d6,d7		; d7 - mnklijgh.efcdabOP.MNKLIJGH.EFCDABop
	and.l	d5,d6		; d6 - m0k0i0g0.e0c0a0O0.M0K0I0G0.E0C0A0o0
	eor.l	d6,d7		; d7 - 0n0l0j0h.0f0d6b0O.0N0K0J0H.0F0d6B0p
	ror.l	#3,d6		; d6 - 0o0m0k0i.0g0e0c0a.0O0M0K0I.0G0E0C0A
	ror.l	#1,d7		; d7 - p0n0l0j0.h0f0d6b0.O0N0K0J0.H0F0d6B0
	or.l	d7,d6		; d6 - ponmlkji.hgfedcba.PONMLKJI.HGFEDCBA
	move.w	d6,(a1)
	swap	d6
	move.w	d6,8(a1)
.sk0:
	move.l	(a0)+,d6
	beq.s	.sk1
	move.l	d6,d7		; d7 - ABCDEFGH.IJKLMNOP.abcdefgh.ijklmnop
	and.l	d2,d6		; d6 - ABCDEFGH.00000000.abcdefgh.00000000
	eor.l	d6,d7		; d7 - 00000000.IJKLMNOP.00000000.ijklmnop
	swap	d6		; d6 - abcdefgh.00000000.ABCDEFGH.00000000
	or.l	d7,d6		; d6 - abcdefgh.IJKLMNOP.ABCDEFGH.ijklmnop
	move.l	d6,d7		; d7 - abcdefgh.IJKLMNOP.ABCDEFGH.ijklmnop
	and.l	d3,d6		; d6 - abcd6000.IJKL0000.ABCd6000.ijkl0000
	eor.l	d6,d7		; d7 - 0000efgh.0000MNOP.0000EFGH.0000mnop
	ror.l	#8,d6		; d6 - ijkl0000.abcd6000.IJKL0000.ABCd6000
	or.l	d7,d6		; d6 - ijklefgh.abcdMNOP.IJKLEFGH.ABCDmnop
	move.l	d6,d7		; d7 - ijklefgh.abcdMNOP.IJKLEFGH.ABCDmnop
	and.l	d4,d6		; d6 - ij00ef00.ab00MN00.IJ00EF00.AB00mn00
	eor.l	d6,d7		; d7 - 00kl00gh.00cd60OP.00KL00GH.00Cd60op
	ror.l	#4,d6		; d6 - mn00ij00.ef00ab00.MN00IJ00.EF00AB00
	or.l	d7,d6		; d6 - mnklijgh.efcdabOP.MNKLIJGH.EFCDABop
	move.l	d6,d7		; d7 - mnklijgh.efcdabOP.MNKLIJGH.EFCDABop
	and.l	d5,d6		; d6 - m0k0i0g0.e0c0a0O0.M0K0I0G0.E0C0A0o0
	eor.l	d6,d7		; d7 - 0n0l0j0h.0f0d6b0O.0N0K0J0H.0F0d6B0p
	ror.l	#3,d6		; d6 - 0o0m0k0i.0g0e0c0a.0O0M0K0I.0G0E0C0A
	ror.l	#1,d7		; d7 - p0n0l0j0.h0f0d6b0.O0N0K0J0.H0F0d6B0
	or.l	d7,d6		; d6 - ponmlkji.hgfedcba.PONMLKJI.HGFEDCBA
	move.w	d6,128*16(a1)
	swap	d6
	move.w	d6,128*16+8(a1)
.sk1:
	add.l	d0,a1
	dbf		d1,.copy_tile_right
The register usage is compatible with the table codes, so you can test how well this does. It is definitely faster than LaBodilsen's code because it works on 32 bits at once instead of 16. Btw, this can be even faster if you use 32x32 as a source (or even full 128x128) instead of 16x16.

edit: If I counted the cycles correctly, this should run at 251 cycles per loop iteration (213 cycles for 32x32) and the table approach would run at 200 cycles per iteration. But my cycle counts might be wrong.

Last edited by orangespider; 15 February 2021 at 22:04.
orangespider is offline  
Old 16 February 2021, 17:21   #36
LaBodilsen
Registered User
 
Join Date: Dec 2017
Location: Denmark
Posts: 179
Quote:
Originally Posted by orangespider View Post
I believe this is the fastest no-table acode:
It is indeed very fast, but meynaf version is just a smidge faster (about 1/4 of a raster line), when used as 32bit.

Code:
	move.l	#$33333333,d5
	move.l	#$55555555,d6
.copy_tile_right:
	move.l	(a0)+,d4
	beq.s	.noflip
	ror.b	#4,d4
	ror.w	#8,d4
	ror.b	#4,d4
	swap	d4
	ror.b	#4,d4
	ror.w	#8,d4
	ror.b	#4,d4
	move.l	d4,d0
	lsr.l	#2,d0
	lsl.l	#2,d4
	eor.l	d4,d0
	and.l	d5,d0
	eor.l	d0,d4
	move.l	d4,d0
	lsr.l	#1,d0
	add.l	d4,d4
	eor.l	d4,d0
	and.l	d6,d0
	eor.l	d0,d4
	move.w	d4,(a1)
	swap	d4
	move.w	d4,8(a1)
.noflip
	move.l	(a0)+,d4
	beq.s	.noflip2
	ror.b	#4,d4
	ror.w	#8,d4
	ror.b	#4,d4
	swap	d4
	ror.b	#4,d4
	ror.w	#8,d4
	ror.b	#4,d4
	move.l	d4,d0
	lsr.l	#2,d0
	lsl.l	#2,d4
	eor.l	d4,d0
	and.l	d5,d0
	eor.l	d0,d4
	move.l	d4,d0
	lsr.l	#1,d0
	add.l	d4,d4
	eor.l	d4,d0
	and.l	d6,d0
	eor.l	d0,d4
	move.w	d4,128*16(a1)
	swap	d4
	move.w	d4,128*16+8(a1)
.noflip2
	add.l	d1,a1
	dbf	d2,.copy_tile_right
this is almost as fast as the table version, only about 1/4 raster line slower for a 16x16 block. although the table version could also be extended to 32bit. and also how much will DMA affect the results, if the blitter is running in the background.
(tested in Winuae A1200 chip only cycle exact)

I have a feeling it could be faster, as you don't need to flip the entire longword, but only the 2 word parts. so maybe a swap somewhere can be cut out.

Last edited by LaBodilsen; 16 February 2021 at 17:34.
LaBodilsen is offline  
Old 17 February 2021, 19:16   #37
LaBodilsen
Registered User
 
Join Date: Dec 2017
Location: Denmark
Posts: 179
Quote:
Originally Posted by LaBodilsen View Post
It is indeed very fast, but meynaf version is just a smidge faster (about 1/4 of a raster line), when used as 32bit.
this is almost as fast as the table version, only about 1/4 raster line slower for a 16x16 block. although the table version could also be extended to 32bit. and also how much will DMA affect the results, if the blitter is running in the background.
(tested in Winuae A1200 chip only cycle exact)
If we take Ross code, and preload the constant, and extend to 32bit, then it is just as fast as Meynaf's version.

Code:
	move.l	#$33333333,d1
	move.l	#$55555555,d4
	move.l	#$0f0f0f0f,d5
.copy_tile_right:
	move.l	(a0)+,d2
	beq.s	.noflip
	move.l	d4,d3
	and.l	d2,d3
	eor.l	d3,d2
	add.l	d3,d3
	lsr.l	#1,d2
	or.l	d3,d2
	move.l	d1,d3
	and.l	d2,d3
	eor.l	d3,d2
	lsl.l	#2,d3
	lsr.l	#2,d2
	or.l	d3,d2
	move.l	d5,d3
	and.l	d2,d3
	eor.l	d3,d2
	lsl.l	#4,d3
	lsr.l	#4,d2
	or.l	d3,d2
	rol.l	#8,d2
	move.w	d2,(a1)
	swap	d2
	move.w	d2,8(a1)
.noflip
	move.l	(a0)+,d2
	beq.s	.noflip2
	move.l	d4,d3
	and.l	d2,d3
	eor.l	d3,d2
	add.l	d3,d3
	lsr.l	#1,d2
	or.l	d3,d2
	move.l	d1,d3
	and.l	d2,d3
	eor.l	d3,d2
	lsl.l	#2,d3
	lsr.l	#2,d2
	or.l	d3,d2
	move.l	d5,d3
	and.l	d2,d3
	eor.l	d3,d2
	lsl.l	#4,d3
	lsr.l	#4,d2
	or.l	d3,d2
	rol.l	#8,d2
	move.w	d2,128*16(a1)
	swap	d2
	move.w	d2,128*16+8(a1)
.noflip2
	add.l	d0,a1
	dbf	d7,.copy_tile_right
here is the result for a 16x16 block:


Red = My code as 32bit
purple = table loopup
green = Meynaf code as 32bit
Yellow = Ross code as 32bit
Turquoise = orangespider code

ps. the test is done for "always data in" and no zero data. so the .noflip branch is never taken. Data in was always $00020002, and result was verified as $40004000 for all versions.
Attached Thumbnails
Click image for larger version

Name:	flipbittest.png
Views:	440
Size:	3.9 KB
ID:	70951  

Last edited by LaBodilsen; 17 February 2021 at 19:23.
LaBodilsen is offline  
Old 17 February 2021, 19:36   #38
roondar
Registered User
 
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,410
It may be wise to try these routines on a real A1200 as well, as 68020 emulation isn't cycle accurate. My personal experience is that real A1200's tend to be slower on RAM access than WinUAE suggests. Do note this is based on my tests with WinUAE 4.2.0, I haven't tried them since upgrading to 4.4.0.
roondar is offline  
Old 18 February 2021, 15:26   #39
Thorham
Computer Nerd
 
Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,753
Quote:
Originally Posted by mcgeezer View Post
Recently I did a bit of work on sprite flipping for a Street Fighter POC by reconstructing a large 128x128 sprite from 16x16 tiles.
If you can do 32x16 tiles instead, and assuming 64 pixel sprites have 64 contiguous pixels, you could do this perhaps (single bitplane):
Code:
    move.l  #$55555555,d2
    move.l  #$33333333,d3
    move.l  #$0f0f0f0f,d4
.loop
    move.l  (a0)+,d0

    move.l  d0,d1
    and.l   d2,d1
    eor.l   d1,d0
    lsr.l   #1,d0
    add.l   d1,d1
    or.l    d1,d0

    move.l  d0,d1
    and.l   d3,d1
    eor.l   d1,d0
    lsr.l   #2,d0
    lsl.l   #2,d1
    or.l    d1,d0

    move.l  d0,d1
    and.l   d4,d1
    eor.l   d1,d0
    lsr.l   #4,d0
    lsl.l   #4,d1
    or.l    d1,d0

    rol.w   #8,d0
    swap    d0
    rol.w   #8,d0
    
    move.l  d0,(a1)
    add.l   d5,a1

    dbra    d7,.loop
The main thing is getting rid of 16 bit writes to chipmem.
Thorham is offline  
Old 18 February 2021, 19:56   #40
mcgeezer
Registered User
 
Join Date: Oct 2017
Location: Sunderland, England
Posts: 2,702
Quote:
Originally Posted by Thorham View Post
If you can do 32x16 tiles instead, and assuming 64 pixel sprites have 64 contiguous pixels, you could do this perhaps (single bitplane):
Code:
    move.l  #$55555555,d2
    move.l  #$33333333,d3
    move.l  #$0f0f0f0f,d4
.loop
    move.l  (a0)+,d0

    move.l  d0,d1
    and.l   d2,d1
    eor.l   d1,d0
    lsr.l   #1,d0
    add.l   d1,d1
    or.l    d1,d0

    move.l  d0,d1
    and.l   d3,d1
    eor.l   d1,d0
    lsr.l   #2,d0
    lsl.l   #2,d1
    or.l    d1,d0

    move.l  d0,d1
    and.l   d4,d1
    eor.l   d1,d0
    lsr.l   #4,d0
    lsl.l   #4,d1
    or.l    d1,d0

    rol.w   #8,d0
    swap    d0
    rol.w   #8,d0
    
    move.l  d0,(a1)
    add.l   d5,a1

    dbra    d7,.loop
The main thing is getting rid of 16 bit writes to chipmem.
Going to 32x16 increases ram by a huge amount.


Here's my final effort.

82 scan lines rendering both hardware sprites each frame from built up 16x16 tiles. Only one face of the tiles is in ram and the flip is done on the fly as the sprite is built up.

Any tiles different from the last frame are cleared - as opposed to simply mass clearing the whole 128x128 sprite.

[ Show youtube player ]
mcgeezer is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
68000 code optimisations pmc Coders. Asm / Hardware 248 17 September 2023 13:20
Blitter flip with interleaved bitplanes (single blit) alpine9000 Coders. Asm / Hardware 4 15 December 2018 04:49
ISOCD optimisations (maximising memory for CD32 games/compilations) earok support.Games 5 07 June 2015 14:37
For sale: Cheap Swap Magic 3.6 and flip lid. Brand new! Smiley MarketPlace 1 12 September 2008 19:01
Tile map sample Blip Coders. General 1 18 July 2007 13:53

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 17:30.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.35158 seconds with 16 queries