16x16 CPU tile flip optimisations - Page 2

orangespider · 14 February 2021, 15:57

Quote:

Originally Posted by ross

Your reasoning applies only to fast 060 (probably 040 also).
It doesn't work for 020/030 and chip ram only systems.

Try it for yourself

Nevermind them. Apparently I got infected with the 060.

mcgeezer · 14 February 2021, 16:25

OK, so for some odd reason I can't get the movem.w to work... I get wrong values coming in from the lookup table, at first I thought it could be the upper word of d2/d3/d4/d5 not being zero but I recall these are set anyway when a word operand is used.

Code:

	move.l	d4,a2
	move.l	d5,a3
	
.copy_tile_right:				; Right copy
	movem.w	(a0)+,d2/d3/d4/d5
	move.w	(a5,d2.l*2),(a1)		; Bitplane 1
	move.w	(a5,d3.l*2),8(a1)		; Bitplane 2
	move.w	(a5,d4.l*2),128*16(a1)		; Bitplane 3
	move.w	(a5,d5.l*2),(128*16)+8(a1)	; Bitplane 4
	add.l	d0,a1
	dbf	d1,.copy_tile_right

	move.l	a2,d4
	move.l	a3,d5

It makes no difference anyway because this code is slower than this:

Code:

.copy_tile_right:				; Right copy
	move.w	(a0)+,d2
	beq.s	.1b
	move.w	(a5,d2.l*2),(a1)		; Bitplane 1
	
.1b:	move.w	(a0)+,d2
	beq.s	.2b
	move.w	(a5,d2.l*2),8(a1)		; Bitplane 2
	
.2b:	move.w	(a0)+,d2
	beq.s	.3b
	move.w	(a5,d2.l*2),128*16(a1)		; Bitplane 3
	
.3b:	move.w	(a0)+,d2
	beq.s	.4b
	move.w	(a5,d2.l*2),(128*16)+8(a1)	; Bitplane 4
	
.4b:	add.l	d0,a1
	dbf	d1,.copy_tile_right

Thank you everyone for their input, it was really interesting!

Graeme

orangespider · 14 February 2021, 16:55

What's fastest on 020 out of these 3:

Code:

; case 1
	move.w	(a0)+,d2
	move.w	(a5,d2.l*2),(a1)
; other bitplanes
	move.w	(a0)+,d2
	move.w	(a5,d2.l*2),2(a1)

; case 2
	move.w	(a0),d2
	move.w	(a5,d2.l*2),d3
	swap	d3
	move.w	16(a0),d2
	move.w	(a5,d2.l*2),d3
	move.l	d3,(a1)

; case 3
	move.w	(a0),d2
	move.l	(a5,d2.l*4),d3     ; a5 - pointer to a 256k table
	move.w	16(a0),d2
	move.w	(a5,d2.l*4),d3
	move.l	d3,(a1)

edit also what's faster out of these 2:

Code:

; case 1
	move.w	(a0)+,d2
	move.l	(a5,d2.l*2),(a0)
	move.w	(a0)+,d2
	move.l	(a5,d2.l*2),8(a0)

; case 2
	move.l	(a0)+,d2
	move.l	(a5,d2.w*2),(a0)
	swap	d2
	move.l	(a5,d2.w*2),8(a0)

LaBodilsen · 14 February 2021, 17:31

Quote:

Originally Posted by orangespider

Code:

; this part can be set only once for multiple mirrors
	move.l	#%11111111000000001111111100000000, d2

	move.l	(a0)+, d0
	move.l	d0, d1		; d1 = ABCDEFGH.IJKLMNOP.abcdefgh.ijklmnop
	and.l	d2, d0		; d0 = ABCDEFGH.00000000.abcdefgh.00000000
	lsl.l	#8, d1		; d1 = IJKLMNOP.abcdefgh.ijklmnop.00000000
	lsr.l	#8, d0		; d0 = 00000000.ABCDEFGH.00000000.abcdefgh
	and.l	d2, d1		; d1 = IJKLMNOP.00000000.ijklmnop.00000000
	or.l	d1, d0		; d0 = IJKLMNOP.ABCDEFGH.ijklmnop.abcdefgh

Would that not be the same as this: (but faster at least on 68000)

Code:

	move.l	(a0)+,D0	;d0 = ABCDEFGH.IJKLMNOP.abcdefgh.ijklmnop
	ror.w	#8,D0		;d0 = ABCDEFGH.IJKLMNOP.ijklmnop.abcdefgh
	swap	d0		;d0 = ijklmnop.abcdefgh.ABCDEFGH.IJKLMNOP
	ror.w	#8,d0		;d0 = ijklmnop.abcdefgh.IJKLMNOP.ABCDEFGH
	swap	d0		;d0 = IJKLMNOP.ABCDEFGH.ijklmnop.abcdefgh

orangespider · 14 February 2021, 17:39

Quote:

Originally Posted by LaBodilsen

Would that not be the same as this: (but faster at least on 68000)

Code:

	move.l	(a0)+,D0	;d0 = ABCDEFGH.IJKLMNOP.abcdefgh.ijklmnop
	ror.w	#8,D0		;d0 = ABCDEFGH.IJKLMNOP.ijklmnop.abcdefgh
	swap	d0		;d0 = ijklmnop.abcdefgh.ABCDEFGH.IJKLMNOP
	ror.w	#8,d0		;d0 = ijklmnop.abcdefgh.IJKLMNOP.ABCDEFGH
	swap	d0		;d0 = IJKLMNOP.ABCDEFGH.ijklmnop.abcdefgh

yeah, it would, but I just looked at the problem as a c2p and modified my 060 c2p for it, where the other code would run faster.

a/b · 14 February 2021, 18:13

Quote:

Originally Posted by mcgeezer

OK, so for some odd reason I can't get the movem.w to work... I get wrong values coming in from the lookup table, at first I thought it could be the upper word of d2/d3/d4/d5 not being zero but I recall these are set anyway when a word operand is used.

movem.w sign-extends, any offset $8000-$ffff becomes a negative 32-bit offset (you are using them right after as .l).

Don_Adan · 14 February 2021, 18:33

Quote:

Originally Posted by mcgeezer

OK, so for some odd reason I can't get the movem.w to work... I get wrong values coming in from the lookup table, at first I thought it could be the upper word of d2/d3/d4/d5 not being zero but I recall these are set anyway when a word operand is used.

Code:

	move.l	d4,a2
	move.l	d5,a3
	
.copy_tile_right:				; Right copy
	movem.w	(a0)+,d2/d3/d4/d5
	move.w	(a5,d2.l*2),(a1)		; Bitplane 1
	move.w	(a5,d3.l*2),8(a1)		; Bitplane 2
	move.w	(a5,d4.l*2),128*16(a1)		; Bitplane 3
	move.w	(a5,d5.l*2),(128*16)+8(a1)	; Bitplane 4
	add.l	d0,a1
	dbf	d1,.copy_tile_right

	move.l	a2,d4
	move.l	a3,d5

It makes no difference anyway because this code is slower than this:

Code:

.copy_tile_right:				; Right copy
	move.w	(a0)+,d2
	beq.s	.1b
	move.w	(a5,d2.l*2),(a1)		; Bitplane 1
	
.1b:	move.w	(a0)+,d2
	beq.s	.2b
	move.w	(a5,d2.l*2),8(a1)		; Bitplane 2
	
.2b:	move.w	(a0)+,d2
	beq.s	.3b
	move.w	(a5,d2.l*2),128*16(a1)		; Bitplane 3
	
.3b:	move.w	(a0)+,d2
	beq.s	.4b
	move.w	(a5,d2.l*2),(128*16)+8(a1)	; Bitplane 4
	
.4b:	add.l	d0,a1
	dbf	d1,.copy_tile_right

Thank you everyone for their input, it was really interesting!

Graeme

I think that if you want to reach maximum speed, you can use two versions. First for flip images with many null words values, and second for flip images with small numbers of null words. Two short routines and one 128KB table. For access to table for movem.w you must set A5 as A5 + 64K, because negative longword values (auto extending). Anyway D2.W*2 is enough.

mcgeezer · 14 February 2021, 19:59

Quote:

Originally Posted by a/b

movem.w sign-extends, any offset $8000-$ffff becomes a negative 32-bit offset (you are using them right after as .l).

Quote:

Originally Posted by Don_Adan

I think that if you want to reach maximum speed, you can use two versions. First for flip images with many null words values, and second for flip images with small numbers of null words. Two short routines and one 128KB table. For access to table for movem.w you must set A5 as A5 + 64K, because negative longword values (auto extending). Anyway D2.W*2 is enough.

Ahhh thanks for pointing out the sign-extend on movem.w - I should have known that and incorrectly assumed it was clearing the upper word.

LaBodilsen · 15 February 2021, 17:16

If you are in a pinch for memory, I propose this no table approach.

Code:

	move.w	(a0)+,D2
	beq.s	.noflip
	moveq	#7-1,D3
	ror.b	#1,D2
.loopshakeflip
	rol.w	#1,D2
	ror.b	#2,D2
	dbf	D3,.loopshakeflip
	rol.w	#1,D2
	ror.b	#1,D2
	move.w	D2,(a1)
.noflip

it came to me in the middle of the night, where I woke up with this idea in my head. I call it a shakeflip, as that is almost what's going on here.

its optimized for smallest number of chipmem instruction fetches, and hopefully the "shakeflip" loop will run in cache. cycle count for small ror/rol is 6+2 and 6+4 per #1, and #2. so 7 times 18 + 24 = 150 cycles, not counting the loop and instruction fetches.

it's most likely much slower than a table lookup, but I do like the simplicity in it.

mcgeezer · 15 February 2021, 17:31

I’ll give it a try and show the results for you.

meynaf · 15 February 2021, 18:17

Another possibility for the c2p-style approach :

Code:

; data and result in d4

; enter here for 32-bit bit reverse
 ror.b #4,d4
 ror.w #8,d4
 ror.b #4,d4
 swap d4

; enter here for 16-bit only
 ror.b #4,d4
 ror.w #8,d4
 ror.b #4,d4
 move.l d4,d0
 lsr.l #2,d0
 lsl.l #2,d4
 eor.l d4,d0
 and.l #$33333333,d0
 eor.l d0,d4
 move.l d4,d0
 lsr.l #1,d0
 add.l d4,d4
 eor.l d4,d0
 and.l #$55555555,d0
 eor.l d0,d4

Of course the two constants can be preloaded in other registers (and shortened if you use only 16-bit).

ross · 15 February 2021, 18:19

Oh well, then check also the good old magnitude progressive group swapping.
Not much suited for 16 bit or 020, but give it a try

Code:

    move.w   (a0)+,d2
    beq.s    .noflip
    move.w   #$5555,d3
    and.w    d2,d3
    eor.w    d3,d2
    add.w    d3,d3
    lsr.w    #1,d2
    or.w     d3,d2
    move.w   #$3333,d3
    and.w    d2,d3
    eor.w    d3,d2
    lsl.w    #2,d3
    lsr.w    #2,d2
    or.w     d3,d2
    move.w   #$0f0f,d3
    and.w    d2,d3
    eor.w    d3,d2
    lsl.w    #4,d3
    lsr.w    #4,d2
    or.w     d3,d2
    rol.w    #8,d2
    move.w   d2,(a1)
.noflip

LaBodilsen · 15 February 2021, 19:06

okay I did some test myself for a 16*16 block in 4 bitplanes (with Winuae though)

turns out my own loop, is slower than just unrolling it.

Code:

	move.w	(a0)+,D2
	beq.s	.noflip
	ror.b	#1,D2
	rol.w	#1,D2
	ror.b	#2,D2
	rol.w	#1,D2
	ror.b	#2,D2
	rol.w	#1,D2
	ror.b	#2,D2
	rol.w	#1,D2
	ror.b	#2,D2
	rol.w	#1,D2
	ror.b	#2,D2
	rol.w	#1,D2
	ror.b	#2,D2
	rol.w	#1,D2
	ror.b	#2,D2
	rol.w	#1,D2
	ror.b	#1,D2
	move.w	D2,(a1)
.noflip
	move.w	(a0),D2

with regards to speed, Meynaf code is the fastest of the 3, at about 3 2/3 raster line, when the 2 constants are preloaded into registers. Ross and mine is the same with about 4 2/3 rasterline. but my code only use 1 data register

of course table lookup is the fastest with about 2 1/3 rasterlines
(edit: fixed my unrolled code, which made it faster)

DanScott · 15 February 2021, 19:18

could you use a sequence of

ror.w #1,d2
addx.w d3,d3

uses one more data register

orangespider · 15 February 2021, 21:08

I believe this is the fastest no-table acode:

Code:

	move.l	#%11111111000000001111111100000000,d2
	move.l	#%11110000111100001111000011110000,d3
	move.l	#%11001100110011001100110011001100,d4
	move.l	#%10101010101010101010101010101010,d5
.copy_tile_right:
	move.l	(a0)+,d6
	beq.s	.sk0
	move.l	d6,d7		; d7 - ABCDEFGH.IJKLMNOP.abcdefgh.ijklmnop
	and.l	d2,d6		; d6 - ABCDEFGH.00000000.abcdefgh.00000000
	eor.l	d6,d7		; d7 - 00000000.IJKLMNOP.00000000.ijklmnop
	swap	d6		; d6 - abcdefgh.00000000.ABCDEFGH.00000000
	or.l	d7,d6		; d6 - abcdefgh.IJKLMNOP.ABCDEFGH.ijklmnop
	move.l	d6,d7		; d7 - abcdefgh.IJKLMNOP.ABCDEFGH.ijklmnop
	and.l	d3,d6		; d6 - abcd6000.IJKL0000.ABCd6000.ijkl0000
	eor.l	d6,d7		; d7 - 0000efgh.0000MNOP.0000EFGH.0000mnop
	ror.l	#8,d6		; d6 - ijkl0000.abcd6000.IJKL0000.ABCd6000
	or.l	d7,d6		; d6 - ijklefgh.abcdMNOP.IJKLEFGH.ABCDmnop
	move.l	d6,d7		; d7 - ijklefgh.abcdMNOP.IJKLEFGH.ABCDmnop
	and.l	d4,d6		; d6 - ij00ef00.ab00MN00.IJ00EF00.AB00mn00
	eor.l	d6,d7		; d7 - 00kl00gh.00cd60OP.00KL00GH.00Cd60op
	ror.l	#4,d6		; d6 - mn00ij00.ef00ab00.MN00IJ00.EF00AB00
	or.l	d7,d6		; d6 - mnklijgh.efcdabOP.MNKLIJGH.EFCDABop
	move.l	d6,d7		; d7 - mnklijgh.efcdabOP.MNKLIJGH.EFCDABop
	and.l	d5,d6		; d6 - m0k0i0g0.e0c0a0O0.M0K0I0G0.E0C0A0o0
	eor.l	d6,d7		; d7 - 0n0l0j0h.0f0d6b0O.0N0K0J0H.0F0d6B0p
	ror.l	#3,d6		; d6 - 0o0m0k0i.0g0e0c0a.0O0M0K0I.0G0E0C0A
	ror.l	#1,d7		; d7 - p0n0l0j0.h0f0d6b0.O0N0K0J0.H0F0d6B0
	or.l	d7,d6		; d6 - ponmlkji.hgfedcba.PONMLKJI.HGFEDCBA
	move.w	d6,(a1)
	swap	d6
	move.w	d6,8(a1)
.sk0:
	move.l	(a0)+,d6
	beq.s	.sk1
	move.l	d6,d7		; d7 - ABCDEFGH.IJKLMNOP.abcdefgh.ijklmnop
	and.l	d2,d6		; d6 - ABCDEFGH.00000000.abcdefgh.00000000
	eor.l	d6,d7		; d7 - 00000000.IJKLMNOP.00000000.ijklmnop
	swap	d6		; d6 - abcdefgh.00000000.ABCDEFGH.00000000
	or.l	d7,d6		; d6 - abcdefgh.IJKLMNOP.ABCDEFGH.ijklmnop
	move.l	d6,d7		; d7 - abcdefgh.IJKLMNOP.ABCDEFGH.ijklmnop
	and.l	d3,d6		; d6 - abcd6000.IJKL0000.ABCd6000.ijkl0000
	eor.l	d6,d7		; d7 - 0000efgh.0000MNOP.0000EFGH.0000mnop
	ror.l	#8,d6		; d6 - ijkl0000.abcd6000.IJKL0000.ABCd6000
	or.l	d7,d6		; d6 - ijklefgh.abcdMNOP.IJKLEFGH.ABCDmnop
	move.l	d6,d7		; d7 - ijklefgh.abcdMNOP.IJKLEFGH.ABCDmnop
	and.l	d4,d6		; d6 - ij00ef00.ab00MN00.IJ00EF00.AB00mn00
	eor.l	d6,d7		; d7 - 00kl00gh.00cd60OP.00KL00GH.00Cd60op
	ror.l	#4,d6		; d6 - mn00ij00.ef00ab00.MN00IJ00.EF00AB00
	or.l	d7,d6		; d6 - mnklijgh.efcdabOP.MNKLIJGH.EFCDABop
	move.l	d6,d7		; d7 - mnklijgh.efcdabOP.MNKLIJGH.EFCDABop
	and.l	d5,d6		; d6 - m0k0i0g0.e0c0a0O0.M0K0I0G0.E0C0A0o0
	eor.l	d6,d7		; d7 - 0n0l0j0h.0f0d6b0O.0N0K0J0H.0F0d6B0p
	ror.l	#3,d6		; d6 - 0o0m0k0i.0g0e0c0a.0O0M0K0I.0G0E0C0A
	ror.l	#1,d7		; d7 - p0n0l0j0.h0f0d6b0.O0N0K0J0.H0F0d6B0
	or.l	d7,d6		; d6 - ponmlkji.hgfedcba.PONMLKJI.HGFEDCBA
	move.w	d6,128*16(a1)
	swap	d6
	move.w	d6,128*16+8(a1)
.sk1:
	add.l	d0,a1
	dbf		d1,.copy_tile_right

The register usage is compatible with the table codes, so you can test how well this does. It is definitely faster than LaBodilsen's code because it works on 32 bits at once instead of 16. Btw, this can be even faster if you use 32x32 as a source (or even full 128x128) instead of 16x16.

edit: If I counted the cycles correctly, this should run at 251 cycles per loop iteration (213 cycles for 32x32) and the table approach would run at 200 cycles per iteration. But my cycle counts might be wrong.

LaBodilsen · 16 February 2021, 17:21

Quote:

Originally Posted by orangespider

I believe this is the fastest no-table acode:

It is indeed very fast, but meynaf version is just a smidge faster (about 1/4 of a raster line), when used as 32bit.

Code:

	move.l	#$33333333,d5
	move.l	#$55555555,d6
.copy_tile_right:
	move.l	(a0)+,d4
	beq.s	.noflip
	ror.b	#4,d4
	ror.w	#8,d4
	ror.b	#4,d4
	swap	d4
	ror.b	#4,d4
	ror.w	#8,d4
	ror.b	#4,d4
	move.l	d4,d0
	lsr.l	#2,d0
	lsl.l	#2,d4
	eor.l	d4,d0
	and.l	d5,d0
	eor.l	d0,d4
	move.l	d4,d0
	lsr.l	#1,d0
	add.l	d4,d4
	eor.l	d4,d0
	and.l	d6,d0
	eor.l	d0,d4
	move.w	d4,(a1)
	swap	d4
	move.w	d4,8(a1)
.noflip
	move.l	(a0)+,d4
	beq.s	.noflip2
	ror.b	#4,d4
	ror.w	#8,d4
	ror.b	#4,d4
	swap	d4
	ror.b	#4,d4
	ror.w	#8,d4
	ror.b	#4,d4
	move.l	d4,d0
	lsr.l	#2,d0
	lsl.l	#2,d4
	eor.l	d4,d0
	and.l	d5,d0
	eor.l	d0,d4
	move.l	d4,d0
	lsr.l	#1,d0
	add.l	d4,d4
	eor.l	d4,d0
	and.l	d6,d0
	eor.l	d0,d4
	move.w	d4,128*16(a1)
	swap	d4
	move.w	d4,128*16+8(a1)
.noflip2
	add.l	d1,a1
	dbf	d2,.copy_tile_right

this is almost as fast as the table version, only about 1/4 raster line slower for a 16x16 block. although the table version could also be extended to 32bit. and also how much will DMA affect the results, if the blitter is running in the background.
(tested in Winuae A1200 chip only cycle exact)

I have a feeling it could be faster, as you don't need to flip the entire longword, but only the 2 word parts. so maybe a swap somewhere can be cut out.

LaBodilsen · 17 February 2021, 19:16

Quote:

Originally Posted by LaBodilsen

It is indeed very fast, but meynaf version is just a smidge faster (about 1/4 of a raster line), when used as 32bit.
this is almost as fast as the table version, only about 1/4 raster line slower for a 16x16 block. although the table version could also be extended to 32bit. and also how much will DMA affect the results, if the blitter is running in the background.
(tested in Winuae A1200 chip only cycle exact)

If we take Ross code, and preload the constant, and extend to 32bit, then it is just as fast as Meynaf's version.

Code:

	move.l	#$33333333,d1
	move.l	#$55555555,d4
	move.l	#$0f0f0f0f,d5
.copy_tile_right:
	move.l	(a0)+,d2
	beq.s	.noflip
	move.l	d4,d3
	and.l	d2,d3
	eor.l	d3,d2
	add.l	d3,d3
	lsr.l	#1,d2
	or.l	d3,d2
	move.l	d1,d3
	and.l	d2,d3
	eor.l	d3,d2
	lsl.l	#2,d3
	lsr.l	#2,d2
	or.l	d3,d2
	move.l	d5,d3
	and.l	d2,d3
	eor.l	d3,d2
	lsl.l	#4,d3
	lsr.l	#4,d2
	or.l	d3,d2
	rol.l	#8,d2
	move.w	d2,(a1)
	swap	d2
	move.w	d2,8(a1)
.noflip
	move.l	(a0)+,d2
	beq.s	.noflip2
	move.l	d4,d3
	and.l	d2,d3
	eor.l	d3,d2
	add.l	d3,d3
	lsr.l	#1,d2
	or.l	d3,d2
	move.l	d1,d3
	and.l	d2,d3
	eor.l	d3,d2
	lsl.l	#2,d3
	lsr.l	#2,d2
	or.l	d3,d2
	move.l	d5,d3
	and.l	d2,d3
	eor.l	d3,d2
	lsl.l	#4,d3
	lsr.l	#4,d2
	or.l	d3,d2
	rol.l	#8,d2
	move.w	d2,128*16(a1)
	swap	d2
	move.w	d2,128*16+8(a1)
.noflip2
	add.l	d0,a1
	dbf	d7,.copy_tile_right

here is the result for a 16x16 block:

Red = My code as 32bit
purple = table loopup
green = Meynaf code as 32bit
Yellow = Ross code as 32bit
Turquoise = orangespider code

ps. the test is done for "always data in" and no zero data. so the .noflip branch is never taken. Data in was always $00020002, and result was verified as $40004000 for all versions.

roondar · 17 February 2021, 19:36

It may be wise to try these routines on a real A1200 as well, as 68020 emulation isn't cycle accurate. My personal experience is that real A1200's tend to be slower on RAM access than WinUAE suggests. Do note this is based on my tests with WinUAE 4.2.0, I haven't tried them since upgrading to 4.4.0.

Thorham · 18 February 2021, 15:26

Quote:

Originally Posted by mcgeezer

Recently I did a bit of work on sprite flipping for a Street Fighter POC by reconstructing a large 128x128 sprite from 16x16 tiles.

If you can do 32x16 tiles instead, and assuming 64 pixel sprites have 64 contiguous pixels, you could do this perhaps (single bitplane):

Code:

    move.l  #$55555555,d2
    move.l  #$33333333,d3
    move.l  #$0f0f0f0f,d4
.loop
    move.l  (a0)+,d0

    move.l  d0,d1
    and.l   d2,d1
    eor.l   d1,d0
    lsr.l   #1,d0
    add.l   d1,d1
    or.l    d1,d0

    move.l  d0,d1
    and.l   d3,d1
    eor.l   d1,d0
    lsr.l   #2,d0
    lsl.l   #2,d1
    or.l    d1,d0

    move.l  d0,d1
    and.l   d4,d1
    eor.l   d1,d0
    lsr.l   #4,d0
    lsl.l   #4,d1
    or.l    d1,d0

    rol.w   #8,d0
    swap    d0
    rol.w   #8,d0
    
    move.l  d0,(a1)
    add.l   d5,a1

    dbra    d7,.loop

The main thing is getting rid of 16 bit writes to chipmem.

mcgeezer · 18 February 2021, 19:56

Quote:

Originally Posted by Thorham

If you can do 32x16 tiles instead, and assuming 64 pixel sprites have 64 contiguous pixels, you could do this perhaps (single bitplane):

Code:

    move.l  #$55555555,d2
    move.l  #$33333333,d3
    move.l  #$0f0f0f0f,d4
.loop
    move.l  (a0)+,d0

    move.l  d0,d1
    and.l   d2,d1
    eor.l   d1,d0
    lsr.l   #1,d0
    add.l   d1,d1
    or.l    d1,d0

    move.l  d0,d1
    and.l   d3,d1
    eor.l   d1,d0
    lsr.l   #2,d0
    lsl.l   #2,d1
    or.l    d1,d0

    move.l  d0,d1
    and.l   d4,d1
    eor.l   d1,d0
    lsr.l   #4,d0
    lsl.l   #4,d1
    or.l    d1,d0

    rol.w   #8,d0
    swap    d0
    rol.w   #8,d0
    
    move.l  d0,(a1)
    add.l   d5,a1

    dbra    d7,.loop

The main thing is getting rid of 16 bit writes to chipmem.

Going to 32x16 increases ram by a huge amount.

Here's my final effort.

82 scan lines rendering both hardware sprites each frame from built up 16x16 tiles. Only one face of the tiles is in ram and the flip is done on the fly as the sprite is built up.

Any tiles different from the last frame are cleared - as opposed to simply mass clearing the whole 128x128 sprite.

[ Show youtube player ]

15 February 2021, 17:16	#29
LaBodilsen Registered User Join Date: Dec 2017 Location: Denmark Posts: 179	If you are in a pinch for memory, I propose this no table approach. Code: move.w (a0)+,D2 beq.s .noflip moveq #7-1,D3 ror.b #1,D2 .loopshakeflip rol.w #1,D2 ror.b #2,D2 dbf D3,.loopshakeflip rol.w #1,D2 ror.b #1,D2 move.w D2,(a1) .noflip it came to me in the middle of the night, where I woke up with this idea in my head. I call it a shakeflip, as that is almost what's going on here. its optimized for smallest number of chipmem instruction fetches, and hopefully the "shakeflip" loop will run in cache. cycle count for small ror/rol is 6+2 and 6+4 per #1, and #2. so 7 times 18 + 24 = 150 cycles, not counting the loop and instruction fetches. it's most likely much slower than a table lookup, but I do like the simplicity in it. Last edited by LaBodilsen; 15 February 2021 at 17:52.

15 February 2021, 19:06	#33
LaBodilsen Registered User Join Date: Dec 2017 Location: Denmark Posts: 179	okay I did some test myself for a 1616 block in 4 bitplanes (with Winuae though) turns out my own loop, is slower than just unrolling it. Code: move.w (a0)+,D2 beq.s .noflip ror.b #1,D2 rol.w #1,D2 ror.b #2,D2 rol.w #1,D2 ror.b #2,D2 rol.w #1,D2 ror.b #2,D2 rol.w #1,D2 ror.b #2,D2 rol.w #1,D2 ror.b #2,D2 rol.w #1,D2 ror.b #2,D2 rol.w #1,D2 ror.b #2,D2 rol.w #1,D2 ror.b #1,D2 move.w D2,(a1) .noflip move.w (a0),D2 with regards to speed, Meynaf code is the fastest of the 3, at about 3 2/3 raster line, when the 2 constants are preloaded into registers. Ross and mine is the same with about 4 2/3 rasterline. but my code only use 1 data register of course table lookup is the fastest with about 2 1/3 rasterlines (edit: fixed my unrolled code, which made it faster) Last edited by LaBodilsen; 15 February 2021 at 19:27.*

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
68000 code optimisations	pmc	Coders. Asm / Hardware	248	17 September 2023 13:20
Blitter flip with interleaved bitplanes (single blit)	alpine9000	Coders. Asm / Hardware	4	15 December 2018 04:49
ISOCD optimisations (maximising memory for CD32 games/compilations)	earok	support.Games	5	07 June 2015 14:37
For sale: Cheap Swap Magic 3.6 and flip lid. Brand new!	Smiley	MarketPlace	1	12 September 2008 19:01
Tile map sample	Blip	Coders. General	1	18 July 2007 13:53

15 February 2021, 17:31	#30
mcgeezer Registered User Join Date: Oct 2017 Location: Sunderland, England Posts: 2,702	I’ll give it a try and show the results for you.

15 February 2021, 19:18	#34
DanScott Lemon. / Core Design Join Date: Mar 2016 Location: Tier 5 Posts: 1,212	could you use a sequence of ror.w #1,d2 addx.w d3,d3 uses one more data register

17 February 2021, 19:36	#38
roondar Registered User Join Date: Jul 2015 Location: The Netherlands Posts: 3,410	It may be wise to try these routines on a real A1200 as well, as 68020 emulation isn't cycle accurate. My personal experience is that real A1200's tend to be slower on RAM access than WinUAE suggests. Do note this is based on my tests with WinUAE 4.2.0, I haven't tried them since upgrading to 4.4.0.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)