16x16 CPU tile flip optimisations

mcgeezer · 14 February 2021, 12:34

Recently I did a bit of work on sprite flipping for a Street Fighter POC by reconstructing a large 128x128 sprite from 16x16 tiles.

I've been looking at ways to improve the speed of the routine and I thought I had a way of doing it by using 16bit lookups instead of 32bit.

Here's my current code:

Code:

	move.l	MIRROR(a6),a5		; Start of 128kb Bit mirror
	
	moveq	#16,d0			; Modulo for destination 
	moveq	#16-1,d1		; number of copy lines
	
.copy_tile_right:				; Right copy
	move.w	(a0)+,d2
	move.w	(a5,d2.l*2),(a1)		; Bitplane 1
	add.l	d0,a1
	
	move.w	(a0)+,d2
	move.w	(a5,d2.l*2),(a2)		; Bitplane 2
	add.l	d0,a2
	
	move.w	(a0)+,d2
	move.w	(a5,d2.l*2),(a3)		; Bitplane 3
	add.l	d0,a3
	
	move.w	(a0)+,d2
	move.w	(a5,d2.l*2),(a4)		; Bitplane 4
	add.l	d0,a4
	
	dbf	d1,.copy_tile_right
	
	bra	.exit

I had thought I could have indexed into the middle of the bit mirror by 65536 bytes and saved some cycles by changing the longword lookup on the move to words eg. move.w (a5,d2.w*2) but I seem to get wrong values - probably because of the sign bit set.

Any takers for improving the routine? Target is AGA 020 Chip ram only.

Graeme

ross · 14 February 2021, 14:09

Hi Greame the alternative could be the 64KiB flip table coupled with a longword read:

Code:

.copy_tile_right:               ; Right copy

    move.l  (a0)+,d2
    add.w   d2,d2
    move.w  (a5,d2.w),d2
    addx.w  d2,d2
    move.w  d2,(a1)             ; Bitplane 1
    add.l   d0,a1

    swap    d2
    add.w   d2,d2
    move.w  (a5,d2.w),d2
    addx.w  d2,d2
    move.w  d2,(a2)             ; Bitplane 2
    add.l   d0,a2
    
    move.l  (a0)+,d2
    add.w   d2,d2
    move.w  (a5,d2.w),d2
    addx.w  d2,d2
    move.w  d2,(a3)             ; Bitplane 3
    add.l   d0,a3

    swap    d2
    add.w   d2,d2
    move.w  (a5,d2.w),d2
    addx.w  d2,d2
    move.w  d2,(a4)             ; Bitplane 4
    add.l   d0,a4

    dbf d1,.copy_tile_right

But I guess you've already tried it.. is it slower than your way?

roondar · 14 February 2021, 14:14

The only immediate thing I can note is that 16x16 is a bit of a shame for AGA. 32 bit wide reading/writing would be twice as fast (assuming 32 bit alignment naturally). But also takes much more memory for the tiles.

Perhaps a rewrite where you read the 16x16 tiles into words, but combine the results of two side-by-side tiles into a longword to write to the destination might still be useful as an optimisation though. That should be something like 33% faster (if my math doesn't fail me!). Edit: again, assuming the destination is 32 bit aligned

mcgeezer · 14 February 2021, 14:23

@ross - yeah i tried a read with a swap and i didnt get a speed increase. One thing i have done is put a beq.s after the move to d2 so it skips the copy if the value is 0. This saved about 8 scan lines but is dependant on the data.

DanScott · 14 February 2021, 14:25

you might even want to movem the tile data in to as many free data registers as possible

ross · 14 February 2021, 14:29

Quote:

Originally Posted by DanScott

you might even want to movem the tile data in to as many free data registers as possible

Yes that's what I originally wrote in the code.
But then I realized I would have used one more register, with probably no speed increase

	movem.l	(a0)+,d2/d3

This could be extended unrolling the loop, but I don't know if you gain so much in speed.

LaBodilsen · 14 February 2021, 14:34

Don't know if this works as I think it does, but asmpro don't complain.

Code:

	moveq	#0,d0			; Modulo for destination 
	moveq	#16-1,d1		; number of copy lines
	
.copy_tile_right:				; Right copy
	move.w	(a0)+,d2
	move.w	(a5,d2.l*2),(a1,d0.l)		; Bitplane 1
	
	move.w	(a0)+,d2
	move.w	(a5,d2.l*2),(a2,d0.l)		; Bitplane 2
	
	move.w	(a0)+,d2
	move.w	(a5,d2.l*2),(a3,d0.l)		; Bitplane 3
	
	move.w	(a0)+,d2
	move.w	(a5,d2.l*2),(a4,d0.l)		; Bitplane 4
	add.l	#16,d0
	
	dbf	d1,.copy_tile_right

edit: if you have more data registers free, it could be combined with movem.w (a0)+,d2/d3/d4/d5.
also, if the bitplane data is not more than 127bytes apart you could do:

Code:

move.w	(a5,d2.l*2),126(a4,d0.l)		; Bitplane 4

mcgeezer · 14 February 2021, 14:44

Quote:

Originally Posted by DanScott

you might even want to movem the tile data in to as many free data registers as possible

This was my first thought but i couldn’t get it to improve. Mind you i was only moving 4 words at a time, 8 might be better if i can get the regs free.

ross · 14 February 2021, 14:51

Quote:

Originally Posted by LaBodilsen

Don't know if this works as I think it does, but asmpro don't complain.

Good idea

(not sure if it gains speed)

This can be extended to save registers and modified by knowing the constant distance between destination bpls:

Code:

.copy_tile_right:				; Right copy
	move.w	(a0)+,d2
	move.w	(a5,d2.l*2),(a1)		; Bitplane 1
	
	move.w	(a0)+,d2
	move.w	(a5,d2.l*2),($xxxx.w,a1)        ; Bitplane 2
	
	move.w	(a0)+,d2
	move.w	(a5,d2.l*2),($xxxx*2.w,a1)	; Bitplane 3
	
	move.w	(a0)+,d2
	move.w	(a5,d2.l*2),($xxxx*3.w,a1)	; Bitplane 4
	adda.l	d0,a1
	
	dbf	d1,.copy_tile_right

EDIT: usually the (offset.w,ax) indexing is faster than (xx.b,ax,dx), internally two sum instead of three.
I don't know though in 020 (and from cache), and too lazy to look at the manuals, so you have to try

LaBodilsen · 14 February 2021, 14:59

Quote:

Originally Posted by ross

Good idea

(not sure if it gains speed)
This can be extended to save registers by knowing the constant distance between destination bplanes.

Code:

.copy_tile_right:				; Right copy
	move.w	(a0)+,d2
	move.w	(a5,d2.l*2),(a1)		; Bitplane 1
	
	move.w	(a0)+,d2
	move.w	(a5,d2.l*2),($xxxx.w,a1)        ; Bitplane 2
	
	move.w	(a0)+,d2
	move.w	(a5,d2.l*2),($xxxx*2.w,a1)	; Bitplane 3
	
	move.w	(a0)+,d2
	move.w	(a5,d2.l*2),($xxxx*3.w,a1)	; Bitplane 4
	adda.l	d0,a1
	
	dbf	d1,.copy_tile_right

Sweet.. would this work?:

Code:

.copy_tile_right:				; Right copy
	movem.w	(a0)+,d2/d3/d4/d5
	move.w	(a5,d2.l*2),(a1)		; Bitplane 1
	move.w	(a5,d3.l*2),($xxxx.w,a1)        ; Bitplane 2
	move.w	(a5,d4.l*2),($xxxx*2.w,a1)	; Bitplane 3
	move.w	(a5,d5.l*2),($xxxx*3.w,a1)	; Bitplane 4
	adda.l	d0,a1
	
	dbf	d1,.copy_tile_right

ross · 14 February 2021, 15:09

Quote:

Originally Posted by LaBodilsen

Sweet.. would this work?:

Code:

.copy_tile_right:				; Right copy
	movem.w	(a0)+,d2/d3/d4/d5
	move.w	(a5,d2.l*2),(a1)		; Bitplane 1
	move.w	(a5,d3.l*2),($xxxx.w,a1)        ; Bitplane 2
	move.w	(a5,d4.l*2),($xxxx*2.w,a1)	; Bitplane 3
	move.w	(a5,d5.l*2),($xxxx*3.w,a1)	; Bitplane 4
	adda.l	d0,a1
	
	dbf	d1,.copy_tile_right

Sure, that's the next step using movem

mcgeezer · 14 February 2021, 15:23

OK, so far this code snippet has yielded the biggest speed improvement.

Code:

	move.l	MIRROR(a6),a5		; Start of 128kb Bit mirror
	
	moveq	#16,d0			; Modulo for destination 
	moveq	#16-1,d1		; number of copy lines
	moveq	#0,d2
	
.copy_tile_right:				; Right copy
	move.w	(a0)+,d2
	beq.s	.1b
	move.w	(a5,d2.l*2),(a1)		; Bitplane 1
	
.1b:	move.w	(a0)+,d2
	beq.s	.2b
	move.w	(a5,d2.l*2),8(a1)		; Bitplane 2
	
.2b:	move.w	(a0)+,d2
	beq.s	.3b
	move.w	(a5,d2.l*2),(a3)		; Bitplane 3
	
.3b:	move.w	(a0)+,d2
	beq.s	.4b
	move.w	(a5,d2.l*2),8(a3)		; Bitplane 4
	
.4b:	add.l	d0,a1
	add.l	d0,a3
	dbf	d1,.copy_tile_right

.exit:	rts

I'll now revisit the movem and see if I can get it faster.

Edit - quick note. a1 and a3 point to different hardware sprite addresses so I can't use one address register as a base, it needs to be two.

ross · 14 February 2021, 15:29

Quote:

Originally Posted by mcgeezer

Edit - quick note. a1 and a3 point to different hardware sprite addresses so I can't use one address register as a base, it needs to be two.

But the sprites data are not consecutive in memory?

EDIT:

Quote:

Originally Posted by mcgeezer

I'll now revisit the movem and see if I can get it faster.

I have some doubt, because you are forced to a check per register to use the beq trick.. but you will let us know shortly

mcgeezer · 14 February 2021, 15:33

Quote:

Originally Posted by ross

But the sprites data are not consecutive in memory?

Maybe I should have mentioned, I'm moving into hardware sprites which are 64 wide by 128 pixels in depth.

Here's how I allocate them.

Code:

SPRITE_BANK_SIZE:	equ	128

        move.l	#(SPRITE_BANK_SIZE*16)*16,d0		
	move.l	#MEMF_CHIP,d1
	bsr	agdAllocateResource
	tst.l	d0
	bmi	.error
	
	move.l	d0,a0
	move.l	(a0),d0
	
	move.l	d0,HDL_SPRITE_BUF0BANK0(a6)
	add.l	#(SPRITE_BANK_SIZE*16),d0
	move.l	d0,HDL_SPRITE_BUF0BANK1(a6)
	add.l	#(SPRITE_BANK_SIZE*16),d0
	move.l	d0,HDL_SPRITE_BUF0BANK2(a6)
	add.l	#(SPRITE_BANK_SIZE*16),d0
	move.l	d0,HDL_SPRITE_BUF0BANK3(a6)
	add.l	#(SPRITE_BANK_SIZE*16),d0
	move.l	d0,HDL_SPRITE_BUF0BANK4(a6)
	add.l	#(SPRITE_BANK_SIZE*16),d0
	move.l	d0,HDL_SPRITE_BUF0BANK5(a6)
	add.l	#(SPRITE_BANK_SIZE*16),d0
	move.l	d0,HDL_SPRITE_BUF0BANK6(a6)
	add.l	#(SPRITE_BANK_SIZE*16),d0
	move.l	d0,HDL_SPRITE_BUF0BANK7(a6)
	add.l	#(SPRITE_BANK_SIZE*16),d0
	
	move.l	d0,HDL_SPRITE_BUF1BANK0(a6)
	add.l	#(SPRITE_BANK_SIZE*16),d0
	move.l	d0,HDL_SPRITE_BUF1BANK1(a6)
	add.l	#(SPRITE_BANK_SIZE*16),d0
	move.l	d0,HDL_SPRITE_BUF1BANK2(a6)
	add.l	#(SPRITE_BANK_SIZE*16),d0
	move.l	d0,HDL_SPRITE_BUF1BANK3(a6)
	add.l	#(SPRITE_BANK_SIZE*16),d0
	move.l	d0,HDL_SPRITE_BUF1BANK4(a6)
	add.l	#(SPRITE_BANK_SIZE*16),d0
	move.l	d0,HDL_SPRITE_BUF1BANK5(a6)
	add.l	#(SPRITE_BANK_SIZE*16),d0
	move.l	d0,HDL_SPRITE_BUF1BANK6(a6)
	add.l	#(SPRITE_BANK_SIZE*16),d0
	move.l	d0,HDL_SPRITE_BUF1BANK7(a6)	
	bra.s	.exit

LaBodilsen · 14 February 2021, 15:34

Quote:

Originally Posted by mcgeezer

OK, so far this code snippet has yielded the biggest speed improvement.

I'll now revisit the movem and see if I can get it faster.

Edit - quick note. a1 and a3 point to different hardware sprite addresses so I can't use one address register as a base, it needs to be two.

All though I think the beq.s after move to d2 is a smart way to avoid flipping zero data's, it will also not overwrite potential data in the destination, so are you sure the destination data is always "null"?

mcgeezer · 14 February 2021, 15:38

Quote:

Originally Posted by LaBodilsen

All though I think the beq.s after move to d2 is a smart way to avoid flipping zero data's, it will also not overwrite potential data in the destination, so are you sure the destination data is always "null"?

Yes, always null because i clear it every frame.

ross · 14 February 2021, 15:45

Quote:

Originally Posted by mcgeezer

I'm moving into hardware sprites which are 64 wide by 128 pixels in depth.

Code:

.copy_tile_right:				; Right copy
	move.w	(a0)+,d2
	beq.s	.1b
	move.w	(a5,d2.l*2),(a1)		; Bitplane 1
	
.1b:	move.w	(a0)+,d2
	beq.s	.2b
	move.w	(a5,d2.l*2),8(a1)		; Bitplane 2
	
.2b:	move.w	(a0)+,d2
	beq.s	.3b
	move.w	(a5,d2.l*2),128*16(a1)		; Bitplane 3
	
.3b:	move.w	(a0)+,d2
	beq.s	.4b
	move.w	(a5,d2.l*2),128*16+8(a1)		; Bitplane 4
	
.4b:	add.l	d0,a1
	dbf	d1,.copy_tile_right

Don_Adan · 14 February 2021, 15:47

If movem.w will be used then perhaps base for A5 table must be changed, because self extending from word to longword. Anyway perhaps something like this can be used too:
movem.w (a0)+,d2/d3/d4/d5/d7/a2/a6

dependent which registers can be free.

orangespider · 14 February 2021, 15:49

I don't understand why you want to use lookup tables at all? I mean chipram reads and writes are slow as it is. Wouldn't a straight up calculation be faster?

Code:

; this part can be set only once for multiple mirrors
	move.l	#%11111111000000001111111100000000, d2
	move.l	#%11110000111100001111000011110000, d3
	move.l	#%11001100110011001100110011001100, d4
	move.l	#$10101010101010101010101010101010, d5

	moveq	#6, d1
	move.l	(a0)+, d0
	move.l	d0, d1		; d1 = ABCDEFGH.IJKLMNOP.abcdefgh.ijklmnop
	and.l	d2, d0		; d0 = ABCDEFGH.00000000.abcdefgh.00000000
	lsl.l	#8, d1		; d1 = IJKLMNOP.abcdefgh.ijklmnop.00000000
	lsr.l	#8, d0		; d0 = 00000000.ABCDEFGH.00000000.abcdefgh
	and.l	d2, d1		; d1 = IJKLMNOP.00000000.ijklmnop.00000000
	or.l	d1, d0		; d0 = IJKLMNOP.ABCDEFGH.ijklmnop.abcdefgh
	move.l	d0, d1		; d1 = IJKLMNOP.ABCDEFGH.ijklmnop.abcdefgh
	and.l	d3, d0		; d0 = IJKL0000.ABCD0000.ijkl0000.abcd0000
	lsl.l	#4, d1		; d1 = MNOPABCD.EFGHijkl.mnopabcd.efgh0000
	lsr.l	#4, d0		; d0 = 0000IJKL.0000ABCD.0000ijkl.0000abcd
	and.l	d3, d1		; d1 = MNOP0000.EFGH0000.mnop0000.efgh0000
	or.l	d1, d0		; d0 = MNOPIJKL.EFGHABCD.mnopijkl.efghabcd
	move.l	d0, d1		; d1 = MNOPIJKL.EFGHABCD.mnopijkl.efghabcd
	and.l	d4, d0		; d0 = MN00IJ00.EF00AB00.mn00ij00.ef00ab00
	lsl.l	#2, d1		; d1 = OPIJKLEF.GHABCDmn.opijklef.ghabcd00
	lsr.l	#2, d0		; d0 = 00MN00IJ.00EF00AB.00mn00ij.00ef00ab
	and.l	d4, d1		; d1 = OP00KL00.GH00CD00.op00kl00.gh00cd00
	or.l	d1, d0		; d0 = OPMNKLIJ.GHEFCDAB.opmnklij.ghefcdab
	move.l	d0, d1		; d1 = OPMNKLIJ.GHEFCDAB.opmnklij.ghefcdab
	and.l	d5, d0		; d0 = O0M0K0I0.G0E0C0A0.o0m0k0i0.g0e0c0a0
	lsl.l	#1, d1		; d1 = PMNKLIJG.HEFCDABo.pmnklijg.hefcdab0
	lsr.l	#1, d0		; d0 = 0O0M0K0I.0G0E0C0A.0o0m0k0i.0g0e0c0a
	and.l	d5, d1		; d1 = P0N0L0J0.H0F0D0B0.p0n0l0j0.h0f0d0b0
	or.l	d0, d1		; d1 = PONMLKJI.HGFEDCBA.ponmlkji.hgfedcba
.copy_tile_right:
	move.l	(a0)+, d0
	move.l	d1, (a1)+
	move.l	d0, d1		; d1 = ABCDEFGH.IJKLMNOP.abcdefgh.ijklmnop
	and.l	d2, d0		; d0 = ABCDEFGH.00000000.abcdefgh.00000000
	lsl.l	#8, d1		; d1 = IJKLMNOP.abcdefgh.ijklmnop.00000000
	lsr.l	#8, d0		; d0 = 00000000.ABCDEFGH.00000000.abcdefgh
	and.l	d2, d1		; d1 = IJKLMNOP.00000000.ijklmnop.00000000
	or.l	d1, d0		; d0 = IJKLMNOP.ABCDEFGH.ijklmnop.abcdefgh
	move.l	d0, d1		; d1 = IJKLMNOP.ABCDEFGH.ijklmnop.abcdefgh
	and.l	d3, d0		; d0 = IJKL0000.ABCD0000.ijkl0000.abcd0000
	lsl.l	#4, d1		; d1 = MNOPABCD.EFGHijkl.mnopabcd.efgh0000
	lsr.l	#4, d0		; d0 = 0000IJKL.0000ABCD.0000ijkl.0000abcd
	and.l	d3, d1		; d1 = MNOP0000.EFGH0000.mnop0000.efgh0000
	or.l	d1, d0		; d0 = MNOPIJKL.EFGHABCD.mnopijkl.efghabcd
	move.l	d0, d1		; d1 = MNOPIJKL.EFGHABCD.mnopijkl.efghabcd
	and.l	d4, d0		; d0 = MN00IJ00.EF00AB00.mn00ij00.ef00ab00
	lsl.l	#2, d1		; d1 = OPIJKLEF.GHABCDmn.opijklef.ghabcd00
	lsr.l	#2, d0		; d0 = 00MN00IJ.00EF00AB.00mn00ij.00ef00ab
	and.l	d4, d1		; d1 = OP00KL00.GH00CD00.op00kl00.gh00cd00
	or.l	d1, d0		; d0 = OPMNKLIJ.GHEFCDAB.opmnklij.ghefcdab
	move.l	d0, d1		; d1 = OPMNKLIJ.GHEFCDAB.opmnklij.ghefcdab
	and.l	d5, d0		; d0 = O0M0K0I0.G0E0C0A0.o0m0k0i0.g0e0c0a0
	lsl.l	#1, d1		; d1 = PMNKLIJG.HEFCDABo.pmnklijg.hefcdab0
	lsr.l	#1, d0		; d0 = 0O0M0K0I.0G0E0C0A.0o0m0k0i.0g0e0c0a
	and.l	d5, d1		; d1 = P0N0L0J0.H0F0D0B0.p0n0l0j0.h0f0d0b0
	or.l	d0, d1		; d1 = PONMLKJI.HGFEDCBA.ponmlkji.hgfedcba
	dbf	d6, .copy_tile_right
	move.l	d1, (a1)+
	bra	.exit

I didn't optimize for anything other than 060 for a long time, but to me it feels like having 1 chip ram write and 1 chip ram read should be faster even with all that code compared to 2 chip ram writes and 4 chip ram reads doing the same job?

ross · 14 February 2021, 15:53

Quote:

Originally Posted by orangespider

I don't understand why you want to use lookup tables at all? I mean chipram reads and writes are slow as it is. Wouldn't a straight up calculation be faster?
I didn't optimize for anything other than 060 for a long time, but to me it feels like having 1 chip ram write and 1 chip ram read should be faster even with all that code compared to 2 chip ram writes and 4 chip ram reads doing the same job?

Your reasoning applies only to fast 060 (probably 040 also).
It doesn't work for 020/030 and chip ram only systems.

Try it for yourself

14 February 2021, 14:34	#7
LaBodilsen Registered User Join Date: Dec 2017 Location: Denmark Posts: 179	Don't know if this works as I think it does, but asmpro don't complain. Code: moveq #0,d0 ; Modulo for destination moveq #16-1,d1 ; number of copy lines .copy_tile_right: ; Right copy move.w (a0)+,d2 move.w (a5,d2.l2),(a1,d0.l) ; Bitplane 1 move.w (a0)+,d2 move.w (a5,d2.l2),(a2,d0.l) ; Bitplane 2 move.w (a0)+,d2 move.w (a5,d2.l2),(a3,d0.l) ; Bitplane 3 move.w (a0)+,d2 move.w (a5,d2.l2),(a4,d0.l) ; Bitplane 4 add.l #16,d0 dbf d1,.copy_tile_right edit: if you have more data registers free, it could be combined with movem.w (a0)+,d2/d3/d4/d5. also, if the bitplane data is not more than 127bytes apart you could do: Code: move.w (a5,d2.l2),126(a4,d0.l) ; Bitplane 4 Last edited by LaBodilsen; 14 February 2021 at 14:48.*

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
68000 code optimisations	pmc	Coders. Asm / Hardware	248	17 September 2023 13:20
Blitter flip with interleaved bitplanes (single blit)	alpine9000	Coders. Asm / Hardware	4	15 December 2018 04:49
ISOCD optimisations (maximising memory for CD32 games/compilations)	earok	support.Games	5	07 June 2015 14:37
For sale: Cheap Swap Magic 3.6 and flip lid. Brand new!	Smiley	MarketPlace	1	12 September 2008 19:01
Tile map sample	Blip	Coders. General	1	18 July 2007 13:53

14 February 2021, 12:34	#1
mcgeezer Registered User Join Date: Oct 2017 Location: Sunderland, England Posts: 2,702	16x16 CPU tile flip optimisations Recently I did a bit of work on sprite flipping for a Street Fighter POC by reconstructing a large 128x128 sprite from 16x16 tiles. I've been looking at ways to improve the speed of the routine and I thought I had a way of doing it by using 16bit lookups instead of 32bit. Here's my current code: Code: move.l MIRROR(a6),a5 ; Start of 128kb Bit mirror moveq #16,d0 ; Modulo for destination moveq #16-1,d1 ; number of copy lines .copy_tile_right: ; Right copy move.w (a0)+,d2 move.w (a5,d2.l2),(a1) ; Bitplane 1 add.l d0,a1 move.w (a0)+,d2 move.w (a5,d2.l2),(a2) ; Bitplane 2 add.l d0,a2 move.w (a0)+,d2 move.w (a5,d2.l2),(a3) ; Bitplane 3 add.l d0,a3 move.w (a0)+,d2 move.w (a5,d2.l2),(a4) ; Bitplane 4 add.l d0,a4 dbf d1,.copy_tile_right bra .exit I had thought I could have indexed into the middle of the bit mirror by 65536 bytes and saved some cycles by changing the longword lookup on the move to words eg. move.w (a5,d2.w*2) but I seem to get wrong values - probably because of the sign bit set. Any takers for improving the routine? Target is AGA 020 Chip ram only. Graeme

14 February 2021, 14:14	#3
roondar Registered User Join Date: Jul 2015 Location: The Netherlands Posts: 3,410	The only immediate thing I can note is that 16x16 is a bit of a shame for AGA. 32 bit wide reading/writing would be twice as fast (assuming 32 bit alignment naturally). But also takes much more memory for the tiles. Perhaps a rewrite where you read the 16x16 tiles into words, but combine the results of two side-by-side tiles into a longword to write to the destination might still be useful as an optimisation though. That should be something like 33% faster (if my math doesn't fail me!). Edit: again, assuming the destination is 32 bit aligned

14 February 2021, 14:23	#4
mcgeezer Registered User Join Date: Oct 2017 Location: Sunderland, England Posts: 2,702	@ross - yeah i tried a read with a swap and i didnt get a speed increase. One thing i have done is put a beq.s after the move to d2 so it skips the copy if the value is 0. This saved about 8 scan lines but is dependant on the data.

14 February 2021, 14:25	#5
DanScott Lemon. / Core Design Join Date: Mar 2016 Location: Tier 5 Posts: 1,212	you might even want to movem the tile data in to as many free data registers as possible

14 February 2021, 15:47	#18
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 55 Posts: 1,959	If movem.w will be used then perhaps base for A5 table must be changed, because self extending from word to longword. Anyway perhaps something like this can be used too: movem.w (a0)+,d2/d3/d4/d5/d7/a2/a6 dependent which registers can be free.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)