14 February 2021, 12:34 | #1 |
Registered User
Join Date: Oct 2017
Location: Sunderland, England
Posts: 2,702
|
16x16 CPU tile flip optimisations
Recently I did a bit of work on sprite flipping for a Street Fighter POC by reconstructing a large 128x128 sprite from 16x16 tiles.
I've been looking at ways to improve the speed of the routine and I thought I had a way of doing it by using 16bit lookups instead of 32bit. Here's my current code: Code:
move.l MIRROR(a6),a5 ; Start of 128kb Bit mirror moveq #16,d0 ; Modulo for destination moveq #16-1,d1 ; number of copy lines .copy_tile_right: ; Right copy move.w (a0)+,d2 move.w (a5,d2.l*2),(a1) ; Bitplane 1 add.l d0,a1 move.w (a0)+,d2 move.w (a5,d2.l*2),(a2) ; Bitplane 2 add.l d0,a2 move.w (a0)+,d2 move.w (a5,d2.l*2),(a3) ; Bitplane 3 add.l d0,a3 move.w (a0)+,d2 move.w (a5,d2.l*2),(a4) ; Bitplane 4 add.l d0,a4 dbf d1,.copy_tile_right bra .exit Any takers for improving the routine? Target is AGA 020 Chip ram only. Graeme |
14 February 2021, 14:09 | #2 |
Defendit numerus
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
|
Hi Greame the alternative could be the 64KiB flip table coupled with a longword read:
Code:
.copy_tile_right: ; Right copy move.l (a0)+,d2 add.w d2,d2 move.w (a5,d2.w),d2 addx.w d2,d2 move.w d2,(a1) ; Bitplane 1 add.l d0,a1 swap d2 add.w d2,d2 move.w (a5,d2.w),d2 addx.w d2,d2 move.w d2,(a2) ; Bitplane 2 add.l d0,a2 move.l (a0)+,d2 add.w d2,d2 move.w (a5,d2.w),d2 addx.w d2,d2 move.w d2,(a3) ; Bitplane 3 add.l d0,a3 swap d2 add.w d2,d2 move.w (a5,d2.w),d2 addx.w d2,d2 move.w d2,(a4) ; Bitplane 4 add.l d0,a4 dbf d1,.copy_tile_right |
14 February 2021, 14:14 | #3 |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,410
|
The only immediate thing I can note is that 16x16 is a bit of a shame for AGA. 32 bit wide reading/writing would be twice as fast (assuming 32 bit alignment naturally). But also takes much more memory for the tiles.
Perhaps a rewrite where you read the 16x16 tiles into words, but combine the results of two side-by-side tiles into a longword to write to the destination might still be useful as an optimisation though. That should be something like 33% faster (if my math doesn't fail me!). Edit: again, assuming the destination is 32 bit aligned |
14 February 2021, 14:23 | #4 |
Registered User
Join Date: Oct 2017
Location: Sunderland, England
Posts: 2,702
|
@ross - yeah i tried a read with a swap and i didnt get a speed increase. One thing i have done is put a beq.s after the move to d2 so it skips the copy if the value is 0. This saved about 8 scan lines but is dependant on the data.
|
14 February 2021, 14:25 | #5 |
Lemon. / Core Design
Join Date: Mar 2016
Location: Tier 5
Posts: 1,212
|
you might even want to movem the tile data in to as many free data registers as possible
|
14 February 2021, 14:29 | #6 | |
Defendit numerus
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
|
Quote:
But then I realized I would have used one more register, with probably no speed increase movem.l (a0)+,d2/d3 This could be extended unrolling the loop, but I don't know if you gain so much in speed. |
|
14 February 2021, 14:34 | #7 |
Registered User
Join Date: Dec 2017
Location: Denmark
Posts: 179
|
Don't know if this works as I think it does, but asmpro don't complain.
Code:
moveq #0,d0 ; Modulo for destination moveq #16-1,d1 ; number of copy lines .copy_tile_right: ; Right copy move.w (a0)+,d2 move.w (a5,d2.l*2),(a1,d0.l) ; Bitplane 1 move.w (a0)+,d2 move.w (a5,d2.l*2),(a2,d0.l) ; Bitplane 2 move.w (a0)+,d2 move.w (a5,d2.l*2),(a3,d0.l) ; Bitplane 3 move.w (a0)+,d2 move.w (a5,d2.l*2),(a4,d0.l) ; Bitplane 4 add.l #16,d0 dbf d1,.copy_tile_right also, if the bitplane data is not more than 127bytes apart you could do: Code:
move.w (a5,d2.l*2),126(a4,d0.l) ; Bitplane 4 Last edited by LaBodilsen; 14 February 2021 at 14:48. |
14 February 2021, 14:44 | #8 |
Registered User
Join Date: Oct 2017
Location: Sunderland, England
Posts: 2,702
|
|
14 February 2021, 14:51 | #9 | |
Defendit numerus
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
|
Quote:
This can be extended to save registers and modified by knowing the constant distance between destination bpls: Code:
.copy_tile_right: ; Right copy move.w (a0)+,d2 move.w (a5,d2.l*2),(a1) ; Bitplane 1 move.w (a0)+,d2 move.w (a5,d2.l*2),($xxxx.w,a1) ; Bitplane 2 move.w (a0)+,d2 move.w (a5,d2.l*2),($xxxx*2.w,a1) ; Bitplane 3 move.w (a0)+,d2 move.w (a5,d2.l*2),($xxxx*3.w,a1) ; Bitplane 4 adda.l d0,a1 dbf d1,.copy_tile_right I don't know though in 020 (and from cache), and too lazy to look at the manuals, so you have to try Last edited by ross; 14 February 2021 at 15:02. |
|
14 February 2021, 14:59 | #10 | |
Registered User
Join Date: Dec 2017
Location: Denmark
Posts: 179
|
Quote:
Sweet.. would this work?: Code:
.copy_tile_right: ; Right copy movem.w (a0)+,d2/d3/d4/d5 move.w (a5,d2.l*2),(a1) ; Bitplane 1 move.w (a5,d3.l*2),($xxxx.w,a1) ; Bitplane 2 move.w (a5,d4.l*2),($xxxx*2.w,a1) ; Bitplane 3 move.w (a5,d5.l*2),($xxxx*3.w,a1) ; Bitplane 4 adda.l d0,a1 dbf d1,.copy_tile_right |
|
14 February 2021, 15:09 | #11 | |
Defendit numerus
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
|
Quote:
|
|
14 February 2021, 15:23 | #12 |
Registered User
Join Date: Oct 2017
Location: Sunderland, England
Posts: 2,702
|
OK, so far this code snippet has yielded the biggest speed improvement.
Code:
move.l MIRROR(a6),a5 ; Start of 128kb Bit mirror moveq #16,d0 ; Modulo for destination moveq #16-1,d1 ; number of copy lines moveq #0,d2 .copy_tile_right: ; Right copy move.w (a0)+,d2 beq.s .1b move.w (a5,d2.l*2),(a1) ; Bitplane 1 .1b: move.w (a0)+,d2 beq.s .2b move.w (a5,d2.l*2),8(a1) ; Bitplane 2 .2b: move.w (a0)+,d2 beq.s .3b move.w (a5,d2.l*2),(a3) ; Bitplane 3 .3b: move.w (a0)+,d2 beq.s .4b move.w (a5,d2.l*2),8(a3) ; Bitplane 4 .4b: add.l d0,a1 add.l d0,a3 dbf d1,.copy_tile_right .exit: rts I'll now revisit the movem and see if I can get it faster. Edit - quick note. a1 and a3 point to different hardware sprite addresses so I can't use one address register as a base, it needs to be two. |
14 February 2021, 15:29 | #13 | |
Defendit numerus
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
|
Quote:
EDIT: I have some doubt, because you are forced to a check per register to use the beq trick.. but you will let us know shortly Last edited by ross; 14 February 2021 at 15:36. |
|
14 February 2021, 15:33 | #14 |
Registered User
Join Date: Oct 2017
Location: Sunderland, England
Posts: 2,702
|
Maybe I should have mentioned, I'm moving into hardware sprites which are 64 wide by 128 pixels in depth.
Here's how I allocate them. Code:
SPRITE_BANK_SIZE: equ 128 move.l #(SPRITE_BANK_SIZE*16)*16,d0 move.l #MEMF_CHIP,d1 bsr agdAllocateResource tst.l d0 bmi .error move.l d0,a0 move.l (a0),d0 move.l d0,HDL_SPRITE_BUF0BANK0(a6) add.l #(SPRITE_BANK_SIZE*16),d0 move.l d0,HDL_SPRITE_BUF0BANK1(a6) add.l #(SPRITE_BANK_SIZE*16),d0 move.l d0,HDL_SPRITE_BUF0BANK2(a6) add.l #(SPRITE_BANK_SIZE*16),d0 move.l d0,HDL_SPRITE_BUF0BANK3(a6) add.l #(SPRITE_BANK_SIZE*16),d0 move.l d0,HDL_SPRITE_BUF0BANK4(a6) add.l #(SPRITE_BANK_SIZE*16),d0 move.l d0,HDL_SPRITE_BUF0BANK5(a6) add.l #(SPRITE_BANK_SIZE*16),d0 move.l d0,HDL_SPRITE_BUF0BANK6(a6) add.l #(SPRITE_BANK_SIZE*16),d0 move.l d0,HDL_SPRITE_BUF0BANK7(a6) add.l #(SPRITE_BANK_SIZE*16),d0 move.l d0,HDL_SPRITE_BUF1BANK0(a6) add.l #(SPRITE_BANK_SIZE*16),d0 move.l d0,HDL_SPRITE_BUF1BANK1(a6) add.l #(SPRITE_BANK_SIZE*16),d0 move.l d0,HDL_SPRITE_BUF1BANK2(a6) add.l #(SPRITE_BANK_SIZE*16),d0 move.l d0,HDL_SPRITE_BUF1BANK3(a6) add.l #(SPRITE_BANK_SIZE*16),d0 move.l d0,HDL_SPRITE_BUF1BANK4(a6) add.l #(SPRITE_BANK_SIZE*16),d0 move.l d0,HDL_SPRITE_BUF1BANK5(a6) add.l #(SPRITE_BANK_SIZE*16),d0 move.l d0,HDL_SPRITE_BUF1BANK6(a6) add.l #(SPRITE_BANK_SIZE*16),d0 move.l d0,HDL_SPRITE_BUF1BANK7(a6) bra.s .exit |
14 February 2021, 15:34 | #15 | |
Registered User
Join Date: Dec 2017
Location: Denmark
Posts: 179
|
Quote:
|
|
14 February 2021, 15:38 | #16 |
Registered User
Join Date: Oct 2017
Location: Sunderland, England
Posts: 2,702
|
|
14 February 2021, 15:45 | #17 | |
Defendit numerus
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
|
Quote:
Code:
.copy_tile_right: ; Right copy move.w (a0)+,d2 beq.s .1b move.w (a5,d2.l*2),(a1) ; Bitplane 1 .1b: move.w (a0)+,d2 beq.s .2b move.w (a5,d2.l*2),8(a1) ; Bitplane 2 .2b: move.w (a0)+,d2 beq.s .3b move.w (a5,d2.l*2),128*16(a1) ; Bitplane 3 .3b: move.w (a0)+,d2 beq.s .4b move.w (a5,d2.l*2),128*16+8(a1) ; Bitplane 4 .4b: add.l d0,a1 dbf d1,.copy_tile_right |
|
14 February 2021, 15:47 | #18 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,959
|
If movem.w will be used then perhaps base for A5 table must be changed, because self extending from word to longword. Anyway perhaps something like this can be used too:
movem.w (a0)+,d2/d3/d4/d5/d7/a2/a6 dependent which registers can be free. |
14 February 2021, 15:49 | #19 |
Registered User
Join Date: Feb 2021
Location: Becej / Serbia
Posts: 120
|
I don't understand why you want to use lookup tables at all? I mean chipram reads and writes are slow as it is. Wouldn't a straight up calculation be faster?
Code:
; this part can be set only once for multiple mirrors move.l #%11111111000000001111111100000000, d2 move.l #%11110000111100001111000011110000, d3 move.l #%11001100110011001100110011001100, d4 move.l #$10101010101010101010101010101010, d5 moveq #6, d1 move.l (a0)+, d0 move.l d0, d1 ; d1 = ABCDEFGH.IJKLMNOP.abcdefgh.ijklmnop and.l d2, d0 ; d0 = ABCDEFGH.00000000.abcdefgh.00000000 lsl.l #8, d1 ; d1 = IJKLMNOP.abcdefgh.ijklmnop.00000000 lsr.l #8, d0 ; d0 = 00000000.ABCDEFGH.00000000.abcdefgh and.l d2, d1 ; d1 = IJKLMNOP.00000000.ijklmnop.00000000 or.l d1, d0 ; d0 = IJKLMNOP.ABCDEFGH.ijklmnop.abcdefgh move.l d0, d1 ; d1 = IJKLMNOP.ABCDEFGH.ijklmnop.abcdefgh and.l d3, d0 ; d0 = IJKL0000.ABCD0000.ijkl0000.abcd0000 lsl.l #4, d1 ; d1 = MNOPABCD.EFGHijkl.mnopabcd.efgh0000 lsr.l #4, d0 ; d0 = 0000IJKL.0000ABCD.0000ijkl.0000abcd and.l d3, d1 ; d1 = MNOP0000.EFGH0000.mnop0000.efgh0000 or.l d1, d0 ; d0 = MNOPIJKL.EFGHABCD.mnopijkl.efghabcd move.l d0, d1 ; d1 = MNOPIJKL.EFGHABCD.mnopijkl.efghabcd and.l d4, d0 ; d0 = MN00IJ00.EF00AB00.mn00ij00.ef00ab00 lsl.l #2, d1 ; d1 = OPIJKLEF.GHABCDmn.opijklef.ghabcd00 lsr.l #2, d0 ; d0 = 00MN00IJ.00EF00AB.00mn00ij.00ef00ab and.l d4, d1 ; d1 = OP00KL00.GH00CD00.op00kl00.gh00cd00 or.l d1, d0 ; d0 = OPMNKLIJ.GHEFCDAB.opmnklij.ghefcdab move.l d0, d1 ; d1 = OPMNKLIJ.GHEFCDAB.opmnklij.ghefcdab and.l d5, d0 ; d0 = O0M0K0I0.G0E0C0A0.o0m0k0i0.g0e0c0a0 lsl.l #1, d1 ; d1 = PMNKLIJG.HEFCDABo.pmnklijg.hefcdab0 lsr.l #1, d0 ; d0 = 0O0M0K0I.0G0E0C0A.0o0m0k0i.0g0e0c0a and.l d5, d1 ; d1 = P0N0L0J0.H0F0D0B0.p0n0l0j0.h0f0d0b0 or.l d0, d1 ; d1 = PONMLKJI.HGFEDCBA.ponmlkji.hgfedcba .copy_tile_right: move.l (a0)+, d0 move.l d1, (a1)+ move.l d0, d1 ; d1 = ABCDEFGH.IJKLMNOP.abcdefgh.ijklmnop and.l d2, d0 ; d0 = ABCDEFGH.00000000.abcdefgh.00000000 lsl.l #8, d1 ; d1 = IJKLMNOP.abcdefgh.ijklmnop.00000000 lsr.l #8, d0 ; d0 = 00000000.ABCDEFGH.00000000.abcdefgh and.l d2, d1 ; d1 = IJKLMNOP.00000000.ijklmnop.00000000 or.l d1, d0 ; d0 = IJKLMNOP.ABCDEFGH.ijklmnop.abcdefgh move.l d0, d1 ; d1 = IJKLMNOP.ABCDEFGH.ijklmnop.abcdefgh and.l d3, d0 ; d0 = IJKL0000.ABCD0000.ijkl0000.abcd0000 lsl.l #4, d1 ; d1 = MNOPABCD.EFGHijkl.mnopabcd.efgh0000 lsr.l #4, d0 ; d0 = 0000IJKL.0000ABCD.0000ijkl.0000abcd and.l d3, d1 ; d1 = MNOP0000.EFGH0000.mnop0000.efgh0000 or.l d1, d0 ; d0 = MNOPIJKL.EFGHABCD.mnopijkl.efghabcd move.l d0, d1 ; d1 = MNOPIJKL.EFGHABCD.mnopijkl.efghabcd and.l d4, d0 ; d0 = MN00IJ00.EF00AB00.mn00ij00.ef00ab00 lsl.l #2, d1 ; d1 = OPIJKLEF.GHABCDmn.opijklef.ghabcd00 lsr.l #2, d0 ; d0 = 00MN00IJ.00EF00AB.00mn00ij.00ef00ab and.l d4, d1 ; d1 = OP00KL00.GH00CD00.op00kl00.gh00cd00 or.l d1, d0 ; d0 = OPMNKLIJ.GHEFCDAB.opmnklij.ghefcdab move.l d0, d1 ; d1 = OPMNKLIJ.GHEFCDAB.opmnklij.ghefcdab and.l d5, d0 ; d0 = O0M0K0I0.G0E0C0A0.o0m0k0i0.g0e0c0a0 lsl.l #1, d1 ; d1 = PMNKLIJG.HEFCDABo.pmnklijg.hefcdab0 lsr.l #1, d0 ; d0 = 0O0M0K0I.0G0E0C0A.0o0m0k0i.0g0e0c0a and.l d5, d1 ; d1 = P0N0L0J0.H0F0D0B0.p0n0l0j0.h0f0d0b0 or.l d0, d1 ; d1 = PONMLKJI.HGFEDCBA.ponmlkji.hgfedcba dbf d6, .copy_tile_right move.l d1, (a1)+ bra .exit |
14 February 2021, 15:53 | #20 | |
Defendit numerus
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
|
Quote:
It doesn't work for 020/030 and chip ram only systems. Try it for yourself |
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
68000 code optimisations | pmc | Coders. Asm / Hardware | 248 | 17 September 2023 13:20 |
Blitter flip with interleaved bitplanes (single blit) | alpine9000 | Coders. Asm / Hardware | 4 | 15 December 2018 04:49 |
ISOCD optimisations (maximising memory for CD32 games/compilations) | earok | support.Games | 5 | 07 June 2015 14:37 |
For sale: Cheap Swap Magic 3.6 and flip lid. Brand new! | Smiley | MarketPlace | 1 | 12 September 2008 19:01 |
Tile map sample | Blip | Coders. General | 1 | 18 July 2007 13:53 |
|
|