14 February 2021, 15:57 | #21 |
Registered User
Join Date: Feb 2021
Location: Becej / Serbia
Posts: 120
|
|
14 February 2021, 16:25 | #22 |
Registered User
Join Date: Oct 2017
Location: Sunderland, England
Posts: 2,702
|
OK, so for some odd reason I can't get the movem.w to work... I get wrong values coming in from the lookup table, at first I thought it could be the upper word of d2/d3/d4/d5 not being zero but I recall these are set anyway when a word operand is used.
Code:
move.l d4,a2 move.l d5,a3 .copy_tile_right: ; Right copy movem.w (a0)+,d2/d3/d4/d5 move.w (a5,d2.l*2),(a1) ; Bitplane 1 move.w (a5,d3.l*2),8(a1) ; Bitplane 2 move.w (a5,d4.l*2),128*16(a1) ; Bitplane 3 move.w (a5,d5.l*2),(128*16)+8(a1) ; Bitplane 4 add.l d0,a1 dbf d1,.copy_tile_right move.l a2,d4 move.l a3,d5 Code:
.copy_tile_right: ; Right copy move.w (a0)+,d2 beq.s .1b move.w (a5,d2.l*2),(a1) ; Bitplane 1 .1b: move.w (a0)+,d2 beq.s .2b move.w (a5,d2.l*2),8(a1) ; Bitplane 2 .2b: move.w (a0)+,d2 beq.s .3b move.w (a5,d2.l*2),128*16(a1) ; Bitplane 3 .3b: move.w (a0)+,d2 beq.s .4b move.w (a5,d2.l*2),(128*16)+8(a1) ; Bitplane 4 .4b: add.l d0,a1 dbf d1,.copy_tile_right Graeme |
14 February 2021, 16:55 | #23 |
Registered User
Join Date: Feb 2021
Location: Becej / Serbia
Posts: 120
|
What's fastest on 020 out of these 3:
Code:
; case 1 move.w (a0)+,d2 move.w (a5,d2.l*2),(a1) ; other bitplanes move.w (a0)+,d2 move.w (a5,d2.l*2),2(a1) ; case 2 move.w (a0),d2 move.w (a5,d2.l*2),d3 swap d3 move.w 16(a0),d2 move.w (a5,d2.l*2),d3 move.l d3,(a1) ; case 3 move.w (a0),d2 move.l (a5,d2.l*4),d3 ; a5 - pointer to a 256k table move.w 16(a0),d2 move.w (a5,d2.l*4),d3 move.l d3,(a1) Code:
; case 1 move.w (a0)+,d2 move.l (a5,d2.l*2),(a0) move.w (a0)+,d2 move.l (a5,d2.l*2),8(a0) ; case 2 move.l (a0)+,d2 move.l (a5,d2.w*2),(a0) swap d2 move.l (a5,d2.w*2),8(a0) Last edited by orangespider; 14 February 2021 at 17:07. |
14 February 2021, 17:31 | #24 | |
Registered User
Join Date: Dec 2017
Location: Denmark
Posts: 179
|
Quote:
Code:
move.l (a0)+,D0 ;d0 = ABCDEFGH.IJKLMNOP.abcdefgh.ijklmnop ror.w #8,D0 ;d0 = ABCDEFGH.IJKLMNOP.ijklmnop.abcdefgh swap d0 ;d0 = ijklmnop.abcdefgh.ABCDEFGH.IJKLMNOP ror.w #8,d0 ;d0 = ijklmnop.abcdefgh.IJKLMNOP.ABCDEFGH swap d0 ;d0 = IJKLMNOP.ABCDEFGH.ijklmnop.abcdefgh |
|
14 February 2021, 17:39 | #25 | |
Registered User
Join Date: Feb 2021
Location: Becej / Serbia
Posts: 120
|
Quote:
|
|
14 February 2021, 18:13 | #26 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,039
|
movem.w sign-extends, any offset $8000-$ffff becomes a negative 32-bit offset (you are using them right after as .l).
|
14 February 2021, 18:33 | #27 | |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,960
|
Quote:
|
|
14 February 2021, 19:59 | #28 | ||
Registered User
Join Date: Oct 2017
Location: Sunderland, England
Posts: 2,702
|
Quote:
Quote:
|
||
15 February 2021, 17:16 | #29 |
Registered User
Join Date: Dec 2017
Location: Denmark
Posts: 179
|
If you are in a pinch for memory, I propose this no table approach.
Code:
move.w (a0)+,D2 beq.s .noflip moveq #7-1,D3 ror.b #1,D2 .loopshakeflip rol.w #1,D2 ror.b #2,D2 dbf D3,.loopshakeflip rol.w #1,D2 ror.b #1,D2 move.w D2,(a1) .noflip its optimized for smallest number of chipmem instruction fetches, and hopefully the "shakeflip" loop will run in cache. cycle count for small ror/rol is 6+2 and 6+4 per #1, and #2. so 7 times 18 + 24 = 150 cycles, not counting the loop and instruction fetches. it's most likely much slower than a table lookup, but I do like the simplicity in it. Last edited by LaBodilsen; 15 February 2021 at 17:52. |
15 February 2021, 17:31 | #30 |
Registered User
Join Date: Oct 2017
Location: Sunderland, England
Posts: 2,702
|
I’ll give it a try and show the results for you.
|
15 February 2021, 18:17 | #31 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Another possibility for the c2p-style approach :
Code:
; data and result in d4 ; enter here for 32-bit bit reverse ror.b #4,d4 ror.w #8,d4 ror.b #4,d4 swap d4 ; enter here for 16-bit only ror.b #4,d4 ror.w #8,d4 ror.b #4,d4 move.l d4,d0 lsr.l #2,d0 lsl.l #2,d4 eor.l d4,d0 and.l #$33333333,d0 eor.l d0,d4 move.l d4,d0 lsr.l #1,d0 add.l d4,d4 eor.l d4,d0 and.l #$55555555,d0 eor.l d0,d4 |
15 February 2021, 18:19 | #32 |
Defendit numerus
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
|
Oh well, then check also the good old magnitude progressive group swapping.
Not much suited for 16 bit or 020, but give it a try Code:
move.w (a0)+,d2 beq.s .noflip move.w #$5555,d3 and.w d2,d3 eor.w d3,d2 add.w d3,d3 lsr.w #1,d2 or.w d3,d2 move.w #$3333,d3 and.w d2,d3 eor.w d3,d2 lsl.w #2,d3 lsr.w #2,d2 or.w d3,d2 move.w #$0f0f,d3 and.w d2,d3 eor.w d3,d2 lsl.w #4,d3 lsr.w #4,d2 or.w d3,d2 rol.w #8,d2 move.w d2,(a1) .noflip |
15 February 2021, 19:06 | #33 |
Registered User
Join Date: Dec 2017
Location: Denmark
Posts: 179
|
okay I did some test myself for a 16*16 block in 4 bitplanes (with Winuae though)
turns out my own loop, is slower than just unrolling it. Code:
move.w (a0)+,D2 beq.s .noflip ror.b #1,D2 rol.w #1,D2 ror.b #2,D2 rol.w #1,D2 ror.b #2,D2 rol.w #1,D2 ror.b #2,D2 rol.w #1,D2 ror.b #2,D2 rol.w #1,D2 ror.b #2,D2 rol.w #1,D2 ror.b #2,D2 rol.w #1,D2 ror.b #2,D2 rol.w #1,D2 ror.b #1,D2 move.w D2,(a1) .noflip move.w (a0),D2 of course table lookup is the fastest with about 2 1/3 rasterlines (edit: fixed my unrolled code, which made it faster) Last edited by LaBodilsen; 15 February 2021 at 19:27. |
15 February 2021, 19:18 | #34 |
Lemon. / Core Design
Join Date: Mar 2016
Location: Tier 5
Posts: 1,212
|
could you use a sequence of
ror.w #1,d2 addx.w d3,d3 uses one more data register |
15 February 2021, 21:08 | #35 |
Registered User
Join Date: Feb 2021
Location: Becej / Serbia
Posts: 120
|
I believe this is the fastest no-table acode:
Code:
move.l #%11111111000000001111111100000000,d2 move.l #%11110000111100001111000011110000,d3 move.l #%11001100110011001100110011001100,d4 move.l #%10101010101010101010101010101010,d5 .copy_tile_right: move.l (a0)+,d6 beq.s .sk0 move.l d6,d7 ; d7 - ABCDEFGH.IJKLMNOP.abcdefgh.ijklmnop and.l d2,d6 ; d6 - ABCDEFGH.00000000.abcdefgh.00000000 eor.l d6,d7 ; d7 - 00000000.IJKLMNOP.00000000.ijklmnop swap d6 ; d6 - abcdefgh.00000000.ABCDEFGH.00000000 or.l d7,d6 ; d6 - abcdefgh.IJKLMNOP.ABCDEFGH.ijklmnop move.l d6,d7 ; d7 - abcdefgh.IJKLMNOP.ABCDEFGH.ijklmnop and.l d3,d6 ; d6 - abcd6000.IJKL0000.ABCd6000.ijkl0000 eor.l d6,d7 ; d7 - 0000efgh.0000MNOP.0000EFGH.0000mnop ror.l #8,d6 ; d6 - ijkl0000.abcd6000.IJKL0000.ABCd6000 or.l d7,d6 ; d6 - ijklefgh.abcdMNOP.IJKLEFGH.ABCDmnop move.l d6,d7 ; d7 - ijklefgh.abcdMNOP.IJKLEFGH.ABCDmnop and.l d4,d6 ; d6 - ij00ef00.ab00MN00.IJ00EF00.AB00mn00 eor.l d6,d7 ; d7 - 00kl00gh.00cd60OP.00KL00GH.00Cd60op ror.l #4,d6 ; d6 - mn00ij00.ef00ab00.MN00IJ00.EF00AB00 or.l d7,d6 ; d6 - mnklijgh.efcdabOP.MNKLIJGH.EFCDABop move.l d6,d7 ; d7 - mnklijgh.efcdabOP.MNKLIJGH.EFCDABop and.l d5,d6 ; d6 - m0k0i0g0.e0c0a0O0.M0K0I0G0.E0C0A0o0 eor.l d6,d7 ; d7 - 0n0l0j0h.0f0d6b0O.0N0K0J0H.0F0d6B0p ror.l #3,d6 ; d6 - 0o0m0k0i.0g0e0c0a.0O0M0K0I.0G0E0C0A ror.l #1,d7 ; d7 - p0n0l0j0.h0f0d6b0.O0N0K0J0.H0F0d6B0 or.l d7,d6 ; d6 - ponmlkji.hgfedcba.PONMLKJI.HGFEDCBA move.w d6,(a1) swap d6 move.w d6,8(a1) .sk0: move.l (a0)+,d6 beq.s .sk1 move.l d6,d7 ; d7 - ABCDEFGH.IJKLMNOP.abcdefgh.ijklmnop and.l d2,d6 ; d6 - ABCDEFGH.00000000.abcdefgh.00000000 eor.l d6,d7 ; d7 - 00000000.IJKLMNOP.00000000.ijklmnop swap d6 ; d6 - abcdefgh.00000000.ABCDEFGH.00000000 or.l d7,d6 ; d6 - abcdefgh.IJKLMNOP.ABCDEFGH.ijklmnop move.l d6,d7 ; d7 - abcdefgh.IJKLMNOP.ABCDEFGH.ijklmnop and.l d3,d6 ; d6 - abcd6000.IJKL0000.ABCd6000.ijkl0000 eor.l d6,d7 ; d7 - 0000efgh.0000MNOP.0000EFGH.0000mnop ror.l #8,d6 ; d6 - ijkl0000.abcd6000.IJKL0000.ABCd6000 or.l d7,d6 ; d6 - ijklefgh.abcdMNOP.IJKLEFGH.ABCDmnop move.l d6,d7 ; d7 - ijklefgh.abcdMNOP.IJKLEFGH.ABCDmnop and.l d4,d6 ; d6 - ij00ef00.ab00MN00.IJ00EF00.AB00mn00 eor.l d6,d7 ; d7 - 00kl00gh.00cd60OP.00KL00GH.00Cd60op ror.l #4,d6 ; d6 - mn00ij00.ef00ab00.MN00IJ00.EF00AB00 or.l d7,d6 ; d6 - mnklijgh.efcdabOP.MNKLIJGH.EFCDABop move.l d6,d7 ; d7 - mnklijgh.efcdabOP.MNKLIJGH.EFCDABop and.l d5,d6 ; d6 - m0k0i0g0.e0c0a0O0.M0K0I0G0.E0C0A0o0 eor.l d6,d7 ; d7 - 0n0l0j0h.0f0d6b0O.0N0K0J0H.0F0d6B0p ror.l #3,d6 ; d6 - 0o0m0k0i.0g0e0c0a.0O0M0K0I.0G0E0C0A ror.l #1,d7 ; d7 - p0n0l0j0.h0f0d6b0.O0N0K0J0.H0F0d6B0 or.l d7,d6 ; d6 - ponmlkji.hgfedcba.PONMLKJI.HGFEDCBA move.w d6,128*16(a1) swap d6 move.w d6,128*16+8(a1) .sk1: add.l d0,a1 dbf d1,.copy_tile_right edit: If I counted the cycles correctly, this should run at 251 cycles per loop iteration (213 cycles for 32x32) and the table approach would run at 200 cycles per iteration. But my cycle counts might be wrong. Last edited by orangespider; 15 February 2021 at 22:04. |
16 February 2021, 17:21 | #36 |
Registered User
Join Date: Dec 2017
Location: Denmark
Posts: 179
|
It is indeed very fast, but meynaf version is just a smidge faster (about 1/4 of a raster line), when used as 32bit.
Code:
move.l #$33333333,d5 move.l #$55555555,d6 .copy_tile_right: move.l (a0)+,d4 beq.s .noflip ror.b #4,d4 ror.w #8,d4 ror.b #4,d4 swap d4 ror.b #4,d4 ror.w #8,d4 ror.b #4,d4 move.l d4,d0 lsr.l #2,d0 lsl.l #2,d4 eor.l d4,d0 and.l d5,d0 eor.l d0,d4 move.l d4,d0 lsr.l #1,d0 add.l d4,d4 eor.l d4,d0 and.l d6,d0 eor.l d0,d4 move.w d4,(a1) swap d4 move.w d4,8(a1) .noflip move.l (a0)+,d4 beq.s .noflip2 ror.b #4,d4 ror.w #8,d4 ror.b #4,d4 swap d4 ror.b #4,d4 ror.w #8,d4 ror.b #4,d4 move.l d4,d0 lsr.l #2,d0 lsl.l #2,d4 eor.l d4,d0 and.l d5,d0 eor.l d0,d4 move.l d4,d0 lsr.l #1,d0 add.l d4,d4 eor.l d4,d0 and.l d6,d0 eor.l d0,d4 move.w d4,128*16(a1) swap d4 move.w d4,128*16+8(a1) .noflip2 add.l d1,a1 dbf d2,.copy_tile_right (tested in Winuae A1200 chip only cycle exact) I have a feeling it could be faster, as you don't need to flip the entire longword, but only the 2 word parts. so maybe a swap somewhere can be cut out. Last edited by LaBodilsen; 16 February 2021 at 17:34. |
17 February 2021, 19:16 | #37 | |
Registered User
Join Date: Dec 2017
Location: Denmark
Posts: 179
|
Quote:
Code:
move.l #$33333333,d1 move.l #$55555555,d4 move.l #$0f0f0f0f,d5 .copy_tile_right: move.l (a0)+,d2 beq.s .noflip move.l d4,d3 and.l d2,d3 eor.l d3,d2 add.l d3,d3 lsr.l #1,d2 or.l d3,d2 move.l d1,d3 and.l d2,d3 eor.l d3,d2 lsl.l #2,d3 lsr.l #2,d2 or.l d3,d2 move.l d5,d3 and.l d2,d3 eor.l d3,d2 lsl.l #4,d3 lsr.l #4,d2 or.l d3,d2 rol.l #8,d2 move.w d2,(a1) swap d2 move.w d2,8(a1) .noflip move.l (a0)+,d2 beq.s .noflip2 move.l d4,d3 and.l d2,d3 eor.l d3,d2 add.l d3,d3 lsr.l #1,d2 or.l d3,d2 move.l d1,d3 and.l d2,d3 eor.l d3,d2 lsl.l #2,d3 lsr.l #2,d2 or.l d3,d2 move.l d5,d3 and.l d2,d3 eor.l d3,d2 lsl.l #4,d3 lsr.l #4,d2 or.l d3,d2 rol.l #8,d2 move.w d2,128*16(a1) swap d2 move.w d2,128*16+8(a1) .noflip2 add.l d0,a1 dbf d7,.copy_tile_right Red = My code as 32bit purple = table loopup green = Meynaf code as 32bit Yellow = Ross code as 32bit Turquoise = orangespider code ps. the test is done for "always data in" and no zero data. so the .noflip branch is never taken. Data in was always $00020002, and result was verified as $40004000 for all versions. Last edited by LaBodilsen; 17 February 2021 at 19:23. |
|
17 February 2021, 19:36 | #38 |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,410
|
It may be wise to try these routines on a real A1200 as well, as 68020 emulation isn't cycle accurate. My personal experience is that real A1200's tend to be slower on RAM access than WinUAE suggests. Do note this is based on my tests with WinUAE 4.2.0, I haven't tried them since upgrading to 4.4.0.
|
18 February 2021, 15:26 | #39 | |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,753
|
Quote:
Code:
move.l #$55555555,d2 move.l #$33333333,d3 move.l #$0f0f0f0f,d4 .loop move.l (a0)+,d0 move.l d0,d1 and.l d2,d1 eor.l d1,d0 lsr.l #1,d0 add.l d1,d1 or.l d1,d0 move.l d0,d1 and.l d3,d1 eor.l d1,d0 lsr.l #2,d0 lsl.l #2,d1 or.l d1,d0 move.l d0,d1 and.l d4,d1 eor.l d1,d0 lsr.l #4,d0 lsl.l #4,d1 or.l d1,d0 rol.w #8,d0 swap d0 rol.w #8,d0 move.l d0,(a1) add.l d5,a1 dbra d7,.loop |
|
18 February 2021, 19:56 | #40 | |
Registered User
Join Date: Oct 2017
Location: Sunderland, England
Posts: 2,702
|
Quote:
Here's my final effort. 82 scan lines rendering both hardware sprites each frame from built up 16x16 tiles. Only one face of the tiles is in ram and the flip is done on the fly as the sprite is built up. Any tiles different from the last frame are cleared - as opposed to simply mass clearing the whole 128x128 sprite. [ Show youtube player ] |
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
68000 code optimisations | pmc | Coders. Asm / Hardware | 248 | 17 September 2023 13:20 |
Blitter flip with interleaved bitplanes (single blit) | alpine9000 | Coders. Asm / Hardware | 4 | 15 December 2018 04:49 |
ISOCD optimisations (maximising memory for CD32 games/compilations) | earok | support.Games | 5 | 07 June 2015 14:37 |
For sale: Cheap Swap Magic 3.6 and flip lid. Brand new! | Smiley | MarketPlace | 1 | 12 September 2008 19:01 |
Tile map sample | Blip | Coders. General | 1 | 18 July 2007 13:53 |
|
|