English Amiga Board - Fast tile flipping on CD32

Page 1 of 2

Show 20 post(s) from this thread on one page

English Amiga Board (https://eab.abime.net/index.php)

- Coders. Asm / Hardware (https://eab.abime.net/forumdisplay.php?f=112)

- - Fast tile flipping on CD32 (https://eab.abime.net/showthread.php?t=92873)

mcgeezer

31 May 2018 23:58

Fast tile flipping on CD32

Hi all,

Does anyone know if there are any system calls I can make on a CD32 platform that will take a graphical tile and flip it on the X, Y or both axis?

I have the CD32 developer documentation but for some reason I can't find anything specific to this Akiko chip and how to access it.

I want to take a 32x32 pixel tile and flip it.

Any help as always is really appreciated.

Geezer

ross	01 June 2018 08:06

Quote:

Originally Posted by mcgeezer (Post 1245833)

Does anyone know if there are any system calls I can make on a CD32 platform that will take a graphical tile and flip it on the X, Y or both axis?

Hi mcgeezer,
I do not know if it exists and in any case I imagine it would be slow..

Quote:

I want to take a 32x32 pixel tile and flip it.

The only fast thing that comes to my mind to make a flip around the y axis is through lockup table (8bit flip through a 256byte table and then shuffle positions or 16bit flip through a 64kwords table then swap) or much more convoluted and slow with the blitter.
Around the x axis is very simple with both the cpu and the blitter (modulo is your friend).

The alternative is to work completely in chunky, forgetting bitplanes and working only in bytes and then make the conversion with Akiko (that I have no idea how to program) or using one of the many chunky to planar routines available.

Obviously the fastest thing is to use double the memory (or the triple if you want also the composite flips) with different copies of the same tile :D

hooverphonique

01 June 2018 09:42

Quote:

Originally Posted by mcgeezer (Post 1245833)

Akiko only does transformation from chunky to planar (wrt data transformation).

Thorham

01 June 2018 16:31

Quote:

Originally Posted by ross (Post 1245867)

Perhaps this:

This code is wrong, see my post below for corrected version.

Code:

   move.l   #$55555555,d1

   eor.l    d0,d1

   eor.l    d1,d0

   add.l    d1,d1

   lsr.l    #1,d0

   or.l     d1,d0

   

   move.l   #$33333333,d1

   eor.l    d0,d1

   eor.l    d1,d0

   lsl.l    #2,d1

   lsr.l    #2,d0

   or.l     d1,d0



   move.l   #$0f0f0f0f,d1

   eor.l    d0,d1

   eor.l    d1,d0

   lsl.l    #4,d1

   lsr.l    #4,d0

   or.l     d1,d0



   rol.w    #8,d0

   swap     d0

   rol.w    #8,d0

mcgeezer

01 June 2018 16:53

3 Attachment(s)

Thanks for the suggestions guys, I guess i'm looking at doing it with the CPU.

There a couple of reasons I can't use memory, one being capacity and two complications.

On the plus side I only need to do this flip when needed depending on what is in the Side Arms tile map. The other plus is that the scrolling only runs at 25 FPS so I should have plenty of time.

I'll write the scroll routine over the next week or so, the challenge is getting all of the palettes to mesh together during scrolling without having to alter the arcade rom tile map - ugh.

meynaf

01 June 2018 18:07

If there isn't enough memory for keeping mirrored copies of the same tile, then you might use some kind of graphical cache holding the last few ones that were used.

If you want to do that purely dynamic then the 256-byte table seems the best compromise.

saimon69

01 June 2018 21:08

The good'ol' Side Arms! So underrated but also with some playability problems, would like to see it ported decently and improved from its original incarnation...

mcgeezer

01 June 2018 21:41

Quote:

Originally Posted by saimon69 (Post 1245989)

The good'ol' Side Arms! So underrated but also with some playability problems, would like to see it ported decently and improved from its original incarnation...

I'm just running feasibility at the moment which is likely to fail. There's simply too many colours in the tile sets and sprites to do the game justice.

However I will get a nice 8 way scrolling routine out of it supporting 16 or 32 pixel tile sets that I could use on other projects.

saimon69

01 June 2018 23:14

Powder was using a scrolling trechnique similar to side arms and some tricks to run lot of stuff with 16 colors, have the source code if you want to give it a look

mcgeezer

01 June 2018 23:20

Quote:

Originally Posted by saimon69 (Post 1246000)

Powder was using a scrolling trechnique similar to side arms and some tricks to run lot of stuff with 16 colors, have the source code if you want to give it a look

Thanks for the offer, I need to write my own code as I know what I need to implement.

Thorham

02 June 2018 21:58

Quote:

Originally Posted by meynaf (Post 1245961)

If you want to do that purely dynamic then the 256-byte table seems the best compromise.

I somehow doubt that that's going to be faster than doing it in code. The 4x indexed addressing mode alone seems slower than the code I posted.

ross	02 June 2018 23:07

Quote:

Originally Posted by Thorham (Post 1246108)

I somehow doubt that that's going to be faster than doing it in code. The 4x indexed addressing mode alone seems slower than the code I posted.

Hi Thorham, your routine is not working.

This is a right version:
(I have not thought that much if it can be optimized)

Code:

    move.l  d0,d1

    move.l  #$55555555,d2

    lsr.l   #1,d0

    add.l   d1,d1

    and.l   d2,d0

    add.l   d2,d2

    and.l   d2,d1

    or.l    d1,d0

    

    move.l  d0,d1

    move.l  #$33333333,d2

    lsr.l   #2,d0

    lsl.l   #2,d1

    and.l   d2,d0

    lsl.l   #2,d2

    and.l   d2,d1

    or.l    d1,d0



    move.l  d0,d1

    move.l  #$0f0f0f0f,d2

    lsr.l   #4,d0

    lsl.l   #4,d1

    and.l   d2,d0

    lsl.l   #4,d2

    and.l   d2,d1

    or.l    d1,d0

    

    rol.w   #8,d0

    swap    d0

    rol.w   #8,d0

I've serious doubts that it may be faster than a LUT version, especially if designed for a CD32 (a chipmem only 020).

:great

Thorham

03 June 2018 00:09

Quote:

Originally Posted by ross (Post 1246116)

Hi Thorham, your routine is not working.

Thanks for pointing that out :great Some of the eors have to be ands :banghead Remind me to test code before posting it :o

Code:

   move.l   #$55555555,d1

   and.l    d0,d1

   eor.l    d1,d0

   add.l    d1,d1

   lsr.l    #1,d0

   or.l     d1,d0



   move.l   #$33333333,d1

   and.l    d0,d1

   eor.l    d1,d0

   lsl.l    #2,d1

   lsr.l    #2,d0

   or.l     d1,d0



   move.l   #$0f0f0f0f,d1

   and.l    d0,d1

   eor.l    d1,d0

   lsl.l    #4,d1

   lsr.l    #4,d0

   or.l     d1,d0



   rol.w    #8,d0

   swap     d0

   rol.w    #8,d0

mcgeezer

03 June 2018 00:18

Quote:

Originally Posted by Thorham (Post 1246123)

Thanks for pointing that out :great Some of the eors have to be ands :banghead Remind me to test code before posting it :o

Code:

   move.l   #$55555555,d1

   and.l    d0,d1

   eor.l    d1,d0

   add.l    d1,d1

   lsr.l    #1,d0

   or.l     d1,d0



   move.l   #$33333333,d1

   and.l    d0,d1

   eor.l    d1,d0

   lsl.l    #2,d1

   lsr.l    #2,d0

   or.l     d1,d0



   move.l   #$0f0f0f0f,d1

   and.l    d0,d1

   eor.l    d1,d0

   lsl.l    #4,d1

   lsr.l    #4,d0

   or.l     d1,d0



   rol.w    #8,d0

   swap     d0

   rol.w    #8,d0

Would it be ok to ask for a little explanation on what this code is doing?

I haven't debugged or tried it yet but a short explanation of source data/dest would be really useful.

Cheers,
Geezer

ross	03 June 2018 00:25

Quote:

Originally Posted by mcgeezer (Post 1246124)

Would it be ok to ask for a little explanation on what this code is doing?

I haven't debugged or tried it yet but a short explanation of source data/dest would be really useful.

Cheers,
Geezer

Well, this is based in a magnitude progressive group swapping (first bits, then pairs, then nibbles, then bytes, then words).
Basically is like a SIMD approach because there is not carry between operations.
Input D0 contains the 32 bits from a bitplane, output d0 the same bits flipped.

Thorham

03 June 2018 00:25

Quote:

Originally Posted by mcgeezer (Post 1246124)

Would it be ok to ask for a little explanation on what this code is doing?

It simply swaps odd and even bits, bit pairs, nibbles, bytes and finally words.

Quote:

Originally Posted by mcgeezer (Post 1246124)

I haven't debugged or tried it yet but a short explanation of source data/dest would be really useful.

D0 is both source and destination.

ross	03 June 2018 00:31

Quote:

Originally Posted by Thorham (Post 1246127)

It simply swaps odd and even bits, bit pairs, nibbles, bytes and finally words.

D0 is both source and destination.

Same time :p

mcgeezer

03 June 2018 00:32

Quote:

Originally Posted by ross (Post 1246126)

Quote:

Originally Posted by Thorham (Post 1246127)

It simply swaps odd and even bits, bit pairs, nibbles, bytes and finally words.

D0 is both source and destination.

Thanks guys.

I like this because I can fit this in 68020 cache so it will go full speed.

Appreciate it.

ross	03 June 2018 01:02

Quote:

Originally Posted by mcgeezer (Post 1246130)

Appreciate it.

Mine is a didactics implementation (the algorithm is explicit).
Thorham is a more optimized version based on eor property (i don't figure out a better optimization possible).

At this point we need to test versus LUT, what will the winner be?

ross	03 June 2018 11:31

Some non-scientific and quick tests.
Pure code seems slightly faster than this lazy bfextu 8bit LUT implementation:

Code:

_lut8flip:

        lea        _8lut(pc),a0

        move.l        d0,d1

        bfextu  d1{8:8},d2

        move.b        (a0,d2.w),d0

        ror.l        #8,d0

        bfextu  d1{16:8},d2

        move.b        (a0,d2.w),d0

        ror.l        #8,d0

        bfextu  d1{24:8},d2

        move.b        (a0,d2.w),d0

        ror.l        #8,d0

        bfextu  d1{0:8},d2

        move.b        (a0,d2.w),d0

        rts

But the absolute winner is the 16bit LUT approach (even 50% faster).
Simple as:

Code:

_lut16flip:

        lea        _16lut+65536,a0

        move.w        (a0,d0.w*2),d0

        swap        d0

        move.w        (a0,d0.w*2),d0

        rts

The abuse of memory can be contestable, BUT:
suppose you have a lot of big AGA sprites (64x64,4planes) and also a lot of tiles (32x32,4/8planes) for a big total of 1MB of data, all to be flipped.
In this case may be useful ;) (the waste becomes proportionally less and less significant, and CPU time is precious on 020..)

But surely pure code, like Thorham suggested, is a great deal!

[EDIT, PS]
Why non-scientific?
I do not have a CD32, nor an Amiga for that matter :sad
So it's all based on the emulation of WinUAE which for 020 is not CE perfect (or it is for this simple code? well, it's not that important..).
Also I had no will to write code other than bfextu and anyway the difference in speed between pure code and 8bit LUT does not seem significant enough to justify the exclusive use of LUT ;)

All times are GMT +2. The time now is 19:17.

Page 1 of 2

Show 20 post(s) from this thread on one page

Page generated in 0.05008 seconds with 10 queries