English Amiga Board

English Amiga Board (https://eab.abime.net/index.php)
-   Coders. Asm / Hardware (https://eab.abime.net/forumdisplay.php?f=112)
-   -   Fast tile flipping on CD32 (https://eab.abime.net/showthread.php?t=92873)

mcgeezer 31 May 2018 23:58

Fast tile flipping on CD32
 
Hi all,

Does anyone know if there are any system calls I can make on a CD32 platform that will take a graphical tile and flip it on the X, Y or both axis?

I have the CD32 developer documentation but for some reason I can't find anything specific to this Akiko chip and how to access it.

I want to take a 32x32 pixel tile and flip it.

Any help as always is really appreciated.

Geezer

ross 01 June 2018 08:06

Quote:

Originally Posted by mcgeezer (Post 1245833)
Does anyone know if there are any system calls I can make on a CD32 platform that will take a graphical tile and flip it on the X, Y or both axis?

Hi mcgeezer,
I do not know if it exists and in any case I imagine it would be slow..

Quote:

I want to take a 32x32 pixel tile and flip it.
The only fast thing that comes to my mind to make a flip around the y axis is through lockup table (8bit flip through a 256byte table and then shuffle positions or 16bit flip through a 64kwords table then swap) or much more convoluted and slow with the blitter.
Around the x axis is very simple with both the cpu and the blitter (modulo is your friend).

The alternative is to work completely in chunky, forgetting bitplanes and working only in bytes and then make the conversion with Akiko (that I have no idea how to program) or using one of the many chunky to planar routines available.

Obviously the fastest thing is to use double the memory (or the triple if you want also the composite flips) with different copies of the same tile :D

hooverphonique 01 June 2018 09:42

Quote:

Originally Posted by mcgeezer (Post 1245833)
Hi all,

Does anyone know if there are any system calls I can make on a CD32 platform that will take a graphical tile and flip it on the X, Y or both axis?

I have the CD32 developer documentation but for some reason I can't find anything specific to this Akiko chip and how to access it.

I want to take a 32x32 pixel tile and flip it.

Any help as always is really appreciated.

Geezer

Akiko only does transformation from chunky to planar (wrt data transformation).

Thorham 01 June 2018 16:31

Quote:

Originally Posted by ross (Post 1245867)
The only fast thing that comes to my mind to make a flip around the y axis is through lockup table (8bit flip through a 256byte table and then shuffle positions or 16bit flip through a 64kwords table then swap) or much more convoluted and slow with the blitter.

Perhaps this:

This code is wrong, see my post below for corrected version.

Code:

  move.l  #$55555555,d1
  eor.l    d0,d1
  eor.l    d1,d0
  add.l    d1,d1
  lsr.l    #1,d0
  or.l    d1,d0
 
  move.l  #$33333333,d1
  eor.l    d0,d1
  eor.l    d1,d0
  lsl.l    #2,d1
  lsr.l    #2,d0
  or.l    d1,d0

  move.l  #$0f0f0f0f,d1
  eor.l    d0,d1
  eor.l    d1,d0
  lsl.l    #4,d1
  lsr.l    #4,d0
  or.l    d1,d0

  rol.w    #8,d0
  swap    d0
  rol.w    #8,d0


mcgeezer 01 June 2018 16:53

3 Attachment(s)
Thanks for the suggestions guys, I guess i'm looking at doing it with the CPU.

There a couple of reasons I can't use memory, one being capacity and two complications.

On the plus side I only need to do this flip when needed depending on what is in the Side Arms tile map. The other plus is that the scrolling only runs at 25 FPS so I should have plenty of time.

I'll write the scroll routine over the next week or so, the challenge is getting all of the palettes to mesh together during scrolling without having to alter the arcade rom tile map - ugh.

meynaf 01 June 2018 18:07

If there isn't enough memory for keeping mirrored copies of the same tile, then you might use some kind of graphical cache holding the last few ones that were used.

If you want to do that purely dynamic then the 256-byte table seems the best compromise.

saimon69 01 June 2018 21:08

The good'ol' Side Arms! So underrated but also with some playability problems, would like to see it ported decently and improved from its original incarnation...

mcgeezer 01 June 2018 21:41

Quote:

Originally Posted by saimon69 (Post 1245989)
The good'ol' Side Arms! So underrated but also with some playability problems, would like to see it ported decently and improved from its original incarnation...

I'm just running feasibility at the moment which is likely to fail. There's simply too many colours in the tile sets and sprites to do the game justice.

However I will get a nice 8 way scrolling routine out of it supporting 16 or 32 pixel tile sets that I could use on other projects.

saimon69 01 June 2018 23:14

Powder was using a scrolling trechnique similar to side arms and some tricks to run lot of stuff with 16 colors, have the source code if you want to give it a look

mcgeezer 01 June 2018 23:20

Quote:

Originally Posted by saimon69 (Post 1246000)
Powder was using a scrolling trechnique similar to side arms and some tricks to run lot of stuff with 16 colors, have the source code if you want to give it a look

Thanks for the offer, I need to write my own code as I know what I need to implement.

Thorham 02 June 2018 21:58

Quote:

Originally Posted by meynaf (Post 1245961)
If you want to do that purely dynamic then the 256-byte table seems the best compromise.

I somehow doubt that that's going to be faster than doing it in code. The 4x indexed addressing mode alone seems slower than the code I posted.

ross 02 June 2018 23:07

Quote:

Originally Posted by Thorham (Post 1246108)
I somehow doubt that that's going to be faster than doing it in code. The 4x indexed addressing mode alone seems slower than the code I posted.

Hi Thorham, your routine is not working.

This is a right version:
(I have not thought that much if it can be optimized)
Code:

    move.l  d0,d1
    move.l  #$55555555,d2
    lsr.l  #1,d0
    add.l  d1,d1
    and.l  d2,d0
    add.l  d2,d2
    and.l  d2,d1
    or.l    d1,d0
   
    move.l  d0,d1
    move.l  #$33333333,d2
    lsr.l  #2,d0
    lsl.l  #2,d1
    and.l  d2,d0
    lsl.l  #2,d2
    and.l  d2,d1
    or.l    d1,d0

    move.l  d0,d1
    move.l  #$0f0f0f0f,d2
    lsr.l  #4,d0
    lsl.l  #4,d1
    and.l  d2,d0
    lsl.l  #4,d2
    and.l  d2,d1
    or.l    d1,d0
   
    rol.w  #8,d0
    swap    d0
    rol.w  #8,d0

I've serious doubts that it may be faster than a LUT version, especially if designed for a CD32 (a chipmem only 020).

:great

Thorham 03 June 2018 00:09

Quote:

Originally Posted by ross (Post 1246116)
Hi Thorham, your routine is not working.

Thanks for pointing that out :great Some of the eors have to be ands :banghead Remind me to test code before posting it :o
Code:

  move.l  #$55555555,d1
  and.l    d0,d1
  eor.l    d1,d0
  add.l    d1,d1
  lsr.l    #1,d0
  or.l    d1,d0

  move.l  #$33333333,d1
  and.l    d0,d1
  eor.l    d1,d0
  lsl.l    #2,d1
  lsr.l    #2,d0
  or.l    d1,d0

  move.l  #$0f0f0f0f,d1
  and.l    d0,d1
  eor.l    d1,d0
  lsl.l    #4,d1
  lsr.l    #4,d0
  or.l    d1,d0

  rol.w    #8,d0
  swap    d0
  rol.w    #8,d0


mcgeezer 03 June 2018 00:18

Quote:

Originally Posted by Thorham (Post 1246123)
Thanks for pointing that out :great Some of the eors have to be ands :banghead Remind me to test code before posting it :o
Code:

  move.l  #$55555555,d1
  and.l    d0,d1
  eor.l    d1,d0
  add.l    d1,d1
  lsr.l    #1,d0
  or.l    d1,d0

  move.l  #$33333333,d1
  and.l    d0,d1
  eor.l    d1,d0
  lsl.l    #2,d1
  lsr.l    #2,d0
  or.l    d1,d0

  move.l  #$0f0f0f0f,d1
  and.l    d0,d1
  eor.l    d1,d0
  lsl.l    #4,d1
  lsr.l    #4,d0
  or.l    d1,d0

  rol.w    #8,d0
  swap    d0
  rol.w    #8,d0


Would it be ok to ask for a little explanation on what this code is doing?

I haven't debugged or tried it yet but a short explanation of source data/dest would be really useful.

Cheers,
Geezer

ross 03 June 2018 00:25

Quote:

Originally Posted by mcgeezer (Post 1246124)
Would it be ok to ask for a little explanation on what this code is doing?

I haven't debugged or tried it yet but a short explanation of source data/dest would be really useful.

Cheers,
Geezer

Well, this is based in a magnitude progressive group swapping (first bits, then pairs, then nibbles, then bytes, then words).
Basically is like a SIMD approach because there is not carry between operations.
Input D0 contains the 32 bits from a bitplane, output d0 the same bits flipped.

Thorham 03 June 2018 00:25

Quote:

Originally Posted by mcgeezer (Post 1246124)
Would it be ok to ask for a little explanation on what this code is doing?

It simply swaps odd and even bits, bit pairs, nibbles, bytes and finally words.

Quote:

Originally Posted by mcgeezer (Post 1246124)
I haven't debugged or tried it yet but a short explanation of source data/dest would be really useful.

D0 is both source and destination.

ross 03 June 2018 00:31

Quote:

Originally Posted by Thorham (Post 1246127)
It simply swaps odd and even bits, bit pairs, nibbles, bytes and finally words.

D0 is both source and destination.

Same time :p

mcgeezer 03 June 2018 00:32

Quote:

Originally Posted by ross (Post 1246126)
Well, this is based in a magnitude progressive group swapping (first bits, then pairs, then nibbles, then bytes, then words).
Basically is like a SIMD approach because there is not carry between operations.
Input D0 contains the 32 bits from a bitplane, output d0 the same bits flipped.

Quote:

Originally Posted by Thorham (Post 1246127)
It simply swaps odd and even bits, bit pairs, nibbles, bytes and finally words.

D0 is both source and destination.

Thanks guys.

I like this because I can fit this in 68020 cache so it will go full speed.

Appreciate it.

ross 03 June 2018 01:02

Quote:

Originally Posted by mcgeezer (Post 1246130)
Appreciate it.

Mine is a didactics implementation (the algorithm is explicit).
Thorham is a more optimized version based on eor property (i don't figure out a better optimization possible).

At this point we need to test versus LUT, what will the winner be?

ross 03 June 2018 11:31

Some non-scientific and quick tests.
Pure code seems slightly faster than this lazy bfextu 8bit LUT implementation:
Code:

_lut8flip:
        lea        _8lut(pc),a0
        move.l        d0,d1
        bfextu  d1{8:8},d2
        move.b        (a0,d2.w),d0
        ror.l        #8,d0
        bfextu  d1{16:8},d2
        move.b        (a0,d2.w),d0
        ror.l        #8,d0
        bfextu  d1{24:8},d2
        move.b        (a0,d2.w),d0
        ror.l        #8,d0
        bfextu  d1{0:8},d2
        move.b        (a0,d2.w),d0
        rts

But the absolute winner is the 16bit LUT approach (even 50% faster).
Simple as:
Code:

_lut16flip:
        lea        _16lut+65536,a0
        move.w        (a0,d0.w*2),d0
        swap        d0
        move.w        (a0,d0.w*2),d0
        rts

The abuse of memory can be contestable, BUT:
suppose you have a lot of big AGA sprites (64x64,4planes) and also a lot of tiles (32x32,4/8planes) for a big total of 1MB of data, all to be flipped.
In this case may be useful ;) (the waste becomes proportionally less and less significant, and CPU time is precious on 020..)

But surely pure code, like Thorham suggested, is a great deal!


[EDIT, PS]
Why non-scientific?
I do not have a CD32, nor an Amiga for that matter :sad
So it's all based on the emulation of WinUAE which for 020 is not CE perfect (or it is for this simple code? well, it's not that important..).
Also I had no will to write code other than bfextu and anyway the difference in speed between pure code and 8bit LUT does not seem significant enough to justify the exclusive use of LUT ;)


All times are GMT +2. The time now is 19:17.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.

Page generated in 0.05008 seconds with 10 queries