Fast tile flipping on CD32

mcgeezer · 31 May 2018, 23:58

Hi all,

Does anyone know if there are any system calls I can make on a CD32 platform that will take a graphical tile and flip it on the X, Y or both axis?

I have the CD32 developer documentation but for some reason I can't find anything specific to this Akiko chip and how to access it.

I want to take a 32x32 pixel tile and flip it.

Any help as always is really appreciated.

Geezer

ross · 01 June 2018, 08:06

Quote:

Originally Posted by mcgeezer

Does anyone know if there are any system calls I can make on a CD32 platform that will take a graphical tile and flip it on the X, Y or both axis?

Hi mcgeezer,
I do not know if it exists and in any case I imagine it would be slow..

Quote:

I want to take a 32x32 pixel tile and flip it.

The only fast thing that comes to my mind to make a flip around the y axis is through lockup table (8bit flip through a 256byte table and then shuffle positions or 16bit flip through a 64kwords table then swap) or much more convoluted and slow with the blitter.
Around the x axis is very simple with both the cpu and the blitter (modulo is your friend).

The alternative is to work completely in chunky, forgetting bitplanes and working only in bytes and then make the conversion with Akiko (that I have no idea how to program) or using one of the many chunky to planar routines available.

Obviously the fastest thing is to use double the memory (or the triple if you want also the composite flips) with different copies of the same tile

hooverphonique · 01 June 2018, 09:42

Quote:

Originally Posted by mcgeezer

Hi all,

Does anyone know if there are any system calls I can make on a CD32 platform that will take a graphical tile and flip it on the X, Y or both axis?

I have the CD32 developer documentation but for some reason I can't find anything specific to this Akiko chip and how to access it.

I want to take a 32x32 pixel tile and flip it.

Any help as always is really appreciated.

Geezer

Akiko only does transformation from chunky to planar (wrt data transformation).

Thorham · 01 June 2018, 16:31

Quote:

Originally Posted by ross

The only fast thing that comes to my mind to make a flip around the y axis is through lockup table (8bit flip through a 256byte table and then shuffle positions or 16bit flip through a 64kwords table then swap) or much more convoluted and slow with the blitter.

Perhaps this:

This code is wrong, see my post below for corrected version.

Code:

   move.l   #$55555555,d1
   eor.l    d0,d1
   eor.l    d1,d0
   add.l    d1,d1
   lsr.l    #1,d0
   or.l     d1,d0
   
   move.l   #$33333333,d1
   eor.l    d0,d1
   eor.l    d1,d0
   lsl.l    #2,d1
   lsr.l    #2,d0
   or.l     d1,d0

   move.l   #$0f0f0f0f,d1
   eor.l    d0,d1
   eor.l    d1,d0
   lsl.l    #4,d1
   lsr.l    #4,d0
   or.l     d1,d0

   rol.w    #8,d0
   swap     d0
   rol.w    #8,d0

mcgeezer · 01 June 2018, 16:53

Thanks for the suggestions guys, I guess i'm looking at doing it with the CPU.

There a couple of reasons I can't use memory, one being capacity and two complications.

On the plus side I only need to do this flip when needed depending on what is in the Side Arms tile map. The other plus is that the scrolling only runs at 25 FPS so I should have plenty of time.

I'll write the scroll routine over the next week or so, the challenge is getting all of the palettes to mesh together during scrolling without having to alter the arcade rom tile map - ugh.

meynaf · 01 June 2018, 18:07

If there isn't enough memory for keeping mirrored copies of the same tile, then you might use some kind of graphical cache holding the last few ones that were used.

If you want to do that purely dynamic then the 256-byte table seems the best compromise.

saimon69 · 01 June 2018, 21:08

The good'ol' Side Arms! So underrated but also with some playability problems, would like to see it ported decently and improved from its original incarnation...

mcgeezer · 01 June 2018, 21:41

Quote:

Originally Posted by saimon69

The good'ol' Side Arms! So underrated but also with some playability problems, would like to see it ported decently and improved from its original incarnation...

I'm just running feasibility at the moment which is likely to fail. There's simply too many colours in the tile sets and sprites to do the game justice.

However I will get a nice 8 way scrolling routine out of it supporting 16 or 32 pixel tile sets that I could use on other projects.

saimon69 · 01 June 2018, 23:14

Powder was using a scrolling trechnique similar to side arms and some tricks to run lot of stuff with 16 colors, have the source code if you want to give it a look

mcgeezer · 01 June 2018, 23:20

Quote:

Originally Posted by saimon69

Powder was using a scrolling trechnique similar to side arms and some tricks to run lot of stuff with 16 colors, have the source code if you want to give it a look

Thanks for the offer, I need to write my own code as I know what I need to implement.

Thorham · 02 June 2018, 21:58

Quote:

Originally Posted by meynaf

If you want to do that purely dynamic then the 256-byte table seems the best compromise.

I somehow doubt that that's going to be faster than doing it in code. The 4x indexed addressing mode alone seems slower than the code I posted.

ross · 02 June 2018, 23:07

Quote:

Originally Posted by Thorham

I somehow doubt that that's going to be faster than doing it in code. The 4x indexed addressing mode alone seems slower than the code I posted.

Hi Thorham, your routine is not working.

This is a right version:
(I have not thought that much if it can be optimized)

Code:

    move.l  d0,d1
    move.l  #$55555555,d2
    lsr.l   #1,d0
    add.l   d1,d1
    and.l   d2,d0
    add.l   d2,d2
    and.l   d2,d1
    or.l    d1,d0
    
    move.l  d0,d1
    move.l  #$33333333,d2
    lsr.l   #2,d0
    lsl.l   #2,d1
    and.l   d2,d0
    lsl.l   #2,d2
    and.l   d2,d1
    or.l    d1,d0

    move.l  d0,d1
    move.l  #$0f0f0f0f,d2
    lsr.l   #4,d0
    lsl.l   #4,d1
    and.l   d2,d0
    lsl.l   #4,d2
    and.l   d2,d1
    or.l    d1,d0
    
    rol.w   #8,d0
    swap    d0
    rol.w   #8,d0

I've serious doubts that it may be faster than a LUT version, especially if designed for a CD32 (a chipmem only 020).

Thorham · 03 June 2018, 00:09

Quote:

Originally Posted by ross

Hi Thorham, your routine is not working.

Thanks for pointing that out

Some of the eors have to be ands

Remind me to test code before posting it

Code:

   move.l   #$55555555,d1
   and.l    d0,d1
   eor.l    d1,d0
   add.l    d1,d1
   lsr.l    #1,d0
   or.l     d1,d0

   move.l   #$33333333,d1
   and.l    d0,d1
   eor.l    d1,d0
   lsl.l    #2,d1
   lsr.l    #2,d0
   or.l     d1,d0

   move.l   #$0f0f0f0f,d1
   and.l    d0,d1
   eor.l    d1,d0
   lsl.l    #4,d1
   lsr.l    #4,d0
   or.l     d1,d0

   rol.w    #8,d0
   swap     d0
   rol.w    #8,d0

mcgeezer · 03 June 2018, 00:18

Quote:

Originally Posted by Thorham

Thanks for pointing that out

Some of the eors have to be ands

Remind me to test code before posting it

Code:

   move.l   #$55555555,d1
   and.l    d0,d1
   eor.l    d1,d0
   add.l    d1,d1
   lsr.l    #1,d0
   or.l     d1,d0

   move.l   #$33333333,d1
   and.l    d0,d1
   eor.l    d1,d0
   lsl.l    #2,d1
   lsr.l    #2,d0
   or.l     d1,d0

   move.l   #$0f0f0f0f,d1
   and.l    d0,d1
   eor.l    d1,d0
   lsl.l    #4,d1
   lsr.l    #4,d0
   or.l     d1,d0

   rol.w    #8,d0
   swap     d0
   rol.w    #8,d0

Would it be ok to ask for a little explanation on what this code is doing?

I haven't debugged or tried it yet but a short explanation of source data/dest would be really useful.

Cheers,
Geezer

ross · 03 June 2018, 00:25

Quote:

Originally Posted by mcgeezer

Would it be ok to ask for a little explanation on what this code is doing?

I haven't debugged or tried it yet but a short explanation of source data/dest would be really useful.

Cheers,
Geezer

Well, this is based in a magnitude progressive group swapping (first bits, then pairs, then nibbles, then bytes, then words).
Basically is like a SIMD approach because there is not carry between operations.
Input D0 contains the 32 bits from a bitplane, output d0 the same bits flipped.

Thorham · 03 June 2018, 00:25

Quote:

Originally Posted by mcgeezer

Would it be ok to ask for a little explanation on what this code is doing?

It simply swaps odd and even bits, bit pairs, nibbles, bytes and finally words.

Quote:

Originally Posted by mcgeezer

I haven't debugged or tried it yet but a short explanation of source data/dest would be really useful.

D0 is both source and destination.

ross · 03 June 2018, 00:31

Quote:

Originally Posted by Thorham

It simply swaps odd and even bits, bit pairs, nibbles, bytes and finally words.

D0 is both source and destination.

Same time

mcgeezer · 03 June 2018, 00:32

Quote:

Originally Posted by ross

Well, this is based in a magnitude progressive group swapping (first bits, then pairs, then nibbles, then bytes, then words).
Basically is like a SIMD approach because there is not carry between operations.
Input D0 contains the 32 bits from a bitplane, output d0 the same bits flipped.

Quote:

Originally Posted by Thorham

It simply swaps odd and even bits, bit pairs, nibbles, bytes and finally words.

D0 is both source and destination.

Thanks guys.

I like this because I can fit this in 68020 cache so it will go full speed.

Appreciate it.

ross · 03 June 2018, 01:02

Quote:

Originally Posted by mcgeezer

Appreciate it.

Mine is a didactics implementation (the algorithm is explicit).
Thorham is a more optimized version based on eor property (i don't figure out a better optimization possible).

At this point we need to test versus LUT, what will the winner be?

ross · 03 June 2018, 11:31

Some non-scientific and quick tests.
Pure code seems slightly faster than this lazy bfextu 8bit LUT implementation:

Code:

_lut8flip:
	lea	_8lut(pc),a0
	move.l	d0,d1
	bfextu  d1{8:8},d2
	move.b	(a0,d2.w),d0
	ror.l	#8,d0
	bfextu  d1{16:8},d2
	move.b	(a0,d2.w),d0
	ror.l	#8,d0
	bfextu  d1{24:8},d2
	move.b	(a0,d2.w),d0
	ror.l	#8,d0
	bfextu  d1{0:8},d2
	move.b	(a0,d2.w),d0
	rts

But the absolute winner is the 16bit LUT approach (even 50% faster).
Simple as:

Code:

_lut16flip:
	lea	_16lut+65536,a0
	move.w	(a0,d0.w*2),d0
	swap	d0
	move.w	(a0,d0.w*2),d0
	rts

The abuse of memory can be contestable, BUT:
suppose you have a lot of big AGA sprites (64x64,4planes) and also a lot of tiles (32x32,4/8planes) for a big total of 1MB of data, all to be flipped.
In this case may be useful

(the waste becomes proportionally less and less significant, and CPU time is precious on 020..)

But surely pure code, like Thorham suggested, is a great deal!

[EDIT, PS]
Why non-scientific?
I do not have a CD32, nor an Amiga for that matter

So it's all based on the emulation of WinUAE which for 020 is not CE perfect (or it is for this simple code? well, it's not that important..).
Also I had no will to write code other than bfextu and anyway the difference in speed between pure code and 8bit LUT does not seem significant enough to justify the exclusive use of LUT

01 June 2018, 16:53	#5
mcgeezer Registered User Join Date: Oct 2017 Location: Sunderland, England Posts: 2,702	Thanks for the suggestions guys, I guess i'm looking at doing it with the CPU. There a couple of reasons I can't use memory, one being capacity and two complications. On the plus side I only need to do this flip when needed depending on what is in the Side Arms tile map. The other plus is that the scrolling only runs at 25 FPS so I should have plenty of time. I'll write the scroll routine over the next week or so, the challenge is getting all of the palettes to mesh together during scrolling without having to alter the arcade rom tile map - ugh. Attached Thumbnails

03 June 2018, 11:31	#20
ross Defendit numerus Join Date: Mar 2017 Location: Crossing the Rubicon Age: 53 Posts: 4,468	Some non-scientific and quick tests. Pure code seems slightly faster than this lazy bfextu 8bit LUT implementation: Code: _lut8flip: lea _8lut(pc),a0 move.l d0,d1 bfextu d1{8:8},d2 move.b (a0,d2.w),d0 ror.l #8,d0 bfextu d1{16:8},d2 move.b (a0,d2.w),d0 ror.l #8,d0 bfextu d1{24:8},d2 move.b (a0,d2.w),d0 ror.l #8,d0 bfextu d1{0:8},d2 move.b (a0,d2.w),d0 rts But the absolute winner is the 16bit LUT approach (even 50% faster). Simple as: Code: _lut16flip: lea _16lut+65536,a0 move.w (a0,d0.w2),d0 swap d0 move.w (a0,d0.w2),d0 rts The abuse of memory can be contestable, BUT: suppose you have a lot of big AGA sprites (64x64,4planes) and also a lot of tiles (32x32,4/8planes) for a big total of 1MB of data, all to be flipped. In this case may be useful (the waste becomes proportionally less and less significant, and CPU time is precious on 020..) But surely pure code, like Thorham suggested, is a great deal! [EDIT, PS] Why non-scientific? I do not have a CD32, nor an Amiga for that matter So it's all based on the emulation of WinUAE which for 020 is not CE perfect (or it is for this simple code? well, it's not that important..). Also I had no will to write code other than bfextu and anyway the difference in speed between pure code and 8bit LUT does not seem significant enough to justify the exclusive use of LUT Last edited by ross; 03 June 2018 at 12:04. Reason: PS

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Workaround to X-Flipping issue found. No actual solution as yet.	Brick Nash	Coders. AMOS	12	13 October 2017 19:01
flipping through screens using middle mouse button	Yulquen74	request.Apps	5	27 June 2014 21:31
Too fast CD32 emulation	Amigabest	support.WinUAE	1	13 May 2012 20:13
wing commander cd32 too fast	JuvUK	support.Games	8	21 March 2009 21:43
Flipping floppies	Dave_wb	support.Hardware	8	03 December 2006 12:36

31 May 2018, 23:58	#1
mcgeezer Registered User Join Date: Oct 2017 Location: Sunderland, England Posts: 2,702	Fast tile flipping on CD32 Hi all, Does anyone know if there are any system calls I can make on a CD32 platform that will take a graphical tile and flip it on the X, Y or both axis? I have the CD32 developer documentation but for some reason I can't find anything specific to this Akiko chip and how to access it. I want to take a 32x32 pixel tile and flip it. Any help as always is really appreciated. Geezer

01 June 2018, 18:07	#6
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	If there isn't enough memory for keeping mirrored copies of the same tile, then you might use some kind of graphical cache holding the last few ones that were used. If you want to do that purely dynamic then the 256-byte table seems the best compromise.

01 June 2018, 21:08	#7
saimon69 J.M.D - Bedroom Musician Join Date: Apr 2014 Location: los angeles,ca Posts: 3,519	The good'ol' Side Arms! So underrated but also with some playability problems, would like to see it ported decently and improved from its original incarnation...

01 June 2018, 23:14	#9
saimon69 J.M.D - Bedroom Musician Join Date: Apr 2014 Location: los angeles,ca Posts: 3,519	Powder was using a scrolling trechnique similar to side arms and some tricks to run lot of stuff with 16 colors, have the source code if you want to give it a look

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)