Fast tile flipping on CD32 - Page 2

Thorham · 03 June 2018, 16:06

Quote:

Originally Posted by ross

But surely pure code, like Thorham suggested, is a great deal!

I sure hope so

How about these (NOT tested!!):

Code:

   clr.l    d1
   move.b   d0,d1
   move.b   (a0,d1.w),d0
   rol.w    #8,d0
   move.b   d0,d1
   move.b   (a0,d1.w),d0
   swap     d0
   move.b   d0,d1
   move.b   (a0,d1.w),d0
   rol.w    #8,d0
   move.b   d0,d1
   move.b   (a0,d1.w),d0

Code:

   bfextu   d0{0:8},d2
   move.b   (a0,d2.w),d1
   lsl.l    #8,d1
   bfextu   d0{8:8},d2
   move.b   (a0,d2.w),d1
   lsl.l    #8,d1
   bfextu   d0{16:8},d2
   move.b   (a0,d2.w),d1
   lsl.l    #8,d1
   bfextu   d0{24:8},d2
   move.b   (a0,d2.w),d1

ross · 03 June 2018, 17:03

Quote:

Originally Posted by Thorham

How about these (NOT tested!!):

They works, but are in the same league than pure code (the same or slighty slower)

(actually in the second there is a small change to do but the concept is that)

However there is too little difference to fully understand if there is any kind of income (real machine tests are absolutely due).

Thorham · 03 June 2018, 17:23

Quote:

Originally Posted by ross

actually in the second there is a small change to do but the concept is that

What exactly? I never really use those bit field instructions.

Here's another one. Uses four 1kb tables:

Code:

   clr.l    d1
   move.b   d0,d1
   move.l   (a0,d1.w*4),d1
   lsr.w    #8,d0
   or.l     (a1,d0.w*4),d1
   swap     d0
   move.b   d0,d1
   or.l     (a2,d1.w*4),d1
   lsr.w    #8,d0
   or.l     (a3,d0.w*4),d1

This will most certainly be faster than all code with fastmem.

ross · 03 June 2018, 18:45

Quote:

Originally Posted by Thorham

What exactly? I never really use those bit field instructions.

You have reversed bit positions

(bit 0 is MSB).
This is right, but at the end is practically same as mine:

Code:

_lut8flip3:
   lea	_8lut(pc),a0
   bfextu   d0{24:8},d2
   move.b   (a0,d2.w),d1
   lsl.l    #8,d1
   bfextu   d0{16:8},d2
   move.b   (a0,d2.w),d1
   lsl.l    #8,d1
   bfextu   d0{8:8},d2
   move.b   (a0,d2.w),d1
   lsl.l    #8,d1
   bfextu   d0{0:8},d2
   move.b   (a0,d2.w),d1
   move.l   d1,d0
   rts

Quote:

Here's another one. Uses four 1kb tables:

The four address register usage can be an handicap (save/load from stack..) but hey, the more the better

Quote:

This will most certainly be faster than all code with fastmem.

Yes, in fact it takes a different version depending on the conditions: pure 68k, only chipmem or only 16bitmem, 020+, real fastmem available, ..).

Thorham · 03 June 2018, 18:50

Quote:

Originally Posted by ross

You have reversed bit positions

(bit 0 is MSB).

Thanks

Quote:

Originally Posted by ross

The four address register usage can be an handicap (save/load from stack..)

Depends on how many tiles you have to flip.

Quote:

Originally Posted by ross

Yes, in fact it takes a different version depending on the conditions: pure 68k, only chipmem or only 16bitmem, 020+, real fastmem available, ..).

Yes, one size really doesn't fit all in this case. And after all that, there's also the 68060...

Photon · 03 June 2018, 19:28

ross seems capable, so I think he could both code and compare to give us the answer

Here's an untested variant from the obvious LUT. It trades 3 memory writes for 2 decode cycles. It was mostly to find out if one could do the same for the memory reads, but from toying with it a few minutes I don't think that's possible.

Code:

moveq #0,d0
moveq #0,d1
move.b (a0)+,d0
move.b (a0)+,d1
move.w (a2,d1.w),d1
move.b (a2,d0.w),d1
swap d1
move.b (a0)+,d0
move.b (a0)+,d1
move.w (a2,d1.w),d1
move.b (a2,d0.w),d1

ross · 03 June 2018, 21:09

Quote:

Originally Posted by Photon

Here's an untested variant from the obvious LUT. It trades 3 memory writes for 2 decode cycles.

Hi Photon, the big problem here is the 020 limits.
There is not a data cache for all the

(a0)+

access.
If the tiles data is properly 32bit chipmem aligned a single .l is a big win.
So have d0.l filled is good!

Another bottleneck in your code is the

move.w (a2,d1.w),d1

that can span two memory line! (actually resulting in two separate read)
All the rol/lsl and even register bfextu is relatively cheap compared to chip mem access, this makes the pure (instruction cached) code so fast.

Your code can maybe be adapted for a pure 68k version (avoiding access to odd addresses).

Quote:

It was mostly to find out if one could do the same for the memory reads, but from toying with it a few minutes I don't think that's possible.

I think you are right

Anyway not a good idea in this context.

meynaf · 04 June 2018, 09:17

Quote:

Originally Posted by ross

Your code can maybe be adapted for a pure 68k version (avoiding access to odd addresses).

Word access on odd addresses are normal speed if not crossing a longword boundary (i think). That would make only 25% accesses slower than normal one. But in chipmem and no dcache i'm not sure.

Anyway for pure 68000 it may (?) be faster to use few ops in memory rather than many ops in registers (inner loop here for 2x 8-bit steps) :

Code:

 move.b (a0)+,d3
 move.b -(a2),d4
 move.b (a4,d4.w),(a1)+
 move.b (a4,d3.w),-(a3)

There you have a pointer on start and end of line, for both read and write, and a 256-byte table.
Could be the fastest solution if allowed to use 64k table.

chb · 04 June 2018, 23:16

If you optimize for 68k, you could also try to use the blitter. I was theorizing about it some time ago here: http://eab.abime.net/showpost.php?p=...&postcount=248 (didn't know it has a fancy name, lol)
It's basically the same thing Thorham proposed (flipping pairwise). It should be 4x6 = 24 clock cycles (and 12 memory cycles) per word, plus overhead for blitter setup and extra word.
The byte table approach given above should be 8cc/2ma + 10/2 + 18/4 + 18/4, so 54 clock cycles and 12 memory accesses per word, plus a small overhead for the loop (negligible for an sufficiently unrolled loop).

So blitter approach is probably only useful for bigger tiles and if it can use memory cycles not available for the cpu (e.g. borders).

I do not understand btw. the idea behind using pointers to start and end of the line - pre-decrement for source is slightly slower than post-increment (same for destination), so why not flip the line from left to right simply? Or use one 256-byte and one 256-word table to do word writes, like proposed here: http://eab.abime.net/showpost.php?p=...7&postcount=53

EDIT: Ah, think I got it - you can do the flip in-place without a buffer? Very clever!

Photon · 05 June 2018, 00:17

chb - he's not optimizing for 68000, and I don't like LUTs, it's hit and miss if you can get a good gain. Certainly more miss the higher up the Motorola family you travel.

ross, I was thinking you'd just run them and report? (Including chipmem r/w for the data words for all variants.) It might be that the fastest one is the one who can do the largest MOVEM. But 68020 isn't infinitely faster than the 68000, so obviously instruction count and heft matters.

From this, I end up with

Code:

	move.w (a6,Rn.w*2),Rn
	swap Rn
	move.w (a6,Rn.w*2),Rn

which I see matches one you posted. Probably a replacement for the SWAP for A-regs would still pay off, so that you could do a MOVEM.L d0-a5, but aligning and just doing d0-d7 might be an even better idea.

Obviously the fastest would be to run the tile conversion in batch long beforehand rather than stream them.

ross · 05 June 2018, 10:41

Quote:

Originally Posted by meynaf

Word access on odd addresses are normal speed if not crossing a longword boundary (i think). That would make only 25% accesses slower than normal one.

Yep, on average 25% slower.

Quote:

Originally Posted by meynaf

Anyway for pure 68000 it may (?) be faster to use few ops in memory rather than many ops in registers

You are right, sure for 68k you can trade ops in memory with some instruction.

Quote:

Originally Posted by chb

If you optimize for 68k, you could also try to use the blitter.
---
So blitter approach is probably only useful for bigger tiles and if it can use memory cycles not available for the cpu (e.g. borders).

Also here we should do some tests but I do not have much trust that goes faster than the CPU for two reasons:
- on pure 68k the ops that need more cycles (lsx, rox) does't have to compete with video DMA contrary to blitter;
- on 020+ instructions cycles are reduced, memory access and ALU are full 32bit so a 16 bit blitter can be a limit...

Quote:

Originally Posted by Photon

ross, I was thinking you'd just run them and report?

uh, sorry

I have absurdly busy days and I could not even turn on my (virtual) Amiga..
But i've simply adapted my http://eab.abime.net/showpost.php?p=1199574&postcount=1 code, it give steady and rock solid results.

Quote:

Originally Posted by Photon

It might be that the fastest one is the one who can do the largest MOVEM. But 68020 isn't infinitely faster than the 68000, so obviously instruction count and heft matters.

This is a good idea.

Quote:

Originally Posted by Photon

From this, I end up with

Code:

	move.w (a6,Rn.w*2),Rn
	swap Rn
	move.w (a6,Rn.w*2),Rn

which I see matches one you posted.

Yes, this is the faster!

Quote:

Originally Posted by Photon

Probably a replacement for the SWAP for A-regs would still pay off, so that you could do a MOVEM.L d0-a5, but aligning and just doing d0-d7 might be an even better idea.

Yes, previous code with some MOVEM prefilled regs can be the absolute winner.

Quote:

Originally Posted by Photon

Obviously the fastest would be to run the tile conversion in batch long beforehand rather than stream them.

Sure

chb · 05 June 2018, 14:24

Quote:

Originally Posted by ross

Also here we should do some tests but I do not have much trust that goes faster than the CPU for two reasons:
- on pure 68k the ops that need more cycles (lsx, rox) does't have to compete with video DMA contrary to blitter;
- on 020+ instructions cycles are reduced, memory access and ALU are full 32bit so a 16 bit blitter can be a limit...

Yep, it's probably useful only for a quite restricted number of scenarios: no competing dma (borders), cpu busy with other tasks like muls and shifts that do not require a lot of memory cycles, or if cpu can work in fastmem. And on 020+ without fast maybe not at all. So likely not worth the hassle.

buzzybee · 07 August 2019, 16:42

@Thorham: Thanks for your flipping code. Works great! I´d love to use it in my next game if you don´t mind.

03 June 2018, 19:28	#26
Photon Moderator Join Date: Nov 2004 Location: Eksjö / Sweden Posts: 5,602	ross seems capable, so I think he could both code and compare to give us the answer Here's an untested variant from the obvious LUT. It trades 3 memory writes for 2 decode cycles. It was mostly to find out if one could do the same for the memory reads, but from toying with it a few minutes I don't think that's possible. Code: moveq #0,d0 moveq #0,d1 move.b (a0)+,d0 move.b (a0)+,d1 move.w (a2,d1.w),d1 move.b (a2,d0.w),d1 swap d1 move.b (a0)+,d0 move.b (a0)+,d1 move.w (a2,d1.w),d1 move.b (a2,d0.w),d1

04 June 2018, 23:16	#29
chb Registered User Join Date: Dec 2014 Location: germany Posts: 439	If you optimize for 68k, you could also try to use the blitter. I was theorizing about it some time ago here: http://eab.abime.net/showpost.php?p=...&postcount=248 (didn't know it has a fancy name, lol) It's basically the same thing Thorham proposed (flipping pairwise). It should be 4x6 = 24 clock cycles (and 12 memory cycles) per word, plus overhead for blitter setup and extra word. The byte table approach given above should be 8cc/2ma + 10/2 + 18/4 + 18/4, so 54 clock cycles and 12 memory accesses per word, plus a small overhead for the loop (negligible for an sufficiently unrolled loop). So blitter approach is probably only useful for bigger tiles and if it can use memory cycles not available for the cpu (e.g. borders). I do not understand btw. the idea behind using pointers to start and end of the line - pre-decrement for source is slightly slower than post-increment (same for destination), so why not flip the line from left to right simply? Or use one 256-byte and one 256-word table to do word writes, like proposed here: http://eab.abime.net/showpost.php?p=...7&postcount=53 EDIT: Ah, think I got it - you can do the flip in-place without a buffer? Very clever! Last edited by chb; 05 June 2018 at 14:14.

05 June 2018, 00:17	#30
Photon Moderator Join Date: Nov 2004 Location: Eksjö / Sweden Posts: 5,602	chb - he's not optimizing for 68000, and I don't like LUTs, it's hit and miss if you can get a good gain. Certainly more miss the higher up the Motorola family you travel. ross, I was thinking you'd just run them and report? (Including chipmem r/w for the data words for all variants.) It might be that the fastest one is the one who can do the largest MOVEM. But 68020 isn't infinitely faster than the 68000, so obviously instruction count and heft matters. From this, I end up with Code: move.w (a6,Rn.w2),Rn swap Rn move.w (a6,Rn.w2),Rn which I see matches one you posted. Probably a replacement for the SWAP for A-regs would still pay off, so that you could do a MOVEM.L d0-a5, but aligning and just doing d0-d7 might be an even better idea. Obviously the fastest would be to run the tile conversion in batch long beforehand rather than stream them.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Workaround to X-Flipping issue found. No actual solution as yet.	Brick Nash	Coders. AMOS	12	13 October 2017 19:01
flipping through screens using middle mouse button	Yulquen74	request.Apps	5	27 June 2014 21:31
Too fast CD32 emulation	Amigabest	support.WinUAE	1	13 May 2012 20:13
wing commander cd32 too fast	JuvUK	support.Games	8	21 March 2009 21:43
Flipping floppies	Dave_wb	support.Hardware	8	03 December 2006 12:36

07 August 2019, 16:42	#33
buzzybee Registered User Join Date: Oct 2015 Location: Landsberg / Germany Posts: 526	@Thorham: Thanks for your flipping code. Works great! I´d love to use it in my next game if you don´t mind.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)