Optimised Akiko Chunky-to-Planar emulation

Mequa · 21 January 2012, 09:32

For Toni, here is the source I worked on all night to optimise the Akiko Chunky-to-Planar emulation for WinUAE and PUAE. In my tests, it runs at least twice as fast as the current procedure.
If it works, this should make for much more efficient CD32 emulation!

C/C++ source code follows, so if you're not a developer, this may look like Greek. Of course this hasn't yet been tested in WinUAE, and I'm not sure if it needs any modifications to work.
It has been tested in GCC and proved compliant with both C89/C90 and C++, producing identical results each time to the WinUAE routine.

As for speed, 100,000,000 iterations completed in 18 seconds, compared with 36 seconds for the current "piece of crap" WinUAE algorithm.

Here goes:

First up, some extra arrays for the header file:

Code:

/* Added by Mequa - for header: */
uae_u32 akiko_precalc_shift[32];
uae_u32 akiko_precalc_bytenum[32][8];

A precalculation routine which must be run before using my C2P routine, with an overhead of just over 1KB:

Code:

/* Precalculations for Akiko Chunky-to-Planar conversion by Mequa */
static void akiko_precalculate_fast (void)
{
	uae_u32 i, j;
	for (i = 0; i < 32; i++)
	{
		akiko_precalc_shift  [(int)i]  		      = 1 << i;
		for (j = 0; j < 8; j++)
		{
			akiko_precalc_bytenum[(int)i][(int)j] = (i >> 3) + ((7 - j) << 2);
		}
	}
}

And finally, the algorithm itself:

Code:

/* Optimised Chunky-to-Planar algorithm by Mequa */
static void akiko_c2p_do_fast (void)
{
	int i;

	for (i = 0; i < 8; i++)
	{
		akiko_result[i] 	   =  (((akiko_buffer[0] & akiko_precalc_shift[i])    != 0) << (akiko_precalc_bytenum[i][0])   )
				    	   |  (((akiko_buffer[1] & akiko_precalc_shift[i])    != 0) << (akiko_precalc_bytenum[i][1])   )
				    	   |  (((akiko_buffer[2] & akiko_precalc_shift[i])    != 0) << (akiko_precalc_bytenum[i][2])   )
				    	   |  (((akiko_buffer[3] & akiko_precalc_shift[i])    != 0) << (akiko_precalc_bytenum[i][3])   )
				    	   |  (((akiko_buffer[4] & akiko_precalc_shift[i])    != 0) << (akiko_precalc_bytenum[i][4])   )
				    	   |  (((akiko_buffer[5] & akiko_precalc_shift[i])    != 0) << (akiko_precalc_bytenum[i][5])   )
				    	   |  (((akiko_buffer[6] & akiko_precalc_shift[i])    != 0) << (akiko_precalc_bytenum[i][6])   )
				    	   |  (((akiko_buffer[7] & akiko_precalc_shift[i])    != 0) << (akiko_precalc_bytenum[i][7])   )
					   |  (((akiko_buffer[0] & akiko_precalc_shift[i+8])  != 0) << (akiko_precalc_bytenum[i+8][0]) )
				    	   |  (((akiko_buffer[1] & akiko_precalc_shift[i+8])  != 0) << (akiko_precalc_bytenum[i+8][1]) )
				    	   |  (((akiko_buffer[2] & akiko_precalc_shift[i+8])  != 0) << (akiko_precalc_bytenum[i+8][2]) )
				    	   |  (((akiko_buffer[3] & akiko_precalc_shift[i+8])  != 0) << (akiko_precalc_bytenum[i+8][3]) )
				    	   |  (((akiko_buffer[4] & akiko_precalc_shift[i+8])  != 0) << (akiko_precalc_bytenum[i+8][4]) )
				    	   |  (((akiko_buffer[5] & akiko_precalc_shift[i+8])  != 0) << (akiko_precalc_bytenum[i+8][5]) )
				    	   |  (((akiko_buffer[6] & akiko_precalc_shift[i+8])  != 0) << (akiko_precalc_bytenum[i+8][6]) )
				    	   |  (((akiko_buffer[7] & akiko_precalc_shift[i+8])  != 0) << (akiko_precalc_bytenum[i+8][7]) )
					   |  (((akiko_buffer[0] & akiko_precalc_shift[i+16]) != 0) << (akiko_precalc_bytenum[i+16][0]))
				    	   |  (((akiko_buffer[1] & akiko_precalc_shift[i+16]) != 0) << (akiko_precalc_bytenum[i+16][1]))
				    	   |  (((akiko_buffer[2] & akiko_precalc_shift[i+16]) != 0) << (akiko_precalc_bytenum[i+16][2]))
				    	   |  (((akiko_buffer[3] & akiko_precalc_shift[i+16]) != 0) << (akiko_precalc_bytenum[i+16][3]))
				    	   |  (((akiko_buffer[4] & akiko_precalc_shift[i+16]) != 0) << (akiko_precalc_bytenum[i+16][4]))
				    	   |  (((akiko_buffer[5] & akiko_precalc_shift[i+16]) != 0) << (akiko_precalc_bytenum[i+16][5]))
				    	   |  (((akiko_buffer[6] & akiko_precalc_shift[i+16]) != 0) << (akiko_precalc_bytenum[i+16][6]))
				    	   |  (((akiko_buffer[7] & akiko_precalc_shift[i+16]) != 0) << (akiko_precalc_bytenum[i+16][7]))
					   |  (((akiko_buffer[0] & akiko_precalc_shift[i+24]) != 0) << (akiko_precalc_bytenum[i+24][0]))
				    	   |  (((akiko_buffer[1] & akiko_precalc_shift[i+24]) != 0) << (akiko_precalc_bytenum[i+24][1]))
				    	   |  (((akiko_buffer[2] & akiko_precalc_shift[i+24]) != 0) << (akiko_precalc_bytenum[i+24][2]))
				    	   |  (((akiko_buffer[3] & akiko_precalc_shift[i+24]) != 0) << (akiko_precalc_bytenum[i+24][3]))
				    	   |  (((akiko_buffer[4] & akiko_precalc_shift[i+24]) != 0) << (akiko_precalc_bytenum[i+24][4]))
				    	   |  (((akiko_buffer[5] & akiko_precalc_shift[i+24]) != 0) << (akiko_precalc_bytenum[i+24][5]))
				    	   |  (((akiko_buffer[6] & akiko_precalc_shift[i+24]) != 0) << (akiko_precalc_bytenum[i+24][6]))
				    	   |  (((akiko_buffer[7] & akiko_precalc_shift[i+24]) != 0) << (akiko_precalc_bytenum[i+24][7]));
	}
}

Have fun, Toni!

And if this works, faster CD32 emulation for all.
(It may be particularly useful for ARM platforms for e.g. PUAE, where CD32 emulation can lag.)

Happy emulating!

Mequa

Mequa · 22 January 2012, 18:48

Toni's response was that the speedup might make little difference, since C2P processing is a small part of emulating CD32 games (although the current source says the current algorithm is too inefficient), and the extra memory overhead for a rarely-used feature is unwanted bloat.

Here's a modified version which only allocates the memory for the table if Akiko emulation is needed:

For the header, "akiko.h":

Code:

/* Added by Mequa - for header: */
uae_u32* akiko_precalc_shift = NULL;
uae_u32** akiko_precalc_bytenum = NULL;

The precalculation routine, for "akiko.cpp":

Code:

/* Precalculations for Akiko Chunky-to-Planar conversion by Mequa */
int akiko_precalculate_fast (void)
{
	uae_u32 i, j;

	if (akiko_precalc_shift == NULL)
	        akiko_precalc_shift   = (uae_u32*)  malloc(sizeof(uae_u32) * 32);
	if (akiko_precalc_shift == NULL)
		return 0;
	if (akiko_precalc_bytenum == NULL)
		akiko_precalc_bytenum = (uae_u32**) malloc(sizeof(uae_u32*) * 32);
	if (akiko_precalc_bytenum == NULL)
		return 0;

	for (i = 0; i < 32; i++) {
		akiko_precalc_bytenum[i] = NULL;
		akiko_precalc_bytenum[i] = (uae_u32*) malloc(sizeof(uae_u32) * 8);
		if (akiko_precalc_bytenum[i] == NULL)
			return 0;
	}
	for (i = 0; i < 32; i++) {
		akiko_precalc_shift [(int)i] = 1 << i;
		for (j = 0; j < 8; j++)
			akiko_precalc_bytenum[(int)i][(int)j] = (i >> 3) + ((7 - j) << 2);
	}

	return 1;
}

Finally, an extra routine to free the memory used by the precalculated table when it is no longer needed:

Code:

/* Used to destroy the precalculated table when no longer needed: */
static void akiko_kill_precalculate_table(void)
{
	int i;

	if (akiko_precalc_shift != NULL) {
		free (akiko_precalc_shift);
		akiko_precalc_shift = NULL;
	}
	if (akiko_precalc_bytenum != NULL) {
		for (i = 0; i < 32; i++) {
			free (akiko_precalc_bytenum[i]);
			akiko_precalc_bytenum[i] = NULL;
		}

		free (akiko_precalc_bytenum);
		akiko_precalc_bytenum = NULL;
	}
}

The akiko_c2p_do_fast() procedure (the actual algorithm) is the same as in my last post.

You can also optionally add the following after its "int i;" line, which will automatically allocate the table when needed with only a tiny performance overhead:

Code:

	if ((akiko_precalc_shift == NULL) || (akiko_precalc_bytenum == NULL))
		akiko_precalculate_fast();

I also have a shorter version of akiko_c2p_do_fast() which uses 32 loop iterations instead of 8 (using a bit more precomputation too), which works with both the last approach and this dynamically-allocated one.
However, this one tested a bit faster.

Toni Wilen · 22 January 2012, 19:14

I didn't mean the space used by lookup table was bad, no one cares about small extra memory usage today.

I only wondered about the algorithm and if lookup table is the right choice when using modern CPUs that run extremely fast as long as data is in caches and not that fast when caches need filling and if performance difference is noticeable when using real world programs.

Emulation has huge amount of running code, even 100x speed up in tiny part of code may only mean 1% total performance increase when whole emulation is running. (Which is quite annoying!)

Mequa · 24 January 2012, 09:13

Every little helps.

Edit:
Real Amiga non-Akiko fast C2P routines, from my knowledge, use 68k assembler with some speed enhancements using blitter and/or copper. I expect lookup tables may be considered too memory-intensive for real unexpanded Amigas, but what better options are there for UAE?

On x86, perhaps MMX instructions could be used for a speedup here, and/or a rewrite of the algorithm in x86 assembler. For ports of e.g. PUAE to ARM, PowerPC etc., however, this version is cross-platform enough. (ARM versions could always do with platform-specific optimisations though.)

As a proof of concept, I even got my static version of this algorithm working in XC on an XMOS XC-1A Development Kit on its XMOS XS1-G4 chip!
(Although still faster than the current UAE C2P, without parallel optimisations it's pretty inefficient on that platform; hence XS1-G4 will never be an ideal platform for UAE, which is not written to exploit parallelism... plus the aforementioned development kit has very little memory and stack space.)

Toni seems to want a better algorithm than this however (although this is clearly an improvement on the current one).
My initial algorithm with a statically-allocated lookup table may also be slightly faster in practice than the dynamic version.

How could this be benchmarked in real-world application? I thought of making a custom-build of WinUAE with both the original algorithm and my optimised version implemented, with timers added to both procedures (with feedback), and both tested running e.g. Microcosm with standard CD32 settings (perhaps on Intel Atom). Even a tiny overall speedup may be better than nothing (see Tesco's slogan).

StingRay · 24 January 2012, 11:42

Quote:

Originally Posted by Mequa

Real Amiga non-Akiko fast C2P routines, from my knowledge, use 68k assembler with some speed enhancements using blitter and/or copper.

Only c2p routines for 68020 and maybe 030 use the blitter, all others (040/060) use CPU only as that's the fastest solution.

Quote:

Originally Posted by Mequa

I expect lookup tables may be considered too memory-intensive for real unexpanded Amigas, but what better options are there for UAE?

That is not the reason, large lookup tables are inefficient (cache) so it's faster to do it "on the fly" with the CPU.

Mequa · 04 February 2012, 21:42

Quote:

Originally Posted by Toni Wilen

- Mequa's optimized Akiko C2P added.

Holy moly.

Which version of it? And is it really faster with the CPU cache issues?

Does this mean my procedure now falls under the GPL?

P.S. Toni, can you get CD32 games to boot with JIT enabled in the latest beta?

Toni Wilen · 04 February 2012, 22:06

I moved above post from beta thread. Not really correct place for this kind of stuff.

Quote:

Originally Posted by Mequa

Holy moly.

Which version of it? And is it really faster with the CPU cache issues?

It is the one listed above. I don't know or care

Quote:

Does this mean my procedure now falls under the GPL?

It only means the included code is under GPL. You can still do whatever you want with above code or re-license it etc.. (and if it isn't, it of course goes away in next beta, I assumed it has proper license because you suggested it for inclusion..)

Quote:

P.S. Toni, can you get CD32 games to boot with JIT enabled in the latest beta?

No but I haven't bothered with this, it isn't that important. (Don't need CD32 mode to enable Akiko C2P hardware)

Mequa · 04 February 2012, 22:19

I meant, did you include the static or dynamic version?

I wrote this myself and give permission, so the included code is fine to license under GPL for inclusion in WinUAE/PUAE etc.

I was just wondering if I could still reuse my own routine above under another license in my own projects (e.g. revised BSD license as for my jAMOS project). Which seems to be the case. (If not I'll leave it just for UAE.)

ceztko · 05 February 2012, 00:54

Quote:

Originally Posted by Mequa

I was just wondering if I could still reuse my own routine above under another license in my own projects (e.g. revised BSD license as for my jAMOS project). Which seems to be the case. (If not I'll leave it just for UAE.)

Sure, you can. That does mean that, depending on the source where the code was copied, it will be GPL or BSD. Or you may want to specify the double license immediately. Or stick to the more permissive one (no problems in having modified BSD or MIT licensed code in GPL projects). Because of copyleft you can't say later that the code wasn't supposed to be GPL/BSD.

Mequa · 05 February 2012, 02:47

Or I could just make this routine 100% public domain and usable without restriction?

Perhaps that would be my best bet. Of course this is already open-sourced, and I want to keep it GPL-compatible.

04 February 2012, 22:19	#8
Mequa Registered User Join Date: Nov 2009 Location: UK Posts: 497	I meant, did you include the static or dynamic version? I wrote this myself and give permission, so the included code is fine to license under GPL for inclusion in WinUAE/PUAE etc. I was just wondering if I could still reuse my own routine above under another license in my own projects (e.g. revised BSD license as for my jAMOS project). Which seems to be the case. (If not I'll leave it just for UAE.) Last edited by Mequa; 04 February 2012 at 22:40.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Akiko Chunky-to-Planar conversion	Mequa	support.WinUAE	5	21 January 2012 10:50
Emulation of Graffiti chunky graphics device	pmc	request.UAE Wishlist	29	30 November 2011 18:52
Amiga Bitplanes & Planar vs Chunky - Technical Expert Required	CodyJarrett	project.APoV	4	12 November 2009 11:14
Chunky to planar	pmc	Coders. Tutorials	11	15 September 2009 16:20
Chunky to planar on a500	Alter	Coders. General	28	10 April 2007 02:53

22 January 2012, 19:14	#3
Toni Wilen WinUAE developer Join Date: Aug 2001 Location: Hämeenlinna/Finland Age: 49 Posts: 26,507	I didn't mean the space used by lookup table was bad, no one cares about small extra memory usage today. I only wondered about the algorithm and if lookup table is the right choice when using modern CPUs that run extremely fast as long as data is in caches and not that fast when caches need filling and if performance difference is noticeable when using real world programs. Emulation has huge amount of running code, even 100x speed up in tiny part of code may only mean 1% total performance increase when whole emulation is running. (Which is quite annoying!)

24 January 2012, 09:13	#4
Mequa Registered User Join Date: Nov 2009 Location: UK Posts: 497	Every little helps. Edit: Real Amiga non-Akiko fast C2P routines, from my knowledge, use 68k assembler with some speed enhancements using blitter and/or copper. I expect lookup tables may be considered too memory-intensive for real unexpanded Amigas, but what better options are there for UAE? On x86, perhaps MMX instructions could be used for a speedup here, and/or a rewrite of the algorithm in x86 assembler. For ports of e.g. PUAE to ARM, PowerPC etc., however, this version is cross-platform enough. (ARM versions could always do with platform-specific optimisations though.) As a proof of concept, I even got my static version of this algorithm working in XC on an XMOS XC-1A Development Kit on its XMOS XS1-G4 chip! (Although still faster than the current UAE C2P, without parallel optimisations it's pretty inefficient on that platform; hence XS1-G4 will never be an ideal platform for UAE, which is not written to exploit parallelism... plus the aforementioned development kit has very little memory and stack space.) Toni seems to want a better algorithm than this however (although this is clearly an improvement on the current one). My initial algorithm with a statically-allocated lookup table may also be slightly faster in practice than the dynamic version. How could this be benchmarked in real-world application? I thought of making a custom-build of WinUAE with both the original algorithm and my optimised version implemented, with timers added to both procedures (with feedback), and both tested running e.g. Microcosm with standard CD32 settings (perhaps on Intel Atom). Even a tiny overall speedup may be better than nothing (see Tesco's slogan).

05 February 2012, 02:47	#10
Mequa Registered User Join Date: Nov 2009 Location: UK Posts: 497	Or I could just make this routine 100% public domain and usable without restriction? Perhaps that would be my best bet. Of course this is already open-sourced, and I want to keep it GPL-compatible.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)