English Amiga Board


Go Back   English Amiga Board > Support > support.WinUAE

 
 
Thread Tools
Old 21 January 2012, 09:32   #1
Mequa
Registered User
 
Join Date: Nov 2009
Location: UK
Posts: 497
Lightbulb Optimised Akiko Chunky-to-Planar emulation

For Toni, here is the source I worked on all night to optimise the Akiko Chunky-to-Planar emulation for WinUAE and PUAE. In my tests, it runs at least twice as fast as the current procedure.
If it works, this should make for much more efficient CD32 emulation!

C/C++ source code follows, so if you're not a developer, this may look like Greek. Of course this hasn't yet been tested in WinUAE, and I'm not sure if it needs any modifications to work.
It has been tested in GCC and proved compliant with both C89/C90 and C++, producing identical results each time to the WinUAE routine.

As for speed, 100,000,000 iterations completed in 18 seconds, compared with 36 seconds for the current "piece of crap" WinUAE algorithm.


Here goes:

First up, some extra arrays for the header file:

Code:
/* Added by Mequa - for header: */
uae_u32 akiko_precalc_shift[32];
uae_u32 akiko_precalc_bytenum[32][8];
A precalculation routine which must be run before using my C2P routine, with an overhead of just over 1KB:

Code:
/* Precalculations for Akiko Chunky-to-Planar conversion by Mequa */
static void akiko_precalculate_fast (void)
{
	uae_u32 i, j;
	for (i = 0; i < 32; i++)
	{
		akiko_precalc_shift  [(int)i]  		      = 1 << i;
		for (j = 0; j < 8; j++)
		{
			akiko_precalc_bytenum[(int)i][(int)j] = (i >> 3) + ((7 - j) << 2);
		}
	}
}
And finally, the algorithm itself:

Code:
/* Optimised Chunky-to-Planar algorithm by Mequa */
static void akiko_c2p_do_fast (void)
{
	int i;

	for (i = 0; i < 8; i++)
	{
		akiko_result[i] 	   =  (((akiko_buffer[0] & akiko_precalc_shift[i])    != 0) << (akiko_precalc_bytenum[i][0])   )
				    	   |  (((akiko_buffer[1] & akiko_precalc_shift[i])    != 0) << (akiko_precalc_bytenum[i][1])   )
				    	   |  (((akiko_buffer[2] & akiko_precalc_shift[i])    != 0) << (akiko_precalc_bytenum[i][2])   )
				    	   |  (((akiko_buffer[3] & akiko_precalc_shift[i])    != 0) << (akiko_precalc_bytenum[i][3])   )
				    	   |  (((akiko_buffer[4] & akiko_precalc_shift[i])    != 0) << (akiko_precalc_bytenum[i][4])   )
				    	   |  (((akiko_buffer[5] & akiko_precalc_shift[i])    != 0) << (akiko_precalc_bytenum[i][5])   )
				    	   |  (((akiko_buffer[6] & akiko_precalc_shift[i])    != 0) << (akiko_precalc_bytenum[i][6])   )
				    	   |  (((akiko_buffer[7] & akiko_precalc_shift[i])    != 0) << (akiko_precalc_bytenum[i][7])   )
					   |  (((akiko_buffer[0] & akiko_precalc_shift[i+8])  != 0) << (akiko_precalc_bytenum[i+8][0]) )
				    	   |  (((akiko_buffer[1] & akiko_precalc_shift[i+8])  != 0) << (akiko_precalc_bytenum[i+8][1]) )
				    	   |  (((akiko_buffer[2] & akiko_precalc_shift[i+8])  != 0) << (akiko_precalc_bytenum[i+8][2]) )
				    	   |  (((akiko_buffer[3] & akiko_precalc_shift[i+8])  != 0) << (akiko_precalc_bytenum[i+8][3]) )
				    	   |  (((akiko_buffer[4] & akiko_precalc_shift[i+8])  != 0) << (akiko_precalc_bytenum[i+8][4]) )
				    	   |  (((akiko_buffer[5] & akiko_precalc_shift[i+8])  != 0) << (akiko_precalc_bytenum[i+8][5]) )
				    	   |  (((akiko_buffer[6] & akiko_precalc_shift[i+8])  != 0) << (akiko_precalc_bytenum[i+8][6]) )
				    	   |  (((akiko_buffer[7] & akiko_precalc_shift[i+8])  != 0) << (akiko_precalc_bytenum[i+8][7]) )
					   |  (((akiko_buffer[0] & akiko_precalc_shift[i+16]) != 0) << (akiko_precalc_bytenum[i+16][0]))
				    	   |  (((akiko_buffer[1] & akiko_precalc_shift[i+16]) != 0) << (akiko_precalc_bytenum[i+16][1]))
				    	   |  (((akiko_buffer[2] & akiko_precalc_shift[i+16]) != 0) << (akiko_precalc_bytenum[i+16][2]))
				    	   |  (((akiko_buffer[3] & akiko_precalc_shift[i+16]) != 0) << (akiko_precalc_bytenum[i+16][3]))
				    	   |  (((akiko_buffer[4] & akiko_precalc_shift[i+16]) != 0) << (akiko_precalc_bytenum[i+16][4]))
				    	   |  (((akiko_buffer[5] & akiko_precalc_shift[i+16]) != 0) << (akiko_precalc_bytenum[i+16][5]))
				    	   |  (((akiko_buffer[6] & akiko_precalc_shift[i+16]) != 0) << (akiko_precalc_bytenum[i+16][6]))
				    	   |  (((akiko_buffer[7] & akiko_precalc_shift[i+16]) != 0) << (akiko_precalc_bytenum[i+16][7]))
					   |  (((akiko_buffer[0] & akiko_precalc_shift[i+24]) != 0) << (akiko_precalc_bytenum[i+24][0]))
				    	   |  (((akiko_buffer[1] & akiko_precalc_shift[i+24]) != 0) << (akiko_precalc_bytenum[i+24][1]))
				    	   |  (((akiko_buffer[2] & akiko_precalc_shift[i+24]) != 0) << (akiko_precalc_bytenum[i+24][2]))
				    	   |  (((akiko_buffer[3] & akiko_precalc_shift[i+24]) != 0) << (akiko_precalc_bytenum[i+24][3]))
				    	   |  (((akiko_buffer[4] & akiko_precalc_shift[i+24]) != 0) << (akiko_precalc_bytenum[i+24][4]))
				    	   |  (((akiko_buffer[5] & akiko_precalc_shift[i+24]) != 0) << (akiko_precalc_bytenum[i+24][5]))
				    	   |  (((akiko_buffer[6] & akiko_precalc_shift[i+24]) != 0) << (akiko_precalc_bytenum[i+24][6]))
				    	   |  (((akiko_buffer[7] & akiko_precalc_shift[i+24]) != 0) << (akiko_precalc_bytenum[i+24][7]));
	}
}
Have fun, Toni!

And if this works, faster CD32 emulation for all.
(It may be particularly useful for ARM platforms for e.g. PUAE, where CD32 emulation can lag.)


Happy emulating!

Mequa

Last edited by Mequa; 27 January 2012 at 03:12.
Mequa is offline  
Old 22 January 2012, 18:48   #2
Mequa
Registered User
 
Join Date: Nov 2009
Location: UK
Posts: 497
Smile An alternative version to save memory when not using Akiko

Toni's response was that the speedup might make little difference, since C2P processing is a small part of emulating CD32 games (although the current source says the current algorithm is too inefficient), and the extra memory overhead for a rarely-used feature is unwanted bloat.

Here's a modified version which only allocates the memory for the table if Akiko emulation is needed:

For the header, "akiko.h":
Code:
/* Added by Mequa - for header: */
uae_u32* akiko_precalc_shift = NULL;
uae_u32** akiko_precalc_bytenum = NULL;
The precalculation routine, for "akiko.cpp":
Code:
/* Precalculations for Akiko Chunky-to-Planar conversion by Mequa */
int akiko_precalculate_fast (void)
{
	uae_u32 i, j;

	if (akiko_precalc_shift == NULL)
	        akiko_precalc_shift   = (uae_u32*)  malloc(sizeof(uae_u32) * 32);
	if (akiko_precalc_shift == NULL)
		return 0;
	if (akiko_precalc_bytenum == NULL)
		akiko_precalc_bytenum = (uae_u32**) malloc(sizeof(uae_u32*) * 32);
	if (akiko_precalc_bytenum == NULL)
		return 0;

	for (i = 0; i < 32; i++) {
		akiko_precalc_bytenum[i] = NULL;
		akiko_precalc_bytenum[i] = (uae_u32*) malloc(sizeof(uae_u32) * 8);
		if (akiko_precalc_bytenum[i] == NULL)
			return 0;
	}
	for (i = 0; i < 32; i++) {
		akiko_precalc_shift [(int)i] = 1 << i;
		for (j = 0; j < 8; j++)
			akiko_precalc_bytenum[(int)i][(int)j] = (i >> 3) + ((7 - j) << 2);
	}

	return 1;
}
Finally, an extra routine to free the memory used by the precalculated table when it is no longer needed:
Code:
/* Used to destroy the precalculated table when no longer needed: */
static void akiko_kill_precalculate_table(void)
{
	int i;

	if (akiko_precalc_shift != NULL) {
		free (akiko_precalc_shift);
		akiko_precalc_shift = NULL;
	}
	if (akiko_precalc_bytenum != NULL) {
		for (i = 0; i < 32; i++) {
			free (akiko_precalc_bytenum[i]);
			akiko_precalc_bytenum[i] = NULL;
		}

		free (akiko_precalc_bytenum);
		akiko_precalc_bytenum = NULL;
	}
}
The akiko_c2p_do_fast() procedure (the actual algorithm) is the same as in my last post.

You can also optionally add the following after its "int i;" line, which will automatically allocate the table when needed with only a tiny performance overhead:

Code:
	if ((akiko_precalc_shift == NULL) || (akiko_precalc_bytenum == NULL))
		akiko_precalculate_fast();
I also have a shorter version of akiko_c2p_do_fast() which uses 32 loop iterations instead of 8 (using a bit more precomputation too), which works with both the last approach and this dynamically-allocated one.
However, this one tested a bit faster.
Mequa is offline  
Old 22 January 2012, 19:14   #3
Toni Wilen
WinUAE developer
 
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,507
I didn't mean the space used by lookup table was bad, no one cares about small extra memory usage today.

I only wondered about the algorithm and if lookup table is the right choice when using modern CPUs that run extremely fast as long as data is in caches and not that fast when caches need filling and if performance difference is noticeable when using real world programs.

Emulation has huge amount of running code, even 100x speed up in tiny part of code may only mean 1% total performance increase when whole emulation is running. (Which is quite annoying!)
Toni Wilen is offline  
Old 24 January 2012, 09:13   #4
Mequa
Registered User
 
Join Date: Nov 2009
Location: UK
Posts: 497
Every little helps.

Edit:
Real Amiga non-Akiko fast C2P routines, from my knowledge, use 68k assembler with some speed enhancements using blitter and/or copper. I expect lookup tables may be considered too memory-intensive for real unexpanded Amigas, but what better options are there for UAE?

On x86, perhaps MMX instructions could be used for a speedup here, and/or a rewrite of the algorithm in x86 assembler. For ports of e.g. PUAE to ARM, PowerPC etc., however, this version is cross-platform enough. (ARM versions could always do with platform-specific optimisations though.)

As a proof of concept, I even got my static version of this algorithm working in XC on an XMOS XC-1A Development Kit on its XMOS XS1-G4 chip!
(Although still faster than the current UAE C2P, without parallel optimisations it's pretty inefficient on that platform; hence XS1-G4 will never be an ideal platform for UAE, which is not written to exploit parallelism... plus the aforementioned development kit has very little memory and stack space.)

Toni seems to want a better algorithm than this however (although this is clearly an improvement on the current one).
My initial algorithm with a statically-allocated lookup table may also be slightly faster in practice than the dynamic version.

How could this be benchmarked in real-world application? I thought of making a custom-build of WinUAE with both the original algorithm and my optimised version implemented, with timers added to both procedures (with feedback), and both tested running e.g. Microcosm with standard CD32 settings (perhaps on Intel Atom). Even a tiny overall speedup may be better than nothing (see Tesco's slogan).
Mequa is offline  
Old 24 January 2012, 11:42   #5
StingRay
move.l #$c0ff33,throat
 
StingRay's Avatar
 
Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 6,863
Quote:
Originally Posted by Mequa View Post
Real Amiga non-Akiko fast C2P routines, from my knowledge, use 68k assembler with some speed enhancements using blitter and/or copper.
Only c2p routines for 68020 and maybe 030 use the blitter, all others (040/060) use CPU only as that's the fastest solution.

Quote:
Originally Posted by Mequa View Post
I expect lookup tables may be considered too memory-intensive for real unexpanded Amigas, but what better options are there for UAE?
That is not the reason, large lookup tables are inefficient (cache) so it's faster to do it "on the fly" with the CPU.
StingRay is offline  
Old 04 February 2012, 21:42   #6
Mequa
Registered User
 
Join Date: Nov 2009
Location: UK
Posts: 497
Happy

Quote:
Originally Posted by Toni Wilen View Post
- Mequa's optimized Akiko C2P added.
Holy moly. Which version of it? And is it really faster with the CPU cache issues?

Does this mean my procedure now falls under the GPL?


P.S. Toni, can you get CD32 games to boot with JIT enabled in the latest beta?
Mequa is offline  
Old 04 February 2012, 22:06   #7
Toni Wilen
WinUAE developer
 
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,507
I moved above post from beta thread. Not really correct place for this kind of stuff.

Quote:
Originally Posted by Mequa View Post
Holy moly. Which version of it? And is it really faster with the CPU cache issues?
It is the one listed above. I don't know or care

Quote:
Does this mean my procedure now falls under the GPL?
It only means the included code is under GPL. You can still do whatever you want with above code or re-license it etc.. (and if it isn't, it of course goes away in next beta, I assumed it has proper license because you suggested it for inclusion..)

Quote:
P.S. Toni, can you get CD32 games to boot with JIT enabled in the latest beta?
No but I haven't bothered with this, it isn't that important. (Don't need CD32 mode to enable Akiko C2P hardware)
Toni Wilen is offline  
Old 04 February 2012, 22:19   #8
Mequa
Registered User
 
Join Date: Nov 2009
Location: UK
Posts: 497
I meant, did you include the static or dynamic version?

I wrote this myself and give permission, so the included code is fine to license under GPL for inclusion in WinUAE/PUAE etc.

I was just wondering if I could still reuse my own routine above under another license in my own projects (e.g. revised BSD license as for my jAMOS project). Which seems to be the case. (If not I'll leave it just for UAE.)

Last edited by Mequa; 04 February 2012 at 22:40.
Mequa is offline  
Old 05 February 2012, 00:54   #9
ceztko
Registered User
 
Join Date: Aug 2006
Location: Italy
Posts: 109
Quote:
Originally Posted by Mequa View Post
I was just wondering if I could still reuse my own routine above under another license in my own projects (e.g. revised BSD license as for my jAMOS project). Which seems to be the case. (If not I'll leave it just for UAE.)
Sure, you can. That does mean that, depending on the source where the code was copied, it will be GPL or BSD. Or you may want to specify the double license immediately. Or stick to the more permissive one (no problems in having modified BSD or MIT licensed code in GPL projects). Because of copyleft you can't say later that the code wasn't supposed to be GPL/BSD.
ceztko is offline  
Old 05 February 2012, 02:47   #10
Mequa
Registered User
 
Join Date: Nov 2009
Location: UK
Posts: 497
Or I could just make this routine 100% public domain and usable without restriction?

Perhaps that would be my best bet. Of course this is already open-sourced, and I want to keep it GPL-compatible.
Mequa is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Akiko Chunky-to-Planar conversion Mequa support.WinUAE 5 21 January 2012 10:50
Emulation of Graffiti chunky graphics device pmc request.UAE Wishlist 29 30 November 2011 18:52
Amiga Bitplanes & Planar vs Chunky - Technical Expert Required CodyJarrett project.APoV 4 12 November 2009 11:14
Chunky to planar pmc Coders. Tutorials 11 15 September 2009 16:20
Chunky to planar on a500 Alter Coders. General 28 10 April 2007 02:53

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 06:14.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.08172 seconds with 13 queries