21 January 2012, 09:32 | #1 |
Registered User
Join Date: Nov 2009
Location: UK
Posts: 497
|
Optimised Akiko Chunky-to-Planar emulation
For Toni, here is the source I worked on all night to optimise the Akiko Chunky-to-Planar emulation for WinUAE and PUAE. In my tests, it runs at least twice as fast as the current procedure.
If it works, this should make for much more efficient CD32 emulation! C/C++ source code follows, so if you're not a developer, this may look like Greek. Of course this hasn't yet been tested in WinUAE, and I'm not sure if it needs any modifications to work. It has been tested in GCC and proved compliant with both C89/C90 and C++, producing identical results each time to the WinUAE routine. As for speed, 100,000,000 iterations completed in 18 seconds, compared with 36 seconds for the current "piece of crap" WinUAE algorithm. Here goes: First up, some extra arrays for the header file: Code:
/* Added by Mequa - for header: */ uae_u32 akiko_precalc_shift[32]; uae_u32 akiko_precalc_bytenum[32][8]; Code:
/* Precalculations for Akiko Chunky-to-Planar conversion by Mequa */ static void akiko_precalculate_fast (void) { uae_u32 i, j; for (i = 0; i < 32; i++) { akiko_precalc_shift [(int)i] = 1 << i; for (j = 0; j < 8; j++) { akiko_precalc_bytenum[(int)i][(int)j] = (i >> 3) + ((7 - j) << 2); } } } Code:
/* Optimised Chunky-to-Planar algorithm by Mequa */ static void akiko_c2p_do_fast (void) { int i; for (i = 0; i < 8; i++) { akiko_result[i] = (((akiko_buffer[0] & akiko_precalc_shift[i]) != 0) << (akiko_precalc_bytenum[i][0]) ) | (((akiko_buffer[1] & akiko_precalc_shift[i]) != 0) << (akiko_precalc_bytenum[i][1]) ) | (((akiko_buffer[2] & akiko_precalc_shift[i]) != 0) << (akiko_precalc_bytenum[i][2]) ) | (((akiko_buffer[3] & akiko_precalc_shift[i]) != 0) << (akiko_precalc_bytenum[i][3]) ) | (((akiko_buffer[4] & akiko_precalc_shift[i]) != 0) << (akiko_precalc_bytenum[i][4]) ) | (((akiko_buffer[5] & akiko_precalc_shift[i]) != 0) << (akiko_precalc_bytenum[i][5]) ) | (((akiko_buffer[6] & akiko_precalc_shift[i]) != 0) << (akiko_precalc_bytenum[i][6]) ) | (((akiko_buffer[7] & akiko_precalc_shift[i]) != 0) << (akiko_precalc_bytenum[i][7]) ) | (((akiko_buffer[0] & akiko_precalc_shift[i+8]) != 0) << (akiko_precalc_bytenum[i+8][0]) ) | (((akiko_buffer[1] & akiko_precalc_shift[i+8]) != 0) << (akiko_precalc_bytenum[i+8][1]) ) | (((akiko_buffer[2] & akiko_precalc_shift[i+8]) != 0) << (akiko_precalc_bytenum[i+8][2]) ) | (((akiko_buffer[3] & akiko_precalc_shift[i+8]) != 0) << (akiko_precalc_bytenum[i+8][3]) ) | (((akiko_buffer[4] & akiko_precalc_shift[i+8]) != 0) << (akiko_precalc_bytenum[i+8][4]) ) | (((akiko_buffer[5] & akiko_precalc_shift[i+8]) != 0) << (akiko_precalc_bytenum[i+8][5]) ) | (((akiko_buffer[6] & akiko_precalc_shift[i+8]) != 0) << (akiko_precalc_bytenum[i+8][6]) ) | (((akiko_buffer[7] & akiko_precalc_shift[i+8]) != 0) << (akiko_precalc_bytenum[i+8][7]) ) | (((akiko_buffer[0] & akiko_precalc_shift[i+16]) != 0) << (akiko_precalc_bytenum[i+16][0])) | (((akiko_buffer[1] & akiko_precalc_shift[i+16]) != 0) << (akiko_precalc_bytenum[i+16][1])) | (((akiko_buffer[2] & akiko_precalc_shift[i+16]) != 0) << (akiko_precalc_bytenum[i+16][2])) | (((akiko_buffer[3] & akiko_precalc_shift[i+16]) != 0) << (akiko_precalc_bytenum[i+16][3])) | (((akiko_buffer[4] & akiko_precalc_shift[i+16]) != 0) << (akiko_precalc_bytenum[i+16][4])) | (((akiko_buffer[5] & akiko_precalc_shift[i+16]) != 0) << (akiko_precalc_bytenum[i+16][5])) | (((akiko_buffer[6] & akiko_precalc_shift[i+16]) != 0) << (akiko_precalc_bytenum[i+16][6])) | (((akiko_buffer[7] & akiko_precalc_shift[i+16]) != 0) << (akiko_precalc_bytenum[i+16][7])) | (((akiko_buffer[0] & akiko_precalc_shift[i+24]) != 0) << (akiko_precalc_bytenum[i+24][0])) | (((akiko_buffer[1] & akiko_precalc_shift[i+24]) != 0) << (akiko_precalc_bytenum[i+24][1])) | (((akiko_buffer[2] & akiko_precalc_shift[i+24]) != 0) << (akiko_precalc_bytenum[i+24][2])) | (((akiko_buffer[3] & akiko_precalc_shift[i+24]) != 0) << (akiko_precalc_bytenum[i+24][3])) | (((akiko_buffer[4] & akiko_precalc_shift[i+24]) != 0) << (akiko_precalc_bytenum[i+24][4])) | (((akiko_buffer[5] & akiko_precalc_shift[i+24]) != 0) << (akiko_precalc_bytenum[i+24][5])) | (((akiko_buffer[6] & akiko_precalc_shift[i+24]) != 0) << (akiko_precalc_bytenum[i+24][6])) | (((akiko_buffer[7] & akiko_precalc_shift[i+24]) != 0) << (akiko_precalc_bytenum[i+24][7])); } } And if this works, faster CD32 emulation for all. (It may be particularly useful for ARM platforms for e.g. PUAE, where CD32 emulation can lag.) Happy emulating! Mequa Last edited by Mequa; 27 January 2012 at 03:12. |
22 January 2012, 18:48 | #2 |
Registered User
Join Date: Nov 2009
Location: UK
Posts: 497
|
An alternative version to save memory when not using Akiko
Toni's response was that the speedup might make little difference, since C2P processing is a small part of emulating CD32 games (although the current source says the current algorithm is too inefficient), and the extra memory overhead for a rarely-used feature is unwanted bloat.
Here's a modified version which only allocates the memory for the table if Akiko emulation is needed: For the header, "akiko.h": Code:
/* Added by Mequa - for header: */ uae_u32* akiko_precalc_shift = NULL; uae_u32** akiko_precalc_bytenum = NULL; Code:
/* Precalculations for Akiko Chunky-to-Planar conversion by Mequa */ int akiko_precalculate_fast (void) { uae_u32 i, j; if (akiko_precalc_shift == NULL) akiko_precalc_shift = (uae_u32*) malloc(sizeof(uae_u32) * 32); if (akiko_precalc_shift == NULL) return 0; if (akiko_precalc_bytenum == NULL) akiko_precalc_bytenum = (uae_u32**) malloc(sizeof(uae_u32*) * 32); if (akiko_precalc_bytenum == NULL) return 0; for (i = 0; i < 32; i++) { akiko_precalc_bytenum[i] = NULL; akiko_precalc_bytenum[i] = (uae_u32*) malloc(sizeof(uae_u32) * 8); if (akiko_precalc_bytenum[i] == NULL) return 0; } for (i = 0; i < 32; i++) { akiko_precalc_shift [(int)i] = 1 << i; for (j = 0; j < 8; j++) akiko_precalc_bytenum[(int)i][(int)j] = (i >> 3) + ((7 - j) << 2); } return 1; } Code:
/* Used to destroy the precalculated table when no longer needed: */ static void akiko_kill_precalculate_table(void) { int i; if (akiko_precalc_shift != NULL) { free (akiko_precalc_shift); akiko_precalc_shift = NULL; } if (akiko_precalc_bytenum != NULL) { for (i = 0; i < 32; i++) { free (akiko_precalc_bytenum[i]); akiko_precalc_bytenum[i] = NULL; } free (akiko_precalc_bytenum); akiko_precalc_bytenum = NULL; } } You can also optionally add the following after its "int i;" line, which will automatically allocate the table when needed with only a tiny performance overhead: Code:
if ((akiko_precalc_shift == NULL) || (akiko_precalc_bytenum == NULL)) akiko_precalculate_fast(); However, this one tested a bit faster. |
22 January 2012, 19:14 | #3 |
WinUAE developer
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,583
|
I didn't mean the space used by lookup table was bad, no one cares about small extra memory usage today.
I only wondered about the algorithm and if lookup table is the right choice when using modern CPUs that run extremely fast as long as data is in caches and not that fast when caches need filling and if performance difference is noticeable when using real world programs. Emulation has huge amount of running code, even 100x speed up in tiny part of code may only mean 1% total performance increase when whole emulation is running. (Which is quite annoying!) |
24 January 2012, 09:13 | #4 |
Registered User
Join Date: Nov 2009
Location: UK
Posts: 497
|
Every little helps.
Edit: Real Amiga non-Akiko fast C2P routines, from my knowledge, use 68k assembler with some speed enhancements using blitter and/or copper. I expect lookup tables may be considered too memory-intensive for real unexpanded Amigas, but what better options are there for UAE? On x86, perhaps MMX instructions could be used for a speedup here, and/or a rewrite of the algorithm in x86 assembler. For ports of e.g. PUAE to ARM, PowerPC etc., however, this version is cross-platform enough. (ARM versions could always do with platform-specific optimisations though.) As a proof of concept, I even got my static version of this algorithm working in XC on an XMOS XC-1A Development Kit on its XMOS XS1-G4 chip! (Although still faster than the current UAE C2P, without parallel optimisations it's pretty inefficient on that platform; hence XS1-G4 will never be an ideal platform for UAE, which is not written to exploit parallelism... plus the aforementioned development kit has very little memory and stack space.) Toni seems to want a better algorithm than this however (although this is clearly an improvement on the current one). My initial algorithm with a statically-allocated lookup table may also be slightly faster in practice than the dynamic version. How could this be benchmarked in real-world application? I thought of making a custom-build of WinUAE with both the original algorithm and my optimised version implemented, with timers added to both procedures (with feedback), and both tested running e.g. Microcosm with standard CD32 settings (perhaps on Intel Atom). Even a tiny overall speedup may be better than nothing (see Tesco's slogan). |
24 January 2012, 11:42 | #5 | |
move.l #$c0ff33,throat
Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 6,865
|
Quote:
That is not the reason, large lookup tables are inefficient (cache) so it's faster to do it "on the fly" with the CPU. |
|
04 February 2012, 21:42 | #6 |
Registered User
Join Date: Nov 2009
Location: UK
Posts: 497
|
|
04 February 2012, 22:06 | #7 | |||
WinUAE developer
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,583
|
I moved above post from beta thread. Not really correct place for this kind of stuff.
Quote:
Quote:
Quote:
|
|||
04 February 2012, 22:19 | #8 |
Registered User
Join Date: Nov 2009
Location: UK
Posts: 497
|
I meant, did you include the static or dynamic version?
I wrote this myself and give permission, so the included code is fine to license under GPL for inclusion in WinUAE/PUAE etc. I was just wondering if I could still reuse my own routine above under another license in my own projects (e.g. revised BSD license as for my jAMOS project). Which seems to be the case. (If not I'll leave it just for UAE.) Last edited by Mequa; 04 February 2012 at 22:40. |
05 February 2012, 00:54 | #9 |
Registered User
Join Date: Aug 2006
Location: Italy
Posts: 109
|
Sure, you can. That does mean that, depending on the source where the code was copied, it will be GPL or BSD. Or you may want to specify the double license immediately. Or stick to the more permissive one (no problems in having modified BSD or MIT licensed code in GPL projects). Because of copyleft you can't say later that the code wasn't supposed to be GPL/BSD.
|
05 February 2012, 02:47 | #10 |
Registered User
Join Date: Nov 2009
Location: UK
Posts: 497
|
Or I could just make this routine 100% public domain and usable without restriction?
Perhaps that would be my best bet. Of course this is already open-sourced, and I want to keep it GPL-compatible. |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Akiko Chunky-to-Planar conversion | Mequa | support.WinUAE | 5 | 21 January 2012 10:50 |
Emulation of Graffiti chunky graphics device | pmc | request.UAE Wishlist | 29 | 30 November 2011 18:52 |
Amiga Bitplanes & Planar vs Chunky - Technical Expert Required | CodyJarrett | project.APoV | 4 | 12 November 2009 11:14 |
Chunky to planar | pmc | Coders. Tutorials | 11 | 15 September 2009 16:20 |
Chunky to planar on a500 | Alter | Coders. General | 28 | 10 April 2007 02:53 |
|
|