08 June 2014, 21:59 | #1 |
Registered User
Join Date: Apr 2014
Location: Germany
Posts: 154
|
Which is the fastest software C2P 1x1 routine
Which are the fastest C2P routines?
Does someone know how many instruction they needs per converted pixel? Thanks in advance |
08 June 2014, 23:22 | #2 |
Banned
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
|
Kalm's c2p code is one of the best and he has made it available:
http://eab.abime.net/showthread.php?t=52125 |
08 June 2014, 23:29 | #3 |
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
|
Converting 32 chunky-pixels to 8 longwords of bitplane-data usually takes:
8 longword reads 14*2*4 + 10*2 = 132 ALU operations 8 longword writes (you can get rid of about 10 of those ALU operations by being tricky) There is usually a bit of overhead, let's say 10-20 extra ALU operations (counters, shuffling data into backup registers, advancing pointers, etc). The above figures assume that the ALU operations are done on 32-bit registers. Fewer ALU ops are required if you have the data in smaller registers. |
09 June 2014, 07:53 | #4 | |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Quote:
So you can't design a single cpu instruction that would convert a pixel, if it's that you have in mind |
|
09 June 2014, 08:03 | #5 |
Registered User
Join Date: Apr 2014
Location: Germany
Posts: 154
|
Hi Kalms,
Edit: Ah I saw your sources on Google now. Quite a long collection. Would it help your code if you had a few more data registers? Last edited by Gunnar; 09 June 2014 at 08:16. |
09 June 2014, 08:14 | #6 |
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
|
In theory you should be able to do 6bpl c2p in,
8 longword reads ~90 ALU ops 6 longword writes however the routines I have lying around (from http://unitedstatesofamiga.googlecod...P-20100426.zip) seem to be at ~120 ALU ops. If you want to speed the transform up by introducing new instructions, then a reasonably simple way to do so is to introduce a pair of new instructions that performs half of a merge each, a = (b & ~e) | ((c & ~e) >> d) (and the dual of this would be, a = ((b & e) << d) | (c & e)) where b & c are input registers, and d & e are constants. e can be computed from d. The operation above is essentially a bit selection operation. You would use it like such: MERGE_FIRST_HALF d0,d1,4,d2 a = d2 b = d0 c = d1 d = 4 e = small_lookup_table[d] = $0f0f0f0f The above instruction has 2 read-only registers and 1 write-only register operand. If you prefer to have just two read+write register operands then you can create an instruction that intermingles bits between the two registers: a0 = (b & ~e) | ((c & ~e) >> d) a1 = ((b & e) << d) | (c & e) b = a0 c = a1 and you use it like: MERGE d0,d1,4 This would do a bit shuffle between d0 & d1. With that instruction you will be down at 20 ALU ops. This does feel like a bit of a moot point though. The C2P transform itself does not take a lot of time (it takes about 4-5ms on a 50MHz 68060 for 320x256 pixels @ 8bpl); bulk of time disappears in fast->chip copying today, so that is where your efforts would make a bigger impact. Last edited by TCD; 09 June 2014 at 09:16. Reason: Back-to-back posts merged |
09 June 2014, 09:18 | #7 |
Registered User
Join Date: Jan 2012
Location: USA
Posts: 372
|
I wonder if there's an opportunity to use software pipelineing/instruction interleaving to get work done while chipram writes are pending since the bus interface of 68020 and later runs concurrent with instruction execution -- the idea being to trickle out writes that can overlap with calculations.
|
09 June 2014, 09:34 | #8 | |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Quote:
|
|
09 June 2014, 10:13 | #9 |
Registered User
Join Date: Apr 2014
Location: Germany
Posts: 154
|
Hi Kalms,
Thanks for the answer. Maybe I should explain real quick what silly idea I had. The idea is just for fun and "silly" but useful for existing Vampire600 owners with no truecolor video out. For them it might be a nice hack. I have some games which use 15bit GFX mode. I wondered if I could hack together a display conversion which spits out 320x256x64 colors on OCS and this with 20FPs or more. The silly idea is to render the game in 15/12bit color of the time. To do a lockup per pixel with a 12bit colormap to find the best matching color pen and then to do C2P. This means the color-lookup is part of the C2P routine and its time its hidden in the Chipmem write time. As useful features we currently have: 16 Data registers We can do free REG to REG MOVE Instruction per cycle. This means MOVE.l D0,D1 OPP.L D2,D1 -- are merged in 1 single cycle instruction PERM instruction which does a byte permute 2 register source, 1 register written, immedite byte selector. I don't know if this helps for the c2p? Phoenix can do memory read in parallel with out blocking. This means if you write a C2P or texture mapper clever / or slightly interleaved. Then the memory latency can be done to zero. PIXMERGE a bilienar interpolate instructions which does bilinear mixing of 2 RGB pixels in a single instruction. Maybe nice for RGB texture operations. Adding a SELECT instruction which picks bitwise would be simple. This would also help Softblitting. I know the idea is silly - but I'm curious what we could do with an A600 with it. Can I also write this as: SrcA =b SrcB =c Mask =e Dst SrcA <<= d; Dest = (SrcA & mask) | (SrcB & ~mask) -- This is basically a Blit minterm Last edited by TCD; 10 June 2014 at 08:59. Reason: Back-to-back posts merged |
09 June 2014, 12:28 | #10 |
Glastonbridge Software
Join Date: Jan 2012
Location: Edinburgh/Scotland
Posts: 2,243
|
should be possible to make an instruction that operates on all 8 data registers simultaneously and does C2P on 32 pixels in a single cycle, right?
(not including reading and writing) |
09 June 2014, 13:36 | #11 | |
Registered User
Join Date: Apr 2014
Location: Germany
Posts: 154
|
Quote:
The limitation is here the register file design and the limited number of read ports. In other words the ALU can not read 8 registers in parallel. |
|
09 June 2014, 21:05 | #12 |
Glastonbridge Software
Join Date: Jan 2012
Location: Edinburgh/Scotland
Posts: 2,243
|
You don't necessarily need to use the read or write ports. It is a fixed permutation of bits so it could be hard-wired into the register file (in theory) like a set of shift registers. The instruction would only need to pulse the shift clock.
|
10 June 2014, 00:17 | #13 |
Registered User
Join Date: Mar 2012
Location: Norfolk, UK
Posts: 1,153
|
In most FPGA-based CPUs (and I'm assuming the Apollo core is no exception - Gunnar please correct me if I'm wrong), the register file is implemented as one or more of the FPGA's built-in dual-port RAM blocks, which limits what you can do in terms of hard-wiring.
|
10 June 2014, 08:52 | #14 | |
Registered User
Join Date: Apr 2014
Location: Germany
Posts: 154
|
Quote:
But lets get back to the original topic. The idea is to try out a display solution to display hicolor-15bit screens on 64 colors ECS mode. Last edited by TCD; 10 June 2014 at 08:59. Reason: Back-to-back posts merged |
|
10 June 2014, 08:59 | #15 | |
Thalion Webshrine
Join Date: Jan 2004
Location: Oxford
Posts: 14,354
|
Quote:
|
|
10 June 2014, 09:47 | #16 | |
Registered User
Join Date: Apr 2014
Location: Germany
Posts: 154
|
Quote:
A Cyclone memory block can have a widht of 32bit = Amiga register width. There is not such things as a "bit write masks" but there are Byte-enables. Read and Wirte ports can be increased by cloning the register file. 1 memory block gives you 1 read port and 1 write port 2 readports and 1 write ports can be generated with 2 Memblocks 2 readports and 2 write ports can be generated with 4 Memblocks 3 readports and 3 write ports can be generated with 9 Memblocks 4 readports and 4 write ports can be generated with 16 Memblocks 8 readports and 8 write ports can be generated with 64 Memblocks You see the pattern. The whole FPGA has 36 memblocks. And you do not want to use all for your registers - but maybe want some for Caches too. |
|
10 June 2014, 09:48 | #17 |
Registered User
Join Date: Dec 2013
Location: Lake Havasu City, AZ
Posts: 741
|
I made some very fast C2P routines for my Mac emulations. There is no mention if you are converting off screen bitmap to an Amiga screen. If so, there are some neat tricks you can do to synchronize the stores to chipmem. You can also use the MMU to map a compare page to determine when sections of the screen have changed. With a 68030+ it is easy to use a 4K page (MMU) and then a physical compare page to only convert just data that has changed. Back in the day, I spent months working out the fastest C2P routines for various bit depths (all assembly code of course ). I even wrote a test suite to compare C2P code. I will have to locate that stuff.
Last edited by JimDrew; 10 June 2014 at 09:54. |
10 June 2014, 10:01 | #18 |
Registered User
Join Date: Apr 2014
Location: Germany
Posts: 154
|
This would be nice.
Maybe we can all work together as team to improve the raw ideas that I have? I would like to prepare the following demo code: The main code will render from fast to fast in direct color (15bit/12bit ..) As test I would like to add bilinear filtering of texture pixels (smoothing). The display routine I would like to lookup for each direct color value a best match in the colortable (64 colors EHB) and then do C2P. I hope to get the whole running at around 20 FPS. As I have no MMU the code will render and convert the whole screen everytime. Which is for video playback or fast games also probably the best approach. |
10 June 2014, 11:02 | #19 | |
Glastonbridge Software
Join Date: Jan 2012
Location: Edinburgh/Scotland
Posts: 2,243
|
Quote:
Nevertheless this does not really require "8 port register file", because individual random access is not required, each port only has to access a single register. So it wouldn't need 8*8=64 blocks, only 8 blocks, that is, 8 separate cloned registers with 1 port each. |
|
10 June 2014, 13:18 | #20 | ||
Registered User
Join Date: Mar 2012
Location: Norfolk, UK
Posts: 1,153
|
Quote:
Quote:
|
||
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Coders Challenge #2: C2P | oRBIT | Coders. General | 4 | 04 June 2010 18:12 |
Any C2P experts here? | oRBIT | Coders. General | 36 | 27 April 2010 07:26 |
C2P....help! | NovaCoder | Coders. General | 8 | 17 December 2009 00:15 |
Game in c2p? | oRBIT | Amiga scene | 11 | 01 February 2007 21:28 |
Fastest TCP/IP software | Smiley | support.Hardware | 7 | 14 March 2005 18:26 |
|
|