Which is the fastest software C2P 1x1 routine

Gunnar · 08 June 2014, 21:59

Which are the fastest C2P routines?

Does someone know how many instruction they needs per converted pixel?

Thanks in advance

matthey · 08 June 2014, 23:22

Quote:

Originally Posted by Gunnar

Which are the fastest C2P routines?

Kalm's c2p code is one of the best and he has made it available:

http://eab.abime.net/showthread.php?t=52125

Kalms · 08 June 2014, 23:29

Converting 32 chunky-pixels to 8 longwords of bitplane-data usually takes:

8 longword reads
14*2*4 + 10*2 = 132 ALU operations
8 longword writes

(you can get rid of about 10 of those ALU operations by being tricky)

There is usually a bit of overhead, let's say 10-20 extra ALU operations (counters, shuffling data into backup registers, advancing pointers, etc).

The above figures assume that the ALU operations are done on 32-bit registers. Fewer ALU ops are required if you have the data in smaller registers.

meynaf · 09 June 2014, 07:53

Quote:

Originally Posted by Gunnar

Does someone know how many instruction they needs per converted pixel?

Pixels are not converted individually. You read 32 pixels at once, mix the bits, and then write your pixels back.

So you can't design a single cpu instruction that would convert a pixel, if it's that you have in mind

Gunnar · 09 June 2014, 08:03

Hi Kalms,

Edit:
Ah I saw your sources on Google now. Quite a long collection.
Would it help your code if you had a few more data registers?

Kalms · 09 June 2014, 08:14

In theory you should be able to do 6bpl c2p in,

8 longword reads
~90 ALU ops
6 longword writes

however the routines I have lying around (from http://unitedstatesofamiga.googlecod...P-20100426.zip) seem to be at ~120 ALU ops.

If you want to speed the transform up by introducing new instructions, then a reasonably simple way to do so is to introduce a pair of new instructions that performs half of a merge each,

a = (b & ~e) | ((c & ~e) >> d)
(and the dual of this would be, a = ((b & e) << d) | (c & e))

where b & c are input registers, and d & e are constants. e can be computed from d. The operation above is essentially a bit selection operation. You would use it like such:

MERGE_FIRST_HALF d0,d1,4,d2
a = d2
b = d0
c = d1
d = 4
e = small_lookup_table[d] = $0f0f0f0f

The above instruction has 2 read-only registers and 1 write-only register operand.

If you prefer to have just two read+write register operands then you can create an instruction that intermingles bits between the two registers:

a0 = (b & ~e) | ((c & ~e) >> d)
a1 = ((b & e) << d) | (c & e)
b = a0
c = a1

and you use it like:

MERGE d0,d1,4

This would do a bit shuffle between d0 & d1. With that instruction you will be down at 20 ALU ops.

This does feel like a bit of a moot point though.

The C2P transform itself does not take a lot of time (it takes about 4-5ms on a 50MHz 68060 for 320x256 pixels @ 8bpl); bulk of time disappears in fast->chip copying today, so that is where your efforts would make a bigger impact.

mc6809e · 09 June 2014, 09:18

Quote:

Originally Posted by Kalms

The C2P transform itself does not take a lot of time (it takes about 4-5ms on a 50MHz 68060 for 320x256 pixels @ 8bpl); bulk of time disappears in fast->chip copying today, so that is where your efforts would make a bigger impact.

I wonder if there's an opportunity to use software pipelineing/instruction interleaving to get work done while chipram writes are pending since the bus interface of 68020 and later runs concurrent with instruction execution -- the idea being to trickle out writes that can overlap with calculations.

meynaf · 09 June 2014, 09:34

Quote:

Originally Posted by mc6809e

I wonder if there's an opportunity to use software pipelineing/instruction interleaving to get work done while chipram writes are pending since the bus interface of 68020 and later runs concurrent with instruction execution -- the idea being to trickle out writes that can overlap with calculations.

This is already done. Any decent 68030 c2p does that, for example. On a 68030, in fact, doing a c2p in fastmem would bring very little benefit, as the instructions are enough to fully "cover" the chipmem waits.

Gunnar · 09 June 2014, 10:13

Hi Kalms,

Thanks for the answer.

Maybe I should explain real quick what silly idea I had.
The idea is just for fun and "silly" but useful for existing Vampire600 owners with no truecolor video out. For them it might be a nice hack.

I have some games which use 15bit GFX mode.
I wondered if I could hack together a display conversion
which spits out 320x256x64 colors on OCS and this with 20FPs or more.

The silly idea is to render the game in 15/12bit color of the time.
To do a lockup per pixel with a 12bit colormap to find the best matching color pen and then to do C2P.

This means the color-lookup is part of the C2P routine and its time its hidden in the Chipmem write time.

As useful features we currently have:

16 Data registers

We can do free REG to REG MOVE Instruction per cycle.
This means
MOVE.l D0,D1
OPP.L D2,D1 -- are merged in 1 single cycle instruction

PERM instruction which does a byte permute
2 register source, 1 register written, immedite byte selector.
I don't know if this helps for the c2p?

Phoenix can do memory read in parallel with out blocking.
This means if you write a C2P or texture mapper clever / or slightly interleaved.
Then the memory latency can be done to zero.

PIXMERGE a bilienar interpolate instructions which does bilinear mixing of 2 RGB pixels in a single instruction. Maybe nice for RGB texture operations.

Adding a SELECT instruction which picks bitwise would be simple.
This would also help Softblitting.

I know the idea is silly - but I'm curious what we could do with an A600 with it.

Quote:

Originally Posted by Kalms

a0 = (b & ~e) | ((c & ~e) >> d)
a1 = ((b & e) << d) | (c & e)

Can I also write this as:

SrcA =b
SrcB =c
Mask =e
Dst

SrcA <<= d;
Dest = (SrcA & mask) | (SrcB & ~mask) -- This is basically a Blit minterm

Mrs Beanbag · 09 June 2014, 12:28

should be possible to make an instruction that operates on all 8 data registers simultaneously and does C2P on 32 pixels in a single cycle, right?
(not including reading and writing)

Gunnar · 09 June 2014, 13:36

Quote:

Originally Posted by Mrs Beanbag

should be possible to make an instruction that operates on all 8 data registers simultaneously and does C2P on 32 pixels in a single cycle, right?

No unfortunately not.
The limitation is here the register file design and the limited number of read ports.
In other words the ALU can not read 8 registers in parallel.

Mrs Beanbag · 09 June 2014, 21:05

Quote:

Originally Posted by Gunnar

No unfortunately not.
The limitation is here the register file design and the limited number of read ports.
In other words the ALU can not read 8 registers in parallel.

You don't necessarily need to use the read or write ports. It is a fixed permutation of bits so it could be hard-wired into the register file (in theory) like a set of shift registers. The instruction would only need to pulse the shift clock.

robinsonb5 · 10 June 2014, 00:17

Quote:

Originally Posted by Mrs Beanbag

You don't necessarily need to use the read or write ports. It is a fixed permutation of bits so it could be hard-wired into the register file (in theory) like a set of shift registers. The instruction would only need to pulse the shift clock.

In most FPGA-based CPUs (and I'm assuming the Apollo core is no exception - Gunnar please correct me if I'm wrong), the register file is implemented as one or more of the FPGA's built-in dual-port RAM blocks, which limits what you can do in terms of hard-wiring.

Gunnar · 10 June 2014, 08:52

Quote:

Originally Posted by robinsonb5

In most FPGA-based CPUs (and I'm assuming the Apollo core is no exception - Gunnar please correct me if I'm wrong), the register file is implemented as one or more of the FPGA's built-in dual-port RAM blocks, which limits what you can do in terms of hard-wiring.

This is correct.

But lets get back to the original topic.

The idea is to try out a display solution to display hicolor-15bit screens on 64 colors ECS mode.

alexh · 10 June 2014, 08:59

Quote:

Originally Posted by robinsonb5

In most FPGA-based CPUs (and I'm assuming the Apollo core is no exception - Gunnar please correct me if I'm wrong), the register file is implemented as one or more of the FPGA's built-in dual-port RAM blocks, which limits what you can do in terms of hard-wiring.

But most FPGA's have block RAMs which are designed to be much wider than anything you might ever use, are chainable (you can use more than one to represent a single RAM) and have individual bit write masks. Not 100% familiar with the Vampire but making 8 registers readable in a single cycle should be possible with little overhead

Gunnar · 10 June 2014, 09:47

Quote:

Originally Posted by alexh

But most FPGA's have block RAMs which are designed to be much wider than anything you might ever use, are chainable (you can use more than one to represent a single RAM) and have individual bit write masks. Not 100% familiar with the Vampire but making 8 registers readable in a single cycle should be possible with little overhead

No.
A Cyclone memory block can have a widht of 32bit = Amiga register width.
There is not such things as a "bit write masks" but there are Byte-enables.
Read and Wirte ports can be increased by cloning the register file.
1 memory block gives you 1 read port and 1 write port
2 readports and 1 write ports can be generated with 2 Memblocks
2 readports and 2 write ports can be generated with 4 Memblocks
3 readports and 3 write ports can be generated with 9 Memblocks
4 readports and 4 write ports can be generated with 16 Memblocks
8 readports and 8 write ports can be generated with 64 Memblocks

You see the pattern.

The whole FPGA has 36 memblocks.
And you do not want to use all for your registers - but maybe want some for Caches too.

JimDrew · 10 June 2014, 09:48

I made some very fast C2P routines for my Mac emulations. There is no mention if you are converting off screen bitmap to an Amiga screen. If so, there are some neat tricks you can do to synchronize the stores to chipmem. You can also use the MMU to map a compare page to determine when sections of the screen have changed. With a 68030+ it is easy to use a 4K page (MMU) and then a physical compare page to only convert just data that has changed. Back in the day, I spent months working out the fastest C2P routines for various bit depths (all assembly code of course ). I even wrote a test suite to compare C2P code. I will have to locate that stuff.

Gunnar · 10 June 2014, 10:01

Quote:

Originally Posted by JimDrew

I will have to locate that stuff.

This would be nice.

Maybe we can all work together as team to improve the raw ideas that I have?

I would like to prepare the following demo code:
The main code will render from fast to fast in direct color (15bit/12bit ..)
As test I would like to add bilinear filtering of texture pixels (smoothing).

The display routine I would like to lookup for each direct color value a best match in the colortable (64 colors EHB) and then do C2P.

I hope to get the whole running at around 20 FPS.

As I have no MMU the code will render and convert the whole screen everytime.
Which is for video playback or fast games also probably the best approach.

Mrs Beanbag · 10 June 2014, 11:02

Quote:

Originally Posted by Gunnar

No.
A Cyclone memory block can have a widht of 32bit = Amiga register width.
There is not such things as a "bit write masks" but there are Byte-enables.
Read and Wirte ports can be increased by cloning the register file.
1 memory block gives you 1 read port and 1 write port
2 readports and 1 write ports can be generated with 2 Memblocks
...

ok i get it, it would be easy to do in actual hardware but obviously FPGAs are wired up in a certain way and only certain things are possible to do.

Nevertheless this does not really require "8 port register file", because individual random access is not required, each port only has to access a single register. So it wouldn't need 8*8=64 blocks, only 8 blocks, that is, 8 separate cloned registers with 1 port each.

robinsonb5 · 10 June 2014, 13:18

Quote:

Originally Posted by Mrs Beanbag

ok i get it, it would be easy to do in actual hardware but obviously FPGAs are wired up in a certain way and only certain things are possible to do.

It's not that the register file *has* to be implemented as internal block RAM, it's just that implementing it any other way uses up logic elements really fast.

Quote:

Nevertheless this does not really require "8 port register file", because individual random access is not required, each port only has to access a single register. So it wouldn't need 8*8=64 blocks, only 8 blocks, that is, 8 separate cloned registers with 1 port each.

True. But to do the cornerturn this way, assuming your memory blocks support 2 ports each, you'd need to devote 4 blocks to the register file, which means a minimum of 16 kilobits used to support 256 bits of register.

08 June 2014, 21:59	#1
Gunnar Registered User Join Date: Apr 2014 Location: Germany Posts: 154	Which is the fastest software C2P 1x1 routine Which are the fastest C2P routines? Does someone know how many instruction they needs per converted pixel? Thanks in advance

09 June 2014, 08:03	#5
Gunnar Registered User Join Date: Apr 2014 Location: Germany Posts: 154	Hi Kalms, Edit: Ah I saw your sources on Google now. Quite a long collection. Would it help your code if you had a few more data registers? Last edited by Gunnar; 09 June 2014 at 08:16.

09 June 2014, 08:14	#6
Kalms Registered User Join Date: Nov 2006 Location: Stockholm, Sweden Posts: 237	In theory you should be able to do 6bpl c2p in, 8 longword reads ~90 ALU ops 6 longword writes however the routines I have lying around (from http://unitedstatesofamiga.googlecod...P-20100426.zip) seem to be at ~120 ALU ops. If you want to speed the transform up by introducing new instructions, then a reasonably simple way to do so is to introduce a pair of new instructions that performs half of a merge each, a = (b & ~e) \| ((c & ~e) >> d) (and the dual of this would be, a = ((b & e) << d) \| (c & e)) where b & c are input registers, and d & e are constants. e can be computed from d. The operation above is essentially a bit selection operation. You would use it like such: MERGE_FIRST_HALF d0,d1,4,d2 a = d2 b = d0 c = d1 d = 4 e = small_lookup_table[d] = $0f0f0f0f The above instruction has 2 read-only registers and 1 write-only register operand. If you prefer to have just two read+write register operands then you can create an instruction that intermingles bits between the two registers: a0 = (b & ~e) \| ((c & ~e) >> d) a1 = ((b & e) << d) \| (c & e) b = a0 c = a1 and you use it like: MERGE d0,d1,4 This would do a bit shuffle between d0 & d1. With that instruction you will be down at 20 ALU ops. This does feel like a bit of a moot point though. The C2P transform itself does not take a lot of time (it takes about 4-5ms on a 50MHz 68060 for 320x256 pixels @ 8bpl); bulk of time disappears in fast->chip copying today, so that is where your efforts would make a bigger impact. Last edited by TCD; 09 June 2014 at 09:16. Reason: Back-to-back posts merged

10 June 2014, 09:48	#17
JimDrew Registered User Join Date: Dec 2013 Location: Lake Havasu City, AZ Posts: 741	I made some very fast C2P routines for my Mac emulations. There is no mention if you are converting off screen bitmap to an Amiga screen. If so, there are some neat tricks you can do to synchronize the stores to chipmem. You can also use the MMU to map a compare page to determine when sections of the screen have changed. With a 68030+ it is easy to use a 4K page (MMU) and then a physical compare page to only convert just data that has changed. Back in the day, I spent months working out the fastest C2P routines for various bit depths (all assembly code of course ). I even wrote a test suite to compare C2P code. I will have to locate that stuff. Last edited by JimDrew; 10 June 2014 at 09:54.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Coders Challenge #2: C2P	oRBIT	Coders. General	4	04 June 2010 18:12
Any C2P experts here?	oRBIT	Coders. General	36	27 April 2010 07:26
C2P....help!	NovaCoder	Coders. General	8	17 December 2009 00:15
Game in c2p?	oRBIT	Amiga scene	11	01 February 2007 21:28
Fastest TCP/IP software	Smiley	support.Hardware	7	14 March 2005 18:26

08 June 2014, 23:29	#3
Kalms Registered User Join Date: Nov 2006 Location: Stockholm, Sweden Posts: 237	Converting 32 chunky-pixels to 8 longwords of bitplane-data usually takes: 8 longword reads 1424 + 10*2 = 132 ALU operations 8 longword writes (you can get rid of about 10 of those ALU operations by being tricky) There is usually a bit of overhead, let's say 10-20 extra ALU operations (counters, shuffling data into backup registers, advancing pointers, etc). The above figures assume that the ALU operations are done on 32-bit registers. Fewer ALU ops are required if you have the data in smaller registers.

09 June 2014, 12:28	#10
Mrs Beanbag Glastonbridge Software Join Date: Jan 2012 Location: Edinburgh/Scotland Posts: 2,243	should be possible to make an instruction that operates on all 8 data registers simultaneously and does C2P on 32 pixels in a single cycle, right? (not including reading and writing)

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)