English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 08 June 2014, 21:59   #1
Gunnar
Registered User
 
Join Date: Apr 2014
Location: Germany
Posts: 154
Which is the fastest software C2P 1x1 routine

Which are the fastest C2P routines?

Does someone know how many instruction they needs per converted pixel?

Thanks in advance
Gunnar is offline  
Old 08 June 2014, 23:22   #2
matthey
Banned
 
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
Quote:
Originally Posted by Gunnar View Post
Which are the fastest C2P routines?
Kalm's c2p code is one of the best and he has made it available:

http://eab.abime.net/showthread.php?t=52125
matthey is offline  
Old 08 June 2014, 23:29   #3
Kalms
Registered User
 
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
Converting 32 chunky-pixels to 8 longwords of bitplane-data usually takes:

8 longword reads
14*2*4 + 10*2 = 132 ALU operations
8 longword writes

(you can get rid of about 10 of those ALU operations by being tricky)

There is usually a bit of overhead, let's say 10-20 extra ALU operations (counters, shuffling data into backup registers, advancing pointers, etc).

The above figures assume that the ALU operations are done on 32-bit registers. Fewer ALU ops are required if you have the data in smaller registers.
Kalms is offline  
Old 09 June 2014, 07:53   #4
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by Gunnar View Post
Does someone know how many instruction they needs per converted pixel?
Pixels are not converted individually. You read 32 pixels at once, mix the bits, and then write your pixels back.

So you can't design a single cpu instruction that would convert a pixel, if it's that you have in mind
meynaf is offline  
Old 09 June 2014, 08:03   #5
Gunnar
Registered User
 
Join Date: Apr 2014
Location: Germany
Posts: 154
Hi Kalms,

Edit:
Ah I saw your sources on Google now. Quite a long collection.
Would it help your code if you had a few more data registers?

Last edited by Gunnar; 09 June 2014 at 08:16.
Gunnar is offline  
Old 09 June 2014, 08:14   #6
Kalms
Registered User
 
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
In theory you should be able to do 6bpl c2p in,

8 longword reads
~90 ALU ops
6 longword writes

however the routines I have lying around (from http://unitedstatesofamiga.googlecod...P-20100426.zip) seem to be at ~120 ALU ops.

If you want to speed the transform up by introducing new instructions, then a reasonably simple way to do so is to introduce a pair of new instructions that performs half of a merge each,

a = (b & ~e) | ((c & ~e) >> d)
(and the dual of this would be, a = ((b & e) << d) | (c & e))

where b & c are input registers, and d & e are constants. e can be computed from d. The operation above is essentially a bit selection operation. You would use it like such:

MERGE_FIRST_HALF d0,d1,4,d2
a = d2
b = d0
c = d1
d = 4
e = small_lookup_table[d] = $0f0f0f0f

The above instruction has 2 read-only registers and 1 write-only register operand.


If you prefer to have just two read+write register operands then you can create an instruction that intermingles bits between the two registers:

a0 = (b & ~e) | ((c & ~e) >> d)
a1 = ((b & e) << d) | (c & e)
b = a0
c = a1

and you use it like:

MERGE d0,d1,4

This would do a bit shuffle between d0 & d1. With that instruction you will be down at 20 ALU ops.


This does feel like a bit of a moot point though.

The C2P transform itself does not take a lot of time (it takes about 4-5ms on a 50MHz 68060 for 320x256 pixels @ 8bpl); bulk of time disappears in fast->chip copying today, so that is where your efforts would make a bigger impact.

Last edited by TCD; 09 June 2014 at 09:16. Reason: Back-to-back posts merged
Kalms is offline  
Old 09 June 2014, 09:18   #7
mc6809e
Registered User
 
Join Date: Jan 2012
Location: USA
Posts: 372
Quote:
Originally Posted by Kalms View Post
The C2P transform itself does not take a lot of time (it takes about 4-5ms on a 50MHz 68060 for 320x256 pixels @ 8bpl); bulk of time disappears in fast->chip copying today, so that is where your efforts would make a bigger impact.
I wonder if there's an opportunity to use software pipelineing/instruction interleaving to get work done while chipram writes are pending since the bus interface of 68020 and later runs concurrent with instruction execution -- the idea being to trickle out writes that can overlap with calculations.
mc6809e is offline  
Old 09 June 2014, 09:34   #8
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by mc6809e View Post
I wonder if there's an opportunity to use software pipelineing/instruction interleaving to get work done while chipram writes are pending since the bus interface of 68020 and later runs concurrent with instruction execution -- the idea being to trickle out writes that can overlap with calculations.
This is already done. Any decent 68030 c2p does that, for example. On a 68030, in fact, doing a c2p in fastmem would bring very little benefit, as the instructions are enough to fully "cover" the chipmem waits.
meynaf is offline  
Old 09 June 2014, 10:13   #9
Gunnar
Registered User
 
Join Date: Apr 2014
Location: Germany
Posts: 154
Hi Kalms,

Thanks for the answer.

Maybe I should explain real quick what silly idea I had.
The idea is just for fun and "silly" but useful for existing Vampire600 owners with no truecolor video out. For them it might be a nice hack.


I have some games which use 15bit GFX mode.
I wondered if I could hack together a display conversion
which spits out 320x256x64 colors on OCS and this with 20FPs or more.

The silly idea is to render the game in 15/12bit color of the time.
To do a lockup per pixel with a 12bit colormap to find the best matching color pen and then to do C2P.

This means the color-lookup is part of the C2P routine and its time its hidden in the Chipmem write time.


As useful features we currently have:

16 Data registers

We can do free REG to REG MOVE Instruction per cycle.
This means
MOVE.l D0,D1
OPP.L D2,D1 -- are merged in 1 single cycle instruction

PERM instruction which does a byte permute
2 register source, 1 register written, immedite byte selector.
I don't know if this helps for the c2p?

Phoenix can do memory read in parallel with out blocking.
This means if you write a C2P or texture mapper clever / or slightly interleaved.
Then the memory latency can be done to zero.

PIXMERGE a bilienar interpolate instructions which does bilinear mixing of 2 RGB pixels in a single instruction. Maybe nice for RGB texture operations.

Adding a SELECT instruction which picks bitwise would be simple.
This would also help Softblitting.


I know the idea is silly - but I'm curious what we could do with an A600 with it.

Quote:
Originally Posted by Kalms View Post

a0 = (b & ~e) | ((c & ~e) >> d)
a1 = ((b & e) << d) | (c & e)
Can I also write this as:

SrcA =b
SrcB =c
Mask =e
Dst

SrcA <<= d;
Dest = (SrcA & mask) | (SrcB & ~mask) -- This is basically a Blit minterm

Last edited by TCD; 10 June 2014 at 08:59. Reason: Back-to-back posts merged
Gunnar is offline  
Old 09 June 2014, 12:28   #10
Mrs Beanbag
Glastonbridge Software
 
Mrs Beanbag's Avatar
 
Join Date: Jan 2012
Location: Edinburgh/Scotland
Posts: 2,243
should be possible to make an instruction that operates on all 8 data registers simultaneously and does C2P on 32 pixels in a single cycle, right?
(not including reading and writing)
Mrs Beanbag is offline  
Old 09 June 2014, 13:36   #11
Gunnar
Registered User
 
Join Date: Apr 2014
Location: Germany
Posts: 154
Quote:
Originally Posted by Mrs Beanbag View Post
should be possible to make an instruction that operates on all 8 data registers simultaneously and does C2P on 32 pixels in a single cycle, right?
No unfortunately not.
The limitation is here the register file design and the limited number of read ports.
In other words the ALU can not read 8 registers in parallel.
Gunnar is offline  
Old 09 June 2014, 21:05   #12
Mrs Beanbag
Glastonbridge Software
 
Mrs Beanbag's Avatar
 
Join Date: Jan 2012
Location: Edinburgh/Scotland
Posts: 2,243
Quote:
Originally Posted by Gunnar View Post
No unfortunately not.
The limitation is here the register file design and the limited number of read ports.
In other words the ALU can not read 8 registers in parallel.
You don't necessarily need to use the read or write ports. It is a fixed permutation of bits so it could be hard-wired into the register file (in theory) like a set of shift registers. The instruction would only need to pulse the shift clock.
Mrs Beanbag is offline  
Old 10 June 2014, 00:17   #13
robinsonb5
Registered User
 
Join Date: Mar 2012
Location: Norfolk, UK
Posts: 1,153
Quote:
Originally Posted by Mrs Beanbag View Post
You don't necessarily need to use the read or write ports. It is a fixed permutation of bits so it could be hard-wired into the register file (in theory) like a set of shift registers. The instruction would only need to pulse the shift clock.
In most FPGA-based CPUs (and I'm assuming the Apollo core is no exception - Gunnar please correct me if I'm wrong), the register file is implemented as one or more of the FPGA's built-in dual-port RAM blocks, which limits what you can do in terms of hard-wiring.
robinsonb5 is offline  
Old 10 June 2014, 08:52   #14
Gunnar
Registered User
 
Join Date: Apr 2014
Location: Germany
Posts: 154
Quote:
Originally Posted by robinsonb5 View Post
In most FPGA-based CPUs (and I'm assuming the Apollo core is no exception - Gunnar please correct me if I'm wrong), the register file is implemented as one or more of the FPGA's built-in dual-port RAM blocks, which limits what you can do in terms of hard-wiring.
This is correct.

But lets get back to the original topic.

The idea is to try out a display solution to display hicolor-15bit screens on 64 colors ECS mode.

Last edited by TCD; 10 June 2014 at 08:59. Reason: Back-to-back posts merged
Gunnar is offline  
Old 10 June 2014, 08:59   #15
alexh
Thalion Webshrine
 
alexh's Avatar
 
Join Date: Jan 2004
Location: Oxford
Posts: 14,354
Quote:
Originally Posted by robinsonb5 View Post
In most FPGA-based CPUs (and I'm assuming the Apollo core is no exception - Gunnar please correct me if I'm wrong), the register file is implemented as one or more of the FPGA's built-in dual-port RAM blocks, which limits what you can do in terms of hard-wiring.
But most FPGA's have block RAMs which are designed to be much wider than anything you might ever use, are chainable (you can use more than one to represent a single RAM) and have individual bit write masks. Not 100% familiar with the Vampire but making 8 registers readable in a single cycle should be possible with little overhead
alexh is offline  
Old 10 June 2014, 09:47   #16
Gunnar
Registered User
 
Join Date: Apr 2014
Location: Germany
Posts: 154
Quote:
Originally Posted by alexh View Post
But most FPGA's have block RAMs which are designed to be much wider than anything you might ever use, are chainable (you can use more than one to represent a single RAM) and have individual bit write masks. Not 100% familiar with the Vampire but making 8 registers readable in a single cycle should be possible with little overhead
No.
A Cyclone memory block can have a widht of 32bit = Amiga register width.
There is not such things as a "bit write masks" but there are Byte-enables.
Read and Wirte ports can be increased by cloning the register file.
1 memory block gives you 1 read port and 1 write port
2 readports and 1 write ports can be generated with 2 Memblocks
2 readports and 2 write ports can be generated with 4 Memblocks
3 readports and 3 write ports can be generated with 9 Memblocks
4 readports and 4 write ports can be generated with 16 Memblocks
8 readports and 8 write ports can be generated with 64 Memblocks

You see the pattern.


The whole FPGA has 36 memblocks.
And you do not want to use all for your registers - but maybe want some for Caches too.
Gunnar is offline  
Old 10 June 2014, 09:48   #17
JimDrew
Registered User
 
Join Date: Dec 2013
Location: Lake Havasu City, AZ
Posts: 741
I made some very fast C2P routines for my Mac emulations. There is no mention if you are converting off screen bitmap to an Amiga screen. If so, there are some neat tricks you can do to synchronize the stores to chipmem. You can also use the MMU to map a compare page to determine when sections of the screen have changed. With a 68030+ it is easy to use a 4K page (MMU) and then a physical compare page to only convert just data that has changed. Back in the day, I spent months working out the fastest C2P routines for various bit depths (all assembly code of course ). I even wrote a test suite to compare C2P code. I will have to locate that stuff.

Last edited by JimDrew; 10 June 2014 at 09:54.
JimDrew is offline  
Old 10 June 2014, 10:01   #18
Gunnar
Registered User
 
Join Date: Apr 2014
Location: Germany
Posts: 154
Quote:
Originally Posted by JimDrew View Post
I will have to locate that stuff.
This would be nice.

Maybe we can all work together as team to improve the raw ideas that I have?

I would like to prepare the following demo code:
The main code will render from fast to fast in direct color (15bit/12bit ..)
As test I would like to add bilinear filtering of texture pixels (smoothing).

The display routine I would like to lookup for each direct color value a best match in the colortable (64 colors EHB) and then do C2P.

I hope to get the whole running at around 20 FPS.

As I have no MMU the code will render and convert the whole screen everytime.
Which is for video playback or fast games also probably the best approach.
Gunnar is offline  
Old 10 June 2014, 11:02   #19
Mrs Beanbag
Glastonbridge Software
 
Mrs Beanbag's Avatar
 
Join Date: Jan 2012
Location: Edinburgh/Scotland
Posts: 2,243
Quote:
Originally Posted by Gunnar View Post
No.
A Cyclone memory block can have a widht of 32bit = Amiga register width.
There is not such things as a "bit write masks" but there are Byte-enables.
Read and Wirte ports can be increased by cloning the register file.
1 memory block gives you 1 read port and 1 write port
2 readports and 1 write ports can be generated with 2 Memblocks
...
ok i get it, it would be easy to do in actual hardware but obviously FPGAs are wired up in a certain way and only certain things are possible to do.

Nevertheless this does not really require "8 port register file", because individual random access is not required, each port only has to access a single register. So it wouldn't need 8*8=64 blocks, only 8 blocks, that is, 8 separate cloned registers with 1 port each.
Mrs Beanbag is offline  
Old 10 June 2014, 13:18   #20
robinsonb5
Registered User
 
Join Date: Mar 2012
Location: Norfolk, UK
Posts: 1,153
Quote:
Originally Posted by Mrs Beanbag View Post
ok i get it, it would be easy to do in actual hardware but obviously FPGAs are wired up in a certain way and only certain things are possible to do.
It's not that the register file *has* to be implemented as internal block RAM, it's just that implementing it any other way uses up logic elements really fast.

Quote:
Nevertheless this does not really require "8 port register file", because individual random access is not required, each port only has to access a single register. So it wouldn't need 8*8=64 blocks, only 8 blocks, that is, 8 separate cloned registers with 1 port each.
True. But to do the cornerturn this way, assuming your memory blocks support 2 ports each, you'd need to devote 4 blocks to the register file, which means a minimum of 16 kilobits used to support 256 bits of register.
robinsonb5 is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Coders Challenge #2: C2P oRBIT Coders. General 4 04 June 2010 18:12
Any C2P experts here? oRBIT Coders. General 36 27 April 2010 07:26
C2P....help! NovaCoder Coders. General 8 17 December 2009 00:15
Game in c2p? oRBIT Amiga scene 11 01 February 2007 21:28
Fastest TCP/IP software Smiley support.Hardware 7 14 March 2005 18:26

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 01:34.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.10809 seconds with 14 queries