Performance of Akiko C2P on 030/50 CD32 systems - Page 6

Lunda · 26 May 2024, 16:14

Quote:

Originally Posted by abu_the_monkey

that is more believable.

still almost 9mb/s ain't too shabby.

@Lunda is that with the 030 clock at 70mhz? (5 times the 14mhz cpu clock)

Yes, that's with 70.9MHz CPU clock.

Akiko C2P write + read, should use ~6 14MHz cycles. So 9MB/s looks correct.

Karlos · 26 May 2024, 16:14

Quote:

Originally Posted by abu_the_monkey

that is more believable.

still almost 9mb/s ain't too shabby.

@Lunda is that with the 030 clock at 70mhz? (5 times the 14mhz cpu clock)

Technically it's 18 MB/s. It's 9 million bytes converted per second.

abu_the_monkey · 26 May 2024, 16:27

I wonder how running the cpu asynchronous (like most accelerators do) would affect the result.

still such a missed opportunity, not having akiko DMA to chip ram

Lunda · 26 May 2024, 16:57

Quote:

Originally Posted by abu_the_monkey

I wonder how running the cpu asynchronous (like most accelerators do) would affect the result.

still such a missed opportunity, not having akiko DMA to chip ram

DMA would have been great. Less than 30 CPU cycles(adding one write for address) at 14 MHz to do C2P and chip write for 32 pixels. Yes, the bus will be busy for longer, but the CPU is free.

DMA without fast RAM doesn't improve much though.

DOOM on CD32 would have been good enough. Remember that back then 160 x 200 15 fps was considered a great doom port.

paraj · 26 May 2024, 18:50

Great! 3 cycles is an odd (no pun intended) number of cycles for the accesses to take though.

Time for phase 2

Does 030 actually benefit from "burst reads", filling complete cache line or something like that? Otherwise it might be better (and easier) to just keep caches disabled during C2P as the cache would be trashed anyway.

pipper · 26 May 2024, 19:40

Does the test also do the necessary chipmem writes? If not, it’s not a realistic scenario.
One opportunity could be to see if Akiko access and chipmem writes can somehow be scheduled in a clever way(?)

Karlos · 26 May 2024, 19:43

Quote:

Originally Posted by pipper

Does the test also do the necessary chipmem writes? If not, it’s not a realistic scenario.
One opportunity could be to see if Akiko access and chipmem writes can somehow be scheduled in a clever way(?)

No not yet. I was curious about the, pardon the pun, bit by bit breakdown. I'll add that next. My thoughts are that that maybe there's a hacky, cache slapping way to improve it

paraj · 26 May 2024, 19:59

It would be very interesting to get numbers from 020 with and without fast ram (if possible) for this simplified test. I'm almost willing to bet access time is going to be an even number of 14Mhz cycles. In principle the chip write (in best case) is just going to add 8 more 14Mhz cycles (2*CCK), so 14 in total per long word with this config.

Seems like (again my math is probably off) akiko is a win if C2P can't be done in 8*6*(50/14) ~171 cycles (at 50MHz). Lots of effects make it more complicated (included what can and cannot overlap), but you need to read from (maybe fast) RAM and write to chip in either case.

Karlos · 26 May 2024, 20:14

I'm doubtful we'll come across some hitherto unknown speedup but it's fun to poke about.

paraj · 26 May 2024, 20:26

Definitely! And seeing real numbers of the low level stuff is very interesting (instead of FPS from various games).

abu_the_monkey · 26 May 2024, 21:56

I wonder if it would be better to use akiko for the c2p but to fast ram and then copy to chip.

Karlos · 26 May 2024, 22:11

Quote:

Originally Posted by abu_the_monkey

I wonder if it would be better to use akiko for the c2p but to fast ram and then copy to chip.

Well we can experiment and find out.

Cyprian · 26 May 2024, 22:19

Quote:

Originally Posted by Lunda

I was wrong. See attached pics.

Clock is 14MHz.

nice investigation

thanks to you we now know that Akiko is much better than we thought, even on accelerated machine.

Quote:

Originally Posted by paraj

It would be very interesting to get numbers from 020 with and without fast ram (if possible) for this simplified test. I'm almost willing to bet access time is going to be an even number of 14Mhz cycles. In principle the chip write (in best case) is just going to add 8 more 14Mhz cycles (2*CCK), so 14 in total per long word with this config.

I'm also interested in the result

abu_the_monkey · 26 May 2024, 22:26

Quote:

Originally Posted by Karlos

Well we can experiment and find out.

my thinking is that more of the heavy lifting would be done in the 'fast' domain and only the copy of the full converted bitmap would be in the slower fast->chip domain.

Karlos · 27 May 2024, 14:23

On an 030, If I'm doing a write to chip ram via an address pointer and I want to add an offset to the pointer immediately afterwards (e.g. calculating the next plane to write to), is the cost of that operation fully masked by the pending write? How many cycles should I expect to be able to execute while the write is happening, assuming operations that aren't doing any data memory accesses?

meynaf · 27 May 2024, 14:50

Quote:

Originally Posted by Karlos

On an 030, If I'm doing a write to chip ram via an address pointer and I want to add an offset to the pointer immediately afterwards (e.g. calculating the next plane to write to), is the cost of that operation fully masked by the pending write?

Yes - as long as the value to add doesn't come from memory, of course.
Nearly every register-only instruction seems to 'pipeline' well, except iterative instructions such as mul & div which stall like memory accesses.

Quote:

Originally Posted by Karlos

How many cycles should I expect to be able to execute while the write is happening, assuming operations that aren't doing any data memory accesses?

For 50Mhz 030 : at least 24, usually 26. Experiments have shown exact number isn't easy to predict.

Karlos · 27 May 2024, 15:07

How about using movem to transfer a number of registers worth of data from a source buffer? Or is it better to just use separate moves? Thinking about instruction cache size here.

meynaf · 27 May 2024, 15:37

Quote:

Originally Posted by Karlos

How about using movem to transfer a number of registers worth of data from a source buffer? Or is it better to just use separate moves? Thinking about instruction cache size here.

For source read, why not.
But cache size isn't an issue here as the loop appears to be very small.

Karlos · 27 May 2024, 15:39

Quote:

Originally Posted by meynaf

For source read, why not.
But cache size isn't an issue here as the loop appears to be very small.

I'm thinking of variations, really. Some loops will be bigger and may involve toggling datacache behaviours (direct CACR manipulation).

Karlos · 27 May 2024, 16:16

I have just pushed an update to the branch that contains the most naive implementation possible as a test case. The lha file contains the updated binary.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
C2P Performance issues	meeku	Coders. Asm / Hardware	10	09 April 2019 18:29
Alien Breed 3D CD32 - Akiko C2P?	wairnair	support.Games	9	06 July 2018 14:32
Gloom Akiko C2P?	Whitesnake	support.Games	5	23 April 2007 19:01
Blizzard 030/50 Accelerators	Parsec	Amiga scene	20	14 February 2004 17:48
Cd32 Emulator (AKIKO)	Doozy	support.WinUAE	3	06 December 2001 08:41

26 May 2024, 16:27	#103
abu_the_monkey Registered User Join Date: Oct 2020 Location: Bicester Posts: 2,022	I wonder how running the cpu asynchronous (like most accelerators do) would affect the result. still such a missed opportunity, not having akiko DMA to chip ram

26 May 2024, 18:50	#105
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,213	Great! 3 cycles is an odd (no pun intended) number of cycles for the accesses to take though. Time for phase 2 Does 030 actually benefit from "burst reads", filling complete cache line or something like that? Otherwise it might be better (and easier) to just keep caches disabled during C2P as the cache would be trashed anyway.

26 May 2024, 19:40	#106
pipper Registered User Join Date: Jul 2017 Location: San Jose Posts: 676	Does the test also do the necessary chipmem writes? If not, it’s not a realistic scenario. One opportunity could be to see if Akiko access and chipmem writes can somehow be scheduled in a clever way(?)

26 May 2024, 19:59	#108
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,213	It would be very interesting to get numbers from 020 with and without fast ram (if possible) for this simplified test. I'm almost willing to bet access time is going to be an even number of 14Mhz cycles. In principle the chip write (in best case) is just going to add 8 more 14Mhz cycles (2CCK), so 14 in total per long word with this config. Seems like (again my math is probably off) akiko is a win if C2P can't be done in 86*(50/14) ~171 cycles (at 50MHz). Lots of effects make it more complicated (included what can and cannot overlap), but you need to read from (maybe fast) RAM and write to chip in either case.

26 May 2024, 20:14	#109
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,469	I'm doubtful we'll come across some hitherto unknown speedup but it's fun to poke about.

26 May 2024, 20:26	#110
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,213	Definitely! And seeing real numbers of the low level stuff is very interesting (instead of FPS from various games).

26 May 2024, 21:56	#111
abu_the_monkey Registered User Join Date: Oct 2020 Location: Bicester Posts: 2,022	I wonder if it would be better to use akiko for the c2p but to fast ram and then copy to chip.

27 May 2024, 14:23	#115
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,469	On an 030, If I'm doing a write to chip ram via an address pointer and I want to add an offset to the pointer immediately afterwards (e.g. calculating the next plane to write to), is the cost of that operation fully masked by the pending write? How many cycles should I expect to be able to execute while the write is happening, assuming operations that aren't doing any data memory accesses?

27 May 2024, 15:07	#117
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,469	How about using movem to transfer a number of registers worth of data from a source buffer? Or is it better to just use separate moves? Thinking about instruction cache size here.

27 May 2024, 16:16	#120
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,469	I have just pushed an update to the branch that contains the most naive implementation possible as a test case. The lha file contains the updated binary.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)