English Amiga Board


Go Back   English Amiga Board > Coders > Coders. General

 
 
Thread Tools
Old 26 May 2024, 16:14   #101
Lunda
Registered User
 
Join Date: Jul 2023
Location: Domsjö/Sweden
Posts: 56
Quote:
Originally Posted by abu_the_monkey View Post
that is more believable.

still almost 9mb/s ain't too shabby.

@Lunda is that with the 030 clock at 70mhz? (5 times the 14mhz cpu clock)
Yes, that's with 70.9MHz CPU clock.

Akiko C2P write + read, should use ~6 14MHz cycles. So 9MB/s looks correct.
Lunda is offline  
Old 26 May 2024, 16:14   #102
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,416
Quote:
Originally Posted by abu_the_monkey View Post
that is more believable.

still almost 9mb/s ain't too shabby.

@Lunda is that with the 030 clock at 70mhz? (5 times the 14mhz cpu clock)
Technically it's 18 MB/s. It's 9 million bytes converted per second.
Karlos is offline  
Old 26 May 2024, 16:27   #103
abu_the_monkey
Registered User
 
Join Date: Oct 2020
Location: Bicester
Posts: 2,018
I wonder how running the cpu asynchronous (like most accelerators do) would affect the result.

still such a missed opportunity, not having akiko DMA to chip ram
abu_the_monkey is offline  
Old 26 May 2024, 16:57   #104
Lunda
Registered User
 
Join Date: Jul 2023
Location: Domsjö/Sweden
Posts: 56
Quote:
Originally Posted by abu_the_monkey View Post
I wonder how running the cpu asynchronous (like most accelerators do) would affect the result.

still such a missed opportunity, not having akiko DMA to chip ram
DMA would have been great. Less than 30 CPU cycles(adding one write for address) at 14 MHz to do C2P and chip write for 32 pixels. Yes, the bus will be busy for longer, but the CPU is free.

DMA without fast RAM doesn't improve much though.

DOOM on CD32 would have been good enough. Remember that back then 160 x 200 15 fps was considered a great doom port.
Lunda is offline  
Old 26 May 2024, 18:50   #105
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,190
Great! 3 cycles is an odd (no pun intended) number of cycles for the accesses to take though.

Time for phase 2

Does 030 actually benefit from "burst reads", filling complete cache line or something like that? Otherwise it might be better (and easier) to just keep caches disabled during C2P as the cache would be trashed anyway.
paraj is offline  
Old 26 May 2024, 19:40   #106
pipper
Registered User
 
Join Date: Jul 2017
Location: San Jose
Posts: 676
Does the test also do the necessary chipmem writes? If not, it’s not a realistic scenario.
One opportunity could be to see if Akiko access and chipmem writes can somehow be scheduled in a clever way(?)
pipper is offline  
Old 26 May 2024, 19:43   #107
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,416
Quote:
Originally Posted by pipper View Post
Does the test also do the necessary chipmem writes? If not, it’s not a realistic scenario.
One opportunity could be to see if Akiko access and chipmem writes can somehow be scheduled in a clever way(?)
No not yet. I was curious about the, pardon the pun, bit by bit breakdown. I'll add that next. My thoughts are that that maybe there's a hacky, cache slapping way to improve it
Karlos is offline  
Old 26 May 2024, 19:59   #108
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,190
It would be very interesting to get numbers from 020 with and without fast ram (if possible) for this simplified test. I'm almost willing to bet access time is going to be an even number of 14Mhz cycles. In principle the chip write (in best case) is just going to add 8 more 14Mhz cycles (2*CCK), so 14 in total per long word with this config.

Seems like (again my math is probably off) akiko is a win if C2P can't be done in 8*6*(50/14) ~171 cycles (at 50MHz). Lots of effects make it more complicated (included what can and cannot overlap), but you need to read from (maybe fast) RAM and write to chip in either case.
paraj is offline  
Old 26 May 2024, 20:14   #109
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,416
I'm doubtful we'll come across some hitherto unknown speedup but it's fun to poke about.
Karlos is offline  
Old 26 May 2024, 20:26   #110
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,190
Definitely! And seeing real numbers of the low level stuff is very interesting (instead of FPS from various games).
paraj is offline  
Old 26 May 2024, 21:56   #111
abu_the_monkey
Registered User
 
Join Date: Oct 2020
Location: Bicester
Posts: 2,018
I wonder if it would be better to use akiko for the c2p but to fast ram and then copy to chip.
abu_the_monkey is offline  
Old 26 May 2024, 22:11   #112
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,416
Quote:
Originally Posted by abu_the_monkey View Post
I wonder if it would be better to use akiko for the c2p but to fast ram and then copy to chip.
Well we can experiment and find out.
Karlos is offline  
Old 26 May 2024, 22:19   #113
Cyprian
Registered User
 
Join Date: Jul 2014
Location: Warsaw/Poland
Posts: 192
Quote:
Originally Posted by Lunda View Post
I was wrong. See attached pics.

Clock is 14MHz.
nice investigation
thanks to you we now know that Akiko is much better than we thought, even on accelerated machine.




Quote:
Originally Posted by paraj View Post
It would be very interesting to get numbers from 020 with and without fast ram (if possible) for this simplified test. I'm almost willing to bet access time is going to be an even number of 14Mhz cycles. In principle the chip write (in best case) is just going to add 8 more 14Mhz cycles (2*CCK), so 14 in total per long word with this config.
I'm also interested in the result
Cyprian is offline  
Old 26 May 2024, 22:26   #114
abu_the_monkey
Registered User
 
Join Date: Oct 2020
Location: Bicester
Posts: 2,018
Quote:
Originally Posted by Karlos View Post
Well we can experiment and find out.
my thinking is that more of the heavy lifting would be done in the 'fast' domain and only the copy of the full converted bitmap would be in the slower fast->chip domain.
abu_the_monkey is offline  
Old 27 May 2024, 14:23   #115
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,416
On an 030, If I'm doing a write to chip ram via an address pointer and I want to add an offset to the pointer immediately afterwards (e.g. calculating the next plane to write to), is the cost of that operation fully masked by the pending write? How many cycles should I expect to be able to execute while the write is happening, assuming operations that aren't doing any data memory accesses?
Karlos is offline  
Old 27 May 2024, 14:50   #116
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
Quote:
Originally Posted by Karlos View Post
On an 030, If I'm doing a write to chip ram via an address pointer and I want to add an offset to the pointer immediately afterwards (e.g. calculating the next plane to write to), is the cost of that operation fully masked by the pending write?
Yes - as long as the value to add doesn't come from memory, of course.
Nearly every register-only instruction seems to 'pipeline' well, except iterative instructions such as mul & div which stall like memory accesses.


Quote:
Originally Posted by Karlos View Post
How many cycles should I expect to be able to execute while the write is happening, assuming operations that aren't doing any data memory accesses?
For 50Mhz 030 : at least 24, usually 26. Experiments have shown exact number isn't easy to predict.
meynaf is online now  
Old 27 May 2024, 15:07   #117
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,416
How about using movem to transfer a number of registers worth of data from a source buffer? Or is it better to just use separate moves? Thinking about instruction cache size here.
Karlos is offline  
Old 27 May 2024, 15:37   #118
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
Quote:
Originally Posted by Karlos View Post
How about using movem to transfer a number of registers worth of data from a source buffer? Or is it better to just use separate moves? Thinking about instruction cache size here.
For source read, why not.
But cache size isn't an issue here as the loop appears to be very small.
meynaf is online now  
Old 27 May 2024, 15:39   #119
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,416
Quote:
Originally Posted by meynaf View Post
For source read, why not.
But cache size isn't an issue here as the loop appears to be very small.
I'm thinking of variations, really. Some loops will be bigger and may involve toggling datacache behaviours (direct CACR manipulation).
Karlos is offline  
Old 27 May 2024, 16:16   #120
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,416
I have just pushed an update to the branch that contains the most naive implementation possible as a test case. The lha file contains the updated binary.
Karlos is offline  
 


Currently Active Users Viewing This Thread: 2 (0 members and 2 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
C2P Performance issues meeku Coders. Asm / Hardware 10 09 April 2019 18:29
Alien Breed 3D CD32 - Akiko C2P? wairnair support.Games 9 06 July 2018 14:32
Gloom Akiko C2P? Whitesnake support.Games 5 23 April 2007 19:01
Blizzard 030/50 Accelerators Parsec Amiga scene 20 14 February 2004 17:48
Cd32 Emulator (AKIKO) Doozy support.WinUAE 3 06 December 2001 08:41

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 17:15.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.10867 seconds with 16 queries