English Amiga Board


Go Back   English Amiga Board > Main > Amiga scene

 
 
Thread Tools
Old 23 February 2024, 03:49   #1
lmimmfn
Registered User
 
Join Date: May 2018
Location: Ireland
Posts: 691
How much CPU does C2P consume?

I'm curious with the most generic C2P routine(not optimized for Edge cases etc.) how much CPU it consumes across the different Amiga range and CPUs, I realise the Chip RAM throughput is only half on 16bit machines vs 32bit, but dies anyone have any benchmarks on CPU performance across the Amiga range and CPUs?

I thought it might be interested vs top level Intel performance.
Thsnks
lmimmfn is offline  
Old 23 February 2024, 04:04   #2
NovaCoder
Registered User
 
NovaCoder's Avatar
 
Join Date: Sep 2007
Location: Melbourne/Australia
Posts: 4,416
Almost nothing for an 060 (esp. overclocked), most of my old ports run at about the same FPS for both AGA and RTG.
NovaCoder is offline  
Old 23 February 2024, 05:48   #3
Thomas Richter
Registered User
 
Join Date: Jan 2019
Location: Germany
Posts: 3,307
Well C2P takes all the CPU time it can get, of course, though if the target is the chip memory, then typically the bottleneck is the interface to the chip memory - if this is what you ask. That does not mean that it comes for free compared to a "native chunky" display. Actually, in the latter case the CPU would only render parts of the screen - the objects to animate - but with C2P, it would typically convert the entire frame buffer. Making C2P only for an object or a part of the frame costs also overhead.

This is neither a matter of "intel vs. 68k" - it is more a matter of the data organization and the available bandwidths.

Thus - would a native display in chunky be faster, even with the same bottleneck? Yes, but not because the C2P would go away, but because you would not have to touch the entire screen and move less data around.

Would it be faster with a faster chip memory bandwidth? Yes, most definitely, and at that point C2P would be the bottleneck, not the memory bandwidth.

With the bandwidths available, it does not make a difference whether you copy an entire frame planar to planar, or convert an entire frame with CP2 from chunky to planar, but that's not because the conversion is "for free", but because the chip memory is so slow.
Thomas Richter is offline  
Old 23 February 2024, 07:23   #4
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,039
You can check this old c2p thread and results for NONE (no writes to chip ram).

https://eab.abime.net/showthread.php...475#post967475
Don_Adan is offline  
Old 23 February 2024, 08:06   #5
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
Quote:
Originally Posted by Thomas Richter View Post
Making C2P only for an object or a part of the frame costs also overhead.
Not that much, unless there are many independent objects on screen - and in that case, merging them into single c2p might be better.
There is a lot to be gained with partial c2p. Without it, my HOMM2 port would just crawl.


Quote:
Originally Posted by Thomas Richter View Post
Thus - would a native display in chunky be faster, even with the same bottleneck? Yes, but not because the C2P would go away, but because you would not have to touch the entire screen and move less data around.
I don't think so. At least, not always. Provided there is a back-buffer in fastmem, it would be chunky and thus there would not be more data to move around.
Then, be it rendered thru copymem or c2p, it does not matter (on 060, that is).
With a significant amount of graphic operations, using a buffer in fastmem is probably faster than rendering directly to chipmem, btw - even if we had native chunky.
meynaf is offline  
Old 23 February 2024, 08:54   #6
Thomas Richter
Registered User
 
Join Date: Jan 2019
Location: Germany
Posts: 3,307
How can rendering things twice (first to fast mem, then c2p to chip) be possibly faster than rendering only once (CPU directly to chip)?
Thomas Richter is offline  
Old 23 February 2024, 09:08   #7
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,062
If you read from the buffer, or write the same pixel multiple times...
a/b is offline  
Old 23 February 2024, 09:14   #8
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
Quote:
Originally Posted by Thomas Richter View Post
How can rendering things twice (first to fast mem, then c2p to chip) be possibly faster than rendering only once (CPU directly to chip)?
This is because first rendering is complex operations done in fast memory (which may involve reading said memory over and over for handling transparency) and second is more or less just memory copying (to incredibly slow mem).
For example (taking 60ns fastmem, 50Mhz 68030 and 32-bit chipmem), let's have 3 mem accesses (8+8+8) - that's not much actually - and then copy to chip (8+26). Now compare to direct rendering (26+26+26). More mem accesses but faster. And that's on small example, on a rather modest cpu, with few operations and without counting data caches (which won't cache chipmem).
meynaf is offline  
Old 25 February 2024, 07:08   #9
neoman
titan sucks!
 
Join Date: Dec 2012
Location: munich/germany
Posts: 54
https://github.com/Kalmalyzer/kalms-...ee/main/normal

On the top of many .s files Kalms documents how much his routines need.
neoman is offline  
Old 25 February 2024, 10:38   #10
Thorham
Computer Nerd
 
Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,840
Quote:
Originally Posted by Thomas Richter View Post
With the bandwidths available, it does not make a difference whether you copy an entire frame planar to planar, or convert an entire frame with CP2 from chunky to planar, but that's not because the conversion is "for free", but because the chip memory is so slow.
You're forgetting that the C2P code runs during the slow chipmem writes. If you don't need C2P, then you can do other work during the copy loop.
Thorham is offline  
Old 25 February 2024, 12:21   #11
arti040
Piotr
 
Join Date: Jul 2013
Location: Lodz/Poland
Age: 40
Posts: 207
Dumb question - would it be possible to design an accelerator board with dedicated chip and memory to perform c2p externally? Does it make sense? ;-)
arti040 is offline  
Old 25 February 2024, 20:26   #12
pandy71
Registered User
 
Join Date: Jun 2010
Location: PL?
Posts: 2,875
Quote:
Originally Posted by arti040 View Post
Dumb question - would it be possible to design an accelerator board with dedicated chip and memory to perform c2p externally? Does it make sense? ;-)
Akiko patent is here https://www.freepatentsonline.com/5461680.html - quite easy to recreate C2P functionality to provide legacy compatibility, also by adding DMA to perform on the fly conversion (copying data from FAST to CHIP at the same time with with C2P conversion) is possible so in overall it can be speedup.
Also this can be interesting: https://eab.abime.net/showthread.php?t=105664 - RP2040 seem to be perfect solution for Amiga small improvements - probably RP2040 can be fast enough to deal with RGA bus (so seat on top of Denise and perform some functionality comparable to Indivision)
pandy71 is offline  
Old 25 February 2024, 22:08   #13
arti040
Piotr
 
Join Date: Jul 2013
Location: Lodz/Poland
Age: 40
Posts: 207
Wow! Thanks for the links
arti040 is offline  
Old 01 March 2024, 23:27   #14
Bruce Abbott
Registered User
 
Bruce Abbott's Avatar
 
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,719
Some real-world results on my A1200 with Blizzard 1230IV (50MHz 030) and 60ns RAM, running DoomAttack timedemo at standard window size (2 steps down from full-screen) with various copy routines:-

"c2p_optimized" (normal AGA c2p routine) - 10.1 fps
"fake chunky" (copy FastRAM to ChipRAM) - 10.85 fps, 7% faster.
"fake RTG" (copy FastRAM to FastRAM) - 12.65 fps, 25% faster
"Fake RTG" direct (just rendering to FastRAM) - 13.27 fps, 31% faster

Conclusions:

- On a 50MHz 030 the c2p overhead is minimal. The improvement from having a hardware chunky mode in the AGA chipset would be practically unnoticeable (at least for Doom and similar games).

- Using a 32 bit graphics card on the local CPU bus could potentially increase Doom's frame rate by up to 31%, which is a significant but not amazing improvement. Programs that spend less time calculating would benefit more.

As a comparison, here are some selected Doom frame rates on various PCs:-

i386 SX25 WDC VGA ISA - 3.4 fps
Am386 DX40 (MX83) ISA - 8.24 fps
486DX2/66 miro 1H10AD VLB - 10.02 fps
486DX2/66 CL-GD5428 1MB VLB - 10.3 fps
P100 TVGA 8800CS 512KB ISA - 11.82 fps
P100 Stealth II S220 4MB PCI - 15.71 fps
P100 Trident TVGA 8900D ISA - 32.35 fps
P100 Bali 32 1MB PCI - 73.66 fps

And here are some more on various systems:-

NextStation 68040 33MHz 2-bit grayscale - 9.8 fps
SPARCstation IPX MB86903 40MHz Sun GX - 10.9 fps
Pentium-60 Compaq Qvision 2000+ MGA PCI - 11.8 fps
Amiga1200 68040 40MHz AGA - 13.4 fps
Pentium 75 S3 PCI - 23.2 fps
Amiga 1200 68060 50MHz AGA - 24.6 fps
Pentium-120 Diamond Viper SE PCI - 27.4 fps

On faster PCs the frame rates vary greatly depending on the graphics card and bus settings. The limit for ISA bus cards appears to be 33 fps, though most were around 15 fps and some VLB and PCI machines were even slower despite having a fast 486 or Pentium CPU.
Bruce Abbott is offline  
Old 02 March 2024, 11:29   #15
Thomas Richter
Registered User
 
Join Date: Jan 2019
Location: Germany
Posts: 3,307
Quote:
Originally Posted by Bruce Abbott View Post
"c2p_optimized" (normal AGA c2p routine) - 10.1 fps
"fake chunky" (copy FastRAM to ChipRAM) - 10.85 fps, 7% faster.
"fake RTG" (copy FastRAM to FastRAM) - 12.65 fps, 25% faster
"Fake RTG" direct (just rendering to FastRAM) - 13.27 fps, 31% faster

Conclusions:

- On a 50MHz 030 the c2p overhead is minimal.
30% is minimal? I don't think so. ChipMem access is slow, of course, but with a properly made wider RAM interface as you find it in graphics cards, you do get some nice gains.



So what you conclude from this "fake comparison" is wrong. It is not "oh, we don't need chunky". It is rather "speed up the damn chipset". What you would need to compare is "planar over a properly made chip ram interface" vs. "chunky over a properly made chip ram interface", and that would measure the overhead of a stupid conversion that could have been avoided if CBM had made some investments into the chips instead of "read my lips - no new chips".


Thus, now go back to measurement and measure "c2p from fast to fast" vs. "direct rendering into fast". That gives you the right numbers for decisoin making for graphics modes.
Thomas Richter is offline  
Old 02 March 2024, 12:48   #16
HornBeamSoft
Registered User
 
Join Date: Aug 2020
Location: Namestovo/Slovakia
Posts: 17
Quote:
Originally Posted by Bruce Abbott View Post
Some real-world results on my A1200 with Blizzard 1230IV (50MHz 030) and 60ns RAM, running DoomAttack timedemo at standard window size (2 steps down from full-screen) with various copy routines
It is very interesting.
Do You have results for A1200 with Fast RAM only and may publish it?
HornBeamSoft is offline  
Old 02 March 2024, 13:18   #17
Thorham
Computer Nerd
 
Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,840
Quote:
Originally Posted by Thomas Richter View Post
30% is minimal? I don't think so.
It is when we're going from 10.1 fps to 13.27 fps. Even double the speed isn't good enough. You really want 30 fps, so you need three times the speed.
Thorham is offline  
Old 02 March 2024, 20:05   #18
Bruce Abbott
Registered User
 
Bruce Abbott's Avatar
 
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,719
Quote:
Originally Posted by Thomas Richter View Post
30% is minimal? I don't think so. ChipMem access is slow, of course, but with a properly made wider RAM interface as you find it in graphics cards, you do get some nice gains.
Somebody can't read. The difference between planar with c2p and chunky (if AGA had that) is only 7%. 31% is the theoretic speedup if you had a graphics card connected directly to the 030 local bus with no wait states etc. at 33MB/s write speed. Even Zorro III can't do that.

Quote:
So what you conclude from this "fake comparison" is wrong. It is not "oh, we don't need chunky". It is rather "speed up the damn chipset".
AGA+ would have speed the bus up to 14MHz, which would get maybe 25% higher frame rate. But we didn't need that. Just put a faster CPU in there and you get the same effect. A 40MHz 040 was enough. A 50MHz 060 got all the speed you would ever need for Doom. Chuck in a PiStorm and...


Quote:
What you would need to compare is "planar over a properly made chip ram interface" vs. "chunky over a properly made chip ram interface", and that would measure the overhead of a stupid conversion that could have been avoided if CBM had made some investments into the chips instead of "read my lips - no new chips".
You clearly don't appreciate the situation. Commodore had 15 engineers working on AAA and still didn't have properly working silicon when crunch time came. AGA was never designed to be the ultimate Amiga chipset, it was just an extension of ECS to put in the A500 replacement while they worked on the high-end chipset. But if you had a big box Amiga then you already had what was needed to get better graphics - bus slots.

AGA did a good job of what it was designed to do. Everybody at Commodore (engineers and management) agreed on that - the only issue being how long it took to get out the door.

AAA was different story. I think Commodore should have killed it early on and concentrated on the low-end AA (AGA) instead. For the high end they should have just put RTG into the OS and let 3rd party graphics cards fill the gap. But most of the engineers didn't want that - they wanted to go up-market with a 'VGA killer' chipset, rather than down-market where Commodore's strength lay.

Quote:
Thus, now go back to measurement and measure "c2p from fast to fast" vs. "direct rendering into fast". That gives you the right numbers for decisoin making for graphics modes.
No, it doesn't. There was no way that AGA would work at 33MHz. That's what AAA was supposed to do, but I doubt they could have managed it with the process they had. So they would have to outsource all the chips - and then a year or two later need another upgrade to counter the latest PCI graphics cards. Towards the end the engineers realized this. The next high-end Amiga (if Commodore had survived) would probably have PCI slots.
Bruce Abbott is offline  
Old 02 March 2024, 20:22   #19
Bruce Abbott
Registered User
 
Bruce Abbott's Avatar
 
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,719
Quote:
Originally Posted by HornBeamSoft View Post
It is very interesting.
Do You have results for A1200 with Fast RAM only and may publish it?
Not sure what you mean by that - I don't think you can have an A1200 'with Fast RAM only'.

If you mean everything being in FastRAM, including Doom code and data, screen memory and ROM, that's the 'Fake RTG direct' test, where it just renders to FastRAM and that's it (no screen copy, no c2p).
Bruce Abbott is offline  
Old 02 March 2024, 20:25   #20
AestheticDebris
Registered User
 
Join Date: May 2023
Location: Norwich
Posts: 429
Quote:
Originally Posted by Bruce Abbott View Post
Somebody can't read. The difference between planar with c2p and chunky (if AGA had that) is only 7%. 31% is the theoretic speedup if you had a graphics card connected directly to the 030 local bus with no wait states etc. at 33MB/s write speed. Even Zorro III can't do that.
What those numbers represent is really not clear at all. You've described "fake chunky" as Fast RAM To Chip RAM, but if AGA had chunky mode it wouldn't render into fast RAM at all. Without knowing exactly what is being done and why, the performance measurements aren't really obvious at all.

Quote:
Originally Posted by Bruce Abbott View Post

AGA did a good job of what it was designed to do. Everybody at Commodore (engineers and management) agreed on that - the only issue being how long it took to get out the door.
Everyone.at Amstrad thought the same about the GX4000 console. When you're on the inside of a project, tunnel vision can easily blind you to mistakes that are obvious to outside observers.
AestheticDebris is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Selling A3660 CPU card, including Rev 5 CPU - NEW - professionally built tbtorro MarketPlace 1 17 June 2018 19:14
Blitter C2P? How? Samurai_Crow Coders. Asm / Hardware 21 24 April 2018 19:12
Any C2P experts here? oRBIT Coders. General 36 27 April 2010 07:26
C2P....help! NovaCoder Coders. General 8 17 December 2009 00:15
Game in c2p? oRBIT Amiga scene 11 01 February 2007 21:28

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 02:20.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.15265 seconds with 13 queries