How much CPU does C2P consume?

lmimmfn · 23 February 2024, 03:49

I'm curious with the most generic C2P routine(not optimized for Edge cases etc.) how much CPU it consumes across the different Amiga range and CPUs, I realise the Chip RAM throughput is only half on 16bit machines vs 32bit, but dies anyone have any benchmarks on CPU performance across the Amiga range and CPUs?

I thought it might be interested vs top level Intel performance.
Thsnks

NovaCoder · 23 February 2024, 04:04

Almost nothing for an 060 (esp. overclocked), most of my old ports run at about the same FPS for both AGA and RTG.

Thomas Richter · 23 February 2024, 05:48

Well C2P takes all the CPU time it can get, of course, though if the target is the chip memory, then typically the bottleneck is the interface to the chip memory - if this is what you ask. That does not mean that it comes for free compared to a "native chunky" display. Actually, in the latter case the CPU would only render parts of the screen - the objects to animate - but with C2P, it would typically convert the entire frame buffer. Making C2P only for an object or a part of the frame costs also overhead.

This is neither a matter of "intel vs. 68k" - it is more a matter of the data organization and the available bandwidths.

Thus - would a native display in chunky be faster, even with the same bottleneck? Yes, but not because the C2P would go away, but because you would not have to touch the entire screen and move less data around.

Would it be faster with a faster chip memory bandwidth? Yes, most definitely, and at that point C2P would be the bottleneck, not the memory bandwidth.

With the bandwidths available, it does not make a difference whether you copy an entire frame planar to planar, or convert an entire frame with CP2 from chunky to planar, but that's not because the conversion is "for free", but because the chip memory is so slow.

Don_Adan · 23 February 2024, 07:23

You can check this old c2p thread and results for NONE (no writes to chip ram).

https://eab.abime.net/showthread.php...475#post967475

meynaf · 23 February 2024, 08:06

Quote:

Originally Posted by Thomas Richter

Making C2P only for an object or a part of the frame costs also overhead.

Not that much, unless there are many independent objects on screen - and in that case, merging them into single c2p might be better.
There is a lot to be gained with partial c2p. Without it, my HOMM2 port would just crawl.

Quote:

Originally Posted by Thomas Richter

Thus - would a native display in chunky be faster, even with the same bottleneck? Yes, but not because the C2P would go away, but because you would not have to touch the entire screen and move less data around.

I don't think so. At least, not always. Provided there is a back-buffer in fastmem, it would be chunky and thus there would not be more data to move around.
Then, be it rendered thru copymem or c2p, it does not matter (on 060, that is).
With a significant amount of graphic operations, using a buffer in fastmem is probably faster than rendering directly to chipmem, btw - even if we had native chunky.

Thomas Richter · 23 February 2024, 08:54

How can rendering things twice (first to fast mem, then c2p to chip) be possibly faster than rendering only once (CPU directly to chip)?

a/b · 23 February 2024, 09:08

If you read from the buffer, or write the same pixel multiple times...

meynaf · 23 February 2024, 09:14

Quote:

Originally Posted by Thomas Richter

How can rendering things twice (first to fast mem, then c2p to chip) be possibly faster than rendering only once (CPU directly to chip)?

This is because first rendering is complex operations done in fast memory (which may involve reading said memory over and over for handling transparency) and second is more or less just memory copying (to incredibly slow mem).
For example (taking 60ns fastmem, 50Mhz 68030 and 32-bit chipmem), let's have 3 mem accesses (8+8+8) - that's not much actually - and then copy to chip (8+26). Now compare to direct rendering (26+26+26). More mem accesses but faster. And that's on small example, on a rather modest cpu, with few operations and without counting data caches (which won't cache chipmem).

neoman · 25 February 2024, 07:08

https://github.com/Kalmalyzer/kalms-...ee/main/normal

On the top of many .s files Kalms documents how much his routines need.

Thorham · 25 February 2024, 10:38

Quote:

Originally Posted by Thomas Richter

With the bandwidths available, it does not make a difference whether you copy an entire frame planar to planar, or convert an entire frame with CP2 from chunky to planar, but that's not because the conversion is "for free", but because the chip memory is so slow.

You're forgetting that the C2P code runs during the slow chipmem writes. If you don't need C2P, then you can do other work during the copy loop.

arti040 · 25 February 2024, 12:21

Dumb question - would it be possible to design an accelerator board with dedicated chip and memory to perform c2p externally? Does it make sense? ;-)

pandy71 · 25 February 2024, 20:26

Quote:

Originally Posted by arti040

Dumb question - would it be possible to design an accelerator board with dedicated chip and memory to perform c2p externally? Does it make sense? ;-)

Akiko patent is here https://www.freepatentsonline.com/5461680.html - quite easy to recreate C2P functionality to provide legacy compatibility, also by adding DMA to perform on the fly conversion (copying data from FAST to CHIP at the same time with with C2P conversion) is possible so in overall it can be speedup.
Also this can be interesting: https://eab.abime.net/showthread.php?t=105664 - RP2040 seem to be perfect solution for Amiga small improvements - probably RP2040 can be fast enough to deal with RGA bus (so seat on top of Denise and perform some functionality comparable to Indivision)

arti040 · 25 February 2024, 22:08

Wow! Thanks for the links

Bruce Abbott · 01 March 2024, 23:27

Some real-world results on my A1200 with Blizzard 1230IV (50MHz 030) and 60ns RAM, running DoomAttack timedemo at standard window size (2 steps down from full-screen) with various copy routines:-

"c2p_optimized" (normal AGA c2p routine) - 10.1 fps
"fake chunky" (copy FastRAM to ChipRAM) - 10.85 fps, 7% faster.
"fake RTG" (copy FastRAM to FastRAM) - 12.65 fps, 25% faster
"Fake RTG" direct (just rendering to FastRAM) - 13.27 fps, 31% faster

Conclusions:

- On a 50MHz 030 the c2p overhead is minimal. The improvement from having a hardware chunky mode in the AGA chipset would be practically unnoticeable (at least for Doom and similar games).

- Using a 32 bit graphics card on the local CPU bus could potentially increase Doom's frame rate by up to 31%, which is a significant but not amazing improvement. Programs that spend less time calculating would benefit more.

As a comparison, here are some selected Doom frame rates on various PCs:-

i386 SX25 WDC VGA ISA - 3.4 fps
Am386 DX40 (MX83) ISA - 8.24 fps
486DX2/66 miro 1H10AD VLB - 10.02 fps
486DX2/66 CL-GD5428 1MB VLB - 10.3 fps
P100 TVGA 8800CS 512KB ISA - 11.82 fps
P100 Stealth II S220 4MB PCI - 15.71 fps
P100 Trident TVGA 8900D ISA - 32.35 fps
P100 Bali 32 1MB PCI - 73.66 fps

And here are some more on various systems:-

NextStation 68040 33MHz 2-bit grayscale - 9.8 fps
SPARCstation IPX MB86903 40MHz Sun GX - 10.9 fps
Pentium-60 Compaq Qvision 2000+ MGA PCI - 11.8 fps
Amiga1200 68040 40MHz AGA - 13.4 fps
Pentium 75 S3 PCI - 23.2 fps
Amiga 1200 68060 50MHz AGA - 24.6 fps
Pentium-120 Diamond Viper SE PCI - 27.4 fps

On faster PCs the frame rates vary greatly depending on the graphics card and bus settings. The limit for ISA bus cards appears to be 33 fps, though most were around 15 fps and some VLB and PCI machines were even slower despite having a fast 486 or Pentium CPU.

Thomas Richter · 02 March 2024, 11:29

Quote:

Originally Posted by Bruce Abbott

"c2p_optimized" (normal AGA c2p routine) - 10.1 fps
"fake chunky" (copy FastRAM to ChipRAM) - 10.85 fps, 7% faster.
"fake RTG" (copy FastRAM to FastRAM) - 12.65 fps, 25% faster
"Fake RTG" direct (just rendering to FastRAM) - 13.27 fps, 31% faster

Conclusions:

- On a 50MHz 030 the c2p overhead is minimal.

30% is minimal? I don't think so. ChipMem access is slow, of course, but with a properly made wider RAM interface as you find it in graphics cards, you do get some nice gains.

So what you conclude from this "fake comparison" is wrong. It is not "oh, we don't need chunky". It is rather "speed up the damn chipset". What you would need to compare is "planar over a properly made chip ram interface" vs. "chunky over a properly made chip ram interface", and that would measure the overhead of a stupid conversion that could have been avoided if CBM had made some investments into the chips instead of "read my lips - no new chips".

Thus, now go back to measurement and measure "c2p from fast to fast" vs. "direct rendering into fast". That gives you the right numbers for decisoin making for graphics modes.

HornBeamSoft · 02 March 2024, 12:48

Quote:

Originally Posted by Bruce Abbott

Some real-world results on my A1200 with Blizzard 1230IV (50MHz 030) and 60ns RAM, running DoomAttack timedemo at standard window size (2 steps down from full-screen) with various copy routines

It is very interesting.
Do You have results for A1200 with Fast RAM only and may publish it?

Thorham · 02 March 2024, 13:18

Quote:

Originally Posted by Thomas Richter

30% is minimal? I don't think so.

It is when we're going from 10.1 fps to 13.27 fps. Even double the speed isn't good enough. You really want 30 fps, so you need three times the speed.

Bruce Abbott · 02 March 2024, 20:05

Quote:

Originally Posted by Thomas Richter

30% is minimal? I don't think so. ChipMem access is slow, of course, but with a properly made wider RAM interface as you find it in graphics cards, you do get some nice gains.

Somebody can't read. The difference between planar with c2p and chunky (if AGA had that) is only 7%. 31% is the theoretic speedup if you had a graphics card connected directly to the 030 local bus with no wait states etc. at 33MB/s write speed. Even Zorro III can't do that.

Quote:

So what you conclude from this "fake comparison" is wrong. It is not "oh, we don't need chunky". It is rather "speed up the damn chipset".

AGA+ would have speed the bus up to 14MHz, which would get maybe 25% higher frame rate. But we didn't need that. Just put a faster CPU in there and you get the same effect. A 40MHz 040 was enough. A 50MHz 060 got all the speed you would ever need for Doom. Chuck in a PiStorm and...

Quote:

What you would need to compare is "planar over a properly made chip ram interface" vs. "chunky over a properly made chip ram interface", and that would measure the overhead of a stupid conversion that could have been avoided if CBM had made some investments into the chips instead of "read my lips - no new chips".

You clearly don't appreciate the situation. Commodore had 15 engineers working on AAA and still didn't have properly working silicon when crunch time came. AGA was never designed to be the ultimate Amiga chipset, it was just an extension of ECS to put in the A500 replacement while they worked on the high-end chipset. But if you had a big box Amiga then you already had what was needed to get better graphics - bus slots.

AGA did a good job of what it was designed to do. Everybody at Commodore (engineers and management) agreed on that - the only issue being how long it took to get out the door.

AAA was different story. I think Commodore should have killed it early on and concentrated on the low-end AA (AGA) instead. For the high end they should have just put RTG into the OS and let 3rd party graphics cards fill the gap. But most of the engineers didn't want that - they wanted to go up-market with a 'VGA killer' chipset, rather than down-market where Commodore's strength lay.

Quote:

Thus, now go back to measurement and measure "c2p from fast to fast" vs. "direct rendering into fast". That gives you the right numbers for decisoin making for graphics modes.

No, it doesn't. There was no way that AGA would work at 33MHz. That's what AAA was supposed to do, but I doubt they could have managed it with the process they had. So they would have to outsource all the chips - and then a year or two later need another upgrade to counter the latest PCI graphics cards. Towards the end the engineers realized this. The next high-end Amiga (if Commodore had survived) would probably have PCI slots.

Bruce Abbott · 02 March 2024, 20:22

Quote:

Originally Posted by HornBeamSoft

It is very interesting.
Do You have results for A1200 with Fast RAM only and may publish it?

Not sure what you mean by that - I don't think you can have an A1200 'with Fast RAM only'.

If you mean everything being in FastRAM, including Doom code and data, screen memory and ROM, that's the 'Fake RTG direct' test, where it just renders to FastRAM and that's it (no screen copy, no c2p).

AestheticDebris · 02 March 2024, 20:25

Quote:

Originally Posted by Bruce Abbott

Somebody can't read. The difference between planar with c2p and chunky (if AGA had that) is only 7%. 31% is the theoretic speedup if you had a graphics card connected directly to the 030 local bus with no wait states etc. at 33MB/s write speed. Even Zorro III can't do that.

What those numbers represent is really not clear at all. You've described "fake chunky" as Fast RAM To Chip RAM, but if AGA had chunky mode it wouldn't render into fast RAM at all. Without knowing exactly what is being done and why, the performance measurements aren't really obvious at all.

Quote:

Originally Posted by Bruce Abbott

AGA did a good job of what it was designed to do. Everybody at Commodore (engineers and management) agreed on that - the only issue being how long it took to get out the door.

Everyone.at Amstrad thought the same about the GX4000 console. When you're on the inside of a project, tunnel vision can easily blind you to mistakes that are obvious to outside observers.

23 February 2024, 03:49	#1
lmimmfn Registered User Join Date: May 2018 Location: Ireland Posts: 691	How much CPU does C2P consume? I'm curious with the most generic C2P routine(not optimized for Edge cases etc.) how much CPU it consumes across the different Amiga range and CPUs, I realise the Chip RAM throughput is only half on 16bit machines vs 32bit, but dies anyone have any benchmarks on CPU performance across the Amiga range and CPUs? I thought it might be interested vs top level Intel performance. Thsnks

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Selling A3660 CPU card, including Rev 5 CPU - NEW - professionally built	tbtorro	MarketPlace	1	17 June 2018 19:14
Blitter C2P? How?	Samurai_Crow	Coders. Asm / Hardware	21	24 April 2018 19:12
Any C2P experts here?	oRBIT	Coders. General	36	27 April 2010 07:26
C2P....help!	NovaCoder	Coders. General	8	17 December 2009 00:15
Game in c2p?	oRBIT	Amiga scene	11	01 February 2007 21:28

23 February 2024, 04:04	#2
NovaCoder Registered User Join Date: Sep 2007 Location: Melbourne/Australia Posts: 4,416	Almost nothing for an 060 (esp. overclocked), most of my old ports run at about the same FPS for both AGA and RTG.

23 February 2024, 05:48	#3
Thomas Richter Registered User Join Date: Jan 2019 Location: Germany Posts: 3,307	Well C2P takes all the CPU time it can get, of course, though if the target is the chip memory, then typically the bottleneck is the interface to the chip memory - if this is what you ask. That does not mean that it comes for free compared to a "native chunky" display. Actually, in the latter case the CPU would only render parts of the screen - the objects to animate - but with C2P, it would typically convert the entire frame buffer. Making C2P only for an object or a part of the frame costs also overhead. This is neither a matter of "intel vs. 68k" - it is more a matter of the data organization and the available bandwidths. Thus - would a native display in chunky be faster, even with the same bottleneck? Yes, but not because the C2P would go away, but because you would not have to touch the entire screen and move less data around. Would it be faster with a faster chip memory bandwidth? Yes, most definitely, and at that point C2P would be the bottleneck, not the memory bandwidth. With the bandwidths available, it does not make a difference whether you copy an entire frame planar to planar, or convert an entire frame with CP2 from chunky to planar, but that's not because the conversion is "for free", but because the chip memory is so slow.

23 February 2024, 07:23	#4
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 56 Posts: 2,039	You can check this old c2p thread and results for NONE (no writes to chip ram). https://eab.abime.net/showthread.php...475#post967475

23 February 2024, 08:54	#6
Thomas Richter Registered User Join Date: Jan 2019 Location: Germany Posts: 3,307	How can rendering things twice (first to fast mem, then c2p to chip) be possibly faster than rendering only once (CPU directly to chip)?

23 February 2024, 09:08	#7
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,062	If you read from the buffer, or write the same pixel multiple times...

25 February 2024, 07:08	#9
neoman titan sucks! Join Date: Dec 2012 Location: munich/germany Posts: 54	https://github.com/Kalmalyzer/kalms-...ee/main/normal On the top of many .s files Kalms documents how much his routines need.

25 February 2024, 12:21	#11
arti040 Piotr Join Date: Jul 2013 Location: Lodz/Poland Age: 40 Posts: 207	Dumb question - would it be possible to design an accelerator board with dedicated chip and memory to perform c2p externally? Does it make sense? ;-)

25 February 2024, 22:08	#13
arti040 Piotr Join Date: Jul 2013 Location: Lodz/Poland Age: 40 Posts: 207	Wow! Thanks for the links

01 March 2024, 23:27	#14
Bruce Abbott Registered User Join Date: Mar 2018 Location: Hastings, New Zealand Posts: 2,719	Some real-world results on my A1200 with Blizzard 1230IV (50MHz 030) and 60ns RAM, running DoomAttack timedemo at standard window size (2 steps down from full-screen) with various copy routines:- "c2p_optimized" (normal AGA c2p routine) - 10.1 fps "fake chunky" (copy FastRAM to ChipRAM) - 10.85 fps, 7% faster. "fake RTG" (copy FastRAM to FastRAM) - 12.65 fps, 25% faster "Fake RTG" direct (just rendering to FastRAM) - 13.27 fps, 31% faster Conclusions: - On a 50MHz 030 the c2p overhead is minimal. The improvement from having a hardware chunky mode in the AGA chipset would be practically unnoticeable (at least for Doom and similar games). - Using a 32 bit graphics card on the local CPU bus could potentially increase Doom's frame rate by up to 31%, which is a significant but not amazing improvement. Programs that spend less time calculating would benefit more. As a comparison, here are some selected Doom frame rates on various PCs:- i386 SX25 WDC VGA ISA - 3.4 fps Am386 DX40 (MX83) ISA - 8.24 fps 486DX2/66 miro 1H10AD VLB - 10.02 fps 486DX2/66 CL-GD5428 1MB VLB - 10.3 fps P100 TVGA 8800CS 512KB ISA - 11.82 fps P100 Stealth II S220 4MB PCI - 15.71 fps P100 Trident TVGA 8900D ISA - 32.35 fps P100 Bali 32 1MB PCI - 73.66 fps And here are some more on various systems:- NextStation 68040 33MHz 2-bit grayscale - 9.8 fps SPARCstation IPX MB86903 40MHz Sun GX - 10.9 fps Pentium-60 Compaq Qvision 2000+ MGA PCI - 11.8 fps Amiga1200 68040 40MHz AGA - 13.4 fps Pentium 75 S3 PCI - 23.2 fps Amiga 1200 68060 50MHz AGA - 24.6 fps Pentium-120 Diamond Viper SE PCI - 27.4 fps On faster PCs the frame rates vary greatly depending on the graphics card and bus settings. The limit for ISA bus cards appears to be 33 fps, though most were around 15 fps and some VLB and PCI machines were even slower despite having a fast 486 or Pentium CPU.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)