How much CPU does C2P consume? - Page 3

Thorham · 05 March 2024, 11:38

Quote:

Originally Posted by Thomas Richter

Yes, you can achieve "copy speed", but that's because "copy speed" is slow. The trick is to avoid the copy in first place, and let the chipset do the work. But that's not possible due to an architecture bound to planar.

This isn't another what if thread. It just somehow got derailed. This is the original post:

Quote:

Originally Posted by lmimmfn

I'm curious with the most generic C2P routine(not optimized for Edge cases etc.) how much CPU it consumes across the different Amiga range and CPUs, I realise the Chip RAM throughput is only half on 16bit machines vs 32bit, but dies anyone have any benchmarks on CPU performance across the Amiga range and CPUs?

I thought it might be interested vs top level Intel performance.
Thsnks

Karlos · 05 March 2024, 12:01

Quote:

Originally Posted by Thomas Richter

Yes, you can achieve "copy speed", but that's because "copy speed" is slow. The trick is to avoid the copy in first place, and let the chipset do the work. But that's not possible due to an architecture bound to planar.

That's a given, which is the reason for the original question.

However, even if the chip ram were chunky, the bandwidth is still weak and you are dealing with uncached memory accesses. Rendering tends to be per pixel, you have a slow bus. Heaven help you if you want to do any transparency or other effects that require reading a pixel and replacing it with a new value.

Thomas Richter · 05 March 2024, 13:35

Quote:

Originally Posted by Karlos

However, even if the chip ram were chunky, the bandwidth is still weak and you are dealing with uncached memory accesses. Rendering tends to be per pixel, you have a slow bus. Heaven help you if you want to do any transparency or other effects that require reading a pixel and replacing it with a new value.

But that's the point - you don't render with the CPU if you don't have to. With C2P, you are burning CPU cycles.

Thorham · 05 March 2024, 14:17

Quote:

Originally Posted by Thomas Richter

But that's the point - you don't render with the CPU if you don't have to. With C2P, you are burning CPU cycles.

What does that have to do with the original question?

alexh · 05 March 2024, 14:28

Mr.Moderator. Can someone clean up this thread, backtrack to where it left the technical discussions about C2P speed on AGA/ECS and put everything after that into any other Wot-if thread? Or the bin? Ta.

Thomas Richter · 05 March 2024, 14:33

Quote:

Originally Posted by Thorham

What does that have to do with the original question?

What would you answer if the original question would be "how fast can you copy memory"? The answer is "it depends". Here the answer is "by the speed of available bandwidth". The *real* answer is: "avoid it if you can, because it will always cost time."

Karlos · 05 March 2024, 14:47

@ThoR

I think the point that you should not be using the CPU to do what your video hardware could/should is very well understood, but the fact is we are dealing with a specific problem domain where, for whatever reason, we are dealing with CPU rendering. That could be for any reason: Video decode, an oldschool FPS game, an emulator, etc.

If you are calculating a frame, pixel by pixel, potentially drawing things that aren't easily broken down into simple sequential planar writes, you are probably going to end up to use C2P. Even when you can do planar writes in spans, just look at TFX. The single biggest bottleneck was trying to render to chip ram. Rendering the same planar data in fast ram and moving said data to chip dramatically improved the speed.

If you are generating chunky pixels, you now have to pay a conversion tax to get them displayed and the less you have to pay for any given number the better.

Back to the original question, it generally consumes 100% CPU (as in you can't get anything else meaningful done in the meantime). The only variable is, how long does it take, per frame?

Thorham · 05 March 2024, 16:04

Quote:

Originally Posted by Thomas Richter

What would you answer if the original question would be "how fast can you copy memory"?

But that's not the question. The question is about C2P CPU time consumption on various Amiga systems. This can just be measured per system.

Quote:

Originally Posted by Thomas Richter

The *real* answer is: "avoid it if you can, because it will always cost time."

Yes, but why use C2P when you don't need it? Seems a bit obvious.

Photon · 10 March 2024, 02:27

As a rough baseline, a 320x200x8bpp ready buffer takes roughly 1 frame with the CPU, or maybe a bit less on a fast 060. This means that there's little or no time left to render something interesting in 8 bits at full frame rate.

Graphics cards on a bus with no acceleration would be better or worse depending on the bus bandwidth vs. specific Amiga model motherboard chip RAM bandwidth from the CPU. Reverts to copyspeed.

Motorola CPU cards with VDU as North Bridge would bypass the CPU-external bus and is the only way to beat bus copyspeed. IDK if SCSI modules could be adapted to become VDUs or if they're fast enough, but I estimate 1 in 20 or less Blizzard cards have modules. If TF1260 could be modded this could be more promising as it has sold in high numbers.

For 3D and calculations, the fastest way is ofc a GPU with everything on it, and CPU just passing object data and handing off all rendering to that PU. Unfortunately this results in a great cost (maybe not now?) but the bigger problem is that it must be standardized and devs incentivized to make games and other real-time applications for it that need full framerate. A cheap GPU, say 300 EUR would probably take 10 years to reach out to a wide enough audience. Big box Amigas bar A2000 are very rare, and the question is where the GPU would sit in the much more abundant wedge Amigas.

VladR · 10 March 2024, 03:58

Quote:

Originally Posted by Photon

As a rough baseline, a 320x200x8bpp ready buffer takes roughly 1 frame with the CPU, or maybe a bit less on a fast 060. This means that there's little or no time left to render something interesting in 8 bits at full frame rate.

Yes, but there's 2 solutions to this problem:
1. Allow user the option of having 50% of screen covered in HUD (cockpit view in racing games, flight/space simulators) thus reducing that cost to 50% and leaving the user with 2 options : 50/60 fps (but with half screen spent on HUD) and 30 fps (full-screen).

2. Just optimize/tweak the 3D scene to a 25/30 fps lock (but no frame-drops!). Yes, this means that for most of the time, the CPU is unused to handle the rare spikes (which would otherwise drop the framerate below 25/30). But plenty recent racing games had a 30-fps lock on consoles and that's fine.

VladR · 10 March 2024, 04:20

Quote:

Originally Posted by Thomas Richter

Add real measurements - what about the same engine on a RTG graphics card - with chunky memory and a chunky blitter? That it something CBM could have done, if they just haven't been asleep. Reinterpreting the planar data from Agnus in chunky is not exactly rocket science.

I do have a question about that. During last 2 weekends I've configured my WinUAE dev set-up on my new PC and restarted working on the RTG Benchmark Demo (which runs on my flatshader engine that I so far tested only on Vampire V2/V4 - though the code is 040-only ATM).

One of the features is that I allow to run benchmark without _LVOWritePixelArray (CyberGraphXBase) - meaning it benchmarks raw CPU throughput without any chipram malarkey (you just don't see it on screen, but the frame is fully rendered internally in a loop).

Is RTG driver doing something else during that call other than C2P ?

The benchmark spits out 2 numbers - with and without the RTG call, so it's very easy to instantly find out how long it takes. Obviously, no VSYNC.

I'm just wondering if the _LVOWritePixelArray () doesn't do a bunch of other things that would skew the C2P results?

Do cards like ZZ9000 even do C2P ? Don't they bypass C2P / chipram completely with their own [presumably] RTG video-out ? Maybe I'm mixing that with the PiStorm, though...

pipper · 10 March 2024, 05:28

Assumed one manages to transfer around 10mb/s to the RTG card across the ZIII bus, a 320x200 image costs ca 6ms to transfer, leaving ca 10ms for the game to render everything at 60fps.
One thing that could have helped greatly is if the graphics card could do DMA transfers on its own via busmastering. This way you could hide the transfer while the CPU renders the next frame already (assumed that the transfer does not saturate the fastmem bandwidth). I’m not aware of any VGA chip if that era that could do this, though (with maybe the exception of the s3 virge?) - this is generally something that came up later with the first 3D cards.

In case of C2P the equivalent would have been a DMA engine that can fetch from Fastmem into chipmem fully autonomous and do the conversion on the fly.

Bruce Abbott · 10 March 2024, 06:06

Quote:

Originally Posted by pipper

In case of C2P the equivalent would have been a DMA engine that can fetch from Fastmem into chipmem fully autonomous and do the conversion on the fly.

Would be great to have an Akiko style c2p converter that did that. Or it could just have DMA on the output side, then the CPU simply has to stuff it with chunky pixels which get (slowly) written to ChipRAM via DMA while the CPU bus is freed up for other stuff.

But this requires specialized hardware. The way things go these days it would be stupidly expensive and hard to buy, then go out of production due to lack of chips or interest from the designer.

IMO it's better to concentrate on developing code that works with all existing hardware. That way anybody with an Amiga can make use of it. Maybe it's not as efficient as dedicated hardware, but it has the advantages of being 'free' and a lot more inclusive. It is also truer to the retro spirit. We can imagine it being done back in 1992, showing what the Amiga was really capable of!

I suspect that many games using chunky pixels are still not fully optimized. For example Doom does c2p on the whole screen even when running in a smaller game window. What it should be doing is not redrawing the (static) border, and updating the status panel at a lower rate (perhaps using the blitter and bitplane graphics). With these changes I bet the frame rate could be increased significantly.

Bruce Abbott · 10 March 2024, 06:35

Quote:

Originally Posted by VladR

2. Just optimize/tweak the 3D scene to a 25/30 fps lock (but no frame-drops!). Yes, this means that for most of the time, the CPU is unused to handle the rare spikes (which would otherwise drop the framerate below 25/30). But plenty recent racing games had a 30-fps lock on consoles and that's fine.

25 fps is plenty high enough for most games IMO. Even 17 fps (three 50Hz frames) is fine for games like Doom.

What's more important is having a consistent frame rate that you can tune your reactions to. I had Quake on my A3000 with 50MHz 060, and what spoiled it was dramatic slowdowns when engaging enemies - just when you didn't need it. Maybe less particle stuff and reduced detail in the enemies would have helped. But of course nobody tried this because it would mean modifying game assets, a much harder job than just increasing hardware performance.

When the frame rate is locked there is more incentive to make the code consistently keep up, rather than just trying to get the fastest speed you can. I suspect the desire for even higher frame rates largely comes from wanting to eliminate annoying slowdowns.

VladR · 10 March 2024, 07:10

Quote:

Originally Posted by Bruce Abbott

25 fps is plenty high enough for most games IMO. Even 17 fps (three 50Hz frames) is fine for games like Doom.

What's more important is having a consistent frame rate that you can tune your reactions to. I had Quake on my A3000 with 50MHz 060, and what spoiled it was dramatic slowdowns when engaging enemies - just when you didn't need it. Maybe less particle stuff and reduced detail in the enemies would have helped. But of course nobody tried this because it would mean modifying game assets, a much harder job than just increasing hardware performance.

When the frame rate is locked there is more incentive to make the code consistently keep up, rather than just trying to get the fastest speed you can. I suspect the desire for even higher frame rates largely comes from wanting to eliminate annoying slowdowns.

Yes, 17 fps in Doom, if it was frame-locked, would be plenty. But the first framedrop always killed it for me, even 30 yrs ago.

The problem isn't necessarily that the code is unoptimized (of course, there's a lot of that).
Problem is, that for a given HW, the scene complexity was always considered as "playable" by choosing the least-complex 3D scene/room.
Then, like in your example, you add enemies in Quake (or more complex room), and framerate obviously drops drastically, as there's no such thing as two equally CPU-intensive 3D-engine frames.

The solution is to run benchmark on entire game, pick the slowest room, lock the framerate to that (or butcher the scene complexity on such rooms).

But, of course, since other rooms could run at 50-200% higher framerate, nobody does that and then the framerate is all over the place, resulting in grossly suboptimal experience for many of us...

Technically, Carmack did a significant optimization with Quake using the BeamTree approach, which halved the framerate in best-case scenario but avoided brutal framedrops (though their definition of "brutal" differs from ours).
Still, they should have raised the min.req. as even on my Pentium, it became quickly unplayable...
As a funny anecdote, I really enjoyed Quake 1 completely for the first time only on PS4, as I could play it in FullHD without framedrops. ~Quarter century later, but hey...

dreadnought · 10 March 2024, 08:27

All threads lead to Doom

Quote:

Originally Posted by VladR

As a funny anecdote, I really enjoyed Quake 1 completely for the first time only on PS4, as I could play it in FullHD without framedrops. ~Quarter century later, but hey...

Whatever rocks your boat, but we played Quake 1/2 competitively in the late 90s on whatever hardware/connection was available (mostly low end Pentiums and Celerons) and still having hell of a time.

Forcing the lowest-denominator frame lock would be a pretty crazy move since hardware setups were different, most people prefer to have it maxed wherever possible, and generally it wouldn't be as bad as you describe. And you could of course control it via console anyway (cl_maxfps if I remember correctly).

TCD · 10 March 2024, 09:30

Quote:

Originally Posted by VladR

Technically, Carmack did a significant optimization with Quake using the BeamTree approach, which halved the framerate in best-case scenario but avoided brutal framedrops (though their definition of "brutal" differs from ours).

Quite an interesting article about that by Micheal Abrash: https://www.bluesnews.com/abrash/chap64.shtml

Thomas Richter · 10 March 2024, 09:42

Quote:

Originally Posted by VladR

I'm just wondering if the _LVOWritePixelArray () doesn't do a bunch of other things that would skew the C2P results?

I do not know what CGFfx does. I can only tell you what P96 does. The corresponding P96 function creates from your source data a transient chunky bitmap and then runs into BltBitMapRastPort, which performs the usual clipping at layer boundaries. Within each rectangle to copy,it runs into BltBitMap(), which at the low level, performs a memory copy if the target is chunky, or a C2P conversion if the target is planar.

Quote:

Originally Posted by VladR

Do cards like ZZ9000 even do C2P ? Don't they bypass C2P / chipram completely with their own [presumably] RTG video-out ? Maybe I'm mixing that with the PiStorm, though...

No graphics card does C2P because that operation does not make sense in a chunky world. P96 has a primitive for P2C, and that conversion is offered by many graphic cards as also the windows API needs something similar to expand 1-bitplane wide graphics (as for example for drawing text) to a chunky frame buffer.

Thomas Richter · 10 March 2024, 09:51

Quote:

Originally Posted by pipper

One thing that could have helped greatly is if the graphics card could do DMA transfers on its own via busmastering. This way you could hide the transfer while the CPU renders the next frame already (assumed that the transfer does not saturate the fastmem bandwidth). I’m not aware of any VGA chip if that era that could do this, though (with maybe the exception of the s3 virge?) - this is generally something that came up later with the first 3D cards.

Back then, DMA from the host was a relatively rare feature, and from the VGA chips I had written drivers for, none had support for that. There is a single exception, namely the A2410 card, which could do busmastering and let the TMS34010 access Amiga Zorro RAM through DMA - at least in principle. In practise, the DMA logic on the board is broken and it does not work.

Quote:

Originally Posted by pipper

In case of C2P the equivalent would have been a DMA engine that can fetch from Fastmem into chipmem fully autonomous and do the conversion on the fly.

What you describe is a blitter mode that does such a conversion, though the Amiga blitter is relatively poorly prepared for that. For a (speedy) C2P conversion, it would need to have 8 destination channels, not only one. Of course you can use the blitter "as is" for C2P running serially over all bitplanes by moving the right bits out of the source - and I had even done this a while ago for VideoEasel. The result is not very fast, but it runs in parallel to the CPU which can do more useful things while the blitter is busy.

What is a lot more practical is to have a chunky mode directly in Denise. It is the same amount of data it had to fetch, and it would be even simpler as it does not have to interleave the accesses for the individual bitplanes.

Thorham · 10 March 2024, 11:02

Quote:

Originally Posted by dreadnought

All threads lead to Doom

What's this obsession with Doom about anyway

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Selling A3660 CPU card, including Rev 5 CPU - NEW - professionally built	tbtorro	MarketPlace	1	17 June 2018 19:14
Blitter C2P? How?	Samurai_Crow	Coders. Asm / Hardware	21	24 April 2018 19:12
Any C2P experts here?	oRBIT	Coders. General	36	27 April 2010 07:26
C2P....help!	NovaCoder	Coders. General	8	17 December 2009 00:15
Game in c2p?	oRBIT	Amiga scene	11	01 February 2007 21:28

05 March 2024, 14:28	#45
alexh Thalion Webshrine Join Date: Jan 2004 Location: Oxford Posts: 14,396	Mr.Moderator. Can someone clean up this thread, backtrack to where it left the technical discussions about C2P speed on AGA/ECS and put everything after that into any other Wot-if thread? Or the bin? Ta.

05 March 2024, 14:47	#47
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,234	@ThoR I think the point that you should not be using the CPU to do what your video hardware could/should is very well understood, but the fact is we are dealing with a specific problem domain where, for whatever reason, we are dealing with CPU rendering. That could be for any reason: Video decode, an oldschool FPS game, an emulator, etc. If you are calculating a frame, pixel by pixel, potentially drawing things that aren't easily broken down into simple sequential planar writes, you are probably going to end up to use C2P. Even when you can do planar writes in spans, just look at TFX. The single biggest bottleneck was trying to render to chip ram. Rendering the same planar data in fast ram and moving said data to chip dramatically improved the speed. If you are generating chunky pixels, you now have to pay a conversion tax to get them displayed and the less you have to pay for any given number the better. Back to the original question, it generally consumes 100% CPU (as in you can't get anything else meaningful done in the meantime). The only variable is, how long does it take, per frame?

10 March 2024, 02:27	#49
Photon Moderator Join Date: Nov 2004 Location: Eksjö / Sweden Posts: 5,625	As a rough baseline, a 320x200x8bpp ready buffer takes roughly 1 frame with the CPU, or maybe a bit less on a fast 060. This means that there's little or no time left to render something interesting in 8 bits at full frame rate. Graphics cards on a bus with no acceleration would be better or worse depending on the bus bandwidth vs. specific Amiga model motherboard chip RAM bandwidth from the CPU. Reverts to copyspeed. Motorola CPU cards with VDU as North Bridge would bypass the CPU-external bus and is the only way to beat bus copyspeed. IDK if SCSI modules could be adapted to become VDUs or if they're fast enough, but I estimate 1 in 20 or less Blizzard cards have modules. If TF1260 could be modded this could be more promising as it has sold in high numbers. For 3D and calculations, the fastest way is ofc a GPU with everything on it, and CPU just passing object data and handing off all rendering to that PU. Unfortunately this results in a great cost (maybe not now?) but the bigger problem is that it must be standardized and devs incentivized to make games and other real-time applications for it that need full framerate. A cheap GPU, say 300 EUR would probably take 10 years to reach out to a wide enough audience. Big box Amigas bar A2000 are very rare, and the question is where the GPU would sit in the much more abundant wedge Amigas.

10 March 2024, 05:28	#52
pipper Registered User Join Date: Jul 2017 Location: San Jose Posts: 664	Assumed one manages to transfer around 10mb/s to the RTG card across the ZIII bus, a 320x200 image costs ca 6ms to transfer, leaving ca 10ms for the game to render everything at 60fps. One thing that could have helped greatly is if the graphics card could do DMA transfers on its own via busmastering. This way you could hide the transfer while the CPU renders the next frame already (assumed that the transfer does not saturate the fastmem bandwidth). I’m not aware of any VGA chip if that era that could do this, though (with maybe the exception of the s3 virge?) - this is generally something that came up later with the first 3D cards. In case of C2P the equivalent would have been a DMA engine that can fetch from Fastmem into chipmem fully autonomous and do the conversion on the fly.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)