24 April 2020, 18:07 | #121 | ||
Registered User
Join Date: May 2017
Location: Munich/Bavaria
Posts: 2,295
|
Quote:
however: my assumption is true for 80-90% of the frame (or better field) Quote:
Your calculation is still off - taking all the effects into account the display DMA is maybe down to 20% of all cycles ... |
||
24 April 2020, 18:17 | #122 | ||
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,411
|
Quote:
Quote:
Bitplane DMA cost is exactly what I said it is for 320x256x8 (4x fetch), as is total number of cycles available. Which means that my ~1/7th is accurate. |
||
24 April 2020, 18:38 | #123 | |
Registered User
Join Date: May 2017
Location: Munich/Bavaria
Posts: 2,295
|
Quote:
(for PAL we have 256/312=0.82) PS: and no: overscan does certainly not affect the CPU equally |
|
24 April 2020, 22:50 | #124 | |
Registered User
Join Date: Dec 2016
Location: Finland
Posts: 168
|
Quote:
The memory write speed of blitter is still max 3,5 MB/sec (A->D copy), while the CPU can write into chip ram with 7 MB/sec. And not only that, this operation can be something else than a straight memory copy. And if blitter combines all sources ABC->D blitter speed is further halved. A cookie cut operation from fast ram on a 68030 can be made several times faster than with the blitter (not just twice faster, as the copy speed would indicate). |
|
25 April 2020, 00:50 | #125 | |
Registered User
Join Date: Dec 2014
Location: germany
Posts: 439
|
Quote:
(320*256*50)/(3.58e6*8) = 0.143 |
|
25 April 2020, 01:17 | #126 | |||
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,411
|
Quote:
Now the long answer. Appologies for the wall of text & numbers, but I wanted to be as complete as I could to show what I mean. Let's start using OCS: each scanline, the Amiga has 226 DMA slots/cycles which can be filled by chipset DMA sources (Blitter, Copper, Sprites, bitplanes, disk, audio, refresh) or CPU*. There are 312 lines in the display, meaning the Amiga has a total of 70512 DMA slots/cycles available per frame. Of these 226 cycles, bitplane DMA takes 1 cycle per every bitplane per every word. Or put differently, 16 pixels @ 1 bitplane = 1 cycle, 16 pixels @ 4 bitplanes = 4 cycles, 32 pixels @ 4 bitplanes = 8 cycles, etc. A 320 pixel wide screen @ 1 bitplane will take 20 cycles per line. Likewise, a 320 pixel wide screen @ 4 bitplanes will take 80 cycles per line (both out of 226 cycles available per line**). If we then multiply this number by the height of the display, we get the total number of cycles the display fetch takes. Which means that a 320x256x4 OCS screen will take 80*256=20480 cycles. Moving to AGA, we can fetch from display memory at 4x the speed, so the same 320x256x4 bitplane screen would now only take 5120 cycles to fetch (20480/4). However, AGA allows for 8 bitplanes so let's assume we run in 8 bitplanes. Which means the display uses ((320/16)*256*8)/4=10240 cycles. This means the 320x256x8 display uses 40 cycles per line, out of 226. That is ~18% of each scanline it is displaying on. If we then take into account that the display only lasts 256 out of 312 lines, we have to take this 18% and multiply it by 256/312=0,82. Leading to ~14,8%, which is about 1/7th. Alternatively, we can simply take the number of DMA cycles the screen takes (10240 cycles) and divide this by total number of DMA cycles the system has (70512-1248 refresh cycles***=69264). This also gives us ~14,8%, which is about 1/7th. *) My example did not account for anything but bitplanes, Blitter and CPU to keep it simple. Adding in the other sources will generally (but not always) not make a real difference as generally the other DMA sources will not use large amounts of cycles. There are situations in which they can (OCS style Sprite/Copper based backgrounds for instance), but normally they don't. So I won't add them in this example either. ***) I made a small error in my earlier total, there are 1248 refresh cycles (4 per scanline) but I used 1152 for some reason. Quote:
8 bitplane DMA fetches "block out" all DMA slots during their fetch moments. This means that neither CPU or Blitter get access to any slots during the display fetch moments. Adding overscan simply means there are more scanlines during which this situation applies, which affects both CPU and Blitter the same way. Quote:
To clarify: I am in no way trying to say there are no benefits of using the CPU. Far from it, you are absolutely correct to say that a 68030 using fast memory can do things much faster than the Blitter does. And because it only needs to write out the result to chip memory instead of doing everything in chip memory, this can result in a sizeable speed boost for graphics operations. However, the bandwidth to chipmemory of the Blitter and CPU are both still limited to 7MB/sec. My point was rather that it might be useful to consider using both the Blitter and CPU concurrently, as the Blitter can fill some of the gaps the CPU leaves open on the bus. Whether this will work and to what degree remains to be seen, but that was my idea. Not to suggest the CPU has no advantages over the Blitter. On a side note: I do want to point out that the CPU can only write to chip memory at 7MB/sec if we assume it has zero-cycle-access to fast memory and has zero-cycle-algorithms for drawing operations. Neither of these are likely to be true on a 68030 (though the cache and relatively smart bus sequencer can and do help here). Which will mean that copy/cookie-cut/etc actions will usually clock in at somewhere below 7MB/sec writing. Now it will obviously not have to lose a large percentage, but it won't get the full 100% either. It's actually all fairly interesting and there are some caveats involving the choice between simply copy entire screen buffers or drawing in rectangles a-la-the Blitter. The former is likely not fast enough for 50FPS updates, the latter ironically is less efficient than you might think because the Blitter usually can get by moving around fewer bytes in memory than the CPU can (due to the 16 vs 32 bit difference). Good stuff to think or ponder about and tinker with. Last edited by roondar; 25 April 2020 at 01:55. |
|||
25 April 2020, 01:38 | #127 | |
Registered User
Join Date: May 2017
Location: Munich/Bavaria
Posts: 2,295
|
Quote:
But we DO have display DMA... And now you are confusing display fetch cycles (32bit -double CAS) with Blitter (16bit - no double CAS) And besides that: the Blitter would have access to ALL cycles minus (display, sound, sprites, floppy), not only every second - the CPU has only every second .... Sorry, but what you just wrote makes absolutely no sense in this discussion |
|
25 April 2020, 01:53 | #128 | ||||||
Registered User
Join Date: May 2017
Location: Munich/Bavaria
Posts: 2,295
|
Quote:
Every second cycle ... so it is even worse for the Blitter during this time! Quote:
Quote:
come on... so it is 40 to 80 ... exactly the half I made that calculation already above and it is like I said there! The only difference is you neglect audio and sprite and disk DMA. Quote:
it will only affect the Blitter, since the display-fetches will not take place during the CPU slots anyways. Quote:
Quote:
Last edited by Gorf; 25 April 2020 at 02:10. |
||||||
25 April 2020, 02:37 | #129 |
Registered User
Join Date: Dec 2019
Location: North Dakota
Posts: 741
|
It just so happens that I'm - right this moment- working on a benchmark build for my V4.
Perhaps we could use this opportunity and test some static 3D mesh of a character. While my game doesn't need to do FK/IK animation, we could still benchmark the actual poly transform+render -which as I'm sure we can all agree- is great majority of all computations. Two requests: 1. Is 16 MB reasonable amount of RAM (to require) for our scenario here ? Or nobody with 030 has 16 MB ? 2. Can somebody please find some blog post detailing the exact polycount of VirtuaFighter characters ? EDIT: The reason I am asking for 16 MB is: 1. Large tables for division and multiplication (surely no sane person would intend to execute mul/div?!?) 2. Keyframe animation of characters (no FK/IK) - storing all frames of each animation (as full 3D coordinates) |
25 April 2020, 02:54 | #130 |
Registered User
Join Date: Dec 2019
Location: North Dakota
Posts: 741
|
Some raw brain-dump on KeyFrame animation:
250 [Vertices] per character (estimate) 6 [Bytes] per Vertex (2 Bytes per X,Y,Z) - sufficient precision without major artifacts 1,500 [Bytes] per One Frame of animation (250*6) Since Polygon Indices are reused across all frames, we can disregard their storage costs. Now, how many frames do we need ? From my experience with keyframe animation, if you're using Lerping (to interpolate between two frames), simple animations can get away with just 6-10 frames (Breathing, minor pose changes). Walking is impossible to do without 12 frames, and even at 12 frames looks like sh*t. More complex movements need 30-50 frames. Of course, player can't wait 7 seconds till the poor 030 processes all 50 frames of the animation, so in reality, the number of frames will be dictated by actual HW performance. Well, unless we make it a Turn-Based Fighter - then it would be OK to wait 5-8 seconds to finish each move. Sarcasm aside, let's assume we could somehow reach 20 fps on 030, and if each move took 1.2 second (not sure, never timed it, hell - never even played VF), then we could process even full 24 frames of the animation. 16 different animations per character 24 frames per animation 384 = 16*24 = Total Frames per character 1,500 * 384 = 576,000 Bytes = 563 KB per character [for vertices] We would need the same for polygon normals, so another 563 KB 563 + 563 = 1,126 MB - just 1 MB. Not bad, really... Even if we doubled the amount of frames and animations, it's still ~4.5 MB So, even 8 MB 030 would suffice. Although, if you have 2 different characters on screen, then we gotta double it and we're then at ~9 MB. That's gonna take a while to load from disk... |
25 April 2020, 03:23 | #131 |
Total Chaos forever!
Join Date: Aug 2007
Location: Waterville, MN, USA
Age: 49
Posts: 2,187
|
Re:9MB per pair of characters
A CD32 with an accelerator could probably handle it then. Likewise an A1200 with an accelerator and hard drive. If we could just ditch floppies sooner.... |
25 April 2020, 04:41 | #132 | ||||||
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,411
|
Quote:
Quote:
Quote:
My main issue, however, was with horizontal blank/border. As it turns out, you have a different way of looking at what normally happens during that time than I do, which explains a lot. Quote:
The reason I didn't include them is precisely because they normally don't amount to anywhere near their maximum number of cycles. Disk DMA is rarely used. Audio DMA almost never runs 4 channels@max rate. AGA games tend to not use sprite background layers as often as OCS games do (my guess for the reason is that the new Dual Playfield mode makes them less needed), so sprites are likewise not usually displayed on every scanline. Could all these things be active all the time? Well, anything is possible... But then again, this feels a lot like a rather arbitrary limitation of which situation to include and which not to include. Why not include Copper activity in the horizontal blank/border as well along with sprites/audio/disk? Or doesn't that count because doing that as well does lock out the CPU? Quote:
Perhaps a diagram might be helpful here. Code:
Cycle diagram lowres AGA 8BPL/4x fetch vs OCS 4BPL fetch (assuming CPU hits every slot it can) D=Bitplane DMA fetch . = free cycle for Blitter/CPU C = CPU cycle (no two CPU cycles can occur back-to-back) B = Blitter cycle OCS 4 BPL .D.D.D.D .D.D.D.D .D.D.D.D .D.D.D.D (etc until scanline ends) => CDCDCDCD CDCDCDCD CDCDCDCD CDCDCDCD (etc until scanline ends) or BDBDBDBD BDBDBDBD BDBDBDBD BDBDBDBD (etc until scanline ends) AGA 8 BPL/4x DDDDDDDD ........ ........ ........ (etc until scanline ends) => DDDDDDDD C.C.C.C. C.C.C.C. C.C.C.C. (etc until scanline ends) or DDDDDDDD BBBBBBBB BBBBBBBB BBBBBBBB (etc until scanline ends) Quote:
|
||||||
25 April 2020, 06:14 | #133 |
Registered User
Join Date: May 2017
Location: Munich/Bavaria
Posts: 2,295
|
Ah ok - thank you.
Now the diagram was really helpful. Now my mistake was in deed to treat the 40 AGA fetches like the 80 ECS ones: as only occurring during the "chip slots". But for my defense: now the value of one 1/7 or 14% of all cycles you mentioned, opposed to my 20-25%, is also absolutely meaningless - it simply does not matter at all in this case! Why? The question was who gets more bandwidth Blitter or CPU - and to answer this question the number of Display DMA fetches has no relevance, since the impact is exactly the same .. as is the number of lines. We can just pretend there are (226-40=)186 slots in each line. We can also say we don’t play sound and we don’t use sprites ... But we can’t get rid of the 4 refresh cycles ... and these are chip slots. So we have a maximum of 93*4B = 372B per line for the CPU And we have a maximum of 182*2B=364B per line for the Blitter So even without sound, sprites and disk, the bandwidth for the Blitter is less than for the CPU - in this particular case! |
25 April 2020, 07:18 | #134 | |||
Code Kitten
Join Date: Aug 2015
Location: Montreal/Canadia
Age: 52
Posts: 1,178
|
Quote:
Although 16 MB seems quite on the high side. Are you counting animations? On VRally-2 PS2 we had about 400 triangles for the cars at the highest quality LOD. Although it is true that the projection and fill rate needs of a racing game and a fighting games are very different, this still gives an idea of how crude 3D models had to be back then. Mul/div tables are a must but which format do you use for positional data? PS1/Saturn used fixed point 16bit, so quite low precision. And frankly if we want to project as many vertices as possible, we probably should be looking at this format as well. I would be very surprised if the Saturn version of Virtua Fighter used more than 5000 vertices per character but 1000 vertices would not surprise me at all. I think someone posted some numbers earlier but I do not recall their source. Quote:
This said, I think it is premature to worry about animation before knowing how many projections can be made per frame because in the end this is going to be the initial multiplying factor. Great point roondar, thanks for taking the time to lay this down. I think it is important to know which place the Blitter can take in this whole endeavor because even if it were possible for the CPU to write to Chip RAM at maximum bandwidth, it will never be able to do so for more than a fraction of each frame. Between vertices projections, animation lerping, visibility computations and rasterization span computations, the CPU will be very busy each frame and that will not be happening in Chip RAM. During that non negligible fraction of the frame, the Blitter, even if slow, will be king. Quote:
Moreover, cf my point in responding to roondar above, the CPU will have many other things to do before it can have a chance to touch Chip RAM. This could easily represent 25%-75% of a frame. During that time, a Copper driven Blitter queue would be able to operate completely asynchronously. Filling all the Chip access cycles it can while the CPU is busy multiplying matrices and lerping in Fast RAM. If timed correctly, the Blitter could operate during the most DMA busy section of the screen while the CPU would take over during the vertical blank at max bandwidth. (Obviously this require double buffering but I doubt anyone expected single buffering to be used. ) Also, I have not given it much thought yet but I intuitively doubt that polygon rasterization with the Blitter would require all four channels to be used. (There have been other discussions about polygon filling with the Blitter on the EAB but I do not think they have reached any definitive or practically applicable conclusion alas.) |
|||
25 April 2020, 07:33 | #135 |
Registered User
Join Date: May 2017
Location: Munich/Bavaria
Posts: 2,295
|
No question: the Blitter can be put to good use - especially if the system has FastRAM and the CPU can do something else ...
I never wanted to discuss this fact - it is obvious. This was just about the (rather academic) question of theoretical bandwidths and which one is higher. |
25 April 2020, 08:29 | #136 | |
Registered User
Join Date: Dec 2014
Location: germany
Posts: 439
|
Quote:
I think on an 030/50 a muls.w actually may be faster than a table access. Worst case is given with 28 cycles, but that depends on the bit patterns of the operands. Divs.w is much slower (56 cycles max), so it's probably better to use tables here. |
|
25 April 2020, 10:53 | #137 | ||
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,411
|
Quote:
Quote:
My original post(s) should probably have said that they were very similar rather than "identical", but my overall idea (namely - Blitter bandwidth to chip memory is not much lower than CPU bandwidth, rather they're very similar) is actually accurate. Remember, your initial starting point was a difference of somewhere between 18% and 25%. It turns out to be closer to 1,8-2% within the constraints we're looking at here (note: obviously, changing the constraints will impact the result). To me a ~2% difference isn't really all that significant and not really the same as what we were talking about. If that 2% had been your starting point, I'd simply have agreed I forgot to take refresh into account and left it at that Wouldn't it depend on the speed of memory? I don't have the 68030 cycle counts handy, but the 68020 can do a memory lookup in something on the order of 4 to 9 cycles if it had zero-wait-state memory (assuming here we're not doing the lea required to fetch the table every table access, otherwise it's a couple more). |
||
25 April 2020, 11:41 | #138 | |
Registered User
Join Date: Dec 2014
Location: germany
Posts: 439
|
Quote:
|
|
25 April 2020, 11:52 | #139 |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,411
|
10 cycle memory access would be indeed quite slow, at that rate you might actually be right that mulu.w might even be quicker. Reminds me of address calculation for bobs on OCS/68000 machines. On paper using a lookup is a big win (something like 40-50 cycles saved per lookup). But in reality the stated cycle time for mulu is a worst case scenario that doesn't really apply, which drops the gains to be much more modest.
|
25 April 2020, 12:25 | #140 | |
Registered User
Join Date: Dec 2019
Location: North Dakota
Posts: 741
|
Quote:
Incidentally, on Jaguar, I never needed to do such tables, as both MUL and DIV were in reality just 1 cycle on the RISC GPU. Div has technically a latency of 9 cycles on RISC GPU, but if you rearrange the code around it, it can be basically computed for free (just 1 cycle). It's also very easy to benchmark it like that and confirm it. Not even Vampire can do DIV in 1 cycle... |
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Found: Shadow Fighter (Was: Anime Fighter) | LaundroMat | Looking for a game name ? | 6 | 14 June 2017 20:52 |
DKB Cobra/Viper 030 (Full 030) + FPU + Ram £100 | ElectroBlaster | MarketPlace | 1 | 08 March 2013 12:52 |
DKB Viper 030 + 128mb simm for A500 030 + ram... | ElectroBlaster | Swapshop | 0 | 18 August 2012 19:48 |
[Found: Virtua Cop] shootie game with a gun | cosmicfrog | Looking for a game name ? | 11 | 05 October 2009 22:11 |
GVP G-force 030 board for A2000-problem switching between 030 and 68k | Unregistered | support.Hardware | 5 | 19 August 2004 10:04 |
|
|