Could an Amiga with an 030 do Virtua Fighter? - Page 7

Gorf · 24 April 2020, 18:07

Quote:

Originally Posted by roondar

Maybe this helps?

You are correct that OCS/ECS 4 bitplanes take half of available cycles. In fact, all your numbers on cycles are correct.
However... You are not correct that Amiga bitplane fetches happen all the time. They only happen when they need to for the display, which is only a certain part of the total time in a frame

true: I simplified here ... but we also have audio and disk DMA and sprites or overscan the steal cycles ...

however: my assumption is true for 80-90% of the frame (or better field)

Quote:

The Blitter and CPU are on "even footing" for the part of the frame where no display happens, which is why the total number of cycles the Blitter loses vs the CPU is lower

so for 10-20% of the time ... does not change that much

Your calculation is still off - taking all the effects into account the display DMA is maybe down to 20% of all cycles ...

roondar · 24 April 2020, 18:17

Quote:

Originally Posted by Gorf

true: I simplified here ... but we also have audio and disk DMA and sprites or overscan the steal cycles ...

however: my assumption is true for 80-90% of the frame (or better field)

It's not true for 80% of the frame. It's true for ~59% of the frame, assuming 320x256 resolution. Sprite/disk/audio are small overheads which will not change numbers much. Overscan is not, but affects CPU equally for 8 bitplanes.

Quote:

so for 10-20% of the time ... does not change that much

Your calculation is still off - taking all the effects into account the display DMA is maybe down to 20% of all cycles ...

It is accurate. Read the HRM and do the math for yourself if you don't believe me.
Bitplane DMA cost is exactly what I said it is for 320x256x8 (4x fetch), as is total number of cycles available. Which means that my ~1/7th is accurate.

Gorf · 24 April 2020, 18:38

Quote:

Originally Posted by roondar

It's not true for 80% of the frame. It's true for ~59% of the frame, assuming 320x256 resolution. Sprite/disk/audio are small overheads which will not change numbers much. Overscan is not, but affects CPU equally for 8 bitplanes.

It is accurate. Read the HRM and do the math for yourself if you don't believe me.
Bitplane DMA cost is exactly what I said it is for 320x256x8 (4x fetch), as is total number of cycles available. Which means that my ~1/7th is accurate.

Can you please show me the calculation for this?
(for PAL we have 256/312=0.82)

PS:
and no: overscan does certainly not affect the CPU equally

coder76 · 24 April 2020, 22:50

Quote:

Originally Posted by roondar

It's not really accurate to say that a CPU has more bandwidth to chipmemory than the Blitter. Best case bandwidth for both is identical (~7MB/sec), but many CPU's/turboboard don't manage to get that best case. Real world bandwidth of the CPU is therefore often somewhat lower than the Blitter.

The only reason it seems to be the case that the CPU has higher bandwidth is because it can use both fast and chip memory, while the Blitter is always using just chip memory. But this is a "trick of the mind": the CPU will still never exceed the ~7MB/sec to chip memory even if part of the operation is done using much faster fast memory.

The memory write speed of blitter is still max 3,5 MB/sec (A->D copy), while the CPU can write into chip ram with 7 MB/sec. And not only that, this operation can be something else than a straight memory copy. And if blitter combines all sources ABC->D blitter speed is further halved. A cookie cut operation from fast ram on a 68030 can be made several times faster than with the blitter (not just twice faster, as the copy speed would indicate).

chb · 25 April 2020, 00:50

Quote:

Originally Posted by Gorf

Can you please show me the calculation for this?
(for PAL we have 256/312=0.82)

PS:
and no: overscan does certainly not affect the CPU equally

I'm not roondar, but it is actually quite straight forward: The Amiga has 3.58e6 chip ram memory cycles (every second clock cycle). Assuming no other DMA and 8 bytes per mem cycle (32 bit/double CAS), you get

(320*256*50)/(3.58e6*8) = 0.143

roondar · 25 April 2020, 01:17

Quote:

Originally Posted by Gorf

Can you please show me the calculation for this?
(for PAL we have 256/312=0.82)

Of course. First the short answer: the 82% figure fails to account for the horizontal border/blanking during which no bitplane DMA takes place.

Now the long answer. Appologies for the wall of text & numbers, but I wanted to be as complete as I could to show what I mean.

Let's start using OCS: each scanline, the Amiga has 226 DMA slots/cycles which can be filled by chipset DMA sources (Blitter, Copper, Sprites, bitplanes, disk, audio, refresh) or CPU*. There are 312 lines in the display, meaning the Amiga has a total of 70512 DMA slots/cycles available per frame. Of these 226 cycles, bitplane DMA takes 1 cycle per every bitplane per every word. Or put differently, 16 pixels @ 1 bitplane = 1 cycle, 16 pixels @ 4 bitplanes = 4 cycles, 32 pixels @ 4 bitplanes = 8 cycles, etc.

A 320 pixel wide screen @ 1 bitplane will take 20 cycles per line. Likewise, a 320 pixel wide screen @ 4 bitplanes will take 80 cycles per line (both out of 226 cycles available per line**). If we then multiply this number by the height of the display, we get the total number of cycles the display fetch takes. Which means that a 320x256x4 OCS screen will take 80*256=20480 cycles.

Moving to AGA, we can fetch from display memory at 4x the speed, so the same 320x256x4 bitplane screen would now only take 5120 cycles to fetch (20480/4). However, AGA allows for 8 bitplanes so let's assume we run in 8 bitplanes. Which means the display uses ((320/16)*256*8)/4=10240 cycles. This means the 320x256x8 display uses 40 cycles per line, out of 226. That is ~18% of each scanline it is displaying on.

If we then take into account that the display only lasts 256 out of 312 lines, we have to take this 18% and multiply it by 256/312=0,82. Leading to ~14,8%, which is about 1/7th.

Alternatively, we can simply take the number of DMA cycles the screen takes (10240 cycles) and divide this by total number of DMA cycles the system has (70512-1248 refresh cycles***=69264). This also gives us ~14,8%, which is about 1/7th.

*) My example did not account for anything but bitplanes, Blitter and CPU to keep it simple. Adding in the other sources will generally (but not always) not make a real difference as generally the other DMA sources will not use large amounts of cycles. There are situations in which they can (OCS style Sprite/Copper based backgrounds for instance), but normally they don't. So I won't add them in this example either.

***) I made a small error in my earlier total, there are 1248 refresh cycles (4 per scanline) but I used 1152 for some reason.

Quote:

PS:
and no: overscan does certainly not affect the CPU equally

Yes, it does for 8 bitplanes. You lose (relatively) the same amount of CPU slots as you do Blitter slots. Note that I'm only talking about chip memory access here, the CPU is naturally not affected in fast memory.

8 bitplane DMA fetches "block out" all DMA slots during their fetch moments. This means that neither CPU or Blitter get access to any slots during the display fetch moments. Adding overscan simply means there are more scanlines during which this situation applies, which affects both CPU and Blitter the same way.

Quote:

Originally Posted by coder76

The memory write speed of blitter is still max 3,5 MB/sec (A->D copy), while the CPU can write into chip ram with 7 MB/sec. And not only that, this operation can be something else than a straight memory copy. And if blitter combines all sources ABC->D blitter speed is further halved. A cookie cut operation from fast ram on a 68030 can be made several times faster than with the blitter (not just twice faster, as the copy speed would indicate).

There is a difference between bandwidth and write speed. I was talking about the former, you're talking about what I called "a trick of the mind" - you're looking at graphic operation speed, not chip memory bandwith

To clarify: I am in no way trying to say there are no benefits of using the CPU. Far from it, you are absolutely correct to say that a 68030 using fast memory can do things much faster than the Blitter does. And because it only needs to write out the result to chip memory instead of doing everything in chip memory, this can result in a sizeable speed boost for graphics operations. However, the bandwidth to chipmemory of the Blitter and CPU are both still limited to 7MB/sec.

My point was rather that it might be useful to consider using both the Blitter and CPU concurrently, as the Blitter can fill some of the gaps the CPU leaves open on the bus. Whether this will work and to what degree remains to be seen, but that was my idea. Not to suggest the CPU has no advantages over the Blitter.

On a side note: I do want to point out that the CPU can only write to chip memory at 7MB/sec if we assume it has zero-cycle-access to fast memory and has zero-cycle-algorithms for drawing operations. Neither of these are likely to be true on a 68030 (though the cache and relatively smart bus sequencer can and do help here). Which will mean that copy/cookie-cut/etc actions will usually clock in at somewhere below 7MB/sec writing. Now it will obviously not have to lose a large percentage, but it won't get the full 100% either.

It's actually all fairly interesting and there are some caveats involving the choice between simply copy entire screen buffers or drawing in rectangles a-la-the Blitter. The former is likely not fast enough for 50FPS updates, the latter ironically is less efficient than you might think because the Blitter usually can get by moving around fewer bytes in memory than the CPU can (due to the 16 vs 32 bit difference). Good stuff to think or ponder about and tinker with.

Gorf · 25 April 2020, 01:38

Quote:

Originally Posted by chb

I'm not roondar, but it is actually quite straight forward: The Amiga has 3.58e6 chip ram memory cycles (every second clock cycle). Assuming no other DMA and 8 bytes per mem cycle (32 bit/double CAS), you get

(320*256*50)/(3.58e6*8) = 0.143

50 what?

But we DO have display DMA...

And now you are confusing display fetch cycles (32bit -double CAS) with Blitter (16bit - no double CAS)

And besides that: the Blitter would have access to ALL cycles minus (display, sound, sprites, floppy), not only every second - the CPU has only every second ....

Sorry, but what you just wrote makes absolutely no sense in this discussion

Gorf · 25 April 2020, 01:53

Quote:

Originally Posted by roondar

Of course. First the short answer: the 82% figure fails to account for the horizontal border/blanking during which no bitplane DMA takes place.

but there we have sound- and sprite-DMA
Every second cycle ... so it is even worse for the Blitter during this time!

Quote:

Now the long answer. ....
There are 312 lines in the display,....

which I did mention above.

Quote:

A 320 pixel wide screen @ 1 bitplane will take 20 cycles per line. Likewise, a 320 pixel wide screen @ 4 bitplanes will take 80 cycles per line (both out of 226 cycles available per line**).
....
This means the 320x256x8 display uses 40 cycles per line, out of 226. That is ~18% of each scanline it is displaying on.

come on...
so it is 40 to 80 ... exactly the half
I made that calculation already above and it is like I said there!

The only difference is you neglect audio and sprite and disk DMA.

Quote:

8 bitplane DMA fetches "block out" all DMA slots during their fetch moments. This means that neither CPU or Blitter get access to any slots during the display fetch moments. Adding overscan simply means there are more scanlines during which this situation applies, which affects both CPU and Blitter the same way.

nope
it will only affect the Blitter, since the display-fetches will not take place during the CPU slots anyways.

Quote:

However, the bandwidth to chipmemory of the Blitter and CPU are both still limited to 7MB/sec.

but the Blitter can't reach that number because of the display (and other) DMA - the CPU can (and does on the A1200).

Quote:

My point was rather that it might be useful to consider using both the Blitter and CPU concurrently...

Of course - that is some other subject and I never doubted that.

VladR · 25 April 2020, 02:37

It just so happens that I'm - right this moment- working on a benchmark build for my V4.

Perhaps we could use this opportunity and test some static 3D mesh of a character. While my game doesn't need to do FK/IK animation, we could still benchmark the actual poly transform+render -which as I'm sure we can all agree- is great majority of all computations.

Two requests:
1. Is 16 MB reasonable amount of RAM (to require) for our scenario here ? Or nobody with 030 has 16 MB ?
2. Can somebody please find some blog post detailing the exact polycount of VirtuaFighter characters ?

EDIT: The reason I am asking for 16 MB is:
1. Large tables for division and multiplication (surely no sane person would intend to execute mul/div?!?)
2. Keyframe animation of characters (no FK/IK) - storing all frames of each animation (as full 3D coordinates)

VladR · 25 April 2020, 02:54

Some raw brain-dump on KeyFrame animation:

250 [Vertices] per character (estimate)
6 [Bytes] per Vertex (2 Bytes per X,Y,Z) - sufficient precision without major artifacts
1,500 [Bytes] per One Frame of animation (250*6)

Since Polygon Indices are reused across all frames, we can disregard their storage costs.

Now, how many frames do we need ? From my experience with keyframe animation, if you're using Lerping (to interpolate between two frames), simple animations can get away with just 6-10 frames (Breathing, minor pose changes).
Walking is impossible to do without 12 frames, and even at 12 frames looks like sh*t.

More complex movements need 30-50 frames.

Of course, player can't wait 7 seconds till the poor 030 processes all 50 frames of the animation, so in reality, the number of frames will be dictated by actual HW performance.

Well, unless we make it a Turn-Based Fighter - then it would be OK to wait 5-8 seconds to finish each move.

Sarcasm aside, let's assume we could somehow reach 20 fps on 030, and if each move took 1.2 second (not sure, never timed it, hell - never even played VF), then we could process even full 24 frames of the animation.

16 different animations per character
24 frames per animation
384 = 16*24 = Total Frames per character

1,500 * 384 = 576,000 Bytes = 563 KB per character [for vertices]
We would need the same for polygon normals, so another 563 KB

563 + 563 = 1,126 MB - just 1 MB. Not bad, really...
Even if we doubled the amount of frames and animations, it's still ~4.5 MB

So, even 8 MB 030 would suffice. Although, if you have 2 different characters on screen, then we gotta double it and we're then at ~9 MB. That's gonna take a while to load from disk...

Samurai_Crow · 25 April 2020, 03:23

Re:9MB per pair of characters

A CD32 with an accelerator could probably handle it then. Likewise an A1200 with an accelerator and hard drive.

If we could just ditch floppies sooner....

roondar · 25 April 2020, 04:41

Quote:

Originally Posted by Gorf

but there we have sound- and sprite-DMA
Every second cycle ... so it is even worse for the Blitter during this time!

Sound and sprite DMA does not occur every second cycle, there are border/blanking cycles where no Sprite or audio DMA occurs at all.

Quote:

which I did mention above.

I never said otherwise. I was trying to be complete.

Quote:

come on...
so it is 40 to 80 ... exactly the halve
I made that calculation already above and it is like I said there!

I acknowledged several posts ago that the bitplane DMA figure you gave is accurate (as in, the percentage is). But it's not a complete picture: the 40 AGA cycles are not spread across the scanline in the same way as the 80 OCS cycles are. The 80 cycles are spread across 20 fetches of 16 pixels, the 40 cycles are spread across 5 fetches of 64 pixels. I included a diagram further down that shows why this is important.

My main issue, however, was with horizontal blank/border. As it turns out, you have a different way of looking at what normally happens during that time than I do, which explains a lot.

Quote:

The only difference is you neglect audio and sprite and disk DMA.

You neglected them as well for the first five or so posts you made on this topic (same with overscan, which also appeared out of nowhere). Had you included what you thought about the other DMA sources from the get go, I would've understood your point earlier. I'd also have been able to point out why the way you want to include them is not very realistic. See, when you did start mentioning them in more detail, you took the rather over the top view that all of those other DMA sources are always running all the time. Which is not a realistic way of looking at them.

The reason I didn't include them is precisely because they normally don't amount to anywhere near their maximum number of cycles. Disk DMA is rarely used. Audio DMA almost never runs 4 channels@max rate. AGA games tend to not use sprite background layers as often as OCS games do (my guess for the reason is that the new Dual Playfield mode makes them less needed), so sprites are likewise not usually displayed on every scanline. Could all these things be active all the time? Well, anything is possible...

But then again, this feels a lot like a rather arbitrary limitation of which situation to include and which not to include. Why not include Copper activity in the horizontal blank/border as well along with sprites/audio/disk? Or doesn't that count because doing that as well does lock out the CPU?

Quote:

nope
it will only affect the Blitter, since the display-fetches will not take place during the CPU slots anyways.

This is not true. I've explained you several times now that display fetches do occur during CPU slots, given enough bitplanes are active.

Perhaps a diagram might be helpful here.

Code:

Cycle diagram lowres AGA 8BPL/4x fetch vs OCS 4BPL fetch (assuming CPU hits every slot it can)

D=Bitplane DMA fetch
. = free cycle for Blitter/CPU
C = CPU cycle (no two CPU cycles can occur back-to-back)
B = Blitter cycle

 
OCS 4 BPL
.D.D.D.D .D.D.D.D .D.D.D.D .D.D.D.D (etc until scanline ends) =>
CDCDCDCD CDCDCDCD CDCDCDCD CDCDCDCD (etc until scanline ends) or
BDBDBDBD BDBDBDBD BDBDBDBD BDBDBDBD (etc until scanline ends) 
 
AGA 8 BPL/4x
DDDDDDDD ........ ........ ........ (etc until scanline ends) =>
DDDDDDDD C.C.C.C. C.C.C.C. C.C.C.C. (etc until scanline ends) or
DDDDDDDD BBBBBBBB BBBBBBBB BBBBBBBB (etc until scanline ends)

As you can see above, the OCS 4 BPL fetch has a free slot every other cycle, which allows the CPU to interleave "perfectly". The 8 bitplane AGA 4x fetch on the other hand has one "block" where the CPU can't access memory at all, followed by three where it can only access memory every other cycle. This block impacts the Blitter in the same way as it does the CPU. It's this weird setup that AGA uses that is the cause of me saying the Blitter and CPU are impacted in a similar fashion by bitplane DMA for an 8 bitplane low res screen.

Quote:

but the Blitter can't reach that number because to the display (and other) DMA - the CPU can (and does on the A1200).

The CPU in the A1200 does not reach 7MB/sec when it's running in low res 8 bitplane mode either. Because of the the display and sometimes (though less often) even due to other DMA. The 68020 in the A1200 also has a pretty poor reading speed from chip memory and big penalties for anything that isn't 32 bits wide plus aligned to 32 bits. I did a whole bunch of experiments with it (for the A1200 CPU+Blitter combined blitting example program I made a few months ago) and it's far slower when accessing memory than I had expected.

Gorf · 25 April 2020, 06:14

Ah ok - thank you.
Now the diagram was really helpful.

Now my mistake was in deed to treat the 40 AGA fetches like the 80 ECS ones: as only occurring during the "chip slots".

But for my defense: now the value of one 1/7 or 14% of all cycles you mentioned, opposed to my 20-25%, is also absolutely meaningless - it simply does not matter at all in this case!

Why? The question was who gets more bandwidth Blitter or CPU - and to answer this question the number of Display DMA fetches has no relevance, since the impact is exactly the same .. as is the number of lines.

We can just pretend there are (226-40=)186 slots in each line.
We can also say we don’t play sound and we don’t use sprites ...

But we can’t get rid of the 4 refresh cycles ... and these are chip slots.

So we have a maximum of 93*4B = 372B per line for the CPU
And we have a maximum of 182*2B=364B per line for the Blitter

So even without sound, sprites and disk, the bandwidth for the Blitter is less than for the CPU - in this particular case!

ReadOnlyCat · 25 April 2020, 07:18

Quote:

Originally Posted by VladR

Perhaps we could use this opportunity and test some static 3D mesh of a character. While my game doesn't need to do FK/IK animation, we could still benchmark the actual poly transform+render -which as I'm sure we can all agree- is great majority of all computations.

Two requests:
1. Is 16 MB reasonable amount of RAM (to require) for our scenario here ? Or nobody with 030 has 16 MB ?
2. Can somebody please find some blog post detailing the exact polycount of VirtuaFighter characters ?

Excellent idea.
Although 16 MB seems quite on the high side. Are you counting animations?

On VRally-2 PS2 we had about 400 triangles for the cars at the highest quality LOD. Although it is true that the projection and fill rate needs of a racing game and a fighting games are very different, this still gives an idea of how crude 3D models had to be back then.

Mul/div tables are a must but which format do you use for positional data?
PS1/Saturn used fixed point 16bit, so quite low precision. And frankly if we want to project as many vertices as possible, we probably should be looking at this format as well.

I would be very surprised if the Saturn version of Virtua Fighter used more than 5000 vertices per character but 1000 vertices would not surprise me at all.
I think someone posted some numbers earlier but I do not recall their source.

Quote:

Originally Posted by VladR

Some raw brain-dump on KeyFrame animation:
[snip]
Now, how many frames do we need ? From my experience with keyframe animation, if you're using Lerping

I think your numbers may be on the low side (vertex count notably) but your calculations look sound overall.
This said, I think it is premature to worry about animation before knowing how many projections can be made per frame because in the end this is going to be the initial multiplying factor.

Quote:

Originally Posted by roondar

Perhaps a diagram might be helpful here.

Great point roondar, thanks for taking the time to lay this down.

I think it is important to know which place the Blitter can take in this whole endeavor because even if it were possible for the CPU to write to Chip RAM at maximum bandwidth, it will never be able to do so for more than a fraction of each frame. Between vertices projections, animation lerping, visibility computations and rasterization span computations, the CPU will be very busy each frame and that will not be happening in Chip RAM.

During that non negligible fraction of the frame, the Blitter, even if slow, will be king.

Quote:

Originally Posted by Gorf

So we have a maximum of 93*4B = 372B per line for the CPU
And we have a maximum of 182*2B=364B per line for the Blitter

So even without sound, sprites and disk, the bandwidth for the Blitter is less than for the CPU - in this particular case!

That advantage is minimal though.

Moreover, cf my point in responding to roondar above, the CPU will have many other things to do before it can have a chance to touch Chip RAM. This could easily represent 25%-75% of a frame.

During that time, a Copper driven Blitter queue would be able to operate completely asynchronously. Filling all the Chip access cycles it can while the CPU is busy multiplying matrices and lerping in Fast RAM. If timed correctly, the Blitter could operate during the most DMA busy section of the screen while the CPU would take over during the vertical blank at max bandwidth.

(Obviously this require double buffering but I doubt anyone expected single buffering to be used.

)

Also, I have not given it much thought yet but I intuitively doubt that polygon rasterization with the Blitter would require all four channels to be used.
(There have been other discussions about polygon filling with the Blitter on the EAB but I do not think they have reached any definitive or practically applicable conclusion alas.)

Gorf · 25 April 2020, 07:33

No question: the Blitter can be put to good use - especially if the system has FastRAM and the CPU can do something else ...
I never wanted to discuss this fact - it is obvious.

This was just about the (rather academic) question of theoretical bandwidths and which one is higher.

chb · 25 April 2020, 08:29

Quote:

Originally Posted by Gorf

50 what?

But we DO have display DMA...

And now you are confusing display fetch cycles (32bit -double CAS) with Blitter (16bit - no double CAS)

And besides that: the Blitter would have access to ALL cycles minus (display, sound, sprites, floppy), not only every second - the CPU has only every second ....

Sorry, but what you just wrote makes absolutely no sense in this discussion

Those 50 are the screen refresh rate. I was just showing that bitplane DMA for a 320*256*8 screen takes ~14% of all available chip memory slots, as roondar claimed. And display fetch is 32 bit, double CAS, but takes one memory slot like a blitter fetch/write. But I think that is clear now.

Quote:

Originally Posted by VladR

1. Large tables for division and multiplication (surely no sane person would intend to execute mul/div?!?)

I think on an 030/50 a muls.w actually may be faster than a table access. Worst case is given with 28 cycles, but that depends on the bit patterns of the operands. Divs.w is much slower (56 cycles max), so it's probably better to use tables here.

roondar · 25 April 2020, 10:53

Quote:

Originally Posted by Gorf

But for my defense: now the value of one 1/7 or 14% of all cycles you mentioned, opposed to my 20-25%, is also absolutely meaningless - it simply does not matter at all in this case!
<...>
So we have a maximum of 93*4B = 372B per line for the CPU
And we have a maximum of 182*2B=364B per line for the Blitter

Well, it's not completely meaningless - the number of bytes per line does increase for areas of the frame where no bitplane DMA takes place. The difference (for bitplane DMA only) isn't enormous thanks to AGA 4x fetches, but it's still there.

Quote:

So even without sound, sprites and disk, the bandwidth for the Blitter is less than for the CPU - in this particular case!

Quite, but in my defense - your original position was that this difference was quite big. This was in fact the whole reason for this discussion. Well, it was from my perspective anyway. As it turns out, it's only a 2496 bytes difference per frame, out of ~141000 available per frame (to be clear, this number is without counting other DMA, if we do add in bitplane DMA the number of available bytes/frame will drop by some 20480).

My original post(s) should probably have said that they were very similar rather than "identical", but my overall idea (namely - Blitter bandwidth to chip memory is not much lower than CPU bandwidth, rather they're very similar) is actually accurate. Remember, your initial starting point was a difference of somewhere between 18% and 25%. It turns out to be closer to 1,8-2% within the constraints we're looking at here (note: obviously, changing the constraints will impact the result).

To me a ~2% difference isn't really all that significant and not really the same as what we were talking about. If that 2% had been your starting point, I'd simply have agreed I forgot to take refresh into account and left it at that

Quote:

Originally Posted by chb

I think on an 030/50 a muls.w actually may be faster than a table access. Worst case is given with 28 cycles, but that depends on the bit patterns of the operands. Divs.w is much slower (56 cycles max), so it's probably better to use tables here.

Wouldn't it depend on the speed of memory? I don't have the 68030 cycle counts handy, but the 68020 can do a memory lookup in something on the order of 4 to 9 cycles if it had zero-wait-state memory (assuming here we're not doing the lea required to fetch the table every table access, otherwise it's a couple more).

chb · 25 April 2020, 11:41

Quote:

Originally Posted by roondar

Wouldn't it depend on the speed of memory? I don't have the 68030 cycle counts handy, but the 68020 can do a memory lookup in something on the order of 4 to 9 cycles if it had zero-wait-state memory (assuming here we're not doing the lea required to fetch the table every table access, otherwise it's a couple more).

Yes, of course, you're absolutely right. But AFAIK all 030/50 accelerators had standard 70 ns or maybe 60 ns FPM RAM, which is not particulary fast when doing random access table lookups. IRC, that translates to about 4-5e6 mem accesses per second if they are random (no bursts possible). Someone quoted here 20 MB/s, so even if you achieve that with a random access pattern (and those table lookups won't be sequential, and you won't get a lot of data cache hits), that's still minimal 10 cycles per access, plus address calculation. I did not test it, but I'm not sure if it would win against a mul.w routine.

roondar · 25 April 2020, 11:52

10 cycle memory access would be indeed quite slow, at that rate you might actually be right that mulu.w might even be quicker. Reminds me of address calculation for bobs on OCS/68000 machines. On paper using a lookup is a big win (something like 40-50 cycles saved per lookup). But in reality the stated cycle time for mulu is a worst case scenario that doesn't really apply, which drops the gains to be much more modest.

VladR · 25 April 2020, 12:25

Quote:

Originally Posted by roondar

10 cycle memory access would be indeed quite slow, at that rate you might actually be right that mulu.w might even be quicker. Reminds me of address calculation for bobs on OCS/68000 machines. On paper using a lookup is a big win (something like 40-50 cycles saved per lookup). But in reality the stated cycle time for mulu is a worst case scenario that doesn't really apply, which drops the gains to be much more modest.

This is trivial to benchmark. All that needs to be done is to exchange the Transform method with the one that doesn't use mul/div but rather LUTs. Then just compare the time taken.

Incidentally, on Jaguar, I never needed to do such tables, as both MUL and DIV were in reality just 1 cycle on the RISC GPU.

Div has technically a latency of 9 cycles on RISC GPU, but if you rearrange the code around it, it can be basically computed for free (just 1 cycle). It's also very easy to benchmark it like that and confirm it.

Not even Vampire can do DIV in 1 cycle...

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Found: Shadow Fighter (Was: Anime Fighter)	LaundroMat	Looking for a game name ?	6	14 June 2017 20:52
DKB Cobra/Viper 030 (Full 030) + FPU + Ram £100	ElectroBlaster	MarketPlace	1	08 March 2013 12:52
DKB Viper 030 + 128mb simm for A500 030 + ram...	ElectroBlaster	Swapshop	0	18 August 2012 19:48
[Found: Virtua Cop] shootie game with a gun	cosmicfrog	Looking for a game name ?	11	05 October 2009 22:11
GVP G-force 030 board for A2000-problem switching between 030 and 68k	Unregistered	support.Hardware	5	19 August 2004 10:04

25 April 2020, 02:37	#129
VladR Registered User Join Date: Dec 2019 Location: North Dakota Posts: 741	It just so happens that I'm - right this moment- working on a benchmark build for my V4. Perhaps we could use this opportunity and test some static 3D mesh of a character. While my game doesn't need to do FK/IK animation, we could still benchmark the actual poly transform+render -which as I'm sure we can all agree- is great majority of all computations. Two requests: 1. Is 16 MB reasonable amount of RAM (to require) for our scenario here ? Or nobody with 030 has 16 MB ? 2. Can somebody please find some blog post detailing the exact polycount of VirtuaFighter characters ? EDIT: The reason I am asking for 16 MB is: 1. Large tables for division and multiplication (surely no sane person would intend to execute mul/div?!?) 2. Keyframe animation of characters (no FK/IK) - storing all frames of each animation (as full 3D coordinates)

25 April 2020, 02:54	#130
VladR Registered User Join Date: Dec 2019 Location: North Dakota Posts: 741	Some raw brain-dump on KeyFrame animation: 250 [Vertices] per character (estimate) 6 [Bytes] per Vertex (2 Bytes per X,Y,Z) - sufficient precision without major artifacts 1,500 [Bytes] per One Frame of animation (2506) Since Polygon Indices are reused across all frames, we can disregard their storage costs. Now, how many frames do we need ? From my experience with keyframe animation, if you're using Lerping (to interpolate between two frames), simple animations can get away with just 6-10 frames (Breathing, minor pose changes). Walking is impossible to do without 12 frames, and even at 12 frames looks like sht. More complex movements need 30-50 frames. Of course, player can't wait 7 seconds till the poor 030 processes all 50 frames of the animation, so in reality, the number of frames will be dictated by actual HW performance. Well, unless we make it a Turn-Based Fighter - then it would be OK to wait 5-8 seconds to finish each move. Sarcasm aside, let's assume we could somehow reach 20 fps on 030, and if each move took 1.2 second (not sure, never timed it, hell - never even played VF), then we could process even full 24 frames of the animation. 16 different animations per character 24 frames per animation 384 = 1624 = Total Frames per character 1,500 384 = 576,000 Bytes = 563 KB per character [for vertices] We would need the same for polygon normals, so another 563 KB 563 + 563 = 1,126 MB - just 1 MB. Not bad, really... Even if we doubled the amount of frames and animations, it's still ~4.5 MB So, even 8 MB 030 would suffice. Although, if you have 2 different characters on screen, then we gotta double it and we're then at ~9 MB. That's gonna take a while to load from disk...

25 April 2020, 03:23	#131
Samurai_Crow Total Chaos forever! Join Date: Aug 2007 Location: Waterville, MN, USA Age: 49 Posts: 2,187	Re:9MB per pair of characters A CD32 with an accelerator could probably handle it then. Likewise an A1200 with an accelerator and hard drive. If we could just ditch floppies sooner....

25 April 2020, 06:14	#133
Gorf Registered User Join Date: May 2017 Location: Munich/Bavaria Posts: 2,295	Ah ok - thank you. Now the diagram was really helpful. Now my mistake was in deed to treat the 40 AGA fetches like the 80 ECS ones: as only occurring during the "chip slots". But for my defense: now the value of one 1/7 or 14% of all cycles you mentioned, opposed to my 20-25%, is also absolutely meaningless - it simply does not matter at all in this case! Why? The question was who gets more bandwidth Blitter or CPU - and to answer this question the number of Display DMA fetches has no relevance, since the impact is exactly the same .. as is the number of lines. We can just pretend there are (226-40=)186 slots in each line. We can also say we don’t play sound and we don’t use sprites ... But we can’t get rid of the 4 refresh cycles ... and these are chip slots. So we have a maximum of 934B = 372B per line for the CPU And we have a maximum of 1822B=364B per line for the Blitter So even without sound, sprites and disk, the bandwidth for the Blitter is less than for the CPU - in this particular case!

25 April 2020, 07:33	#135
Gorf Registered User Join Date: May 2017 Location: Munich/Bavaria Posts: 2,295	No question: the Blitter can be put to good use - especially if the system has FastRAM and the CPU can do something else ... I never wanted to discuss this fact - it is obvious. This was just about the (rather academic) question of theoretical bandwidths and which one is higher.

25 April 2020, 11:52	#139
roondar Registered User Join Date: Jul 2015 Location: The Netherlands Posts: 3,411	10 cycle memory access would be indeed quite slow, at that rate you might actually be right that mulu.w might even be quicker. Reminds me of address calculation for bobs on OCS/68000 machines. On paper using a lookup is a big win (something like 40-50 cycles saved per lookup). But in reality the stated cycle time for mulu is a worst case scenario that doesn't really apply, which drops the gains to be much more modest.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)