English Amiga Board


Go Back   English Amiga Board > Main > Retrogaming General Discussion

 
 
Thread Tools
Old 15 May 2020, 14:23   #221
roondar
Registered User

 
Join Date: Jul 2015
Location: The Netherlands
Posts: 1,964
Quote:
Originally Posted by trixster View Post
quake is taxing not just because of polygons but because of the amount of texture data that needs to be pushed around
Exactly, the thing about most "high end 3D" for high end 68K Amigas is that it almost always involves* pushing textured polygons and/or special effects. It seems to me that would be much more expensive than trying for just flat shaded stuff.

(I know Virtua Fighter graphically does quite a bit more than just pushing some flat shaded polygons by the way, it's just a way for me to think of a possible starting point - perhaps even for a completely different game if there's a 3D minded 68K maestro out here )

*) Off the top of my head, I can't even think of a game/demo designed for a 68040/68060 Amiga that was based on using flat shaded polygons. Let alone one designed to exploit the better bus that AGA offers to chip memory. Perhaps they exist, and I'd actually be really interested in seeming them, but I can't think of any.
roondar is offline  
Old 15 May 2020, 16:58   #222
AmigaHope
Registered User
 
Join Date: Sep 2006
Location: New Sandusky
Posts: 680
Quote:
Originally Posted by eXeler0 View Post
A better question might be:
What Amiga spec could run something similar to the 32X version of Virtua Fighter?
68060 w/fast 2D gfx card. Zorro 3 would work but ideally a card that lived on the CPU card like CyberVisionPPC.

People keep bringing up MIPS, but Dhrystone MIPS calculations include a mix of instructions. They neglect to consider the amount of MUL instructions Virtua Fighter requires and the amount of time spend rendering to display memory.

SH-2 has fast near-immediate multiplication instructions. The 68040 does not, but the 68060 does.

The 32X has a fast (but dumb) framebuffer. AGA is extremely slow, hence the need for a gfx card connected via a fast bus. (Though I speculated earlier that you could maybe get away with a slower bus like Z2 if you rasterized using vector draw/fill commands on something that supported them like a Picasso IV)
AmigaHope is offline  
Old 15 May 2020, 17:49   #223
roondar
Registered User

 
Join Date: Jul 2015
Location: The Netherlands
Posts: 1,964
The 68060 (and indeed even the 68030) normally has more than enough bandwidth to chip memory to redraw the full screen in well under the time it takes to display a frame (doing so only requires ~4MB/sec for a 256 colour 320x256 screen and AGA offers somewhere around 6.5MB/sec in 8BPL/lowres). Which should mean that the maximum time that AGA can possibly cost you over using something faster is 1 frame (render in fast memory one frame, copy over the next). So a 1-frame game slows to a 2-frame game, etc.

Now don't take me wrong, that is still a pretty stiff penalty to face. But it is manageable in most cases.

Basically, I'm pretty positive that a 2-frame (or even 3-frame) Virtua Fighter style game can still be quite playable. Perhaps it's even possible to draw fewer frames than the game logic processes (a-la SFII Turbo in the Arcade) to give the illusion of a smoother game.
roondar is offline  
Old 15 May 2020, 21:01   #224
saimon69
J.M.D - Bedroom Musician

 
Join Date: Apr 2014
Location: los angeles,ca
Posts: 1,408
Quote:
Originally Posted by roondar View Post

Basically, I'm pretty positive that a 2-frame (or even 3-frame) Virtua Fighter style game can still be quite playable. Perhaps it's even possible to draw fewer frames than the game logic processes (a-la SFII Turbo in the Arcade) to give the illusion of a smoother game.
There is some kind of fetish for the full 50 FPS that i don't fully understand and that in my opinion is undermining lots of projects
saimon69 is online now  
Old 15 May 2020, 21:11   #225
dreadnought
Registered User
 
Join Date: Dec 2019
Location: Ur, Atlantis
Posts: 356
Quote:
Originally Posted by saimon69 View Post
There is some kind of fetish for the full 50 FPS that i don't fully understand and that in my opinion is undermining lots of projects
Some games really need to reach certain framerate level not to be just "playable" but for their gameplay to make sense. I believe Virtua Fighter is one of them. It's one of the highest-regarded fighting games thanks to its sophisticated combat system, not just the pioneering use of 3D. Playing it with some flaky framerate would be pointless imo (though it's not about 50fps, i don't think even Saturn port had that. Steady-30 should be ok I suppose).
dreadnought is offline  
Old 16 May 2020, 00:03   #226
roondar
Registered User

 
Join Date: Jul 2015
Location: The Netherlands
Posts: 1,964
A flaky frame rate would not be good. And it's possible that a 3-frame redraw speed is too low for this particular game. That said, 2 frames should still be doable, IMHO. It's a tad academical really, as I have no real basis to say that a 68060 based Amiga could actually make 25FPS to begin with.

The only thing that makes me consider it a possibility is the thing I mentioned about keeping the game logic (and thus hopefully the feeling of responsiveness) running at a higher frame rate than the game updates it's screen. This worked for SF-II Turbo, which apparently felt very differently as a result. But it's all just idle guesswork, really.
roondar is offline  
Old 16 May 2020, 10:36   #227
AmigaHope
Registered User
 
Join Date: Sep 2006
Location: New Sandusky
Posts: 680
Quote:
Originally Posted by roondar View Post
A flaky frame rate would not be good. And it's possible that a 3-frame redraw speed is too low for this particular game. That said, 2 frames should still be doable, IMHO. It's a tad academical really, as I have no real basis to say that a 68060 based Amiga could actually make 25FPS to begin with
Original VF1 is only 30fps anyway. It was the Saturn ports that shot for 60 fps.

The 32X was able to achieve this with reduced poly counts and some dropped frames using an architecture in the same ballpark as the 68060 (if not a little worse because of having to be split over two 25Mhz CPUs). This was, however, with a local fast framebuffer.

Moreover I doublechecked and I was wrong, the 32X VDP does have one single accelerated feature -- it can write an arbitrary amount of a single word value to the frame buffer, so you could in fact use it to crudely rasterize solid-shade polygons (though how much speedup you'd get vs just writing from CPU I'm unsure). It can't draw vectors, calculate fills, copy data, etc. like a proper blitter, it can just fill an arbitrary range of memory with the same 16-bit value over and over.

Anyway I have no doubt that the 68060 can do better than the 32X in terms of geometry, the only devil is getting that output into a displayable state on the Amiga. (While also rendering backgrounds, which the 32X gets for free from the Genesis.)
AmigaHope is offline  
Old 18 May 2020, 15:43   #228
VladR
Registered User

 
Join Date: Dec 2019
Location: North Dakota
Posts: 217
Quote:
Originally Posted by AmigaHope View Post
Original VF1 is only 30fps anyway. It was the Saturn ports that shot for 60 fps.

The 32X was able to achieve this with reduced poly counts and some dropped frames using an architecture in the same ballpark as the 68060 (if not a little worse because of having to be split over two 25Mhz CPUs). This was, however, with a local fast framebuffer.
I've done a bit of experimenting with Sega's SH-2 dev environment.
I compiled the gcc and SH-4 compilers and was able to get an executable.


Their dev environment is incredible. Targeting both SH-2s is incredibly easy. All happens on a C level and the compilers generate pretty respectable code, to the point that only very few stages of pipeline need rewriting in ASM.


From my perspective, Sega's dev environment, code samples and docs beat Sony's by a large margin.




So, unlike, say, Jaguar, where I had to spent months debugging weird HW bugs and all 3 processors were completely different (68000, GPU RISC, DSP RISC) with different instruction set, cycles, etc. - the Sega's environment makes multithreaded programming a complete breeze.




Quote:
Originally Posted by AmigaHope View Post
Moreover I doublechecked and I was wrong, the 32X VDP does have one single accelerated feature -- it can write an arbitrary amount of a single word value to the frame buffer, so you could in fact use it to crudely rasterize solid-shade polygons (though how much speedup you'd get vs just writing from CPU I'm unsure). It can't draw vectors, calculate fills, copy data, etc. like a proper blitter, it can just fill an arbitrary range of memory with the same 16-bit value over and over.
Correct. VDP's Run-Length Mode is a handy feature because for flatshading you basically get a parallel blitter functionality.


As to the speed-up, I have benchmark data from my 3D engine on 68080. From all pipeline's stages, you save the last one - Pixel Fill (simple dbra loop move.l d0,(a0)+ ).
On 68080, that was 8.7% of a frame time for a scene of 1,000 triangles and 55,215 written pixels.
Should be easy to compute the cost of this on 68030:
.PixelFillLoop:
move.b d0,(a0)+ ; 8 cycles
dbra d1,.PixelFillLoop ; 10 cycles



And ~55,000 x 18 = 990,000 cycles - e.g. ~1 frame on 50 MHz 030.


Of course, on 68030 it would be worthwhile to unroll this 320x and compute a jump, avoiding dbra per pixel altogether.
This would somewhat increase per-scanline cost, but reduce per-pixel by 50%. Since there's ~3,000 scanlines vs 55,000 px, it makes sense doing that.




But, on 68080, since it's merely 8% of frame time, it's simply not worth my time to save ~4% of frame time, so I just let it be


Quote:
Originally Posted by AmigaHope View Post
Anyway I have no doubt that the 68060 can do better than the 32X in terms of geometry, the only devil is getting that output into a displayable state on the Amiga. (While also rendering backgrounds, which the 32X gets for free from the Genesis.)
Not sure about this. 060 is definitely in the ballpark of 32x, but SH-2s are ~20 MIPS each. Which 060 are we talking about here exactly ?


Don't forget that on 32x you don't have to do any bitplane conversion. You merely write one 16-bit word per scanline (length:color) and let VDP fill the framebuffer in parallel.
I think someone mentioned here that it takes about ~1 frame of CPU time to that conversion ? Though, not sure if that was for 030 or 060...
VladR is offline  
Old 18 May 2020, 15:50   #229
VladR
Registered User

 
Join Date: Dec 2019
Location: North Dakota
Posts: 217
Quote:
Originally Posted by britelite View Post
When it comes to measuring 3D throughput, TBL-demos aren't really the best example...
I'm not familiar with TBL's demos, but those few 060 demos that I saw had an abysmal framerate. Plenty scenes had super slow camera, so it felt like 5 fps.


That's 10 frames of CPU time on PAL, on a 060.
Versus 2 frames of CPU time that we want for this game, on a 030.


That's 5x difference already, and when we account for performance difference between 030 and 060, that's gonna shoot up way above 10x...


Not to mention that lots of demos are merely scene players, which is in stark contrast to a real-time fighting game...
VladR is offline  
Old 18 May 2020, 16:03   #230
hooverphonique
ex. demoscener "Bigmama"

 
Join Date: Jun 2012
Location: Fyn / Denmark
Posts: 1,102
Quote:
Originally Posted by VladR View Post
.PixelFillLoop:
move.b d0,(a0)+ ; 8 cycles
dbra d1,.PixelFillLoop ; 10 cycles
On an Amiga, you wouldn't use move.b, but move.l for this.
hooverphonique is offline  
Old 18 May 2020, 16:54   #231
VladR
Registered User

 
Join Date: Dec 2019
Location: North Dakota
Posts: 217
Quote:
Originally Posted by hooverphonique View Post
On an Amiga, you wouldn't use move.b, but move.l for this.
Not quite sure about that. We're talking about 256-color RTG mode here, right ? Which will then be converted to respective bit-planes.


Are you perhaps thinking of doing: LeftEdge - MiddleGroup - RightEdge approach ? The MiddleGroup would be a loop of 4-px move.l, LeftEdge and RightEdge would handle the respective 1-3px alignment on each side of the scanline.

If that's the case, then this will significantly increase per-scanline cost of the pipeline.
In our case, the code that would break down each scanline like this would be executed over 3,000x.

Also, the scanlines for our use-case scenario are quite short - like maybe 10-12 px on average ? I seriously doubt the cost savings would offset that, but that's very easy to benchmark - just run both codepaths and compare...
VladR is offline  
Old 18 May 2020, 16:58   #232
roondar
Registered User

 
Join Date: Jul 2015
Location: The Netherlands
Posts: 1,964
Quote:
Originally Posted by VladR View Post
I think someone mentioned here that it takes about ~1 frame of CPU time to that conversion ? Though, not sure if that was for 030 or 060...
As far as I know, both the 68040 and 68060 are fast enough to do this conversion at "copy speed", i.e. fast enough to saturate the chip memory bus. Of course, that bus only gives you somewhere around ~6.5MB/sec of bandwidth (the rest of the 7MB/sec theoretical bandwidth is lost to the display). As such, it will take you about 60% of a frame to fully convert and copy over a 320x256x8 display to chip memory.

This is a hard limit*, no matter how fast the CPU gets it will never get any faster. I don't know if the 68030 is fast enough to do the same, but the existence of optimised Blitter+CPU C2P methods for 68030's makes me think it probably isn't.

*) Well, the 6.5mb is an approximation, so it may be a bit more or less. But you get the idea.
roondar is offline  
Old 18 May 2020, 17:08   #233
VladR
Registered User

 
Join Date: Dec 2019
Location: North Dakota
Posts: 217
Quote:
Originally Posted by roondar View Post
As far as I know, both the 68040 and 68060 are fast enough to do this conversion at "copy speed", i.e. fast enough to saturate the chip memory bus. Of course, that bus only gives you somewhere around ~6.5MB/sec of bandwidth (the rest of the 7MB/sec theoretical bandwidth is lost to the display). As such, it will take you about 60% of a frame to fully convert and copy over a 320x256x8 display to chip memory.

This is a hard limit*, no matter how fast the CPU gets it will never get any faster. I don't know if the 68030 is fast enough to do the same, but the existence of optimised Blitter+CPU C2P methods for 68030's makes me think it probably isn't.

*) Well, the 6.5mb is an approximation, so it may be a bit more or less. But you get the idea.
Interesting, so the 7 MB/sec is the limit imposed by what exactly here - by the RAM chip access speed (e.g. 60 ns) ?


I find it a bit hard to believe that 060 can execute the bitplane conversion faster than the write to RAM (I mean, there's quite a lot of bit shifting involved) - but obviously haven't ever run those benchmarks, so don't really know. Then again, if 060 can execute most ops in ~2 cycles and the write takes 16, it probably makes sense...


I'm sure there have been hundred+ attempts in past to write as efficient version of C2P as possible...




Either way, it's ~60% of frame time on an 060, leaving 1.4 frame time for a game to run at 25/30 fps.
VladR is offline  
Old 18 May 2020, 17:12   #234
hooverphonique
ex. demoscener "Bigmama"

 
Join Date: Jun 2012
Location: Fyn / Denmark
Posts: 1,102
Quote:
Originally Posted by VladR View Post
Not quite sure about that. We're talking about 256-color RTG mode here, right ? Which will then be converted to respective bit-planes.
I'm not sure.. Does an Amiga with an 030 have RTG? Even if it does, doing bytewise accesses over the Zorro bus to adjacent addresses will slow things down significantly compared to pairing the data.
hooverphonique is offline  
Old 18 May 2020, 17:12   #235
VladR
Registered User

 
Join Date: Dec 2019
Location: North Dakota
Posts: 217
On a Jaguar, at the beginning of the frame, I always initiated ClearScreen via Blitter, so even at 640x240x16bit (300 KB FrameBuffer) I always got a ~free clearscreen, since my pipeline continued with the 3D transform, and I benchmarked the stages so that I would never have to wait more than ~1% of frame time for the clear to finish.


Can we do something like this on 060 ? Meaning issuing a single Blitter command that would clear our 320x256x8bit FrameBuffer in parallel (for free) while 060 is running the 3D pipeline ?

And if it can be done, what impact has such bus access on 060's RAM access ? We would need to read the 3D mesh, compute transform, and write the data back to RAM, all the while Blitter is accessing the RAM...


Because if it can be done on 060, then we really have 1.4 frame time for a 3D scene, which is quite a lot (though probably not enough for VirtuaFighter)
VladR is offline  
Old 18 May 2020, 17:18   #236
roondar
Registered User

 
Join Date: Jul 2015
Location: The Netherlands
Posts: 1,964
Quote:
Originally Posted by VladR View Post
Interesting, so the 7 MB/sec is the limit imposed by what exactly here - by the RAM chip access speed (e.g. 60 ns) ?
It's imposed by the AGA DMA controller (Alice) and the chip memory bus speed. The bus runs at about 3.5MHz and is 32 bits wide. However, due to the design of Alice, the CPU can not access chip memory two cycles in a row, but is limited to accessing chip memory once cycle and then it has to wait one cycle. Which gives it 50% of the total cycles, so (3.5*4)/2=7MB/sec.
Quote:
I find it a bit hard to believe that 060 can execute the bitplane conversion faster than the write to RAM (I mean, there's quite a lot of bit shifting involved) - but obviously haven't ever run those benchmarks, so don't really know. Then again, if 060 can execute most ops in ~2 cycles and the write takes 16, it probably makes sense...

I'm sure there have been hundred+ attempts in past to write as efficient version of C2P as possible...
I'm pretty sure it does manage to do it at copy speed. Perhaps others can weigh in here, I'm not an expert on C2P routines - I merely report what I've heard quite a few times.
Quote:
Either way, it's ~60% of frame time on an 060, leaving 1.4 frame time for a game to run at 25/30 fps.
Yup, it's not great. But it's what we're stuck with
---
Quote:
Originally Posted by VladR View Post
Can we do something like this on 060 ? Meaning issuing a single Blitter command that would clear our 320x256x8bit FrameBuffer in parallel (for free) while 060 is running the 3D pipeline ?
You can indeed have the Blitter clear the screen concurrently with the 060 doing it's thing.
Quote:
And if it can be done, what impact has such bus access on 060's RAM access ? We would need to read the 3D mesh, compute transform, and write the data back to RAM, all the while Blitter is accessing the RAM...
Blitter clears on the Amiga are kind of interesting in that they only use half the available DMA cycles* (due to pipelining). This effectively means that the 060 can simply slot into whatever half of the chip memory cycles that are unused. Which should mean the effect on chip memory bandwidth to the CPU is close to zero.

Also note that the 060 will never be impacted by the Blitter for the part of it's code for the 3D pipeline that runs in Fast RAM.

*) note here that there is some interesting limitations in the Blitter design regarding these idle cycles, which makes such a trick a lot less useful on OCS/ECS systems, but the effect of these are a lot less severe on AGA systems with their enhanced fetching of bitplane data.

Last edited by roondar; 18 May 2020 at 17:23.
roondar is offline  
Old 18 May 2020, 17:21   #237
VladR
Registered User

 
Join Date: Dec 2019
Location: North Dakota
Posts: 217
Quote:
Originally Posted by hooverphonique View Post
I'm not sure.. Does an Amiga with an 030 have RTG? Even if it does, doing bytewise accesses over the Zorro bus to adjacent addresses will slow things down significantly compared to pairing the data.
Damn, such access pattern would have to be really, really slow if it would be actually faster to break down each such short scanline into 32-bit values (left and right edge would have to be AND'ed and OR'ed).


Honestly, the last time I did that, was on Atari 800's 6502 (1.79 MHz) CPU with just 3 registers, so I reckon it should be much shorter code on 68000 with 16 registers and incredibly rich addressing modes, so perhaps it's not such a big deal on this architecture...


It was, however, several pages of code on 6502 and 4 look up tables...
VladR is offline  
Old 18 May 2020, 17:25   #238
VladR
Registered User

 
Join Date: Dec 2019
Location: North Dakota
Posts: 217
Quote:
Originally Posted by roondar View Post
It's imposed by the AGA DMA controller (Alice) and the chip memory bus speed. The bus runs at about 3.5MHz and is 32 bits wide. However, due to the design of Alice, the CPU can not access chip memory two cycles in a row, but is limited to accessing chip memory once cycle and then it has to wait one cycle. Which gives it 50% of the total cycles, so (3.5*4)/2=7MB/sec.
Ouch


Yeah, now I understand why even 060 would reach that limit. Thanks for explanation.


One less thing to benchmark then, eh ?
VladR is offline  
Old 18 May 2020, 17:32   #239
roondar
Registered User

 
Join Date: Jul 2015
Location: The Netherlands
Posts: 1,964
Ah, one thing to note on Blitter clears that may not have been perfectly clear: due to the idle cycles the Blitter clear has, it can only reach 50% of it's normal bandwidth. This means that it can only clear at about 3.5MB/sec. Which is not fast enough to clear all of the screen @320x256x8.
roondar is offline  
Old 18 May 2020, 17:32   #240
VladR
Registered User

 
Join Date: Dec 2019
Location: North Dakota
Posts: 217
Quote:
Originally Posted by roondar View Post
You can indeed have the Blitter clear the screen concurrently with the 060 doing it's thing.

Blitter clears on the Amiga are kind of interesting in that they only use half the available DMA cycles* (due to pipelining). This effectively means that the 060 can simply slot into whatever half of the chip memory cycles that are unused. Which should mean the effect on chip memory bandwidth to the CPU is close to zero.

Also note that the 060 will never be impacted by the Blitter for the part of it's code for the 3D pipeline that runs in Fast RAM.

*) note here that there is some interesting limitations in the Blitter design regarding these idle cycles, which makes such a trick a lot less useful on OCS/ECS systems, but the effect of these are a lot less severe on AGA systems with their enhanced fetching of bitplane data.
Thanks, that's pretty cool then, as we have 1.4 frame time fully available.


Now, thinking about it, 60% of frame time (for C2P), on 060 is an awful lot of instructions. I suspect 060 is idle for a lot of that time.


Could we, in theory, interleave our C2P code with some other code that would, basically in parallel with C2P, compute something else (say, 3D transform) ?


I mean, it would be a real b*tch to debug, for sure, but in theory it could work, right ?


Not sure I am making sense right now - but I am assuming here that those 60% of CPU time for C2P is really a lot of idle time for an 060 here - a time that could be used for something else, if somebody was insane enough to interleave it like that...
VladR is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Found: Shadow Fighter (Was: Anime Fighter) LaundroMat Looking for a game name ? 6 14 June 2017 20:52
DKB Cobra/Viper 030 (Full 030) + FPU + Ram £100 ElectroBlaster MarketPlace 1 08 March 2013 12:52
DKB Viper 030 + 128mb simm for A500 030 + ram... ElectroBlaster Swapshop 0 18 August 2012 19:48
[Found: Virtua Cop] shootie game with a gun cosmicfrog Looking for a game name ? 11 05 October 2009 22:11
GVP G-force 030 board for A2000-problem switching between 030 and 68k Unregistered support.Hardware 5 19 August 2004 10:04

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 20:36.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, vBulletin Solutions Inc.
Page generated in 0.11221 seconds with 16 queries