How did games like Starglider 2 get such a high frame rate? - Page 2

deimos · 10 August 2018, 09:26

This is not something I would have thought of. I'll keep it for the last round of optimisations, after I've done all the algorithmic improvements and standard tweaks.

Quote:

Originally Posted by grond

If you really want to do vector multiplications, you'll run out of registers. Hence, use MULs with constants for the camera vectors (like MULS.w #cam,Dn) and replace the dummy value with your updated camera values for each frame (self-modifying code). In this way you can do point projection without having to juggle with registers.

zero · 10 August 2018, 10:09

How are you doing backface culling? Not calculating normals I guess, just looking at vertex order.

StingRay · 10 August 2018, 10:16

Quote:

Originally Posted by deimos

This is not something I would have thought of. I'll keep it for the last round of optimisations, after I've done all the algorithmic improvements and standard tweaks.

If you want your code to work properly on 68010+ you either have to use 2 different routines (one using SMC, one without) or at least disable the instruction cache as otherwise the SMC will happily funk up on anything else than plain 68000.

deimos · 10 August 2018, 10:28

I'm not calculating the whole normal:

if ((x1 - x0) * (y2 - y0) < (y1 - y0) * (x2 - x0))
continue;

where the face is defined by the three points (x0, y0), (x1, y1), (x2, y2).

Quote:

Originally Posted by zero

How are you doing backface culling? Not calculating normals I guess, just looking at vertex order.

chb · 10 August 2018, 10:33

Quote:

Originally Posted by deimos

I always start with original coordinates in object space. I fill in my rotation matrix and apply this to my coordinates and add the translation for the object's position to get coordinates in world space. The camera is fixed at 0,0,0 for now, otherwise I'd move to 4x4 matrices and combine the move to camera coordinates into one matrix.

I can see advantages in modelling my objects by taking advantage of their symmetry, they'll be easier to design by hand that way, but unless I'm missing something, I don't think I can use it to reduce the computations I need to do.

Hmm, either I fail to make myself clear or my idea is utter nonsense, both of which is totally possible

. Maybe to be clear: The idea is avoiding multiplications. If your multiplications are fast, there's not much gain.

But let's take my example with the cube. To rotate it, you'd need to compute R p_i (p_i edge vectors), for every of the 8 edges. If you'd have have generic vectors (ax,ay,az),(bx,by,bz)...(hx,hy,hz) you need to calculate 9 products for every rotation: r11*ax,r12*ay,r13*az, r21*ax,r22*ay,r23*az etc. , so a total of 72 products (I got a factor 3 wrong before

). If now all coordinates have the same absolute value (like for the cube), you can re-use/precompute those products for the other object vectors, as e.g. r11*ax = r11*bx, r12*ay = -r12*by etc - only nine multiplications have to be carried out, and you have to care for the signs. Fastest way would be to substitute adds by subs.

I'd guess to implement it efficiently, you'd write (rather: generate) an own matrix rotation routine for every object. Does not work for objects that change dynamically, of course (apart from modifications like scaling etc).

deimos · 10 August 2018, 10:51

I'm going to try and work through an example to see if I can understand.

Quote:

Originally Posted by chb

Hmm, either I fail to make myself clear or my idea is utter nonsense, both of which is totally possible

. Maybe to be clear: The idea is avoiding multiplications. If your multiplications are fast, there's not much gain.

But let's take my example with the cube. To rotate it, you'd need to compute R p_i (p_i edge vectors), for every of the 8 edges. If you'd have have generic vectors (ax,ay,az),(bx,by,bz)...(hx,hy,hz) you need to calculate 9 products for every rotation: r11*ax,r12*ay,r13*az, r21*ax,r22*ay,r23*az etc. , so a total of 72 products (I got a factor 3 wrong before

). If now all coordinates have the same absolute value (like for the cube), you can re-use/precompute those products for the other object vectors, as e.g. r11*ax = r11*bx, r12*ay = -r12*by etc - only nine multiplications have to be carried out, and you have to care for the signs. Fastest way would be to substitute adds by subs.

I'd guess to implement it efficiently, you'd write (rather: generate) an own matrix rotation routine for every object. Does not work for objects that change dynamically, of course (apart from modifications like scaling etc).

grond · 10 August 2018, 10:59

Quote:

Originally Posted by StingRay

If you want your code to work properly on 68010+ you either have to use 2 different routines (one using SMC, one without) or at least disable the instruction cache as otherwise the SMC will happily funk up on anything else than plain 68000.

Rather 020+ but disabling caches would be nuts anyway. Just flush the caches after self-modifying the code. There are even OS functions readily available for this purpose.

grond · 10 August 2018, 11:16

You can also use spheric coordinates for the objects and then do a simple scaling using LUTs.

deimos · 10 August 2018, 11:32

Is this what you mean?

So, a matrix m and a vector v, where:

m =
| a b c |
| d e f |
| g h i |

and:

v =
| x |
| y |
| z |

results in:

v' = mv =
| ax + by + cz |
| dx + ey + fz |
| gx + hy + iz |

Assuming that the values of abs x, y, z recurr within this or even other models, then the individual products, ax, by, cz, and so on, will be worth storing for reuse. Or, calculated in bulk at the start of the process.

Quote:

Originally Posted by chb

Hmm, either I fail to make myself clear or my idea is utter nonsense, both of which is totally possible

. Maybe to be clear: The idea is avoiding multiplications. If your multiplications are fast, there's not much gain.

But let's take my example with the cube. To rotate it, you'd need to compute R p_i (p_i edge vectors), for every of the 8 edges. If you'd have have generic vectors (ax,ay,az),(bx,by,bz)...(hx,hy,hz) you need to calculate 9 products for every rotation: r11*ax,r12*ay,r13*az, r21*ax,r22*ay,r23*az etc. , so a total of 72 products (I got a factor 3 wrong before

). If now all coordinates have the same absolute value (like for the cube), you can re-use/precompute those products for the other object vectors, as e.g. r11*ax = r11*bx, r12*ay = -r12*by etc - only nine multiplications have to be carried out, and you have to care for the signs. Fastest way would be to substitute adds by subs.

I'd guess to implement it efficiently, you'd write (rather: generate) an own matrix rotation routine for every object. Does not work for objects that change dynamically, of course (apart from modifications like scaling etc).

chb · 10 August 2018, 12:27

Quote:

Originally Posted by deimos

Is this what you mean?

[...]

Assuming that the values of abs x, y, z recurr within this or even other models, then the individual products, ax, by, cz, and so on, will be worth storing for reuse. Or, calculated in bulk at the start of the process.

Yep, that's what I wanted to say. As symmetries around the origin result in same abs values for the vector elements, they are therefore beneficial for this trick. If you have other models with same coordinate values* that always share the same rotation matrix, you can exploit that too, of course.

I'd try this with your sphere and see what you can gain.

*some more advanced optimization could take into account that partial sums appear more than once (cube example), or that for a vector v' = (2x,y,z) or the like you can reuse the products from v = (x,y,z) with some simple additional adds.

deimos · 10 August 2018, 12:56

I'm working on it. First I have to rewrite the code that generates the model. I'll update once I have it working.

Quote:

Originally Posted by chb

Yep, that's what I wanted to say. As symmetries around the origin result in same abs values for the vector elements, they are therefore beneficial for this trick. If you have other models with same coordinate values* that always share the same rotation matrix, you can exploit that too, of course.

I'd try this with your sphere and see what you can gain.

*some more advanced optimization could take into account that partial sums appear more than once (cube example), or that for a vector v' = (2x,y,z) or the like you can reuse the products from v = (x,y,z) with some simple additional adds.

StingRay · 11 August 2018, 00:10

Quote:

Originally Posted by grond

Rather 020+ but disabling caches would be nuts anyway. Just flush the caches after self-modifying the code. There are even OS functions readily available for this purpose.

010 has large enough cache to cause problems with SMC too! And it is much more useful to disable the instruction cache once instead of flushing the cache each time SMC has been used on such CPU's!

deimos · 11 August 2018, 13:19

With just a simple implementation of this I'm seeing a 10% reduction of the time I spend calculating. I think the reason I'm not seeing more is the overhead of the extra steps needed to get to the coordinate data. I think I can get a lot more, after all the amount of coordinate data, and therefore the multiplications, has dramatically decreased, but it may take me a while to get there.

Quote:

Originally Posted by chb

Yep, that's what I wanted to say. As symmetries around the origin result in same abs values for the vector elements, they are therefore beneficial for this trick. If you have other models with same coordinate values* that always share the same rotation matrix, you can exploit that too, of course.

I'd try this with your sphere and see what you can gain.

*some more advanced optimization could take into account that partial sums appear more than once (cube example), or that for a vector v' = (2x,y,z) or the like you can reuse the products from v = (x,y,z) with some simple additional adds.

chb · 11 August 2018, 17:32

Quote:

Originally Posted by deimos

With just a simple implementation of this I'm seeing a 10% reduction of the time I spend calculating. I think the reason I'm not seeing more is the overhead of the extra steps needed to get to the coordinate data. I think I can get a lot more, after all the amount of coordinate data, and therefore the multiplications, has dramatically decreased, but it may take me a while to get there.

Ah, interesting. As said, the fastest (and most complex and inflexible) way probably would involve writing individual code for every object shape, if you do not plan to dynamically modify them. Which could mean that you do not store original object coordinates in an array/struct, but just as a piece of code that takes the rotation matrix as parameter and returns the rotated coordinates. Makes only sense of course if you are always rotating your object first and do not plan to use the original coordinates for something else.

Anyway, I do not know if those techniques fit in your engine concept, they put restrictions on object shapes to be effective, they can be quite time consuming if you hand-optimize the code for every shape, and writing a decent code generator may be hard... anyway, I guess you're doing this for fun, and optimizing the hell out of something is definitely good fun!

BTW, a cheap optimization for back face culling, again exploiting object symmetry (probably you're doing it already): If for your polygon another polygon exist with normal vector pointing just to the opposite direction (like opposite sides of a cube), you need to test only one of them and know the result for the other one. Same of course if the normal vectors are equal (same plane and direction).

deimos · 11 August 2018, 19:04

Here are the current numbers following the symmetry-based optimisations on the vertex data suggested by chb:

My code takes 52s to draw 100 frames (320x256, 6 bitmap / EHB mode). There are 128 polygons represented by 12 x-coordinates, 5 y-coordinates and 12 z-coordinates (114 vertices pointing into that coordinate data).

100 / 52 gives 1.92 fps.

The CPU time is spent as follow:

35% is spent rotating and moving the ball and doing the 3D calculations.
27% is doing the edge tracking part of the scanline fill algorithm.
38% is spent drawing horizontal lines between active edges to fill the polygons.

The code is mostly C, but I do have some assembly in there for a couple of hotspots and for the fixed point arithmetic.

I think my next step is to rewrite more of it in assembly, but I don't believe I'll get more than a 25% improvement across the board, firstly because it won't be changing the underlying algorithms, and secondly because the code generated by the compiler isn't bad to start with.

grond · 11 August 2018, 22:02

Quote:

Originally Posted by StingRay

010 has large enough cache to cause problems with SMC too! And it is much more useful to disable the instruction cache once instead of flushing the cache each time SMC has been used on such CPU's!

No and no. The 010 has a loop cache of 6 bytes which can only hold a short instruction and a decrement-and-branch instruction. There is no practical way to mess this up with SMC. And disabling the caches because you are using SMC to speed up (!) your code is bonkers. You'd lose most of your CPUs speed to save a few cycles in the point projection code. Flushing the caches once per frame doesn't have a large effect. Have a look at the system functions for this. They allow flushing just the modified range of memory. Oh, and by the way, I actually wrote the SMC point projection code I described above including the cache flushing once per frame and can tell you that it is pretty fast.

StingRay · 11 August 2018, 22:59

Quote:

Originally Posted by grond

No and no. The 010 has a loop cache of 6 bytes which can only hold a short instruction and a decrement-and-branch instruction. There is no practical way to mess this up with SMC. And disabling the caches because you are using SMC to speed up (!) your code is bonkers. You'd lose most of your CPUs speed to save a few cycles in the point projection code. Flushing the caches once per frame doesn't have a large effect. Have a look at the system functions for this. They allow flushing just the modified range of memory. Oh, and by the way, I actually wrote the SMC point projection code I described above including the cache flushing once per frame and can tell you that it is pretty fast.

It only really makes sense to use SMC on 68000 to speed up the code and I was referring to processors >68000! And it sure isn't "bonkers" to write reliable code on 680x0, far too often I had to fix broken demo code which didn't work correctly because of SMC! Besides, it is pretty easy to write fast code without using any SMC at all. And if it really can't be avoided to use SMC there should at least be 2 different routines, one for 68000, one for >68000. And using OS routines when the system is killed isn't exactly great either. Not to forget that the OS funtions to flush the cache require OS2.0+.

grond · 12 August 2018, 08:06

I wrote and tested the code on an 030 and it worked fast and reliably. I never kill the OS in my code, why should I? I believe those few percents lost to the OS still running can be gained more easily in other parts of the code.

If you get full-frame rate on an 68000, yes, then you can disable caches for all faster processors because you will also get full speed. But if you want to make your code as fast as possible on any processor, flushing the caches once per frame is much faster than disabling them altogether.

Photon · 12 August 2018, 20:59

Quote:

Originally Posted by deimos

Here are the current numbers following the symmetry-based optimisations on the vertex data suggested by chb:

My code takes 52s to draw 100 frames (320x256, 6 bitmap / EHB mode). There are 128 polygons represented by 12 x-coordinates, 5 y-coordinates and 12 z-coordinates (114 vertices pointing into that coordinate data).

100 / 52 gives 1.92 fps.

The CPU time is spent as follow:

35% is spent rotating and moving the ball and doing the 3D calculations.
27% is doing the edge tracking part of the scanline fill algorithm.
38% is spent drawing horizontal lines between active edges to fill the polygons.

The code is mostly C, but I do have some assembly in there for a couple of hotspots and for the fixed point arithmetic.

I think my next step is to rewrite more of it in assembly, but I don't believe I'll get more than a 25% improvement across the board, firstly because it won't be changing the underlying algorithms, and secondly because the code generated by the compiler isn't bad to start with.

This is what I was looking for from the question in OP.

Generally when you write a new routine, optimization is a process of iteration and verifying that the code change in fact reduced processing time. (Not for you, you already know this and understand the problem.)

Instinctively,

1) the calculation part is way too heavy. Rendering is heavy enough; calculation should be 10% or so of the rendering time.
2) You're using a scanline algorithm, which is inherently for more powerful computers with chunky buffers (or something similar). You will never get this method fast without simply upgrading the hardware. Basically all poly renderers in OCS games simply render the polys one by one back to front.

Side note that SG2 isn't really "fast"; most games shrunk the screen and then polygons become quite small, which means the CPU can beat the Blitter; not having to use the Amiga hardware made the ST port a no-brainer. But it was written in Assembler, which is roughly 10x faster than C, so this is a big part of the difference in speed that you see.

There's also special-casing. Since none of your polygons partially overlap, you don't have to sort them, and you can fill them all at once with a single blit. This is the fastest method, but if you want your poly renderer to be general-purpose, and since all your polygons are very small indeed, then you could write a CPU poly fill (which would start to chug only when polygons become big). If you want this, and write such a routine, then I would say anything faster than 15 FPS for this object means you have a well-optimized routine.

deimos · 13 August 2018, 11:07

Totally agree with point 1, that’s why I was looking for any cheats to the system that SG2 may have used, apart from the obvious small draw distance.

With regards to point 2, I’m currently using a scanline polygon fill to simply draw polygons one by one, I’m not doing any hidden surface removal or considering anything more than the current polygon - i just have two edge trackers doing an integer bresenhan style run down the sides of the polygons and drawing a line between the two sides. I’m not even doing any polygon sorting at the moment as the back face culling is enough for the single object I’m drawing. The actual line drawing between the edges for the polygon fill is a reasonably smart algorithm that works word by word with lookup tables for what to and / or to the extremities, but my polygons are so small that it’s only the extremities that count.

I will start to rewrite the code in assembly, but I have looked at what the compiler produces and I really don’t expect a 10x improvement. But maybe I’ll be surprised. Biggest obsticle is the 25 years since I last did assembly coding.

Quote:

Originally Posted by Photon

This is what I was looking for from the question in OP.

Generally when you write a new routine, optimization is a process of iteration and verifying that the code change in fact reduced processing time. (Not for you, you already know this and understand the problem.)

Instinctively,

1) the calculation part is way too heavy. Rendering is heavy enough; calculation should be 10% or so of the rendering time.
2) You're using a scanline algorithm, which is inherently for more powerful computers with chunky buffers (or something similar). You will never get this method fast without simply upgrading the hardware. Basically all poly renderers in OCS games simply render the polys one by one back to front.

Side note that SG2 isn't really "fast"; most games shrunk the screen and then polygons become quite small, which means the CPU can beat the Blitter; not having to use the Amiga hardware made the ST port a no-brainer. But it was written in Assembler, which is roughly 10x faster than C, so this is a big part of the difference in speed that you see.

There's also special-casing. Since none of your polygons partially overlap, you don't have to sort them, and you can fill them all at once with a single blit. This is the fastest method, but if you want your poly renderer to be general-purpose, and since all your polygons are very small indeed, then you could write a CPU poly fill (which would start to chug only when polygons become big). If you want this, and write such a routine, then I would say anything faster than 15 FPS for this object means you have a well-optimized routine.

11 August 2018, 19:04	#35
deimos It's coming back! Join Date: Jul 2018 Location: comp.sys.amiga Posts: 762	Here are the current numbers following the symmetry-based optimisations on the vertex data suggested by chb: My code takes 52s to draw 100 frames (320x256, 6 bitmap / EHB mode). There are 128 polygons represented by 12 x-coordinates, 5 y-coordinates and 12 z-coordinates (114 vertices pointing into that coordinate data). 100 / 52 gives 1.92 fps. The CPU time is spent as follow: 35% is spent rotating and moving the ball and doing the 3D calculations. 27% is doing the edge tracking part of the scanline fill algorithm. 38% is spent drawing horizontal lines between active edges to fill the polygons. The code is mostly C, but I do have some assembly in there for a couple of hotspots and for the fixed point arithmetic. I think my next step is to rewrite more of it in assembly, but I don't believe I'll get more than a 25% improvement across the board, firstly because it won't be changing the underlying algorithms, and secondly because the code generated by the compiler isn't bad to start with. Last edited by deimos; 11 August 2018 at 19:17.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Games that are Full Frame Rate or Slower - Limitations or Choice?	Foebane	Retrogaming General Discussion	35	08 April 2018 13:22
F1 grand prix frame rate	universale	support.Games	18	13 July 2015 21:45
The First Person Shooter frame rate tolerance poll...	DDNI	Retrogaming General Discussion	41	30 June 2011 03:32
Vsync Fullscreen and Double Buffer, incorrect frame rate?	rsn8887	support.WinUAE	1	07 April 2011 20:43
Propper speed request when recording with "Disable frame rate" turned on.	Ironclaw	request.UAE Wishlist	9	02 August 2006 07:21

10 August 2018, 10:09	#22
zero Registered User Join Date: Jun 2016 Location: UK Posts: 428	How are you doing backface culling? Not calculating normals I guess, just looking at vertex order.

10 August 2018, 11:16	#28
grond Registered User Join Date: Jun 2015 Location: Germany Posts: 1,924	You can also use spheric coordinates for the objects and then do a simple scaling using LUTs.

12 August 2018, 08:06	#38
grond Registered User Join Date: Jun 2015 Location: Germany Posts: 1,924	I wrote and tested the code on an 030 and it worked fast and reliably. I never kill the OS in my code, why should I? I believe those few percents lost to the OS still running can be gained more easily in other parts of the code. If you get full-frame rate on an 68000, yes, then you can disable caches for all faster processors because you will also get full speed. But if you want to make your code as fast as possible on any processor, flushing the caches once per frame is much faster than disabling them altogether.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)