English Amiga Board How did games like Starglider 2 get such a high frame rate?
 User Name Remember Me? Password
 Register Amiga FAQ Rules & Help Members List  /  Moderators List Today's Posts Mark Forums Read

10 August 2018, 09:26   #21
deimos
Registered User

Join Date: Jul 2018
Location: Londonish / UK
Posts: 52
This is not something I would have thought of. I'll keep it for the last round of optimisations, after I've done all the algorithmic improvements and standard tweaks.

Quote:
 Originally Posted by grond If you really want to do vector multiplications, you'll run out of registers. Hence, use MULs with constants for the camera vectors (like MULS.w #cam,Dn) and replace the dummy value with your updated camera values for each frame (self-modifying code). In this way you can do point projection without having to juggle with registers.

 AdSense
 10 August 2018, 10:09 #22 zero Registered User   Join Date: Jun 2016 Location: UK Posts: 177 How are you doing backface culling? Not calculating normals I guess, just looking at vertex order.
10 August 2018, 10:16   #23
StingRay
move.l #\$c0ff33,throat

Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 5,959
Quote:
 Originally Posted by deimos This is not something I would have thought of. I'll keep it for the last round of optimisations, after I've done all the algorithmic improvements and standard tweaks.

If you want your code to work properly on 68010+ you either have to use 2 different routines (one using SMC, one without) or at least disable the instruction cache as otherwise the SMC will happily funk up on anything else than plain 68000.

10 August 2018, 10:28   #24
deimos
Registered User

Join Date: Jul 2018
Location: Londonish / UK
Posts: 52
I'm not calculating the whole normal:

if ((x1 - x0) * (y2 - y0) < (y1 - y0) * (x2 - x0))
continue;

where the face is defined by the three points (x0, y0), (x1, y1), (x2, y2).

Quote:
 Originally Posted by zero How are you doing backface culling? Not calculating normals I guess, just looking at vertex order.

10 August 2018, 10:33   #25
chb
Registered User

Join Date: Dec 2014
Location: germany
Posts: 89
Quote:
 Originally Posted by deimos I always start with original coordinates in object space. I fill in my rotation matrix and apply this to my coordinates and add the translation for the object's position to get coordinates in world space. The camera is fixed at 0,0,0 for now, otherwise I'd move to 4x4 matrices and combine the move to camera coordinates into one matrix. I can see advantages in modelling my objects by taking advantage of their symmetry, they'll be easier to design by hand that way, but unless I'm missing something, I don't think I can use it to reduce the computations I need to do.
Hmm, either I fail to make myself clear or my idea is utter nonsense, both of which is totally possible . Maybe to be clear: The idea is avoiding multiplications. If your multiplications are fast, there's not much gain.

But let's take my example with the cube. To rotate it, you'd need to compute R p_i (p_i edge vectors), for every of the 8 edges. If you'd have have generic vectors (ax,ay,az),(bx,by,bz)...(hx,hy,hz) you need to calculate 9 products for every rotation: r11*ax,r12*ay,r13*az, r21*ax,r22*ay,r23*az etc. , so a total of 72 products (I got a factor 3 wrong before ). If now all coordinates have the same absolute value (like for the cube), you can re-use/precompute those products for the other object vectors, as e.g. r11*ax = r11*bx, r12*ay = -r12*by etc - only nine multiplications have to be carried out, and you have to care for the signs. Fastest way would be to substitute adds by subs.

I'd guess to implement it efficiently, you'd write (rather: generate) an own matrix rotation routine for every object. Does not work for objects that change dynamically, of course (apart from modifications like scaling etc).

Last edited by chb; 10 August 2018 at 10:39.

10 August 2018, 10:51   #26
deimos
Registered User

Join Date: Jul 2018
Location: Londonish / UK
Posts: 52
I'm going to try and work through an example to see if I can understand.

Quote:
 Originally Posted by chb Hmm, either I fail to make myself clear or my idea is utter nonsense, both of which is totally possible . Maybe to be clear: The idea is avoiding multiplications. If your multiplications are fast, there's not much gain. But let's take my example with the cube. To rotate it, you'd need to compute R p_i (p_i edge vectors), for every of the 8 edges. If you'd have have generic vectors (ax,ay,az),(bx,by,bz)...(hx,hy,hz) you need to calculate 9 products for every rotation: r11*ax,r12*ay,r13*az, r21*ax,r22*ay,r23*az etc. , so a total of 72 products (I got a factor 3 wrong before ). If now all coordinates have the same absolute value (like for the cube), you can re-use/precompute those products for the other object vectors, as e.g. r11*ax = r11*bx, r12*ay = -r12*by etc - only nine multiplications have to be carried out, and you have to care for the signs. Fastest way would be to substitute adds by subs. I'd guess to implement it efficiently, you'd write (rather: generate) an own matrix rotation routine for every object. Does not work for objects that change dynamically, of course (apart from modifications like scaling etc).

10 August 2018, 10:59   #27
grond
Registered User

Join Date: Jun 2015
Location: Germany
Posts: 436
Quote:
 Originally Posted by StingRay If you want your code to work properly on 68010+ you either have to use 2 different routines (one using SMC, one without) or at least disable the instruction cache as otherwise the SMC will happily funk up on anything else than plain 68000.
Rather 020+ but disabling caches would be nuts anyway. Just flush the caches after self-modifying the code. There are even OS functions readily available for this purpose.

 10 August 2018, 11:16 #28 grond Registered User   Join Date: Jun 2015 Location: Germany Posts: 436 You can also use spheric coordinates for the objects and then do a simple scaling using LUTs.
10 August 2018, 11:32   #29
deimos
Registered User

Join Date: Jul 2018
Location: Londonish / UK
Posts: 52
Is this what you mean?

So, a matrix m and a vector v, where:

m =
| a b c |
| d e f |
| g h i |

and:

v =
| x |
| y |
| z |

results in:

v' = mv =
| ax + by + cz |
| dx + ey + fz |
| gx + hy + iz |

Assuming that the values of abs x, y, z recurr within this or even other models, then the individual products, ax, by, cz, and so on, will be worth storing for reuse. Or, calculated in bulk at the start of the process.

Quote:
 Originally Posted by chb Hmm, either I fail to make myself clear or my idea is utter nonsense, both of which is totally possible . Maybe to be clear: The idea is avoiding multiplications. If your multiplications are fast, there's not much gain. But let's take my example with the cube. To rotate it, you'd need to compute R p_i (p_i edge vectors), for every of the 8 edges. If you'd have have generic vectors (ax,ay,az),(bx,by,bz)...(hx,hy,hz) you need to calculate 9 products for every rotation: r11*ax,r12*ay,r13*az, r21*ax,r22*ay,r23*az etc. , so a total of 72 products (I got a factor 3 wrong before ). If now all coordinates have the same absolute value (like for the cube), you can re-use/precompute those products for the other object vectors, as e.g. r11*ax = r11*bx, r12*ay = -r12*by etc - only nine multiplications have to be carried out, and you have to care for the signs. Fastest way would be to substitute adds by subs. I'd guess to implement it efficiently, you'd write (rather: generate) an own matrix rotation routine for every object. Does not work for objects that change dynamically, of course (apart from modifications like scaling etc).

10 August 2018, 12:27   #30
chb
Registered User

Join Date: Dec 2014
Location: germany
Posts: 89
Quote:
 Originally Posted by deimos Is this what you mean? [...] Assuming that the values of abs x, y, z recurr within this or even other models, then the individual products, ax, by, cz, and so on, will be worth storing for reuse. Or, calculated in bulk at the start of the process.
Yep, that's what I wanted to say. As symmetries around the origin result in same abs values for the vector elements, they are therefore beneficial for this trick. If you have other models with same coordinate values* that always share the same rotation matrix, you can exploit that too, of course.

I'd try this with your sphere and see what you can gain.

*some more advanced optimization could take into account that partial sums appear more than once (cube example), or that for a vector v' = (2x,y,z) or the like you can reuse the products from v = (x,y,z) with some simple additional adds.

10 August 2018, 12:56   #31
deimos
Registered User

Join Date: Jul 2018
Location: Londonish / UK
Posts: 52
I'm working on it. First I have to rewrite the code that generates the model. I'll update once I have it working.

Quote:
 Originally Posted by chb Yep, that's what I wanted to say. As symmetries around the origin result in same abs values for the vector elements, they are therefore beneficial for this trick. If you have other models with same coordinate values* that always share the same rotation matrix, you can exploit that too, of course. I'd try this with your sphere and see what you can gain. *some more advanced optimization could take into account that partial sums appear more than once (cube example), or that for a vector v' = (2x,y,z) or the like you can reuse the products from v = (x,y,z) with some simple additional adds.

11 August 2018, 00:10   #32
StingRay
move.l #\$c0ff33,throat

Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 5,959
Quote:
 Originally Posted by grond Rather 020+ but disabling caches would be nuts anyway. Just flush the caches after self-modifying the code. There are even OS functions readily available for this purpose.

010 has large enough cache to cause problems with SMC too! And it is much more useful to disable the instruction cache once instead of flushing the cache each time SMC has been used on such CPU's!

11 August 2018, 13:19   #33
deimos
Registered User

Join Date: Jul 2018
Location: Londonish / UK
Posts: 52
With just a simple implementation of this I'm seeing a 10% reduction of the time I spend calculating. I think the reason I'm not seeing more is the overhead of the extra steps needed to get to the coordinate data. I think I can get a lot more, after all the amount of coordinate data, and therefore the multiplications, has dramatically decreased, but it may take me a while to get there.

Quote:
 Originally Posted by chb Yep, that's what I wanted to say. As symmetries around the origin result in same abs values for the vector elements, they are therefore beneficial for this trick. If you have other models with same coordinate values* that always share the same rotation matrix, you can exploit that too, of course. I'd try this with your sphere and see what you can gain. *some more advanced optimization could take into account that partial sums appear more than once (cube example), or that for a vector v' = (2x,y,z) or the like you can reuse the products from v = (x,y,z) with some simple additional adds.

11 August 2018, 17:32   #34
chb
Registered User

Join Date: Dec 2014
Location: germany
Posts: 89
Quote:
 Originally Posted by deimos With just a simple implementation of this I'm seeing a 10% reduction of the time I spend calculating. I think the reason I'm not seeing more is the overhead of the extra steps needed to get to the coordinate data. I think I can get a lot more, after all the amount of coordinate data, and therefore the multiplications, has dramatically decreased, but it may take me a while to get there.
Ah, interesting. As said, the fastest (and most complex and inflexible) way probably would involve writing individual code for every object shape, if you do not plan to dynamically modify them. Which could mean that you do not store original object coordinates in an array/struct, but just as a piece of code that takes the rotation matrix as parameter and returns the rotated coordinates. Makes only sense of course if you are always rotating your object first and do not plan to use the original coordinates for something else.

Anyway, I do not know if those techniques fit in your engine concept, they put restrictions on object shapes to be effective, they can be quite time consuming if you hand-optimize the code for every shape, and writing a decent code generator may be hard... anyway, I guess you're doing this for fun, and optimizing the hell out of something is definitely good fun!

BTW, a cheap optimization for back face culling, again exploiting object symmetry (probably you're doing it already): If for your polygon another polygon exist with normal vector pointing just to the opposite direction (like opposite sides of a cube), you need to test only one of them and know the result for the other one. Same of course if the normal vectors are equal (same plane and direction).

 11 August 2018, 19:04 #35 deimos Registered User   Join Date: Jul 2018 Location: Londonish / UK Posts: 52 Here are the current numbers following the symmetry-based optimisations on the vertex data suggested by chb: My code takes 52s to draw 100 frames (320x256, 6 bitmap / EHB mode). There are 128 polygons represented by 12 x-coordinates, 5 y-coordinates and 12 z-coordinates (114 vertices pointing into that coordinate data). 100 / 52 gives 1.92 fps. The CPU time is spent as follow:35% is spent rotating and moving the ball and doing the 3D calculations. 27% is doing the edge tracking part of the scanline fill algorithm. 38% is spent drawing horizontal lines between active edges to fill the polygons. The code is mostly C, but I do have some assembly in there for a couple of hotspots and for the fixed point arithmetic. I think my next step is to rewrite more of it in assembly, but I don't believe I'll get more than a 25% improvement across the board, firstly because it won't be changing the underlying algorithms, and secondly because the code generated by the compiler isn't bad to start with. Last edited by deimos; 11 August 2018 at 19:17.
11 August 2018, 22:02   #36
grond
Registered User

Join Date: Jun 2015
Location: Germany
Posts: 436
Quote:
 Originally Posted by StingRay 010 has large enough cache to cause problems with SMC too! And it is much more useful to disable the instruction cache once instead of flushing the cache each time SMC has been used on such CPU's!
No and no. The 010 has a loop cache of 6 bytes which can only hold a short instruction and a decrement-and-branch instruction. There is no practical way to mess this up with SMC. And disabling the caches because you are using SMC to speed up (!) your code is bonkers. You'd lose most of your CPUs speed to save a few cycles in the point projection code. Flushing the caches once per frame doesn't have a large effect. Have a look at the system functions for this. They allow flushing just the modified range of memory. Oh, and by the way, I actually wrote the SMC point projection code I described above including the cache flushing once per frame and can tell you that it is pretty fast.

11 August 2018, 22:59   #37
StingRay
move.l #\$c0ff33,throat

Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 5,959
Quote:
 Originally Posted by grond No and no. The 010 has a loop cache of 6 bytes which can only hold a short instruction and a decrement-and-branch instruction. There is no practical way to mess this up with SMC. And disabling the caches because you are using SMC to speed up (!) your code is bonkers. You'd lose most of your CPUs speed to save a few cycles in the point projection code. Flushing the caches once per frame doesn't have a large effect. Have a look at the system functions for this. They allow flushing just the modified range of memory. Oh, and by the way, I actually wrote the SMC point projection code I described above including the cache flushing once per frame and can tell you that it is pretty fast.

It only really makes sense to use SMC on 68000 to speed up the code and I was referring to processors >68000! And it sure isn't "bonkers" to write reliable code on 680x0, far too often I had to fix broken demo code which didn't work correctly because of SMC! Besides, it is pretty easy to write fast code without using any SMC at all. And if it really can't be avoided to use SMC there should at least be 2 different routines, one for 68000, one for >68000. And using OS routines when the system is killed isn't exactly great either. Not to forget that the OS funtions to flush the cache require OS2.0+.

 12 August 2018, 08:06 #38 grond Registered User   Join Date: Jun 2015 Location: Germany Posts: 436 I wrote and tested the code on an 030 and it worked fast and reliably. I never kill the OS in my code, why should I? I believe those few percents lost to the OS still running can be gained more easily in other parts of the code. If you get full-frame rate on an 68000, yes, then you can disable caches for all faster processors because you will also get full speed. But if you want to make your code as fast as possible on any processor, flushing the caches once per frame is much faster than disabling them altogether.
12 August 2018, 20:59   #39
Photon
Moderator

Join Date: Nov 2004
Location: Hult / Sweden
Posts: 4,536
Quote:
 Originally Posted by deimos Here are the current numbers following the symmetry-based optimisations on the vertex data suggested by chb: My code takes 52s to draw 100 frames (320x256, 6 bitmap / EHB mode). There are 128 polygons represented by 12 x-coordinates, 5 y-coordinates and 12 z-coordinates (114 vertices pointing into that coordinate data). 100 / 52 gives 1.92 fps. The CPU time is spent as follow:35% is spent rotating and moving the ball and doing the 3D calculations. 27% is doing the edge tracking part of the scanline fill algorithm. 38% is spent drawing horizontal lines between active edges to fill the polygons. The code is mostly C, but I do have some assembly in there for a couple of hotspots and for the fixed point arithmetic. I think my next step is to rewrite more of it in assembly, but I don't believe I'll get more than a 25% improvement across the board, firstly because it won't be changing the underlying algorithms, and secondly because the code generated by the compiler isn't bad to start with.
This is what I was looking for from the question in OP. Generally when you write a new routine, optimization is a process of iteration and verifying that the code change in fact reduced processing time. (Not for you, you already know this and understand the problem.)

Instinctively,

1) the calculation part is way too heavy. Rendering is heavy enough; calculation should be 10% or so of the rendering time.
2) You're using a scanline algorithm, which is inherently for more powerful computers with chunky buffers (or something similar). You will never get this method fast without simply upgrading the hardware. Basically all poly renderers in OCS games simply render the polys one by one back to front.

Side note that SG2 isn't really "fast"; most games shrunk the screen and then polygons become quite small, which means the CPU can beat the Blitter; not having to use the Amiga hardware made the ST port a no-brainer. But it was written in Assembler, which is roughly 10x faster than C, so this is a big part of the difference in speed that you see.

There's also special-casing. Since none of your polygons partially overlap, you don't have to sort them, and you can fill them all at once with a single blit. This is the fastest method, but if you want your poly renderer to be general-purpose, and since all your polygons are very small indeed, then you could write a CPU poly fill (which would start to chug only when polygons become big). If you want this, and write such a routine, then I would say anything faster than 15 FPS for this object means you have a well-optimized routine.

Last edited by Photon; 12 August 2018 at 21:10.

13 August 2018, 11:07   #40
deimos
Registered User

Join Date: Jul 2018
Location: Londonish / UK
Posts: 52
Totally agree with point 1, that’s why I was looking for any cheats to the system that SG2 may have used, apart from the obvious small draw distance.

With regards to point 2, I’m currently using a scanline polygon fill to simply draw polygons one by one, I’m not doing any hidden surface removal or considering anything more than the current polygon - i just have two edge trackers doing an integer bresenhan style run down the sides of the polygons and drawing a line between the two sides. I’m not even doing any polygon sorting at the moment as the back face culling is enough for the single object I’m drawing. The actual line drawing between the edges for the polygon fill is a reasonably smart algorithm that works word by word with lookup tables for what to and / or to the extremities, but my polygons are so small that it’s only the extremities that count.

I will start to rewrite the code in assembly, but I have looked at what the compiler produces and I really don’t expect a 10x improvement. But maybe I’ll be surprised. Biggest obsticle is the 25 years since I last did assembly coding.

Quote:
 Originally Posted by Photon This is what I was looking for from the question in OP. Generally when you write a new routine, optimization is a process of iteration and verifying that the code change in fact reduced processing time. (Not for you, you already know this and understand the problem.) Instinctively, 1) the calculation part is way too heavy. Rendering is heavy enough; calculation should be 10% or so of the rendering time. 2) You're using a scanline algorithm, which is inherently for more powerful computers with chunky buffers (or something similar). You will never get this method fast without simply upgrading the hardware. Basically all poly renderers in OCS games simply render the polys one by one back to front. Side note that SG2 isn't really "fast"; most games shrunk the screen and then polygons become quite small, which means the CPU can beat the Blitter; not having to use the Amiga hardware made the ST port a no-brainer. But it was written in Assembler, which is roughly 10x faster than C, so this is a big part of the difference in speed that you see. There's also special-casing. Since none of your polygons partially overlap, you don't have to sort them, and you can fill them all at once with a single blit. This is the fastest method, but if you want your poly renderer to be general-purpose, and since all your polygons are very small indeed, then you could write a CPU poly fill (which would start to chug only when polygons become big). If you want this, and write such a routine, then I would say anything faster than 15 FPS for this object means you have a well-optimized routine.

Last edited by deimos; 13 August 2018 at 12:57.

 AdSense

 Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 Thread Tools

 Similar Threads Thread Thread Starter Forum Replies Last Post Foebane Retrogaming General Discussion 35 08 April 2018 13:22 universale support.Games 18 13 July 2015 21:45 DDNI Retrogaming General Discussion 41 30 June 2011 03:32 rsn8887 support.WinUAE 1 07 April 2011 20:43 Ironclaw request.UAE Wishlist 9 02 August 2006 07:21

 Posting Rules You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is On HTML code is Off Forum Rules
 Forum Jump User Control Panel Private Messages Subscriptions Who's Online Search Forums Forums Home News Main     Amiga scene     Retrogaming General Discussion     Nostalgia & memories Support     New to Emulation or Amiga scene         Member Introductions     support.WinUAE     support.WinFellow     support.OtherUAE     support.FS-UAE         project.AmigaLive     support.Hardware         Hardware mods         Hardware pics     support.Games     support.Demos     support.Apps     support.Amiga Forever     support.Amix     support.Other Requests     request.UAE Wishlist     request.Old Rare Games     request.Demos     request.Apps     request.Modules     request.Music     request.Other     Looking for a game name ?     Games images which need to be WHDified abime.net - Hall Of Light     HOL news     HOL suggestions and feedback     HOL data problems     HOL contributions abime.net - Amiga Magazine Rack     AMR news     AMR suggestions and feedback     AMR data problems     AMR contributions abime.net - Home Projects     project.Amiga Lore     project.EAB     project.IRC     project.Mods Jukebox     project.Wiki abime.net - Hosted Projects     project.aGTW     project.APoV     project.ClassicWB     project.Jambo!     project.Green Amiga Alien GUIDES     project.Maptapper     project.Sprites     project.WinUAE - Kaillera Other Projects     project.Amiga Demo DVD     project.Amiga Game Factory     project.CARE     project.EAB File Server     project.CD32 Conversion     project.Game Cover Art         GCA.Feedback and Suggestions         GCA.Work in Progress         GCA.Cover Requests         GCA.Usefull Programs         GCA.Helpdesk     project.KGLoad     project.MAGE     project.Missing Full Shareware Games     project.SPS (was CAPS)     project.TOSEC (amiga only)     project.WHDLoad         project.Killergorilla's WHD packs Misc     Amiga websites reviews     MarketPlace         Swapshop     Collections     EAB's competition Coders     Coders. General         Coders. Releases         Coders. Tutorials     Coders. Asm / Hardware     Coders. System         Coders. Scripting         Coders. Nextgen     Coders. Language         Coders. C/C++         Coders. AMOS         Coders. Blitz Basic     Coders. Contest         Coders. Entries Off Topic     OT - General     OT - Technical     OT - Entertainment     OT - Sports     OT - Gaming

All times are GMT +2. The time now is 04:43.

 -- EAB3 skin ---- EAB2 skin ---- Mobile skin Archive - Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2018, vBulletin Solutions Inc.
Page generated in 0.08577 seconds with 16 queries