English Amiga Board


Go Back   English Amiga Board > Coders > Coders. General

 
 
Thread Tools
Old 10 August 2018, 10:26   #21
deimos
Registered User

 
Join Date: Jul 2018
Location: Londonish / UK
Posts: 109
This is not something I would have thought of. I'll keep it for the last round of optimisations, after I've done all the algorithmic improvements and standard tweaks.

Quote:
Originally Posted by grond View Post
If you really want to do vector multiplications, you'll run out of registers. Hence, use MULs with constants for the camera vectors (like MULS.w #cam,Dn) and replace the dummy value with your updated camera values for each frame (self-modifying code). In this way you can do point projection without having to juggle with registers.
deimos is offline  
Old 10 August 2018, 11:09   #22
zero
Registered User

 
Join Date: Jun 2016
Location: UK
Posts: 189
How are you doing backface culling? Not calculating normals I guess, just looking at vertex order.
zero is offline  
Old 10 August 2018, 11:16   #23
StingRay
move.l #$c0ff33,throat

StingRay's Avatar
 
Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 6,146
Quote:
Originally Posted by deimos View Post
This is not something I would have thought of. I'll keep it for the last round of optimisations, after I've done all the algorithmic improvements and standard tweaks.

If you want your code to work properly on 68010+ you either have to use 2 different routines (one using SMC, one without) or at least disable the instruction cache as otherwise the SMC will happily funk up on anything else than plain 68000.
StingRay is offline  
Old 10 August 2018, 11:28   #24
deimos
Registered User

 
Join Date: Jul 2018
Location: Londonish / UK
Posts: 109
I'm not calculating the whole normal:

if ((x1 - x0) * (y2 - y0) < (y1 - y0) * (x2 - x0))
continue;

where the face is defined by the three points (x0, y0), (x1, y1), (x2, y2).

Quote:
Originally Posted by zero View Post
How are you doing backface culling? Not calculating normals I guess, just looking at vertex order.
deimos is offline  
Old 10 August 2018, 11:33   #25
chb
Registered User

 
Join Date: Dec 2014
Location: germany
Posts: 99
Quote:
Originally Posted by deimos View Post
I always start with original coordinates in object space. I fill in my rotation matrix and apply this to my coordinates and add the translation for the object's position to get coordinates in world space. The camera is fixed at 0,0,0 for now, otherwise I'd move to 4x4 matrices and combine the move to camera coordinates into one matrix.

I can see advantages in modelling my objects by taking advantage of their symmetry, they'll be easier to design by hand that way, but unless I'm missing something, I don't think I can use it to reduce the computations I need to do.
Hmm, either I fail to make myself clear or my idea is utter nonsense, both of which is totally possible . Maybe to be clear: The idea is avoiding multiplications. If your multiplications are fast, there's not much gain.

But let's take my example with the cube. To rotate it, you'd need to compute R p_i (p_i edge vectors), for every of the 8 edges. If you'd have have generic vectors (ax,ay,az),(bx,by,bz)...(hx,hy,hz) you need to calculate 9 products for every rotation: r11*ax,r12*ay,r13*az, r21*ax,r22*ay,r23*az etc. , so a total of 72 products (I got a factor 3 wrong before ). If now all coordinates have the same absolute value (like for the cube), you can re-use/precompute those products for the other object vectors, as e.g. r11*ax = r11*bx, r12*ay = -r12*by etc - only nine multiplications have to be carried out, and you have to care for the signs. Fastest way would be to substitute adds by subs.

I'd guess to implement it efficiently, you'd write (rather: generate) an own matrix rotation routine for every object. Does not work for objects that change dynamically, of course (apart from modifications like scaling etc).

Last edited by chb; 10 August 2018 at 11:39.
chb is offline  
Old 10 August 2018, 11:51   #26
deimos
Registered User

 
Join Date: Jul 2018
Location: Londonish / UK
Posts: 109
I'm going to try and work through an example to see if I can understand.

Quote:
Originally Posted by chb View Post
Hmm, either I fail to make myself clear or my idea is utter nonsense, both of which is totally possible . Maybe to be clear: The idea is avoiding multiplications. If your multiplications are fast, there's not much gain.

But let's take my example with the cube. To rotate it, you'd need to compute R p_i (p_i edge vectors), for every of the 8 edges. If you'd have have generic vectors (ax,ay,az),(bx,by,bz)...(hx,hy,hz) you need to calculate 9 products for every rotation: r11*ax,r12*ay,r13*az, r21*ax,r22*ay,r23*az etc. , so a total of 72 products (I got a factor 3 wrong before ). If now all coordinates have the same absolute value (like for the cube), you can re-use/precompute those products for the other object vectors, as e.g. r11*ax = r11*bx, r12*ay = -r12*by etc - only nine multiplications have to be carried out, and you have to care for the signs. Fastest way would be to substitute adds by subs.

I'd guess to implement it efficiently, you'd write (rather: generate) an own matrix rotation routine for every object. Does not work for objects that change dynamically, of course (apart from modifications like scaling etc).
deimos is offline  
Old 10 August 2018, 11:59   #27
grond
Registered User

 
Join Date: Jun 2015
Location: Germany
Posts: 480
Quote:
Originally Posted by StingRay View Post
If you want your code to work properly on 68010+ you either have to use 2 different routines (one using SMC, one without) or at least disable the instruction cache as otherwise the SMC will happily funk up on anything else than plain 68000.
Rather 020+ but disabling caches would be nuts anyway. Just flush the caches after self-modifying the code. There are even OS functions readily available for this purpose.
grond is offline  
Old 10 August 2018, 12:16   #28
grond
Registered User

 
Join Date: Jun 2015
Location: Germany
Posts: 480
You can also use spheric coordinates for the objects and then do a simple scaling using LUTs.
grond is offline  
Old 10 August 2018, 12:32   #29
deimos
Registered User

 
Join Date: Jul 2018
Location: Londonish / UK
Posts: 109
Is this what you mean?

So, a matrix m and a vector v, where:

m =
| a b c |
| d e f |
| g h i |

and:

v =
| x |
| y |
| z |

results in:

v' = mv =
| ax + by + cz |
| dx + ey + fz |
| gx + hy + iz |

Assuming that the values of abs x, y, z recurr within this or even other models, then the individual products, ax, by, cz, and so on, will be worth storing for reuse. Or, calculated in bulk at the start of the process.


Quote:
Originally Posted by chb View Post
Hmm, either I fail to make myself clear or my idea is utter nonsense, both of which is totally possible . Maybe to be clear: The idea is avoiding multiplications. If your multiplications are fast, there's not much gain.

But let's take my example with the cube. To rotate it, you'd need to compute R p_i (p_i edge vectors), for every of the 8 edges. If you'd have have generic vectors (ax,ay,az),(bx,by,bz)...(hx,hy,hz) you need to calculate 9 products for every rotation: r11*ax,r12*ay,r13*az, r21*ax,r22*ay,r23*az etc. , so a total of 72 products (I got a factor 3 wrong before ). If now all coordinates have the same absolute value (like for the cube), you can re-use/precompute those products for the other object vectors, as e.g. r11*ax = r11*bx, r12*ay = -r12*by etc - only nine multiplications have to be carried out, and you have to care for the signs. Fastest way would be to substitute adds by subs.

I'd guess to implement it efficiently, you'd write (rather: generate) an own matrix rotation routine for every object. Does not work for objects that change dynamically, of course (apart from modifications like scaling etc).
deimos is offline  
Old 10 August 2018, 13:27   #30
chb
Registered User

 
Join Date: Dec 2014
Location: germany
Posts: 99
Quote:
Originally Posted by deimos View Post
Is this what you mean?

[...]

Assuming that the values of abs x, y, z recurr within this or even other models, then the individual products, ax, by, cz, and so on, will be worth storing for reuse. Or, calculated in bulk at the start of the process.
Yep, that's what I wanted to say. As symmetries around the origin result in same abs values for the vector elements, they are therefore beneficial for this trick. If you have other models with same coordinate values* that always share the same rotation matrix, you can exploit that too, of course.

I'd try this with your sphere and see what you can gain.

*some more advanced optimization could take into account that partial sums appear more than once (cube example), or that for a vector v' = (2x,y,z) or the like you can reuse the products from v = (x,y,z) with some simple additional adds.
chb is offline  
Old 10 August 2018, 13:56   #31
deimos
Registered User

 
Join Date: Jul 2018
Location: Londonish / UK
Posts: 109
I'm working on it. First I have to rewrite the code that generates the model. I'll update once I have it working.

Quote:
Originally Posted by chb View Post
Yep, that's what I wanted to say. As symmetries around the origin result in same abs values for the vector elements, they are therefore beneficial for this trick. If you have other models with same coordinate values* that always share the same rotation matrix, you can exploit that too, of course.

I'd try this with your sphere and see what you can gain.

*some more advanced optimization could take into account that partial sums appear more than once (cube example), or that for a vector v' = (2x,y,z) or the like you can reuse the products from v = (x,y,z) with some simple additional adds.
deimos is offline  
Old 11 August 2018, 01:10   #32
StingRay
move.l #$c0ff33,throat

StingRay's Avatar
 
Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 6,146
Quote:
Originally Posted by grond View Post
Rather 020+ but disabling caches would be nuts anyway. Just flush the caches after self-modifying the code. There are even OS functions readily available for this purpose.

010 has large enough cache to cause problems with SMC too! And it is much more useful to disable the instruction cache once instead of flushing the cache each time SMC has been used on such CPU's!
StingRay is offline  
Old 11 August 2018, 14:19   #33
deimos
Registered User

 
Join Date: Jul 2018
Location: Londonish / UK
Posts: 109
With just a simple implementation of this I'm seeing a 10% reduction of the time I spend calculating. I think the reason I'm not seeing more is the overhead of the extra steps needed to get to the coordinate data. I think I can get a lot more, after all the amount of coordinate data, and therefore the multiplications, has dramatically decreased, but it may take me a while to get there.

Quote:
Originally Posted by chb View Post
Yep, that's what I wanted to say. As symmetries around the origin result in same abs values for the vector elements, they are therefore beneficial for this trick. If you have other models with same coordinate values* that always share the same rotation matrix, you can exploit that too, of course.

I'd try this with your sphere and see what you can gain.

*some more advanced optimization could take into account that partial sums appear more than once (cube example), or that for a vector v' = (2x,y,z) or the like you can reuse the products from v = (x,y,z) with some simple additional adds.
deimos is offline  
Old 11 August 2018, 18:32   #34
chb
Registered User

 
Join Date: Dec 2014
Location: germany
Posts: 99
Quote:
Originally Posted by deimos View Post
With just a simple implementation of this I'm seeing a 10% reduction of the time I spend calculating. I think the reason I'm not seeing more is the overhead of the extra steps needed to get to the coordinate data. I think I can get a lot more, after all the amount of coordinate data, and therefore the multiplications, has dramatically decreased, but it may take me a while to get there.
Ah, interesting. As said, the fastest (and most complex and inflexible) way probably would involve writing individual code for every object shape, if you do not plan to dynamically modify them. Which could mean that you do not store original object coordinates in an array/struct, but just as a piece of code that takes the rotation matrix as parameter and returns the rotated coordinates. Makes only sense of course if you are always rotating your object first and do not plan to use the original coordinates for something else.

Anyway, I do not know if those techniques fit in your engine concept, they put restrictions on object shapes to be effective, they can be quite time consuming if you hand-optimize the code for every shape, and writing a decent code generator may be hard... anyway, I guess you're doing this for fun, and optimizing the hell out of something is definitely good fun!

BTW, a cheap optimization for back face culling, again exploiting object symmetry (probably you're doing it already): If for your polygon another polygon exist with normal vector pointing just to the opposite direction (like opposite sides of a cube), you need to test only one of them and know the result for the other one. Same of course if the normal vectors are equal (same plane and direction).
chb is offline  
Old 11 August 2018, 20:04   #35
deimos
Registered User

 
Join Date: Jul 2018
Location: Londonish / UK
Posts: 109
Here are the current numbers following the symmetry-based optimisations on the vertex data suggested by chb:

My code takes 52s to draw 100 frames (320x256, 6 bitmap / EHB mode). There are 128 polygons represented by 12 x-coordinates, 5 y-coordinates and 12 z-coordinates (114 vertices pointing into that coordinate data).

100 / 52 gives 1.92 fps.

The CPU time is spent as follow:
  • 35% is spent rotating and moving the ball and doing the 3D calculations.
  • 27% is doing the edge tracking part of the scanline fill algorithm.
  • 38% is spent drawing horizontal lines between active edges to fill the polygons.
The code is mostly C, but I do have some assembly in there for a couple of hotspots and for the fixed point arithmetic.

I think my next step is to rewrite more of it in assembly, but I don't believe I'll get more than a 25% improvement across the board, firstly because it won't be changing the underlying algorithms, and secondly because the code generated by the compiler isn't bad to start with.

Last edited by deimos; 11 August 2018 at 20:17.
deimos is offline  
Old 11 August 2018, 23:02   #36
grond
Registered User

 
Join Date: Jun 2015
Location: Germany
Posts: 480
Quote:
Originally Posted by StingRay View Post
010 has large enough cache to cause problems with SMC too! And it is much more useful to disable the instruction cache once instead of flushing the cache each time SMC has been used on such CPU's!
No and no. The 010 has a loop cache of 6 bytes which can only hold a short instruction and a decrement-and-branch instruction. There is no practical way to mess this up with SMC. And disabling the caches because you are using SMC to speed up (!) your code is bonkers. You'd lose most of your CPUs speed to save a few cycles in the point projection code. Flushing the caches once per frame doesn't have a large effect. Have a look at the system functions for this. They allow flushing just the modified range of memory. Oh, and by the way, I actually wrote the SMC point projection code I described above including the cache flushing once per frame and can tell you that it is pretty fast.
grond is offline  
Old 11 August 2018, 23:59   #37
StingRay
move.l #$c0ff33,throat

StingRay's Avatar
 
Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 6,146
Quote:
Originally Posted by grond View Post
No and no. The 010 has a loop cache of 6 bytes which can only hold a short instruction and a decrement-and-branch instruction. There is no practical way to mess this up with SMC. And disabling the caches because you are using SMC to speed up (!) your code is bonkers. You'd lose most of your CPUs speed to save a few cycles in the point projection code. Flushing the caches once per frame doesn't have a large effect. Have a look at the system functions for this. They allow flushing just the modified range of memory. Oh, and by the way, I actually wrote the SMC point projection code I described above including the cache flushing once per frame and can tell you that it is pretty fast.

It only really makes sense to use SMC on 68000 to speed up the code and I was referring to processors >68000! And it sure isn't "bonkers" to write reliable code on 680x0, far too often I had to fix broken demo code which didn't work correctly because of SMC! Besides, it is pretty easy to write fast code without using any SMC at all. And if it really can't be avoided to use SMC there should at least be 2 different routines, one for 68000, one for >68000. And using OS routines when the system is killed isn't exactly great either. Not to forget that the OS funtions to flush the cache require OS2.0+.
StingRay is offline  
Old 12 August 2018, 09:06   #38
grond
Registered User

 
Join Date: Jun 2015
Location: Germany
Posts: 480
I wrote and tested the code on an 030 and it worked fast and reliably. I never kill the OS in my code, why should I? I believe those few percents lost to the OS still running can be gained more easily in other parts of the code.

If you get full-frame rate on an 68000, yes, then you can disable caches for all faster processors because you will also get full speed. But if you want to make your code as fast as possible on any processor, flushing the caches once per frame is much faster than disabling them altogether.
grond is offline  
Old 12 August 2018, 21:59   #39
Photon
Moderator
Photon's Avatar
 
Join Date: Nov 2004
Location: Hult / Sweden
Posts: 4,589
Quote:
Originally Posted by deimos View Post
Here are the current numbers following the symmetry-based optimisations on the vertex data suggested by chb:

My code takes 52s to draw 100 frames (320x256, 6 bitmap / EHB mode). There are 128 polygons represented by 12 x-coordinates, 5 y-coordinates and 12 z-coordinates (114 vertices pointing into that coordinate data).

100 / 52 gives 1.92 fps.

The CPU time is spent as follow:
  • 35% is spent rotating and moving the ball and doing the 3D calculations.
  • 27% is doing the edge tracking part of the scanline fill algorithm.
  • 38% is spent drawing horizontal lines between active edges to fill the polygons.
The code is mostly C, but I do have some assembly in there for a couple of hotspots and for the fixed point arithmetic.

I think my next step is to rewrite more of it in assembly, but I don't believe I'll get more than a 25% improvement across the board, firstly because it won't be changing the underlying algorithms, and secondly because the code generated by the compiler isn't bad to start with.
This is what I was looking for from the question in OP. Generally when you write a new routine, optimization is a process of iteration and verifying that the code change in fact reduced processing time. (Not for you, you already know this and understand the problem.)

Instinctively,

1) the calculation part is way too heavy. Rendering is heavy enough; calculation should be 10% or so of the rendering time.
2) You're using a scanline algorithm, which is inherently for more powerful computers with chunky buffers (or something similar). You will never get this method fast without simply upgrading the hardware. Basically all poly renderers in OCS games simply render the polys one by one back to front.

Side note that SG2 isn't really "fast"; most games shrunk the screen and then polygons become quite small, which means the CPU can beat the Blitter; not having to use the Amiga hardware made the ST port a no-brainer. But it was written in Assembler, which is roughly 10x faster than C, so this is a big part of the difference in speed that you see.

There's also special-casing. Since none of your polygons partially overlap, you don't have to sort them, and you can fill them all at once with a single blit. This is the fastest method, but if you want your poly renderer to be general-purpose, and since all your polygons are very small indeed, then you could write a CPU poly fill (which would start to chug only when polygons become big). If you want this, and write such a routine, then I would say anything faster than 15 FPS for this object means you have a well-optimized routine.

Last edited by Photon; 12 August 2018 at 22:10.
Photon is offline  
Old 13 August 2018, 12:07   #40
deimos
Registered User

 
Join Date: Jul 2018
Location: Londonish / UK
Posts: 109
Totally agree with point 1, that’s why I was looking for any cheats to the system that SG2 may have used, apart from the obvious small draw distance.

With regards to point 2, I’m currently using a scanline polygon fill to simply draw polygons one by one, I’m not doing any hidden surface removal or considering anything more than the current polygon - i just have two edge trackers doing an integer bresenhan style run down the sides of the polygons and drawing a line between the two sides. I’m not even doing any polygon sorting at the moment as the back face culling is enough for the single object I’m drawing. The actual line drawing between the edges for the polygon fill is a reasonably smart algorithm that works word by word with lookup tables for what to and / or to the extremities, but my polygons are so small that it’s only the extremities that count.

I will start to rewrite the code in assembly, but I have looked at what the compiler produces and I really don’t expect a 10x improvement. But maybe I’ll be surprised. Biggest obsticle is the 25 years since I last did assembly coding.

Quote:
Originally Posted by Photon View Post
This is what I was looking for from the question in OP. Generally when you write a new routine, optimization is a process of iteration and verifying that the code change in fact reduced processing time. (Not for you, you already know this and understand the problem.)

Instinctively,

1) the calculation part is way too heavy. Rendering is heavy enough; calculation should be 10% or so of the rendering time.
2) You're using a scanline algorithm, which is inherently for more powerful computers with chunky buffers (or something similar). You will never get this method fast without simply upgrading the hardware. Basically all poly renderers in OCS games simply render the polys one by one back to front.

Side note that SG2 isn't really "fast"; most games shrunk the screen and then polygons become quite small, which means the CPU can beat the Blitter; not having to use the Amiga hardware made the ST port a no-brainer. But it was written in Assembler, which is roughly 10x faster than C, so this is a big part of the difference in speed that you see.

There's also special-casing. Since none of your polygons partially overlap, you don't have to sort them, and you can fill them all at once with a single blit. This is the fastest method, but if you want your poly renderer to be general-purpose, and since all your polygons are very small indeed, then you could write a CPU poly fill (which would start to chug only when polygons become big). If you want this, and write such a routine, then I would say anything faster than 15 FPS for this object means you have a well-optimized routine.

Last edited by deimos; 13 August 2018 at 13:57.
deimos is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Games that are Full Frame Rate or Slower - Limitations or Choice? Foebane Retrogaming General Discussion 35 08 April 2018 14:22
F1 grand prix frame rate universale support.Games 18 13 July 2015 22:45
The First Person Shooter frame rate tolerance poll... DDNI Retrogaming General Discussion 41 30 June 2011 04:32
Vsync Fullscreen and Double Buffer, incorrect frame rate? rsn8887 support.WinUAE 1 07 April 2011 21:43
Propper speed request when recording with "Disable frame rate" turned on. Ironclaw request.UAE Wishlist 9 02 August 2006 08:21

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 03:14.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2018, vBulletin Solutions Inc.
Page generated in 0.09231 seconds with 14 queries