10 August 2018, 09:26 | #21 | |
It's coming back!
Join Date: Jul 2018
Location: comp.sys.amiga
Posts: 762
|
This is not something I would have thought of. I'll keep it for the last round of optimisations, after I've done all the algorithmic improvements and standard tweaks.
Quote:
|
|
10 August 2018, 10:09 | #22 |
Registered User
Join Date: Jun 2016
Location: UK
Posts: 428
|
How are you doing backface culling? Not calculating normals I guess, just looking at vertex order.
|
10 August 2018, 10:16 | #23 | |
move.l #$c0ff33,throat
Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 6,863
|
Quote:
If you want your code to work properly on 68010+ you either have to use 2 different routines (one using SMC, one without) or at least disable the instruction cache as otherwise the SMC will happily funk up on anything else than plain 68000. |
|
10 August 2018, 10:28 | #24 |
It's coming back!
Join Date: Jul 2018
Location: comp.sys.amiga
Posts: 762
|
I'm not calculating the whole normal:
if ((x1 - x0) * (y2 - y0) < (y1 - y0) * (x2 - x0)) continue; where the face is defined by the three points (x0, y0), (x1, y1), (x2, y2). |
10 August 2018, 10:33 | #25 | |
Registered User
Join Date: Dec 2014
Location: germany
Posts: 439
|
Quote:
But let's take my example with the cube. To rotate it, you'd need to compute R p_i (p_i edge vectors), for every of the 8 edges. If you'd have have generic vectors (ax,ay,az),(bx,by,bz)...(hx,hy,hz) you need to calculate 9 products for every rotation: r11*ax,r12*ay,r13*az, r21*ax,r22*ay,r23*az etc. , so a total of 72 products (I got a factor 3 wrong before ). If now all coordinates have the same absolute value (like for the cube), you can re-use/precompute those products for the other object vectors, as e.g. r11*ax = r11*bx, r12*ay = -r12*by etc - only nine multiplications have to be carried out, and you have to care for the signs. Fastest way would be to substitute adds by subs. I'd guess to implement it efficiently, you'd write (rather: generate) an own matrix rotation routine for every object. Does not work for objects that change dynamically, of course (apart from modifications like scaling etc). Last edited by chb; 10 August 2018 at 10:39. |
|
10 August 2018, 10:51 | #26 | |
It's coming back!
Join Date: Jul 2018
Location: comp.sys.amiga
Posts: 762
|
I'm going to try and work through an example to see if I can understand.
Quote:
|
|
10 August 2018, 10:59 | #27 |
Registered User
Join Date: Jun 2015
Location: Germany
Posts: 1,924
|
Rather 020+ but disabling caches would be nuts anyway. Just flush the caches after self-modifying the code. There are even OS functions readily available for this purpose.
|
10 August 2018, 11:16 | #28 |
Registered User
Join Date: Jun 2015
Location: Germany
Posts: 1,924
|
You can also use spheric coordinates for the objects and then do a simple scaling using LUTs.
|
10 August 2018, 11:32 | #29 | |
It's coming back!
Join Date: Jul 2018
Location: comp.sys.amiga
Posts: 762
|
Is this what you mean?
So, a matrix m and a vector v, where: m = | a b c | | d e f | | g h i | and: v = | x | | y | | z | results in: v' = mv = | ax + by + cz | | dx + ey + fz | | gx + hy + iz | Assuming that the values of abs x, y, z recurr within this or even other models, then the individual products, ax, by, cz, and so on, will be worth storing for reuse. Or, calculated in bulk at the start of the process. Quote:
|
|
10 August 2018, 12:27 | #30 | |
Registered User
Join Date: Dec 2014
Location: germany
Posts: 439
|
Quote:
I'd try this with your sphere and see what you can gain. *some more advanced optimization could take into account that partial sums appear more than once (cube example), or that for a vector v' = (2x,y,z) or the like you can reuse the products from v = (x,y,z) with some simple additional adds. |
|
10 August 2018, 12:56 | #31 | |
It's coming back!
Join Date: Jul 2018
Location: comp.sys.amiga
Posts: 762
|
I'm working on it. First I have to rewrite the code that generates the model. I'll update once I have it working.
Quote:
|
|
11 August 2018, 00:10 | #32 | |
move.l #$c0ff33,throat
Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 6,863
|
Quote:
010 has large enough cache to cause problems with SMC too! And it is much more useful to disable the instruction cache once instead of flushing the cache each time SMC has been used on such CPU's! |
|
11 August 2018, 13:19 | #33 | |
It's coming back!
Join Date: Jul 2018
Location: comp.sys.amiga
Posts: 762
|
With just a simple implementation of this I'm seeing a 10% reduction of the time I spend calculating. I think the reason I'm not seeing more is the overhead of the extra steps needed to get to the coordinate data. I think I can get a lot more, after all the amount of coordinate data, and therefore the multiplications, has dramatically decreased, but it may take me a while to get there.
Quote:
|
|
11 August 2018, 17:32 | #34 | |
Registered User
Join Date: Dec 2014
Location: germany
Posts: 439
|
Quote:
Anyway, I do not know if those techniques fit in your engine concept, they put restrictions on object shapes to be effective, they can be quite time consuming if you hand-optimize the code for every shape, and writing a decent code generator may be hard... anyway, I guess you're doing this for fun, and optimizing the hell out of something is definitely good fun! BTW, a cheap optimization for back face culling, again exploiting object symmetry (probably you're doing it already): If for your polygon another polygon exist with normal vector pointing just to the opposite direction (like opposite sides of a cube), you need to test only one of them and know the result for the other one. Same of course if the normal vectors are equal (same plane and direction). |
|
11 August 2018, 19:04 | #35 |
It's coming back!
Join Date: Jul 2018
Location: comp.sys.amiga
Posts: 762
|
Here are the current numbers following the symmetry-based optimisations on the vertex data suggested by chb:
My code takes 52s to draw 100 frames (320x256, 6 bitmap / EHB mode). There are 128 polygons represented by 12 x-coordinates, 5 y-coordinates and 12 z-coordinates (114 vertices pointing into that coordinate data). 100 / 52 gives 1.92 fps. The CPU time is spent as follow:
I think my next step is to rewrite more of it in assembly, but I don't believe I'll get more than a 25% improvement across the board, firstly because it won't be changing the underlying algorithms, and secondly because the code generated by the compiler isn't bad to start with. Last edited by deimos; 11 August 2018 at 19:17. |
11 August 2018, 22:02 | #36 |
Registered User
Join Date: Jun 2015
Location: Germany
Posts: 1,924
|
No and no. The 010 has a loop cache of 6 bytes which can only hold a short instruction and a decrement-and-branch instruction. There is no practical way to mess this up with SMC. And disabling the caches because you are using SMC to speed up (!) your code is bonkers. You'd lose most of your CPUs speed to save a few cycles in the point projection code. Flushing the caches once per frame doesn't have a large effect. Have a look at the system functions for this. They allow flushing just the modified range of memory. Oh, and by the way, I actually wrote the SMC point projection code I described above including the cache flushing once per frame and can tell you that it is pretty fast.
|
11 August 2018, 22:59 | #37 | |
move.l #$c0ff33,throat
Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 6,863
|
Quote:
It only really makes sense to use SMC on 68000 to speed up the code and I was referring to processors >68000! And it sure isn't "bonkers" to write reliable code on 680x0, far too often I had to fix broken demo code which didn't work correctly because of SMC! Besides, it is pretty easy to write fast code without using any SMC at all. And if it really can't be avoided to use SMC there should at least be 2 different routines, one for 68000, one for >68000. And using OS routines when the system is killed isn't exactly great either. Not to forget that the OS funtions to flush the cache require OS2.0+. |
|
12 August 2018, 08:06 | #38 |
Registered User
Join Date: Jun 2015
Location: Germany
Posts: 1,924
|
I wrote and tested the code on an 030 and it worked fast and reliably. I never kill the OS in my code, why should I? I believe those few percents lost to the OS still running can be gained more easily in other parts of the code.
If you get full-frame rate on an 68000, yes, then you can disable caches for all faster processors because you will also get full speed. But if you want to make your code as fast as possible on any processor, flushing the caches once per frame is much faster than disabling them altogether. |
12 August 2018, 20:59 | #39 | |
Moderator
Join Date: Nov 2004
Location: Eksjö / Sweden
Posts: 5,655
|
Quote:
Instinctively, 1) the calculation part is way too heavy. Rendering is heavy enough; calculation should be 10% or so of the rendering time. 2) You're using a scanline algorithm, which is inherently for more powerful computers with chunky buffers (or something similar). You will never get this method fast without simply upgrading the hardware. Basically all poly renderers in OCS games simply render the polys one by one back to front. Side note that SG2 isn't really "fast"; most games shrunk the screen and then polygons become quite small, which means the CPU can beat the Blitter; not having to use the Amiga hardware made the ST port a no-brainer. But it was written in Assembler, which is roughly 10x faster than C, so this is a big part of the difference in speed that you see. There's also special-casing. Since none of your polygons partially overlap, you don't have to sort them, and you can fill them all at once with a single blit. This is the fastest method, but if you want your poly renderer to be general-purpose, and since all your polygons are very small indeed, then you could write a CPU poly fill (which would start to chug only when polygons become big). If you want this, and write such a routine, then I would say anything faster than 15 FPS for this object means you have a well-optimized routine. Last edited by Photon; 12 August 2018 at 21:10. |
|
13 August 2018, 11:07 | #40 | |
It's coming back!
Join Date: Jul 2018
Location: comp.sys.amiga
Posts: 762
|
Totally agree with point 1, that’s why I was looking for any cheats to the system that SG2 may have used, apart from the obvious small draw distance.
With regards to point 2, I’m currently using a scanline polygon fill to simply draw polygons one by one, I’m not doing any hidden surface removal or considering anything more than the current polygon - i just have two edge trackers doing an integer bresenhan style run down the sides of the polygons and drawing a line between the two sides. I’m not even doing any polygon sorting at the moment as the back face culling is enough for the single object I’m drawing. The actual line drawing between the edges for the polygon fill is a reasonably smart algorithm that works word by word with lookup tables for what to and / or to the extremities, but my polygons are so small that it’s only the extremities that count. I will start to rewrite the code in assembly, but I have looked at what the compiler produces and I really don’t expect a 10x improvement. But maybe I’ll be surprised. Biggest obsticle is the 25 years since I last did assembly coding. Quote:
Last edited by deimos; 13 August 2018 at 12:57. |
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Games that are Full Frame Rate or Slower - Limitations or Choice? | Foebane | Retrogaming General Discussion | 35 | 08 April 2018 13:22 |
F1 grand prix frame rate | universale | support.Games | 18 | 13 July 2015 21:45 |
The First Person Shooter frame rate tolerance poll... | DDNI | Retrogaming General Discussion | 41 | 30 June 2011 03:32 |
Vsync Fullscreen and Double Buffer, incorrect frame rate? | rsn8887 | support.WinUAE | 1 | 07 April 2011 20:43 |
Propper speed request when recording with "Disable frame rate" turned on. | Ironclaw | request.UAE Wishlist | 9 | 02 August 2006 07:21 |
|
|