Informed optimisation

Ernst Blofeld · 14 January 2021, 15:59

I've started trying to optimise my 3D engine a bit.

Step 1: take some measurements:

Click image for larger version

Name: profiling-1.png
Views: 157
Size: 183.6 KB
ID: 70352

Step 2: ???

Step 3: Profit!

Anyway, what do I do now? It's not the calculations that take up the majority of the time, and it's not painting the polygons either, it's "everything else".

I feel stuck because I don't know how to identity that "everything else" or what its problems could be.

I know not to expect this code to be super fast, because it is doing a lot, 2 aircraft, a tank, a control tower and runway, plus ground, sky, sun (out of sight), shadows.

Click image for larger version

Name: profiling-2.png
Views: 130
Size: 4.4 KB
ID: 70353

And it's nearly all in C. I'm not going to change that yet, but I wonder if it's hiding some inefficiency, such as not being able to use registers effectively, that a restructure or reordering of my code would magically fix.

My models are overly complex, there are no levels of detail, but I figure that only helps to identify the performance problems.

Anyway, that's all. It's good to share.

Jobbo · 14 January 2021, 16:59

In my experience with GCC you're not going to see revolutionary improvements by turning your existing code into assembler. So, you are wise to avoid that until you've taken the C code as far as you can.

Without code to look at it's hard to say how it could be improved. Using 32bit ints easily adds up because it's double the amount of memory you're shuffling around and 32bit instructions are slower. So it pays off to constrain your data to 16bits everywhere. Shifts are terribly slow so it helps to keep them to a minimum or always 16bit shifts so the compiler will use a swap instruction.

Maybe you can share one part of the code so you can get feedback that might lead to ideas you can propagate through the rest of the code.

The only way to make progress is to chip away at each part of the code making it slightly faster until all that works starts to add up.

But at the moment you probably don't know what to do with any one piece of code, so pick one to share and maybe people will have good suggestions.

Ernst Blofeld · 14 January 2021, 17:10

Quote:

Originally Posted by Jobbo

In my experience with GCC you're not going to see revolutionary improvements by turning your existing code into assembler. So, you are wise to avoid that until you've taken the C code as far as you can.

Without code to look at it's hard to say how it could be improved. Using 32bit ints easily adds up because it's double the amount of memory you're shuffling around and 32bit instructions are slower. So it pays off to constrain your data to 16bits everywhere. Shifts are terribly slow so it helps to keep them to a minimum or always 16bit shifts so the compiler will use a swap instruction.

Maybe you can share one part of the code so you can get feedback that might lead to ideas you can propagate through the rest of the code.

The only way to make progress is to chip away at each part of the code making it slightly faster until all that works starts to add up.

But at the moment you probably don't know what to do with any one piece of code, so pick one to share and maybe people will have good suggestions.

I'm happy to share my code, but I think sharing it all will overwhelm and the help will dry up.

Can you suggest a function to start with? Something that won't turn the crowds away?

Jobbo · 14 January 2021, 17:13

Another generally slow approach that is common in C code to using pointers in data structures. Again, slow because they are 32bits.

It's more efficient to use indices that will be 16bits. Ideally you want to avoid indirection as much as possible simply because it adds up to more data to read.

Getting into the weeds a little if you do need indirection it's also pretty slow to use indices because the compiler will need to scale them up to whatever you are indexing. In 68000 the addressing modes only allow for byte indexing. So, where possible you want to store indices that are pre-multiplied with the size that they are indexing. The compiler is quite good at dealing with code like this:

Code:

int x = table[preMulIndex / sizeof(int)];

Jobbo · 14 January 2021, 17:15

Quote:

Originally Posted by Ernst Blofeld

I'm happy to share my code, but I think sharing it all will overwhelm and the help will dry up.

Can you suggest a function to start with? Something that won't turn the crowds away?

Maybe the 3D math stuff, since lots of people will already know how that works in theory.

Ernst Blofeld · 14 January 2021, 17:20

Quote:

Originally Posted by Jobbo

Maybe the 3D math stuff, since lots of people will already know how that works in theory.

Well, sure, but the calculations are 27% where the drawing is 73%.

The biggest chunk of the calculations is _MobileEntity3D_calculate, at 21% of the total.

I can see what I do to package that up into something readable?

Ernst Blofeld · 14 January 2021, 17:27

Quote:

Originally Posted by Ernst Blofeld

Well, sure, but the calculations are 27% where the drawing is 73%.

The biggest chunk of the calculations is _MobileEntity3D_calculate, at 21% of the total.

I can see what I do to package that up into something readable?

Code:

void _MobileEntity3D_calculate(Entity * this_Entity) {
    BEGIN_FUNCTION

    Entity3D * this_Entity3D = (Entity3D *) this_Entity;
    MobileEntity3D * this_MobileEntity3D = (MobileEntity3D *) this_Entity;
    
    extern Camera * currentCamera;
    extern const WORD viewingDistance;

    Transformation3D transformation;
    CombineTransformations(&transformation, &currentCamera->transformation3D, &this_Entity->transformation3D);

    Transformation3D shadowTransformation;
    CombineTransformations(&shadowTransformation, &currentCamera->transformation3D, &this_MobileEntity3D->shadowTransformation3D);

    const Model3D * model3D = this_Entity3D->model3D;

    Point3D * vertex = model3D->vertices;
    Calculation3D * calculation = this_Entity->calculations3D;
    Calculation3D * shadowCalculation = this_MobileEntity3D->shadowCalculations3D;
    for (UWORD i = 0; i < model3D->numVertices; i++, vertex++, calculation++, shadowCalculation++) {
        LongPoint3D * transformed = &calculation->transformed;
        LongPoint3D * projected = &calculation->projected;
        ApplyTransformation(transformed, &transformation, vertex);
        if (transformed->x >= viewingDistance)
            *projected = (LongPoint3D) {
                transformed->x,
                transformed->y * viewingDistance / transformed->x,
                transformed->z * viewingDistance / transformed->x
            };
        else
            *projected = *transformed;
            
        LongPoint3D * transformedShadow = &shadowCalculation->transformed;
        LongPoint3D * projectedShadow = &shadowCalculation->projected;
        ApplyTransformation(transformedShadow, &shadowTransformation, vertex);
        if (transformedShadow->x >= viewingDistance) {
            *projectedShadow = (LongPoint3D) {
                transformedShadow->x,
                transformedShadow->y * viewingDistance / transformedShadow->x,
                transformedShadow->z * viewingDistance / transformedShadow->x
            };
        } else
            *projectedShadow = *transformedShadow;
    }

    END_FUNCTION
}

Possibly, this has shown a problem.

Possibly, this may be calculating shadows for one object that doesn't need them.

That may be 3% there.

roondar · 14 January 2021, 17:29

The first thing that I find interesting is that the shadow layer takes a considerable amount of time, nearly as much time as the rest (24 vs 32%). Perhaps you can 'cheat' the shadows somehow by integrating the preparations into the main render code?

The second thing I find noteworthy is that a lot of time is spend in various render (but not painting) routines. It seems to me that these are the ones where optimization might be most useful.

In general, there's a bunch of different things that can be optimized in C/C++ that aren't always clear. In particular, compilers sometimes push local arrays onto the stack instead of using a block of allocated memory and simple pointer references. The type of variables you create can also have a significant impact on performance, volatile vs non-volatile etc can be quite the difference. And indeed, on 68000 it pays to try and keep everything that can fit in 16 bits in 16 bit fields.

Jobbo · 14 January 2021, 17:44

Quote:

Originally Posted by Ernst Blofeld

Possibly, this has shown a problem.

Possibly, this may be calculating shadows for one object that doesn't need them.

That may be 3% there.

There are a couple of things that stand out at a high level.

Firstly you are using divides for the projection, this is going to be slow. It's fairly easy to improve this with a 1/x table and replace the divides with multiplies.

The other thing is that for each vertex you're branching based on the view distance. I presume this is to avoid projection divides for distant vertices.

It would be better to make that decision for an entire object once and run it through two different code paths, rather than branching for each vertex.

Also you are saving the full 3D vertex for the projected form. Maybe you only need the 2D point after it's projected.

There is also quite a bit of redundant duplicate code, the optimizer can't help as much as you'd like with this stuff. So make sure you write the code out so things like "viewingDistance / transformed->x" are only done once, not twice.

There's a bunch of indirection that might also be causing things to be slower than necessary. You have things in Calculation3D structures when it might be better to pull them out into separate arrays. Struct of arrays rather than array of structs.

Ernst Blofeld · 14 January 2021, 17:54

Quote:

Originally Posted by roondar

The first thing that I find interesting is that the shadow layer takes a considerable amount of time, nearly as much time as the rest (24 vs 32%). Perhaps you can 'cheat' the shadows somehow by integrating the preparations into the main render code?

Yes, the shadows are way more expensive than they should be. I'm not really sure what to do. The shadows are a "cross section" that I think I'll need in various ways if I want to build some physics into this, so I'd rather fix them that throw them away.

There are different transformations for the object vs its shadow, each of which is the combination of the model transform and camera transform, with the shadow one having an extra bit of flattening to make it a shadow (setting a bunch of stuff to zero). Perhaps the conventional wisdom of combining all the matrices together doesn't apply here?

Quote:

Originally Posted by roondar

The second thing I find noteworthy is that a lot of time is spend in various render (but not painting) routines. It seems to me that these are the ones where optimization might be most useful.

Yes, there's a bunch of "missing" time. I wonder if this is shoveling of data around because registers can't be used efficiently, or just plain C inefficiency, and just doing things in a different order will fix everything.

Quote:

Originally Posted by roondar

In general, there's a bunch of different things that can be optimized in C/C++ that aren't always clear. In particular, compilers sometimes push local arrays onto the stack instead of using a block of allocated memory and simple pointer references. The type of variables you create can also have a significant impact on performance, volatile vs non-volatile etc can be quite the difference. And indeed, on 68000 it pays to try and keep everything that can fit in 16 bits in 16 bit fields.

I have been trying to make thing fit into the natural 68000 way with little bits of inline asm to do my 16x16->32bit multiplications rather than the C way of promoting everything to 32 bit first. But there's probably a bunch I've not realised were happening.

roondar · 14 January 2021, 17:58

A weird idea, but it can help to try and make your compiler use registers more efficiently: try to use fewer variables in functions. The 68000 has 8 data registers and 8 (though due to stack usually really only 7) address registers. In practice this does not translate to 15 variables being used as quite a few operations require temporary registers for use or require both address and data registers.

The upshot is that if you can keep the number of variables in use at any time low, the compiler won't have to swap around registers to stack as much to keep it all working.

Ernst Blofeld · 14 January 2021, 18:07

Quote:

Originally Posted by Jobbo

There are a couple of things that stand out at a high level.

Firstly you are using divides for the projection, this is going to be slow. It's fairly easy to improve this with a 1/x table and replace the divides with multiplies.

There's definitely a problem here, I'm dividing LONGs, that's never going to be quick. But how can I easily change that to a table?

Quote:

Originally Posted by Jobbo

The other thing is that for each vertex you're branching based on the view distance. I presume this is to avoid projection divides for distant vertices.

It's to avoid the divide by zero and stupid big numbers as things go behind you.

Quote:

Originally Posted by Jobbo

It would be better to make that decision for an entire object once and run it through two different code paths, rather than branching for each vertex.

Also you are saving the full 3D vertex for the projected form. Maybe you only need the 2D point after it's projected.

Maybe I can do better here, I'll look into it. I think it was to give me data to clip against the near plane with.

Quote:

Originally Posted by Jobbo

There is also quite a bit of redundant duplicate code, the optimizer can't help as much as you'd like with this stuff. So make sure you write the code out so things like "viewingDistance / transformed->x" are only done once, not twice.

There's a bunch of indirection that might also be causing things to be slower than necessary. You have things in Calculation3D structures when it might be better to pull them out into separate arrays. Struct of arrays rather than array of structs.

Ernst Blofeld · 14 January 2021, 18:15

Quote:

Originally Posted by roondar

A weird idea, but it can help to try and make your compiler use registers more efficiently: try to use fewer variables in functions. The 68000 has 8 data registers and 8 (though due to stack usually really only 7) address registers. In practice this does not translate to 15 variables being used as quite a few operations require temporary registers for use or require both address and data registers.

The upshot is that if you can keep the number of variables in use at any time low, the compiler won't have to swap around registers to stack as much to keep it all working.

I do keep a few global pointers, like where my calculations go, to save passing parameters on the stack for this kind of reason, but I'll look out for more opportunities to annoy my computer science professors.

Antiriad_UK · 14 January 2021, 18:19

Quote:

Originally Posted by Ernst Blofeld

There's definitely a problem here, I'm dividing LONGs, that's never going to be quick. But how can I easily change that to a table?

Instead of value / 2 (for example) you can do value * 0.5. So you generate a table of 1/1, 1/2, 1/3, etc. To make it work you have to use fixed point so you do a table something like 16384/1, 16384/2, etc and then instead of a divide you can a multiply instead

Code:

	move.w	(a2,d2.w),d2		;16384 / z
	muls	d2,d0  ;x
	muls	d2,d1  ;y

	add.l	d0,d0
	add.l	d0,d0
	swap	d0			;/16384, back to normal scale

Obviously you have to decide how far you support z values.

Ernst Blofeld · 14 January 2021, 18:38

Quote:

Originally Posted by Antiriad_UK

Obviously you have to decide how far you support z values.

Yes.

Problem is I'm dealing with some big things as well as some smaller things.

My runway is 1.5km long, my jet is 16m.

The model I use for my jet is in millimetres.

The model I use for the runway is in metres.

(where there are 1024 millimetres in a metre and a pixel happens to be exactly one millimetre square)

At the moment everything ends up in millimetres in a LONG because after doing all the 2:14 fixed point rotation multiplications I shift right only 4 places for the runway.

It all starts to get complicated.

But, unless I change things, Z is a LONG, so I can't build a table for it.

But, are the Z divides the thing to fixate on, when the whole calculation section is only a quarter of the total?

Jobbo · 14 January 2021, 19:40

It sounds like you are being constrained by the fact that you need the same code to handle large objects such as the runway and relatively small objects such as the plane.

To handle these two cases most efficiently you probably want to split the code into two separate paths.

For example near plane clipping is no doubt important for the runway. But for the plane you might as well just clip the whole plane if it gets too near the camera.

Similarly and related to the requirements for optimizing the projection, you might want two different sets of code dealing with the two different scales.

Ultimately you want to do what it fastest for the data you have, not something generic.

It's a certainty that most games back in the day would treat the objects quite differently than they do the environment.

Jobbo · 14 January 2021, 19:44

For the reciprocal z table you want to limit the table by limiting how far away you are willing to draw objects. And you want to limit the precision by some factor. So you might changes the code from this:

a = b / z;

to

a = b * recip[z >> 4];

You're trading away some precision but it's quite manageable if you limit the far plane.

Ernst Blofeld · 14 January 2021, 20:19

OK, let me report back tomorrow. I already have separate calculation functions for Entity2Ds (the runway), FixedEntity3Ds (the tower) and MobileEntity3Ds (tanks and planes), so calculating them in different ways should be achievable.

DanScott · 14 January 2021, 23:14

How are you actually rendering your polygons?

Ernst Blofeld · 15 January 2021, 03:36

Quote:

Originally Posted by DanScott

How are you actually rendering your polygons?

The ground is done with the blitter, just one bitplane. The rest are done with the CPU, the asphalt of the runway and the shadows are also just one bitplane, as is the sun although that's a precalculated circle.

To draw the polygons I go round the edges using Bresenham to populate an array of start and end points (anti clockwise so down is a start point and up is an end). I then go through this array, having remembered the y min and max to start and stop at, drawing horizontal lines one long word at a time. That last bit came from elsewhere, but I changed it to write longs rather than words.

14 January 2021, 15:59	#1
Ernst Blofeld <optimized out> Join Date: Sep 2020 Location: <optimized out> Posts: 321	Informed optimisation I've started trying to optimise my 3D engine a bit. Step 1: take some measurements: Step 2: ??? Step 3: Profit! Anyway, what do I do now? It's not the calculations that take up the majority of the time, and it's not painting the polygons either, it's "everything else". I feel stuck because I don't know how to identity that "everything else" or what its problems could be. I know not to expect this code to be super fast, because it is doing a lot, 2 aircraft, a tank, a control tower and runway, plus ground, sky, sun (out of sight), shadows. And it's nearly all in C. I'm not going to change that yet, but I wonder if it's hiding some inefficiency, such as not being able to use registers effectively, that a restructure or reordering of my code would magically fix. My models are overly complex, there are no levels of detail, but I figure that only helps to identify the performance problems. Anyway, that's all. It's good to share. Last edited by Ernst Blofeld; 14 January 2021 at 16:30.

14 January 2021, 17:13	#4
Jobbo Registered User Join Date: Jun 2020 Location: Druidia Posts: 386	Another generally slow approach that is common in C code to using pointers in data structures. Again, slow because they are 32bits. It's more efficient to use indices that will be 16bits. Ideally you want to avoid indirection as much as possible simply because it adds up to more data to read. Getting into the weeds a little if you do need indirection it's also pretty slow to use indices because the compiler will need to scale them up to whatever you are indexing. In 68000 the addressing modes only allow for byte indexing. So, where possible you want to store indices that are pre-multiplied with the size that they are indexing. The compiler is quite good at dealing with code like this: Code: int x = table[preMulIndex / sizeof(int)];

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
68000 optimisation	Galahad/FLT	Coders. Asm / Hardware	9	20 August 2016 00:29
Picasso IV optimisation	Tony Landais	support.Hardware	10	01 September 2006 19:54

14 January 2021, 16:59	#2
Jobbo Registered User Join Date: Jun 2020 Location: Druidia Posts: 386	In my experience with GCC you're not going to see revolutionary improvements by turning your existing code into assembler. So, you are wise to avoid that until you've taken the C code as far as you can. Without code to look at it's hard to say how it could be improved. Using 32bit ints easily adds up because it's double the amount of memory you're shuffling around and 32bit instructions are slower. So it pays off to constrain your data to 16bits everywhere. Shifts are terribly slow so it helps to keep them to a minimum or always 16bit shifts so the compiler will use a swap instruction. Maybe you can share one part of the code so you can get feedback that might lead to ideas you can propagate through the rest of the code. The only way to make progress is to chip away at each part of the code making it slightly faster until all that works starts to add up. But at the moment you probably don't know what to do with any one piece of code, so pick one to share and maybe people will have good suggestions.

14 January 2021, 17:29	#8
roondar Registered User Join Date: Jul 2015 Location: The Netherlands Posts: 3,408	The first thing that I find interesting is that the shadow layer takes a considerable amount of time, nearly as much time as the rest (24 vs 32%). Perhaps you can 'cheat' the shadows somehow by integrating the preparations into the main render code? The second thing I find noteworthy is that a lot of time is spend in various render (but not painting) routines. It seems to me that these are the ones where optimization might be most useful. In general, there's a bunch of different things that can be optimized in C/C++ that aren't always clear. In particular, compilers sometimes push local arrays onto the stack instead of using a block of allocated memory and simple pointer references. The type of variables you create can also have a significant impact on performance, volatile vs non-volatile etc can be quite the difference. And indeed, on 68000 it pays to try and keep everything that can fit in 16 bits in 16 bit fields.

14 January 2021, 17:58	#11
roondar Registered User Join Date: Jul 2015 Location: The Netherlands Posts: 3,408	A weird idea, but it can help to try and make your compiler use registers more efficiently: try to use fewer variables in functions. The 68000 has 8 data registers and 8 (though due to stack usually really only 7) address registers. In practice this does not translate to 15 variables being used as quite a few operations require temporary registers for use or require both address and data registers. The upshot is that if you can keep the number of variables in use at any time low, the compiler won't have to swap around registers to stack as much to keep it all working.

14 January 2021, 19:40	#16
Jobbo Registered User Join Date: Jun 2020 Location: Druidia Posts: 386	It sounds like you are being constrained by the fact that you need the same code to handle large objects such as the runway and relatively small objects such as the plane. To handle these two cases most efficiently you probably want to split the code into two separate paths. For example near plane clipping is no doubt important for the runway. But for the plane you might as well just clip the whole plane if it gets too near the camera. Similarly and related to the requirements for optimizing the projection, you might want two different sets of code dealing with the two different scales. Ultimately you want to do what it fastest for the data you have, not something generic. It's a certainty that most games back in the day would treat the objects quite differently than they do the environment.

14 January 2021, 19:44	#17
Jobbo Registered User Join Date: Jun 2020 Location: Druidia Posts: 386	For the reciprocal z table you want to limit the table by limiting how far away you are willing to draw objects. And you want to limit the precision by some factor. So you might changes the code from this: a = b / z; to a = b * recip[z >> 4]; You're trading away some precision but it's quite manageable if you limit the far plane.

14 January 2021, 20:19	#18
Ernst Blofeld <optimized out> Join Date: Sep 2020 Location: <optimized out> Posts: 321	OK, let me report back tomorrow. I already have separate calculation functions for Entity2Ds (the runway), FixedEntity3Ds (the tower) and MobileEntity3Ds (tanks and planes), so calculating them in different ways should be achievable.

14 January 2021, 23:14	#19
DanScott Lemon. / Core Design Join Date: Mar 2016 Location: Tier 5 Posts: 1,211	How are you actually rendering your polygons?

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)