27 October 2011, 09:31 | #1 |
Registered User
Join Date: Dec 2007
Location: Dark Kingdom
Posts: 213
|
Optimizing question: instruction order
Yesterday I was doing a review of some routines of mine. A couple of optimization questions came up to my mind,
wich, maybe are rather general. Before you tell me, I know that the best answer to my questions is "try different possibilities and mesure results", but since it's a time consuming activity I ask for some "rule of thumb", if any is reasonable. The routines performs back face culling of polygons and computation (with different formulas) of the illumination of a surface with respect to a varying position light source. So they have MULS and in some cases also DIVS. Moreover the routines have to do a pair of memory accesses that do not depend on the results of the MULS and DIVS, i.e. I can insert the memory access instructions (almost) in any place inside the routines. The main target il 68000 + OCS (A500) with fast. Secondary target, any other 68k CPU. Thinking in terms of my main target, I am doing the following assumption: the relative order of the instructions, does not change the performance. I.e. it does not matter (for speed in the main target) where I insert the memory access instructions Question 1: is it really true? On the other hand, I believe that for the sake of efficiency in 020+ it is better to interleave MULS and DIVS with several instructions, so I chose to put the memory access just before or after a MULS or DIVS. Question 2: is it (in general) a good idea? Question 3: is it (in general) better to put the memory access before or after a MULS/DIVS? (my guess is before, because I expect that, at least on 040 and 060, while the CPU waits for the memory access to complete it can start the MULS/DIVS) Question 4: Should I avoid waisting time thinking to general rules-of-thumb, which are impossible to give, and stick to a try-and-measure approach? PS: please apoligize my questions if are dumb. Yesterday I have managed to do a 1 hour coding session! It has been the first time since 2009! What a wondeful experience! |
27 October 2011, 14:40 | #2 |
Banned
Join Date: Jan 2009
Location: U.K.
Posts: 93
|
Question 1: How do you know without test results?
Question 2: How do you know without test results? Question 3: How do you know without test results? Question 4: The answer is YES. |
27 October 2011, 18:47 | #3 |
AMOS Extensions Developer
Join Date: Jun 2007
Location: near Cambridge, UK
Age: 44
Posts: 1,924
|
1. Depends. If you are activating the blitter for example, then it definately matters. If you are only coding using the CPU, then I don't think so (unless you are trying to code for most of the 680x0 family).
2. No idea. Do some tests! 3. I'd say before is best, but I'd wait for the experts (cue Stingray and Leffmann ) 4. Yes. It is very hard (impossible?) to code for all the 680x0 CPU family used in Amiga's, aswell as all the different hardware configurations. Stick with one machine (perhaps a popular one) then you can optimize it heavily Personally, I'd normally think about using tables for MULS and DIVS. However, it sounds like what you are doing would require an awful lot of tables so you'd probably be better off just using the MULS and DIVS instead. I'm no expert, still a relative newbie myself, but thought I'd offer my $0.02 Regards, Lonewolf10 |
27 October 2011, 20:10 | #4 | ||||
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,762
|
Quote:
For 68060 the order certainly does matter. Reorder instructions so that the next instruction doesn't rely on results of the previous one. Quote:
Quote:
Quote:
True, but you can still apply rules of thumb to individual CPUs. It's best to try and see if you can get good performance on your lowest CPU target, finish that code and do a separate version for a higher target. |
||||
28 October 2011, 10:26 | #5 | |
Registered User
Join Date: Dec 2007
Location: Dark Kingdom
Posts: 213
|
Quote:
A more clear formulation is: 1') In a 68000+OCS (+ optionally fast ram) tipical A500 setting, does the relative order of instructions affect performance? While the routine run, it may be that the blitter is clearing the screen, but it may also be that it has already finished. How would you chose instruction ordering in this situation? |
|
28 October 2011, 10:32 | #6 | |
Registered User
Join Date: Dec 2007
Location: Dark Kingdom
Posts: 213
|
Quote:
Just to be sure that I understand correctly, since in my question the subject of the sentence is "the memory access instruction" while in your sentence the subject seems to me "the MUL/DIV/register work instructions" : is your advice to put FIRST the memory access instructions and THEN the MUL/DIV/register work instruction ? |
|
28 October 2011, 10:42 | #7 |
Registered User
Join Date: Dec 2007
Location: Dark Kingdom
Posts: 213
|
@All
Thanks to any who answerd to my questions! Your advices are highly appriciated! :-) I would like to clarify that I do NOT believe that "rules of thumb" can substitute tests. I agree that to have the absolute best optimization one has to do tests. I was asking for "rules of thumb" that may help: 1) to guide, as Thorham said, the test, i.e. to help excluding some non-optimal reordering 2) my main target is 68000, while as secondary target I don't have a specific CPU. Since I suspect the order does not matter for 68000 (or that there are many optimal ordering), I would like to test just on the 68000 to find a set of equally best ordering and then use "rule of thumb" to select among them one that is not too bad on any 020+ |
28 October 2011, 12:53 | #8 |
WinUAE developer
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,515
|
Instruction re-ordering won't affect speed on a 68000 (it has no caches or write buffers etc..) but it can affect speed if accessing Agnus bus and other DMA channels are active and re-ordered instructions have different internal idle cycles.
|
29 October 2011, 15:34 | #9 |
Moderator
Join Date: Nov 2004
Location: Eksjö / Sweden
Posts: 5,604
|
The only proper answer imo is to get one of each platform you want to support, and code on them. I think you ask because you want to optimize "theoretically for all models" and not have to code on an A500/A600 Whereas in fact if you just code on an A600 you can just set different bg colors at the start of each routine to see where the bottlenecks are. It's dead easy and saves time by giving instant answers to optimization questions, it simply ends all "will this run fast enough on..." doubts, which is ace
I've ended up with only two choices, A500 512k slowmem (runs fine, speedups are a bonus) and A1200-060 (runs optimally, slowdowns are acceptable by lower hardware users). The load-use reordering is fine on any machine but won't give optimization on CPUs without caches. Coding for "backward branch is assumed taken" is also good, nothing wrong with that. But you will optimize much more by eliminating data redundancy, reordering data for sequential access, and reducing shifts, muls, divs and unnecessary memory accesses and instructions. On a higher level, eliminating unnecessary blits and recalculations. When the code follows the above, the only optimizations left for 68000 are 1) time DMA start perfectly to somehow interleave DMA with CPU internal cycles (very hard), and 2) support putting [almost all] code+data in fastmem. Now, the above description fits a coder god. My suggestion is to approach coding as problem-solving: 1. Problem: the program isn't finished. Solution: finish the program. 2. Problem: the program isn't fast enough: Solution: find out why. 3. Problem: routine x and y are taking too long. Solution: optimize the inner loops (ONLY) of routine x and y 4. Problem: Program is finished and fast enough, but I want to write god-code where every line is perfectly optimized. Solution: the problem is with you. Release the program, then fix you. The last point is based on introspection, you may not have that problem |
29 October 2011, 17:07 | #10 | |
gone
Join Date: Apr 2007
Location: completely gone
Posts: 1,596
|
Quote:
|
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Layered tile engine optimizing. | Thorham | Coders. General | 0 | 30 September 2011 20:43 |
Instruction cache question | Lord Riton | Coders. General | 2 | 07 April 2011 12:25 |
Question about the TAS instruction. | Thorham | Coders. General | 7 | 03 April 2011 13:12 |
Benching and optimizing CF-IDE speed | Photon | support.Hardware | 12 | 15 July 2009 01:48 |
|
|