View Single Post
Old 28 November 2010, 17:01   #7
Photon
Moderator

Photon's Avatar
 
Join Date: Nov 2004
Location: Eksjö / Sweden
Posts: 4,821
First ARM was pretty simple with its 3 stages. Nowadays with deeper pipeline and all kinds of caches and cache sizes, there's to many combinations to try to reach 'optimum'. It's better to instead always apply some general techniques:

The first thing that will slow you down is the memory. So the absolute first priority is to cut down on memory accesses, and that means writing few instructions, storing efficiently and calculating from small values rather than fetch data that could be calculated. The built-in method for loading a register with a number is a perfect example of that; numeric constants take space. But if you know your caches and you know (or make sure) a loop and all data it needs will fit in the caches, you can disregard memory speed.

Grouping LDM/STM 'cleverly' and placing them before several internal calculation instructions is vital for cache performance. STMIA performance varies between models, so the best way is to 'saturate' the busy-time when STMIA is off storing with as many instructions as you can; if STMIA is done before all the internal calc instructions are done, you can split the STMIA up or better, move the internal instructions to somewhere else after another store or so.

Second is interleaving instructions, but on a system where the CPU is much faster than the memory, some small stalls affect performance MUCH less. When the code does what it should, simply reorder to put result-dependent instructions at maximum distance from result-calculating instruction. ARM models vary, so you can do no better than that.

For tight loops, the loop may or may not be 'spliceable' at all. Ie. omitting 9 of 10 decrements or checks + branches by repeating 10 times and dividing loop counter by 10. Some loops can have their exit-condition added to all instructions (except load/check instruction) in the loop until there's an exit-branch, but it's rare.

Branches are normally counted as taking 3 cycle slots, nonexecuted instructions (including branches) 1. I guess you already know not to 'skip' 1-2 instructions using branches, but instead make the skippable instructions conditional, so.

Know your data. "Pre-massage" the data before a big operation on it, if it's worth it. That will help you minimize memory accesses.
Photon is offline  
 
Page generated in 0.04070 seconds with 11 queries