View Single Post
Old 30 November 2010, 21:12   #10

Photon's Avatar
Join Date: Nov 2004
Location: Eksjö / Sweden
Posts: 4,821
I meant that in ARM-programming classes, you normally talk of taken branches (executed branches) take 3 cycles. Cache and prediction affect this, but it's a rule of thumb. If you know 100% what's in the cache and the data/calculated values will cause which branches to execute, you can know better what the real situation is.

(Some ARM models don't have branch prediction, and I don't think any have "user suggested" branch prediction, but I could be wrong. If there's no way to specify, most if not all CPUs default to predicting backwards branches as 'taken' and forward ones as 'not taken'. So one way to help prediction is to make 'usually non-taken branches" point forward.)

finkel, as I said the best is to put result-calculation as far before result-use as possible, but if you have a choice to make (multiple dependencies in the loop, how to reorder everything?) the only for sure way is to test some combinations and measure the time (paste equivalent loop in a simple time-measurer code).

Because there are many models of each ARM version, with different cache specs and ratios to memory clock, your cpu could run faster or slower than specified, memory may have wait-states or DMA happening.

If you want some approximation, I'd say at least 3 internal instructions fit after 1 uncached store. Also read up on write cache-flushing, timing that could help execute many internal instructions for free or the other way around. For reads, you can't do anything with the read value directly after anyway, even if it's cached, so just put any instruction there.

Basically what you're doing is the profiling and rearrangements of instructions that some compilers do, except you're doing it manually. Maybe someone has dreamt up such a tool for pure asm? Wouldn't be unthinkable, considering ARM is heavily used in embedded applications.

To be optimal you really must make sure you examine carefully and know your exact CPU and memory interface. Knowing and abusing your cache, rearranging your data and whole rendering engine (or whatever) to time loads and stores perfectly and all the time knowing what's in the cache takes a ninja coder god. There are a few such people in the demoscene, I know... it's hard, even on a fixed platform. Doing it for "many ARM setups" is a level harder still.
Photon is offline  
Page generated in 0.04163 seconds with 11 queries