25 November 2010, 14:35 | #1 |
Registered User
Join Date: Aug 2010
Location: x
Posts: 36
|
ARM Assembler Optimization
There are probably better places to ask this specific question, but why not give it a try here first?
I need some documentation on how to optimize assembler code for pipeline execution on ARM11 processor. I remember some general rules for some generic RISC CPU from my university days, like branch prediction, instruction interleaving, computational dependencies, etc, but I have only some vague ideas what I should be doing with ARM11. I downloaded some technical references from the net, so at least I know that ARM11 has 8 pipeline stages, early-stage branch prediction, and so on, but some specific answers are hard to find. For example, if I write something like this... tst r0,#1 <other instructions> beq l1 ... how many other (non-dependent) instructions should I put between tst and beq so that pipeline doesn't stall, and flushes minimal number of instructions in case of branch misprediction? Similar question for something like this: add r0,#1 <other instructions> add r1,r0 ;@ write-read dependence of r1 on r0 Has anyone seen any online documentation dealing with this kind of stuff? |
25 November 2010, 14:49 | #2 |
Registered User
|
|
25 November 2010, 17:34 | #3 |
Thalion Webshrine
Join Date: Jan 2004
Location: Oxford
Posts: 14,337
|
|
26 November 2010, 11:47 | #4 | |
Registered User
Join Date: Aug 2010
Location: x
Posts: 36
|
Quote:
|
|
26 November 2010, 11:57 | #5 | |
Registered User
Join Date: Aug 2010
Location: x
Posts: 36
|
Quote:
I'm suspecting those two for a terrible drop in fps, from 25 to 15 and less, and I'm trying to optimize that code as much as I can. It's really strange that ARM doesn't have DIV instruction, considering that it has ALU & MAC units running in parallel, so DIV unit could have been added as well. Probably too many cycles needed to do division, which in a RISC CPU doesn't make much sense, but still... I'll ask on some ARM forums as well. Either way, thanks for the help. |
|
26 November 2010, 14:27 | #6 |
Workbitch 1.3
Join Date: Oct 2001
Location: Melbourne, Australia
Age: 46
Posts: 2,084
|
There are some really talented ARM Assembly guys in the gp32/gp2x scene, you could try asking on gp32x.com or #gp2xdev on efnet.
|
28 November 2010, 17:01 | #7 |
Moderator
Join Date: Nov 2004
Location: Eksjö / Sweden
Posts: 5,602
|
First ARM was pretty simple with its 3 stages. Nowadays with deeper pipeline and all kinds of caches and cache sizes, there's to many combinations to try to reach 'optimum'. It's better to instead always apply some general techniques:
The first thing that will slow you down is the memory. So the absolute first priority is to cut down on memory accesses, and that means writing few instructions, storing efficiently and calculating from small values rather than fetch data that could be calculated. The built-in method for loading a register with a number is a perfect example of that; numeric constants take space. But if you know your caches and you know (or make sure) a loop and all data it needs will fit in the caches, you can disregard memory speed. Grouping LDM/STM 'cleverly' and placing them before several internal calculation instructions is vital for cache performance. STMIA performance varies between models, so the best way is to 'saturate' the busy-time when STMIA is off storing with as many instructions as you can; if STMIA is done before all the internal calc instructions are done, you can split the STMIA up or better, move the internal instructions to somewhere else after another store or so. Second is interleaving instructions, but on a system where the CPU is much faster than the memory, some small stalls affect performance MUCH less. When the code does what it should, simply reorder to put result-dependent instructions at maximum distance from result-calculating instruction. ARM models vary, so you can do no better than that. For tight loops, the loop may or may not be 'spliceable' at all. Ie. omitting 9 of 10 decrements or checks + branches by repeating 10 times and dividing loop counter by 10. Some loops can have their exit-condition added to all instructions (except load/check instruction) in the loop until there's an exit-branch, but it's rare. Branches are normally counted as taking 3 cycle slots, nonexecuted instructions (including branches) 1. I guess you already know not to 'skip' 1-2 instructions using branches, but instead make the skippable instructions conditional, so. Know your data. "Pre-massage" the data before a big operation on it, if it's worth it. That will help you minimize memory accesses. |
30 November 2010, 10:21 | #8 | |||
Registered User
Join Date: Aug 2010
Location: x
Posts: 36
|
Quote:
I was hoping to gain some speed by using as few memory access functions as possible, but some loops (like pfield_doline_nx() functions in all Amiga emulators I've seen so far) require more registers than ARM has available. Quick question - is there any fast way of loading non-standard constants, like 0x55555, other than storing them in memory buffers, and LDR-ing them into registers? One reference manual states that any constant in the range of 0-65535 can be used as imm16 operand, but for some reason "mov r0,#0x5555" won't execute (step over fails in debugger). Quote:
How many cycles do memory read/write take if data cache hit occurs? The buffers I'm using are less than 100 bytes long, so I guess that cache will not be a problem. Quote:
Do you know how many independent instructions should be placed between flags-changing instruction and branch instruction depending on that result? According to what pipeline looks like, I assume it should be something like 7 (stage 8, write-back - stage 2, branch prediction + 1). 1. Fe1 – Address is sent and instruction received 2. Fe2 – Much of the branch prediction goes here 3. De – Decode instruction 4. Iss – Read registers and issue instruction 5. Sh – Perform shift operations 6. ALU – Perform integer operations 7. Sat – Saturate results 8. WB – Write back data to registers I'm trying to get every little detail here, since rewriting some of the emulator stuff in assembler seems to be worth the effort. Even with unoptimized, but pure assembler version of pfield_doline_n6() function I wrote yesterday, there was something like 15% speed improvement. If I could get that to 30% with optimizations (hard, but plenty or room for trying), that would be really something. 30% here, 30% there, and pretty soon, we're talking about serious speedups. |
|||
30 November 2010, 10:23 | #9 |
Registered User
Join Date: Aug 2010
Location: x
Posts: 36
|
|
30 November 2010, 21:12 | #10 |
Moderator
Join Date: Nov 2004
Location: Eksjö / Sweden
Posts: 5,602
|
I meant that in ARM-programming classes, you normally talk of taken branches (executed branches) take 3 cycles. Cache and prediction affect this, but it's a rule of thumb. If you know 100% what's in the cache and the data/calculated values will cause which branches to execute, you can know better what the real situation is.
(Some ARM models don't have branch prediction, and I don't think any have "user suggested" branch prediction, but I could be wrong. If there's no way to specify, most if not all CPUs default to predicting backwards branches as 'taken' and forward ones as 'not taken'. So one way to help prediction is to make 'usually non-taken branches" point forward.) finkel, as I said the best is to put result-calculation as far before result-use as possible, but if you have a choice to make (multiple dependencies in the loop, how to reorder everything?) the only for sure way is to test some combinations and measure the time (paste equivalent loop in a simple time-measurer code). Because there are many models of each ARM version, with different cache specs and ratios to memory clock, your cpu could run faster or slower than specified, memory may have wait-states or DMA happening. If you want some approximation, I'd say at least 3 internal instructions fit after 1 uncached store. Also read up on write cache-flushing, timing that could help execute many internal instructions for free or the other way around. For reads, you can't do anything with the read value directly after anyway, even if it's cached, so just put any instruction there. Basically what you're doing is the profiling and rearrangements of instructions that some compilers do, except you're doing it manually. Maybe someone has dreamt up such a tool for pure asm? Wouldn't be unthinkable, considering ARM is heavily used in embedded applications. To be optimal you really must make sure you examine carefully and know your exact CPU and memory interface. Knowing and abusing your cache, rearranging your data and whole rendering engine (or whatever) to time loads and stores perfectly and all the time knowing what's in the cache takes a ninja coder god. There are a few such people in the demoscene, I know... it's hard, even on a fixed platform. Doing it for "many ARM setups" is a level harder still. |
01 December 2010, 11:56 | #11 | ||
Registered User
Join Date: Aug 2010
Location: x
Posts: 36
|
Quote:
I'll be perfectly happy with a "sub-optimized" version that still runs 30% faster than it does now. Quote:
I want most bang for my bucks, so to speak... Anyway, thank you all for your help. Suggesting gp32x.com was an excellent idea. I got some superb advices, and a very useful link to instruction timings. In case anyone else needs the link with ARM11 timings: http://infocenter.arm.com/help/topic.../Cjaedced.html |
||
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
IPF and ARM | ascp | project.SPS (was CAPS) | 24 | 14 October 2015 19:19 |
WinUAE for Windows 8 ARM? | Mequa | request.UAE Wishlist | 6 | 27 October 2011 19:40 |
Looking for 68000 binary optimization utility | amigoun | request.Apps | 2 | 23 October 2011 00:36 |
Made in Britain - ARM | DDNI | Nostalgia & memories | 2 | 30 June 2011 10:53 |
Minimig mod for ARM | lolafg | support.Hardware | 7 | 17 October 2010 20:43 |
|
|