ARM Assembler Optimization

finkel · 25 November 2010, 14:35

There are probably better places to ask this specific question, but why not give it a try here first?

I need some documentation on how to optimize assembler code for pipeline execution on ARM11 processor.

I remember some general rules for some generic RISC CPU from my university days, like branch prediction, instruction interleaving, computational dependencies, etc, but I have only some vague ideas what I should be doing with ARM11.

I downloaded some technical references from the net, so at least I know that ARM11 has 8 pipeline stages, early-stage branch prediction, and so on, but some specific answers are hard to find.

For example, if I write something like this...

tst r0,#1
<other instructions>
beq l1

... how many other (non-dependent) instructions should I put between tst and beq so that pipeline doesn't stall, and flushes minimal number of instructions in case of branch misprediction?

Similar question for something like this:

add r0,#1
<other instructions>
add r1,r0 ;@ write-read dependence of r1 on r0

Has anyone seen any online documentation dealing with this kind of stuff?

Foul · 25 November 2010, 14:49

would be better to ask here i suppose :

http://eab.abime.net/forumdisplay.php?f=37

alexh · 25 November 2010, 17:34

http://notaz.gp2x.de/cyclone.php

finkel · 26 November 2010, 11:47

Quote:

Originally Posted by Foul

would be better to ask here i suppose :

http://eab.abime.net/forumdisplay.php?f=37

Good idea.

Can threads be moved, or do I have to post another one there?

finkel · 26 November 2010, 11:57

Quote:

Originally Posted by alexh

http://notaz.gp2x.de/cyclone.php

Ah, I know about Cyclone core. Actually, I intended to add 68020 support to it, but after seeing how much time it took me to add that support to FAME C code in UAE4ALL, I fell back to writing in assembler only the MULL and DIVL (32/64-bit) 68020 instructions emulation.

I'm suspecting those two for a terrible drop in fps, from 25 to 15 and less, and I'm trying to optimize that code as much as I can. It's really strange that ARM doesn't have DIV instruction, considering that it has ALU & MAC units running in parallel, so DIV unit could have been added as well. Probably too many cycles needed to do division, which in a RISC CPU doesn't make much sense, but still...

I'll ask on some ARM forums as well. Either way, thanks for the help.

T_hairy_bootson · 26 November 2010, 14:27

There are some really talented ARM Assembly guys in the gp32/gp2x scene, you could try asking on gp32x.com or #gp2xdev on efnet.

Photon · 28 November 2010, 17:01

First ARM was pretty simple with its 3 stages. Nowadays with deeper pipeline and all kinds of caches and cache sizes, there's to many combinations to try to reach 'optimum'. It's better to instead always apply some general techniques:

The first thing that will slow you down is the memory. So the absolute first priority is to cut down on memory accesses, and that means writing few instructions, storing efficiently and calculating from small values rather than fetch data that could be calculated. The built-in method for loading a register with a number is a perfect example of that; numeric constants take space. But if you know your caches and you know (or make sure) a loop and all data it needs will fit in the caches, you can disregard memory speed.

Grouping LDM/STM 'cleverly' and placing them before several internal calculation instructions is vital for cache performance. STMIA performance varies between models, so the best way is to 'saturate' the busy-time when STMIA is off storing with as many instructions as you can; if STMIA is done before all the internal calc instructions are done, you can split the STMIA up or better, move the internal instructions to somewhere else after another store or so.

Second is interleaving instructions, but on a system where the CPU is much faster than the memory, some small stalls affect performance MUCH less. When the code does what it should, simply reorder to put result-dependent instructions at maximum distance from result-calculating instruction. ARM models vary, so you can do no better than that.

For tight loops, the loop may or may not be 'spliceable' at all. Ie. omitting 9 of 10 decrements or checks + branches by repeating 10 times and dividing loop counter by 10. Some loops can have their exit-condition added to all instructions (except load/check instruction) in the loop until there's an exit-branch, but it's rare.

Branches are normally counted as taking 3 cycle slots, nonexecuted instructions (including branches) 1. I guess you already know not to 'skip' 1-2 instructions using branches, but instead make the skippable instructions conditional, so.

Know your data. "Pre-massage" the data before a big operation on it, if it's worth it. That will help you minimize memory accesses.

finkel · 30 November 2010, 10:21

Quote:

Originally Posted by Photon

The built-in method for loading a register with a number is a perfect example of that; numeric constants take space.

All very useful advices.

I was hoping to gain some speed by using as few memory access functions as possible, but some loops (like pfield_doline_nx() functions in all Amiga emulators I've seen so far) require more registers than ARM has available.

Quick question - is there any fast way of loading non-standard constants, like 0x55555, other than storing them in memory buffers, and LDR-ing them into registers?

One reference manual states that any constant in the range of 0-65535 can be used as imm16 operand, but for some reason "mov r0,#0x5555" won't execute (step over fails in debugger).

Quote:

Originally Posted by Photon

Grouping LDM/STM 'cleverly' and placing them before several internal calculation instructions is vital for cache performance.

Trying to do that. I'm moving some independent instructions between LDR/STR and those instructions which use loaded registers (likewise for STR/STM), but I don't know how many I should put in between.

How many cycles do memory read/write take if data cache hit occurs? The buffers I'm using are less than 100 bytes long, so I guess that cache will not be a problem.

Quote:

Originally Posted by Photon

Branches are normally counted as taking 3 cycle slots, nonexecuted instructions (including branches) 1.

Mispredicted branches lose just 3 cycles, or you meant that taken branches take 3 cycles?

Do you know how many independent instructions should be placed between flags-changing instruction and branch instruction depending on that result? According to what pipeline looks like, I assume it should be something like 7 (stage 8, write-back - stage 2, branch prediction + 1).

1. Fe1 – Address is sent and instruction received
2. Fe2 – Much of the branch prediction goes here
3. De – Decode instruction
4. Iss – Read registers and issue instruction
5. Sh – Perform shift operations
6. ALU – Perform integer operations
7. Sat – Saturate results
8. WB – Write back data to registers

I'm trying to get every little detail here, since rewriting some of the emulator stuff in assembler seems to be worth the effort. Even with unoptimized, but pure assembler version of pfield_doline_n6() function I wrote yesterday, there was something like 15% speed improvement. If I could get that to 30% with optimizations (hard, but plenty or room for trying), that would be really something. 30% here, 30% there, and pretty soon, we're talking about serious speedups.

finkel · 30 November 2010, 10:23

Quote:

Originally Posted by T_hairy_bootson

There are some really talented ARM Assembly guys in the gp32/gp2x scene, you could try asking on gp32x.com or #gp2xdev on efnet.

Will definitely try there as well.

Photon · 30 November 2010, 21:12

I meant that in ARM-programming classes, you normally talk of taken branches (executed branches) take 3 cycles. Cache and prediction affect this, but it's a rule of thumb. If you know 100% what's in the cache and the data/calculated values will cause which branches to execute, you can know better what the real situation is.

(Some ARM models don't have branch prediction, and I don't think any have "user suggested" branch prediction, but I could be wrong. If there's no way to specify, most if not all CPUs default to predicting backwards branches as 'taken' and forward ones as 'not taken'. So one way to help prediction is to make 'usually non-taken branches" point forward.)

finkel, as I said the best is to put result-calculation as far before result-use as possible, but if you have a choice to make (multiple dependencies in the loop, how to reorder everything?) the only for sure way is to test some combinations and measure the time (paste equivalent loop in a simple time-measurer code).

Because there are many models of each ARM version, with different cache specs and ratios to memory clock, your cpu could run faster or slower than specified, memory may have wait-states or DMA happening.

If you want some approximation, I'd say at least 3 internal instructions fit after 1 uncached store. Also read up on write cache-flushing, timing that could help execute many internal instructions for free or the other way around. For reads, you can't do anything with the read value directly after anyway, even if it's cached, so just put any instruction there.

Basically what you're doing is the profiling and rearrangements of instructions that some compilers do, except you're doing it manually. Maybe someone has dreamt up such a tool for pure asm? Wouldn't be unthinkable, considering ARM is heavily used in embedded applications.

To be optimal you really must make sure you examine carefully and know your exact CPU and memory interface. Knowing and abusing your cache, rearranging your data and whole rendering engine (or whatever) to time loads and stores perfectly and all the time knowing what's in the cache takes a ninja coder god.

There are a few such people in the demoscene, I know... it's hard, even on a fixed platform. Doing it for "many ARM setups" is a level harder still.

finkel · 01 December 2010, 11:56

Quote:

Originally Posted by Photon

the only for sure way is to test some combinations and measure the time (paste equivalent loop in a simple time-measurer code).

That's exactly what I'm trying to avoid. Testing 101 different versions of the code is not something I'd be looking for.

I'll be perfectly happy with a "sub-optimized" version that still runs 30% faster than it does now.

Quote:

Originally Posted by Photon

Because there are many models of each ARM version, with different cache specs and ratios to memory clock, your cpu could run faster or slower than specified, memory may have wait-states or DMA happening.

I'm not going to optimize for different ARM versions, mainly because I have only one ARM CPU to work with, and that one is in my S60 phone. If I had to give plenty of bucks for it, I'm certainly going to squeeze out every bit of processing power that I can.

I want most bang for my bucks, so to speak...

Anyway, thank you all for your help. Suggesting gp32x.com was an excellent idea. I got some superb advices, and a very useful link to instruction timings.

In case anyone else needs the link with ARM11 timings:
http://infocenter.arm.com/help/topic.../Cjaedced.html

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
IPF and ARM	ascp	project.SPS (was CAPS)	24	14 October 2015 19:19
WinUAE for Windows 8 ARM?	Mequa	request.UAE Wishlist	6	27 October 2011 19:40
Looking for 68000 binary optimization utility	amigoun	request.Apps	2	23 October 2011 00:36
Made in Britain - ARM	DDNI	Nostalgia & memories	2	30 June 2011 10:53
Minimig mod for ARM	lolafg	support.Hardware	7	17 October 2010 20:43

25 November 2010, 14:35	#1
finkel Registered User Join Date: Aug 2010 Location: x Posts: 36	ARM Assembler Optimization There are probably better places to ask this specific question, but why not give it a try here first? I need some documentation on how to optimize assembler code for pipeline execution on ARM11 processor. I remember some general rules for some generic RISC CPU from my university days, like branch prediction, instruction interleaving, computational dependencies, etc, but I have only some vague ideas what I should be doing with ARM11. I downloaded some technical references from the net, so at least I know that ARM11 has 8 pipeline stages, early-stage branch prediction, and so on, but some specific answers are hard to find. For example, if I write something like this... tst r0,#1 <other instructions> beq l1 ... how many other (non-dependent) instructions should I put between tst and beq so that pipeline doesn't stall, and flushes minimal number of instructions in case of branch misprediction? Similar question for something like this: add r0,#1 <other instructions> add r1,r0 ;@ write-read dependence of r1 on r0 Has anyone seen any online documentation dealing with this kind of stuff?

25 November 2010, 14:49	#2
Foul Registered User Join Date: Jun 2009 Location: Perigueux/France Age: 49 Posts: 1,516	would be better to ask here i suppose : http://eab.abime.net/forumdisplay.php?f=37

25 November 2010, 17:34	#3
alexh Thalion Webshrine Join Date: Jan 2004 Location: Oxford Posts: 14,337	http://notaz.gp2x.de/cyclone.php

26 November 2010, 14:27	#6
T_hairy_bootson Workbitch 1.3 Join Date: Oct 2001 Location: Melbourne, Australia Age: 46 Posts: 2,084	There are some really talented ARM Assembly guys in the gp32/gp2x scene, you could try asking on gp32x.com or #gp2xdev on efnet.

28 November 2010, 17:01	#7
Photon Moderator Join Date: Nov 2004 Location: Eksjö / Sweden Posts: 5,602	First ARM was pretty simple with its 3 stages. Nowadays with deeper pipeline and all kinds of caches and cache sizes, there's to many combinations to try to reach 'optimum'. It's better to instead always apply some general techniques: The first thing that will slow you down is the memory. So the absolute first priority is to cut down on memory accesses, and that means writing few instructions, storing efficiently and calculating from small values rather than fetch data that could be calculated. The built-in method for loading a register with a number is a perfect example of that; numeric constants take space. But if you know your caches and you know (or make sure) a loop and all data it needs will fit in the caches, you can disregard memory speed. Grouping LDM/STM 'cleverly' and placing them before several internal calculation instructions is vital for cache performance. STMIA performance varies between models, so the best way is to 'saturate' the busy-time when STMIA is off storing with as many instructions as you can; if STMIA is done before all the internal calc instructions are done, you can split the STMIA up or better, move the internal instructions to somewhere else after another store or so. Second is interleaving instructions, but on a system where the CPU is much faster than the memory, some small stalls affect performance MUCH less. When the code does what it should, simply reorder to put result-dependent instructions at maximum distance from result-calculating instruction. ARM models vary, so you can do no better than that. For tight loops, the loop may or may not be 'spliceable' at all. Ie. omitting 9 of 10 decrements or checks + branches by repeating 10 times and dividing loop counter by 10. Some loops can have their exit-condition added to all instructions (except load/check instruction) in the loop until there's an exit-branch, but it's rare. Branches are normally counted as taking 3 cycle slots, nonexecuted instructions (including branches) 1. I guess you already know not to 'skip' 1-2 instructions using branches, but instead make the skippable instructions conditional, so. Know your data. "Pre-massage" the data before a big operation on it, if it's worth it. That will help you minimize memory accesses.

30 November 2010, 21:12	#10
Photon Moderator Join Date: Nov 2004 Location: Eksjö / Sweden Posts: 5,602	I meant that in ARM-programming classes, you normally talk of taken branches (executed branches) take 3 cycles. Cache and prediction affect this, but it's a rule of thumb. If you know 100% what's in the cache and the data/calculated values will cause which branches to execute, you can know better what the real situation is. (Some ARM models don't have branch prediction, and I don't think any have "user suggested" branch prediction, but I could be wrong. If there's no way to specify, most if not all CPUs default to predicting backwards branches as 'taken' and forward ones as 'not taken'. So one way to help prediction is to make 'usually non-taken branches" point forward.) finkel, as I said the best is to put result-calculation as far before result-use as possible, but if you have a choice to make (multiple dependencies in the loop, how to reorder everything?) the only for sure way is to test some combinations and measure the time (paste equivalent loop in a simple time-measurer code). Because there are many models of each ARM version, with different cache specs and ratios to memory clock, your cpu could run faster or slower than specified, memory may have wait-states or DMA happening. If you want some approximation, I'd say at least 3 internal instructions fit after 1 uncached store. Also read up on write cache-flushing, timing that could help execute many internal instructions for free or the other way around. For reads, you can't do anything with the read value directly after anyway, even if it's cached, so just put any instruction there. Basically what you're doing is the profiling and rearrangements of instructions that some compilers do, except you're doing it manually. Maybe someone has dreamt up such a tool for pure asm? Wouldn't be unthinkable, considering ARM is heavily used in embedded applications. To be optimal you really must make sure you examine carefully and know your exact CPU and memory interface. Knowing and abusing your cache, rearranging your data and whole rendering engine (or whatever) to time loads and stores perfectly and all the time knowing what's in the cache takes a ninja coder god. There are a few such people in the demoscene, I know... it's hard, even on a fixed platform. Doing it for "many ARM setups" is a level harder still.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)