English Amiga Board

English Amiga Board (http://eab.abime.net/index.php)
-   Coders. General (http://eab.abime.net/forumdisplay.php?f=37)
-   -   ARM Assembler Optimization (http://eab.abime.net/showthread.php?t=56514)

finkel 25 November 2010 14:35

ARM Assembler Optimization
 
There are probably better places to ask this specific question, but why not give it a try here first? :)

I need some documentation on how to optimize assembler code for pipeline execution on ARM11 processor.

I remember some general rules for some generic RISC CPU from my university days, like branch prediction, instruction interleaving, computational dependencies, etc, but I have only some vague ideas what I should be doing with ARM11.

I downloaded some technical references from the net, so at least I know that ARM11 has 8 pipeline stages, early-stage branch prediction, and so on, but some specific answers are hard to find.

For example, if I write something like this...

tst r0,#1
<other instructions>
beq l1

... how many other (non-dependent) instructions should I put between tst and beq so that pipeline doesn't stall, and flushes minimal number of instructions in case of branch misprediction?

Similar question for something like this:

add r0,#1
<other instructions>
add r1,r0 ;@ write-read dependence of r1 on r0

Has anyone seen any online documentation dealing with this kind of stuff?

Foul 25 November 2010 14:49

would be better to ask here i suppose :

http://eab.abime.net/forumdisplay.php?f=37

;)

alexh 25 November 2010 17:34

http://notaz.gp2x.de/cyclone.php

finkel 26 November 2010 11:47

Quote:

Originally Posted by Foul (Post 718042)
would be better to ask here i suppose :

http://eab.abime.net/forumdisplay.php?f=37

;)

Good idea. :) Can threads be moved, or do I have to post another one there?

finkel 26 November 2010 11:57

Quote:

Originally Posted by alexh (Post 718070)

Ah, I know about Cyclone core. Actually, I intended to add 68020 support to it, but after seeing how much time it took me to add that support to FAME C code in UAE4ALL, I fell back to writing in assembler only the MULL and DIVL (32/64-bit) 68020 instructions emulation.

I'm suspecting those two for a terrible drop in fps, from 25 to 15 and less, and I'm trying to optimize that code as much as I can. It's really strange that ARM doesn't have DIV instruction, considering that it has ALU & MAC units running in parallel, so DIV unit could have been added as well. Probably too many cycles needed to do division, which in a RISC CPU doesn't make much sense, but still...

I'll ask on some ARM forums as well. Either way, thanks for the help.

T_hairy_bootson 26 November 2010 14:27

There are some really talented ARM Assembly guys in the gp32/gp2x scene, you could try asking on gp32x.com or #gp2xdev on efnet.

Photon 28 November 2010 17:01

First ARM was pretty simple with its 3 stages. Nowadays with deeper pipeline and all kinds of caches and cache sizes, there's to many combinations to try to reach 'optimum'. It's better to instead always apply some general techniques:

The first thing that will slow you down is the memory. So the absolute first priority is to cut down on memory accesses, and that means writing few instructions, storing efficiently and calculating from small values rather than fetch data that could be calculated. The built-in method for loading a register with a number is a perfect example of that; numeric constants take space. But if you know your caches and you know (or make sure) a loop and all data it needs will fit in the caches, you can disregard memory speed.

Grouping LDM/STM 'cleverly' and placing them before several internal calculation instructions is vital for cache performance. STMIA performance varies between models, so the best way is to 'saturate' the busy-time when STMIA is off storing with as many instructions as you can; if STMIA is done before all the internal calc instructions are done, you can split the STMIA up or better, move the internal instructions to somewhere else after another store or so.

Second is interleaving instructions, but on a system where the CPU is much faster than the memory, some small stalls affect performance MUCH less. When the code does what it should, simply reorder to put result-dependent instructions at maximum distance from result-calculating instruction. ARM models vary, so you can do no better than that.

For tight loops, the loop may or may not be 'spliceable' at all. Ie. omitting 9 of 10 decrements or checks + branches by repeating 10 times and dividing loop counter by 10. Some loops can have their exit-condition added to all instructions (except load/check instruction) in the loop until there's an exit-branch, but it's rare.

Branches are normally counted as taking 3 cycle slots, nonexecuted instructions (including branches) 1. I guess you already know not to 'skip' 1-2 instructions using branches, but instead make the skippable instructions conditional, so.

Know your data. "Pre-massage" the data before a big operation on it, if it's worth it. That will help you minimize memory accesses.

finkel 30 November 2010 10:21

Quote:

Originally Posted by Photon (Post 718649)
The built-in method for loading a register with a number is a perfect example of that; numeric constants take space.

All very useful advices.

I was hoping to gain some speed by using as few memory access functions as possible, but some loops (like pfield_doline_nx() functions in all Amiga emulators I've seen so far) require more registers than ARM has available. :)

Quick question - is there any fast way of loading non-standard constants, like 0x55555, other than storing them in memory buffers, and LDR-ing them into registers?

One reference manual states that any constant in the range of 0-65535 can be used as imm16 operand, but for some reason "mov r0,#0x5555" won't execute (step over fails in debugger).

Quote:

Originally Posted by Photon (Post 718649)
Grouping LDM/STM 'cleverly' and placing them before several internal calculation instructions is vital for cache performance.

Trying to do that. I'm moving some independent instructions between LDR/STR and those instructions which use loaded registers (likewise for STR/STM), but I don't know how many I should put in between.

How many cycles do memory read/write take if data cache hit occurs? The buffers I'm using are less than 100 bytes long, so I guess that cache will not be a problem.

Quote:

Originally Posted by Photon (Post 718649)
Branches are normally counted as taking 3 cycle slots, nonexecuted instructions (including branches) 1.

Mispredicted branches lose just 3 cycles, or you meant that taken branches take 3 cycles?

Do you know how many independent instructions should be placed between flags-changing instruction and branch instruction depending on that result? According to what pipeline looks like, I assume it should be something like 7 (stage 8, write-back - stage 2, branch prediction + 1).

1. Fe1 – Address is sent and instruction received
2. Fe2 – Much of the branch prediction goes here
3. De – Decode instruction
4. Iss – Read registers and issue instruction
5. Sh – Perform shift operations
6. ALU – Perform integer operations
7. Sat – Saturate results
8. WB – Write back data to registers


I'm trying to get every little detail here, since rewriting some of the emulator stuff in assembler seems to be worth the effort. Even with unoptimized, but pure assembler version of pfield_doline_n6() function I wrote yesterday, there was something like 15% speed improvement. If I could get that to 30% with optimizations (hard, but plenty or room for trying), that would be really something. 30% here, 30% there, and pretty soon, we're talking about serious speedups. :)

finkel 30 November 2010 10:23

Quote:

Originally Posted by T_hairy_bootson (Post 718296)
There are some really talented ARM Assembly guys in the gp32/gp2x scene, you could try asking on gp32x.com or #gp2xdev on efnet.

Will definitely try there as well. :)

Photon 30 November 2010 21:12

I meant that in ARM-programming classes, you normally talk of taken branches (executed branches) take 3 cycles. Cache and prediction affect this, but it's a rule of thumb. If you know 100% what's in the cache and the data/calculated values will cause which branches to execute, you can know better what the real situation is.

(Some ARM models don't have branch prediction, and I don't think any have "user suggested" branch prediction, but I could be wrong. If there's no way to specify, most if not all CPUs default to predicting backwards branches as 'taken' and forward ones as 'not taken'. So one way to help prediction is to make 'usually non-taken branches" point forward.)

finkel, as I said the best is to put result-calculation as far before result-use as possible, but if you have a choice to make (multiple dependencies in the loop, how to reorder everything?) the only for sure way is to test some combinations and measure the time (paste equivalent loop in a simple time-measurer code).

Because there are many models of each ARM version, with different cache specs and ratios to memory clock, your cpu could run faster or slower than specified, memory may have wait-states or DMA happening.

If you want some approximation, I'd say at least 3 internal instructions fit after 1 uncached store. Also read up on write cache-flushing, timing that could help execute many internal instructions for free or the other way around. For reads, you can't do anything with the read value directly after anyway, even if it's cached, so just put any instruction there.

Basically what you're doing is the profiling and rearrangements of instructions that some compilers do, except you're doing it manually. Maybe someone has dreamt up such a tool for pure asm? Wouldn't be unthinkable, considering ARM is heavily used in embedded applications.

To be optimal you really must make sure you examine carefully and know your exact CPU and memory interface. Knowing and abusing your cache, rearranging your data and whole rendering engine (or whatever) to time loads and stores perfectly and all the time knowing what's in the cache takes a ninja coder god. :) There are a few such people in the demoscene, I know... it's hard, even on a fixed platform. Doing it for "many ARM setups" is a level harder still.

finkel 01 December 2010 11:56

Quote:

Originally Posted by Photon (Post 719231)
the only for sure way is to test some combinations and measure the time (paste equivalent loop in a simple time-measurer code).

That's exactly what I'm trying to avoid. Testing 101 different versions of the code is not something I'd be looking for.

I'll be perfectly happy with a "sub-optimized" version that still runs 30% faster than it does now. :)

Quote:

Originally Posted by Photon (Post 719231)
Because there are many models of each ARM version, with different cache specs and ratios to memory clock, your cpu could run faster or slower than specified, memory may have wait-states or DMA happening.

I'm not going to optimize for different ARM versions, mainly because I have only one ARM CPU to work with, and that one is in my S60 phone. If I had to give plenty of bucks for it, I'm certainly going to squeeze out every bit of processing power that I can.

I want most bang for my bucks, so to speak... :)

Anyway, thank you all for your help. Suggesting gp32x.com was an excellent idea. I got some superb advices, and a very useful link to instruction timings.

In case anyone else needs the link with ARM11 timings:
http://infocenter.arm.com/help/topic.../Cjaedced.html


All times are GMT +2. The time now is 13:19.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, vBulletin Solutions Inc.

Page generated in 0.04359 seconds with 11 queries