Aligning code, does it matter?

oRBIT · 07 August 2023, 19:04

Browsing through my old code and found quite a few
CNOP 0,8
CNOP 0,16
(or something similar I think). It was related to aligning code somehow to make it run slightly faster I heard back then...
Does anyone knows better about this? Does it matter (and how much)?

paraj · 07 August 2023, 19:43

Large alignment values are sort of dubious since normally you're not getting better than 8-byte alignment from LoadSeg (normal executable loader).
This obviously only matters for 020+ (000 only cares that it's word aligned

). For 020 4-byte alignment will matter, and could be the difference between a loop just fitting in 256-byte I$ or not. For 060 cachelines are 16-byte so there could be some benefit from such an alignment, but I've never measured any difference. 030/040's may have different tradeoffs that I'm not aware of.

So hot take from me: Anything more than CNOP 0,4 is useless. Now please someone else do the "someone is wrong on the internet" thing so I can be schooled

meynaf · 07 August 2023, 20:10

Quote:

Originally Posted by paraj

Now please someone else do the "someone is wrong on the internet" thing so I can be schooled

Yeah, i'll happily do.

I would advise to never ever do this. Really.
But again, i've disassembled quite a lot of code and this goes in the way.

Anyhow, my tests for 68030 have shown that it does not matter at all if a loop is 240 bytes or less.
As long as your cache lines are 16 bytes or more, it won't help due allocated memory is only 8-byte aligned.
It may eventually even backfire as these padding words will take space in the cache.

Don_Adan · 07 August 2023, 21:04

If I remember right, from my very old tests for c2p routine on 68040 and 68060, aligning code via CNOP (4,8,16) has no sense. But putting/placing input data for c2p routine at address like f.e $1200008 or $1200018 has sense.

Exactly, 8 bytes longer buffer must be allocated.
Example code, about I mean:

Code:

 move.l A0,D0
 lsr.l #4,D0
 lsl.l #4,D0
 addq.l #8,D0
 move.l D0,A0 (address of input buffer)

And if I remember right someone told me that scsi.device A600 version was fastest on 68020/68030 than A1200 version. It was strange, because 68000 version is 2 commands longer than 68020, due odd address check. I dont checked this, if it was buffer alligned or code alligned problem.

Thorham · 08 August 2023, 11:46

Quote:

Originally Posted by meynaf

Anyhow, my tests for 68030 have shown that it does not matter at all if a loop is 240 bytes or less.

Shouldn't it be possible to use all of the 256 bytes if your code is 16 or perhaps even 256 byte aligned? Quite unclear why it's 240 bytes. Just some overhead?

meynaf · 08 August 2023, 12:21

Quote:

Originally Posted by Thorham

Shouldn't it be possible to use all of the 256 bytes if your code is 16 or perhaps even 256 byte aligned? Quite unclear why it's 240 bytes. Just some overhead?

I suppose this is the prefetch for what's next, at the end of the loop.

ross · 08 August 2023, 12:26

Quote:

Originally Posted by meynaf

I suppose this is the prefetch for what's next, at the end of the loop.

Yes, this.
"Prefetch requests are simultaneously submitted to the cache holding register, the instruction cache, and the bus controller"

oRBIT · 08 August 2023, 13:22

So, not sure how I would interpret the answers..
Aligning only make sense if there are loops involved (and aligning to 16-byte boundaries would make it work best with 060 and lower).?
Does this only makes sense for loops?
For example, would a simple subroutine benefit from this when there's no loops involved?
Makesomething:
nop
rts

SpeedGeek · 08 August 2023, 15:27

Alignment does help with performance in certain cases with 68020+, otherwise Motorola would not have recommended it. There are best case and worst case performance examples but Motorola was expecting the average case to be faster.

I mostly use Devpac and for some reason CNOP 0,4 doesn't always result in longword alignment. I suppose the internal defaults could have a higher priority than assembler directives in some cases but the Devpac manual does not give much detail about this.

meynaf · 08 August 2023, 15:58

Quote:

Originally Posted by SpeedGeek

Alignment does help with performance in certain cases with 68020+, otherwise Motorola would not have recommended it.

Where did Motorola recommend it ? I don't remember having seen that in the user manual.

SpeedGeek · 08 August 2023, 17:58

Quote:

Originally Posted by meynaf

Where did Motorola recommend it ? I don't remember having seen that in the user manual.

Code:

11.1 PERFORMANCE TRADEOFFS
The MC68030 maximizes average performance at the expense of worst case performance.
The time spent executing one instruction can vary from zero to over 100 clocks. Factors
affecting the execution time are the preceding and following instructions, the instruction
stream alignment, residency of operands and instruction words in the caches, residency of
address translations in the address translation cache, and operand alignment.
To increase the average performance of the MC68030, certain tradeoffs were made to
increase best case performance and to decrease the occurrence of worst case behavior. For
example, burst filling increases performance by prefetching data for later accesses, but it
commits the external bus controller and a cache for a longer period.

Code:

8.4 RETURN FROM EXCEPTIONS
After the processor has completed executing the exception handlers for all pending
exceptions, the processor resumes normal instruction execution at the address in the
processor’s vector table for the last exception processed. Once the exception handler has
completed execution, if possible the processor must return the system context as it was
prior to the exception using the RTE instruction. (If the internal data of the exception stack
frames are manipulated, M68040 may enter into an undefined state; this applies
specifically to the SSW on the access error stack frame.)
When the processor executes an RTE instruction, it examines the stack frame on top of
the active supervisor stack to determine if it is a valid frame and what type of context
restoration it requires. If during restoration, a stack frame has an odd address PC and an
SR that indicates user trace mode enabled, then an address error is taken. The SR
stacked for the address error has the SR S-bit set. For previous members of the M68000
family the S-bit is clear. When the M68040 writes or reads a stack frame, it uses longword
operand transfers wherever possible. Using a long-word-aligned stack pointer
greatly enhances exception processing performance. The processor does not necessarily
read or write the stack frame data in sequential order. The system software should not
depend on a particular exception generating a particular stack frame. For compatibility
with future devices, the software should be able to handle any format of stack frame for
any type of exception. The following paragraphs discuss in detail each stack frame format.

Code:

7.3 MISALIGNED OPERANDS
All M68040 data formats can be located in memory on any byte boundary. A byte operand
is properly aligned at any address; a word operand is misaligned at an odd address; and a
long word is misaligned at an address that is not evenly divisible by 4. However, since
operands can reside at any byte boundary, they can be misaligned. Although the M68040
does not enforce any alignment restrictions for data operands (including PC relative data
addressing), some performance degradation occurs when additional bus cycles are
required for long-word or word operands that are misaligned. For maximum performance,
data items should be aligned on their natural boundaries. All instruction words and
extension words must reside on word boundaries. Attempting to prefetch an instruction
word at an odd address causes an address error exception. Refer to Section 8 Exception
Processing for details on address error exceptions.

paraj · 08 August 2023, 18:28

You want to use explicit alignment when it increases speed and cache utilization, and these are somewhat interlinked. Say you have a small function: moveq #0,d0 / rts (i.e. 4 bytes). That function will greatly benefit from not straddling a cache line on all 68k processors with an I$ since you'll need at most one cache refill.
With regards to utilization having the before mentioned function aligned will also help, since you won't have "useless" instructions in the cache. Or, like I mentioned earlier, proper alignment can mean the difference between a critical code path (not necessarily only a loop) fitting completely into the cache or bouncing in and out of it.
So when should you use explicit alignment? Like meynaf says, probably never, and if you do be sure to measure it and think about why and how it interacts with other parts of the code.
Unless they're part of the same "block" of code, aligning function starts (and perhaps very "hot" internal labels) to 4 bytes is probably decent. For loops, I don't see any direct benefit unless that's what will keep you under a critical level.

For 68020 only 4-byte alignment matters (since the cache lines are longword sized). It's probably more important to keep speed critical code together if possible.
For 060 (and probably 040) it matters a lot less overall since the the I$ is comparatively large. Sure the would be some benefit in doing clever 16-byte alignment stuff, but it would be really far down my list of things to try (certainly throwing in "cnop 0,16" at "random" is likely to make things worse, not better).

Some empirical results from stock A1200 (so EC020 @14MHz) for different code sizes and loop alignments. Loop is just moveq #0,d1 with dbf d0,start (x10000). Time is in ns per moveq. As expected there is zero benefit until we get to the critical point at which point it matters slightly. For 060/50 the same data is completely flat at 10ns/moveq.
Probably most interesting architecture in this regard is 030 with its tiny I$ and large cache line size, but meynaf already covered that.

Code:

Size    2    4    8   16
 220  146  146  146  146
 222  146  146  146  146
 224  146  146  146  146
 226  146  146  146  146
 228  146  146  146  146
 230  146  146  146  146
 232  145  146  146  146
 234  146  145  145  145
 236  145  146  146  146
 238  146  145  145  145
 240  145  146  146  146
 242  146  145  145  145
 244  145  146  146  146
 246  146  145  145  145
 248  145  146  146  146
 250  146  145  145  145
 252  145  146  146  146
 254  146  145  145  145
 256  153  146  146  146
 258  156  152  152  152
 260  159  155  155  155
 262  163  161  160  160
 264  168  165  165  165
 266  171  169  169  169
 268  176  172  172  172
 270  179  177  177  177

meynaf · 08 August 2023, 18:37

Quote:

Originally Posted by SpeedGeek

the instruction stream alignment

That it can have some importance, does not turn code aligning into a recommendation.

Quote:

Originally Posted by SpeedGeek

and operand alignment.

(...)

Using a long-word-aligned stack pointer greatly enhances exception processing performance.

(...)

some performance degradation occurs when additional bus cycles are required for long-word or word operands that are misaligned.

This is data. Nothing to do with code alignment.

So no, sorry but they don't recommend it.

chb · 08 August 2023, 18:52

AFAIR, the 68020 manual mentioned another, non-cache related benefit of alignment: The 68020 always prefetches 32 bit, and does not need an additional bus cycle if the next (word-length) instruction was already fetched as part of that long word. Similar for multi-word instructions, a longword-length instruction may need one or two bus cycles, dependent on alignment. For longer instruction sequences the difference is probably neglectable, but code in memory with a lot of branching around that resets the prefetch buffer may IMHO benefit from longword-aligning the labels, especially on systems with bad memory performance (e.g. stock A1200).

SpeedGeek · 08 August 2023, 19:13

Quote:

Originally Posted by meynaf

That it can have some importance, does not turn code aligning into a recommendation.

This is data. Nothing to do with code alignment.

So no, sorry but they don't recommend it.

Operands can be instructions or data. Even for loops which benefit most from the instruction cache the operands first have to be read from memory before they can be placed in the cache. Eventually, some or all of the cache will have to be flushed. So, the operands must be read from memory again and again.

Also, it's possible the operands are located in 32 bit Chip RAM or Zorro3 bus driver ROM's and the cache disable logic or MMU table doesn't care if it's an instruction or data operand.

Regarding, what's recommended anything can be implied. But I didn't imply instruction operand alignment would result in a performance benefit in all cases.

paraj · 08 August 2023, 19:21

Quote:

Originally Posted by SpeedGeek

Operands can be instructions or data. Even for loops which benefit most from the instruction cache the operands first have to be read from memory before they can be placed in the cache. Eventually, some or all of the cache will have to be flushed. So, the operands must be read from memory again and again.

Also, it's possible the operands are located in 32 bit Chip RAM or Zorro3 bus driver ROM's and the cache disable logic or MMU table doesn't care if it's an instruction or data operand.

Regarding, what's recommended anything can be implied. But I didn't imply alignment would result in a performance benefit in all cases.

Instruction and data caches are always separate on 68k though, so operands don't need to be considered for code alignment. They should definitely be longword aligned (if possible) for max performance, no disagreement there.

meynaf · 08 August 2023, 19:24

Quote:

Originally Posted by SpeedGeek

Operands can be instructions or data. Even for loops which benefit most from the instruction cache the operands first have to be read from memory before they can be placed in the cache. Eventually, some or all of the cache will have to be flushed. So, the operands must be read from memory again and again.

Well, see paraj's reply.

Quote:

Originally Posted by SpeedGeek

Also, it's possible the operands are located in 32 bit Chip RAM or Zorro3 bus driver ROM's and the cache disable logic or MMU table doesn't care if it's an instruction or data operand.

No. Code located in chipmem, or in ROM, is cached.

SpeedGeek · 08 August 2023, 19:49

Quote:

Originally Posted by paraj

Instruction and data caches are always separate on 68k though, so operands don't need to be considered for code alignment. They should definitely be longword aligned (if possible) for max performance, no disagreement there.

The caches don't determine how the operands (instruction or data) are aligned in memory. So how do you conclude operands don't need to be considered and at the same time they should be longword aligned for max performance? Are these operands somehow magically aligning themselves?

Quote:

Originally Posted by meynaf

No. Code located in chipmem, or in ROM, is cached.

Wrong. The C= A2630 and GVP G-Force 030 cache instructions in Chipmem but not in Zorro2 I/O space. I assume the A3000 Fat Gary Chip is functionally the same as the A2630. On the A2000 with a 16 bit data bus, longword alignment won't help much here but the A3000 & A4000 with a 32 bit data bus it could help quite a lot.

paraj · 08 August 2023, 20:09

Quote:

Originally Posted by SpeedGeek

The caches don't determine how the operands (instruction or data) are aligned in memory. So how do you conclude operands don't need to be considered and at the same time they should be longword aligned for max performance? Are these operands somehow magically aligning themselves?

OP (as far as I understand) is asking whether aligning *code* in memory makes it run faster. AFAIK, but please correct me if I'm wrong, the alignment of the instructions has no direct interaction with the *data* operands in this regard. I.e. an aligned *code* loop accessing some data, assuming it fits in cache, will not run faster than an unaligned one accessing the same data, whether or not the accessed data is aligned or not.

If the *data* is aligned both will (potentially) run faster.

SpeedGeek · 08 August 2023, 20:51

Quote:

Originally Posted by paraj

OP (as far as I understand) is asking whether aligning *code* in memory makes it run faster. AFAIK, but please correct me if I'm wrong, the alignment of the instructions has no direct interaction with the *data* operands in this regard. I.e. an aligned *code* loop accessing some data, assuming it fits in cache, will not run faster than an unaligned one accessing the same data, whether or not the accessed data is aligned or not.

If the *data* is aligned both will (potentially) run faster.

Your answer considers the execution time of loop code which fits in the instruction cache and was already placed in the cache. My answer considers the time it takes to read from memory to fill the cache and the code is to large to fit or eventually gets flushed from the cache, requiring the code to be read again and again from memory. It also considers code read from cache disabled memory.

BTW, I'm not suggesting every instruction operand should be longword aligned, just at the start of the code block.

07 August 2023, 19:04	#1
oRBIT Zone Friend Join Date: Apr 2006 Location: Gothenburg/Sweden Age: 48 Posts: 339	Aligning code, does it matter? Browsing through my old code and found quite a few CNOP 0,8 CNOP 0,16 (or something similar I think). It was related to aligning code somehow to make it run slightly faster I heard back then... Does anyone knows better about this? Does it matter (and how much)?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Any info on aligning a floppy drive head?	BarryB	support.Hardware	12	24 August 2021 22:26
VBCC and PosixLib: #80000004 no matter what.	admiral	Coders. C/C++	5	08 October 2018 16:49
Amiga 1200 REV does it matter	XsamX1987	support.Hardware	9	17 February 2017 07:10
Burnt my 68882 , does it really matter ????	keropi	support.Hardware	11	13 December 2004 11:18
A little question concerning Gray Matter	MethodGit	Amiga scene	13	03 December 2001 18:55

07 August 2023, 19:43	#2
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,105	Large alignment values are sort of dubious since normally you're not getting better than 8-byte alignment from LoadSeg (normal executable loader). This obviously only matters for 020+ (000 only cares that it's word aligned ). For 020 4-byte alignment will matter, and could be the difference between a loop just fitting in 256-byte I$ or not. For 060 cachelines are 16-byte so there could be some benefit from such an alignment, but I've never measured any difference. 030/040's may have different tradeoffs that I'm not aware of. So hot take from me: Anything more than CNOP 0,4 is useless. Now please someone else do the "someone is wrong on the internet" thing so I can be schooled

07 August 2023, 21:04	#4
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 55 Posts: 1,975	If I remember right, from my very old tests for c2p routine on 68040 and 68060, aligning code via CNOP (4,8,16) has no sense. But putting/placing input data for c2p routine at address like f.e $1200008 or $1200018 has sense. Exactly, 8 bytes longer buffer must be allocated. Example code, about I mean: Code: move.l A0,D0 lsr.l #4,D0 lsl.l #4,D0 addq.l #8,D0 move.l D0,A0 (address of input buffer) And if I remember right someone told me that scsi.device A600 version was fastest on 68020/68030 than A1200 version. It was strange, because 68000 version is 2 commands longer than 68020, due odd address check. I dont checked this, if it was buffer alligned or code alligned problem.

08 August 2023, 13:22	#8
oRBIT Zone Friend Join Date: Apr 2006 Location: Gothenburg/Sweden Age: 48 Posts: 339	So, not sure how I would interpret the answers.. Aligning only make sense if there are loops involved (and aligning to 16-byte boundaries would make it work best with 060 and lower).? Does this only makes sense for loops? For example, would a simple subroutine benefit from this when there's no loops involved? Makesomething: nop rts

08 August 2023, 15:27	#9
SpeedGeek Moderator Join Date: Dec 2010 Location: Wisconsin USA Age: 60 Posts: 841	Alignment does help with performance in certain cases with 68020+, otherwise Motorola would not have recommended it. There are best case and worst case performance examples but Motorola was expecting the average case to be faster. I mostly use Devpac and for some reason CNOP 0,4 doesn't always result in longword alignment. I suppose the internal defaults could have a higher priority than assembler directives in some cases but the Devpac manual does not give much detail about this.

08 August 2023, 18:52	#14
chb Registered User Join Date: Dec 2014 Location: germany Posts: 439	AFAIR, the 68020 manual mentioned another, non-cache related benefit of alignment: The 68020 always prefetches 32 bit, and does not need an additional bus cycle if the next (word-length) instruction was already fetched as part of that long word. Similar for multi-word instructions, a longword-length instruction may need one or two bus cycles, dependent on alignment. For longer instruction sequences the difference is probably neglectable, but code in memory with a lot of branching around that resets the prefetch buffer may IMHO benefit from longword-aligning the labels, especially on systems with bad memory performance (e.g. stock A1200).

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)