English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 07 August 2023, 19:04   #1
oRBIT
Zone Friend
 
Join Date: Apr 2006
Location: Gothenburg/Sweden
Age: 48
Posts: 339
Aligning code, does it matter?

Browsing through my old code and found quite a few
CNOP 0,8
CNOP 0,16
(or something similar I think). It was related to aligning code somehow to make it run slightly faster I heard back then...
Does anyone knows better about this? Does it matter (and how much)?
oRBIT is offline  
Old 07 August 2023, 19:43   #2
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,105
Large alignment values are sort of dubious since normally you're not getting better than 8-byte alignment from LoadSeg (normal executable loader).
This obviously only matters for 020+ (000 only cares that it's word aligned ). For 020 4-byte alignment will matter, and could be the difference between a loop just fitting in 256-byte I$ or not. For 060 cachelines are 16-byte so there could be some benefit from such an alignment, but I've never measured any difference. 030/040's may have different tradeoffs that I'm not aware of.

So hot take from me: Anything more than CNOP 0,4 is useless. Now please someone else do the "someone is wrong on the internet" thing so I can be schooled
paraj is offline  
Old 07 August 2023, 20:10   #3
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by paraj View Post
Now please someone else do the "someone is wrong on the internet" thing so I can be schooled
Yeah, i'll happily do.

I would advise to never ever do this. Really.
But again, i've disassembled quite a lot of code and this goes in the way.

Anyhow, my tests for 68030 have shown that it does not matter at all if a loop is 240 bytes or less.
As long as your cache lines are 16 bytes or more, it won't help due allocated memory is only 8-byte aligned.
It may eventually even backfire as these padding words will take space in the cache.
meynaf is offline  
Old 07 August 2023, 21:04   #4
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,975
If I remember right, from my very old tests for c2p routine on 68040 and 68060, aligning code via CNOP (4,8,16) has no sense. But putting/placing input data for c2p routine at address like f.e $1200008 or $1200018 has sense.

Exactly, 8 bytes longer buffer must be allocated.
Example code, about I mean:

Code:
 move.l A0,D0
 lsr.l #4,D0
 lsl.l #4,D0
 addq.l #8,D0
 move.l D0,A0 (address of input buffer)

And if I remember right someone told me that scsi.device A600 version was fastest on 68020/68030 than A1200 version. It was strange, because 68000 version is 2 commands longer than 68020, due odd address check. I dont checked this, if it was buffer alligned or code alligned problem.
Don_Adan is offline  
Old 08 August 2023, 11:46   #5
Thorham
Computer Nerd
 
Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,764
Quote:
Originally Posted by meynaf View Post
Anyhow, my tests for 68030 have shown that it does not matter at all if a loop is 240 bytes or less.
Shouldn't it be possible to use all of the 256 bytes if your code is 16 or perhaps even 256 byte aligned? Quite unclear why it's 240 bytes. Just some overhead?
Thorham is offline  
Old 08 August 2023, 12:21   #6
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by Thorham View Post
Shouldn't it be possible to use all of the 256 bytes if your code is 16 or perhaps even 256 byte aligned? Quite unclear why it's 240 bytes. Just some overhead?
I suppose this is the prefetch for what's next, at the end of the loop.
meynaf is offline  
Old 08 August 2023, 12:26   #7
ross
Defendit numerus
 
ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,474
Quote:
Originally Posted by meynaf View Post
I suppose this is the prefetch for what's next, at the end of the loop.
Yes, this.
"Prefetch requests are simultaneously submitted to the cache holding register, the instruction cache, and the bus controller"
ross is offline  
Old 08 August 2023, 13:22   #8
oRBIT
Zone Friend
 
Join Date: Apr 2006
Location: Gothenburg/Sweden
Age: 48
Posts: 339
So, not sure how I would interpret the answers..
Aligning only make sense if there are loops involved (and aligning to 16-byte boundaries would make it work best with 060 and lower).?
Does this only makes sense for loops?
For example, would a simple subroutine benefit from this when there's no loops involved?
Makesomething:
nop
rts
oRBIT is offline  
Old 08 August 2023, 15:27   #9
SpeedGeek
Moderator
 
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 841
Alignment does help with performance in certain cases with 68020+, otherwise Motorola would not have recommended it. There are best case and worst case performance examples but Motorola was expecting the average case to be faster.

I mostly use Devpac and for some reason CNOP 0,4 doesn't always result in longword alignment. I suppose the internal defaults could have a higher priority than assembler directives in some cases but the Devpac manual does not give much detail about this.
SpeedGeek is offline  
Old 08 August 2023, 15:58   #10
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by SpeedGeek View Post
Alignment does help with performance in certain cases with 68020+, otherwise Motorola would not have recommended it.
Where did Motorola recommend it ? I don't remember having seen that in the user manual.
meynaf is offline  
Old 08 August 2023, 17:58   #11
SpeedGeek
Moderator
 
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 841
Quote:
Originally Posted by meynaf View Post
Where did Motorola recommend it ? I don't remember having seen that in the user manual.
Code:
11.1 PERFORMANCE TRADEOFFS
The MC68030 maximizes average performance at the expense of worst case performance.
The time spent executing one instruction can vary from zero to over 100 clocks. Factors
affecting the execution time are the preceding and following instructions, the instruction
stream alignment, residency of operands and instruction words in the caches, residency of
address translations in the address translation cache, and operand alignment.
To increase the average performance of the MC68030, certain tradeoffs were made to
increase best case performance and to decrease the occurrence of worst case behavior. For
example, burst filling increases performance by prefetching data for later accesses, but it
commits the external bus controller and a cache for a longer period.
Code:
8.4 RETURN FROM EXCEPTIONS
After the processor has completed executing the exception handlers for all pending
exceptions, the processor resumes normal instruction execution at the address in the
processor’s vector table for the last exception processed. Once the exception handler has
completed execution, if possible the processor must return the system context as it was
prior to the exception using the RTE instruction. (If the internal data of the exception stack
frames are manipulated, M68040 may enter into an undefined state; this applies
specifically to the SSW on the access error stack frame.)
When the processor executes an RTE instruction, it examines the stack frame on top of
the active supervisor stack to determine if it is a valid frame and what type of context
restoration it requires. If during restoration, a stack frame has an odd address PC and an
SR that indicates user trace mode enabled, then an address error is taken. The SR
stacked for the address error has the SR S-bit set. For previous members of the M68000
family the S-bit is clear. When the M68040 writes or reads a stack frame, it uses longword
operand transfers wherever possible. Using a long-word-aligned stack pointer
greatly enhances exception processing performance. The processor does not necessarily
read or write the stack frame data in sequential order. The system software should not
depend on a particular exception generating a particular stack frame. For compatibility
with future devices, the software should be able to handle any format of stack frame for
any type of exception. The following paragraphs discuss in detail each stack frame format.
Code:
7.3 MISALIGNED OPERANDS
All M68040 data formats can be located in memory on any byte boundary. A byte operand
is properly aligned at any address; a word operand is misaligned at an odd address; and a
long word is misaligned at an address that is not evenly divisible by 4. However, since
operands can reside at any byte boundary, they can be misaligned. Although the M68040
does not enforce any alignment restrictions for data operands (including PC relative data
addressing), some performance degradation occurs when additional bus cycles are
required for long-word or word operands that are misaligned. For maximum performance,
data items should be aligned on their natural boundaries. All instruction words and
extension words must reside on word boundaries. Attempting to prefetch an instruction
word at an odd address causes an address error exception. Refer to Section 8 Exception
Processing for details on address error exceptions.
SpeedGeek is offline  
Old 08 August 2023, 18:28   #12
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,105
You want to use explicit alignment when it increases speed and cache utilization, and these are somewhat interlinked. Say you have a small function: moveq #0,d0 / rts (i.e. 4 bytes). That function will greatly benefit from not straddling a cache line on all 68k processors with an I$ since you'll need at most one cache refill.
With regards to utilization having the before mentioned function aligned will also help, since you won't have "useless" instructions in the cache. Or, like I mentioned earlier, proper alignment can mean the difference between a critical code path (not necessarily only a loop) fitting completely into the cache or bouncing in and out of it.
So when should you use explicit alignment? Like meynaf says, probably never, and if you do be sure to measure it and think about why and how it interacts with other parts of the code.
Unless they're part of the same "block" of code, aligning function starts (and perhaps very "hot" internal labels) to 4 bytes is probably decent. For loops, I don't see any direct benefit unless that's what will keep you under a critical level.

For 68020 only 4-byte alignment matters (since the cache lines are longword sized). It's probably more important to keep speed critical code together if possible.
For 060 (and probably 040) it matters a lot less overall since the the I$ is comparatively large. Sure the would be some benefit in doing clever 16-byte alignment stuff, but it would be really far down my list of things to try (certainly throwing in "cnop 0,16" at "random" is likely to make things worse, not better).

Some empirical results from stock A1200 (so EC020 @14MHz) for different code sizes and loop alignments. Loop is just moveq #0,d1 with dbf d0,start (x10000). Time is in ns per moveq. As expected there is zero benefit until we get to the critical point at which point it matters slightly. For 060/50 the same data is completely flat at 10ns/moveq.
Probably most interesting architecture in this regard is 030 with its tiny I$ and large cache line size, but meynaf already covered that.

Code:
Size    2    4    8   16
 220  146  146  146  146
 222  146  146  146  146
 224  146  146  146  146
 226  146  146  146  146
 228  146  146  146  146
 230  146  146  146  146
 232  145  146  146  146
 234  146  145  145  145
 236  145  146  146  146
 238  146  145  145  145
 240  145  146  146  146
 242  146  145  145  145
 244  145  146  146  146
 246  146  145  145  145
 248  145  146  146  146
 250  146  145  145  145
 252  145  146  146  146
 254  146  145  145  145
 256  153  146  146  146
 258  156  152  152  152
 260  159  155  155  155
 262  163  161  160  160
 264  168  165  165  165
 266  171  169  169  169
 268  176  172  172  172
 270  179  177  177  177
Attached Files
File Type: zip timing.zip (6.5 KB, 20 views)
paraj is offline  
Old 08 August 2023, 18:37   #13
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by SpeedGeek View Post
the instruction stream alignment
That it can have some importance, does not turn code aligning into a recommendation.


Quote:
Originally Posted by SpeedGeek View Post
and operand alignment.

(...)

Using a long-word-aligned stack pointer greatly enhances exception processing performance.

(...)

some performance degradation occurs when additional bus cycles are required for long-word or word operands that are misaligned.
This is data. Nothing to do with code alignment.

So no, sorry but they don't recommend it.
meynaf is offline  
Old 08 August 2023, 18:52   #14
chb
Registered User
 
Join Date: Dec 2014
Location: germany
Posts: 439
AFAIR, the 68020 manual mentioned another, non-cache related benefit of alignment: The 68020 always prefetches 32 bit, and does not need an additional bus cycle if the next (word-length) instruction was already fetched as part of that long word. Similar for multi-word instructions, a longword-length instruction may need one or two bus cycles, dependent on alignment. For longer instruction sequences the difference is probably neglectable, but code in memory with a lot of branching around that resets the prefetch buffer may IMHO benefit from longword-aligning the labels, especially on systems with bad memory performance (e.g. stock A1200).
chb is offline  
Old 08 August 2023, 19:13   #15
SpeedGeek
Moderator
 
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 841
Quote:
Originally Posted by meynaf View Post
That it can have some importance, does not turn code aligning into a recommendation.

This is data. Nothing to do with code alignment.

So no, sorry but they don't recommend it.
Operands can be instructions or data. Even for loops which benefit most from the instruction cache the operands first have to be read from memory before they can be placed in the cache. Eventually, some or all of the cache will have to be flushed. So, the operands must be read from memory again and again.

Also, it's possible the operands are located in 32 bit Chip RAM or Zorro3 bus driver ROM's and the cache disable logic or MMU table doesn't care if it's an instruction or data operand.

Regarding, what's recommended anything can be implied. But I didn't imply instruction operand alignment would result in a performance benefit in all cases.

Last edited by SpeedGeek; 08 August 2023 at 19:25.
SpeedGeek is offline  
Old 08 August 2023, 19:21   #16
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,105
Quote:
Originally Posted by SpeedGeek View Post
Operands can be instructions or data. Even for loops which benefit most from the instruction cache the operands first have to be read from memory before they can be placed in the cache. Eventually, some or all of the cache will have to be flushed. So, the operands must be read from memory again and again.

Also, it's possible the operands are located in 32 bit Chip RAM or Zorro3 bus driver ROM's and the cache disable logic or MMU table doesn't care if it's an instruction or data operand.

Regarding, what's recommended anything can be implied. But I didn't imply alignment would result in a performance benefit in all cases.
Instruction and data caches are always separate on 68k though, so operands don't need to be considered for code alignment. They should definitely be longword aligned (if possible) for max performance, no disagreement there.
paraj is offline  
Old 08 August 2023, 19:24   #17
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by SpeedGeek View Post
Operands can be instructions or data. Even for loops which benefit most from the instruction cache the operands first have to be read from memory before they can be placed in the cache. Eventually, some or all of the cache will have to be flushed. So, the operands must be read from memory again and again.
Well, see paraj's reply.


Quote:
Originally Posted by SpeedGeek View Post
Also, it's possible the operands are located in 32 bit Chip RAM or Zorro3 bus driver ROM's and the cache disable logic or MMU table doesn't care if it's an instruction or data operand.
No. Code located in chipmem, or in ROM, is cached.
meynaf is offline  
Old 08 August 2023, 19:49   #18
SpeedGeek
Moderator
 
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 841
Quote:
Originally Posted by paraj View Post
Instruction and data caches are always separate on 68k though, so operands don't need to be considered for code alignment. They should definitely be longword aligned (if possible) for max performance, no disagreement there.

The caches don't determine how the operands (instruction or data) are aligned in memory. So how do you conclude operands don't need to be considered and at the same time they should be longword aligned for max performance? Are these operands somehow magically aligning themselves?

Quote:
Originally Posted by meynaf View Post

No. Code located in chipmem, or in ROM, is cached.
Wrong. The C= A2630 and GVP G-Force 030 cache instructions in Chipmem but not in Zorro2 I/O space. I assume the A3000 Fat Gary Chip is functionally the same as the A2630. On the A2000 with a 16 bit data bus, longword alignment won't help much here but the A3000 & A4000 with a 32 bit data bus it could help quite a lot.

Last edited by SpeedGeek; 08 August 2023 at 20:00.
SpeedGeek is offline  
Old 08 August 2023, 20:09   #19
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,105
Quote:
Originally Posted by SpeedGeek View Post
The caches don't determine how the operands (instruction or data) are aligned in memory. So how do you conclude operands don't need to be considered and at the same time they should be longword aligned for max performance? Are these operands somehow magically aligning themselves?
OP (as far as I understand) is asking whether aligning *code* in memory makes it run faster. AFAIK, but please correct me if I'm wrong, the alignment of the instructions has no direct interaction with the *data* operands in this regard. I.e. an aligned *code* loop accessing some data, assuming it fits in cache, will not run faster than an unaligned one accessing the same data, whether or not the accessed data is aligned or not.


If the *data* is aligned both will (potentially) run faster.
paraj is offline  
Old 08 August 2023, 20:51   #20
SpeedGeek
Moderator
 
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 841
Quote:
Originally Posted by paraj View Post
OP (as far as I understand) is asking whether aligning *code* in memory makes it run faster. AFAIK, but please correct me if I'm wrong, the alignment of the instructions has no direct interaction with the *data* operands in this regard. I.e. an aligned *code* loop accessing some data, assuming it fits in cache, will not run faster than an unaligned one accessing the same data, whether or not the accessed data is aligned or not.

If the *data* is aligned both will (potentially) run faster.
Your answer considers the execution time of loop code which fits in the instruction cache and was already placed in the cache. My answer considers the time it takes to read from memory to fill the cache and the code is to large to fit or eventually gets flushed from the cache, requiring the code to be read again and again from memory. It also considers code read from cache disabled memory.

BTW, I'm not suggesting every instruction operand should be longword aligned, just at the start of the code block.

Last edited by SpeedGeek; 09 August 2023 at 00:16.
SpeedGeek is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Any info on aligning a floppy drive head? BarryB support.Hardware 12 24 August 2021 22:26
VBCC and PosixLib: #80000004 no matter what. admiral Coders. C/C++ 5 08 October 2018 16:49
Amiga 1200 REV does it matter XsamX1987 support.Hardware 9 17 February 2017 07:10
Burnt my 68882 , does it really matter ???? keropi support.Hardware 11 13 December 2004 11:18
A little question concerning Gray Matter MethodGit Amiga scene 13 03 December 2001 18:55

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 08:13.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.12493 seconds with 16 queries