07 August 2023, 19:04 | #1 |
Zone Friend
Join Date: Apr 2006
Location: Gothenburg/Sweden
Age: 48
Posts: 339
|
Aligning code, does it matter?
Browsing through my old code and found quite a few
CNOP 0,8 CNOP 0,16 (or something similar I think). It was related to aligning code somehow to make it run slightly faster I heard back then... Does anyone knows better about this? Does it matter (and how much)? |
07 August 2023, 19:43 | #2 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,105
|
Large alignment values are sort of dubious since normally you're not getting better than 8-byte alignment from LoadSeg (normal executable loader).
This obviously only matters for 020+ (000 only cares that it's word aligned ). For 020 4-byte alignment will matter, and could be the difference between a loop just fitting in 256-byte I$ or not. For 060 cachelines are 16-byte so there could be some benefit from such an alignment, but I've never measured any difference. 030/040's may have different tradeoffs that I'm not aware of. So hot take from me: Anything more than CNOP 0,4 is useless. Now please someone else do the "someone is wrong on the internet" thing so I can be schooled |
07 August 2023, 20:10 | #3 | |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Quote:
I would advise to never ever do this. Really. But again, i've disassembled quite a lot of code and this goes in the way. Anyhow, my tests for 68030 have shown that it does not matter at all if a loop is 240 bytes or less. As long as your cache lines are 16 bytes or more, it won't help due allocated memory is only 8-byte aligned. It may eventually even backfire as these padding words will take space in the cache. |
|
07 August 2023, 21:04 | #4 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,975
|
If I remember right, from my very old tests for c2p routine on 68040 and 68060, aligning code via CNOP (4,8,16) has no sense. But putting/placing input data for c2p routine at address like f.e $1200008 or $1200018 has sense.
Exactly, 8 bytes longer buffer must be allocated. Example code, about I mean: Code:
move.l A0,D0 lsr.l #4,D0 lsl.l #4,D0 addq.l #8,D0 move.l D0,A0 (address of input buffer) And if I remember right someone told me that scsi.device A600 version was fastest on 68020/68030 than A1200 version. It was strange, because 68000 version is 2 commands longer than 68020, due odd address check. I dont checked this, if it was buffer alligned or code alligned problem. |
08 August 2023, 11:46 | #5 |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,764
|
|
08 August 2023, 12:21 | #6 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
|
08 August 2023, 12:26 | #7 |
Defendit numerus
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,474
|
|
08 August 2023, 13:22 | #8 |
Zone Friend
Join Date: Apr 2006
Location: Gothenburg/Sweden
Age: 48
Posts: 339
|
So, not sure how I would interpret the answers..
Aligning only make sense if there are loops involved (and aligning to 16-byte boundaries would make it work best with 060 and lower).? Does this only makes sense for loops? For example, would a simple subroutine benefit from this when there's no loops involved? Makesomething: nop rts |
08 August 2023, 15:27 | #9 |
Moderator
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 841
|
Alignment does help with performance in certain cases with 68020+, otherwise Motorola would not have recommended it. There are best case and worst case performance examples but Motorola was expecting the average case to be faster.
I mostly use Devpac and for some reason CNOP 0,4 doesn't always result in longword alignment. I suppose the internal defaults could have a higher priority than assembler directives in some cases but the Devpac manual does not give much detail about this. |
08 August 2023, 15:58 | #10 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
|
08 August 2023, 17:58 | #11 | |
Moderator
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 841
|
Quote:
Code:
11.1 PERFORMANCE TRADEOFFS The MC68030 maximizes average performance at the expense of worst case performance. The time spent executing one instruction can vary from zero to over 100 clocks. Factors affecting the execution time are the preceding and following instructions, the instruction stream alignment, residency of operands and instruction words in the caches, residency of address translations in the address translation cache, and operand alignment. To increase the average performance of the MC68030, certain tradeoffs were made to increase best case performance and to decrease the occurrence of worst case behavior. For example, burst filling increases performance by prefetching data for later accesses, but it commits the external bus controller and a cache for a longer period. Code:
8.4 RETURN FROM EXCEPTIONS After the processor has completed executing the exception handlers for all pending exceptions, the processor resumes normal instruction execution at the address in the processor’s vector table for the last exception processed. Once the exception handler has completed execution, if possible the processor must return the system context as it was prior to the exception using the RTE instruction. (If the internal data of the exception stack frames are manipulated, M68040 may enter into an undefined state; this applies specifically to the SSW on the access error stack frame.) When the processor executes an RTE instruction, it examines the stack frame on top of the active supervisor stack to determine if it is a valid frame and what type of context restoration it requires. If during restoration, a stack frame has an odd address PC and an SR that indicates user trace mode enabled, then an address error is taken. The SR stacked for the address error has the SR S-bit set. For previous members of the M68000 family the S-bit is clear. When the M68040 writes or reads a stack frame, it uses longword operand transfers wherever possible. Using a long-word-aligned stack pointer greatly enhances exception processing performance. The processor does not necessarily read or write the stack frame data in sequential order. The system software should not depend on a particular exception generating a particular stack frame. For compatibility with future devices, the software should be able to handle any format of stack frame for any type of exception. The following paragraphs discuss in detail each stack frame format. Code:
7.3 MISALIGNED OPERANDS All M68040 data formats can be located in memory on any byte boundary. A byte operand is properly aligned at any address; a word operand is misaligned at an odd address; and a long word is misaligned at an address that is not evenly divisible by 4. However, since operands can reside at any byte boundary, they can be misaligned. Although the M68040 does not enforce any alignment restrictions for data operands (including PC relative data addressing), some performance degradation occurs when additional bus cycles are required for long-word or word operands that are misaligned. For maximum performance, data items should be aligned on their natural boundaries. All instruction words and extension words must reside on word boundaries. Attempting to prefetch an instruction word at an odd address causes an address error exception. Refer to Section 8 Exception Processing for details on address error exceptions. |
|
08 August 2023, 18:28 | #12 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,105
|
You want to use explicit alignment when it increases speed and cache utilization, and these are somewhat interlinked. Say you have a small function: moveq #0,d0 / rts (i.e. 4 bytes). That function will greatly benefit from not straddling a cache line on all 68k processors with an I$ since you'll need at most one cache refill.
With regards to utilization having the before mentioned function aligned will also help, since you won't have "useless" instructions in the cache. Or, like I mentioned earlier, proper alignment can mean the difference between a critical code path (not necessarily only a loop) fitting completely into the cache or bouncing in and out of it. So when should you use explicit alignment? Like meynaf says, probably never, and if you do be sure to measure it and think about why and how it interacts with other parts of the code. Unless they're part of the same "block" of code, aligning function starts (and perhaps very "hot" internal labels) to 4 bytes is probably decent. For loops, I don't see any direct benefit unless that's what will keep you under a critical level. For 68020 only 4-byte alignment matters (since the cache lines are longword sized). It's probably more important to keep speed critical code together if possible. For 060 (and probably 040) it matters a lot less overall since the the I$ is comparatively large. Sure the would be some benefit in doing clever 16-byte alignment stuff, but it would be really far down my list of things to try (certainly throwing in "cnop 0,16" at "random" is likely to make things worse, not better). Some empirical results from stock A1200 (so EC020 @14MHz) for different code sizes and loop alignments. Loop is just moveq #0,d1 with dbf d0,start (x10000). Time is in ns per moveq. As expected there is zero benefit until we get to the critical point at which point it matters slightly. For 060/50 the same data is completely flat at 10ns/moveq. Probably most interesting architecture in this regard is 030 with its tiny I$ and large cache line size, but meynaf already covered that. Code:
Size 2 4 8 16 220 146 146 146 146 222 146 146 146 146 224 146 146 146 146 226 146 146 146 146 228 146 146 146 146 230 146 146 146 146 232 145 146 146 146 234 146 145 145 145 236 145 146 146 146 238 146 145 145 145 240 145 146 146 146 242 146 145 145 145 244 145 146 146 146 246 146 145 145 145 248 145 146 146 146 250 146 145 145 145 252 145 146 146 146 254 146 145 145 145 256 153 146 146 146 258 156 152 152 152 260 159 155 155 155 262 163 161 160 160 264 168 165 165 165 266 171 169 169 169 268 176 172 172 172 270 179 177 177 177 |
08 August 2023, 18:37 | #13 | |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
That it can have some importance, does not turn code aligning into a recommendation.
Quote:
So no, sorry but they don't recommend it. |
|
08 August 2023, 18:52 | #14 |
Registered User
Join Date: Dec 2014
Location: germany
Posts: 439
|
AFAIR, the 68020 manual mentioned another, non-cache related benefit of alignment: The 68020 always prefetches 32 bit, and does not need an additional bus cycle if the next (word-length) instruction was already fetched as part of that long word. Similar for multi-word instructions, a longword-length instruction may need one or two bus cycles, dependent on alignment. For longer instruction sequences the difference is probably neglectable, but code in memory with a lot of branching around that resets the prefetch buffer may IMHO benefit from longword-aligning the labels, especially on systems with bad memory performance (e.g. stock A1200).
|
08 August 2023, 19:13 | #15 | |
Moderator
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 841
|
Quote:
Also, it's possible the operands are located in 32 bit Chip RAM or Zorro3 bus driver ROM's and the cache disable logic or MMU table doesn't care if it's an instruction or data operand. Regarding, what's recommended anything can be implied. But I didn't imply instruction operand alignment would result in a performance benefit in all cases. Last edited by SpeedGeek; 08 August 2023 at 19:25. |
|
08 August 2023, 19:21 | #16 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,105
|
Quote:
|
|
08 August 2023, 19:24 | #17 | |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Quote:
No. Code located in chipmem, or in ROM, is cached. |
|
08 August 2023, 19:49 | #18 | |
Moderator
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 841
|
Quote:
The caches don't determine how the operands (instruction or data) are aligned in memory. So how do you conclude operands don't need to be considered and at the same time they should be longword aligned for max performance? Are these operands somehow magically aligning themselves? Wrong. The C= A2630 and GVP G-Force 030 cache instructions in Chipmem but not in Zorro2 I/O space. I assume the A3000 Fat Gary Chip is functionally the same as the A2630. On the A2000 with a 16 bit data bus, longword alignment won't help much here but the A3000 & A4000 with a 32 bit data bus it could help quite a lot. Last edited by SpeedGeek; 08 August 2023 at 20:00. |
|
08 August 2023, 20:09 | #19 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,105
|
Quote:
If the *data* is aligned both will (potentially) run faster. |
|
08 August 2023, 20:51 | #20 | |
Moderator
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 841
|
Quote:
BTW, I'm not suggesting every instruction operand should be longword aligned, just at the start of the code block. Last edited by SpeedGeek; 09 August 2023 at 00:16. |
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Any info on aligning a floppy drive head? | BarryB | support.Hardware | 12 | 24 August 2021 22:26 |
VBCC and PosixLib: #80000004 no matter what. | admiral | Coders. C/C++ | 5 | 08 October 2018 16:49 |
Amiga 1200 REV does it matter | XsamX1987 | support.Hardware | 9 | 17 February 2017 07:10 |
Burnt my 68882 , does it really matter ???? | keropi | support.Hardware | 11 | 13 December 2004 11:18 |
A little question concerning Gray Matter | MethodGit | Amiga scene | 13 | 03 December 2001 18:55 |
|
|