24 May 2019, 20:03 | #1 |
Natteravn
Join Date: Nov 2009
Location: Herford / Germany
Posts: 2,496
|
Abs to PC-rel optimization on 68040
I'm currently discussing a new code generation feature with Volker for vbccm68k, whenever there are function calls with pointers passed in data registers (e.g. dos.library/Write).
Let's assume you want to pass a string-constant pointer in d2, then currently the code would always be "move.l #strconst,d2", although strconst resides in the same (code) section, as it is constant. Better would be: Code:
lea strconst(pc),a0 move.l a0,d2 I remember the 68040 had some strange cases, where absolute addressing is fastest. Can anybody confirm? |
24 May 2019, 23:31 | #2 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,039
|
Yeah, d(pc) is slower than addr(.w/l) on an 040. EA calcs are pipelined and done in parallel with execution, and pc relative EA calc takes more time than straight absolute (thus a higher chance to stall the pipeline).
For example, these numbers are measured on a real 040 (first number is raster time after 1000 repetitions, second is approximated cycles): lea (d.w,a0),a1 ; 2391 2 lea (d.w,pc),a0 ; 4954 4 lea ($1234).w,a0 ; 1112 1 lea ($12345678),a0 ; 1112 1 move. #,d0 ; 1112 1 movea. #,a0 ; 1112 1 move. ($12345678),d0 ; 1159 1 movea.w ($12345678),a0 ; 3551 3 movea.l ($12345678),a0 ; 2264 2 |
25 May 2019, 03:26 | #3 |
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
|
It is worse on 060 as well. The move.l #imm,an would go into one of the pipelines, taking a single instruction slot, and pair with anything else; the suggested replacement instruction pair will take two instruction slots.
|
25 May 2019, 12:11 | #4 |
Natteravn
Join Date: Nov 2009
Location: Herford / Germany
Posts: 2,496
|
Thanks a lot for this valuable information! So we will disable that kind of optimization for 68040 and 68060.
Also interesting to see that (d,An) is faster than (d,PC) on the 040. Which means it also doesn't make much sense to move constant data to the code section, when small-data (A4 base-relative) is enabled. |
05 June 2019, 23:45 | #5 |
Moderator
Join Date: Nov 2004
Location: Eksjö / Sweden
Posts: 5,602
|
Editing comment, cos I read too quickly. I thought this was about memory accesses at first but it isn't, it's about pipelining.
Isn't the reason the pipeline stall from store -> use, though? This should definitely be the fastest way to do it from 68000 to 68030. Make sure to do a cache clear before measuring on all CPUs. Last edited by Photon; 05 June 2019 at 23:53. |
06 June 2019, 14:00 | #6 |
Registered User
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,546
|
Even if it's not the fastest it still saves 2 bytes of object code and 6 bytes on disk, which might be more important than being slightly slower (if it is). Those 2 bytes saved might just make the difference when trying to fit a loop in the cache too, especially if main memory is slow.
Of course if you need fully relocatable code then it is the only option, but that is rarely a requirement on the Amiga. But for me the biggest advantage of lea over #xxxx is that it makes the code easier to follow, which reduces errors and is just nicer. One reason I like programming the Amiga in assembler is the aesthetics of 68K assembly language. I would rather have nice looking code than the ultimate speed, and torturing the ISA to wring the last drop out of a particular processor is not my style. The advantage of using lea is that you can see at glance that this is an address, not immediate data. I like to keep addresses in address register and data in data registers wherever practicable, because it makes sense and is how they were supposed to be used. I hate CPUs that just have 'registers' whose function you can't tell by looking at them (even worse when some have a special purpose and so can't be used for general stuff). I can't think of any specific case where the speed of this operation would be critical. The OP's example "Let's assume you want to pass a string-constant pointer in d2" implies that a lot of time may be spent processing that data. Why would you be putting the address into D2? Probably for a DOS function which will execute thousands or millions of instructions. The few cycles you save in your code by using #xxxx are insignificant. |
06 June 2019, 19:10 | #7 | |||
Natteravn
Join Date: Nov 2009
Location: Herford / Germany
Posts: 2,496
|
Quote:
Quote:
Quote:
Although, the person who requested this feature argumented similarly. Still I don't like the idea to generate slower code for saving a relocation. |
|||
07 June 2019, 02:55 | #8 | ||
Registered User
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,546
|
Quote:
Quote:
I don't worry about such minutiae unless I have a critical piece of code that needs to run as fast as possible, and even then it can often be improved more by changing higher level algorithms. As an example of how trying to create the fastest code can be counterproductive, I saw a program that included a complicated 'copymemquick' function designed to copy non-aligned memory etc. as fast as possible, then only used it once to copy a few bytes! The code was several times larger than the data it had to copy. |
||
07 June 2019, 15:46 | #9 | |||
Natteravn
Join Date: Nov 2009
Location: Herford / Germany
Posts: 2,496
|
Quote:
Quote:
Quote:
Loading time might be long or short, but in any case a few relocation entries more or less are always absolutely insignificant. And the program will be loaded only once. On the other hand, you don't know how often your program will call such a function. Might be millions of times. Or forever, when running as a server. Nevertheless, I don't say this modification would generally be bad. vbcc also has the -size option, to optimise for size. We could put it there. |
|||
08 June 2019, 06:45 | #10 | ||||
Registered User
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,546
|
Quote:
Quote:
Quote:
If you know that a particular function will be called a lot and its internal operations are quick, optimizing the call might just be worth it. But if you really want to save cycles then perhaps you shouldn't be using it at all! Quote:
Most of my code is not optimized, but I don't care. It's more important to get it working properly and make it seem fast and responsive to the user. I would never optimize for 68040 because the slower machines need it much more, and I certainly don't want to emulate the PC way of doing things (yes, we know our code is dog slow on current machines, but when the next models come out...). |
||||
08 June 2019, 10:28 | #11 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
I would be strongly against adding this kind of optimization at the assembler level, because it trashes A0. I use a macro for lea to Dn in the rare cases it's needed.
But for the code generation of a compiler, it makes sense. |
08 June 2019, 13:56 | #12 |
Natteravn
Join Date: Nov 2009
Location: Herford / Germany
Posts: 2,496
|
I was always talking about vbcc's m68k code generator. The assembler would never do such optimizations.
|
09 June 2019, 11:25 | #13 | ||
Registered User
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,546
|
Quote:
I don't think using a '__reg("d0")' attribute is a great idea because it reduces portability for very little gain, but the 'lea' optimization reduces file size in all cases and has an insignificant effect on 040 execution speed, so applying it all the time is fine. Quote:
|
||
09 June 2019, 18:40 | #14 | |||
Natteravn
Join Date: Nov 2009
Location: Herford / Germany
Posts: 2,496
|
Quote:
Quote:
Quote:
|
|||
09 June 2019, 18:52 | #15 |
Registered User
Join Date: May 2013
Location: Grimstad / Norway
Posts: 839
|
BTW, I think BAsm lets you set temp/trash registers for its optimizer IIRC. Just saying.
(And again, anyone know what happened to the ca 1990 GNU Superoptimizer?) |
11 June 2019, 11:33 | #16 | |
Natteravn
Join Date: Nov 2009
Location: Herford / Germany
Posts: 2,496
|
Back to topic:
Quote:
There also seems to be no difference for the 030, so the only CPU with a disadvantage out of this optimization would be the 040. |
|
11 June 2019, 15:19 | #17 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,039
|
Can't speak from experience on 060 (e.g. how it behaves when you have a properly optimized/pipelined code, etc.).
Looking at docs, chaper 10 Instruction Timing in 060 User's Manual, comparative to 040. For 040 you can see that EA calc for d(pc)/d(pc,rx) is up to 4 cycles slower than d(ax)/d(ax,ry), depending on instruction/scenario, and that's what I get (the numbers I posted earlier are not properly pipelined code and stalls happen, but that's the point). While for 060, after a quick glance, I couldn't see any differences in timings except for PEA (1 cycle faster; thought maybe it was a typo but there was nothing about it in erratas). |
12 June 2019, 13:27 | #18 | |
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
|
Quote:
|
|
12 June 2019, 13:31 | #19 |
Natteravn
Join Date: Nov 2009
Location: Herford / Germany
Posts: 2,496
|
Ok. Makes sense. Thanks for the explanation!
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Code optimization. | gazj82 | Coders. Blitz Basic | 26 | 08 July 2018 15:56 |
Amiga a3640 processor card and 68040/68040 processors | Euphoria | MarketPlace | 3 | 26 February 2017 21:15 |
3D Graphics: possible optimization? | sandruzzo | Coders. General | 3 | 26 February 2016 08:01 |
Loop optimization + cycle counts | losso | Coders. Asm / Hardware | 8 | 05 November 2013 11:50 |
ARM Assembler Optimization | finkel | Coders. General | 10 | 01 December 2010 11:56 |
|
|