English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 24 May 2019, 20:03   #1
phx
Natteravn
 
phx's Avatar
 
Join Date: Nov 2009
Location: Herford / Germany
Posts: 2,496
Abs to PC-rel optimization on 68040

I'm currently discussing a new code generation feature with Volker for vbccm68k, whenever there are function calls with pointers passed in data registers (e.g. dos.library/Write).

Let's assume you want to pass a string-constant pointer in d2, then currently the code would always be "move.l #strconst,d2", although strconst resides in the same (code) section, as it is constant. Better would be:
Code:
        lea     strconst(pc),a0
        move.l  a0,d2
which saves the relocation entry, and shouldn't be any slower - at least on a 68000. Is there a CPU where this is slower?

I remember the 68040 had some strange cases, where absolute addressing is fastest. Can anybody confirm?
phx is offline  
Old 24 May 2019, 23:31   #2
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,039
Yeah, d(pc) is slower than addr(.w/l) on an 040. EA calcs are pipelined and done in parallel with execution, and pc relative EA calc takes more time than straight absolute (thus a higher chance to stall the pipeline).
For example, these numbers are measured on a real 040 (first number is raster time after 1000 repetitions, second is approximated cycles):
lea (d.w,a0),a1 ; 2391 2
lea (d.w,pc),a0 ; 4954 4
lea ($1234).w,a0 ; 1112 1
lea ($12345678),a0 ; 1112 1
move. #,d0 ; 1112 1
movea. #,a0 ; 1112 1
move. ($12345678),d0 ; 1159 1
movea.w ($12345678),a0 ; 3551 3
movea.l ($12345678),a0 ; 2264 2
a/b is offline  
Old 25 May 2019, 03:26   #3
Kalms
Registered User
 
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
It is worse on 060 as well. The move.l #imm,an would go into one of the pipelines, taking a single instruction slot, and pair with anything else; the suggested replacement instruction pair will take two instruction slots.
Kalms is offline  
Old 25 May 2019, 12:11   #4
phx
Natteravn
 
phx's Avatar
 
Join Date: Nov 2009
Location: Herford / Germany
Posts: 2,496
Thanks a lot for this valuable information! So we will disable that kind of optimization for 68040 and 68060.

Also interesting to see that (d,An) is faster than (d,PC) on the 040. Which means it also doesn't make much sense to move constant data to the code section, when small-data (A4 base-relative) is enabled.
phx is offline  
Old 05 June 2019, 23:45   #5
Photon
Moderator
 
Photon's Avatar
 
Join Date: Nov 2004
Location: Eksjö / Sweden
Posts: 5,602
Editing comment, cos I read too quickly. I thought this was about memory accesses at first but it isn't, it's about pipelining.

Isn't the reason the pipeline stall from store -> use, though?

This should definitely be the fastest way to do it from 68000 to 68030. Make sure to do a cache clear before measuring on all CPUs.

Last edited by Photon; 05 June 2019 at 23:53.
Photon is offline  
Old 06 June 2019, 14:00   #6
Bruce Abbott
Registered User
 
Bruce Abbott's Avatar
 
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,546
Even if it's not the fastest it still saves 2 bytes of object code and 6 bytes on disk, which might be more important than being slightly slower (if it is). Those 2 bytes saved might just make the difference when trying to fit a loop in the cache too, especially if main memory is slow.

Of course if you need fully relocatable code then it is the only option, but that is rarely a requirement on the Amiga.

But for me the biggest advantage of lea over #xxxx is that it makes the code easier to follow, which reduces errors and is just nicer. One reason I like programming the Amiga in assembler is the aesthetics of 68K assembly language. I would rather have nice looking code than the ultimate speed, and torturing the ISA to wring the last drop out of a particular processor is not my style.

The advantage of using lea is that you can see at glance that this is an address, not immediate data. I like to keep addresses in address register and data in data registers wherever practicable, because it makes sense and is how they were supposed to be used. I hate CPUs that just have 'registers' whose function you can't tell by looking at them (even worse when some have a special purpose and so can't be used for general stuff).

I can't think of any specific case where the speed of this operation would be critical. The OP's example "Let's assume you want to pass a string-constant pointer in d2" implies that a lot of time may be spent processing that data. Why would you be putting the address into D2? Probably for a DOS function which will execute thousands or millions of instructions. The few cycles you save in your code by using #xxxx are insignificant.
Bruce Abbott is offline  
Old 06 June 2019, 19:10   #7
phx
Natteravn
 
phx's Avatar
 
Join Date: Nov 2009
Location: Herford / Germany
Posts: 2,496
Quote:
Originally Posted by Bruce Abbott View Post
Even if it's not the fastest it still saves 2 bytes of object code and 6 bytes on disk,
No. Unfortunately not. The code size will be the same (don't forget the additional MOVE instruction). You just save the relocation.

Quote:
One reason I like programming the Amiga in assembler is the aesthetics of 68K assembly language.
We have something in common.

Quote:
The OP's example "Let's assume you want to pass a string-constant pointer in d2" implies that a lot of time may be spent processing that data.
True, in this case. But not in every case. And the compiler doesn't always know what the called function is doing.

Although, the person who requested this feature argumented similarly. Still I don't like the idea to generate slower code for saving a relocation.
phx is offline  
Old 07 June 2019, 02:55   #8
Bruce Abbott
Registered User
 
Bruce Abbott's Avatar
 
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,546
Quote:
Originally Posted by phx View Post
No. Unfortunately not. The code size will be the same (don't forget the additional MOVE instruction). You just save the relocation.
I understand now, you are considering this only for function calls that have pointers passed in data registers (ie. DOS library). In that case it is pointless worrying about execution time.

Quote:
Still I don't like the idea to generate slower code for saving a relocation.
The code may not be slower if loading time is significant. That reloc has to be read from a storage device into memory, then processed and poked into the code. All those extra cycles just to save a minuscule amount at run-time - how may DOS calls before you are ahead?

I don't worry about such minutiae unless I have a critical piece of code that needs to run as fast as possible, and even then it can often be improved more by changing higher level algorithms. As an example of how trying to create the fastest code can be counterproductive, I saw a program that included a complicated 'copymemquick' function designed to copy non-aligned memory etc. as fast as possible, then only used it once to copy a few bytes! The code was several times larger than the data it had to copy.
Bruce Abbott is offline  
Old 07 June 2019, 15:46   #9
phx
Natteravn
 
phx's Avatar
 
Join Date: Nov 2009
Location: Herford / Germany
Posts: 2,496
Quote:
Originally Posted by Bruce Abbott View Post
I understand now, you are considering this only for function calls that have pointers passed in data registers (ie. DOS library).
Yes.

Quote:
In that case it is pointless worrying about execution time.
No. Because these functions will not automatically be DOS calls, or even OS-calls. You could also define your own function and tell it to pass pointers in data registers (__reg("d0") attribute).

Quote:
The code may not be slower if loading time is significant. That reloc has to be read from a storage device into memory, then processed and poked into the code. All those extra cycles just to save a minuscule amount at run-time - how may DOS calls before you are ahead?
You are serious?
Loading time might be long or short, but in any case a few relocation entries more or less are always absolutely insignificant. And the program will be loaded only once. On the other hand, you don't know how often your program will call such a function. Might be millions of times. Or forever, when running as a server.

Nevertheless, I don't say this modification would generally be bad. vbcc also has the -size option, to optimise for size. We could put it there.
phx is offline  
Old 08 June 2019, 06:45   #10
Bruce Abbott
Registered User
 
Bruce Abbott's Avatar
 
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,546
Quote:
Originally Posted by phx View Post
You could also define your own function and tell it to pass pointers in data registers (__reg("d0") attribute).
That would be silly, since the first thing the function would do is put it in back into an address register. DOS library uses data registers because it is (was) written in BCPL, but nothing else I have seen does it.

Quote:
You are serious?
Yes. Programmers should concentrate on improving the overall efficiency and functionality of their code, not waste time trying to shave a few cycles off stuff that isn't used that often.

Quote:
Loading time might be long or short, but in any case a few relocation entries more or less are always absolutely insignificant. And the program will be loaded only once.
That's not necessarily true. Actually loading and relocating can take significant time, and a program may be loaded many times if invoked by a script etc. When loaded from disk or over a network, the total amount of data manipulation and code executed to just do that one reloc is staggering. I stand by my conclusion that in many cases the amount of time lost will never be made up for during the program's lifetime.

If you know that a particular function will be called a lot and its internal operations are quick, optimizing the call might just be worth it. But if you really want to save cycles then perhaps you shouldn't be using it at all!

Quote:
Nevertheless, I don't say this modification would generally be bad. vbcc also has the -size option, to optimise for size. We could put it there.
I suggest not doing any optimizations of this type, and letting the programmer decide what instructions to use. The more the assembler does behind your back the harder it is to keep track of what the code is really doing. Most of these little tweaks don't make much difference anyway.

Most of my code is not optimized, but I don't care. It's more important to get it working properly and make it seem fast and responsive to the user. I would never optimize for 68040 because the slower machines need it much more, and I certainly don't want to emulate the PC way of doing things (yes, we know our code is dog slow on current machines, but when the next models come out...).
Bruce Abbott is offline  
Old 08 June 2019, 10:28   #11
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
I would be strongly against adding this kind of optimization at the assembler level, because it trashes A0. I use a macro for lea to Dn in the rare cases it's needed.
But for the code generation of a compiler, it makes sense.
meynaf is offline  
Old 08 June 2019, 13:56   #12
phx
Natteravn
 
phx's Avatar
 
Join Date: Nov 2009
Location: Herford / Germany
Posts: 2,496
I was always talking about vbcc's m68k code generator. The assembler would never do such optimizations.
phx is offline  
Old 09 June 2019, 11:25   #13
Bruce Abbott
Registered User
 
Bruce Abbott's Avatar
 
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,546
Quote:
Originally Posted by phx View Post
I was always talking about vbcc's m68k code generator. The assembler would never do such optimizations.
We might be getting a bit off-track here. Vbcc compiles to asm, then uses vasm to create the binary, right? So as long as optimizations are only done by the compiler and not when the assembler is used directly I have no problem with it.

I don't think using a '__reg("d0")' attribute is a great idea because it reduces portability for very little gain, but the 'lea' optimization reduces file size in all cases and has an insignificant effect on 040 execution speed, so applying it all the time is fine.


Quote:
Originally Posted by meynaf
I would be strongly against adding this kind of optimization at the assembler level, because it trashes A0.
When calling DOS functions (the only practical use case) A0 is trashed anyway. But it's still nice to know that what you wrote is what you get, even if just to avoid confusion when debugging.
Bruce Abbott is offline  
Old 09 June 2019, 18:40   #14
phx
Natteravn
 
phx's Avatar
 
Join Date: Nov 2009
Location: Herford / Germany
Posts: 2,496
Quote:
Originally Posted by Bruce Abbott View Post
Vbcc compiles to asm, then uses vasm to create the binary, right?
Right.

Quote:
So as long as optimizations are only done by the compiler and not when the assembler is used directly I have no problem with it.
That makes me very happy!

Quote:
I don't think using a '__reg("d0")' attribute is a great idea because it reduces portability for very little gain
Right. It rarely makes sense for your own C functions. But you need to specify register arguments when doing OS-calls, or interfacing with assembler routines.
phx is offline  
Old 09 June 2019, 18:52   #15
NorthWay
Registered User
 
Join Date: May 2013
Location: Grimstad / Norway
Posts: 839
BTW, I think BAsm lets you set temp/trash registers for its optimizer IIRC. Just saying.
(And again, anyone know what happened to the ca 1990 GNU Superoptimizer?)
NorthWay is offline  
Old 11 June 2019, 11:33   #16
phx
Natteravn
 
phx's Avatar
 
Join Date: Nov 2009
Location: Herford / Germany
Posts: 2,496
Back to topic:
Quote:
Originally Posted by Kalms View Post
It is worse on 060 as well. The move.l #imm,an would go into one of the pipelines, taking a single instruction slot, and pair with anything else; the suggested replacement instruction pair will take two instruction slots.
Tests on real hardware (CSPPC/060) have shown that this is not the case for 060. The 040 is much faster with move-immediate, but there was no difference with the 060.

There also seems to be no difference for the 030, so the only CPU with a disadvantage out of this optimization would be the 040.
phx is offline  
Old 11 June 2019, 15:19   #17
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,039
Can't speak from experience on 060 (e.g. how it behaves when you have a properly optimized/pipelined code, etc.).
Looking at docs, chaper 10 Instruction Timing in 060 User's Manual, comparative to 040. For 040 you can see that EA calc for d(pc)/d(pc,rx) is up to 4 cycles slower than d(ax)/d(ax,ry), depending on instruction/scenario, and that's what I get (the numbers I posted earlier are not properly pipelined code and stalls happen, but that's the point). While for 060, after a quick glance, I couldn't see any differences in timings except for PEA (1 cycle faster; thought maybe it was a typo but there was nothing about it in erratas).
a/b is offline  
Old 12 June 2019, 13:27   #18
Kalms
Registered User
 
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 237
Quote:
Originally Posted by phx View Post
Back to topic:

Tests on real hardware (CSPPC/060) have shown that this is not the case for 060. The 040 is much faster with move-immediate, but there was no difference with the 060.

There also seems to be no difference for the 030, so the only CPU with a disadvantage out of this optimization would be the 040.
Given the nature of the difference (one "instruction slot" more consumed by the code), in some situations it will have the same runtime speed on 060; in some situations it will be 1 cycle slower. Still - as you note - the difference is minuscule in practice.
Kalms is offline  
Old 12 June 2019, 13:31   #19
phx
Natteravn
 
phx's Avatar
 
Join Date: Nov 2009
Location: Herford / Germany
Posts: 2,496
Ok. Makes sense. Thanks for the explanation!
phx is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Code optimization. gazj82 Coders. Blitz Basic 26 08 July 2018 15:56
Amiga a3640 processor card and 68040/68040 processors Euphoria MarketPlace 3 26 February 2017 21:15
3D Graphics: possible optimization? sandruzzo Coders. General 3 26 February 2016 08:01
Loop optimization + cycle counts losso Coders. Asm / Hardware 8 05 November 2013 11:50
ARM Assembler Optimization finkel Coders. General 10 01 December 2010 11:56

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 10:54.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.14912 seconds with 15 queries