Abs to PC-rel optimization on 68040

phx · 24 May 2019, 20:03

I'm currently discussing a new code generation feature with Volker for vbccm68k, whenever there are function calls with pointers passed in data registers (e.g. dos.library/Write).

Let's assume you want to pass a string-constant pointer in d2, then currently the code would always be "move.l #strconst,d2", although strconst resides in the same (code) section, as it is constant. Better would be:

Code:

        lea     strconst(pc),a0
        move.l  a0,d2

which saves the relocation entry, and shouldn't be any slower - at least on a 68000. Is there a CPU where this is slower?

I remember the 68040 had some strange cases, where absolute addressing is fastest. Can anybody confirm?

a/b · 24 May 2019, 23:31

Yeah, d(pc) is slower than addr(.w/l) on an 040. EA calcs are pipelined and done in parallel with execution, and pc relative EA calc takes more time than straight absolute (thus a higher chance to stall the pipeline).
For example, these numbers are measured on a real 040 (first number is raster time after 1000 repetitions, second is approximated cycles):
lea (d.w,a0),a1 ; 2391 2
lea (d.w,pc),a0 ; 4954 4
lea ($1234).w,a0 ; 1112 1
lea ($12345678),a0 ; 1112 1
move. #,d0 ; 1112 1
movea. #,a0 ; 1112 1
move. ($12345678),d0 ; 1159 1
movea.w ($12345678),a0 ; 3551 3
movea.l ($12345678),a0 ; 2264 2

Kalms · 25 May 2019, 03:26

It is worse on 060 as well. The move.l #imm,an would go into one of the pipelines, taking a single instruction slot, and pair with anything else; the suggested replacement instruction pair will take two instruction slots.

phx · 25 May 2019, 12:11

Thanks a lot for this valuable information! So we will disable that kind of optimization for 68040 and 68060.

Also interesting to see that (d,An) is faster than (d,PC) on the 040. Which means it also doesn't make much sense to move constant data to the code section, when small-data (A4 base-relative) is enabled.

Photon · 05 June 2019, 23:45

Editing comment, cos I read too quickly. I thought this was about memory accesses at first but it isn't, it's about pipelining.

Isn't the reason the pipeline stall from store -> use, though?

This should definitely be the fastest way to do it from 68000 to 68030. Make sure to do a cache clear before measuring on all CPUs.

Bruce Abbott · 06 June 2019, 14:00

Even if it's not the fastest it still saves 2 bytes of object code and 6 bytes on disk, which might be more important than being slightly slower (if it is). Those 2 bytes saved might just make the difference when trying to fit a loop in the cache too, especially if main memory is slow.

Of course if you need fully relocatable code then it is the only option, but that is rarely a requirement on the Amiga.

But for me the biggest advantage of lea over #xxxx is that it makes the code easier to follow, which reduces errors and is just nicer. One reason I like programming the Amiga in assembler is the aesthetics of 68K assembly language. I would rather have nice looking code than the ultimate speed, and torturing the ISA to wring the last drop out of a particular processor is not my style.

The advantage of using lea is that you can see at glance that this is an address, not immediate data. I like to keep addresses in address register and data in data registers wherever practicable, because it makes sense and is how they were supposed to be used. I hate CPUs that just have 'registers' whose function you can't tell by looking at them (even worse when some have a special purpose and so can't be used for general stuff).

I can't think of any specific case where the speed of this operation would be critical. The OP's example "Let's assume you want to pass a string-constant pointer in d2" implies that a lot of time may be spent processing that data. Why would you be putting the address into D2? Probably for a DOS function which will execute thousands or millions of instructions. The few cycles you save in your code by using #xxxx are insignificant.

phx · 06 June 2019, 19:10

Quote:

Originally Posted by Bruce Abbott

Even if it's not the fastest it still saves 2 bytes of object code and 6 bytes on disk,

No. Unfortunately not. The code size will be the same (don't forget the additional MOVE instruction). You just save the relocation.

Quote:

One reason I like programming the Amiga in assembler is the aesthetics of 68K assembly language.

We have something in common.

Quote:

The OP's example "Let's assume you want to pass a string-constant pointer in d2" implies that a lot of time may be spent processing that data.

True, in this case. But not in every case. And the compiler doesn't always know what the called function is doing.

Although, the person who requested this feature argumented similarly. Still I don't like the idea to generate slower code for saving a relocation.

Bruce Abbott · 07 June 2019, 02:55

Quote:

Originally Posted by phx

No. Unfortunately not. The code size will be the same (don't forget the additional MOVE instruction). You just save the relocation.

I understand now, you are considering this only for function calls that have pointers passed in data registers (ie. DOS library). In that case it is pointless worrying about execution time.

Quote:

Still I don't like the idea to generate slower code for saving a relocation.

The code may not be slower if loading time is significant. That reloc has to be read from a storage device into memory, then processed and poked into the code. All those extra cycles just to save a minuscule amount at run-time - how may DOS calls before you are ahead?

I don't worry about such minutiae unless I have a critical piece of code that needs to run as fast as possible, and even then it can often be improved more by changing higher level algorithms. As an example of how trying to create the fastest code can be counterproductive, I saw a program that included a complicated 'copymemquick' function designed to copy non-aligned memory etc. as fast as possible, then only used it once to copy a few bytes! The code was several times larger than the data it had to copy.

phx · 07 June 2019, 15:46

Quote:

Originally Posted by Bruce Abbott

I understand now, you are considering this only for function calls that have pointers passed in data registers (ie. DOS library).

Yes.

Quote:

In that case it is pointless worrying about execution time.

No. Because these functions will not automatically be DOS calls, or even OS-calls. You could also define your own function and tell it to pass pointers in data registers (__reg("d0") attribute).

Quote:

The code may not be slower if loading time is significant. That reloc has to be read from a storage device into memory, then processed and poked into the code. All those extra cycles just to save a minuscule amount at run-time - how may DOS calls before you are ahead?

You are serious?

Loading time might be long or short, but in any case a few relocation entries more or less are always absolutely insignificant. And the program will be loaded only once. On the other hand, you don't know how often your program will call such a function. Might be millions of times. Or forever, when running as a server.

Nevertheless, I don't say this modification would generally be bad. vbcc also has the -size option, to optimise for size. We could put it there.

Bruce Abbott · 08 June 2019, 06:45

Quote:

Originally Posted by phx

You could also define your own function and tell it to pass pointers in data registers (__reg("d0") attribute).

That would be silly, since the first thing the function would do is put it in back into an address register. DOS library uses data registers because it is (was) written in BCPL, but nothing else I have seen does it.

Quote:

You are serious?

Yes. Programmers should concentrate on improving the overall efficiency and functionality of their code, not waste time trying to shave a few cycles off stuff that isn't used that often.

Quote:

Loading time might be long or short, but in any case a few relocation entries more or less are always absolutely insignificant. And the program will be loaded only once.

That's not necessarily true. Actually loading and relocating can take significant time, and a program may be loaded many times if invoked by a script etc. When loaded from disk or over a network, the total amount of data manipulation and code executed to just do that one reloc is staggering. I stand by my conclusion that in many cases the amount of time lost will never be made up for during the program's lifetime.

If you know that a particular function will be called a lot and its internal operations are quick, optimizing the call might just be worth it. But if you really want to save cycles then perhaps you shouldn't be using it at all!

Quote:

Nevertheless, I don't say this modification would generally be bad. vbcc also has the -size option, to optimise for size. We could put it there.

I suggest not doing any optimizations of this type, and letting the programmer decide what instructions to use. The more the assembler does behind your back the harder it is to keep track of what the code is really doing. Most of these little tweaks don't make much difference anyway.

Most of my code is not optimized, but I don't care. It's more important to get it working properly and make it seem fast and responsive to the user. I would never optimize for 68040 because the slower machines need it much more, and I certainly don't want to emulate the PC way of doing things (yes, we know our code is dog slow on current machines, but when the next models come out...).

meynaf · 08 June 2019, 10:28

I would be strongly against adding this kind of optimization at the assembler level, because it trashes A0. I use a macro for lea to Dn in the rare cases it's needed.
But for the code generation of a compiler, it makes sense.

phx · 08 June 2019, 13:56

I was always talking about vbcc's m68k code generator. The assembler would never do such optimizations.

Bruce Abbott · 09 June 2019, 11:25

Quote:

Originally Posted by phx

I was always talking about vbcc's m68k code generator. The assembler would never do such optimizations.

We might be getting a bit off-track here. Vbcc compiles to asm, then uses vasm to create the binary, right? So as long as optimizations are only done by the compiler and not when the assembler is used directly I have no problem with it.

I don't think using a '__reg("d0")' attribute is a great idea because it reduces portability for very little gain, but the 'lea' optimization reduces file size in all cases and has an insignificant effect on 040 execution speed, so applying it all the time is fine.

Quote:

Originally Posted by meynaf

I would be strongly against adding this kind of optimization at the assembler level, because it trashes A0.

When calling DOS functions (the only practical use case) A0 is trashed anyway. But it's still nice to know that what you wrote is what you get, even if just to avoid confusion when debugging.

phx · 09 June 2019, 18:40

Quote:

Originally Posted by Bruce Abbott

Vbcc compiles to asm, then uses vasm to create the binary, right?

Right.

Quote:

So as long as optimizations are only done by the compiler and not when the assembler is used directly I have no problem with it.

That makes me very happy!

Quote:

I don't think using a '__reg("d0")' attribute is a great idea because it reduces portability for very little gain

Right. It rarely makes sense for your own C functions. But you need to specify register arguments when doing OS-calls, or interfacing with assembler routines.

NorthWay · 09 June 2019, 18:52

BTW, I think BAsm lets you set temp/trash registers for its optimizer IIRC. Just saying.
(And again, anyone know what happened to the ca 1990 GNU Superoptimizer?)

phx · 11 June 2019, 11:33

Back to topic:

Quote:

Originally Posted by Kalms

It is worse on 060 as well. The move.l #imm,an would go into one of the pipelines, taking a single instruction slot, and pair with anything else; the suggested replacement instruction pair will take two instruction slots.

Tests on real hardware (CSPPC/060) have shown that this is not the case for 060. The 040 is much faster with move-immediate, but there was no difference with the 060.

There also seems to be no difference for the 030, so the only CPU with a disadvantage out of this optimization would be the 040.

a/b · 11 June 2019, 15:19

Can't speak from experience on 060 (e.g. how it behaves when you have a properly optimized/pipelined code, etc.).
Looking at docs, chaper 10 Instruction Timing in 060 User's Manual, comparative to 040. For 040 you can see that EA calc for d(pc)/d(pc,rx) is up to 4 cycles slower than d(ax)/d(ax,ry), depending on instruction/scenario, and that's what I get (the numbers I posted earlier are not properly pipelined code and stalls happen, but that's the point). While for 060, after a quick glance, I couldn't see any differences in timings except for PEA (1 cycle faster; thought maybe it was a typo but there was nothing about it in erratas).

Kalms · 12 June 2019, 13:27

Quote:

Originally Posted by phx

Back to topic:

Tests on real hardware (CSPPC/060) have shown that this is not the case for 060. The 040 is much faster with move-immediate, but there was no difference with the 060.

There also seems to be no difference for the 030, so the only CPU with a disadvantage out of this optimization would be the 040.

Given the nature of the difference (one "instruction slot" more consumed by the code), in some situations it will have the same runtime speed on 060; in some situations it will be 1 cycle slower. Still - as you note - the difference is minuscule in practice.

phx · 12 June 2019, 13:31

Ok. Makes sense. Thanks for the explanation!

24 May 2019, 20:03	#1
phx Natteravn Join Date: Nov 2009 Location: Herford / Germany Posts: 2,496	Abs to PC-rel optimization on 68040 I'm currently discussing a new code generation feature with Volker for vbccm68k, whenever there are function calls with pointers passed in data registers (e.g. dos.library/Write). Let's assume you want to pass a string-constant pointer in d2, then currently the code would always be "move.l #strconst,d2", although strconst resides in the same (code) section, as it is constant. Better would be: Code: lea strconst(pc),a0 move.l a0,d2 which saves the relocation entry, and shouldn't be any slower - at least on a 68000. Is there a CPU where this is slower? I remember the 68040 had some strange cases, where absolute addressing is fastest. Can anybody confirm?

05 June 2019, 23:45	#5
Photon Moderator Join Date: Nov 2004 Location: Eksjö / Sweden Posts: 5,602	Editing comment, cos I read too quickly. I thought this was about memory accesses at first but it isn't, it's about pipelining. Isn't the reason the pipeline stall from store -> use, though? This should definitely be the fastest way to do it from 68000 to 68030. Make sure to do a cache clear before measuring on all CPUs. Last edited by Photon; 05 June 2019 at 23:53.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Code optimization.	gazj82	Coders. Blitz Basic	26	08 July 2018 15:56
Amiga a3640 processor card and 68040/68040 processors	Euphoria	MarketPlace	3	26 February 2017 21:15
3D Graphics: possible optimization?	sandruzzo	Coders. General	3	26 February 2016 08:01
Loop optimization + cycle counts	losso	Coders. Asm / Hardware	8	05 November 2013 11:50
ARM Assembler Optimization	finkel	Coders. General	10	01 December 2010 11:56

24 May 2019, 23:31	#2
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	Yeah, d(pc) is slower than addr(.w/l) on an 040. EA calcs are pipelined and done in parallel with execution, and pc relative EA calc takes more time than straight absolute (thus a higher chance to stall the pipeline). For example, these numbers are measured on a real 040 (first number is raster time after 1000 repetitions, second is approximated cycles): lea (d.w,a0),a1 ; 2391 2 lea (d.w,pc),a0 ; 4954 4 lea ($1234).w,a0 ; 1112 1 lea ($12345678),a0 ; 1112 1 move. #,d0 ; 1112 1 movea. #,a0 ; 1112 1 move. ($12345678),d0 ; 1159 1 movea.w ($12345678),a0 ; 3551 3 movea.l ($12345678),a0 ; 2264 2

25 May 2019, 03:26	#3
Kalms Registered User Join Date: Nov 2006 Location: Stockholm, Sweden Posts: 237	It is worse on 060 as well. The move.l #imm,an would go into one of the pipelines, taking a single instruction slot, and pair with anything else; the suggested replacement instruction pair will take two instruction slots.

25 May 2019, 12:11	#4
phx Natteravn Join Date: Nov 2009 Location: Herford / Germany Posts: 2,496	Thanks a lot for this valuable information! So we will disable that kind of optimization for 68040 and 68060. Also interesting to see that (d,An) is faster than (d,PC) on the 040. Which means it also doesn't make much sense to move constant data to the code section, when small-data (A4 base-relative) is enabled.

06 June 2019, 14:00	#6
Bruce Abbott Registered User Join Date: Mar 2018 Location: Hastings, New Zealand Posts: 2,546	Even if it's not the fastest it still saves 2 bytes of object code and 6 bytes on disk, which might be more important than being slightly slower (if it is). Those 2 bytes saved might just make the difference when trying to fit a loop in the cache too, especially if main memory is slow. Of course if you need fully relocatable code then it is the only option, but that is rarely a requirement on the Amiga. But for me the biggest advantage of lea over #xxxx is that it makes the code easier to follow, which reduces errors and is just nicer. One reason I like programming the Amiga in assembler is the aesthetics of 68K assembly language. I would rather have nice looking code than the ultimate speed, and torturing the ISA to wring the last drop out of a particular processor is not my style. The advantage of using lea is that you can see at glance that this is an address, not immediate data. I like to keep addresses in address register and data in data registers wherever practicable, because it makes sense and is how they were supposed to be used. I hate CPUs that just have 'registers' whose function you can't tell by looking at them (even worse when some have a special purpose and so can't be used for general stuff). I can't think of any specific case where the speed of this operation would be critical. The OP's example "Let's assume you want to pass a string-constant pointer in d2" implies that a lot of time may be spent processing that data. Why would you be putting the address into D2? Probably for a DOS function which will execute thousands or millions of instructions. The few cycles you save in your code by using #xxxx are insignificant.

08 June 2019, 10:28	#11
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	I would be strongly against adding this kind of optimization at the assembler level, because it trashes A0. I use a macro for lea to Dn in the rare cases it's needed. But for the code generation of a compiler, it makes sense.

08 June 2019, 13:56	#12
phx Natteravn Join Date: Nov 2009 Location: Herford / Germany Posts: 2,496	I was always talking about vbcc's m68k code generator. The assembler would never do such optimizations.

09 June 2019, 18:52	#15
NorthWay Registered User Join Date: May 2013 Location: Grimstad / Norway Posts: 839	BTW, I think BAsm lets you set temp/trash registers for its optimizer IIRC. Just saying. (And again, anyone know what happened to the ca 1990 GNU Superoptimizer?)

11 June 2019, 15:19	#17
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	Can't speak from experience on 060 (e.g. how it behaves when you have a properly optimized/pipelined code, etc.). Looking at docs, chaper 10 Instruction Timing in 060 User's Manual, comparative to 040. For 040 you can see that EA calc for d(pc)/d(pc,rx) is up to 4 cycles slower than d(ax)/d(ax,ry), depending on instruction/scenario, and that's what I get (the numbers I posted earlier are not properly pipelined code and stalls happen, but that's the point). While for 060, after a quick glance, I couldn't see any differences in timings except for PEA (1 cycle faster; thought maybe it was a typo but there was nothing about it in erratas).

12 June 2019, 13:31	#19
phx Natteravn Join Date: Nov 2009 Location: Herford / Germany Posts: 2,496	Ok. Makes sense. Thanks for the explanation!

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)