adda / suba Vs. lea - Page 2

Karlos · 21 September 2022, 20:46

At the risk of being a filthy necromancer, is there a definitive answer for 68020+ where the offset size can be represented as a 16 bit immediate? Assume code is in fast memory and instruction cache enabled.

smack · 27 September 2022, 00:03

Quote:

Originally Posted by Karlos

At the risk of being a filthy necromancer, is there a definitive answer for 68020+ where the offset size can be represented as a 16 bit immediate? Assume code is in fast memory and instruction cache enabled.

Looking at section 8 of this document https://www.nxp.com/docs/en/data-sheet/MC68020UM.pdf

adda / suba EA,An
0 / 2 / 3 cycles (best / cache / worst case) 8.2.8 Arithmetic/Logical Instructions
0 / 2 / 3 cycles for EA=#<data>.W 8.2.1 Fetch Effective Address
= 0 / 4 / 6 cycles

lea
2 / 2 / 3 cycles 8.2.16 Control Instructions
2 / 2 / 3 cycles for EA=(d16,An) 8.2.3 Calculate Effective Address
= 4 / 4 / 6 cycles

There is no clear winner, but I would use adda/suba because it has a chance of a faster best-case and the code is more expressive IMO.

If you wonder what these best / cache / worst cases are, well it's complicated

Quote:

8.1 TIMING ESTIMATION FACTORS
The advanced architecture of the MC68020/EC020 makes exact instruction timing
calculations difficult due to the effects of:
1. An On-Chip Instruction Cache and Instruction Prefetch
2. Operand Misalignment
3. Bus Controller/Sequence Concurrency
4. Instruction Execution Overlap
These factors make MC68020/EC020 instruction set timing difficult to calculate on a single
instruction basis since instructions vary in execution time from one context to another.

I expect the instruction timing to be similarly complex for the 040 and 060 processors...

phx · 27 September 2022, 12:14

What makes LEA attractive is that you can store the result of the addition to a different register.

Bruce Abbott · 27 September 2022, 21:06

Quote:

Originally Posted by Karlos

At the risk of being a filthy necromancer, is there a definitive answer for 68020+ where the offset size can be represented as a 16 bit immediate? Assume code is in fast memory and instruction cache enabled.

Real-world test on 50MHz 030 (Blizzard 1230-IV with 60ns DRAM):-

In cache:-
addq.w #x,An 2 clocks
addq.l #x,An 2 clocks
lea xx(An),An 4 clocks
lea x(An,Rn),An 6 clocks
adda.w #xx,An 6 clocks
adda.l #xxxx,An 6 clocks

Not cached:-
addq.w #x,An 4 clocks
addq.l #x,An 4 clocks
lea xx(An),An 7 clocks
lea x(An,Rn),An 7 clocks
adda.w #xx,An 7 clocks
adda.l #xxxx,An 11 clocks

This shows a 2 clock advantage of lea over adda.w when cached, but no advantage when not cached. Also note that adda.l is the same speed as adda.w when cached, but 4 clocks slower when not cached!

It also shows that 60ns DRAM is not fast enough to keep up with the 50MHz 030 on the Blizzard 1230-IV. Getting that code into the cache can make it up to twice as fast!

Karlos · 27 September 2022, 22:39

Great responses, thanks!

Any info on 040 / 060 ?

smack · 28 September 2022, 01:04

68040, see section 10 of this document https://www.nxp.com/docs/en/referenc.../MC68040UM.pdf

instruction and data caches, execution pipeline (more complex than the 68020)
adda #<xxx>,An = 1 cycle (1 <ea> + 1 execute)
suba #<xxx>,An = 2 cycles (1 <ea> + 2 execute) <- is that real? suba slower than adda?
lea (d16,An),An = 2 cycles (2 <ea> + 2 execute)

68060, see section 10 of this document https://www.nxp.com/docs/en/data-sheet/MC68060UM.pdf

instruction and data caches, superscalar pipeline and dual execution units (more complex than the 68040)
adda #<xxx>,An = 1 cycle, pOEP | sOEP
suba #<xxx>,An = 1 cycle, pOEP | sOEP
lea (d16,An),An = 1 cycle, pOEP | sOEP

Of course, real world performance depends on the specific sequence of instructions executed, because of how they interact (overlap or stall) in the execution pipelines.

paraj · 28 September 2022, 08:36

For 060 I think lea is slightly better (all other things being equal) since it can sometimes avoid a 2 cycle change/use register stall (10.2.3).

The instruction timings for 040 look a bit odd for adda/suba:
adda Dn,Am ;1+2
adda An,Am ;1+1
adda #<xxx>,Am, ;1+1
suba Dn,Am ;1+1
suba An,Am ;1+2
suba #<xxx>,Am, ;1+2

So suba Dn,Am is faster than adda Dn,Am and suba An,Am?

a/b · 28 September 2022, 10:22

68040 real hardware (1000 iterations of 8x repeated):

Code:

	adda.	d0,a0		; 3528	3
	adda.	a0,a1		; 1112	1
	adda.	(a0),a1		; 2396	2
	adda.	#,a0		; 1117	1

	suba.	d0,a0		; 3528	3
	suba.	(a0),a1		; 2396	2
	suba.	#,a0		; 1117	1

	lea	(a0),a1		; 1112	1
	lea	(d.w,a0),a1	; 2391	2
	lea	(d.w,pc),a0	; 4954	4

No size means the same speed for both .w and .l. Not really "best case" scenario, but close enough I guess (everything is in cache).

saimo · 28 September 2022, 17:17

Quote:

Originally Posted by Bruce Abbott

Real-world test on 50MHz 030 (Blizzard 1230-IV with 60ns DRAM):-

In cache:-
addq.w #x,An 2 clocks
addq.l #x,An 2 clocks
lea xx(An),An 4 clocks
lea x(An,Rn),An 6 clocks
adda.w #xx,An 6 clocks
adda.l #xxxx,An 6 clocks

Not cached:-
addq.w #x,An 4 clocks
addq.l #x,An 4 clocks
lea xx(An),An 7 clocks
lea x(An,Rn),An 7 clocks
adda.w #xx,An 7 clocks
adda.l #xxxx,An 11 clocks

This shows a 2 clock advantage of lea over adda.w when cached, but no advantage when not cached. Also note that adda.l is the same speed as adda.w when cached, but 4 clocks slower when not cached!

It also shows that 60ns DRAM is not fast enough to keep up with the 50MHz 030 on the Blizzard 1230-IV. Getting that code into the cache can make it up to twice as fast!

I have a same-specced machine, but a number of tests showed that adda and lea execute at the same speed in all cases.
EDIT: I didn't consider that assembler optimizations might kick in, so my previous results were faked by the fact that the assembler actually replaced adda with lea. Stupid mistake, apologies. Now that I have encoded the instructions manually, my tests return the same results (i.e. adda 6 cycles, lea 4 cycles, both 7 cycles when cache is off).

Updated results:

Code:

--------------------------------------------------------------------------------
ADDA.W #X,AN VS LEA.L (D16,AN),AN COMPARISON

NOTES
 * The tests have been run on an Amiga 1200 equipped with a Blizzard 1230-IV
   with an MC68030 clocked at 50 MHz and 60 ns FAST RAM.
 * The tests for the unexpanded Amiga 1200 cases have been been run on the same
   machine with the accelerator board disabled.
 * The tests have been run with DMA and interrupts off, and with PAL settings.
 * The tests have been run 3 or 4 times each, and the best results have been
   taken; most of the tests always returned the same result (only the 68030,
   cache off, CHIP RAM cases showed fluctuations of some tens/hundreds cycles,
   which were insignificant anyway).
 * The tests execute the core code 10000 times.
 * The duration of the tests is measured by means of color clocks (CCKs).
 * The test programs require ECS or AGA (but have been tested on AGA only).
 * The test programs can be run from both shell and Workbench.
 * The test programs take no argument.
 * The test programs shut the OS off, take over the machine entirely and access
   the hardware directly; although on exit they restore the system with the
   utmost care, no guarantee is given - USE AT YOUR OWN RISK.
 * Included test programs:
    * ADCL = Adda, cache Disabled, CHIP RAM, Longword alignment
    * ADCW = Adda, cache Disabled, CHIP RAM, Word alignment
    * ADFL = Adda, cache Disabled, FAST RAM, Longword alignment
    * ADFW = Adda, cache Disabled, FAST RAM, Word alignment
    * AECL = Adda, cache Enabled, CHIP RAM, Longword alignment
    * AECW = Adda, cache Enabled, CHIP RAM, Word alignment
    * AEFL = Adda, cache Enabled, FAST RAM, Longword alignment
    * AEFW = Adda, cache Enabled, FAST RAM, Word alignment
    * LDCL = Lea, cache Disabled, CHIP RAM, Longword alignment
    * LDCW = Lea, cache Disabled, CHIP RAM, Word alignment
    * LDFL = Lea, cache Disabled, FAST RAM, Longword alignment
    * LDFW = Lea, cache Disabled, FAST RAM, Word alignment
    * LECL = Lea, cache Enabled, CHIP RAM, Longword alignment
    * LECW = Lea, cache Enabled, CHIP RAM, Word alignment
    * LEFL = Lea, cache Enabled, FAST RAM, Longword alignment
    * LEFW = Lea, cache Enabled, FAST RAM, Word alignment


--------------------------------------------------------------------------------
68020 @ 14.18758 MHz / CHIP RAM

OFFICIAL SPECIFICATIONS (best/cache/worst)
 * adda: FEA: 0/2/3, operation: 0/2/3 -> 0/4/6
 * lea:  CEA: 2/2/3, operation: 2/2/3 -> 4/4/6
 * dbf:  branch taken: 3/6/9, branch not taken: 7/10/10

CORE CODE

 .l rept   10
    adda.w #-32768,a0     ;0/4/6
    endr
    dbf    d0,.l          ;3/6/9 or 7/10/10

 .l rept   10
    lea.l  (-32768,a0),a0 ;4/4/6
    endr
    dbf    d0,.l          ;3/6/9 or 7/10/10

THEORETICAL CYCLES
 * total / cache: (6*10+9) + (4*10+6)*9998 + (4*10+10) = 460027
 * loop  / cache: 460027/10000 = ~46
 * total / worst: (6*10+9)*9999 + (6*10+10) = 690001
 * loop  / worst: 690001/10000 = ~69

RESULTS FOR LONGWORD-ALIGNED CODE

       |          |       cache on      |      cache off
  ins. |     unit +----------+----------+----------+----------
       |          | FAST RAM | CHIP RAM | FAST RAM | CHIP RAM
 ------+----------+----------+----------+----------+----------
  adda |   CCKs/T |          |   167784 |          |  342832
       | cycles/T |          |   671136 |          | 1371328
       | cycles/L |          |    67.11 |          |  137.13
 ------+----------+----------+----------+----------+----------
   lea |   CCKs/T |          |   117936 |          |  342832
       | cycles/T |          |   471744 |          | 1371328
       | cycles/L |          |    47.17 |          |  137.13

RESULTS FOR WORD-ALIGNED CODE

       |          |       cache on      |      cache off
  ins. |     unit +----------+----------+----------+----------
       |          | FAST RAM | CHIP RAM | FAST RAM | CHIP RAM
 ------+----------+----------+----------+----------+----------
  adda |   CCKs/T |          |   165280 |          |  353033
       | cycles/T |          |   661120 |          | 1412132
       | cycles/L |          |    66.11 |          |  141.21
 ------+----------+----------+----------+----------+----------
   lea |   CCKs/T |          |   115432 |          |  342832
       | cycles/T |          |   461728 |          | 1371328
       | cycles/L |          |    46.17 |          |  137.13

CALCULATION OF CPU CYCLES
 * total = CCKs * 4
 * loop  = (CCKs * 4) / 10000 = CCKs / 2500

NOTES
 * The lea + cache on + longword alignment case loop takes 1+ cycles more than
   expected (47+ VS 46); by executing a loop of a single lea 65536 times, the
   measured time is 180514 CCKs = 11.02 cycles, which is closer to the expected
   value (10 cycles), but still ~1 cycle slower; it seems that, at least in this
   context, dbf actually takes 7 cycles.
 * The adda + cache on + longword alignment case loop takes 21+ cycles more than
   expected (67 VS 46); given that dbf takes 7 cycles, adda takes (67-7)/10 = 6
   cycles, i.e. 2 cycles more than expected (6 VS 4).
 * The cache off case loops are about 2 times slower than the theoretical value
   (137.13 and 141.21 are close to 69*2 = 138), which is not surprising given
   the CHIP RAM access timings.
 * Word alignment affects performance as follows:
    * adda + cache off case: lowers it slightly;
    * adda + cache on case: improves it slightly;
    * lea + cache on case: improves it slightly;
    * lea + cache off case: has no effect.
   In particular, it eliminates the extra cycle taken by dbf in the cache on
   cases.


--------------------------------------------------------------------------------
68030 @ 50 MHz / CHIP RAM / FAST RAM 60 ns

OFFICIAL SPECIFICATIONS (CACHE / NO CACHE)
 * adda: FEA: 2/2, operation 4/4 -> 6/6
 * lea:  CEA: 2/2, operation: 2/2 -> 4/4
 * dbf:  branch taken: 6/8, branch not taken: 10/13

CORE CODE

 .l rept   10
    adda.w #-32768,a0     ;6/6
    endr
    dbf    d0,.l          ;6/8 or 10/13

 .l rept   10
    lea.l  (-32768,a0),a0 ;4/4
    endr
    dbf    d0,.l          ;6/8 or 10/13

THEORETICAL CYCLES
 * adda / total / cache: (6*10+8) + (6*10+6)*9998 + (6*10+10) = 660006
 * adda / loop  / cache: 660006/10000 = ~66
 * adda / total / no cache: (6*10+8)*9999 + (6*10+13) = 680005
 * adda / loop  / no cache: 680005/10000 = ~68
 * lea / total / cache: (4*10+8) + (4*10+6)*9998 + (4*10+10) = 460006
 * lea / loop  / cache: 460006/10000 = ~46
 * lea / total / no cache: (4*10+8)*9999 + (4*10+13) = 480005
 * lea / loop  / no cache: 480005/10000 = ~48

RESULTS FOR LONGWORD-ALIGNED CODE

       |          |        cache on       |       cache off
  ins. |     unit +-----------+-----------+-----------+------------
       |          |  FAST RAM |  CHIP RAM |  FAST RAM |   CHIP RAM
 ------+----------+-----------+-----------+-----------+------------
  adda |   CCKs/T |     46895 |     47141 |     59046 |     290438
       | cycles/T | 661071.16 | 664538.98 | 832361.83 | 4094257.09
       | cycles/L |     66.11 |     66.45 |     83.24 |     409.43
 ------+----------+-----------+-----------+-----------+------------
   lea |   CCKs/T |     32914 |     32934 |     59046 |     290297
       | cycles/T | 463983.29 | 464265.22 | 832361.83 | 4092269.44
       | cycles/L |     46.40 |     46.43 |     83.24 |     409.23

RESULTS FOR WORD-ALIGNED CODE

       |          |        cache on       |        cache off
  ins. |     unit +-----------+-----------+------------+------------
       |          |  FAST RAM |  CHIP RAM |   FAST RAM |   CHIP RAM
 ------+----------+-----------+-----------+------------+------------
  adda |   CCKs/T |     47121 |     47147 |      74670 |     342832
       | cycles/T | 664257.05 | 664623.57 | 1052610.81 | 4832846.76
       | cycles/L |     66.43 |     66.46 |     105.53 |     483.29
 ------+----------+-----------+-----------+------------+------------
   lea |   CCKs/T |     32914 |     32934 |      60256 |     296603
       | cycles/T | 463983.29 | 464265.22 |  849419.00 | 4181164.09
       | cycles/L |     46.40 |     46.43 |      84.94 |     418.12

CALCULATION OF CPU CYCLES
 * total = (CCKs * 50000000) / 3546895 = (CCKs * 10000000) / 709379
 * loop  = ((CCKs * 50000000) / 3546895) / 10000 = (CCKs * 1000) / 709379

NOTES
 * The cache on case loops take 0.4+ cycles more than expected (46.4+ VS 46 and
   66.4+ VS 66); by executing a loop of a single lea 65536 times, the measured
   time is 46792 CCKs = 10.07 cycles, which is closer to the expected value (10
   cycles); quite  weird.
 * The cache off + FAST RAM case loops are much slower than the theoretical
   value (83+ VS 48 cycles in the best case), which is unexpected given that the
   accelerator board is said to have a zero-wait-state design and that the RAM
   access timing (60 ns) is an exact multiple of the CPU clock timing (20 ns);
   looking at the numbers, it seems that dbf takes 10 cycles, i.e. 2 cycles more
   than expected (10 VS 8), and that adda/lea take (83-10)/10 = 7 cycles, i.e. 3
   cycles more than expected (7 VS 4).
 * Word alignment affects performance as follows:
    * adda + cache off case: lowers it a lot;
    * adda + cache on case: lowers it slightly;
    * lea + cache off case: lowers it slightly;
    * lea + cache on case: has no effect.

Attached is an (EDIT: updated) archive with the test programs, so that anyone who feels like can make more tests (and also measure the 68040 and 68060 performance).

I totally forgot to check whether word/longword alignment made any difference; I did mean to, but then I just forgot

(Now this is fixed: the updated results and test programs also deal with alignment.)

PeterK · 28 September 2022, 17:36

Quote:

ADDA.W #X,AN VS LEA.L (D16,AN),AN COMPARISON

Is that just a Copy&Paste bug?

Code:

lea.l  #$8000,a0

or where is the LEA.L (D16,AN),AN ?

saimo · 28 September 2022, 17:53

Quote:

Originally Posted by PeterK

Is that just a Copy&Paste bug?

Code:

lea.l  #$8000,a0

or where is the LEA.L (D16,AN),AN ?

Yeah, horrible copy&paste bug

Thanks for pointing it out! I'll fix the text and re-upload the archive.

Thankfully, the actual code is correct, though:

Code:

.l

	ifeq	INSTRUCTION

	rept	10
	dc.w	$d0fc,$8000	;adda.w #-32768,a0
	endr

	else

	rept	10
	dc.w	$41e8,$8000	;lea.l (-32768,a0),a0
	endr

	endif

	dbf	d0,.l

EDIT: it just occurred to me that another thing I forgot is that I didn't check whether the assembler changed the instrutions! I'm writing the istructions by hand now to make sure. Sorry.
EDIT2: I just disassembled the adda test programs, and it turned out that the assembler did change adda into lea! No surprise the results were identical! Epic fail. I'm fixing everything and re-running the tests now.
EDIT3: everything fixed & updated.

paraj · 28 September 2022, 18:11

Just to be sure I checked on my 060 and lea X(aN),An adda.w #X,An and adda.l #X,An are all equally fast (1 cycle) as long as instruction fetch can keep up (otherwise adda.l is worse of course, and none of them can reach 0.5 cycles in isolation). Also

Code:

        lea    X(a0),a0
        move.l  (a0),d0

is faster than adda.w as the stall is indeed avoided.

PeterK · 28 September 2022, 18:13

Thanks saimo! I've tried a few of your benchmarks on WinUAE, but unfortunately my Windows has too many background processes running and the timing results are more or less randomly, an exact comparison seems to be impossible. I would really like to know whether ADDA or LEA is performing better, since this is one of the optimizations which can be enabled in PhxAss, and usually I have switched it on.

Quote:

EDIT2: I just disassembled the adda test programs, and it turned out that the assembler did change them into leas! No surprise the results were identical! Epic fail. I'm fixing everything and re-running the tests now.

The opposite has happened to me some weeks ago. I forgot to switch on the PhxAss optimization and then my icon.library self protection didn't accept the different code checksum.

saimo · 28 September 2022, 19:42

Quote:

Originally Posted by PeterK

Thanks saimo! I've tried a few of your benchmarks on WinUAE, but unfortunately my Windows has too many background processes running and the timing results are more or less randomly, an exact comparison seems to be impossible.

Emulators are definitely not reliable test platforms, indeed.

Quote:

I would really like to know whether ADDA or LEA is performing better, since this is one of the optimizations which can be enabled in PhxAss, and usually I have switched it on.

I have updated my previous post (both results and archive with test programs). Lea is indeed faster in some cases on 68020 and 68030 (and I do actually use lea in my code, but for some reason this time around I got hit by a doubt and just had to double check, while I was supposed to work on something else - as usual, rushing things never helps).

Quote:

The opposite has happened to me some weeks ago. I forgot to switch on the PhxAss optimization and then my icon.library self protection didn't accept the different code checksum.

By coincidence, I used PhxAss as well

meynaf · 28 September 2022, 20:11

Quote:

Originally Posted by saimo

By coincidence, I used PhxAss as well

Then you don't need to encode the instructions manually. Just use opt 0 to turn all optimizations off, either at the command line or in the source itself.

smack · 28 September 2022, 21:20

for reference, the 68030 User's Manual
https://www.nxp.com/docs/en/referenc...68030UM-P1.pdf
https://www.nxp.com/docs/en/referenc...68030UM-P2.pdf

SECTION 11 INSTRUCTION EXECUTION TIMING
can be found in part 2

Code:

			Head		Tail	I-Cache Case	No-Cache Case
adda.w	#-32768,a0
; (fea) #<data>.W	2		0	2(0/0/0)	2(0/1/0)
; ADDA.W EA,An		0		0	4(0/0/0)	4(0/1/0)
; total = 6 cycles

lea (-32768,a0),a0
; (cea) (d16,An)	2 + op head	0	2(0/0/0)	2(0/1/0)
; LEA			2		0	2(0/0/0)	2(0/1/0)
; total = 4 cycles

so the 68030 document and test results agree.

the 68020 document seems to have incomplete timing data for ADDA / SUBA: it doesn't distinguish between the faster ADDA.L (4 cycles) and the slower ADDA.W (6 cycles).

saimo · 28 September 2022, 23:08

@meynaf

Quote:

Originally Posted by meynaf

Then you don't need to encode the instructions manually. Just use opt 0 to turn all optimizations off, either at the command line or in the source itself.

Yep, but that wouldn't have been as fool-proof. And here I'm proving I can be a real fool (see also reply below)

@smack

Quote:

Originally Posted by smack

for reference, the 68030 User's Manual
https://www.nxp.com/docs/en/referenc...68030UM-P1.pdf
https://www.nxp.com/docs/en/referenc...68030UM-P2.pdf

SECTION 11 INSTRUCTION EXECUTION TIMING
can be found in part 2

Code:

            Head        Tail    I-Cache Case    No-Cache Case
adda.w    #-32768,a0
; (fea) #<data>.W    2        0    2(0/0/0)    2(0/1/0)
; ADDA.W EA,An        0        0    4(0/0/0)    4(0/1/0)
; total = 6 cycles

lea (-32768,a0),a0
; (cea) (d16,An)    2 + op head    0    2(0/0/0)    2(0/1/0)
; LEA            2        0    2(0/0/0)    2(0/1/0)
; total = 4 cycles

so the 68030 document and test results agree.

You're totally right, thank you. I won't even try to describe which broken mental processes prevented me to add the FEA to the calculations. Just one more epic fail...

Quote:

the 68020 document seems to have incomplete timing data for ADDA / SUBA: it doesn't distinguish between the faster ADDA.L (4 cycles) and the slower ADDA.W (6 cycles).

Indeed. And 68020 and 68030 are generally so similar when it comes to timings that one would expect that the UM isn't quite right there. Though, there are surprises sometimes (for example, see notes below about alignment).

@all

Updated the results and the archive in post #29. Now also code alignment is taken into account.

On 68020, word alignment affects performance as follows:
* adda + cache off case: lowers it slightly;
* adda + cache on case: improves it slightly;
* lea + cache on case: improves it slightly;
* lea + cache off case: has no effect.
In particular, it eliminates the extra cycle taken by dbf in the cache on cases, so the measured times are very close to the theoretical ones - i.e. word alignment is better than longword alignment in this specific case.

On 68030, alignment affects performance as follows:
* adda + cache off case: lowers it a lot;
* adda + cache on case: lowers it slightly;
* lea + cache off case: lowers it slightly;
* lea + cache on case: has no effect.

saimo · 28 September 2022, 23:21

Quote:

Originally Posted by phx

What makes LEA attractive is that you can store the result of the addition to a different register.

And on 68020+ you can also get quick multiplications by 3, 5 and 9 (plus displacement). That's very advantageous if the source operand happens to be in an address register already and it's OK to have the destination in the same / another address register - OK, it's a very specific case, but still it might be useful. The same goes for additions of any two operands, one of which has to be pre-multiplied by 2, 4 or 8.

PeterK · 29 September 2022, 13:01

Quote:

Originally Posted by saimo

And on 68020+ you can also get quick multiplications by 3, 5 and 9 (plus displacement).

Great idea!

That's a good trick which I can use at least in one case, although the impact on the speed of my library won't be noticeable.

Bruce Abbott · 30 September 2022, 03:26

Quote:

Originally Posted by saimo

I have a same-specced machine, but a number of tests showed that adda and lea execute at the same speed in all cases.
EDIT: I didn't consider that assembler optimizations might kick in, so my previous results were faked by the fact that the assembler actually replaced adda with lea.

Stupid mistake, apologies. Now that I have encoded the instructions manually, my tests return the same results (i.e. adda 6 cycles, lea 4 cycles, both 7 cycles when cache is off).

Thanks for the confirmation. Verifying that the tests are done correctly can be tricky, even after disassembling the code to make sure it is assembled correctly.

Quote:

I totally forgot to check whether word/longword alignment made any difference; I did mean to, but then I just forgot (Now this is fixed: the updated results and test programs also deal with alignment.)

This can also be tricky to get right. To reduce loop overhead my test code repeats the instruction 25 or 50 times. If an instruction has an odd number of words then the alignment changes from one instruction to the next, and the result is an average of different alignments.

28 September 2022, 01:04	#26
smack Registered User Join Date: May 2020 Location: Germany Posts: 20	68040, see section 10 of this document https://www.nxp.com/docs/en/referenc.../MC68040UM.pdf instruction and data caches, execution pipeline (more complex than the 68020) adda #<xxx>,An = 1 cycle (1 <ea> + 1 execute) suba #<xxx>,An = 2 cycles (1 <ea> + 2 execute) <- is that real? suba slower than adda? lea (d16,An),An = 2 cycles (2 <ea> + 2 execute) 68060, see section 10 of this document https://www.nxp.com/docs/en/data-sheet/MC68060UM.pdf instruction and data caches, superscalar pipeline and dual execution units (more complex than the 68040) adda #<xxx>,An = 1 cycle, pOEP \| sOEP suba #<xxx>,An = 1 cycle, pOEP \| sOEP lea (d16,An),An = 1 cycle, pOEP \| sOEP Of course, real world performance depends on the specific sequence of instructions executed, because of how they interact (overlap or stall) in the execution pipelines.

28 September 2022, 10:22	#28
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	68040 real hardware (1000 iterations of 8x repeated): Code: adda. d0,a0 ; 3528 3 adda. a0,a1 ; 1112 1 adda. (a0),a1 ; 2396 2 adda. #,a0 ; 1117 1 suba. d0,a0 ; 3528 3 suba. (a0),a1 ; 2396 2 suba. #,a0 ; 1117 1 lea (a0),a1 ; 1112 1 lea (d.w,a0),a1 ; 2391 2 lea (d.w,pc),a0 ; 4954 4 No size means the same speed for both .w and .l. Not really "best case" scenario, but close enough I guess (everything is in cache).

28 September 2022, 18:11	#32
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,099	Just to be sure I checked on my 060 and lea X(aN),An adda.w #X,An and adda.l #X,An are all equally fast (1 cycle) as long as instruction fetch can keep up (otherwise adda.l is worse of course, and none of them can reach 0.5 cycles in isolation). Also Code: lea X(a0),a0 move.l (a0),d0 is faster than adda.w as the stall is indeed avoided.

28 September 2022, 21:20	#36
smack Registered User Join Date: May 2020 Location: Germany Posts: 20	for reference, the 68030 User's Manual https://www.nxp.com/docs/en/referenc...68030UM-P1.pdf https://www.nxp.com/docs/en/referenc...68030UM-P2.pdf SECTION 11 INSTRUCTION EXECUTION TIMING can be found in part 2 Code: Head Tail I-Cache Case No-Cache Case adda.w #-32768,a0 ; (fea) #<data>.W 2 0 2(0/0/0) 2(0/1/0) ; ADDA.W EA,An 0 0 4(0/0/0) 4(0/1/0) ; total = 6 cycles lea (-32768,a0),a0 ; (cea) (d16,An) 2 + op head 0 2(0/0/0) 2(0/1/0) ; LEA 2 0 2(0/0/0) 2(0/1/0) ; total = 4 cycles so the 68030 document and test results agree. the 68020 document seems to have incomplete timing data for ADDA / SUBA: it doesn't distinguish between the faster ADDA.L (4 cycles) and the slower ADDA.W (6 cycles). Last edited by smack; 28 September 2022 at 22:26. Reason: added instruction timings

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
32bit PC-relative LEA ??	Nut	Coders. General	22	18 March 2010 10:56

21 September 2022, 20:46	#21
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,122	At the risk of being a filthy necromancer, is there a definitive answer for 68020+ where the offset size can be represented as a 16 bit immediate? Assume code is in fast memory and instruction cache enabled.

27 September 2022, 12:14	#23
phx Natteravn Join Date: Nov 2009 Location: Herford / Germany Posts: 2,496	What makes LEA attractive is that you can store the result of the addition to a different register.

27 September 2022, 22:39	#25
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,122	Great responses, thanks! Any info on 040 / 060 ?

28 September 2022, 08:36	#27
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,099	For 060 I think lea is slightly better (all other things being equal) since it can sometimes avoid a 2 cycle change/use register stall (10.2.3). The instruction timings for 040 look a bit odd for adda/suba: adda Dn,Am ;1+2 adda An,Am ;1+1 adda #<xxx>,Am, ;1+1 suba Dn,Am ;1+1 suba An,Am ;1+2 suba #<xxx>,Am, ;1+2 So suba Dn,Am is faster than adda Dn,Am and suba An,Am?

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)