English Amiga Board


Go Back   English Amiga Board > Coders > Coders. General

 
 
Thread Tools
Old 21 September 2022, 20:46   #21
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,122
At the risk of being a filthy necromancer, is there a definitive answer for 68020+ where the offset size can be represented as a 16 bit immediate? Assume code is in fast memory and instruction cache enabled.
Karlos is offline  
Old 27 September 2022, 00:03   #22
smack
Registered User
 
Join Date: May 2020
Location: Germany
Posts: 20
Quote:
Originally Posted by Karlos View Post
At the risk of being a filthy necromancer, is there a definitive answer for 68020+ where the offset size can be represented as a 16 bit immediate? Assume code is in fast memory and instruction cache enabled.
Looking at section 8 of this document https://www.nxp.com/docs/en/data-sheet/MC68020UM.pdf
  • adda / suba EA,An
  • 0 / 2 / 3 cycles (best / cache / worst case) 8.2.8 Arithmetic/Logical Instructions
  • 0 / 2 / 3 cycles for EA=#<data>.W 8.2.1 Fetch Effective Address
  • = 0 / 4 / 6 cycles
  • lea
  • 2 / 2 / 3 cycles 8.2.16 Control Instructions
  • 2 / 2 / 3 cycles for EA=(d16,An) 8.2.3 Calculate Effective Address
  • = 4 / 4 / 6 cycles

There is no clear winner, but I would use adda/suba because it has a chance of a faster best-case and the code is more expressive IMO.


If you wonder what these best / cache / worst cases are, well it's complicated
Quote:
8.1 TIMING ESTIMATION FACTORS
The advanced architecture of the MC68020/EC020 makes exact instruction timing
calculations difficult due to the effects of:
1. An On-Chip Instruction Cache and Instruction Prefetch
2. Operand Misalignment
3. Bus Controller/Sequence Concurrency
4. Instruction Execution Overlap
These factors make MC68020/EC020 instruction set timing difficult to calculate on a single
instruction basis since instructions vary in execution time from one context to another.
I expect the instruction timing to be similarly complex for the 040 and 060 processors...
smack is offline  
Old 27 September 2022, 12:14   #23
phx
Natteravn
 
phx's Avatar
 
Join Date: Nov 2009
Location: Herford / Germany
Posts: 2,496
What makes LEA attractive is that you can store the result of the addition to a different register.
phx is offline  
Old 27 September 2022, 21:06   #24
Bruce Abbott
Registered User
 
Bruce Abbott's Avatar
 
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,546
Quote:
Originally Posted by Karlos View Post
At the risk of being a filthy necromancer, is there a definitive answer for 68020+ where the offset size can be represented as a 16 bit immediate? Assume code is in fast memory and instruction cache enabled.
Real-world test on 50MHz 030 (Blizzard 1230-IV with 60ns DRAM):-

In cache:-
addq.w #x,An 2 clocks
addq.l #x,An 2 clocks
lea xx(An),An 4 clocks
lea x(An,Rn),An 6 clocks
adda.w #xx,An 6 clocks
adda.l #xxxx,An 6 clocks

Not cached:-
addq.w #x,An 4 clocks
addq.l #x,An 4 clocks
lea xx(An),An 7 clocks
lea x(An,Rn),An 7 clocks
adda.w #xx,An 7 clocks
adda.l #xxxx,An 11 clocks

This shows a 2 clock advantage of lea over adda.w when cached, but no advantage when not cached. Also note that adda.l is the same speed as adda.w when cached, but 4 clocks slower when not cached!

It also shows that 60ns DRAM is not fast enough to keep up with the 50MHz 030 on the Blizzard 1230-IV. Getting that code into the cache can make it up to twice as fast!
Bruce Abbott is offline  
Old 27 September 2022, 22:39   #25
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,122
Great responses, thanks!

Any info on 040 / 060 ?
Karlos is offline  
Old 28 September 2022, 01:04   #26
smack
Registered User
 
Join Date: May 2020
Location: Germany
Posts: 20
68040, see section 10 of this document https://www.nxp.com/docs/en/referenc.../MC68040UM.pdf
  • instruction and data caches, execution pipeline (more complex than the 68020)
  • adda #<xxx>,An = 1 cycle (1 <ea> + 1 execute)
  • suba #<xxx>,An = 2 cycles (1 <ea> + 2 execute) <- is that real? suba slower than adda?
  • lea (d16,An),An = 2 cycles (2 <ea> + 2 execute)

68060, see section 10 of this document https://www.nxp.com/docs/en/data-sheet/MC68060UM.pdf
  • instruction and data caches, superscalar pipeline and dual execution units (more complex than the 68040)
  • adda #<xxx>,An = 1 cycle, pOEP | sOEP
  • suba #<xxx>,An = 1 cycle, pOEP | sOEP
  • lea (d16,An),An = 1 cycle, pOEP | sOEP

Of course, real world performance depends on the specific sequence of instructions executed, because of how they interact (overlap or stall) in the execution pipelines.
smack is offline  
Old 28 September 2022, 08:36   #27
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,099
For 060 I think lea is slightly better (all other things being equal) since it can sometimes avoid a 2 cycle change/use register stall (10.2.3).

The instruction timings for 040 look a bit odd for adda/suba:
adda Dn,Am ;1+2
adda An,Am ;1+1
adda #<xxx>,Am, ;1+1
suba Dn,Am ;1+1
suba An,Am ;1+2
suba #<xxx>,Am, ;1+2

So suba Dn,Am is faster than adda Dn,Am and suba An,Am?
paraj is offline  
Old 28 September 2022, 10:22   #28
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,039
68040 real hardware (1000 iterations of 8x repeated):
Code:
	adda.	d0,a0		; 3528	3
	adda.	a0,a1		; 1112	1
	adda.	(a0),a1		; 2396	2
	adda.	#,a0		; 1117	1

	suba.	d0,a0		; 3528	3
	suba.	(a0),a1		; 2396	2
	suba.	#,a0		; 1117	1

	lea	(a0),a1		; 1112	1
	lea	(d.w,a0),a1	; 2391	2
	lea	(d.w,pc),a0	; 4954	4
No size means the same speed for both .w and .l. Not really "best case" scenario, but close enough I guess (everything is in cache).
a/b is offline  
Old 28 September 2022, 17:17   #29
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
Quote:
Originally Posted by Bruce Abbott View Post
Real-world test on 50MHz 030 (Blizzard 1230-IV with 60ns DRAM):-

In cache:-
addq.w #x,An 2 clocks
addq.l #x,An 2 clocks
lea xx(An),An 4 clocks
lea x(An,Rn),An 6 clocks
adda.w #xx,An 6 clocks
adda.l #xxxx,An 6 clocks

Not cached:-
addq.w #x,An 4 clocks
addq.l #x,An 4 clocks
lea xx(An),An 7 clocks
lea x(An,Rn),An 7 clocks
adda.w #xx,An 7 clocks
adda.l #xxxx,An 11 clocks

This shows a 2 clock advantage of lea over adda.w when cached, but no advantage when not cached. Also note that adda.l is the same speed as adda.w when cached, but 4 clocks slower when not cached!

It also shows that 60ns DRAM is not fast enough to keep up with the 50MHz 030 on the Blizzard 1230-IV. Getting that code into the cache can make it up to twice as fast!
I have a same-specced machine, but a number of tests showed that adda and lea execute at the same speed in all cases.
EDIT: I didn't consider that assembler optimizations might kick in, so my previous results were faked by the fact that the assembler actually replaced adda with lea. Stupid mistake, apologies. Now that I have encoded the instructions manually, my tests return the same results (i.e. adda 6 cycles, lea 4 cycles, both 7 cycles when cache is off).

Updated results:
Code:
--------------------------------------------------------------------------------
ADDA.W #X,AN VS LEA.L (D16,AN),AN COMPARISON

NOTES
 * The tests have been run on an Amiga 1200 equipped with a Blizzard 1230-IV
   with an MC68030 clocked at 50 MHz and 60 ns FAST RAM.
 * The tests for the unexpanded Amiga 1200 cases have been been run on the same
   machine with the accelerator board disabled.
 * The tests have been run with DMA and interrupts off, and with PAL settings.
 * The tests have been run 3 or 4 times each, and the best results have been
   taken; most of the tests always returned the same result (only the 68030,
   cache off, CHIP RAM cases showed fluctuations of some tens/hundreds cycles,
   which were insignificant anyway).
 * The tests execute the core code 10000 times.
 * The duration of the tests is measured by means of color clocks (CCKs).
 * The test programs require ECS or AGA (but have been tested on AGA only).
 * The test programs can be run from both shell and Workbench.
 * The test programs take no argument.
 * The test programs shut the OS off, take over the machine entirely and access
   the hardware directly; although on exit they restore the system with the
   utmost care, no guarantee is given - USE AT YOUR OWN RISK.
 * Included test programs:
    * ADCL = Adda, cache Disabled, CHIP RAM, Longword alignment
    * ADCW = Adda, cache Disabled, CHIP RAM, Word alignment
    * ADFL = Adda, cache Disabled, FAST RAM, Longword alignment
    * ADFW = Adda, cache Disabled, FAST RAM, Word alignment
    * AECL = Adda, cache Enabled, CHIP RAM, Longword alignment
    * AECW = Adda, cache Enabled, CHIP RAM, Word alignment
    * AEFL = Adda, cache Enabled, FAST RAM, Longword alignment
    * AEFW = Adda, cache Enabled, FAST RAM, Word alignment
    * LDCL = Lea, cache Disabled, CHIP RAM, Longword alignment
    * LDCW = Lea, cache Disabled, CHIP RAM, Word alignment
    * LDFL = Lea, cache Disabled, FAST RAM, Longword alignment
    * LDFW = Lea, cache Disabled, FAST RAM, Word alignment
    * LECL = Lea, cache Enabled, CHIP RAM, Longword alignment
    * LECW = Lea, cache Enabled, CHIP RAM, Word alignment
    * LEFL = Lea, cache Enabled, FAST RAM, Longword alignment
    * LEFW = Lea, cache Enabled, FAST RAM, Word alignment


--------------------------------------------------------------------------------
68020 @ 14.18758 MHz / CHIP RAM

OFFICIAL SPECIFICATIONS (best/cache/worst)
 * adda: FEA: 0/2/3, operation: 0/2/3 -> 0/4/6
 * lea:  CEA: 2/2/3, operation: 2/2/3 -> 4/4/6
 * dbf:  branch taken: 3/6/9, branch not taken: 7/10/10

CORE CODE

 .l rept   10
    adda.w #-32768,a0     ;0/4/6
    endr
    dbf    d0,.l          ;3/6/9 or 7/10/10

 .l rept   10
    lea.l  (-32768,a0),a0 ;4/4/6
    endr
    dbf    d0,.l          ;3/6/9 or 7/10/10

THEORETICAL CYCLES
 * total / cache: (6*10+9) + (4*10+6)*9998 + (4*10+10) = 460027
 * loop  / cache: 460027/10000 = ~46
 * total / worst: (6*10+9)*9999 + (6*10+10) = 690001
 * loop  / worst: 690001/10000 = ~69

RESULTS FOR LONGWORD-ALIGNED CODE

       |          |       cache on      |      cache off
  ins. |     unit +----------+----------+----------+----------
       |          | FAST RAM | CHIP RAM | FAST RAM | CHIP RAM
 ------+----------+----------+----------+----------+----------
  adda |   CCKs/T |          |   167784 |          |  342832
       | cycles/T |          |   671136 |          | 1371328
       | cycles/L |          |    67.11 |          |  137.13
 ------+----------+----------+----------+----------+----------
   lea |   CCKs/T |          |   117936 |          |  342832
       | cycles/T |          |   471744 |          | 1371328
       | cycles/L |          |    47.17 |          |  137.13

RESULTS FOR WORD-ALIGNED CODE

       |          |       cache on      |      cache off
  ins. |     unit +----------+----------+----------+----------
       |          | FAST RAM | CHIP RAM | FAST RAM | CHIP RAM
 ------+----------+----------+----------+----------+----------
  adda |   CCKs/T |          |   165280 |          |  353033
       | cycles/T |          |   661120 |          | 1412132
       | cycles/L |          |    66.11 |          |  141.21
 ------+----------+----------+----------+----------+----------
   lea |   CCKs/T |          |   115432 |          |  342832
       | cycles/T |          |   461728 |          | 1371328
       | cycles/L |          |    46.17 |          |  137.13

CALCULATION OF CPU CYCLES
 * total = CCKs * 4
 * loop  = (CCKs * 4) / 10000 = CCKs / 2500

NOTES
 * The lea + cache on + longword alignment case loop takes 1+ cycles more than
   expected (47+ VS 46); by executing a loop of a single lea 65536 times, the
   measured time is 180514 CCKs = 11.02 cycles, which is closer to the expected
   value (10 cycles), but still ~1 cycle slower; it seems that, at least in this
   context, dbf actually takes 7 cycles.
 * The adda + cache on + longword alignment case loop takes 21+ cycles more than
   expected (67 VS 46); given that dbf takes 7 cycles, adda takes (67-7)/10 = 6
   cycles, i.e. 2 cycles more than expected (6 VS 4).
 * The cache off case loops are about 2 times slower than the theoretical value
   (137.13 and 141.21 are close to 69*2 = 138), which is not surprising given
   the CHIP RAM access timings.
 * Word alignment affects performance as follows:
    * adda + cache off case: lowers it slightly;
    * adda + cache on case: improves it slightly;
    * lea + cache on case: improves it slightly;
    * lea + cache off case: has no effect.
   In particular, it eliminates the extra cycle taken by dbf in the cache on
   cases.


--------------------------------------------------------------------------------
68030 @ 50 MHz / CHIP RAM / FAST RAM 60 ns

OFFICIAL SPECIFICATIONS (CACHE / NO CACHE)
 * adda: FEA: 2/2, operation 4/4 -> 6/6
 * lea:  CEA: 2/2, operation: 2/2 -> 4/4
 * dbf:  branch taken: 6/8, branch not taken: 10/13

CORE CODE

 .l rept   10
    adda.w #-32768,a0     ;6/6
    endr
    dbf    d0,.l          ;6/8 or 10/13

 .l rept   10
    lea.l  (-32768,a0),a0 ;4/4
    endr
    dbf    d0,.l          ;6/8 or 10/13

THEORETICAL CYCLES
 * adda / total / cache: (6*10+8) + (6*10+6)*9998 + (6*10+10) = 660006
 * adda / loop  / cache: 660006/10000 = ~66
 * adda / total / no cache: (6*10+8)*9999 + (6*10+13) = 680005
 * adda / loop  / no cache: 680005/10000 = ~68
 * lea / total / cache: (4*10+8) + (4*10+6)*9998 + (4*10+10) = 460006
 * lea / loop  / cache: 460006/10000 = ~46
 * lea / total / no cache: (4*10+8)*9999 + (4*10+13) = 480005
 * lea / loop  / no cache: 480005/10000 = ~48

RESULTS FOR LONGWORD-ALIGNED CODE

       |          |        cache on       |       cache off
  ins. |     unit +-----------+-----------+-----------+------------
       |          |  FAST RAM |  CHIP RAM |  FAST RAM |   CHIP RAM
 ------+----------+-----------+-----------+-----------+------------
  adda |   CCKs/T |     46895 |     47141 |     59046 |     290438
       | cycles/T | 661071.16 | 664538.98 | 832361.83 | 4094257.09
       | cycles/L |     66.11 |     66.45 |     83.24 |     409.43
 ------+----------+-----------+-----------+-----------+------------
   lea |   CCKs/T |     32914 |     32934 |     59046 |     290297
       | cycles/T | 463983.29 | 464265.22 | 832361.83 | 4092269.44
       | cycles/L |     46.40 |     46.43 |     83.24 |     409.23

RESULTS FOR WORD-ALIGNED CODE

       |          |        cache on       |        cache off
  ins. |     unit +-----------+-----------+------------+------------
       |          |  FAST RAM |  CHIP RAM |   FAST RAM |   CHIP RAM
 ------+----------+-----------+-----------+------------+------------
  adda |   CCKs/T |     47121 |     47147 |      74670 |     342832
       | cycles/T | 664257.05 | 664623.57 | 1052610.81 | 4832846.76
       | cycles/L |     66.43 |     66.46 |     105.53 |     483.29
 ------+----------+-----------+-----------+------------+------------
   lea |   CCKs/T |     32914 |     32934 |      60256 |     296603
       | cycles/T | 463983.29 | 464265.22 |  849419.00 | 4181164.09
       | cycles/L |     46.40 |     46.43 |      84.94 |     418.12

CALCULATION OF CPU CYCLES
 * total = (CCKs * 50000000) / 3546895 = (CCKs * 10000000) / 709379
 * loop  = ((CCKs * 50000000) / 3546895) / 10000 = (CCKs * 1000) / 709379

NOTES
 * The cache on case loops take 0.4+ cycles more than expected (46.4+ VS 46 and
   66.4+ VS 66); by executing a loop of a single lea 65536 times, the measured
   time is 46792 CCKs = 10.07 cycles, which is closer to the expected value (10
   cycles); quite  weird.
 * The cache off + FAST RAM case loops are much slower than the theoretical
   value (83+ VS 48 cycles in the best case), which is unexpected given that the
   accelerator board is said to have a zero-wait-state design and that the RAM
   access timing (60 ns) is an exact multiple of the CPU clock timing (20 ns);
   looking at the numbers, it seems that dbf takes 10 cycles, i.e. 2 cycles more
   than expected (10 VS 8), and that adda/lea take (83-10)/10 = 7 cycles, i.e. 3
   cycles more than expected (7 VS 4).
 * Word alignment affects performance as follows:
    * adda + cache off case: lowers it a lot;
    * adda + cache on case: lowers it slightly;
    * lea + cache off case: lowers it slightly;
    * lea + cache on case: has no effect.
Attached is an (EDIT: updated) archive with the test programs, so that anyone who feels like can make more tests (and also measure the 68040 and 68060 performance).

I totally forgot to check whether word/longword alignment made any difference; I did mean to, but then I just forgot (Now this is fixed: the updated results and test programs also deal with alignment.)
Attached Files
File Type: lha addaVSlea.lha (45.4 KB, 37 views)

Last edited by saimo; 28 September 2022 at 23:00.
saimo is offline  
Old 28 September 2022, 17:36   #30
PeterK
Registered User
 
Join Date: Apr 2005
Location: digital hell, Germany, after 1984, but worse
Posts: 3,365
Quote:
ADDA.W #X,AN VS LEA.L (D16,AN),AN COMPARISON
Is that just a Copy&Paste bug?
Code:
lea.l  #$8000,a0
or where is the LEA.L (D16,AN),AN ?
PeterK is offline  
Old 28 September 2022, 17:53   #31
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
Quote:
Originally Posted by PeterK View Post
Is that just a Copy&Paste bug?
Code:
lea.l  #$8000,a0
or where is the LEA.L (D16,AN),AN ?
Yeah, horrible copy&paste bug
Thanks for pointing it out! I'll fix the text and re-upload the archive.

Thankfully, the actual code is correct, though:
Code:
.l

	ifeq	INSTRUCTION

	rept	10
	dc.w	$d0fc,$8000	;adda.w #-32768,a0
	endr

	else

	rept	10
	dc.w	$41e8,$8000	;lea.l (-32768,a0),a0
	endr

	endif

	dbf	d0,.l
EDIT: it just occurred to me that another thing I forgot is that I didn't check whether the assembler changed the instrutions! I'm writing the istructions by hand now to make sure. Sorry.
EDIT2: I just disassembled the adda test programs, and it turned out that the assembler did change adda into lea! No surprise the results were identical! Epic fail. I'm fixing everything and re-running the tests now.
EDIT3: everything fixed & updated.

Last edited by saimo; 28 September 2022 at 23:02.
saimo is offline  
Old 28 September 2022, 18:11   #32
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,099
Just to be sure I checked on my 060 and lea X(aN),An adda.w #X,An and adda.l #X,An are all equally fast (1 cycle) as long as instruction fetch can keep up (otherwise adda.l is worse of course, and none of them can reach 0.5 cycles in isolation). Also
Code:
        lea    X(a0),a0
        move.l  (a0),d0
is faster than adda.w as the stall is indeed avoided.
paraj is offline  
Old 28 September 2022, 18:13   #33
PeterK
Registered User
 
Join Date: Apr 2005
Location: digital hell, Germany, after 1984, but worse
Posts: 3,365
Thanks saimo! I've tried a few of your benchmarks on WinUAE, but unfortunately my Windows has too many background processes running and the timing results are more or less randomly, an exact comparison seems to be impossible. I would really like to know whether ADDA or LEA is performing better, since this is one of the optimizations which can be enabled in PhxAss, and usually I have switched it on.

Quote:
EDIT2: I just disassembled the adda test programs, and it turned out that the assembler did change them into leas! No surprise the results were identical! Epic fail. I'm fixing everything and re-running the tests now.
The opposite has happened to me some weeks ago. I forgot to switch on the PhxAss optimization and then my icon.library self protection didn't accept the different code checksum.

Last edited by PeterK; 28 September 2022 at 18:23.
PeterK is offline  
Old 28 September 2022, 19:42   #34
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
Quote:
Originally Posted by PeterK View Post
Thanks saimo! I've tried a few of your benchmarks on WinUAE, but unfortunately my Windows has too many background processes running and the timing results are more or less randomly, an exact comparison seems to be impossible.
Emulators are definitely not reliable test platforms, indeed.

Quote:
I would really like to know whether ADDA or LEA is performing better, since this is one of the optimizations which can be enabled in PhxAss, and usually I have switched it on.
I have updated my previous post (both results and archive with test programs). Lea is indeed faster in some cases on 68020 and 68030 (and I do actually use lea in my code, but for some reason this time around I got hit by a doubt and just had to double check, while I was supposed to work on something else - as usual, rushing things never helps).

Quote:
The opposite has happened to me some weeks ago. I forgot to switch on the PhxAss optimization and then my icon.library self protection didn't accept the different code checksum.
By coincidence, I used PhxAss as well
saimo is offline  
Old 28 September 2022, 20:11   #35
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by saimo View Post
By coincidence, I used PhxAss as well
Then you don't need to encode the instructions manually. Just use opt 0 to turn all optimizations off, either at the command line or in the source itself.
meynaf is offline  
Old 28 September 2022, 21:20   #36
smack
Registered User
 
Join Date: May 2020
Location: Germany
Posts: 20
for reference, the 68030 User's Manual
https://www.nxp.com/docs/en/referenc...68030UM-P1.pdf
https://www.nxp.com/docs/en/referenc...68030UM-P2.pdf

SECTION 11 INSTRUCTION EXECUTION TIMING
can be found in part 2

Code:
			Head		Tail	I-Cache Case	No-Cache Case
adda.w	#-32768,a0
; (fea) #<data>.W	2		0	2(0/0/0)	2(0/1/0)
; ADDA.W EA,An		0		0	4(0/0/0)	4(0/1/0)
; total = 6 cycles

lea (-32768,a0),a0
; (cea) (d16,An)	2 + op head	0	2(0/0/0)	2(0/1/0)
; LEA			2		0	2(0/0/0)	2(0/1/0)
; total = 4 cycles
so the 68030 document and test results agree.

the 68020 document seems to have incomplete timing data for ADDA / SUBA: it doesn't distinguish between the faster ADDA.L (4 cycles) and the slower ADDA.W (6 cycles).

Last edited by smack; 28 September 2022 at 22:26. Reason: added instruction timings
smack is offline  
Old 28 September 2022, 23:08   #37
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
@meynaf

Quote:
Originally Posted by meynaf View Post
Then you don't need to encode the instructions manually. Just use opt 0 to turn all optimizations off, either at the command line or in the source itself.
Yep, but that wouldn't have been as fool-proof. And here I'm proving I can be a real fool (see also reply below)


@smack

Quote:
Originally Posted by smack View Post
for reference, the 68030 User's Manual
https://www.nxp.com/docs/en/referenc...68030UM-P1.pdf
https://www.nxp.com/docs/en/referenc...68030UM-P2.pdf

SECTION 11 INSTRUCTION EXECUTION TIMING
can be found in part 2

Code:
            Head        Tail    I-Cache Case    No-Cache Case
adda.w    #-32768,a0
; (fea) #<data>.W    2        0    2(0/0/0)    2(0/1/0)
; ADDA.W EA,An        0        0    4(0/0/0)    4(0/1/0)
; total = 6 cycles

lea (-32768,a0),a0
; (cea) (d16,An)    2 + op head    0    2(0/0/0)    2(0/1/0)
; LEA            2        0    2(0/0/0)    2(0/1/0)
; total = 4 cycles
so the 68030 document and test results agree.
You're totally right, thank you. I won't even try to describe which broken mental processes prevented me to add the FEA to the calculations. Just one more epic fail...

Quote:
the 68020 document seems to have incomplete timing data for ADDA / SUBA: it doesn't distinguish between the faster ADDA.L (4 cycles) and the slower ADDA.W (6 cycles).
Indeed. And 68020 and 68030 are generally so similar when it comes to timings that one would expect that the UM isn't quite right there. Though, there are surprises sometimes (for example, see notes below about alignment).


@all

Updated the results and the archive in post #29. Now also code alignment is taken into account.

On 68020, word alignment affects performance as follows:
* adda + cache off case: lowers it slightly;
* adda + cache on case: improves it slightly;
* lea + cache on case: improves it slightly;
* lea + cache off case: has no effect.
In particular, it eliminates the extra cycle taken by dbf in the cache on cases, so the measured times are very close to the theoretical ones - i.e. word alignment is better than longword alignment in this specific case.

On 68030, alignment affects performance as follows:
* adda + cache off case: lowers it a lot;
* adda + cache on case: lowers it slightly;
* lea + cache off case: lowers it slightly;
* lea + cache on case: has no effect.

Last edited by saimo; 28 September 2022 at 23:15.
saimo is offline  
Old 28 September 2022, 23:21   #38
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
Quote:
Originally Posted by phx View Post
What makes LEA attractive is that you can store the result of the addition to a different register.
And on 68020+ you can also get quick multiplications by 3, 5 and 9 (plus displacement). That's very advantageous if the source operand happens to be in an address register already and it's OK to have the destination in the same / another address register - OK, it's a very specific case, but still it might be useful. The same goes for additions of any two operands, one of which has to be pre-multiplied by 2, 4 or 8.
saimo is offline  
Old 29 September 2022, 13:01   #39
PeterK
Registered User
 
Join Date: Apr 2005
Location: digital hell, Germany, after 1984, but worse
Posts: 3,365
Quote:
Originally Posted by saimo View Post
And on 68020+ you can also get quick multiplications by 3, 5 and 9 (plus displacement).
Great idea! That's a good trick which I can use at least in one case, although the impact on the speed of my library won't be noticeable.
PeterK is offline  
Old 30 September 2022, 03:26   #40
Bruce Abbott
Registered User
 
Bruce Abbott's Avatar
 
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,546
Quote:
Originally Posted by saimo View Post
I have a same-specced machine, but a number of tests showed that adda and lea execute at the same speed in all cases.
EDIT: I didn't consider that assembler optimizations might kick in, so my previous results were faked by the fact that the assembler actually replaced adda with lea.

Stupid mistake, apologies. Now that I have encoded the instructions manually, my tests return the same results (i.e. adda 6 cycles, lea 4 cycles, both 7 cycles when cache is off).
Thanks for the confirmation. Verifying that the tests are done correctly can be tricky, even after disassembling the code to make sure it is assembled correctly.

Quote:
I totally forgot to check whether word/longword alignment made any difference; I did mean to, but then I just forgot (Now this is fixed: the updated results and test programs also deal with alignment.)
This can also be tricky to get right. To reduce loop overhead my test code repeats the instruction 25 or 50 times. If an instruction has an odd number of words then the alignment changes from one instruction to the next, and the result is an average of different alignments.
Bruce Abbott is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
32bit PC-relative LEA ?? Nut Coders. General 22 18 March 2010 10:56

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 10:55.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.40761 seconds with 14 queries