English Amiga Board adda / suba Vs. lea
 User Name Remember Me? Password
 Register Amiga FAQ Rules & Help Members List  /  Moderators List Today's Posts Mark Forums Read

 21 September 2022, 21:46 #21 Karlos Registered User   Join Date: Aug 2022 Location: UK Posts: 367 At the risk of being a filthy necromancer, is there a definitive answer for 68020+ where the offset size can be represented as a 16 bit immediate? Assume code is in fast memory and instruction cache enabled.
27 September 2022, 01:03   #22
smack
Registered User

Join Date: May 2020
Location: Germany
Posts: 10
Quote:
 Originally Posted by Karlos At the risk of being a filthy necromancer, is there a definitive answer for 68020+ where the offset size can be represented as a 16 bit immediate? Assume code is in fast memory and instruction cache enabled.
Looking at section 8 of this document https://www.nxp.com/docs/en/data-sheet/MC68020UM.pdf
• adda / suba EA,An
• 0 / 2 / 3 cycles (best / cache / worst case) 8.2.8 Arithmetic/Logical Instructions
• 0 / 2 / 3 cycles for EA=#<data>.W 8.2.1 Fetch Effective Address
• = 0 / 4 / 6 cycles
• lea
• 2 / 2 / 3 cycles 8.2.16 Control Instructions
• 2 / 2 / 3 cycles for EA=(d16,An) 8.2.3 Calculate Effective Address
• = 4 / 4 / 6 cycles

There is no clear winner, but I would use adda/suba because it has a chance of a faster best-case and the code is more expressive IMO.

If you wonder what these best / cache / worst cases are, well it's complicated
Quote:
 8.1 TIMING ESTIMATION FACTORS The advanced architecture of the MC68020/EC020 makes exact instruction timing calculations difficult due to the effects of: 1. An On-Chip Instruction Cache and Instruction Prefetch 2. Operand Misalignment 3. Bus Controller/Sequence Concurrency 4. Instruction Execution Overlap These factors make MC68020/EC020 instruction set timing difficult to calculate on a single instruction basis since instructions vary in execution time from one context to another.
I expect the instruction timing to be similarly complex for the 040 and 060 processors...

 27 September 2022, 13:14 #23 phx Natteravn   Join Date: Nov 2009 Location: Herford / Germany Posts: 2,239 What makes LEA attractive is that you can store the result of the addition to a different register.
27 September 2022, 22:06   #24
Bruce Abbott
Registered User

Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 1,244
Quote:
 Originally Posted by Karlos At the risk of being a filthy necromancer, is there a definitive answer for 68020+ where the offset size can be represented as a 16 bit immediate? Assume code is in fast memory and instruction cache enabled.
Real-world test on 50MHz 030 (Blizzard 1230-IV with 60ns DRAM):-

In cache:-
addq.w #x,An 2 clocks
addq.l #x,An 2 clocks
lea xx(An),An 4 clocks
lea x(An,Rn),An 6 clocks
adda.w #xx,An 6 clocks
adda.l #xxxx,An 6 clocks

Not cached:-
addq.w #x,An 4 clocks
addq.l #x,An 4 clocks
lea xx(An),An 7 clocks
lea x(An,Rn),An 7 clocks
adda.w #xx,An 7 clocks
adda.l #xxxx,An 11 clocks

This shows a 2 clock advantage of lea over adda.w when cached, but no advantage when not cached. Also note that adda.l is the same speed as adda.w when cached, but 4 clocks slower when not cached!

It also shows that 60ns DRAM is not fast enough to keep up with the 50MHz 030 on the Blizzard 1230-IV. Getting that code into the cache can make it up to twice as fast!

 27 September 2022, 23:39 #25 Karlos Registered User   Join Date: Aug 2022 Location: UK Posts: 367 Great responses, thanks! Any info on 040 / 060 ?
 28 September 2022, 02:04 #26 smack Registered User   Join Date: May 2020 Location: Germany Posts: 10 68040, see section 10 of this document https://www.nxp.com/docs/en/referenc.../MC68040UM.pdfinstruction and data caches, execution pipeline (more complex than the 68020) adda #,An = 1 cycle (1 + 1 execute) suba #,An = 2 cycles (1 + 2 execute) <- is that real? suba slower than adda? lea (d16,An),An = 2 cycles (2 + 2 execute) 68060, see section 10 of this document https://www.nxp.com/docs/en/data-sheet/MC68060UM.pdfinstruction and data caches, superscalar pipeline and dual execution units (more complex than the 68040) adda #,An = 1 cycle, pOEP | sOEP suba #,An = 1 cycle, pOEP | sOEP lea (d16,An),An = 1 cycle, pOEP | sOEP Of course, real world performance depends on the specific sequence of instructions executed, because of how they interact (overlap or stall) in the execution pipelines.
 28 September 2022, 09:36 #27 paraj Registered User   Join Date: Feb 2017 Location: Denmark Posts: 425 For 060 I think lea is slightly better (all other things being equal) since it can sometimes avoid a 2 cycle change/use register stall (10.2.3). The instruction timings for 040 look a bit odd for adda/suba: adda Dn,Am ;1+2 adda An,Am ;1+1 adda #,Am, ;1+1 suba Dn,Am ;1+1 suba An,Am ;1+2 suba #,Am, ;1+2 So suba Dn,Am is faster than adda Dn,Am and suba An,Am?
 28 September 2022, 11:22 #28 a/b Registered User   Join Date: Jun 2016 Location: europe Posts: 737 68040 real hardware (1000 iterations of 8x repeated): Code: ``` adda. d0,a0 ; 3528 3 adda. a0,a1 ; 1112 1 adda. (a0),a1 ; 2396 2 adda. #,a0 ; 1117 1 suba. d0,a0 ; 3528 3 suba. (a0),a1 ; 2396 2 suba. #,a0 ; 1117 1 lea (a0),a1 ; 1112 1 lea (d.w,a0),a1 ; 2391 2 lea (d.w,pc),a0 ; 4954 4``` No size means the same speed for both .w and .l. Not really "best case" scenario, but close enough I guess (everything is in cache).
28 September 2022, 18:17   #29
saimo
Registered User

Join Date: Aug 2010
Location: Italy
Posts: 555
Quote:
I have a same-specced machine, but a number of tests showed that adda and lea execute at the same speed in all cases.
EDIT: I didn't consider that assembler optimizations might kick in, so my previous results were faked by the fact that the assembler actually replaced adda with lea. Stupid mistake, apologies. Now that I have encoded the instructions manually, my tests return the same results (i.e. adda 6 cycles, lea 4 cycles, both 7 cycles when cache is off).

Updated results:
Code:
```--------------------------------------------------------------------------------
ADDA.W #X,AN VS LEA.L (D16,AN),AN COMPARISON

NOTES
* The tests have been run on an Amiga 1200 equipped with a Blizzard 1230-IV
with an MC68030 clocked at 50 MHz and 60 ns FAST RAM.
* The tests for the unexpanded Amiga 1200 cases have been been run on the same
machine with the accelerator board disabled.
* The tests have been run with DMA and interrupts off, and with PAL settings.
* The tests have been run 3 or 4 times each, and the best results have been
taken; most of the tests always returned the same result (only the 68030,
cache off, CHIP RAM cases showed fluctuations of some tens/hundreds cycles,
which were insignificant anyway).
* The tests execute the core code 10000 times.
* The duration of the tests is measured by means of color clocks (CCKs).
* The test programs require ECS or AGA (but have been tested on AGA only).
* The test programs can be run from both shell and Workbench.
* The test programs take no argument.
* The test programs shut the OS off, take over the machine entirely and access
the hardware directly; although on exit they restore the system with the
utmost care, no guarantee is given - USE AT YOUR OWN RISK.
* Included test programs:
* ADCL = Adda, cache Disabled, CHIP RAM, Longword alignment
* ADCW = Adda, cache Disabled, CHIP RAM, Word alignment
* ADFL = Adda, cache Disabled, FAST RAM, Longword alignment
* ADFW = Adda, cache Disabled, FAST RAM, Word alignment
* AECL = Adda, cache Enabled, CHIP RAM, Longword alignment
* AECW = Adda, cache Enabled, CHIP RAM, Word alignment
* AEFL = Adda, cache Enabled, FAST RAM, Longword alignment
* AEFW = Adda, cache Enabled, FAST RAM, Word alignment
* LDCL = Lea, cache Disabled, CHIP RAM, Longword alignment
* LDCW = Lea, cache Disabled, CHIP RAM, Word alignment
* LDFL = Lea, cache Disabled, FAST RAM, Longword alignment
* LDFW = Lea, cache Disabled, FAST RAM, Word alignment
* LECL = Lea, cache Enabled, CHIP RAM, Longword alignment
* LECW = Lea, cache Enabled, CHIP RAM, Word alignment
* LEFL = Lea, cache Enabled, FAST RAM, Longword alignment
* LEFW = Lea, cache Enabled, FAST RAM, Word alignment

--------------------------------------------------------------------------------
68020 @ 14.18758 MHz / CHIP RAM

OFFICIAL SPECIFICATIONS (best/cache/worst)
* adda: FEA: 0/2/3, operation: 0/2/3 -> 0/4/6
* lea:  CEA: 2/2/3, operation: 2/2/3 -> 4/4/6
* dbf:  branch taken: 3/6/9, branch not taken: 7/10/10

CORE CODE

.l rept   10
endr
dbf    d0,.l          ;3/6/9 or 7/10/10

.l rept   10
lea.l  (-32768,a0),a0 ;4/4/6
endr
dbf    d0,.l          ;3/6/9 or 7/10/10

THEORETICAL CYCLES
* total / cache: (6*10+9) + (4*10+6)*9998 + (4*10+10) = 460027
* loop  / cache: 460027/10000 = ~46
* total / worst: (6*10+9)*9999 + (6*10+10) = 690001
* loop  / worst: 690001/10000 = ~69

RESULTS FOR LONGWORD-ALIGNED CODE

|          |       cache on      |      cache off
ins. |     unit +----------+----------+----------+----------
|          | FAST RAM | CHIP RAM | FAST RAM | CHIP RAM
------+----------+----------+----------+----------+----------
adda |   CCKs/T |          |   167784 |          |  342832
| cycles/T |          |   671136 |          | 1371328
| cycles/L |          |    67.11 |          |  137.13
------+----------+----------+----------+----------+----------
lea |   CCKs/T |          |   117936 |          |  342832
| cycles/T |          |   471744 |          | 1371328
| cycles/L |          |    47.17 |          |  137.13

RESULTS FOR WORD-ALIGNED CODE

|          |       cache on      |      cache off
ins. |     unit +----------+----------+----------+----------
|          | FAST RAM | CHIP RAM | FAST RAM | CHIP RAM
------+----------+----------+----------+----------+----------
adda |   CCKs/T |          |   165280 |          |  353033
| cycles/T |          |   661120 |          | 1412132
| cycles/L |          |    66.11 |          |  141.21
------+----------+----------+----------+----------+----------
lea |   CCKs/T |          |   115432 |          |  342832
| cycles/T |          |   461728 |          | 1371328
| cycles/L |          |    46.17 |          |  137.13

CALCULATION OF CPU CYCLES
* total = CCKs * 4
* loop  = (CCKs * 4) / 10000 = CCKs / 2500

NOTES
* The lea + cache on + longword alignment case loop takes 1+ cycles more than
expected (47+ VS 46); by executing a loop of a single lea 65536 times, the
measured time is 180514 CCKs = 11.02 cycles, which is closer to the expected
value (10 cycles), but still ~1 cycle slower; it seems that, at least in this
context, dbf actually takes 7 cycles.
* The adda + cache on + longword alignment case loop takes 21+ cycles more than
expected (67 VS 46); given that dbf takes 7 cycles, adda takes (67-7)/10 = 6
cycles, i.e. 2 cycles more than expected (6 VS 4).
* The cache off case loops are about 2 times slower than the theoretical value
(137.13 and 141.21 are close to 69*2 = 138), which is not surprising given
the CHIP RAM access timings.
* Word alignment affects performance as follows:
* adda + cache off case: lowers it slightly;
* adda + cache on case: improves it slightly;
* lea + cache on case: improves it slightly;
* lea + cache off case: has no effect.
In particular, it eliminates the extra cycle taken by dbf in the cache on
cases.

--------------------------------------------------------------------------------
68030 @ 50 MHz / CHIP RAM / FAST RAM 60 ns

OFFICIAL SPECIFICATIONS (CACHE / NO CACHE)
* adda: FEA: 2/2, operation 4/4 -> 6/6
* lea:  CEA: 2/2, operation: 2/2 -> 4/4
* dbf:  branch taken: 6/8, branch not taken: 10/13

CORE CODE

.l rept   10
endr
dbf    d0,.l          ;6/8 or 10/13

.l rept   10
lea.l  (-32768,a0),a0 ;4/4
endr
dbf    d0,.l          ;6/8 or 10/13

THEORETICAL CYCLES
* adda / total / cache: (6*10+8) + (6*10+6)*9998 + (6*10+10) = 660006
* adda / loop  / cache: 660006/10000 = ~66
* adda / total / no cache: (6*10+8)*9999 + (6*10+13) = 680005
* adda / loop  / no cache: 680005/10000 = ~68
* lea / total / cache: (4*10+8) + (4*10+6)*9998 + (4*10+10) = 460006
* lea / loop  / cache: 460006/10000 = ~46
* lea / total / no cache: (4*10+8)*9999 + (4*10+13) = 480005
* lea / loop  / no cache: 480005/10000 = ~48

RESULTS FOR LONGWORD-ALIGNED CODE

|          |        cache on       |       cache off
ins. |     unit +-----------+-----------+-----------+------------
|          |  FAST RAM |  CHIP RAM |  FAST RAM |   CHIP RAM
------+----------+-----------+-----------+-----------+------------
adda |   CCKs/T |     46895 |     47141 |     59046 |     290438
| cycles/T | 661071.16 | 664538.98 | 832361.83 | 4094257.09
| cycles/L |     66.11 |     66.45 |     83.24 |     409.43
------+----------+-----------+-----------+-----------+------------
lea |   CCKs/T |     32914 |     32934 |     59046 |     290297
| cycles/T | 463983.29 | 464265.22 | 832361.83 | 4092269.44
| cycles/L |     46.40 |     46.43 |     83.24 |     409.23

RESULTS FOR WORD-ALIGNED CODE

|          |        cache on       |        cache off
ins. |     unit +-----------+-----------+------------+------------
|          |  FAST RAM |  CHIP RAM |   FAST RAM |   CHIP RAM
------+----------+-----------+-----------+------------+------------
adda |   CCKs/T |     47121 |     47147 |      74670 |     342832
| cycles/T | 664257.05 | 664623.57 | 1052610.81 | 4832846.76
| cycles/L |     66.43 |     66.46 |     105.53 |     483.29
------+----------+-----------+-----------+------------+------------
lea |   CCKs/T |     32914 |     32934 |      60256 |     296603
| cycles/T | 463983.29 | 464265.22 |  849419.00 | 4181164.09
| cycles/L |     46.40 |     46.43 |      84.94 |     418.12

CALCULATION OF CPU CYCLES
* total = (CCKs * 50000000) / 3546895 = (CCKs * 10000000) / 709379
* loop  = ((CCKs * 50000000) / 3546895) / 10000 = (CCKs * 1000) / 709379

NOTES
* The cache on case loops take 0.4+ cycles more than expected (46.4+ VS 46 and
66.4+ VS 66); by executing a loop of a single lea 65536 times, the measured
time is 46792 CCKs = 10.07 cycles, which is closer to the expected value (10
cycles); quite  weird.
* The cache off + FAST RAM case loops are much slower than the theoretical
value (83+ VS 48 cycles in the best case), which is unexpected given that the
accelerator board is said to have a zero-wait-state design and that the RAM
access timing (60 ns) is an exact multiple of the CPU clock timing (20 ns);
looking at the numbers, it seems that dbf takes 10 cycles, i.e. 2 cycles more
than expected (10 VS 8), and that adda/lea take (83-10)/10 = 7 cycles, i.e. 3
cycles more than expected (7 VS 4).
* Word alignment affects performance as follows:
* adda + cache off case: lowers it a lot;
* adda + cache on case: lowers it slightly;
* lea + cache off case: lowers it slightly;
* lea + cache on case: has no effect.```
Attached is an (EDIT: updated) archive with the test programs, so that anyone who feels like can make more tests (and also measure the 68040 and 68060 performance).

I totally forgot to check whether word/longword alignment made any difference; I did mean to, but then I just forgot (Now this is fixed: the updated results and test programs also deal with alignment.)
Attached Files
 addaVSlea.lha (45.4 KB, 9 views)

Last edited by saimo; 29 September 2022 at 00:00.

28 September 2022, 18:36   #30
PeterK
Registered User

Join Date: Apr 2005
Location: digital hell, Germany, after 1984, but worse
Posts: 3,151
Quote:
 ADDA.W #X,AN VS LEA.L (D16,AN),AN COMPARISON
Is that just a Copy&Paste bug?
Code:
`lea.l  #\$8000,a0`
or where is the LEA.L (D16,AN),AN ?

28 September 2022, 18:53   #31
saimo
Registered User

Join Date: Aug 2010
Location: Italy
Posts: 555
Quote:
 Originally Posted by PeterK Is that just a Copy&Paste bug? Code: `lea.l #\$8000,a0` or where is the LEA.L (D16,AN),AN ?
Yeah, horrible copy&paste bug
Thanks for pointing it out! I'll fix the text and re-upload the archive.

Thankfully, the actual code is correct, though:
Code:
```.l

ifeq	INSTRUCTION

rept	10
dc.w	\$d0fc,\$8000	;adda.w #-32768,a0
endr

else

rept	10
dc.w	\$41e8,\$8000	;lea.l (-32768,a0),a0
endr

endif

dbf	d0,.l```
EDIT: it just occurred to me that another thing I forgot is that I didn't check whether the assembler changed the instrutions! I'm writing the istructions by hand now to make sure. Sorry.
EDIT2: I just disassembled the adda test programs, and it turned out that the assembler did change adda into lea! No surprise the results were identical! Epic fail. I'm fixing everything and re-running the tests now.
EDIT3: everything fixed & updated.

Last edited by saimo; 29 September 2022 at 00:02.

 28 September 2022, 19:11 #32 paraj Registered User   Join Date: Feb 2017 Location: Denmark Posts: 425 Just to be sure I checked on my 060 and lea X(aN),An adda.w #X,An and adda.l #X,An are all equally fast (1 cycle) as long as instruction fetch can keep up (otherwise adda.l is worse of course, and none of them can reach 0.5 cycles in isolation). Also Code: ``` lea X(a0),a0 move.l (a0),d0``` is faster than adda.w as the stall is indeed avoided.
28 September 2022, 19:13   #33
PeterK
Registered User

Join Date: Apr 2005
Location: digital hell, Germany, after 1984, but worse
Posts: 3,151
Thanks saimo! I've tried a few of your benchmarks on WinUAE, but unfortunately my Windows has too many background processes running and the timing results are more or less randomly, an exact comparison seems to be impossible. I would really like to know whether ADDA or LEA is performing better, since this is one of the optimizations which can be enabled in PhxAss, and usually I have switched it on.

Quote:
 EDIT2: I just disassembled the adda test programs, and it turned out that the assembler did change them into leas! No surprise the results were identical! Epic fail. I'm fixing everything and re-running the tests now.
The opposite has happened to me some weeks ago. I forgot to switch on the PhxAss optimization and then my icon.library self protection didn't accept the different code checksum.

Last edited by PeterK; 28 September 2022 at 19:23.

28 September 2022, 20:42   #34
saimo
Registered User

Join Date: Aug 2010
Location: Italy
Posts: 555
Quote:
 Originally Posted by PeterK Thanks saimo! I've tried a few of your benchmarks on WinUAE, but unfortunately my Windows has too many background processes running and the timing results are more or less randomly, an exact comparison seems to be impossible.
Emulators are definitely not reliable test platforms, indeed.

Quote:
 I would really like to know whether ADDA or LEA is performing better, since this is one of the optimizations which can be enabled in PhxAss, and usually I have switched it on.
I have updated my previous post (both results and archive with test programs). Lea is indeed faster in some cases on 68020 and 68030 (and I do actually use lea in my code, but for some reason this time around I got hit by a doubt and just had to double check, while I was supposed to work on something else - as usual, rushing things never helps).

Quote:
 The opposite has happened to me some weeks ago. I forgot to switch on the PhxAss optimization and then my icon.library self protection didn't accept the different code checksum.
By coincidence, I used PhxAss as well

28 September 2022, 21:11   #35
meynaf
son of 68k

Join Date: Nov 2007
Location: Lyon / France
Age: 49
Posts: 4,629
Quote:
 Originally Posted by saimo By coincidence, I used PhxAss as well
Then you don't need to encode the instructions manually. Just use opt 0 to turn all optimizations off, either at the command line or in the source itself.

 28 September 2022, 22:20 #36 smack Registered User   Join Date: May 2020 Location: Germany Posts: 10 for reference, the 68030 User's Manual https://www.nxp.com/docs/en/referenc...68030UM-P1.pdf https://www.nxp.com/docs/en/referenc...68030UM-P2.pdf SECTION 11 INSTRUCTION EXECUTION TIMING can be found in part 2 Code: ``` Head Tail I-Cache Case No-Cache Case adda.w #-32768,a0 ; (fea) #.W 2 0 2(0/0/0) 2(0/1/0) ; ADDA.W EA,An 0 0 4(0/0/0) 4(0/1/0) ; total = 6 cycles lea (-32768,a0),a0 ; (cea) (d16,An) 2 + op head 0 2(0/0/0) 2(0/1/0) ; LEA 2 0 2(0/0/0) 2(0/1/0) ; total = 4 cycles``` so the 68030 document and test results agree. the 68020 document seems to have incomplete timing data for ADDA / SUBA: it doesn't distinguish between the faster ADDA.L (4 cycles) and the slower ADDA.W (6 cycles). Last edited by smack; 28 September 2022 at 23:26. Reason: added instruction timings
29 September 2022, 00:08   #37
saimo
Registered User

Join Date: Aug 2010
Location: Italy
Posts: 555
@meynaf

Quote:
 Originally Posted by meynaf Then you don't need to encode the instructions manually. Just use opt 0 to turn all optimizations off, either at the command line or in the source itself.
Yep, but that wouldn't have been as fool-proof. And here I'm proving I can be a real fool (see also reply below)

@smack

Quote:
 Originally Posted by smack for reference, the 68030 User's Manual https://www.nxp.com/docs/en/referenc...68030UM-P1.pdf https://www.nxp.com/docs/en/referenc...68030UM-P2.pdf SECTION 11 INSTRUCTION EXECUTION TIMING can be found in part 2 Code: ``` Head Tail I-Cache Case No-Cache Case adda.w #-32768,a0 ; (fea) #.W 2 0 2(0/0/0) 2(0/1/0) ; ADDA.W EA,An 0 0 4(0/0/0) 4(0/1/0) ; total = 6 cycles lea (-32768,a0),a0 ; (cea) (d16,An) 2 + op head 0 2(0/0/0) 2(0/1/0) ; LEA 2 0 2(0/0/0) 2(0/1/0) ; total = 4 cycles``` so the 68030 document and test results agree.
You're totally right, thank you. I won't even try to describe which broken mental processes prevented me to add the FEA to the calculations. Just one more epic fail...

Quote:
 the 68020 document seems to have incomplete timing data for ADDA / SUBA: it doesn't distinguish between the faster ADDA.L (4 cycles) and the slower ADDA.W (6 cycles).
Indeed. And 68020 and 68030 are generally so similar when it comes to timings that one would expect that the UM isn't quite right there. Though, there are surprises sometimes (for example, see notes below about alignment).

@all

Updated the results and the archive in post #29. Now also code alignment is taken into account.

On 68020, word alignment affects performance as follows:
* adda + cache off case: lowers it slightly;
* adda + cache on case: improves it slightly;
* lea + cache on case: improves it slightly;
* lea + cache off case: has no effect.
In particular, it eliminates the extra cycle taken by dbf in the cache on cases, so the measured times are very close to the theoretical ones - i.e. word alignment is better than longword alignment in this specific case.

On 68030, alignment affects performance as follows:
* adda + cache off case: lowers it a lot;
* adda + cache on case: lowers it slightly;
* lea + cache off case: lowers it slightly;
* lea + cache on case: has no effect.

Last edited by saimo; 29 September 2022 at 00:15.

29 September 2022, 00:21   #38
saimo
Registered User

Join Date: Aug 2010
Location: Italy
Posts: 555
Quote:
 Originally Posted by phx What makes LEA attractive is that you can store the result of the addition to a different register.
And on 68020+ you can also get quick multiplications by 3, 5 and 9 (plus displacement). That's very advantageous if the source operand happens to be in an address register already and it's OK to have the destination in the same / another address register - OK, it's a very specific case, but still it might be useful. The same goes for additions of any two operands, one of which has to be pre-multiplied by 2, 4 or 8.

29 September 2022, 14:01   #39
PeterK
Registered User

Join Date: Apr 2005
Location: digital hell, Germany, after 1984, but worse
Posts: 3,151
Quote:
 Originally Posted by saimo And on 68020+ you can also get quick multiplications by 3, 5 and 9 (plus displacement).
Great idea! That's a good trick which I can use at least in one case, although the impact on the speed of my library won't be noticeable.

30 September 2022, 04:26   #40
Bruce Abbott
Registered User

Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 1,244
Quote:
 Originally Posted by saimo I have a same-specced machine, but a number of tests showed that adda and lea execute at the same speed in all cases. EDIT: I didn't consider that assembler optimizations might kick in, so my previous results were faked by the fact that the assembler actually replaced adda with lea. Stupid mistake, apologies. Now that I have encoded the instructions manually, my tests return the same results (i.e. adda 6 cycles, lea 4 cycles, both 7 cycles when cache is off).
Thanks for the confirmation. Verifying that the tests are done correctly can be tricky, even after disassembling the code to make sure it is assembled correctly.

Quote:
 I totally forgot to check whether word/longword alignment made any difference; I did mean to, but then I just forgot (Now this is fixed: the updated results and test programs also deal with alignment.)
This can also be tricky to get right. To reduce loop overhead my test code repeats the instruction 25 or 50 times. If an instruction has an odd number of words then the alignment changes from one instruction to the next, and the result is an average of different alignments.

 Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)

 Similar Threads Thread Thread Starter Forum Replies Last Post Nut Coders. General 22 18 March 2010 11:56

 Posting Rules You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is On HTML code is Off Forum Rules
 Forum Jump User Control Panel Private Messages Subscriptions Who's Online Search Forums Forums Home News Main     Amiga scene     Retrogaming General Discussion     Nostalgia & memories Support     New to Emulation or Amiga scene         Member Introductions     support.WinUAE     support.WinFellow     support.OtherUAE     support.FS-UAE         project.AmigaLive     support.Hardware         Hardware mods         Hardware pics     support.Games     support.Demos     support.Apps     support.Amiga Forever     support.Amix     support.AmigaOS     support.Other Requests     request.UAE Wishlist     request.Old Rare Games     request.Demos     request.Apps     request.Modules     request.Music     request.Other     Looking for a game name ?     Games images which need to be WHDified abime.net - Hall Of Light     HOL news     HOL suggestions and feedback     HOL data problems     HOL contributions abime.net - Amiga Magazine Rack     AMR news     AMR suggestions and feedback     AMR data problems     AMR contributions abime.net - Home Projects     project.Amiga Lore     project.EAB     project.IRC     project.Mods Jukebox     project.Wiki abime.net - Hosted Projects     project.aGTW     project.APoV     project.ClassicWB     project.Jambo!     project.Green Amiga Alien GUIDES     project.Maptapper     project.Sprites     project.WinUAE - Kaillera Other Projects     project.Amiga Demo DVD     project.Amiga Game Factory     project.CARE     project.Amiga File Server     project.CD32 Conversion     project.Game Cover Art         GCA.Feedback and Suggestions         GCA.Work in Progress         GCA.Cover Requests         GCA.Usefull Programs         GCA.Helpdesk     project.KGLoad     project.MAGE     project.Missing Full Shareware Games     project.SPS (was CAPS)     project.TOSEC (amiga only)     project.WHDLoad         project.Killergorilla's WHD packs Misc     Amiga websites reviews     MarketPlace         Swapshop     Kinky Amiga Stuff     Collections     EAB's competition Coders     Coders. General         Coders. Releases         Coders. Tutorials     Coders. Asm / Hardware     Coders. System         Coders. Scripting         Coders. Nextgen     Coders. Language         Coders. C/C++         Coders. AMOS         Coders. Blitz Basic     Coders. Contest         Coders. Entries Creation     Graphics         Graphics. Work In Progress         Graphics. Finished Work         Graphics. Tutorials     Music         Music. Work In Progress         Music. Finished Work         Music. Tutorials

All times are GMT +2. The time now is 17:26.

 -- EAB3 skin ---- EAB2 skin ---- Mobile skin Archive - Top