21 September 2022, 20:46 | #21 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,122
|
At the risk of being a filthy necromancer, is there a definitive answer for 68020+ where the offset size can be represented as a 16 bit immediate? Assume code is in fast memory and instruction cache enabled.
|
27 September 2022, 00:03 | #22 | ||
Registered User
Join Date: May 2020
Location: Germany
Posts: 20
|
Quote:
There is no clear winner, but I would use adda/suba because it has a chance of a faster best-case and the code is more expressive IMO. If you wonder what these best / cache / worst cases are, well it's complicated Quote:
|
||
27 September 2022, 12:14 | #23 |
Natteravn
Join Date: Nov 2009
Location: Herford / Germany
Posts: 2,496
|
What makes LEA attractive is that you can store the result of the addition to a different register.
|
27 September 2022, 21:06 | #24 | |
Registered User
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,546
|
Quote:
In cache:- addq.w #x,An 2 clocks addq.l #x,An 2 clocks lea xx(An),An 4 clocks lea x(An,Rn),An 6 clocks adda.w #xx,An 6 clocks adda.l #xxxx,An 6 clocks Not cached:- addq.w #x,An 4 clocks addq.l #x,An 4 clocks lea xx(An),An 7 clocks lea x(An,Rn),An 7 clocks adda.w #xx,An 7 clocks adda.l #xxxx,An 11 clocks This shows a 2 clock advantage of lea over adda.w when cached, but no advantage when not cached. Also note that adda.l is the same speed as adda.w when cached, but 4 clocks slower when not cached! It also shows that 60ns DRAM is not fast enough to keep up with the 50MHz 030 on the Blizzard 1230-IV. Getting that code into the cache can make it up to twice as fast! |
|
27 September 2022, 22:39 | #25 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,122
|
Great responses, thanks!
Any info on 040 / 060 ? |
28 September 2022, 01:04 | #26 |
Registered User
Join Date: May 2020
Location: Germany
Posts: 20
|
68040, see section 10 of this document https://www.nxp.com/docs/en/referenc.../MC68040UM.pdf
68060, see section 10 of this document https://www.nxp.com/docs/en/data-sheet/MC68060UM.pdf
Of course, real world performance depends on the specific sequence of instructions executed, because of how they interact (overlap or stall) in the execution pipelines. |
28 September 2022, 08:36 | #27 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,099
|
For 060 I think lea is slightly better (all other things being equal) since it can sometimes avoid a 2 cycle change/use register stall (10.2.3).
The instruction timings for 040 look a bit odd for adda/suba: adda Dn,Am ;1+2 adda An,Am ;1+1 adda #<xxx>,Am, ;1+1 suba Dn,Am ;1+1 suba An,Am ;1+2 suba #<xxx>,Am, ;1+2 So suba Dn,Am is faster than adda Dn,Am and suba An,Am? |
28 September 2022, 10:22 | #28 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,039
|
68040 real hardware (1000 iterations of 8x repeated):
Code:
adda. d0,a0 ; 3528 3 adda. a0,a1 ; 1112 1 adda. (a0),a1 ; 2396 2 adda. #,a0 ; 1117 1 suba. d0,a0 ; 3528 3 suba. (a0),a1 ; 2396 2 suba. #,a0 ; 1117 1 lea (a0),a1 ; 1112 1 lea (d.w,a0),a1 ; 2391 2 lea (d.w,pc),a0 ; 4954 4 |
28 September 2022, 17:17 | #29 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 787
|
Quote:
EDIT: I didn't consider that assembler optimizations might kick in, so my previous results were faked by the fact that the assembler actually replaced adda with lea. Stupid mistake, apologies. Now that I have encoded the instructions manually, my tests return the same results (i.e. adda 6 cycles, lea 4 cycles, both 7 cycles when cache is off). Updated results: Code:
-------------------------------------------------------------------------------- ADDA.W #X,AN VS LEA.L (D16,AN),AN COMPARISON NOTES * The tests have been run on an Amiga 1200 equipped with a Blizzard 1230-IV with an MC68030 clocked at 50 MHz and 60 ns FAST RAM. * The tests for the unexpanded Amiga 1200 cases have been been run on the same machine with the accelerator board disabled. * The tests have been run with DMA and interrupts off, and with PAL settings. * The tests have been run 3 or 4 times each, and the best results have been taken; most of the tests always returned the same result (only the 68030, cache off, CHIP RAM cases showed fluctuations of some tens/hundreds cycles, which were insignificant anyway). * The tests execute the core code 10000 times. * The duration of the tests is measured by means of color clocks (CCKs). * The test programs require ECS or AGA (but have been tested on AGA only). * The test programs can be run from both shell and Workbench. * The test programs take no argument. * The test programs shut the OS off, take over the machine entirely and access the hardware directly; although on exit they restore the system with the utmost care, no guarantee is given - USE AT YOUR OWN RISK. * Included test programs: * ADCL = Adda, cache Disabled, CHIP RAM, Longword alignment * ADCW = Adda, cache Disabled, CHIP RAM, Word alignment * ADFL = Adda, cache Disabled, FAST RAM, Longword alignment * ADFW = Adda, cache Disabled, FAST RAM, Word alignment * AECL = Adda, cache Enabled, CHIP RAM, Longword alignment * AECW = Adda, cache Enabled, CHIP RAM, Word alignment * AEFL = Adda, cache Enabled, FAST RAM, Longword alignment * AEFW = Adda, cache Enabled, FAST RAM, Word alignment * LDCL = Lea, cache Disabled, CHIP RAM, Longword alignment * LDCW = Lea, cache Disabled, CHIP RAM, Word alignment * LDFL = Lea, cache Disabled, FAST RAM, Longword alignment * LDFW = Lea, cache Disabled, FAST RAM, Word alignment * LECL = Lea, cache Enabled, CHIP RAM, Longword alignment * LECW = Lea, cache Enabled, CHIP RAM, Word alignment * LEFL = Lea, cache Enabled, FAST RAM, Longword alignment * LEFW = Lea, cache Enabled, FAST RAM, Word alignment -------------------------------------------------------------------------------- 68020 @ 14.18758 MHz / CHIP RAM OFFICIAL SPECIFICATIONS (best/cache/worst) * adda: FEA: 0/2/3, operation: 0/2/3 -> 0/4/6 * lea: CEA: 2/2/3, operation: 2/2/3 -> 4/4/6 * dbf: branch taken: 3/6/9, branch not taken: 7/10/10 CORE CODE .l rept 10 adda.w #-32768,a0 ;0/4/6 endr dbf d0,.l ;3/6/9 or 7/10/10 .l rept 10 lea.l (-32768,a0),a0 ;4/4/6 endr dbf d0,.l ;3/6/9 or 7/10/10 THEORETICAL CYCLES * total / cache: (6*10+9) + (4*10+6)*9998 + (4*10+10) = 460027 * loop / cache: 460027/10000 = ~46 * total / worst: (6*10+9)*9999 + (6*10+10) = 690001 * loop / worst: 690001/10000 = ~69 RESULTS FOR LONGWORD-ALIGNED CODE | | cache on | cache off ins. | unit +----------+----------+----------+---------- | | FAST RAM | CHIP RAM | FAST RAM | CHIP RAM ------+----------+----------+----------+----------+---------- adda | CCKs/T | | 167784 | | 342832 | cycles/T | | 671136 | | 1371328 | cycles/L | | 67.11 | | 137.13 ------+----------+----------+----------+----------+---------- lea | CCKs/T | | 117936 | | 342832 | cycles/T | | 471744 | | 1371328 | cycles/L | | 47.17 | | 137.13 RESULTS FOR WORD-ALIGNED CODE | | cache on | cache off ins. | unit +----------+----------+----------+---------- | | FAST RAM | CHIP RAM | FAST RAM | CHIP RAM ------+----------+----------+----------+----------+---------- adda | CCKs/T | | 165280 | | 353033 | cycles/T | | 661120 | | 1412132 | cycles/L | | 66.11 | | 141.21 ------+----------+----------+----------+----------+---------- lea | CCKs/T | | 115432 | | 342832 | cycles/T | | 461728 | | 1371328 | cycles/L | | 46.17 | | 137.13 CALCULATION OF CPU CYCLES * total = CCKs * 4 * loop = (CCKs * 4) / 10000 = CCKs / 2500 NOTES * The lea + cache on + longword alignment case loop takes 1+ cycles more than expected (47+ VS 46); by executing a loop of a single lea 65536 times, the measured time is 180514 CCKs = 11.02 cycles, which is closer to the expected value (10 cycles), but still ~1 cycle slower; it seems that, at least in this context, dbf actually takes 7 cycles. * The adda + cache on + longword alignment case loop takes 21+ cycles more than expected (67 VS 46); given that dbf takes 7 cycles, adda takes (67-7)/10 = 6 cycles, i.e. 2 cycles more than expected (6 VS 4). * The cache off case loops are about 2 times slower than the theoretical value (137.13 and 141.21 are close to 69*2 = 138), which is not surprising given the CHIP RAM access timings. * Word alignment affects performance as follows: * adda + cache off case: lowers it slightly; * adda + cache on case: improves it slightly; * lea + cache on case: improves it slightly; * lea + cache off case: has no effect. In particular, it eliminates the extra cycle taken by dbf in the cache on cases. -------------------------------------------------------------------------------- 68030 @ 50 MHz / CHIP RAM / FAST RAM 60 ns OFFICIAL SPECIFICATIONS (CACHE / NO CACHE) * adda: FEA: 2/2, operation 4/4 -> 6/6 * lea: CEA: 2/2, operation: 2/2 -> 4/4 * dbf: branch taken: 6/8, branch not taken: 10/13 CORE CODE .l rept 10 adda.w #-32768,a0 ;6/6 endr dbf d0,.l ;6/8 or 10/13 .l rept 10 lea.l (-32768,a0),a0 ;4/4 endr dbf d0,.l ;6/8 or 10/13 THEORETICAL CYCLES * adda / total / cache: (6*10+8) + (6*10+6)*9998 + (6*10+10) = 660006 * adda / loop / cache: 660006/10000 = ~66 * adda / total / no cache: (6*10+8)*9999 + (6*10+13) = 680005 * adda / loop / no cache: 680005/10000 = ~68 * lea / total / cache: (4*10+8) + (4*10+6)*9998 + (4*10+10) = 460006 * lea / loop / cache: 460006/10000 = ~46 * lea / total / no cache: (4*10+8)*9999 + (4*10+13) = 480005 * lea / loop / no cache: 480005/10000 = ~48 RESULTS FOR LONGWORD-ALIGNED CODE | | cache on | cache off ins. | unit +-----------+-----------+-----------+------------ | | FAST RAM | CHIP RAM | FAST RAM | CHIP RAM ------+----------+-----------+-----------+-----------+------------ adda | CCKs/T | 46895 | 47141 | 59046 | 290438 | cycles/T | 661071.16 | 664538.98 | 832361.83 | 4094257.09 | cycles/L | 66.11 | 66.45 | 83.24 | 409.43 ------+----------+-----------+-----------+-----------+------------ lea | CCKs/T | 32914 | 32934 | 59046 | 290297 | cycles/T | 463983.29 | 464265.22 | 832361.83 | 4092269.44 | cycles/L | 46.40 | 46.43 | 83.24 | 409.23 RESULTS FOR WORD-ALIGNED CODE | | cache on | cache off ins. | unit +-----------+-----------+------------+------------ | | FAST RAM | CHIP RAM | FAST RAM | CHIP RAM ------+----------+-----------+-----------+------------+------------ adda | CCKs/T | 47121 | 47147 | 74670 | 342832 | cycles/T | 664257.05 | 664623.57 | 1052610.81 | 4832846.76 | cycles/L | 66.43 | 66.46 | 105.53 | 483.29 ------+----------+-----------+-----------+------------+------------ lea | CCKs/T | 32914 | 32934 | 60256 | 296603 | cycles/T | 463983.29 | 464265.22 | 849419.00 | 4181164.09 | cycles/L | 46.40 | 46.43 | 84.94 | 418.12 CALCULATION OF CPU CYCLES * total = (CCKs * 50000000) / 3546895 = (CCKs * 10000000) / 709379 * loop = ((CCKs * 50000000) / 3546895) / 10000 = (CCKs * 1000) / 709379 NOTES * The cache on case loops take 0.4+ cycles more than expected (46.4+ VS 46 and 66.4+ VS 66); by executing a loop of a single lea 65536 times, the measured time is 46792 CCKs = 10.07 cycles, which is closer to the expected value (10 cycles); quite weird. * The cache off + FAST RAM case loops are much slower than the theoretical value (83+ VS 48 cycles in the best case), which is unexpected given that the accelerator board is said to have a zero-wait-state design and that the RAM access timing (60 ns) is an exact multiple of the CPU clock timing (20 ns); looking at the numbers, it seems that dbf takes 10 cycles, i.e. 2 cycles more than expected (10 VS 8), and that adda/lea take (83-10)/10 = 7 cycles, i.e. 3 cycles more than expected (7 VS 4). * Word alignment affects performance as follows: * adda + cache off case: lowers it a lot; * adda + cache on case: lowers it slightly; * lea + cache off case: lowers it slightly; * lea + cache on case: has no effect. I totally forgot to check whether word/longword alignment made any difference; I did mean to, but then I just forgot (Now this is fixed: the updated results and test programs also deal with alignment.) Last edited by saimo; 28 September 2022 at 23:00. |
|
28 September 2022, 17:36 | #30 | |
Registered User
Join Date: Apr 2005
Location: digital hell, Germany, after 1984, but worse
Posts: 3,365
|
Quote:
Code:
lea.l #$8000,a0 |
|
28 September 2022, 17:53 | #31 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 787
|
Quote:
Thanks for pointing it out! I'll fix the text and re-upload the archive. Thankfully, the actual code is correct, though: Code:
.l ifeq INSTRUCTION rept 10 dc.w $d0fc,$8000 ;adda.w #-32768,a0 endr else rept 10 dc.w $41e8,$8000 ;lea.l (-32768,a0),a0 endr endif dbf d0,.l EDIT2: I just disassembled the adda test programs, and it turned out that the assembler did change adda into lea! No surprise the results were identical! Epic fail. I'm fixing everything and re-running the tests now. EDIT3: everything fixed & updated. Last edited by saimo; 28 September 2022 at 23:02. |
|
28 September 2022, 18:11 | #32 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,099
|
Just to be sure I checked on my 060 and lea X(aN),An adda.w #X,An and adda.l #X,An are all equally fast (1 cycle) as long as instruction fetch can keep up (otherwise adda.l is worse of course, and none of them can reach 0.5 cycles in isolation). Also
Code:
lea X(a0),a0 move.l (a0),d0 |
28 September 2022, 18:13 | #33 | |
Registered User
Join Date: Apr 2005
Location: digital hell, Germany, after 1984, but worse
Posts: 3,365
|
Thanks saimo! I've tried a few of your benchmarks on WinUAE, but unfortunately my Windows has too many background processes running and the timing results are more or less randomly, an exact comparison seems to be impossible. I would really like to know whether ADDA or LEA is performing better, since this is one of the optimizations which can be enabled in PhxAss, and usually I have switched it on.
Quote:
Last edited by PeterK; 28 September 2022 at 18:23. |
|
28 September 2022, 19:42 | #34 | |||
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 787
|
Quote:
Quote:
Quote:
|
|||
28 September 2022, 20:11 | #35 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
|
28 September 2022, 21:20 | #36 |
Registered User
Join Date: May 2020
Location: Germany
Posts: 20
|
for reference, the 68030 User's Manual
https://www.nxp.com/docs/en/referenc...68030UM-P1.pdf https://www.nxp.com/docs/en/referenc...68030UM-P2.pdf SECTION 11 INSTRUCTION EXECUTION TIMING can be found in part 2 Code:
Head Tail I-Cache Case No-Cache Case adda.w #-32768,a0 ; (fea) #<data>.W 2 0 2(0/0/0) 2(0/1/0) ; ADDA.W EA,An 0 0 4(0/0/0) 4(0/1/0) ; total = 6 cycles lea (-32768,a0),a0 ; (cea) (d16,An) 2 + op head 0 2(0/0/0) 2(0/1/0) ; LEA 2 0 2(0/0/0) 2(0/1/0) ; total = 4 cycles the 68020 document seems to have incomplete timing data for ADDA / SUBA: it doesn't distinguish between the faster ADDA.L (4 cycles) and the slower ADDA.W (6 cycles). Last edited by smack; 28 September 2022 at 22:26. Reason: added instruction timings |
28 September 2022, 23:08 | #37 | |||
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 787
|
@meynaf
Quote:
@smack Quote:
Quote:
@all Updated the results and the archive in post #29. Now also code alignment is taken into account. On 68020, word alignment affects performance as follows: * adda + cache off case: lowers it slightly; * adda + cache on case: improves it slightly; * lea + cache on case: improves it slightly; * lea + cache off case: has no effect. In particular, it eliminates the extra cycle taken by dbf in the cache on cases, so the measured times are very close to the theoretical ones - i.e. word alignment is better than longword alignment in this specific case. On 68030, alignment affects performance as follows: * adda + cache off case: lowers it a lot; * adda + cache on case: lowers it slightly; * lea + cache off case: lowers it slightly; * lea + cache on case: has no effect. Last edited by saimo; 28 September 2022 at 23:15. |
|||
28 September 2022, 23:21 | #38 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 787
|
And on 68020+ you can also get quick multiplications by 3, 5 and 9 (plus displacement). That's very advantageous if the source operand happens to be in an address register already and it's OK to have the destination in the same / another address register - OK, it's a very specific case, but still it might be useful. The same goes for additions of any two operands, one of which has to be pre-multiplied by 2, 4 or 8.
|
29 September 2022, 13:01 | #39 |
Registered User
Join Date: Apr 2005
Location: digital hell, Germany, after 1984, but worse
Posts: 3,365
|
|
30 September 2022, 03:26 | #40 | ||
Registered User
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,546
|
Quote:
Quote:
|
||
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
32bit PC-relative LEA ?? | Nut | Coders. General | 22 | 18 March 2010 10:56 |
|
|