26 April 2021, 19:17 | #1 |
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
Optimizing the 68020+ 32-bit math
I have a project, the number pi calculator. I have found out recently that my 68020 code is actually slower than my 68000 code. I was misguided that emulators (FS-UAE, Hatari) show that on the contary that the 68020 code is faster. It seems that DIVUL takes much less cycles in emus than in hardware. So I am trying to make the 68020 code which will be faster than the 68000 code.
The code for the 68020/30 is quite short: Code:
divul.l d4,d7:d6 move d7,(a3) Code:
moveq.l #0,d7 divu d4,d6 bvc .div32no\@ swap d6 move d6,d7 divu d4,d7 swap d7 move d7,d6 swap d6 divu d4,d6 .div32no\@ move d6,d7 clr d6 swap d6 move d6,(a3) My whole project is available on github https://github.com/litwr2/rosetta-pi...e/master/amiga Maybe someone can help me to make better code for the 68020/30? Any hints are welcome. Thank you in advance. It is interesting to make code for the 68020/30 faster than the 68000 code. However I have only been able to make slightly shorter code for the 68020/30. Code:
moveq.l #0,d7 divu d4,d6 bvc .div32no\@ divul d4,d7:d6 move d7,(a3) ;r[i] <- d%b bra .div32f\@ .div32no\@ move d6,d7 clr d6 swap d6 move d6,(a3) ;r[i] <- d%b .div32f\@ |
27 April 2021, 10:43 | #2 |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,313
|
That is NOT equivalent code. A DIVU.L dx,da:db is a 64 by 32 bit division. That's not what your code is doing. To emulate that on a machine without a 32-bit quotient, you need something like Algorithm D. And yes, that would be slower than a DIVU.L.
|
28 April 2021, 02:32 | #3 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,064
|
Well, he's not using divu.L, it's divuL(.L) which is 32/32 -> 32:32.
I didn't look too closely at this because... Don't really want to start any flame wars but that code nearly gave me cancer. *Every* single instruction that should have a size specifier (because of multiple valid options), and there's almost a dozen of them is that short clip, doesn't have it. And moveq, which is *always* a longword operation and doesn't ever need it, has it. When I see that I just check out. Having to decipher M68K code because it's butchered so badly is not my idea of fun. I mean, it's OK when it's some fresh guy who's still learning the basics and does all kind of stuff, we all did that at some point, but this... Nothing personal, I just don't understand why would anyone write so ambiguous code: it's more error prone, less portable, others have to waste time figuring out what you wanted to do, ... |
28 April 2021, 10:54 | #4 |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,313
|
Well, it doesn't really matter too much. The major problem is here is the quotient size of 32 bit, and this requires a more complicated algorithm than a cascaded divu with has a quotient of 16 bit. A 32/16 full division with 32 remainder and 32 bit quotient is easy with divu, but a full 32/32 division requires "algorithm D" (or some other division algorithm) and is more complex than what is shown here.
|
29 April 2021, 08:36 | #5 | ||
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
Quote:
Quote:
68020/30: Code:
divul.l d4,d7:d6 move.w d7,(a3) Code:
moveq.l #0,d7 divu.w d4,d6 bvc .div32no swap d6 move.w d6,d7 divu.w d4,d7 swap d7 move.w d7,d6 swap d6 divu.w d4,d6 .div32no move.w d6,d7 clr.w d6 swap d6 move.w d6,(a3) Code:
moveq.l #0,d7 divu.w d4,d6 bvc .div32no divul.l d4,d7:d6 move.w d7,(a3) ;r[i] <- d%b bra .div32f .div32no move.w d6,d7 clr.w d6 swap d6 move.w d6,(a3) ;r[i] <- d%b .div32f However I almost sure that it is impossible to optimize this 68020/30 division better than I show in my improved code. It is interesting that I could optimize the 80386 code for the same case. The initial 80386 code Code:
div esi mov [si+ra+1],dx Code:
rol eax,16 cmp ax,si jnc .lx mov dx,ax shr eax,16 div si .lxc: mov [si+ra+1],dx Code:
.lx: rol eax,16 div esi jmp .lxc |
||
29 April 2021, 12:20 | #6 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,064
|
Why didn't you use "the same" algorithm, something like (code not tested)?
Code:
move.l d4,d0 swap d0 cmp.l d0,d6 ; divident >= (divisor<<16)? bhs.b .32bit .16bit divu.w d4,d6 swap d6 move.w d6,(a3) .32bit divul.l d4,d7:d6 move.w d7,(a3) Last edited by a/b; 29 April 2021 at 13:15. |
29 April 2021, 14:09 | #7 |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,313
|
Precisely. It depends most likely on the distribution of the input which version is faster.
|
29 April 2021, 14:56 | #8 | |
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
Quote:
Code:
moveq.l #0,d7 swap d6 cmp.w d4,d6 bcs .div32no swap d6 divul.l d4,d7:d6 move.w d7,(a3) bra .div32f .div32no swap d6 divu.w d4,d6 move.w d6,d7 clr.w d6 swap d6 move.w d6,(a3) .div32f Last edited by litwr; 29 April 2021 at 15:02. |
|
29 April 2021, 15:02 | #9 |
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
|
29 April 2021, 16:00 | #10 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,064
|
OK, I didn't take that into account, keeping data in d6/d7.
Here are my suggestions: 1. maybe invert the bcs condition: if the 16-bit case is executed a lot more frequently it should be as branch not taken *if* you can adjust your code to avoid a bra at the end 2. if you have a spare register, use move+swap+cmp sequence instead of swap+cmp+2*swap, it's the same speed but 2 bytes shorter, so potentially very slightly faster because you can squeeze 2 more bytes into icache (not that large if 020/030) 3. moveq #0,d7 should be moved to after .div32no (only the 16-bit case neeeds it), because 32-bit div will set all 32 bits anyway, or.... 4. moveq #0,d7 should be executed only once before the loop (code implies that the remainder is always 16-bit, and setting d7 bits 16-31 only once will suffice) Last edited by a/b; 29 April 2021 at 16:10. |
29 April 2021, 20:03 | #11 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Quote:
Code:
move.l d6,d7 swap.w d6 cmp.w d4,d6 bcs.b .div32no divul.l d4,d6:d7 move.w d6,(a3) exg.l d6,d7 bra.b .div32f .div32no divu.w d4,d7 clr.l d6 move.w d7,d6 clr.w d7 swap.w d7 move.w d7,(a3) .div32f * 32-bit quotient in d6; * 32-bit remainder in d7, with upper word set to 0; * 16-bit remainder written to (a3). Also, it executes some stuff in parallel, saving cycles. If you don't care about the upper word of d7 being 0: Code:
move.l d6,d7 swap.w d6 cmp.w d4,d6 bcs.b .div32no divul.l d4,d6:d7 move.w d6,(a3) exg.l d6,d7 bra.b .div32f .div32no divu.w d4,d7 clr.l d6 move.w d7,d6 swap.w d7 move.w d7,(a3) .div32f Code:
move.l d6,d7 swap.w d6 cmp.w d4,d6 bcs.b .div32no divul.l d4,d6:d7 move.w d6,(a3) bra.b .div32f .div32no divu.w d4,d7 clr.l d6 move.w d7,d6 swap.w d7 move.w d7,(a3) .div32f Code:
move.l d6,d7 swap.w d6 cmp.w d4,d6 bcs.b .div32no divul.l d4,d6:d7 move.w d6,(a3) bra.b .div32f .div32no divu.w d4,d7 clr.l d6 move.l d7,(a3) move.w d7,d6 swap.w d7 .div32f Code:
move.l d6,d7 swap.w d6 cmp.w d4,d6 bcs.b .div32no divul.l d4,d6:d7 move.w d6,(a3) bra.b .div32f .div32no divu.w d4,d7 clr.l d6 move.l d7,(a3) move.w d7,d6 .div32f |
|
29 April 2021, 21:57 | #12 |
move.l #$c0ff33,throat
Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 6,863
|
|
29 April 2021, 23:06 | #13 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,064
|
|
29 April 2021, 23:15 | #14 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Quote:
The code proposed by litwr doesn't have increments, and from the context it looks like (a3) is a variable rather than an item in an array/buffer, so it seems it should be possible to long-align it. |
|
30 April 2021, 14:28 | #15 |
old bearded fool
Join Date: Jan 2010
Location: Bangkok
Age: 56
Posts: 779
|
I compiled pi-amiga.asm with 'vasm pi-amiga.asm -Fhunkexe -nosym -o pi-amiga' and ran it on my stock A1200 with fastram, got the following result:
Code:
pi-amiga number ? calculator v8 (68000) number of digits (up to 9252)? 800 31415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679821480865132823066470938446095505822317253594081284811174502841027019385211055596446229489549303819644288109756659334461284756482337867831652712019091456485669234603486104543266482133936072602491412737245870066063155881748815209209628292540917153643678925903600113305305488204665213841469519415116094330572703657595919530921861173819326117931051185480744623799627495673518857527248912279381830119491298336733624406566430860213949463952247371907021798609437027705392171762931767523846748184676694051320005681271452635608277857713427577896091736371787214684409012249534301465495853710507922796892589235420199561121290219608640344181598136297747713099605187072113499999983729780499510597317328160963185 3.24 With "m68020" and "MULUopt = 1" I got this result: Code:
pi-amiga number ? calculator v8 (68020) number of digits (up to 9252)? 800 31415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679821480865132823066470938446095505822317253594081284811174502841027019385211055596446229489549303819644288109756659334461284756482337867831652712019091456485669234603486104543266482133936072602491412737245870066063155881748815209209628292540917153643678925903600113305305488204665213841469519415116094330572703657595919530921861173819326117931051185480744623799627495673518857527248912279381830119491298336733624406566430860213949463952247371907021798609437027705392171762931767523846748184676694051320005681271452635608277857713427577896091736371787214684409012249534301465495853710507922796892589235420199561121290219608640344181598136297747713099605187072113499999983729780499510597317328160963185 3.14 Also tried the C example as mentioned in the comments, compiled with SAS/C and 'SCOPTIONS' as follows: Code:
DEBUG=SYMBOLFLUSH MATH=STANDARD CPU=68020 ERRORREXX OPTIMIZE LINK DATAMEMORY=FAST STRIPDEBUG Code:
/* https://crypto.stanford.edu/pbc/notes/pi/code.html */ #include <stdio.h> int main() { int r[2800 + 1]; int i, k; int b, d; int c = 0; for (i = 0; i < 2800; i++) { r[i] = 2000; } for (k = 2800; k > 0; k -= 14) { d = 0; i = k; for (;;) { d += r[i] * 10000; b = 2 * i - 1; r[i] = d % b; d /= b; i--; if (i == 0) break; d *= i; } printf("%.4d", c + d / 10000); c = d % 10000; } printf("\n"); return 0; } Result: Code:
> timeit pi 31415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679821480865132823066470938446095505822317253594081284811174502841027019385211055596446229489549303819644288109756659334461284756482337867831652712019091456485669234603486104543266482133936072602491412737245870066063155881748815209209628292540917153643678925903600113305305488204665213841469519415116094330572703657595919530921861173819326117931051185480744623799627495673518857527248912279381830119491298336733624406566430860213949463952247371907021798609437027705392171762931767523846748184676694051320005681271452635608277857713427577896091736371787214684409012249534301465495853710507922796892589235420199561121290219608640344181598136297747713099605187072113499999983729780499510597317328160963185 Elapsed: 7.25s Also tried vbcc 'vc pi.c -o vc_pi -O2 -DCPU=68020' and the time it took was roughly 14 seconds, "-DCPU=68020" option didn't seem to have any affect, same time result with 68000 although a few bytes difference in size. When compiling for 68000 with SAS/C the time it took was roughly 13 seconds, in other words almost twice the performance for 68020. I haven't disassembled the resulting binaries (yet), but might do that later out of curiosity. Last edited by modrobert; 30 April 2021 at 16:45. |
30 April 2021, 16:00 | #16 | |
Registered User
Join Date: Jan 2021
Location: Germany
Posts: 18
|
Quote:
|
|
30 April 2021, 16:33 | #17 | |
old bearded fool
Join Date: Jan 2010
Location: Bangkok
Age: 56
Posts: 779
|
Quote:
Code:
> timeit vc68020_pi 31415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679821480865132823066470938446095505822317253594081284811174502841027019385211055596446229489549303819644288109756659334461284756482337867831652712019091456485669234603486104543266482133936072602491412737245870066063155881748815209209628292540917153643678925903600113305305488204665213841469519415116094330572703657595919530921861173819326117931051185480744623799627495673518857527248912279381830119491298336733624406566430860213949463952247371907021798609437027705392171762931767523846748184676694051320005681271452635608277857713427577896091736371787214684409012249534301465495853710507922796892589235420199561121290219608640344181598136297747713099605187072113499999983729780499510597317328160963185 Elapsed: 6.78s |
|
30 April 2021, 21:06 | #18 | ||||||
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
Quote:
2. This gives too little. All main loop size is less 100 bytes, so it can't help for better cache usage. Indeed it is good to make the code a bit shorter but you chose the slower code with CMP before the first division... 3. You are right. I found out this myself too. But it doesn't affect the performance because it only makes faster the code which is executed very rarely. 4. I don't understand this your point. Quote:
No, it is important. Actually, the algo just subtracts D6 and D7 from D5 after the division. So we may exchange D6 and D7 but D5 is a long value. This code corrupts D7. Quote:
IMHO it is rather a matter of tastes. Some people may prefer more detailed code, some people prefer the briefness. BTW VASM always uses .W by default if the size is omitted. Quote:
Quote:
Quote:
Good emulators are quite accurate for the 68000 and they definitely show that MULUopt=1 makes the code slower. However your results show that MULUopt=1 can be useful at least for the 68020. I can also think that the 68030 must also be accelerated by MULUopt=1. Would you like please to run pi-amiga and pi-amiga1200 on your hardware for 100, 1000, and 3000 digits? A screenshot would be a nice addition for me to insert it in the table. Last edited by BippyM; 01 June 2021 at 18:24. |
||||||
30 April 2021, 22:42 | #19 | ||
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Quote:
Quote:
Anyway, on to the divu-first code... Leaving aside the bvs optimization (that depends on the structure of your code), there's still one thing you can do to avoid the moveq at the beginning of the code, thus saving a little time in the case of the bvs branch: Code:
divu.w d4,d6 bvc.b .div32no divul.l d4,d7:d6 move.w d7,(a3) bra.b .div32f .div32no move.l d6,d7 clr.w d6 eor.l d6,d7 swap.w d6 move.w d6,(a3) .div32f |
||
01 May 2021, 08:50 | #20 | ||||
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
Quote:
Quote:
Quote:
Your code again corrupts D7. Quote:
Code:
divu.w d4,d6 bvs.s .longdiv moveq.l #0,d7 move.w d6,d7 clr.w d6 swap d6 move.w d6,(a3) |
||||
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
68020 Bit Field Instructions | mcgeezer | Coders. Asm / Hardware | 9 | 27 October 2023 23:21 |
68060 64-bit integer math | BSzili | Coders. Asm / Hardware | 7 | 25 January 2021 21:18 |
Discovery: Math | Audio Snow | request.Old Rare Games | 30 | 20 August 2018 12:17 |
Math apps | mtb | support.Apps | 1 | 08 September 2002 18:59 |
|
|