English Amiga Board Optimizing the 68020+ 32-bit math
 Register Amiga FAQ Rules & Help Members List  /  Moderators List Today's Posts Mark Forums Read

 26 April 2021, 19:17 #1 litwr Registered User   Join Date: Mar 2016 Location: Ozherele Posts: 229 Optimizing the 68020+ 32-bit math I have a project, the number pi calculator. I have found out recently that my 68020 code is actually slower than my 68000 code. I was misguided that emulators (FS-UAE, Hatari) show that on the contary that the 68020 code is faster. It seems that DIVUL takes much less cycles in emus than in hardware. So I am trying to make the 68020 code which will be faster than the 68000 code. The code for the 68020/30 is quite short: Code: ``` divul.l d4,d7:d6 move d7,(a3)``` The equivalent code for the 68000 is longer (but faster!): Code: ``` moveq.l #0,d7 divu d4,d6 bvc .div32no\@ swap d6 move d6,d7 divu d4,d7 swap d7 move d7,d6 swap d6 divu d4,d6 .div32no\@ move d6,d7 clr d6 swap d6 move d6,(a3)``` The code for the 68000 is faster because the branch to .div32no\@ is taken almost always in the pi-spigot algo. My whole project is available on github https://github.com/litwr2/rosetta-pi...e/master/amiga Maybe someone can help me to make better code for the 68020/30? Any hints are welcome. Thank you in advance. It is interesting to make code for the 68020/30 faster than the 68000 code. However I have only been able to make slightly shorter code for the 68020/30. Code: ``` moveq.l #0,d7 divu d4,d6 bvc .div32no\@ divul d4,d7:d6 move d7,(a3) ;r[i] <- d%b bra .div32f\@ .div32no\@ move d6,d7 clr d6 swap d6 move d6,(a3) ;r[i] <- d%b .div32f\@``` You can notice that D6 and D7 are exchanged - it is ok.
27 April 2021, 10:43   #2
Thomas Richter
Registered User

Join Date: Jan 2019
Location: Germany
Posts: 1,149
Quote:
 Originally Posted by litwr [/CODE]The equivalent code for the 68000 is longer (but faster!): Code: ``` moveq.l #0,d7 divu d4,d6 bvc .div32no\@ swap d6 move d6,d7 divu d4,d7 swap d7 move d7,d6 swap d6 divu d4,d6 .div32no\@ move d6,d7 clr d6 swap d6 move d6,(a3)```
That is NOT equivalent code. A DIVU.L dx,da:db is a 64 by 32 bit division. That's not what your code is doing. To emulate that on a machine without a 32-bit quotient, you need something like Algorithm D. And yes, that would be slower than a DIVU.L.

 28 April 2021, 02:32 #3 a/b Registered User   Join Date: Jun 2016 Location: europe Posts: 386 Well, he's not using divu.L, it's divuL(.L) which is 32/32 -> 32:32. I didn't look too closely at this because... Don't really want to start any flame wars but that code nearly gave me cancer. *Every* single instruction that should have a size specifier (because of multiple valid options), and there's almost a dozen of them is that short clip, doesn't have it. And moveq, which is *always* a longword operation and doesn't ever need it, has it. When I see that I just check out. Having to decipher M68K code because it's butchered so badly is not my idea of fun. I mean, it's OK when it's some fresh guy who's still learning the basics and does all kind of stuff, we all did that at some point, but this... Nothing personal, I just don't understand why would anyone write so ambiguous code: it's more error prone, less portable, others have to waste time figuring out what you wanted to do, ...
 28 April 2021, 10:54 #4 Thomas Richter Registered User   Join Date: Jan 2019 Location: Germany Posts: 1,149 Well, it doesn't really matter too much. The major problem is here is the quotient size of 32 bit, and this requires a more complicated algorithm than a cascaded divu with has a quotient of 16 bit. A 32/16 full division with 32 remainder and 32 bit quotient is easy with divu, but a full 32/32 division requires "algorithm D" (or some other division algorithm) and is more complex than what is shown here.
29 April 2021, 08:36   #5
litwr
Registered User

Join Date: Mar 2016
Location: Ozherele
Posts: 229
Quote:
 Originally Posted by Thomas Richter That is NOT equivalent code. A DIVU.L dx,da:db is a 64 by 32 bit division. That's not what your code is doing. To emulate that on a machine without a 32-bit quotient, you need something like Algorithm D. And yes, that would be slower than a DIVU.L. Well, it doesn't really matter too much. The major problem is here is the quotient size of 32 bit, and this requires a more complicated algorithm than a cascaded divu with has a quotient of 16 bit. A 32/16 full division with 32 remainder and 32 bit quotient is easy with divu, but a full 32/32 division requires "algorithm D" (or some other division algorithm) and is more complex than what is shown here.
Sorry I missed to show an important detail. It is all about 32/16 division with 32q:16r result. So the 68000 code is actually equivalent to the shorter 68020/30 code for this case. Indeed, as it was noted by a/b the DIVUL instruction is used, so, anyway, we have 32/32 division instruction, not 64/32.
Quote:
 Originally Posted by a/b Well, he's not using divu.L, it's divuL(.L) which is 32/32 -> 32:32. I didn't look too closely at this because... Don't really want to start any flame wars but that code nearly gave me cancer. *Every* single instruction that should have a size specifier (because of multiple valid options), and there's almost a dozen of them is that short clip, doesn't have it. And moveq, which is *always* a longword operation and doesn't ever need it, has it. When I see that I just check out. Having to decipher M68K code because it's butchered so badly is not my idea of fun. I mean, it's OK when it's some fresh guy who's still learning the basics and does all kind of stuff, we all did that at some point, but this... Nothing personal, I just don't understand why would anyone write so ambiguous code: it's more error prone, less portable, others have to waste time figuring out what you wanted to do, ...
It is impossible to please everyone. It is quite common to skip .w qualifier. You can find proofs for this in many books. However I can try to make the code specifically for your tastes.
68020/30:
Code:
```     divul.l d4,d7:d6
move.w d7,(a3)```
68000:
Code:
```     moveq.l #0,d7
divu.w d4,d6
bvc .div32no

swap d6
move.w d6,d7
divu.w d4,d7
swap d7
move.w d7,d6
swap d6
divu.w d4,d6
.div32no
move.w d6,d7
clr.w d6
swap d6
move.w d6,(a3)```
68020/30 (improved):
Code:
```     moveq.l #0,d7
divu.w d4,d6
bvc .div32no

divul.l d4,d7:d6
move.w d7,(a3)     ;r[i] <- d%b
bra .div32f

.div32no
move.w d6,d7
clr.w d6
swap d6
move.w d6,(a3)     ;r[i] <- d%b
.div32f```
I hope it is ok for you now. And I can assure you that if your text was less poisonous the result would be the same.
However I almost sure that it is impossible to optimize this 68020/30 division better than I show in my improved code.
It is interesting that I could optimize the 80386 code for the same case.
The initial 80386 code
Code:
```         div esi
mov [si+ra+1],dx```
The optimized 80386 code (it gives even a larger gain on the 80486)
Code:
```         rol eax,16
cmp ax,si
jnc .lx

mov dx,ax
shr eax,16
div si
.lxc:    mov [si+ra+1],dx```
Code:
```.lx:     rol eax,16
div esi
jmp .lxc```
Indeed it is rather a complex optimization but I had a hope that the Amiga experts can know some 68k tricks which I missed.

 29 April 2021, 12:20 #6 a/b Registered User   Join Date: Jun 2016 Location: europe Posts: 386 Why didn't you use "the same" algorithm, something like (code not tested)? Code: ``` move.l d4,d0 swap d0 cmp.l d0,d6 ; divident >= (divisor<<16)? bhs.b .32bit .16bit divu.w d4,d6 swap d6 move.w d6,(a3) .32bit divul.l d4,d7:d6 move.w d7,(a3)``` edit: To keep it simple: do a faster (32/16) div whenever possible *without* penalty of failing (it's still a slooow div), extra check is compensated for. Last edited by a/b; 29 April 2021 at 13:15.
 29 April 2021, 14:09 #7 Thomas Richter Registered User   Join Date: Jan 2019 Location: Germany Posts: 1,149 Precisely. It depends most likely on the distribution of the input which version is faster.
29 April 2021, 14:56   #8
litwr
Registered User

Join Date: Mar 2016
Location: Ozherele
Posts: 229
Quote:
 Originally Posted by a/b Why didn't you use "the same" algorithm, something like (code not tested)? Code: ``` move.l d4,d0 swap d0 cmp.l d0,d6 ; divident >= (divisor<<16)? bhs.b .32bit .16bit divu.w d4,d6 swap d6 move.w d6,(a3) .32bit divul.l d4,d7:d6 move.w d7,(a3)``` edit: To keep it simple: do a faster (32/16) div whenever possible *without* penalty of failing (it's still a slooow div), extra check is compensated for.
Thank you! However it is not that easy because we need d6 and d7 which must keep quotient and remainder in their 32-bit. So actually, we need a sequence of MOVE, CLR, BSWAP before .32bit - my code which is equivalent to the 386 code is the next:

Code:
```     moveq.l #0,d7
swap d6
cmp.w d4,d6
bcs .div32no

swap d6
divul.l d4,d7:d6
move.w d7,(a3)
bra .div32f

.div32no
swap d6
divu.w d4,d6
move.w d6,d7
clr.w d6
swap d6
move.w d6,(a3)
.div32f```
This makes 2 extra SWAPs. Optimization for the 80386 gives only 2 or 3 saved cycles, for the 486 - 4 or 5. So it is really very complex. It is sad that neither 80386 nor 68020 are cycle exact in popular emulators. Moreover the emulators are very inaccurate especially for DIVU.L and DIVUL.

Last edited by litwr; 29 April 2021 at 15:02.

29 April 2021, 15:02   #9
litwr
Registered User

Join Date: Mar 2016
Location: Ozherele
Posts: 229
Quote:
 Originally Posted by Thomas Richter Precisely. It depends most likely on the distribution of the input which version is faster.
We can just ignore the branch with DIVUL - it is executed very rarely.

 29 April 2021, 16:00 #10 a/b Registered User   Join Date: Jun 2016 Location: europe Posts: 386 OK, I didn't take that into account, keeping data in d6/d7. Here are my suggestions: 1. maybe invert the bcs condition: if the 16-bit case is executed a lot more frequently it should be as branch not taken *if* you can adjust your code to avoid a bra at the end 2. if you have a spare register, use move+swap+cmp sequence instead of swap+cmp+2*swap, it's the same speed but 2 bytes shorter, so potentially very slightly faster because you can squeeze 2 more bytes into icache (not that large if 020/030) 3. moveq #0,d7 should be moved to after .div32no (only the 16-bit case neeeds it), because 32-bit div will set all 32 bits anyway, or.... 4. moveq #0,d7 should be executed only once before the loop (code implies that the remainder is always 16-bit, and setting d7 bits 16-31 only once will suffice) Last edited by a/b; 29 April 2021 at 16:10.
29 April 2021, 20:03   #11
saimo
Registered User

Join Date: Aug 2010
Location: Italy
Posts: 407
Quote:
 Originally Posted by litwr We can just ignore the branch with DIVUL - it is executed very rarely.
Then I'd go for this:
Code:
```     move.l     d6,d7
swap.w     d6
cmp.w      d4,d6
bcs.b      .div32no

divul.l    d4,d6:d7
move.w     d6,(a3)
exg.l      d6,d7
bra.b      .div32f

.div32no
divu.w     d4,d7
clr.l      d6
move.w     d7,d6
clr.w      d7
swap.w     d7
move.w     d7,(a3)

.div32f```
This code gives:
* 32-bit quotient in d6;
* 32-bit remainder in d7, with upper word set to 0;
* 16-bit remainder written to (a3).
Also, it executes some stuff in parallel, saving cycles.

If you don't care about the upper word of d7 being 0:
Code:
```     move.l     d6,d7
swap.w     d6
cmp.w      d4,d6
bcs.b      .div32no

divul.l    d4,d6:d7
move.w     d6,(a3)
exg.l      d6,d7
bra.b      .div32f

.div32no
divu.w     d4,d7
clr.l      d6
move.w     d7,d6
swap.w     d7
move.w     d7,(a3)

.div32f```
If you don't care about the registers being exchanged (as you mentioned in a post):
Code:
```     move.l     d6,d7
swap.w     d6
cmp.w      d4,d6
bcs.b      .div32no

divul.l    d4,d6:d7
move.w     d6,(a3)
bra.b      .div32f

.div32no
divu.w     d4,d7
clr.l      d6
move.w     d7,d6
swap.w     d7
move.w     d7,(a3)

.div32f```
If you can afford to trash the word at (2,a3):
Code:
```     move.l     d6,d7
swap.w     d6
cmp.w      d4,d6
bcs.b      .div32no

divul.l    d4,d6:d7
move.w     d6,(a3)
bra.b      .div32f

.div32no
divu.w     d4,d7
clr.l      d6
move.l     d7,(a3)
move.w     d7,d6
swap.w     d7

.div32f```
And if you don't care about the remainder in d7:
Code:
```     move.l     d6,d7
swap.w     d6
cmp.w      d4,d6
bcs.b      .div32no

divul.l    d4,d6:d7
move.w     d6,(a3)
bra.b      .div32f

.div32no
divu.w     d4,d7
clr.l      d6
move.l     d7,(a3)
move.w     d7,d6

.div32f```

29 April 2021, 21:57   #12
StingRay
move.l #\$c0ff33,throat

Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 6,651
Quote:
 Originally Posted by litwr It is quite common to skip .w qualifier. You can find proofs for this in many books.

You should read better books then! There are assemblers which default to .l which in turn breaks such code. And rightfully so as the size specifiers are there for a reason.

29 April 2021, 23:06   #13
a/b
Registered User

Join Date: Jun 2016
Location: europe
Posts: 386
Quote:
 Originally Posted by saimo If you can afford to trash the word at (2,a3):
Keep in mind that misaligned longword access is slower. No (a3)+ so we don't know for how much is a3 being incremented, but it might be ok though.

29 April 2021, 23:15   #14
saimo
Registered User

Join Date: Aug 2010
Location: Italy
Posts: 407
Quote:
 Originally Posted by a/b Keep in mind that misaligned longword access is slower. No (a3)+ so we don't know for how much is a3 being incremented, but it might be ok though.
Ah, yes, alignment is crucial there to actually benefit from the long write. Mentioning that didn't even cross my mind as it's a given. Thanks for pointing it out, though.
The code proposed by litwr doesn't have increments, and from the context it looks like (a3) is a variable rather than an item in an array/buffer, so it seems it should be possible to long-align it.

 30 April 2021, 14:28 #15 modrobert old bearded fool   Join Date: Jan 2010 Location: Bangkok Age: 53 Posts: 572 I compiled pi-amiga.asm with 'vasm pi-amiga.asm -Fhunkexe -nosym -o pi-amiga' and ran it on my stock A1200 with fastram, got the following result: Code: ```pi-amiga number ? calculator v8 (68000) number of digits (up to 9252)? 800 31415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679821480865132823066470938446095505822317253594081284811174502841027019385211055596446229489549303819644288109756659334461284756482337867831652712019091456485669234603486104543266482133936072602491412737245870066063155881748815209209628292540917153643678925903600113305305488204665213841469519415116094330572703657595919530921861173819326117931051185480744623799627495673518857527248912279381830119491298336733624406566430860213949463952247371907021798609437027705392171762931767523846748184676694051320005681271452635608277857713427577896091736371787214684409012249534301465495853710507922796892589235420199561121290219608640344181598136297747713099605187072113499999983729780499510597317328160963185 3.24``` Assuming I understood the last output there, it took 3.24 seconds? With "m68020" and "MULUopt = 1" I got this result: Code: ```pi-amiga number ? calculator v8 (68020) number of digits (up to 9252)? 800 31415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679821480865132823066470938446095505822317253594081284811174502841027019385211055596446229489549303819644288109756659334461284756482337867831652712019091456485669234603486104543266482133936072602491412737245870066063155881748815209209628292540917153643678925903600113305305488204665213841469519415116094330572703657595919530921861173819326117931051185480744623799627495673518857527248912279381830119491298336733624406566430860213949463952247371907021798609437027705392171762931767523846748184676694051320005681271452635608277857713427577896091736371787214684409012249534301465495853710507922796892589235420199561121290219608640344181598136297747713099605187072113499999983729780499510597317328160963185 3.14``` Took 3.14 seconds. Also tried the C example as mentioned in the comments, compiled with SAS/C and 'SCOPTIONS' as follows: Code: ```DEBUG=SYMBOLFLUSH MATH=STANDARD CPU=68020 ERRORREXX OPTIMIZE LINK DATAMEMORY=FAST STRIPDEBUG``` Source: Code: ```/* https://crypto.stanford.edu/pbc/notes/pi/code.html */ #include int main() { int r[2800 + 1]; int i, k; int b, d; int c = 0; for (i = 0; i < 2800; i++) { r[i] = 2000; } for (k = 2800; k > 0; k -= 14) { d = 0; i = k; for (;;) { d += r[i] * 10000; b = 2 * i - 1; r[i] = d % b; d /= b; i--; if (i == 0) break; d *= i; } printf("%.4d", c + d / 10000); c = d % 10000; } printf("\n"); return 0; }``` Result: Code: ```> timeit pi 31415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679821480865132823066470938446095505822317253594081284811174502841027019385211055596446229489549303819644288109756659334461284756482337867831652712019091456485669234603486104543266482133936072602491412737245870066063155881748815209209628292540917153643678925903600113305305488204665213841469519415116094330572703657595919530921861173819326117931051185480744623799627495673518857527248912279381830119491298336733624406566430860213949463952247371907021798609437027705392171762931767523846748184676694051320005681271452635608277857713427577896091736371787214684409012249534301465495853710507922796892589235420199561121290219608640344181598136297747713099605187072113499999983729780499510597317328160963185 Elapsed: 7.25s``` Also tried vbcc 'vc pi.c -o vc_pi -O2 -DCPU=68020' and the time it took was roughly 14 seconds, "-DCPU=68020" option didn't seem to have any affect, same time result with 68000 although a few bytes difference in size. When compiling for 68000 with SAS/C the time it took was roughly 13 seconds, in other words almost twice the performance for 68020. I haven't disassembled the resulting binaries (yet), but might do that later out of curiosity. Last edited by modrobert; 30 April 2021 at 16:45.
30 April 2021, 16:00   #16
vbc
Registered User

Join Date: Jan 2021
Location: Germany
Posts: 2
Quote:
 Originally Posted by modrobert Also tried vbcc 'vc pi.c -o vc_pi -O2 -DCPU=68020' and the time it took was roughly 14 seconds, "-DCPU=68020" option didn't seem to have any affect, same time result with 68000 although a few bytes difference in size. When compiling for 68000 with SAS/C the time it took was roughly 13 seconds, in other words almost twice the performance for 68020. I haven't disassembled the resulting binaries (yet), but might do that later out of curiosity.
-DCPU=68020 is the same as putting "#define CPU 68020" in your code. The option to generate code for 68020 is -cpu=68020.

30 April 2021, 16:33   #17
modrobert
old bearded fool

Join Date: Jan 2010
Location: Bangkok
Age: 53
Posts: 572
Quote:
 Originally Posted by vbc -DCPU=68020 is the same as putting "#define CPU 68020" in your code. The option to generate code for 68020 is -cpu=68020.
Aha, thanks. I got this result for vbcc after compiling with 'vc pi.c -o vc68020_pi -O2 -DCPU=68020 -cpu=68020':

Code:
```> timeit vc68020_pi
31415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679821480865132823066470938446095505822317253594081284811174502841027019385211055596446229489549303819644288109756659334461284756482337867831652712019091456485669234603486104543266482133936072602491412737245870066063155881748815209209628292540917153643678925903600113305305488204665213841469519415116094330572703657595919530921861173819326117931051185480744623799627495673518857527248912279381830119491298336733624406566430860213949463952247371907021798609437027705392171762931767523846748184676694051320005681271452635608277857713427577896091736371787214684409012249534301465495853710507922796892589235420199561121290219608640344181598136297747713099605187072113499999983729780499510597317328160963185
Elapsed: 6.78s```

30 April 2021, 21:06   #18
litwr
Registered User

Join Date: Mar 2016
Location: Ozherele
Posts: 229
Quote:
 Originally Posted by a/b Here are my suggestions: 1. maybe invert the bcs condition: if the 16-bit case is executed a lot more frequently it should be as branch not taken *if* you can adjust your code to avoid a bra at the end 2. if you have a spare register, use move+swap+cmp sequence instead of swap+cmp+2*swap, it's the same speed but 2 bytes shorter, so potentially very slightly faster because you can squeeze 2 more bytes into icache (not that large if 020/030) 3. moveq #0,d7 should be moved to after .div32no (only the 16-bit case neeeds it), because 32-bit div will set all 32 bits anyway, or.... 4. moveq #0,d7 should be executed only once before the loop (code implies that the remainder is always 16-bit, and setting d7 bits 16-31 only once will suffice)
1. Eureka! How could I omit this variant?! Maybe I wanted the macro code in one piece? But I split it for the 80386... So it is a crazy moment when a man discovers that he lost his pen when it is in his hand. Thank you very much. Maybe I could also reluctant to do this optimization because Bcc timings are very unusual for the 68k. For instance, for the Bcc.W case branch is taken is longer than not taken. I attach my improved programs to this post. Emulators (which are accurate for the 68000) show more than 1% speed boost - it has been a largest gain for years. However I need help with hardware to check the gain for the 68020 case.
2. This gives too little. All main loop size is less 100 bytes, so it can't help for better cache usage. Indeed it is good to make the code a bit shorter but you chose the slower code with CMP before the first division...
3. You are right. I found out this myself too. But it doesn't affect the performance because it only makes faster the code which is executed very rarely.
4. I don't understand this your point.

Quote:
 Originally Posted by saimo Then I'd go for this: Code: ``` move.l d6,d7 swap.w d6 cmp.w d4,d6 bcs.b .div32no divul.l d4,d6:d7 move.w d6,(a3) exg.l d6,d7 bra.b .div32f .div32no divu.w d4,d7 clr.l d6 move.w d7,d6 clr.w d7 swap.w d7 move.w d7,(a3) .div32f``` This code gives: * 32-bit quotient in d6; * 32-bit remainder in d7, with upper word set to 0; * 16-bit remainder written to (a3). Also, it executes some stuff in parallel, saving cycles.
Thank you. But you also chose the slower code with CMP before the first division. The 68k has an advantage over the x86: the DIVU instructions set V-flag. Why don't use this advantage? Finally, your code just replaces MOVEQ and SWAP with MOVE.L and CLR.L - it hardly makes any speed boost.

Quote:
 Originally Posted by a/b If you don't care about the upper word of d7 being 0:
No, it is important. Actually, the algo just subtracts D6 and D7 from D5 after the division. So we may exchange D6 and D7 but D5 is a long value.

Quote:
 Originally Posted by a/b If you can afford to trash the word at (2,a3):
This code corrupts D7.

Quote:
 Originally Posted by StingRay You should read better books then! There are assemblers which default to .l which in turn breaks such code. And rightfully so as the size specifiers are there for a reason.
Maybe. I can just show you a good book where they use just MOVE instead of MOVE.W - Amiga Machine Language by Stefan Dittrich (1989). In a newer book Total! Amiga assembler by Paul Overaa (1995) they wrote "If the object size is not specified then a word size (16 bit) is assumed".
IMHO it is rather a matter of tastes. Some people may prefer more detailed code, some people prefer the briefness.
BTW VASM always uses .W by default if the size is omitted.

Quote:
 Originally Posted by a/b Keep in mind that misaligned longword access is slower. No (a3)+ so we don't know for how much is a3 being incremented, but it might be ok though.
The code has the move.w -(a3),d0 instruction in its main loop.

Quote:
 Originally Posted by modrobert Code: ```pi-amiga number ? calculator v8 (68000) number of digits (up to 9252)? 800 3141592653589793... 3.24``` Assuming I understood the last output there, it took 3.24 seconds?
Exactly! However would you like to provide us with results for 1000 or/and 3000 digits? It allows me to compare your results with previous. It is better to show only timing because the algo is well tested and digits must be correct.

Quote:
 Originally Posted by StingRay With "m68020" and "MULUopt = 1" I got this result: Code: ```pi-amiga number ? calculator v8 (68020) number of digits (up to 9252)? 800 3141592653989793... 3.14```
Thank you very much!!! Your result show that MULUopt speeds up the 68020 code! This is a surprise for me because it means that hardware multiplication is slower than a set of equivalent instructions for this case.

Good emulators are quite accurate for the 68000 and they definitely show that MULUopt=1 makes the code slower. However your results show that MULUopt=1 can be useful at least for the 68020. I can also think that the 68030 must also be accelerated by MULUopt=1. Would you like please to run pi-amiga and pi-amiga1200 on your hardware for 100, 1000, and 3000 digits? A screenshot would be a nice addition for me to insert it in the table.

Last edited by BippyM; 01 June 2021 at 18:24.

30 April 2021, 22:42   #19
saimo
Registered User

Join Date: Aug 2010
Location: Italy
Posts: 407
Quote:
 Thank you. But you also chose the slower code with CMP before the first division. The 68k has an advantage over the x86: the DIVU instructions set V-flag. Why don't use this advantage?
To be honest, I only skimmed through the thread and I thought that the code you landed at in post #8 was for some reason the form you were aiming at, so I just applied some optimizations to that But, yes, I agree that it's better to perform the division first, given that you said that the worst case (overflow set) is very rare.

Quote:
 Finally, your code just replaces MOVEQ and SWAP with MOVE.L and CLR.L - it hardly makes any speed boost.
Other than on 68060, swap is slower. The code I proposed also aims to save cycles by allowing the CPU to execute more stuff in parallel thanks to less register dependencies (and the long write to memory, which, if I understand correctly, is not an option).
Anyway, on to the divu-first code...

Leaving aside the bvs optimization (that depends on the structure of your code), there's still one thing you can do to avoid the moveq at the beginning of the code, thus saving a little time in the case of the bvs branch:
Code:
```     divu.w   d4,d6
bvc.b    .div32no

divul.l  d4,d7:d6
move.w   d7,(a3)
bra.b    .div32f

.div32no
move.l   d6,d7
clr.w    d6
eor.l    d6,d7
swap.w   d6
move.w   d6,(a3)

.div32f```

01 May 2021, 08:50   #20
litwr
Registered User

Join Date: Mar 2016
Location: Ozherele
Posts: 229
Quote:
 Originally Posted by saimo The code proposed by litwr doesn't have increments, and from the context it looks like (a3) is a variable rather than an item in an array/buffer, so it seems it should be possible to long-align it.
A3 is a pointer to an array element. The element size is 2 byte. The pointer decreases. So long word access may slow down the algo.

Quote:
 Originally Posted by saimo To be honest, I only skimmed through the thread and I thought that the code you landed at in post #8 was for some reason the form you were aiming at, so I just applied some optimizations to that But, yes, I agree that it's better to perform the division first, given that you said that the worst case (overflow set) is very rare.

Quote:
 Originally Posted by saimo Other than on 68060, swap is slower. The code I proposed also aims to save cycles by allowing the CPU to execute more stuff in parallel thanks to less register dependencies (and the long write to memory, which, if I understand correctly, is not an option).
Thank you. I didn't know this. However CLR vs SWAP timing is rather odd. The 68000 executes SWAP faster than CLR but the 68020 executes CLR faster than SWAP! It is interesting to reduce instruction dependency in the code that may speed up the execution on the 68020 and higher 68k. Actually I didn't think about it. But I have just checked the code and IMHO it is difficult to improve it this way. The code for the main loop is short, it is only 17 lines (or 25 if MULUopt=1) between .l2 and BCC .l2 - one can check it too.

Quote:
 Originally Posted by saimo Anyway, on to the divu-first code...

Quote:
 Originally Posted by saimo Leaving aside the bvs optimization (that depends on the structure of your code), there's still one thing you can do to avoid the moveq at the beginning of the code, thus saving a little time in the case of the bvs branch
My BVS optimization is in the last attachment. However it is not important because it is independent from other optimizations. However, it is short and I can show it here too

Code:
```         divu.w d4,d6
bvs.s .longdiv

moveq.l #0,d7
move.w d6,d7
clr.w d6
swap d6
move.w d6,(a3)```

 Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)

 Similar Threads Thread Thread Starter Forum Replies Last Post BSzili Coders. Asm / Hardware 7 25 January 2021 21:18 mcgeezer Coders. Asm / Hardware 7 07 February 2019 14:59 Audio Snow request.Old Rare Games 30 20 August 2018 12:17 mtb support.Apps 1 08 September 2002 18:59

 Posting Rules You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is On HTML code is Off Forum Rules
 Forum Jump User Control Panel Private Messages Subscriptions Who's Online Search Forums Forums Home News Main     Amiga scene     Retrogaming General Discussion     Nostalgia & memories Support     New to Emulation or Amiga scene         Member Introductions     support.WinUAE     support.WinFellow     support.OtherUAE     support.FS-UAE         project.AmigaLive     support.Hardware         Hardware mods         Hardware pics     support.Games     support.Demos     support.Apps     support.Amiga Forever     support.Amix     support.Other Requests     request.UAE Wishlist     request.Old Rare Games     request.Demos     request.Apps     request.Modules     request.Music     request.Other     Looking for a game name ?     Games images which need to be WHDified abime.net - Hall Of Light     HOL news     HOL suggestions and feedback     HOL data problems     HOL contributions abime.net - Amiga Magazine Rack     AMR news     AMR suggestions and feedback     AMR data problems     AMR contributions abime.net - Home Projects     project.Amiga Lore     project.EAB     project.IRC     project.Mods Jukebox     project.Wiki abime.net - Hosted Projects     project.aGTW     project.APoV     project.ClassicWB     project.Jambo!     project.Green Amiga Alien GUIDES     project.Maptapper     project.Sprites     project.WinUAE - Kaillera Other Projects     project.Amiga Demo DVD     project.Amiga Game Factory     project.CARE     project.Amiga File Server     project.CD32 Conversion     project.Game Cover Art         GCA.Feedback and Suggestions         GCA.Work in Progress         GCA.Cover Requests         GCA.Usefull Programs         GCA.Helpdesk     project.KGLoad     project.MAGE     project.Missing Full Shareware Games     project.SPS (was CAPS)     project.TOSEC (amiga only)     project.WHDLoad         project.Killergorilla's WHD packs Misc     Amiga websites reviews     MarketPlace         Swapshop     Kinky Amiga Stuff     Collections     EAB's competition Coders     Coders. General         Coders. Releases         Coders. Tutorials     Coders. Asm / Hardware     Coders. System         Coders. Scripting         Coders. Nextgen     Coders. Language         Coders. C/C++         Coders. AMOS         Coders. Blitz Basic     Coders. Contest         Coders. Entries Creation     Graphics         Graphics. Work In Progress         Graphics. Finished Work         Graphics. Tutorials     Music         Music. Work In Progress         Music. Finished Work         Music. Tutorials

All times are GMT +2. The time now is 15:36.

 -- EAB3 skin ---- EAB2 skin ---- Mobile skin Archive - Top