English Amiga Board


Go Back   English Amiga Board > Coders > Coders. General

 
 
Thread Tools
Old 26 April 2021, 19:17   #1
litwr
Registered User

 
Join Date: Mar 2016
Location: Ozherele
Posts: 229
Optimizing the 68020+ 32-bit math

I have a project, the number pi calculator. I have found out recently that my 68020 code is actually slower than my 68000 code. I was misguided that emulators (FS-UAE, Hatari) show that on the contary that the 68020 code is faster. It seems that DIVUL takes much less cycles in emus than in hardware. So I am trying to make the 68020 code which will be faster than the 68000 code.
The code for the 68020/30 is quite short:
Code:
     divul.l d4,d7:d6
     move d7,(a3)
The equivalent code for the 68000 is longer (but faster!):
Code:
     moveq.l #0,d7
     divu d4,d6
     bvc .div32no\@

     swap d6
     move d6,d7
     divu d4,d7
     swap d7
     move d7,d6
     swap d6
     divu d4,d6
.div32no\@
     move d6,d7
     clr d6
     swap d6
     move d6,(a3)
The code for the 68000 is faster because the branch to .div32no\@ is taken almost always in the pi-spigot algo.

My whole project is available on github https://github.com/litwr2/rosetta-pi...e/master/amiga

Maybe someone can help me to make better code for the 68020/30? Any hints are welcome. Thank you in advance. It is interesting to make code for the 68020/30 faster than the 68000 code. However I have only been able to make slightly shorter code for the 68020/30.

Code:
     moveq.l #0,d7
     divu d4,d6
     bvc .div32no\@

     divul d4,d7:d6
     move d7,(a3)     ;r[i] <- d%b
     bra .div32f\@

.div32no\@
     move d6,d7
     clr d6
     swap d6
     move d6,(a3)     ;r[i] <- d%b
.div32f\@
You can notice that D6 and D7 are exchanged - it is ok.
litwr is offline  
Old 27 April 2021, 10:43   #2
Thomas Richter
Registered User
 
Join Date: Jan 2019
Location: Germany
Posts: 1,149
Quote:
Originally Posted by litwr View Post
[/CODE]The equivalent code for the 68000 is longer (but faster!):
Code:
     moveq.l #0,d7
     divu d4,d6
     bvc .div32no\@

     swap d6
     move d6,d7
     divu d4,d7
     swap d7
     move d7,d6
     swap d6
     divu d4,d6
.div32no\@
     move d6,d7
     clr d6
     swap d6
     move d6,(a3)
That is NOT equivalent code. A DIVU.L dx,da:db is a 64 by 32 bit division. That's not what your code is doing. To emulate that on a machine without a 32-bit quotient, you need something like Algorithm D. And yes, that would be slower than a DIVU.L.
Thomas Richter is offline  
Old 28 April 2021, 02:32   #3
a/b
Registered User

 
Join Date: Jun 2016
Location: europe
Posts: 386
Well, he's not using divu.L, it's divuL(.L) which is 32/32 -> 32:32.
I didn't look too closely at this because... Don't really want to start any flame wars but that code nearly gave me cancer. *Every* single instruction that should have a size specifier (because of multiple valid options), and there's almost a dozen of them is that short clip, doesn't have it. And moveq, which is *always* a longword operation and doesn't ever need it, has it.
When I see that I just check out. Having to decipher M68K code because it's butchered so badly is not my idea of fun. I mean, it's OK when it's some fresh guy who's still learning the basics and does all kind of stuff, we all did that at some point, but this... Nothing personal, I just don't understand why would anyone write so ambiguous code: it's more error prone, less portable, others have to waste time figuring out what you wanted to do, ...
a/b is offline  
Old 28 April 2021, 10:54   #4
Thomas Richter
Registered User
 
Join Date: Jan 2019
Location: Germany
Posts: 1,149
Well, it doesn't really matter too much. The major problem is here is the quotient size of 32 bit, and this requires a more complicated algorithm than a cascaded divu with has a quotient of 16 bit. A 32/16 full division with 32 remainder and 32 bit quotient is easy with divu, but a full 32/32 division requires "algorithm D" (or some other division algorithm) and is more complex than what is shown here.
Thomas Richter is offline  
Old 29 April 2021, 08:36   #5
litwr
Registered User

 
Join Date: Mar 2016
Location: Ozherele
Posts: 229
Quote:
Originally Posted by Thomas Richter View Post
That is NOT equivalent code. A DIVU.L dx,da:db is a 64 by 32 bit division. That's not what your code is doing. To emulate that on a machine without a 32-bit quotient, you need something like Algorithm D. And yes, that would be slower than a DIVU.L.

Well, it doesn't really matter too much. The major problem is here is the quotient size of 32 bit, and this requires a more complicated algorithm than a cascaded divu with has a quotient of 16 bit. A 32/16 full division with 32 remainder and 32 bit quotient is easy with divu, but a full 32/32 division requires "algorithm D" (or some other division algorithm) and is more complex than what is shown here.
Sorry I missed to show an important detail. It is all about 32/16 division with 32q:16r result. So the 68000 code is actually equivalent to the shorter 68020/30 code for this case. Indeed, as it was noted by a/b the DIVUL instruction is used, so, anyway, we have 32/32 division instruction, not 64/32.
Quote:
Originally Posted by a/b View Post
Well, he's not using divu.L, it's divuL(.L) which is 32/32 -> 32:32.
I didn't look too closely at this because... Don't really want to start any flame wars but that code nearly gave me cancer. *Every* single instruction that should have a size specifier (because of multiple valid options), and there's almost a dozen of them is that short clip, doesn't have it. And moveq, which is *always* a longword operation and doesn't ever need it, has it.
When I see that I just check out. Having to decipher M68K code because it's butchered so badly is not my idea of fun. I mean, it's OK when it's some fresh guy who's still learning the basics and does all kind of stuff, we all did that at some point, but this... Nothing personal, I just don't understand why would anyone write so ambiguous code: it's more error prone, less portable, others have to waste time figuring out what you wanted to do, ...
It is impossible to please everyone. It is quite common to skip .w qualifier. You can find proofs for this in many books. However I can try to make the code specifically for your tastes.
68020/30:
Code:
     divul.l d4,d7:d6
     move.w d7,(a3)
68000:
Code:
     moveq.l #0,d7
     divu.w d4,d6
     bvc .div32no

     swap d6
     move.w d6,d7
     divu.w d4,d7
     swap d7
     move.w d7,d6
     swap d6
     divu.w d4,d6
.div32no
     move.w d6,d7
     clr.w d6
     swap d6
     move.w d6,(a3)
68020/30 (improved):
Code:
     moveq.l #0,d7
     divu.w d4,d6
     bvc .div32no

     divul.l d4,d7:d6
     move.w d7,(a3)     ;r[i] <- d%b
     bra .div32f

.div32no
     move.w d6,d7
     clr.w d6
     swap d6
     move.w d6,(a3)     ;r[i] <- d%b
.div32f
I hope it is ok for you now. And I can assure you that if your text was less poisonous the result would be the same.
However I almost sure that it is impossible to optimize this 68020/30 division better than I show in my improved code.
It is interesting that I could optimize the 80386 code for the same case.
The initial 80386 code
Code:
         div esi
         mov [si+ra+1],dx
The optimized 80386 code (it gives even a larger gain on the 80486)
Code:
         rol eax,16
         cmp ax,si
         jnc .lx

         mov dx,ax
         shr eax,16
         div si
.lxc:    mov [si+ra+1],dx
Code:
.lx:     rol eax,16
         div esi
         jmp .lxc
Indeed it is rather a complex optimization but I had a hope that the Amiga experts can know some 68k tricks which I missed.
litwr is offline  
Old 29 April 2021, 12:20   #6
a/b
Registered User

 
Join Date: Jun 2016
Location: europe
Posts: 386
Why didn't you use "the same" algorithm, something like (code not tested)?
Code:
	move.l	d4,d0
	swap	d0
	cmp.l	d0,d6		; divident >= (divisor<<16)?
	bhs.b	.32bit

.16bit	divu.w	d4,d6
	swap	d6
	move.w	d6,(a3)

.32bit	divul.l	d4,d7:d6
	move.w	d7,(a3)
edit: To keep it simple: do a faster (32/16) div whenever possible *without* penalty of failing (it's still a slooow div), extra check is compensated for.

Last edited by a/b; 29 April 2021 at 13:15.
a/b is offline  
Old 29 April 2021, 14:09   #7
Thomas Richter
Registered User
 
Join Date: Jan 2019
Location: Germany
Posts: 1,149
Precisely. It depends most likely on the distribution of the input which version is faster.
Thomas Richter is offline  
Old 29 April 2021, 14:56   #8
litwr
Registered User

 
Join Date: Mar 2016
Location: Ozherele
Posts: 229
Quote:
Originally Posted by a/b View Post
Why didn't you use "the same" algorithm, something like (code not tested)?
Code:
	move.l	d4,d0
	swap	d0
	cmp.l	d0,d6		; divident >= (divisor<<16)?
	bhs.b	.32bit

.16bit	divu.w	d4,d6
	swap	d6
	move.w	d6,(a3)

.32bit	divul.l	d4,d7:d6
	move.w	d7,(a3)
edit: To keep it simple: do a faster (32/16) div whenever possible *without* penalty of failing (it's still a slooow div), extra check is compensated for.
Thank you! However it is not that easy because we need d6 and d7 which must keep quotient and remainder in their 32-bit. So actually, we need a sequence of MOVE, CLR, BSWAP before .32bit - my code which is equivalent to the 386 code is the next:

Code:
     moveq.l #0,d7
     swap d6
     cmp.w d4,d6
     bcs .div32no

     swap d6
     divul.l d4,d7:d6
     move.w d7,(a3) 
     bra .div32f

.div32no
     swap d6
     divu.w d4,d6
     move.w d6,d7
     clr.w d6
     swap d6 
     move.w d6,(a3) 
.div32f
This makes 2 extra SWAPs. Optimization for the 80386 gives only 2 or 3 saved cycles, for the 486 - 4 or 5. So it is really very complex. It is sad that neither 80386 nor 68020 are cycle exact in popular emulators. Moreover the emulators are very inaccurate especially for DIVU.L and DIVUL.

Last edited by litwr; 29 April 2021 at 15:02.
litwr is offline  
Old 29 April 2021, 15:02   #9
litwr
Registered User

 
Join Date: Mar 2016
Location: Ozherele
Posts: 229
Quote:
Originally Posted by Thomas Richter View Post
Precisely. It depends most likely on the distribution of the input which version is faster.
We can just ignore the branch with DIVUL - it is executed very rarely.
litwr is offline  
Old 29 April 2021, 16:00   #10
a/b
Registered User

 
Join Date: Jun 2016
Location: europe
Posts: 386
OK, I didn't take that into account, keeping data in d6/d7.

Here are my suggestions:
1. maybe invert the bcs condition: if the 16-bit case is executed a lot more frequently it should be as branch not taken *if* you can adjust your code to avoid a bra at the end
2. if you have a spare register, use move+swap+cmp sequence instead of swap+cmp+2*swap, it's the same speed but 2 bytes shorter, so potentially very slightly faster because you can squeeze 2 more bytes into icache (not that large if 020/030)
3. moveq #0,d7 should be moved to after .div32no (only the 16-bit case neeeds it), because 32-bit div will set all 32 bits anyway, or....
4. moveq #0,d7 should be executed only once before the loop (code implies that the remainder is always 16-bit, and setting d7 bits 16-31 only once will suffice)

Last edited by a/b; 29 April 2021 at 16:10.
a/b is offline  
Old 29 April 2021, 20:03   #11
saimo
Registered User
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 407
Quote:
Originally Posted by litwr View Post
We can just ignore the branch with DIVUL - it is executed very rarely.
Then I'd go for this:
Code:
     move.l     d6,d7
     swap.w     d6
     cmp.w      d4,d6
     bcs.b      .div32no

     divul.l    d4,d6:d7
     move.w     d6,(a3)
     exg.l      d6,d7
     bra.b      .div32f

.div32no
     divu.w     d4,d7
     clr.l      d6
     move.w     d7,d6
     clr.w      d7
     swap.w     d7
     move.w     d7,(a3)

.div32f
This code gives:
* 32-bit quotient in d6;
* 32-bit remainder in d7, with upper word set to 0;
* 16-bit remainder written to (a3).
Also, it executes some stuff in parallel, saving cycles.

If you don't care about the upper word of d7 being 0:
Code:
     move.l     d6,d7
     swap.w     d6
     cmp.w      d4,d6
     bcs.b      .div32no

     divul.l    d4,d6:d7
     move.w     d6,(a3)
     exg.l      d6,d7
     bra.b      .div32f

.div32no
     divu.w     d4,d7
     clr.l      d6
     move.w     d7,d6
     swap.w     d7
     move.w     d7,(a3)

.div32f
If you don't care about the registers being exchanged (as you mentioned in a post):
Code:
     move.l     d6,d7
     swap.w     d6
     cmp.w      d4,d6
     bcs.b      .div32no

     divul.l    d4,d6:d7
     move.w     d6,(a3)
     bra.b      .div32f

.div32no
     divu.w     d4,d7
     clr.l      d6
     move.w     d7,d6
     swap.w     d7
     move.w     d7,(a3)

.div32f
If you can afford to trash the word at (2,a3):
Code:
     move.l     d6,d7
     swap.w     d6
     cmp.w      d4,d6
     bcs.b      .div32no

     divul.l    d4,d6:d7
     move.w     d6,(a3)
     bra.b      .div32f

.div32no
     divu.w     d4,d7
     clr.l      d6
     move.l     d7,(a3)
     move.w     d7,d6
     swap.w     d7

.div32f
And if you don't care about the remainder in d7:
Code:
     move.l     d6,d7
     swap.w     d6
     cmp.w      d4,d6
     bcs.b      .div32no

     divul.l    d4,d6:d7
     move.w     d6,(a3)
     bra.b      .div32f

.div32no
     divu.w     d4,d7
     clr.l      d6
     move.l     d7,(a3)
     move.w     d7,d6

.div32f
saimo is offline  
Old 29 April 2021, 21:57   #12
StingRay
move.l #$c0ff33,throat

StingRay's Avatar
 
Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 6,651
Quote:
Originally Posted by litwr View Post
It is quite common to skip .w qualifier. You can find proofs for this in many books.

You should read better books then! There are assemblers which default to .l which in turn breaks such code. And rightfully so as the size specifiers are there for a reason.
StingRay is offline  
Old 29 April 2021, 23:06   #13
a/b
Registered User

 
Join Date: Jun 2016
Location: europe
Posts: 386
Quote:
Originally Posted by saimo View Post
If you can afford to trash the word at (2,a3):
Keep in mind that misaligned longword access is slower. No (a3)+ so we don't know for how much is a3 being incremented, but it might be ok though.
a/b is offline  
Old 29 April 2021, 23:15   #14
saimo
Registered User
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 407
Quote:
Originally Posted by a/b View Post
Keep in mind that misaligned longword access is slower. No (a3)+ so we don't know for how much is a3 being incremented, but it might be ok though.
Ah, yes, alignment is crucial there to actually benefit from the long write. Mentioning that didn't even cross my mind as it's a given. Thanks for pointing it out, though.
The code proposed by litwr doesn't have increments, and from the context it looks like (a3) is a variable rather than an item in an array/buffer, so it seems it should be possible to long-align it.
saimo is offline  
Old 30 April 2021, 14:28   #15
modrobert
old bearded fool

modrobert's Avatar
 
Join Date: Jan 2010
Location: Bangkok
Age: 53
Posts: 572
I compiled pi-amiga.asm with 'vasm pi-amiga.asm -Fhunkexe -nosym -o pi-amiga' and ran it on my stock A1200 with fastram, got the following result:

Code:
pi-amiga
number ? calculator v8 (68000)
number of digits (up to 9252)? 800
31415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679821480865132823066470938446095505822317253594081284811174502841027019385211055596446229489549303819644288109756659334461284756482337867831652712019091456485669234603486104543266482133936072602491412737245870066063155881748815209209628292540917153643678925903600113305305488204665213841469519415116094330572703657595919530921861173819326117931051185480744623799627495673518857527248912279381830119491298336733624406566430860213949463952247371907021798609437027705392171762931767523846748184676694051320005681271452635608277857713427577896091736371787214684409012249534301465495853710507922796892589235420199561121290219608640344181598136297747713099605187072113499999983729780499510597317328160963185  3.24
Assuming I understood the last output there, it took 3.24 seconds?


With "m68020" and "MULUopt = 1" I got this result:

Code:
pi-amiga
number ? calculator v8 (68020)
number of digits (up to 9252)? 800
31415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679821480865132823066470938446095505822317253594081284811174502841027019385211055596446229489549303819644288109756659334461284756482337867831652712019091456485669234603486104543266482133936072602491412737245870066063155881748815209209628292540917153643678925903600113305305488204665213841469519415116094330572703657595919530921861173819326117931051185480744623799627495673518857527248912279381830119491298336733624406566430860213949463952247371907021798609437027705392171762931767523846748184676694051320005681271452635608277857713427577896091736371787214684409012249534301465495853710507922796892589235420199561121290219608640344181598136297747713099605187072113499999983729780499510597317328160963185 3.14
Took 3.14 seconds.


Also tried the C example as mentioned in the comments, compiled with SAS/C and 'SCOPTIONS' as follows:

Code:
DEBUG=SYMBOLFLUSH
MATH=STANDARD
CPU=68020
ERRORREXX
OPTIMIZE
LINK
DATAMEMORY=FAST
STRIPDEBUG
Source:

Code:
/* https://crypto.stanford.edu/pbc/notes/pi/code.html */

#include <stdio.h>

int main() {
    int r[2800 + 1];
    int i, k;
    int b, d;
    int c = 0;

    for (i = 0; i < 2800; i++) {
        r[i] = 2000;
    }

    for (k = 2800; k > 0; k -= 14) {
        d = 0;
        i = k;
        for (;;) {
            d += r[i] * 10000;
            b = 2 * i - 1;
            r[i] = d % b;
            d /= b;
            i--;
            if (i == 0) break;
            d *= i;
        }
        printf("%.4d", c + d / 10000);
        c = d % 10000;
    }

    printf("\n");
    return 0;
}

Result:

Code:
> timeit pi
31415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679821480865132823066470938446095505822317253594081284811174502841027019385211055596446229489549303819644288109756659334461284756482337867831652712019091456485669234603486104543266482133936072602491412737245870066063155881748815209209628292540917153643678925903600113305305488204665213841469519415116094330572703657595919530921861173819326117931051185480744623799627495673518857527248912279381830119491298336733624406566430860213949463952247371907021798609437027705392171762931767523846748184676694051320005681271452635608277857713427577896091736371787214684409012249534301465495853710507922796892589235420199561121290219608640344181598136297747713099605187072113499999983729780499510597317328160963185
Elapsed: 7.25s

Also tried vbcc 'vc pi.c -o vc_pi -O2 -DCPU=68020' and the time it took was roughly 14 seconds, "-DCPU=68020" option didn't seem to have any affect, same time result with 68000 although a few bytes difference in size. When compiling for 68000 with SAS/C the time it took was roughly 13 seconds, in other words almost twice the performance for 68020. I haven't disassembled the resulting binaries (yet), but might do that later out of curiosity.

Last edited by modrobert; 30 April 2021 at 16:45.
modrobert is offline  
Old 30 April 2021, 16:00   #16
vbc
Registered User
 
Join Date: Jan 2021
Location: Germany
Posts: 2
Quote:
Originally Posted by modrobert View Post
Also tried vbcc 'vc pi.c -o vc_pi -O2 -DCPU=68020' and the time it took was roughly 14 seconds, "-DCPU=68020" option didn't seem to have any affect, same time result with 68000 although a few bytes difference in size. When compiling for 68000 with SAS/C the time it took was roughly 13 seconds, in other words almost twice the performance for 68020. I haven't disassembled the resulting binaries (yet), but might do that later out of curiosity.
-DCPU=68020 is the same as putting "#define CPU 68020" in your code. The option to generate code for 68020 is -cpu=68020.
vbc is offline  
Old 30 April 2021, 16:33   #17
modrobert
old bearded fool

modrobert's Avatar
 
Join Date: Jan 2010
Location: Bangkok
Age: 53
Posts: 572
Quote:
Originally Posted by vbc View Post
-DCPU=68020 is the same as putting "#define CPU 68020" in your code. The option to generate code for 68020 is -cpu=68020.
Aha, thanks. I got this result for vbcc after compiling with 'vc pi.c -o vc68020_pi -O2 -DCPU=68020 -cpu=68020':

Code:
> timeit vc68020_pi
31415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679821480865132823066470938446095505822317253594081284811174502841027019385211055596446229489549303819644288109756659334461284756482337867831652712019091456485669234603486104543266482133936072602491412737245870066063155881748815209209628292540917153643678925903600113305305488204665213841469519415116094330572703657595919530921861173819326117931051185480744623799627495673518857527248912279381830119491298336733624406566430860213949463952247371907021798609437027705392171762931767523846748184676694051320005681271452635608277857713427577896091736371787214684409012249534301465495853710507922796892589235420199561121290219608640344181598136297747713099605187072113499999983729780499510597317328160963185
Elapsed: 6.78s
modrobert is offline  
Old 30 April 2021, 21:06   #18
litwr
Registered User

 
Join Date: Mar 2016
Location: Ozherele
Posts: 229
Quote:
Originally Posted by a/b View Post
Here are my suggestions:
1. maybe invert the bcs condition: if the 16-bit case is executed a lot more frequently it should be as branch not taken *if* you can adjust your code to avoid a bra at the end
2. if you have a spare register, use move+swap+cmp sequence instead of swap+cmp+2*swap, it's the same speed but 2 bytes shorter, so potentially very slightly faster because you can squeeze 2 more bytes into icache (not that large if 020/030)
3. moveq #0,d7 should be moved to after .div32no (only the 16-bit case neeeds it), because 32-bit div will set all 32 bits anyway, or....
4. moveq #0,d7 should be executed only once before the loop (code implies that the remainder is always 16-bit, and setting d7 bits 16-31 only once will suffice)
1. Eureka! How could I omit this variant?! Maybe I wanted the macro code in one piece? But I split it for the 80386... So it is a crazy moment when a man discovers that he lost his pen when it is in his hand. Thank you very much. Maybe I could also reluctant to do this optimization because Bcc timings are very unusual for the 68k. For instance, for the Bcc.W case branch is taken is longer than not taken. I attach my improved programs to this post. Emulators (which are accurate for the 68000) show more than 1% speed boost - it has been a largest gain for years. However I need help with hardware to check the gain for the 68020 case.
2. This gives too little. All main loop size is less 100 bytes, so it can't help for better cache usage. Indeed it is good to make the code a bit shorter but you chose the slower code with CMP before the first division...
3. You are right. I found out this myself too. But it doesn't affect the performance because it only makes faster the code which is executed very rarely.
4. I don't understand this your point.

Quote:
Originally Posted by saimo View Post
Then I'd go for this:
Code:
     move.l     d6,d7
     swap.w     d6
     cmp.w      d4,d6
     bcs.b      .div32no

     divul.l    d4,d6:d7
     move.w     d6,(a3)
     exg.l      d6,d7
     bra.b      .div32f

.div32no
     divu.w     d4,d7
     clr.l      d6
     move.w     d7,d6
     clr.w      d7
     swap.w     d7
     move.w     d7,(a3)

.div32f
This code gives:
* 32-bit quotient in d6;
* 32-bit remainder in d7, with upper word set to 0;
* 16-bit remainder written to (a3).
Also, it executes some stuff in parallel, saving cycles.
Thank you. But you also chose the slower code with CMP before the first division. The 68k has an advantage over the x86: the DIVU instructions set V-flag. Why don't use this advantage? Finally, your code just replaces MOVEQ and SWAP with MOVE.L and CLR.L - it hardly makes any speed boost.

Quote:
Originally Posted by a/b View Post
If you don't care about the upper word of d7 being 0:
No, it is important. Actually, the algo just subtracts D6 and D7 from D5 after the division. So we may exchange D6 and D7 but D5 is a long value.

Quote:
Originally Posted by a/b View Post
If you can afford to trash the word at (2,a3):
This code corrupts D7.

Quote:
Originally Posted by StingRay View Post
You should read better books then! There are assemblers which default to .l which in turn breaks such code. And rightfully so as the size specifiers are there for a reason.
Maybe. I can just show you a good book where they use just MOVE instead of MOVE.W - Amiga Machine Language by Stefan Dittrich (1989). In a newer book Total! Amiga assembler by Paul Overaa (1995) they wrote "If the object size is not specified then a word size (16 bit) is assumed".
IMHO it is rather a matter of tastes. Some people may prefer more detailed code, some people prefer the briefness.
BTW VASM always uses .W by default if the size is omitted.

Quote:
Originally Posted by a/b View Post
Keep in mind that misaligned longword access is slower. No (a3)+ so we don't know for how much is a3 being incremented, but it might be ok though.
The code has the move.w -(a3),d0 instruction in its main loop.

Quote:
Originally Posted by modrobert View Post
Code:
pi-amiga
number ? calculator v8 (68000)
number of digits (up to 9252)? 800
3141592653589793...
3.24
Assuming I understood the last output there, it took 3.24 seconds?
Exactly! However would you like to provide us with results for 1000 or/and 3000 digits? It allows me to compare your results with previous. It is better to show only timing because the algo is well tested and digits must be correct.

Quote:
Originally Posted by StingRay View Post
With "m68020" and "MULUopt = 1" I got this result:

Code:
pi-amiga
number ? calculator v8 (68020)
number of digits (up to 9252)? 800
3141592653989793...
3.14
Thank you very much!!! Your result show that MULUopt speeds up the 68020 code! This is a surprise for me because it means that hardware multiplication is slower than a set of equivalent instructions for this case.

Good emulators are quite accurate for the 68000 and they definitely show that MULUopt=1 makes the code slower. However your results show that MULUopt=1 can be useful at least for the 68020. I can also think that the 68030 must also be accelerated by MULUopt=1. Would you like please to run pi-amiga and pi-amiga1200 on your hardware for 100, 1000, and 3000 digits? A screenshot would be a nice addition for me to insert it in the table.

Last edited by BippyM; 01 June 2021 at 18:24.
litwr is offline  
Old 30 April 2021, 22:42   #19
saimo
Registered User
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 407
Quote:
Thank you. But you also chose the slower code with CMP before the first division. The 68k has an advantage over the x86: the DIVU instructions set V-flag. Why don't use this advantage?
To be honest, I only skimmed through the thread and I thought that the code you landed at in post #8 was for some reason the form you were aiming at, so I just applied some optimizations to that But, yes, I agree that it's better to perform the division first, given that you said that the worst case (overflow set) is very rare.

Quote:
Finally, your code just replaces MOVEQ and SWAP with MOVE.L and CLR.L - it hardly makes any speed boost.
Other than on 68060, swap is slower. The code I proposed also aims to save cycles by allowing the CPU to execute more stuff in parallel thanks to less register dependencies (and the long write to memory, which, if I understand correctly, is not an option).
Anyway, on to the divu-first code...

Leaving aside the bvs optimization (that depends on the structure of your code), there's still one thing you can do to avoid the moveq at the beginning of the code, thus saving a little time in the case of the bvs branch:
Code:
     divu.w   d4,d6
     bvc.b    .div32no

     divul.l  d4,d7:d6
     move.w   d7,(a3)
     bra.b    .div32f

.div32no
     move.l   d6,d7 
     clr.w    d6
     eor.l    d6,d7
     swap.w   d6
     move.w   d6,(a3)
     
.div32f
saimo is offline  
Old 01 May 2021, 08:50   #20
litwr
Registered User

 
Join Date: Mar 2016
Location: Ozherele
Posts: 229
Quote:
Originally Posted by saimo View Post
The code proposed by litwr doesn't have increments, and from the context it looks like (a3) is a variable rather than an item in an array/buffer, so it seems it should be possible to long-align it.
A3 is a pointer to an array element. The element size is 2 byte. The pointer decreases. So long word access may slow down the algo.

Quote:
Originally Posted by saimo View Post
To be honest, I only skimmed through the thread and I thought that the code you landed at in post #8 was for some reason the form you were aiming at, so I just applied some optimizations to that But, yes, I agree that it's better to perform the division first, given that you said that the worst case (overflow set) is very rare.
I showed the CMP-first version only because it was a/b's demand. Sorry I didn't add more information about it afore.

Quote:
Originally Posted by saimo View Post
Other than on 68060, swap is slower. The code I proposed also aims to save cycles by allowing the CPU to execute more stuff in parallel thanks to less register dependencies (and the long write to memory, which, if I understand correctly, is not an option).
Thank you. I didn't know this. However CLR vs SWAP timing is rather odd. The 68000 executes SWAP faster than CLR but the 68020 executes CLR faster than SWAP! It is interesting to reduce instruction dependency in the code that may speed up the execution on the 68020 and higher 68k. Actually I didn't think about it. But I have just checked the code and IMHO it is difficult to improve it this way. The code for the main loop is short, it is only 17 lines (or 25 if MULUopt=1) between .l2 and BCC .l2 - one can check it too.

Quote:
Originally Posted by saimo View Post
Anyway, on to the divu-first code...
Your code again corrupts D7.

Quote:
Originally Posted by saimo View Post
Leaving aside the bvs optimization (that depends on the structure of your code), there's still one thing you can do to avoid the moveq at the beginning of the code, thus saving a little time in the case of the bvs branch
My BVS optimization is in the last attachment. However it is not important because it is independent from other optimizations. However, it is short and I can show it here too

Code:
         divu.w d4,d6
         bvs.s .longdiv

         moveq.l #0,d7
         move.w d6,d7
         clr.w d6
         swap d6
         move.w d6,(a3)
litwr is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
68060 64-bit integer math BSzili Coders. Asm / Hardware 7 25 January 2021 21:18
68020 Bit Field Instructions mcgeezer Coders. Asm / Hardware 7 07 February 2019 14:59
Discovery: Math Audio Snow request.Old Rare Games 30 20 August 2018 12:17
Math apps mtb support.Apps 1 08 September 2002 18:59

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 15:36.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, vBulletin Solutions Inc.
Page generated in 0.10980 seconds with 14 queries