Optimizing the 68020+ 32-bit math

litwr · 26 April 2021, 19:17

I have a project, the number pi calculator. I have found out recently that my 68020 code is actually slower than my 68000 code.

I was misguided that emulators (FS-UAE, Hatari) show that on the contary that the 68020 code is faster. It seems that DIVUL takes much less cycles in emus than in hardware.

So I am trying to make the 68020 code which will be faster than the 68000 code.
The code for the 68020/30 is quite short:

Code:

     divul.l d4,d7:d6
     move d7,(a3)

The equivalent code for the 68000 is longer (but faster!):

Code:

     moveq.l #0,d7
     divu d4,d6
     bvc .div32no\@

     swap d6
     move d6,d7
     divu d4,d7
     swap d7
     move d7,d6
     swap d6
     divu d4,d6
.div32no\@
     move d6,d7
     clr d6
     swap d6
     move d6,(a3)

The code for the 68000 is faster because the branch to .div32no\@ is taken almost always in the pi-spigot algo.

My whole project is available on github https://github.com/litwr2/rosetta-pi...e/master/amiga

Maybe someone can help me to make better code for the 68020/30? Any hints are welcome. Thank you in advance. It is interesting to make code for the 68020/30 faster than the 68000 code. However I have only been able to make slightly shorter code for the 68020/30.

Code:

     moveq.l #0,d7
     divu d4,d6
     bvc .div32no\@

     divul d4,d7:d6
     move d7,(a3)     ;r[i] <- d%b
     bra .div32f\@

.div32no\@
     move d6,d7
     clr d6
     swap d6
     move d6,(a3)     ;r[i] <- d%b
.div32f\@

You can notice that D6 and D7 are exchanged - it is ok.

Thomas Richter · 27 April 2021, 10:43

Quote:

Originally Posted by litwr

[/CODE]The equivalent code for the 68000 is longer (but faster!):

Code:

     moveq.l #0,d7
     divu d4,d6
     bvc .div32no\@

     swap d6
     move d6,d7
     divu d4,d7
     swap d7
     move d7,d6
     swap d6
     divu d4,d6
.div32no\@
     move d6,d7
     clr d6
     swap d6
     move d6,(a3)

That is NOT equivalent code. A DIVU.L dx,da:db is a 64 by 32 bit division. That's not what your code is doing. To emulate that on a machine without a 32-bit quotient, you need something like Algorithm D. And yes, that would be slower than a DIVU.L.

a/b · 28 April 2021, 02:32

Well, he's not using divu.L, it's divuL(.L) which is 32/32 -> 32:32.
I didn't look too closely at this because... Don't really want to start any flame wars but that code nearly gave me cancer. *Every* single instruction that should have a size specifier (because of multiple valid options), and there's almost a dozen of them is that short clip, doesn't have it. And moveq, which is *always* a longword operation and doesn't ever need it, has it.
When I see that I just check out. Having to decipher M68K code because it's butchered so badly is not my idea of fun. I mean, it's OK when it's some fresh guy who's still learning the basics and does all kind of stuff, we all did that at some point, but this... Nothing personal, I just don't understand why would anyone write so ambiguous code: it's more error prone, less portable, others have to waste time figuring out what you wanted to do, ...

Thomas Richter · 28 April 2021, 10:54

Well, it doesn't really matter too much. The major problem is here is the quotient size of 32 bit, and this requires a more complicated algorithm than a cascaded divu with has a quotient of 16 bit. A 32/16 full division with 32 remainder and 32 bit quotient is easy with divu, but a full 32/32 division requires "algorithm D" (or some other division algorithm) and is more complex than what is shown here.

litwr · 29 April 2021, 08:36

Quote:

Originally Posted by Thomas Richter

That is NOT equivalent code. A DIVU.L dx,da:db is a 64 by 32 bit division. That's not what your code is doing. To emulate that on a machine without a 32-bit quotient, you need something like Algorithm D. And yes, that would be slower than a DIVU.L.

Well, it doesn't really matter too much. The major problem is here is the quotient size of 32 bit, and this requires a more complicated algorithm than a cascaded divu with has a quotient of 16 bit. A 32/16 full division with 32 remainder and 32 bit quotient is easy with divu, but a full 32/32 division requires "algorithm D" (or some other division algorithm) and is more complex than what is shown here.

Sorry I missed to show an important detail. It is all about 32/16 division with 32q:16r result. So the 68000 code is actually equivalent to the shorter 68020/30 code for this case. Indeed, as it was noted by a/b the DIVUL instruction is used, so, anyway, we have 32/32 division instruction, not 64/32.

Quote:

Originally Posted by a/b

Well, he's not using divu.L, it's divuL(.L) which is 32/32 -> 32:32.
I didn't look too closely at this because... Don't really want to start any flame wars but that code nearly gave me cancer. *Every* single instruction that should have a size specifier (because of multiple valid options), and there's almost a dozen of them is that short clip, doesn't have it. And moveq, which is *always* a longword operation and doesn't ever need it, has it.
When I see that I just check out. Having to decipher M68K code because it's butchered so badly is not my idea of fun. I mean, it's OK when it's some fresh guy who's still learning the basics and does all kind of stuff, we all did that at some point, but this... Nothing personal, I just don't understand why would anyone write so ambiguous code: it's more error prone, less portable, others have to waste time figuring out what you wanted to do, ...

It is impossible to please everyone. It is quite common to skip .w qualifier. You can find proofs for this in many books. However I can try to make the code specifically for your tastes.
68020/30:

Code:

     divul.l d4,d7:d6
     move.w d7,(a3)

68000:

Code:

     moveq.l #0,d7
     divu.w d4,d6
     bvc .div32no

     swap d6
     move.w d6,d7
     divu.w d4,d7
     swap d7
     move.w d7,d6
     swap d6
     divu.w d4,d6
.div32no
     move.w d6,d7
     clr.w d6
     swap d6
     move.w d6,(a3)

68020/30 (improved):

Code:

     moveq.l #0,d7
     divu.w d4,d6
     bvc .div32no

     divul.l d4,d7:d6
     move.w d7,(a3)     ;r[i] <- d%b
     bra .div32f

.div32no
     move.w d6,d7
     clr.w d6
     swap d6
     move.w d6,(a3)     ;r[i] <- d%b
.div32f

I hope it is ok for you now. And I can assure you that if your text was less poisonous the result would be the same.

However I almost sure that it is impossible to optimize this 68020/30 division better than I show in my improved code.

It is interesting that I could optimize the 80386 code for the same case.
The initial 80386 code

Code:

         div esi
         mov [si+ra+1],dx

The optimized 80386 code (it gives even a larger gain on the 80486)

Code:

         rol eax,16
         cmp ax,si
         jnc .lx

         mov dx,ax
         shr eax,16
         div si
.lxc:    mov [si+ra+1],dx

Code:

.lx:     rol eax,16
         div esi
         jmp .lxc

Indeed it is rather a complex optimization but I had a hope that the Amiga experts can know some 68k tricks which I missed.

a/b · 29 April 2021, 12:20

Why didn't you use "the same" algorithm, something like (code not tested)?

Code:

	move.l	d4,d0
	swap	d0
	cmp.l	d0,d6		; divident >= (divisor<<16)?
	bhs.b	.32bit

.16bit	divu.w	d4,d6
	swap	d6
	move.w	d6,(a3)

.32bit	divul.l	d4,d7:d6
	move.w	d7,(a3)

edit: To keep it simple: do a faster (32/16) div whenever possible *without* penalty of failing (it's still a slooow div), extra check is compensated for.

Thomas Richter · 29 April 2021, 14:09

Precisely. It depends most likely on the distribution of the input which version is faster.

litwr · 29 April 2021, 14:56

Quote:

Originally Posted by a/b

Why didn't you use "the same" algorithm, something like (code not tested)?

Code:

	move.l	d4,d0
	swap	d0
	cmp.l	d0,d6		; divident >= (divisor<<16)?
	bhs.b	.32bit

.16bit	divu.w	d4,d6
	swap	d6
	move.w	d6,(a3)

.32bit	divul.l	d4,d7:d6
	move.w	d7,(a3)

edit: To keep it simple: do a faster (32/16) div whenever possible *without* penalty of failing (it's still a slooow div), extra check is compensated for.

Thank you! However it is not that easy because we need d6 and d7 which must keep quotient and remainder in their 32-bit. So actually, we need a sequence of MOVE, CLR, BSWAP before .32bit - my code which is equivalent to the 386 code is the next:

Code:

     moveq.l #0,d7
     swap d6
     cmp.w d4,d6
     bcs .div32no

     swap d6
     divul.l d4,d7:d6
     move.w d7,(a3) 
     bra .div32f

.div32no
     swap d6
     divu.w d4,d6
     move.w d6,d7
     clr.w d6
     swap d6 
     move.w d6,(a3) 
.div32f

This makes 2 extra SWAPs.

Optimization for the 80386 gives only 2 or 3 saved cycles, for the 486 - 4 or 5. So it is really very complex. It is sad that neither 80386 nor 68020 are cycle exact in popular emulators. Moreover the emulators are very inaccurate especially for DIVU.L and DIVUL.

litwr · 29 April 2021, 15:02

Quote:

Originally Posted by Thomas Richter

Precisely. It depends most likely on the distribution of the input which version is faster.

We can just ignore the branch with DIVUL - it is executed very rarely.

a/b · 29 April 2021, 16:00

OK, I didn't take that into account, keeping data in d6/d7.

Here are my suggestions:
1. maybe invert the bcs condition: if the 16-bit case is executed a lot more frequently it should be as branch not taken *if* you can adjust your code to avoid a bra at the end
2. if you have a spare register, use move+swap+cmp sequence instead of swap+cmp+2*swap, it's the same speed but 2 bytes shorter, so potentially very slightly faster because you can squeeze 2 more bytes into icache (not that large if 020/030)
3. moveq #0,d7 should be moved to after .div32no (only the 16-bit case neeeds it), because 32-bit div will set all 32 bits anyway, or....
4. moveq #0,d7 should be executed only once before the loop (code implies that the remainder is always 16-bit, and setting d7 bits 16-31 only once will suffice)

saimo · 29 April 2021, 20:03

Quote:

Originally Posted by litwr

We can just ignore the branch with DIVUL - it is executed very rarely.

Then I'd go for this:

Code:

     move.l     d6,d7
     swap.w     d6
     cmp.w      d4,d6
     bcs.b      .div32no

     divul.l    d4,d6:d7
     move.w     d6,(a3)
     exg.l      d6,d7
     bra.b      .div32f

.div32no
     divu.w     d4,d7
     clr.l      d6
     move.w     d7,d6
     clr.w      d7
     swap.w     d7
     move.w     d7,(a3)

.div32f

This code gives:
* 32-bit quotient in d6;
* 32-bit remainder in d7, with upper word set to 0;
* 16-bit remainder written to (a3).
Also, it executes some stuff in parallel, saving cycles.

If you don't care about the upper word of d7 being 0:

Code:

     move.l     d6,d7
     swap.w     d6
     cmp.w      d4,d6
     bcs.b      .div32no

     divul.l    d4,d6:d7
     move.w     d6,(a3)
     exg.l      d6,d7
     bra.b      .div32f

.div32no
     divu.w     d4,d7
     clr.l      d6
     move.w     d7,d6
     swap.w     d7
     move.w     d7,(a3)

.div32f

If you don't care about the registers being exchanged (as you mentioned in a post):

Code:

     move.l     d6,d7
     swap.w     d6
     cmp.w      d4,d6
     bcs.b      .div32no

     divul.l    d4,d6:d7
     move.w     d6,(a3)
     bra.b      .div32f

.div32no
     divu.w     d4,d7
     clr.l      d6
     move.w     d7,d6
     swap.w     d7
     move.w     d7,(a3)

.div32f

If you can afford to trash the word at (2,a3):

Code:

     move.l     d6,d7
     swap.w     d6
     cmp.w      d4,d6
     bcs.b      .div32no

     divul.l    d4,d6:d7
     move.w     d6,(a3)
     bra.b      .div32f

.div32no
     divu.w     d4,d7
     clr.l      d6
     move.l     d7,(a3)
     move.w     d7,d6
     swap.w     d7

.div32f

And if you don't care about the remainder in d7:

Code:

     move.l     d6,d7
     swap.w     d6
     cmp.w      d4,d6
     bcs.b      .div32no

     divul.l    d4,d6:d7
     move.w     d6,(a3)
     bra.b      .div32f

.div32no
     divu.w     d4,d7
     clr.l      d6
     move.l     d7,(a3)
     move.w     d7,d6

.div32f

StingRay · 29 April 2021, 21:57

Quote:

Originally Posted by litwr

It is quite common to skip .w qualifier. You can find proofs for this in many books.

You should read better books then! There are assemblers which default to .l which in turn breaks such code. And rightfully so as the size specifiers are there for a reason.

a/b · 29 April 2021, 23:06

Quote:

Originally Posted by saimo

If you can afford to trash the word at (2,a3):

Keep in mind that misaligned longword access is slower. No (a3)+ so we don't know for how much is a3 being incremented, but it might be ok though.

saimo · 29 April 2021, 23:15

Quote:

Originally Posted by a/b

Keep in mind that misaligned longword access is slower. No (a3)+ so we don't know for how much is a3 being incremented, but it might be ok though.

Ah, yes, alignment is crucial there to actually benefit from the long write. Mentioning that didn't even cross my mind as it's a given. Thanks for pointing it out, though.
The code proposed by litwr doesn't have increments, and from the context it looks like (a3) is a variable rather than an item in an array/buffer, so it seems it should be possible to long-align it.

modrobert · 30 April 2021, 14:28

I compiled pi-amiga.asm with 'vasm pi-amiga.asm -Fhunkexe -nosym -o pi-amiga' and ran it on my stock A1200 with fastram, got the following result:

Code:

pi-amiga
number ? calculator v8 (68000)
number of digits (up to 9252)? 800
31415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679821480865132823066470938446095505822317253594081284811174502841027019385211055596446229489549303819644288109756659334461284756482337867831652712019091456485669234603486104543266482133936072602491412737245870066063155881748815209209628292540917153643678925903600113305305488204665213841469519415116094330572703657595919530921861173819326117931051185480744623799627495673518857527248912279381830119491298336733624406566430860213949463952247371907021798609437027705392171762931767523846748184676694051320005681271452635608277857713427577896091736371787214684409012249534301465495853710507922796892589235420199561121290219608640344181598136297747713099605187072113499999983729780499510597317328160963185  3.24

Assuming I understood the last output there, it took 3.24 seconds?

With "m68020" and "MULUopt = 1" I got this result:

Code:

pi-amiga
number ? calculator v8 (68020)
number of digits (up to 9252)? 800
31415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679821480865132823066470938446095505822317253594081284811174502841027019385211055596446229489549303819644288109756659334461284756482337867831652712019091456485669234603486104543266482133936072602491412737245870066063155881748815209209628292540917153643678925903600113305305488204665213841469519415116094330572703657595919530921861173819326117931051185480744623799627495673518857527248912279381830119491298336733624406566430860213949463952247371907021798609437027705392171762931767523846748184676694051320005681271452635608277857713427577896091736371787214684409012249534301465495853710507922796892589235420199561121290219608640344181598136297747713099605187072113499999983729780499510597317328160963185 3.14

Took 3.14 seconds.

Also tried the C example as mentioned in the comments, compiled with SAS/C and 'SCOPTIONS' as follows:

Code:

DEBUG=SYMBOLFLUSH
MATH=STANDARD
CPU=68020
ERRORREXX
OPTIMIZE
LINK
DATAMEMORY=FAST
STRIPDEBUG

Source:

Code:

/* https://crypto.stanford.edu/pbc/notes/pi/code.html */

#include <stdio.h>

int main() {
    int r[2800 + 1];
    int i, k;
    int b, d;
    int c = 0;

    for (i = 0; i < 2800; i++) {
        r[i] = 2000;
    }

    for (k = 2800; k > 0; k -= 14) {
        d = 0;
        i = k;
        for (;;) {
            d += r[i] * 10000;
            b = 2 * i - 1;
            r[i] = d % b;
            d /= b;
            i--;
            if (i == 0) break;
            d *= i;
        }
        printf("%.4d", c + d / 10000);
        c = d % 10000;
    }

    printf("\n");
    return 0;
}

Result:

Code:

> timeit pi
31415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679821480865132823066470938446095505822317253594081284811174502841027019385211055596446229489549303819644288109756659334461284756482337867831652712019091456485669234603486104543266482133936072602491412737245870066063155881748815209209628292540917153643678925903600113305305488204665213841469519415116094330572703657595919530921861173819326117931051185480744623799627495673518857527248912279381830119491298336733624406566430860213949463952247371907021798609437027705392171762931767523846748184676694051320005681271452635608277857713427577896091736371787214684409012249534301465495853710507922796892589235420199561121290219608640344181598136297747713099605187072113499999983729780499510597317328160963185
Elapsed: 7.25s

Also tried vbcc 'vc pi.c -o vc_pi -O2 -DCPU=68020' and the time it took was roughly 14 seconds, "-DCPU=68020" option didn't seem to have any affect, same time result with 68000 although a few bytes difference in size. When compiling for 68000 with SAS/C the time it took was roughly 13 seconds, in other words almost twice the performance for 68020. I haven't disassembled the resulting binaries (yet), but might do that later out of curiosity.

vbc · 30 April 2021, 16:00

Quote:

Originally Posted by modrobert

Also tried vbcc 'vc pi.c -o vc_pi -O2 -DCPU=68020' and the time it took was roughly 14 seconds, "-DCPU=68020" option didn't seem to have any affect, same time result with 68000 although a few bytes difference in size. When compiling for 68000 with SAS/C the time it took was roughly 13 seconds, in other words almost twice the performance for 68020. I haven't disassembled the resulting binaries (yet), but might do that later out of curiosity.

-DCPU=68020 is the same as putting "#define CPU 68020" in your code. The option to generate code for 68020 is -cpu=68020.

modrobert · 30 April 2021, 16:33

Quote:

Originally Posted by vbc

-DCPU=68020 is the same as putting "#define CPU 68020" in your code. The option to generate code for 68020 is -cpu=68020.

Aha, thanks. I got this result for vbcc after compiling with 'vc pi.c -o vc68020_pi -O2 -DCPU=68020 -cpu=68020':

Code:

> timeit vc68020_pi
31415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679821480865132823066470938446095505822317253594081284811174502841027019385211055596446229489549303819644288109756659334461284756482337867831652712019091456485669234603486104543266482133936072602491412737245870066063155881748815209209628292540917153643678925903600113305305488204665213841469519415116094330572703657595919530921861173819326117931051185480744623799627495673518857527248912279381830119491298336733624406566430860213949463952247371907021798609437027705392171762931767523846748184676694051320005681271452635608277857713427577896091736371787214684409012249534301465495853710507922796892589235420199561121290219608640344181598136297747713099605187072113499999983729780499510597317328160963185
Elapsed: 6.78s

litwr · 30 April 2021, 21:06

Quote:

Originally Posted by a/b

Here are my suggestions:
1. maybe invert the bcs condition: if the 16-bit case is executed a lot more frequently it should be as branch not taken *if* you can adjust your code to avoid a bra at the end
2. if you have a spare register, use move+swap+cmp sequence instead of swap+cmp+2*swap, it's the same speed but 2 bytes shorter, so potentially very slightly faster because you can squeeze 2 more bytes into icache (not that large if 020/030)
3. moveq #0,d7 should be moved to after .div32no (only the 16-bit case neeeds it), because 32-bit div will set all 32 bits anyway, or....
4. moveq #0,d7 should be executed only once before the loop (code implies that the remainder is always 16-bit, and setting d7 bits 16-31 only once will suffice)

1. Eureka! How could I omit this variant?! Maybe I wanted the macro code in one piece? But I split it for the 80386... So it is a crazy moment when a man discovers that he lost his pen when it is in his hand.

Thank you very much. Maybe I could also reluctant to do this optimization because Bcc timings are very unusual for the 68k. For instance, for the Bcc.W case branch is taken is longer than not taken. I attach my improved programs to this post. Emulators (which are accurate for the 68000) show more than 1% speed boost - it has been a largest gain for years. However I need help with hardware to check the gain for the 68020 case.
2. This gives too little. All main loop size is less 100 bytes, so it can't help for better cache usage. Indeed it is good to make the code a bit shorter but you chose the slower code with CMP before the first division...
3. You are right. I found out this myself too. But it doesn't affect the performance because it only makes faster the code which is executed very rarely.
4. I don't understand this your point.

Quote:

Originally Posted by saimo

Then I'd go for this:

Code:

     move.l     d6,d7
     swap.w     d6
     cmp.w      d4,d6
     bcs.b      .div32no

     divul.l    d4,d6:d7
     move.w     d6,(a3)
     exg.l      d6,d7
     bra.b      .div32f

.div32no
     divu.w     d4,d7
     clr.l      d6
     move.w     d7,d6
     clr.w      d7
     swap.w     d7
     move.w     d7,(a3)

.div32f

This code gives:
* 32-bit quotient in d6;
* 32-bit remainder in d7, with upper word set to 0;
* 16-bit remainder written to (a3).
Also, it executes some stuff in parallel, saving cycles.

Thank you. But you also chose the slower code with CMP before the first division. The 68k has an advantage over the x86: the DIVU instructions set V-flag. Why don't use this advantage?

Finally, your code just replaces MOVEQ and SWAP with MOVE.L and CLR.L - it hardly makes any speed boost.

Quote:

Originally Posted by a/b

If you don't care about the upper word of d7 being 0:

No, it is important. Actually, the algo just subtracts D6 and D7 from D5 after the division. So we may exchange D6 and D7 but D5 is a long value.

Quote:

Originally Posted by a/b

If you can afford to trash the word at (2,a3):

This code corrupts D7.

Quote:

Originally Posted by StingRay

You should read better books then! There are assemblers which default to .l which in turn breaks such code. And rightfully so as the size specifiers are there for a reason.

Maybe. I can just show you a good book where they use just MOVE instead of MOVE.W - Amiga Machine Language by Stefan Dittrich (1989). In a newer book Total! Amiga assembler by Paul Overaa (1995) they wrote "If the object size is not specified then a word size (16 bit) is assumed".
IMHO it is rather a matter of tastes. Some people may prefer more detailed code, some people prefer the briefness.
BTW VASM always uses .W by default if the size is omitted.

Quote:

Originally Posted by a/b

Keep in mind that misaligned longword access is slower. No (a3)+ so we don't know for how much is a3 being incremented, but it might be ok though.

The code has the move.w -(a3),d0 instruction in its main loop.

Quote:

Originally Posted by modrobert

Code:

pi-amiga
number ? calculator v8 (68000)
number of digits (up to 9252)? 800
3141592653589793...
3.24

Assuming I understood the last output there, it took 3.24 seconds?

Exactly! However would you like to provide us with results for 1000 or/and 3000 digits? It allows me to compare your results with previous. It is better to show only timing because the algo is well tested and digits must be correct.

Quote:

Originally Posted by StingRay

With "m68020" and "MULUopt = 1" I got this result:

Code:

pi-amiga
number ? calculator v8 (68020)
number of digits (up to 9252)? 800
3141592653989793...
3.14

Thank you very much!!! Your result show that MULUopt speeds up the 68020 code! This is a surprise for me because it means that hardware multiplication is slower than a set of equivalent instructions for this case.

Good emulators are quite accurate for the 68000 and they definitely show that MULUopt=1 makes the code slower. However your results show that MULUopt=1 can be useful at least for the 68020. I can also think that the 68030 must also be accelerated by MULUopt=1. Would you like please to run pi-amiga and pi-amiga1200 on your hardware for 100, 1000, and 3000 digits? A screenshot would be a nice addition for me to insert it in the table.

saimo · 30 April 2021, 22:42

Quote:

Thank you. But you also chose the slower code with CMP before the first division. The 68k has an advantage over the x86: the DIVU instructions set V-flag. Why don't use this advantage?

To be honest, I only skimmed through the thread and I thought that the code you landed at in post #8 was for some reason the form you were aiming at, so I just applied some optimizations to that

But, yes, I agree that it's better to perform the division first, given that you said that the worst case (overflow set) is very rare.

Quote:

Finally, your code just replaces MOVEQ and SWAP with MOVE.L and CLR.L - it hardly makes any speed boost.

Other than on 68060, swap is slower. The code I proposed also aims to save cycles by allowing the CPU to execute more stuff in parallel thanks to less register dependencies (and the long write to memory, which, if I understand correctly, is not an option).
Anyway, on to the divu-first code...

Leaving aside the bvs optimization (that depends on the structure of your code), there's still one thing you can do to avoid the moveq at the beginning of the code, thus saving a little time in the case of the bvs branch:

Code:

     divu.w   d4,d6
     bvc.b    .div32no

     divul.l  d4,d7:d6
     move.w   d7,(a3)
     bra.b    .div32f

.div32no
     move.l   d6,d7 
     clr.w    d6
     eor.l    d6,d7
     swap.w   d6
     move.w   d6,(a3)
     
.div32f

litwr · 01 May 2021, 08:50

Quote:

Originally Posted by saimo

The code proposed by litwr doesn't have increments, and from the context it looks like (a3) is a variable rather than an item in an array/buffer, so it seems it should be possible to long-align it.

A3 is a pointer to an array element. The element size is 2 byte. The pointer decreases. So long word access may slow down the algo.

Quote:

Originally Posted by saimo

To be honest, I only skimmed through the thread and I thought that the code you landed at in post #8 was for some reason the form you were aiming at, so I just applied some optimizations to that

But, yes, I agree that it's better to perform the division first, given that you said that the worst case (overflow set) is very rare.

I showed the CMP-first version only because it was a/b's demand. Sorry I didn't add more information about it afore.

Quote:

Originally Posted by saimo

Other than on 68060, swap is slower. The code I proposed also aims to save cycles by allowing the CPU to execute more stuff in parallel thanks to less register dependencies (and the long write to memory, which, if I understand correctly, is not an option).

Thank you. I didn't know this. However CLR vs SWAP timing is rather odd. The 68000 executes SWAP faster than CLR but the 68020 executes CLR faster than SWAP! It is interesting to reduce instruction dependency in the code that may speed up the execution on the 68020 and higher 68k. Actually I didn't think about it. But I have just checked the code and IMHO it is difficult to improve it this way. The code for the main loop is short, it is only 17 lines (or 25 if MULUopt=1) between .l2 and BCC .l2 - one can check it too.

Quote:

Originally Posted by saimo

Anyway, on to the divu-first code...

Your code again corrupts D7.

Quote:

Originally Posted by saimo

Leaving aside the bvs optimization (that depends on the structure of your code), there's still one thing you can do to avoid the moveq at the beginning of the code, thus saving a little time in the case of the bvs branch

My BVS optimization is in the last attachment. However it is not important because it is independent from other optimizations. However, it is short and I can show it here too

Code:

         divu.w d4,d6
         bvs.s .longdiv

         moveq.l #0,d7
         move.w d6,d7
         clr.w d6
         swap d6
         move.w d6,(a3)

26 April 2021, 19:17	#1
litwr Registered User Join Date: Mar 2016 Location: Ozherele Posts: 229	Optimizing the 68020+ 32-bit math I have a project, the number pi calculator. I have found out recently that my 68020 code is actually slower than my 68000 code. I was misguided that emulators (FS-UAE, Hatari) show that on the contary that the 68020 code is faster. It seems that DIVUL takes much less cycles in emus than in hardware. So I am trying to make the 68020 code which will be faster than the 68000 code. The code for the 68020/30 is quite short: Code: divul.l d4,d7:d6 move d7,(a3) The equivalent code for the 68000 is longer (but faster!): Code: moveq.l #0,d7 divu d4,d6 bvc .div32no\@ swap d6 move d6,d7 divu d4,d7 swap d7 move d7,d6 swap d6 divu d4,d6 .div32no\@ move d6,d7 clr d6 swap d6 move d6,(a3) The code for the 68000 is faster because the branch to .div32no\@ is taken almost always in the pi-spigot algo. My whole project is available on github https://github.com/litwr2/rosetta-pi...e/master/amiga Maybe someone can help me to make better code for the 68020/30? Any hints are welcome. Thank you in advance. It is interesting to make code for the 68020/30 faster than the 68000 code. However I have only been able to make slightly shorter code for the 68020/30. Code: moveq.l #0,d7 divu d4,d6 bvc .div32no\@ divul d4,d7:d6 move d7,(a3) ;r[i] <- d%b bra .div32f\@ .div32no\@ move d6,d7 clr d6 swap d6 move d6,(a3) ;r[i] <- d%b .div32f\@ You can notice that D6 and D7 are exchanged - it is ok.

29 April 2021, 12:20	#6
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	Why didn't you use "the same" algorithm, something like (code not tested)? Code: move.l d4,d0 swap d0 cmp.l d0,d6 ; divident >= (divisor<<16)? bhs.b .32bit .16bit divu.w d4,d6 swap d6 move.w d6,(a3) .32bit divul.l d4,d7:d6 move.w d7,(a3) edit: To keep it simple: do a faster (32/16) div whenever possible without penalty of failing (it's still a slooow div), extra check is compensated for. Last edited by a/b; 29 April 2021 at 13:15.

29 April 2021, 16:00	#10
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	OK, I didn't take that into account, keeping data in d6/d7. Here are my suggestions: 1. maybe invert the bcs condition: if the 16-bit case is executed a lot more frequently it should be as branch not taken if you can adjust your code to avoid a bra at the end 2. if you have a spare register, use move+swap+cmp sequence instead of swap+cmp+2swap, it's the same speed but 2 bytes shorter, so potentially very slightly faster because you can squeeze 2 more bytes into icache (not that large if 020/030) 3. moveq #0,d7 should be moved to after .div32no (only the 16-bit case neeeds it), because 32-bit div will set all 32 bits anyway, or.... 4. moveq #0,d7 should be executed only once before the loop (code implies that the remainder is always 16-bit, and setting d7 bits 16-31 only once will suffice) Last edited by a/b; 29 April 2021 at 16:10.*

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
68020 Bit Field Instructions	mcgeezer	Coders. Asm / Hardware	9	27 October 2023 23:21
68060 64-bit integer math	BSzili	Coders. Asm / Hardware	7	25 January 2021 21:18
Discovery: Math	Audio Snow	request.Old Rare Games	30	20 August 2018 12:17
Math apps	mtb	support.Apps	1	08 September 2002 18:59

28 April 2021, 02:32	#3
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	Well, he's not using divu.L, it's divuL(.L) which is 32/32 -> 32:32. I didn't look too closely at this because... Don't really want to start any flame wars but that code nearly gave me cancer. Every single instruction that should have a size specifier (because of multiple valid options), and there's almost a dozen of them is that short clip, doesn't have it. And moveq, which is always a longword operation and doesn't ever need it, has it. When I see that I just check out. Having to decipher M68K code because it's butchered so badly is not my idea of fun. I mean, it's OK when it's some fresh guy who's still learning the basics and does all kind of stuff, we all did that at some point, but this... Nothing personal, I just don't understand why would anyone write so ambiguous code: it's more error prone, less portable, others have to waste time figuring out what you wanted to do, ...

28 April 2021, 10:54	#4
Thomas Richter Registered User Join Date: Jan 2019 Location: Germany Posts: 3,215	Well, it doesn't really matter too much. The major problem is here is the quotient size of 32 bit, and this requires a more complicated algorithm than a cascaded divu with has a quotient of 16 bit. A 32/16 full division with 32 remainder and 32 bit quotient is easy with divu, but a full 32/32 division requires "algorithm D" (or some other division algorithm) and is more complex than what is shown here.

29 April 2021, 14:09	#7
Thomas Richter Registered User Join Date: Jan 2019 Location: Germany Posts: 3,215	Precisely. It depends most likely on the distribution of the input which version is faster.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)