English Amiga Board - View Single Post - Optimizing the 68020+ 32-bit math

litwr · 30 April 2021, 21:06

Quote:

Originally Posted by a/b

Here are my suggestions:
1. maybe invert the bcs condition: if the 16-bit case is executed a lot more frequently it should be as branch not taken *if* you can adjust your code to avoid a bra at the end
2. if you have a spare register, use move+swap+cmp sequence instead of swap+cmp+2*swap, it's the same speed but 2 bytes shorter, so potentially very slightly faster because you can squeeze 2 more bytes into icache (not that large if 020/030)
3. moveq #0,d7 should be moved to after .div32no (only the 16-bit case neeeds it), because 32-bit div will set all 32 bits anyway, or....
4. moveq #0,d7 should be executed only once before the loop (code implies that the remainder is always 16-bit, and setting d7 bits 16-31 only once will suffice)

1. Eureka! How could I omit this variant?! Maybe I wanted the macro code in one piece? But I split it for the 80386... So it is a crazy moment when a man discovers that he lost his pen when it is in his hand.

Thank you very much. Maybe I could also reluctant to do this optimization because Bcc timings are very unusual for the 68k. For instance, for the Bcc.W case branch is taken is longer than not taken. I attach my improved programs to this post. Emulators (which are accurate for the 68000) show more than 1% speed boost - it has been a largest gain for years. However I need help with hardware to check the gain for the 68020 case.
2. This gives too little. All main loop size is less 100 bytes, so it can't help for better cache usage. Indeed it is good to make the code a bit shorter but you chose the slower code with CMP before the first division...
3. You are right. I found out this myself too. But it doesn't affect the performance because it only makes faster the code which is executed very rarely.
4. I don't understand this your point.

Quote:

Originally Posted by saimo

Then I'd go for this:

Code:

     move.l     d6,d7
     swap.w     d6
     cmp.w      d4,d6
     bcs.b      .div32no

     divul.l    d4,d6:d7
     move.w     d6,(a3)
     exg.l      d6,d7
     bra.b      .div32f

.div32no
     divu.w     d4,d7
     clr.l      d6
     move.w     d7,d6
     clr.w      d7
     swap.w     d7
     move.w     d7,(a3)

.div32f

This code gives:
* 32-bit quotient in d6;
* 32-bit remainder in d7, with upper word set to 0;
* 16-bit remainder written to (a3).
Also, it executes some stuff in parallel, saving cycles.

Thank you. But you also chose the slower code with CMP before the first division. The 68k has an advantage over the x86: the DIVU instructions set V-flag. Why don't use this advantage?

Finally, your code just replaces MOVEQ and SWAP with MOVE.L and CLR.L - it hardly makes any speed boost.

Quote:

Originally Posted by a/b

If you don't care about the upper word of d7 being 0:

No, it is important. Actually, the algo just subtracts D6 and D7 from D5 after the division. So we may exchange D6 and D7 but D5 is a long value.

Quote:

Originally Posted by a/b

If you can afford to trash the word at (2,a3):

This code corrupts D7.

Quote:

Originally Posted by StingRay

You should read better books then! There are assemblers which default to .l which in turn breaks such code. And rightfully so as the size specifiers are there for a reason.

Maybe. I can just show you a good book where they use just MOVE instead of MOVE.W - Amiga Machine Language by Stefan Dittrich (1989). In a newer book Total! Amiga assembler by Paul Overaa (1995) they wrote "If the object size is not specified then a word size (16 bit) is assumed".
IMHO it is rather a matter of tastes. Some people may prefer more detailed code, some people prefer the briefness.
BTW VASM always uses .W by default if the size is omitted.

Quote:

Originally Posted by a/b

Keep in mind that misaligned longword access is slower. No (a3)+ so we don't know for how much is a3 being incremented, but it might be ok though.

The code has the move.w -(a3),d0 instruction in its main loop.

Quote:

Originally Posted by modrobert

Code:

pi-amiga
number ? calculator v8 (68000)
number of digits (up to 9252)? 800
3141592653589793...
3.24

Assuming I understood the last output there, it took 3.24 seconds?

Exactly! However would you like to provide us with results for 1000 or/and 3000 digits? It allows me to compare your results with previous. It is better to show only timing because the algo is well tested and digits must be correct.

Quote:

Originally Posted by StingRay

With "m68020" and "MULUopt = 1" I got this result:

Code:

pi-amiga
number ? calculator v8 (68020)
number of digits (up to 9252)? 800
3141592653989793...
3.14

Thank you very much!!! Your result show that MULUopt speeds up the 68020 code! This is a surprise for me because it means that hardware multiplication is slower than a set of equivalent instructions for this case.

Good emulators are quite accurate for the 68000 and they definitely show that MULUopt=1 makes the code slower. However your results show that MULUopt=1 can be useful at least for the 68020. I can also think that the 68030 must also be accelerated by MULUopt=1. Would you like please to run pi-amiga and pi-amiga1200 on your hardware for 100, 1000, and 3000 digits? A screenshot would be a nice addition for me to insert it in the table.