English Amiga Board - View Single Post - Optimizing the 68020+ 32-bit math

litwr · 01 May 2021, 08:50

Quote:

Originally Posted by saimo

The code proposed by litwr doesn't have increments, and from the context it looks like (a3) is a variable rather than an item in an array/buffer, so it seems it should be possible to long-align it.

A3 is a pointer to an array element. The element size is 2 byte. The pointer decreases. So long word access may slow down the algo.

Quote:

Originally Posted by saimo

To be honest, I only skimmed through the thread and I thought that the code you landed at in post #8 was for some reason the form you were aiming at, so I just applied some optimizations to that

But, yes, I agree that it's better to perform the division first, given that you said that the worst case (overflow set) is very rare.

I showed the CMP-first version only because it was a/b's demand. Sorry I didn't add more information about it afore.

Quote:

Originally Posted by saimo

Other than on 68060, swap is slower. The code I proposed also aims to save cycles by allowing the CPU to execute more stuff in parallel thanks to less register dependencies (and the long write to memory, which, if I understand correctly, is not an option).

Thank you. I didn't know this. However CLR vs SWAP timing is rather odd. The 68000 executes SWAP faster than CLR but the 68020 executes CLR faster than SWAP! It is interesting to reduce instruction dependency in the code that may speed up the execution on the 68020 and higher 68k. Actually I didn't think about it. But I have just checked the code and IMHO it is difficult to improve it this way. The code for the main loop is short, it is only 17 lines (or 25 if MULUopt=1) between .l2 and BCC .l2 - one can check it too.

Quote:

Originally Posted by saimo

Anyway, on to the divu-first code...

Your code again corrupts D7.

Quote:

Originally Posted by saimo

Leaving aside the bvs optimization (that depends on the structure of your code), there's still one thing you can do to avoid the moveq at the beginning of the code, thus saving a little time in the case of the bvs branch

My BVS optimization is in the last attachment. However it is not important because it is independent from other optimizations. However, it is short and I can show it here too

Code:

         divu.w d4,d6
         bvs.s .longdiv

         moveq.l #0,d7
         move.w d6,d7
         clr.w d6
         swap d6
         move.w d6,(a3)