Optimizing the 68020+ 32-bit math - Page 3

VladR · 03 May 2021, 17:55

It should be criminally simple to create a benchmark (running, say, 1,000,000x) that compares execution time of various inputs to the DIV and create another variant that uses divu.w.

Shouldn't take more than 10 minutes to write.

All you really need is just an actual HW to run it.

A 20 MHz CPU should execute about 250,000 divisions in a second (at ~80 cycles).
Feeding the two registers and looping should bring that to about ~175,000 (very rough estimate, don't have the cycle table at hand, so feel free to look it up and adjust accordingly).

So, running a benchmark for 10,000,000x should provide sufficient benchmarking precision [from comparison standpoint].

In fact, what I don't understand, is that how come, if you are concerned about the performance, and you have an access to the HW, you actually haven't written such benchmark in the first place before even creating the thread ?

saimo · 03 May 2021, 18:19

@litwr

If you have a spare register, you could do this:

Code:

   move.l   #$ffff,d5; somewhere in your initialization code
   ...

   move.w   d6,d7
   swap.w   d6
   and.l    d5,d7
   move.w   d6,(a3)
   and.l    d5,d6

This removes all the registers conflicts (it's always a good thing) and exploits the free cycles after the write (i.e. and.l d5,d6 comes for free).

Regarding the puzzling timing results: are interrupts off during the test?

Edit: I just had a quick look at the inner loop of your code; I noticed that it reads from memory the item calculated at the previous step: why not exploiting the fact that you already have it in a register?

Unless I'm misunderstanding something, I'd propose this (for clarity, I have removed the mulu optimization stuff):

Code:

         addq.l  #2,a3
         move.l  #$ffff,d3
         <initialize d6>
         ...
         bra.b   .l4

.longdiv divul.l d4,d7:d6
         move.w  d7,-(a3)
         move.l  d7,d6

         subq.l  #2,d4
         bcs.b   .enddiv

.l2      sub.l   d6,d5
         sub.l   d7,d5
         lsr.l   #1,d5

.l4      mulu.w  d1,d6
         add.l   d5,d6
         move.l  d6,d5
         divu.w  d4,d6
         bvs.b   .longdiv

         move.w  d6,d7
         swap.w  d6
         and.l   d3,d7
         move.w  d6,-(a3)
         and.l   d3,d6

         subq.l  #2,d4
         bcc.b   .l2

.enddiv

Note that all the instructions between the writes and the branch instructions are free.

Additional note: -(a3) gives a better result (3 cycles) than (a3)+ (4 cycles) on 68020 only in the best case (i.e. when the instructions using it benefit from the maximum overlap conditions), but otherwise it gives a worse result (5 cycles); also, it always performs worse on 68030 (it makes no difference on 68040 and 68060, instead); you might experiment by adding a nop before the loop: by changing the alignment of the instructions, you might get better results (but first you must have a reliable way of making tests).

I see room for optimization also outside of the loop, but I'm affected by a weird allergy to reading code (I'm serious), so that's all I can do

roondar · 03 May 2021, 20:47

Right, just because I'm really rather interested in the Motorola 68K architecture, I did a few minor tests using extremely simple code I executed while all interrupts and the screen were disabled. I timed using the Amiga CIA timers and ran 4000 DIVU.L's and DIVUL.L's in a simple DBRA loop.

I found the following:

WinUAE set to 'aproximate A1200 speed' ran both DIVU.L and DIVUL.L much, much faster than my real A1200. Depending on the instruction, between 4,9x and 6,5x as fast.
On real hardware, there is absolutely no difference in execution time depending on the value passed to the DIVU(L) instruction. I tried a variety of values and all resulted in exactly the same timing.
On both real hardware and WinUAE DIVU.L is ever so slightly faster than DIVUL.L (by about 6%).

Edit: note that the 6% speed difference might actually not be due to the different instructions, but instead due to the need to reset an additional register in the loop for the 64 bit variants.

modrobert · 04 May 2021, 08:35

Quote:

Originally Posted by litwr

Why make a disassembly, you have sources.

I can again confirm that DIVUL is executed almost never.

I only have 'pi-amiga-8.asm' and 'pi-amiga-9.asm' in 'pi-amiga-cmp.zip', so wanted to see exactly what the binaries you supplied did. The DIVU.W instruction is slow as well; 42-44 clocks (ten times slower than other average instructions).

Thomas Richter · 04 May 2021, 08:41

Quote:

Originally Posted by roondar

On real hardware, there is absolutely no difference in execution time depending on the value passed to the DIVU(L) instruction. I tried a variety of values and all resulted in exactly the same timing.

That depends on the processor. On my 68060, the time is (almost) constant, but the instruction is much quicker in the overflow case. The manual states so, it is 28 cycles, with a footnode saying that there are early-out conditions for the 32/16 division, which is exactly what I have found.

I do not know for the 68020 or 68030. For the 68000, the manual gives timing information, and there it depends on the numbers.

roondar · 04 May 2021, 10:02

Quote:

Originally Posted by Thomas Richter

That depends on the processor. On my 68060, the time is (almost) constant, but the instruction is much quicker in the overflow case. The manual states so, it is 28 cycles, with a footnode saying that there are early-out conditions for the 32/16 division, which is exactly what I have found.

I do not know for the 68020 or 68030. For the 68000, the manual gives timing information, and there it depends on the numbers.

Ah, perhaps I should've specified this. But my test results were specifically for the 68020 only, as that is what litwr was talking about.
So for the record: my test was done on a stock Amiga 1200, no expansions.

litwr · 04 May 2021, 11:43

Quote:

Originally Posted by VladR

In fact, what I don't understand, is that how come, if you are concerned about the performance, and you have an access to the HW, you actually haven't written such benchmark in the first place before even creating the thread ?

I don't have hardware.

This thread is about the pi-spigot optimization, and the peculiarities of division implementations are just a collateral effect.

Quote:

Originally Posted by saimo

If you have a spare register, you could do this:

Code:

   move.l   #$ffff,d5; somewhere in your initialization code
   ...

   move.w   d6,d7
   swap.w   d6
   and.l    d5,d7
   move.w   d6,(a3)
   and.l    d5,d6

This removes all the registers conflicts (it's always a good thing) and exploits the free cycles after the write (i.e. and.l d5,d6 comes for free).

Thank you but I completely missed your point.

It seems your code is one instruction longer. Of course, the AND and MOVE in your code can be executed in parallel but I doubt that your code will be faster even on the 68060.

Quote:

Originally Posted by saimo

Regarding the puzzling timing results: are interrupts off during the test?

Yes they are enabled in all my pi-spigot implementations but you can notice that modrobert ran long tests for several minutes, so it can't affect results this way. I can repeat that it can only be explained by temperature raise. Of course, modrobert can just run the two tests in reverse order and this will confirm or reject my explanation.

Quote:

Originally Posted by saimo

Edit: I just had a quick look at the inner loop of your code; I noticed that it reads from memory the item calculated at the previous step: why not exploiting the fact that you already have it in a register?

No, my code reads data into a register with move.w -(a3),d0 and then writes a new value with move d6,(a3) or move d7,(a3). On the next step the algo takes a diffrent value.

Quote:

Originally Posted by saimo

Unless I'm misunderstanding something, I'd propose this (for clarity, I have removed the mulu optimization stuff):

Code:

         addq.l  #2,a3
         move.l  #$ffff,d3
         <initialize d6>
         ...
         bra.b   .l4

.longdiv divul.l d4,d7:d6
         move.w  d7,-(a3)
         move.l  d7,d6

         subq.l  #2,d4
         bcs.b   .enddiv

.l2      sub.l   d6,d5
         sub.l   d7,d5
         lsr.l   #1,d5

.l4      mulu.w  d1,d6
         add.l   d5,d6
         move.l  d6,d5
         divu.w  d4,d6
         bvs.b   .longdiv

         move.w  d6,d7
         swap.w  d6
         and.l   d3,d7
         move.w  d6,-(a3)
         and.l   d3,d6

         subq.l  #2,d4
         bcc.b   .l2

.enddiv

This code can't work because D6 at .L4 keeps a wrong value - it is not a value from the array but the remainder or quotient.

Quote:

Originally Posted by saimo

I see room for optimization also outside of the loop, but I'm affected by a weird allergy to reading code (I'm serious), so that's all I can do

Sorry, but it is very short.

A man had a similar the number pi calculation project - this algo is about 2 times faster but is also 3-4 larger.

Quote:

Originally Posted by modrobert

I only have 'pi-amiga-8.asm' and 'pi-amiga-9.asm' in 'pi-amiga-cmp.zip', so wanted to see exactly what the binaries you supplied did. The DIVU.W instruction is slow as well; 42-44 clocks (ten times slower than other average instructions).

The binaries were compiled from those sources. There were four cases:
1) mc68000 and MULUopt=0;
2) mc68020 and MULUopt=0;
3) mc68000 and MULUopt=1;
4) mc68020 and MULUopt=1.
BTW the table is updated. I also made a size optimization, so v10 is now the newest but it has the same speed as v9.

Quote:

Originally Posted by roondar

Right, just because I'm really rather interested in the Motorola 68K architecture, I did a few minor tests using extremely simple code I executed while all interrupts and the screen were disabled. I timed using the Amiga CIA timers and ran 4000 DIVU.L's and DIVUL.L's in a simple DBRA loop.

I found the following:

WinUAE set to 'aproximate A1200 speed' ran both DIVU.L and DIVUL.L much, much faster than my real A1200. Depending on the instruction, between 4,9x and 6,5x as fast.
On real hardware, there is absolutely no difference in execution time depending on the value passed to the DIVU(L) instruction. I tried a variety of values and all resulted in exactly the same timing.
On both real hardware and WinUAE DIVU.L is ever so slightly faster than DIVUL.L (by about 6%).

Edit: note that the 6% speed difference might actually not be due to the different instructions, but instead due to the need to reset an additional register in the loop for the 64 bit variants.

Thank you. One can only wonder why emulator writers still use so poor code for division emulation.

Quote:

Originally Posted by Thomas Richter

That depends on the processor. On my 68060, the time is (almost) constant, but the instruction is much quicker in the overflow case. The manual states so, it is 28 cycles, with a footnode saying that there are early-out conditions for the 32/16 division, which is exactly what I have found.

I do not know for the 68020 or 68030. For the 68000, the manual gives timing information, and there it depends on the numbers.

It is really interesting what is the performance difference between the 68020 and 68030. It is very confusing that instruction timings for these processors are calculated by different methods. The pi-spigot results show almost identical numbers for both these processors.

roondar · 04 May 2021, 13:37

Quote:

Originally Posted by litwr

Thank you. One can only wonder why emulator writers still use so poor code for division emulation.

It's not really poor code that's the problem. Processors beyond the 68000/68010/80386 are hard to emulate because they have cache memory and instruction speed that varies depending on the instruction stream. Since no one knows exactly how the 68020 internal sequencer works (same with 486+), accurate timing while emulating is basically impossible to get.

So compromises have to be made, where some instructions will end up being too fast and others too slow.

Toni Wilen · 04 May 2021, 17:52

When I was coding and testing my CPU tester, I noticed that 68020 and 68030 have almost all undocumented features working identically, for example DIV undocumented flags work exactly the same.

I am almost sure execution core and microcode is practically identical but everything else has been updated (memory controller, MMU, caches etc. Updating execution core probably wasn't worth the required testing)

Division algorithm improvements:

68000 is very slow, pure microcode without any "hardware acceleration", timing fully known (but was not fully documented)

68010 division is improved, UAE emulated timing should be fully correct. I did some experiments and noticed that timing is correct if 68000 algorithm is slightly modified and conditional microcode branches removed, for example 68000 does it similar to "skip following microcode block if condition false" and skipped block was "subtract x from y". 68010 does similar to "subtract [x if true, 0 if not true] from y". (Apparently 68010 hardware gained feature that can conditionally force input to zero when doing ALU operations) This improved performance and reduced timing variation depending on input values (68010 DIV timing still depends on input variables but not as much as 68000). 68010 MUL also uses same method to improve performance.

68020/030 almost certainly uses similar algoritm that 68010 uses. 020/030 is still fully(?) microcoded.

Unfortunately Motorola documentation only talks about max cycle times (as was already mentioned) which are useless for emulation purposes. It needs to use best case timing because max times generally are much less common in real world. Everything would run far too slowly when using max timings or even if using average times!. Thats why "inaccurate emulation" has to use minimum possible timings. Which are unknown. Caches and long pipeliline makes it impossible to get any useful timing results. Yes, you can disable caches and all and get "correct" timing results but they are also totally useless in real world..

(and some 68020+ only instructions might lack "slow down" completely. no one should use them in speed critical parts anyway)

saimo · 04 May 2021, 18:39

Quote:

Originally Posted by litwr

Quote:

Originally Posted by saimo

@litwr

If you have a spare register, you could do this:

Code:

   move.l   #$ffff,d5; somewhere in your initialization code
   ...

   move.w   d6,d7
   swap.w   d6
   and.l    d5,d7
   move.w   d6,(a3)
   and.l    d5,d6

This removes all the registers conflicts (it's always a good thing) and exploits the free cycles after the write (i.e. and.l d5,d6 comes for free).

Thank you but I completely missed your point. It seems your code is one instruction longer. Of course, the AND and MOVE in your code can be executed in parallel but I doubt that your code will be faster even on the 68060.

Except for the move.l #$ffff,d5 instruction, which needs to be executed once and for all outside of the loop, the code is the same size.
As for the speed improvement:
* the removal of the registers conflicts does allow to exploit both the execution pipelines of the 68060 (the instructions used are suitable for both the primary and secondary pipeline), so the code would run faster on such CPU;
* the and instruction after the write is for free also on 68020, as it executes while the write is being performed, whereas the same doesn't happen for bcc - forgot to mention earlier: if my memory doesn't fail me, branch instructions (I'm pretty such that this applies to dbcc at least) can't execute instead while writes happen, at least on 68020 and 68030 - it's something I had noticed experimentally a while back.

Quote:

Yes they are enabled in all my pi-spigot implementations but you can notice that modrobert ran long tests for several minutes, so it can't affect results this way. I can repeat that it can only be explained by temperature raise. Of course, modrobert can just run the two tests in reverse order and this will confirm or reject my explanation.

You want the maximum precision in this case, and keeping the interrupts on in a multitasking environment won't help at all. Just call Disable() and Enable() from exec.library before and after the test loop, respectively (and don't call OS functions in between).

Quote:

No, my code reads data into a register with move.w -(a3),d0 and then writes a new value with move d6,(a3) or move d7,(a3). On the next step the algo takes a diffrent value.

DOH, what a dumb thing I said and did there! Sorry for the idiotic code. I'm totally ashamed.

Here's a proper version:

Code:

         move.l  #$ffff,d3
         ...
         bra.b   .l4

.longdiv divul.l d4,d7:d6
         move.w  d7,(a3)

         subq.l  #2,d4
         bcs.b   .enddiv

.l2      sub.l   d6,d5
         sub.l   d7,d5

.l4      move.w  -(a3),d0
         lsr.l   #1,d5
         mulu.w  d1,d0
         add.l   d0,d5
         move.l  d5,d6
         divu.w  d4,d6
         bvs.b   .longdiv

         move.w  d6,d7
         swap.w  d6
         and.l   d3,d7
         move.w  d6,(a3)
         and.l   d3,d6

         subq.l  #2,d4
         bcc.b   .l2

.enddiv

One optimization I forgot to mention previously is the replacement of bra after divul.l: this saves quite a few cycles, albeit only in the rare cases of long division.
A new optimization here is that lsr.l has been moved to avoid 2 registers conflicts (this can be done safely given that the initialization clears d5 before jumping to .l4).

For further speed improvements I suggest (sorry for repeating, but better safe than sorry):
* try to put a nop before the loop code (a 2-byte different alignment might give a different result);
* if it's possible to reverse the items order in the array, use (a3) in place of -(a3) and (a3)+ in place of (a3): this will give faster results on 68030 and maybe even on 68020 - this should also be checked with and without nop.

modrobert · 04 May 2021, 23:04

Quote:

Originally Posted by litwr

Yes they are enabled in all my pi-spigot implementations but you can notice that modrobert ran long tests for several minutes, so it can't affect results this way. I can repeat that it can only be explained by temperature raise. Of course, modrobert can just run the two tests in reverse order and this will confirm or reject my explanation.

Reverse order...

Code:

pi-amiga1200-9mo
number ? calculator v9 [MULUopt](68020)
number of digits (up to 9248)? 9000
314...  295.12

Code:

pi-amiga-9mo
number ? calculator v9 [MULUopt](68000)
number of digits (up to 9248)? 9000
314...  292.28

Don_Adan · 05 May 2021, 04:55

Quote:

Originally Posted by litwr

IMHO it is much easier just to believe the author of the code.

Ok.
What is this?

Code:

         or d5,d5      ; ???
         beq .l20

         move d5,d1
         addq #3,d5
         and #$fffc,d5
         cmp.b #10,(a0)
         bne .l21

Why not?

Code:

 ;        or d5,d5
 ;        beq .l20

         move d5,d1
          beq.b .l20
         addq #3,d5
         and #$fffc,d5
         cmp.b #10,(a0)
         bne .l21

And what is this?

Code:

         move.l $6c,rasterie+2
         move.l #rasteri,$6c

Even for A1200 this is dangerous.

Also if you know 68020 instructions timing then divul version is useless for 68020/68030.

saimo · 05 May 2021, 16:23

Quote:

Originally Posted by roondar

On both real hardware and WinUAE DIVU.L is ever so slightly faster than DIVUL.L (by about 6%).

Edit: note that the 6% speed difference might actually not be due to the different instructions, but instead due to the need to reset an additional register in the loop for the 64 bit variants.

Thanks for the tests, roondar!

So, divu.l (64 bits) seems faster than divul.l (32 bits)? Maybe you meant the opposite (which would fit the "reset additional register" in the following comment)?
As for the overhead effects, you could countercheck by adding dummy operations of the same kind in the loop of the apparently faster instruction.
The MC68020UM manual doesn't even mention timings for divul, so maybe the reason is that it performs exactly like divu.

roondar · 05 May 2021, 16:56

It may be my 68020 is still a bit weak (I don't know all the instructions by heart yet), but in my test I used the following two forms:

Code:

   divu.l   d0,d1
   divul.l  d0,d1:d2

And that assembled fine. To be 100% clear on this, I just tested and divu.l d0,d1:d2 also works (but, to be clear, this wasn't tested by me for speed). However, divul.l d0,d1 does not assemble. At any rate, I'm not actually sure whether or not resetting the register makes a difference, speed wise. So the 6% difference could be either the instruction itself, of the additional register reset.

saimo · 05 May 2021, 17:10

Quote:

Originally Posted by roondar

It may be my 68020 is still a bit weak (I don't know all the instructions by heart yet), but in my test I used the following two forms:

Code:

   divu.l   d0,d1
   divul.l  d0,d1:d2

And that assembled fine. To be 100% clear on this, I just tested and divu.l d0,d1:d2 also works (but, to be clear, this wasn't tested by me for speed). However, divul.l d0,d1 does not assemble. At any rate, I'm not actually sure whether or not resetting the register makes a difference, speed wise. So the 6% difference could be either the instruction itself, of the additional register reset.

divul is always 32 bits, and returns both quotient and remainder, while divu.l comes in two forms:
* divu.l <ea>,dq (32/32 -> 32q)
* divu.l <ea>dr:dq (64/32 -> 32q, 32r)

Indeed it makes sense that divu.l d0,d1 might be faster than divul.l d0,d1 (both 32 bits, but the latter also returns the remainder), but I'd be surprised if the 64 bit divu.l were faster - but now I see that you didn't test that one

saimo · 05 May 2021, 20:05

I ended up making a few tests myself, and what already reported by roondar is totally confirmed: i.e. divu instructions all perform at the same speed and seem not to depend on the input (at least when overflow doesn't happen - I haven't checked that).
I have made the tests on a standard A1200 and on the same A1200 with an Blizzard 1230-IV turned on - i.e., on a 14 MHz 68020 and on a 50 MHz 68030.

The test were based on this simple loop...

Code:

   move.w  #499,d0
.l move.l  #$7fffffff,d1
   move.l  #$12345678,d2
   moveq.l #$ffffffff,d3
   <divu instruction>
   dbf     d0,.l

... the interrupts were off and the elapsed time was measured with a CIA.

These are the results (times expressed in CIA ticks):

Code:

 instruction        68020   68030
------------------+-------+-------
 divu.l  d1,d3    |   9ca |   2c6
 divu.l  d1,d2:d3 |   9ca |   2c6
 divul.l d1,d2:d3 |   9ca |   2c6

Note: I've tried several other input values (even giving large quotients) with the same results.

EDIT: what follows is a brain fart! "divu instructions all perform at the same speed" kept lingering in my head and made me come to the conclusion stated below, by making me forget that the pi code uses divu.w (not divu.l!), which is faster.

This means that the bvs "optimization" in the pi code is useless (Don Adan, is this that you were referring to, maybe?). Such code can then be reduced to:

Code:

.l4 move.w  -(a3),d0
    lsr.l   #1,d5
    mulu.w  d1,d0
    add.l   d0,d5
    move.l  d5,d6
    divul.l d4,d7:d6
    move.w  d7,(a3)
    sub.l   d6,d5
    sub.l   d7,d5      
    subq.l  #2,d4
    bcc.b   .l4

Bruce Abbott · 05 May 2021, 21:30

Quote:

Originally Posted by litwr

I almost sure that it is impossible to optimize this 68020/30 division better than I show in my improved code.

It is interesting that I could optimize the 80386 code for the same case.

You have discovered one of the nice things about 68k code - most of the time there is little to gain from 'optimizing' for the 020 and 030. This is good because it means you can concentrate on producing clean readable code rather than tying yourself in knots attempting to micro-optimize it, and you know that if you write good code for the 68000 it will be good for more advanced CPUs too.

Quote:

I can just show you a good book where they use just MOVE instead of MOVE.W - Amiga Machine Language by Stefan Dittrich (1989). In a newer book Total! Amiga assembler by Paul Overaa (1995) they wrote "If the object size is not specified then a word size (16 bit) is assumed".
IMHO it is rather a matter of tastes.

It's not just 'tastes', but a matter of producing unambiguous code which is easier for someone unfamiliar with it to understand.

Quote:

Originally Posted by Don_Adan

Ok.
What is this?

Code:

         or d5,d5      ; ???
         beq .l20

         move d5,d1

This is what get when you are an x86 coder who doesn't know that most 68k instructions affect the flags.

modrobert · 06 May 2021, 00:25

After reading Toni's post I got a bit curious about the microcode "state machine" for DIVU.W for 68020 which is the most used instruction of the slow ones in litwr's code.

I used TimeIt (from aminet) and wrote some code for the test.

test.c

Code:

/* stub for testing */

#include <stdio.h>
#include <stdlib.h>

unsigned long divuw(__reg("d0") unsigned long a, __reg("d1") unsigned long b, \
 __reg("d2") unsigned long c);

int main(int argc, char *argv[])
{
 unsigned long dividend, divisor, counter, result;
 
 if(argc < 4) {
  printf("Usage: %s <dividend> <divisor> <counter>\n", argv[0]);
  exit(1);
 }
 
 dividend = atol(argv[1]);
 divisor = atol(argv[2]);
 if(divisor == 0) {
  printf("Error: divisor is zero or not an integer.\n");
  exit(2);
 }
 counter = atol(argv[3]);
 
 printf("Running division: %lu / %lu\n", dividend, divisor);
 result = divuw(dividend, divisor, counter);
 printf("Done with %lu divisions, result: 0x%08lx\n", counter, result);
 
 return 0;
}

div.s

Code:

        section "CODE",code
        public  _divuw
        cnop    0,4

_divuw
        move.l d3,-(sp)
        move.l d0,d3    ; save dividend

.loop
        move.l d3,d0    ; restore dividend
        divu.w d1,d0
        subq.l #1,d2    ; loop until counter reaches zero
        bne .loop

        move.l (sp)+,d3
        ; result from last division in d0
        rts

Compiled it with 'vc test.c div.s -o test_div' (vbcc) on my A1200.

Code:

Usage: test_div <dividend> <divisor> <counter>

Here are the results...

There was no check for division by one, which should be able to return rather fast, same goes for "dividend == divisor".

Code:

> timeit test_div 1 1 4000000
Running division: 1 / 1
Done with 4000000 divisions, result: 0x00000001
Elapsed: 15.98s

Both dividend and divisor are power of two which is easy to check and can be calculated fast using lsr.l.

Code:

> timeit test_div 128 8 4000000
Running division: 128 / 8
Done with 4000000 divisions, result: 0x00000010
Elapsed: 15.96s

When dividend is zero result is zero, but the 68020 still needs to think about it.

Code:

> timeit test_div 0 1 4000000
Running with: 0 / 1
Done with 4000000 divisions, result: 0x00000000
Elapsed: 15.86s

This fraction gives infinite decimals, but no major time difference, as can be seen.

Code:

> timeit test_div 7 3 4000000
Running division: 7 / 3
Done with 4000000 divisions, result: 0x00010002
Elapsed: 15.95s

Just testing bigger dividend and divisor, similar performance.

Code:

> timeit test_div 64737 33223 4000000
Running division: 64737 / 33223
Done with 4000000 divisions, result: 0x7b1a0001
Elapsed: 15.98s

Here I removed the 'DIVU.W' in div.s, replaced it with 'NOP', and compiled for reference.

Code:

> timeit test_div 0 1 4000000
Running with: 0 / 1
Done with 4000000 divisions, result: 0x00000000
Elapsed: 3.70s

Conclusion:

The microcode "state machine" is really dumb/simple for DIVU.W in the sense that there are no optimizations based on input, so either the hardware engineers were saving logic gates or had other restrictions due to the CPU design in general. On the bright side, this leaves room for improvement from a programmer perspective, avoid DIVxx whenever possible, there are plenty of useful instructions to choose from in the amazing 68k ISA.

EDIT:

Added a test where DIVU.W is commented out in 'div.s'.

Code:

> timeit test_div 1 1 4000000
Running division: 1 / 1
Done with 4000000 divisions, result: 0x00000001
Elapsed: 3.00s

Subtract 3 seconds from the previous tests to get DIVU.W time.

roondar · 06 May 2021, 00:52

On the Amiga, such expensive instructions can still be useful if you're using the chipset as well and aren't running on a fast CPU with Fast RAM. A simple trick would be to run one or more (big) blit(s) in the background, do your DIVxx stuff in the foreground. Essentially the DIVxx stuff becomes 'free' because you're still doing useful work regardless. And the Blitter would only barely be slowed down by the CPU because it's mostly calculating so it runs at near-full-speed too

I'm pretty sure some games/demos use tricks like that - certainly on A500.

Thomas Richter · 06 May 2021, 07:56

Quote:

Originally Posted by modrobert

After reading Toni's post I got a bit curious about the microcode "state machine" for DIVU.W for 68020 which is the most used instruction of the slow ones in litwr's code.

Well, in your case, the division is just a very minor ingredient in the overall running time and other code parts dominate, thus I'm not sure you would be able to see much of a difference from this test code. It needs a tighter loop around the div.

That said, I did not see an argument dependency on my 68060 except that it is much faster in the overflow case, but that's the only fast case I could discover.

03 May 2021, 17:55	#41
VladR Registered User Join Date: Dec 2019 Location: North Dakota Posts: 741	It should be criminally simple to create a benchmark (running, say, 1,000,000x) that compares execution time of various inputs to the DIV and create another variant that uses divu.w. Shouldn't take more than 10 minutes to write. All you really need is just an actual HW to run it. A 20 MHz CPU should execute about 250,000 divisions in a second (at ~80 cycles). Feeding the two registers and looping should bring that to about ~175,000 (very rough estimate, don't have the cycle table at hand, so feel free to look it up and adjust accordingly). So, running a benchmark for 10,000,000x should provide sufficient benchmarking precision [from comparison standpoint]. In fact, what I don't understand, is that how come, if you are concerned about the performance, and you have an access to the HW, you actually haven't written such benchmark in the first place before even creating the thread ? Last edited by VladR; 03 May 2021 at 17:57. Reason: typos

03 May 2021, 18:19	#42
saimo Registered User Join Date: Aug 2010 Location: Italy Posts: 854	@litwr If you have a spare register, you could do this: Code: move.l #$ffff,d5; somewhere in your initialization code ... move.w d6,d7 swap.w d6 and.l d5,d7 move.w d6,(a3) and.l d5,d6 This removes all the registers conflicts (it's always a good thing) and exploits the free cycles after the write (i.e. and.l d5,d6 comes for free). Regarding the puzzling timing results: are interrupts off during the test? Edit: I just had a quick look at the inner loop of your code; I noticed that it reads from memory the item calculated at the previous step: why not exploiting the fact that you already have it in a register? Unless I'm misunderstanding something, I'd propose this (for clarity, I have removed the mulu optimization stuff): Code: addq.l #2,a3 move.l #$ffff,d3 <initialize d6> ... bra.b .l4 .longdiv divul.l d4,d7:d6 move.w d7,-(a3) move.l d7,d6 subq.l #2,d4 bcs.b .enddiv .l2 sub.l d6,d5 sub.l d7,d5 lsr.l #1,d5 .l4 mulu.w d1,d6 add.l d5,d6 move.l d6,d5 divu.w d4,d6 bvs.b .longdiv move.w d6,d7 swap.w d6 and.l d3,d7 move.w d6,-(a3) and.l d3,d6 subq.l #2,d4 bcc.b .l2 .enddiv Note that all the instructions between the writes and the branch instructions are free. Additional note: -(a3) gives a better result (3 cycles) than (a3)+ (4 cycles) on 68020 only in the best case (i.e. when the instructions using it benefit from the maximum overlap conditions), but otherwise it gives a worse result (5 cycles); also, it always performs worse on 68030 (it makes no difference on 68040 and 68060, instead); you might experiment by adding a nop before the loop: by changing the alignment of the instructions, you might get better results (but first you must have a reliable way of making tests). I see room for optimization also outside of the loop, but I'm affected by a weird allergy to reading code (I'm serious), so that's all I can do Last edited by saimo; 03 May 2021 at 23:52.

03 May 2021, 20:47	#43
roondar Registered User Join Date: Jul 2015 Location: The Netherlands Posts: 3,436	Right, just because I'm really rather interested in the Motorola 68K architecture, I did a few minor tests using extremely simple code I executed while all interrupts and the screen were disabled. I timed using the Amiga CIA timers and ran 4000 DIVU.L's and DIVUL.L's in a simple DBRA loop. I found the following: WinUAE set to 'aproximate A1200 speed' ran both DIVU.L and DIVUL.L much, much faster than my real A1200. Depending on the instruction, between 4,9x and 6,5x as fast. On real hardware, there is absolutely no difference in execution time depending on the value passed to the DIVU(L) instruction. I tried a variety of values and all resulted in exactly the same timing. On both real hardware and WinUAE DIVU.L is ever so slightly faster than DIVUL.L (by about 6%). Edit: note that the 6% speed difference might actually not be due to the different instructions, but instead due to the need to reset an additional register in the loop for the 64 bit variants. Last edited by roondar; 03 May 2021 at 20:54.

05 May 2021, 16:56	#54
roondar Registered User Join Date: Jul 2015 Location: The Netherlands Posts: 3,436	It may be my 68020 is still a bit weak (I don't know all the instructions by heart yet), but in my test I used the following two forms: Code: divu.l d0,d1 divul.l d0,d1:d2 And that assembled fine. To be 100% clear on this, I just tested and divu.l d0,d1:d2 also works (but, to be clear, this wasn't tested by me for speed). However, divul.l d0,d1 does not assemble. At any rate, I'm not actually sure whether or not resetting the register makes a difference, speed wise. So the 6% difference could be either the instruction itself, of the additional register reset.

05 May 2021, 20:05	#56
saimo Registered User Join Date: Aug 2010 Location: Italy Posts: 854	I ended up making a few tests myself, and what already reported by roondar is totally confirmed: i.e. divu instructions all perform at the same speed and seem not to depend on the input (at least when overflow doesn't happen - I haven't checked that). I have made the tests on a standard A1200 and on the same A1200 with an Blizzard 1230-IV turned on - i.e., on a 14 MHz 68020 and on a 50 MHz 68030. The test were based on this simple loop... Code: move.w #499,d0 .l move.l #$7fffffff,d1 move.l #$12345678,d2 moveq.l #$ffffffff,d3 <divu instruction> dbf d0,.l ... the interrupts were off and the elapsed time was measured with a CIA. These are the results (times expressed in CIA ticks): Code: instruction 68020 68030 ------------------+-------+------- divu.l d1,d3 \| 9ca \| 2c6 divu.l d1,d2:d3 \| 9ca \| 2c6 divul.l d1,d2:d3 \| 9ca \| 2c6 Note: I've tried several other input values (even giving large quotients) with the same results. EDIT: what follows is a brain fart! "divu instructions all perform at the same speed" kept lingering in my head and made me come to the conclusion stated below, by making me forget that the pi code uses divu.w (not divu.l!), which is faster. This means that the bvs "optimization" in the pi code is useless (Don Adan, is this that you were referring to, maybe?). Such code can then be reduced to: Code: .l4 move.w -(a3),d0 lsr.l #1,d5 mulu.w d1,d0 add.l d0,d5 move.l d5,d6 divul.l d4,d7:d6 move.w d7,(a3) sub.l d6,d5 sub.l d7,d5 subq.l #2,d4 bcc.b .l4 Last edited by saimo; 08 May 2021 at 13:32.

04 May 2021, 17:52	#49
Toni Wilen WinUAE developer Join Date: Aug 2001 Location: Hämeenlinna/Finland Age: 49 Posts: 26,570	When I was coding and testing my CPU tester, I noticed that 68020 and 68030 have almost all undocumented features working identically, for example DIV undocumented flags work exactly the same. I am almost sure execution core and microcode is practically identical but everything else has been updated (memory controller, MMU, caches etc. Updating execution core probably wasn't worth the required testing) Division algorithm improvements: 68000 is very slow, pure microcode without any "hardware acceleration", timing fully known (but was not fully documented) 68010 division is improved, UAE emulated timing should be fully correct. I did some experiments and noticed that timing is correct if 68000 algorithm is slightly modified and conditional microcode branches removed, for example 68000 does it similar to "skip following microcode block if condition false" and skipped block was "subtract x from y". 68010 does similar to "subtract [x if true, 0 if not true] from y". (Apparently 68010 hardware gained feature that can conditionally force input to zero when doing ALU operations) This improved performance and reduced timing variation depending on input values (68010 DIV timing still depends on input variables but not as much as 68000). 68010 MUL also uses same method to improve performance. 68020/030 almost certainly uses similar algoritm that 68010 uses. 020/030 is still fully(?) microcoded. Unfortunately Motorola documentation only talks about max cycle times (as was already mentioned) which are useless for emulation purposes. It needs to use best case timing because max times generally are much less common in real world. Everything would run far too slowly when using max timings or even if using average times!. Thats why "inaccurate emulation" has to use minimum possible timings. Which are unknown. Caches and long pipeliline makes it impossible to get any useful timing results. Yes, you can disable caches and all and get "correct" timing results but they are also totally useless in real world.. (and some 68020+ only instructions might lack "slow down" completely. no one should use them in speed critical parts anyway)

06 May 2021, 00:52	#59
roondar Registered User Join Date: Jul 2015 Location: The Netherlands Posts: 3,436	On the Amiga, such expensive instructions can still be useful if you're using the chipset as well and aren't running on a fast CPU with Fast RAM. A simple trick would be to run one or more (big) blit(s) in the background, do your DIVxx stuff in the foreground. Essentially the DIVxx stuff becomes 'free' because you're still doing useful work regardless. And the Blitter would only barely be slowed down by the CPU because it's mostly calculating so it runs at near-full-speed too I'm pretty sure some games/demos use tricks like that - certainly on A500.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
68020 Bit Field Instructions	mcgeezer	Coders. Asm / Hardware	9	27 October 2023 23:21
68060 64-bit integer math	BSzili	Coders. Asm / Hardware	7	25 January 2021 21:18
Discovery: Math	Audio Snow	request.Old Rare Games	30	20 August 2018 12:17
Math apps	mtb	support.Apps	1	08 September 2002 18:59