Optimizing the 68020+ 32-bit math - Page 4

Thomas Richter · 06 May 2021, 07:59

Quote:

Originally Posted by roondar

On the Amiga, such expensive instructions can still be useful if you're using the chipset as well and aren't running on a fast CPU with Fast RAM.

Actually, I'm doing a lot of signal processing here in my day job, and what I learned is: Regardless what the CPU is, avoid divisions. You typically replace them by a multiplication by a pre-shifted inverse, and a right-shift. That's precise enough, and a lot faster than the division algorithm.

modrobert · 06 May 2021, 09:38

Quote:

Originally Posted by Thomas Richter

Well, in your case, the division is just a very minor ingredient in the overall running time and other code parts dominate, thus I'm not sure you would be able to see much of a difference from this test code. It needs a tighter loop around the div.

I mitigated that by replacing DIVU.W with NOP in the last test from previous post.

EDIT:

Realized now when typing this, it would be better to just comment DIVU.W out in 'div.s', so here is that test.

Code:

> timeit test_div 1 1 4000000
Running division: 1 / 1
Done with 4000000 divisions, result: 0x00000001
Elapsed: 3.00s

Just subtract 3 seconds from the test results to get DIVU.W time.

roondar · 06 May 2021, 09:54

Quote:

Originally Posted by Thomas Richter

Actually, I'm doing a lot of signal processing here in my day job, and what I learned is: Regardless what the CPU is, avoid divisions. You typically replace them by a multiplication by a pre-shifted inverse, and a right-shift. That's precise enough, and a lot faster than the division algorithm.

Indeed, it's of course always better to have a faster algorithm. But if you can't avoid it somehow, on the low end Amiga's you can still partially mitigate the cost by doing what I pointed out. That said, I'd say that on 68020, MUL.L still qualifies as an expensive instruction so it would also qualify for the same 'trick'.

modrobert · 06 May 2021, 10:51

I checked the TG68 VHDL source code (TG68_fast.vhd), which is used as 68000 in FPGA solutions such as Minimig.

Code:

-----------------------------------------------------------------------------
-- DIVU
-----------------------------------------------------------------------------
PROCESS (clk, execOPC, opcode, OP1out, OP2out, div_reg, dummy_div_sub, div_quot, div_sign, dummy_div_over, dummy_div)
    BEGIN
        set_V_Flag <= '0';

        IF rising_edge(clk) THEN
            IF clkena='1' THEN
                IF decodeOPC='1' THEN
                    IF opcode(8)='1' AND reg_QB(31)='1' THEN                -- Neg divisor
                        div_sign <= '1';
                        div_reg <= 0-reg_QB;
                    ELSE
                        div_sign <= '0';
                        div_reg <= reg_QB;
                    END IF;
                ELSIF exec_DIVU='1' THEN
                    div_reg <= div_quot;
                END IF;
            END IF;
        END IF;

        dummy_div_over <= ('0'&OP1out(31 downto 16))-('0'&OP2out(15 downto 0));

        IF opcode(8)='1' AND OP2out(15) ='1' THEN
            dummy_div_sub <= (div_reg(31 downto 15))+('1'&OP2out(15 downto 0));
        ELSE
            dummy_div_sub <= (div_reg(31 downto 15))-('0'&OP2out(15 downto 0));
        END IF;

        IF (dummy_div_sub(16))='1' THEN
            div_quot(31 downto 16) <= div_reg(30 downto 15);
        ELSE
            div_quot(31 downto 16) <= dummy_div_sub(15 downto 0);
        END IF;

        div_quot(15 downto 0) <= div_reg(14 downto 0)&NOT dummy_div_sub(16);

        IF execOPC='1' AND opcode(8)='1' AND (OP2out(15) XOR div_sign)='1' THEN
            dummy_div(15 downto 0) <= 0-div_quot(15 downto 0);
        ELSE
            dummy_div(15 downto 0) <= div_quot(15 downto 0);
        END IF;

        IF div_sign='1' THEN
            dummy_div(31 downto 16) <= 0-div_quot(31 downto 16);
        ELSE
            dummy_div(31 downto 16) <= div_quot(31 downto 16);
        END IF;

        IF (opcode(8)='1' AND (OP2out(15) XOR div_sign XOR dummy_div(15))='1' AND dummy_div(15 downto 0)/=X"0000")  --Overflow DIVS
            OR (opcode(8)='0' AND dummy_div_over(16)='0') THEN  --Overflow DIVU
            set_V_Flag <= '1';
        END IF;
    END PROCESS;

(Some of the operations for the "div" registers are handled by the ALU, search for "--ALU" and also "-- execute microcode".)

I've done some simple hardware designs in the past (both Verilog and VHDL), and modified some 8 bit state machines which somewhat resembles a CPU, but this is another level, never had experience designing mathematical arithmetic logic (so far).

The general idea seems to be "division bit for bit" handling the dividend, divisor and output, and using several 16 and 32 bit custom division registers (div_reg, div_quot, OP2out, etc.) mapped to result from bit operations as the process continues each clock. This partially explains why the inputs rarely matter, it just chugs through each of the bits several passes. The only exception I can find where input is checked is when the divisor is zero (which gives guru in the Amiga, so no speed gain there).

I don't know if the 68EC020 in A1200 uses the same DIVU design as TG68, but perhaps something similar.

litwr · 06 May 2021, 20:08

Quote:

Originally Posted by meynaf

still have the memory of your pi-spigot program where you removed features just to show shorter x86 code.

I have added information about code density in table #4 of https://litwr2.github.io/pi-spigot-b...benchmark.html - I hope meynaf will like it.

The 68000 shows good results, though it is behind leaders the VAX, IBM/370, and NS 32016.

Quote:

Originally Posted by roondar

It's not really poor code that's the problem. Processors beyond the 68000/68010/80386 are hard to emulate because they have cache memory and instruction speed that varies depending on the instruction stream. Since no one knows exactly how the 68020 internal sequencer works (same with 486+), accurate timing while emulating is basically impossible to get.

Quote:

Originally Posted by Toni Wilen

When I was coding and testing my CPU tester, I noticed that 68020 and 68030 have almost all undocumented features working identically, for example DIV undocumented flags work exactly the same.

The problems that roondar listed can make up to 40% speed emulation difference but FS-UAE or Hatari make it up to 400%! It is absolutely definite that they use a wrong timing constant for DIVU.L and DIVUL.

Quote:

Originally Posted by saimo

Except for the move.l #$ffff,d5 instruction, which needs to be executed once and for all outside of the loop, the code is the same size.
As for the speed improvement:
* the removal of the registers conflicts does allow to exploit both the execution pipelines of the 68060 (the instructions used are suitable for both the primary and secondary pipeline), so the code would run faster on such CPU;
* the and instruction after the write is for free also on 68020, as it executes while the write is being performed, whereas the same doesn't happen for bcc - forgot to mention earlier: if my memory doesn't fail me, branch instructions (I'm pretty such that this applies to dbcc at least) can't execute instead while writes happen, at least on 68020 and 68030 - it's something I had noticed experimentally a while back.

It seems that your code has the same size and maybe it will be a bit faster on the 68060. But I doubt that it will be faster on the 68020/30. Maybe I prepare code to test your idea later.

Quote:

Originally Posted by saimo

You want the maximum precision in this case, and keeping the interrupts on in a multitasking environment won't help at all. Just call Disable() and Enable() from exec.library before and after the test loop, respectively (and don't call OS functions in between).

Amiga documentation has a cite about Disavle(): DO NOT USE THIS CALL WITHOUT GOOD JUSTIFICATION. THIS CALL IS VERY DANGEROUS! Of course, my timing code is not perfect but it is very short and it works. However if you can show me a way for a better timing code snippet it will be welcome from me. I really missed a good code snippet which uses timer.

Quote:

Originally Posted by saimo

Here's a proper version:

Thank you but it is rather super-scalar optimization. My code is too ancient for it.

Quote:

Originally Posted by saimo

For further speed improvements I suggest (sorry for repeating, but better safe than sorry):
* try to put a nop before the loop code (a 2-byte different alignment might give a different result);

Thank you! It seems this suggestion resolves the big mystery of modrobert's results. The 68020 code has a label that has word alignment and this slow down the 68020. It is interesting that the 68030 does not slow down in this case (at least when all code fits its instruction cache) - results from the Atari TT confirm this.

Quote:

Originally Posted by saimo

* if it's possible to reverse the items order in the array, use (a3) in place of -(a3) and (a3)+ in place of (a3): this will give faster results on 68030 and maybe even on 68020 - this should also be checked with and without nop.

This changes the algo which is fixed by its C-sources.

Quote:

Originally Posted by modrobert

Reverse order...

Code:

pi-amiga1200-9mo
number ? calculator v9 [MULUopt](68020)
number of digits (up to 9248)? 9000
314...  295.12

Code:

pi-amiga-9mo
number ? calculator v9 [MULUopt](68000)
number of digits (up to 9248)? 9000
314...  292.28

Thank you very much! Your results confirms that Commodore made very good hardware which speed cannot be affected by a slight temperature change. Now I am sure that this speed difference is caused by an alignment of .l2-label, it is double word for the 68000 version and word for the 68020 version. Thanks saimo for the hint. I have just added ALIGN 2 before this label.

Quote:

Originally Posted by Don_Adan

What is this?
Why not?

Code:

         move d5,d1
          beq.b .l20
         addq #3,d5
         and #$fffc,d5
         cmp.b #10,(a0)
         bne .l21

Thank you but this code is outside any loops so IMHO more clear logic is better for this case.

Quote:

Originally Posted by Don_Adan

And what is this?

Code:

         move.l $6c,rasterie+2
         move.l #rasteri,$6c

Even for A1200 this is dangerous.

It just works. However it will be good if you give me an example of better code to measure time. I have already asked saimo for this.

Quote:

Originally Posted by Don_Adan

Also if you know 68020 instructions timing then divul version is useless for 68020/68030.

It is not true for my pi-spigot code. The code with DIVUL is faster that the code with two DIVU.W and it is much shorter.

Quote:

Originally Posted by saimo

This means that the bvs "optimization" in the pi code is useless (Don Adan, is this that you were referring to, maybe?). Such code can then be reduced to:

It seems you missed the idea of BVS-optimization. It just allows to use BVS.S with no branch taken for more often case and it is faster for the 68000 and 68020/30.

Quote:

Originally Posted by Bruce Abbott

This is what get when you are an x86 coder who doesn't know that most 68k instructions affect the flags.

Thank you. Don_Adan mentioned this code above - I have replied him. You know I did a lot of the 6502 coding.

Quote:

Originally Posted by Thomas Richter

Actually, I'm doing a lot of signal processing here in my day job, and what I learned is: Regardless what the CPU is, avoid divisions. You typically replace them by a multiplication by a pre-shifted inverse, and a right-shift. That's precise enough, and a lot faster than the division algorithm.

I hope you know the fantastic ARM code to do division by 10 using multiplication.

Code:

       SUB	r2, r1, #10             ; keep (x-10) for later
	SUB	r1, r1, r1, lsr #2
	ADD	r1, r1, r1, lsr #4
	ADD	r1, r1, r1, lsr #8
	ADD	r1, r1, r1, lsr #16
	MOV	r1, r1, lsr #3
	ADD	r3, r1, r1, asl #2
	SUBS	r2, r2, r3, asl #1      ; calc (x-10) - (x/10)*10
	ADDMI	r2, r2, #10             ; fix-up remainder
	ADDPL	r1, r1, #1              ; fix-up quotient

It divides 32-bit R1 by 10 and returns a 32-bit quotient in R1 and a remainder in R2 - it only takes 9 cycles even on the ARM1.

meynaf · 06 May 2021, 21:25

Quote:

Originally Posted by litwr

I have added information about code density in table #4 of https://litwr2.github.io/pi-spigot-b...benchmark.html - I hope meynaf will like it.

Why would I ? Too small example, too much OS specific code. The fact that it's optimised for speed rather than code size doesn't make the test very meaningful either.

Quote:

Originally Posted by litwr

It divides 32-bit R1 by 10 and returns a 32-bit quotient in R1 and a remainder in R2 - it only takes 9 cycles even on the ARM1.

And 40 bytes of code just to divide by 10.

Thomas Richter · 06 May 2021, 21:45

Quote:

Originally Posted by litwr

I hope you know the fantastic ARM code to do division by 10 using multiplication.

Not in specific, but the general trick is to find an exponent T such that 2^T + 1 is divisible by 5, i.e. u * 5 = 2^T + 1.

Then ( x * u * 5 - x ) = (x * (2^T + 1) - x ) = 2^T, thus x * u = x / 5 mod 2^T. Dividing by 2 is then simple.

A particular choice (though possibly not yours) is T = 18, 2^T + 1 = 262145 = 5 * 52429.

Hence, to divide by 5, multiply x by 52429, and take the mod 2^18 (easy, just masking). To divide by 10, add another rightshift.

Of course, that is not the only choice for T.

However, that is not in general what is needed in my job because the divisors are parameters and not constants.

Don_Adan · 06 May 2021, 22:42

"Thank you but this code is outside any loops so IMHO more clear logic is better for this case."
Yes, or.w d5,d5 is outside loops, but why you used Max. digits value, if you wasted memory for nothing on Amiga version? It was only example, perhaps more unnecessary code can exist.

"It just works. However it will be good if you give me an example of better code to measure time. I have already asked saimo for this"

I will use AddIntServer and RemIntServer from exec.

Or use something like this:
http://eab.abime.net/showpost.php?p=552625&postcount=43

Or better use/adapt time measure routine from c2p routine. Originally written by Jim Drew.

http://eab.abime.net/showpost.php?p=...&postcount=235

saimo · 07 May 2021, 00:04

It's about a week that I forced myself to stop working on my new game because of serious sleep issues (in fact, I can't always think straight). But this divu thing got me intrigued, so I just couldn't help but make more tests

I decided to check the divu instructions with broader ranges of values (always ensuring that overflow does not occur), comparing them against one another, and seeing how they behave on 68020 and 68030.
The tests were made with interrupts and DMA off, and the time has been measured using the 32-bit timer from CIA A.

First of all, an overview of the tests performed and of the results:

Code:

 # | OPERATION        | DIVIDEND          |         DIVISOR         | ITERATIONS | TIME 68020 | TIME 68030
---+------------------+-------------------+-------------------------+------------+------------+------------
 1 | 32/16 -> 16q 16r | 2^16-1            |         1 ... 2^16-1    |     2^16-1 |     180325 |      50210
   | divu.w dx,dy     | $ffff             |         1 ... 65535     |      65535 |  §  176949 |  §   50210
---+------------------+-------------------+-------------------------+------------+------------+------------
 2 | 32/16 -> 16q 16r | (2^16-1) * 2^15   |      2^15 ... 2^16-1    |       2^15 |      88578 |      25106
   | divu.w dx,dy     | $7fff8000         |     32768 ... 65535     |      32768 |  §   90166 |  §   25106
---+------------------+-------------------+-------------------------+------------+------------+------------
 3 | 32/32 -> 32q     | 2^32-1            |         1 ... 2^20      |       2^20 |    4718597 |    1338906
   | divu.l dx,dy     | $ffffffff         |         1 ... 1048576   |    1048576 |  § 4875883 |  § 1368658
---+------------------+-------------------+-------------------------+------------+------------+------------
 4 | 32/32 -> 32q 32r | 2^32-1            |         1 ... 2^20      |       2^20 |    4718597 |    1338906
   | divul.l dx,dy:dz | $ffffffff         |         1 ... 1048576   |    1048576 |  § 4771025 |  § 1338905
---+------------------+-------------------+-------------------------+------------+------------+------------
 5 | 64/32 -> 32q 32r | 2^32-1            |         1 ... 2^16-1    |     2^16-1 |     304742 |      85541
   | divu.l dx,dy:dz  | $ffffffff         |         1 ... 65535     |      65535 |  §  301465 |  §   85542
---+------------------+-------------------+-------------------------+------------+------------+------------
 6 | 64/32 -> 32q 32r | (2^32-1) * 2^31   | 2^32-2^20 ... 2^32-1    |       2^20 |    4823456 |    1368660
   | divu.l dx,dy:dz  | $7fffffff80000000 | $fff00000 ... $ffffffff |    1048576 |  § 4875884 |  § 1368659

The times are expressed in CIA clocks.
The times are relative to the whole core loops, not just to the divu instructions.
§ = time when the code alignment was altered with an nop before the core loop.

First considerations:
* the 68020 is more sensitive than the 68030 to alignment;
* 32- and 64-bit divus seem to perform very similarly;
* 16-bit divisions seem faster (as one would expect);
* remainders do not seem to impact performance (as one would expect);
* input data does not seem to affect performance (as the MC68020UM implies and contrary to what the MC68030UM says).

Now, given that the number of iterations differ, to be able to compare the times, let's see how long a single iteration took on average (best times only):

Code:

 # | OPERATION        | ITERATIONS | TIME 020 | TIME 030 |  AVERAGE 68020 |  AVERAGE 68030
---+------------------+------------+----------+----------+----------------+----------------
 1 | divu.w dx,dy     |      65535 |   176949 |    50210 | 2.700068665599 | 0.766155489433
---+------------------+------------+----------+----------+----------------+----------------
 2 | divu.w dx,dy     |      32768 |    88578 |    25106 | 2.703186035156 | 0.766174316406
---+------------------+------------+----------+----------+----------------+----------------
 3 | divu.l dx,dy     |    1048576 |  4718597 |  1338906 | 4.500004768372 | 1.276880264282
---+------------------+------------+----------+----------+----------------+----------------
 4 | divul.l dx,dy:dz |    1048576 |  4718597 |  1338906 | 4.500004768372 | 1.276880264282
---+------------------+------------+----------+----------+----------------+----------------
 5 | divu.l dx,dy:dz  |      65535 |   301465 |    85541 | 4.600061036088 | 1.305271992065
---+------------------+------------+----------+----------+----------------+----------------
 6 | divu.l dx,dy:dz  |    1048576 |  4823456 |  1368659 | 4.600006103516 | 1.305254936218

This clearly shows that:
* 16-bit divus are faster;
* 32-bit divus perform equally regardless of the remainder;
* 64-bit divus perform equally regardless of the remainder.

It also seems to indicate that 64-bit divus are slower than 32-bit divus - but that isn't case, because the difference is due to the extra code in the core loops.
So, before proceeding further, let's look at the core loops.

Code:

TEST #1

   move.l  #$ffff,d2 ;$0000ffff
   moveq.l #1,d7     ;i = 1
.l move.l  d2,d0     ;$0000ffff
   divu.w  d7,d0     ;$0000ffff/i
   addq.w  #1,d7     ;++i
   bne.b   .l


TEST #2

   move.l #$7fff8000,d2 ;$ffff*2^15
   move.w #$8000,d7     ;i = 2^15
.l move.l d2,d0         ;$ffff*2^15
   divu.w d7,d0         ;($ffff*2^15)/i
   addq.w #1,d7         ;++i
   bne.b  .l


TEST #3

   moveq.l #-1,d2      ;$ffffffff
   move.l  #$100000,d7 ;i = 2^20
.l move.l  d2,d0       ;$ffffffff
   divu.l  d7,d0       ;($ffffffff)/i
   subq.l  #1,d7       ;--i
   bne.b   .l


TEST #4

   moveq.l #-1,d2      ;$ffffffff
   move.l  #$100000,d7 ;i = 2^20
.l move.l  d2,d0       ;$ffffffff
   divul.l d7,d1:d0    ;($ffffffff)/i
   subq.l  #1,d7       ;--i
   bne.b   .l


TEST #5

   moveq.l #-1,d2   ;$ffffffff
   moveq.l #1,d7    ;i = 1
.l clr.l   d1       ;0
   move.l  d2,d0    ;$ffffffff
   divu.l  d7,d1:d0 ;($00000000ffffffff)/i
   addq.w  #1,d7    ;++i
   bne.b   .l


TEST #6
   move.l #$7fffffff,d3 ;$7fffffff
   move.l #$80000000,d2 ;$80000000
   move.l #$fff00000,d7 ;i = 2^32-2^20
.l move.l d3,d1         ;$7fffffff
   move.l d2,d0         ;$80000000
   divu.l d7,d1:d0      ;((2^32-1)*2^31)/i
   addq.l #1,d7         ;++i
   bne.b  .l

The divus of tests #3, #4, #5 and #6 all perform at the same speed: in fact, by adding a dummy move to test #3 (the same could be done with test #4)...

Code:

   moveq.l #-1,d2      ;$ffffffff
   move.l  #$100000,d7 ;i = 2^20
.l move.l  d4,d5       ;dummy operation
   move.l  d2,d0       ;$ffffffff
   divu.l  d7,d0       ;($ffffffff)/i
   subq.l  #1,d7       ;--i
   bne.b   .l

... to make the non-divu instructions cycles equal to those of tests #5 and #6, the execution time becomes exactly the same.

Now let's compare the performance of the two CPUs:

Code:

 # | OPERATION        | TIME 020 | TIME 030 | TIME 020 / TIME 030
---+------------------+----------+----------+---------------------
 1 | divu.w dx,dy     |   176949 |    50210 | 3.524178450508
---+------------------+----------+----------+---------------------
 2 | divu.w dx,dy     |    88578 |    25106 | 3.52816059906
---+------------------+----------+----------+---------------------
 3 | divu.l dx,dy     |  4718597 |  1338906 | 3.5242182797
---+------------------+----------+----------+---------------------
 4 | divul.l dx,dy:dz |  4718597 |  1338906 | 3.5242182797
---+------------------+----------+----------+---------------------
 5 | divu.l dx,dy:dz  |   301465 |    85541 | 3.524216457605
---+------------------+----------+----------+---------------------
 6 | divu.l dx,dy:dz  |  4823456 |  1368659 | 3.524220423056

The 68020 of the standard A1200 runs at 14.18758 MHz, while the 68030 on the Blizzard 1230-IV I used runs at 50 MHz. Therefore, frequency-wise, the 68030 runs 50 / 14.18758 = 3.524209202697 times faster than the 68020, exactly like the table indicates.
This also means that the divus implementations are the same on both the CPUs. In fact, their user's manuals indicate the very same timings (for the cache-case, that is, but also the other cases are almost identical). I guess that where the MC68030UM says that the actual timing depends on the input data, it refers to divisions by 0 and overflows.

Finally, one last questions: are the results reliable?
Let's look at the test #1 code and at its cycles on the 68020, and let's compare them with the result obtained experimentally.

Code:

   move.b  #$41,$bfee01 ;(reload timer and start it)

   move.l  #$ffff,d2    ;5w  6c (yes, the cache-cache is said to be worse than the worst-case)
   moveq.l #1,d7        ;3w  2c

.l move.l  d2,d0        ;3w  2c
   divu.w  d7,d0        ;44w 44c
   addq.w  #1,d7        ;3w  2c
   bne.b   .l           ;9w  6c 4cl

   clr.b   $bfee01      ;6w+5w 4c+4c (stop timer)

w = worst-case
c = cache-case
l = last iteration

Initially, the CPU has to fetch the instructions from RAM and the initialization instructions as well (by the way, only now it dawned on me that I should have put them before the write to the CIA CRA register, but I'm too tired now to redo all the tests and calculations again), so the executions is slower (worst-case) and takes 64 cycles:

Code:

longword-aligned code case:

   move.l  #$ffff,d2    ;5
   moveq.l #1,d7        ;2 (because it has been prefetched with the previous instruction)
.l move.l  d2,d0        ;3
   divu.w  d7,d0        ;44
   addq.w  #1,d7        ;3
   bne.b   .l           ;7 (because the opcode has been prefetched, so only the offset needs an additional read)

word-aligned code case:

   move.l  #$ffff,d2    ;4 (because the opcode has been prefetched with the previous instruction)
   moveq.l #1,d7        ;3
.l move.l  d2,d0        ;2 (because it has been prefetched with the previous instruction)
   divu.w  d7,d0        ;44
   addq.w  #1,d7        ;2 (because it has been prefetched with the previous instruction)
   bne.b   .l           ;9

Afterwards, each iteration takes 2+44+2+6 = 54 cycles.
The last iteration takes 2 cycles less as the branch is not taken, i.e. 52 cycles.
Finally, the write to the CIA has to be evaluated taking into account that its timing has a base time (6w 4c) plus the calculate effective address time (5w 4c). Depending on the code alignment, the opcode might already be in the cache thanks to the 32-bit fetch for bne, so, in that case, the time is 4c+5w = 9 cycles; otherwise, the time is 6w+5w = 11 cycles.
Therefore, theoretically, the whole execution takes 64+54*65533+52+9 = 3538907 or 64+54*65533+52+11 = 3538909 cycles.
The CIA runs at 0.709379 MHz, i.e. 1/20th of the CPU speed, so the elapsed time in CPU cycles is 176949*20 = 3538980.
The difference between the actual time and the theoretical time is thus 3538980-3538907 = 73 or 3538980-3538909 = 71 cycles. I guess that it can be explained as follows:
* the instructions are fetched from CHIP RAM, which is slower than what the MC68020UM assumes (its timings are relative to a 0-wait-state RAM); namely, the CHIP RAM runs at 1/4 of the CPU frequency; given that when the instructions are not in the cache theoretically 64-54+5 = 15 or 64-54+7 = 17 cycles more for RAM accesses are needed, execution takes actually 15*4 = 60 or 17*4 = 68 cycles longer;
* the access to the CIA is slow due to the slower frequency of the chip (1/20 of the CPU frequency), so, in the worst case, clr could actually take 19 cycles longer.
Even if I made some mistake, even without considering the performance penalty factors, 73 out 3538980 cycles represent a 0.002063% error, I'd say that the tests are reliable. That's also supported by the fact that I ran the tests multiple times obtaining always the same results.

EDIT START

I just had to see how test #1 performs with minimized overhead, so I rewrote it like this:

Code:

   move.l  #$ffff,d2
   moveq.l #1,d7
   lea.l   $bfee01,a5
   move.b  #$41,(a5)  ;(reload timer and start it)

.l move.l  d2,d0      ;3w  2c
   divu.w  d7,d0      ;44w 44c
   addq.w  #1,d7      ;3w  2c
   bne.b   .l         ;9w  6c 4cl

   clr.b   (a5)       ;6w+2w 4c+2c  (stop timer)

w = worst-case
c = cache-case
l = last iteration

The actual cycles measured on an unexpanded A1200 are:
* without nop before the code: 176946;
* with nop before the code: 180323.

(Side note: this means that, in the best case, the cost of the overhead of the previous version of the code was (176949-176946)*20 = 60 cycles.)

When the code is not cached, the theoretical cycles are:

Code:

longword-aligned code case:

.l move.l  d2,d0      ;3
   divu.w  d7,d0      ;44
   addq.w  #1,d7      ;3
   bne.b   .l         ;7 (because the opcode has been prefetched, so only the offset needs an additional read)

   clr.b   (a5)       ;4+2 = 6 (because it has been prefetched with the previous instruction)

word-aligned code case:

.l move.l  d2,d0      ;2 (because it has been prefetched with the previous instruction)
   divu.w  d7,d0      ;44
   addq.w  #1,d7      ;2 (because it has been prefetched with the previous instruction)
   bne.b   .l         ;9

   clr.b   (a5)       ;6+2 = 8

Therefore, considering the best case:
* first iteration: 57 cycles;
* next iterations: 54 cycles;
* last iteration: 52 cycles;
* final write: 6 cycles;
* total: 57+54*65533+52+6 = 3538897 cycles

The difference between actual time and theoretical time is 176946*20-3538897 = 23 cycles. By looking at the timings above, we see that the additional cycles to fetch the instructions from RAM are 57-54 = 3. Due to the CHIP RAM slowness, those actually amount to 3*4 = 12 cycles. The remaining 23-12 = 11 cycles should be due to the access to the CIA.
To be honest, I didn't write down on paper a timing chart of the CPU activity (and, even if I tried, there is no documentation that explains how to do that exactly), so calculations might be a bit off here and there, but still the closeness of the figures and the 100% stable test results prove even better that the measured times are accurate enough for the purpose of evaluating the divu operations performance.

EDIT END

Conclusions? Don't get involved in threads that might steal you an enourmous amount of time for stuff you'll never have a use for anyway

@litwr

I'll get back to you tomorrow ASAP (edit: couldn't sleep). Now I should really try to get some sleep.

modrobert · 07 May 2021, 06:09

Quote:

Originally Posted by saimo

Conclusions? Don't get involved in threads that might steal you an enourmous amount of time for stuff you'll never have a use for anyway

I find your results interesting, and the reward for doing this is learning more; the very foundation of every future decision. Basic research vs being productive, you need both to stay sane.

litwr · 08 May 2021, 09:36

Quote:

Originally Posted by meynaf

Why would I ? Too small example, too much OS specific code. The fact that it's optimised for speed rather than code size doesn't make the test very meaningful either.

It seems you would have been happier if the 8086 had showed a better code density.

Anyway, the main loop doesn't contain any OS specific code.

And, IMHO, we have a very good case, we prepare code optimized for speed and get its size - this gives us the size/density of the fastest code.

Quote:

Originally Posted by Don_Adan

"Thank you but this code is outside any loops so IMHO more clear logic is better for this case."
Yes, or.w d5,d5 is outside loops, but why you used Max. digits value, if you wasted memory for nothing on Amiga version? It was only example, perhaps more unnecessary code can exist.

Sorry, it seems I missed something.

Would you like please to clarify your phrase "but why you used Max. digits value"? Thank you.
BTW I removed that OR, it is a tiny and insignificant but an improvement!. Thank you very much.

Quote:

Originally Posted by Don_Adan

I will use AddIntServer and RemIntServer from exec.

Thank you. But you pointed me that I wasted 2 bytes afore.

Quote:

Originally Posted by Don_Adan

Or use something like this:
http://eab.abime.net/showpost.php?p=552625&postcount=43

Why use this complexity if we have AddIntServer/RemIntServer?

Quote:

Originally Posted by Don_Adan

Or better use/adapt time measure routine from c2p routine. Originally written by Jim Drew.
http://eab.abime.net/showpost.php?p=...&postcount=235

Thank you very much. I am going to dig into this code. However this code is too low level, its author uses numbers instead of library function names.

Quote:

Originally Posted by saimo

It's about a week that I forced myself to stop working on my new game because of serious sleep issues (in fact, I can't always think straight). But this divu thing got me intrigued, so I just couldn't help but make more tests

You have shown a kind of magic of exact 68020/30 cycle counting. Thank you very much. I would like to learn this wisdom someday.
BTW would you like please to run pi-amiga, pi-amiga1200, pi-amigax on your Blizzard 1230-IV @50 MHz for me (100, 1000, 3000 digits)? The archive is attached.

Quote:

Originally Posted by modrobert

Reverse order...

Could you run pi-amiga, pi-amiga1200, pi-amigax (from pi-amiga-11-beta.zip) for 3000 digits? It can help to find more details about the 68020. BTW I have made several test with my old 80386 board. I tested the next instruction sequence

Code:

.loop:   nop
         ...
         jmp .loop

It was a surprise for me that the 80386 is faster when .loop is double word aligned! I was sure that the x86 ignores code alignment completely.

Don_Adan · 08 May 2021, 10:40

Quote:

Originally Posted by litwr

It seems you would have been happier if the 8086 had showed a better code density.

Anyway, the main loop doesn't contain any OS specific code.

And, IMHO, we have a very good case, we prepare code optimized for speed and get its size - this gives us the size/density of the fastest code.

Sorry, it seems I missed something.

Would you like please to clarify your phrase "but why you used Max. digits value"? Thank you.
BTW I removed that OR, it is a tiny and insignificant but an improvement!. Thank you very much.

Thank you. But you pointed me that I wasted 2 bytes afore.

Why use this complexity if we have AddIntServer/RemIntServer?

Thank you very much. I am going to dig into this code. However this code is too low level, its author uses numbers instead of library function names.

You have shown a kind of magic of exact 68020/30 cycle counting. Thank you very much. I would like to learn this wisdom someday.
BTW would you like please to run pi-amiga, pi-amiga1200, pi-amigax on you Blizzard 1230-IV @50 MHz for me (100, 1000, 3000 digits)? The archive is attached.

Could you run pi-amiga, pi-amiga1200, pi-amigax (from pi-amiga-11-beta.zip) for 3000 digits. It can help to find more details about the 68020. BTW I have made several test with my old 80386 board. I tested the next instruction sequence

Code:

.loop:   nop
         ...
         jmp .loop

It was a surprise for me that the 80386 is faster when .loop is double word aligned! I was sure that the x86 ignores code alignment completely.

Why using get VBR? Because you can learn something new about Amiga.

Why you used Max Digits value, if your Amiga code is inefficiency? For Atari you have 9288 digits, for Amiga only 9252 digits.

And you wasted much more than 2 bytes.
Why you allocated/freeing memory for this routine? When BSS section is enough?

Section BSS,Digits

ds.b $10000-(endmark-start)

The best option is using Code_BSS, then only one memory area will be used and code can be fully PC relative.

If you want to reach good number of digits, your code must be fully PC relative.

litwr · 08 May 2021, 12:39

Quote:

Originally Posted by Don_Adan

Why using get VBR? Because you can learn something new about Amiga.
Why you used Max Digits value, if your Amiga code is inefficiency? For Atari you have 9288 digits, for Amiga only 9252 digits.
And you wasted much more than 2 bytes. Why you allocated/freeing memory for this routine? When BSS section is enough?

Section BSS,Digits
ds.b $10000-(endmark-start)

The best option is using Code_BSS, then only one memory area will be used and code can be fully PC relative.

Thanks for the idea about BSS. Of course, it will be the best to use BSS instead of dances around AllocMem/FreeMem. However the gain is not that good as I want to expect. It is because of I don't know how to combine CODE and BSS into one section. Is it possible? It would be good to have something like

Code:

msg2  dc.b 10
bss
mydata   blk.b 60000    ;it is a separate section, this makes a file larger :(

Is there a way to ask the system just to reserve memory and append it to a loaded code section? I tried to use Section Code_BSS but this just reserves memory on every DS.

How to create a true BSS region inside Code_BSS section?

Indeed even a separate BSS section gives some gain. Maybe I make this tiny improvement (which gives us less than 10 digits) sometime. Of course I remember your super tricky way to use the stack but I would like to use more conventional coding. And, you know the main goal of my project is the maximum speed, the size optimization is secondary and not important goal there.

Sorry, some of you remarks are still cryptic for me. Would you like please to clarify some you phrases?

"Why using get VBR? Because you can learn something new about Amiga." - Sorry, I completely missed your idea here.

"Why you used Max Digits value, if your Amiga code is inefficiency?" - I still don't understand what is wrong about Max Digits value? You know, the Amiga OS does't have a symbol input function, so the getnum-function for the Amiga and Atari ST are not the same. You can notice that TOS allows us just to use memory allocated for this function for the main array later. I don't know how use the same approach under WB.

Quote:

Originally Posted by Don_Adan

If you want to reach good number of digits, your code must be fully PC relative.

IMHO it is quite pc-relative now. The first version of the pi-spigot for the Amiga were less pc-relative. It is fixed quite long ago. Thanks to EAB-experts.

saimo · 08 May 2021, 13:31

Quote:

Originally Posted by litwr

It seems that your code has the same size and maybe it will be a bit faster on the 68060. But I doubt that it will be faster on the 68020/30. Maybe I prepare code to test your idea later.

Even if it weren't faster on 68020/30, it wouldn't hurt to have it perform better on 68060, would it?

And, anyway, I'm confident it runs a bit faster on all 68020+ CPUs - if I get a chance, I'll make a test later myself.

Quote:

Amiga documentation has a cite about Disavle(): DO NOT USE THIS CALL WITHOUT GOOD JUSTIFICATION. THIS CALL IS VERY DANGEROUS! Of course, my timing code is not perfect but it is very short and it works. However if you can show me a way for a better timing code snippet it will be welcome from me. I really missed a good code snippet which uses timer.

Yes, the function comes with that warning - and that warning is wise. However, for just some internal testing, you can use it.
To make reliable tests on an unexpanded Amiga, you should also turn DMA off (because DMA affects the access of the CPU to the RAM), but a proper takeover and restore code isn't trivial (and still might cause issues on some machines). Anyway, exclusively for internal testing and to quickly make the tests, you could use move.w #$4000,$dff096 before the test loop and move.w #$c000,$dff096 after it. Please note that this is extremely brute code, so, it's best to run the tests on a minimal environment, i.e. after booting the machine without startup-sequence.

Regarding the timer code, for maximum precision, hardware (CIA timers) should be used directly - so, again, a proper startup and restore code is needed. In absence of that, it's best to use the OS functions. I see that Don Adan already provided you with a couple of pointers.

Quote:

Thank you but it is rather super-scalar optimization. My code is too ancient for it.

Sorry, but that doesn't mean anything. I simply changed the order of some instructions, replaced some other instructions with AND and replaced a BRA with a SUBQ and a BCS: hardly rocket science. The code remains simple and works fine on any 68020+ CPU.
For convenience, only because this thread became quite complicated, here's the code again:

Code:

         move.l  #$ffff,d3
         ...
         bra.b   .l4

.longdiv divul.l d4,d7:d6
         move.w  d7,(a3)

         subq.l  #2,d4
         bcs.b   .enddiv

.l2      sub.l   d6,d5
         sub.l   d7,d5

.l4      move.w  -(a3),d0
         lsr.l   #1,d5
         mulu.w  d1,d0
         add.l   d0,d5
         move.l  d5,d6
         divu.w  d4,d6
         bvs.b   .longdiv

         move.w  d6,d7
         swap.w  d6
         and.l   d3,d7
         move.w  d6,(a3)
         and.l   d3,d6

         subq.l  #2,d4
         bcc.b   .l2

.enddiv

Quote:

Thank you! It seems this suggestion resolves the big mystery of modrobert's results. The 68020 code has a label that has word alignment and this slow down the 68020. It is interesting that the 68030 does not slow down in this case (at least when all code fits its instruction cache) - results from the Atari TT confirm this.

Yes, the 68030 is less sensitive to code alignment. But as along as you don't have a reliable way to measure time, don't be too sure about the NOP solution.

Quote:

It seems you missed the idea of BVS-optimization. It just allows to use BVS.S with no branch taken for more often case and it is faster for the 68000 and 68020/30.

Well, I perfectly understood that (and even expressely agreed with it in post #19): in post #56 I simply brain-farted (added explanation directly in the post)

Quote:

BTW would you like please to run pi-amiga, pi-amiga1200, pi-amigax on you Blizzard 1230-IV @50 MHz for me (100, 1000, 3000 digits)? The archive is attached.

Sure. I hope to get back to you some time later today or tomorrow.

EDIT

Results on 68020 (3000 digits):
* pi-amiga: 37.32
* pi-amiga1200 and pi-amigax: 36.94

Results on 68030: all .00, regardless of the number of digits.

saimo · 08 May 2021, 13:44

Quote:

Originally Posted by modrobert

I find your results interesting, and the reward for doing this is learning more; the very foundation of every future decision. Basic research vs being productive, you need both to stay sane.

Well, the problem is that all that work didn't really tell us much: we already knew that the timings are basically the same for both the 68020 and 68030 (as per the official manuals) and that the input value doesn't make a difference on 68020 (as per its manual). It was therefore likely that also on the 68030 the speed didn't depend on the input value, even if the manual says "Indicates Maximum Time (Acutal time is data dependent)" (typo not mine

), so basically the tests simply confirmed just that. Quite little... but it's been fun!

Don_Adan · 08 May 2021, 15:40

Quote:

Originally Posted by litwr

Thanks for the idea about BSS. Of course, it will be the best to use BSS instead of dances around AllocMem/FreeMem. However the gain is not that good as I want to expect. It is because of I don't know how to combine CODE and BSS into one section. Is it possible? It would be good to have something like

Code:

msg2  dc.b 10
bss
mydata   blk.b 60000    ;it is a separate section, this makes a file larger :(

Is there a way to ask the system just to reserve memory and append it to a loaded code section? I tried to use Section Code_BSS but this just reserves memory on every DS.

How to create a true BSS region inside Code_BSS section?

Indeed even a separate BSS section gives some gain. Maybe I make this tiny improvement (which gives us less than 10 digits) sometime. Of course I remember your super tricky way to use the stack but I would like to use more conventional coding. And, you know the main goal of my project is the maximum speed, the size optimization is secondary and not important goal there.

Sorry, some of you remarks are still cryptic for me. Would you like please to clarify some you phrases?

"Why using get VBR? Because you can learn something new about Amiga." - Sorry, I completely missed your idea here.

"Why you used Max Digits value, if your Amiga code is inefficiency?" - I still don't understand what is wrong about Max Digits value? You know, the Amiga OS does't have a symbol input function, so the getnum-function for the Amiga and Atari ST are not the same. You can notice that TOS allows us just to use memory allocated for this function for the main array later. I don't know how use the same approach under WB.

IMHO it is quite pc-relative now. The first version of the pi-spigot for the Amiga were less pc-relative. It is fixed quite long ago. Thanks to EAB-experts.

I always manually created Code_BSS section, if I needed. You can use Vasm, it has possibility to create Code_BSS (or maybe BSS_Code) hunk, if I remember right Phx infos. But i never used this assembler.
Stack version it was Ross idea, you can use normal version with Code_ Bss section.

For VBR you must learn/remember that VBR not always must be at $0 address for 68010+ CPUs. Then your code will be dont works or can crash.

And yes your code can be optimised/shortened much more than for only 10 digits.
You used:

maxn dc.w 0

move.l d0,maxn

You know that this is buggy? Of course, if you want you can overwrite "dos.library" name, it was originally meynaf's idea for shortest code.
Anyway much easiest and shortest is replacing maxn with D7 register.
move.l d0,d7

move.l d7,d5

cmp.w d7,d5

This is used for something?

msg6 dc.b 'no fast memory',10

For me only wasted memory.

modrobert · 08 May 2021, 16:11

Quote:

Originally Posted by saimo

Well, the problem is that all that work didn't really tell us much: we already knew that the timings are basically the same for both the 68020 and 68030 (as per the official manuals) and that the input value doesn't make a difference on 68020 (as per its manual). It was therefore likely that also on the 68030 the speed didn't depend on the input value, even if the manual says "Indicates Maximum Time (Acutal time is data dependent)" (typo not mine

), so basically the tests simply confirmed just that. Quite little... but it's been fun!

Yes, fun indeed. I just learned that 4 million NOP instructions takes 0.70 seconds to run on a 68EC020 @ 14MHz.

Seriously though, I think optimizing is so much fun. Besides being a lost art, it's such a challenge on classic hardware and useful knowledge in all types of programming. Know the hardware, know the software, now make it fast.

EDIT:

litwr,

Code:

> pi-amiga
number pi calculator v11 (beta)(68000)
number of digits (up to 9248)? 3000
314159...  35.30

Code:

> pi-amiga1200
number pi calculator v11 (beta)(68020)
number of digits (up to 9248)? 3000
314159...  34.70

Code:

> pi-amigax
number pi calculator v11 (Beta, SuperScalar)(68020)
number of digits (up to 9244)? 3000
314159...  34.68

Don_Adan · 08 May 2021, 19:56

BTW. When you made one hunk (code and bss) version, you can easy beat Atari Max Digits value.
BTW2. For good, but unfair code you can use all 64 KB for Max Digits.

litwr · 09 May 2021, 13:58

Quote:

Originally Posted by saimo

Even if it weren't faster on 68020/30, it wouldn't hurt to have it perform better on 68060, would it?

And, anyway, I'm confident it runs a bit faster on all 68020+ CPUs - if I get a chance, I'll make a test later myself.

It is sad that the 68060 was not used in any mainstream computer. IMHO it was not worse than the first PPC. But the special optimization for the 68060 can make me crazy because it is very difficult to test code for this processor. Emulators are very inaccurate even for the 68020 and I can think that a 68060 emulator does not still exist at all. Your and modrobert's results proved that your optimizations do not accelerate the 68020/30.

Quote:

Originally Posted by saimo

Yes, the function comes with that warning - and that warning is wise. However, for just some internal testing, you can use it.

Now my code uses Forbid/Permit and this makes it less than 1% faster.

Quote:

Originally Posted by saimo

To make reliable tests on an unexpanded Amiga, you should also turn DMA off (because DMA affects the access of the CPU to the RAM), but a proper takeover and restore code isn't trivial (and still might cause issues on some machines). Anyway, exclusively for internal testing and to quickly make the tests, you could use move.w #$4000,$dff096 before the test loop and move.w #$c000,$dff096 after it. Please note that this is extremely brute code, so, it's best to run the tests on a minimal environment, i.e. after booting the machine without startup-sequence.

I have just tested this. I didn't notice any speed change, maybe there was only something less than 0.1% deviations in results.

Quote:

Originally Posted by saimo

Regarding the timer code, for maximum precision, hardware (CIA timers) should be used directly - so, again, a proper startup and restore code is needed. In absence of that, it's best to use the OS functions. I see that Don Adan already provided you with a couple of pointers.

IMHO Vertical blank interrupts are quite good too.

Quote:

Originally Posted by saimo

For convenience, only because this thread became quite complicated, here's the code again:

You already tested your code - it is in pi-amigax.

Quote:

Originally Posted by saimo

Yes, the 68030 is less sensitive to code alignment. But as along as you don't have a reliable way to measure time, don't be too sure about the NOP solution.

This subject requires exact numbers. IMHO time measurement has been quite accurate. You can check it by your stopwatch.

Quote:

Originally Posted by saimo

Results on 68020 (3000 digits):
* pi-amiga: 37.32
* pi-amiga1200 and pi-amigax: 36.94
Results on 68030: all .00, regardless of the number of digits.

Thank you very much. Your and modrobert's results show that MULUopt=1 saves 2 cycles.
It seems your 68030 system relocates interrupt vector table. It is the first time I met such a thing. Thanks to Don_Adan I have just made new code which uses AddIntServer/RemIntServer instead of the direct work with the interrupt vector. Please could you rerun the new code (it is attached) on your 68030 hardware for me?

Quote:

Originally Posted by Don_Adan

I always manually created Code_BSS section, if I needed. You can use Vasm, it has possibility to create Code_BSS (or maybe BSS_Code) hunk, if I remember right Phx infos. But i never used this assembler.
Stack version it was Ross idea, you can use normal version with Code_ Bss section.

Sorry I don't know how to do this? Any hints?
Now I just use a BSS section but all gain from it and other your optimizations was eaten by new AddIntServer/RemIntServer code. It is also sad that VASM doesn't allow us to use

Code:

ds.b 65536-endmark+start

and I have to use ugly code instead of this. Is VASM a multi-pass assembler or not?! BTW do you know how to get a section size? Maybe this helps to fix the ugliness mentioned afore.

Quote:

Originally Posted by Don_Adan

For VBR you must learn/remember that VBR not always must be at $0 address for 68010+ CPUs. Then your code will be dont works or can crash.

Thank you very much. Your point has been proven by the fact that saimo's 68030 board doesn't like my previous way to get timings.

Quote:

Originally Posted by Don_Adan

And yes your code can be optimized/shortened much more than for only 10 digits.

10 digits of the number pi mean 70 bytes and all your optimizations give less than 10 bytes.

Quote:

Originally Posted by Don_Adan

You know that this is buggy? Of course, if you want you can overwrite "dos.library" name, it was originally meynaf's idea for shortest code.

This name was already used so it couldn't create an issue.

Anyway I removed the maxn-variable completely. Thank you.

Quote:

Originally Posted by Don_Adan

For me only wasted memory.

Thank you I already fixed it a bit earlier.

Quote:

Originally Posted by modrobert

Yes, fun indeed. I just learned that 4 million NOP instructions takes 0.70 seconds to run on a 68EC020 @ 14MHz.

Why? 4 million NOPs mean 8 MB of continuous code - it is not easy for the plain Amiga.

Quote:

Originally Posted by modrobert

Code:

> pi-amiga
number pi calculator v11 (beta)(68000)
number of digits (up to 9248)? 3000
314159...  35.30

Code:

> pi-amiga1200
number pi calculator v11 (beta)(68020)
number of digits (up to 9248)? 3000
314159...  34.70

Code:

> pi-amigax
number pi calculator v11 (Beta, SuperScalar)(68020)
number of digits (up to 9244)? 3000
314159...  34.68

Thank you very much! Now your results correspond my theories.

Quote:

Originally Posted by Don_Adan

BTW. When you made one hunk (code and bss) version, you can easy beat Atari Max Digits value.
BTW2. For good, but unfair code you can use all 64 KB for Max Digits.

Please be less cryptic. Would you like please provide us with details?

Don_Adan · 09 May 2021, 16:31

I dont know, what is necessary (option or naming) for auto creating Code_BSS section by Vasm.
But perhaps some infos you can find here:

http://eab.abime.net/showthread.php?t=97310

I always did this manually.
Assemble your source as one section program with
ds.b $10000-(endmark-start)

at end of source.
Later manually cut/remove empty bytes created by ds.b $10000-(endmark-start) part. And manually edit one longword in Amiga exe header. You can find which one if you compared normal code version and code_bss version meynaf's pi routine from old (pi?) thread.
Exactly for your routine you must replace second $00004000 value in Amiga exe file with (endmark-start+3)/4 value.

All sizes of pure sections in Amiga exe are stored as size/4 (longword).

Then section 16 bytes is stored as $00000004, 1024 bytes as $00000100, 65536 bytes as $00004000.

Of course target sections like Code_C(hip), Code_F(ast), Data_C(hip) etc set some bits in stored longword too, but this is not your case.

07 May 2021, 00:04	#69
saimo Registered User Join Date: Aug 2010 Location: Italy Posts: 787	It's about a week that I forced myself to stop working on my new game because of serious sleep issues (in fact, I can't always think straight). But this divu thing got me intrigued, so I just couldn't help but make more tests I decided to check the divu instructions with broader ranges of values (always ensuring that overflow does not occur), comparing them against one another, and seeing how they behave on 68020 and 68030. The tests were made with interrupts and DMA off, and the time has been measured using the 32-bit timer from CIA A. First of all, an overview of the tests performed and of the results: Code: # \| OPERATION \| DIVIDEND \| DIVISOR \| ITERATIONS \| TIME 68020 \| TIME 68030 ---+------------------+-------------------+-------------------------+------------+------------+------------ 1 \| 32/16 -> 16q 16r \| 2^16-1 \| 1 ... 2^16-1 \| 2^16-1 \| 180325 \| 50210 \| divu.w dx,dy \| $ffff \| 1 ... 65535 \| 65535 \| § 176949 \| § 50210 ---+------------------+-------------------+-------------------------+------------+------------+------------ 2 \| 32/16 -> 16q 16r \| (2^16-1) * 2^15 \| 2^15 ... 2^16-1 \| 2^15 \| 88578 \| 25106 \| divu.w dx,dy \| $7fff8000 \| 32768 ... 65535 \| 32768 \| § 90166 \| § 25106 ---+------------------+-------------------+-------------------------+------------+------------+------------ 3 \| 32/32 -> 32q \| 2^32-1 \| 1 ... 2^20 \| 2^20 \| 4718597 \| 1338906 \| divu.l dx,dy \| $ffffffff \| 1 ... 1048576 \| 1048576 \| § 4875883 \| § 1368658 ---+------------------+-------------------+-------------------------+------------+------------+------------ 4 \| 32/32 -> 32q 32r \| 2^32-1 \| 1 ... 2^20 \| 2^20 \| 4718597 \| 1338906 \| divul.l dx,dy:dz \| $ffffffff \| 1 ... 1048576 \| 1048576 \| § 4771025 \| § 1338905 ---+------------------+-------------------+-------------------------+------------+------------+------------ 5 \| 64/32 -> 32q 32r \| 2^32-1 \| 1 ... 2^16-1 \| 2^16-1 \| 304742 \| 85541 \| divu.l dx,dy:dz \| $ffffffff \| 1 ... 65535 \| 65535 \| § 301465 \| § 85542 ---+------------------+-------------------+-------------------------+------------+------------+------------ 6 \| 64/32 -> 32q 32r \| (2^32-1) * 2^31 \| 2^32-2^20 ... 2^32-1 \| 2^20 \| 4823456 \| 1368660 \| divu.l dx,dy:dz \| $7fffffff80000000 \| $fff00000 ... $ffffffff \| 1048576 \| § 4875884 \| § 1368659 The times are expressed in CIA clocks. The times are relative to the whole core loops, not just to the divu instructions. § = time when the code alignment was altered with an nop before the core loop. First considerations: * the 68020 is more sensitive than the 68030 to alignment; * 32- and 64-bit divus seem to perform very similarly; * 16-bit divisions seem faster (as one would expect); * remainders do not seem to impact performance (as one would expect); * input data does not seem to affect performance (as the MC68020UM implies and contrary to what the MC68030UM says). Now, given that the number of iterations differ, to be able to compare the times, let's see how long a single iteration took on average (best times only): Code: # \| OPERATION \| ITERATIONS \| TIME 020 \| TIME 030 \| AVERAGE 68020 \| AVERAGE 68030 ---+------------------+------------+----------+----------+----------------+---------------- 1 \| divu.w dx,dy \| 65535 \| 176949 \| 50210 \| 2.700068665599 \| 0.766155489433 ---+------------------+------------+----------+----------+----------------+---------------- 2 \| divu.w dx,dy \| 32768 \| 88578 \| 25106 \| 2.703186035156 \| 0.766174316406 ---+------------------+------------+----------+----------+----------------+---------------- 3 \| divu.l dx,dy \| 1048576 \| 4718597 \| 1338906 \| 4.500004768372 \| 1.276880264282 ---+------------------+------------+----------+----------+----------------+---------------- 4 \| divul.l dx,dy:dz \| 1048576 \| 4718597 \| 1338906 \| 4.500004768372 \| 1.276880264282 ---+------------------+------------+----------+----------+----------------+---------------- 5 \| divu.l dx,dy:dz \| 65535 \| 301465 \| 85541 \| 4.600061036088 \| 1.305271992065 ---+------------------+------------+----------+----------+----------------+---------------- 6 \| divu.l dx,dy:dz \| 1048576 \| 4823456 \| 1368659 \| 4.600006103516 \| 1.305254936218 This clearly shows that: * 16-bit divus are faster; * 32-bit divus perform equally regardless of the remainder; * 64-bit divus perform equally regardless of the remainder. It also seems to indicate that 64-bit divus are slower than 32-bit divus - but that isn't case, because the difference is due to the extra code in the core loops. So, before proceeding further, let's look at the core loops. Code: TEST #1 move.l #$ffff,d2 ;$0000ffff moveq.l #1,d7 ;i = 1 .l move.l d2,d0 ;$0000ffff divu.w d7,d0 ;$0000ffff/i addq.w #1,d7 ;++i bne.b .l TEST #2 move.l #$7fff8000,d2 ;$ffff2^15 move.w #$8000,d7 ;i = 2^15 .l move.l d2,d0 ;$ffff2^15 divu.w d7,d0 ;($ffff2^15)/i addq.w #1,d7 ;++i bne.b .l TEST #3 moveq.l #-1,d2 ;$ffffffff move.l #$100000,d7 ;i = 2^20 .l move.l d2,d0 ;$ffffffff divu.l d7,d0 ;($ffffffff)/i subq.l #1,d7 ;--i bne.b .l TEST #4 moveq.l #-1,d2 ;$ffffffff move.l #$100000,d7 ;i = 2^20 .l move.l d2,d0 ;$ffffffff divul.l d7,d1:d0 ;($ffffffff)/i subq.l #1,d7 ;--i bne.b .l TEST #5 moveq.l #-1,d2 ;$ffffffff moveq.l #1,d7 ;i = 1 .l clr.l d1 ;0 move.l d2,d0 ;$ffffffff divu.l d7,d1:d0 ;($00000000ffffffff)/i addq.w #1,d7 ;++i bne.b .l TEST #6 move.l #$7fffffff,d3 ;$7fffffff move.l #$80000000,d2 ;$80000000 move.l #$fff00000,d7 ;i = 2^32-2^20 .l move.l d3,d1 ;$7fffffff move.l d2,d0 ;$80000000 divu.l d7,d1:d0 ;((2^32-1)2^31)/i addq.l #1,d7 ;++i bne.b .l The divus of tests #3, #4, #5 and #6 all perform at the same speed: in fact, by adding a dummy move to test #3 (the same could be done with test #4)... Code: moveq.l #-1,d2 ;$ffffffff move.l #$100000,d7 ;i = 2^20 .l move.l d4,d5 ;dummy operation move.l d2,d0 ;$ffffffff divu.l d7,d0 ;($ffffffff)/i subq.l #1,d7 ;--i bne.b .l ... to make the non-divu instructions cycles equal to those of tests #5 and #6, the execution time becomes exactly the same. Now let's compare the performance of the two CPUs: Code: # \| OPERATION \| TIME 020 \| TIME 030 \| TIME 020 / TIME 030 ---+------------------+----------+----------+--------------------- 1 \| divu.w dx,dy \| 176949 \| 50210 \| 3.524178450508 ---+------------------+----------+----------+--------------------- 2 \| divu.w dx,dy \| 88578 \| 25106 \| 3.52816059906 ---+------------------+----------+----------+--------------------- 3 \| divu.l dx,dy \| 4718597 \| 1338906 \| 3.5242182797 ---+------------------+----------+----------+--------------------- 4 \| divul.l dx,dy:dz \| 4718597 \| 1338906 \| 3.5242182797 ---+------------------+----------+----------+--------------------- 5 \| divu.l dx,dy:dz \| 301465 \| 85541 \| 3.524216457605 ---+------------------+----------+----------+--------------------- 6 \| divu.l dx,dy:dz \| 4823456 \| 1368659 \| 3.524220423056 The 68020 of the standard A1200 runs at 14.18758 MHz, while the 68030 on the Blizzard 1230-IV I used runs at 50 MHz. Therefore, frequency-wise, the 68030 runs 50 / 14.18758 = 3.524209202697 times faster than the 68020, exactly like the table indicates. This also means that the divus implementations are the same on both the CPUs. In fact, their user's manuals indicate the very same timings (for the cache-case, that is, but also the other cases are almost identical). I guess that where the MC68030UM says that the actual timing depends on the input data, it refers to divisions by 0 and overflows. Finally, one last questions: are the results reliable? Let's look at the test #1 code and at its cycles on the 68020, and let's compare them with the result obtained experimentally. Code: move.b #$41,$bfee01 ;(reload timer and start it) move.l #$ffff,d2 ;5w 6c (yes, the cache-cache is said to be worse than the worst-case) moveq.l #1,d7 ;3w 2c .l move.l d2,d0 ;3w 2c divu.w d7,d0 ;44w 44c addq.w #1,d7 ;3w 2c bne.b .l ;9w 6c 4cl clr.b $bfee01 ;6w+5w 4c+4c (stop timer) w = worst-case c = cache-case l = last iteration Initially, the CPU has to fetch the instructions from RAM and the initialization instructions as well (by the way, only now it dawned on me that I should have put them before the write to the CIA CRA register, but I'm too tired now to redo all the tests and calculations again), so the executions is slower (worst-case) and takes 64 cycles: Code: longword-aligned code case: move.l #$ffff,d2 ;5 moveq.l #1,d7 ;2 (because it has been prefetched with the previous instruction) .l move.l d2,d0 ;3 divu.w d7,d0 ;44 addq.w #1,d7 ;3 bne.b .l ;7 (because the opcode has been prefetched, so only the offset needs an additional read) word-aligned code case: move.l #$ffff,d2 ;4 (because the opcode has been prefetched with the previous instruction) moveq.l #1,d7 ;3 .l move.l d2,d0 ;2 (because it has been prefetched with the previous instruction) divu.w d7,d0 ;44 addq.w #1,d7 ;2 (because it has been prefetched with the previous instruction) bne.b .l ;9 Afterwards, each iteration takes 2+44+2+6 = 54 cycles. The last iteration takes 2 cycles less as the branch is not taken, i.e. 52 cycles. Finally, the write to the CIA has to be evaluated taking into account that its timing has a base time (6w 4c) plus the calculate effective address time (5w 4c). Depending on the code alignment, the opcode might already be in the cache thanks to the 32-bit fetch for bne, so, in that case, the time is 4c+5w = 9 cycles; otherwise, the time is 6w+5w = 11 cycles. Therefore, theoretically, the whole execution takes 64+5465533+52+9 = 3538907 or 64+5465533+52+11 = 3538909 cycles. The CIA runs at 0.709379 MHz, i.e. 1/20th of the CPU speed, so the elapsed time in CPU cycles is 17694920 = 3538980. The difference between the actual time and the theoretical time is thus 3538980-3538907 = 73 or 3538980-3538909 = 71 cycles. I guess that it can be explained as follows: the instructions are fetched from CHIP RAM, which is slower than what the MC68020UM assumes (its timings are relative to a 0-wait-state RAM); namely, the CHIP RAM runs at 1/4 of the CPU frequency; given that when the instructions are not in the cache theoretically 64-54+5 = 15 or 64-54+7 = 17 cycles more for RAM accesses are needed, execution takes actually 154 = 60 or 174 = 68 cycles longer; * the access to the CIA is slow due to the slower frequency of the chip (1/20 of the CPU frequency), so, in the worst case, clr could actually take 19 cycles longer. Even if I made some mistake, even without considering the performance penalty factors, 73 out 3538980 cycles represent a 0.002063% error, I'd say that the tests are reliable. That's also supported by the fact that I ran the tests multiple times obtaining always the same results. EDIT START I just had to see how test #1 performs with minimized overhead, so I rewrote it like this: Code: move.l #$ffff,d2 moveq.l #1,d7 lea.l $bfee01,a5 move.b #$41,(a5) ;(reload timer and start it) .l move.l d2,d0 ;3w 2c divu.w d7,d0 ;44w 44c addq.w #1,d7 ;3w 2c bne.b .l ;9w 6c 4cl clr.b (a5) ;6w+2w 4c+2c (stop timer) w = worst-case c = cache-case l = last iteration The actual cycles measured on an unexpanded A1200 are: * without nop before the code: 176946; * with nop before the code: 180323. (Side note: this means that, in the best case, the cost of the overhead of the previous version of the code was (176949-176946)20 = 60 cycles.) When the code is not cached, the theoretical cycles are: Code: longword-aligned code case: .l move.l d2,d0 ;3 divu.w d7,d0 ;44 addq.w #1,d7 ;3 bne.b .l ;7 (because the opcode has been prefetched, so only the offset needs an additional read) clr.b (a5) ;4+2 = 6 (because it has been prefetched with the previous instruction) word-aligned code case: .l move.l d2,d0 ;2 (because it has been prefetched with the previous instruction) divu.w d7,d0 ;44 addq.w #1,d7 ;2 (because it has been prefetched with the previous instruction) bne.b .l ;9 clr.b (a5) ;6+2 = 8 Therefore, considering the best case: first iteration: 57 cycles; * next iterations: 54 cycles; * last iteration: 52 cycles; * final write: 6 cycles; * total: 57+5465533+52+6 = 3538897 cycles The difference between actual time and theoretical time is 17694620-3538897 = 23 cycles. By looking at the timings above, we see that the additional cycles to fetch the instructions from RAM are 57-54 = 3. Due to the CHIP RAM slowness, those actually amount to 34 = 12 cycles. The remaining 23-12 = 11 cycles should be due to the access to the CIA. To be honest, I didn't write down on paper a timing chart of the CPU activity (and, even if I tried, there is no documentation that explains how to do that exactly), so calculations might be a bit off here and there, but still the closeness of the figures and the 100% stable test results prove even better that the measured times are accurate enough for the purpose of evaluating the divu operations performance. EDIT END Conclusions? Don't get involved in threads that might steal you an enourmous amount of time for stuff you'll never have a use for anyway @litwr I'll get back to you tomorrow ASAP (edit: couldn't sleep). Now I should really try to get some sleep. Last edited by saimo; 08 May 2021 at 00:34.*

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
68020 Bit Field Instructions	mcgeezer	Coders. Asm / Hardware	9	27 October 2023 23:21
68060 64-bit integer math	BSzili	Coders. Asm / Hardware	7	25 January 2021 21:18
Discovery: Math	Audio Snow	request.Old Rare Games	30	20 August 2018 12:17
Math apps	mtb	support.Apps	1	08 September 2002 18:59

06 May 2021, 22:42	#68
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 55 Posts: 1,959	"Thank you but this code is outside any loops so IMHO more clear logic is better for this case." Yes, or.w d5,d5 is outside loops, but why you used Max. digits value, if you wasted memory for nothing on Amiga version? It was only example, perhaps more unnecessary code can exist. "It just works. However it will be good if you give me an example of better code to measure time. I have already asked saimo for this" I will use AddIntServer and RemIntServer from exec. Or use something like this: http://eab.abime.net/showpost.php?p=552625&postcount=43 Or better use/adapt time measure routine from c2p routine. Originally written by Jim Drew. http://eab.abime.net/showpost.php?p=...&postcount=235

08 May 2021, 19:56	#78
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 55 Posts: 1,959	BTW. When you made one hunk (code and bss) version, you can easy beat Atari Max Digits value. BTW2. For good, but unfair code you can use all 64 KB for Max Digits.

09 May 2021, 16:31	#80
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 55 Posts: 1,959	I dont know, what is necessary (option or naming) for auto creating Code_BSS section by Vasm. But perhaps some infos you can find here: http://eab.abime.net/showthread.php?t=97310 I always did this manually. Assemble your source as one section program with ds.b $10000-(endmark-start) at end of source. Later manually cut/remove empty bytes created by ds.b $10000-(endmark-start) part. And manually edit one longword in Amiga exe header. You can find which one if you compared normal code version and code_bss version meynaf's pi routine from old (pi?) thread. Exactly for your routine you must replace second $00004000 value in Amiga exe file with (endmark-start+3)/4 value. All sizes of pure sections in Amiga exe are stored as size/4 (longword). Then section 16 bytes is stored as $00000004, 1024 bytes as $00000100, 65536 bytes as $00004000. Of course target sections like Code_C(hip), Code_F(ast), Data_C(hip) etc set some bits in stored longword too, but this is not your case.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)