English Amiga Board

English Amiga Board (https://eab.abime.net/index.php)
-   Coders. General (https://eab.abime.net/forumdisplay.php?f=37)
-   -   Optimizing the 68020+ 32-bit math (https://eab.abime.net/showthread.php?t=106742)

modrobert 01 May 2021 10:54

Quote:

Originally Posted by litwr (Post 1480385)
Would you like please to run pi-amiga and pi-amiga1200 on your hardware for 100, 1000, and 3000 digits? A screenshot would be a nice addition for me to insert it in the table.

OK, I used the binaries in your attached archive 'pispigot.zip'. Here are the results for v9:

Code:

pi-amiga
number ? calculator v9 (68000)
number of digits (up to 9248)? 100
314...  .10

Code:

pi-amiga
number ? calculator v9 (68000)
number of digits (up to 9248)? 1000
314...  4.40

Code:

pi-amiga
number ? calculator v9 (68000)
number of digits (up to 9248)? 3000
314...  34.16


Code:

pi-amiga1200
number ? calculator v9 (68020)
number of digits (up to 9252)? 100
314...  .12

Code:

pi-amiga1200
number ? calculator v9 (68020)
number of digits (up to 9252)? 1000
314...  4.42

Code:

pi-amiga1200
number ? calculator v9 (68020)
number of digits (up to 9252)? 3000
314...  34.42


The time results vary a bit between runs, most likely due to multitasking and perhaps accuracy of the frame counter, on the first "100" test I get results between 0.09 and 0.12 running the same program.

I attached a screenshot, quality kind of sucks, it's hard to get good ones from CRT screens (looks better in reality).

saimo 01 May 2021 12:06

Quote:

Originally Posted by litwr (Post 1480440)
I showed the CMP-first version only because it was a/b's demand. Sorry I didn't add more information about it afore.

No need to apologize: it's my fault that I didn't read the thread properly (and still haven't :p).

Quote:

Thank you. I didn't know this. However CLR vs SWAP timing is rather odd. The 68000 executes SWAP faster than CLR but the 68020 executes CLR faster than SWAP!
Given that your code uses divu.l and divul.l, which are not available on the 68000, I assumed that it was for 68020+, so I didn't hesitate to replace moveq.l with clr.l (just for readability reasons - it's only a personal preference).

Quote:

It is interesting to reduce instruction dependency in the code that may speed up the execution on the 68020 and higher 68k. Actually I didn't think about it.
For completeness, reducing the register conflicts gives an improvement only on the 68060.

Quote:

Your code again corrupts D7.
How so? It puts the quotient in the low word and clears the upper word, just like your code does.

Quote:

Code:

        divu.w d4,d6
        bvs.s .longdiv

        moveq.l #0,d7
        move.w d6,d7
        clr.w d6
        swap d6
        move.w d6,(a3)


One register conflict can be avoided by using the same trick proposed in my previous post:

Code:

  divu.w d4,d6
  bvs.s  .longdiv

  move.l d6,d7
  clr.w  d6
  eor.l  d6,d7
  swap.w d6
  move.w d6,(a3)

If you can change the array item size to 4 bytes, then the code can become:

Code:

  divu.w d4,d6
  bvs.s  .longdiv

  move.l d6,d7
  clr.w  d6
  eor.l  d6,d7
  move.l d6,(a3)
  swap.w d6

This removes one more register conflict and has the swap executing in parallel with the write (but the latter is not necessarily an improvement, as other instructions after this piece of code could execute in parallel as well).

litwr 01 May 2021 12:57

Quote:

Originally Posted by saimo (Post 1480459)
For completeness, reducing the register conflicts gives an improvement only on the 68060.

I still don't know when the 68020/30 can perform 2-3 instructions at once. :(

Quote:

Originally Posted by saimo (Post 1480459)
How so? It puts the quotient in the low word and clears the upper word, just like your code does.

Sorry again. I was wrong. What a nice EOR-optimization! Thank you. The 68020 oddity can make me mad, the latest modrobert's results help this. :banghead However your optimization just replaces MOVEQ and MOVE.W with MOVE.L and EOR.L and IMHO your pair is even a bit slower.

Quote:

Originally Posted by saimo (Post 1480459)
If you can change the array item size to 4 bytes, then the code can become:

No, because there is the 4th rule of the pi-spigot: it utilizes all available RAM below 64 KB limit to get the maximum number of calculated digits, so it is forbidden to restrict artificially the maximum number of digits. Actually we can took a double word, process two its words and save the double word back. But I doubt that this can accelerate the program.

saimo 01 May 2021 13:57

Quote:

Originally Posted by litwr (Post 1480467)
I still don't know when the 68020/30 can perform 2-3 instructions at once. :(

They generally can't: they can partly execute stuff in parallel depending on the instructions types, operands and alignment in memory. When the "head" and "tail" (as they're called in the 68030 documentation) timings of instructions overlap, then parallel execution occurs (the head of an instruction can execute in parallel with the tail of the previous instruction; if the head happens to be all that the instruction consists in, then the whole instruction, in practice, takes 0 cycles) The 68040 has a deeper pipeline and the 68060 is also superscalar, so those CPUs can do even better. If you feel like learning the details, check out the user manuals of the 68k CPUs (they're available for download from NXP - here's the one for the M68020).
That said, after an instruction that writes to memory, the CPU can enjoy many "free" cycles for the cached non-memory instructions; for example, I measured that after a write to CHIP RAM, the 68030 on my Blizzard 1230-IV has 26 free cycles which can be used for all the instructions that fit in those cycles; during a write to FAST RAM (which the CPU has exclusive access to and is no-wait-state) the CPU has 4 free cycles. (Exception: if I remember correctly - it's been a while - rotate and, maybe, also shift instructions, for some strange reason, actually can't benefit from the free cycles and cause the CPU to stall until the write finishes.) This is the reason why I proposed the longword write trick.

Quote:

Sorry again. I was wrong. What a nice EOR-optimization! Thank you. The 68020 oddity can make me mad, the latest modrobert's results help this. :banghead However your optimization just replaces MOVEQ and MOVE.W with MOVE.L and EOR.L and IMHO your pair is even a bit slower.
The instructions I used are register-only and are as fast as it gets on any 68020+ CPU ;) Basically, my code does not add cycles, but it eliminates a register dependency, thus actually making the code faster on the 68060.

litwr 01 May 2021 18:53

Quote:

Originally Posted by modrobert (Post 1480450)
Code:

pi-amiga
number ? calculator v9 (68000)
number of digits (up to 9248)? 100
314...  .10

Code:

pi-amiga
number ? calculator v9 (68000)
number of digits (up to 9248)? 1000
314...  4.40

Code:

pi-amiga
number ? calculator v9 (68000)
number of digits (up to 9248)? 3000
314...  34.16

Code:

pi-amiga1200
number ? calculator v9 (68020)
number of digits (up to 9252)? 100
314...  .12

Code:

pi-amiga1200
number ? calculator v9 (68020)
number of digits (up to 9252)? 1000
314...  4.42

Code:

pi-amiga1200
number ? calculator v9 (68020)
number of digits (up to 9252)? 3000
314...  34.42


Thank you very much. Your results show that the 68020 code optimization is a very uneasy task. They again show that the 68020 code is slower than the pure 68000 code and this is very odd. Would you like please to help me a bit more? I prepared an ultimate test which can show us all unclear details. The attached archive contains the next executable files:
pi-amiga-8 - an old version (68000)
pi-amiga-8mo - an old version with MULUopt=1 (68000)
pi-amiga-9 - BVS optimization (68000)
pi-amiga-9mo - BVS optimization with MULUopt=1 (68000)
pi-amiga1200-8 - an old version (68020)
pi-amiga1200-8mo - an old version with MULUopt=1 (68020)
pi-amiga1200-9 - BVS optimization (68020)
pi-amiga1200-9mo - BVS optimization with MULUopt=1 (68020)

You have already run pi-amiga-9mo and pi-amiga1200-9mo - if you rerun them the results must be the same. So results for pi-amiga-8, pi-amiga-8mo, pi-amiga-9, pi-amiga1200-8, pi-amiga1200-8mo, pi-amiga1200-9 are only required to get information which optimization actually works on the 68020. This time only 3000 digit results are required.

Quote:

Originally Posted by saimo (Post 1480477)
They generally can't: they can partly execute stuff in parallel depending on the instructions types, operands and alignment in memory. When the "head" and "tail" (as they're called in the 68030 documentation) timings of instructions overlap, then parallel execution occurs (the head of an instruction can execute in parallel with the tail of the previous instruction; if the head happens to be all that the instruction consists in, then the whole instruction, in practice, takes 0 cycles) The 68040 has a deeper pipeline and the 68060 is also superscalar, so those CPUs can do even better. If you feel like learning the details, check out the user manuals of the 68k CPUs (they're available for download from NXP - here's the one for the M68020).
That said, after an instruction that writes to memory, the CPU can enjoy many "free" cycles for the cached non-memory instructions; for example, I measured that after a write to CHIP RAM, the 68030 on my Blizzard 1230-IV has 26 free cycles which can be used for all the instructions that fit in those cycles; during a write to FAST RAM (which the CPU has exclusive access to and is no-wait-state) the CPU has 4 free cycles. (Exception: if I remember correctly - it's been a while - rotate and, maybe, also shift instructions, for some strange reason, actually can't benefit from the free cycles and cause the CPU to stall until the write finishes.) This is the reason why I proposed the longword write trick.

Thank you very much for this very interesting information. This also partially explains why the 68020 simulators are still so far from the perfection.

Quote:

Originally Posted by saimo (Post 1480477)
The instructions I used are register-only and are as fast as it gets on any 68020+ CPU ;) Basically, my code does not add cycles, but it eliminates a register dependency, thus actually making the code faster on the 68060.

It is sad that Commodore left PC market so early. :( The 68060 based Amiga would have been quite good until maybe 1998. But now I have large problems even with the 68020 optimization... I doubt that I can discover how to optimize the code even for the 68030 - I will need help from Atari people for this. So the 68060 optimization is rather among pure fantastics for me now.

modrobert 02 May 2021 11:23

While waiting for new electrolytic capacitors to ship for my original A1200 PSU (it needs recapping), I have connected a PC ATX 250W power supply temporarily which gives me more options.

litwr,

Do you want me to run the new tests with the stock A1200 68020 same as before, or using ACA-1232 68030 @ 33MHz (with 128mb fastram)?

EDIT:

Never mind, the ACA-1232 seems to be broken, so can't do 68030 for now.


Amiga 1200 stock 68020 @ 14MHz with 4mb fastram

Code:

pi-amiga-8
number ? calculator v8 [MULUopt](68000)
number of digits (up to 9252)? 3000
314...  35.90

Code:

pi-amiga-8mo
number ? calculator v8 [MULUopt](68000)
number of digits (up to 9248)? 3000
314...  35.30

Code:

pi-amiga-9
number ? calculator v9(68000)
number of digits (up to 9248)? 3000
314...  34.72

Code:

pi-amiga-9mo
number ? calculator v9 [MULUopt](68000)
number of digits (up to 9248)? 3000
314...  34.72

Code:

pi-amiga1200-8
number ? calculator v8 [MULUopt](68020)
number of digits (up to 9252)? 3000
314...  36.72

Code:

pi-amiga1200-8mo
number ? calculator v8 [MULUopt](68020)
number of digits (up to 9248)? 3000
314...  35.54

Code:

pi-amiga1200-9
number ? calculator v9(68020)
number of digits (up to 9252)? 3000
314...  35.00

Code:

pi-amiga1200-9mo
number ? calculator v9 [MULUopt](68020)
number of digits (up to 9248)? 3000
314...  34.94


EDIT 2:

I played around with 'pi-amiga-9.asm' and modified it (edit/compile) to not print the digits via "Write(a6)" to stdout (under PR0000 label), just to see how much it would gain, and the result was 32.80 seconds, so that saved roughly 2 seconds.

BTW: Noticed you have "jmp Write(a6)" on line 307 in 'pi-amiga-9.asm', shouldn't that be "jsr Write(a6)"?

litwr 02 May 2021 20:26

Quote:

Originally Posted by modrobert (Post 1480675)
While waiting for new electrolytic capacitors to ship for my original A1200 PSU (it needs recapping), I have connected a PC ATX 250W power supply temporarily which gives me more options.

Thank you very much. Your results clarify some things. So it seems that the BVS-optimization saves 2 cycles and the MULUopt-optimization on the 68020 saves 1. However it is a big mystery why the 68020 code (with DIVUL) is always slower than the plain 68000 code?! This is impossible! I can only suggest that we have some random deviations and your results showed cases when these deviations were in favor of the 68000 code. Maybe only the long run of PI-AMIGA-9MO and PI-AMIGA1200-9MO for 9000 digits can help to find the truth (you can redirect the output to a file to minimize the part of IO-timings).
BTW could you detach fast RAM and run PI-AMIGA-9MO for 3000 digits? Indeed it would also be nice to get results from your 68030 hardware sometime in the future.


Quote:

Originally Posted by modrobert (Post 1480675)
I played around with 'pi-amiga-9.asm' and modified it (edit/compile) to not print the digits via "Write(a6)" to stdout (under PR0000 label), just to see how much it would gain, and the result was 32.80 seconds, so that saved roughly 2 seconds.

BTW: Noticed you have "jmp Write(a6)" on line 307 in 'pi-amiga-9.asm', shouldn't that be "jsr Write(a6)"?

You can just set IO=0 to disable the screen output. If you selected format option for CPU+IO you would get CPU and IO time separately: the ER-value is calculated basing only on the pure CPU time, no IO time is taken into account.
The JMP instruction is ok - how could it work if it was wrong? :)

modrobert 02 May 2021 21:19

Quote:

Originally Posted by litwr (Post 1480773)
Thank you very much. Your results clarify some things. So it seems that the BVS-optimization saves 2 cycles and the MULUopt-optimization on the 68020 saves 1. However it is a big mystery why the 68020 code (with DIVUL) is always slower than the plain 68000 code?! This is impossible! I can only suggest that we have some random deviations and your results showed cases when these deviations were in favor of the 68000 code. Maybe only the long run of PI-AMIGA-9MO and PI-AMIGA1200-9MO for 9000 digits can help to find the truth (you can redirect the output to a file to minimize the part of IO-timings).
BTW could you detach fast RAM and run PI-AMIGA-9MO for 3000 digits? Indeed it would also be nice to get results from your 68030 hardware sometime in the future.

Code:

pi-amiga-9mo
number ? calculator v9 [MULUopt](68000)
number of digits (up to 9248)? 9000
314...  291.88

Code:

pi-amiga1200-9mo
number ? calculator v9 [MULUopt](68020)
number of digits (up to 9248)? 9000
314...  294.38


Without fastram...
Code:

pi-amiga-9mo
number ? calculator v9 [MULUopt](68000)
number of digits (up to 9248)? 3000
314...  36.84

Quote:

Originally Posted by litwr (Post 1480773)
You can just set IO=0 to disable the screen output. If you selected format option for CPU+IO you would get CPU and IO time separately: the ER-value is calculated basing only on the pure CPU time, no IO time is taken into account.
The JMP instruction is ok - how could it work if it was wrong? :)

Aha, OK, missed the IO constant in the source code.

I noticed you don't select any type of RAM "clr.l d1" when allocating "AllocMem(a6)", according to documentation it should pick fastram first, but not sure if that can be overridden by compiler settings?

modrobert 02 May 2021 21:45

litwr,

Check the attached screenshot regarding number of CPU cycles for DIVU.L.

Source:
https://www.nxp.com/docs/en/data-sheet/MC68020UM.pdf

litwr 03 May 2021 15:18

Quote:

Originally Posted by modrobert (Post 1480788)
Code:

pi-amiga-9mo
number ? calculator v9 [MULUopt](68000)
number of digits (up to 9248)? 9000
314...  291.88

Code:

pi-amiga1200-9mo
number ? calculator v9 [MULUopt](68020)
number of digits (up to 9248)? 9000
314...  294.38


Without fastram...
Code:

pi-amiga-9mo
number ? calculator v9 [MULUopt](68000)
number of digits (up to 9248)? 3000
314...  36.84

I noticed you don't select any type of RAM "clr.l d1" when allocating "AllocMem(a6)", according to documentation it should pick fastram first, but not sure if that can be overridden by compiler settings?

Thank you very much. However your results look impossible. Both programs execute the same code
Code:

        divu d4,d6
        bvs.s .longdiv

        moveq.l #0,d7

and your results show that their timings are different.
Code sequences at .longdiv are different but the branch to .longdiv is taken almost never, it is about 1 branch taken of 10,000,000 cases. I can only suggest one plausible explanation. Maybe your results are caused by programs run order. You ran PI-AMIGA-9 first. I can assume that this processing makes your hardware a bit hotter and slower. IMHO if you run PI-AMIGA1200-9MO first then it will be faster. If it is not right then I am completely baffled.

Thank you for your remark about the AllocMem invocation. But I can't understand what compiler setting can affect memory allocation function call...

litwr 03 May 2021 15:20

Quote:

Originally Posted by modrobert (Post 1480799)
litwr,

Check the attached screenshot regarding number of CPU cycles for DIVU.L.

Source:
https://www.nxp.com/docs/en/data-sheet/MC68020UM.pdf

Interestingly that it seems DIVUL.L and two variants of DIVU.L have the same timings despite of they have quite different amount of work to do...

modrobert 03 May 2021 15:42

Quote:

Originally Posted by litwr (Post 1480952)
Interestingly that it seems DIVUL.L and two variants of DIVU.L have the same timings despite of they have quite different amount of work to do...

I meant compared to DIVU.W, close to twice the number of CPU clocks at 76-79 for DIVU.L (and roughly 20 times slower than average instructions) which could be the reason 68020 code doesn't make much difference in this case.

roondar 03 May 2021 15:53

Regarding the timings of instructions, It should be pointed out that the timings in the Motorola manuals for the DIV and MUL instructions represent the maximum number of cycles they can take*, not the minimum. The actual timing varies depending on amongst other things the given input, though Motorola has not included information about how much this variation is**.

If I had to guess, I'd say that the 64 bit DIV instructions will probably perform worse than the 32 bit ones.

*) From the manual linked above, page 8-11: "This CC time is a maximum since the times given for the MULU.L and DIVS.L are maximums.".

**) See the introduction to chapter 8: "This section describes the instruction execution and operations (table searches, etc.) of the MC68020/EC020 in terms of external clock cycles. It provides accurate execution and operation timing guidelines but not exact timings for every possible circumstance. This approach is used since exact execution time for an instruction or operation is highly dependent on memory speeds and other variables."

modrobert 03 May 2021 16:24

Quote:

Originally Posted by roondar (Post 1480966)
Regarding the timings of instructions, It should be pointed out that the timings in the Motorola manuals for the DIV and MUL instructions represent the maximum number of cycles they can take*, not the minimum. The actual timing varies depending on amongst other things the given input, though Motorola has not included information about how much this variation is**.

If I had to guess, I'd say that the 64 bit DIV instructions will probably perform worse than the 32 bit ones.

*) From the manual linked above, page 8-11: "This CC time is a maximum since the times given for the MULU.L and DIVS.L are maximums.".

**) See the introduction to chapter 8: "This section describes the instruction execution and operations (table searches, etc.) of the MC68020/EC020 in terms of external clock cycles. It provides accurate execution and operation timing guidelines but not exact timings for every possible circumstance. This approach is used since exact execution time for an instruction or operation is highly dependent on memory speeds and other variables."

Looks pretty clear to me, they have columns ranging from "best case" to "worst case", and this is how to interpret the data in section 8.2.8.

https://modrobert.ddns.net/pics/mc68020_clocks.png

Time wise you can do roughly 20 ("average speed" ~4 clocks) instructions for the cost of one DIVU.L!

roondar 03 May 2021 16:42

Best case through worst case in those diagrams only refer to time with instruction overlap/no overlap but in cache/no overlap not in cache*. They do not refer to the differences in timing for DIV/MUL due to different values or operand size being passed (in the case of the 64 bit operations).

*) The last one is especially interesting because it assumes memory access is penalized by just 1 cycle per access, which is way better than what the A1200 without Fast RAM actually manages.

Edit: on a side note, it's very logical that DIV/MUL have different execution speeds based on differing inputs as the amount of work needed to be done for both does vary based on input.

modrobert 03 May 2021 16:46

Here are the disassembled files attached for 'pi-amiga-9mo' and 'pi-amiga1200-9mo' binaries from 'pi-amiga-cmp.zip'. As you can see the 'pi-amiga1200-9mo' includes "divul.l d4,d7:d6", but this part of the code is not called much according to litwr. Lots of DIVU.W instructions in both which is slow as well.

Code:

lab_fa:
  divul.l  d4,d7:d6
  move.w  d7,(a3)
  bra.b    lab_12e


Thomas Richter 03 May 2021 17:26

Quote:

Originally Posted by litwr (Post 1480952)
Interestingly that it seems DIVUL.L and two variants of DIVU.L have the same timings despite of they have quite different amount of work to do...


Not really, they have the same amount of work to do. Your average division algorithm creates the remainder as by-product anyhow. The typical division implementation is a 2nbits/nbits division.


IOWs, the underlying algorithm is probably much the same for all division versions, just that some of the outputs are thrown away and/or some of the inputs are assumed to be zero.

Don_Adan 03 May 2021 17:41

Quote:

Originally Posted by modrobert (Post 1480977)
Here are the disassembled files attached for 'pi-amiga-9mo' and 'pi-amiga1200-9mo' binaries from 'pi-amiga-cmp.zip'. As you can see the 'pi-amiga1200-9mo' includes "divul.l d4,d7:d6", but this part of the code is not called much according to litwr. Lots of DIVU.W instructions in both which is slow as well.

Code:

lab_fa:
  divul.l  d4,d7:d6
  move.w  d7,(a3)
  bra.b    lab_12e


You can add temporary counter how many times this routine is called, if only a few times like about 10 times, i will rewrote fully this pi routine and remove bvs check, because pi is constant. i will call longdiv routine only when is necessary. Something like:

245 times divu.w
1 time divu.l
387 times divu.w
1 time divu.l
1647 times divu.w
2 times divu.l
etc

up to f.e 10000 times

if someone called this routine more than 10000 calls then old routine after 10000 times can be called within bvs check.

litwr 03 May 2021 17:50

Quote:

Originally Posted by roondar (Post 1480966)
Regarding the timings of instructions, It should be pointed out that the timings in the Motorola manuals for the DIV and MUL instructions represent the maximum number of cycles they can take*, not the minimum. The actual timing varies depending on amongst other things the given input, though Motorola has not included information about how much this variation is**.

If I had to guess, I'd say that the 64 bit DIV instructions will probably perform worse than the 32 bit ones.

*) From the manual linked above, page 8-11: "This CC time is a maximum since the times given for the MULU.L and DIVS.L are maximums.".

**) See the introduction to chapter 8: "This section describes the instruction execution and operations (table searches, etc.) of the MC68020/EC020 in terms of external clock cycles. It provides accurate execution and operation timing guidelines but not exact timings for every possible circumstance. This approach is used since exact execution time for an instruction or operation is highly dependent on memory speeds and other variables."

My gosh! :) Have anybody ever tried to find those minimums? IMHO it is quite possible that minimum = maximum because some division implementations have constant time of executions for any their arguments.

Quote:

Originally Posted by modrobert (Post 1480977)
Here are the disassembled files attached for 'pi-amiga-9mo' and 'pi-amiga1200-9mo' binaries from 'pi-amiga-cmp.zip'. As you can see the 'pi-amiga1200-9mo' includes "divul.l d4,d7:d6", but this part of the code is not called much according to litwr. Lots of DIVU.W instructions in both which is slow as well.

Code:

lab_fa:
  divul.l  d4,d7:d6
  move.w  d7,(a3)
  bra.b    lab_12e


Why make a disassembly, you have sources. :) I can again confirm that DIVUL is executed almost never.

litwr 03 May 2021 17:52

Quote:

Originally Posted by Don_Adan (Post 1480992)
You can add temporary counter how many times this routine is called, if only a few times like about 10 times, i will rewrote fully this pi routine and remove bvs check, because pi is constant.

IMHO it is much easier just to believe the author of the code. :)


All times are GMT +2. The time now is 10:19.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.

Page generated in 0.09582 seconds with 11 queries