Optimizing the 68020+ 32-bit math - Page 5

saimo · 09 May 2021, 18:07

@litwr

Quote:

Originally Posted by litwr

It is sad that the 68060 was not used in any mainstream computer. IMHO it was not worse than the first PPC. But the special optimization for the 68060 can make me crazy because it is very difficult to test code for this processor. Emulators are very inaccurate even for the 68020 and I can think that a 68060 emulator does not still exist at all. Your and modrobert's results proved that your optimizations do not accelerate the 68020/30.

Simply don't look at the numbers spat out by emulators

As for optimizing for 68060, I'm not saying that you should aim for that: I simply suggested some innocuous rearrangement of instructions so that there are less registers conflicts.

Quote:

Now my code uses Forbid/Permit and this makes it less than 1% faster.

I have just tested this. I didn't notice any speed change, maybe there was only something less than 0.1% deviations in results.

IMHO Vertical blank interrupts are quite good too.

That method provides a 50 Hz (on PAL) or 60 Hz (on NTSC) resolution which is just too coarse to measure the difference made by a 1- or 2-cycle optimization for something that runs only for a very short time. Moreover, it's affected by the hardware and OS interrupt processing overhead.
Just to be clear: I'm saying this only to point out that it's difficult to evaluate performance accurately with such method; unfortunately, more precise measurements require lots of additional code and direct hardware access.

Quote:

This subject requires exact numbers. IMHO time measurement has been quite accurate. You can check it by your stopwatch.

Which would be the most inaccurate way of measuring the time

Quote:

Thank you very much. Your and modrobert's results show that MULUopt=1 saves 2 cycles.

The cache-case timings of the mulu optimization code are these:

Code:

  moveq.l #0,d0    ;2
  move.w  -(a3),d0
  move.l  d0,d1    ;2
  lsl.l   #3,d0    ;4
  sub.l   d0,d1    ;2
  add.l   d0,d0    ;2
  sub.l   d0,d1    ;2
  sub.l   d0,d1    ;2
  lsl.l   #8,d1    ;4
  sub.l   d1,d0    ;2

The total time is 22 cycles. Note that I said cache-case: thanks to instructions overlap, the actual performance might be better.
The mulu.w timings are instead:
* best-case (maximum overlap case): 25 cycles;
* cache-case: 27 cycles.
In short, the mulu optimization is quite good (albeit it takes a few bytes) and saves more than 2 cycles. It could cause issues if it makes the main loop larger than the instruction cache (256 bytes), though (but at a glance it doesn't look like the code is even close to that).
Maybe the figure you came up with was influnced by the measurements inaccuracy.

Quote:

It seems your 68030 system relocates interrupt vector table. It is the first time I met such a thing. Thanks to Don_Adan I have just made new code which uses AddIntServer/RemIntServer instead of the direct work with the interrupt vector. Please could you rerun the new code (it is attached) on your 68030 hardware for me?

Indeed I have the table relocated. I'm unsure whether I'll be able to make the tests today, but sooner or later I'll let you know.

About testing... yesterday and earlier today I made some more tests.
First of all, I took my old test-bed program (the same I used for the divu tests reported previously) and I changed it to use, instead of the CIA A 32-bit timer (which runs at a trifle more than 0.7 MHz), the CIA B TOD clock, which has a rasterline resolution (so, 312 times more precise on PAL and 262 times more precise on NTSC than vertical blanking), plus fine-grained timing based on the raster beam position (VHPOSR register): this gives a color clock resolution (about 3.5 MHz), which is 5 times more precise than before.
Then, I have put your core loop and my core loop inside the test-bed program (commenting out divu.w + bvs.b, because I didn't have the outer stuff that initializes data and registers) and ran various tests. The results were:
* the speed is basically the same on 68020, regardless of some instructions rearrangement and of code alignment;
* my code performed better on 68030.
I must say that the first result surprised me a bit (even if the removal of divu.w + bvs.b meant that the divul branch optimization was not exploited). Then I changed my code again a little bit to try an optimization that I came up with last night (it had been lingering in my head forever, but somehow I couldn't see it clearly). The resulting code is 2 bytes shorter and, theoretically, 2 cycles faster on both 68020 and 68030 (and a bit less 68060-friendly):

Code:

         clr.l   d3
         ...
         bra.b   .l4

.longdiv divul.l d4,d7:d6
         move.w  d7,(a3)

         subq.l  #2,d4
         bcs.b   .enddiv

         sub.l   d6,d5
         sub.l   d7,d5

.l4      move.w  -(a3),d0
         lsr.l   #1,d5
         mulu.w  d1,d0
         add.l   d0,d5
         move.l  d5,d6
         divu.w  d4,d6
         bvs.b   .longdiv

         move.w  d6,d3
         clr.w   d6
         sub.l   d3,d5
         swap.w  d6
         move.w  d6,(a3)
         sub.l   d6,d5

         subq.l  #2,d4
         bcc.b   .l4

.enddiv

I tested it against your code and, surprisingly, still on 68020 the results were pretty much the same; on 68030, there was a speed improvement:

Code:

elapsed time:
 · color clocks: 14499
 · rasterlines:  63.7
 · frames:       0.2
 · µs:           4087.8

elapsed time:
 · color clocks: 14075
 · rasterlines:  61.9
 · frames:       0.1
 · µs:           3968.2

Notes:
 · loop executed 3000 times;
 · divu.w + bvs.b commented out.

That's about 1.03 times / 3% faster.

That convinced me to finally make a full test. (Fighting against myself,) I looked at the whole source to add also the outer loop. The impression I had had a few days ago regarding other possible optimizations was confirmed, so I rewrote the code as follows:

Code:

DN       =       3000            ;digits number
IN       =       ((DN/2)*7)      ;items number

* INITIALIZATION (buffer address in a2)

         movea.l a2,a0
         move.l  #2000*$10001,d0
         move.w  #IN/2-1,d1
.fill    move.l  d0,(a0)+
         dbra    d1,.fill

         move.w  #10000,d1
         move.w  #IN*2,d2
         clr.l   d3
         movea.w #28,a1

* MAIN LOOP

.l0      clr.l   d4
         clr.l   d5
         move.w  d2,d4
         lea.l   (a2,d2.w),a3
         subq.l  #1,d4
         bra.b   .l4

.longdiv divul.l d4,d7:d6
         move.w  d7,(a3)

         subq.l  #2,d4
         bcs.b   .enddiv

         sub.l   d6,d5
         sub.l   d7,d5

.l4      move.w  -(a3),d0
         lsr.l   #1,d5
         mulu.w  d1,d0
         add.l   d0,d5
         move.l  d5,d6
         divu.w  d4,d6
         bvs.b   .longdiv

         move.w  d6,d3
         clr.w   d6
         sub.l   d3,d5
         swap.w  d6
         move.w  d6,(a3)
         sub.l   d6,d5

         subq.l  #2,d4
         bcc.b   .l4

.enddiv  divu.w  d1,d5
         sub.w   a1,d2
         bcc.b   .l0

Did I understand correctly? Did I break anything? Once the code is correct, I'll test its performance.

litwr · 10 May 2021, 20:12

Quote:

Originally Posted by saimo

That method provides a 50 Hz (on PAL) or 60 Hz (on NTSC) resolution which is just too coarse to measure the difference made by a 1- or 2-cycle optimization for something that runs only for a very short time. Moreover, it's affected by the hardware and OS interrupt processing overhead.
Just to be clear: I'm saying this only to point out that it's difficult to evaluate performance accurately with such method; unfortunately, more precise measurements require lots of additional code and direct hardware access.

I can hardly agree with you. The program measures time of a loop which executed about 4,000,000 times for 3000 digits - 0.02s is quite a good accuracy, because it shows even 0.02/4e6 = 5ns loop timing changes.

Quote:

Originally Posted by saimo

Which would be the most inaccurate way of measuring the time

IMHO stopwatch accuracy is about 0.2s - it is not too bad.

Quote:

Originally Posted by saimo

The cache-case timings of the mulu optimization code are these:

Code:

  moveq.l #0,d0    ;2
  move.w  -(a3),d0
  move.l  d0,d1    ;2
  lsl.l   #3,d0    ;4
  sub.l   d0,d1    ;2
  add.l   d0,d0    ;2
  sub.l   d0,d1    ;2
  sub.l   d0,d1    ;2
  lsl.l   #8,d1    ;4
  sub.l   d1,d0    ;2

The total time is 22 cycles. Note that I said cache-case: thanks to instructions overlap, the actual performance might be better.
The mulu.w timings are instead:
* best-case (maximum overlap case): 25 cycles;
* cache-case: 27 cycles.
In short, the mulu optimization is quite good (albeit it takes a few bytes) and saves more than 2 cycles. It could cause issues if it makes the main loop larger than the instruction cache (256 bytes), though (but at a glance it doesn't look like the code is even close to that).

I have checked manuals and got the next table

Code:

                      68020       68030
  moveq.l #0,d0       0-2-3       2-2
  move.l d0,d1        0-2-3       2-2
  lsl.l #3,d0         1-4-4       4-4
  sub.l d0,d1         0-2-3       2-2
  add.l d0,d0         0-2-3       2-2
  sub.l d0,d1         0-2-3       2-2
  sub.l d0,d1         0-2-3       2-2
  lsl.l #8,d1         1-4-4       4-4
  sub.l d1,d0         0-2-3       2-2
                      2-22-29     22-22

  mulu d1,d0          25-27-28    28-28

According to this table MULUopt=1 should give us 5 saved cycles for the 68020 and 6 cycles for the 68030. However results from hardware give us rather 2-3 saved cycles from the 68020. Maybe the 68020 partially invalidates its cache after a jump? The x86 just drops its instruction queue on any jump, even a jump to the next instruction - this feature is used, for instance when the x86 is turned into protected mode. BTW I tried to implement the same multiplication optimization for the 80386. I found out that it doesn't work for it. Theoretically this optimization may slightly speed up the 80386 but practically on pi-spigot data it is slower. However I am sure that it must speed up the 80486.
Results from your 68030 hardware can help us better understand this case. I am also curious why do the 68020 timings have 3 cases while the 68030 have only 2?

Quote:

Originally Posted by saimo

Indeed I have the table relocated. I'm unsure whether I'll be able to make the tests today, but sooner or later I'll let you know.

I still don't understand this. Why can this happen? What is the reason to relocate this table?

Quote:

Originally Posted by saimo

About testing... yesterday and earlier today I made some more tests.
...
Then, I have put your core loop and my core loop inside the test-bed program (commenting out divu.w + bvs.b, because I didn't have the outer stuff that initializes data and registers) and ran various tests. The results were:
* the speed is basically the same on 68020, regardless of some instructions rearrangement and of code alignment;
* my code performed better on 68030.

I have to note that results may depend on actual data processed. It is important if division or multiplication depend on their arguments. It seems that division and multiplication on the 68020/30 do not depend on their arguments but there may be other minor factors which depends on data processed.

Quote:

Originally Posted by saimo

The resulting code is 2 bytes shorter and, theoretically, 2 cycles faster on both 68020 and 68030 (and a bit less 68060-friendly):

Wow! Thank you very much. Your optimization actually makes code faster! It speeds up the 68020/30 and 68000 programs. However this optimization makes code larger - I had to fix some details. Would you like please to run this new code (pi-amiga, pi-amiga1200) for me on your hardware (68020, 68030) for 100, 1000, 3000 digits?

Quote:

Originally Posted by modrobert

...

saimo could make an almost impossible thing. He has made the code a bit more faster! Could you please run the new programs for me? I hope it is the last time. I am sure that no more optimizations possible.

EDIT. I have replaced the attached file by an upgraded one. The changes do not affect the performance. I just added a better memory management which allows us to get more digits - thanks to Don_Adan.

litwr · 10 May 2021, 21:23

Quote:

Originally Posted by Don_Adan

I dont know, what is necessary (option or naming) for auto creating Code_BSS section by Vasm.
But perhaps some infos you can find here:

http://eab.abime.net/showthread.php?t=97310

Thank you very much! I just needed the -databss option. It was easy.

Now the Amiga version of pi-spigot can show up to 9272 digits! It is 20 digits more than the previous version. However the Atari ST version allows us to get up to 9288 digits. It is still a bit more. The new Amiga version (11 beta 4) archive is attached to the previous post.

Don_Adan · 10 May 2021, 23:02

Quote:

Originally Posted by litwr

Thank you very much! I just needed the -databss option. It was easy.

Now the Amiga version of pi-spigot can show up to 9272 digits! It is 20 digits more than the previous version. However the Atari ST version allows us to get up to 9288 digits. It is still a bit more. The new Amiga version (11 beta 4) archive is attached to the previous post.

I dont coded long time, but ln_Name is unnecessary, can be null/zero. And is_Data can be set as A1 input, if i remember right.

Code:

rasteri      addq.l #1,(A1)
;If you set your interrupt to priority 10 or higher then a0 must point at $dff000 on exit
      moveq #0,d0  ; must set Z flag on exit!
      rts

VBlankServer:
      dc.l  0,0                   ;ln_Succ,ln_Pred
      dc.b  NT_INTERRUPT,0        ;ln_Type,ln_Pri
      dc.l  0               ;ln_Name
      dc.l  time,rasteri             ;is_Data,is_Code

saimo · 11 May 2021, 00:27

Quote:

Originally Posted by litwr

I can hardly agree with you. The program measures time of a loop which executed about 4,000,000 times for 3000 digits - 0.02s is quite a good accuracy, because it shows even 0.02/4e6 = 5ns loop timing changes.

Of course, the more the iterations, the higher the precision. But what if the loops are few (like, for example, for just a few digits)?
Let's suppose that the loops are just 40,000 and that each loop takes 100 cycles: the total amount of cycles is 40,000 * 100 = 4,000,000. Execution on a 50 MHz CPU would require 4,000,000 / 50,000,000 = 0.08 s, i.e. 4 PAL frames. Now, if the loop were optimized by 2 cycles, the total amount of cycles would become 40,000 * 98 = 3,920,000, which, on the same CPU, would run in 3,920,000 / 50,000,000 = 0.0784 s, i.e. = 3,92 frames. The 0.02 s resolution wouldn't suffice anymore.
The results I'm reporting below show in practice the limitations of the method. But if it fits your bill, no problem with me

Quote:

I have checked manuals and got the next table

Code:

                      68020       68030
  moveq.l #0,d0       0-2-3       2-2
  move.l d0,d1        0-2-3       2-2
  lsl.l #3,d0         1-4-4       4-4
  sub.l d0,d1         0-2-3       2-2
  add.l d0,d0         0-2-3       2-2
  sub.l d0,d1         0-2-3       2-2
  sub.l d0,d1         0-2-3       2-2
  lsl.l #8,d1         1-4-4       4-4
  sub.l d1,d0         0-2-3       2-2
                      2-22-29     22-22

  mulu d1,d0          25-27-28    28-28

According to this table MULUopt=1 should give us 5 saved cycles for the 68020 and 6 cycles for the 68030. However results from hardware give us rather 2-3 saved cycles from the 68020. Maybe the 68020 partially invalidates its cache after a jump?

As far as I know, no. I've never read anything like that on the MC68k manuals. In that regard, branching affects only the 68060 branch prediction.

Quote:

I am also curious why do the 68020 timings have 3 cases while the 68030 have only 2?

The memory controllers of those CPU differ, and probably so also instruction fetching/dispatching/etc. From the timings, we can see that the CPUs are very similar, but not identical.
The 68020 manual illustrates 3 cases: best-case (maximum overlap), cache-case and worst-case (no-cache). It basically says: look, when you're really good/lucky, instructions will overlap and execute faster - and with some examples it shows what that means; but it doesn't provide details for all the instructions. The 68030 manual is better: it basically says the same thing, but instead of giving out the best cases, it provides head and tail (the parts that can overlap) timings for each addressing mode and instruction, so that it is actually possible to calculate the benefits of overlapping.

Quote:

I still don't understand this. Why can this happen? What is the reason to relocate this table?

It doesn't "happen": it's forced to happen (by the user, normally). The reason is that locations 0-1023 belong to the CHIP RAM, which is the RAM shared between CPU and chipset, and thus subject to contention (therefore, for example, the deeper is the screen mode, the more screen data needs to be fetched, the less time the CPU has access to the CHIP RAM); moreover, its clock is slower than that of the CPU (for example, 1/4 of the 68020 in the Amiga 1200). So, if the machine happens to have FAST RAM, i.e. RAM exclusively accessed by the CPU, it's best to have the vectors relocated there, so that interrupts processing is faster.

Quote:

Wow! Thank you very much. Your optimization actually makes code faster! It speeds up the 68020/30 and 68000 programs. However this optimization makes code larger - I had to fix some details. Would you like please to run this new code (pi-amiga, pi-amiga1200) for me on your hardware (68020, 68030) for 100, 1000, 3000 digits?

Here are the results. I have run each test 3 or 4 times.

68020, pi-amiga:
* 100 digits: 0.12
* 1000 digits: 4.80 - 4.82
* 3000 digits: 36.92 - 36.94

68020, pi-amiga1200:
* 100 digits: 0.12 - 0.14
* 1000 digits: 4.64
* 3000 digits: 35.32

68030, data cache burst on, pi-amiga:
* 100 digits: 0.04 - 0.08
* 1000 digits: 1.68
* 3000 digits: 10.96

68030, data cache burst off, pi-amiga:
* 100 digits: 0.04 - 0.08
* 1000 digits: 1.68
* 3000 digits: 10.94 - 10.96

68030, data cache burst on, pi-amiga1200:
* 100 digits: 0.06 - 0.08
* 1000 digits: 1.64 - 1.66
* 3000 digits: 10.62 - 10.64

68030, data cache burst off, pi-amiga1200:
* 100 digits: 0.06 - 0.08
* 1000 digits: 1.64 - 1.66
* 3000 digits: 10.62

The above tests have been run without booting with startup-sequence, i.e. with the barest system running. Out of curiosity, I tested it also with AmigaOS fully booted, and that confirmed the impression I had from watching the screen: the timing includes the printout operations! That has a serious impact on the times. For example, I have run the pi-amiga test with a value of 1000 a number of times, changing the shell windows size and the screen depth, and obtained times ranging from 1.98 to 4.12 (and I could have obtained even more extreme values by removing some patches, playing more with the window size, etc.).

Quote:

saimo could make an almost impossible thing. He has made the code a bit more faster! Could you please run the new programs for me? I hope it is the last time. I am sure that no more optimizations possible.

Wait, there's still the outer loop to be optimized! Did you check the code at the bottom of the previous post?

Don_Adan · 11 May 2021, 04:13

Why not?

Code:

;for vasm assembler
;it calculates pi-number using the next C-algorithm
;https://crypto.stanford.edu/pbc/notes/pi/code.html

;#include <stdio.h>
;#define N 2800
;main() {
;   long r[N + 1], i, k, b, c;
;   c = 0;
;   for (i = 1; i <= N; i++)   ;it is the fixed line!, the original was (i = 0; i < N; ...
;      r[i] = 2000;
;   for (k = N; k > 0; k -= 14) {
;      d = 0;
;      i = k;
;      for(;;) {
;         d += r[i]*10000;
;         b = i*2 - 1;
;         r[i] = d%b;
;         d /= b;
;         i--;
;         if (i == 0) break;
;         d *= i;
;      }
;      printf("%.4d", (int)(c + d/10000));
;      c = d%10000;
;   }
;}

;the time of the calculation is quadratic, so if T is time to calculate N digits
;then 4*T is required to calculate 2*N digits
;main loop count is 7*(4+D)*D/16, D - number of digits

;So r[0] is never used.  The program for 680x0 uses r[0] and doesn't use r[N] - so it optimizes the memory usage by 2 bytes

;litwr has written this for 680x0
;tricky provided some help
;MMS gave some support
;Thorham and meynaf helped a lot
;a/b, saimo and modrobert helped to optimize the 68k code

     mc68000
MULUopt = 0   ;1 is much slower for 68000, for 68020 it is the same for FS-UAE and faintly faster with the real 68020
IO = 1

OldOpenLibrary	= -408
CloseLibrary	= -414
Output = -60
Input = -54
Write = -48
Read = -42
AllocMem = -198
FreeMem = -210
Forbid = -132
Permit = -138
AddIntServer = -168
RemIntServer = -174
VBlankFrequency = 530
INTB_VERTB = 5     ;for vblank interrupt
NT_INTERRUPT = 2   ;node type

;N = 7*D/2 ;D digits, e.g., N = 350 for 100 digits

div32x16 macro    ;D7=D6/D4, D6=D6%D4
     ;clr.l d7
     moveq.l #0,d7
     divu d4,d6
     bvc .div32no\@

     swap d6
     move d6,d7
     divu d4,d7
     swap d7
     move d7,d6
     swap d6
     divu d4,d6
.div32no\@
     move d6,d7
     clr d6
     swap d6
endm

start    
         bra.w Ra

Back
.fill    move.l d0,(a0)+
         dbra d3,.fill
.l0      clr.l d5       ;d <- 0
         clr.l d4
         move kv(pc),d4
         add.l d4,d4     ;i <-k*2
         move.l a2,a3
         adda.l d4,a3
         subq.l #1,d4     ;b <- 2*i-1
         moveq.l #0,d3
  ifeq MULUopt
         move #10000,d1
  endif
         bra .l4

.longdiv
  if __VASM&28              ;68020/30?
         divul d4,d3:d6
  else
         swap d6
         move d6,d3
         divu d4,d3
         swap d3
         move d3,d6
         swap d6
         divu d4,d6

         move d6,d3
         exg.l d6,d3
         clr d3
         swap d3
  endif
         move d3,(a3)     ;r[i] <- d%b
         bra.s .enddiv

  if __VASM&28              ;68020/30?
         align 2
  endif
.l2      sub.l d6,d5
         sub.l d3,d5
         lsr.l d5
.l4
  if MULUopt
         moveq.l #0,d0  ;MULU optimization
  endif
         move -(a3),d0      ; r[i]
  if MULUopt
         move.l d0,d1   ;MULU optimization
         lsl.l #3,d0
         sub.l d0,d1
         add.l d0,d0
         sub.l d0,d1
         sub.l d0,d1
         lsl.l #8,d1
         sub.l d1,d0
  else
         mulu d1,d0       ;r[i]*10000, removed with MULU optimization
  endif
         add.l d0,d5       ;d += d + r[i]*10000
         move.l d5,d6
         divu d4,d6
         bvs.s .longdiv

         move d6,d3
         clr d6
         swap d6
         move d6,(a3)     ;r[i] <- d%b
.enddiv
         subq #2,d4    ;i <- i - 1
         bcc .l2       ;the main loop
  if MULUopt
         divu #10000,d5  ;MULU optimization
  else
         divu d1,d5      ;removed with MULU optimization
  endif
  if IO
         add cv(pc),d5    ;c + d/10000
         swap d5      ;c <- d%10000
         move d5,cv
         clr d5
         swap d5
         bsr PR0000
  endif
         sub.w #14,kv
         bne .l0

         move.l time(pc),d5
         ;move.w #$c000,$dff096    ;DMA on
         exg.l a5,a6
         moveq.l #INTB_VERTB,d0
         lea.l VBlankServer(pc),a1
         jsr RemIntServer(a6)
         jsr Permit(a6)
         exg.l a5,a6

         moveq.l #1,d3
         move.l cout(pc),d1
         move.l #msgx,d2
         jsr Write(a6)  ;space

         move.l d5,d3
         lsl.l d5
         cmp.b #50,VBlankFrequency(a5)
         beq .l8

         lsl.l d5      ;60 Hz
         add.l d3,d5
         divu #3,d5
         swap d5
         lsr #2,d5
         swap d5
         negx.l d5
         neg.l d5

.l8      lea string(pc),a3
         move #10,d4
         move.l d5,d6
         div32x16
         move.b d6,(a3)+
         divu d4,d7
         swap d7
         move.b d7,(a3)+
         clr d7
         swap d7
         move.b #'.'-'0',(a3)+
.l12     tst d7
         beq .l11

         divu d4,d7
         swap d7
         move.b d7,(a3)+
         clr d7
         swap d7
         bra .l12

.l11     add.b #'0',-(a3)
         moveq #1,d3
         move.l cout(pc),d1
         move.l a3,d2
         jsr Write(a6)
         cmp.l #string,a3
         bne .l11

         move.l cout(pc),d1
         move.l #msgx+1,d2
         jsr Write(a6)  ;newline

         move.l a6,a1
         move.l a5,a6
         jmp CloseLibrary(a6)

PR0000     ;prints d5
       lea buf(pc),a0
       bsr .l1
       moveq #4,d3
       move.l cout(pc),d1
       move.l #buf,d2
       jmp Write(a6)             ;call Write(stdout,buff,size)

.l1    divu #1000,d5
       bsr .l0
       clr d5
       swap d5

       divu #100,d5
       bsr .l0
       clr d5
       swap d5

       divu #10,d5
       bsr .l0
       swap d5

.l0    eori.b #'0',d5
       move.b d5,(a0)+
eos    rts

rasteri      addq.l #1,time
;If you set your interrupt to priority 10 or higher then a0 must point at $dff000 on exit
      moveq #0,d0  ; must set Z flag on exit!
      rts

VBlankServer:
      dc.l  0,0                   ;ln_Succ,ln_Pred
      dc.b  NT_INTERRUPT,0        ;ln_Type,ln_Pri
      dc.l  libname               ;ln_Name
      dc.l  0,rasteri             ;is_Data,is_Code

cv  dc.w 0
kv  dc.w 0
time dc.l 0
cout dc.l 0
buf ds.b 4
msgx dc.b 32,10
        align 2
ra

         lea.l libname(pc),a1         ;open the dos library
         move.l 4,a5
         move.l a5,a6
         jsr OldOpenLibrary(a6)
         move.l d0,a6
         jsr Output(a6)          ;get stdout
         move.l d0,cout
         move.l d0,d1                   ;call Write(stdout,buff,size)
         move.l #msg1,d2
         moveq #msg4-msg1,d3
         jsr Write(a6)
         move.l #start+$10000-ra,d0
         divu #7,d0
         ext.l d0
         and.b #$fc,d0
         move.l d0,d7     ;d7=maxn

.l20     move.l cout(pc),d1
         move.l #msg4,d2
         moveq #msg5-msg4,d3
         jsr Write(a6)
         move.l d7,d5
         bsr PR0000
         move.l cout(pc),d1
         move.l #msg5,d2
         moveq #msg3-msg5,d3
         jsr Write(a6)
       
getnum
        jsr Input(a6)          ;get stdin
         move.l #string,d2     ;set by previous call
         move.l d0,d1
         moveq.l #5,d3     ;+ newline
         jsr Read(a6)
         subq #1,d0
         beq getnum

         move.l d2,a0
         clr.l d5
.l1      clr d6
         move.b (a0)+,d6
         sub.b #'0',d6
         add d6,d5
         subq #1,d0
         beq skip

         mulu #10,d5
         bra .l1

skip
         cmp d7,d5
         bhi .l20

         move d5,d1
         beq .l20

         addq #3,d5
         and #$fffc,d5
         cmp.b #10,(a0)
         bne .l21

         cmp d1,d5
         beq .l7

.l21     move d5,a4
         bsr PR0000
         move a4,d5
         move.l cout(pc),d1
         move.l #msg3,d2
         moveq #msg2-msg3+1,d3
         jsr Write(a6)
.l7      lsr d5
         mulu #7,d5
         move.l d5,d3
         lea.l ra(pc),a2

         exg.l a5,a6
         jsr Forbid(a6)
         moveq.l #INTB_VERTB,d0
         lea.l VBlankServer(pc),a1
         jsr AddIntServer(a6)
         exg.l a5,a6
         ;move.w #$4000,$dff096    ;DMA off

         lsr d3
         subq #1,d3
         move.l #2000*65537,d0
         move.l a2,a0

         clr cv
         move d5,kv
         bra.w Back


string = msg1
libname  dc.b "dos.library",0
msg1  dc.b 'number pi calculator v11 [beta 4]'
  if __VASM&28              ;68020/30?
      dc.b '(68020)'
  else
      dc.b '(68000)'
  endif
msg4  dc.b 10,'number of digits (up to '
msg5 dc.b ')? '
msg3  dc.b ' digits will be printed'
msg2  dc.b 10
      dcb.b 65536-(ra-start)
      end start

modrobert · 11 May 2021, 06:02

Quote:

Originally Posted by litwr

saimo could make an almost impossible thing. He has made the code a bit more faster! Could you please run the new programs for me? I hope it is the last time. I am sure that no more optimizations possible.

Sorry, my A1200 got serious screen flicker via RGB, after I checked the motherboard it turns out there are some leaky electrolytic SMD capacitors, ordered new ones, will take a few days to fix. This could be the reason my ACA-1232 (68030 @ 33MHz) fails intermittently as well.

Looks like saimo got you covered.

litwr · 11 May 2021, 19:39

Quote:

Originally Posted by saimo

Here are the results. I have run each test 3 or 4 times.
68020, pi-amiga:
* 100 digits: 0.12
* 1000 digits: 4.80 - 4.82
* 3000 digits: 36.92 - 36.94

68020, pi-amiga1200:
* 100 digits: 0.12 - 0.14
* 1000 digits: 4.64
* 3000 digits: 35.32

68030, data cache burst on, pi-amiga:
* 100 digits: 0.04 - 0.08
* 1000 digits: 1.68
* 3000 digits: 10.96

68030, data cache burst off, pi-amiga:
* 100 digits: 0.04 - 0.08
* 1000 digits: 1.68
* 3000 digits: 10.94 - 10.96

68030, data cache burst on, pi-amiga1200:
* 100 digits: 0.06 - 0.08
* 1000 digits: 1.64 - 1.66
* 3000 digits: 10.62 - 10.64

68030, data cache burst off, pi-amiga1200:
* 100 digits: 0.06 - 0.08
* 1000 digits: 1.64 - 1.66
* 3000 digits: 10.62

The above tests have been run without booting with startup-sequence, i.e. with the barest system running.

Thank you very much for so thorough information. However I need several more details from you. Did your 68020 system use fast RAM? Did your runs actually print digits of the pi on the screen?

Quote:

Originally Posted by modrobert

Sorry, my A1200 got serious screen flicker via RGB, after I checked the motherboard it turns out there are some leaky electrolytic SMD capacitors, ordered new ones, will take a few days to fix. This could be the reason my ACA-1232 (68030 @ 33MHz) fails intermittently as well.

It is really sad to know about your problems.

I wish your hardware to be fixed soon.

saimo · 11 May 2021, 20:52

Quote:

Originally Posted by litwr

Thank you very much for so thorough information. However I need several more details from you. Did your 68020 system use fast RAM? Did your runs actually print digits of the pi on the screen?

To use the 68020, I have to disable the accelerator board that mounts the 68030 and the FAST RAM, the 68020 tests are relative to the stock A1200 (which has only CHIP RAM).
Yes, the test programs outputted the digits.

litwr · 12 May 2021, 19:54

Quote:

Originally Posted by saimo

Let's suppose that the loops are just 40,000 and that each loop takes 100 cycles: the total amount of cycles is 40,000 * 100 = 4,000,000. Execution on a 50 MHz CPU would require 4,000,000 / 50,000,000 = 0.08 s, i.e. 4 PAL frames. Now, if the loop were optimized by 2 cycles, the total amount of cycles would become 40,000 * 98 = 3,920,000, which, on the same CPU, would run in 3,920,000 / 50,000,000 = 0.0784 s, i.e. = 3,92 frames. The 0.02 s resolution wouldn't suffice anymore.

You are right for your case but the pi-spigot case is different.

And anyway, you can just increase the number of iteration from 100 to, for instance 10000...

Quote:

Originally Posted by saimo

As far as I know, no. I've never read anything like that on the MC68k manuals. In that regard, branching affects only the 68060 branch prediction.

So we have a mystery again. I have attached an archive which contains two programs: pi-mulu and pi-opt. We need to run them on real 68020 or 68030 hardware and check the timing difference (=D). We know that for N digits the algo does K=7*(N+4)*N/16 iterations. So if we multiply D/K by the CPU frequency we get the number of the saved ticks. Of course, larger N is better, IMHO N=3000 or more will be ok. For other N we must get the same number of saved ticks in cases of the 68020/30. Would anybody like please to run these programs on real 68020/30 hardware?
I tried to check this saved tick number for the emulated A500. I got data that FS-UAE is rather not completely accurate at least for the MULU timings.

I got that code with MULU is about 42 ticks faster that with MULU optimization. Let's examine the 68000 timings

Code:

  moveq.l #0,d0       4
  move.l d0,d1        4       
  lsl.l #3,d0         14      
  sub.l d0,d1         6       
  add.l d0,d0         6       
  sub.l d0,d1         6       
  sub.l d0,d1         6       
  lsl.l #8,d1         24
  sub.l d1,d0         6       
                      =76

  mulu d1,d0          38-70 - it is 54 on average

So 42 is too large. The maximum number possible is 38 and the real number is rather close to 30. It will be very interesting to get results from a real A500 - this can be a bug report for emulator writers.

Quote:

Originally Posted by saimo

It doesn't "happen": it's forced to happen (by the user, normally). The reason is that locations 0-1023 belong to the CHIP RAM, which is the RAM shared between CPU and chipset, and thus subject to contention (therefore, for example, the deeper is the screen mode, the more screen data needs to be fetched, the less time the CPU has access to the CHIP RAM); moreover, its clock is slower than that of the CPU (for example, 1/4 of the 68020 in the Amiga 1200). So, if the machine happens to have FAST RAM, i.e. RAM exclusively accessed by the CPU, it's best to have the vectors relocated there, so that interrupts processing is faster.

Thank you very much. However I still have doubts about usefulness of such the 68010+ feature. The interrupt initiating sequence is rather slow on the 68k, so possible loss of 4 cycles means a very little overhead in actual interrupt processing. On the other hand, the 68030+ and even some 68000/10/20 systems have an MMU which allows us to do the same trick in more natural and common way. Maybe I missed something?

Quote:

Originally Posted by saimo

Here are the results. I have run each test 3 or 4 times.

68030, data cache burst on, pi-amiga:
* 100 digits: 0.04 - 0.08
* 1000 digits: 1.68

68030, data cache burst off, pi-amiga:
* 100 digits: 0.04 - 0.08
* 1000 digits: 1.68

68030, data cache burst on, pi-amiga1200:
* 100 digits: 0.06 - 0.08
* 1000 digits: 1.64 - 1.66

68030, data cache burst off, pi-amiga1200:
* 100 digits: 0.06 - 0.08
* 1000 digits: 1.64 - 1.66

IMHO it is possible that your two last results have a typo for 100 digits timing. The 0.06 value for the pi-amiga1200 looks not correct, it can't be more than the corresponding value from pi-amiga.
It is also interesting that your results show a big advantage of the 68030 over the 68020. The table #2 shows ER that is equal 422 for the 68030 and 438 for the 68020. These numbers were almost equal before you published results from your Blizzard card.

Quote:

Originally Posted by saimo

The above tests have been run without booting with startup-sequence, i.e. with the barest system running. Out of curiosity, I tested it also with AmigaOS fully booted, and that confirmed the impression I had from watching the screen: the timing includes the printout operations! That has a serious impact on the times. For example, I have run the pi-amiga test with a value of 1000 a number of times, changing the shell windows size and the screen depth, and obtained times ranging from 1.98 to 4.12 (and I could have obtained even more extreme values by removing some patches, playing more with the window size, etc.).

Of course, we have to provide conditions for the best timings: maximize the shell window and clear the screen.

Quote:

Originally Posted by saimo

Wait, there's still the outer loop to be optimized! Did you check the code at the bottom of the previous post?

Of course we can do something in the code outside the main loop but it doesn't affect timings. So I am rather reluctant to do anything there. IMHO your changes maybe save several cycles but this doesn't affect the program performance. I am also not sure if your changes make code a bit larger. Anyway thank you.

Quote:

Originally Posted by Don_Adan

Why not?

This way we can modify the Atari ST code too.

However the main issue is in the fact that such changes makes the code larger.

Quote:

Originally Posted by modrobert

Looks like saimo got you covered.

Your A1200 has fast RAM, so its results cannot be replaced by results from saimo's systems.

Don_Adan · 12 May 2021, 22:29

"This way we can modify the Atari ST code too. However the main issue is in the fact that such changes makes the code larger."

Amiga initialisation code is perhaps larger than Atari ST.
Not exactly larger, because "getnum" routine was inlined. Current version can be shortest much more, still is not fully PC relative, registers can be better used too. Here is example for 4 bytes code gain plus one reloc too.
I dont know if main target is fastest code, or shortest code or Max Digits code.

Code:

PR0000     ;prints d5
       lea buf(pc),a0
 move.l A0,D2
       bsr .l1
       moveq #4,d3
       move.l cout(pc),d1
;       move.l #buf,d2
       jmp Write(a6)             ;call Write(stdout,buff,size)

.l1    divu #1000,d5
       bsr .l0
       clr d5
       swap d5

       divu #100,d5
       bsr .l0
       clr d5
       swap d5

       divu #10,d5
       bsr .l0
       swap d5

.l0    eori.b #'0',d5
       move.b d5,(a0)+
eos    rts

modrobert · 13 May 2021, 19:31

Quote:

Originally Posted by litwr

Your A1200 has fast RAM, so its results cannot be replaced by results from saimo's systems.

Alright, A1200 recapped along with the power supply, what do you want me to test?

EDIT:

Did these for now...

Code:

> pi-mulu
number pi calculator v11 [mulu opt test](68000)
number of digits (up to 9276)? 3000
 31.86

Code:

> pi-opt
number pi calculator v11 [mulu opt test](68000)
number of digits (up to 9276)? 3000
 30.72

There was no pi output from these (assuming this is intentional).

saimo · 13 May 2021, 23:55

Quote:

Originally Posted by litwr

So we have a mystery again ...

Again, simply don't use emulators as reference.

Quote:

Thank you very much. However I still have doubts about usefulness of such the 68010+ feature. The interrupt initiating sequence is rather slow on the 68k, so possible loss of 4 cycles means a very little overhead in actual interrupt processing. On the other hand, the 68030+ and even some 68000/10/20 systems have an MMU which allows us to do the same trick in more natural and common way. Maybe I missed something?

Yes.
Interrupts are critical (especially on devices with limited power, like the Amiga or those that use the M68k CPUs as embedded controllers), so the faster they are handled, the better. Now, the VBR helps speeding up the execution of interrupts on Amiga because of its RAM architecture, but that might not apply to other systems.
But even without the speed factor, being able to relocate the vectors table could be an advantage: for example, more freedom for hardware/OS designers, possibility of exploiting those 1024 bytes in the 0 page (which can be accessed more quickly thanks to its 16 bit addresses), more robustness (less risk of destroying the vectors due to null pointers).
The MMU is meant for other purposes, and it's best to use it for those, instead of complicating its usage even more. Not to mention that MMU slows things down a bit and requires RAM (OK, transparent translation registers might help, but then again those are better used for other purposes).
Moreover, only the non-EC/LC 68030+ CPUs have an MMU.

Quote:

IMHO it is possible that your two last results have a typo for 100 digits timing. The 0.06 value for the pi-amiga1200 looks not correct, it can't be more than the corresponding value from pi-amiga.

No typo: those are the actual figures. I ran the tests multiple times.

Quote:

It is also interesting that your results show a big advantage of the 68030 over the 68020.

No surprise at all there: if you look at the figures, you can easily see that the 68030 is about 3.5 faster than the 68020, and that's perfectly logical since that's the frequency ratio, and the CPU cache and the instructions timings are (basically) the same.

Quote:

The table #2 shows ER that is equal 422 for the 68030 and 438 for the 68020. These numbers were almost equal before you published results from your Blizzard card.

I can't follow this. What's ER?

Quote:

Of course, we have to provide conditions for the best timings: maximize the shell window and clear the screen.

Nope, the solution is disabling the printout: if you want to check the speed of the algorithm, eliminate all the other factors (especially considering the limited precision of the timing mechanism).

Quote:

Of course we can do something in the code outside the main loop but it doesn't affect timings. So I am rather reluctant to do anything there. IMHO your changes maybe save several cycles but this doesn't affect the program performance. I am also not sure if your changes make code a bit larger. Anyway thank you.

At a very quick glance (no time for a thorough analysis now), the differences are the following (I have left out the similar/identical/equivalent parts).

EDIT: I gave the code a less swift look and added also 68020 cache-case timings; b = bytes, c = cycles.

Your code:

Code:

   move d5,kv      ;6b 6c
   ...
   move kv(pc),d4  ;4b 3c
   add.l d4,d4     ;2b 2c
   move.l a2,a3    ;2b 2c
   adda.l d4,a3    ;2b 2c
   ...
   sub.w #14,kv    ;6b 10c

total: 22b 25c

My code:

Code:

   movea.w #28,a1       ;4b 4c
   ...
   move.w  d2,d4        ;2b 2c
   lea.l   (a2,d2.w),a3 ;4b 6c
   ...
   sub.w   a1,d2        ;2b 2c

total: 12b 14c

The timings assume 0-wait-state memory accesses, so the actual timings of the instructions that use kv are much slower (like, two or three times slower).
Note: even just 11 cycles per external loop do make a difference of some thousands of cycles when thousands of digits are calculated; the difference might escape the vertical blanking based time measuring, but I just can't see why not making the code faster - that's one of the purposes why you started this thread, isn't it?

Additional note: like I said in the original post, whether my code does the right thing is still to be verified.

By the way, I have an alternative version of the inner loop which is 2 cycles faster on 68020 (and probably even faster on 68040 and 68060, but not on 68030) when the mulu optimization is not used. I'm trying to figure out a way to adapt the mulu optimization version to take advantage of the optimization, but it looks like it might be impossible.

litwr · 14 May 2021, 21:46

Quote:

Originally Posted by Don_Adan

Amiga initialisation code is perhaps larger than Atari ST.
Not exactly larger, because "getnum" routine was inlined. Current version can be shortest much more, still is not fully PC relative, registers can be better used too. Here is example for 4 bytes code gain plus one reloc too.
I dont know if main target is fastest code, or shortest code or Max Digits code.

Code:

PR0000     ;prints d5
       lea buf(pc),a0
 move.l A0,D2

Thank you. I have used your optimization of PR0000. However the size of code or number of digits are not important - they have only secondary meaning. You can check code for the x86 - it even contains a long sonorous phrase which just occupies its bytes.

Even the number of digits for the current Amiga version is larger than this number for the IBM PC.

I don't think that the inlined getnum and similar tricks are useful. They make code less clear and don't affect the speed.

Quote:

Originally Posted by modrobert

Alright, A1200 recapped along with the power supply, what do you want me to test?

Code:

> pi-mulu
number pi calculator v11 [mulu opt test](68000)
number of digits (up to 9276)? 3000
 31.86

Code:

> pi-opt
number pi calculator v11 [mulu opt test](68000)
number of digits (up to 9276)? 3000
 30.72

There was no pi output from these (assuming this is intentional).

Thank you very much! Your results proved that the gain from MULU optimization is not 5 cycles but only 4 = (31.86-30.72)/(7*3000*3004/16)*7090000*2. So it seems that sometime the 68020 treats the MULU instruction as its best case...

My request has also been to run programs from pi-amiga-11-beta-4.zip for 100/1000/3000 digits. Your system uses fast RAM and this gives us other numbers, not the same as saimo's system. The archive is attached to this post
http://eab.abime.net/showpost.php?p=...0&postcount=82

Quote:

Originally Posted by saimo

Again, simply don't use emulators as reference.

However for the 68000 FS-UAE is quite accurate. I only showed that it has some minor timing issue with the MULU timing emulation.

Quote:

Originally Posted by saimo

Interrupts are critical (especially on devices with limited power, like the Amiga or those that use the M68k CPUs as embedded controllers), so the faster they are handled, the better. Now, the VBR helps speeding up the execution of interrupts on Amiga because of its RAM architecture, but that might not apply to other systems.

IMHO if we need low interrupt latency we should use 6502/65816.

An interrupt handler routine can't be executed faster than 50 cycles on the 68020/30. And typical handlers need more than 100 cycles. So 1 saved cycle means very little. And it is only due some peculiarities of the Amiga architecture.

Quote:

Originally Posted by saimo

But even without the speed factor, being able to relocate the vectors table could be an advantage: for example, more freedom for hardware/OS designers, possibility of exploiting those 1024 bytes in the 0 page (which can be accessed more quickly thanks to its 16 bit addresses), more robustness (less risk of destroying the vectors due to null pointers).

IMHO all this things are rather minor, they can give only tiny advantages in rather very rare cases. Of course, it is just my IMHO.

Quote:

Originally Posted by saimo

Moreover, only the non-EC/LC 68030+ CPUs have an MMU.

Thank you, I missed this. I was sure that all top Amigas can run Unix. There was a strange story when Mehdi Ali refused to share Unix for the Amiga with Sun - https://www.commodore.ca/commodore-h...-of-commodore/ - one might think that Commodore invested much in their Unix, so it could be used widely...

Quote:

Originally Posted by saimo

No typo: those are the actual figures. I ran the tests multiple times.

It is one more mystery.

Quote:

Originally Posted by saimo

No surprise at all there: if you look at the figures, you can easily see that the 68030 is about 3.5 faster than the 68020, and that's perfectly logical since that's the frequency ratio, and the CPU cache and the instructions timings are (basically) the same.

What figures? The 68030 can't be 3.5 faster than the 68020.

Quote:

Originally Posted by saimo

I can't follow this. What's ER?

It is the value in the second table which sorted out many CPUs. The cite The next table contains approximate values of efficiency reciprocals (ER). These values are calculated by multiplication of the time (CPU time only without IO) of the calculation of 3000 digits by the CPU frequency. The ER values are gotten for pi-spigot which uses base 16-bit integer arithmetic. The best ER value for each CPU is taken. The ER value reflects the efficiency of a CPU electronics.

Quote:

Originally Posted by saimo

Nope, the solution is disabling the printout: if you want to check the speed of the algorithm, eliminate all the other factors (especially considering the limited precision of the timing mechanism).

It can't be disabled because the program is the number pi calculator.

Let me repeat again. I gather 3 numbers for the performance because this actually allows us to have combined value for a CPU performance and the standard character printout function speed coded in that 3 numbers. I have checked their accuracy many times. So your objection doesn't rely on facts. Just perform your measurements accurately and you get that the CPU performance is measured with less than 1% errors.

Quote:

Originally Posted by saimo

EDIT: I gave the code a less swift look and added also 68020 cache-case timings; b = bytes, c = cycles.
Note: even just 11 cycles per external loop do make a difference of some thousands of cycles when thousands of digits are calculated; the difference might escape the vertical blanking based time measuring, but I just can't see why not making the code faster - that's one of the purposes why you started this thread, isn't it?

The external loop is executed only D/4 times where D is the number of the pi digits while the internal loop is executed 7*(D+4)*D/16 times... Moreover the external loop contains also the printout routine which is much longer than 11 cycles.

Quote:

Originally Posted by saimo

Additional note: like I said in the original post, whether my code does the right thing is still to be verified.

And it is the main issue. Of course I know that it is possible to use registers more optimal but the gain is 0 so I don't invest my time for this.

Quote:

Originally Posted by saimo

By the way, I have an alternative version of the inner loop which is 2 cycles faster on 68020 (and probably even faster on 68040 and 68060, but not on 68030) when the mulu optimization is not used. I'm trying to figure out a way to adapt the mulu optimization version to take advantage of the optimization, but it looks like it might be impossible.

It will be interesting to know details. I was sure that this is already almost impossible.

saimo · 14 May 2021, 23:05

@litwr

I honestly can't afford to get into a philosophical/mindset debate, so I'll keep the answers to the bare minimum.

Quote:

Originally Posted by litwr

What figures? The 68030 can't be 3.5 faster than the 68020.

The frequency of the 68030 in my Amiga is more than 3.5 faster than that of the 68020 in the same machine. See my previous posts.

Quote:

And it is the main issue. Of course I know that it is possible to use registers more optimal but the gain is 0 so I don't invest my time for this.

You've shown interest in saving bytes, and my code, besides being faster, also saves bytes - overall, the gain is all but 0.
Moreover, it should be easy to you to see whether the code is correct.

Your code does:
loop:
get kv
kv*2 -> add to array offset
kv*2-1 -> divisor
...
kv -= 14
goto loop

My code does (using more efficient instructions):
kv*2 -> kv2
loop:
kv2 -> add to array offset
kv2-1 -> divisor
...
kv -= 28
goto loop

That should work, right?

Quote:

It will be interesting to know details. I was sure that this is already almost impossible.

If/when I'm done with it, I'll post the code. Maybe included in a whole algorithm (which is basically already in place) or full program. But now I'm busy with other stuff (a game, which actually I had to stop working on due to sleeping issues).

Don_Adan · 15 May 2021, 00:55

Write routine is integrated with main loop, then D2 is trashed and A1 can be trashed via Write(A6) as scratch register. Full loop is up to .l0 label, i think. D7 can be better used and maybe (SP), but i dont checked code exactly.

modrobert · 15 May 2021, 10:38

Quote:

Originally Posted by litwr

My request has also been to run programs from pi-amiga-11-beta-4.zip for 100/1000/3000 digits. Your system uses fast RAM and this gives us other numbers, not the same as saimo's system. The archive is attached to this post
http://eab.abime.net/showpost.php?p=...0&postcount=82

Code:

 > pi-amiga
number pi calculator v11 [beta 4](68000)
number of digits (up to 9272)? 100
314...  .14

> pi-amiga
number pi calculator v11 [beta 4](68000)
number of digits (up to 9272)? 1000
314...  4.72

> pi-amiga
number pi calculator v11 [beta 4](68000)
number of digits (up to 9272)? 3000
314... 35.46

Code:

> pi-amiga1200
number pi calculator v11 [beta 4](68020)
number of digits (up to 9272)? 100
314...  .12

> pi-amiga1200
number pi calculator v11 [beta 4](68020)
number of digits (up to 9272)? 1000
314...  4.36

> pi-amiga1200
number pi calculator v11 [beta 4](68020)
number of digits (up to 9272)? 3000
314...  33.96

saimo · 15 May 2021, 13:41

Quote:

Originally Posted by Don_Adan

Write routine is integrated with main loop, then D2 is trashed and A1 can be trashed via Write(A6) as scratch register. Full loop is up to .l0 label, i think. D7 can be better used and maybe (SP), but i dont checked code exactly.

Right, the write routine: I wasn't even remotely thinking about that, as I was focused on optimizing the core. a1 is not a problem: the .l0 loop doesn't use a2,a4 and a5, so a1 can be replaced with one of those registers. Regarding d2, my further-optimized code might help, as it frees d7 (which then can be used in place of d2):

Code:

         clr.l d6
         ...

.l0      move.w  #20000,d1
         ...

.longdiv divul.l d4,d6:d5
         move.w  d6,(a3)
         suba.l  d5,a5
.l2      suba.l  d6,a5
         subq.w  #2,d4
         bcs.b   .enddiv

.l4      move.w  -(a3),d5
         mulu.w  d1,d5
         add.l   a5,d5
         lsr.l   #1,d5
         movea.l d5,a5
         divu.w  d4,d5
         bvs.b   .longdiv

         move.w  d5,d6
         swap.w  d5
         move.w  d5,(a3)
         suba.w  d5,a5
         subq.w  #2,d4
         bcc.b   .l2

.enddiv

The trick exploits the sign-extension of suba.w to avoid clearing the upper word of a register and thus save two cycles. As a side effect, d7 is no longer needed.

That said, the following must be noted.

The trick doesn't bring more speed on 68030 because suba.w takes 4 cycles instead of 2, unlike on the other 68020+ CPUs - actually, I'm starting to wonder if it's an erratum on the MC68020UM or MC68030UM, given the strong similarity of the two CPUs (my memory tells me that there actually is a penalty, but recently it failed me precisely regarding similar matters). Maybe I'll check it out later.

To avoid losing cycles for the calculation of d (the mulu part), the latter has to change from:
d = d/2 + r[i]*10000
to:
d = (d + r[i]*20000)/2
This works also when d is odd thanks to the fact that r[i]*20000 is always even.

This change means that the trick doesn't bring any benefit (actually, it causes a 2 cycles penalty if suba.w actually takes longer) if the mulu optimization is used because multiplying by 20000 requires 2 cycles more (I've been trying lots of alternatives and built some tables to try to avoid that, but it doesn't look like it's possible - this is what I was referring to in my previous post).

move.w d5,d6 and suba.l d6,a5 work fine because d5 and d6 are always a 16-bit number, being the remainder of the division by d4, which is a 16-bit number itself.

suba.w d5,a5 works fine when bit #15 of d5 is 0, i.e. when d5 < 32768; given that d5 is the remainder of the division, the following condition applies:
* d4, the divisor, must be < 32769
* d4 is at most N*2-1;
* N is 7*D/2;
* hence, d4 is at most (7*D/2)*2-1 = 7*D-1;
* hence, it must be 7*D-1 < 32769 -> D < (32769+1)/7 = 4681.

Whether such condition is acceptable is another story. If not, a clr.w d5 before swap.w d5 is enough to remove suba.w d5,a5 and move .l2 one line up: this will cancel the 2-cycle gain and bring the limit of D back to the original value (7*D-1 < 65536 -> D < (65537+1)/7 = 9362) - but, of course, this is useful only in the non-mulu-optimization case.

EDIT

I have just tested the behaviour of suba.w on real hardware and it turns out that, as I feared, the MC68020UM is wrong, as 2 additional cycles are required.

For the test, I had this loop repeated 65536 times:

Code:

.l suba.w|l d0,a0
   dbf      d0,.l

According to the MC68020, the cache-case timings are 2 for suba (regardless of the size) and 6 for dbf, for a total of 8 cycles.
According to the MC68030, instead, suba.w takes 4 cycles and suba.l 2 cycles.

On 68020, the results were:

suba.l
color clocks: 131510
total cycles_: (147924*14187580)/3546895 = 526040
cycles per loop: 526040/65536 = 8.03

suba.w
color clocks: 164332
total cycles: (164332*14187580)/3546895 = 657328
cycles per loop: 657328/65536 = 10.03

On 68030, the results were:

suba.l
color clocks: 37254
total cycles: (37254*50000000)/3546895 = 525163.6
cycles per loop: 525163.6 / 8.01

suba.w
color clocks: 46564
total cycles: (46564*50000000)/3546895 = 656405.1
cycles per loop: 656405.1/65536 = 10.02

(The total cycles are calculated as <measured color clocks> * <CPU frequency> / <color clock frequency>.)

Therefore, unfortunately, speed-wise, the alternative code in this post can be useful at most only on 68040 and 68060.

litwr · 15 May 2021, 15:10

The pipack #55 has just been released.

Sorry it has been happened before saimo posted his last update.
Being pushed by saimo I removed the kv-variable, this shortened the code by 16 bytes. However it was not enough to get 4 more digits for the Amiga but it was enough for the Atari ST that can print up to 9292 digits now.
BTW Does anyone know how to assign another name for a register in VASM assembly? EQU doesn't work.

Quote:

Originally Posted by saimo

I honestly can't afford to get into a philosophical/mindset debate, so I'll keep the answers to the bare minimum.

I am not a philosopher.

I am a technician who has also interest in history. IMHO I always talk about facts.

Quote:

Originally Posted by saimo

You've shown interest in saving bytes, and my code, besides being faster, also saves bytes - overall, the gain is all but 0.
Moreover, it should be easy to you to see whether the code is correct.

Yes less bytes is better but... Why leave the airplane body without fancy paint? or... Why leave the cabin without comfortable seats?

I am sure your optimization idea is correct. But it slightly changes the algo and anyway requires some testing. You know my latest optimization (pushed

by you) gave us 16 bytes and it is more than 12 bytes you offered.

Quote:

Originally Posted by saimo

If/when I'm done with it, I'll post the code. Maybe included in a whole algorithm (which is basically already in place) or full program. But now I'm busy with other stuff (a game, which actually I had to stop working on due to sleeping issues).

It sounds like something super-impossible. However it seems that your last update will require the next iteration for the Amiga and Atari ST.

Now I need some time to dig into your last post.

Quote:

Originally Posted by modrobert

Code:

 > pi-amiga
number pi calculator v11 [beta 4](68000)
number of digits (up to 9272)? 100
314...  .14

> pi-amiga
number pi calculator v11 [beta 4](68000)
number of digits (up to 9272)? 1000
314...  4.72

> pi-amiga
number pi calculator v11 [beta 4](68000)
number of digits (up to 9272)? 3000
314... 35.46

Code:

> pi-amiga1200
number pi calculator v11 [beta 4](68020)
number of digits (up to 9272)? 100
314...  .12

> pi-amiga1200
number pi calculator v11 [beta 4](68020)
number of digits (up to 9272)? 1000
314...  4.36

> pi-amiga1200
number pi calculator v11 [beta 4](68020)
number of digits (up to 9272)? 3000
314...  33.96

Thank you very much! Your results are very interesting. They show that the better optimized 68020 code beats fast RAM advantages. Software has beaten hardware this time.

meynaf · 15 May 2021, 15:29

Quote:

Originally Posted by litwr

BTW Does anyone know how to assign another name for a register in VASM assembly? EQU doesn't work.

Use EQUR.

12 May 2021, 22:29	#91
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 55 Posts: 1,957	"This way we can modify the Atari ST code too. However the main issue is in the fact that such changes makes the code larger." Amiga initialisation code is perhaps larger than Atari ST. Not exactly larger, because "getnum" routine was inlined. Current version can be shortest much more, still is not fully PC relative, registers can be better used too. Here is example for 4 bytes code gain plus one reloc too. I dont know if main target is fastest code, or shortest code or Max Digits code. Code: PR0000 ;prints d5 lea buf(pc),a0 move.l A0,D2 bsr .l1 moveq #4,d3 move.l cout(pc),d1 ; move.l #buf,d2 jmp Write(a6) ;call Write(stdout,buff,size) .l1 divu #1000,d5 bsr .l0 clr d5 swap d5 divu #100,d5 bsr .l0 clr d5 swap d5 divu #10,d5 bsr .l0 swap d5 .l0 eori.b #'0',d5 move.b d5,(a0)+ eos rts

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
68020 Bit Field Instructions	mcgeezer	Coders. Asm / Hardware	9	27 October 2023 23:21
68060 64-bit integer math	BSzili	Coders. Asm / Hardware	7	25 January 2021 21:18
Discovery: Math	Audio Snow	request.Old Rare Games	30	20 August 2018 12:17
Math apps	mtb	support.Apps	1	08 September 2002 18:59

15 May 2021, 00:55	#96
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 55 Posts: 1,957	Write routine is integrated with main loop, then D2 is trashed and A1 can be trashed via Write(A6) as scratch register. Full loop is up to .l0 label, i think. D7 can be better used and maybe (SP), but i dont checked code exactly.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)