09 May 2021, 18:07 | #81 | |||||
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 787
|
@litwr
Quote:
As for optimizing for 68060, I'm not saying that you should aim for that: I simply suggested some innocuous rearrangement of instructions so that there are less registers conflicts. Quote:
Just to be clear: I'm saying this only to point out that it's difficult to evaluate performance accurately with such method; unfortunately, more precise measurements require lots of additional code and direct hardware access. Quote:
Quote:
Code:
moveq.l #0,d0 ;2 move.w -(a3),d0 move.l d0,d1 ;2 lsl.l #3,d0 ;4 sub.l d0,d1 ;2 add.l d0,d0 ;2 sub.l d0,d1 ;2 sub.l d0,d1 ;2 lsl.l #8,d1 ;4 sub.l d1,d0 ;2 The mulu.w timings are instead: * best-case (maximum overlap case): 25 cycles; * cache-case: 27 cycles. In short, the mulu optimization is quite good (albeit it takes a few bytes) and saves more than 2 cycles. It could cause issues if it makes the main loop larger than the instruction cache (256 bytes), though (but at a glance it doesn't look like the code is even close to that). Maybe the figure you came up with was influnced by the measurements inaccuracy. Quote:
About testing... yesterday and earlier today I made some more tests. First of all, I took my old test-bed program (the same I used for the divu tests reported previously) and I changed it to use, instead of the CIA A 32-bit timer (which runs at a trifle more than 0.7 MHz), the CIA B TOD clock, which has a rasterline resolution (so, 312 times more precise on PAL and 262 times more precise on NTSC than vertical blanking), plus fine-grained timing based on the raster beam position (VHPOSR register): this gives a color clock resolution (about 3.5 MHz), which is 5 times more precise than before. Then, I have put your core loop and my core loop inside the test-bed program (commenting out divu.w + bvs.b, because I didn't have the outer stuff that initializes data and registers) and ran various tests. The results were: * the speed is basically the same on 68020, regardless of some instructions rearrangement and of code alignment; * my code performed better on 68030. I must say that the first result surprised me a bit (even if the removal of divu.w + bvs.b meant that the divul branch optimization was not exploited). Then I changed my code again a little bit to try an optimization that I came up with last night (it had been lingering in my head forever, but somehow I couldn't see it clearly). The resulting code is 2 bytes shorter and, theoretically, 2 cycles faster on both 68020 and 68030 (and a bit less 68060-friendly): Code:
clr.l d3 ... bra.b .l4 .longdiv divul.l d4,d7:d6 move.w d7,(a3) subq.l #2,d4 bcs.b .enddiv sub.l d6,d5 sub.l d7,d5 .l4 move.w -(a3),d0 lsr.l #1,d5 mulu.w d1,d0 add.l d0,d5 move.l d5,d6 divu.w d4,d6 bvs.b .longdiv move.w d6,d3 clr.w d6 sub.l d3,d5 swap.w d6 move.w d6,(a3) sub.l d6,d5 subq.l #2,d4 bcc.b .l4 .enddiv Code:
elapsed time: · color clocks: 14499 · rasterlines: 63.7 · frames: 0.2 · µs: 4087.8 elapsed time: · color clocks: 14075 · rasterlines: 61.9 · frames: 0.1 · µs: 3968.2 Notes: · loop executed 3000 times; · divu.w + bvs.b commented out. That convinced me to finally make a full test. (Fighting against myself,) I looked at the whole source to add also the outer loop. The impression I had had a few days ago regarding other possible optimizations was confirmed, so I rewrote the code as follows: Code:
DN = 3000 ;digits number IN = ((DN/2)*7) ;items number * INITIALIZATION (buffer address in a2) movea.l a2,a0 move.l #2000*$10001,d0 move.w #IN/2-1,d1 .fill move.l d0,(a0)+ dbra d1,.fill move.w #10000,d1 move.w #IN*2,d2 clr.l d3 movea.w #28,a1 * MAIN LOOP .l0 clr.l d4 clr.l d5 move.w d2,d4 lea.l (a2,d2.w),a3 subq.l #1,d4 bra.b .l4 .longdiv divul.l d4,d7:d6 move.w d7,(a3) subq.l #2,d4 bcs.b .enddiv sub.l d6,d5 sub.l d7,d5 .l4 move.w -(a3),d0 lsr.l #1,d5 mulu.w d1,d0 add.l d0,d5 move.l d5,d6 divu.w d4,d6 bvs.b .longdiv move.w d6,d3 clr.w d6 sub.l d3,d5 swap.w d6 move.w d6,(a3) sub.l d6,d5 subq.l #2,d4 bcc.b .l4 .enddiv divu.w d1,d5 sub.w a1,d2 bcc.b .l0 Last edited by saimo; 10 May 2021 at 08:40. |
|||||
10 May 2021, 20:12 | #82 | |||||
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
Quote:
IMHO stopwatch accuracy is about 0.2s - it is not too bad. Quote:
Code:
68020 68030 moveq.l #0,d0 0-2-3 2-2 move.l d0,d1 0-2-3 2-2 lsl.l #3,d0 1-4-4 4-4 sub.l d0,d1 0-2-3 2-2 add.l d0,d0 0-2-3 2-2 sub.l d0,d1 0-2-3 2-2 sub.l d0,d1 0-2-3 2-2 lsl.l #8,d1 1-4-4 4-4 sub.l d1,d0 0-2-3 2-2 2-22-29 22-22 mulu d1,d0 25-27-28 28-28 Results from your 68030 hardware can help us better understand this case. I am also curious why do the 68020 timings have 3 cases while the 68030 have only 2? Quote:
Quote:
Quote:
saimo could make an almost impossible thing. He has made the code a bit more faster! Could you please run the new programs for me? I hope it is the last time. I am sure that no more optimizations possible. EDIT. I have replaced the attached file by an upgraded one. The changes do not affect the performance. I just added a better memory management which allows us to get more digits - thanks to Don_Adan. Last edited by BippyM; 01 June 2021 at 18:24. |
|||||
10 May 2021, 21:23 | #83 | |
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
Quote:
|
|
10 May 2021, 23:02 | #84 | |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,959
|
Quote:
Code:
rasteri addq.l #1,(A1) ;If you set your interrupt to priority 10 or higher then a0 must point at $dff000 on exit moveq #0,d0 ; must set Z flag on exit! rts VBlankServer: dc.l 0,0 ;ln_Succ,ln_Pred dc.b NT_INTERRUPT,0 ;ln_Type,ln_Pri dc.l 0 ;ln_Name dc.l time,rasteri ;is_Data,is_Code |
|
11 May 2021, 00:27 | #85 | ||||||
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 787
|
Quote:
Let's suppose that the loops are just 40,000 and that each loop takes 100 cycles: the total amount of cycles is 40,000 * 100 = 4,000,000. Execution on a 50 MHz CPU would require 4,000,000 / 50,000,000 = 0.08 s, i.e. 4 PAL frames. Now, if the loop were optimized by 2 cycles, the total amount of cycles would become 40,000 * 98 = 3,920,000, which, on the same CPU, would run in 3,920,000 / 50,000,000 = 0.0784 s, i.e. = 3,92 frames. The 0.02 s resolution wouldn't suffice anymore. The results I'm reporting below show in practice the limitations of the method. But if it fits your bill, no problem with me Quote:
Quote:
The 68020 manual illustrates 3 cases: best-case (maximum overlap), cache-case and worst-case (no-cache). It basically says: look, when you're really good/lucky, instructions will overlap and execute faster - and with some examples it shows what that means; but it doesn't provide details for all the instructions. The 68030 manual is better: it basically says the same thing, but instead of giving out the best cases, it provides head and tail (the parts that can overlap) timings for each addressing mode and instruction, so that it is actually possible to calculate the benefits of overlapping. Quote:
Quote:
68020, pi-amiga: * 100 digits: 0.12 * 1000 digits: 4.80 - 4.82 * 3000 digits: 36.92 - 36.94 68020, pi-amiga1200: * 100 digits: 0.12 - 0.14 * 1000 digits: 4.64 * 3000 digits: 35.32 68030, data cache burst on, pi-amiga: * 100 digits: 0.04 - 0.08 * 1000 digits: 1.68 * 3000 digits: 10.96 68030, data cache burst off, pi-amiga: * 100 digits: 0.04 - 0.08 * 1000 digits: 1.68 * 3000 digits: 10.94 - 10.96 68030, data cache burst on, pi-amiga1200: * 100 digits: 0.06 - 0.08 * 1000 digits: 1.64 - 1.66 * 3000 digits: 10.62 - 10.64 68030, data cache burst off, pi-amiga1200: * 100 digits: 0.06 - 0.08 * 1000 digits: 1.64 - 1.66 * 3000 digits: 10.62 The above tests have been run without booting with startup-sequence, i.e. with the barest system running. Out of curiosity, I tested it also with AmigaOS fully booted, and that confirmed the impression I had from watching the screen: the timing includes the printout operations! That has a serious impact on the times. For example, I have run the pi-amiga test with a value of 1000 a number of times, changing the shell windows size and the screen depth, and obtained times ranging from 1.98 to 4.12 (and I could have obtained even more extreme values by removing some patches, playing more with the window size, etc.). Quote:
|
||||||
11 May 2021, 04:13 | #86 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,959
|
Why not?
Code:
;for vasm assembler ;it calculates pi-number using the next C-algorithm ;https://crypto.stanford.edu/pbc/notes/pi/code.html ;#include <stdio.h> ;#define N 2800 ;main() { ; long r[N + 1], i, k, b, c; ; c = 0; ; for (i = 1; i <= N; i++) ;it is the fixed line!, the original was (i = 0; i < N; ... ; r[i] = 2000; ; for (k = N; k > 0; k -= 14) { ; d = 0; ; i = k; ; for(;;) { ; d += r[i]*10000; ; b = i*2 - 1; ; r[i] = d%b; ; d /= b; ; i--; ; if (i == 0) break; ; d *= i; ; } ; printf("%.4d", (int)(c + d/10000)); ; c = d%10000; ; } ;} ;the time of the calculation is quadratic, so if T is time to calculate N digits ;then 4*T is required to calculate 2*N digits ;main loop count is 7*(4+D)*D/16, D - number of digits ;So r[0] is never used. The program for 680x0 uses r[0] and doesn't use r[N] - so it optimizes the memory usage by 2 bytes ;litwr has written this for 680x0 ;tricky provided some help ;MMS gave some support ;Thorham and meynaf helped a lot ;a/b, saimo and modrobert helped to optimize the 68k code mc68000 MULUopt = 0 ;1 is much slower for 68000, for 68020 it is the same for FS-UAE and faintly faster with the real 68020 IO = 1 OldOpenLibrary = -408 CloseLibrary = -414 Output = -60 Input = -54 Write = -48 Read = -42 AllocMem = -198 FreeMem = -210 Forbid = -132 Permit = -138 AddIntServer = -168 RemIntServer = -174 VBlankFrequency = 530 INTB_VERTB = 5 ;for vblank interrupt NT_INTERRUPT = 2 ;node type ;N = 7*D/2 ;D digits, e.g., N = 350 for 100 digits div32x16 macro ;D7=D6/D4, D6=D6%D4 ;clr.l d7 moveq.l #0,d7 divu d4,d6 bvc .div32no\@ swap d6 move d6,d7 divu d4,d7 swap d7 move d7,d6 swap d6 divu d4,d6 .div32no\@ move d6,d7 clr d6 swap d6 endm start bra.w Ra Back .fill move.l d0,(a0)+ dbra d3,.fill .l0 clr.l d5 ;d <- 0 clr.l d4 move kv(pc),d4 add.l d4,d4 ;i <-k*2 move.l a2,a3 adda.l d4,a3 subq.l #1,d4 ;b <- 2*i-1 moveq.l #0,d3 ifeq MULUopt move #10000,d1 endif bra .l4 .longdiv if __VASM&28 ;68020/30? divul d4,d3:d6 else swap d6 move d6,d3 divu d4,d3 swap d3 move d3,d6 swap d6 divu d4,d6 move d6,d3 exg.l d6,d3 clr d3 swap d3 endif move d3,(a3) ;r[i] <- d%b bra.s .enddiv if __VASM&28 ;68020/30? align 2 endif .l2 sub.l d6,d5 sub.l d3,d5 lsr.l d5 .l4 if MULUopt moveq.l #0,d0 ;MULU optimization endif move -(a3),d0 ; r[i] if MULUopt move.l d0,d1 ;MULU optimization lsl.l #3,d0 sub.l d0,d1 add.l d0,d0 sub.l d0,d1 sub.l d0,d1 lsl.l #8,d1 sub.l d1,d0 else mulu d1,d0 ;r[i]*10000, removed with MULU optimization endif add.l d0,d5 ;d += d + r[i]*10000 move.l d5,d6 divu d4,d6 bvs.s .longdiv move d6,d3 clr d6 swap d6 move d6,(a3) ;r[i] <- d%b .enddiv subq #2,d4 ;i <- i - 1 bcc .l2 ;the main loop if MULUopt divu #10000,d5 ;MULU optimization else divu d1,d5 ;removed with MULU optimization endif if IO add cv(pc),d5 ;c + d/10000 swap d5 ;c <- d%10000 move d5,cv clr d5 swap d5 bsr PR0000 endif sub.w #14,kv bne .l0 move.l time(pc),d5 ;move.w #$c000,$dff096 ;DMA on exg.l a5,a6 moveq.l #INTB_VERTB,d0 lea.l VBlankServer(pc),a1 jsr RemIntServer(a6) jsr Permit(a6) exg.l a5,a6 moveq.l #1,d3 move.l cout(pc),d1 move.l #msgx,d2 jsr Write(a6) ;space move.l d5,d3 lsl.l d5 cmp.b #50,VBlankFrequency(a5) beq .l8 lsl.l d5 ;60 Hz add.l d3,d5 divu #3,d5 swap d5 lsr #2,d5 swap d5 negx.l d5 neg.l d5 .l8 lea string(pc),a3 move #10,d4 move.l d5,d6 div32x16 move.b d6,(a3)+ divu d4,d7 swap d7 move.b d7,(a3)+ clr d7 swap d7 move.b #'.'-'0',(a3)+ .l12 tst d7 beq .l11 divu d4,d7 swap d7 move.b d7,(a3)+ clr d7 swap d7 bra .l12 .l11 add.b #'0',-(a3) moveq #1,d3 move.l cout(pc),d1 move.l a3,d2 jsr Write(a6) cmp.l #string,a3 bne .l11 move.l cout(pc),d1 move.l #msgx+1,d2 jsr Write(a6) ;newline move.l a6,a1 move.l a5,a6 jmp CloseLibrary(a6) PR0000 ;prints d5 lea buf(pc),a0 bsr .l1 moveq #4,d3 move.l cout(pc),d1 move.l #buf,d2 jmp Write(a6) ;call Write(stdout,buff,size) .l1 divu #1000,d5 bsr .l0 clr d5 swap d5 divu #100,d5 bsr .l0 clr d5 swap d5 divu #10,d5 bsr .l0 swap d5 .l0 eori.b #'0',d5 move.b d5,(a0)+ eos rts rasteri addq.l #1,time ;If you set your interrupt to priority 10 or higher then a0 must point at $dff000 on exit moveq #0,d0 ; must set Z flag on exit! rts VBlankServer: dc.l 0,0 ;ln_Succ,ln_Pred dc.b NT_INTERRUPT,0 ;ln_Type,ln_Pri dc.l libname ;ln_Name dc.l 0,rasteri ;is_Data,is_Code cv dc.w 0 kv dc.w 0 time dc.l 0 cout dc.l 0 buf ds.b 4 msgx dc.b 32,10 align 2 ra lea.l libname(pc),a1 ;open the dos library move.l 4,a5 move.l a5,a6 jsr OldOpenLibrary(a6) move.l d0,a6 jsr Output(a6) ;get stdout move.l d0,cout move.l d0,d1 ;call Write(stdout,buff,size) move.l #msg1,d2 moveq #msg4-msg1,d3 jsr Write(a6) move.l #start+$10000-ra,d0 divu #7,d0 ext.l d0 and.b #$fc,d0 move.l d0,d7 ;d7=maxn .l20 move.l cout(pc),d1 move.l #msg4,d2 moveq #msg5-msg4,d3 jsr Write(a6) move.l d7,d5 bsr PR0000 move.l cout(pc),d1 move.l #msg5,d2 moveq #msg3-msg5,d3 jsr Write(a6) getnum jsr Input(a6) ;get stdin move.l #string,d2 ;set by previous call move.l d0,d1 moveq.l #5,d3 ;+ newline jsr Read(a6) subq #1,d0 beq getnum move.l d2,a0 clr.l d5 .l1 clr d6 move.b (a0)+,d6 sub.b #'0',d6 add d6,d5 subq #1,d0 beq skip mulu #10,d5 bra .l1 skip cmp d7,d5 bhi .l20 move d5,d1 beq .l20 addq #3,d5 and #$fffc,d5 cmp.b #10,(a0) bne .l21 cmp d1,d5 beq .l7 .l21 move d5,a4 bsr PR0000 move a4,d5 move.l cout(pc),d1 move.l #msg3,d2 moveq #msg2-msg3+1,d3 jsr Write(a6) .l7 lsr d5 mulu #7,d5 move.l d5,d3 lea.l ra(pc),a2 exg.l a5,a6 jsr Forbid(a6) moveq.l #INTB_VERTB,d0 lea.l VBlankServer(pc),a1 jsr AddIntServer(a6) exg.l a5,a6 ;move.w #$4000,$dff096 ;DMA off lsr d3 subq #1,d3 move.l #2000*65537,d0 move.l a2,a0 clr cv move d5,kv bra.w Back string = msg1 libname dc.b "dos.library",0 msg1 dc.b 'number pi calculator v11 [beta 4]' if __VASM&28 ;68020/30? dc.b '(68020)' else dc.b '(68000)' endif msg4 dc.b 10,'number of digits (up to ' msg5 dc.b ')? ' msg3 dc.b ' digits will be printed' msg2 dc.b 10 dcb.b 65536-(ra-start) end start Last edited by Don_Adan; 11 May 2021 at 04:23. |
11 May 2021, 06:02 | #87 | |
old bearded fool
Join Date: Jan 2010
Location: Bangkok
Age: 56
Posts: 775
|
Quote:
Looks like saimo got you covered. |
|
11 May 2021, 19:39 | #88 | ||
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
Quote:
Quote:
|
||
11 May 2021, 20:52 | #89 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 787
|
Quote:
Yes, the test programs outputted the digits. |
|
12 May 2021, 19:54 | #90 | ||||||
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
Quote:
Quote:
I tried to check this saved tick number for the emulated A500. I got data that FS-UAE is rather not completely accurate at least for the MULU timings. I got that code with MULU is about 42 ticks faster that with MULU optimization. Let's examine the 68000 timings Code:
moveq.l #0,d0 4 move.l d0,d1 4 lsl.l #3,d0 14 sub.l d0,d1 6 add.l d0,d0 6 sub.l d0,d1 6 sub.l d0,d1 6 lsl.l #8,d1 24 sub.l d1,d0 6 =76 mulu d1,d0 38-70 - it is 54 on average Quote:
Quote:
It is also interesting that your results show a big advantage of the 68030 over the 68020. The table #2 shows ER that is equal 422 for the 68030 and 438 for the 68020. These numbers were almost equal before you published results from your Blizzard card. Quote:
Quote:
This way we can modify the Atari ST code too. However the main issue is in the fact that such changes makes the code larger. Your A1200 has fast RAM, so its results cannot be replaced by results from saimo's systems. Last edited by BippyM; 01 June 2021 at 18:24. |
||||||
12 May 2021, 22:29 | #91 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,959
|
"This way we can modify the Atari ST code too. However the main issue is in the fact that such changes makes the code larger."
Amiga initialisation code is perhaps larger than Atari ST. Not exactly larger, because "getnum" routine was inlined. Current version can be shortest much more, still is not fully PC relative, registers can be better used too. Here is example for 4 bytes code gain plus one reloc too. I dont know if main target is fastest code, or shortest code or Max Digits code. Code:
PR0000 ;prints d5 lea buf(pc),a0 move.l A0,D2 bsr .l1 moveq #4,d3 move.l cout(pc),d1 ; move.l #buf,d2 jmp Write(a6) ;call Write(stdout,buff,size) .l1 divu #1000,d5 bsr .l0 clr d5 swap d5 divu #100,d5 bsr .l0 clr d5 swap d5 divu #10,d5 bsr .l0 swap d5 .l0 eori.b #'0',d5 move.b d5,(a0)+ eos rts |
13 May 2021, 19:31 | #92 | |
old bearded fool
Join Date: Jan 2010
Location: Bangkok
Age: 56
Posts: 775
|
Quote:
EDIT: Did these for now... Code:
> pi-mulu number pi calculator v11 [mulu opt test](68000) number of digits (up to 9276)? 3000 31.86 Code:
> pi-opt number pi calculator v11 [mulu opt test](68000) number of digits (up to 9276)? 3000 30.72 Last edited by modrobert; 13 May 2021 at 21:25. |
|
13 May 2021, 23:55 | #93 | ||||||
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 787
|
Again, simply don't use emulators as reference.
Quote:
Interrupts are critical (especially on devices with limited power, like the Amiga or those that use the M68k CPUs as embedded controllers), so the faster they are handled, the better. Now, the VBR helps speeding up the execution of interrupts on Amiga because of its RAM architecture, but that might not apply to other systems. But even without the speed factor, being able to relocate the vectors table could be an advantage: for example, more freedom for hardware/OS designers, possibility of exploiting those 1024 bytes in the 0 page (which can be accessed more quickly thanks to its 16 bit addresses), more robustness (less risk of destroying the vectors due to null pointers). The MMU is meant for other purposes, and it's best to use it for those, instead of complicating its usage even more. Not to mention that MMU slows things down a bit and requires RAM (OK, transparent translation registers might help, but then again those are better used for other purposes). Moreover, only the non-EC/LC 68030+ CPUs have an MMU. Quote:
Quote:
Quote:
Quote:
Quote:
EDIT: I gave the code a less swift look and added also 68020 cache-case timings; b = bytes, c = cycles. Your code: Code:
move d5,kv ;6b 6c ... move kv(pc),d4 ;4b 3c add.l d4,d4 ;2b 2c move.l a2,a3 ;2b 2c adda.l d4,a3 ;2b 2c ... sub.w #14,kv ;6b 10c total: 22b 25c Code:
movea.w #28,a1 ;4b 4c ... move.w d2,d4 ;2b 2c lea.l (a2,d2.w),a3 ;4b 6c ... sub.w a1,d2 ;2b 2c total: 12b 14c Note: even just 11 cycles per external loop do make a difference of some thousands of cycles when thousands of digits are calculated; the difference might escape the vertical blanking based time measuring, but I just can't see why not making the code faster - that's one of the purposes why you started this thread, isn't it? Additional note: like I said in the original post, whether my code does the right thing is still to be verified. By the way, I have an alternative version of the inner loop which is 2 cycles faster on 68020 (and probably even faster on 68040 and 68060, but not on 68030) when the mulu optimization is not used. I'm trying to figure out a way to adapt the mulu optimization version to take advantage of the optimization, but it looks like it might be impossible. Last edited by saimo; 14 May 2021 at 16:44. |
||||||
14 May 2021, 21:46 | #94 | ||||||||||
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
Quote:
Quote:
My request has also been to run programs from pi-amiga-11-beta-4.zip for 100/1000/3000 digits. Your system uses fast RAM and this gives us other numbers, not the same as saimo's system. The archive is attached to this post http://eab.abime.net/showpost.php?p=...0&postcount=82 However for the 68000 FS-UAE is quite accurate. I only showed that it has some minor timing issue with the MULU timing emulation. Quote:
Quote:
Thank you, I missed this. I was sure that all top Amigas can run Unix. There was a strange story when Mehdi Ali refused to share Unix for the Amiga with Sun - https://www.commodore.ca/commodore-h...-of-commodore/ - one might think that Commodore invested much in their Unix, so it could be used widely... Quote:
Quote:
It is the value in the second table which sorted out many CPUs. The cite The next table contains approximate values of efficiency reciprocals (ER). These values are calculated by multiplication of the time (CPU time only without IO) of the calculation of 3000 digits by the CPU frequency. The ER values are gotten for pi-spigot which uses base 16-bit integer arithmetic. The best ER value for each CPU is taken. The ER value reflects the efficiency of a CPU electronics. Quote:
Quote:
Quote:
Quote:
|
||||||||||
14 May 2021, 23:05 | #95 | ||
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 787
|
@litwr
I honestly can't afford to get into a philosophical/mindset debate, so I'll keep the answers to the bare minimum. The frequency of the 68030 in my Amiga is more than 3.5 faster than that of the 68020 in the same machine. See my previous posts. Quote:
Moreover, it should be easy to you to see whether the code is correct. Your code does: loop: get kv kv*2 -> add to array offset kv*2-1 -> divisor ... kv -= 14 goto loop My code does (using more efficient instructions): kv*2 -> kv2 loop: kv2 -> add to array offset kv2-1 -> divisor ... kv -= 28 goto loop That should work, right? Quote:
|
||
15 May 2021, 00:55 | #96 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,959
|
Write routine is integrated with main loop, then D2 is trashed and A1 can be trashed via Write(A6) as scratch register. Full loop is up to .l0 label, i think. D7 can be better used and maybe (SP), but i dont checked code exactly.
|
15 May 2021, 10:38 | #97 | |
old bearded fool
Join Date: Jan 2010
Location: Bangkok
Age: 56
Posts: 775
|
Quote:
Code:
> pi-amiga number pi calculator v11 [beta 4](68000) number of digits (up to 9272)? 100 314... .14 > pi-amiga number pi calculator v11 [beta 4](68000) number of digits (up to 9272)? 1000 314... 4.72 > pi-amiga number pi calculator v11 [beta 4](68000) number of digits (up to 9272)? 3000 314... 35.46 Code:
> pi-amiga1200 number pi calculator v11 [beta 4](68020) number of digits (up to 9272)? 100 314... .12 > pi-amiga1200 number pi calculator v11 [beta 4](68020) number of digits (up to 9272)? 1000 314... 4.36 > pi-amiga1200 number pi calculator v11 [beta 4](68020) number of digits (up to 9272)? 3000 314... 33.96 |
|
15 May 2021, 13:41 | #98 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 787
|
Quote:
Code:
clr.l d6 ... .l0 move.w #20000,d1 ... .longdiv divul.l d4,d6:d5 move.w d6,(a3) suba.l d5,a5 .l2 suba.l d6,a5 subq.w #2,d4 bcs.b .enddiv .l4 move.w -(a3),d5 mulu.w d1,d5 add.l a5,d5 lsr.l #1,d5 movea.l d5,a5 divu.w d4,d5 bvs.b .longdiv move.w d5,d6 swap.w d5 move.w d5,(a3) suba.w d5,a5 subq.w #2,d4 bcc.b .l2 .enddiv That said, the following must be noted. The trick doesn't bring more speed on 68030 because suba.w takes 4 cycles instead of 2, unlike on the other 68020+ CPUs - actually, I'm starting to wonder if it's an erratum on the MC68020UM or MC68030UM, given the strong similarity of the two CPUs (my memory tells me that there actually is a penalty, but recently it failed me precisely regarding similar matters). Maybe I'll check it out later. To avoid losing cycles for the calculation of d (the mulu part), the latter has to change from: d = d/2 + r[i]*10000 to: d = (d + r[i]*20000)/2 This works also when d is odd thanks to the fact that r[i]*20000 is always even. This change means that the trick doesn't bring any benefit (actually, it causes a 2 cycles penalty if suba.w actually takes longer) if the mulu optimization is used because multiplying by 20000 requires 2 cycles more (I've been trying lots of alternatives and built some tables to try to avoid that, but it doesn't look like it's possible - this is what I was referring to in my previous post). move.w d5,d6 and suba.l d6,a5 work fine because d5 and d6 are always a 16-bit number, being the remainder of the division by d4, which is a 16-bit number itself. suba.w d5,a5 works fine when bit #15 of d5 is 0, i.e. when d5 < 32768; given that d5 is the remainder of the division, the following condition applies: * d4, the divisor, must be < 32769 * d4 is at most N*2-1; * N is 7*D/2; * hence, d4 is at most (7*D/2)*2-1 = 7*D-1; * hence, it must be 7*D-1 < 32769 -> D < (32769+1)/7 = 4681. Whether such condition is acceptable is another story. If not, a clr.w d5 before swap.w d5 is enough to remove suba.w d5,a5 and move .l2 one line up: this will cancel the 2-cycle gain and bring the limit of D back to the original value (7*D-1 < 65536 -> D < (65537+1)/7 = 9362) - but, of course, this is useful only in the non-mulu-optimization case. EDIT I have just tested the behaviour of suba.w on real hardware and it turns out that, as I feared, the MC68020UM is wrong, as 2 additional cycles are required. For the test, I had this loop repeated 65536 times: Code:
.l suba.w|l d0,a0 dbf d0,.l According to the MC68030, instead, suba.w takes 4 cycles and suba.l 2 cycles. On 68020, the results were: suba.l color clocks: 131510 total cycles_: (147924*14187580)/3546895 = 526040 cycles per loop: 526040/65536 = 8.03 suba.w color clocks: 164332 total cycles: (164332*14187580)/3546895 = 657328 cycles per loop: 657328/65536 = 10.03 On 68030, the results were: suba.l color clocks: 37254 total cycles: (37254*50000000)/3546895 = 525163.6 cycles per loop: 525163.6 / 8.01 suba.w color clocks: 46564 total cycles: (46564*50000000)/3546895 = 656405.1 cycles per loop: 656405.1/65536 = 10.02 (The total cycles are calculated as <measured color clocks> * <CPU frequency> / <color clock frequency>.) Therefore, unfortunately, speed-wise, the alternative code in this post can be useful at most only on 68040 and 68060. Last edited by saimo; 15 May 2021 at 19:17. |
|
15 May 2021, 15:10 | #99 | ||||
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
The pipack #55 has just been released. Sorry it has been happened before saimo posted his last update.
Being pushed by saimo I removed the kv-variable, this shortened the code by 16 bytes. However it was not enough to get 4 more digits for the Amiga but it was enough for the Atari ST that can print up to 9292 digits now. BTW Does anyone know how to assign another name for a register in VASM assembly? EQU doesn't work. Quote:
Quote:
I am sure your optimization idea is correct. But it slightly changes the algo and anyway requires some testing. You know my latest optimization (pushed by you) gave us 16 bytes and it is more than 12 bytes you offered. Quote:
Quote:
|
||||
15 May 2021, 15:29 | #100 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
68020 Bit Field Instructions | mcgeezer | Coders. Asm / Hardware | 9 | 27 October 2023 23:21 |
68060 64-bit integer math | BSzili | Coders. Asm / Hardware | 7 | 25 January 2021 21:18 |
Discovery: Math | Audio Snow | request.Old Rare Games | 30 | 20 August 2018 12:17 |
Math apps | mtb | support.Apps | 1 | 08 September 2002 18:59 |
|
|