![]() |
![]() |
#1 |
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
A mathematical demo
I have made a code for 80386 at 12.5 MHz which is slightly faster than the same functionality code for 68020 at 14 MHz. The project is at http://litwr2.atspace.eu/pi/pi-spigot-benchmark.html. The main loop is between labels .l2 and .l4, it is 12 instructions only. The same code for 80386 occupies 13 lines. C sources are also provided. Is it possible to win 80386 having 1.5 MHz handicap? Help is welcome. Thanks in advance. I have to notify It is not easy task.
|
![]() |
![]() |
#2 |
Registered User
Join Date: Dec 2014
Location: germany
Posts: 439
|
I just had a very brief look at the code, but you might want to unroll the loop a few times. The 68020 has 256 bytes of instruction cache, so unrolling it that it fits into cache might save you the "bra .l2" instruction a few times.
On a side note, I'm not sure how exact UAE reproduces the timing behaviour of the 68020. It's very precise for the 68000, but I remember the author mentioning that it is worse for the 020. Last edited by chb; 12 January 2017 at 20:59. |
![]() |
![]() |
#3 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
Emulators are good for 68000 but for 68020+ timing they are useless.
Unrolling is not needed, u can save that bra every time ('xcept the very first) by reordering the code : Code:
bra.s .l2 .loop sub.l d6,d5 sub.l d7,d5 lsr.l #1,d5 .l2 move.w -(a3),d0 mulu d1,d0 add.l d0,d5 move.l d5,d6 divul.l d4,d7:d6 move.w d7,(a3) subq.w #2,d4 bcc.s .loop .l4 |
![]() |
![]() |
#4 | |
Registered User
Join Date: Dec 2014
Location: germany
Posts: 439
|
Quote:
![]() Code:
(save d3 to mem or free adress register if needed) move.w d4,d3 lsr.w #2,d3 subq.w #1,d3 bmi.s .l2 bra.s .l2_8 .loop_8 sub.l d6,d5 sub.l d7,d5 lsr.l #1,d5 .l2_8 move.w -(a3),d0 mulu d1,d0 add.l d0,d5 move.l d5,d6 divul.l d4,d7:d6 move.w d7,(a3) subq.w #2,d4 (unroll additional 7x) dbra d3, .loop_8 ;------ unchanged from meynaf's post, except the bra .l2 .loop sub.l d6,d5 sub.l d7,d5 lsr.l #1,d5 .l2 move.w -(a3),d0 mulu d1,d0 add.l d0,d5 move.l d5,d6 divul.l d4,d7:d6 move.w d7,(a3) subq.w #2,d4 bcc.s .loop .l4 Last edited by chb; 13 January 2017 at 00:32. |
|
![]() |
![]() |
#5 |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,849
|
How about this:
Code:
bra .l1 .loopc sub.l d5,d0 sub.l d3,d0 lsr.l #1,d0 mulu.w d6,d3 move.l d3,(a1) .l1 add.l -(a1),d0 move.l d0,d5 divul.l d1,d3:d5 subq.l #2,d1 bgt .loopc mulu.w d6,d3 move.l d3,(a1) Code:
incdir "asminc:" include "exec/exec.i" include "dos/dos.i" include "lvo/lvos.i" execBase equ 4 N equ 2800 start ; ; open dos library ; move.l execBase,a6 lea dosName,a1 moveq #36,d0 jsr _LVOOpenLibrary(a6) move.l d0,dosBase beq .exit ; ; fill array ; lea r+4,a0 move.l #2000*10000,d0 move.l #N-1,d7 .loopa move.l d0,(a0)+ dbra d7,.loopa ; ; main loop ; move.l dosBase,a6 lea printfArgs,a3 lea r,a2 move.l #10000,d6 clr.l d4 move.l #N,d7 .loopb clr.l d0 move.l d7,d1 lea 4(a2,d1.w*4),a1 add.l d1,d1 subq.l #1,d1 bra .l1 .loopc sub.l d5,d0 sub.l d3,d0 lsr.l #1,d0 mulu.w d6,d3 move.l d3,(a1) .l1 add.l -(a1),d0 move.l d0,d5 divul.l d1,d3:d5 subq.l #2,d1 bgt .loopc mulu.w d6,d3 move.l d3,(a1) ; print digits divul.l d6,d3:d0 add.l d4,d0 move.l d0,(a3) move.l #formatString,d1 move.l a3,d2 jsr _LVOVPrintf(a6) move.l d3,d4 sub.w #14,d7 bgt .loopb ; print line feed move.l #lineFeed,d1 move.l a3,d2 jsr _LVOVPrintf(a6) ; cleanup and exit .exit move.l execBase,a6 move.l dosBase,a1 jsr _LVOCloseLibrary(a6) rts section data,data_p dosBase dc.l 0 r dcb.l N+1 printfArgs dcb.l 4 formatString dc.b "%04ld",0 lineFeed dc.b 10,0 dosName dc.b "dos.library",0 |
![]() |
![]() |
#6 |
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
Thanks to everybody for help.
![]() ![]() @meynaf Thank you very much for the brilliant idea. ![]() ![]() @Thorham Wow! You have written a new program. Thank you very much. Your code looks close to perfect. ![]() ![]() ![]() |
![]() |
![]() |
#7 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
Your program will crash if it can't open dos.library as you closelibrary(null) in that case
![]() Perhaps you should also consider using a BSS section ; your exe is quite large. Nevertheless reducing the loop to 10 instructions is very nice ![]() |
![]() |
![]() |
#8 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
Oh, i forgot this.
Since it's for 68020 we could replace : Code:
mulu.w d6,d3 ; d6=10000 ($2710) Code:
move.l d3,d2 lsl.l #3,d3 sub.l d3,d2 add.l d3,d3 sub.l d3,d2 sub.l d3,d2 lsl.l #8,d2 sub.l d2,d3 |
![]() |
![]() |
#9 | ||
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,849
|
Quote:
Quote:
By the way, nice x 10000 optimization. |
||
![]() |
![]() |
#10 |
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
IMHO this optimized MUL is the key to outperform 80386. Thanks!
![]() |
![]() |
![]() |
#11 | ||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
Quote:
However, one reason why this OpenLibrary call can fail is precisely because the program is run under some version lower than V36... Yeah, this one wasn't exactly easy to make. Perhaps i can push the vice further and unroll the loop x7 - as the number of iterations is always dividable by 7. Now the loop is slightly under 250 bytes but i don't see a way to make it shorter : Code:
; init lea rtab,a4 ; gives ds.w 2800 move.w #2800/2-1,d0 move.l #(2000<<16)+2000,d1 move.l a4,a0 .iloop move.l d1,(a0)+ dbf d0,.iloop ; main loop suba.l a5,a5 move.l #2800*2-1,d7 .mloop moveq #0,d5 move.l a4,a3 move.l d7,d4 .loop7 movem.w (a3),d0-d3/a0-a2 do1 macro move.l \1,d6 lsl.l #3,d6 sub.l d6,\1 add.l d6,d6 add.l d6,d5 add.l d6,d6 sub.l \1,d6 lsl.l #8,d6 add.l d6,d5 move.l d5,d0 divul.l d4,d6:d0 move.w d6,(a3)+ sub.l d0,d5 sub.l d6,d5 lsr.l #1,d5 subq.l #2,d4 endm do1 d0 do1 d1 do1 d2 do1 d3 do1 a0 do1 a1 do1 a2 bcc .loop7 divul.l #10000,d1:d0 add.l a5,d0 move.l d1,a5 ; printf d0 here lea 14*2(a4),a4 subi.w #28,d7 bgt .mloop Quote:
![]() That said, i doubt the 386 is a champion in this area either... Having more registers can perhaps help too, btw. Depends how this lookup is made. I didn't check the exact range of values. |
||
![]() |
![]() |
#12 | |||
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,849
|
Quote:
Nice, I certainly didn't think of that. Quote:
Quote:
Here's a version that should work with Kickstart 1.2, possibly lower (uses a 200 byte table for padded number conversion instead of vprintf): Code:
; ; PI spigot, 68020+ Kickstart 1.2+ ; _LVOOldOpenLibrary equ -408 _LVOCloseLibrary equ -414 _LVOOutput equ -60 _LVOWrite equ -48 execBase equ 4 N equ 2800 start ; ; open dos library ; move.l execBase,a6 lea dosName,a1 jsr _LVOOldOpenLibrary(a6) move.l d0,dosBase beq exit ; ; get output handle ; move.l dosBase,a6 jsr _LVOOutput(a6) move.l d0,stdOut ; ; fill array ; lea r,a0 move.l #2000*10000,d0 clr.l (a0)+ move.l #N-1,d7 .loopa move.l d0,(a0)+ dbra d7,.loopa ; ; main loop ; lea decTable,a4 lea formatString,a5 lea r,a2 move.l #10000,d6 clr.l d4 move.l #N,d7 .loopb clr.l d0 move.l d7,d1 lea 4(a2,d1.w*4),a1 add.l d1,d1 subq.l #1,d1 bra .l1 .loopc sub.l d5,d0 sub.l d3,d0 lsr.l #1,d0 mulu.w d6,d3 move.l d3,(a1) .l1 add.l -(a1),d0 move.l d0,d5 divul.l d1,d3:d5 subq.l #2,d1 bgt .loopc mulu.w d6,d3 move.l d3,(a1) ; write digits to string divul.l d6,d3:d0 add.l d4,d0 divu.w #100,d0 move.w (a4,d0.w*2),(a5)+ swap d0 move.w (a4,d0.w*2),(a5)+ move.l d3,d4 sub.w #14,d7 bgt .loopb ; print PI digits move.w #$0a00,(a5)+ move.l dosBase,a6 move.l stdOut,d1 move.l #formatString,d2 move.l a5,d3 sub.l #formatString,d3 jsr _LVOWrite(a6) ; cleanup and exit exit move.l execBase,a6 move.l dosBase,a1 tst.l a1 beq .l1 jsr _LVOCloseLibrary(a6) .l1 rts section data,data_p dosBase dc.l 0 stdOut dc.l 0 decTable dc.b "00010203040506070809" dc.b "10111213141516171819" dc.b "20212223242526272829" dc.b "30313233343536373839" dc.b "40414243444546474849" dc.b "50515253545556575859" dc.b "60616263646566676869" dc.b "70717273747576777879" dc.b "80818283848586878889" dc.b "90919293949596979899" dosName dc.b "dos.library",0 section data2,bss_p r ds.l N+1 formatString ds.b 1024 |
|||
![]() |
![]() |
#13 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
I know, you need V36 so you ask for V36 which is ok, the problem is that the program will crash if V36 dos isn't available - which would easily be avoided by just branching to rts and not to the CloseLibrary call.
Means a 20kb table. Perhaps a little bit overkill for this kind of program... |
![]() |
![]() |
#14 | |
Banned
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
|
Quote:
Code:
68020 vs 80286 16x16=32 27 21 32x32=32 43 - 32x32=64 43 - 68030 vs 80386 16x16=32 27 9-22 32x32=32 43 9-38 32x32=64 43 9-38 68040 vs 80486 16x16=32 14 13-26 32x32=32 20 13-42 32x32=64 20 13-42 68060 vs Pentium 16x16=32 2 11 32x32=32 2 10 32x32=64 - 10 ![]() Last edited by matthey; 13 January 2017 at 19:46. |
|
![]() |
![]() |
#15 |
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
Thanks a lot!
![]() ![]() Last edited by litwr; 13 January 2017 at 20:06. |
![]() |
![]() |
#16 | ||
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,849
|
Quote:
![]() ![]() Quote:
![]() That reminds me, I still have to add a timer and allow a variable number of digits to be calculated. |
||
![]() |
![]() |
#17 | ||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
Quote:
Filling such a table is easy but takes some amount of time. Or we could accept a large exe where most others are under 1kb ![]() And the net benefit can be very small. A table lookup can be 12 (maybe 14, i don't remember exactly) cycles, with big wait states because in unexpanded A1200 we're in chipmem... Quote:
Makes me think i've recently added this kind of function to my include files. So here it goes, in case it could be helpful for ya : Code:
movem.l d2-d3/a0-a2/a6,-(a7) subq.l #8,a7 ; cheap struct timeval move.l a7,a0 move.l timereq+$14,a6 ; timereq is ioreq for previously opened timer.device jsr -$42(a6) ; GetSysTime movem.l (a7)+,d0-d1 ; d0=tv_secs, d1=tv_micro movem.l oldtime,d2-d3 ; two ds.l to store previous value movem.l d0-d1,oldtime sub.l d3,d1 ; micro subx.l d3,d3 ; keep the carry subx.l d2,d0 ; d0=secs andi.l #1000000,d3 ; +1 million if carry set add.l d3,d1 ; d1=micro movem.l (a7)+,d2-d3/a0-a2/a6 rts |
||
![]() |
![]() |
#18 |
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
@meynaf Sorry, the fast multiplication code is not complete. It assumes that the high word of D3 is zero but it is not. So we need CLR.L D3 before MOVE.W -(A3),D3. (I want to have 80386 MOVZX here.) This may be solved by keeping 32-bit words in the main array (r). But this reduces the number of digits twice.
![]() However I am curious is it possible use small tables with 68020 or 68000 for faster multiplication like with z80 or 6502? @Thorham Your multiplication optimization also relies on 32 bit words. I have to use such words with ARM code because it can't work with 16 bit words. So I am definitely going to update ARM code with this optimization. Anyway a lot of thanks to everybody. ![]() |
![]() |
![]() |
#19 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
|
![]() |
![]() |
#20 | |
Banned
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
|
Quote:
The 68020+ is one of the best ISAs ever created for handling tables. It should be able to handle practically every table the 8 bit processors could do. However, complex addressing modes were not particularly fast and accessing memory was especially slow back in the Amiga days. |
|
![]() |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Maniac Mansion Demo Disk - Onslaught releases the demo of a classic for C64! | Neil79 | Retrogaming General Discussion | 0 | 16 January 2015 10:40 |
Yet another "plz help me find a demo" thead! AGA demo w/ texturemapped building | ruinatokyo | request.Demos | 1 | 26 September 2013 16:29 |
Jason Lowe and The Mathematical Reflex Test | jedk | Retrogaming General Discussion | 5 | 30 January 2013 02:13 |
Old Amiga Demo Wanted -- Music Track "The last ninja demo" | scribbleuk | request.Demos | 13 | 23 April 2012 13:35 |
CU Amiga #75/Simon The Sorcerer Demo + CU Amiga #99/Amazon Queen Demo | MethodGit | request.Old Rare Games | 12 | 16 February 2004 17:16 |
|
|