English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 12 January 2017, 18:16   #1
litwr
Registered User
 
Join Date: Mar 2016
Location: Ozherele
Posts: 229
A mathematical demo

I have made a code for 80386 at 12.5 MHz which is slightly faster than the same functionality code for 68020 at 14 MHz. The project is at http://litwr2.atspace.eu/pi/pi-spigot-benchmark.html. The main loop is between labels .l2 and .l4, it is 12 instructions only. The same code for 80386 occupies 13 lines. C sources are also provided. Is it possible to win 80386 having 1.5 MHz handicap? Help is welcome. Thanks in advance. I have to notify It is not easy task.
litwr is offline  
Old 12 January 2017, 20:53   #2
chb
Registered User
 
Join Date: Dec 2014
Location: germany
Posts: 439
I just had a very brief look at the code, but you might want to unroll the loop a few times. The 68020 has 256 bytes of instruction cache, so unrolling it that it fits into cache might save you the "bra .l2" instruction a few times.

On a side note, I'm not sure how exact UAE reproduces the timing behaviour of the 68020. It's very precise for the 68000, but I remember the author mentioning that it is worse for the 020.

Last edited by chb; 12 January 2017 at 20:59.
chb is offline  
Old 12 January 2017, 21:14   #3
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
Emulators are good for 68000 but for 68020+ timing they are useless.

Unrolling is not needed, u can save that bra every time ('xcept the very first) by reordering the code :
Code:
	bra.s .l2
.loop
	sub.l d6,d5
	sub.l d7,d5
	lsr.l #1,d5
.l2	move.w -(a3),d0
	mulu d1,d0
	add.l d0,d5
	move.l d5,d6
	divul.l d4,d7:d6
	move.w d7,(a3)
	subq.w #2,d4
	bcc.s .loop
.l4
meynaf is offline  
Old 12 January 2017, 22:13   #4
chb
Registered User
 
Join Date: Dec 2014
Location: germany
Posts: 439
Quote:
Originally Posted by meynaf View Post
Unrolling is not needed, u can save that bra every time ('xcept the very first) by reordering the code
Nice! But if d4 has usually a quite large value, you could still gain something from unrolling: Split the loop in a part that's for example 8x unrolled and a non-unrolled (rolled?) part, and set an additional loop counter to N/8, respectively. Saves the bcc.s a couple of times, but needs an additional data register (d3 is free temporarily?) and a dbra at the end.

Code:
        (save d3 to mem or free adress register if needed)
        move.w d4,d3
        lsr.w #2,d3
        subq.w #1,d3
        bmi.s .l2

	bra.s .l2_8
.loop_8
	sub.l d6,d5
	sub.l d7,d5
	lsr.l #1,d5
.l2_8	move.w -(a3),d0
	mulu d1,d0
	add.l d0,d5
	move.l d5,d6
	divul.l d4,d7:d6
	move.w d7,(a3)
	subq.w #2,d4


(unroll additional 7x)

	dbra d3, .loop_8

;------ unchanged from meynaf's post, except the bra .l2


.loop
	sub.l d6,d5
	sub.l d7,d5
	lsr.l #1,d5
.l2	move.w -(a3),d0
	mulu d1,d0
	add.l d0,d5
	move.l d5,d6
	divul.l d4,d7:d6
	move.w d7,(a3)
	subq.w #2,d4
	bcc.s .loop
.l4
Warning, that code is untested, and my 68k assembly is very rusty...

Last edited by chb; 13 January 2017 at 00:32.
chb is offline  
Old 13 January 2017, 08:38   #5
Thorham
Computer Nerd
 
Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,849
How about this:

Code:
    bra     .l1

.loopc
    sub.l   d5,d0
    sub.l   d3,d0
    lsr.l   #1,d0
    mulu.w  d6,d3
    move.l  d3,(a1)
.l1
    add.l   -(a1),d0
    move.l  d0,d5
    divul.l d1,d3:d5

    subq.l  #2,d1
    bgt     .loopc

    mulu.w  d6,d3
    move.l  d3,(a1)
Full program (tested) 68020/30 AOS2+:

Code:
    incdir  "asminc:"

    include "exec/exec.i"
    include "dos/dos.i"
    include "lvo/lvos.i"

execBase equ 4

N equ 2800

start
;
; open dos library
;
    move.l  execBase,a6
    lea     dosName,a1
    moveq   #36,d0
    jsr     _LVOOpenLibrary(a6)
    move.l  d0,dosBase
    beq     .exit
;
; fill array
;
    lea     r+4,a0
    move.l  #2000*10000,d0

    move.l  #N-1,d7
.loopa
    move.l  d0,(a0)+
    dbra    d7,.loopa
;
; main loop
;
    move.l  dosBase,a6
    lea     printfArgs,a3
    lea     r,a2
    move.l  #10000,d6
    clr.l   d4

    move.l  #N,d7
.loopb
    clr.l   d0
    move.l  d7,d1

    lea     4(a2,d1.w*4),a1

    add.l   d1,d1
    subq.l  #1,d1

    bra     .l1

.loopc
    sub.l   d5,d0
    sub.l   d3,d0
    lsr.l   #1,d0
    mulu.w  d6,d3
    move.l  d3,(a1)
.l1
    add.l   -(a1),d0
    move.l  d0,d5
    divul.l d1,d3:d5

    subq.l  #2,d1
    bgt     .loopc

    mulu.w  d6,d3
    move.l  d3,(a1)

; print digits

    divul.l d6,d3:d0
    add.l   d4,d0
    move.l  d0,(a3)

    move.l  #formatString,d1
    move.l  a3,d2
    jsr     _LVOVPrintf(a6)

    move.l  d3,d4

    sub.w   #14,d7
    bgt     .loopb

; print line feed

    move.l  #lineFeed,d1
    move.l  a3,d2
    jsr     _LVOVPrintf(a6)

; cleanup and exit

.exit
    move.l  execBase,a6
    move.l  dosBase,a1
    jsr     _LVOCloseLibrary(a6)
    rts



    section data,data_p

dosBase
    dc.l    0

r
    dcb.l   N+1

printfArgs
    dcb.l   4

formatString
    dc.b    "%04ld",0

lineFeed
    dc.b    10,0

dosName
    dc.b    "dos.library",0
Thorham is offline  
Old 13 January 2017, 10:36   #6
litwr
Registered User
 
Join Date: Mar 2016
Location: Ozherele
Posts: 229
Thanks to everybody for help. I am going to rewrite a lot of codes...
@meynaf
Thank you very much for the brilliant idea. It is a shame for me to miss it.
@Thorham
Wow! You have written a new program. Thank you very much. Your code looks close to perfect.
litwr is offline  
Old 13 January 2017, 11:07   #7
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
Quote:
Originally Posted by Thorham View Post
Full program (tested) 68020/30 AOS2+:
Your program will crash if it can't open dos.library as you closelibrary(null) in that case

Perhaps you should also consider using a BSS section ; your exe is quite large.

Nevertheless reducing the loop to 10 instructions is very nice
meynaf is offline  
Old 13 January 2017, 12:31   #8
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
Oh, i forgot this.

Since it's for 68020 we could replace :
Code:
   mulu.w  d6,d3        ; d6=10000 ($2710)
by :
Code:
 move.l d3,d2
 lsl.l #3,d3
 sub.l d3,d2
 add.l d3,d3
 sub.l d3,d2
 sub.l d3,d2
 lsl.l #8,d2
 sub.l d2,d3
(20 clocks for mul instead of 26 if i count correctly)
meynaf is offline  
Old 13 January 2017, 16:57   #9
Thorham
Computer Nerd
 
Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,849
Quote:
Originally Posted by meynaf View Post
Your program will crash if it can't open dos.library as you closelibrary(null) in that case
Passing a NULL is safe starting with V36: http://amigadev.elowar.com/read/ADCD.../node01F7.html

Quote:
Originally Posted by meynaf View Post
Perhaps you should also consider using a BSS section ; your exe is quite large.
Absolutely.

By the way, nice x 10000 optimization.
Thorham is offline  
Old 13 January 2017, 17:27   #10
litwr
Registered User
 
Join Date: Mar 2016
Location: Ozherele
Posts: 229
IMHO this optimized MUL is the key to outperform 80386. Thanks! It was surprise for me that hardware multiplication is a bit slow for 68020. May it be faster with a small lookup (<32K)? z80 and 6502 codes use 768 bytes lookup.
litwr is offline  
Old 13 January 2017, 17:40   #11
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
Quote:
Originally Posted by Thorham View Post
Passing a NULL is safe starting with V36: http://amigadev.elowar.com/read/ADCD.../node01F7.html
Didn't know.
However, one reason why this OpenLibrary call can fail is precisely because the program is run under some version lower than V36...

Quote:
Originally Posted by Thorham View Post
By the way, nice x 10000 optimization.
Yeah, this one wasn't exactly easy to make.

Perhaps i can push the vice further and unroll the loop x7 - as the number of iterations is always dividable by 7. Now the loop is slightly under 250 bytes but i don't see a way to make it shorter :
Code:
; init
 lea rtab,a4                     ; gives ds.w 2800
 move.w #2800/2-1,d0
 move.l #(2000<<16)+2000,d1
 move.l a4,a0
.iloop
 move.l d1,(a0)+
 dbf d0,.iloop

; main loop
 suba.l a5,a5
 move.l #2800*2-1,d7
.mloop
 moveq #0,d5
 move.l a4,a3
 move.l d7,d4
.loop7
 movem.w (a3),d0-d3/a0-a2
do1 macro
 move.l \1,d6
 lsl.l #3,d6
 sub.l d6,\1
 add.l d6,d6
 add.l d6,d5
 add.l d6,d6
 sub.l \1,d6
 lsl.l #8,d6
 add.l d6,d5
 move.l d5,d0
 divul.l d4,d6:d0
 move.w d6,(a3)+
 sub.l d0,d5
 sub.l d6,d5
 lsr.l #1,d5
 subq.l #2,d4
 endm
 do1 d0
 do1 d1
 do1 d2
 do1 d3
 do1 a0
 do1 a1
 do1 a2
 bcc .loop7
 divul.l #10000,d1:d0
 add.l a5,d0
 move.l d1,a5

; printf d0 here

 lea 14*2(a4),a4
 subi.w #28,d7
 bgt .mloop
Note the buffer is now used in reverse order.


Quote:
Originally Posted by litwr View Post
IMHO this optimized MUL is the key to outperform 80386. Thanks! It was surprise for me that hardware multiplication is a bit slow for 68020.
Multiply speed is the major weakness of 020/030 if you ask me
That said, i doubt the 386 is a champion in this area either...

Having more registers can perhaps help too, btw.


Quote:
Originally Posted by litwr View Post
May it be faster with a small lookup (<32K)? z80 and 6502 codes use 768 bytes lookup.
Depends how this lookup is made. I didn't check the exact range of values.
meynaf is offline  
Old 13 January 2017, 18:46   #12
Thorham
Computer Nerd
 
Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,849
Quote:
Originally Posted by meynaf View Post
one reason why this OpenLibrary call can fail is precisely because the program is run under some version lower than V36...
Yes, but vprintf needs V36.

Quote:
Originally Posted by meynaf View Post
Yeah, this one wasn't exactly easy to make.
Nice, I certainly didn't think of that.

Quote:
Originally Posted by meynaf View Post
Perhaps i can push the vice further and unroll the loop x7 - as the number of iterations is always dividable by 7. Now the loop is slightly under 250 bytes but i don't see a way to make it shorter
Interesting, I'll check it out.

Quote:
Originally Posted by meynaf View Post
Depends how this lookup is made. I didn't check the exact range of values.
Range should be 0 - 9999 (modulus 10000).

Here's a version that should work with Kickstart 1.2, possibly lower (uses a 200 byte table for padded number conversion instead of vprintf):
Code:
;
; PI spigot, 68020+ Kickstart 1.2+
;

_LVOOldOpenLibrary equ -408
_LVOCloseLibrary equ -414
_LVOOutput equ -60
_LVOWrite equ -48

execBase equ 4

N equ 2800

start
;
; open dos library
;
    move.l  execBase,a6
    lea     dosName,a1
    jsr     _LVOOldOpenLibrary(a6)
    move.l  d0,dosBase
    beq     exit
;
; get output handle
;
    move.l  dosBase,a6
    jsr     _LVOOutput(a6)
    move.l  d0,stdOut
;
; fill array
;
    lea     r,a0
    move.l  #2000*10000,d0

    clr.l   (a0)+

    move.l  #N-1,d7
.loopa
    move.l  d0,(a0)+
    dbra    d7,.loopa
;
; main loop
;
    lea     decTable,a4
    lea     formatString,a5
    lea     r,a2
    move.l  #10000,d6
    clr.l   d4

    move.l  #N,d7
.loopb
    clr.l   d0
    move.l  d7,d1

    lea     4(a2,d1.w*4),a1

    add.l   d1,d1
    subq.l  #1,d1

    bra     .l1

.loopc
    sub.l   d5,d0
    sub.l   d3,d0
    lsr.l   #1,d0
    mulu.w  d6,d3
    move.l  d3,(a1)
.l1
    add.l   -(a1),d0
    move.l  d0,d5
    divul.l d1,d3:d5

    subq.l  #2,d1
    bgt     .loopc

    mulu.w  d6,d3
    move.l  d3,(a1)

; write digits to string

    divul.l d6,d3:d0
    add.l   d4,d0

    divu.w  #100,d0
    move.w  (a4,d0.w*2),(a5)+
    swap    d0
    move.w  (a4,d0.w*2),(a5)+

    move.l  d3,d4

    sub.w   #14,d7
    bgt     .loopb

; print PI digits

    move.w  #$0a00,(a5)+

    move.l  dosBase,a6

    move.l  stdOut,d1
    move.l  #formatString,d2
    move.l  a5,d3
    sub.l   #formatString,d3
    jsr     _LVOWrite(a6)

; cleanup and exit

exit
    move.l  execBase,a6
    move.l  dosBase,a1
    tst.l   a1
    beq     .l1
    jsr     _LVOCloseLibrary(a6)
.l1
    rts

    section data,data_p

dosBase
    dc.l    0

stdOut
    dc.l    0

decTable
    dc.b    "00010203040506070809"
    dc.b    "10111213141516171819"
    dc.b    "20212223242526272829"
    dc.b    "30313233343536373839"
    dc.b    "40414243444546474849"
    dc.b    "50515253545556575859"
    dc.b    "60616263646566676869"
    dc.b    "70717273747576777879"
    dc.b    "80818283848586878889"
    dc.b    "90919293949596979899"

dosName
    dc.b    "dos.library",0

    section data2,bss_p
r
    ds.l    N+1

formatString
    ds.b    1024
Thorham is offline  
Old 13 January 2017, 19:11   #13
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
Quote:
Originally Posted by Thorham View Post
Yes, but vprintf needs V36.
I know, you need V36 so you ask for V36 which is ok, the problem is that the program will crash if V36 dos isn't available - which would easily be avoided by just branching to rts and not to the CloseLibrary call.


Quote:
Originally Posted by Thorham View Post
Range should be 0 - 9999 (modulus 10000).
Means a 20kb table. Perhaps a little bit overkill for this kind of program...
meynaf is offline  
Old 13 January 2017, 19:40   #14
matthey
Banned
 
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
Quote:
Originally Posted by meynaf View Post
Multiply speed is the major weakness of 020/030 if you ask me
That said, i doubt the 386 is a champion in this area either...
These are the timings I came up with quickly for unsigned integer register to register multiply in cache.

Code:
68020 vs 80286
16x16=32 27 21
32x32=32 43 -
32x32=64 43 -

68030 vs 80386
16x16=32 27 9-22
32x32=32 43 9-38
32x32=64 43 9-38

68040 vs 80486
16x16=32 14 13-26
32x32=32 20 13-42
32x32=64 20 13-42

68060 vs Pentium
16x16=32 2 11
32x32=32 2 10
32x32=64 - 10
It appears the 68k has better timings except for the 68020-68030 years (as you suspected). The 68k MUL was more flexible with more general purpose registers. The x86 processors tended to have higher clock speeds. Of course, a modern 68k ASIC could have a 1 cycle MUL for all types .

Last edited by matthey; 13 January 2017 at 19:46.
matthey is offline  
Old 13 January 2017, 19:57   #15
litwr
Registered User
 
Join Date: Mar 2016
Location: Ozherele
Posts: 229
Thanks a lot! I need several days to digest such a big dish. 80286 (80386) has a very good division and a much less good multiplication.

Last edited by litwr; 13 January 2017 at 20:06.
litwr is offline  
Old 13 January 2017, 20:34   #16
Thorham
Computer Nerd
 
Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,849
Quote:
Originally Posted by meynaf View Post
I know, you need V36 so you ask for V36 which is ok, the problem is that the program will crash if V36 dos isn't available - which would easily be avoided by just branching to rts and not to the CloseLibrary call.
Yeah, of course Is fixed in the last version I posted though, but good call

Quote:
Originally Posted by meynaf View Post
Means a 20kb table. Perhaps a little bit overkill for this kind of program...
It's a benchmark, so it's not overkill (my 1 megabyte HAM8 table is overkill ). With this you can see the difference between a multiply instruction and table access on multiple platforms.

That reminds me, I still have to add a timer and allow a variable number of digits to be calculated.
Thorham is offline  
Old 14 January 2017, 10:29   #17
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
Quote:
Originally Posted by Thorham View Post
It's a benchmark, so it's not overkill (my 1 megabyte HAM8 table is overkill ). With this you can see the difference between a multiply instruction and table access on multiple platforms.
I said 20kb but in fact it's gonna be 40kb, due the result is 32-bit and not 16.
Filling such a table is easy but takes some amount of time. Or we could accept a large exe where most others are under 1kb
And the net benefit can be very small. A table lookup can be 12 (maybe 14, i don't remember exactly) cycles, with big wait states because in unexpanded A1200 we're in chipmem...


Quote:
Originally Posted by Thorham View Post
That reminds me, I still have to add a timer and allow a variable number of digits to be calculated.
Can't help you for variable digits, but for the timer i have the relevant code. Of course you could go for cheap vbl counter ; I do many benchmarks this way - but due to the relative high speed of the A1200, it's perhaps not precise enough.

Makes me think i've recently added this kind of function to my include files.
So here it goes, in case it could be helpful for ya :
Code:
 movem.l d2-d3/a0-a2/a6,-(a7)
 subq.l #8,a7            ; cheap struct timeval
 move.l a7,a0
 move.l timereq+$14,a6   ; timereq is ioreq for previously opened timer.device
 jsr -$42(a6)            ; GetSysTime
 movem.l (a7)+,d0-d1     ; d0=tv_secs, d1=tv_micro
 movem.l oldtime,d2-d3   ; two ds.l to store previous value
 movem.l d0-d1,oldtime
 sub.l d3,d1             ; micro
 subx.l d3,d3            ; keep the carry
 subx.l d2,d0            ; d0=secs
 andi.l #1000000,d3      ; +1 million if carry set
 add.l d3,d1             ; d1=micro
 movem.l (a7)+,d2-d3/a0-a2/a6
 rts
Call it once at startup and once at the end. Returns d0=secs & d1=micro. Does not touch other regs. Of course you have to open timer.device prior to first call.
meynaf is offline  
Old 14 January 2017, 12:10   #18
litwr
Registered User
 
Join Date: Mar 2016
Location: Ozherele
Posts: 229
@meynaf Sorry, the fast multiplication code is not complete. It assumes that the high word of D3 is zero but it is not. So we need CLR.L D3 before MOVE.W -(A3),D3. (I want to have 80386 MOVZX here.) This may be solved by keeping 32-bit words in the main array (r). But this reduces the number of digits twice.
However I am curious is it possible use small tables with 68020 or 68000 for faster multiplication like with z80 or 6502?
@Thorham Your multiplication optimization also relies on 32 bit words. I have to use such words with ARM code because it can't work with 16 bit words. So I am definitely going to update ARM code with this optimization.
Anyway a lot of thanks to everybody.
litwr is offline  
Old 14 January 2017, 12:58   #19
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
Quote:
Originally Posted by litwr View Post
@meynaf Sorry, the fast multiplication code is not complete. It assumes that the high word of D3 is zero but it is not.
Depends how you use it. With MOVEM.W you get a free extend so my unrolled code works fine.
meynaf is offline  
Old 14 January 2017, 16:53   #20
matthey
Banned
 
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
Quote:
Originally Posted by litwr View Post
Sorry, the fast multiplication code is not complete. It assumes that the high word of D3 is zero but it is not. So we need CLR.L D3 before MOVE.W -(A3),D3. (I want to have 80386 MOVZX here.) This may be solved by keeping 32-bit words in the main array (r). But this reduces the number of digits twice.
The 68k god waves his hand and says "You don't need a useless MOVZX instruction. The Intel engineers had not ascended when they mistakenly added MOVZX to the x86 and the Motorola engineers were not exalted when they added MVZ to the ColdFire. Pay no attention to the instruction saved, 2 bytes saved, single register functionality or compiler importance. The Force has spoken."

Quote:
Originally Posted by litwr View Post
However I am curious is it possible use small tables with 68020 or 68000 for faster multiplication like with z80 or 6502?
The 68020+ is one of the best ISAs ever created for handling tables. It should be able to handle practically every table the 8 bit processors could do. However, complex addressing modes were not particularly fast and accessing memory was especially slow back in the Amiga days.
matthey is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Maniac Mansion Demo Disk - Onslaught releases the demo of a classic for C64! Neil79 Retrogaming General Discussion 0 16 January 2015 10:40
Yet another "plz help me find a demo" thead! AGA demo w/ texturemapped building ruinatokyo request.Demos 1 26 September 2013 16:29
Jason Lowe and The Mathematical Reflex Test jedk Retrogaming General Discussion 5 30 January 2013 02:13
Old Amiga Demo Wanted -- Music Track "The last ninja demo" scribbleuk request.Demos 13 23 April 2012 13:35
CU Amiga #75/Simon The Sorcerer Demo + CU Amiga #99/Amazon Queen Demo MethodGit request.Old Rare Games 12 16 February 2004 17:16

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 16:02.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.11302 seconds with 13 queries