A mathematical demo

litwr · 12 January 2017, 18:16

I have made a code for 80386 at 12.5 MHz which is slightly faster than the same functionality code for 68020 at 14 MHz. The project is at http://litwr2.atspace.eu/pi/pi-spigot-benchmark.html. The main loop is between labels .l2 and .l4, it is 12 instructions only. The same code for 80386 occupies 13 lines. C sources are also provided. Is it possible to win 80386 having 1.5 MHz handicap? Help is welcome. Thanks in advance. I have to notify It is not easy task.

chb · 12 January 2017, 20:53

I just had a very brief look at the code, but you might want to unroll the loop a few times. The 68020 has 256 bytes of instruction cache, so unrolling it that it fits into cache might save you the "bra .l2" instruction a few times.

On a side note, I'm not sure how exact UAE reproduces the timing behaviour of the 68020. It's very precise for the 68000, but I remember the author mentioning that it is worse for the 020.

meynaf · 12 January 2017, 21:14

Emulators are good for 68000 but for 68020+ timing they are useless.

Unrolling is not needed, u can save that bra every time ('xcept the very first) by reordering the code :

Code:

	bra.s .l2
.loop
	sub.l d6,d5
	sub.l d7,d5
	lsr.l #1,d5
.l2	move.w -(a3),d0
	mulu d1,d0
	add.l d0,d5
	move.l d5,d6
	divul.l d4,d7:d6
	move.w d7,(a3)
	subq.w #2,d4
	bcc.s .loop
.l4

chb · 12 January 2017, 22:13

Quote:

Originally Posted by meynaf

Unrolling is not needed, u can save that bra every time ('xcept the very first) by reordering the code

Nice! But if d4 has usually a quite large value, you could still gain something from unrolling: Split the loop in a part that's for example 8x unrolled and a non-unrolled (rolled?

) part, and set an additional loop counter to N/8, respectively. Saves the bcc.s a couple of times, but needs an additional data register (d3 is free temporarily?) and a dbra at the end.

Code:

        (save d3 to mem or free adress register if needed)
        move.w d4,d3
        lsr.w #2,d3
        subq.w #1,d3
        bmi.s .l2

	bra.s .l2_8
.loop_8
	sub.l d6,d5
	sub.l d7,d5
	lsr.l #1,d5
.l2_8	move.w -(a3),d0
	mulu d1,d0
	add.l d0,d5
	move.l d5,d6
	divul.l d4,d7:d6
	move.w d7,(a3)
	subq.w #2,d4


(unroll additional 7x)

	dbra d3, .loop_8

;------ unchanged from meynaf's post, except the bra .l2


.loop
	sub.l d6,d5
	sub.l d7,d5
	lsr.l #1,d5
.l2	move.w -(a3),d0
	mulu d1,d0
	add.l d0,d5
	move.l d5,d6
	divul.l d4,d7:d6
	move.w d7,(a3)
	subq.w #2,d4
	bcc.s .loop
.l4

Warning, that code is untested, and my 68k assembly is very rusty...

Thorham · 13 January 2017, 08:38

How about this:

Code:

    bra     .l1

.loopc
    sub.l   d5,d0
    sub.l   d3,d0
    lsr.l   #1,d0
    mulu.w  d6,d3
    move.l  d3,(a1)
.l1
    add.l   -(a1),d0
    move.l  d0,d5
    divul.l d1,d3:d5

    subq.l  #2,d1
    bgt     .loopc

    mulu.w  d6,d3
    move.l  d3,(a1)

Full program (tested) 68020/30 AOS2+:

Code:

    incdir  "asminc:"

    include "exec/exec.i"
    include "dos/dos.i"
    include "lvo/lvos.i"

execBase equ 4

N equ 2800

start
;
; open dos library
;
    move.l  execBase,a6
    lea     dosName,a1
    moveq   #36,d0
    jsr     _LVOOpenLibrary(a6)
    move.l  d0,dosBase
    beq     .exit
;
; fill array
;
    lea     r+4,a0
    move.l  #2000*10000,d0

    move.l  #N-1,d7
.loopa
    move.l  d0,(a0)+
    dbra    d7,.loopa
;
; main loop
;
    move.l  dosBase,a6
    lea     printfArgs,a3
    lea     r,a2
    move.l  #10000,d6
    clr.l   d4

    move.l  #N,d7
.loopb
    clr.l   d0
    move.l  d7,d1

    lea     4(a2,d1.w*4),a1

    add.l   d1,d1
    subq.l  #1,d1

    bra     .l1

.loopc
    sub.l   d5,d0
    sub.l   d3,d0
    lsr.l   #1,d0
    mulu.w  d6,d3
    move.l  d3,(a1)
.l1
    add.l   -(a1),d0
    move.l  d0,d5
    divul.l d1,d3:d5

    subq.l  #2,d1
    bgt     .loopc

    mulu.w  d6,d3
    move.l  d3,(a1)

; print digits

    divul.l d6,d3:d0
    add.l   d4,d0
    move.l  d0,(a3)

    move.l  #formatString,d1
    move.l  a3,d2
    jsr     _LVOVPrintf(a6)

    move.l  d3,d4

    sub.w   #14,d7
    bgt     .loopb

; print line feed

    move.l  #lineFeed,d1
    move.l  a3,d2
    jsr     _LVOVPrintf(a6)

; cleanup and exit

.exit
    move.l  execBase,a6
    move.l  dosBase,a1
    jsr     _LVOCloseLibrary(a6)
    rts



    section data,data_p

dosBase
    dc.l    0

r
    dcb.l   N+1

printfArgs
    dcb.l   4

formatString
    dc.b    "%04ld",0

lineFeed
    dc.b    10,0

dosName
    dc.b    "dos.library",0

litwr · 13 January 2017, 10:36

Thanks to everybody for help.

I am going to rewrite a lot of codes...

@meynaf
Thank you very much for the brilliant idea.

It is a shame for me to miss it.

@Thorham
Wow! You have written a new program. Thank you very much. Your code looks close to perfect.

meynaf · 13 January 2017, 11:07

Quote:

Originally Posted by Thorham

Full program (tested) 68020/30 AOS2+:

Your program will crash if it can't open dos.library as you closelibrary(null) in that case

Perhaps you should also consider using a BSS section ; your exe is quite large.

Nevertheless reducing the loop to 10 instructions is very nice

meynaf · 13 January 2017, 12:31

Oh, i forgot this.

Since it's for 68020 we could replace :

Code:

   mulu.w  d6,d3        ; d6=10000 ($2710)

by :

Code:

 move.l d3,d2
 lsl.l #3,d3
 sub.l d3,d2
 add.l d3,d3
 sub.l d3,d2
 sub.l d3,d2
 lsl.l #8,d2
 sub.l d2,d3

(20 clocks for mul instead of 26 if i count correctly)

Thorham · 13 January 2017, 16:57

Quote:

Originally Posted by meynaf

Your program will crash if it can't open dos.library as you closelibrary(null) in that case

Passing a NULL is safe starting with V36: http://amigadev.elowar.com/read/ADCD.../node01F7.html

Quote:

Originally Posted by meynaf

Perhaps you should also consider using a BSS section ; your exe is quite large.

Absolutely.

By the way, nice x 10000 optimization.

litwr · 13 January 2017, 17:27

IMHO this optimized MUL is the key to outperform 80386. Thanks!

It was surprise for me that hardware multiplication is a bit slow for 68020. May it be faster with a small lookup (<32K)? z80 and 6502 codes use 768 bytes lookup.

meynaf · 13 January 2017, 17:40

Quote:

Originally Posted by Thorham

Passing a NULL is safe starting with V36: http://amigadev.elowar.com/read/ADCD.../node01F7.html

Didn't know.
However, one reason why this OpenLibrary call can fail is precisely because the program is run under some version lower than V36...

Quote:

Originally Posted by Thorham

By the way, nice x 10000 optimization.

Yeah, this one wasn't exactly easy to make.

Perhaps i can push the vice further and unroll the loop x7 - as the number of iterations is always dividable by 7. Now the loop is slightly under 250 bytes but i don't see a way to make it shorter :

Code:

; init
 lea rtab,a4                     ; gives ds.w 2800
 move.w #2800/2-1,d0
 move.l #(2000<<16)+2000,d1
 move.l a4,a0
.iloop
 move.l d1,(a0)+
 dbf d0,.iloop

; main loop
 suba.l a5,a5
 move.l #2800*2-1,d7
.mloop
 moveq #0,d5
 move.l a4,a3
 move.l d7,d4
.loop7
 movem.w (a3),d0-d3/a0-a2
do1 macro
 move.l \1,d6
 lsl.l #3,d6
 sub.l d6,\1
 add.l d6,d6
 add.l d6,d5
 add.l d6,d6
 sub.l \1,d6
 lsl.l #8,d6
 add.l d6,d5
 move.l d5,d0
 divul.l d4,d6:d0
 move.w d6,(a3)+
 sub.l d0,d5
 sub.l d6,d5
 lsr.l #1,d5
 subq.l #2,d4
 endm
 do1 d0
 do1 d1
 do1 d2
 do1 d3
 do1 a0
 do1 a1
 do1 a2
 bcc .loop7
 divul.l #10000,d1:d0
 add.l a5,d0
 move.l d1,a5

; printf d0 here

 lea 14*2(a4),a4
 subi.w #28,d7
 bgt .mloop

Note the buffer is now used in reverse order.

Quote:

Originally Posted by litwr

IMHO this optimized MUL is the key to outperform 80386. Thanks!

It was surprise for me that hardware multiplication is a bit slow for 68020.

Multiply speed is the major weakness of 020/030 if you ask me

That said, i doubt the 386 is a champion in this area either...

Having more registers can perhaps help too, btw.

Quote:

Originally Posted by litwr

May it be faster with a small lookup (<32K)? z80 and 6502 codes use 768 bytes lookup.

Depends how this lookup is made. I didn't check the exact range of values.

Thorham · 13 January 2017, 18:46

Quote:

Originally Posted by meynaf

one reason why this OpenLibrary call can fail is precisely because the program is run under some version lower than V36...

Yes, but vprintf needs V36.

Quote:

Originally Posted by meynaf

Yeah, this one wasn't exactly easy to make.

Nice, I certainly didn't think of that.

Quote:

Originally Posted by meynaf

Perhaps i can push the vice further and unroll the loop x7 - as the number of iterations is always dividable by 7. Now the loop is slightly under 250 bytes but i don't see a way to make it shorter

Interesting, I'll check it out.

Quote:

Originally Posted by meynaf

Depends how this lookup is made. I didn't check the exact range of values.

Range should be 0 - 9999 (modulus 10000).

Here's a version that should work with Kickstart 1.2, possibly lower (uses a 200 byte table for padded number conversion instead of vprintf):

Code:

;
; PI spigot, 68020+ Kickstart 1.2+
;

_LVOOldOpenLibrary equ -408
_LVOCloseLibrary equ -414
_LVOOutput equ -60
_LVOWrite equ -48

execBase equ 4

N equ 2800

start
;
; open dos library
;
    move.l  execBase,a6
    lea     dosName,a1
    jsr     _LVOOldOpenLibrary(a6)
    move.l  d0,dosBase
    beq     exit
;
; get output handle
;
    move.l  dosBase,a6
    jsr     _LVOOutput(a6)
    move.l  d0,stdOut
;
; fill array
;
    lea     r,a0
    move.l  #2000*10000,d0

    clr.l   (a0)+

    move.l  #N-1,d7
.loopa
    move.l  d0,(a0)+
    dbra    d7,.loopa
;
; main loop
;
    lea     decTable,a4
    lea     formatString,a5
    lea     r,a2
    move.l  #10000,d6
    clr.l   d4

    move.l  #N,d7
.loopb
    clr.l   d0
    move.l  d7,d1

    lea     4(a2,d1.w*4),a1

    add.l   d1,d1
    subq.l  #1,d1

    bra     .l1

.loopc
    sub.l   d5,d0
    sub.l   d3,d0
    lsr.l   #1,d0
    mulu.w  d6,d3
    move.l  d3,(a1)
.l1
    add.l   -(a1),d0
    move.l  d0,d5
    divul.l d1,d3:d5

    subq.l  #2,d1
    bgt     .loopc

    mulu.w  d6,d3
    move.l  d3,(a1)

; write digits to string

    divul.l d6,d3:d0
    add.l   d4,d0

    divu.w  #100,d0
    move.w  (a4,d0.w*2),(a5)+
    swap    d0
    move.w  (a4,d0.w*2),(a5)+

    move.l  d3,d4

    sub.w   #14,d7
    bgt     .loopb

; print PI digits

    move.w  #$0a00,(a5)+

    move.l  dosBase,a6

    move.l  stdOut,d1
    move.l  #formatString,d2
    move.l  a5,d3
    sub.l   #formatString,d3
    jsr     _LVOWrite(a6)

; cleanup and exit

exit
    move.l  execBase,a6
    move.l  dosBase,a1
    tst.l   a1
    beq     .l1
    jsr     _LVOCloseLibrary(a6)
.l1
    rts

    section data,data_p

dosBase
    dc.l    0

stdOut
    dc.l    0

decTable
    dc.b    "00010203040506070809"
    dc.b    "10111213141516171819"
    dc.b    "20212223242526272829"
    dc.b    "30313233343536373839"
    dc.b    "40414243444546474849"
    dc.b    "50515253545556575859"
    dc.b    "60616263646566676869"
    dc.b    "70717273747576777879"
    dc.b    "80818283848586878889"
    dc.b    "90919293949596979899"

dosName
    dc.b    "dos.library",0

    section data2,bss_p
r
    ds.l    N+1

formatString
    ds.b    1024

meynaf · 13 January 2017, 19:11

Quote:

Originally Posted by Thorham

Yes, but vprintf needs V36.

I know, you need V36 so you ask for V36 which is ok, the problem is that the program will crash if V36 dos isn't available - which would easily be avoided by just branching to rts and not to the CloseLibrary call.

Quote:

Originally Posted by Thorham

Range should be 0 - 9999 (modulus 10000).

Means a 20kb table. Perhaps a little bit overkill for this kind of program...

matthey · 13 January 2017, 19:40

Quote:

Originally Posted by meynaf

Multiply speed is the major weakness of 020/030 if you ask me

That said, i doubt the 386 is a champion in this area either...

These are the timings I came up with quickly for unsigned integer register to register multiply in cache.

Code:

68020 vs 80286
16x16=32 27 21
32x32=32 43 -
32x32=64 43 -

68030 vs 80386
16x16=32 27 9-22
32x32=32 43 9-38
32x32=64 43 9-38

68040 vs 80486
16x16=32 14 13-26
32x32=32 20 13-42
32x32=64 20 13-42

68060 vs Pentium
16x16=32 2 11
32x32=32 2 10
32x32=64 - 10

It appears the 68k has better timings except for the 68020-68030 years (as you suspected). The 68k MUL was more flexible with more general purpose registers. The x86 processors tended to have higher clock speeds. Of course, a modern 68k ASIC could have a 1 cycle MUL for all types

.

litwr · 13 January 2017, 19:57

Thanks a lot!

I need several days to digest such a big dish.

80286 (80386) has a very good division and a much less good multiplication.

Thorham · 13 January 2017, 20:34

Quote:

Originally Posted by meynaf

I know, you need V36 so you ask for V36 which is ok, the problem is that the program will crash if V36 dos isn't available - which would easily be avoided by just branching to rts and not to the CloseLibrary call.

Yeah, of course

Is fixed in the last version I posted though, but good call

Quote:

Originally Posted by meynaf

Means a 20kb table. Perhaps a little bit overkill for this kind of program...

It's a benchmark, so it's not overkill (my 1 megabyte HAM8 table is overkill

). With this you can see the difference between a multiply instruction and table access on multiple platforms.

That reminds me, I still have to add a timer and allow a variable number of digits to be calculated.

meynaf · 14 January 2017, 10:29

Quote:

Originally Posted by Thorham

It's a benchmark, so it's not overkill (my 1 megabyte HAM8 table is overkill

). With this you can see the difference between a multiply instruction and table access on multiple platforms.

I said 20kb but in fact it's gonna be 40kb, due the result is 32-bit and not 16.
Filling such a table is easy but takes some amount of time. Or we could accept a large exe where most others are under 1kb

And the net benefit can be very small. A table lookup can be 12 (maybe 14, i don't remember exactly) cycles, with big wait states because in unexpanded A1200 we're in chipmem...

Quote:

Originally Posted by Thorham

That reminds me, I still have to add a timer and allow a variable number of digits to be calculated.

Can't help you for variable digits, but for the timer i have the relevant code. Of course you could go for cheap vbl counter ; I do many benchmarks this way - but due to the relative high speed of the A1200, it's perhaps not precise enough.

Makes me think i've recently added this kind of function to my include files.
So here it goes, in case it could be helpful for ya :

Code:

 movem.l d2-d3/a0-a2/a6,-(a7)
 subq.l #8,a7            ; cheap struct timeval
 move.l a7,a0
 move.l timereq+$14,a6   ; timereq is ioreq for previously opened timer.device
 jsr -$42(a6)            ; GetSysTime
 movem.l (a7)+,d0-d1     ; d0=tv_secs, d1=tv_micro
 movem.l oldtime,d2-d3   ; two ds.l to store previous value
 movem.l d0-d1,oldtime
 sub.l d3,d1             ; micro
 subx.l d3,d3            ; keep the carry
 subx.l d2,d0            ; d0=secs
 andi.l #1000000,d3      ; +1 million if carry set
 add.l d3,d1             ; d1=micro
 movem.l (a7)+,d2-d3/a0-a2/a6
 rts

Call it once at startup and once at the end. Returns d0=secs & d1=micro. Does not touch other regs. Of course you have to open timer.device prior to first call.

litwr · 14 January 2017, 12:10

@meynaf Sorry, the fast multiplication code is not complete. It assumes that the high word of D3 is zero but it is not. So we need CLR.L D3 before MOVE.W -(A3),D3. (I want to have 80386 MOVZX here.) This may be solved by keeping 32-bit words in the main array (r). But this reduces the number of digits twice.

However I am curious is it possible use small tables with 68020 or 68000 for faster multiplication like with z80 or 6502?
@Thorham Your multiplication optimization also relies on 32 bit words. I have to use such words with ARM code because it can't work with 16 bit words. So I am definitely going to update ARM code with this optimization.
Anyway a lot of thanks to everybody.

meynaf · 14 January 2017, 12:58

Quote:

Originally Posted by litwr

@meynaf Sorry, the fast multiplication code is not complete. It assumes that the high word of D3 is zero but it is not.

Depends how you use it. With MOVEM.W you get a free extend so my unrolled code works fine.

matthey · 14 January 2017, 16:53

Quote:

Originally Posted by litwr

Sorry, the fast multiplication code is not complete. It assumes that the high word of D3 is zero but it is not. So we need CLR.L D3 before MOVE.W -(A3),D3. (I want to have 80386 MOVZX here.) This may be solved by keeping 32-bit words in the main array (r). But this reduces the number of digits twice.

The 68k god waves his hand and says "You don't need a useless MOVZX instruction. The Intel engineers had not ascended when they mistakenly added MOVZX to the x86 and the Motorola engineers were not exalted when they added MVZ to the ColdFire. Pay no attention to the instruction saved, 2 bytes saved, single register functionality or compiler importance. The Force has spoken."

Quote:

Originally Posted by litwr

However I am curious is it possible use small tables with 68020 or 68000 for faster multiplication like with z80 or 6502?

The 68020+ is one of the best ISAs ever created for handling tables. It should be able to handle practically every table the 8 bit processors could do. However, complex addressing modes were not particularly fast and accessing memory was especially slow back in the Amiga days.

13 January 2017, 19:57	#15
litwr Registered User Join Date: Mar 2016 Location: Ozherele Posts: 229	Thanks a lot! I need several days to digest such a big dish. 80286 (80386) has a very good division and a much less good multiplication. Last edited by litwr; 13 January 2017 at 20:06.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Maniac Mansion Demo Disk - Onslaught releases the demo of a classic for C64!	Neil79	Retrogaming General Discussion	0	16 January 2015 10:40
Yet another "plz help me find a demo" thead! AGA demo w/ texturemapped building	ruinatokyo	request.Demos	1	26 September 2013 16:29
Jason Lowe and The Mathematical Reflex Test	jedk	Retrogaming General Discussion	5	30 January 2013 02:13
Old Amiga Demo Wanted -- Music Track "The last ninja demo"	scribbleuk	request.Demos	13	23 April 2012 13:35
CU Amiga #75/Simon The Sorcerer Demo + CU Amiga #99/Amazon Queen Demo	MethodGit	request.Old Rare Games	12	16 February 2004 17:16

12 January 2017, 18:16	#1
litwr Registered User Join Date: Mar 2016 Location: Ozherele Posts: 229	A mathematical demo I have made a code for 80386 at 12.5 MHz which is slightly faster than the same functionality code for 68020 at 14 MHz. The project is at http://litwr2.atspace.eu/pi/pi-spigot-benchmark.html. The main loop is between labels .l2 and .l4, it is 12 instructions only. The same code for 80386 occupies 13 lines. C sources are also provided. Is it possible to win 80386 having 1.5 MHz handicap? Help is welcome. Thanks in advance. I have to notify It is not easy task.

12 January 2017, 20:53	#2
chb Registered User Join Date: Dec 2014 Location: germany Posts: 439	I just had a very brief look at the code, but you might want to unroll the loop a few times. The 68020 has 256 bytes of instruction cache, so unrolling it that it fits into cache might save you the "bra .l2" instruction a few times. On a side note, I'm not sure how exact UAE reproduces the timing behaviour of the 68020. It's very precise for the 68000, but I remember the author mentioning that it is worse for the 020. Last edited by chb; 12 January 2017 at 20:59.

12 January 2017, 21:14	#3
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,355	Emulators are good for 68000 but for 68020+ timing they are useless. Unrolling is not needed, u can save that bra every time ('xcept the very first) by reordering the code : Code: bra.s .l2 .loop sub.l d6,d5 sub.l d7,d5 lsr.l #1,d5 .l2 move.w -(a3),d0 mulu d1,d0 add.l d0,d5 move.l d5,d6 divul.l d4,d7:d6 move.w d7,(a3) subq.w #2,d4 bcc.s .loop .l4

13 January 2017, 10:36	#6
litwr Registered User Join Date: Mar 2016 Location: Ozherele Posts: 229	Thanks to everybody for help. I am going to rewrite a lot of codes... @meynaf Thank you very much for the brilliant idea. It is a shame for me to miss it. @Thorham Wow! You have written a new program. Thank you very much. Your code looks close to perfect.

13 January 2017, 12:31	#8
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,355	Oh, i forgot this. Since it's for 68020 we could replace : Code: mulu.w d6,d3 ; d6=10000 ($2710) by : Code: move.l d3,d2 lsl.l #3,d3 sub.l d3,d2 add.l d3,d3 sub.l d3,d2 sub.l d3,d2 lsl.l #8,d2 sub.l d2,d3 (20 clocks for mul instead of 26 if i count correctly)

13 January 2017, 17:27	#10
litwr Registered User Join Date: Mar 2016 Location: Ozherele Posts: 229	IMHO this optimized MUL is the key to outperform 80386. Thanks! It was surprise for me that hardware multiplication is a bit slow for 68020. May it be faster with a small lookup (<32K)? z80 and 6502 codes use 768 bytes lookup.

14 January 2017, 12:10	#18
litwr Registered User Join Date: Mar 2016 Location: Ozherele Posts: 229	@meynaf Sorry, the fast multiplication code is not complete. It assumes that the high word of D3 is zero but it is not. So we need CLR.L D3 before MOVE.W -(A3),D3. (I want to have 80386 MOVZX here.) This may be solved by keeping 32-bit words in the main array (r). But this reduces the number of digits twice. However I am curious is it possible use small tables with 68020 or 68000 for faster multiplication like with z80 or 6502? @Thorham Your multiplication optimization also relies on 32 bit words. I have to use such words with ARM code because it can't work with 16 bit words. So I am definitely going to update ARM code with this optimization. Anyway a lot of thanks to everybody.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)