68060 64-bit integer math

BSzili · 05 December 2020, 07:57

I have a few functions I'd like to optimize for the '060:

Code:

static inline int scale(int eax, int edx, int ecx) { return (int32_t)(((int64_t)eax * (int64_t)edx) / (int64_t)ecx); }
static inline int mulscale(int eax, int edx, int ecx) { return (int32_t)(((int64_t)eax * (int64_t)edx) >> (int8_t)ecx); }
static inline int divscale(int eax, int ebx, int ecx) { return (int32_t)(((int64_t)eax << (int8_t)ecx) / (int64_t)ebx); }

When compiling these for the 060, I'll end up with GCC integer library functions (__muldi3, __divdi3, __ashrdi3, __ashldi3), which has some overhead. I looked up assembly optimized versions, and found these:

Code:

_scale:
    move.l  d1,-(sp)
    move.l  d2,-(sp)

    muls.l  d1,d1:d0
    divs.l  d2,d1:d0

    move.l  (sp)+,d2
    move.l  (sp)+,d1
    rts

_mulscale:
    movem.l d1-d3,-(sp)

    muls.l  d1,d1:d0
    moveq   #32,d3
    sub.l   d2,d3
    lsr.l   d2,d0
    lsl.l   d3,d1
    or.l    d1,d0

    movem.l (sp)+,d1-d3
    rts

_divscale:
    move.l  d2,-(sp)
    move.l  d3,-(sp)

    move.l  d0,d3
    lsl.l   d2,d0
    neg.b   d2
    and.b   #$1f,d2
    asr.l   d2,d3
    divs.l  d1,d3:d0

    move.l  (sp)+,d3
    move.l  (sp)+,d2
    rts

Unfortunately these use muls.l, divs.l which is good for 020-040, but I'll be emulated on the 060. I don't really know a lot of 68k assembly, so I'm not sure how I could avoid using the emulated integer instructions. I did a naive implementation using double precision floating point math, which worked for smaller numbers, but didn't have enough precision to replace the original functions

Could someone point me in the right direction?

Thomas Richter · 05 December 2020, 09:58

64 bit math functions are in the utility.library, at least a 32x32->64 multiply. This function is either an already 68060 capable version on 3.1.4, or patched over by the 68060.library (before), so you are safe to use it. There is (at present) no 64/32 divide function in the utility.library. This will change in 3.2. Unfortunately, there is no "trivial" replacement for it in the sense of a ten-liner. Thus, I could either give you a "shorter, easy to understand but slow" version, or a "longer, complex but faster" version. What do you need?

BSzili · 05 December 2020, 10:54

Thanks for the tip! I'm fine with the complex but faster one for the division, as these functions are called quite often.

Thomas Richter · 05 December 2020, 14:49

Quote:

Originally Posted by BSzili

Thanks for the tip! I'm fine with the complex but faster one for the division, as these functions are called quite often.

Well, here you go:

Code:

    ;; signed 64/32 divide, 68020 function
SDiv64H:
    divs.l    d2,d1:d0
    rts
    ;; signed 64/32 divide, 68060 function
SDiv646:
    movem.l    d2-d7/a2-a5,-(sp)

    sub.l    a2,a2        ;sign flag divisor
    sub.l    a3,a3        ;sign flag dividend
    move.l    d0,a4
    move.l    d1,a5        ;save original contents

    move.l    d2,d7        ;save divisor
    bpl.s    1$        ;make positive
    addq.w    #1,a2        ;set flag
    neg.l    d7
1$:

    move.l    d0,d6
    move.l    d1,d5        ;save dividend
    bpl.s    2$
    addq.w    #1,a3        ;set flag
    neg.l    d6
    negx.l    d5        ;invert
2$:
    tst.l    d5        ;is the high non-zero? If so, full divide
    bne.s    3$

    tst.l    d6        ;is low zero?
    beq.s    10$        ;yes, we are done

    ;; here low <> 0
    cmp.l    d6,d7        ;is the divisor <= lo (dividend)
    bls.s    5$        ;yes, use a 32-bit divide

    exg.l    d5,d6        ;q = 0, r = dividend
    bra.s    6$        ;can't divide, done
5$:    
    divul.l    d7,d5:d6                    ;# it's only a 32/32 bit div!
    bra.b    6$
3$:    ;; full 64 bit case here. do we have an overflow?
    cmp.l    d5,d7
    bls.s    7$        ;yes
    ;; here full 64/32 divide
    bsr    AlgorithmD    ;perform classical algorithm D from Knuth
    ;; remainder in d6, quotient in d5
6$:    ;done here, check whether there is a sign switch
    cmp.w    #0,a3
    beq.s    8$
    neg.l    d5        ;remainder has the same sign as dividend
8$:
    cmp.w    a2,a3        ;same signs of divisor and dividend?
    beq.s    9$

    cmp.l    #$80000000,d6    ;representable as 32-bit negative number?
    bhi.s    7$        ;overflow
    neg.l    d6        ;make a 2s complement for quotient
    bra.s    10$
9$:                ;here positive
    tst.l    d6        ;will fit into 32 bits?
    bmi.s    7$        ;overflow if not
10$:                ;here done
    move.l    d5,d1        ;remainder -> d1
    move.l    d6,d0        ;quotient -> d0
    andi.b    #$fc,ccr    ;clear V and C
    bra.s    11$
7$:
    move.l    a4,d0        ;restore original register content
    move.l    a5,d1
    ori.b    #$2,ccr        ;set overflow bit
11$:
    movem.l    (sp)+,d2-d7/a2-a5
    rts

AlgorithmD is Knuth's classical division algorithm:

Code:

    ;; division algorithm. This is either a
    ;; full algorithmD from Knuth "The Art of Programming", Vol.2
    ;; or using a 32:16 division in case the divisor fits
    ;; into 16 bits.
    ;; This algorithm divides d5:d6 by d7
AlgorithmD:    
    swap    d7
    tst.w    d7
    bne.s    1$
    ;; ok, we only need to divide by 16 bits, so
    ;; things are somewhat simpler
    swap    d7        ; restore divisor
    ;; note that we already know that the division
    ;; does not overflow (checked upwards)
    moveq    #0,d1

    swap    d5        ; same as r*b if previous step rqd
    swap    d6        ; get u3 to lsw position
    move.w    d6,d5        ; rb + u3

    divu.w    d7,d5

    move.w    d5,d1        ; first quotient word
    swap    d6        ; get u4
    move.w    d6,d5        ; rb + u4

    divu.w    d7,d5

    swap    d1
    move.w    d5,d1        ; 2nd quotient 'digit'
    clr.w    d5        
    swap    d5        ;now remainder
    move.l    d1,d6        ; and quotient
    rts    
1$:
    swap d7            ; restore divisor
    ;; classical algorithm D.
    ;; In this algorithm, the divisor is treated as a 2 digit (word) number
    ;; which is divided into a 3 digit (word) dividend to get one quotient
    ;; digit (word). After subtraction, the dividend is shifted and the
    ;; process repeated. Before beginning, the divisor and quotient are
    ;; 'normalized' so that the process of estimating the quotient digit
    ;; will yield verifiably correct results..
    moveq    #0,d4                        ;all the flags in d4
ddnchk:
    ;; normalize/upshift the divisor to use full 32 bits, adjust dividend with it.
    ;; the number of shifts goes into d4
    ;; note that d7 is at least 0x00010000
    tst.l    d7
    bmi.b    ddnormalized
ddnchk2:
    addq.l    #$1,d4                        ;count in d4
    lsl.l    #$1,d6                        ;# shift u4,u3 with overflow to u2
    roxl.l    #$1,d5                        ;# shift u1,u2 
    lsl.l    #$1,d7                        ;# shift the divisor
    bpl.b    ddnchk2

    swap    d4                        ;keep lo-word free

ddnormalized:
    ;; Now calculate an estimate of the quotient words (msw first, then lsw).
    ;; The comments use subscripts for the first quotient digit determination.
    move.l    d7,d3                        ;# divisor
    move.l    d5,d2                        ;# dividend mslw
    swap    d3
    swap    d2
    move.w    #$ffff,d1                    ;# use max trial quotient word
    cmp.w    d3,d2                        ;# V1 = U1 ?
    beq.b    ddadj0

    move.l    d5,d1
    divu.w    d3,d1                        ;# use quotient of mslw/msw
ddadj0:

    ;; now test the trial quotient and adjust. This step plus the
    ;; normalization assures (according to Knuth) that the trial
    ;; quotient will be at worst 2 too large.
    ;; NOTE: We do not perform step D3 here. This is not required, as
    ;; D4 is sufficient for adjusting a quotient that has been guessed
    ;; "too large". At most, it can be off by two (easy to prove).

    move.l    d7,d2        ;V1V2->d2

    ;; at this stage, d1 is scaled by 1<<16. Evaluate the 32x32 product d1'0xd7->d2(hi),d3(lo)
    ;; d0,d2,d3,d6 are scratches

    move.l    d7,d0        ;V1V2->d0
    swap     d2        ;get hi of d7 = V1
    mulu.w    d1,d0        ;V2*q: scaled by 2^16, must be split in higher/lower pair
    mulu.w    d1,d2        ;V1*q: the upper 32 bit = V1*q, must be scaled by 2^32
    move.l    d0,d3        ;get lo
    clr.w    d0        ;clear lo
    swap    d3        ;part of it
    swap    d0        ;swap: scale by 2^16: This must be added to hi
    clr.w    d3        ;shift up by 16
    add.l    d0,d2        ;add to hi

    sub.l    d3,d6        
    subx.l    d2,d5        ; subtract double precision
    bcc.s    dd2nd        ; no carry, do next quotient digit

    ;; need to add back divisor longword to current ms 3 digits of dividend
    ;; - according to Knuth, this is done only 2 out of 65536 times for random
    ;; divisor, dividend selection.
    ;;
    ;; thor: computations show that this loop is run at most twice.
    ;; this is better than knuth in this specific case because we
    ;; avoid the multiplications in D3 and this step here (D4) is 
    ;; only addition.

    move.l    d7,d3
    move.l    d7,d2        ; add V1V20 back to U1U2U3U4
    swap    d3
    clr.w    d2
    clr.w    d3
    swap    d2        ; d3 now ls word of divisor
ddaddup:
    subq.l    #$1,d1        ; q is one too large
    add.l    d3,d6        ; aligned with 3rd word of dividend
    addx.l    d2,d5
    bcc    ddaddup        ; until we're positive again
dd2nd:
    tst.l    d4
    bmi.s    ddremain

;# first quotient digit now correct. store digit and shift the
;# (subtracted) dividend 
    move.w    d1,d4        ;keep hi-quotient
    swap    d5
    swap    d6
    move.w    d6,d5
    clr.w    d6        ;shift remainder up by 16 bits
    bset    #31,d4        ;second digit
    bra.s    ddnormalized
ddremain:
;# add 2nd word to quotient, get the remainder.
    swap    d1
    move.w    d4,d1        ;get result of previous division
    swap    d1        ;restore order
    swap    d4        ;restore normalization counter

;# shift down one word/digit to renormalize remainder.
    move.w    d5,d6
    swap    d6
    swap    d5
    clr.l    d7
    move.b    d4,d7
    beq.s    ddrn
    ;; shift d5:d6 to the right by d7 bits, d5 is high
    move.l    d5,d0
    lsr.l    d7,d6        ; shift low out
    ror.l    d7,d0        ; move low bits of high to high
    lsr.l    d7,d5        ; shift high down
    eor.l    d5,d0        ; high bits are in d0
    or.l    d0,d6        ; into d6
ddrn:
    move.l    d6,d5        ; remainder
    move.l    d1,d6
    rts

BSzili · 05 December 2020, 15:44

Thanks again, I'll benchmark this against the output of the compiler that uses libgcc.a.

a/b · 05 December 2020, 18:22

Both 0 and $8000000 are not affected by a NEG.L, and since they are both passable beforehand this can be optimized:

Code:

;    cmp.l    #$80000000,d6    ;representable as 32-bit negative number?
;    bhi.s    7$        ;overflow
;    neg.l    d6        ;make a 2s complement for quotient
    neg.l    d6        ;make a 2s complement for quotient
    bgt.s    7$        ;overflow
    bra.s    10$

Overflow range [$80000001,$ffffffff] is NEG'd to [$7fffffff,1] (=> BGT).

Ernst Blofeld · 25 January 2021, 20:31

Will that implentation of AlgorithmD work by itself on plain 68k machines? It doesn't rely on the shortcuts in the 020 specific code above?

a/b · 25 January 2021, 21:18

Yeah, it will work on any M68K.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Math [divs]	LeCaravage	Coders. Asm / Hardware	3	23 October 2020 08:39
Discovery: Math	Audio Snow	request.Old Rare Games	30	20 August 2018 12:17
68060 integer instruction emulation in 2.7.0	jotd	support.WinUAE	14	04 January 2014 19:20
Math library?	Tiddlypeeps	Coders. General	5	08 April 2010 19:45

05 December 2020, 07:57	#1
BSzili old chunk of coal Join Date: Nov 2011 Location: Hungary Posts: 1,289	68060 64-bit integer math I have a few functions I'd like to optimize for the '060: Code: static inline int scale(int eax, int edx, int ecx) { return (int32_t)(((int64_t)eax * (int64_t)edx) / (int64_t)ecx); } static inline int mulscale(int eax, int edx, int ecx) { return (int32_t)(((int64_t)eax * (int64_t)edx) >> (int8_t)ecx); } static inline int divscale(int eax, int ebx, int ecx) { return (int32_t)(((int64_t)eax << (int8_t)ecx) / (int64_t)ebx); } When compiling these for the 060, I'll end up with GCC integer library functions (__muldi3, __divdi3, __ashrdi3, __ashldi3), which has some overhead. I looked up assembly optimized versions, and found these: Code: _scale: move.l d1,-(sp) move.l d2,-(sp) muls.l d1,d1:d0 divs.l d2,d1:d0 move.l (sp)+,d2 move.l (sp)+,d1 rts _mulscale: movem.l d1-d3,-(sp) muls.l d1,d1:d0 moveq #32,d3 sub.l d2,d3 lsr.l d2,d0 lsl.l d3,d1 or.l d1,d0 movem.l (sp)+,d1-d3 rts _divscale: move.l d2,-(sp) move.l d3,-(sp) move.l d0,d3 lsl.l d2,d0 neg.b d2 and.b #$1f,d2 asr.l d2,d3 divs.l d1,d3:d0 move.l (sp)+,d3 move.l (sp)+,d2 rts Unfortunately these use muls.l, divs.l which is good for 020-040, but I'll be emulated on the 060. I don't really know a lot of 68k assembly, so I'm not sure how I could avoid using the emulated integer instructions. I did a naive implementation using double precision floating point math, which worked for smaller numbers, but didn't have enough precision to replace the original functions Could someone point me in the right direction? Last edited by BSzili; 05 December 2020 at 08:18.

05 December 2020, 09:58	#2
Thomas Richter Registered User Join Date: Jan 2019 Location: Germany Posts: 3,216	64 bit math functions are in the utility.library, at least a 32x32->64 multiply. This function is either an already 68060 capable version on 3.1.4, or patched over by the 68060.library (before), so you are safe to use it. There is (at present) no 64/32 divide function in the utility.library. This will change in 3.2. Unfortunately, there is no "trivial" replacement for it in the sense of a ten-liner. Thus, I could either give you a "shorter, easy to understand but slow" version, or a "longer, complex but faster" version. What do you need?

05 December 2020, 10:54	#3
BSzili old chunk of coal Join Date: Nov 2011 Location: Hungary Posts: 1,289	Thanks for the tip! I'm fine with the complex but faster one for the division, as these functions are called quite often.

05 December 2020, 15:44	#5
BSzili old chunk of coal Join Date: Nov 2011 Location: Hungary Posts: 1,289	Thanks again, I'll benchmark this against the output of the compiler that uses libgcc.a.

05 December 2020, 18:22	#6
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	Both 0 and $8000000 are not affected by a NEG.L, and since they are both passable beforehand this can be optimized: Code: ; cmp.l #$80000000,d6 ;representable as 32-bit negative number? ; bhi.s 7$ ;overflow ; neg.l d6 ;make a 2s complement for quotient neg.l d6 ;make a 2s complement for quotient bgt.s 7$ ;overflow bra.s 10$ Overflow range [$80000001,$ffffffff] is NEG'd to [$7fffffff,1] (=> BGT).

25 January 2021, 20:31	#7
Ernst Blofeld <optimized out> Join Date: Sep 2020 Location: <optimized out> Posts: 321	Will that implentation of AlgorithmD work by itself on plain 68k machines? It doesn't rely on the shortcuts in the 020 specific code above?

25 January 2021, 21:18	#8
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	Yeah, it will work on any M68K.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)