English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 05 December 2020, 08:57   #1
BSzili
Registered User

BSzili's Avatar
 
Join Date: Nov 2011
Location: Hungary
Posts: 334
68060 64-bit integer math

I have a few functions I'd like to optimize for the '060:
Code:
static inline int scale(int eax, int edx, int ecx) { return (int32_t)(((int64_t)eax * (int64_t)edx) / (int64_t)ecx); }
static inline int mulscale(int eax, int edx, int ecx) { return (int32_t)(((int64_t)eax * (int64_t)edx) >> (int8_t)ecx); }
static inline int divscale(int eax, int ebx, int ecx) { return (int32_t)(((int64_t)eax << (int8_t)ecx) / (int64_t)ebx); }
When compiling these for the 060, I'll end up with GCC integer library functions (__muldi3, __divdi3, __ashrdi3, __ashldi3), which has some overhead. I looked up assembly optimized versions, and found these:
Code:
_scale:
    move.l  d1,-(sp)
    move.l  d2,-(sp)

    muls.l  d1,d1:d0
    divs.l  d2,d1:d0

    move.l  (sp)+,d2
    move.l  (sp)+,d1
    rts

_mulscale:
    movem.l d1-d3,-(sp)

    muls.l  d1,d1:d0
    moveq   #32,d3
    sub.l   d2,d3
    lsr.l   d2,d0
    lsl.l   d3,d1
    or.l    d1,d0

    movem.l (sp)+,d1-d3
    rts

_divscale:
    move.l  d2,-(sp)
    move.l  d3,-(sp)

    move.l  d0,d3
    lsl.l   d2,d0
    neg.b   d2
    and.b   #$1f,d2
    asr.l   d2,d3
    divs.l  d1,d3:d0

    move.l  (sp)+,d3
    move.l  (sp)+,d2
    rts
Unfortunately these use muls.l, divs.l which is good for 020-040, but I'll be emulated on the 060. I don't really know a lot of 68k assembly, so I'm not sure how I could avoid using the emulated integer instructions. I did a naive implementation using double precision floating point math, which worked for smaller numbers, but didn't have enough precision to replace the original functions Could someone point me in the right direction?

Last edited by BSzili; 05 December 2020 at 09:18.
BSzili is offline  
Old 05 December 2020, 10:58   #2
Thomas Richter
Registered User
 
Join Date: Jan 2019
Location: Germany
Posts: 964
64 bit math functions are in the utility.library, at least a 32x32->64 multiply. This function is either an already 68060 capable version on 3.1.4, or patched over by the 68060.library (before), so you are safe to use it. There is (at present) no 64/32 divide function in the utility.library. This will change in 3.2. Unfortunately, there is no "trivial" replacement for it in the sense of a ten-liner. Thus, I could either give you a "shorter, easy to understand but slow" version, or a "longer, complex but faster" version. What do you need?
Thomas Richter is offline  
Old 05 December 2020, 11:54   #3
BSzili
Registered User

BSzili's Avatar
 
Join Date: Nov 2011
Location: Hungary
Posts: 334
Thanks for the tip! I'm fine with the complex but faster one for the division, as these functions are called quite often.
BSzili is offline  
Old 05 December 2020, 15:49   #4
Thomas Richter
Registered User
 
Join Date: Jan 2019
Location: Germany
Posts: 964
Quote:
Originally Posted by BSzili View Post
Thanks for the tip! I'm fine with the complex but faster one for the division, as these functions are called quite often.

Well, here you go:


Code:
    ;; signed 64/32 divide, 68020 function
SDiv64H:
    divs.l    d2,d1:d0
    rts
    ;; signed 64/32 divide, 68060 function
SDiv646:
    movem.l    d2-d7/a2-a5,-(sp)

    sub.l    a2,a2        ;sign flag divisor
    sub.l    a3,a3        ;sign flag dividend
    move.l    d0,a4
    move.l    d1,a5        ;save original contents

    move.l    d2,d7        ;save divisor
    bpl.s    1$        ;make positive
    addq.w    #1,a2        ;set flag
    neg.l    d7
1$:

    move.l    d0,d6
    move.l    d1,d5        ;save dividend
    bpl.s    2$
    addq.w    #1,a3        ;set flag
    neg.l    d6
    negx.l    d5        ;invert
2$:
    tst.l    d5        ;is the high non-zero? If so, full divide
    bne.s    3$

    tst.l    d6        ;is low zero?
    beq.s    10$        ;yes, we are done

    ;; here low <> 0
    cmp.l    d6,d7        ;is the divisor <= lo (dividend)
    bls.s    5$        ;yes, use a 32-bit divide

    exg.l    d5,d6        ;q = 0, r = dividend
    bra.s    6$        ;can't divide, done
5$:    
    divul.l    d7,d5:d6                    ;# it's only a 32/32 bit div!
    bra.b    6$
3$:    ;; full 64 bit case here. do we have an overflow?
    cmp.l    d5,d7
    bls.s    7$        ;yes
    ;; here full 64/32 divide
    bsr    AlgorithmD    ;perform classical algorithm D from Knuth
    ;; remainder in d6, quotient in d5
6$:    ;done here, check whether there is a sign switch
    cmp.w    #0,a3
    beq.s    8$
    neg.l    d5        ;remainder has the same sign as dividend
8$:
    cmp.w    a2,a3        ;same signs of divisor and dividend?
    beq.s    9$

    cmp.l    #$80000000,d6    ;representable as 32-bit negative number?
    bhi.s    7$        ;overflow
    neg.l    d6        ;make a 2s complement for quotient
    bra.s    10$
9$:                ;here positive
    tst.l    d6        ;will fit into 32 bits?
    bmi.s    7$        ;overflow if not
10$:                ;here done
    move.l    d5,d1        ;remainder -> d1
    move.l    d6,d0        ;quotient -> d0
    andi.b    #$fc,ccr    ;clear V and C
    bra.s    11$
7$:
    move.l    a4,d0        ;restore original register content
    move.l    a5,d1
    ori.b    #$2,ccr        ;set overflow bit
11$:
    movem.l    (sp)+,d2-d7/a2-a5
    rts

AlgorithmD is Knuth's classical division algorithm:
Code:
    ;; division algorithm. This is either a
    ;; full algorithmD from Knuth "The Art of Programming", Vol.2
    ;; or using a 32:16 division in case the divisor fits
    ;; into 16 bits.
    ;; This algorithm divides d5:d6 by d7
AlgorithmD:    
    swap    d7
    tst.w    d7
    bne.s    1$
    ;; ok, we only need to divide by 16 bits, so
    ;; things are somewhat simpler
    swap    d7        ; restore divisor
    ;; note that we already know that the division
    ;; does not overflow (checked upwards)
    moveq    #0,d1

    swap    d5        ; same as r*b if previous step rqd
    swap    d6        ; get u3 to lsw position
    move.w    d6,d5        ; rb + u3

    divu.w    d7,d5

    move.w    d5,d1        ; first quotient word
    swap    d6        ; get u4
    move.w    d6,d5        ; rb + u4

    divu.w    d7,d5

    swap    d1
    move.w    d5,d1        ; 2nd quotient 'digit'
    clr.w    d5        
    swap    d5        ;now remainder
    move.l    d1,d6        ; and quotient
    rts    
1$:
    swap d7            ; restore divisor
    ;; classical algorithm D.
    ;; In this algorithm, the divisor is treated as a 2 digit (word) number
    ;; which is divided into a 3 digit (word) dividend to get one quotient
    ;; digit (word). After subtraction, the dividend is shifted and the
    ;; process repeated. Before beginning, the divisor and quotient are
    ;; 'normalized' so that the process of estimating the quotient digit
    ;; will yield verifiably correct results..
    moveq    #0,d4                        ;all the flags in d4
ddnchk:
    ;; normalize/upshift the divisor to use full 32 bits, adjust dividend with it.
    ;; the number of shifts goes into d4
    ;; note that d7 is at least 0x00010000
    tst.l    d7
    bmi.b    ddnormalized
ddnchk2:
    addq.l    #$1,d4                        ;count in d4
    lsl.l    #$1,d6                        ;# shift u4,u3 with overflow to u2
    roxl.l    #$1,d5                        ;# shift u1,u2 
    lsl.l    #$1,d7                        ;# shift the divisor
    bpl.b    ddnchk2

    swap    d4                        ;keep lo-word free

ddnormalized:
    ;; Now calculate an estimate of the quotient words (msw first, then lsw).
    ;; The comments use subscripts for the first quotient digit determination.
    move.l    d7,d3                        ;# divisor
    move.l    d5,d2                        ;# dividend mslw
    swap    d3
    swap    d2
    move.w    #$ffff,d1                    ;# use max trial quotient word
    cmp.w    d3,d2                        ;# V1 = U1 ?
    beq.b    ddadj0

    move.l    d5,d1
    divu.w    d3,d1                        ;# use quotient of mslw/msw
ddadj0:

    ;; now test the trial quotient and adjust. This step plus the
    ;; normalization assures (according to Knuth) that the trial
    ;; quotient will be at worst 2 too large.
    ;; NOTE: We do not perform step D3 here. This is not required, as
    ;; D4 is sufficient for adjusting a quotient that has been guessed
    ;; "too large". At most, it can be off by two (easy to prove).

    move.l    d7,d2        ;V1V2->d2

    ;; at this stage, d1 is scaled by 1<<16. Evaluate the 32x32 product d1'0xd7->d2(hi),d3(lo)
    ;; d0,d2,d3,d6 are scratches

    move.l    d7,d0        ;V1V2->d0
    swap     d2        ;get hi of d7 = V1
    mulu.w    d1,d0        ;V2*q: scaled by 2^16, must be split in higher/lower pair
    mulu.w    d1,d2        ;V1*q: the upper 32 bit = V1*q, must be scaled by 2^32
    move.l    d0,d3        ;get lo
    clr.w    d0        ;clear lo
    swap    d3        ;part of it
    swap    d0        ;swap: scale by 2^16: This must be added to hi
    clr.w    d3        ;shift up by 16
    add.l    d0,d2        ;add to hi

    sub.l    d3,d6        
    subx.l    d2,d5        ; subtract double precision
    bcc.s    dd2nd        ; no carry, do next quotient digit

    ;; need to add back divisor longword to current ms 3 digits of dividend
    ;; - according to Knuth, this is done only 2 out of 65536 times for random
    ;; divisor, dividend selection.
    ;;
    ;; thor: computations show that this loop is run at most twice.
    ;; this is better than knuth in this specific case because we
    ;; avoid the multiplications in D3 and this step here (D4) is 
    ;; only addition.

    move.l    d7,d3
    move.l    d7,d2        ; add V1V20 back to U1U2U3U4
    swap    d3
    clr.w    d2
    clr.w    d3
    swap    d2        ; d3 now ls word of divisor
ddaddup:
    subq.l    #$1,d1        ; q is one too large
    add.l    d3,d6        ; aligned with 3rd word of dividend
    addx.l    d2,d5
    bcc    ddaddup        ; until we're positive again
dd2nd:
    tst.l    d4
    bmi.s    ddremain

;# first quotient digit now correct. store digit and shift the
;# (subtracted) dividend 
    move.w    d1,d4        ;keep hi-quotient
    swap    d5
    swap    d6
    move.w    d6,d5
    clr.w    d6        ;shift remainder up by 16 bits
    bset    #31,d4        ;second digit
    bra.s    ddnormalized
ddremain:
;# add 2nd word to quotient, get the remainder.
    swap    d1
    move.w    d4,d1        ;get result of previous division
    swap    d1        ;restore order
    swap    d4        ;restore normalization counter

;# shift down one word/digit to renormalize remainder.
    move.w    d5,d6
    swap    d6
    swap    d5
    clr.l    d7
    move.b    d4,d7
    beq.s    ddrn
    ;; shift d5:d6 to the right by d7 bits, d5 is high
    move.l    d5,d0
    lsr.l    d7,d6        ; shift low out
    ror.l    d7,d0        ; move low bits of high to high
    lsr.l    d7,d5        ; shift high down
    eor.l    d5,d0        ; high bits are in d0
    or.l    d0,d6        ; into d6
ddrn:
    move.l    d6,d5        ; remainder
    move.l    d1,d6
    rts
Thomas Richter is offline  
Old 05 December 2020, 16:44   #5
BSzili
Registered User

BSzili's Avatar
 
Join Date: Nov 2011
Location: Hungary
Posts: 334
Thanks again, I'll benchmark this against the output of the compiler that uses libgcc.a.
BSzili is offline  
Old 05 December 2020, 19:22   #6
a/b
Registered User

 
Join Date: Jun 2016
Location: europe
Posts: 306
Both 0 and $8000000 are not affected by a NEG.L, and since they are both passable beforehand this can be optimized:
Code:
;    cmp.l    #$80000000,d6    ;representable as 32-bit negative number?
;    bhi.s    7$        ;overflow
;    neg.l    d6        ;make a 2s complement for quotient
    neg.l    d6        ;make a 2s complement for quotient
    bgt.s    7$        ;overflow
    bra.s    10$
Overflow range [$80000001,$ffffffff] is NEG'd to [$7fffffff,1] (=> BGT).
a/b is offline  
Old 25 January 2021, 21:31   #7
Ernst Blofeld
<optimized out>

Ernst Blofeld's Avatar
 
Join Date: Sep 2020
Location: <optimized out>
Posts: 272
Will that implentation of AlgorithmD work by itself on plain 68k machines? It doesn't rely on the shortcuts in the 020 specific code above?
Ernst Blofeld is offline  
Old 25 January 2021, 22:18   #8
a/b
Registered User

 
Join Date: Jun 2016
Location: europe
Posts: 306
Yeah, it will work on any M68K.
a/b is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Math [divs] LeCaravage Coders. Asm / Hardware 3 23 October 2020 09:39
Discovery: Math Audio Snow request.Old Rare Games 30 20 August 2018 13:17
68060 integer instruction emulation in 2.7.0 jotd support.WinUAE 14 04 January 2014 20:20
Math library? Tiddlypeeps Coders. General 5 08 April 2010 20:45

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 22:06.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, vBulletin Solutions Inc.
Page generated in 0.12616 seconds with 13 queries