05 December 2020, 07:57 | #1 |
old chunk of coal
Join Date: Nov 2011
Location: Hungary
Posts: 1,300
|
68060 64-bit integer math
I have a few functions I'd like to optimize for the '060:
Code:
static inline int scale(int eax, int edx, int ecx) { return (int32_t)(((int64_t)eax * (int64_t)edx) / (int64_t)ecx); } static inline int mulscale(int eax, int edx, int ecx) { return (int32_t)(((int64_t)eax * (int64_t)edx) >> (int8_t)ecx); } static inline int divscale(int eax, int ebx, int ecx) { return (int32_t)(((int64_t)eax << (int8_t)ecx) / (int64_t)ebx); } Code:
_scale: move.l d1,-(sp) move.l d2,-(sp) muls.l d1,d1:d0 divs.l d2,d1:d0 move.l (sp)+,d2 move.l (sp)+,d1 rts _mulscale: movem.l d1-d3,-(sp) muls.l d1,d1:d0 moveq #32,d3 sub.l d2,d3 lsr.l d2,d0 lsl.l d3,d1 or.l d1,d0 movem.l (sp)+,d1-d3 rts _divscale: move.l d2,-(sp) move.l d3,-(sp) move.l d0,d3 lsl.l d2,d0 neg.b d2 and.b #$1f,d2 asr.l d2,d3 divs.l d1,d3:d0 move.l (sp)+,d3 move.l (sp)+,d2 rts Last edited by BSzili; 05 December 2020 at 08:18. |
05 December 2020, 09:58 | #2 |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,307
|
64 bit math functions are in the utility.library, at least a 32x32->64 multiply. This function is either an already 68060 capable version on 3.1.4, or patched over by the 68060.library (before), so you are safe to use it. There is (at present) no 64/32 divide function in the utility.library. This will change in 3.2. Unfortunately, there is no "trivial" replacement for it in the sense of a ten-liner. Thus, I could either give you a "shorter, easy to understand but slow" version, or a "longer, complex but faster" version. What do you need?
|
05 December 2020, 10:54 | #3 |
old chunk of coal
Join Date: Nov 2011
Location: Hungary
Posts: 1,300
|
Thanks for the tip! I'm fine with the complex but faster one for the division, as these functions are called quite often.
|
05 December 2020, 14:49 | #4 | |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,307
|
Quote:
Well, here you go: Code:
;; signed 64/32 divide, 68020 function SDiv64H: divs.l d2,d1:d0 rts ;; signed 64/32 divide, 68060 function SDiv646: movem.l d2-d7/a2-a5,-(sp) sub.l a2,a2 ;sign flag divisor sub.l a3,a3 ;sign flag dividend move.l d0,a4 move.l d1,a5 ;save original contents move.l d2,d7 ;save divisor bpl.s 1$ ;make positive addq.w #1,a2 ;set flag neg.l d7 1$: move.l d0,d6 move.l d1,d5 ;save dividend bpl.s 2$ addq.w #1,a3 ;set flag neg.l d6 negx.l d5 ;invert 2$: tst.l d5 ;is the high non-zero? If so, full divide bne.s 3$ tst.l d6 ;is low zero? beq.s 10$ ;yes, we are done ;; here low <> 0 cmp.l d6,d7 ;is the divisor <= lo (dividend) bls.s 5$ ;yes, use a 32-bit divide exg.l d5,d6 ;q = 0, r = dividend bra.s 6$ ;can't divide, done 5$: divul.l d7,d5:d6 ;# it's only a 32/32 bit div! bra.b 6$ 3$: ;; full 64 bit case here. do we have an overflow? cmp.l d5,d7 bls.s 7$ ;yes ;; here full 64/32 divide bsr AlgorithmD ;perform classical algorithm D from Knuth ;; remainder in d6, quotient in d5 6$: ;done here, check whether there is a sign switch cmp.w #0,a3 beq.s 8$ neg.l d5 ;remainder has the same sign as dividend 8$: cmp.w a2,a3 ;same signs of divisor and dividend? beq.s 9$ cmp.l #$80000000,d6 ;representable as 32-bit negative number? bhi.s 7$ ;overflow neg.l d6 ;make a 2s complement for quotient bra.s 10$ 9$: ;here positive tst.l d6 ;will fit into 32 bits? bmi.s 7$ ;overflow if not 10$: ;here done move.l d5,d1 ;remainder -> d1 move.l d6,d0 ;quotient -> d0 andi.b #$fc,ccr ;clear V and C bra.s 11$ 7$: move.l a4,d0 ;restore original register content move.l a5,d1 ori.b #$2,ccr ;set overflow bit 11$: movem.l (sp)+,d2-d7/a2-a5 rts AlgorithmD is Knuth's classical division algorithm: Code:
;; division algorithm. This is either a ;; full algorithmD from Knuth "The Art of Programming", Vol.2 ;; or using a 32:16 division in case the divisor fits ;; into 16 bits. ;; This algorithm divides d5:d6 by d7 AlgorithmD: swap d7 tst.w d7 bne.s 1$ ;; ok, we only need to divide by 16 bits, so ;; things are somewhat simpler swap d7 ; restore divisor ;; note that we already know that the division ;; does not overflow (checked upwards) moveq #0,d1 swap d5 ; same as r*b if previous step rqd swap d6 ; get u3 to lsw position move.w d6,d5 ; rb + u3 divu.w d7,d5 move.w d5,d1 ; first quotient word swap d6 ; get u4 move.w d6,d5 ; rb + u4 divu.w d7,d5 swap d1 move.w d5,d1 ; 2nd quotient 'digit' clr.w d5 swap d5 ;now remainder move.l d1,d6 ; and quotient rts 1$: swap d7 ; restore divisor ;; classical algorithm D. ;; In this algorithm, the divisor is treated as a 2 digit (word) number ;; which is divided into a 3 digit (word) dividend to get one quotient ;; digit (word). After subtraction, the dividend is shifted and the ;; process repeated. Before beginning, the divisor and quotient are ;; 'normalized' so that the process of estimating the quotient digit ;; will yield verifiably correct results.. moveq #0,d4 ;all the flags in d4 ddnchk: ;; normalize/upshift the divisor to use full 32 bits, adjust dividend with it. ;; the number of shifts goes into d4 ;; note that d7 is at least 0x00010000 tst.l d7 bmi.b ddnormalized ddnchk2: addq.l #$1,d4 ;count in d4 lsl.l #$1,d6 ;# shift u4,u3 with overflow to u2 roxl.l #$1,d5 ;# shift u1,u2 lsl.l #$1,d7 ;# shift the divisor bpl.b ddnchk2 swap d4 ;keep lo-word free ddnormalized: ;; Now calculate an estimate of the quotient words (msw first, then lsw). ;; The comments use subscripts for the first quotient digit determination. move.l d7,d3 ;# divisor move.l d5,d2 ;# dividend mslw swap d3 swap d2 move.w #$ffff,d1 ;# use max trial quotient word cmp.w d3,d2 ;# V1 = U1 ? beq.b ddadj0 move.l d5,d1 divu.w d3,d1 ;# use quotient of mslw/msw ddadj0: ;; now test the trial quotient and adjust. This step plus the ;; normalization assures (according to Knuth) that the trial ;; quotient will be at worst 2 too large. ;; NOTE: We do not perform step D3 here. This is not required, as ;; D4 is sufficient for adjusting a quotient that has been guessed ;; "too large". At most, it can be off by two (easy to prove). move.l d7,d2 ;V1V2->d2 ;; at this stage, d1 is scaled by 1<<16. Evaluate the 32x32 product d1'0xd7->d2(hi),d3(lo) ;; d0,d2,d3,d6 are scratches move.l d7,d0 ;V1V2->d0 swap d2 ;get hi of d7 = V1 mulu.w d1,d0 ;V2*q: scaled by 2^16, must be split in higher/lower pair mulu.w d1,d2 ;V1*q: the upper 32 bit = V1*q, must be scaled by 2^32 move.l d0,d3 ;get lo clr.w d0 ;clear lo swap d3 ;part of it swap d0 ;swap: scale by 2^16: This must be added to hi clr.w d3 ;shift up by 16 add.l d0,d2 ;add to hi sub.l d3,d6 subx.l d2,d5 ; subtract double precision bcc.s dd2nd ; no carry, do next quotient digit ;; need to add back divisor longword to current ms 3 digits of dividend ;; - according to Knuth, this is done only 2 out of 65536 times for random ;; divisor, dividend selection. ;; ;; thor: computations show that this loop is run at most twice. ;; this is better than knuth in this specific case because we ;; avoid the multiplications in D3 and this step here (D4) is ;; only addition. move.l d7,d3 move.l d7,d2 ; add V1V20 back to U1U2U3U4 swap d3 clr.w d2 clr.w d3 swap d2 ; d3 now ls word of divisor ddaddup: subq.l #$1,d1 ; q is one too large add.l d3,d6 ; aligned with 3rd word of dividend addx.l d2,d5 bcc ddaddup ; until we're positive again dd2nd: tst.l d4 bmi.s ddremain ;# first quotient digit now correct. store digit and shift the ;# (subtracted) dividend move.w d1,d4 ;keep hi-quotient swap d5 swap d6 move.w d6,d5 clr.w d6 ;shift remainder up by 16 bits bset #31,d4 ;second digit bra.s ddnormalized ddremain: ;# add 2nd word to quotient, get the remainder. swap d1 move.w d4,d1 ;get result of previous division swap d1 ;restore order swap d4 ;restore normalization counter ;# shift down one word/digit to renormalize remainder. move.w d5,d6 swap d6 swap d5 clr.l d7 move.b d4,d7 beq.s ddrn ;; shift d5:d6 to the right by d7 bits, d5 is high move.l d5,d0 lsr.l d7,d6 ; shift low out ror.l d7,d0 ; move low bits of high to high lsr.l d7,d5 ; shift high down eor.l d5,d0 ; high bits are in d0 or.l d0,d6 ; into d6 ddrn: move.l d6,d5 ; remainder move.l d1,d6 rts |
|
05 December 2020, 15:44 | #5 |
old chunk of coal
Join Date: Nov 2011
Location: Hungary
Posts: 1,300
|
Thanks again, I'll benchmark this against the output of the compiler that uses libgcc.a.
|
05 December 2020, 18:22 | #6 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,062
|
Both 0 and $8000000 are not affected by a NEG.L, and since they are both passable beforehand this can be optimized:
Code:
; cmp.l #$80000000,d6 ;representable as 32-bit negative number? ; bhi.s 7$ ;overflow ; neg.l d6 ;make a 2s complement for quotient neg.l d6 ;make a 2s complement for quotient bgt.s 7$ ;overflow bra.s 10$ |
25 January 2021, 20:31 | #7 |
<optimized out>
Join Date: Sep 2020
Location: <optimized out>
Posts: 321
|
Will that implentation of AlgorithmD work by itself on plain 68k machines? It doesn't rely on the shortcuts in the 020 specific code above?
|
25 January 2021, 21:18 | #8 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,062
|
Yeah, it will work on any M68K.
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Math [divs] | LeCaravage | Coders. Asm / Hardware | 3 | 23 October 2020 08:39 |
Discovery: Math | Audio Snow | request.Old Rare Games | 30 | 20 August 2018 12:17 |
68060 integer instruction emulation in 2.7.0 | jotd | support.WinUAE | 14 | 04 January 2014 19:20 |
Math library? | Tiddlypeeps | Coders. General | 5 | 08 April 2010 19:45 |
|
|