Quote:
Originally Posted by a/b
Why didn't you use "the same" algorithm, something like (code not tested)?
Code:
move.l d4,d0
swap d0
cmp.l d0,d6 ; divident >= (divisor<<16)?
bhs.b .32bit
.16bit divu.w d4,d6
swap d6
move.w d6,(a3)
.32bit divul.l d4,d7:d6
move.w d7,(a3)
edit: To keep it simple: do a faster (32/16) div whenever possible *without* penalty of failing (it's still a slooow div), extra check is compensated for.
|
Thank you! However it is not that easy because we need d6 and d7 which must keep quotient and remainder in their 32-bit. So actually, we need a sequence of MOVE, CLR, BSWAP before .32bit - my code which is equivalent to the 386 code is the next:
Code:
moveq.l #0,d7
swap d6
cmp.w d4,d6
bcs .div32no
swap d6
divul.l d4,d7:d6
move.w d7,(a3)
bra .div32f
.div32no
swap d6
divu.w d4,d6
move.w d6,d7
clr.w d6
swap d6
move.w d6,(a3)
.div32f
This makes 2 extra SWAPs.
Optimization for the 80386 gives only 2 or 3 saved cycles, for the 486 - 4 or 5. So it is really very complex. It is sad that neither 80386 nor 68020 are cycle exact in popular emulators. Moreover the emulators are very inaccurate especially for DIVU.L and DIVUL.