Why didn't you use "the same" algorithm, something like (code not tested)?
Code:
move.l d4,d0
swap d0
cmp.l d0,d6 ; divident >= (divisor<<16)?
bhs.b .32bit
.16bit divu.w d4,d6
swap d6
move.w d6,(a3)
.32bit divul.l d4,d7:d6
move.w d7,(a3)
edit: To keep it simple: do a faster (32/16) div whenever possible *without* penalty of failing (it's still a slooow div), extra check is compensated for.