View Single Post
Old 29 April 2021, 16:00   #10
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,043
OK, I didn't take that into account, keeping data in d6/d7.

Here are my suggestions:
1. maybe invert the bcs condition: if the 16-bit case is executed a lot more frequently it should be as branch not taken *if* you can adjust your code to avoid a bra at the end
2. if you have a spare register, use move+swap+cmp sequence instead of swap+cmp+2*swap, it's the same speed but 2 bytes shorter, so potentially very slightly faster because you can squeeze 2 more bytes into icache (not that large if 020/030)
3. moveq #0,d7 should be moved to after .div32no (only the 16-bit case neeeds it), because 32-bit div will set all 32 bits anyway, or....
4. moveq #0,d7 should be executed only once before the loop (code implies that the remainder is always 16-bit, and setting d7 bits 16-31 only once will suffice)

Last edited by a/b; 29 April 2021 at 16:10.
a/b is offline  
 
Page generated in 0.08970 seconds with 11 queries