English Amiga Board - View Single Post - Optimizing the 68020+ 32-bit math

a/b · 29 April 2021, 16:00

OK, I didn't take that into account, keeping data in d6/d7.

Here are my suggestions:
1. maybe invert the bcs condition: if the 16-bit case is executed a lot more frequently it should be as branch not taken *if* you can adjust your code to avoid a bra at the end
2. if you have a spare register, use move+swap+cmp sequence instead of swap+cmp+2*swap, it's the same speed but 2 bytes shorter, so potentially very slightly faster because you can squeeze 2 more bytes into icache (not that large if 020/030)
3. moveq #0,d7 should be moved to after .div32no (only the 16-bit case neeeds it), because 32-bit div will set all 32 bits anyway, or....
4. moveq #0,d7 should be executed only once before the loop (code implies that the remainder is always 16-bit, and setting d7 bits 16-31 only once will suffice)

29 April 2021, 16:00	#10
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,043	OK, I didn't take that into account, keeping data in d6/d7. Here are my suggestions: 1. maybe invert the bcs condition: if the 16-bit case is executed a lot more frequently it should be as branch not taken if you can adjust your code to avoid a bra at the end 2. if you have a spare register, use move+swap+cmp sequence instead of swap+cmp+2swap, it's the same speed but 2 bytes shorter, so potentially very slightly faster because you can squeeze 2 more bytes into icache (not that large if 020/030) 3. moveq #0,d7 should be moved to after .div32no (only the 16-bit case neeeds it), because 32-bit div will set all 32 bits anyway, or.... 4. moveq #0,d7 should be executed only once before the loop (code implies that the remainder is always 16-bit, and setting d7 bits 16-31 only once will suffice) Last edited by a/b; 29 April 2021 at 16:10.*