I'm more into 000/040...
Anyway, three minor things after taking a quick look at the code and 020 tables:
- lea (ham8.colorTable.w,pc,d6.w*8),a3 is out of 8-bit range
- and.b #$fc,dx as fast as and.b d6,dx? if so, moveq #-4,d6 not needed
- (-6,a3)/(-4,a3)/(-2,a3) faster than subq.l #6,a3 and 3x(a3)+? (postinc as fast as indirect displacement)
Branching looks OK to me (G's weight is 3 so it makes sense to assume branch not taken when comparing d4 with d3/d5).
EDIT:
So much about taking a nap, now I can't shut my brain off..
Code:
; move.w #512-1,-(sp)
...
; move.w #640-1,d7
move.l #(512-1)<<16+(640-1),d7
dbf d7,.loopx
..
; subq.w #1,(sp)
sub.l #(2<<16)-640,d7
bge .loopy
; addq.l #2,sp