So moving the D2 change code did make a speed improvement!
movem -> move+ext however, turned out to not be any faster (but also not slower). Didn't try it for the 16-bit mixer, as it would require 2x move and 2x ext, which I assume will be slower.
I released a new version (w/ source) with the optimizations. If I were to guess, it should be around 1-4% faster.
https://16-bits.org/etc/xmaplay060_v041.zip
I spent a lot of time shuffling code around, and this is the fastest I could get it (even if it may look like more stuff should go inbetween D2 change and D2 use).