English Amiga Board - 68030 multiplication optimisation

English Amiga Board (https://eab.abime.net/index.php)

- Coders. Asm / Hardware (https://eab.abime.net/forumdisplay.php?f=112)

- - 68030 multiplication optimisation (https://eab.abime.net/showthread.php?t=113135)

68030 multiplication optimisation

I have replaced a bunch of fixed point integer division (immediate divisor) with some muls/asr approximations that work well on 68060 thanks to its fast multiplication.

I want to create separately optimised versions for 030, where according to the user manual, the worst case is 28 cycles (not including EA or wait states). It doesn't state what the best case is, however, but it's clear the number of set bits is a factor.

For the 030, the plan would be to perform series of shift and accumulate steps. Since we are multiplying by an immediate, this could be quite efficient as you'd only do what is strictly necessary.

Except, without knowing what the best case is for the multiplication operation, a multi instruction approach like this could be worse.

Does anyone know the "best" case? Assume that you are multiplying by a number that has at least 2 bits set or we'd just use a shift instead.

I'd assume 12+16 (and 32-bit muls are extra 16 = 44).
If you have access to actual 020/030 hardware you could try 10000x mulu.w #0, then #1, #3, .. and see if it approximately works that way. If it does, then muls most likely work the same way as on 68000 (01/10 pairs).

A trivial example might be mulu.w #320, d0

I can see this being reworked something like:

lsl.l #6, d0 ; d0 * 64
move.l d0, d1 ; temporary in d1
lsl.l #2, d0 ; d0 * 256
add.l d1, d0 ; d0 * (256 + 64)

But is it going to be any quicker?

Aren't 030 and 020 cycle times quite similar (except for minor differences in cache speed)? In that case best case given by 020UM is 25 cycles (maybe 030 can do 24?, but either way negligible difference from worst).

Your snippet should be faster just counting cycles, but the larger code size may negate the benefit (more instruction cache used). Counting cycles is a good guide, but you really want to measure the full loop on more advanced CPUs.

Quote:

Originally Posted by paraj (Post 1587876)

I'm not explicitly targeting 020 separately. It looks like 020/030, 040 benefit from this sort of thing. 16 cycles for mulu.w on 040 (not including EA).

Quote:

Originally Posted by Karlos (Post 1587879)

I'm not explicitly targeting 020 separately. It looks like 020/030, 040 and 060 benefit from this sort of thing. 16 cycles for mulu.w on 040 (not including EA).

Multiplication only needs 2 cycles on 68060.

Quote:

Originally Posted by StingRay (Post 1587901)

Multiplication only needs 2 cycles on 68060.

Yes, which is why I've used it to replace division (muls followed by asr). However the multiply is still expensive on older CPUs.

You explicitly mentioned that 060 would benefit from such optimisation. That's not true and as I didn't know if you know :) I mentioned it.

Quote:

Originally Posted by StingRay (Post 1587915)

You explicitly mentioned that 060 would benefit from such optimisation. That's not true and as I didn't know if you know :) I mentioned it.

No, it's probably my bad wording. The current implementation of many integer division has been redone for 060 using multiply/shift right. This works well because division is up to 18 cycles but the multiply shift is just a few.

The same optimisation does help on 040 and below, where integer division is even slower, but the multiply operation on 020-040 is nowhere near as efficient as it is on 060. So I wanted to create something similar for those where the multiply step is replaced by left shift/add. Ultimately having separate binaries tuned for 020/030, 040 and 060. Hope that's clearer.

This will only be faster where we can beat the mul operation. For 040 16*16=>32 is stated as 16 cycles (execution only), and I don't know for sure if that's just the most pessimistic case. What I don't want to do is make something slower, lol.