68030 multiplication optimisation

Karlos · 08 January 2023, 16:32

I have replaced a bunch of fixed point integer division (immediate divisor) with some muls/asr approximations that work well on 68060 thanks to its fast multiplication.

I want to create separately optimised versions for 030, where according to the user manual, the worst case is 28 cycles (not including EA or wait states). It doesn't state what the best case is, however, but it's clear the number of set bits is a factor.

For the 030, the plan would be to perform series of shift and accumulate steps. Since we are multiplying by an immediate, this could be quite efficient as you'd only do what is strictly necessary.

Except, without knowing what the best case is for the multiplication operation, a multi instruction approach like this could be worse.

Does anyone know the "best" case? Assume that you are multiplying by a number that has at least 2 bits set or we'd just use a shift instead.

a/b · 08 January 2023, 17:47

I'd assume 12+16 (and 32-bit muls are extra 16 = 44).
If you have access to actual 020/030 hardware you could try 10000x mulu.w #0, then #1, #3, .. and see if it approximately works that way. If it does, then muls most likely work the same way as on 68000 (01/10 pairs).

Karlos · 08 January 2023, 18:28

A trivial example might be mulu.w #320, d0

I can see this being reworked something like:

lsl.l #6, d0 ; d0 * 64
move.l d0, d1 ; temporary in d1
lsl.l #2, d0 ; d0 * 256
add.l d1, d0 ; d0 * (256 + 64)

But is it going to be any quicker?

paraj · 08 January 2023, 18:50

Aren't 030 and 020 cycle times quite similar (except for minor differences in cache speed)? In that case best case given by 020UM is 25 cycles (maybe 030 can do 24?, but either way negligible difference from worst).

Your snippet should be faster just counting cycles, but the larger code size may negate the benefit (more instruction cache used). Counting cycles is a good guide, but you really want to measure the full loop on more advanced CPUs.

Karlos · 08 January 2023, 18:59

Quote:

Originally Posted by paraj

Aren't 030 and 020 cycle times quite similar (except for minor differences in cache speed)? In that case best case given by 020UM is 25 cycles (maybe 030 can do 24?, but either way negligible difference from worst).

Your snippet should be faster just counting cycles, but the larger code size may negate the benefit (more instruction cache used). Counting cycles is a good guide, but you really want to measure the full loop on more advanced CPUs.

I'm not explicitly targeting 020 separately. It looks like 020/030, 040 benefit from this sort of thing. 16 cycles for mulu.w on 040 (not including EA).

StingRay · 08 January 2023, 22:48

Quote:

Originally Posted by Karlos

I'm not explicitly targeting 020 separately. It looks like 020/030, 040 and 060 benefit from this sort of thing. 16 cycles for mulu.w on 040 (not including EA).

Multiplication only needs 2 cycles on 68060.

Karlos · 08 January 2023, 23:39

Quote:

Originally Posted by StingRay

Multiplication only needs 2 cycles on 68060.

Yes, which is why I've used it to replace division (muls followed by asr). However the multiply is still expensive on older CPUs.

StingRay · 09 January 2023, 00:11

You explicitly mentioned that 060 would benefit from such optimisation. That's not true and as I didn't know if you know

I mentioned it.

Karlos · 09 January 2023, 12:03

Quote:

Originally Posted by StingRay

You explicitly mentioned that 060 would benefit from such optimisation. That's not true and as I didn't know if you know

I mentioned it.

No, it's probably my bad wording. The current implementation of many integer division has been redone for 060 using multiply/shift right. This works well because division is up to 18 cycles but the multiply shift is just a few.

The same optimisation does help on 040 and below, where integer division is even slower, but the multiply operation on 020-040 is nowhere near as efficient as it is on 060. So I wanted to create something similar for those where the multiply step is replaced by left shift/add. Ultimately having separate binaries tuned for 020/030, 040 and 060. Hope that's clearer.

This will only be faster where we can beat the mul operation. For 040 16*16=>32 is stated as 16 cycles (execution only), and I don't know for sure if that's just the most pessimistic case. What I don't want to do is make something slower, lol.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Informed optimisation	Ernst Blofeld	Coders. General	88	09 February 2021 19:47
vbcc 0.9e signed multiplication issue?	dalton	Coders. C/C++	3	09 January 2018 08:47
68000 optimisation	Galahad/FLT	Coders. Asm / Hardware	9	20 August 2016 00:29
Picasso IV optimisation	Tony Landais	support.Hardware	10	01 September 2006 19:54

08 January 2023, 16:32	#1
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,180	68030 multiplication optimisation I have replaced a bunch of fixed point integer division (immediate divisor) with some muls/asr approximations that work well on 68060 thanks to its fast multiplication. I want to create separately optimised versions for 030, where according to the user manual, the worst case is 28 cycles (not including EA or wait states). It doesn't state what the best case is, however, but it's clear the number of set bits is a factor. For the 030, the plan would be to perform series of shift and accumulate steps. Since we are multiplying by an immediate, this could be quite efficient as you'd only do what is strictly necessary. Except, without knowing what the best case is for the multiplication operation, a multi instruction approach like this could be worse. Does anyone know the "best" case? Assume that you are multiplying by a number that has at least 2 bits set or we'd just use a shift instead. Last edited by Karlos; 08 January 2023 at 18:20.

08 January 2023, 17:47	#2
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	I'd assume 12+16 (and 32-bit muls are extra 16 = 44). If you have access to actual 020/030 hardware you could try 10000x mulu.w #0, then #1, #3, .. and see if it approximately works that way. If it does, then muls most likely work the same way as on 68000 (01/10 pairs).

08 January 2023, 18:28	#3
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,180	A trivial example might be mulu.w #320, d0 I can see this being reworked something like: lsl.l #6, d0 ; d0 * 64 move.l d0, d1 ; temporary in d1 lsl.l #2, d0 ; d0 * 256 add.l d1, d0 ; d0 * (256 + 64) But is it going to be any quicker?

08 January 2023, 18:50	#4
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,113	Aren't 030 and 020 cycle times quite similar (except for minor differences in cache speed)? In that case best case given by 020UM is 25 cycles (maybe 030 can do 24?, but either way negligible difference from worst). Your snippet should be faster just counting cycles, but the larger code size may negate the benefit (more instruction cache used). Counting cycles is a good guide, but you really want to measure the full loop on more advanced CPUs.

09 January 2023, 00:11	#8
StingRay move.l #$c0ff33,throat Join Date: Dec 2005 Location: Berlin/Joymoney Posts: 6,863	You explicitly mentioned that 060 would benefit from such optimisation. That's not true and as I didn't know if you know I mentioned it.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)