08 January 2023, 16:32 | #1 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,180
|
68030 multiplication optimisation
I have replaced a bunch of fixed point integer division (immediate divisor) with some muls/asr approximations that work well on 68060 thanks to its fast multiplication.
I want to create separately optimised versions for 030, where according to the user manual, the worst case is 28 cycles (not including EA or wait states). It doesn't state what the best case is, however, but it's clear the number of set bits is a factor. For the 030, the plan would be to perform series of shift and accumulate steps. Since we are multiplying by an immediate, this could be quite efficient as you'd only do what is strictly necessary. Except, without knowing what the best case is for the multiplication operation, a multi instruction approach like this could be worse. Does anyone know the "best" case? Assume that you are multiplying by a number that has at least 2 bits set or we'd just use a shift instead. Last edited by Karlos; 08 January 2023 at 18:20. |
08 January 2023, 17:47 | #2 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,039
|
I'd assume 12+16 (and 32-bit muls are extra 16 = 44).
If you have access to actual 020/030 hardware you could try 10000x mulu.w #0, then #1, #3, .. and see if it approximately works that way. If it does, then muls most likely work the same way as on 68000 (01/10 pairs). |
08 January 2023, 18:28 | #3 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,180
|
A trivial example might be mulu.w #320, d0
I can see this being reworked something like: lsl.l #6, d0 ; d0 * 64 move.l d0, d1 ; temporary in d1 lsl.l #2, d0 ; d0 * 256 add.l d1, d0 ; d0 * (256 + 64) But is it going to be any quicker? |
08 January 2023, 18:50 | #4 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,113
|
Aren't 030 and 020 cycle times quite similar (except for minor differences in cache speed)? In that case best case given by 020UM is 25 cycles (maybe 030 can do 24?, but either way negligible difference from worst).
Your snippet should be faster just counting cycles, but the larger code size may negate the benefit (more instruction cache used). Counting cycles is a good guide, but you really want to measure the full loop on more advanced CPUs. |
08 January 2023, 18:59 | #5 | |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,180
|
Quote:
Last edited by Karlos; 09 January 2023 at 12:18. |
|
08 January 2023, 22:48 | #6 |
move.l #$c0ff33,throat
Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 6,863
|
|
08 January 2023, 23:39 | #7 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,180
|
|
09 January 2023, 00:11 | #8 |
move.l #$c0ff33,throat
Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 6,863
|
You explicitly mentioned that 060 would benefit from such optimisation. That's not true and as I didn't know if you know I mentioned it.
|
09 January 2023, 12:03 | #9 | |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,180
|
Quote:
The same optimisation does help on 040 and below, where integer division is even slower, but the multiply operation on 020-040 is nowhere near as efficient as it is on 060. So I wanted to create something similar for those where the multiply step is replaced by left shift/add. Ultimately having separate binaries tuned for 020/030, 040 and 060. Hope that's clearer. This will only be faster where we can beat the mul operation. For 040 16*16=>32 is stated as 16 cycles (execution only), and I don't know for sure if that's just the most pessimistic case. What I don't want to do is make something slower, lol. Last edited by Karlos; 09 January 2023 at 12:17. |
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Informed optimisation | Ernst Blofeld | Coders. General | 88 | 09 February 2021 19:47 |
vbcc 0.9e signed multiplication issue? | dalton | Coders. C/C++ | 3 | 09 January 2018 08:47 |
68000 optimisation | Galahad/FLT | Coders. Asm / Hardware | 9 | 20 August 2016 00:29 |
Picasso IV optimisation | Tony Landais | support.Hardware | 10 | 01 September 2006 19:54 |
|
|