English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 08 January 2023, 16:32   #1
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,180
68030 multiplication optimisation

I have replaced a bunch of fixed point integer division (immediate divisor) with some muls/asr approximations that work well on 68060 thanks to its fast multiplication.

I want to create separately optimised versions for 030, where according to the user manual, the worst case is 28 cycles (not including EA or wait states). It doesn't state what the best case is, however, but it's clear the number of set bits is a factor.

For the 030, the plan would be to perform series of shift and accumulate steps. Since we are multiplying by an immediate, this could be quite efficient as you'd only do what is strictly necessary.

Except, without knowing what the best case is for the multiplication operation, a multi instruction approach like this could be worse.

Does anyone know the "best" case? Assume that you are multiplying by a number that has at least 2 bits set or we'd just use a shift instead.

Last edited by Karlos; 08 January 2023 at 18:20.
Karlos is offline  
Old 08 January 2023, 17:47   #2
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,039
I'd assume 12+16 (and 32-bit muls are extra 16 = 44).
If you have access to actual 020/030 hardware you could try 10000x mulu.w #0, then #1, #3, .. and see if it approximately works that way. If it does, then muls most likely work the same way as on 68000 (01/10 pairs).
a/b is offline  
Old 08 January 2023, 18:28   #3
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,180
A trivial example might be mulu.w #320, d0

I can see this being reworked something like:

lsl.l #6, d0 ; d0 * 64
move.l d0, d1 ; temporary in d1
lsl.l #2, d0 ; d0 * 256
add.l d1, d0 ; d0 * (256 + 64)

But is it going to be any quicker?
Karlos is offline  
Old 08 January 2023, 18:50   #4
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,113
Aren't 030 and 020 cycle times quite similar (except for minor differences in cache speed)? In that case best case given by 020UM is 25 cycles (maybe 030 can do 24?, but either way negligible difference from worst).

Your snippet should be faster just counting cycles, but the larger code size may negate the benefit (more instruction cache used). Counting cycles is a good guide, but you really want to measure the full loop on more advanced CPUs.
paraj is offline  
Old 08 January 2023, 18:59   #5
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,180
Quote:
Originally Posted by paraj View Post
Aren't 030 and 020 cycle times quite similar (except for minor differences in cache speed)? In that case best case given by 020UM is 25 cycles (maybe 030 can do 24?, but either way negligible difference from worst).

Your snippet should be faster just counting cycles, but the larger code size may negate the benefit (more instruction cache used). Counting cycles is a good guide, but you really want to measure the full loop on more advanced CPUs.
I'm not explicitly targeting 020 separately. It looks like 020/030, 040 benefit from this sort of thing. 16 cycles for mulu.w on 040 (not including EA).

Last edited by Karlos; 09 January 2023 at 12:18.
Karlos is offline  
Old 08 January 2023, 22:48   #6
StingRay
move.l #$c0ff33,throat
 
StingRay's Avatar
 
Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 6,863
Quote:
Originally Posted by Karlos View Post
I'm not explicitly targeting 020 separately. It looks like 020/030, 040 and 060 benefit from this sort of thing. 16 cycles for mulu.w on 040 (not including EA).

Multiplication only needs 2 cycles on 68060.
StingRay is offline  
Old 08 January 2023, 23:39   #7
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,180
Quote:
Originally Posted by StingRay View Post
Multiplication only needs 2 cycles on 68060.
Yes, which is why I've used it to replace division (muls followed by asr). However the multiply is still expensive on older CPUs.
Karlos is offline  
Old 09 January 2023, 00:11   #8
StingRay
move.l #$c0ff33,throat
 
StingRay's Avatar
 
Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 6,863
You explicitly mentioned that 060 would benefit from such optimisation. That's not true and as I didn't know if you know I mentioned it.
StingRay is offline  
Old 09 January 2023, 12:03   #9
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,180
Quote:
Originally Posted by StingRay View Post
You explicitly mentioned that 060 would benefit from such optimisation. That's not true and as I didn't know if you know I mentioned it.
No, it's probably my bad wording. The current implementation of many integer division has been redone for 060 using multiply/shift right. This works well because division is up to 18 cycles but the multiply shift is just a few.

The same optimisation does help on 040 and below, where integer division is even slower, but the multiply operation on 020-040 is nowhere near as efficient as it is on 060. So I wanted to create something similar for those where the multiply step is replaced by left shift/add. Ultimately having separate binaries tuned for 020/030, 040 and 060. Hope that's clearer.

This will only be faster where we can beat the mul operation. For 040 16*16=>32 is stated as 16 cycles (execution only), and I don't know for sure if that's just the most pessimistic case. What I don't want to do is make something slower, lol.

Last edited by Karlos; 09 January 2023 at 12:17.
Karlos is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Informed optimisation Ernst Blofeld Coders. General 88 09 February 2021 19:47
vbcc 0.9e signed multiplication issue? dalton Coders. C/C++ 3 09 January 2018 08:47
68000 optimisation Galahad/FLT Coders. Asm / Hardware 9 20 August 2016 00:29
Picasso IV optimisation Tony Landais support.Hardware 10 01 September 2006 19:54

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 23:07.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.07459 seconds with 15 queries