English Amiga Board

Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

Thread Tools
Old 18 September 2018, 21:06   #1
Registered User

Join Date: Sep 2018
Location: Peterborough
Posts: 4

I am converting an ACELP speech decoder and 3 small fragments of code (under 150 bytes in total) take 25% of the bus bandwidth. They are:

MULS - 32-bit x 32-bit --->64 bit (used by FFT)
CLZ - count leading zeros (for Huffman decoding)
MULSHIFT32 -32-bix x 32-bit --->top 32-bits of result.

I am pretty confident that I have dealt with the first two of these fragments BUT my major problem is that just removing the preservation of the bottom-32 bits of the multiply only saves 1 instruction.

As usual, the code uses 4 x 16-bit partials but as I'm sure everyone knows, the bottom and top part of the result are the last 2 instructions in the fragment. What I need is a novel approach and I have spent weeks trying to work this out, I'm not just being lazy.

The only thing I can think of is approaching the calculations downwards. If your first step is to calculate the top 16-bits of each register then you have a FLOOR i.e. You know the minimum possible value. It is only a vague notion, sadly but I firmly intend to optimize as much as is humanly possible.

Thanks for your time,
Clubcard is offline  
Old 18 September 2018, 21:46   #2
son of 68k
meynaf's Avatar
Join Date: Nov 2007
Location: Lyon / France
Age: 45
Posts: 3,357
Are you doing that on 68000 ? If not, you can use 68020+ long mul instructions.
meynaf is offline  
Old 18 September 2018, 23:31   #3
Registered User

Join Date: Jun 2015
Location: Germany
Posts: 514
And CLZ is available on 020+ as BFFFO.
grond is offline  
Old 18 September 2018, 23:51   #4
Join Date: Jul 2008
Location: Sweden
Posts: 2,219
What do your 32-bit factors look like, are they evenly distributed over the whole 32-bit range? If there's a bias towards smaller values in the 16-bit range, then the overhead of a few comparisons and branches could eliminate up to four multiplications.
Leffmann is offline  
Old 04 October 2018, 23:38   #5
Registered User

Join Date: Sep 2018
Location: Peterborough
Posts: 4
I am using an FPGA implementation of 68000 with the aim of finding the lowest number of transistors & power that can decode the 32kb/s ACELP stream. First I see how fast a standard 68000 instruction set has to run (faster uses more power) and then when that has been exhaustively optimized, then add/remove functionality.

The idea is to plot the complexity against speed.

It's really the MULSHIFT (32-bit x 32-bit ---> top 32-bit) that is causing me pain. Hackers Delight has an algorithm to find the high part of the product but it doesn't result in any savings!

I just hoped someone knew of an established trick or where to look for same.
Clubcard is offline  
Old 05 October 2018, 09:45   #6
Registered User
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 198
If you want 100% bit-precise results, and both input operands are randomly distributed 32-bit values, then I don't think there is much you can do.

I believe that the CPUs that offer MULHIGH instructions (I'm thinking old MIPS as an example) run a full 32x32=>64 multiply and then throw away the bottom 32 bits.

Hacker's delight version for unsigned ints here for reference

When I sketched it out on paper I ended up with these steps for a regular 4-multiplies version:

result = (a * b) >> 32

partial_a = lower_half(a) * lower_half(b)

partial_b = lower_half(a) * upper_half(b)

partial_c = upper_half(a) * lower_half(b)

partial_d = upper_half(a) * upper_half(b)

result = upper_half(upper_half(partial_a) + lower_half(partial_b) + lower_half(partial_c)) + upper_half(partial_b) + upper_half(partial_c) + partial_d
The reason why the lower portions need to be computed is because they influence (via carries) the upper portion.

If you accept off-by-one errors in your computation, then you can omit the partial_a term. That one contributes at most +1 to the result. Then you have:

result = (a * b) >> 32, max error of 1

partial_b = lower_half(a) * upper_half(b)

partial_c = upper_half(a) * lower_half(b)

partial_d = upper_half(a) * upper_half(b)

result = upper_half(partial_b) + upper_half(partial_c) + partial_d
It's difficult to reduce the number of operations down further with a reasonal bound on the errors; you'd need to look at the actual values of a & b then.

However, reading up a bit on ACELP, it's a prediction based thing right? If the predictor isn't getting reset frequently then you probably need bit-exact results.

... which is why your best bet is what others have said here, analyze your input data.
Kalms is offline  
Old 06 October 2018, 22:17   #7
Registered User

Join Date: Sep 2018
Location: Peterborough
Posts: 4
Using Kuratsuba multiplication means only 3 16x16 MULUs are needed for 32-bit x 32-bit --> 64-bit multiply. MULSHIFT32 is the most common macro and while I have optimized it, I have a gut feeling that since we know the size of the product (highest 1 bit) = size of the factors (highest 1 bits) added or 1 bit less. I just have a feeling that I can get the top bit into C. Km uses (a x c) + (b x d) - ((a - c) x (b - d)) but if I only need to calculate 17, bits then as you can see, that's going to be 2 multiplies.

Floor & Ceiling was another idea but that isn't to efficient.

Last edited by Clubcard; 06 October 2018 at 22:31.
Clubcard is offline  

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

All times are GMT +2. The time now is 21:32.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2019, vBulletin Solutions Inc.
Page generated in 0.06690 seconds with 14 queries