English Amiga Board Mulshift32
 Register Amiga FAQ Rules & Help Members List  /  Moderators List Today's Posts Mark Forums Read

 18 September 2018, 21:06 #1 Clubcard Registered User   Join Date: Sep 2018 Location: Peterborough Posts: 4 Mulshift32 Hello, I am converting an ACELP speech decoder and 3 small fragments of code (under 150 bytes in total) take 25% of the bus bandwidth. They are: MULS - 32-bit x 32-bit --->64 bit (used by FFT) CLZ - count leading zeros (for Huffman decoding) MULSHIFT32 -32-bix x 32-bit --->top 32-bits of result. I am pretty confident that I have dealt with the first two of these fragments BUT my major problem is that just removing the preservation of the bottom-32 bits of the multiply only saves 1 instruction. As usual, the code uses 4 x 16-bit partials but as I'm sure everyone knows, the bottom and top part of the result are the last 2 instructions in the fragment. What I need is a novel approach and I have spent weeks trying to work this out, I'm not just being lazy. The only thing I can think of is approaching the calculations downwards. If your first step is to calculate the top 16-bits of each register then you have a FLOOR i.e. You know the minimum possible value. It is only a vague notion, sadly but I firmly intend to optimize as much as is humanly possible. Thanks for your time, CC
 18 September 2018, 21:46 #2 meynaf son of 68k   Join Date: Nov 2007 Location: Lyon / France Age: 45 Posts: 3,357 Are you doing that on 68000 ? If not, you can use 68020+ long mul instructions.
 18 September 2018, 23:31 #3 grond Registered User   Join Date: Jun 2015 Location: Germany Posts: 514 And CLZ is available on 020+ as BFFFO.
 18 September 2018, 23:51 #4 Leffmann   Join Date: Jul 2008 Location: Sweden Posts: 2,219 What do your 32-bit factors look like, are they evenly distributed over the whole 32-bit range? If there's a bias towards smaller values in the 16-bit range, then the overhead of a few comparisons and branches could eliminate up to four multiplications.
 04 October 2018, 23:38 #5 Clubcard Registered User   Join Date: Sep 2018 Location: Peterborough Posts: 4 I am using an FPGA implementation of 68000 with the aim of finding the lowest number of transistors & power that can decode the 32kb/s ACELP stream. First I see how fast a standard 68000 instruction set has to run (faster uses more power) and then when that has been exhaustively optimized, then add/remove functionality. The idea is to plot the complexity against speed. It's really the MULSHIFT (32-bit x 32-bit ---> top 32-bit) that is causing me pain. Hackers Delight has an algorithm to find the high part of the product but it doesn't result in any savings! I just hoped someone knew of an established trick or where to look for same.
 05 October 2018, 09:45 #6 Kalms Registered User   Join Date: Nov 2006 Location: Stockholm, Sweden Posts: 198 If you want 100% bit-precise results, and both input operands are randomly distributed 32-bit values, then I don't think there is much you can do. I believe that the CPUs that offer MULHIGH instructions (I'm thinking old MIPS as an example) run a full 32x32=>64 multiply and then throw away the bottom 32 bits. Hacker's delight version for unsigned ints here for reference When I sketched it out on paper I ended up with these steps for a regular 4-multiplies version: Code: ```result = (a * b) >> 32 partial_a = lower_half(a) * lower_half(b) partial_b = lower_half(a) * upper_half(b) partial_c = upper_half(a) * lower_half(b) partial_d = upper_half(a) * upper_half(b) result = upper_half(upper_half(partial_a) + lower_half(partial_b) + lower_half(partial_c)) + upper_half(partial_b) + upper_half(partial_c) + partial_d``` The reason why the lower portions need to be computed is because they influence (via carries) the upper portion. If you accept off-by-one errors in your computation, then you can omit the partial_a term. That one contributes at most +1 to the result. Then you have: Code: ```result = (a * b) >> 32, max error of 1 partial_b = lower_half(a) * upper_half(b) partial_c = upper_half(a) * lower_half(b) partial_d = upper_half(a) * upper_half(b) result = upper_half(partial_b) + upper_half(partial_c) + partial_d``` It's difficult to reduce the number of operations down further with a reasonal bound on the errors; you'd need to look at the actual values of a & b then. However, reading up a bit on ACELP, it's a prediction based thing right? If the predictor isn't getting reset frequently then you probably need bit-exact results. ... which is why your best bet is what others have said here, analyze your input data.
 06 October 2018, 22:17 #7 Clubcard Registered User   Join Date: Sep 2018 Location: Peterborough Posts: 4 Using Kuratsuba multiplication means only 3 16x16 MULUs are needed for 32-bit x 32-bit --> 64-bit multiply. MULSHIFT32 is the most common macro and while I have optimized it, I have a gut feeling that since we know the size of the product (highest 1 bit) = size of the factors (highest 1 bits) added or 1 bit less. I just have a feeling that I can get the top bit into C. Km uses (a x c) + (b x d) - ((a - c) x (b - d)) but if I only need to calculate 17, bits then as you can see, that's going to be 2 multiplies. Floor & Ceiling was another idea but that isn't to efficient. Last edited by Clubcard; 06 October 2018 at 22:31.

 Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)