19 October 2020, 21:21 | #1 |
Registered User
Join Date: Sep 2009
Location: Norway
Posts: 1,712
|
Optimizing linear interpolation routine for a live resampler
I'm currently porting the Fasttracker 2.09 XM replayer from i386 asm to 68020 asm (Amiga), and I'm rather close to being finished. Of course, mixing at only 28604Hz means you want some kind of resampling interpolation to prevent a ton of aliasing, so I went with linear interpolation since it's the fastest I can think of.
Here's my current code for getting 1x PCM 8-bit interpolation sample out of 2x 8-bit PCM input samples and a 16-bit sampling position fraction: Code:
move.w (a3,d2.l),d3 ; d3.w = 2x 8-bit signed samples move.b d3,d5 ext.w d5 asr.w #8,d3 sub.w d3,d5 move.w d7,d4 ; d4.w = copy of fractional sampling position (0..65535) lsr.w #8,d4 muls.w d4,d5 asr.w #8,d5 add.w d5,d3 ; d3.w = -128..127 I only have d3, d4 and d5 available. a6 can be used for a LUT pointer. If anyone sees a way to make this faster, or has an idea of how to calculate a LUT for this with little instruction overhead, let me know. I would be really glad! Even a LUT with 4 bits of fractional precision should be OK. PS: It may look like I am potentially reading out of bounds by reading two samples, but in reality the loaded samples have the correct sample point stored at the end of the sample (or end of loop). Last edited by 8bitbubsy; 19 October 2020 at 23:21. |
19 October 2020, 21:55 | #2 |
Registered User
Join Date: Mar 2012
Location: Norfolk, UK
Posts: 1,156
|
Have you considered mixing at, say, four times the target frequency, then using a simple running-average (so just addition and shifting) as a low-pass filter to downsample the complete mix? It won't exactly be hifi (but then neither will linear interpolation) but it might be good enough, and cheaper than trying to interpolate each channel individually.
|
19 October 2020, 21:56 | #3 |
Registered User
Join Date: Sep 2009
Location: Norway
Posts: 1,712
|
Oversampling (of up to 32 channels at 28kHz) is going to be too slow for 68020..68060 Amigas, but otherwise a neat suggestion.
|
19 October 2020, 23:42 | #4 |
Registered User
Join Date: Mar 2012
Location: Norfolk, UK
Posts: 1,156
|
Maybe simple IIR filters, then?
(out_new = (out_old + in) >> 1; or maybe out_new = (out_old + 7*in) >> 3 It should be cheaper to compute than linear interpolation, and you could potentially have a few different versions of the routine with different coefficients, selected by the upsampling factor. |
20 October 2020, 00:22 | #5 | |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,018
|
Quote:
moveq #0,d4 move.w d7,d4 lsr.w #1,d5 move.b d5,d4 add.w (a6,d4.l*2),d3 ;clr.w d4 ; addx.w d4,d3 better precision? |
|
20 October 2020, 00:40 | #6 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,018
|
or maybe 2x bigger table?
moveq #0,d4 move.w d5,d4 lsl.w #7,d4 add.l d4,d4 ror.w #8,d7 move.b d7,d4 ror.w #8,d7 add.w (a6,d4.l*2),d3 or single table: moveq #0,d4 move.w d5,d4 lsl.w #7,d4 add.l d4,d4 ror.w #8,d7 move.b d7,d4 ror.w #8,d7 move.b (a6,d4.l),d5 ext.w d5 add.w d5,d3 |
20 October 2020, 07:02 | #7 |
J.M.D - Bedroom Musician
Join Date: Apr 2014
Location: los angeles,ca
Posts: 3,566
|
You seem the right person to ask, since was thinking for some non conventional use of the xm format natively on the amiga: in example use the four (or three or two) paula channels to replay but the possibility to switch on and off pattern channels so to have interactive soundtracks a la monkey island (more variation of the theme going on that with turning on a channel and replacing it with another one makes it sound different) or a way for a program to change sample volumes; i know those are not replay standard routines and that play is limited to the hardware four channels and amiga frequencies but am considering to break some barriers...
[edit - can a mod do a separate thread for this? i realized am OT] Last edited by saimon69; 20 October 2020 at 19:31. |
20 October 2020, 11:11 | #8 |
Registered User
Join Date: Sep 2009
Location: Norway
Posts: 1,712
|
robinsonb5: That filter is not going to help much since you're not handling the fractional position whatsoever. There will still be somewhat hard edges from the nearest neighbor sampling, which will create aliasing.
Don_Adan: I'll maybe test your code, but that's already more instruction overhead than linear interpolation using muls, maybe it's even slower! I was thinking like: sample1 += centeredLUT[(((sample2-sample1) << 4) | ((frac >> 12) & 15)]; But it just seems to be way too many instructions to set this up. Also I don't want to shift the resolution of the delta sample (s2-s1), and I don't want to have a gigantic LUT either... Here's how FT2 did it in its older mixers: Code:
mov ax,[esi] xor eax,08080h mov bl,al sub bl,ah sbb bh,bh shld ebx,edi,4 ; edi = frac ($xxxx0000) xor ah,ah add al,[bx+CDA_IPTab+CDA_IPTabSize/2] saimon69: I don't think I'm the right person to ask. I'm just directly porting old code, I don't really know how to do your request. Last edited by 8bitbubsy; 20 October 2020 at 12:08. |
20 October 2020, 12:27 | #9 | |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,018
|
Quote:
|
|
20 October 2020, 13:06 | #10 |
Registered User
Join Date: Dec 2014
Location: germany
Posts: 439
|
Hmm, what is your definition of "gigantic" for a LUT? If you can live with 64k and 8 bit fraction resolution, the following might work (not tested):
Code:
move.w (a3,d2.l),d3 ; d3.w = 2x 8-bit signed samples move.b d3,d5 move.b #0,d3 ; or reserve a zero register asl.w #8,d5 ; substract the sample values * 256 in the next step sub.w d3,d5 ; high byte = delta, low byte = 0 move.b d7,d5 ; d5.b = copy of fractional sampling position (0..255) move.b (a4,d5.w),d3 ; a4 = pre-shifted multiplication LUT EDIT: the LUT would look like this: Code:
{{-128*0>>8,-128*1>>8,...,-128*255>>8}, {-127*0>>8,-127*1>>8,...,-127*255>>8}, ... {127*0>>8,127*1>>8,...,127*255>>8}} EDIT: hmm, can we save it by using Code:
move.w (a3,d2.l),d3 ; d3.w = 2x 8-bit signed samples move.b d3,d5 move.b #0,d3 ; or reserve a zero register asl.w #7,d5 ; substract the sample values * 128 in the next step asr.w #1,d3 sub.w d3,d5 or.b d7,d5 ; d7.b = copy of fractional sampling position ( (0..127) move.b (a4,d5.w),d3 ; a4 = pre-shifted multiplication LUT Last edited by chb; 20 October 2020 at 13:31. |
20 October 2020, 14:24 | #11 | |
Registered User
Join Date: Sep 2009
Location: Norway
Posts: 1,712
|
In terms of LUT size, I don't really want it to be bigger than 64K. That would mean full sample point delta precision (9 bits) + 7 bits of fractional precision. That's plenty for linear interpolation already.
Quote:
I could of course change the frac to be 8 bits wide, but given that FT2 supports very low resampling rates, you want to maximize time precision. EDIT: Ah, I see that you mentioned 8-bit frac resolution in the beginning of the post. But as said, I want more precision. Last edited by 8bitbubsy; 20 October 2020 at 14:33. |
|
20 October 2020, 15:57 | #12 |
Registered User
Join Date: Dec 2014
Location: germany
Posts: 439
|
Ok, I'll give it another try. Let's assume frac(n+1) = frac(n) + delta_frac in every step, frac(0)=0. I hope that's what you are using.
We could use a different format for frac - we use 24 bit and put the LSBs in let's say d6 and the MSB in d7. We then need two registers for delta_frac - I do not know your code, let's assume they are d0 and d1. Again, this is not tested, so please rather take it as an inspiration than working code Code:
; compute frac: ; d7 frac MSB, d6 frac LSBs, ; d1 delta_frac MSB, d0 delta_frac LSBs add.w d0,d6 addx.b d1,d7 ; interpolation: moveq #0,d3 ; clear d3 move.w (a3,d2.l),d3 ; d3.w = 2x 8-bit signed samples move.b d3,d5 clr.b d3 asl.w #8,d5 ; substract the sample values * 256 in the next step sub.l d3,d5 ; treat as unsigned long, so LUT needs some re-ordering? move.b d7,d5 ; d7.b = fractional sampling position MSB (0..255) move.b (a4,d5.l),d3 ; a4 = pre-shifted multiplication LUT Last edited by chb; 20 October 2020 at 16:22. |
20 October 2020, 16:24 | #13 |
Registered User
Join Date: Sep 2009
Location: Norway
Posts: 1,712
|
Yeah, I do the sampling position like this:
Code:
add.w d6,d7 addx.l d1,d2 d7.w = temporary sampling position fraction (16-bit) d1.l = signed high 16-bit part of delta (integer samples, signed because it's negative for backwards sampling mode) d2.l = sampling position Also remember that I only have d3, d4, d5 and a6 regs available for free use in the mixing loop. Here's the full inner mixer loop macros: https://pastebin.com/Mi9DpbSE Last edited by 8bitbubsy; 20 October 2020 at 16:35. |
20 October 2020, 17:41 | #14 | |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,018
|
Quote:
You can check mixing routine from Mugician II replayer, it used all (17) 68k registers and all parts of registers for mixing. |
|
21 October 2020, 14:30 | #15 |
Registered User
Join Date: Sep 2009
Location: Norway
Posts: 1,712
|
I managed to calculate a lerp LUT with 9-bit delta precision and 7-bit frac precision, and it works... but... it's about the same speed as the muls code on a 68020! So I was right to begin with, the instruction overhead is slow.
Here's how I did it: Code:
move.w (a3,d2.l),d3 move.b d3,d5 ext.w d5 asr.w #8,d3 sub.w d3,d5 lsl.w #7,d5 move.w d7,d4 rol.w #7,d4 and.b #127,d4 or.b d4,d5 add.b (a6,d5.w),d3 and.w #$ff,d3 Code:
move.w (a3,d2.l),d3 move.b d3,d5 ext.w d5 asr.w #8,d3 sub.w d3,d5 move.w d7,d4 lsr.w #8,d4 muls.w d4,d5 asr.w #8,d5 add.w d5,d3 Code:
int8_t lerpLUT[65536]; void generateLerpLUT(void) { int8_t *ptr8 = lerpLUT; for (int32_t smp = -256; smp < 256; smp++) { for (int32_t frac = 0; frac < 128; frac++) *ptr8++ = (int8_t)round(smp * (frac / 128.0)); } } Last edited by 8bitbubsy; 21 October 2020 at 17:15. |
21 October 2020, 14:44 | #16 | |
Registered User
Join Date: Dec 2014
Location: germany
Posts: 439
|
EDIT: Did not see your last post
Thanks for the code, makes it clearer now. Quote:
I give it a last try. If one cannot use a2, one may use the high word of d6 instead. Or a7 if available. Or use some pc relative addressing to free a data register, e.g. for the volume LUTs? Code:
; Register map: (* indicates modified by chb) ; a0 = original audio buffer pointer (LRLRLR..) ; a1 = current left volume LUT pointer ; *a2 = a2 high word = delta_low LSBs , low word = 0 (was: mixer function table) ; a3 = sample data pointer ; a4 = current right volume LUT pointer ; a5 = current audio buffer pointer (LRLRLR..) ; *a6 = pre-shifted multiplication LUT ; d0.w = bytes to mix ; d1.l = sample read delta high (signed) ; d2.l = sample data position ; d3 = <temporarily used in mixer loop> ; d4 = <temporarily used in mixer loop>, needs initialization with 0.l ; d5 = <temporarily used in mixer loop> ; *d6.b = sample read delta low MSB ; *d7 high word = sample position LSBs ; *d7.b = fractional sample position MSB ; *d7.w MSB = 0 ; ============================================================ ; interpolation: moveq #0,d3 ; clear d3 move.w (a3,d2.l),d3 ; d3.w = 2x 8-bit signed samples S1 S2 move.b d3,d4 ; save unshifted S2 for further operation move.l d4,d5 ; move.l to clear d5 (upper three bytes of d4 always 0) clr.b d3 lsl.w #8,d5 ; substract the sample values * 256 in the next step sub.l d3,d5 ; 256*(S2-S1), treat as unsigned long move.b d7,d5 ; d7.b = fractional sampling position MSB (0..255) move.b (a6,d5.l),d3 ; a6 = LUT, see below for structure ; d3 = 256*(S2-S1)*(1-frac) sub.b d4,d3 ; S_frac = S1 + (S2-S1)*frac = S2 - (S2-S1)*(1-frac) move.w (a1,d3.w*2),d5 swap d5 move.w (a4,d3.w*2),d5 add.l d5,(a5)+ add.l a2,d7 ; a2 delta_low LSBs | 0.w addx.b d6,d7 ; d6.b delta_low MSB addx.l d1,d2 ; sample data position ;; if we cannot use a2 let's take this code: ;; d6 contains delta_low LSBs in the high word and MSB in the lowest byte ;move.w d7,d3 ; save delta_low MSB ;add.l d6,d7 ; add LSBs ;move.w d3,d7 ; restore MSB ;addx.b d6,d7 ; add MSB ;addx.l d1,d2 ; sample data position ; LUT structure stores (1-frac) =^ (255-frac) ; values are word size, MSB = 0 ; ; {{-128*255>>8,-128*254>>8,...,-128*0>>8}, ; {-127*255>>8,-127*254>>8,...,-127*0>>8}, ; ... ; {127*255>>8,127*254>>8,...,127*0>>8}} |
|
21 October 2020, 14:50 | #17 |
Registered User
Join Date: Sep 2009
Location: Norway
Posts: 1,712
|
Given that the code in my previous post (posted not long ago) is slower than the original mul code, I really doubt this will be faster. Thanks for the effort anyway! Appreciated.
EDIT: Oh no, my new LUT code doesn't seem to work after all. I guess I used the wrong binary during testing. But even if it was to work, it'd be slower anyway. Last edited by 8bitbubsy; 21 October 2020 at 14:57. |
21 October 2020, 15:53 | #18 |
Registered User
Join Date: Dec 2014
Location: germany
Posts: 439
|
Well, it was some nice puzzle. Interesting that the table access is so slow, what was your configuration?
|
21 October 2020, 15:55 | #19 | |
Registered User
Join Date: Sep 2009
Location: Norway
Posts: 1,712
|
Quote:
I made a benchmark program where I ran 2048 iterations of the mixer macro with rasterbars, to see how much time it takes. This is probably not a good way to test it, as the scenario is slightly different, but I think it should give a general idea, at least. Anyway, I'm going to redo the benchmark once I get this to actually work. |
|
21 October 2020, 16:08 | #20 |
Registered User
Join Date: Sep 2009
Location: Norway
Posts: 1,712
|
Sorry for the double-post, but I managed to fix the code now. And after benchmark, it seems to be just a tiny bit slower now. I updated the post with my working version.
EDIT: ARGH! I still managed to compile the previous version thinking I was compiling the LUT version, and apparently it still doesn't work like it should. Haha EDIT2: D'oh! I put the table in the BSS hunk, so it got cleared lol. It works now, for sure. Last edited by 8bitbubsy; 21 October 2020 at 16:33. |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Interpolation new Sound options | Paul | support.WinUAE | 10 | 17 March 2019 20:57 |
Artifacts from non-gamma-aware interpolation | mark_k | support.WinUAE | 5 | 08 January 2018 14:37 |
switch sound interpolation 4 chs | turrican3 | support.WinUAE | 1 | 14 February 2016 10:39 |
Non-linear retrogaming? | Nogg | Retrogaming General Discussion | 5 | 13 October 2007 17:09 |
is time linear | PaulS | request.Demos | 2 | 22 September 2002 12:37 |
|
|