English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 19 October 2020, 22:21   #1
8bitbubsy
Registered User

8bitbubsy's Avatar
 
Join Date: Sep 2009
Location: Norway
Posts: 1,491
Optimizing linear interpolation routine for a live resampler

I'm currently porting the Fasttracker 2.09 XM replayer from i386 asm to 68020 asm (Amiga), and I'm rather close to being finished. Of course, mixing at only 28604Hz means you want some kind of resampling interpolation to prevent a ton of aliasing, so I went with linear interpolation since it's the fastest I can think of.

Here's my current code for getting 1x PCM 8-bit interpolation sample out of 2x 8-bit PCM input samples and a 16-bit sampling position fraction:
Code:
    move.w (a3,d2.l),d3    ; d3.w = 2x 8-bit signed samples
    move.b d3,d5
    ext.w  d5
    asr.w  #8,d3
    sub.w  d3,d5
    move.w d7,d4    ; d4.w = copy of fractional sampling position (0..65535)
    lsr.w  #8,d4
    muls.w d4,d5
    asr.w  #8,d5
    add.w  d5,d3    ; d3.w = -128..127
As you can see, this has a ton of overhead already, not to mention that MULS is a slow instruction, even on 68020. I thought of using a precalculated lerp LUT just like FT2 did for its legacy 16-bit mixer, but then comes the problem of getting the delta sample properly shifted with the frac to prepare for LUT indexing. i386 has a neat SHLD/SHRD instruction for doing this with few instructions, but this is not the case for 68020.
I only have d3, d4 and d5 available. a6 can be used for a LUT pointer.

If anyone sees a way to make this faster, or has an idea of how to calculate a LUT for this with little instruction overhead, let me know. I would be really glad! Even a LUT with 4 bits of fractional precision should be OK.
PS: It may look like I am potentially reading out of bounds by reading two samples, but in reality the loaded samples have the correct sample point stored at the end of the sample (or end of loop).

Last edited by 8bitbubsy; 20 October 2020 at 00:21.
8bitbubsy is offline  
Old 19 October 2020, 22:55   #2
robinsonb5
Registered User
 
Join Date: Mar 2012
Location: Norfolk, UK
Posts: 871
Have you considered mixing at, say, four times the target frequency, then using a simple running-average (so just addition and shifting) as a low-pass filter to downsample the complete mix? It won't exactly be hifi (but then neither will linear interpolation) but it might be good enough, and cheaper than trying to interpolate each channel individually.
robinsonb5 is offline  
Old 19 October 2020, 22:56   #3
8bitbubsy
Registered User

8bitbubsy's Avatar
 
Join Date: Sep 2009
Location: Norway
Posts: 1,491
Oversampling (of up to 32 channels at 28kHz) is going to be too slow for 68020..68060 Amigas, but otherwise a neat suggestion.
8bitbubsy is offline  
Old 20 October 2020, 00:42   #4
robinsonb5
Registered User
 
Join Date: Mar 2012
Location: Norfolk, UK
Posts: 871
Maybe simple IIR filters, then?
(out_new = (out_old + in) >> 1; or maybe out_new = (out_old + 7*in) >> 3

It should be cheaper to compute than linear interpolation, and you could potentially have a few different versions of the routine with different coefficients, selected by the upsampling factor.
robinsonb5 is offline  
Old 20 October 2020, 01:22   #5
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 52
Posts: 1,250
Quote:
Originally Posted by 8bitbubsy View Post
I'm currently porting the Fasttracker 2.09 XM replayer from i386 asm to 68020 asm (Amiga), and I'm rather close to being finished. Of course, mixing at only 28604Hz means you want some kind of resampling interpolation to prevent a ton of aliasing, so I went with linear interpolation since it's the fastest I can think of.

Here's my current code for getting 1x PCM 8-bit interpolation sample out of 2x 8-bit PCM input samples and a 16-bit sampling position fraction:
Code:
    move.w (a3,d2.l),d3    ; d3.w = 2x 8-bit signed samples
    move.b d3,d5
    ext.w  d5
    asr.w  #8,d3
    sub.w  d3,d5
    move.w d7,d4    ; d4.w = copy of fractional sampling position (0..65535)
    lsr.w  #8,d4
    muls.w d4,d5
    asr.w  #8,d5
    add.w  d5,d3    ; d3.w = -128..127
As you can see, this has a ton of overhead already, not to mention that MULS is a slow instruction, even on 68020. I thought of using a precalculated lerp LUT just like FT2 did for its legacy 16-bit mixer, but then comes the problem of getting the delta sample properly shifted with the frac to prepare for LUT indexing. i386 has a neat SHLD/SHRD instruction for doing this with few instructions, but this is not the case for 68020.
I only have d3, d4 and d5 available. a6 can be used for a LUT pointer.

If anyone sees a way to make this faster, or has an idea of how to calculate a LUT for this with little instruction overhead, let me know. I would be really glad! Even a LUT with 4 bits of fractional precision should be OK.
PS: It may look like I am potentially reading out of bounds by reading two samples, but in reality the loaded samples have the correct sample point stored at the end of the sample (or end of loop).
Maybe something like this:
moveq #0,d4
move.w d7,d4
lsr.w #1,d5
move.b d5,d4
add.w (a6,d4.l*2),d3
;clr.w d4
; addx.w d4,d3 better precision?
Don_Adan is offline  
Old 20 October 2020, 01:40   #6
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 52
Posts: 1,250
or maybe 2x bigger table?
moveq #0,d4
move.w d5,d4
lsl.w #7,d4
add.l d4,d4
ror.w #8,d7
move.b d7,d4
ror.w #8,d7
add.w (a6,d4.l*2),d3

or single table:

moveq #0,d4
move.w d5,d4
lsl.w #7,d4
add.l d4,d4
ror.w #8,d7
move.b d7,d4
ror.w #8,d7
move.b (a6,d4.l),d5
ext.w d5
add.w d5,d3
Don_Adan is offline  
Old 20 October 2020, 08:02   #7
saimon69
J.M.D - Bedroom Musician

 
Join Date: Apr 2014
Location: los angeles,ca
Posts: 1,577
You seem the right person to ask, since was thinking for some non conventional use of the xm format natively on the amiga: in example use the four (or three or two) paula channels to replay but the possibility to switch on and off pattern channels so to have interactive soundtracks a la monkey island (more variation of the theme going on that with turning on a channel and replacing it with another one makes it sound different) or a way for a program to change sample volumes; i know those are not replay standard routines and that play is limited to the hardware four channels and amiga frequencies but am considering to break some barriers...

[edit - can a mod do a separate thread for this? i realized am OT]

Last edited by saimon69; 20 October 2020 at 20:31.
saimon69 is offline  
Old 20 October 2020, 12:11   #8
8bitbubsy
Registered User

8bitbubsy's Avatar
 
Join Date: Sep 2009
Location: Norway
Posts: 1,491
robinsonb5: That filter is not going to help much since you're not handling the fractional position whatsoever. There will still be somewhat hard edges from the nearest neighbor sampling, which will create aliasing.

Don_Adan: I'll maybe test your code, but that's already more instruction overhead than linear interpolation using muls, maybe it's even slower!
I was thinking like: sample1 += centeredLUT[(((sample2-sample1) << 4) | ((frac >> 12) & 15)];

But it just seems to be way too many instructions to set this up. Also I don't want to shift the resolution of the delta sample (s2-s1), and I don't want to have a gigantic LUT either...
Here's how FT2 did it in its older mixers:
Code:
           mov ax,[esi]
           xor eax,08080h
           mov bl,al
           sub bl,ah
           sbb bh,bh
           shld ebx,edi,4 ; edi = frac ($xxxx0000)
           xor ah,ah
           add al,[bx+CDA_IPTab+CDA_IPTabSize/2]
In some ways, i386 asm is better than 680x0 asm...

saimon69: I don't think I'm the right person to ask. I'm just directly porting old code, I don't really know how to do your request.

Last edited by 8bitbubsy; 20 October 2020 at 13:08.
8bitbubsy is offline  
Old 20 October 2020, 13:27   #9
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 52
Posts: 1,250
Quote:
Originally Posted by 8bitbubsy View Post
robinsonb5: That filter is not going to help much since you're not handling the fractional position whatsoever. There will still be somewhat hard edges from the nearest neighbor sampling, which will create aliasing.

Don_Adan: I'll maybe test your code, but that's already more instruction overhead than linear interpolation using muls, maybe it's even slower!
I was thinking like: sample1 += centeredLUT[(((sample2-sample1) << 4) | ((frac >> 12) & 15)];

But it just seems to be way too many instructions to set this up. Also I don't want to shift the resolution of the delta sample (s2-s1), and I don't want to have a gigantic LUT either...
Here's how FT2 did it in its older mixers:
Code:
           mov ax,[esi]
           xor eax,08080h
           mov bl,al
           sub bl,ah
           sbb bh,bh
           shld ebx,edi,4 ; edi = frac ($xxxx0000)
           xor ah,ah
           add al,[bx+CDA_IPTab+CDA_IPTabSize/2]
In some ways, i386 asm is better than 680x0 asm...

saimon69: I don't think I'm the right person to ask. I'm just directly porting old code, I don't really know how to do your request.
It can not be slower for 680x0, maybe except 68060, because mulu in your code. Of course can be shortest, but this is dependent to your full loop routine. f.ex different handling "copy of fractional sampling position (0..65535)", can shortened code for 2 more instructions.
Don_Adan is offline  
Old 20 October 2020, 14:06   #10
chb
Registered User

 
Join Date: Dec 2014
Location: germany
Posts: 295
Hmm, what is your definition of "gigantic" for a LUT? If you can live with 64k and 8 bit fraction resolution, the following might work (not tested):
Code:
    move.w (a3,d2.l),d3    ; d3.w = 2x 8-bit signed samples
    move.b d3,d5           
    move.b #0,d3           ; or reserve a zero register
    asl.w  #8,d5           ; substract the sample values * 256 in the next step
    sub.w  d3,d5           ; high byte = delta, low byte = 0
    move.b d7,d5           ; d5.b = copy of fractional sampling position (0..255) 
    move.b (a4,d5.w),d3    ; a4 = pre-shifted multiplication LUT
I did not test it, so i hope I did not mess up with the signed values.

EDIT: the LUT would look like this:
Code:
{{-128*0>>8,-128*1>>8,...,-128*255>>8},
 {-127*0>>8,-127*1>>8,...,-127*255>>8},
 ...
 {127*0>>8,127*1>>8,...,127*255>>8}}
EDIT: A, I messed up, delta can be -255 to +255 obviously
EDIT: hmm, can we save it by using
Code:
   move.w (a3,d2.l),d3    ; d3.w = 2x 8-bit signed samples
    move.b d3,d5           
    move.b #0,d3           ; or reserve a zero register
    asl.w  #7,d5           ; substract the sample values * 128 in the next step
    asr.w  #1,d3
    sub.w  d3,d5
    or.b d7,d5         ; d7.b = copy of fractional sampling position ( (0..127) 
    move.b (a4,d5.w),d3    ; a4 = pre-shifted multiplication LUT
?

Last edited by chb; 20 October 2020 at 14:31.
chb is offline  
Old 20 October 2020, 15:24   #11
8bitbubsy
Registered User

8bitbubsy's Avatar
 
Join Date: Sep 2009
Location: Norway
Posts: 1,491
In terms of LUT size, I don't really want it to be bigger than 64K. That would mean full sample point delta precision (9 bits) + 7 bits of fractional precision. That's plenty for linear interpolation already.

Quote:
Originally Posted by chb View Post
EDIT: A, I messed up, delta can be -255 to +255 obviously
EDIT: hmm, can we save it by using
Code:
   move.w (a3,d2.l),d3    ; d3.w = 2x 8-bit signed samples
    move.b d3,d5           
    move.b #0,d3           ; or reserve a zero register
    asl.w  #7,d5           ; substract the sample values * 128 in the next step
    asr.w  #1,d3
    sub.w  d3,d5
    or.b d7,d5         ; d7.b = copy of fractional sampling position ( (0..127) 
    move.b (a4,d5.w),d3    ; a4 = pre-shifted multiplication LUT
?
Almost... but remember that d7.w (frac) is 0..65535, and you want to use the most significant bits of it, not the lower 8-bits! See what I mean by instruction overhead?
I could of course change the frac to be 8 bits wide, but given that FT2 supports very low resampling rates, you want to maximize time precision.

EDIT: Ah, I see that you mentioned 8-bit frac resolution in the beginning of the post. But as said, I want more precision.

Last edited by 8bitbubsy; 20 October 2020 at 15:33.
8bitbubsy is offline  
Old 20 October 2020, 16:57   #12
chb
Registered User

 
Join Date: Dec 2014
Location: germany
Posts: 295
Ok, I'll give it another try. Let's assume frac(n+1) = frac(n) + delta_frac in every step, frac(0)=0. I hope that's what you are using.

We could use a different format for frac - we use 24 bit and put the LSBs in let's say d6 and the MSB in d7. We then need two registers for delta_frac - I do not know your code, let's assume they are d0 and d1.

Again, this is not tested, so please rather take it as an inspiration than working code

Code:
; compute frac:       
; d7 frac MSB, d6 frac LSBs, 
; d1 delta_frac MSB, d0 delta_frac LSBs 

add.w d0,d6
addx.b d1,d7

; interpolation:

moveq #0,d3          ; clear d3
move.w (a3,d2.l),d3  ; d3.w = 2x 8-bit signed samples
move.b d3,d5           
clr.b d3         
asl.w  #8,d5         ; substract the sample values * 256 in the next step
sub.l  d3,d5         ; treat as unsigned long, so LUT needs some re-ordering?
move.b d7,d5         ; d7.b = fractional sampling position MSB (0..255)
move.b (a4,d5.l),d3  ; a4 = pre-shifted multiplication LUT
It's a 128k table then, but that might be an acceptable trade off. I am not 100% sure about that "sub.l d3,d5" - I guess it should be ok to treat those values as unsigned (maybe you'll need to reorder your LUT accordingly), as it's in 2-complement. Still, I might be wrong here.

Last edited by chb; 20 October 2020 at 17:22.
chb is offline  
Old 20 October 2020, 17:24   #13
8bitbubsy
Registered User

8bitbubsy's Avatar
 
Join Date: Sep 2009
Location: Norway
Posts: 1,491
Yeah, I do the sampling position like this:
Code:
	add.w	d6,d7
	addx.l	d1,d2
d6.w = low 16-bit part of delta (sub-samples)
d7.w = temporary sampling position fraction (16-bit)
d1.l = signed high 16-bit part of delta (integer samples, signed because it's negative for backwards sampling mode)
d2.l = sampling position

Also remember that I only have d3, d4, d5 and a6 regs available for free use in the mixing loop.

Here's the full inner mixer loop macros: https://pastebin.com/Mi9DpbSE

Last edited by 8bitbubsy; 20 October 2020 at 17:35.
8bitbubsy is offline  
Old 20 October 2020, 18:41   #14
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 52
Posts: 1,250
Quote:
Originally Posted by 8bitbubsy View Post
Yeah, I do the sampling position like this:
Code:
	add.w	d6,d7
	addx.l	d1,d2
d6.w = low 16-bit part of delta (sub-samples)
d7.w = temporary sampling position fraction (16-bit)
d1.l = signed high 16-bit part of delta (integer samples, signed because it's negative for backwards sampling mode)
d2.l = sampling position

Also remember that I only have d3, d4, d5 and a6 regs available for free use in the mixing loop.

Here's the full inner mixer loop macros: https://pastebin.com/Mi9DpbSE
If you dont have enough registers, you can easy free d7 or d6 register, if you used 2 times swap command for D0 in your loop. For access to table perhaps PC register can be use too.
You can check mixing routine from Mugician II replayer, it used all (17) 68k registers and all parts of registers for mixing.
Don_Adan is offline  
Old 21 October 2020, 15:30   #15
8bitbubsy
Registered User

8bitbubsy's Avatar
 
Join Date: Sep 2009
Location: Norway
Posts: 1,491
I managed to calculate a lerp LUT with 9-bit delta precision and 7-bit frac precision, and it works... but... it's about the same speed as the muls code on a 68020! So I was right to begin with, the instruction overhead is slow.

Here's how I did it:
Code:
	move.w	(a3,d2.l),d3
	move.b	d3,d5
	ext.w	d5
	asr.w	#8,d3
	sub.w	d3,d5
	lsl.w	#7,d5
	move.w	d7,d4
	rol.w	#7,d4
	and.b	#127,d4
	or.b	d4,d5
	add.b	(a6,d5.w),d3
	and.w	#$ff,d3
vs. old muls method:

Code:
	move.w	(a3,d2.l),d3
	move.b	d3,d5
	ext.w	d5
	asr.w	#8,d3
	sub.w	d3,d5
	move.w	d7,d4
	lsr.w	#8,d4
	muls.w	d4,d5
	asr.w	#8,d5
	add.w	d5,d3
Generating the lut:
Code:
int8_t lerpLUT[65536];

void generateLerpLUT(void)
{
	int8_t *ptr8 = lerpLUT;
	for (int32_t smp = -256; smp < 256; smp++)
	{
		for (int32_t frac = 0; frac < 128; frac++)
			*ptr8++ = (int8_t)round(smp * (frac / 128.0));
	}
}
I could change the LUT to use 8-bit frac precision, and then eliminate the AND'ing, but then the upper part of d5.l has to be cleared (longword LUT access), which probably doesn't make it much faster after all...

Last edited by 8bitbubsy; 21 October 2020 at 18:15.
8bitbubsy is offline  
Old 21 October 2020, 15:44   #16
chb
Registered User

 
Join Date: Dec 2014
Location: germany
Posts: 295
EDIT: Did not see your last post

Thanks for the code, makes it clearer now.
Quote:
Originally Posted by 8bitbubsy View Post
Also remember that I only have d3, d4, d5 and a6 regs available for free use in the mixing loop.
Can a2 be saved and restored after mixing? I do not see it used anywhere in your code.

I give it a last try. If one cannot use a2, one may use the high word of d6 instead. Or a7 if available. Or use some pc relative addressing to free a data register, e.g. for the volume LUTs?
Code:
; Register map: (* indicates modified by chb)
;  a0 = original audio buffer pointer (LRLRLR..)
;  a1 = current left volume LUT pointer
; *a2 = a2 high word = delta_low LSBs , low word = 0 (was: mixer function table)
;  a3 = sample data pointer
;  a4 = current right volume LUT pointer
;  a5 = current audio buffer pointer (LRLRLR..)
; *a6 = pre-shifted multiplication LUT
;  d0.w = bytes to mix
;  d1.l = sample read delta high (signed)
;  d2.l	= sample data position
;  d3   = <temporarily used in mixer loop>
;  d4   = <temporarily used in mixer loop>, needs initialization with 0.l
;  d5   = <temporarily used in mixer loop>
; *d6.b = sample read delta low MSB
; *d7 high word = sample position LSBs
; *d7.b = fractional sample position MSB
; *d7.w MSB = 0
; ============================================================

; interpolation:

moveq #0,d3          	; clear d3
move.w (a3,d2.l),d3  	; d3.w = 2x 8-bit signed samples S1 S2
move.b d3,d4  	     	; save unshifted S2 for further operation
move.l d4,d5   	     	; move.l to clear d5 (upper three bytes of d4 always 0)          
clr.b d3         
lsl.w  #8,d5         	; substract the sample values * 256 in the next step
sub.l  d3,d5         	; 256*(S2-S1), treat as unsigned long
move.b d7,d5         	; d7.b = fractional sampling position MSB (0..255)
move.b (a6,d5.l),d3  	; a6 = LUT, see below for structure
			; d3 = 256*(S2-S1)*(1-frac)
sub.b	d4,d3	     	; S_frac = S1 + (S2-S1)*frac = S2 - (S2-S1)*(1-frac)
move.w	(a1,d3.w*2),d5
swap	d5
move.w	(a4,d3.w*2),d5
add.l	d5,(a5)+

add.l	a2,d7	     ; a2 delta_low LSBs | 0.w
addx.b  d6,d7	     ; d6.b delta_low MSB
addx.l	d1,d2	     ; sample data position

;; if we cannot use a2 let's take this code:
;; d6 contains delta_low LSBs in the high word and MSB in the lowest byte
;move.w d7,d3	     ; save delta_low MSB
;add.l	d6,d7	     ; add LSBs
;move.w d3,d7	     ; restore MSB
;addx.b   d6,d7	     ; add MSB
;addx.l	d1,d2	     ; sample data position


;  LUT structure stores (1-frac) =^ (255-frac)
;  values are word size, MSB = 0
;
;  {{-128*255>>8,-128*254>>8,...,-128*0>>8},
;   {-127*255>>8,-127*254>>8,...,-127*0>>8},
;   ...
;   {127*255>>8,127*254>>8,...,127*0>>8}}
I cannot test it here, so unverified code again. May be full of eels, eh, bugs
chb is offline  
Old 21 October 2020, 15:50   #17
8bitbubsy
Registered User

8bitbubsy's Avatar
 
Join Date: Sep 2009
Location: Norway
Posts: 1,491
Given that the code in my previous post (posted not long ago) is slower than the original mul code, I really doubt this will be faster. Thanks for the effort anyway! Appreciated.

EDIT: Oh no, my new LUT code doesn't seem to work after all. I guess I used the wrong binary during testing. But even if it was to work, it'd be slower anyway.

Last edited by 8bitbubsy; 21 October 2020 at 15:57.
8bitbubsy is offline  
Old 21 October 2020, 16:53   #18
chb
Registered User

 
Join Date: Dec 2014
Location: germany
Posts: 295
Quote:
Originally Posted by 8bitbubsy View Post
Given that the code in my previous post (posted not long ago) is slower than the original mul code, I really doubt this will be faster. Thanks for the effort anyway! Appreciated.
Well, it was some nice puzzle. Interesting that the table access is so slow, what was your configuration?
chb is offline  
Old 21 October 2020, 16:55   #19
8bitbubsy
Registered User

8bitbubsy's Avatar
 
Join Date: Sep 2009
Location: Norway
Posts: 1,491
Quote:
Originally Posted by chb View Post
Well, it was some nice puzzle. Interesting that the table access is so slow, what was your configuration?
WinUAE + stock, cycle-exact A1200 config w/ some fastmem.

I made a benchmark program where I ran 2048 iterations of the mixer macro with rasterbars, to see how much time it takes.
This is probably not a good way to test it, as the scenario is slightly different, but I think it should give a general idea, at least.

Anyway, I'm going to redo the benchmark once I get this to actually work.
8bitbubsy is offline  
Old 21 October 2020, 17:08   #20
8bitbubsy
Registered User

8bitbubsy's Avatar
 
Join Date: Sep 2009
Location: Norway
Posts: 1,491
Sorry for the double-post, but I managed to fix the code now. And after benchmark, it seems to be just a tiny bit slower now. I updated the post with my working version.

EDIT: ARGH! I still managed to compile the previous version thinking I was compiling the LUT version, and apparently it still doesn't work like it should. Haha

EDIT2: D'oh! I put the table in the BSS hunk, so it got cleared lol. It works now, for sure.

Last edited by 8bitbubsy; 21 October 2020 at 17:33.
8bitbubsy is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Interpolation new Sound options Paul support.WinUAE 10 17 March 2019 21:57
Artifacts from non-gamma-aware interpolation mark_k support.WinUAE 5 08 January 2018 15:37
switch sound interpolation 4 chs turrican3 support.WinUAE 1 14 February 2016 11:39
Non-linear retrogaming? Nogg Retrogaming General Discussion 5 13 October 2007 18:09
is time linear PaulS request.Demos 2 22 September 2002 13:37

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 15:58.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, vBulletin Solutions Inc.
Page generated in 0.10742 seconds with 13 queries