Optimizing linear interpolation routine for a live resampler

8bitbubsy · 19 October 2020, 21:21

I'm currently porting the Fasttracker 2.09 XM replayer from i386 asm to 68020 asm (Amiga), and I'm rather close to being finished. Of course, mixing at only 28604Hz means you want some kind of resampling interpolation to prevent a ton of aliasing, so I went with linear interpolation since it's the fastest I can think of.

Here's my current code for getting 1x PCM 8-bit interpolation sample out of 2x 8-bit PCM input samples and a 16-bit sampling position fraction:

Code:

    move.w (a3,d2.l),d3    ; d3.w = 2x 8-bit signed samples
    move.b d3,d5
    ext.w  d5
    asr.w  #8,d3
    sub.w  d3,d5
    move.w d7,d4    ; d4.w = copy of fractional sampling position (0..65535)
    lsr.w  #8,d4
    muls.w d4,d5
    asr.w  #8,d5
    add.w  d5,d3    ; d3.w = -128..127

As you can see, this has a ton of overhead already, not to mention that MULS is a slow instruction, even on 68020. I thought of using a precalculated lerp LUT just like FT2 did for its legacy 16-bit mixer, but then comes the problem of getting the delta sample properly shifted with the frac to prepare for LUT indexing. i386 has a neat SHLD/SHRD instruction for doing this with few instructions, but this is not the case for 68020.
I only have d3, d4 and d5 available. a6 can be used for a LUT pointer.

If anyone sees a way to make this faster, or has an idea of how to calculate a LUT for this with little instruction overhead, let me know. I would be really glad! Even a LUT with 4 bits of fractional precision should be OK.
PS: It may look like I am potentially reading out of bounds by reading two samples, but in reality the loaded samples have the correct sample point stored at the end of the sample (or end of loop).

robinsonb5 · 19 October 2020, 21:55

Have you considered mixing at, say, four times the target frequency, then using a simple running-average (so just addition and shifting) as a low-pass filter to downsample the complete mix? It won't exactly be hifi (but then neither will linear interpolation) but it might be good enough, and cheaper than trying to interpolate each channel individually.

8bitbubsy · 19 October 2020, 21:56

Oversampling (of up to 32 channels at 28kHz) is going to be too slow for 68020..68060 Amigas, but otherwise a neat suggestion.

robinsonb5 · 19 October 2020, 23:42

Maybe simple IIR filters, then?
(out_new = (out_old + in) >> 1; or maybe out_new = (out_old + 7*in) >> 3

It should be cheaper to compute than linear interpolation, and you could potentially have a few different versions of the routine with different coefficients, selected by the upsampling factor.

Don_Adan · 20 October 2020, 00:22

Quote:

Originally Posted by 8bitbubsy

I'm currently porting the Fasttracker 2.09 XM replayer from i386 asm to 68020 asm (Amiga), and I'm rather close to being finished. Of course, mixing at only 28604Hz means you want some kind of resampling interpolation to prevent a ton of aliasing, so I went with linear interpolation since it's the fastest I can think of.

Here's my current code for getting 1x PCM 8-bit interpolation sample out of 2x 8-bit PCM input samples and a 16-bit sampling position fraction:

Code:

    move.w (a3,d2.l),d3    ; d3.w = 2x 8-bit signed samples
    move.b d3,d5
    ext.w  d5
    asr.w  #8,d3
    sub.w  d3,d5
    move.w d7,d4    ; d4.w = copy of fractional sampling position (0..65535)
    lsr.w  #8,d4
    muls.w d4,d5
    asr.w  #8,d5
    add.w  d5,d3    ; d3.w = -128..127

As you can see, this has a ton of overhead already, not to mention that MULS is a slow instruction, even on 68020. I thought of using a precalculated lerp LUT just like FT2 did for its legacy 16-bit mixer, but then comes the problem of getting the delta sample properly shifted with the frac to prepare for LUT indexing. i386 has a neat SHLD/SHRD instruction for doing this with few instructions, but this is not the case for 68020.
I only have d3, d4 and d5 available. a6 can be used for a LUT pointer.

If anyone sees a way to make this faster, or has an idea of how to calculate a LUT for this with little instruction overhead, let me know. I would be really glad! Even a LUT with 4 bits of fractional precision should be OK.
PS: It may look like I am potentially reading out of bounds by reading two samples, but in reality the loaded samples have the correct sample point stored at the end of the sample (or end of loop).

Maybe something like this:
moveq #0,d4
move.w d7,d4
lsr.w #1,d5
move.b d5,d4
add.w (a6,d4.l*2),d3
;clr.w d4
; addx.w d4,d3 better precision?

Don_Adan · 20 October 2020, 00:40

or maybe 2x bigger table?
moveq #0,d4
move.w d5,d4
lsl.w #7,d4
add.l d4,d4
ror.w #8,d7
move.b d7,d4
ror.w #8,d7
add.w (a6,d4.l*2),d3

or single table:

moveq #0,d4
move.w d5,d4
lsl.w #7,d4
add.l d4,d4
ror.w #8,d7
move.b d7,d4
ror.w #8,d7
move.b (a6,d4.l),d5
ext.w d5
add.w d5,d3

saimon69 · 20 October 2020, 07:02

You seem the right person to ask, since was thinking for some non conventional use of the xm format natively on the amiga: in example use the four (or three or two) paula channels to replay but the possibility to switch on and off pattern channels so to have interactive soundtracks a la monkey island (more variation of the theme going on that with turning on a channel and replacing it with another one makes it sound different) or a way for a program to change sample volumes; i know those are not replay standard routines and that play is limited to the hardware four channels and amiga frequencies but am considering to break some barriers...

[edit - can a mod do a separate thread for this? i realized am OT]

8bitbubsy · 20 October 2020, 11:11

robinsonb5: That filter is not going to help much since you're not handling the fractional position whatsoever. There will still be somewhat hard edges from the nearest neighbor sampling, which will create aliasing.

Don_Adan: I'll maybe test your code, but that's already more instruction overhead than linear interpolation using muls, maybe it's even slower!
I was thinking like: sample1 += centeredLUT[(((sample2-sample1) << 4) | ((frac >> 12) & 15)];

But it just seems to be way too many instructions to set this up.

Also I don't want to shift the resolution of the delta sample (s2-s1), and I don't want to have a gigantic LUT either...
Here's how FT2 did it in its older mixers:

Code:

           mov ax,[esi]
           xor eax,08080h
           mov bl,al
           sub bl,ah
           sbb bh,bh
           shld ebx,edi,4 ; edi = frac ($xxxx0000)
           xor ah,ah
           add al,[bx+CDA_IPTab+CDA_IPTabSize/2]

In some ways, i386 asm is better than 680x0 asm...

saimon69: I don't think I'm the right person to ask. I'm just directly porting old code, I don't really know how to do your request.

Don_Adan · 20 October 2020, 12:27

Quote:

Originally Posted by 8bitbubsy

robinsonb5: That filter is not going to help much since you're not handling the fractional position whatsoever. There will still be somewhat hard edges from the nearest neighbor sampling, which will create aliasing.

Don_Adan: I'll maybe test your code, but that's already more instruction overhead than linear interpolation using muls, maybe it's even slower!
I was thinking like: sample1 += centeredLUT[(((sample2-sample1) << 4) | ((frac >> 12) & 15)];

But it just seems to be way too many instructions to set this up.

Also I don't want to shift the resolution of the delta sample (s2-s1), and I don't want to have a gigantic LUT either...
Here's how FT2 did it in its older mixers:

Code:

           mov ax,[esi]
           xor eax,08080h
           mov bl,al
           sub bl,ah
           sbb bh,bh
           shld ebx,edi,4 ; edi = frac ($xxxx0000)
           xor ah,ah
           add al,[bx+CDA_IPTab+CDA_IPTabSize/2]

In some ways, i386 asm is better than 680x0 asm...

saimon69: I don't think I'm the right person to ask. I'm just directly porting old code, I don't really know how to do your request.

It can not be slower for 680x0, maybe except 68060, because mulu in your code. Of course can be shortest, but this is dependent to your full loop routine. f.ex different handling "copy of fractional sampling position (0..65535)", can shortened code for 2 more instructions.

chb · 20 October 2020, 13:06

Hmm, what is your definition of "gigantic" for a LUT? If you can live with 64k and 8 bit fraction resolution, the following might work (not tested):

Code:

    move.w (a3,d2.l),d3    ; d3.w = 2x 8-bit signed samples
    move.b d3,d5           
    move.b #0,d3           ; or reserve a zero register
    asl.w  #8,d5           ; substract the sample values * 256 in the next step
    sub.w  d3,d5           ; high byte = delta, low byte = 0
    move.b d7,d5           ; d5.b = copy of fractional sampling position (0..255) 
    move.b (a4,d5.w),d3    ; a4 = pre-shifted multiplication LUT

I did not test it, so i hope I did not mess up with the signed values.

EDIT: the LUT would look like this:

Code:

{{-128*0>>8,-128*1>>8,...,-128*255>>8},
 {-127*0>>8,-127*1>>8,...,-127*255>>8},
 ...
 {127*0>>8,127*1>>8,...,127*255>>8}}

EDIT: A, I messed up, delta can be -255 to +255 obviously

EDIT: hmm, can we save it by using

Code:

   move.w (a3,d2.l),d3    ; d3.w = 2x 8-bit signed samples
    move.b d3,d5           
    move.b #0,d3           ; or reserve a zero register
    asl.w  #7,d5           ; substract the sample values * 128 in the next step
    asr.w  #1,d3
    sub.w  d3,d5
    or.b d7,d5         ; d7.b = copy of fractional sampling position ( (0..127) 
    move.b (a4,d5.w),d3    ; a4 = pre-shifted multiplication LUT

?

8bitbubsy · 20 October 2020, 14:24

In terms of LUT size, I don't really want it to be bigger than 64K. That would mean full sample point delta precision (9 bits) + 7 bits of fractional precision. That's plenty for linear interpolation already.

Quote:

Originally Posted by chb

EDIT: A, I messed up, delta can be -255 to +255 obviously

EDIT: hmm, can we save it by using

Code:

   move.w (a3,d2.l),d3    ; d3.w = 2x 8-bit signed samples
    move.b d3,d5           
    move.b #0,d3           ; or reserve a zero register
    asl.w  #7,d5           ; substract the sample values * 128 in the next step
    asr.w  #1,d3
    sub.w  d3,d5
    or.b d7,d5         ; d7.b = copy of fractional sampling position ( (0..127) 
    move.b (a4,d5.w),d3    ; a4 = pre-shifted multiplication LUT

?

Almost... but remember that d7.w (frac) is 0..65535, and you want to use the most significant bits of it, not the lower 8-bits! See what I mean by instruction overhead?

I could of course change the frac to be 8 bits wide, but given that FT2 supports very low resampling rates, you want to maximize time precision.

EDIT: Ah, I see that you mentioned 8-bit frac resolution in the beginning of the post. But as said, I want more precision.

chb · 20 October 2020, 15:57

Ok, I'll give it another try. Let's assume frac(n+1) = frac(n) + delta_frac in every step, frac(0)=0. I hope that's what you are using.

We could use a different format for frac - we use 24 bit and put the LSBs in let's say d6 and the MSB in d7. We then need two registers for delta_frac - I do not know your code, let's assume they are d0 and d1.

Again, this is not tested, so please rather take it as an inspiration than working code

Code:

; compute frac:       
; d7 frac MSB, d6 frac LSBs, 
; d1 delta_frac MSB, d0 delta_frac LSBs 

add.w d0,d6
addx.b d1,d7

; interpolation:

moveq #0,d3          ; clear d3
move.w (a3,d2.l),d3  ; d3.w = 2x 8-bit signed samples
move.b d3,d5           
clr.b d3         
asl.w  #8,d5         ; substract the sample values * 256 in the next step
sub.l  d3,d5         ; treat as unsigned long, so LUT needs some re-ordering?
move.b d7,d5         ; d7.b = fractional sampling position MSB (0..255)
move.b (a4,d5.l),d3  ; a4 = pre-shifted multiplication LUT

It's a 128k table then, but that might be an acceptable trade off. I am not 100% sure about that "sub.l d3,d5" - I guess it should be ok to treat those values as unsigned (maybe you'll need to reorder your LUT accordingly), as it's in 2-complement. Still, I might be wrong here.

8bitbubsy · 20 October 2020, 16:24

Yeah, I do the sampling position like this:

Code:

	add.w	d6,d7
	addx.l	d1,d2

d6.w = low 16-bit part of delta (sub-samples)
d7.w = temporary sampling position fraction (16-bit)
d1.l = signed high 16-bit part of delta (integer samples, signed because it's negative for backwards sampling mode)
d2.l = sampling position

Also remember that I only have d3, d4, d5 and a6 regs available for free use in the mixing loop.

Here's the full inner mixer loop macros: https://pastebin.com/Mi9DpbSE

Don_Adan · 20 October 2020, 17:41

Quote:

Originally Posted by 8bitbubsy

Yeah, I do the sampling position like this:

Code:

	add.w	d6,d7
	addx.l	d1,d2

d6.w = low 16-bit part of delta (sub-samples)
d7.w = temporary sampling position fraction (16-bit)
d1.l = signed high 16-bit part of delta (integer samples, signed because it's negative for backwards sampling mode)
d2.l = sampling position

Also remember that I only have d3, d4, d5 and a6 regs available for free use in the mixing loop.

Here's the full inner mixer loop macros: https://pastebin.com/Mi9DpbSE

If you dont have enough registers, you can easy free d7 or d6 register, if you used 2 times swap command for D0 in your loop. For access to table perhaps PC register can be use too.
You can check mixing routine from Mugician II replayer, it used all (17) 68k registers and all parts of registers for mixing.

8bitbubsy · 21 October 2020, 14:30

I managed to calculate a lerp LUT with 9-bit delta precision and 7-bit frac precision, and it works... but... it's about the same speed as the muls code on a 68020! So I was right to begin with, the instruction overhead is slow.

Here's how I did it:

Code:

	move.w	(a3,d2.l),d3
	move.b	d3,d5
	ext.w	d5
	asr.w	#8,d3
	sub.w	d3,d5
	lsl.w	#7,d5
	move.w	d7,d4
	rol.w	#7,d4
	and.b	#127,d4
	or.b	d4,d5
	add.b	(a6,d5.w),d3
	and.w	#$ff,d3

vs. old muls method:

Code:

	move.w	(a3,d2.l),d3
	move.b	d3,d5
	ext.w	d5
	asr.w	#8,d3
	sub.w	d3,d5
	move.w	d7,d4
	lsr.w	#8,d4
	muls.w	d4,d5
	asr.w	#8,d5
	add.w	d5,d3

Generating the lut:

Code:

int8_t lerpLUT[65536];

void generateLerpLUT(void)
{
	int8_t *ptr8 = lerpLUT;
	for (int32_t smp = -256; smp < 256; smp++)
	{
		for (int32_t frac = 0; frac < 128; frac++)
			*ptr8++ = (int8_t)round(smp * (frac / 128.0));
	}
}

I could change the LUT to use 8-bit frac precision, and then eliminate the AND'ing, but then the upper part of d5.l has to be cleared (longword LUT access), which probably doesn't make it much faster after all...

chb · 21 October 2020, 14:44

EDIT: Did not see your last post

Thanks for the code, makes it clearer now.

Quote:

Originally Posted by 8bitbubsy

Also remember that I only have d3, d4, d5 and a6 regs available for free use in the mixing loop.

Can a2 be saved and restored after mixing? I do not see it used anywhere in your code.

I give it a last try.

If one cannot use a2, one may use the high word of d6 instead. Or a7 if available. Or use some pc relative addressing to free a data register, e.g. for the volume LUTs?

Code:

; Register map: (* indicates modified by chb)
;  a0 = original audio buffer pointer (LRLRLR..)
;  a1 = current left volume LUT pointer
; *a2 = a2 high word = delta_low LSBs , low word = 0 (was: mixer function table)
;  a3 = sample data pointer
;  a4 = current right volume LUT pointer
;  a5 = current audio buffer pointer (LRLRLR..)
; *a6 = pre-shifted multiplication LUT
;  d0.w = bytes to mix
;  d1.l = sample read delta high (signed)
;  d2.l	= sample data position
;  d3   = <temporarily used in mixer loop>
;  d4   = <temporarily used in mixer loop>, needs initialization with 0.l
;  d5   = <temporarily used in mixer loop>
; *d6.b = sample read delta low MSB
; *d7 high word = sample position LSBs
; *d7.b = fractional sample position MSB
; *d7.w MSB = 0
; ============================================================

; interpolation:

moveq #0,d3          	; clear d3
move.w (a3,d2.l),d3  	; d3.w = 2x 8-bit signed samples S1 S2
move.b d3,d4  	     	; save unshifted S2 for further operation
move.l d4,d5   	     	; move.l to clear d5 (upper three bytes of d4 always 0)          
clr.b d3         
lsl.w  #8,d5         	; substract the sample values * 256 in the next step
sub.l  d3,d5         	; 256*(S2-S1), treat as unsigned long
move.b d7,d5         	; d7.b = fractional sampling position MSB (0..255)
move.b (a6,d5.l),d3  	; a6 = LUT, see below for structure
			; d3 = 256*(S2-S1)*(1-frac)
sub.b	d4,d3	     	; S_frac = S1 + (S2-S1)*frac = S2 - (S2-S1)*(1-frac)
move.w	(a1,d3.w*2),d5
swap	d5
move.w	(a4,d3.w*2),d5
add.l	d5,(a5)+

add.l	a2,d7	     ; a2 delta_low LSBs | 0.w
addx.b  d6,d7	     ; d6.b delta_low MSB
addx.l	d1,d2	     ; sample data position

;; if we cannot use a2 let's take this code:
;; d6 contains delta_low LSBs in the high word and MSB in the lowest byte
;move.w d7,d3	     ; save delta_low MSB
;add.l	d6,d7	     ; add LSBs
;move.w d3,d7	     ; restore MSB
;addx.b   d6,d7	     ; add MSB
;addx.l	d1,d2	     ; sample data position


;  LUT structure stores (1-frac) =^ (255-frac)
;  values are word size, MSB = 0
;
;  {{-128*255>>8,-128*254>>8,...,-128*0>>8},
;   {-127*255>>8,-127*254>>8,...,-127*0>>8},
;   ...
;   {127*255>>8,127*254>>8,...,127*0>>8}}

I cannot test it here, so unverified code again. May be full of eels, eh, bugs

8bitbubsy · 21 October 2020, 14:50

Given that the code in my previous post (posted not long ago) is slower than the original mul code, I really doubt this will be faster. Thanks for the effort anyway! Appreciated.

EDIT: Oh no, my new LUT code doesn't seem to work after all. I guess I used the wrong binary during testing. But even if it was to work, it'd be slower anyway.

chb · 21 October 2020, 15:53

Quote:

Originally Posted by 8bitbubsy

Given that the code in my previous post (posted not long ago) is slower than the original mul code, I really doubt this will be faster. Thanks for the effort anyway! Appreciated.

Well, it was some nice puzzle.

Interesting that the table access is so slow, what was your configuration?

8bitbubsy · 21 October 2020, 15:55

Quote:

Originally Posted by chb

Well, it was some nice puzzle.

Interesting that the table access is so slow, what was your configuration?

WinUAE + stock, cycle-exact A1200 config w/ some fastmem.

I made a benchmark program where I ran 2048 iterations of the mixer macro with rasterbars, to see how much time it takes.
This is probably not a good way to test it, as the scenario is slightly different, but I think it should give a general idea, at least.

Anyway, I'm going to redo the benchmark once I get this to actually work.

8bitbubsy · 21 October 2020, 16:08

Sorry for the double-post, but I managed to fix the code now. And after benchmark, it seems to be just a tiny bit slower now. I updated the post with my working version.

EDIT: ARGH! I still managed to compile the previous version thinking I was compiling the LUT version, and apparently it still doesn't work like it should. Haha

EDIT2: D'oh! I put the table in the BSS hunk, so it got cleared lol. It works now, for sure.

20 October 2020, 16:24	#13
8bitbubsy Registered User Join Date: Sep 2009 Location: Norway Posts: 1,712	Yeah, I do the sampling position like this: Code: add.w d6,d7 addx.l d1,d2 d6.w = low 16-bit part of delta (sub-samples) d7.w = temporary sampling position fraction (16-bit) d1.l = signed high 16-bit part of delta (integer samples, signed because it's negative for backwards sampling mode) d2.l = sampling position Also remember that I only have d3, d4, d5 and a6 regs available for free use in the mixing loop. Here's the full inner mixer loop macros: https://pastebin.com/Mi9DpbSE Last edited by 8bitbubsy; 20 October 2020 at 16:35.

21 October 2020, 14:30	#15
8bitbubsy Registered User Join Date: Sep 2009 Location: Norway Posts: 1,712	I managed to calculate a lerp LUT with 9-bit delta precision and 7-bit frac precision, and it works... but... it's about the same speed as the muls code on a 68020! So I was right to begin with, the instruction overhead is slow. Here's how I did it: Code: move.w (a3,d2.l),d3 move.b d3,d5 ext.w d5 asr.w #8,d3 sub.w d3,d5 lsl.w #7,d5 move.w d7,d4 rol.w #7,d4 and.b #127,d4 or.b d4,d5 add.b (a6,d5.w),d3 and.w #$ff,d3 vs. old muls method: Code: move.w (a3,d2.l),d3 move.b d3,d5 ext.w d5 asr.w #8,d3 sub.w d3,d5 move.w d7,d4 lsr.w #8,d4 muls.w d4,d5 asr.w #8,d5 add.w d5,d3 Generating the lut: Code: int8_t lerpLUT[65536]; void generateLerpLUT(void) { int8_t ptr8 = lerpLUT; for (int32_t smp = -256; smp < 256; smp++) { for (int32_t frac = 0; frac < 128; frac++) ptr8++ = (int8_t)round(smp * (frac / 128.0)); } } I could change the LUT to use 8-bit frac precision, and then eliminate the AND'ing, but then the upper part of d5.l has to be cleared (longword LUT access), which probably doesn't make it much faster after all... Last edited by 8bitbubsy; 21 October 2020 at 17:15.

21 October 2020, 14:50	#17
8bitbubsy Registered User Join Date: Sep 2009 Location: Norway Posts: 1,712	Given that the code in my previous post (posted not long ago) is slower than the original mul code, I really doubt this will be faster. Thanks for the effort anyway! Appreciated. EDIT: Oh no, my new LUT code doesn't seem to work after all. I guess I used the wrong binary during testing. But even if it was to work, it'd be slower anyway. Last edited by 8bitbubsy; 21 October 2020 at 14:57.

21 October 2020, 16:08	#20
8bitbubsy Registered User Join Date: Sep 2009 Location: Norway Posts: 1,712	Sorry for the double-post, but I managed to fix the code now. And after benchmark, it seems to be just a tiny bit slower now. I updated the post with my working version. EDIT: ARGH! I still managed to compile the previous version thinking I was compiling the LUT version, and apparently it still doesn't work like it should. Haha EDIT2: D'oh! I put the table in the BSS hunk, so it got cleared lol. It works now, for sure. Last edited by 8bitbubsy; 21 October 2020 at 16:33.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Interpolation new Sound options	Paul	support.WinUAE	10	17 March 2019 20:57
Artifacts from non-gamma-aware interpolation	mark_k	support.WinUAE	5	08 January 2018 14:37
switch sound interpolation 4 chs	turrican3	support.WinUAE	1	14 February 2016 10:39
Non-linear retrogaming?	Nogg	Retrogaming General Discussion	5	13 October 2007 17:09
is time linear	PaulS	request.Demos	2	22 September 2002 12:37

19 October 2020, 21:21	#1
8bitbubsy Registered User Join Date: Sep 2009 Location: Norway Posts: 1,712	Optimizing linear interpolation routine for a live resampler I'm currently porting the Fasttracker 2.09 XM replayer from i386 asm to 68020 asm (Amiga), and I'm rather close to being finished. Of course, mixing at only 28604Hz means you want some kind of resampling interpolation to prevent a ton of aliasing, so I went with linear interpolation since it's the fastest I can think of. Here's my current code for getting 1x PCM 8-bit interpolation sample out of 2x 8-bit PCM input samples and a 16-bit sampling position fraction: Code: move.w (a3,d2.l),d3 ; d3.w = 2x 8-bit signed samples move.b d3,d5 ext.w d5 asr.w #8,d3 sub.w d3,d5 move.w d7,d4 ; d4.w = copy of fractional sampling position (0..65535) lsr.w #8,d4 muls.w d4,d5 asr.w #8,d5 add.w d5,d3 ; d3.w = -128..127 As you can see, this has a ton of overhead already, not to mention that MULS is a slow instruction, even on 68020. I thought of using a precalculated lerp LUT just like FT2 did for its legacy 16-bit mixer, but then comes the problem of getting the delta sample properly shifted with the frac to prepare for LUT indexing. i386 has a neat SHLD/SHRD instruction for doing this with few instructions, but this is not the case for 68020. I only have d3, d4 and d5 available. a6 can be used for a LUT pointer. If anyone sees a way to make this faster, or has an idea of how to calculate a LUT for this with little instruction overhead, let me know. I would be really glad! Even a LUT with 4 bits of fractional precision should be OK. PS: It may look like I am potentially reading out of bounds by reading two samples, but in reality the loaded samples have the correct sample point stored at the end of the sample (or end of loop). Last edited by 8bitbubsy; 19 October 2020 at 23:21.

19 October 2020, 21:55	#2
robinsonb5 Registered User Join Date: Mar 2012 Location: Norfolk, UK Posts: 1,156	Have you considered mixing at, say, four times the target frequency, then using a simple running-average (so just addition and shifting) as a low-pass filter to downsample the complete mix? It won't exactly be hifi (but then neither will linear interpolation) but it might be good enough, and cheaper than trying to interpolate each channel individually.

19 October 2020, 21:56	#3
8bitbubsy Registered User Join Date: Sep 2009 Location: Norway Posts: 1,712	Oversampling (of up to 32 channels at 28kHz) is going to be too slow for 68020..68060 Amigas, but otherwise a neat suggestion.

19 October 2020, 23:42	#4
robinsonb5 Registered User Join Date: Mar 2012 Location: Norfolk, UK Posts: 1,156	Maybe simple IIR filters, then? (out_new = (out_old + in) >> 1; or maybe out_new = (out_old + 7*in) >> 3 It should be cheaper to compute than linear interpolation, and you could potentially have a few different versions of the routine with different coefficients, selected by the upsampling factor.

20 October 2020, 00:40	#6
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 56 Posts: 2,018	or maybe 2x bigger table? moveq #0,d4 move.w d5,d4 lsl.w #7,d4 add.l d4,d4 ror.w #8,d7 move.b d7,d4 ror.w #8,d7 add.w (a6,d4.l*2),d3 or single table: moveq #0,d4 move.w d5,d4 lsl.w #7,d4 add.l d4,d4 ror.w #8,d7 move.b d7,d4 ror.w #8,d7 move.b (a6,d4.l),d5 ext.w d5 add.w d5,d3

20 October 2020, 07:02	#7
saimon69 J.M.D - Bedroom Musician Join Date: Apr 2014 Location: los angeles,ca Posts: 3,566	You seem the right person to ask, since was thinking for some non conventional use of the xm format natively on the amiga: in example use the four (or three or two) paula channels to replay but the possibility to switch on and off pattern channels so to have interactive soundtracks a la monkey island (more variation of the theme going on that with turning on a channel and replacing it with another one makes it sound different) or a way for a program to change sample volumes; i know those are not replay standard routines and that play is limited to the hardware four channels and amiga frequencies but am considering to break some barriers... [edit - can a mod do a separate thread for this? i realized am OT] Last edited by saimon69; 20 October 2020 at 19:31.

20 October 2020, 11:11	#8
8bitbubsy Registered User Join Date: Sep 2009 Location: Norway Posts: 1,712	robinsonb5: That filter is not going to help much since you're not handling the fractional position whatsoever. There will still be somewhat hard edges from the nearest neighbor sampling, which will create aliasing. Don_Adan: I'll maybe test your code, but that's already more instruction overhead than linear interpolation using muls, maybe it's even slower! I was thinking like: sample1 += centeredLUT[(((sample2-sample1) << 4) \| ((frac >> 12) & 15)]; But it just seems to be way too many instructions to set this up. Also I don't want to shift the resolution of the delta sample (s2-s1), and I don't want to have a gigantic LUT either... Here's how FT2 did it in its older mixers: Code: mov ax,[esi] xor eax,08080h mov bl,al sub bl,ah sbb bh,bh shld ebx,edi,4 ; edi = frac ($xxxx0000) xor ah,ah add al,[bx+CDA_IPTab+CDA_IPTabSize/2] In some ways, i386 asm is better than 680x0 asm... saimon69: I don't think I'm the right person to ask. I'm just directly porting old code, I don't really know how to do your request. Last edited by 8bitbubsy; 20 October 2020 at 12:08.

20 October 2020, 15:57	#12
chb Registered User Join Date: Dec 2014 Location: germany Posts: 439	Ok, I'll give it another try. Let's assume frac(n+1) = frac(n) + delta_frac in every step, frac(0)=0. I hope that's what you are using. We could use a different format for frac - we use 24 bit and put the LSBs in let's say d6 and the MSB in d7. We then need two registers for delta_frac - I do not know your code, let's assume they are d0 and d1. Again, this is not tested, so please rather take it as an inspiration than working code Code: ; compute frac: ; d7 frac MSB, d6 frac LSBs, ; d1 delta_frac MSB, d0 delta_frac LSBs add.w d0,d6 addx.b d1,d7 ; interpolation: moveq #0,d3 ; clear d3 move.w (a3,d2.l),d3 ; d3.w = 2x 8-bit signed samples move.b d3,d5 clr.b d3 asl.w #8,d5 ; substract the sample values * 256 in the next step sub.l d3,d5 ; treat as unsigned long, so LUT needs some re-ordering? move.b d7,d5 ; d7.b = fractional sampling position MSB (0..255) move.b (a4,d5.l),d3 ; a4 = pre-shifted multiplication LUT It's a 128k table then, but that might be an acceptable trade off. I am not 100% sure about that "sub.l d3,d5" - I guess it should be ok to treat those values as unsigned (maybe you'll need to reorder your LUT accordingly), as it's in 2-complement. Still, I might be wrong here. Last edited by chb; 20 October 2020 at 16:22.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)