Optimizing linear interpolation routine for a live resampler - Page 2

chb · 21 October 2020, 16:26

Quote:

Originally Posted by 8bitbubsy

WinUAE + stock, cycle-exact A1200 config w/ some fastmem.

Ah, ok. AFAIK the only CPU WinUAE emulates cycle-exact is the 68000; for the 68020 and upwards the emulation is less precise (because it is much harder and mostly undocumented). But I don't know if the difference is significant in your case.

Quote:

Originally Posted by 8bitbubsy

EDIT: ARGH! I still managed to compile the previous version thinking I was compiling the LUT version, and apparently it still doesn't work like it should. Haha

Haha, been there, done that so many times.

Don_Adan · 21 October 2020, 16:53

Quote:

Originally Posted by 8bitbubsy

I managed to calculate a lerp LUT with 9-bit delta precision and 7-bit frac precision, and it works... but... it's about the same speed as the muls code on a 68020! So I was right to begin with, the instruction overhead is slow.

Here's how I did it:

Code:

	move.w	(a3,d2.l),d3
	move.b	d3,d5
	ext.w	d5
	asr.w	#8,d3
	sub.w	d3,d5
	lsl.w	#7,d5
	move.w	d7,d4
	rol.w	#7,d4
	and.b	#127,d4
	or.b	d4,d5
	add.b	(a6,d5.w),d3
	ext.w	d3

vs. old muls method:

Code:

	move.w	(a3,d2.l),d3
	move.b	d3,d5
	ext.w	d5
	asr.w	#8,d3
	sub.w	d3,d5
	move.w	d7,d4
	lsr.w	#8,d4
	muls.w	d4,d5
	asr.w	#8,d5
	add.w	d5,d3

Generating the lut:

Code:

int8_t lerpLUT[65536];

void generateLerpLUT(void)
{
	int8_t *ptr8 = lerpLUT;
	for (int32_t smp = -256; smp < 256; smp++)
	{
		for (int32_t frac = 0; frac < 128; frac++)
			*ptr8++ = (int8_t)round(smp * (frac / 128.0));
	}
}

I could change the LUT to use 8-bit frac precision, and then eliminate the AND'ing, but then the upper part of d5.l has to be cleared (longword LUT access), which probably doesn't make it much faster after all...

At first you must use original Amiga 68020 for test, not WinUAE.
At second this routine is called 4 times in row, if i remember right. Then you can use

move.w d7,d4
rol.w #7,d4
and.b #127,d4
before the loop, not inside your loop routine.

8bitbubsy · 21 October 2020, 16:54

No I can't, because the fraction changes for every output sample!

EDIT: Ok, the LUT method is faster on my 68030 50MHz Amiga! So that's good news. Also I edited the code again as I had to replace ext.w d3 with and.w #$ff,d3

So I think the thing to focus on now is to try and optimize this any further, if possible:

Code:

	move.w	(a3,d2.l),d3 ; read 2x signed 8-bit PCM samples
	move.b	d3,d5
	ext.w	d5
	asr.w	#8,d3
	sub.w	d3,d5
	lsl.w	#7,d5
	move.w	d7,d4 ; copy of sampling position fraction (16-bit)
	rol.w	#7,d4
	and.b	#127,d4
	or.b	d4,d5
	add.b	(a6,d5.w),d3
	and.w	#$ff,d3 ; d3.b = -128..127 (ready for volume LUT)

Maybe one can use a bitfield instruction to get the LUT index calculated...

Also sorry for not listening too much to the suggestions, I just thought they were not suitable (changing frac etc).

Don_Adan · 21 October 2020, 17:38

Quote:

Originally Posted by 8bitbubsy

No I can't, because the fraction changes for every output sample!

EDIT: Ok, the LUT method is faster on my 68030 50MHz Amiga! So that's good news. Also I edited the code again as I had to replace ext.w d3 with and.w #$ff,d3

So I think the thing to focus on now is to try and optimize this any further, if possible:

Code:

	move.w	(a3,d2.l),d3
	move.b	d3,d5
	ext.w	d5
	asr.w	#8,d3
	sub.w	d3,d5
	lsl.w	#7,d5
	move.w	d7,d4
	rol.w	#7,d4
	and.b	#127,d4
	or.b	d4,d5
	add.b	(a6,d5.w),d3
	and.w	#$ff,d3

Maybe one can use a bitfield instruction to get the LUT index calculated...

ok, right. Original code:

MIXCF MACRO
move.w (a3,d2.l),d3
move.b d3,d5
ext.w d5
asr.w #8,d3
sub.w d3,d5
move.w d7,d4
lsr.w #8,d4
muls.w d4,d5
asr.w #8,d5
add.w d5,d3
move.w (a1,d3.w*2),d5
add.w d5,(a5)+
add.w d5,(a5)+
add.w d6,d7
addx.l d1,d2
ENDM

Perhaps after some modification, this can work fastest or same speed (i dont remember 68020 timings), d4 is free now too.

Code:

	move.w	(a3,d2.l),d3
	move.b	d3,d5
	ext.w	d5
	asr.w	#8,d3
	sub.w	d3,d5
	lsl.w	#7,d5
;	move.w	d7,d4
;	rol.w	#7,d4
;	and.b	#127,d4
;	or.b	d4,d5
        rol.l #7,d7
        or.b    d7,d5
        ror.l    #7,d7           ; restore d7
	add.b	(a6,d5.w),d3
	and.w	#$ff,d3

.....

 add.l d6,d7 ; original d6/d7 word values must be in high word, and low word must be cleared (empty) before the loop
addx.l d1,d2

8bitbubsy · 21 October 2020, 18:07

Thanks, that was slightly faster. I decided to store $00ff in d4.w for the and.w, which additionally made it a tiny bit faster.

Code:

	move.w	(a3,d2.l),d3
	move.b	d3,d5
	ext.w	d5
	asr.w	#8,d3
	sub.w	d3,d5
	lsl.w	#7,d5
	rol.l	#7,d7
	or.b	d7,d5
	ror.l	#7,d7
	add.b	(a6,d5.w),d3
	and.w	d4,d3
	move.w	(a1,d3.w*2),d5
	add.w	d5,(a5)+
	add.w	d5,(a5)+
	add.l	d6,d7
	addx.l	d1,d2

a/b · 21 October 2020, 19:25

If you are using only the upper 16 bits of d6/d7, you can use the lower 16 bits instead of d5. Then you only have to roll d7 left and back right, and d5 is free. Something like:

Code:

	move.w	(a3,d2.l),d3
	move.b	d3,d7
	ext.w	d7
	asr.w	#8,d3
	sub.w	d3,d7
	rol.l	#7,d7
	add.b	(a6,d7.w),d3
	ror.l	#7,d7
	and.w	d4,d3
	move.w	(a1,d3.w*2),d7
	add.w	d7,(a5)+
	add.w	d7,(a5)+
	add.l	d6,d7
	addx.l	d1,d2

8bitbubsy · 21 October 2020, 19:31

Awesome! That worked and gave a nice speed improvement.

a/b · 21 October 2020, 19:47

Laaaag ;p.
And if you have no use for d5, I think move d7,d5 with ror d5 should be faster than ror/rol d7 on a 020/030.

8bitbubsy · 21 October 2020, 19:56

Quote:

Originally Posted by a/b

Laaaag ;p.
And if you have no use for d5, I think move d7,d5 with ror d5 should be faster than ror/rol d7 on a 020/030.

d5 is sadly in use.
EDIT: d5.l can be used for the center mixer. Now, I'm a bit confused as to what you meant I could do with d5.

Here's the current mixers:

Stereo mix:

Code:

; d0.w = bytes to mix
MIXSF MACRO
    move.w (a3,d2.l),d3   ; d3.w = 2x signed 8-bit samples
    move.b d3,d7
    ext.w  d7
    asr.w  #8,d3
    sub.w  d3,d7          ; d7.w = sample2-sample1
    rol.l  #7,d7
    add.b  (a6,d7.w),d3    
    and.w  d4,d3          ; d3.w = $00xx = 8-bit signed interpolated sample
    ror.l  #7,d7
    move.w (a1,d3.w*2),d5 ; d5.w = left output sample (from volume LUT)
    swap   d5
    move.w (a4,d3.w*2),d5 ; d5.l = (leftSample << 16) | rightSample
    add.l  d5,(a5)+
    add.l  d6,d7          ; increase sampling position
    addx.l d1,d2
    ENDM

Center mix (slightly faster when channel pan is in center):

Code:

; d0.w = bytes to mix
MIXCF MACRO
    move.w (a3,d2.l),d3    ; d3.w = 2x signed 8-bit samples
    move.b d3,d7
    ext.w  d7
    asr.w  #8,d3
    sub.w  d3,d7           ; d7.w = sample2-sample1
    rol.l  #7,d7
    add.b  (a6,d7.w),d3
    and.w  d4,d3           ; d3.w = $00xx = 8-bit signed interpolated sample
    ror.l  #7,d7
    move.w (a1,d3.w*2),d3  ; d3.w = output sample (from volume LUT)
    add.w  d3,(a5)+
    add.w  d3,(a5)+
    add.l  d6,d7           ; increase sampling position
    addx.l d1,d2
    ENDM

Don_Adan · 21 October 2020, 20:36

Quote:

Originally Posted by 8bitbubsy

d5 is sadly in use.
EDIT: d5.l can be used for the center mixer. Now, I'm a bit confused as to what you meant I could do with d5.

Here's the current mixers:

Stereo mix:

Code:

; d0.w = bytes to mix
MIXSF MACRO
    move.w (a3,d2.l),d3   ; d3.w = 2x signed 8-bit samples
    move.b d3,d7
    ext.w  d7
    asr.w  #8,d3
    sub.w  d3,d7          ; d7.w = sample2-sample1
    rol.l  #7,d7
    add.b  (a6,d7.w),d3    
    and.w  d4,d3          ; d3.w = $00xx = 8-bit signed interpolated sample
    ror.l  #7,d7
    move.w (a1,d3.w*2),d5 ; d5.w = left output sample (from volume LUT)
    swap   d5
    move.w (a4,d3.w*2),d5 ; d5.l = (leftSample << 16) | rightSample
    add.l  d5,(a5)+
    add.l  d6,d7          ; increase sampling position
    addx.l d1,d2
    ENDM

Center mix (slightly faster when channel pan is in center):

Code:

; d0.w = bytes to mix
MIXCF MACRO
    move.w (a3,d2.l),d3    ; d3.w = 2x signed 8-bit samples
    move.b d3,d7
    ext.w  d7
    asr.w  #8,d3
    sub.w  d3,d7           ; d7.w = sample2-sample1
    rol.l  #7,d7
    add.b  (a6,d7.w),d3
    and.w  d4,d3           ; d3.w = $00xx = 8-bit signed interpolated sample
    ror.l  #7,d7
    move.w (a1,d3.w*2),d3  ; d3.w = output sample (from volume LUT)
    add.w  d3,(a5)+
    add.w  d3,(a5)+
    add.l  d6,d7           ; increase sampling position
    addx.l d1,d2
    ENDM

Perhaps something like this:

Code:

; d0.w = bytes to mix
MIXSF MACRO
    move.w (a3,d2.l),d3   ; d3.w = 2x signed 8-bit samples
    move.b d3,d7
    ext.w  d7
    asr.w  #8,d3
    sub.w  d3,d7          ; d7.w = sample2-sample1
    move.l d7,d5
    rol.l  #7,d5
    add.b  (a6,d5.w),d3    
    and.w  d4,d3          ; d3.w = $00xx = 8-bit signed interpolated sample
    move.w (a1,d3.w*2),d5 ; d5.w = left output sample (from volume LUT)
    swap   d5
    move.w (a4,d3.w*2),d5 ; d5.l = (leftSample << 16) | rightSample
    add.l  d5,(a5)+
    add.l  d6,d7          ; increase sampling position
    addx.l d1,d2
    ENDM

8bitbubsy · 21 October 2020, 21:12

Ah, like that! Yes, it made it slightly faster.

So now we're left with:

Stereo mix:

Code:

; d0.w = bytes to mix
MIXSF MACRO
    move.w (a3,d2.l),d3   ; d3.w = 2x signed 8-bit samples
    move.b d3,d7
    ext.w  d7
    asr.w  #8,d3
    sub.w  d3,d7          ; d7.w = sample2-sample1
    move.l d7,d5 
    rol.l  #7,d5
    add.b  (a6,d5.w),d3    
    and.w  d4,d3          ; d3.w = $00xx = 8-bit signed interpolated sample
    move.w (a1,d3.w*2),d5 ; d5.w = left output sample (from volume LUT)
    swap   d5
    move.w (a4,d3.w*2),d5 ; d5.l = (leftSample << 16) | rightSample
    add.l  d5,(a5)+
    add.l  d6,d7          ; increase sampling position
    addx.l d1,d2
    ENDM

Center mix (slightly faster when channel pan is in center):

Code:

; d0.w = bytes to mix
MIXCF MACRO
    move.w (a3,d2.l),d3    ; d3.w = 2x signed 8-bit samples
    move.b d3,d7
    ext.w  d7
    asr.w  #8,d3
    sub.w  d3,d7           ; d7.w = sample2-sample1
    move.l d7,d5    
    rol.l  #7,d5
    add.b  (a6,d5.w),d3
    and.w  d4,d3           ; d3.w = $00xx = 8-bit signed interpolated sample
    move.w (a1,d3.w*2),d3  ; d3.w = output sample (from volume LUT)
    add.w  d3,(a5)+
    add.w  d3,(a5)+
    add.l  d6,d7           ; increase sampling position
    addx.l d1,d2
    ENDM

Don_Adan · 21 October 2020, 21:35

Quote:

Originally Posted by 8bitbubsy

Ah, like that! Yes, it made it slightly faster.

So now we're left with:

Stereo mix:

Code:

; d0.w = bytes to mix
MIXSF MACRO
    move.w (a3,d2.l),d3   ; d3.w = 2x signed 8-bit samples
    move.b d3,d7
    ext.w  d7
    asr.w  #8,d3
    sub.w  d3,d7          ; d7.w = sample2-sample1
    move.l d7,d5 
    rol.l  #7,d5
    add.b  (a6,d5.w),d3    
    and.w  d4,d3          ; d3.w = $00xx = 8-bit signed interpolated sample
    move.w (a1,d3.w*2),d5 ; d5.w = left output sample (from volume LUT)
    swap   d5
    move.w (a4,d3.w*2),d5 ; d5.l = (leftSample << 16) | rightSample
    add.l  d5,(a5)+
    add.l  d6,d7          ; increase sampling position
    addx.l d1,d2
    ENDM

Center mix (slightly faster when channel pan is in center):

Code:

; d0.w = bytes to mix
MIXCF MACRO
    move.w (a3,d2.l),d3    ; d3.w = 2x signed 8-bit samples
    move.b d3,d7
    ext.w  d7
    asr.w  #8,d3
    sub.w  d3,d7           ; d7.w = sample2-sample1
    move.l d7,d5    
    rol.l  #7,d7
    add.b  (a6,d5.w),d3
    and.w  d4,d3           ; d3.w = $00xx = 8-bit signed interpolated sample
    move.w (a1,d3.w*2),d3  ; d3.w = output sample (from volume LUT)
    add.w  d3,(a5)+
    add.w  d3,(a5)+
    add.l  d6,d7           ; increase sampling position
    addx.l d1,d2
    ENDM

Perhaps, but if A5 is chip ram writing then perhaps can be pipelined for 68030 and MIXCF. meynaf is expert in 68030 pipelining. Or maybe 1 longword ADD will be fastest than 2 word ADDs ?

8bitbubsy · 21 October 2020, 21:36

I'm mixing to a 16-bit fastmem stereo buffer, then in the post-mixing stage I use a post-mixing table to convert it to pre-clamped, normalized 14-bit values for Paula (yes, I use 14-bit output).
I played around with trying to make it use longword add for center mix, but it turned out to be slower. E.g. move.w d3,d5 swap d5 move.w d3,d5 add.l d5,(a5)+

a/b · 21 October 2020, 22:15

It might do nothing, since right-shift is extra fast on 020, but just in case... Replace the first four with:

Code:

	move.w	(a3,d2.l),d7	; d7.w = 2x signed 8-bit samples
	bfexts	d7{16:8},d3
	ext.w	d7

8bitbubsy · 21 October 2020, 22:20

Quote:

Originally Posted by a/b

It might do nothing, since right-shift is extra fast on 020, but just in case... Replace the first four with:

Code:

    move.w    (a3,d2.l),d7    ; d7.w = 2x signed 8-bit samples
    bfexts    d7{16:8},d3
    ext.w    d7

Just benchmarked it on my 68030 50MHz A1200, and it's about 2-4% slower.

Don_Adan · 21 October 2020, 22:29

Quote:

Originally Posted by 8bitbubsy

I'm mixing to a 16-bit fastmem stereo buffer, then in the post-mixing stage I use a post-mixing table to convert it to pre-clamped, normalized 14-bit values for Paula (yes, I use 14-bit output).
I played around with trying to make it use longword add for center mix, but it turned out to be slower. E.g. move.w d3,d5 swap d5 move.w d3,d5 add.l d5,(a5)+

If you want, you can check this:

Code:

; d0.w = bytes to mix
MIXCF MACRO
    move.w (a3,d2.l),d3    ; d3.w = 2x signed 8-bit samples
    move.b d3,d7
    ext.w  d7
    asr.w  #8,d3
    sub.w  d3,d7           ; d7.w = sample2-sample1
    move.l d7,d5    
    rol.l  #7,d7
    add.b  (a6,d5.w),d3
    and.w  d4,d3           ; d3.w = $00xx = 8-bit signed interpolated sample
    add.w (a1,d3.w*2),(a5)+ 
    add.l  d6,d7           ; increase sampling position
    addx.l d1,d2
    add.w (a1,d3.w*2),(a5)+ 
   ENDM

8bitbubsy · 21 October 2020, 22:30

That's an extra look-up + word read from memory, can that possibly be faster?
Also no need to put instructions inbetween audio buffer writes, I'm not using chipmem here.

add (An,Dn),(An)+ is also not a valid opcode. You can only do that on move, I think.

a/b · 21 October 2020, 23:25

OK, another idea...
If d3 bits 8-15 are all either 0 or 1, so if you make the a1/a4 tables twice as large (512 words instead of 256) with indices -256 to 255 (-256 = 0, -255 = 1, ... -1 = 255) and a1/a4 pointing to index 0, you can drop:

Code:

	and.w	d4,d3

And d4 is now free.

Don_Adan · 21 October 2020, 23:42

Quote:

Originally Posted by 8bitbubsy

That's an extra look-up + word read from memory, can that possibly be faster?
Also no need to put instructions inbetween audio buffer writes, I'm not using chipmem here.

add (An,Dn),(An)+ is also not a valid opcode. You can only do that on move, I think.

Right. no opcode. But you can check this. Writing to fastmem is pipelining too, if i remember right.

Code:

; d0.w = bytes to mix
MIXCF MACRO
    move.w (a3,d2.l),d3    ; d3.w = 2x signed 8-bit samples
    move.b d3,d7
    ext.w  d7
    asr.w  #8,d3
    sub.w  d3,d7           ; d7.w = sample2-sample1
    move.l d7,d5    
    rol.l  #7,d7
    add.b  (a6,d5.w),d3
    and.w  d4,d3           ; d3.w = $00xx = 8-bit signed interpolated sample
    move.w (a1,d3.w*2),d3  ; d3.w = output sample (from volume LUT)
    add.w  d3,(a5)+
    add.l  d6,d7           ; increase sampling position
    addx.l d1,d2
    add.w  d3,(a5)+
    ENDM

8bitbubsy · 22 October 2020, 12:20

Quote:

Originally Posted by Don_Adan

Right. no opcode. But you can check this. Writing to fastmem is pipelining too, if i remember right.

Code:

; d0.w = bytes to mix
MIXCF MACRO
    move.w (a3,d2.l),d3    ; d3.w = 2x signed 8-bit samples
    move.b d3,d7
    ext.w  d7
    asr.w  #8,d3
    sub.w  d3,d7           ; d7.w = sample2-sample1
    move.l d7,d5    
    rol.l  #7,d7
    add.b  (a6,d5.w),d3
    and.w  d4,d3           ; d3.w = $00xx = 8-bit signed interpolated sample
    move.w (a1,d3.w*2),d3  ; d3.w = output sample (from volume LUT)
    add.w  d3,(a5)+
    add.l  d6,d7           ; increase sampling position
    addx.l d1,d2
    add.w  d3,(a5)+
    ENDM

This was actually a bit slower on my 68030 50MHz A1200 benchmark, for some reason??

Quote:

Originally Posted by a/b

OK, another idea...
If d3 bits 8-15 are all either 0 or 1, so if you make the a1/a4 tables twice as large (512 words instead of 256) with indices -256 to 255 (-256 = 0, -255 = 1, ... -1 = 255) and a1/a4 pointing to index 0, you can drop:

Code:

    and.w    d4,d3

And d4 is now free.

I pre-centered the volume LUT pointers so that they can handle a signed look-up (still same LUT size), then I increased the lerp LUT size by two, so that it uses signed word values. Now d4 is indeed free and the code is slightly faster. It's currently like this:

Stereo mix:

Code:

; d0.w = bytes to mix
MIXSF MACRO
    move.w (a3,d2.l),d3   ; d3.w = 2x signed 8-bit samples
    move.b d3,d7
    ext.w  d7
    asr.w  #8,d3
    sub.w  d3,d7          ; d7.w = sample2-sample1
    move.l d7,d5 
    rol.l  #7,d5
    add.w  (a6,d5.w*2),d3
    move.w (a1,d3.w*2),d5 ; d5.w = left output sample (from volume LUT)
    swap   d5
    move.w (a4,d3.w*2),d5 ; d5.l = (leftSample << 16) | rightSample
    add.l  d5,(a5)+
    add.l  d6,d7          ; increase sampling position
    addx.l d1,d2
    ENDM

Center mix (slightly faster when channel pan is in center):

Code:

; d0.w = bytes to mix
MIXCF MACRO
    move.w (a3,d2.l),d3    ; d3.w = 2x signed 8-bit samples
    move.b d3,d7
    ext.w  d7
    asr.w  #8,d3
    sub.w  d3,d7           ; d7.w = sample2-sample1
    move.l d7,d5    
    rol.l  #7,d5
    add.w  (a6,d5.w*2),d3
    move.w (a1,d3.w*2),d3  ; d3.w = output sample (from volume LUT)
    add.w  d3,(a5)+
    add.w  d3,(a5)+
    add.l  d6,d7           ; increase sampling position
    addx.l d1,d2
    ENDM

Getting quite fast now, but the binary is getting big. 433kB as of now.

21 October 2020, 16:54	#23
8bitbubsy Registered User Join Date: Sep 2009 Location: Norway Posts: 1,712	No I can't, because the fraction changes for every output sample! EDIT: Ok, the LUT method is faster on my 68030 50MHz Amiga! So that's good news. Also I edited the code again as I had to replace ext.w d3 with and.w #$ff,d3 So I think the thing to focus on now is to try and optimize this any further, if possible: Code: move.w (a3,d2.l),d3 ; read 2x signed 8-bit PCM samples move.b d3,d5 ext.w d5 asr.w #8,d3 sub.w d3,d5 lsl.w #7,d5 move.w d7,d4 ; copy of sampling position fraction (16-bit) rol.w #7,d4 and.b #127,d4 or.b d4,d5 add.b (a6,d5.w),d3 and.w #$ff,d3 ; d3.b = -128..127 (ready for volume LUT) Maybe one can use a bitfield instruction to get the LUT index calculated... Also sorry for not listening too much to the suggestions, I just thought they were not suitable (changing frac etc). Last edited by 8bitbubsy; 21 October 2020 at 17:34.

21 October 2020, 18:07	#25
8bitbubsy Registered User Join Date: Sep 2009 Location: Norway Posts: 1,712	Thanks, that was slightly faster. I decided to store $00ff in d4.w for the and.w, which additionally made it a tiny bit faster. Code: move.w (a3,d2.l),d3 move.b d3,d5 ext.w d5 asr.w #8,d3 sub.w d3,d5 lsl.w #7,d5 rol.l #7,d7 or.b d7,d5 ror.l #7,d7 add.b (a6,d5.w),d3 and.w d4,d3 move.w (a1,d3.w*2),d5 add.w d5,(a5)+ add.w d5,(a5)+ add.l d6,d7 addx.l d1,d2

21 October 2020, 19:25	#26
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,044	If you are using only the upper 16 bits of d6/d7, you can use the lower 16 bits instead of d5. Then you only have to roll d7 left and back right, and d5 is free. Something like: Code: move.w (a3,d2.l),d3 move.b d3,d7 ext.w d7 asr.w #8,d3 sub.w d3,d7 rol.l #7,d7 add.b (a6,d7.w),d3 ror.l #7,d7 and.w d4,d3 move.w (a1,d3.w*2),d7 add.w d7,(a5)+ add.w d7,(a5)+ add.l d6,d7 addx.l d1,d2

21 October 2020, 22:15	#34
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,044	It might do nothing, since right-shift is extra fast on 020, but just in case... Replace the first four with: Code: move.w (a3,d2.l),d7 ; d7.w = 2x signed 8-bit samples bfexts d7{16:8},d3 ext.w d7

21 October 2020, 22:30	#37
8bitbubsy Registered User Join Date: Sep 2009 Location: Norway Posts: 1,712	That's an extra look-up + word read from memory, can that possibly be faster? Also no need to put instructions inbetween audio buffer writes, I'm not using chipmem here. add (An,Dn),(An)+ is also not a valid opcode. You can only do that on move, I think. Last edited by 8bitbubsy; 21 October 2020 at 22:40.

21 October 2020, 19:31	#27
8bitbubsy Registered User Join Date: Sep 2009 Location: Norway Posts: 1,712	Awesome! That worked and gave a nice speed improvement.

21 October 2020, 19:47	#28
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,044	Laaaag ;p. And if you have no use for d5, I think move d7,d5 with ror d5 should be faster than ror/rol d7 on a 020/030.

21 October 2020, 21:36	#33
8bitbubsy Registered User Join Date: Sep 2009 Location: Norway Posts: 1,712	I'm mixing to a 16-bit fastmem stereo buffer, then in the post-mixing stage I use a post-mixing table to convert it to pre-clamped, normalized 14-bit values for Paula (yes, I use 14-bit output). I played around with trying to make it use longword add for center mix, but it turned out to be slower. E.g. move.w d3,d5 swap d5 move.w d3,d5 add.l d5,(a5)+

21 October 2020, 23:25	#38
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,044	OK, another idea... If d3 bits 8-15 are all either 0 or 1, so if you make the a1/a4 tables twice as large (512 words instead of 256) with indices -256 to 255 (-256 = 0, -255 = 1, ... -1 = 255) and a1/a4 pointing to index 0, you can drop: Code: and.w d4,d3 And d4 is now free.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Interpolation new Sound options	Paul	support.WinUAE	10	17 March 2019 20:57
Artifacts from non-gamma-aware interpolation	mark_k	support.WinUAE	5	08 January 2018 14:37
switch sound interpolation 4 chs	turrican3	support.WinUAE	1	14 February 2016 10:39
Non-linear retrogaming?	Nogg	Retrogaming General Discussion	5	13 October 2007 17:09
is time linear	PaulS	request.Demos	2	22 September 2002 12:37