Optimizing linear interpolation routine for a live resampler - Page 3

a/b · 22 October 2020, 13:41

Ah, so the a1/a4 tables are a small part of a large volume table. OK, it wasn't clear from the source. Not worth it on its own, assuming no use for d4 to make the code even faster, but good to hear that you managed to salvage it.

There is a typo in MIXCF that keeps surviving, should be rol d5, not d7.

Tomislav · 22 October 2020, 13:53

Quote:

Originally Posted by 8bitbubsy

Code:

    move.w (a3,d2.l),d3    ; d3.w = 2x signed 8-bit samples
    move.b d3,d7
    ext.w  d7
    asr.w  #8,d3
    sub.w  d3,d7           ; d7.w = sample2-sample1

Did you maybe forgot one ext.w or it is intentionally?

Code:

    move.w (a3,d2.l),d3    ; d3.w = 2x signed 8-bit samples
    move.b d3,d7
    ext.w  d7
    asr.w  #8,d3
    ext.w  d3              ; <- this one
    sub.w  d3,d7           ; d7.w = sample2-sample1

Also "ext.w Dn" is not replacement for "and.w #$00ff,Dn" because if value of data register is for example $12345687 it will extend to $1234ff87. It extends signed byte number to signed word.

But my first assumption is that both byte numbers are signed. ("; d3.w = 2x signed 8-bit samples") And that you maybe forgot to extend second number.

8bitbubsy · 22 October 2020, 14:26

a/b: Ah yes, that's just a typo when I'm making my posts, it's correct in the actual source code. I'll edit them again.
Tomislav: It works as intended, and I don't see any problems with it. I'm working with signed word samples in that stage, and the upper word of d3 is never used in the mixing loop. The second sample is properly converted to 8-bit signed word by the ASR shifting (which copies the sign bit).

Also here's how it sounds as of now. Playing a 20-channel XM song with linear interpolation at 14-bit 28604Hz on my Amiga 1200 (68030 50MHz):
https://www.dropbox.com/s/s6d24ng9hv...play2.mp3?dl=1

It has some quantization noise because all 16-bit samples are converted to 8-bit on load time. Also there is no volume ramping, so it clicks/pops sometimes.

18-20 channels seems about absolute max for a 68030 50MHz, and it will use almost all available CPU time.

Don_Adan · 22 October 2020, 14:51

Quote:

Originally Posted by Tomislav

Did you maybe forgot one ext.w or it is intentionally?

Code:

    move.w (a3,d2.l),d3    ; d3.w = 2x signed 8-bit samples
    move.b d3,d7
    ext.w  d7
    asr.w  #8,d3
    ext.w  d3              ; <- this one
    sub.w  d3,d7           ; d7.w = sample2-sample1

Also "ext.w Dn" is not replacement for "and.w #$00ff,Dn" because if value of data register is for example $12345687 it will extend to $1234ff87. It extends signed byte number to signed word.

But my first assumption is that both byte numbers are signed. ("; d3.w = 2x signed 8-bit samples") And that you maybe forgot to extend second number.

After asr.w #8,d3, ext.w d3 is useless.
Only after lsl.w #8,d3, ext.w d3 has sense.

Don_Adan · 23 October 2020, 06:58

Quote:

Originally Posted by 8bitbubsy

This was actually a bit slower on my 68030 50MHz A1200 benchmark, for some reason??

I pre-centered the volume LUT pointers so that they can handle a signed look-up (still same LUT size), then I increased the lerp LUT size by two, so that it uses signed word values. Now d4 is indeed free and the code is slightly faster. It's currently like this:

Stereo mix:

Code:

; d0.w = bytes to mix
MIXSF MACRO
    move.w (a3,d2.l),d3   ; d3.w = 2x signed 8-bit samples
    move.b d3,d7
    ext.w  d7
    asr.w  #8,d3
    sub.w  d3,d7          ; d7.w = sample2-sample1
    move.l d7,d5 
    rol.l  #7,d5
    add.w  (a6,d5.w*2),d3
    move.w (a1,d3.w*2),d5 ; d5.w = left output sample (from volume LUT)
    swap   d5
    move.w (a4,d3.w*2),d5 ; d5.l = (leftSample << 16) | rightSample
    add.l  d5,(a5)+
    add.l  d6,d7          ; increase sampling position
    addx.l d1,d2
    ENDM

Center mix (slightly faster when channel pan is in center):

Code:

; d0.w = bytes to mix
MIXCF MACRO
    move.w (a3,d2.l),d3    ; d3.w = 2x signed 8-bit samples
    move.b d3,d7
    ext.w  d7
    asr.w  #8,d3
    sub.w  d3,d7           ; d7.w = sample2-sample1
    move.l d7,d5    
    rol.l  #7,d5
    add.w  (a6,d5.w*2),d3
    move.w (a1,d3.w*2),d3  ; d3.w = output sample (from volume LUT)
    add.w  d3,(a5)+
    add.w  d3,(a5)+
    add.l  d6,d7           ; increase sampling position
    addx.l d1,d2
    ENDM

Getting quite fast now, but the binary is getting big. 433kB as of now.

If you want you can speedup your code by double (longword) table at A1:

Stereo mix:

Code:

; d0.w = bytes to mix
MIXSF MACRO
    move.w (a3,d2.l),d3   ; d3.w = 2x signed 8-bit samples
    move.b d3,d7
    ext.w  d7
    asr.w  #8,d3
    sub.w  d3,d7          ; d7.w = sample2-sample1
    move.l d7,d5 
    rol.l  #7,d5
    add.w  (a6,d5.w*2),d3
    move.w (a1,d3.w*4),d5 ; d5.w = left output sample (from volume LUT)
    swap   d5
    move.w (a4,d3.w*2),d5 ; d5.l = (leftSample << 16) | rightSample
 ; if A1 and A4 used same table, use     move.w (a4,d3.w*4),d5
    add.l  d5,(a5)+
    add.l  d6,d7          ; increase sampling position
    addx.l d1,d2
    ENDM

Center mix (slightly faster when channel pan is in center):

Code:

; d0.w = bytes to mix
MIXCF MACRO
    move.w (a3,d2.l),d3    ; d3.w = 2x signed 8-bit samples
    move.b d3,d7
    ext.w  d7
    asr.w  #8,d3
    sub.w  d3,d7           ; d7.w = sample2-sample1
    move.l d7,d5    
    rol.l  #7,d5
    add.w  (a6,d5.w*2),d3
    move.l (a1,d3.w*4),d3  ; d3.l = output sample (from volume LUT)
    add.l  d3,(a5)+
    add.l  d6,d7           ; increase sampling position
    addx.l d1,d2
    ENDM

Don_Adan · 23 October 2020, 07:42

One more thing/trick, one command less, you can use this for single sided table version too:

Stereo mix:

Code:

; d0.w = bytes to mix
MIXSF MACRO
    move.w (a3,d2.l),d3   ; d3.w = 2x signed 8-bit samples
    move.b d3,d7
    ext.w  d7
    asr.w  #8,d3
    sub.w  d3,d7          ; d7.w = sample2-sample1
    move.l d7,d5 
    rol.l  #7,d5
    add.w  (a6,d5.w*2),d3
    move.l (a1,d3.w*4),d5 ; d5.w = left output sample (from volume LUT)
;    swap   d5
    move.w (a4,d3.w*2),d5 ; d5.l = (leftSample << 16) | rightSample
 ; if A1 and A4 used same table, use     move.w (a4,d3.w*4),d5
    add.l  d5,(a5)+
    add.l  d6,d7          ; increase sampling position
    addx.l d1,d2
    ENDM

8bitbubsy · 23 October 2020, 10:42

That table is the same for both L and R volume (it's pre-offset with the current voice volume), and it's already 256kB in size. I'd rather not double that just to make it a few percent faster at max. Thanks anyway!

Don_Adan · 23 October 2020, 11:19

Quote:

Originally Posted by 8bitbubsy

That table is the same for both L and R volume (it's pre-offset with the current voice volume), and it's already 256kB in size. I'd rather not double that just to make it a few percent faster at max. Thanks anyway!

Ok, then only this can be useful, one swap command left.
move.l (a1,d3.w*2),d5 ; d5 high word = left output sample (from volume LUT)
; swap d5
move.w (a4,d3.w*2),d5 ; d5.w = (leftSample << 16) | rightSample

8bitbubsy · 23 October 2020, 11:29

But is a longword read as fast as a word read on 68020?

chb · 23 October 2020, 12:09

Quote:

Originally Posted by 8bitbubsy

But is a longword read as fast as a word read on 68020?

Yes, but only if your longword is 32-bit aligned.

Don_Adan · 23 October 2020, 13:17

Quote:

Originally Posted by 8bitbubsy

But is a longword read as fast as a word read on 68020?

You can check this, but if i remember right word and longword read has same speed for 68020 at even addresses. Only odd even reads has penalty, like f.e this:
move.w (a3,d2.l),d3 ; d3.w = 2x signed 8-bit samples

a/b · 23 October 2020, 14:07

020+ has a 32-bit bus, so misaligned 32-bit transfers (e.g. a longword at address 2, 6, ..., as well as any odd address) are slower since it require two 32-bit transfers.

8bitbubsy · 23 October 2020, 15:51

That might be a problem, since the pre-offseting off the volume LUTs may mess up the 32-bit alignment in the actual look-up.

ross · 23 October 2020, 15:57

I follow this thread casually, but it is very interesting.
I looked at the latest proposed routines and I don't think there are penalties for misalignments.
32-bit accesses are longword aligned and 16-bit accesses are word aligned. The access speed to memory is maximum.

a/b · 23 October 2020, 16:07

Yeah, because the suggested changes rely on doubling the table size to ensure the alignment. And the table size has already been doubled, so that would be 4x the original size.
It's his call, of course ;p.

ross · 23 October 2020, 16:09

Quote:

Originally Posted by a/b

Yeah, because the suggested changes rely on doubling the table size to ensure the alignment. And the table size has already been doubled, so that would be 4x the original size.
It's his call, of course ;p.

Ah ok

Yes, the version with 'smaller' table can misalign.

Don_Adan · 23 October 2020, 16:23

Quote:

Originally Posted by 8bitbubsy

That might be a problem, since the pre-offseting off the volume LUTs may mess up the 32-bit alignment in the actual look-up.

Anyway i think that even for your current LUT it will be fastest. You have 50% reads without penalty, and 50% reads with penalty, but without swap command (4c on 68020, if i remember right). This is easy to test if you have any benchmark for your mixing routine. Of course perhaps the best/fastest will be double table (longword). Of course only for speed test you can made 2x256KB table version too. It can be interesting, which version is the fastest, and how many. You can made test also for original mulu version.

BTW. Because extra odd address penalty for
move.w (a3,d2.l),d3 ; d3.w = 2x signed 8-bit samples
i thinked too about
move.b (a3),d3
move.b 1(a3),d7
but too many other changes is necessary for A3 handling.And it can not be fastest.

meynaf · 23 October 2020, 17:22

IIRC misaligned longword access is more deadly than misaligned word access because a word access in the middle of a long doesn't have penalty.

This means that if a0 is longword aligned, word accesses will be :
. 0(a0) aligned word, ok
. 1(a0) inside longword, ok
. 2(a0) aligned word, ok
. 3(a0) misaligned access (only 25% cases)

8bitbubsy · 24 October 2020, 10:18

Quote:

Originally Posted by Don_Adan

Anyway i think that even for your current LUT it will be fastest. You have 50% reads without penalty, and 50% reads with penalty, but without swap command (4c on 68020, if i remember right). This is easy to test if you have any benchmark for your mixing routine. Of course perhaps the best/fastest will be double table (longword). Of course only for speed test you can made 2x256KB table version too. It can be interesting, which version is the fastest, and how many. You can made test also for original mulu version.

BTW. Because extra odd address penalty for
move.w (a3,d2.l),d3 ; d3.w = 2x signed 8-bit samples
i thinked too about
move.b (a3),d3
move.b 1(a3),d7
but too many other changes is necessary for A3 handling.And it can not be fastest.

Oh no, I totally forgot that this has a possible word access misalignment! If only one could use addx on an address register, then I could add the sampling position to a3 before the loop, then read two bytes, then addx on a3. And when I leave the loop, I subtract the sample base from a3 to get the new sampling position, before I handle sample end/loop end.

I could still do that method by using d2 as the relative sampling position (d2 = a3+d2 before loop). Then I do "move.l d3,a3" as the first instruction in the loop.
That's one move intruction extra, so probably slower in the end?

chb · 24 October 2020, 11:18

Quote:

Originally Posted by 8bitbubsy

Oh no, I totally forgot that this has a possible word access misalignment! If only one could use addx on an address register, then I could add the sampling position to a3 before the loop, then read two bytes, then addx on a3. And when I leave the loop, I subtract the sample base from a3 to get the new sampling position, before I handle sample end/loop end.

I could still do that method by using d2 as the relative sampling position (d2 = a3+d2 before loop). Then I do "move.l d3,a3" as the first instruction in the loop.
That's one move intruction extra, so probably slower in the end?

As meynaf says, you'll have only word misaligned in 25% of the cases, there's IMHO no easy way around that, at least if your sample read integer delta >1 (so you are skipping a significant proportion of the sample values).

It may give some benefit to have two sets of mixing routines: one for integer delta <= 1 where you access every sample, and potentially also are able to re-use the delta between two samples, so that "move.w (a3,d2.l),d3", asr and ext instructions are only necessary once per input sample, not per output sample. And one for delta > 1.

22 October 2020, 14:26	#43
8bitbubsy Registered User Join Date: Sep 2009 Location: Norway Posts: 1,712	a/b: Ah yes, that's just a typo when I'm making my posts, it's correct in the actual source code. I'll edit them again. Tomislav: It works as intended, and I don't see any problems with it. I'm working with signed word samples in that stage, and the upper word of d3 is never used in the mixing loop. The second sample is properly converted to 8-bit signed word by the ASR shifting (which copies the sign bit). Also here's how it sounds as of now. Playing a 20-channel XM song with linear interpolation at 14-bit 28604Hz on my Amiga 1200 (68030 50MHz): https://www.dropbox.com/s/s6d24ng9hv...play2.mp3?dl=1 It has some quantization noise because all 16-bit samples are converted to 8-bit on load time. Also there is no volume ramping, so it clicks/pops sometimes. 18-20 channels seems about absolute max for a 68030 50MHz, and it will use almost all available CPU time. Last edited by 8bitbubsy; 22 October 2020 at 14:55.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Interpolation new Sound options	Paul	support.WinUAE	10	17 March 2019 20:57
Artifacts from non-gamma-aware interpolation	mark_k	support.WinUAE	5	08 January 2018 14:37
switch sound interpolation 4 chs	turrican3	support.WinUAE	1	14 February 2016 10:39
Non-linear retrogaming?	Nogg	Retrogaming General Discussion	5	13 October 2007 17:09
is time linear	PaulS	request.Demos	2	22 September 2002 12:37

22 October 2020, 13:41	#41
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,050	Ah, so the a1/a4 tables are a small part of a large volume table. OK, it wasn't clear from the source. Not worth it on its own, assuming no use for d4 to make the code even faster, but good to hear that you managed to salvage it. There is a typo in MIXCF that keeps surviving, should be rol d5, not d7.

23 October 2020, 10:42	#47
8bitbubsy Registered User Join Date: Sep 2009 Location: Norway Posts: 1,712	That table is the same for both L and R volume (it's pre-offset with the current voice volume), and it's already 256kB in size. I'd rather not double that just to make it a few percent faster at max. Thanks anyway!

23 October 2020, 11:29	#49
8bitbubsy Registered User Join Date: Sep 2009 Location: Norway Posts: 1,712	But is a longword read as fast as a word read on 68020?

23 October 2020, 14:07	#52
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,050	020+ has a 32-bit bus, so misaligned 32-bit transfers (e.g. a longword at address 2, 6, ..., as well as any odd address) are slower since it require two 32-bit transfers.

23 October 2020, 15:51	#53
8bitbubsy Registered User Join Date: Sep 2009 Location: Norway Posts: 1,712	That might be a problem, since the pre-offseting off the volume LUTs may mess up the 32-bit alignment in the actual look-up.

23 October 2020, 15:57	#54
ross Defendit numerus Join Date: Mar 2017 Location: Crossing the Rubicon Age: 54 Posts: 4,483	I follow this thread casually, but it is very interesting. I looked at the latest proposed routines and I don't think there are penalties for misalignments. 32-bit accesses are longword aligned and 16-bit accesses are word aligned. The access speed to memory is maximum.

23 October 2020, 16:07	#55
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,050	Yeah, because the suggested changes rely on doubling the table size to ensure the alignment. And the table size has already been doubled, so that would be 4x the original size. It's his call, of course ;p.

23 October 2020, 17:22	#58
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,351	IIRC misaligned longword access is more deadly than misaligned word access because a word access in the middle of a long doesn't have penalty. This means that if a0 is longword aligned, word accesses will be : . 0(a0) aligned word, ok . 1(a0) inside longword, ok . 2(a0) aligned word, ok . 3(a0) misaligned access (only 25% cases)

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)