English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 22 October 2020, 13:41   #41
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,050
Ah, so the a1/a4 tables are a small part of a large volume table. OK, it wasn't clear from the source. Not worth it on its own, assuming no use for d4 to make the code even faster, but good to hear that you managed to salvage it.

There is a typo in MIXCF that keeps surviving, should be rol d5, not d7.
a/b is offline  
Old 22 October 2020, 13:53   #42
Tomislav
Registered User
 
Join Date: Aug 2014
Location: Zagreb / Croatia
Posts: 302
Quote:
Originally Posted by 8bitbubsy View Post
Code:
    move.w (a3,d2.l),d3    ; d3.w = 2x signed 8-bit samples
    move.b d3,d7
    ext.w  d7
    asr.w  #8,d3
    sub.w  d3,d7           ; d7.w = sample2-sample1
Did you maybe forgot one ext.w or it is intentionally?
Code:
    move.w (a3,d2.l),d3    ; d3.w = 2x signed 8-bit samples
    move.b d3,d7
    ext.w  d7
    asr.w  #8,d3
    ext.w  d3              ; <- this one
    sub.w  d3,d7           ; d7.w = sample2-sample1
Also "ext.w Dn" is not replacement for "and.w #$00ff,Dn" because if value of data register is for example $12345687 it will extend to $1234ff87. It extends signed byte number to signed word.

But my first assumption is that both byte numbers are signed. ("; d3.w = 2x signed 8-bit samples") And that you maybe forgot to extend second number.
Tomislav is offline  
Old 22 October 2020, 14:26   #43
8bitbubsy
Registered User
 
8bitbubsy's Avatar
 
Join Date: Sep 2009
Location: Norway
Posts: 1,712
a/b: Ah yes, that's just a typo when I'm making my posts, it's correct in the actual source code. I'll edit them again.
Tomislav: It works as intended, and I don't see any problems with it. I'm working with signed word samples in that stage, and the upper word of d3 is never used in the mixing loop. The second sample is properly converted to 8-bit signed word by the ASR shifting (which copies the sign bit).

Also here's how it sounds as of now. Playing a 20-channel XM song with linear interpolation at 14-bit 28604Hz on my Amiga 1200 (68030 50MHz):
https://www.dropbox.com/s/s6d24ng9hv...play2.mp3?dl=1

It has some quantization noise because all 16-bit samples are converted to 8-bit on load time. Also there is no volume ramping, so it clicks/pops sometimes.

18-20 channels seems about absolute max for a 68030 50MHz, and it will use almost all available CPU time.

Last edited by 8bitbubsy; 22 October 2020 at 14:55.
8bitbubsy is offline  
Old 22 October 2020, 14:51   #44
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,024
Quote:
Originally Posted by Tomislav View Post
Did you maybe forgot one ext.w or it is intentionally?
Code:
    move.w (a3,d2.l),d3    ; d3.w = 2x signed 8-bit samples
    move.b d3,d7
    ext.w  d7
    asr.w  #8,d3
    ext.w  d3              ; <- this one
    sub.w  d3,d7           ; d7.w = sample2-sample1
Also "ext.w Dn" is not replacement for "and.w #$00ff,Dn" because if value of data register is for example $12345687 it will extend to $1234ff87. It extends signed byte number to signed word.

But my first assumption is that both byte numbers are signed. ("; d3.w = 2x signed 8-bit samples") And that you maybe forgot to extend second number.
After asr.w #8,d3, ext.w d3 is useless.
Only after lsl.w #8,d3, ext.w d3 has sense.
Don_Adan is offline  
Old 23 October 2020, 06:58   #45
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,024
Quote:
Originally Posted by 8bitbubsy View Post
This was actually a bit slower on my 68030 50MHz A1200 benchmark, for some reason??


I pre-centered the volume LUT pointers so that they can handle a signed look-up (still same LUT size), then I increased the lerp LUT size by two, so that it uses signed word values. Now d4 is indeed free and the code is slightly faster. It's currently like this:

Stereo mix:
Code:
; d0.w = bytes to mix
MIXSF MACRO
    move.w (a3,d2.l),d3   ; d3.w = 2x signed 8-bit samples
    move.b d3,d7
    ext.w  d7
    asr.w  #8,d3
    sub.w  d3,d7          ; d7.w = sample2-sample1
    move.l d7,d5 
    rol.l  #7,d5
    add.w  (a6,d5.w*2),d3
    move.w (a1,d3.w*2),d5 ; d5.w = left output sample (from volume LUT)
    swap   d5
    move.w (a4,d3.w*2),d5 ; d5.l = (leftSample << 16) | rightSample
    add.l  d5,(a5)+
    add.l  d6,d7          ; increase sampling position
    addx.l d1,d2
    ENDM
Center mix (slightly faster when channel pan is in center):

Code:
; d0.w = bytes to mix
MIXCF MACRO
    move.w (a3,d2.l),d3    ; d3.w = 2x signed 8-bit samples
    move.b d3,d7
    ext.w  d7
    asr.w  #8,d3
    sub.w  d3,d7           ; d7.w = sample2-sample1
    move.l d7,d5    
    rol.l  #7,d5
    add.w  (a6,d5.w*2),d3
    move.w (a1,d3.w*2),d3  ; d3.w = output sample (from volume LUT)
    add.w  d3,(a5)+
    add.w  d3,(a5)+
    add.l  d6,d7           ; increase sampling position
    addx.l d1,d2
    ENDM
Getting quite fast now, but the binary is getting big. 433kB as of now.
If you want you can speedup your code by double (longword) table at A1:

Stereo mix:
Code:
; d0.w = bytes to mix
MIXSF MACRO
    move.w (a3,d2.l),d3   ; d3.w = 2x signed 8-bit samples
    move.b d3,d7
    ext.w  d7
    asr.w  #8,d3
    sub.w  d3,d7          ; d7.w = sample2-sample1
    move.l d7,d5 
    rol.l  #7,d5
    add.w  (a6,d5.w*2),d3
    move.w (a1,d3.w*4),d5 ; d5.w = left output sample (from volume LUT)
    swap   d5
    move.w (a4,d3.w*2),d5 ; d5.l = (leftSample << 16) | rightSample
 ; if A1 and A4 used same table, use     move.w (a4,d3.w*4),d5
    add.l  d5,(a5)+
    add.l  d6,d7          ; increase sampling position
    addx.l d1,d2
    ENDM
Center mix (slightly faster when channel pan is in center):

Code:
; d0.w = bytes to mix
MIXCF MACRO
    move.w (a3,d2.l),d3    ; d3.w = 2x signed 8-bit samples
    move.b d3,d7
    ext.w  d7
    asr.w  #8,d3
    sub.w  d3,d7           ; d7.w = sample2-sample1
    move.l d7,d5    
    rol.l  #7,d5
    add.w  (a6,d5.w*2),d3
    move.l (a1,d3.w*4),d3  ; d3.l = output sample (from volume LUT)
    add.l  d3,(a5)+
    add.l  d6,d7           ; increase sampling position
    addx.l d1,d2
    ENDM
Don_Adan is offline  
Old 23 October 2020, 07:42   #46
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,024
One more thing/trick, one command less, you can use this for single sided table version too:

Stereo mix:
Code:
; d0.w = bytes to mix
MIXSF MACRO
    move.w (a3,d2.l),d3   ; d3.w = 2x signed 8-bit samples
    move.b d3,d7
    ext.w  d7
    asr.w  #8,d3
    sub.w  d3,d7          ; d7.w = sample2-sample1
    move.l d7,d5 
    rol.l  #7,d5
    add.w  (a6,d5.w*2),d3
    move.l (a1,d3.w*4),d5 ; d5.w = left output sample (from volume LUT)
;    swap   d5
    move.w (a4,d3.w*2),d5 ; d5.l = (leftSample << 16) | rightSample
 ; if A1 and A4 used same table, use     move.w (a4,d3.w*4),d5
    add.l  d5,(a5)+
    add.l  d6,d7          ; increase sampling position
    addx.l d1,d2
    ENDM
Don_Adan is offline  
Old 23 October 2020, 10:42   #47
8bitbubsy
Registered User
 
8bitbubsy's Avatar
 
Join Date: Sep 2009
Location: Norway
Posts: 1,712
That table is the same for both L and R volume (it's pre-offset with the current voice volume), and it's already 256kB in size. I'd rather not double that just to make it a few percent faster at max. Thanks anyway!
8bitbubsy is offline  
Old 23 October 2020, 11:19   #48
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,024
Quote:
Originally Posted by 8bitbubsy View Post
That table is the same for both L and R volume (it's pre-offset with the current voice volume), and it's already 256kB in size. I'd rather not double that just to make it a few percent faster at max. Thanks anyway!
Ok, then only this can be useful, one swap command left.
move.l (a1,d3.w*2),d5 ; d5 high word = left output sample (from volume LUT)
; swap d5
move.w (a4,d3.w*2),d5 ; d5.w = (leftSample << 16) | rightSample
Don_Adan is offline  
Old 23 October 2020, 11:29   #49
8bitbubsy
Registered User
 
8bitbubsy's Avatar
 
Join Date: Sep 2009
Location: Norway
Posts: 1,712
But is a longword read as fast as a word read on 68020?
8bitbubsy is offline  
Old 23 October 2020, 12:09   #50
chb
Registered User
 
Join Date: Dec 2014
Location: germany
Posts: 439
Quote:
Originally Posted by 8bitbubsy View Post
But is a longword read as fast as a word read on 68020?
Yes, but only if your longword is 32-bit aligned.
chb is offline  
Old 23 October 2020, 13:17   #51
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,024
Quote:
Originally Posted by 8bitbubsy View Post
But is a longword read as fast as a word read on 68020?
You can check this, but if i remember right word and longword read has same speed for 68020 at even addresses. Only odd even reads has penalty, like f.e this:
move.w (a3,d2.l),d3 ; d3.w = 2x signed 8-bit samples
Don_Adan is offline  
Old 23 October 2020, 14:07   #52
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,050
020+ has a 32-bit bus, so misaligned 32-bit transfers (e.g. a longword at address 2, 6, ..., as well as any odd address) are slower since it require two 32-bit transfers.
a/b is offline  
Old 23 October 2020, 15:51   #53
8bitbubsy
Registered User
 
8bitbubsy's Avatar
 
Join Date: Sep 2009
Location: Norway
Posts: 1,712
That might be a problem, since the pre-offseting off the volume LUTs may mess up the 32-bit alignment in the actual look-up.
8bitbubsy is offline  
Old 23 October 2020, 15:57   #54
ross
Defendit numerus
 
ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 54
Posts: 4,483
I follow this thread casually, but it is very interesting.
I looked at the latest proposed routines and I don't think there are penalties for misalignments.
32-bit accesses are longword aligned and 16-bit accesses are word aligned. The access speed to memory is maximum.
ross is offline  
Old 23 October 2020, 16:07   #55
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,050
Yeah, because the suggested changes rely on doubling the table size to ensure the alignment. And the table size has already been doubled, so that would be 4x the original size.
It's his call, of course ;p.
a/b is offline  
Old 23 October 2020, 16:09   #56
ross
Defendit numerus
 
ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 54
Posts: 4,483
Quote:
Originally Posted by a/b View Post
Yeah, because the suggested changes rely on doubling the table size to ensure the alignment. And the table size has already been doubled, so that would be 4x the original size.
It's his call, of course ;p.
Ah ok

Yes, the version with 'smaller' table can misalign.
ross is offline  
Old 23 October 2020, 16:23   #57
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,024
Quote:
Originally Posted by 8bitbubsy View Post
That might be a problem, since the pre-offseting off the volume LUTs may mess up the 32-bit alignment in the actual look-up.
Anyway i think that even for your current LUT it will be fastest. You have 50% reads without penalty, and 50% reads with penalty, but without swap command (4c on 68020, if i remember right). This is easy to test if you have any benchmark for your mixing routine. Of course perhaps the best/fastest will be double table (longword). Of course only for speed test you can made 2x256KB table version too. It can be interesting, which version is the fastest, and how many. You can made test also for original mulu version.

BTW. Because extra odd address penalty for
move.w (a3,d2.l),d3 ; d3.w = 2x signed 8-bit samples
i thinked too about
move.b (a3),d3
move.b 1(a3),d7
but too many other changes is necessary for A3 handling.And it can not be fastest.
Don_Adan is offline  
Old 23 October 2020, 17:22   #58
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,351
IIRC misaligned longword access is more deadly than misaligned word access because a word access in the middle of a long doesn't have penalty.

This means that if a0 is longword aligned, word accesses will be :
. 0(a0) aligned word, ok
. 1(a0) inside longword, ok
. 2(a0) aligned word, ok
. 3(a0) misaligned access (only 25% cases)
meynaf is offline  
Old 24 October 2020, 10:18   #59
8bitbubsy
Registered User
 
8bitbubsy's Avatar
 
Join Date: Sep 2009
Location: Norway
Posts: 1,712
Quote:
Originally Posted by Don_Adan View Post
Anyway i think that even for your current LUT it will be fastest. You have 50% reads without penalty, and 50% reads with penalty, but without swap command (4c on 68020, if i remember right). This is easy to test if you have any benchmark for your mixing routine. Of course perhaps the best/fastest will be double table (longword). Of course only for speed test you can made 2x256KB table version too. It can be interesting, which version is the fastest, and how many. You can made test also for original mulu version.

BTW. Because extra odd address penalty for
move.w (a3,d2.l),d3 ; d3.w = 2x signed 8-bit samples
i thinked too about
move.b (a3),d3
move.b 1(a3),d7
but too many other changes is necessary for A3 handling.And it can not be fastest.
Oh no, I totally forgot that this has a possible word access misalignment! If only one could use addx on an address register, then I could add the sampling position to a3 before the loop, then read two bytes, then addx on a3. And when I leave the loop, I subtract the sample base from a3 to get the new sampling position, before I handle sample end/loop end.

I could still do that method by using d2 as the relative sampling position (d2 = a3+d2 before loop). Then I do "move.l d3,a3" as the first instruction in the loop.
That's one move intruction extra, so probably slower in the end?
8bitbubsy is offline  
Old 24 October 2020, 11:18   #60
chb
Registered User
 
Join Date: Dec 2014
Location: germany
Posts: 439
Quote:
Originally Posted by 8bitbubsy View Post
Oh no, I totally forgot that this has a possible word access misalignment! If only one could use addx on an address register, then I could add the sampling position to a3 before the loop, then read two bytes, then addx on a3. And when I leave the loop, I subtract the sample base from a3 to get the new sampling position, before I handle sample end/loop end.

I could still do that method by using d2 as the relative sampling position (d2 = a3+d2 before loop). Then I do "move.l d3,a3" as the first instruction in the loop.
That's one move intruction extra, so probably slower in the end?
As meynaf says, you'll have only word misaligned in 25% of the cases, there's IMHO no easy way around that, at least if your sample read integer delta >1 (so you are skipping a significant proportion of the sample values).

It may give some benefit to have two sets of mixing routines: one for integer delta <= 1 where you access every sample, and potentially also are able to re-use the delta between two samples, so that "move.w (a3,d2.l),d3", asr and ext instructions are only necessary once per input sample, not per output sample. And one for delta > 1.
chb is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Interpolation new Sound options Paul support.WinUAE 10 17 March 2019 20:57
Artifacts from non-gamma-aware interpolation mark_k support.WinUAE 5 08 January 2018 14:37
switch sound interpolation 4 chs turrican3 support.WinUAE 1 14 February 2016 10:39
Non-linear retrogaming? Nogg Retrogaming General Discussion 5 13 October 2007 17:09
is time linear PaulS request.Demos 2 22 September 2002 12:37

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 17:02.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.11827 seconds with 13 queries