Sound overhaul for TKG - Page 4

Karlos · 21 June 2024, 18:32

I definitely intend to make it user selectable regardless. There will also be a version that does not use move16 for the input channel data fetches.

I made a lookup hit rate test and ran half a megabyte of signed 8-bit audio through it. It happened to be music as I wanted a reasonable worst case for the lookup.

The lookup was bucketed into blocks of 16 to simulate the cacheline arrangement and I summed the hit count per bucket. The results are very promising

Code:

 Read 522784 values, checking distribution
 Linear:
 Array
 (
     [0] => 146970
     [1] => 69347
     [2] => 27306
     [3] => 11820
     [4] => 5776
     [5] => 2553
     [6] => 561
     [7] => 197
     [8] => 430
     [9] => 1226
     [10] => 2687
     [11] => 6220
     [12] => 14353
     [13] => 28584
     [14] => 71530
     [15] => 133224
 )
 Delta:
 Array
 (
     [0] => 283956
     [1] => 15449
     [2] => 2236
     [3] => 161
     [4] => 9
     [5] => 0
     [6] => 0
     [7] => 0
     [8] => 0
     [9] => 0
     [10] => 0
     [11] => 6
     [12] => 79
     [13] => 1776
     [14] => 14188
     [15] => 204924
 )

The signed 8-bit value is of course treated as an unsigned word index lookup by our addressing so we get the most hits close to the beginning and end of the tables. With delta encoding, we really focus the hits to the first and last bucket here. To properly simulate my proposed access pattern, I need to tweak this to do one linear lookup then 15 deltas, but I expect the distribution to be closer to the delta, for obvious reasons.

Karlos · 21 June 2024, 18:45

With slightly better formatting....

Code:

 Read 522784 values, checking distribution
 Linear:
 +------+---------+-------+
 | Line |  Access | Hit % |
 +------+---------+-------+
 |    0 |  146970 | 28.11 | 
 |    1 |   69347 | 13.26 | 
 |    2 |   27306 |  5.22 | 
 |    3 |   11820 |  2.26 | 
 |    4 |    5776 |  1.10 | 
 |    5 |    2553 |  0.49 | 
 |    6 |     561 |  0.11 | 
 |    7 |     197 |  0.04 | 
 |    8 |     430 |  0.08 | 
 |    9 |    1226 |  0.23 | 
 |   10 |    2687 |  0.51 | 
 |   11 |    6220 |  1.19 | 
 |   12 |   14353 |  2.75 | 
 |   13 |   28584 |  5.47 | 
 |   14 |   71530 | 13.68 | 
 |   15 |  133224 | 25.48 | 
 +------+---------+-------+
 Delta:
 +------+---------+-------+
 | Line |  Access | Hit % |
 +------+---------+-------+
 |    0 |  283956 | 54.32 | 
 |    1 |   15449 |  2.96 | 
 |    2 |    2236 |  0.43 | 
 |    3 |     161 |  0.03 | 
 |    4 |       9 |  0.00 | 
 |    5 |       0 |  0.00 | 
 |    6 |       0 |  0.00 | 
 |    7 |       0 |  0.00 | 
 |    8 |       0 |  0.00 | 
 |    9 |       0 |  0.00 | 
 |   10 |       0 |  0.00 | 
 |   11 |       6 |  0.00 | 
 |   12 |      79 |  0.02 | 
 |   13 |    1776 |  0.34 | 
 |   14 |   14188 |  2.71 | 
 |   15 |  204924 | 39.20 | 
 +------+---------+-------+
 Linear-1/Delta-15:
 +------+---------+-------+
 | Line |  Access | Hit % |
 +------+---------+-------+
 |    0 |  274808 | 52.57 | 
 |    1 |   18557 |  3.55 | 
 |    2 |    3924 |  0.75 | 
 |    3 |     906 |  0.17 | 
 |    4 |     402 |  0.08 | 
 |    5 |     157 |  0.03 | 
 |    6 |      38 |  0.01 | 
 |    7 |      10 |  0.00 | 
 |    8 |      30 |  0.01 | 
 |    9 |      66 |  0.01 | 
 |   10 |     176 |  0.03 | 
 |   11 |     391 |  0.07 | 
 |   12 |     994 |  0.19 | 
 |   13 |    3449 |  0.66 | 
 |   14 |   17780 |  3.40 | 
 |   15 |  201096 | 38.47 | 
 +------+---------+-------+

The tables are actually made of 16 bit words so I should've chunked blocks of 8 rather than 16. Nevertheless, in the last scheme 91% of all accesses would be to just 4 cache lines.

I just tested with 32 buckets to account for the word size and I still get 78% of all hits in Line 0 and Line 31

Karlos · 21 June 2024, 18:57

@paraj

Quote:

Just a thought, but but have you looked (heard?) into how much fidelity is lost if you restrict yourself to volume levels that only require shifts?

To be clear, this is my get out of jail free card. You can see I already have shift normalisation code in place for exact powers of 2, and it processes two samples at a time. I can see that the fidelity will be impacted when, for example, you have a 16 bit peak just above half way. Then, you'll be using full channel volume, which will mean playing the 8-bit sample at close to half the range, in effect dropping to 7 bits rather than 8.

However, I think there are possibly some fairly easy win solutions. We can probably normalise to volumes like 48, without too much effort, since doing a scale by 1.5 is two adds and a shift. You do have the trade off there if having slightly branchier code, but it's only per frame of 16 output samples.

Karlos · 21 June 2024, 21:43

@paraj

I pushed a new version. This one uses the following ReadArgs options:

M=Multiply/S,D=DumpBuffers/S

If Multiply is given, the mixing stage switches to a pure multiplication based implementation where the only lookup is for the initial scale factor (in a table of 16 entries). The loop is naive and trivial. d4 contains the scale factor, a0 points at the fetch buffer and a4 the accumulation buffer:

Code:

        moveq   #CACHE_LINE_SIZE-1,d1    ; num samples in d1

.next_sample_multiply:
        move.b  (a3)+,d0  ; next sample from the buffer
        ext.w   d0        ; sign extend
        muls.w  d4,d0     ; scale
        add.w   d0,(a4)+  ; accumulate onto the target buffer
        dbra    d1,.next_sample_multiply

This way you can compare before and after. I haven't yet added an option for not using multiplication normalisation, so that stays on. I am very interested to know if the multiplication path is faster on the 68060 (I assume so, but the proof is in the measurement)

The DumpBuffer option just prevents it writing the .raw files unless you ask for it.

paraj · 22 June 2024, 18:26

Quote:

Originally Posted by Karlos

@paraj

I pushed a new version. This one uses the following ReadArgs options:

M=Multiply/S,D=DumpBuffers/S

If Multiply is given, the mixing stage switches to a pure multiplication based implementation where the only lookup is for the initial scale factor (in a table of 16 entries). The loop is naive and trivial. d4 contains the scale factor, a0 points at the fetch buffer and a4 the accumulation buffer:

Code:

        moveq   #CACHE_LINE_SIZE-1,d1    ; num samples in d1

.next_sample_multiply:
        move.b  (a3)+,d0  ; next sample from the buffer
        ext.w   d0        ; sign extend
        muls.w  d4,d0     ; scale
        add.w   d0,(a4)+  ; accumulate onto the target buffer
        dbra    d1,.next_sample_multiply

This way you can compare before and after. I haven't yet added an option for not using multiplication normalisation, so that stays on. I am very interested to know if the multiplication path is faster on the 68060 (I assume so, but the proof is in the measurement)

The DumpBuffer option just prevents it writing the .raw files unless you ask for it.

No mul: 330570
Mul: 289848
(And btw on 060 you always want to replace dbra with subq+bcc, though I doubt it will be noticeable in practice here)

But yeah, by all means explore all options for full precision even on <060, the road is often more interesting than the destination

Karlos · 22 June 2024, 18:50

Quote:

Originally Posted by paraj

No mul: 330570
Mul: 289848
(And btw on 060 you always want to replace dbra with subq+bcc, though I doubt it will be noticeable in practice here)

13.8% faster and much less impact on the datacache! This is what I'm talking about. Regarding dbra, is that something branch prediction related? I assumed it wouldn't make any difference.

Quote:

But yeah, by all means explore all options for full precision even on <060, the road is often more interesting than the destination

Indeed. This is something I've been wanting to explore for quite a long time now. Even if it comes to nothing, it's still fun.

I think, I have the 060 version ready for now, then. Basically the existing version, without any conditionals. You should have the option of choosing it on 040 still, but I'm going to experiment with a lookup-mixer driven, shift normalised variation for that.

Karlos · 22 June 2024, 19:01

About 2.14 ms per packet under "maximum mixer load" at 16kHz.

I am going to add a tool to reconstruct a 16 bit stream from the dump files and feed it a pure 16 bit input stream for normalisation. Then I can do some comparison. Obviously this is theoretical until I actually wire it up to Paula.

paraj · 22 June 2024, 19:17

Quote:

Originally Posted by Karlos

Regarding dbra, is that something branch prediction related? I assumed it wouldn't make any difference.

dbcc is pOEP-only meaning it can never pair with another instruction. subq is pOEP/sOEP so it can pair and correctly predicted branches (Bcc) are free (yes, really, they can take 0 cycles).

So let's say you have a inner loop of

Code:

.loop
 ; ...
 move.b d0,(a0)+ ; assume this executes in pOEP

Code:

  subq.w #1,d0 ; sOEP
  bne.b .loop   ; free when correctly predicted

In this case looping is essentially free. The subq executes concurrently with the store (ignoring cache stuff of course), and the bne is eliminated by the branch predictor (Ifetch just continues at .loop and you pay a penalty (3 cycles iirc) when it terminates).

Code:

   dbf d0,.loop

however can never pair (it's pOEP-only), and can never be eliminated, so always cost 1 cycle (even when correctly predicted). Of course you might be able to fit an unrelated instruction into the sOEP after the store, but bottom line is that it's always better use addq/subq + bcc for loops.

Quote:

Originally Posted by Karlos

Indeed. This is something I've been wanting to explore for quite a long time now. Even if it comes to nothing, it's still fun.

I think, I have the 060 version ready for now, then. Basically the existing version, without any conditionals. You should have the option of choosing it on 040 still, but I'm going to experiment with a lookup-mixer driven, shift normalised variation for that.

Hope we can get some (real) 040 people to do some timing

Karlos · 22 June 2024, 19:21

I have a real 040, but I daren't power it up. It's not been turned on for 10+ years and I'm sure it needs some serious TLC.

I'll make sure all the loops follow your example for the 060 path.

paraj · 22 June 2024, 19:33

Quote:

Originally Posted by Karlos

I have a real 040, but I daren't power it up. It's not been turned on for 10+ years and I'm sure it needs some serious TLC.

I'll make sure all the loops follow your example for the 060 path.

Had my 060 in similar state but powered it on anyway (not recommended), and it worked fine, but then got the A1200+060 recapped and checked, and caps had leaked (but I weren't able to tell by visual inspect). Highly recommended so you can sleep safely at night knowing it's at least not getting worse, and I can continue my cycle counting

. Also have a few other amigas in various state of decay though, but at least my favorite son is thriving.

dbf -> subq/bcc is just one of those mindless "free" optimizations you get used to for 060, but of course it very rarely matters, and is probably bad in general if the code is meant for a blend of targets.

Karlos · 22 June 2024, 21:40

Tiny update (nothing worth testing), I wrote a quick dirty script to reassemble the raw output dumps (samples and volume data) back into a 16-bit stereo sample stream file so that I can examine it. The test as it was was clipping in the middle, so dropping the default level of the mixer helped.

The good news is that the resulting 16-bit stream ended up exactly as expected; A weird sound of the airstrike coming down some sort of pipe (an artefact of deliberately mixing each channel a 16 sample frame out).

I was honestly expecting it to output a complete load of crap due to the prototypical and buggy nature, but it actually surprised me.

So the next obvious thing to do is to make it play via Paula and to do something more practical with it. A good test might be to have a couple of simple 8-bit loop samples (maybe some stems that work together) and making some sort of simple control interface.

Karlos · 23 June 2024, 14:57

@paraj

Just looking at the volmoc.c source, I assume the the external play() function declaration was for some assembler that was never needed in the end?

My understanding is that once initialised and told to play, Paula will loop the same DMA buffer(s) endlessly and that there's an interrupt when it is about to repeat. How long before the wrap does this fire or is it just before?

paraj · 23 June 2024, 17:10

Quote:

Originally Posted by Karlos

@paraj
Just looking at the volmoc.c source, I assume the the external play() function declaration was for some assembler that was never needed in the end?

Yes, it was used in the code i copy/pasted from, but not here.

Quote:

Originally Posted by Karlos

My understanding is that once initialised and told to play, Paula will loop the same DMA buffer(s) endlessly and that there's an interrupt when it is about to repeat. How long before the wrap does this fire or is it just before?

Audio can be a bit tricky, so you may want to re-read the Audio chapter of the HRM and meditate a bit on state transition diagram, but you get an interrupt when the DMA engine has copied the location/length registers to the "backup" registers. This is (roughly) when the sample starts (re-)playing (to my best understanding).

To double buffer, clear the audio register bits of INTREQ, play buffer A (write ac_ptr/ac_len), wait for the bit to be set in INTREQR, clear it and play buffer B. Next interrupt will signal that buffer A finished playing and buffer B started. So prepare buffer A and play it (after clearing the int bit), rinse and repeat (while alternating buffers).

In volmod.c I start by playing buffer B (silence) to make the loop a bit simpler. (Ignore the convert(0) it's just for timing).

Karlos · 23 June 2024, 17:19

On the other track (no pun intended), I updated the readme with the findings on how to optimise the lookup for 68040. I added to a brain dump. If you are interested, https://github.com/0xABADCAFE/tkg-mi...-ov-file#68040

Karlos · 23 June 2024, 17:31

I think what I need is slightly different to vanilla double buffering. I'm creating a packet of audio on the fly based on current game state, so I want to reduce the latency as much as possible. I don't want to create crackly, jittery mess also, but I think 20ms is about the maximum I want the mixer to ever be ahead. Just something else to add to the meditation pile.

paraj · 23 June 2024, 19:51

Quote:

Originally Posted by Karlos

On the other track (no pun intended), I updated the readme with the findings on how to optimise the lookup for 68040. I added to a brain dump. If you are interested, https://github.com/0xABADCAFE/tkg-mi...-ov-file#68040

I am interested, but not sure I grasp what you're trying to do. Is the idea to have a few tables for "small" delta values to avoid multiplication most of the time, but fall back to that as required?

Quote:

Originally Posted by Karlos

I think what I need is slightly different to vanilla double buffering. I'm creating a packet of audio on the fly based on current game state, so I want to reduce the latency as much as possible. I don't want to create crackly, jittery mess also, but I think 20ms is about the maximum I want the mixer to ever be ahead. Just something else to add to the meditation pile.

Isn't the main difficultly synchronizing with the "main thread"? While I do think you could in theory make it just right for the audio buffer to only need refilling during vblank, I think that's going to be a mess with the RTG code, and thus I think it's better to try to decouple it. Maybe you're past this train of thought and pondering bigger issues though

Karlos · 23 June 2024, 20:08

The plan is to do as few multiplications as humanly possible on the 040.

If we have 16 channels, with independent left/right volumes, the mixing step alone needs 16x16x2 multiplications per frame of 16 samples. That's quite a lot, then you have up to 32 more during normalisation. Each one is about 18 cycles just for the ALU so this is a non starter for me.

The obvious alternative, that is already in place, is to create a set of lookup tables, one for each input volume level, that maps an input 8-bit sample to the required 16 bit one for the volume level

The problem with this is cache churn. Each table is 32-cache lines big and with 16 input channels they all can be at different volumes. So I wanted a way to focus the lookups per table into the least number of hot lines.

Karlos · 23 June 2024, 20:11

The obvious solution to that is to convert the sample input data to delta values so that the lookups cluster around the smallest values. Which, at least for the simulation, produced excellent results. A delta scheme does add a couple of complexities I wanted to avoid, particularly when changing volume from one frame of 16 samples to the next. Hence the Linear-1/Delta-15 compromise, which the simulation shows is "almost" as good.

Karlos · 24 June 2024, 00:31

Quick question - is it still faster on 68040 / 68060 to clear memory using, for example move.l d0,(a0)+, or is clr.l (a0)+ - assuming that d0 contains zero?

I use clr for clarity but I prefer cycle efficiency here.

Karlos · 24 June 2024, 00:52

I just pushed an update. It's statically using an 060 codepath, so the parameters don't do anything (except the dump option).

I removed all the indirection, replaced all the dbf for the sub/bne pairings and removed a redundant clearing of the fetch buffer.

Curious to know if it's any better than last time, before I do the 040 version.

21 June 2024, 18:32	#61
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,689	I definitely intend to make it user selectable regardless. There will also be a version that does not use move16 for the input channel data fetches. I made a lookup hit rate test and ran half a megabyte of signed 8-bit audio through it. It happened to be music as I wanted a reasonable worst case for the lookup. The lookup was bucketed into blocks of 16 to simulate the cacheline arrangement and I summed the hit count per bucket. The results are very promising Code: Read 522784 values, checking distribution Linear: Array ( [0] => 146970 [1] => 69347 [2] => 27306 [3] => 11820 [4] => 5776 [5] => 2553 [6] => 561 [7] => 197 [8] => 430 [9] => 1226 [10] => 2687 [11] => 6220 [12] => 14353 [13] => 28584 [14] => 71530 [15] => 133224 ) Delta: Array ( [0] => 283956 [1] => 15449 [2] => 2236 [3] => 161 [4] => 9 [5] => 0 [6] => 0 [7] => 0 [8] => 0 [9] => 0 [10] => 0 [11] => 6 [12] => 79 [13] => 1776 [14] => 14188 [15] => 204924 ) The signed 8-bit value is of course treated as an unsigned word index lookup by our addressing so we get the most hits close to the beginning and end of the tables. With delta encoding, we really focus the hits to the first and last bucket here. To properly simulate my proposed access pattern, I need to tweak this to do one linear lookup then 15 deltas, but I expect the distribution to be closer to the delta, for obvious reasons.

21 June 2024, 21:43	#64
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,689	@paraj I pushed a new version. This one uses the following ReadArgs options: M=Multiply/S,D=DumpBuffers/S If Multiply is given, the mixing stage switches to a pure multiplication based implementation where the only lookup is for the initial scale factor (in a table of 16 entries). The loop is naive and trivial. d4 contains the scale factor, a0 points at the fetch buffer and a4 the accumulation buffer: Code: moveq #CACHE_LINE_SIZE-1,d1 ; num samples in d1 .next_sample_multiply: move.b (a3)+,d0 ; next sample from the buffer ext.w d0 ; sign extend muls.w d4,d0 ; scale add.w d0,(a4)+ ; accumulate onto the target buffer dbra d1,.next_sample_multiply This way you can compare before and after. I haven't yet added an option for not using multiplication normalisation, so that stays on. I am very interested to know if the multiplication path is faster on the 68060 (I assume so, but the proof is in the measurement) The DumpBuffer option just prevents it writing the .raw files unless you ask for it.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Slow A4000 after overhaul	Screechstar	support.Hardware	57	11 July 2023 23:02
Amiga Font Editor overhaul	buggs	Coders. Releases	19	09 March 2021 17:39
Escom A1200 overhaul	Ox.	Amiga scene	8	26 August 2014 08:54
Will Bridge Practice series needs an overhaul	mk1	HOL data problems	1	02 April 2009 21:55

22 June 2024, 19:01	#67
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,689	About 2.14 ms per packet under "maximum mixer load" at 16kHz. I am going to add a tool to reconstruct a 16 bit stream from the dump files and feed it a pure 16 bit input stream for normalisation. Then I can do some comparison. Obviously this is theoretical until I actually wire it up to Paula.

22 June 2024, 19:21	#69
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,689	I have a real 040, but I daren't power it up. It's not been turned on for 10+ years and I'm sure it needs some serious TLC. I'll make sure all the loops follow your example for the 060 path.

22 June 2024, 21:40	#71
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,689	Tiny update (nothing worth testing), I wrote a quick dirty script to reassemble the raw output dumps (samples and volume data) back into a 16-bit stereo sample stream file so that I can examine it. The test as it was was clipping in the middle, so dropping the default level of the mixer helped. The good news is that the resulting 16-bit stream ended up exactly as expected; A weird sound of the airstrike coming down some sort of pipe (an artefact of deliberately mixing each channel a 16 sample frame out). I was honestly expecting it to output a complete load of crap due to the prototypical and buggy nature, but it actually surprised me. So the next obvious thing to do is to make it play via Paula and to do something more practical with it. A good test might be to have a couple of simple 8-bit loop samples (maybe some stems that work together) and making some sort of simple control interface.

23 June 2024, 14:57	#72
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,689	@paraj Just looking at the volmoc.c source, I assume the the external play() function declaration was for some assembler that was never needed in the end? My understanding is that once initialised and told to play, Paula will loop the same DMA buffer(s) endlessly and that there's an interrupt when it is about to repeat. How long before the wrap does this fire or is it just before?

23 June 2024, 17:19	#74
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,689	On the other track (no pun intended), I updated the readme with the findings on how to optimise the lookup for 68040. I added to a brain dump. If you are interested, https://github.com/0xABADCAFE/tkg-mi...-ov-file#68040

23 June 2024, 17:31	#75
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,689	I think what I need is slightly different to vanilla double buffering. I'm creating a packet of audio on the fly based on current game state, so I want to reduce the latency as much as possible. I don't want to create crackly, jittery mess also, but I think 20ms is about the maximum I want the mixer to ever be ahead. Just something else to add to the meditation pile.

23 June 2024, 20:08	#77
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,689	The plan is to do as few multiplications as humanly possible on the 040. If we have 16 channels, with independent left/right volumes, the mixing step alone needs 16x16x2 multiplications per frame of 16 samples. That's quite a lot, then you have up to 32 more during normalisation. Each one is about 18 cycles just for the ALU so this is a non starter for me. The obvious alternative, that is already in place, is to create a set of lookup tables, one for each input volume level, that maps an input 8-bit sample to the required 16 bit one for the volume level The problem with this is cache churn. Each table is 32-cache lines big and with 16 input channels they all can be at different volumes. So I wanted a way to focus the lookups per table into the least number of hot lines.

23 June 2024, 20:11	#78
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,689	The obvious solution to that is to convert the sample input data to delta values so that the lookups cluster around the smallest values. Which, at least for the simulation, produced excellent results. A delta scheme does add a couple of complexities I wanted to avoid, particularly when changing volume from one frame of 16 samples to the next. Hence the Linear-1/Delta-15 compromise, which the simulation shows is "almost" as good.

24 June 2024, 00:31	#79
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,689	Quick question - is it still faster on 68040 / 68060 to clear memory using, for example move.l d0,(a0)+, or is clr.l (a0)+ - assuming that d0 contains zero? I use clr for clarity but I prefer cycle efficiency here.

24 June 2024, 00:52	#80
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,689	I just pushed an update. It's statically using an 060 codepath, so the parameters don't do anything (except the dump option). I removed all the indirection, replaced all the dbf for the sub/bne pairings and removed a redundant clearing of the fetch buffer. Curious to know if it's any better than last time, before I do the 040 version.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)