21 June 2024, 18:32 | #61 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,689
|
I definitely intend to make it user selectable regardless. There will also be a version that does not use move16 for the input channel data fetches.
I made a lookup hit rate test and ran half a megabyte of signed 8-bit audio through it. It happened to be music as I wanted a reasonable worst case for the lookup. The lookup was bucketed into blocks of 16 to simulate the cacheline arrangement and I summed the hit count per bucket. The results are very promising Code:
Read 522784 values, checking distribution Linear: Array ( [0] => 146970 [1] => 69347 [2] => 27306 [3] => 11820 [4] => 5776 [5] => 2553 [6] => 561 [7] => 197 [8] => 430 [9] => 1226 [10] => 2687 [11] => 6220 [12] => 14353 [13] => 28584 [14] => 71530 [15] => 133224 ) Delta: Array ( [0] => 283956 [1] => 15449 [2] => 2236 [3] => 161 [4] => 9 [5] => 0 [6] => 0 [7] => 0 [8] => 0 [9] => 0 [10] => 0 [11] => 6 [12] => 79 [13] => 1776 [14] => 14188 [15] => 204924 ) |
21 June 2024, 18:45 | #62 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,689
|
With slightly better formatting....
Code:
Read 522784 values, checking distribution Linear: +------+---------+-------+ | Line | Access | Hit % | +------+---------+-------+ | 0 | 146970 | 28.11 | | 1 | 69347 | 13.26 | | 2 | 27306 | 5.22 | | 3 | 11820 | 2.26 | | 4 | 5776 | 1.10 | | 5 | 2553 | 0.49 | | 6 | 561 | 0.11 | | 7 | 197 | 0.04 | | 8 | 430 | 0.08 | | 9 | 1226 | 0.23 | | 10 | 2687 | 0.51 | | 11 | 6220 | 1.19 | | 12 | 14353 | 2.75 | | 13 | 28584 | 5.47 | | 14 | 71530 | 13.68 | | 15 | 133224 | 25.48 | +------+---------+-------+ Delta: +------+---------+-------+ | Line | Access | Hit % | +------+---------+-------+ | 0 | 283956 | 54.32 | | 1 | 15449 | 2.96 | | 2 | 2236 | 0.43 | | 3 | 161 | 0.03 | | 4 | 9 | 0.00 | | 5 | 0 | 0.00 | | 6 | 0 | 0.00 | | 7 | 0 | 0.00 | | 8 | 0 | 0.00 | | 9 | 0 | 0.00 | | 10 | 0 | 0.00 | | 11 | 6 | 0.00 | | 12 | 79 | 0.02 | | 13 | 1776 | 0.34 | | 14 | 14188 | 2.71 | | 15 | 204924 | 39.20 | +------+---------+-------+ Linear-1/Delta-15: +------+---------+-------+ | Line | Access | Hit % | +------+---------+-------+ | 0 | 274808 | 52.57 | | 1 | 18557 | 3.55 | | 2 | 3924 | 0.75 | | 3 | 906 | 0.17 | | 4 | 402 | 0.08 | | 5 | 157 | 0.03 | | 6 | 38 | 0.01 | | 7 | 10 | 0.00 | | 8 | 30 | 0.01 | | 9 | 66 | 0.01 | | 10 | 176 | 0.03 | | 11 | 391 | 0.07 | | 12 | 994 | 0.19 | | 13 | 3449 | 0.66 | | 14 | 17780 | 3.40 | | 15 | 201096 | 38.47 | +------+---------+-------+ I just tested with 32 buckets to account for the word size and I still get 78% of all hits in Line 0 and Line 31 Last edited by Karlos; 21 June 2024 at 18:51. |
21 June 2024, 18:57 | #63 | |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,689
|
@paraj
Quote:
However, I think there are possibly some fairly easy win solutions. We can probably normalise to volumes like 48, without too much effort, since doing a scale by 1.5 is two adds and a shift. You do have the trade off there if having slightly branchier code, but it's only per frame of 16 output samples. |
|
21 June 2024, 21:43 | #64 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,689
|
@paraj
I pushed a new version. This one uses the following ReadArgs options: M=Multiply/S,D=DumpBuffers/S If Multiply is given, the mixing stage switches to a pure multiplication based implementation where the only lookup is for the initial scale factor (in a table of 16 entries). The loop is naive and trivial. d4 contains the scale factor, a0 points at the fetch buffer and a4 the accumulation buffer: Code:
moveq #CACHE_LINE_SIZE-1,d1 ; num samples in d1 .next_sample_multiply: move.b (a3)+,d0 ; next sample from the buffer ext.w d0 ; sign extend muls.w d4,d0 ; scale add.w d0,(a4)+ ; accumulate onto the target buffer dbra d1,.next_sample_multiply The DumpBuffer option just prevents it writing the .raw files unless you ask for it. |
22 June 2024, 18:26 | #65 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,275
|
Quote:
Mul: 289848 (And btw on 060 you always want to replace dbra with subq+bcc, though I doubt it will be noticeable in practice here) But yeah, by all means explore all options for full precision even on <060, the road is often more interesting than the destination |
|
22 June 2024, 18:50 | #66 | ||
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,689
|
Quote:
Quote:
I think, I have the 060 version ready for now, then. Basically the existing version, without any conditionals. You should have the option of choosing it on 040 still, but I'm going to experiment with a lookup-mixer driven, shift normalised variation for that. |
||
22 June 2024, 19:01 | #67 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,689
|
About 2.14 ms per packet under "maximum mixer load" at 16kHz.
I am going to add a tool to reconstruct a 16 bit stream from the dump files and feed it a pure 16 bit input stream for normalisation. Then I can do some comparison. Obviously this is theoretical until I actually wire it up to Paula. |
22 June 2024, 19:17 | #68 | ||
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,275
|
Quote:
So let's say you have a inner loop of Code:
.loop ; ... move.b d0,(a0)+ ; assume this executes in pOEP Code:
subq.w #1,d0 ; sOEP bne.b .loop ; free when correctly predicted Code:
dbf d0,.loop Quote:
|
||
22 June 2024, 19:21 | #69 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,689
|
I have a real 040, but I daren't power it up. It's not been turned on for 10+ years and I'm sure it needs some serious TLC.
I'll make sure all the loops follow your example for the 060 path. |
22 June 2024, 19:33 | #70 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,275
|
Quote:
dbf -> subq/bcc is just one of those mindless "free" optimizations you get used to for 060, but of course it very rarely matters, and is probably bad in general if the code is meant for a blend of targets. |
|
22 June 2024, 21:40 | #71 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,689
|
Tiny update (nothing worth testing), I wrote a quick dirty script to reassemble the raw output dumps (samples and volume data) back into a 16-bit stereo sample stream file so that I can examine it. The test as it was was clipping in the middle, so dropping the default level of the mixer helped.
The good news is that the resulting 16-bit stream ended up exactly as expected; A weird sound of the airstrike coming down some sort of pipe (an artefact of deliberately mixing each channel a 16 sample frame out). I was honestly expecting it to output a complete load of crap due to the prototypical and buggy nature, but it actually surprised me. So the next obvious thing to do is to make it play via Paula and to do something more practical with it. A good test might be to have a couple of simple 8-bit loop samples (maybe some stems that work together) and making some sort of simple control interface. |
23 June 2024, 14:57 | #72 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,689
|
@paraj
Just looking at the volmoc.c source, I assume the the external play() function declaration was for some assembler that was never needed in the end? My understanding is that once initialised and told to play, Paula will loop the same DMA buffer(s) endlessly and that there's an interrupt when it is about to repeat. How long before the wrap does this fire or is it just before? |
23 June 2024, 17:10 | #73 | ||
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,275
|
Quote:
Quote:
To double buffer, clear the audio register bits of INTREQ, play buffer A (write ac_ptr/ac_len), wait for the bit to be set in INTREQR, clear it and play buffer B. Next interrupt will signal that buffer A finished playing and buffer B started. So prepare buffer A and play it (after clearing the int bit), rinse and repeat (while alternating buffers). In volmod.c I start by playing buffer B (silence) to make the loop a bit simpler. (Ignore the convert(0) it's just for timing). |
||
23 June 2024, 17:19 | #74 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,689
|
On the other track (no pun intended), I updated the readme with the findings on how to optimise the lookup for 68040. I added to a brain dump. If you are interested, https://github.com/0xABADCAFE/tkg-mi...-ov-file#68040
|
23 June 2024, 17:31 | #75 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,689
|
I think what I need is slightly different to vanilla double buffering. I'm creating a packet of audio on the fly based on current game state, so I want to reduce the latency as much as possible. I don't want to create crackly, jittery mess also, but I think 20ms is about the maximum I want the mixer to ever be ahead. Just something else to add to the meditation pile.
|
23 June 2024, 19:51 | #76 | ||
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,275
|
Quote:
Quote:
|
||
23 June 2024, 20:08 | #77 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,689
|
The plan is to do as few multiplications as humanly possible on the 040.
If we have 16 channels, with independent left/right volumes, the mixing step alone needs 16x16x2 multiplications per frame of 16 samples. That's quite a lot, then you have up to 32 more during normalisation. Each one is about 18 cycles just for the ALU so this is a non starter for me. The obvious alternative, that is already in place, is to create a set of lookup tables, one for each input volume level, that maps an input 8-bit sample to the required 16 bit one for the volume level The problem with this is cache churn. Each table is 32-cache lines big and with 16 input channels they all can be at different volumes. So I wanted a way to focus the lookups per table into the least number of hot lines. |
23 June 2024, 20:11 | #78 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,689
|
The obvious solution to that is to convert the sample input data to delta values so that the lookups cluster around the smallest values. Which, at least for the simulation, produced excellent results. A delta scheme does add a couple of complexities I wanted to avoid, particularly when changing volume from one frame of 16 samples to the next. Hence the Linear-1/Delta-15 compromise, which the simulation shows is "almost" as good.
|
24 June 2024, 00:31 | #79 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,689
|
Quick question - is it still faster on 68040 / 68060 to clear memory using, for example move.l d0,(a0)+, or is clr.l (a0)+ - assuming that d0 contains zero?
I use clr for clarity but I prefer cycle efficiency here. |
24 June 2024, 00:52 | #80 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,689
|
I just pushed an update. It's statically using an 060 codepath, so the parameters don't do anything (except the dump option).
I removed all the indirection, replaced all the dbf for the sub/bne pairings and removed a redundant clearing of the fetch buffer. Curious to know if it's any better than last time, before I do the 040 version. |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Slow A4000 after overhaul | Screechstar | support.Hardware | 57 | 11 July 2023 23:02 |
Amiga Font Editor overhaul | buggs | Coders. Releases | 19 | 09 March 2021 17:39 |
Escom A1200 overhaul | Ox. | Amiga scene | 8 | 26 August 2014 08:54 |
Will Bridge Practice series needs an overhaul | mk1 | HOL data problems | 1 | 02 April 2009 21:55 |
|
|