08 June 2024, 17:58 | #21 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,861
|
It's good enough for Yamaha. This is what they do (or rather used to do) for digital FM synthesis.
Last edited by Karlos; 08 June 2024 at 19:25. |
09 June 2024, 02:09 | #22 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,861
|
I've been thinking a bit about this. In my old experiments with amplitude modulated encoding (which is what I am trying to achieve for this), because i was encoding a 16-bit file, I didn't care too much about compuational cost.
So, what I was doing is to take the absolute maximum value of a collection of 16-bit samples and find the normalisation factor. That's 32768 / abs_val (which I lazily did as a float). It then just found the closest match in an array of normalisation factors for each of the 64 Paula levels. The index is (one less than) the volume to set Paula to and the value is the amount by which all the samples need to be multiplied in order to normalise the frame. The thing is, there are only 64 entries in that table, for obvious reasons (there are only 64 non-zero Paula volumes we can use). So why do 32768/abs_max when I could do 64/(abs_max >> 9) ? And in that case, (abs_max >> 9) is the index already. The value it needs to contain is the corresponding amplification factor. So, as a trivial example, suppose abs_max for a set was 23456. 32768/23456 gives an ideal normalisation factor of about 1.397. My clunky search would search the 64 predefined values and select index 45 (Paula volume 46), with a factor of 1.3913. If we just do 23456>>9, we get 45. The result is the index we wanted. This isn't a trick and while there might be some small differences in the exact values of abs_max where we switch from one index to the next, it's not going to be huge. -edit- checked and it tends to be out by 1 for a fraction of the crossovers, but consistently on the lower side, which actually works better - less chance of clipping. So all that's missing then, is to scale the samples we are left with. For 68060 this can be a fixed point multiplication, since you get 16x16 => 32 in about 3 cycles. It's a lot more for 68040 though. Last edited by Karlos; 09 June 2024 at 02:27. |
09 June 2024, 02:49 | #23 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,861
|
Maybe I'm worrying too much about the multiplication here. As soon as I've normalised the first 4 samples and truncated to 8 bit, I'm going to have to do a long write to the chip ram buffer that I'm building. So perhaps the cost of the remaining samples is hidden behind this, as long as it can keep on crunching while those chip writes are ongoing.
|
09 June 2024, 13:07 | #24 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,861
|
Thinking about it a bit more, there are also three distinct normalisation cases:
1. No normalisation. When the absolute peak level is above 32226, we are just going to truncate to 8 bit and play at full volume. 2. Perfect power of 2 normalisation. There are 6 ranges where our normalisation becomes a simple left shift and take the upper 8 bits. 3. Fractional normalisation. This is everything else, that notionally requires multiplication. On an 060, it's probably fine to ignore the cases and just always multiply because it will be hidden behind the chip ram writes (I think, please correct if otherwise). Case 3 is the one that bothers me, because on 040, we need four of them for every long chip write we need and they'll cost. However, it's also the case that the end result of our multiplication and truncation is a signed 8 bit value. This makes me think that at some point, we can consider making this a pure lookup. Naively, a 32K*64 lookup of 8 bit values is a 2MB table, so that's a dud. However, not all bits of the input will affect the output, at best only 8 of them matter (the 8 highest set ones). With some shifting, the lookup table becomes 256*64. It's actually less than that because there are 7 sets covered by cases 1 and 2, but for simpicity let's ignore that a moment. So maybe we can turn this into a pure lookup problem. |
09 June 2024, 20:51 | #25 |
Registered User
Join Date: Oct 2020
Location: Bicester
Posts: 2,084
|
I wish I could help, but, most of this audio stuff goes way over my head
|
09 June 2024, 21:54 | #26 | |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,861
|
Quote:
Specifically, I have a set of 16-bit signed values (samples from mixing the multiple channels), that I need to multiply by another 16 bit value, N, (a fixed point value that will increase the volume of the samples to maximum) and then take the most significant 8 bits as my result. There is one case where N 1, so that's easy to bypass. There are 6 other cases where N is an exact power of 2 (2, 4, 8, 16, 32 and 64) that can be done using shifts. This leaves me with all the cases where N is some fractional value, that notionally require the multiplication step. However, intuition tells me that because there are only going to be 256 output values from all this, that not all 16 bits of the input matter. Given we are looking for an 8 bit result, it seems intuitive that only the most significant 8 bits of the input matter*. However, 256*64 is a much more manageable lookup size. *I don't just mean the upper byte, the 8 bits after any leading zeros. |
|
16 June 2024, 12:30 | #27 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,861
|
Question for the cycle counting experts:.
Roughly, how many CPU cycles can you stuff behind a long write to chip ram on a typical 68060 at 50MHz? If it's only accessing data you expect to be resident in a cache line already, are there any really obvious gotchas I should know about? Last edited by Karlos; 16 June 2024 at 13:24. |
16 June 2024, 13:48 | #28 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,298
|
If there are no cache misses, then always at least 28. A write to chip mem takes 2 CCKs (564ns), 50MHz cycles are 20ns, and 564/20 ~= 28.2.
I just verified this on my 060 by timing a loop of: Code:
move.l d1,(a0) ; a0 points to chip mem rept \1 mulu.w d2,d3 endr Code:
0 690 ns 34.5 cycles 5 690 ns 34.5 cycles 10 690 ns 34.5 cycles 14 690 ns 34.5 cycles 15 690 ns 34.5 cycles 16 690 ns 34.5 cycles 17 722 ns 36.1 cycles 18 762 ns 38.1 cycles 20 842 ns 42.1 cycles Really no gotcha except for the cache misses part. You can ship off (up to) 4 writes to the store buffer "for free" and it will retire them ASAP, but a cache miss will stall until the store buffer is empty. The writes have to be properly aligned long words of course, otherwise you'll needlessly be wasting bandwidth. |
16 June 2024, 14:36 | #29 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,861
|
I'm going to need to smuggle something heinous behind it https://github.com/0xABADCAFE/tkg-mi...xer_asm.s#L210
There's a buffer of 16 16-bit samples that is cache aligned and having just been mixed into, so should be hot. We've got a fixed point normalisation factor in d2 that will maximise them for playback at a particular volume (already identified). The resulting product has the desired signed 8-bit value in bits 16-23, so we swap them and move that byte into a buffer long. Once we've done this for 4 samples, the complete long is written to chip and we repeat. There's a lot about this code I don't like, for example, I don't know if it would be faster to write the product to a cached long and then move the byte out to some other buffer, or the instruction scheduling probably kills superscalar, but, if most of the cycles are lost behind the 4 long chip writes maybe it's ok. We already did a word write to chip for the volume channel buffer before we entered this. |
16 June 2024, 14:40 | #30 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,861
|
This code, will be despicable on anything less than 060 as those multiplications will murder you, bury you, dig you back up, desecrate you and then leave you for the birds.
The options I am considering are just shift normalisation, which will be quick but loses a significant amount of dynamic range, or some kind of lookup. |
16 June 2024, 15:13 | #31 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,298
|
That code is only slightly slower than "copy" speed:
Code:
rept 4 move.w (a2)+,d0 endr move.l d0,(a1)+ |
16 June 2024, 15:31 | #32 | |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,861
|
Quote:
|
|
16 June 2024, 16:30 | #33 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,298
|
Quote:
The real issue is that my test was too simple. Since I test with large buffers there is a cache miss after every other chip write so we don't get as many free cycles. After adding a bit of prefetching (128 bytes) it performs slightly better than the naive copy loop. You really do have to be careful when you take those cache misses. |
|
16 June 2024, 16:52 | #34 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,861
|
As long as it's close to copy speed in my case, I'll be happy enough. That'll be a decent result considering that we are doing the whole sample normalisation thing.
On the 040, that code will be a disaster. Each multiply alone is at best 18 cycles IIRC. You also have half as many to play with. |
16 June 2024, 17:06 | #35 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,298
|
Quote:
But to settle the 060 part, your code is (essentially) copy speed. Doing fast->fast conversion with everything in cache takes ~26 cycles/long, fast->chip ~35 (704ns) including loop overhead in both cases. FWIW I've previously measured cache misses to add 13 cycles on my machine (excluding cost of flushing push/store buffers). |
|
16 June 2024, 17:09 | #36 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,861
|
I got rid of the first swap anyway.
Code:
.mul_norm_four: ; something like this, for 060 move.w (a2)+,d0 ; xx:xx:AA:aa muls.w d2,d0 ; 00:AA:xx:xx lsr.l #8,d0 ; 00:00:AA:xx move.w d0,d1 ; xx:xx:AA:xx move.w (a2)+,d0 ; xx:xx:BB:bb muls.w d2,d0 ; xx:BB:xx:xx swap d0 ; xx:xx:xx:BB move.b d0,d1 ; xx:xx:AA:BB lsl.l #8,d1 ; xx:AA:BB:00 move.w (a2)+,d0 ; xx:xx:CC:cc muls.w d2,d0 ; xx:CC:xx:xx swap d0 ; xx:xx:xx:CC move.b d0,d1 ; xx:AA:BB:CC lsl.l #8,d1 ; AA:BB:CC:00 move.w (a2)+,d0 ; xx:xx:DD:dd muls.w d2,d0 ; xx:DD:xx:xx swap d0 ; xx:xx:xx:DD move.b d0,d1 ; AA:BB:CC:DD move.l d1,(a1)+ ; long slow chip write here dbra d4,.mul_norm_four move.l a1,(a4) bra.s .done_channel_normalise .shift_norm_four: move.l (a2)+,d0 ; AA:aa:BB:bb lsr.l d2,d0 ; 00:AA:xx:BB move.l (a2)+,d1 ; CC:cc:DD:dd lsl.w #8,d0 ; 00:AA:BB:00 lsr.l d2,d1 ; xx:CC:xx:DD lsl.l #8,d0 ; AA:BB:00:00 lsl.w #8,d1 ; xx:CC:DD:00 lsr.l #8,d1 ; 00:xx:CC:DD move.w d1,d0 ; AA:BB:CC:DD move.l d0,(a1)+ ; long slow chip write dbra d4,.shift_norm_four |
16 June 2024, 17:15 | #37 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,861
|
Now I just need to test it out with some actual input stream data and write the output to a file for validation
|
16 June 2024, 20:39 | #38 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,861
|
Which you think would be easy but every time I sit down, I get some other distraction to deal with lol.
Happy father's day |
17 June 2024, 11:16 | #39 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,861
|
@paraj
Those numbers make me think of the mixing side now. Currently, there are a set of 8 bit to 16 bit sample conversation tables where each table is for a particular virtual volume level for a mixer chanel. There are only 15 of these tables (zero is handled trivially) and each one is 256 words. It bothers me that access within a table will be quite scattered and I can image it being plagued by misses. I can just store the positive half table and handle sign but even then, that's a lot of potential jumping about. I also thought about delta encoding the input to make the lookups cluster around zero more, but there are some problems with that approach too, particularly when the volume levels are changing. Do you think a straight up multiplication would be better on 060? That's 2 multiplications, per sample per active channel. |
17 June 2024, 12:45 | #40 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,298
|
If you just need multiplication (and even a little extra calculation, assuming it's not something like clipping) then straight calculation is probably faster even without cache effects. Consider your current loop:
Code:
; Cycle 1 move.b (a3)+,d0 ; pOEP ; sOEP idle because move also uses memory cycle ; pOEP Change/use stall for 3 cycles waiting for d0 ; Cycle 2-5 move.w 0(a2,d0.w*2),d4 ; pOEP ; sOEP idle because add also uses memory cycle ; Cycle 6-7 add.w d4,(a4)+ ; pOEP ; sOEP idle because dbra is pOEP-only ; Cycle 8 dbra d1,.next_sample ; pOEP Something like this should be more 060 friendly (assuming the lookup can be replaced by ext.w / muls): Code:
; Cycle 1 move.b (a3)+,d0 ; pOEP ; sOEP idle because ext.w d0 needs d0 ; Cycle 2 ext.w d0 ; pOEP ; sOEP idle because muls is pOEP-only ; Cycle 3-4 muls.w d2,d0 ; pOEP ; sOEP idle because muls is pOEP-only ; Cycle 5-6 add.w d0,(a4)+ ; pOEP subq.w #1,d1 ; sOEP ; Assuming correctly predicated (taking 0 cycles) bne.b .next_sample |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Slow A4000 after overhaul | Screechstar | support.Hardware | 57 | 11 July 2023 23:02 |
Amiga Font Editor overhaul | buggs | Coders. Releases | 19 | 09 March 2021 17:39 |
Escom A1200 overhaul | Ox. | Amiga scene | 8 | 26 August 2014 08:54 |
Will Bridge Practice series needs an overhaul | mk1 | HOL data problems | 1 | 02 April 2009 21:55 |
|
|