Sound overhaul for TKG - Page 2

Karlos · 08 June 2024, 17:58

It's good enough for Yamaha. This is what they do (or rather used to do) for digital FM synthesis.

Karlos · 09 June 2024, 02:09

I've been thinking a bit about this. In my old experiments with amplitude modulated encoding (which is what I am trying to achieve for this), because i was encoding a 16-bit file, I didn't care too much about compuational cost.

So, what I was doing is to take the absolute maximum value of a collection of 16-bit samples and find the normalisation factor. That's 32768 / abs_val (which I lazily did as a float). It then just found the closest match in an array of normalisation factors for each of the 64 Paula levels. The index is (one less than) the volume to set Paula to and the value is the amount by which all the samples need to be multiplied in order to normalise the frame.

The thing is, there are only 64 entries in that table, for obvious reasons (there are only 64 non-zero Paula volumes we can use).

So why do 32768/abs_max when I could do 64/(abs_max >> 9) ? And in that case, (abs_max >> 9) is the index already. The value it needs to contain is the corresponding amplification factor.

So, as a trivial example, suppose abs_max for a set was 23456. 32768/23456 gives an ideal normalisation factor of about 1.397. My clunky search would search the 64 predefined values and select index 45 (Paula volume 46), with a factor of 1.3913.

If we just do 23456>>9, we get 45. The result is the index we wanted. This isn't a trick and while there might be some small differences in the exact values of abs_max where we switch from one index to the next, it's not going to be huge.

-edit- checked and it tends to be out by 1 for a fraction of the crossovers, but consistently on the lower side, which actually works better - less chance of clipping.

So all that's missing then, is to scale the samples we are left with. For 68060 this can be a fixed point multiplication, since you get 16x16 => 32 in about 3 cycles. It's a lot more for 68040 though.

Karlos · 09 June 2024, 02:49

Maybe I'm worrying too much about the multiplication here. As soon as I've normalised the first 4 samples and truncated to 8 bit, I'm going to have to do a long write to the chip ram buffer that I'm building. So perhaps the cost of the remaining samples is hidden behind this, as long as it can keep on crunching while those chip writes are ongoing.

Karlos · 09 June 2024, 13:07

Thinking about it a bit more, there are also three distinct normalisation cases:

1. No normalisation. When the absolute peak level is above 32226, we are just going to truncate to 8 bit and play at full volume.
2. Perfect power of 2 normalisation. There are 6 ranges where our normalisation becomes a simple left shift and take the upper 8 bits.
3. Fractional normalisation. This is everything else, that notionally requires multiplication.

On an 060, it's probably fine to ignore the cases and just always multiply because it will be hidden behind the chip ram writes (I think, please correct if otherwise).

Case 3 is the one that bothers me, because on 040, we need four of them for every long chip write we need and they'll cost.

However, it's also the case that the end result of our multiplication and truncation is a signed 8 bit value. This makes me think that at some point, we can consider making this a pure lookup. Naively, a 32K*64 lookup of 8 bit values is a 2MB table, so that's a dud. However, not all bits of the input will affect the output, at best only 8 of them matter (the 8 highest set ones). With some shifting, the lookup table becomes 256*64. It's actually less than that because there are 7 sets covered by cases 1 and 2, but for simpicity let's ignore that a moment.

So maybe we can turn this into a pure lookup problem.

abu_the_monkey · 09 June 2024, 20:51

I wish I could help, but, most of this audio stuff goes way over my head

Karlos · 09 June 2024, 21:54

Quote:

Originally Posted by abu_the_monkey

I wish I could help, but, most of this audio stuff goes way over my head

The challenge at the moment is not really an audio problem, it's a how to avoid (fixed point) multiplication problem. I'm not too bothered on 68060, because it can do multiplication easily in something like 2 or 3 cycles. On 040 though, it's much longer, 18+ IIRC.

Specifically, I have a set of 16-bit signed values (samples from mixing the multiple channels), that I need to multiply by another 16 bit value, N, (a fixed point value that will increase the volume of the samples to maximum) and then take the most significant 8 bits as my result. There is one case where N 1, so that's easy to bypass. There are 6 other cases where N is an exact power of 2 (2, 4, 8, 16, 32 and 64) that can be done using shifts.

This leaves me with all the cases where N is some fractional value, that notionally require the multiplication step. However, intuition tells me that because there are only going to be 256 output values from all this, that not all 16 bits of the input matter. Given we are looking for an 8 bit result, it seems intuitive that only the most significant 8 bits of the input matter*. However, 256*64 is a much more manageable lookup size.

*I don't just mean the upper byte, the 8 bits after any leading zeros.

Karlos · 16 June 2024, 12:30

Question for the cycle counting experts:.

Roughly, how many CPU cycles can you stuff behind a long write to chip ram on a typical 68060 at 50MHz? If it's only accessing data you expect to be resident in a cache line already, are there any really obvious gotchas I should know about?

paraj · 16 June 2024, 13:48

If there are no cache misses, then always at least 28. A write to chip mem takes 2 CCKs (564ns), 50MHz cycles are 20ns, and 564/20 ~= 28.2.

I just verified this on my 060 by timing a loop of:

Code:

        move.l  d1,(a0) ; a0 points to chip mem
        rept    \1
        mulu.w  d2,d3
        endr

For different repititions I get

Code:

0             690 ns 34.5 cycles
5             690 ns 34.5 cycles
10            690 ns 34.5 cycles
14            690 ns 34.5 cycles
15            690 ns 34.5 cycles
16            690 ns 34.5 cycles
17            722 ns 36.1 cycles
18            762 ns 38.1 cycles
20            842 ns 42.1 cycles

690ns corresponds to a write speed of 5.8*10^6 bytes/second, which squares with the 5.5 figure reported by bustest (I ran my test with DMA off).

Really no gotcha except for the cache misses part. You can ship off (up to) 4 writes to the store buffer "for free" and it will retire them ASAP, but a cache miss will stall until the store buffer is empty. The writes have to be properly aligned long words of course, otherwise you'll needlessly be wasting bandwidth.

Karlos · 16 June 2024, 14:36

I'm going to need to smuggle something heinous behind it https://github.com/0xABADCAFE/tkg-mi...xer_asm.s#L210

There's a buffer of 16 16-bit samples that is cache aligned and having just been mixed into, so should be hot. We've got a fixed point normalisation factor in d2 that will maximise them for playback at a particular volume (already identified). The resulting product has the desired signed 8-bit value in bits 16-23, so we swap them and move that byte into a buffer long. Once we've done this for 4 samples, the complete long is written to chip and we repeat.

There's a lot about this code I don't like, for example, I don't know if it would be faster to write the product to a cached long and then move the byte out to some other buffer, or the instruction scheduling probably kills superscalar, but, if most of the cycles are lost behind the 4 long chip writes maybe it's ok. We already did a word write to chip for the volume channel buffer before we entered this.

Karlos · 16 June 2024, 14:40

This code, will be despicable on anything less than 060 as those multiplications will murder you, bury you, dig you back up, desecrate you and then leave you for the birds.

The options I am considering are just shift normalisation, which will be quick but loses a significant amount of dynamic range, or some kind of lookup.

paraj · 16 June 2024, 15:13

That code is only slightly slower than "copy" speed:

Code:

        rept 4
        move.w  (a2)+,d0
        endr
        move.l  d0,(a1)+

comes out at 46 cycles, while the inner part of your loop (dbf omitted) clocks in at 53 cycles. Not sure why cost is not fully hidden, but it's probably close enough anyway.

Karlos · 16 June 2024, 15:31

Quote:

Originally Posted by paraj

That code is only slightly slower than "copy" speed:

Code:

        rept 4
        move.w  (a2)+,d0
        endr
        move.l  d0,(a1)+

comes out at 46 cycles, while the inner part of your loop (dbf omitted) clocks in at 53 cycles. Not sure why cost is not fully hidden, but it's probably close enough anyway.

I bet the swaps don't help. Don't they tie up the primary execution unit?

paraj · 16 June 2024, 16:30

Quote:

Originally Posted by Karlos

I bet the swaps don't help. Don't they tie up the primary execution unit?

The code is not great for superscaler, swap/mul/dbf are all pOEP only and there are register write/use stalls, but it should still be <28 cycles.

The real issue is that my test was too simple. Since I test with large buffers there is a cache miss after every other chip write so we don't get as many free cycles. After adding a bit of prefetching (128 bytes) it performs slightly better than the naive copy loop.

You really do have to be careful when you take those cache misses.

Karlos · 16 June 2024, 16:52

As long as it's close to copy speed in my case, I'll be happy enough. That'll be a decent result considering that we are doing the whole sample normalisation thing.

On the 040, that code will be a disaster. Each multiply alone is at best 18 cycles IIRC. You also have half as many to play with.

paraj · 16 June 2024, 17:06

Quote:

Originally Posted by Karlos

As long as it's close to copy speed in my case, I'll be happy enough. That'll be a decent result considering that we are doing the whole sample normalisation thing.On the 040, that code will be a disaster. Each multiply alone is at best 18 cycles IIRC. You also have half as many to play with.

< 060 is going to be more challenging/fun in other words

But to settle the 060 part, your code is (essentially) copy speed. Doing fast->fast conversion with everything in cache takes ~26 cycles/long, fast->chip ~35 (704ns) including loop overhead in both cases. FWIW I've previously measured cache misses to add 13 cycles on my machine (excluding cost of flushing push/store buffers).

Karlos · 16 June 2024, 17:09

I got rid of the first swap anyway.

Code:

.mul_norm_four:
        ; something like this, for 060
        move.w  (a2)+,d0    ; xx:xx:AA:aa
        muls.w  d2,d0       ; 00:AA:xx:xx
        lsr.l   #8,d0       ; 00:00:AA:xx
        move.w  d0,d1       ; xx:xx:AA:xx

        move.w  (a2)+,d0    ; xx:xx:BB:bb
        muls.w  d2,d0       ; xx:BB:xx:xx
        swap    d0          ; xx:xx:xx:BB
        move.b  d0,d1       ; xx:xx:AA:BB
        lsl.l   #8,d1       ; xx:AA:BB:00

        move.w  (a2)+,d0    ; xx:xx:CC:cc
        muls.w  d2,d0       ; xx:CC:xx:xx
        swap    d0          ; xx:xx:xx:CC
        move.b  d0,d1       ; xx:AA:BB:CC
        lsl.l   #8,d1       ; AA:BB:CC:00

        move.w  (a2)+,d0    ; xx:xx:DD:dd
        muls.w  d2,d0       ; xx:DD:xx:xx
        swap    d0          ; xx:xx:xx:DD
        move.b  d0,d1       ; AA:BB:CC:DD

        move.l  d1,(a1)+    ; long slow chip write here

        dbra    d4,.mul_norm_four
        move.l a1,(a4)

        bra.s   .done_channel_normalise

.shift_norm_four:
        move.l  (a2)+,d0 ; AA:aa:BB:bb
        lsr.l   d2,d0    ; 00:AA:xx:BB
        move.l  (a2)+,d1 ; CC:cc:DD:dd
        lsl.w   #8,d0    ; 00:AA:BB:00
        lsr.l   d2,d1    ; xx:CC:xx:DD
        lsl.l   #8,d0    ; AA:BB:00:00
        lsl.w   #8,d1    ; xx:CC:DD:00
        lsr.l   #8,d1    ; 00:xx:CC:DD
        move.w  d1,d0    ; AA:BB:CC:DD

        move.l  d0,(a1)+ ; long slow chip write
        dbra    d4,.shift_norm_four

Karlos · 16 June 2024, 17:15

Now I just need to test it out with some actual input stream data and write the output to a file for validation

Karlos · 16 June 2024, 20:39

Which you think would be easy but every time I sit down, I get some other distraction to deal with lol.

Happy father's day

Karlos · 17 June 2024, 11:16

@paraj

Those numbers make me think of the mixing side now. Currently, there are a set of 8 bit to 16 bit sample conversation tables where each table is for a particular virtual volume level for a mixer chanel. There are only 15 of these tables (zero is handled trivially) and each one is 256 words. It bothers me that access within a table will be quite scattered and I can image it being plagued by misses. I can just store the positive half table and handle sign but even then, that's a lot of potential jumping about. I also thought about delta encoding the input to make the lookups cluster around zero more, but there are some problems with that approach too, particularly when the volume levels are changing.

Do you think a straight up multiplication would be better on 060? That's 2 multiplications, per sample per active channel.

paraj · 17 June 2024, 12:45

If you just need multiplication (and even a little extra calculation, assuming it's not something like clipping) then straight calculation is probably faster even without cache effects. Consider your current loop:

Code:

        ; Cycle 1
        move.b  (a3)+,d0                        ; pOEP
        ; sOEP idle because move also uses memory cycle
        ; pOEP Change/use stall for 3 cycles waiting for d0
        ; Cycle 2-5
        move.w  0(a2,d0.w*2),d4                 ; pOEP
        ; sOEP idle because add also uses memory cycle
        ; Cycle 6-7
        add.w   d4,(a4)+                        ; pOEP
        ; sOEP idle because dbra is pOEP-only
        ; Cycle 8
        dbra    d1,.next_sample               ; pOEP

The lookup ends up taking (at least) 3 cycles due to the change/use stall, and there aren't really any (easy) way of hiding the stall (except perhaps some tricky loop overlapping).

Something like this should be more 060 friendly (assuming the lookup can be replaced by ext.w / muls):

Code:

        ; Cycle 1
        move.b  (a3)+,d0                        ; pOEP
        ; sOEP idle because ext.w       d0 needs d0
        ; Cycle 2
        ext.w   d0                              ; pOEP
        ; sOEP idle because muls is pOEP-only
        ; Cycle 3-4
        muls.w  d2,d0                           ; pOEP
        ; sOEP idle because muls is pOEP-only
        ; Cycle 5-6
        add.w   d0,(a4)+                        ; pOEP
        subq.w  #1,d1                           ; sOEP
        ; Assuming correctly predicated (taking 0 cycles)
        bne.b   .next_sample

08 June 2024, 17:58	#21
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,473	It's good enough for Yamaha. This is what they do (or rather used to do) for digital FM synthesis. Last edited by Karlos; 08 June 2024 at 19:25.

09 June 2024, 02:09	#22
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,473	I've been thinking a bit about this. In my old experiments with amplitude modulated encoding (which is what I am trying to achieve for this), because i was encoding a 16-bit file, I didn't care too much about compuational cost. So, what I was doing is to take the absolute maximum value of a collection of 16-bit samples and find the normalisation factor. That's 32768 / abs_val (which I lazily did as a float). It then just found the closest match in an array of normalisation factors for each of the 64 Paula levels. The index is (one less than) the volume to set Paula to and the value is the amount by which all the samples need to be multiplied in order to normalise the frame. The thing is, there are only 64 entries in that table, for obvious reasons (there are only 64 non-zero Paula volumes we can use). So why do 32768/abs_max when I could do 64/(abs_max >> 9) ? And in that case, (abs_max >> 9) is the index already. The value it needs to contain is the corresponding amplification factor. So, as a trivial example, suppose abs_max for a set was 23456. 32768/23456 gives an ideal normalisation factor of about 1.397. My clunky search would search the 64 predefined values and select index 45 (Paula volume 46), with a factor of 1.3913. If we just do 23456>>9, we get 45. The result is the index we wanted. This isn't a trick and while there might be some small differences in the exact values of abs_max where we switch from one index to the next, it's not going to be huge. -edit- checked and it tends to be out by 1 for a fraction of the crossovers, but consistently on the lower side, which actually works better - less chance of clipping. So all that's missing then, is to scale the samples we are left with. For 68060 this can be a fixed point multiplication, since you get 16x16 => 32 in about 3 cycles. It's a lot more for 68040 though. Last edited by Karlos; 09 June 2024 at 02:27.

16 June 2024, 12:30	#27
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,473	Question for the cycle counting experts:. Roughly, how many CPU cycles can you stuff behind a long write to chip ram on a typical 68060 at 50MHz? If it's only accessing data you expect to be resident in a cache line already, are there any really obvious gotchas I should know about? Last edited by Karlos; 16 June 2024 at 13:24.

16 June 2024, 13:48	#28
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,215	If there are no cache misses, then always at least 28. A write to chip mem takes 2 CCKs (564ns), 50MHz cycles are 20ns, and 564/20 ~= 28.2. I just verified this on my 060 by timing a loop of: Code: move.l d1,(a0) ; a0 points to chip mem rept \1 mulu.w d2,d3 endr For different repititions I get Code: 0 690 ns 34.5 cycles 5 690 ns 34.5 cycles 10 690 ns 34.5 cycles 14 690 ns 34.5 cycles 15 690 ns 34.5 cycles 16 690 ns 34.5 cycles 17 722 ns 36.1 cycles 18 762 ns 38.1 cycles 20 842 ns 42.1 cycles 690ns corresponds to a write speed of 5.8*10^6 bytes/second, which squares with the 5.5 figure reported by bustest (I ran my test with DMA off). Really no gotcha except for the cache misses part. You can ship off (up to) 4 writes to the store buffer "for free" and it will retire them ASAP, but a cache miss will stall until the store buffer is empty. The writes have to be properly aligned long words of course, otherwise you'll needlessly be wasting bandwidth.

16 June 2024, 15:13	#31
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,215	That code is only slightly slower than "copy" speed: Code: rept 4 move.w (a2)+,d0 endr move.l d0,(a1)+ comes out at 46 cycles, while the inner part of your loop (dbf omitted) clocks in at 53 cycles. Not sure why cost is not fully hidden, but it's probably close enough anyway.

09 June 2024, 02:49	#23
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,473	Maybe I'm worrying too much about the multiplication here. As soon as I've normalised the first 4 samples and truncated to 8 bit, I'm going to have to do a long write to the chip ram buffer that I'm building. So perhaps the cost of the remaining samples is hidden behind this, as long as it can keep on crunching while those chip writes are ongoing.

09 June 2024, 13:07	#24
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,473	Thinking about it a bit more, there are also three distinct normalisation cases: 1. No normalisation. When the absolute peak level is above 32226, we are just going to truncate to 8 bit and play at full volume. 2. Perfect power of 2 normalisation. There are 6 ranges where our normalisation becomes a simple left shift and take the upper 8 bits. 3. Fractional normalisation. This is everything else, that notionally requires multiplication. On an 060, it's probably fine to ignore the cases and just always multiply because it will be hidden behind the chip ram writes (I think, please correct if otherwise). Case 3 is the one that bothers me, because on 040, we need four of them for every long chip write we need and they'll cost. However, it's also the case that the end result of our multiplication and truncation is a signed 8 bit value. This makes me think that at some point, we can consider making this a pure lookup. Naively, a 32K64 lookup of 8 bit values is a 2MB table, so that's a dud. However, not all bits of the input will affect the output, at best only 8 of them matter (the 8 highest set ones). With some shifting, the lookup table becomes 25664. It's actually less than that because there are 7 sets covered by cases 1 and 2, but for simpicity let's ignore that a moment. So maybe we can turn this into a pure lookup problem.

09 June 2024, 20:51	#25
abu_the_monkey Registered User Join Date: Oct 2020 Location: Bicester Posts: 2,022	I wish I could help, but, most of this audio stuff goes way over my head

16 June 2024, 14:36	#29
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,473	I'm going to need to smuggle something heinous behind it https://github.com/0xABADCAFE/tkg-mi...xer_asm.s#L210 There's a buffer of 16 16-bit samples that is cache aligned and having just been mixed into, so should be hot. We've got a fixed point normalisation factor in d2 that will maximise them for playback at a particular volume (already identified). The resulting product has the desired signed 8-bit value in bits 16-23, so we swap them and move that byte into a buffer long. Once we've done this for 4 samples, the complete long is written to chip and we repeat. There's a lot about this code I don't like, for example, I don't know if it would be faster to write the product to a cached long and then move the byte out to some other buffer, or the instruction scheduling probably kills superscalar, but, if most of the cycles are lost behind the 4 long chip writes maybe it's ok. We already did a word write to chip for the volume channel buffer before we entered this.

16 June 2024, 14:40	#30
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,473	This code, will be despicable on anything less than 060 as those multiplications will murder you, bury you, dig you back up, desecrate you and then leave you for the birds. The options I am considering are just shift normalisation, which will be quick but loses a significant amount of dynamic range, or some kind of lookup.

16 June 2024, 16:52	#34
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,473	As long as it's close to copy speed in my case, I'll be happy enough. That'll be a decent result considering that we are doing the whole sample normalisation thing. On the 040, that code will be a disaster. Each multiply alone is at best 18 cycles IIRC. You also have half as many to play with.

16 June 2024, 17:15	#37
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,473	Now I just need to test it out with some actual input stream data and write the output to a file for validation

16 June 2024, 20:39	#38
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,473	Which you think would be easy but every time I sit down, I get some other distraction to deal with lol. Happy father's day

17 June 2024, 11:16	#39
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,473	@paraj Those numbers make me think of the mixing side now. Currently, there are a set of 8 bit to 16 bit sample conversation tables where each table is for a particular virtual volume level for a mixer chanel. There are only 15 of these tables (zero is handled trivially) and each one is 256 words. It bothers me that access within a table will be quite scattered and I can image it being plagued by misses. I can just store the positive half table and handle sign but even then, that's a lot of potential jumping about. I also thought about delta encoding the input to make the lookups cluster around zero more, but there are some problems with that approach too, particularly when the volume levels are changing. Do you think a straight up multiplication would be better on 060? That's 2 multiplications, per sample per active channel.

Currently Active Users Viewing This Thread: 2 (1 members and 1 guests)
Karlos

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Slow A4000 after overhaul	Screechstar	support.Hardware	57	11 July 2023 23:02
Amiga Font Editor overhaul	buggs	Coders. Releases	19	09 March 2021 17:39
Escom A1200 overhaul	Ox.	Amiga scene	8	26 August 2014 08:54
Will Bridge Practice series needs an overhaul	mk1	HOL data problems	1	02 April 2009 21:55