English Amiga Board


Go Back   English Amiga Board > Coders > Coders. General

 
 
Thread Tools
Old 08 June 2024, 17:58   #21
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,473
It's good enough for Yamaha. This is what they do (or rather used to do) for digital FM synthesis.

Last edited by Karlos; 08 June 2024 at 19:25.
Karlos is online now  
Old 09 June 2024, 02:09   #22
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,473
I've been thinking a bit about this. In my old experiments with amplitude modulated encoding (which is what I am trying to achieve for this), because i was encoding a 16-bit file, I didn't care too much about compuational cost.

So, what I was doing is to take the absolute maximum value of a collection of 16-bit samples and find the normalisation factor. That's 32768 / abs_val (which I lazily did as a float). It then just found the closest match in an array of normalisation factors for each of the 64 Paula levels. The index is (one less than) the volume to set Paula to and the value is the amount by which all the samples need to be multiplied in order to normalise the frame.

The thing is, there are only 64 entries in that table, for obvious reasons (there are only 64 non-zero Paula volumes we can use).

So why do 32768/abs_max when I could do 64/(abs_max >> 9) ? And in that case, (abs_max >> 9) is the index already. The value it needs to contain is the corresponding amplification factor.

So, as a trivial example, suppose abs_max for a set was 23456. 32768/23456 gives an ideal normalisation factor of about 1.397. My clunky search would search the 64 predefined values and select index 45 (Paula volume 46), with a factor of 1.3913.

If we just do 23456>>9, we get 45. The result is the index we wanted. This isn't a trick and while there might be some small differences in the exact values of abs_max where we switch from one index to the next, it's not going to be huge.

-edit- checked and it tends to be out by 1 for a fraction of the crossovers, but consistently on the lower side, which actually works better - less chance of clipping.

So all that's missing then, is to scale the samples we are left with. For 68060 this can be a fixed point multiplication, since you get 16x16 => 32 in about 3 cycles. It's a lot more for 68040 though.

Last edited by Karlos; 09 June 2024 at 02:27.
Karlos is online now  
Old 09 June 2024, 02:49   #23
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,473
Maybe I'm worrying too much about the multiplication here. As soon as I've normalised the first 4 samples and truncated to 8 bit, I'm going to have to do a long write to the chip ram buffer that I'm building. So perhaps the cost of the remaining samples is hidden behind this, as long as it can keep on crunching while those chip writes are ongoing.
Karlos is online now  
Old 09 June 2024, 13:07   #24
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,473
Thinking about it a bit more, there are also three distinct normalisation cases:

1. No normalisation. When the absolute peak level is above 32226, we are just going to truncate to 8 bit and play at full volume.
2. Perfect power of 2 normalisation. There are 6 ranges where our normalisation becomes a simple left shift and take the upper 8 bits.
3. Fractional normalisation. This is everything else, that notionally requires multiplication.

On an 060, it's probably fine to ignore the cases and just always multiply because it will be hidden behind the chip ram writes (I think, please correct if otherwise).

Case 3 is the one that bothers me, because on 040, we need four of them for every long chip write we need and they'll cost.

However, it's also the case that the end result of our multiplication and truncation is a signed 8 bit value. This makes me think that at some point, we can consider making this a pure lookup. Naively, a 32K*64 lookup of 8 bit values is a 2MB table, so that's a dud. However, not all bits of the input will affect the output, at best only 8 of them matter (the 8 highest set ones). With some shifting, the lookup table becomes 256*64. It's actually less than that because there are 7 sets covered by cases 1 and 2, but for simpicity let's ignore that a moment.

So maybe we can turn this into a pure lookup problem.
Karlos is online now  
Old 09 June 2024, 20:51   #25
abu_the_monkey
Registered User
 
Join Date: Oct 2020
Location: Bicester
Posts: 2,022
I wish I could help, but, most of this audio stuff goes way over my head
abu_the_monkey is offline  
Old 09 June 2024, 21:54   #26
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,473
Quote:
Originally Posted by abu_the_monkey View Post
I wish I could help, but, most of this audio stuff goes way over my head
The challenge at the moment is not really an audio problem, it's a how to avoid (fixed point) multiplication problem. I'm not too bothered on 68060, because it can do multiplication easily in something like 2 or 3 cycles. On 040 though, it's much longer, 18+ IIRC.

Specifically, I have a set of 16-bit signed values (samples from mixing the multiple channels), that I need to multiply by another 16 bit value, N, (a fixed point value that will increase the volume of the samples to maximum) and then take the most significant 8 bits as my result. There is one case where N 1, so that's easy to bypass. There are 6 other cases where N is an exact power of 2 (2, 4, 8, 16, 32 and 64) that can be done using shifts.

This leaves me with all the cases where N is some fractional value, that notionally require the multiplication step. However, intuition tells me that because there are only going to be 256 output values from all this, that not all 16 bits of the input matter. Given we are looking for an 8 bit result, it seems intuitive that only the most significant 8 bits of the input matter*. However, 256*64 is a much more manageable lookup size.

*I don't just mean the upper byte, the 8 bits after any leading zeros.
Karlos is online now  
Old 16 June 2024, 12:30   #27
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,473
Question for the cycle counting experts:.

Roughly, how many CPU cycles can you stuff behind a long write to chip ram on a typical 68060 at 50MHz? If it's only accessing data you expect to be resident in a cache line already, are there any really obvious gotchas I should know about?

Last edited by Karlos; 16 June 2024 at 13:24.
Karlos is online now  
Old 16 June 2024, 13:48   #28
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,215
If there are no cache misses, then always at least 28. A write to chip mem takes 2 CCKs (564ns), 50MHz cycles are 20ns, and 564/20 ~= 28.2.

I just verified this on my 060 by timing a loop of:
Code:
        move.l  d1,(a0) ; a0 points to chip mem
        rept    \1
        mulu.w  d2,d3
        endr
For different repititions I get
Code:
0             690 ns 34.5 cycles
5             690 ns 34.5 cycles
10            690 ns 34.5 cycles
14            690 ns 34.5 cycles
15            690 ns 34.5 cycles
16            690 ns 34.5 cycles
17            722 ns 36.1 cycles
18            762 ns 38.1 cycles
20            842 ns 42.1 cycles
690ns corresponds to a write speed of 5.8*10^6 bytes/second, which squares with the 5.5 figure reported by bustest (I ran my test with DMA off).

Really no gotcha except for the cache misses part. You can ship off (up to) 4 writes to the store buffer "for free" and it will retire them ASAP, but a cache miss will stall until the store buffer is empty. The writes have to be properly aligned long words of course, otherwise you'll needlessly be wasting bandwidth.
paraj is offline  
Old 16 June 2024, 14:36   #29
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,473
I'm going to need to smuggle something heinous behind it https://github.com/0xABADCAFE/tkg-mi...xer_asm.s#L210

There's a buffer of 16 16-bit samples that is cache aligned and having just been mixed into, so should be hot. We've got a fixed point normalisation factor in d2 that will maximise them for playback at a particular volume (already identified). The resulting product has the desired signed 8-bit value in bits 16-23, so we swap them and move that byte into a buffer long. Once we've done this for 4 samples, the complete long is written to chip and we repeat.

There's a lot about this code I don't like, for example, I don't know if it would be faster to write the product to a cached long and then move the byte out to some other buffer, or the instruction scheduling probably kills superscalar, but, if most of the cycles are lost behind the 4 long chip writes maybe it's ok. We already did a word write to chip for the volume channel buffer before we entered this.
Karlos is online now  
Old 16 June 2024, 14:40   #30
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,473
This code, will be despicable on anything less than 060 as those multiplications will murder you, bury you, dig you back up, desecrate you and then leave you for the birds.

The options I am considering are just shift normalisation, which will be quick but loses a significant amount of dynamic range, or some kind of lookup.
Karlos is online now  
Old 16 June 2024, 15:13   #31
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,215
That code is only slightly slower than "copy" speed:
Code:
        rept 4
        move.w  (a2)+,d0
        endr
        move.l  d0,(a1)+
comes out at 46 cycles, while the inner part of your loop (dbf omitted) clocks in at 53 cycles. Not sure why cost is not fully hidden, but it's probably close enough anyway.
paraj is offline  
Old 16 June 2024, 15:31   #32
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,473
Quote:
Originally Posted by paraj View Post
That code is only slightly slower than "copy" speed:
Code:
        rept 4
        move.w  (a2)+,d0
        endr
        move.l  d0,(a1)+
comes out at 46 cycles, while the inner part of your loop (dbf omitted) clocks in at 53 cycles. Not sure why cost is not fully hidden, but it's probably close enough anyway.
I bet the swaps don't help. Don't they tie up the primary execution unit?
Karlos is online now  
Old 16 June 2024, 16:30   #33
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,215
Quote:
Originally Posted by Karlos View Post
I bet the swaps don't help. Don't they tie up the primary execution unit?
The code is not great for superscaler, swap/mul/dbf are all pOEP only and there are register write/use stalls, but it should still be <28 cycles.

The real issue is that my test was too simple. Since I test with large buffers there is a cache miss after every other chip write so we don't get as many free cycles. After adding a bit of prefetching (128 bytes) it performs slightly better than the naive copy loop.

You really do have to be careful when you take those cache misses.
paraj is offline  
Old 16 June 2024, 16:52   #34
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,473
As long as it's close to copy speed in my case, I'll be happy enough. That'll be a decent result considering that we are doing the whole sample normalisation thing.

On the 040, that code will be a disaster. Each multiply alone is at best 18 cycles IIRC. You also have half as many to play with.
Karlos is online now  
Old 16 June 2024, 17:06   #35
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,215
Quote:
Originally Posted by Karlos View Post
As long as it's close to copy speed in my case, I'll be happy enough. That'll be a decent result considering that we are doing the whole sample normalisation thing.On the 040, that code will be a disaster. Each multiply alone is at best 18 cycles IIRC. You also have half as many to play with.
< 060 is going to be more challenging/fun in other words

But to settle the 060 part, your code is (essentially) copy speed. Doing fast->fast conversion with everything in cache takes ~26 cycles/long, fast->chip ~35 (704ns) including loop overhead in both cases. FWIW I've previously measured cache misses to add 13 cycles on my machine (excluding cost of flushing push/store buffers).
paraj is offline  
Old 16 June 2024, 17:09   #36
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,473
I got rid of the first swap anyway.
Code:
.mul_norm_four:
        ; something like this, for 060
        move.w  (a2)+,d0    ; xx:xx:AA:aa
        muls.w  d2,d0       ; 00:AA:xx:xx
        lsr.l   #8,d0       ; 00:00:AA:xx
        move.w  d0,d1       ; xx:xx:AA:xx

        move.w  (a2)+,d0    ; xx:xx:BB:bb
        muls.w  d2,d0       ; xx:BB:xx:xx
        swap    d0          ; xx:xx:xx:BB
        move.b  d0,d1       ; xx:xx:AA:BB
        lsl.l   #8,d1       ; xx:AA:BB:00

        move.w  (a2)+,d0    ; xx:xx:CC:cc
        muls.w  d2,d0       ; xx:CC:xx:xx
        swap    d0          ; xx:xx:xx:CC
        move.b  d0,d1       ; xx:AA:BB:CC
        lsl.l   #8,d1       ; AA:BB:CC:00

        move.w  (a2)+,d0    ; xx:xx:DD:dd
        muls.w  d2,d0       ; xx:DD:xx:xx
        swap    d0          ; xx:xx:xx:DD
        move.b  d0,d1       ; AA:BB:CC:DD

        move.l  d1,(a1)+    ; long slow chip write here

        dbra    d4,.mul_norm_four
        move.l a1,(a4)

        bra.s   .done_channel_normalise

.shift_norm_four:
        move.l  (a2)+,d0 ; AA:aa:BB:bb
        lsr.l   d2,d0    ; 00:AA:xx:BB
        move.l  (a2)+,d1 ; CC:cc:DD:dd
        lsl.w   #8,d0    ; 00:AA:BB:00
        lsr.l   d2,d1    ; xx:CC:xx:DD
        lsl.l   #8,d0    ; AA:BB:00:00
        lsl.w   #8,d1    ; xx:CC:DD:00
        lsr.l   #8,d1    ; 00:xx:CC:DD
        move.w  d1,d0    ; AA:BB:CC:DD

        move.l  d0,(a1)+ ; long slow chip write
        dbra    d4,.shift_norm_four
Karlos is online now  
Old 16 June 2024, 17:15   #37
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,473
Now I just need to test it out with some actual input stream data and write the output to a file for validation
Karlos is online now  
Old 16 June 2024, 20:39   #38
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,473
Which you think would be easy but every time I sit down, I get some other distraction to deal with lol.

Happy father's day
Karlos is online now  
Old 17 June 2024, 11:16   #39
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,473
@paraj

Those numbers make me think of the mixing side now. Currently, there are a set of 8 bit to 16 bit sample conversation tables where each table is for a particular virtual volume level for a mixer chanel. There are only 15 of these tables (zero is handled trivially) and each one is 256 words. It bothers me that access within a table will be quite scattered and I can image it being plagued by misses. I can just store the positive half table and handle sign but even then, that's a lot of potential jumping about. I also thought about delta encoding the input to make the lookups cluster around zero more, but there are some problems with that approach too, particularly when the volume levels are changing.

Do you think a straight up multiplication would be better on 060? That's 2 multiplications, per sample per active channel.
Karlos is online now  
Old 17 June 2024, 12:45   #40
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,215
If you just need multiplication (and even a little extra calculation, assuming it's not something like clipping) then straight calculation is probably faster even without cache effects. Consider your current loop:
Code:
        ; Cycle 1
        move.b  (a3)+,d0                        ; pOEP
        ; sOEP idle because move also uses memory cycle
        ; pOEP Change/use stall for 3 cycles waiting for d0
        ; Cycle 2-5
        move.w  0(a2,d0.w*2),d4                 ; pOEP
        ; sOEP idle because add also uses memory cycle
        ; Cycle 6-7
        add.w   d4,(a4)+                        ; pOEP
        ; sOEP idle because dbra is pOEP-only
        ; Cycle 8
        dbra    d1,.next_sample               ; pOEP
The lookup ends up taking (at least) 3 cycles due to the change/use stall, and there aren't really any (easy) way of hiding the stall (except perhaps some tricky loop overlapping).

Something like this should be more 060 friendly (assuming the lookup can be replaced by ext.w / muls):
Code:
        ; Cycle 1
        move.b  (a3)+,d0                        ; pOEP
        ; sOEP idle because ext.w       d0 needs d0
        ; Cycle 2
        ext.w   d0                              ; pOEP
        ; sOEP idle because muls is pOEP-only
        ; Cycle 3-4
        muls.w  d2,d0                           ; pOEP
        ; sOEP idle because muls is pOEP-only
        ; Cycle 5-6
        add.w   d0,(a4)+                        ; pOEP
        subq.w  #1,d1                           ; sOEP
        ; Assuming correctly predicated (taking 0 cycles)
        bne.b   .next_sample
paraj is offline  
 


Currently Active Users Viewing This Thread: 2 (1 members and 1 guests)
Karlos
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Slow A4000 after overhaul Screechstar support.Hardware 57 11 July 2023 23:02
Amiga Font Editor overhaul buggs Coders. Releases 19 09 March 2021 17:39
Escom A1200 overhaul Ox. Amiga scene 8 26 August 2014 08:54
Will Bridge Practice series needs an overhaul mk1 HOL data problems 1 02 April 2009 21:55

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 00:54.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.16392 seconds with 16 queries