Sound overhaul for TKG - Page 6

Karlos · 29 June 2024, 23:39

Quote:

Originally Posted by abu_the_monkey

Adulting? sometimes it really sucks all your time away

How naive of me to think that I'd get some time, lol

Karlos · 30 June 2024, 12:53

I've defintely gotten closer. The reconstructed waveform has every alternate sample correct, which means I'm flipping something each iteration. Has to be sign related...

Karlos · 30 June 2024, 17:08

Right, that's working now. The 040 code path produces the same resulting data dumps as the 060 codepath.

The mixing loop can probably be improved

Code:

.mix_first_sample:
        move.b  (a3)+,d0         ; next 8-bit sample.
        move.w  (a2,d0.w*2),d4   ; look up the volume adjusted word
        add.w   d4,(a4)+         ; accumulate onto the target buffer
        move.w  d0,d6            ; d6.w contains last 8-bit sample value

.mix_next_sample:
        neg.b   d0               ; Calculate the next 8-bit delta in d0
        move.b  (a3)+,d6         ; Next 8-bit sample in d6
        add.b   d6,d0            ; 8-bit delta in d0
        add.w   (a2,d0.w*2),d4   ; Add looked up 16-bit delta to last 16-bit sample
        add.w   d4,(a4)+         ; Accumulate
        move.b  d6,d0
        dbra    d1,.mix_next_sample

Karlos · 30 June 2024, 17:10

All it took was 30 mins peace and quiet.

So the next job is to look at the normalisation code. We can do some "between" power of two multipliers, e.g. we could use addition and shift right to get a normalisation of 1.5x and replay at volume level 43 (ideally it would be 42.67 but you have to go with what you have).

pipper · 30 June 2024, 20:05

Quote:

Originally Posted by Karlos

While we are on the subject of caches, I have to wonder how it might be if we could do the columnar wall rendering horizontally. Even if that meant rendering the image rotated and/or tiled. Every 1 pixel wide column to the row major framebuffer must be pretty horrible.

I contemplated this some time ago for Doom as well. I.e try to do the column post rendering horizontally to benefit from caching and write combining. The necessary transpose on present could potentially be hidden in the C2P pass(???) as C2P is itself a form of transpose.
But then there’s floor rendering which is already horizontal. And for instance DoomAttack’s floor rendering is already computing 4 pixels and writes them in one longword write. Doing the floor vertically is likely inefficient as it would mean to give up on “constant z” along the floor lines.
If you did floors and walls separate passes (walls horizontal) and did a transpose-wall-rending in between, it would probably eat up any benefits from horizontal rendering.
Since the wall posts are already stored linearly, the only benefit would be the linear write (instead of wasting a 16byte cache line fill just to write a single pixel back).
I once did an experiment with storing the floor tiles in a 4x4 tiled fashion, where blocks of 16 pixels are stored in a single 16byte cache line. The advantage would be that when you pull a texel, it’s very likely that the next needed neighbor texel will be pulled into cache at the same time.
But it complicates the addressing in the inner texturing loop and thus likely cancels any savings of cycles when pulling from cache.
But maybe one could use this approach for framebuffer writes during wall rendering? The necessary recombination of walls and foors could be done at c2p time (or when copying to RTG) and „unscrambling“ the 4x4 blocks could be done there as well, likely more optimized.

paraj · 30 June 2024, 20:39

Quote:

Originally Posted by pipper

I contemplated this some time ago for Doom as well. I.e try to do the column post rendering horizontally to benefit from caching and write combining. The necessary transpose on present could potentially be hidden in the C2P pass(???) as C2P is itself a form of transpose.
But then there’s floor rendering which is already horizontal. And for instance DoomAttack’s floor rendering is already computing 4 pixels and writes them in one longword write. Doing the floor vertically is likely inefficient as it would mean to give up on “constant z” along the floor lines.
If you did floors and walls separate passes (walls horizontal) and did a transpose-wall-rending in between, it would probably eat up any benefits from horizontal rendering.
Since the wall posts are already stored linearly, the only benefit would be the linear write (instead of wasting a 16byte cache line fill just to write a single pixel back).
I once did an experiment with storing the floor tiles in a 4x4 tiled fashion, where blocks of 16 pixels are stored in a single 16byte cache line. The advantage would be that when you pull a texel, it’s very likely that the next needed neighbor texel will be pulled into cache at the same time.
But it complicates the addressing in the inner texturing loop and thus likely cancels any savings of cycles when pulling from cache.
But maybe one could use this approach for framebuffer writes during wall rendering? The necessary recombination of walls and foors could be done at c2p time (or when copying to RTG) and „unscrambling“ the 4x4 blocks could be done there as well, likely more optimized.

Yeah, while interesting, it's probably not worth exploring too far since the effort would be too great. Caches even on 060 seem to be just too small for anything really worthwhile. Only really great speed up I've achieved in my own stuff is from interleaving C2P and effect calculation, and even then only with severe restrictions on input (i.e. must fit in D$).

Karlos · 30 June 2024, 20:48

I think it depends on your approach to rendering. If you implemented tile based rendering from the start, a 32*32 tile is a 1kB working set that fits easily in your cache even on 040. You can copy that directly to your framebuffer once it's done. Which could be RTG memory and some move16 fun, or it could be chip memory with C2P.

paraj · 30 June 2024, 20:58

Quote:

Originally Posted by Karlos

I think it depends on your approach to rendering. If you implemented tile based rendering from the start, a 32*32 tile is a 1kB working set that fits easily in your cache even on 040. You can copy that directly to your framebuffer once it's done. Which could be RTG memory and some move16 fun, or it could be chip memory with C2P.

I share your enthusiasm, but doubt the numbers will work out in our favor (compared to effort). Happy to be proved wrong!

Karlos · 30 June 2024, 21:23

I think this only works when the engine is designed with these principles in mind from the beginning.

Karlos · 30 June 2024, 21:41

Quote:

Originally Posted by paraj

I share your enthusiasm, but doubt the numbers will work out in our favor (compared to effort). Happy to be proved wrong!

Back on topic, how does the latest mixer perform for you? It will run the 040 code path by default, unless you put USE060 as a parameter.

The current normalisation code is basically the same so we'd be measuring the difference in the lookup v multiplication approach to mixing.

paraj · 30 June 2024, 22:00

Quote:

Originally Posted by Karlos

Back on topic, how does the latest mixer perform for you? It will run the 040 code path by default, unless you put USE060 as a parameter.

The current normalisation code is basically the same so we'd be measuring the difference in the lookup v multiplication approach to mixing.

355996 / 263203 (latter is with USE060)

Karlos · 30 June 2024, 22:56

It would be good to see the same tests on a real 040. 1.35x faster for the 060 specific path seems a nice boost.

Karlos · 30 June 2024, 23:13

Can you retry without datacache enabled ? That could be quite an interesting comparison.

Karlos · 30 June 2024, 23:18

I need to have a non delta 040 version for comparison on real hardware. It would be annoying, but I can imagine that the extra logic required to manage the delta code ends up being slower. Of course, in such a scheme you would just pre-encode your 8-bit samples and simplify the corresponding mixing loop.

Karlos · Yesterday, 15:15

Anyone with a working 040 ?

paraj · Yesterday, 17:44

Quote:

Originally Posted by Karlos

Can you retry without datacache enabled ? That could be quite an interesting comparison.

Without D$: 1304199 / 1001972

Karlos · Yesterday, 19:50

Quote:

Originally Posted by paraj

Without D$: 1304199 / 1001972

So apart from being significantly slower, the relative difference is about the same. I don't know how I feel about that. We are writing to chip RAM. I should add a few more options.

1. NoDelta switch for the 040 path
2. MixOnly switch for testing just the mixing and skipping the normalisation and chip buffer writes.

paraj · Yesterday, 20:01

Quote:

Originally Posted by Karlos

So apart from being significantly slower, the relative difference is about the same. I don't know how I feel about that. We are writing to chip RAM. I should add a few more options.

1. NoDelta switch for the 040 path
2. MixOnly switch for testing just the mixing and skipping the normalisation and chip buffer writes.

I think, but haven't checked, that disabling D$ means chip writes can't overlap with other calculations on 060 (that access memory), so be careful how you interpret results.

Other benchmarks will be good, but IMO you should make one exe that tests a bunch of interesting stuff at once without needing options (like your Akiko tests), and then corner a 040 owner for forced testing.

Karlos · Yesterday, 20:08

Quote:

Originally Posted by paraj

I think, but haven't checked, that disabling D$ means chip writes can't overlap with other calculations on 060, so be careful how you interpret results.

I think this is why a "mix only" test makes sense.

Quote:

Other benchmarks will be good, but IMO you should make one exe that tests a bunch of interesting stuff at once without needing options (like your Akiko tests), and then corner a 040 owner for forced testing.

You're right. However, I'm hedging my bets that on some machines the seemingly suboptimal approach will be the best one and so on.

30 June 2024, 17:10	#104
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,506	All it took was 30 mins peace and quiet. So the next job is to look at the normalisation code. We can do some "between" power of two multipliers, e.g. we could use addition and shift right to get a normalisation of 1.5x and replay at volume level 43 (ideally it would be 42.67 but you have to go with what you have). Last edited by Karlos; 30 June 2024 at 17:23.

30 June 2024, 22:56	#112
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,506	It would be good to see the same tests on a real 040. 1.35x faster for the 060 specific path seems a nice boost. Last edited by Karlos; 30 June 2024 at 23:03.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Slow A4000 after overhaul	Screechstar	support.Hardware	57	11 July 2023 23:02
Amiga Font Editor overhaul	buggs	Coders. Releases	19	09 March 2021 17:39
Escom A1200 overhaul	Ox.	Amiga scene	8	26 August 2014 08:54
Will Bridge Practice series needs an overhaul	mk1	HOL data problems	1	02 April 2009 21:55

30 June 2024, 12:53	#102
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,506	I've defintely gotten closer. The reconstructed waveform has every alternate sample correct, which means I'm flipping something each iteration. Has to be sign related...

30 June 2024, 20:48	#107
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,506	I think it depends on your approach to rendering. If you implemented tile based rendering from the start, a 32*32 tile is a 1kB working set that fits easily in your cache even on 040. You can copy that directly to your framebuffer once it's done. Which could be RTG memory and some move16 fun, or it could be chip memory with C2P.

30 June 2024, 21:23	#109
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,506	I think this only works when the engine is designed with these principles in mind from the beginning.

30 June 2024, 23:13	#113
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,506	Can you retry without datacache enabled ? That could be quite an interesting comparison.

30 June 2024, 23:18	#114
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,506	I need to have a non delta 040 version for comparison on real hardware. It would be annoying, but I can imagine that the extra logic required to manage the delta code ends up being slower. Of course, in such a scheme you would just pre-encode your 8-bit samples and simplify the corresponding mixing loop.

Yesterday, 15:15	#115
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,506	Anyone with a working 040 ?

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)