29 June 2024, 23:39 | #101 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,571
|
|
30 June 2024, 12:53 | #102 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,571
|
I've defintely gotten closer. The reconstructed waveform has every alternate sample correct, which means I'm flipping something each iteration. Has to be sign related...
|
30 June 2024, 17:08 | #103 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,571
|
Right, that's working now. The 040 code path produces the same resulting data dumps as the 060 codepath.
The mixing loop can probably be improved Code:
.mix_first_sample: move.b (a3)+,d0 ; next 8-bit sample. move.w (a2,d0.w*2),d4 ; look up the volume adjusted word add.w d4,(a4)+ ; accumulate onto the target buffer move.w d0,d6 ; d6.w contains last 8-bit sample value .mix_next_sample: neg.b d0 ; Calculate the next 8-bit delta in d0 move.b (a3)+,d6 ; Next 8-bit sample in d6 add.b d6,d0 ; 8-bit delta in d0 add.w (a2,d0.w*2),d4 ; Add looked up 16-bit delta to last 16-bit sample add.w d4,(a4)+ ; Accumulate move.b d6,d0 dbra d1,.mix_next_sample |
30 June 2024, 17:10 | #104 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,571
|
All it took was 30 mins peace and quiet.
So the next job is to look at the normalisation code. We can do some "between" power of two multipliers, e.g. we could use addition and shift right to get a normalisation of 1.5x and replay at volume level 43 (ideally it would be 42.67 but you have to go with what you have). Last edited by Karlos; 30 June 2024 at 17:23. |
30 June 2024, 20:05 | #105 | |
Registered User
Join Date: Jul 2017
Location: San Jose
Posts: 683
|
Quote:
I contemplated this some time ago for Doom as well. I.e try to do the column post rendering horizontally to benefit from caching and write combining. The necessary transpose on present could potentially be hidden in the C2P pass(???) as C2P is itself a form of transpose. But then there’s floor rendering which is already horizontal. And for instance DoomAttack’s floor rendering is already computing 4 pixels and writes them in one longword write. Doing the floor vertically is likely inefficient as it would mean to give up on “constant z” along the floor lines. If you did floors and walls separate passes (walls horizontal) and did a transpose-wall-rending in between, it would probably eat up any benefits from horizontal rendering. Since the wall posts are already stored linearly, the only benefit would be the linear write (instead of wasting a 16byte cache line fill just to write a single pixel back). I once did an experiment with storing the floor tiles in a 4x4 tiled fashion, where blocks of 16 pixels are stored in a single 16byte cache line. The advantage would be that when you pull a texel, it’s very likely that the next needed neighbor texel will be pulled into cache at the same time. But it complicates the addressing in the inner texturing loop and thus likely cancels any savings of cycles when pulling from cache. But maybe one could use this approach for framebuffer writes during wall rendering? The necessary recombination of walls and foors could be done at c2p time (or when copying to RTG) and „unscrambling“ the 4x4 blocks could be done there as well, likely more optimized. |
|
30 June 2024, 20:39 | #106 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,239
|
Quote:
|
|
30 June 2024, 20:48 | #107 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,571
|
I think it depends on your approach to rendering. If you implemented tile based rendering from the start, a 32*32 tile is a 1kB working set that fits easily in your cache even on 040. You can copy that directly to your framebuffer once it's done. Which could be RTG memory and some move16 fun, or it could be chip memory with C2P.
|
30 June 2024, 20:58 | #108 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,239
|
Quote:
|
|
30 June 2024, 21:23 | #109 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,571
|
I think this only works when the engine is designed with these principles in mind from the beginning.
|
30 June 2024, 21:41 | #110 | |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,571
|
Quote:
The current normalisation code is basically the same so we'd be measuring the difference in the lookup v multiplication approach to mixing. |
|
30 June 2024, 22:00 | #111 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,239
|
Quote:
355996 / 263203 (latter is with USE060) |
|
30 June 2024, 22:56 | #112 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,571
|
It would be good to see the same tests on a real 040. 1.35x faster for the 060 specific path seems a nice boost.
Last edited by Karlos; 30 June 2024 at 23:03. |
30 June 2024, 23:13 | #113 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,571
|
Can you retry without datacache enabled ? That could be quite an interesting comparison.
|
30 June 2024, 23:18 | #114 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,571
|
I need to have a non delta 040 version for comparison on real hardware. It would be annoying, but I can imagine that the extra logic required to manage the delta code ends up being slower. Of course, in such a scheme you would just pre-encode your 8-bit samples and simplify the corresponding mixing loop.
|
01 July 2024, 15:15 | #115 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,571
|
Anyone with a working 040 ?
|
01 July 2024, 17:44 | #116 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,239
|
|
01 July 2024, 19:50 | #117 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,571
|
So apart from being significantly slower, the relative difference is about the same. I don't know how I feel about that. We are writing to chip RAM. I should add a few more options.
1. NoDelta switch for the 040 path 2. MixOnly switch for testing just the mixing and skipping the normalisation and chip buffer writes. |
01 July 2024, 20:01 | #118 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,239
|
Quote:
Other benchmarks will be good, but IMO you should make one exe that tests a bunch of interesting stuff at once without needing options (like your Akiko tests), and then corner a 040 owner for forced testing. |
|
01 July 2024, 20:08 | #119 | ||
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,571
|
Quote:
Quote:
|
||
05 July 2024, 19:32 | #120 |
Registered User
Join Date: Oct 2020
Location: Bicester
Posts: 2,038
|
run on my A4000 040@25mhz
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Slow A4000 after overhaul | Screechstar | support.Hardware | 57 | 11 July 2023 23:02 |
Amiga Font Editor overhaul | buggs | Coders. Releases | 19 | 09 March 2021 17:39 |
Escom A1200 overhaul | Ox. | Amiga scene | 8 | 26 August 2014 08:54 |
Will Bridge Practice series needs an overhaul | mk1 | HOL data problems | 1 | 02 April 2009 21:55 |
|
|