English Amiga Board


Go Back   English Amiga Board > Coders > Coders. General

 
 
Thread Tools
Old 29 June 2024, 23:39   #101
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,506
Quote:
Originally Posted by abu_the_monkey View Post
Adulting? sometimes it really sucks all your time away
How naive of me to think that I'd get some time, lol
Karlos is online now  
Old 30 June 2024, 12:53   #102
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,506
I've defintely gotten closer. The reconstructed waveform has every alternate sample correct, which means I'm flipping something each iteration. Has to be sign related...
Karlos is online now  
Old 30 June 2024, 17:08   #103
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,506
Right, that's working now. The 040 code path produces the same resulting data dumps as the 060 codepath.

The mixing loop can probably be improved
Code:
.mix_first_sample:
        move.b  (a3)+,d0         ; next 8-bit sample.
        move.w  (a2,d0.w*2),d4   ; look up the volume adjusted word
        add.w   d4,(a4)+         ; accumulate onto the target buffer
        move.w  d0,d6            ; d6.w contains last 8-bit sample value

.mix_next_sample:
        neg.b   d0               ; Calculate the next 8-bit delta in d0
        move.b  (a3)+,d6         ; Next 8-bit sample in d6
        add.b   d6,d0            ; 8-bit delta in d0
        add.w   (a2,d0.w*2),d4   ; Add looked up 16-bit delta to last 16-bit sample
        add.w   d4,(a4)+         ; Accumulate
        move.b  d6,d0
        dbra    d1,.mix_next_sample
Karlos is online now  
Old 30 June 2024, 17:10   #104
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,506
All it took was 30 mins peace and quiet.

So the next job is to look at the normalisation code. We can do some "between" power of two multipliers, e.g. we could use addition and shift right to get a normalisation of 1.5x and replay at volume level 43 (ideally it would be 42.67 but you have to go with what you have).

Last edited by Karlos; 30 June 2024 at 17:23.
Karlos is online now  
Old 30 June 2024, 20:05   #105
pipper
Registered User
 
Join Date: Jul 2017
Location: San Jose
Posts: 679
Quote:
Originally Posted by Karlos View Post
While we are on the subject of caches, I have to wonder how it might be if we could do the columnar wall rendering horizontally. Even if that meant rendering the image rotated and/or tiled. Every 1 pixel wide column to the row major framebuffer must be pretty horrible.

I contemplated this some time ago for Doom as well. I.e try to do the column post rendering horizontally to benefit from caching and write combining. The necessary transpose on present could potentially be hidden in the C2P pass(???) as C2P is itself a form of transpose.
But then there’s floor rendering which is already horizontal. And for instance DoomAttack’s floor rendering is already computing 4 pixels and writes them in one longword write. Doing the floor vertically is likely inefficient as it would mean to give up on “constant z” along the floor lines.
If you did floors and walls separate passes (walls horizontal) and did a transpose-wall-rending in between, it would probably eat up any benefits from horizontal rendering.
Since the wall posts are already stored linearly, the only benefit would be the linear write (instead of wasting a 16byte cache line fill just to write a single pixel back).
I once did an experiment with storing the floor tiles in a 4x4 tiled fashion, where blocks of 16 pixels are stored in a single 16byte cache line. The advantage would be that when you pull a texel, it’s very likely that the next needed neighbor texel will be pulled into cache at the same time.
But it complicates the addressing in the inner texturing loop and thus likely cancels any savings of cycles when pulling from cache.
But maybe one could use this approach for framebuffer writes during wall rendering? The necessary recombination of walls and foors could be done at c2p time (or when copying to RTG) and „unscrambling“ the 4x4 blocks could be done there as well, likely more optimized.
pipper is offline  
Old 30 June 2024, 20:39   #106
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,226
Quote:
Originally Posted by pipper View Post
I contemplated this some time ago for Doom as well. I.e try to do the column post rendering horizontally to benefit from caching and write combining. The necessary transpose on present could potentially be hidden in the C2P pass(???) as C2P is itself a form of transpose.
But then there’s floor rendering which is already horizontal. And for instance DoomAttack’s floor rendering is already computing 4 pixels and writes them in one longword write. Doing the floor vertically is likely inefficient as it would mean to give up on “constant z” along the floor lines.
If you did floors and walls separate passes (walls horizontal) and did a transpose-wall-rending in between, it would probably eat up any benefits from horizontal rendering.
Since the wall posts are already stored linearly, the only benefit would be the linear write (instead of wasting a 16byte cache line fill just to write a single pixel back).
I once did an experiment with storing the floor tiles in a 4x4 tiled fashion, where blocks of 16 pixels are stored in a single 16byte cache line. The advantage would be that when you pull a texel, it’s very likely that the next needed neighbor texel will be pulled into cache at the same time.
But it complicates the addressing in the inner texturing loop and thus likely cancels any savings of cycles when pulling from cache.
But maybe one could use this approach for framebuffer writes during wall rendering? The necessary recombination of walls and foors could be done at c2p time (or when copying to RTG) and „unscrambling“ the 4x4 blocks could be done there as well, likely more optimized.
Yeah, while interesting, it's probably not worth exploring too far since the effort would be too great. Caches even on 060 seem to be just too small for anything really worthwhile. Only really great speed up I've achieved in my own stuff is from interleaving C2P and effect calculation, and even then only with severe restrictions on input (i.e. must fit in D$).
paraj is offline  
Old 30 June 2024, 20:48   #107
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,506
I think it depends on your approach to rendering. If you implemented tile based rendering from the start, a 32*32 tile is a 1kB working set that fits easily in your cache even on 040. You can copy that directly to your framebuffer once it's done. Which could be RTG memory and some move16 fun, or it could be chip memory with C2P.
Karlos is online now  
Old 30 June 2024, 20:58   #108
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,226
Quote:
Originally Posted by Karlos View Post
I think it depends on your approach to rendering. If you implemented tile based rendering from the start, a 32*32 tile is a 1kB working set that fits easily in your cache even on 040. You can copy that directly to your framebuffer once it's done. Which could be RTG memory and some move16 fun, or it could be chip memory with C2P.
I share your enthusiasm, but doubt the numbers will work out in our favor (compared to effort). Happy to be proved wrong!
paraj is offline  
Old 30 June 2024, 21:23   #109
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,506
I think this only works when the engine is designed with these principles in mind from the beginning.
Karlos is online now  
Old 30 June 2024, 21:41   #110
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,506
Quote:
Originally Posted by paraj View Post
I share your enthusiasm, but doubt the numbers will work out in our favor (compared to effort). Happy to be proved wrong!
Back on topic, how does the latest mixer perform for you? It will run the 040 code path by default, unless you put USE060 as a parameter.

The current normalisation code is basically the same so we'd be measuring the difference in the lookup v multiplication approach to mixing.
Karlos is online now  
Old 30 June 2024, 22:00   #111
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,226
Quote:
Originally Posted by Karlos View Post
Back on topic, how does the latest mixer perform for you? It will run the 040 code path by default, unless you put USE060 as a parameter.

The current normalisation code is basically the same so we'd be measuring the difference in the lookup v multiplication approach to mixing.

355996 / 263203 (latter is with USE060)
paraj is offline  
Old 30 June 2024, 22:56   #112
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,506
It would be good to see the same tests on a real 040. 1.35x faster for the 060 specific path seems a nice boost.

Last edited by Karlos; 30 June 2024 at 23:03.
Karlos is online now  
Old 30 June 2024, 23:13   #113
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,506
Can you retry without datacache enabled ? That could be quite an interesting comparison.
Karlos is online now  
Old 30 June 2024, 23:18   #114
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,506
I need to have a non delta 040 version for comparison on real hardware. It would be annoying, but I can imagine that the extra logic required to manage the delta code ends up being slower. Of course, in such a scheme you would just pre-encode your 8-bit samples and simplify the corresponding mixing loop.
Karlos is online now  
Old Yesterday, 15:15   #115
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,506
Anyone with a working 040 ?
Karlos is online now  
Old Yesterday, 17:44   #116
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,226
Quote:
Originally Posted by Karlos View Post
Can you retry without datacache enabled ? That could be quite an interesting comparison.
Without D$: 1304199 / 1001972
paraj is offline  
Old Yesterday, 19:50   #117
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,506
Quote:
Originally Posted by paraj View Post
Without D$: 1304199 / 1001972
So apart from being significantly slower, the relative difference is about the same. I don't know how I feel about that. We are writing to chip RAM. I should add a few more options.

1. NoDelta switch for the 040 path
2. MixOnly switch for testing just the mixing and skipping the normalisation and chip buffer writes.
Karlos is online now  
Old Yesterday, 20:01   #118
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,226
Quote:
Originally Posted by Karlos View Post
So apart from being significantly slower, the relative difference is about the same. I don't know how I feel about that. We are writing to chip RAM. I should add a few more options.

1. NoDelta switch for the 040 path
2. MixOnly switch for testing just the mixing and skipping the normalisation and chip buffer writes.
I think, but haven't checked, that disabling D$ means chip writes can't overlap with other calculations on 060 (that access memory), so be careful how you interpret results.


Other benchmarks will be good, but IMO you should make one exe that tests a bunch of interesting stuff at once without needing options (like your Akiko tests), and then corner a 040 owner for forced testing.
paraj is offline  
Old Yesterday, 20:08   #119
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,506
Quote:
Originally Posted by paraj View Post
I think, but haven't checked, that disabling D$ means chip writes can't overlap with other calculations on 060, so be careful how you interpret results.
I think this is why a "mix only" test makes sense.

Quote:
Other benchmarks will be good, but IMO you should make one exe that tests a bunch of interesting stuff at once without needing options (like your Akiko tests), and then corner a 040 owner for forced testing.
You're right. However, I'm hedging my bets that on some machines the seemingly suboptimal approach will be the best one and so on.
Karlos is online now  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Slow A4000 after overhaul Screechstar support.Hardware 57 11 July 2023 23:02
Amiga Font Editor overhaul buggs Coders. Releases 19 09 March 2021 17:39
Escom A1200 overhaul Ox. Amiga scene 8 26 August 2014 08:54
Will Bridge Practice series needs an overhaul mk1 HOL data problems 1 02 April 2009 21:55

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 01:31.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.10580 seconds with 14 queries