08 December 2023, 23:31 | #101 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
This build here has the sequential rendering mentioned a few posts back. As I feared, the gain is negligible on my A1200 (about 0.1 fps): all that is gained by means of the sequential rendering of the landscape gets wasted by the less efficient copy from FAST RAM to CHIP RAM (although the rotation can be done fully in parallel with the writes, still writing 4 longwords per loop and having two nested loops is less efficient than burst-copying 32 longwords at a time with partially unrolled code) and the rendering of the background, which is no longer a matter of sequential writes for just the strictly needed amount of lines (depending on the horizon Y).
That said: * at the moment, the rendering of the background is dumb: the whole thing gets rendered, as rendering just the lines (as columns) strictly needed requires a dynamically-sized loop per line, which is so much more complicated that it is hardly more advantageous than a plain copy; anyway, this is just a placeholder as I plan to put the background rendering code at the tail of the landscape rendering code, where only the dots actually visible can be rendered at the end of each column; * this is just an initial version and there might still be room for optimization. But I have to wonder: if I were to make a game with this, I'd feel sick at the thought of having Xs, Ys, widths and heights swapped around... P.S. Sorry if it sounds criptic, but I'm just too tired to explain more adequately. EDIT: afterwards I made a couple of optimizations and there's still something else that can be done; if sleep doesn't kill me, I'll take care of it after work this afternoon. Last edited by saimo; 09 December 2023 at 18:09. Reason: Removed attachment as I provided a newer version later. |
09 December 2023, 05:37 | #102 |
Registered User
Join Date: Jan 2019
Location: Finland
Posts: 654
|
Tried this on TF1260+MMULib, it glitches out.
[ Show youtube player ] |
09 December 2023, 08:53 | #103 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Quote:
Did you use the version attached to my previous post (#101)? If so, maybe I've screwed something up relatively to the new rendering code (not the copy&rotate stuff though, as the splash screen shows fine). Could you try the one attached here (it's the previous build)? It works fine on an A4000/060 and finally even faster than on my 68030 (see attached picture). Last edited by saimo; 09 December 2023 at 18:39. Reason: Removed attachment as I provided a newer version later. |
|
09 December 2023, 09:01 | #104 |
old bearded fool
Join Date: Jan 2010
Location: Bangkok
Age: 57
Posts: 779
|
Is there any reason to use the MMU for Amiga OS on 68k?
(For 68k Linux it's a requirement IIRC.) I remember when using a 68030 @ 33MHz with MMU (ACA-1232, now bork), installed MMULib primarily to test the WHDLoad debug support for it, but uninstalled shortly after because it noticeably affected performance and occasionally I ran into compatibility issues. There seems to be a lot of people using MMU now, is it because of Amiga OS 3.2? I can't remember it being mentioned much in the past. |
09 December 2023, 16:10 | #105 | |
Registered User
Join Date: Jan 2019
Location: Finland
Posts: 654
|
Quote:
[ Show youtube player ] The Nov 28 build on the website works without glitches, but running it after setpatch kills fps. |
|
09 December 2023, 18:04 | #106 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Quote:
But, yes, it comes with a speed penalty. |
|
09 December 2023, 18:09 | #107 | ||
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Quote:
Quote:
|
||
09 December 2023, 18:39 | #108 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
I dedicated some time to the rendering code, to see if I could improve the performance with the sequential writes. Before everything else, I made the benchmark perform the rendering 256 times while rotating the camera by 360° (instead of rendering always the same scene like it did before), so that the figures would be more meaningful.
Then I benchmaked the performance of the engine that writes by columns and of the engine that writes sequentially and block-copies the background. Then I, as anticipated yesterday, implemented the rendering of the background in the core loop, hoping that the avoidance of overdraw would offset the more complex and overheady code. Finally, I benchmarked this last engine. These are the results (as always, on my A1200 + Blizzard 1230 IV): Code:
VOXEL | BACKGROUND | FPS | FPS | PED81C COST WRITES | RENDERING | (BLIND) | (FULL) | (FPS / FRAMES) ------------+---------------------------------+---------+--------+---------------- by columns | block copy, only required lines | 23.108 | 20.918 | 2.190 / 0.227 sequential | block copy, whole | 23.546 | 21.162 | 2.384 / 0.239 sequential | by columns | 23.618 | 21.220 | 2.398 / 0.239 The difference 0.239-0.227 tells us that the rotate&write copy from FAST RAM to CHIP RAM of the rendered raster is 0.012 fps (i.e. about 3.744 rasterlines per frame) slower than a simple block copy of the whole raster. Even if the rotatation logic happens totally in parallel with the writes (in fact, if I disable its code altogether, I get the same or even a worse (!) performance), that's inevitable, given the structurally more complex code. That said, the operation takes about 312*0.239 = 74.5 rasterlines per frame, which isn't too bad. The rendering of the background by columns performs better when little of it shows. There's no absolute best method. But this method has the potential advantage that it allows to add skewing (which I don't plan to add, though). Attached is an archive that contains all the three versions: * PVE-C: Column writes * PVE-SW: Sequential writes, Whole background * PVE-SC: Sequential writes, background rendered by Columns. There are no changes regarding MMU and caches handling, so, given the report above, glitches are likely. Tests and reports will be much appreciated as always Last edited by saimo; 10 December 2023 at 23:05. Reason: Removed attachment, as I provided a newer version later. |
09 December 2023, 20:26 | #109 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,217
|
BB FB
PVE-C 25.679 / 24.297 PVE-SC 24.975 / 24.914 PVE-SW 24.975 / 24.914 Glitches a bit, but not too bad |
09 December 2023, 22:34 | #110 | ||
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
@paraj
Thanks! Quote:
I'm afraid that until I get the glitches fixed, the timings aren't reliable. Quote:
By the way, I received other reports: glitches confirmed on another TF1260, and the only version that works fine is PVE-C on A4000. |
||
10 December 2023, 18:19 | #111 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
While revising the code, I noticed that the initialization code uses Exec's AttnFlags to detect the 68060 (this is actually in a library of routines of my own that I build PVE against), which just doesn't work if PVE is run before SetPatch (e.g. if it is tested without startup-sequence). Fixed now.
Also, I thought it would be worth testing whether the precise caching-inhibit model for the first 16 MB of the address space would cure the glitches. I also fixed a cleanup bug (the program would attempt to restore the MMU as per the system even if the MMU had never been reconfigured). Last edited by saimo; 10 December 2023 at 23:07. Reason: Removed attachment, as I provided a newer version later. |
10 December 2023, 18:53 | #112 | |
Registered User
Join Date: Jan 2019
Location: Finland
Posts: 654
|
Quote:
|
|
10 December 2023, 20:14 | #113 |
Registered User
Join Date: Jul 2017
Location: San Jose
Posts: 677
|
A bit off-topic: do you know how the cache behaves when iterating through the voxel landscape pixels for rendering? I’m assuming they’re stored in a large pitch-linear 2d array.
I have been toying with the idea of scrambling the landscape “texture” such that a group of 4x4 pixels sits linearly in memory. Thus, accessing any of the pixels in a 4x4 group will pull in all the others in the same burst cycle. Accessing any of the neighbors next will likely result in a cache hit. This way the performance of the renderer should be the same regardless of the camera orientation. For full effect it would require the landscape texture to have “mipmaps”, though. I have not implemented such addressing scheme yet and wonder if the better cache behavior is worth the more complicated address decoding effort. |
10 December 2023, 20:56 | #115 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,478
|
@pipper
You can go a step further if you have 8 bit texels. You can arrange 4x4 patches into a cache line. |
10 December 2023, 21:07 | #116 | |||
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Quote:
*Which is what I'm trying to avoid also for 68040 and 68060 (that's already solved for 68030)... Hmm... it just occurred to me that I haven't checked what happens on my 68030 machine if I turn the data cache completely off and then on like for the other CPUs (because I simply use the burst enable/disable bit)... Gotta try, maybe that's what's causing the glitches (although I can't see how: the caches are writethrough, they're disabled only during the rendering loop and the rendering loop accesses only the map data and a few local variables that get initialized without reading them first just before turning the caches off). EDIT: see my next post. Quote:
Quote:
Last edited by saimo; 10 December 2023 at 23:09. |
|||
10 December 2023, 23:03 | #117 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
I tried to have the data cache switched on and off also on 68030, just like on 68040 and 68060, to see if it had the same disruptive effects - and it didn't (but the cache is smaller, so this doesn't mean much; it would have helped if the problems arose, instead).
While at it, I decided to test if the cache handling policy for 68030 was still the best also relatively to the new rendering code (sequential writes + rotate© to CHIP RAM). Before, the policy was: * data cache always on; * data cache burst on only during the block-copy of the background and the block-copy of the raster from FAST RAM to CHIP RAM. Now I tried this policy: * data cache always on; * data cache burst always on except during the voxel rendering. And also: * data cache and burst always on. To my surprise, it turned out that the last (and simplest) strategy was the best one, raising the (BB/FB) benchmark results from 23.618 / 21.220 to 23.720 / 21.290 (just a little, but still...). I guess this is because now the background rendering, which benefits from the burst, is part of the core rendering code. So, it is worth checking whether the same applies to 68040 and 68060: this build here has the data cache toggling off for all the CPUs. P.S. If this cured also the glitches, it just wouldn't seem right (as I can't see why disabling the caches temporarily just during the execution of a rendering routine that only modifies local or uncachable data could have those negative effects)! I'd be happier if the issue were in the buffering stuff (a race condition or something - I still have to investigate, but now I really must put myself to bed). Last edited by saimo; 11 December 2023 at 20:51. Reason: Removed attachment as I provided a newer version later. |
10 December 2023, 23:12 | #118 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,217
|
Seems good! 37.858 (bb) and 30.656 (fb) with my normal config and no glitches!
https://i.imgur.com/JgV5xPZ.mp4 Last edited by paraj; 10 December 2023 at 23:20. |
10 December 2023, 23:44 | #119 |
Registered User
Join Date: Jan 2019
Location: Finland
Posts: 654
|
No glitches on TF1260 either.
BB 50mhz 26.479 FB 50mhz 22.855 BB 100mhz 47.814 FB 100mhz 38.163 |
11 December 2023, 08:34 | #120 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Thanks guys!
Now I know. This night I was dreaming of playing guitar - and quite enjoying it. Suddenly, I woke up for no reason. After a few seconds spent to contemplate the music I had in my head, my mind returned to the riddle, and instantly found the mistake. It was so obvious! And I had even touched on it a number of times... It's the combination of caching, buffering and interrupts! Full details and new builds to find the fastest settings after work (damn real life!). EDIT: here's the explanation (test builds will follow later). This explanation is basically a confession and - believe me - I'm totally ashamed. It's a lot of words, but it's really simple actually, and probably you have already guessed it from what I wrote above. Keeping in mind that the caches are writethrough (but I'll provide builds to test also copyback), this is what happens: n1. the renderer, when it starts executing, turns the data cache off; n2. then it initializes a few local variables without referring to them, but referring only to 4 global variables: camera X, camera Y, camera angle and address of the raster to be written to; n3. during the execution, it writes only to the local variables and to the raster in FAST RAM; n4. when it is done, it turns the data cache on; n5. a buffering operation is executed to mark the raster ready-for-copy-to-CHIP-RAM and make the other raster in FAST RAM available for the next render. All of that, on its own, works just as fine. But... the renderer is not alone! There's also the buffering and the global logic, which are interrupt-driven - more precisely, when the bottom of the display window is reached, a COPER interrupt fires and: i1. the display-related buffering takes place (the CHIP RAM buffers get swapped around); i2. if a rendered raster is available, it gets rotated&written to the hidden CHIP RAM buffer; i3. the control logic (at the moment, just the handling of the user input and of the camera) executes. Things go awry in this (frequent case): e1. the renderer turns the cache off (n1); e2. at any time between steps n2 and n3, the COPER interrupt fires, one or more times (i1-i3); e3. the renderer finishes and turns the cache on (n4). The problem is that at step e2 the buffering and camera variables get altered, making the corresponding data in the cache invalid. However, when the cache is turned back on the good data gets hidden and thus all the following operations (e.g. step n5 or steps i1-i3 if the interrupt fires when the data cache is on) work on invalid data. Result: buffering totally screwed up and horrible glitches. The cure is simple: cinva D when disabling or enabling the cache. I'll post new builds that implement such solution, but also enable the cache during the CHIP RAM -> FAST RAM operation (like it used to be for the 68030) and with copyback enabled, to see which policy gives the fastest result. Last edited by saimo; 11 December 2023 at 14:39. |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
No native AGA screens on PIV since P96 v3 upgrade | LoadWB | support.Apps | 0 | 30 October 2020 01:57 |
Extra bottom line on native screens, chipset feature or WinUAE? | PeterK | support.WinUAE | 5 | 11 September 2019 21:21 |
My pseudo 3D jump code | Brick Nash | Coders. AMOS | 24 | 03 September 2016 00:18 |
Chunky to Planar (C2P) -- USELESS GIMMICK?! | crosis38 | support.Hardware | 10 | 09 July 2016 04:17 |
Pseudo Ops Viruskiller | Promax | request.Apps | 0 | 28 July 2010 22:21 |
|
|