English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 08 December 2023, 23:31   #101
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 854
This build here has the sequential rendering mentioned a few posts back. As I feared, the gain is negligible on my A1200 (about 0.1 fps): all that is gained by means of the sequential rendering of the landscape gets wasted by the less efficient copy from FAST RAM to CHIP RAM (although the rotation can be done fully in parallel with the writes, still writing 4 longwords per loop and having two nested loops is less efficient than burst-copying 32 longwords at a time with partially unrolled code) and the rendering of the background, which is no longer a matter of sequential writes for just the strictly needed amount of lines (depending on the horizon Y).
That said:
* at the moment, the rendering of the background is dumb: the whole thing gets rendered, as rendering just the lines (as columns) strictly needed requires a dynamically-sized loop per line, which is so much more complicated that it is hardly more advantageous than a plain copy; anyway, this is just a placeholder as I plan to put the background rendering code at the tail of the landscape rendering code, where only the dots actually visible can be rendered at the end of each column;
* this is just an initial version and there might still be room for optimization.

But I have to wonder: if I were to make a game with this, I'd feel sick at the thought of having Xs, Ys, widths and heights swapped around...

P.S. Sorry if it sounds criptic, but I'm just too tired to explain more adequately.

EDIT: afterwards I made a couple of optimizations and there's still something else that can be done; if sleep doesn't kill me, I'll take care of it after work this afternoon.

Last edited by saimo; 09 December 2023 at 18:09. Reason: Removed attachment as I provided a newer version later.
saimo is offline  
Old 09 December 2023, 05:37   #102
Aardvark
Registered User
 
Join Date: Jan 2019
Location: Finland
Posts: 654
Tried this on TF1260+MMULib, it glitches out.
[ Show youtube player ]
Aardvark is offline  
Old 09 December 2023, 08:53   #103
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 854
Quote:
Originally Posted by Aardvark View Post
Tried this on TF1260+MMULib, it glitches out.
[ Show youtube player ]
Thanks for the test and the video!
Did you use the version attached to my previous post (#101)? If so, maybe I've screwed something up relatively to the new rendering code (not the copy&rotate stuff though, as the splash screen shows fine).
Could you try the one attached here (it's the previous build)? It works fine on an A4000/060 and finally even faster than on my 68030 (see attached picture).
Attached Thumbnails
Click image for larger version

Name:	results.jpg
Views:	54
Size:	99.4 KB
ID:	80997  

Last edited by saimo; 09 December 2023 at 18:39. Reason: Removed attachment as I provided a newer version later.
saimo is offline  
Old 09 December 2023, 09:01   #104
modrobert
old bearded fool
 
modrobert's Avatar
 
Join Date: Jan 2010
Location: Bangkok
Age: 57
Posts: 779
Is there any reason to use the MMU for Amiga OS on 68k?

(For 68k Linux it's a requirement IIRC.)

I remember when using a 68030 @ 33MHz with MMU (ACA-1232, now bork), installed MMULib primarily to test the WHDLoad debug support for it, but uninstalled shortly after because it noticeably affected performance and occasionally I ran into compatibility issues.

There seems to be a lot of people using MMU now, is it because of Amiga OS 3.2? I can't remember it being mentioned much in the past.
modrobert is offline  
Old 09 December 2023, 16:10   #105
Aardvark
Registered User
 
Join Date: Jan 2019
Location: Finland
Posts: 654
Quote:
Originally Posted by saimo View Post
Thanks for the test and the video!
Did you use the version attached to my previous post (#101)? If so, maybe I've screwed something up relatively to the new rendering code (not the copy&rotate stuff though, as the splash screen shows fine).
Could you try the one attached here (it's the previous build)? It works fine on an A4000/060 and finally even faster than on my 68030 (see attached picture).
The previous build runs with some glitches, and with NSD option it will freeze after short while.
[ Show youtube player ]

The Nov 28 build on the website works without glitches, but running it after setpatch kills fps.
Aardvark is offline  
Old 09 December 2023, 18:04   #106
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 854
Quote:
Originally Posted by modrobert View Post
Is there any reason to use the MMU for Amiga OS on 68k?

(For 68k Linux it's a requirement IIRC.)

I remember when using a 68030 @ 33MHz with MMU (ACA-1232, now bork), installed MMULib primarily to test the WHDLoad debug support for it, but uninstalled shortly after because it noticeably affected performance and occasionally I ran into compatibility issues.

There seems to be a lot of people using MMU now, is it because of Amiga OS 3.2? I can't remember it being mentioned much in the past.
As far as I know, AmigaOS 3.2 (which I don't have) does not impose the use of MMU. But, as it's being discussed in this thread, it is necessary to configure it properly for 68040 and 68060. Maybe it might be the only way to get some special hardware to work. It's useful to protect unallocated memory areas and debugging. It might be used to remap the ROM. And so on.
But, yes, it comes with a speed penalty.
saimo is offline  
Old 09 December 2023, 18:09   #107
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 854
Quote:
Originally Posted by Aardvark View Post
The previous build runs with some glitches, and with NSD option it will freeze after short while.
[ Show youtube player ]
At some point I noticed that the fps counter reported 81: that should not happen, as buffering is limited to 50 Hz. From that and the glitches, I guess it's a matter of interrupts. Will investigate. Thank you.

Quote:
The Nov 28 build on the website works without glitches, but running it after setpatch kills fps.
I guess SetPatch enabled table-based translation, which affects the performance because that build does not configure the MMU.
saimo is offline  
Old 09 December 2023, 18:39   #108
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 854
I dedicated some time to the rendering code, to see if I could improve the performance with the sequential writes. Before everything else, I made the benchmark perform the rendering 256 times while rotating the camera by 360° (instead of rendering always the same scene like it did before), so that the figures would be more meaningful.
Then I benchmaked the performance of the engine that writes by columns and of the engine that writes sequentially and block-copies the background. Then I, as anticipated yesterday, implemented the rendering of the background in the core loop, hoping that the avoidance of overdraw would offset the more complex and overheady code. Finally, I benchmarked this last engine.
These are the results (as always, on my A1200 + Blizzard 1230 IV):

Code:
 VOXEL      | BACKGROUND                      | FPS     | FPS    |  PED81C COST
 WRITES     | RENDERING                       | (BLIND) | (FULL) | (FPS / FRAMES)
------------+---------------------------------+---------+--------+----------------
 by columns | block copy, only required lines |  23.108 | 20.918 |  2.190 / 0.227
 sequential | block copy, whole               |  23.546 | 21.162 |  2.384 / 0.239
 sequential | by columns                      |  23.618 | 21.220 |  2.398 / 0.239
I wasn't really sure that performance would improve, but it did - albeit only a little: the sequential-writes code now beats the column-writes code by 21.220-20.918 = 0.302 fps. Whether it's worth the extra headaches of having the X and Y axes swapped around is to be seen (and honestly I'm not that confident).

The difference 0.239-0.227 tells us that the rotate&write copy from FAST RAM to CHIP RAM of the rendered raster is 0.012 fps (i.e. about 3.744 rasterlines per frame) slower than a simple block copy of the whole raster. Even if the rotatation logic happens totally in parallel with the writes (in fact, if I disable its code altogether, I get the same or even a worse (!) performance), that's inevitable, given the structurally more complex code.
That said, the operation takes about 312*0.239 = 74.5 rasterlines per frame, which isn't too bad.

The rendering of the background by columns performs better when little of it shows. There's no absolute best method. But this method has the potential advantage that it allows to add skewing (which I don't plan to add, though).

Attached is an archive that contains all the three versions:
* PVE-C: Column writes
* PVE-SW: Sequential writes, Whole background
* PVE-SC: Sequential writes, background rendered by Columns.

There are no changes regarding MMU and caches handling, so, given the report above, glitches are likely.

Tests and reports will be much appreciated as always

Last edited by saimo; 10 December 2023 at 23:05. Reason: Removed attachment, as I provided a newer version later.
saimo is offline  
Old 09 December 2023, 20:26   #109
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,217
BB FB
PVE-C 25.679 / 24.297
PVE-SC 24.975 / 24.914
PVE-SW 24.975 / 24.914

Glitches a bit, but not too bad
paraj is offline  
Old 09 December 2023, 22:34   #110
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 854
@paraj

Thanks!

Quote:
Originally Posted by paraj View Post
BB FB
PVE-C 25.679 / 24.297
PVE-SC 24.975 / 24.914
PVE-SW 24.975 / 24.914
Puzzling results: columns rendering is faster than sequential rendering (as I expected, given that the adda to move one column up executes completely in parallel with the dot write, while that can't happen on 68030), but then the block-copy from CHIP RAM to FAST RAM is much slower than the rotate&copy operation and, just as incredibly, the rotate&copy operation would affect the performance by only 24.975-24.914 = 0.061 fps (an average of 19 rasterlines per frame)?
I'm afraid that until I get the glitches fixed, the timings aren't reliable.

Quote:
Glitches a bit, but not too bad
One glitch alone is one glitch too much

By the way, I received other reports: glitches confirmed on another TF1260, and the only version that works fine is PVE-C on A4000.
saimo is offline  
Old 10 December 2023, 18:19   #111
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 854
While revising the code, I noticed that the initialization code uses Exec's AttnFlags to detect the 68060 (this is actually in a library of routines of my own that I build PVE against), which just doesn't work if PVE is run before SetPatch (e.g. if it is tested without startup-sequence). Fixed now.
Also, I thought it would be worth testing whether the precise caching-inhibit model for the first 16 MB of the address space would cure the glitches.
I also fixed a cleanup bug (the program would attempt to restore the MMU as per the system even if the MMU had never been reconfigured).

Last edited by saimo; 10 December 2023 at 23:07. Reason: Removed attachment, as I provided a newer version later.
saimo is offline  
Old 10 December 2023, 18:53   #112
Aardvark
Registered User
 
Join Date: Jan 2019
Location: Finland
Posts: 654
Quote:
Originally Posted by saimo View Post
While revising the code, I noticed that the initialization code uses Exec's AttnFlags to detect the 68060 (this is actually in a library of routines of my own that I build PVE against), which just doesn't work if PVE is run before SetPatch (e.g. if it is tested without startup-sequence). Fixed now.
Also, I thought it would be worthy testing whether the precise caching-inhibit model for the first 16 MB of the address space would cure the glitches.
I also fixed a cleanup bug (the program would attempt to restore the MMU as per the system even if the MMU had never been reconfigured.
It now start without startup-sequence, but the result is the same as in #102
Aardvark is offline  
Old 10 December 2023, 20:14   #113
pipper
Registered User
 
Join Date: Jul 2017
Location: San Jose
Posts: 677
A bit off-topic: do you know how the cache behaves when iterating through the voxel landscape pixels for rendering? I’m assuming they’re stored in a large pitch-linear 2d array.
I have been toying with the idea of scrambling the landscape “texture” such that a group of 4x4 pixels sits linearly in memory. Thus, accessing any of the pixels in a 4x4 group will pull in all the others in the same burst cycle. Accessing any of the neighbors next will likely result in a cache hit. This way the performance of the renderer should be the same regardless of the camera orientation. For full effect it would require the landscape texture to have “mipmaps”, though. I have not implemented such addressing scheme yet and wonder if the better cache behavior is worth the more complicated address decoding effort.
pipper is offline  
Old 10 December 2023, 20:52   #114
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 854
Quote:
Originally Posted by Aardvark View Post
It now start without startup-sequence, but the result is the same as in #102
Thanks for the quick feedback!
Next step: checking buffering and rendering (again).
saimo is offline  
Old 10 December 2023, 20:56   #115
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,478
@pipper

You can go a step further if you have 8 bit texels. You can arrange 4x4 patches into a cache line.
Karlos is online now  
Old 10 December 2023, 21:07   #116
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 854
Quote:
Originally Posted by pipper View Post
A bit off-topic: do you know how the cache behaves when iterating through the voxel landscape pixels for rendering?
I guess that it simply fetches the line that each pixel is in*? But you clearly know this already, so maybe I'm misunderstanding your question.

*Which is what I'm trying to avoid also for 68040 and 68060 (that's already solved for 68030)... Hmm... it just occurred to me that I haven't checked what happens on my 68030 machine if I turn the data cache completely off and then on like for the other CPUs (because I simply use the burst enable/disable bit)... Gotta try, maybe that's what's causing the glitches (although I can't see how: the caches are writethrough, they're disabled only during the rendering loop and the rendering loop accesses only the map data and a few local variables that get initialized without reading them first just before turning the caches off).

EDIT: see my next post.

Quote:
I’m assuming they’re stored in a large pitch-linear 2d array.
Correct. More precisely, each item in the array is a couple of bytes, the first of which indicates the height and the second indicates the color.

Quote:
I have been toying with the idea of scrambling the landscape “texture” such that a group of 4x4 pixels sits linearly in memory. Thus, accessing any of the pixels in a 4x4 group will pull in all the others in the same burst cycle. Accessing any of the neighbors next will likely result in a cache hit. This way the performance of the renderer should be the same regardless of the camera orientation. For full effect it would require the landscape texture to have “mipmaps”, though. I have not implemented such addressing scheme yet and wonder if the better cache behavior is worth the more complicated address decoding effort.
I like the concept, but at first thought the code is so much more complex that I can't see it being more efficient. Plus, this method loses the benefit of coupling together heights and colors.

Last edited by saimo; 10 December 2023 at 23:09.
saimo is offline  
Old 10 December 2023, 23:03   #117
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 854
I tried to have the data cache switched on and off also on 68030, just like on 68040 and 68060, to see if it had the same disruptive effects - and it didn't (but the cache is smaller, so this doesn't mean much; it would have helped if the problems arose, instead).

While at it, I decided to test if the cache handling policy for 68030 was still the best also relatively to the new rendering code (sequential writes + rotate&copy to CHIP RAM).

Before, the policy was:
* data cache always on;
* data cache burst on only during the block-copy of the background and the block-copy of the raster from FAST RAM to CHIP RAM.

Now I tried this policy:
* data cache always on;
* data cache burst always on except during the voxel rendering.

And also:
* data cache and burst always on.

To my surprise, it turned out that the last (and simplest) strategy was the best one, raising the (BB/FB) benchmark results from 23.618 / 21.220 to 23.720 / 21.290 (just a little, but still...).
I guess this is because now the background rendering, which benefits from the burst, is part of the core rendering code. So, it is worth checking whether the same applies to 68040 and 68060: this build here has the data cache toggling off for all the CPUs.

P.S. If this cured also the glitches, it just wouldn't seem right (as I can't see why disabling the caches temporarily just during the execution of a rendering routine that only modifies local or uncachable data could have those negative effects)! I'd be happier if the issue were in the buffering stuff (a race condition or something - I still have to investigate, but now I really must put myself to bed).

Last edited by saimo; 11 December 2023 at 20:51. Reason: Removed attachment as I provided a newer version later.
saimo is offline  
Old 10 December 2023, 23:12   #118
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,217
Seems good! 37.858 (bb) and 30.656 (fb) with my normal config and no glitches!

https://i.imgur.com/JgV5xPZ.mp4

Last edited by paraj; 10 December 2023 at 23:20.
paraj is offline  
Old 10 December 2023, 23:44   #119
Aardvark
Registered User
 
Join Date: Jan 2019
Location: Finland
Posts: 654
No glitches on TF1260 either.

BB 50mhz 26.479
FB 50mhz 22.855
BB 100mhz 47.814
FB 100mhz 38.163
Aardvark is offline  
Old 11 December 2023, 08:34   #120
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 854
Thanks guys!
Now I know. This night I was dreaming of playing guitar - and quite enjoying it. Suddenly, I woke up for no reason. After a few seconds spent to contemplate the music I had in my head, my mind returned to the riddle, and instantly found the mistake. It was so obvious! And I had even touched on it a number of times... It's the combination of caching, buffering and interrupts!
Full details and new builds to find the fastest settings after work (damn real life!).

EDIT: here's the explanation (test builds will follow later). This explanation is basically a confession and - believe me - I'm totally ashamed. It's a lot of words, but it's really simple actually, and probably you have already guessed it from what I wrote above.

Keeping in mind that the caches are writethrough (but I'll provide builds to test also copyback), this is what happens:
n1. the renderer, when it starts executing, turns the data cache off;
n2. then it initializes a few local variables without referring to them, but referring only to 4 global variables: camera X, camera Y, camera angle and address of the raster to be written to;
n3. during the execution, it writes only to the local variables and to the raster in FAST RAM;
n4. when it is done, it turns the data cache on;
n5. a buffering operation is executed to mark the raster ready-for-copy-to-CHIP-RAM and make the other raster in FAST RAM available for the next render.

All of that, on its own, works just as fine. But... the renderer is not alone! There's also the buffering and the global logic, which are interrupt-driven - more precisely, when the bottom of the display window is reached, a COPER interrupt fires and:
i1. the display-related buffering takes place (the CHIP RAM buffers get swapped around);
i2. if a rendered raster is available, it gets rotated&written to the hidden CHIP RAM buffer;
i3. the control logic (at the moment, just the handling of the user input and of the camera) executes.

Things go awry in this (frequent case):
e1. the renderer turns the cache off (n1);
e2. at any time between steps n2 and n3, the COPER interrupt fires, one or more times (i1-i3);
e3. the renderer finishes and turns the cache on (n4).

The problem is that at step e2 the buffering and camera variables get altered, making the corresponding data in the cache invalid. However, when the cache is turned back on the good data gets hidden and thus all the following operations (e.g. step n5 or steps i1-i3 if the interrupt fires when the data cache is on) work on invalid data. Result: buffering totally screwed up and horrible glitches.
The cure is simple: cinva D when disabling or enabling the cache.

I'll post new builds that implement such solution, but also enable the cache during the CHIP RAM -> FAST RAM operation (like it used to be for the 68030) and with copyback enabled, to see which policy gives the fastest result.

Last edited by saimo; 11 December 2023 at 14:39.
saimo is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
No native AGA screens on PIV since P96 v3 upgrade LoadWB support.Apps 0 30 October 2020 01:57
Extra bottom line on native screens, chipset feature or WinUAE? PeterK support.WinUAE 5 11 September 2019 21:21
My pseudo 3D jump code Brick Nash Coders. AMOS 24 03 September 2016 00:18
Chunky to Planar (C2P) -- USELESS GIMMICK?! crosis38 support.Hardware 10 09 July 2016 04:17
Pseudo Ops Viruskiller Promax request.Apps 0 28 July 2010 22:21

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 09:28.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.11122 seconds with 14 queries