06 December 2023, 13:53 | #81 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Quote:
Since it's decades that I last touched the MMU, I thought that I'd better leave the TT registers alone, wrongly thinking that ATC-based translation would override them (when, instead, it's the other way around) and assuming that they were already correctly set. Many thanks for the suggestion. There's one thing I don't understand with the settings you suggested for the DDTs, though: * DDT0 = $0000c040 means that addresses 00xxxxxx are transparently translated and not writeable; * DDT1 = $00ffc000 means that all addresses are transparently translated and writeable. The settings conflict with each other and, although the M68060UM doesn't explicitly say how such conflicts are solved, paragraph 4.4 suggests that write protection would prevail ("When write protection is enabled for a block..."): in that case, writes to CHIP RAM and chip registers would fail. Shouldn't all registers be set to $00ffc000? EDIT: hmm... I'm not convinced the problem is relative to the TTs, because: * when the NOMMU switch is not used, the MMU is not fiddled with, so the jerks should not appear (i.e. I must have introduced a bug somewhere else); * given that, as you reported, the TTs are disabled from outside, when the program disables the ATC-based translation the CPU basically behaves like a 68EC060, i.e. it uses the addresses literally and does not perform any extra caching/writeability handling on them Doh, I was forgetting about cache coherency issues. I'll try also with all the TTs disabled On second thought, I'd better transparently translate the whole address space to mark it writethrough. Last edited by saimo; 06 December 2023 at 16:11. |
|
06 December 2023, 15:48 | #82 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,480
|
Colour me curious. I wonder if the MMU could be problematic for TKG, it has quite a few randomly accessed tables.
|
06 December 2023, 16:05 | #83 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Quote:
|
|
06 December 2023, 17:34 | #84 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,217
|
Quote:
Not exactly the MMU expert either, but $0000c040 should mean E=1, CM=%10 for the lower 16MB, and writing that, it should probably be $0000c060 instead (i.e. CM=%11 => Cache-Inhibited, Imprecise Exception Model). Write through sounds dangerous for custom registers.. Section 4.4 states that "If both registers match, the TT0 status bits are used for the access." so that's guaranteed. If neither match and paging is disabled, you get the default values specified in the translation control register. |
|
06 December 2023, 18:00 | #85 |
old bearded fool
Join Date: Jan 2010
Location: Bangkok
Age: 57
Posts: 779
|
I tried PVE on a stock A1200 with fast RAM, even though it's only 6 FPS on average it's running smooth for a "3D" engine, impressive results.
Code:
total number of frames rendered: 940 total number of frames shown: 7780 frames rendered per second average: 6.04 frames per render average: 8.27 |
06 December 2023, 18:09 | #86 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Quote:
Anyway, I agree that going for the precise mode is best. But I'll give it another thought later/though (EDIT: "though" was supposed to be "tomorrow"... can't think straight...) (too tired now). The attached build sets the MMU as you suggested (and has also a couple of other little changes, but not related to the MMU or bugfixes). Last edited by saimo; 07 December 2023 at 00:25. Reason: Removed attachment, as I provided a newer version later. |
|
06 December 2023, 18:41 | #87 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,217
|
Get some rest and return with a fresh perspective
Still really no speed difference with your latest build, however nommu now seems to be less glitchy! https://i.imgur.com/rvgztVC.mp4 Maybe the cache modification thing you started doing is causing it (for some reason)? Quote:
Also ATC misses, which are the only ones you'll avoid by switching to TT, should normally not have that big of an effect. Of course if there is really no locality in your memory accesses it's going to be slow (around the same as a chip read apparently!), but my test is really pathological. |
|
06 December 2023, 18:41 | #88 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Quote:
I guess that such machine would be capable of an even better performance, as during the initial development I had tried a version whose renderer code was less efficient than the current one* on a stock A1200 and it reached 5 fps already. That was thanks to the fact that back then graphics were being rendered directly to a raster in CHIP RAM. However, seeing that the speed was too low and considering that with just 2 MB the maps have to be small, after a while I decided to make FAST RAM compulsory and changed the buffering strategy: before there were 3 buffers in CHIP RAM; now there are 2 buffers in CHIP RAM and 2 buffers in FAST RAM, and graphics get rendered in FAST RAM first and then copied to CHIP RAM (that, if I remember correctly, produced a gain a 1 or 2 fps on my 68030 machine, but I'm pretty sure that it isn't ideal for a machine that only has additional FAST RAM). *I came up with several optimizations afterwards. Also, that version did not render the background, but performed clearing by continuing the drawing of the columns. That was really inefficient, but it was basically placeholder code for drawing the background with the idea of adding also skewing some day. Eventually I dropped the idea as that would affect the speed too much |
|
06 December 2023, 20:03 | #89 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,217
|
After a bit of guru meditation, I figured out that the reason it was slower is that cache should of course be configured to writeback not writethrough for fast mem. This gives a super tiny benefit in starting pose of original TKG with random build I had lying around (and another quick test).
I currently - subject to further tests - think the best way to achieve your no-MMU setup on 060 is something like (in pseudo-code): Code:
# Paged address translation is enabled, and DTT0/1 are not in used? if TT.e and not DTT0.e and not DTT1.e: DTT0 = $403fc020 # Enable write back, transparent translation for Z3 fast ram |
07 December 2023, 00:23 | #90 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Gosh, I had swapped ITT0 and DTT0 around!
New build, with that fixed. Also, it features another change, unrelated to the MMU but related to the screen refresh, that makes a certain piece of code more robust - although it should not have played any role in the glitches (unless there was expansion hardware causing NMI transitions - which wasn't the case according to the exceptions occurrences log). Sorry if I don't provide an description, but if it had not been too long and I should not have been sleeping (at least for a month straight...), I'd have explained. EDIT: the version originally uploaded with this post included a broken table; if you had already downloaded the archive, please re-download it - sorry. @paraj Moreover, I've used CM = 10 (precise/serialized model) for the first 16 MB: isn't that more recommendable, after all? If this finally works, the next step will be trying copyback for the addresses >= $1000000. Last edited by saimo; 08 December 2023 at 01:37. Reason: Removed attachment as I provided a newer version later. |
07 December 2023, 18:31 | #91 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,217
|
Still glitchy, and crashed when running bb from RAM:
Precise mode is certainly safer, but I don't think it's needed, but haven't tested it. Be aware that it means the store buffer can't be for chip ram. Maybe not a big deal in this case since you're not C2Ping. But really, I stand by suggestion from yesterday: Setup only one DTT0 for fast ram, and don't fiddle with anything else. You get all of the benefits you're looking for and none of the downsides. For your next test build, maybe try to disable all of the advanced stuff including the data cache thing (at least with an option). I have a small tool that can setup DTT0 as I suggest, and I know it works, and will report if that improves things like we expect. |
07 December 2023, 20:01 | #92 | ||
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Quote:
ITT0 = $00ffc000 ITT1 = $00ffc000 DTT0 = $0000c060 DTT1 = $00ffc000 (Which totally makes sense to me.) Quote:
The rendering and buffering stuff did not change since the last build that worked, which did include manipulating the caches. And, regarding them, there's nothing advanced - this is what happens (on 68040 and 68060) in the user mode code that renders the graphics: 1. trap #0; 2. render landscape; 3. trap #1. The 68060 trap #0 handler, which disables the data cache, is as simple as this: Code:
move.l #$20808000,d0 ;(.ESB,EBC,EIC) movec.l d0,cacr rte Code:
move.l #$a0808000,d0 ;(.EDC,ESB,EBC,EIC) movec.l d0,cacr rte Attached is an archive that contains two builds: * both set the MMU registers are indicated above (but only when NOMMU is used: the MMU registers are not changed otherwise); * one has the traps and the other doesn't; * both count the occurrences of the exceptions: the final printout should have only the counter of vector 27 different from zero in the no-traps build; the other should have also the counters of vectors 32 and 33 different from 0, and all the counters should be equal (EDIT: the last part is true only if rendering is done at 50 fps). Last edited by saimo; 08 December 2023 at 12:40. Reason: Removed attachment as I provided a newer version later. |
||
07 December 2023, 20:20 | #93 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,217
|
PVE: normal build - gltiches (like before)
bb/fb crashes PVE_no_traps: no glitches bb 19.131 (261 x 27 int, not others same for rest of tests), fb: 17.340 (288 x 27) Enable my DTT0 thing: bb 25.400 (197 x 27 int), fb: 22.451 (223 x 27) Not 100% sure, but you probably need to flush caches before disable the DC. Since writeback is likely enabled before, bad things probably happen if that data isn't written back before disabling them. Regarding *TT0/1 setup: Looks about right, but again, I think what you really want is just to avoid ATC-miss penalties (which you should get from only setting up one DTT register). Everything else is just asking for trouble for no benefit. It's interesting to experiment with getting it setup correctly otherwise (I'm learning a lot ) but seems tangential to your goal. |
08 December 2023, 01:35 | #94 | |||
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
@paraj
Alright, chances are that your last report enlightened me and finally we have a working solution! Quote:
EDIT: just noticed that the counters of exceptions 27, 32 and 33 differ in the same test on 68040 and 68060 (but not on 68030), and I guess that was what you referred to; a difference is not normal when running benchmarks - I'll investigate that The code works fine, it was the "and all the counters should be equal" statement to be wrong - well, mostly: the counters are equal if execution runs at 50 fps, as the number of traps indicates the number of frames rendered. Quote:
And so the solution is: set up the MMU properly so that also caching is handled correctly and turning the DMA on/off as needed (without flushing) has no negative effect. It's either both caches and MMU or none. I think I know now why I was so confused (besides the fact of having no experience of coding specifically for 68060 and sleeplessness, that is): * the critical point was the false belief that, before I started fiddling with the MMU, PVE worked fine; * that was an illusion: quite a few posts back (haven't checked) at some point, for the first time you reported that PVE didn't work anymore; * that was when I had started fiddling with the caches; * the next build, which had some initialization stuff changed, worked again, and so I thought that caches were OK; * I started fiddling with the MMU and got lost; * at some other point, I realized that due to another initialization issue (CPU ID variable being used before it was set due to the shuffling of some code) the generic 68020 code started being used; * that code was exactly what allowed PVE to work, as it does not contain the DMA toggling! * however, I didn't realize that and kept on believing that the caches were just fine and the problems (which has brutally appeared after fixing the CPU ID bug) were related to the MMU only; * the turning point was your idea of making a test without touching the caches. Quote:
So, I came to these conclusions: * no more NOMMU switch: to achieve the best performance it is necessary to bypass the table searches and enable/disable the data cache (burst) on the fly, so the MMU setup has to be customized always (on all non-EC CPUs, that is); at most, if I get requests, I'll add a SAFE/S switch that disables both caches and MMU handling; * ITT0 = $00ffc000; * DTT0 = $0000C060; * DTT1 = $00ffc000; * DMA turned on/off as shown in the previous post (with $80008000/$000080000 for 68040; on 68030 things are different: the data cache is always on, but the burst is enabled only when copying blocks of memory). The attached build implements the above. It it works, I'd like to also try DTT1 = $00ffc020 (copyback) in order to gain a little more speed (although I expect only a tiny improvement, as the writes to FAST RAM to update variables are very few and all outside of the rendering core loop). Last edited by saimo; 08 December 2023 at 23:31. Reason: Removed attachment as I provided a newer version later. |
|||
08 December 2023, 12:31 | #95 |
Moderator
Join Date: Nov 2001
Location: Germany
Posts: 876
|
Why do you touch ITT* at all?
It should not be relevant IMHO. If you change something in the MMU setup you also need to flush/clear the caches (CPUSH/CINV) because the caches operate independently. |
08 December 2023, 13:01 | #96 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
To avoid that the way the MMU is set up when the program is launched affects performance. By setting ITT0 to $00ffc000 table searches never happen.
Quote:
I think that the exact cause of the glitches/crashes was that the initialization code, together with (partially) incorrect MMU registers values, did this at some point after taking the system over: 1. clear caches (after modifying the code in a few places); 2. set interrupt and traps vectors (with the vbr pointing to a table in FAST RAM); 3. change the MMU setup. If (as probable) the caches were operating in copyback mode, the vectors did not get written to RAM, so it was just luck that the program showed something at all. Now the code clears the caches together with the MMU setup. By the way, I have received a report that PVE now works fine on an A4000/060 when running normally; the same report says that the benchmarks don't work, but I still have to look into that. Last edited by saimo; 08 December 2023 at 13:19. |
|
08 December 2023, 14:08 | #97 | |
Moderator
Join Date: Nov 2001
Location: Germany
Posts: 876
|
Quote:
I first assumed you leave the MMU on and only additionally set the TT*s. It's a pity that there are only 64 ATC entries per data/instruction. It would be interesting what would happen if the mmu tables would be enabled to be cached. I assume mmu.library sets them to noncacheable (also makes sense in my eyes). Table searches then would still occur but most of them could be satisfied via the data cache. |
|
08 December 2023, 15:25 | #98 | |||
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Quote:
By the way: actually I do leave the MMU on (that is, if it's enabled to begin with, as I do not modify TC), but the way the TT registers are set up effectively disables table-based translation. Quote:
Quote:
Anyway, in contexts of hardware hitting software like this, table-based translation is hardly needed... oh, wait: maybe you're thinking of WHDLoad? |
|||
08 December 2023, 17:40 | #99 | |||
Moderator
Join Date: Nov 2001
Location: Germany
Posts: 876
|
Quote:
As I understand you only have a problem with data. Then there is no need to change anything on the instruction side. ATC are also separate for data and inst. Quote:
Quote:
I think a good idea would be to allocate a 16M aligned memory, make this transparent translated and put all data in this segment which should not be cached. The people will need probably 48M RAM for this to have a 16M free aligned block. An other idea would be to make the TT only for supervisor mode and run your code in user or super mode depending on the wished cache use (or vice versa). Don't know if this is feasible. Probably too complicated. |
|||
08 December 2023, 23:15 | #100 | ||
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Quote:
Quote:
Is it some general idea, or is it intended for this little project? In the latter case, nothing that expensive and complicated is needed: just transparent translation for the whole address space, with special care for the first 16 MB and caching. Last edited by saimo; 08 December 2023 at 23:32. |
||
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
No native AGA screens on PIV since P96 v3 upgrade | LoadWB | support.Apps | 0 | 30 October 2020 01:57 |
Extra bottom line on native screens, chipset feature or WinUAE? | PeterK | support.WinUAE | 5 | 11 September 2019 21:21 |
My pseudo 3D jump code | Brick Nash | Coders. AMOS | 24 | 03 September 2016 00:18 |
Chunky to Planar (C2P) -- USELESS GIMMICK?! | crosis38 | support.Hardware | 10 | 09 July 2016 04:17 |
Pseudo Ops Viruskiller | Promax | request.Apps | 0 | 28 July 2010 22:21 |
|
|