03 December 2023, 19:00 | #61 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,213
|
19.23 / 17.48 so yeah tiny improvement
SS (and all other relevant stuff) is already enabled, so you shouldn't need to fiddle with that. Do you happen to use any of the unimplemented instructions (e.g. 64-bit mul/div)? |
03 December 2023, 20:47 | #62 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Been outside for a while, but the brain kept on returning to this all the time. At some point, I wondered whether maybe there's an exception firing all the time, disrupting the execution. I'll cook up a test build that counts the occurrences of all the exceptions.
This makes sense. In the meanwhile, I received the results from tests on other 2 machines - full recap in the table below. Code:
AMIGA | | FPS | FPS | PED81C COST | MODEL | ACCELERATOR BOARD | (BLIND) | (FULL) | (FPS / FRAMES) | NOTE ------+-------------------------------+---------+--------+----------------+----- 1200 | PiStorm32 + Raspberry Pi 3 A+ | 50.00 | 50.00 | 0.00 / ?.??? | CD³² | The Beast 030 | | 30.48 | | 1 1200 | Blizzard 1230 IV | 23.14 | 20.92 | 2.22 / 0.229 | 2 1200 | Blizzard 1260 | 19.23 | 17.48 | 1.75 / 0.260 | 1200 | TerribleFire TF1260 | 13.96 | 13.15 | 0.81 / 0.221 | 4000 | Cyberstorm MK III | 19.53 | 18.05 | 1.48 / 0.210 | 1. 68030 70 MHz, SRAM (1 cycle wrap-around burst) 2. 68030 50 MHz, RAM 60 ns Quote:
|
|
03 December 2023, 21:05 | #63 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,213
|
Roughly how many fastmem accesses are you doing per frame? "The beast" numbers seem to indicate you might be limited by those. I don't recall what worst case #cylces for that, and don't know how you're doing stuff, but maybe try limiting drawing distance or something?
|
03 December 2023, 21:16 | #64 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Quote:
Anyway, of course limiting the distance would increase the speed, but that won't explain why the 68030 performs better than the 68060 |
|
04 December 2023, 01:45 | #65 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
@paraj
Still quick, but better answer. Rendering is done by columns, from bottom to top and then left to right. The code applies a depth of 256 steps per column, so it evaluates 256*128 = 32768 dots per frame (and then renders only those which are actually visible). For each of those dots, the renderer does this: 1. calculate the dot position in the map; 2. read the dot height from the map (memory read #1); 3. calculate the screen Y the dot would project to, taking the height of the camera into account (this requires a lookup table read - memory read #2 (1)); 4. if the dot happens to be hidden behind the dots rendered previously, pass to the next dot (2); 5. read the dot color from the map (memory read #3); 6. plot the dot upwards (3), starting from the screen Y of the previous topmost dot to the screen Y just calculated (N memory writes (4)). (1) This read cannot be efficiently optimized with real-time calculations, also on 68060. (2) This is not rare at all and avoids further memory accesses. (3) Even if the dots were plotted sequentially, using (an)+, the gain would be very little: on my machine that saves 2 cycles per write, but eventually the gain is of just 0.4 fps... which would be lost due to a more inefficient rendering of the background (it would be rotated by 90° and thus need more lines, i.e. more loops overhead) and, above all, due to the need to reorder the data while copying it to CHIP RAM, thus losing the benefits of the burst copy (even if most, if not all, of the rearranging could be done in parallel with the writes). (4) On average very few. The code boils down to a bunch of simple operations (moves, adds, subs, ands, etc.). For it to be the cause of the poor performance on 68060 (even if only 1 pipeline were used!), given the significantly smaller timings of the 68060 instructions, the FAST RAM access of those 68060 boards must be really terrible with respect to my Blizzard's - which is unlikely, as also the comparable FAST RAM -> CHIP RAM copy speeds indicate. I'm not sharing the code yet because I want to first give it another thought (if micro-optimizations are possible, I want to find them myself ) and because I'm under the impression that there's a bigger problem somewhere else (not saying that's the case for sure, though). Could you try the attached build? This one counts the occurrences of all the exceptions during the execution. The output will be like this: Code:
total number of frames rendered: 100 total number of frames shown: 100 frames rendered per second average: 50.00 frames per render average: 1.00 ex#: count 0: 0 1: 0 2: 0 3: 0 4: 0 5: 0 6: 0 7: 0 8: 0 9: 0 10: 0 11: 0 12: 0 13: 0 14: 0 15: 0 16: 0 17: 0 18: 0 19: 0 20: 0 21: 0 22: 0 23: 0 24: 0 25: 0 26: 0 27: 99 28: 0 29: 0 30: 0 31: 0 32: 0 33: 0 34: 0 35: 0 36: 0 37: 0 38: 0 39: 0 40: 0 41: 0 42: 0 43: 0 44: 0 45: 0 46: 0 47: 0 48: 0 49: 0 50: 0 51: 0 52: 0 53: 0 54: 0 55: 0 56: 0 57: 0 58: 0 59: 0 60: 0 61: 0 62: 0 63: 0 64: 0 65: 0 66: 0 67: 0 68: 0 69: 0 70: 0 71: 0 72: 0 73: 0 74: 0 75: 0 76: 0 77: 0 78: 0 79: 0 80: 0 81: 0 82: 0 83: 0 84: 0 85: 0 86: 0 87: 0 88: 0 89: 0 90: 0 91: 0 92: 0 93: 0 94: 0 95: 0 96: 0 97: 0 98: 0 99: 0 100: 0 101: 0 102: 0 103: 0 104: 0 105: 0 106: 0 107: 0 108: 0 109: 0 110: 0 111: 0 112: 0 113: 0 114: 0 115: 0 116: 0 117: 0 118: 0 119: 0 120: 0 121: 0 122: 0 123: 0 124: 0 125: 0 126: 0 127: 0 128: 0 129: 0 130: 0 131: 0 132: 0 133: 0 134: 0 135: 0 136: 0 137: 0 138: 0 139: 0 140: 0 141: 0 142: 0 143: 0 144: 0 145: 0 146: 0 147: 0 148: 0 149: 0 150: 0 151: 0 152: 0 153: 0 154: 0 155: 0 156: 0 157: 0 158: 0 159: 0 160: 0 161: 0 162: 0 163: 0 164: 0 165: 0 166: 0 167: 0 168: 0 169: 0 170: 0 171: 0 172: 0 173: 0 174: 0 175: 0 176: 0 177: 0 178: 0 179: 0 180: 0 181: 0 182: 0 183: 0 184: 0 185: 0 186: 0 187: 0 188: 0 189: 0 190: 0 191: 0 192: 0 193: 0 194: 0 195: 0 196: 0 197: 0 198: 0 199: 0 200: 0 201: 0 202: 0 203: 0 204: 0 205: 0 206: 0 207: 0 208: 0 209: 0 210: 0 211: 0 212: 0 213: 0 214: 0 215: 0 216: 0 217: 0 218: 0 219: 0 220: 0 221: 0 222: 0 223: 0 224: 0 225: 0 226: 0 227: 0 228: 0 229: 0 230: 0 231: 0 232: 0 233: 0 234: 0 235: 0 236: 0 237: 0 238: 0 239: 0 240: 0 241: 0 242: 0 243: 0 244: 0 245: 0 246: 0 247: 0 248: 0 249: 0 250: 0 251: 0 252: 0 253: 0 254: 0 255: 0 (5) Level 3 interrupt; COPER is used to synchronize with the bottom of the screen. Last edited by saimo; 04 December 2023 at 21:20. Reason: Removed attachment as I provided a newer version later. |
04 December 2023, 07:23 | #66 | |
Registered User
Join Date: Jul 2023
Location: Domsjö/Sweden
Posts: 56
|
Quote:
I've made some changes to to the SRAM controller. Read speed is still the same as in the bustest thread. https://eab.abime.net/showpost.php?p...4&postcount=40 BB: tot frames shown = 144 fps avg = 34.72 frames per renderer = 1.44 FB: tot frames shown = 164 fps avg = 30.48 frames per renderer = 1.64 New BUSTEST FAST write performance: writew 43.2 ns 46.3 MB/S writel 43.1 ns 92.9 MB/S writem 39.5 ns 101.3 MB/S |
|
04 December 2023, 11:12 | #67 |
Thalion Webshrine
Join Date: Jan 2004
Location: Oxford
Posts: 14,465
|
|
04 December 2023, 13:48 | #68 | |
ex. demoscener "Bigmama"
Join Date: Jun 2012
Location: Fyn / Denmark
Posts: 1,642
|
Quote:
|
|
04 December 2023, 16:36 | #69 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,213
|
Only expected interrupts. However, I've figured out why it's slow on my machine: ATC-misses! If I disable page address translation and rely just on the TTR's I get 34.72 / 28.57 fps respectively.
I timed a small test program that reads a byte from a (pseudo-)random offset into an array and varied the size: Up to 256KB there are no differences, and the no-MMU case stays flat at around ~375ns/loop iteration, while with MMU it grows to 867 for at 32MB array. EDIT: 256KB of course lines up perfectly with 64-entry ATC and page size of 4K, and forgot something actionable: Of course being more cache friendly is likely a big rework, but I think grouping (height,color) rather than having separate arrays would likely be an easy win. Obviously you can't just switch switch off MMU w/o consequences, so don't change something like that in your own code. Last edited by paraj; 04 December 2023 at 19:40. |
04 December 2023, 21:22 | #70 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Quote:
Your figures make sense: they're just slightly better than those that can be derived proportionally from my 50 MHz 68030 (e.g. 21 fps * 70 / 50 = 29.4). I'm about to post a new version that has a more precise timing: it would be great if you could post the results output by that one, too. |
|
04 December 2023, 21:25 | #71 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Yes, that would definitely have a negative effect, but you gave me an idea: to add a switch that allows to turn superscalar dispatch on and off at will, so that one can measure the benefits of 68060's parallelism. It's already done. I'm posting the new version after replying the other posts here.
|
04 December 2023, 21:30 | #72 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Quote:
That said, I might give the sequential rendering a shot |
|
04 December 2023, 22:05 | #73 | |||||
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
And less exceptions than those than actually happened: I had stupidly set the vectors that handle trap #0/1 (which I use to fiddle with the caches in real time) after setting the routine that counts the exception Once fixed that, also the numbers of vectors 32 and 33 were sane.
Quote:
Last night I received a report (from the always super-supportive and enthusiast klx300r) that confirmed that no unexpected exceptions occurred and that convinced me that there must have been something else... and then it dawned on me, and I came exactly to the same conclusion! In fact, this afternoon - unaware of your warning below - I also implemented a NOMMU switch that disables ATC-based translation. Quote:
Quote:
Quote:
Quote:
Here's the context: 1. the program takes over the system entirely, stores its state carefully; 2. if the NOMMU switch is specified, it stores the current value of the tcr and disables the translation; 3. it never uses to the OS anymore; 4. it accesses exclusively the chipset and the CHIP and FAST areas previously allocated; 5. upon exit, it restores the tcr, restores the system state and cleans everything up. In the past (circa 2002-2004) I did fiddle directly with the MMU for an experimental and rather complex piece of software, which happily ran not only on my A1200/030, but also on A1200/060, A4000/040 and A4000/060. But your warning makes me wonder: have things changed in the meanwhile? Can logical and physical RAM addresses differ? Last edited by saimo; 04 December 2023 at 22:11. |
|||||
04 December 2023, 22:17 | #74 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
New test build.
Changes: 1. more precise timing: now the program uses the color clocks to measure the elapsed time, instead of the displayed frames number; 2. added the NOSUPERSCALARDISPATCH=NSD command line switch, which allows to turn off the 68060 superscalar dispatch; 3. added the NOMMU=NM command line switch, which allows to turn off the MMU (this helps improve the speed) - WARNING: USE AT YOUR OWN RISK. Last edited by saimo; 05 December 2023 at 22:27. Reason: Removed attachment as I provided a newer version afterwards. |
05 December 2023, 00:57 | #75 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,469
|
@paraj
That graph is for general pseudorandom byte sized accesses in an array, right? |
05 December 2023, 18:10 | #76 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,213
|
Quote:
Code:
_byteread:: movem.l d2/d3,-(sp) move.l d1,d3 subq.l #1,d3 lea buffer,a0 move.l #3141592,d1 .loop: move.l d3,d2 and.l d1,d2 rol.l d1,d1 addq.l #7,d1 move.b (a0,d2.l),d2 subq.l #1,d0 bne.b .loop movem.l (sp)+,d2/d3 rts @saimo: If OS is completely off, and there is basically no MMU usage before, then yes, it will probably work. But it's not super uncommon to have things that e.g. move the first 4k to fast mem using MMU so gotta be careful about something like that. |
|
05 December 2023, 22:40 | #77 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Quote:
Could you try the attached version and see if the NOMMU switch does the magic also on your machine, please? Changes since the previous version: * fixed a nasty bug that caused a longword to be written to a random location when the fps indicator was on (two bsrs were used before restoring the stack pointer!); * fixed the shell template ("/FB" -> "=FB/S"); * fixed the cleanup code in a place (it used exec.Supervisor() before the tcr and the OS were restored - thanks to the fact that the vbr points to a copy of the exceptions vectors table and that the code only modifies the level 3 interrupt and the trap #0 and #1 vectors, the flaw didn't have any practical consequences, but still...); * avoided that the CPU tcr is fiddled with if the NOMMU switch is not specified (so that there are no problems also on 68EC060s). Last edited by saimo; 06 December 2023 at 18:10. Reason: Removed attachment as I provided a newer version later. |
|
06 December 2023, 11:43 | #78 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,213
|
bb: 26.378, fb: 25.412, nommu bb: 25.062, nommu fb: 25.031
Glitches quite a bit now: https://i.imgur.com/vUmUUVG.mp4 (no command line arguments, but nommu looks the same) |
06 December 2023, 12:01 | #79 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Quote:
It looks like I screwed something up, but apart from adding the NOMMU switch and the bugfixes (which are unrelated to rendering), the rest remained unchanged... puzzling. Given that the figures are higher than those you had posted previously and that the NOMMU makes no difference, I take it that the MMU is disabled already before launching the executable or that you changed the MMU settings somehow? (Thanks for the useful video! Edit: it clearly shows that the background overwrites the rendered graphics, which means that buffering got broken... the problem must be in the COPER interrupt and the traps - but that's quite a mystery, since I didn't touch those.) |
|
06 December 2023, 12:50 | #80 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,213
|
MMU is enabled. Same result running no s-s (but + setpatch).
Looks like you're not setting the transparent translation registers? MMUlib clears them (sets them to $FFFF6040). You should set them to something like: ITT0/1 = $00ffc000 DDT0 = $0000c040 DDT1 = $00ffc000 before disabling paging, and restore before re-enabling. |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
No native AGA screens on PIV since P96 v3 upgrade | LoadWB | support.Apps | 0 | 30 October 2020 01:57 |
Extra bottom line on native screens, chipset feature or WinUAE? | PeterK | support.WinUAE | 5 | 11 September 2019 21:21 |
My pseudo 3D jump code | Brick Nash | Coders. AMOS | 24 | 03 September 2016 00:18 |
Chunky to Planar (C2P) -- USELESS GIMMICK?! | crosis38 | support.Hardware | 10 | 09 July 2016 04:17 |
Pseudo Ops Viruskiller | Promax | request.Apps | 0 | 28 July 2010 22:21 |
|
|