PED81C - pseudo-native, no C2P chunky screens for AGA - Page 4

paraj · 03 December 2023, 19:00

19.23 / 17.48 so yeah tiny improvement

SS (and all other relevant stuff) is already enabled, so you shouldn't need to fiddle with that. Do you happen to use any of the unimplemented instructions (e.g. 64-bit mul/div)?

saimo · 03 December 2023, 20:47

Been outside for a while, but the brain kept on returning to this all the time. At some point, I wondered whether maybe there's an exception firing all the time, disrupting the execution. I'll cook up a test build that counts the occurrences of all the exceptions.

Quote:

Originally Posted by paraj

19.23 / 17.48 so yeah tiny improvement

This makes sense.

In the meanwhile, I received the results from tests on other 2 machines - full recap in the table below.

Code:

AMIGA |                               |     FPS |    FPS | PED81C COST    |
MODEL | ACCELERATOR BOARD             | (BLIND) | (FULL) | (FPS / FRAMES) | NOTE
------+-------------------------------+---------+--------+----------------+-----
 1200 | PiStorm32 + Raspberry Pi 3 A+ |   50.00 |  50.00 | 0.00 / ?.???   |
 CD³² | The Beast 030                 |         |  30.48 |                | 1
 1200 | Blizzard 1230 IV              |   23.14 |  20.92 | 2.22 / 0.229   | 2
 1200 | Blizzard 1260                 |   19.23 |  17.48 | 1.75 / 0.260   |
 1200 | TerribleFire TF1260           |   13.96 |  13.15 | 0.81 / 0.221   |
 4000 | Cyberstorm MK III             |   19.53 |  18.05 | 1.48 / 0.210   |

1. 68030 70 MHz, SRAM (1 cycle wrap-around burst)
2. 68030 50 MHz, RAM 60 ns

The time taken to write the data to CHIP RAM is very similar between all the machines (that is, other than the PiStorm-equipped one, which is so fast that it neutralizes the impact of the copy to CHIP RAM).

Quote:

SS (and all other relevant stuff) is already enabled, so you shouldn't need to fiddle with that. Do you happen to use any of the unimplemented instructions (e.g. 64-bit mul/div)?

Nope, no unimplemented instructions are used.

paraj · 03 December 2023, 21:05

Roughly how many fastmem accesses are you doing per frame? "The beast" numbers seem to indicate you might be limited by those. I don't recall what worst case #cylces for that, and don't know how you're doing stuff, but maybe try limiting drawing distance or something?

saimo · 03 December 2023, 21:16

Quote:

Originally Posted by paraj

Roughly how many fastmem accesses are you doing per frame? "The beast" numbers seem to indicate you might be limited by those. I don't recall what worst case #cylces for that, and don't know how you're doing stuff, but maybe try limiting drawing distance or something?

Can't give a proper answer right now, but I'll get back to you.
Anyway, of course limiting the distance would increase the speed, but that won't explain why the 68030 performs better than the 68060

saimo · 04 December 2023, 01:45

@paraj

Still quick, but better answer.

Rendering is done by columns, from bottom to top and then left to right.
The code applies a depth of 256 steps per column, so it evaluates 256*128 = 32768 dots per frame (and then renders only those which are actually visible).
For each of those dots, the renderer does this:
1. calculate the dot position in the map;
2. read the dot height from the map (memory read #1);
3. calculate the screen Y the dot would project to, taking the height of the camera into account (this requires a lookup table read - memory read #2 (1));
4. if the dot happens to be hidden behind the dots rendered previously, pass to the next dot (2);
5. read the dot color from the map (memory read #3);
6. plot the dot upwards (3), starting from the screen Y of the previous topmost dot to the screen Y just calculated (N memory writes (4)).

(1) This read cannot be efficiently optimized with real-time calculations, also on 68060.
(2) This is not rare at all and avoids further memory accesses.
(3) Even if the dots were plotted sequentially, using (an)+, the gain would be very little: on my machine that saves 2 cycles per write, but eventually the gain is of just 0.4 fps... which would be lost due to a more inefficient rendering of the background (it would be rotated by 90° and thus need more lines, i.e. more loops overhead) and, above all, due to the need to reorder the data while copying it to CHIP RAM, thus losing the benefits of the burst copy (even if most, if not all, of the rearranging could be done in parallel with the writes).
(4) On average very few.

The code boils down to a bunch of simple operations (moves, adds, subs, ands, etc.). For it to be the cause of the poor performance on 68060 (even if only 1 pipeline were used!), given the significantly smaller timings of the 68060 instructions, the FAST RAM access of those 68060 boards must be really terrible with respect to my Blizzard's - which is unlikely, as also the comparable FAST RAM -> CHIP RAM copy speeds indicate.
I'm not sharing the code yet because I want to first give it another thought (if micro-optimizations are possible, I want to find them myself

) and because I'm under the impression that there's a bigger problem somewhere else (not saying that's the case for sure, though).

Could you try the attached build? This one counts the occurrences of all the exceptions during the execution. The output will be like this:

Code:

total number of frames rendered:    100
total number of frames shown:       100
frames rendered per second average: 50.00
frames per render average:          1.00
ex#: count
  0: 0
  1: 0
  2: 0
  3: 0
  4: 0
  5: 0
  6: 0
  7: 0
  8: 0
  9: 0
 10: 0
 11: 0
 12: 0
 13: 0
 14: 0
 15: 0
 16: 0
 17: 0
 18: 0
 19: 0
 20: 0
 21: 0
 22: 0
 23: 0
 24: 0
 25: 0
 26: 0
 27: 99
 28: 0
 29: 0
 30: 0
 31: 0
 32: 0
 33: 0
 34: 0
 35: 0
 36: 0
 37: 0
 38: 0
 39: 0
 40: 0
 41: 0
 42: 0
 43: 0
 44: 0
 45: 0
 46: 0
 47: 0
 48: 0
 49: 0
 50: 0
 51: 0
 52: 0
 53: 0
 54: 0
 55: 0
 56: 0
 57: 0
 58: 0
 59: 0
 60: 0
 61: 0
 62: 0
 63: 0
 64: 0
 65: 0
 66: 0
 67: 0
 68: 0
 69: 0
 70: 0
 71: 0
 72: 0
 73: 0
 74: 0
 75: 0
 76: 0
 77: 0
 78: 0
 79: 0
 80: 0
 81: 0
 82: 0
 83: 0
 84: 0
 85: 0
 86: 0
 87: 0
 88: 0
 89: 0
 90: 0
 91: 0
 92: 0
 93: 0
 94: 0
 95: 0
 96: 0
 97: 0
 98: 0
 99: 0
100: 0
101: 0
102: 0
103: 0
104: 0
105: 0
106: 0
107: 0
108: 0
109: 0
110: 0
111: 0
112: 0
113: 0
114: 0
115: 0
116: 0
117: 0
118: 0
119: 0
120: 0
121: 0
122: 0
123: 0
124: 0
125: 0
126: 0
127: 0
128: 0
129: 0
130: 0
131: 0
132: 0
133: 0
134: 0
135: 0
136: 0
137: 0
138: 0
139: 0
140: 0
141: 0
142: 0
143: 0
144: 0
145: 0
146: 0
147: 0
148: 0
149: 0
150: 0
151: 0
152: 0
153: 0
154: 0
155: 0
156: 0
157: 0
158: 0
159: 0
160: 0
161: 0
162: 0
163: 0
164: 0
165: 0
166: 0
167: 0
168: 0
169: 0
170: 0
171: 0
172: 0
173: 0
174: 0
175: 0
176: 0
177: 0
178: 0
179: 0
180: 0
181: 0
182: 0
183: 0
184: 0
185: 0
186: 0
187: 0
188: 0
189: 0
190: 0
191: 0
192: 0
193: 0
194: 0
195: 0
196: 0
197: 0
198: 0
199: 0
200: 0
201: 0
202: 0
203: 0
204: 0
205: 0
206: 0
207: 0
208: 0
209: 0
210: 0
211: 0
212: 0
213: 0
214: 0
215: 0
216: 0
217: 0
218: 0
219: 0
220: 0
221: 0
222: 0
223: 0
224: 0
225: 0
226: 0
227: 0
228: 0
229: 0
230: 0
231: 0
232: 0
233: 0
234: 0
235: 0
236: 0
237: 0
238: 0
239: 0
240: 0
241: 0
242: 0
243: 0
244: 0
245: 0
246: 0
247: 0
248: 0
249: 0
250: 0
251: 0
252: 0
253: 0
254: 0
255: 0

If all works as expected, only the count of exception #27 (5) should be non-0 (and equal to the number of the frames shown minus 1).

(5) Level 3 interrupt; COPER is used to synchronize with the bottom of the screen.

Lunda · 04 December 2023, 07:23

Quote:

Originally Posted by saimo

Cool, thanks!
Could you run also the blind benchmark and give me the result, please? Also, what is your Amiga model, accelerator board and RAM speed? I'd like to add your figures to the table in the manual.

Here you go.
I've made some changes to to the SRAM controller. Read speed is still the same as in the bustest thread.
https://eab.abime.net/showpost.php?p...4&postcount=40

BB:
tot frames shown = 144
fps avg = 34.72
frames per renderer = 1.44

FB:
tot frames shown = 164
fps avg = 30.48
frames per renderer = 1.64

New BUSTEST FAST write performance:
writew 43.2 ns 46.3 MB/S
writel 43.1 ns 92.9 MB/S
writem 39.5 ns 101.3 MB/S

alexh · 04 December 2023, 11:12

Quote:

Originally Posted by saimo

explain why the 68030 performs better than the 68060

Switch off Superscalar (setting PCR(0)=0) and see if that has any effect on 68060?

I would imagine it should have a negative effect but you never know

hooverphonique · 04 December 2023, 13:48

Quote:

Originally Posted by saimo

(3) Even if the dots were plotted sequentially, using (an)+, the gain would be very little: on my machine that saves 2 cycles per write, but eventually the gain is of just 0.4 fps... which would be lost due to a more inefficient rendering of the background (it would be rotated by 90° and thus need more lines, i.e. more loops overhead) and, above all, due to the need to reorder the data while copying it to CHIP RAM, thus losing the benefits of the burst copy (even if most, if not all, of the rearranging could be done in parallel with the writes).

Would rotating the bitmap while copying from fast to chip not basically come for free? fast ram is cachable, so reading contiguously with burst should be good; chip ram is not cachable, so should not affect your data cache.

paraj · 04 December 2023, 16:36

Only expected interrupts. However, I've figured out why it's slow on my machine: ATC-misses! If I disable page address translation and rely just on the TTR's I get 34.72 / 28.57 fps respectively.

I timed a small test program that reads a byte from a (pseudo-)random offset into an array and varied the size:

Up to 256KB there are no differences, and the no-MMU case stays flat at around ~375ns/loop iteration, while with MMU it grows to 867 for at 32MB array.

EDIT: 256KB of course lines up perfectly with 64-entry ATC and page size of 4K, and forgot something actionable: Of course being more cache friendly is likely a big rework, but I think grouping (height,color) rather than having separate arrays would likely be an easy win. Obviously you can't just switch switch off MMU w/o consequences, so don't change something like that in your own code.

saimo · 04 December 2023, 21:22

Quote:

Originally Posted by Lunda

Here you go.
I've made some changes to to the SRAM controller. Read speed is still the same as in the bustest thread.
https://eab.abime.net/showpost.php?p...4&postcount=40

BB:
tot frames shown = 144
fps avg = 34.72
frames per renderer = 1.44

FB:
tot frames shown = 164
fps avg = 30.48
frames per renderer = 1.64

New BUSTEST FAST write performance:
writew 43.2 ns 46.3 MB/S
writel 43.1 ns 92.9 MB/S
writem 39.5 ns 101.3 MB/S

Thank you!
Your figures make sense: they're just slightly better than those that can be derived proportionally from my 50 MHz 68030 (e.g. 21 fps * 70 / 50 = 29.4).

I'm about to post a new version that has a more precise timing: it would be great if you could post the results output by that one, too.

saimo · 04 December 2023, 21:25

Quote:

Originally Posted by alexh

Switch off Superscalar (setting PCR(0)=0) and see if that has any effect on 68060?

I would imagine it should have a negative effect but you never know

Yes, that would definitely have a negative effect, but you gave me an idea: to add a switch that allows to turn superscalar dispatch on and off at will, so that one can measure the benefits of 68060's parallelism. It's already done. I'm posting the new version after replying the other posts here.

saimo · 04 December 2023, 21:30

Quote:

Originally Posted by hooverphonique

Would rotating the bitmap while copying from fast to chip not basically come for free? fast ram is cachable, so reading contiguously with burst should be good; chip ram is not cachable, so should not affect your data cache.

Indeed the data cache wouldn't be affected and rotating can be done in parallel with the writes, but what I was trying to say is that a code that does that won't be as quick as the one I pasted in post #28. I know because when I wrote that code I tested tens of different pieces of code, and that one turned out the one that performs better on my machine - in particular writing 13 longwords with single moves performs worse than movem.
That said, I might give the sequential rendering a shot

saimo · 04 December 2023, 22:05

Quote:

Originally Posted by paraj

Only expected interrupts.

And less exceptions than those than actually happened: I had stupidly set the vectors that handle trap #0/1 (which I use to fiddle with the caches in real time) after setting the routine that counts the exception

Once fixed that, also the numbers of vectors 32 and 33 were sane.

Quote:

However, I've figured out why it's slow on my machine: ATC-misses!

Last night I received a report (from the always super-supportive and enthusiast klx300r) that confirmed that no unexpected exceptions occurred and that convinced me that there must have been something else... and then it dawned on me, and I came exactly to the same conclusion!
In fact, this afternoon - unaware of your warning below - I also implemented a NOMMU switch that disables ATC-based translation.

Quote:

If I disable page address translation and rely just on the TTR's I get 34.72 / 28.57 fps respectively.

Now we're talking! At a glance, 1.5x faster rendering than on the 68030 looks realistic, considering the code. And maybe the performance is even better: in the last build, due to a rearrangement of the startup code, the part that performed the CPU-specific initializations was using an uninitialized (0) CPU ID variable, so the program used the generic 68020 code.

Quote:

I timed a small test program that reads a byte from a (pseudo-)random offset into an array and varied the size:

Up to 256KB there are no differences, and the no-MMU case stays flat at around ~375ns/loop iteration, while with MMU it grows to 867 for at 32MB array.

Thanks for the test and the report.

Quote:

EDIT: 256KB of course lines up perfectly with 64-entry ATC and page size of 4K, and forgot something actionable: Of course being more cache friendly is likely a big rework, but I think grouping (height,color) rather than having separate arrays would likely be an easy win.

Good suggestion but, unfortunately, the data has been organized like that ever since.

Quote:

Obviously you can't just switch switch off MMU w/o consequences, so don't change something like that in your own code.

This scared me. I wonder, where's the danger?
Here's the context:
1. the program takes over the system entirely, stores its state carefully;
2. if the NOMMU switch is specified, it stores the current value of the tcr and disables the translation;
3. it never uses to the OS anymore;
4. it accesses exclusively the chipset and the CHIP and FAST areas previously allocated;
5. upon exit, it restores the tcr, restores the system state and cleans everything up.

In the past (circa 2002-2004) I did fiddle directly with the MMU for an experimental and rather complex piece of software, which happily ran not only on my A1200/030, but also on A1200/060, A4000/040 and A4000/060.
But your warning makes me wonder: have things changed in the meanwhile? Can logical and physical RAM addresses differ?

saimo · 04 December 2023, 22:17

New test build.

Changes:
1. more precise timing: now the program uses the color clocks to measure the elapsed time, instead of the displayed frames number;
2. added the NOSUPERSCALARDISPATCH=NSD command line switch, which allows to turn off the 68060 superscalar dispatch;
3. added the NOMMU=NM command line switch, which allows to turn off the MMU (this helps improve the speed) - WARNING: USE AT YOUR OWN RISK.

Karlos · 05 December 2023, 00:57

@paraj

That graph is for general pseudorandom byte sized accesses in an array, right?

paraj · 05 December 2023, 18:10

Quote:

Originally Posted by Karlos

@paraj

That graph is for general pseudorandom byte sized accesses in an array, right?

Yes. Timing this function 10 times, with d0=100000 and d1=arraysize:

Code:

_byteread::
        movem.l d2/d3,-(sp)
        move.l  d1,d3
        subq.l  #1,d3
        lea     buffer,a0
        move.l  #3141592,d1
.loop:
        move.l  d3,d2
        and.l   d1,d2
        rol.l   d1,d1
        addq.l  #7,d1
        move.b  (a0,d2.l),d2
        subq.l  #1,d0
        bne.b   .loop
        movem.l (sp)+,d2/d3
        rts

@saimo: If OS is completely off, and there is basically no MMU usage before, then yes, it will probably work. But it's not super uncommon to have things that e.g. move the first 4k to fast mem using MMU so gotta be careful about something like that.

saimo · 05 December 2023, 22:40

Quote:

Originally Posted by paraj

@saimo: If OS is completely off, and there is basically no MMU usage before, then yes, it will probably work. But it's not super uncommon to have things that e.g. move the first 4k to fast mem using MMU so gotta be careful about something like that.

I just tested it with both MuMove4K and MuForce running on my real Amiga and on WinUAE emulating a 68040 and a 68060, and it works fine. On the real Amiga the speed drops by 50% when the MMU is active, but specifying the NOMMU switch brings the speed back to normal. Funnily, thanks to the fact my PC isn't powerful, the same happens also under emulation

Could you try the attached version and see if the NOMMU switch does the magic also on your machine, please?

Changes since the previous version:
* fixed a nasty bug that caused a longword to be written to a random location when the fps indicator was on (two bsrs were used before restoring the stack pointer!);
* fixed the shell template ("/FB" -> "=FB/S");
* fixed the cleanup code in a place (it used exec.Supervisor() before the tcr and the OS were restored - thanks to the fact that the vbr points to a copy of the exceptions vectors table and that the code only modifies the level 3 interrupt and the trap #0 and #1 vectors, the flaw didn't have any practical consequences, but still...);
* avoided that the CPU tcr is fiddled with if the NOMMU switch is not specified (so that there are no problems also on 68EC060s).

paraj · 06 December 2023, 11:43

bb: 26.378, fb: 25.412, nommu bb: 25.062, nommu fb: 25.031

Glitches quite a bit now: https://i.imgur.com/vUmUUVG.mp4 (no command line arguments, but nommu looks the same)

saimo · 06 December 2023, 12:01

Quote:

Originally Posted by paraj

bb: 26.378, fb: 25.412, nommu bb: 25.062, nommu fb: 25.031

Glitches quite a bit now: https://i.imgur.com/vUmUUVG.mp4 (no command line arguments, but nommu looks the same)

It looks like I screwed something up, but apart from adding the NOMMU switch and the bugfixes (which are unrelated to rendering), the rest remained unchanged... puzzling.
Given that the figures are higher than those you had posted previously and that the NOMMU makes no difference, I take it that the MMU is disabled already before launching the executable or that you changed the MMU settings somehow?
(Thanks for the useful video! Edit: it clearly shows that the background overwrites the rendered graphics, which means that buffering got broken... the problem must be in the COPER interrupt and the traps - but that's quite a mystery, since I didn't touch those.)

paraj · 06 December 2023, 12:50

MMU is enabled. Same result running no s-s (but + setpatch).

Looks like you're not setting the transparent translation registers? MMUlib clears them (sets them to $FFFF6040). You should set them to something like:

ITT0/1 = $00ffc000
DDT0 = $0000c040
DDT1 = $00ffc000

before disabling paging, and restore before re-enabling.

04 December 2023, 01:45	#65
saimo Registered User Join Date: Aug 2010 Location: Italy Posts: 787	@paraj Still quick, but better answer. Rendering is done by columns, from bottom to top and then left to right. The code applies a depth of 256 steps per column, so it evaluates 256128 = 32768 dots per frame (and then renders only those which are actually visible). For each of those dots, the renderer does this: 1. calculate the dot position in the map; 2. read the dot height from the map (memory read #1); 3. calculate the screen Y the dot would project to, taking the height of the camera into account (this requires a lookup table read - memory read #2 (1)); 4. if the dot happens to be hidden behind the dots rendered previously, pass to the next dot (2); 5. read the dot color from the map (memory read #3); 6. plot the dot upwards (3), starting from the screen Y of the previous topmost dot to the screen Y just calculated (N memory writes (4)). (1) This read cannot be efficiently optimized with real-time calculations, also on 68060. (2) This is not rare at all and avoids further memory accesses. (3) Even if the dots were plotted sequentially, using (an)+, the gain would be very little: on my machine that saves 2 cycles per write, but eventually the gain is of just 0.4 fps... which would be lost due to a more inefficient rendering of the background (it would be rotated by 90° and thus need more lines, i.e. more loops overhead) and, above all, due to the need to reorder the data while copying it to CHIP RAM, thus losing the benefits of the burst copy (even if most, if not all, of the rearranging could be done in parallel with the writes). (4) On average very few. The code boils down to a bunch of simple operations (moves, adds, subs, ands, etc.). For it to be the cause of the poor performance on 68060 (even if only 1 pipeline were used!), given the significantly smaller timings of the 68060 instructions, the FAST RAM access of those 68060 boards must be really terrible with respect to my Blizzard's - which is unlikely, as also the comparable FAST RAM -> CHIP RAM copy speeds indicate. I'm not sharing the code yet because I want to first give it another thought (if micro-optimizations are possible, I want to find them myself ) and because I'm under the impression that there's a bigger problem somewhere else (not saying that's the case for sure, though). Could you try the attached build? This one counts the occurrences of all the exceptions during the execution. The output will be like this: Code: total number of frames rendered: 100 total number of frames shown: 100 frames rendered per second average: 50.00 frames per render average: 1.00 ex#: count 0: 0 1: 0 2: 0 3: 0 4: 0 5: 0 6: 0 7: 0 8: 0 9: 0 10: 0 11: 0 12: 0 13: 0 14: 0 15: 0 16: 0 17: 0 18: 0 19: 0 20: 0 21: 0 22: 0 23: 0 24: 0 25: 0 26: 0 27: 99 28: 0 29: 0 30: 0 31: 0 32: 0 33: 0 34: 0 35: 0 36: 0 37: 0 38: 0 39: 0 40: 0 41: 0 42: 0 43: 0 44: 0 45: 0 46: 0 47: 0 48: 0 49: 0 50: 0 51: 0 52: 0 53: 0 54: 0 55: 0 56: 0 57: 0 58: 0 59: 0 60: 0 61: 0 62: 0 63: 0 64: 0 65: 0 66: 0 67: 0 68: 0 69: 0 70: 0 71: 0 72: 0 73: 0 74: 0 75: 0 76: 0 77: 0 78: 0 79: 0 80: 0 81: 0 82: 0 83: 0 84: 0 85: 0 86: 0 87: 0 88: 0 89: 0 90: 0 91: 0 92: 0 93: 0 94: 0 95: 0 96: 0 97: 0 98: 0 99: 0 100: 0 101: 0 102: 0 103: 0 104: 0 105: 0 106: 0 107: 0 108: 0 109: 0 110: 0 111: 0 112: 0 113: 0 114: 0 115: 0 116: 0 117: 0 118: 0 119: 0 120: 0 121: 0 122: 0 123: 0 124: 0 125: 0 126: 0 127: 0 128: 0 129: 0 130: 0 131: 0 132: 0 133: 0 134: 0 135: 0 136: 0 137: 0 138: 0 139: 0 140: 0 141: 0 142: 0 143: 0 144: 0 145: 0 146: 0 147: 0 148: 0 149: 0 150: 0 151: 0 152: 0 153: 0 154: 0 155: 0 156: 0 157: 0 158: 0 159: 0 160: 0 161: 0 162: 0 163: 0 164: 0 165: 0 166: 0 167: 0 168: 0 169: 0 170: 0 171: 0 172: 0 173: 0 174: 0 175: 0 176: 0 177: 0 178: 0 179: 0 180: 0 181: 0 182: 0 183: 0 184: 0 185: 0 186: 0 187: 0 188: 0 189: 0 190: 0 191: 0 192: 0 193: 0 194: 0 195: 0 196: 0 197: 0 198: 0 199: 0 200: 0 201: 0 202: 0 203: 0 204: 0 205: 0 206: 0 207: 0 208: 0 209: 0 210: 0 211: 0 212: 0 213: 0 214: 0 215: 0 216: 0 217: 0 218: 0 219: 0 220: 0 221: 0 222: 0 223: 0 224: 0 225: 0 226: 0 227: 0 228: 0 229: 0 230: 0 231: 0 232: 0 233: 0 234: 0 235: 0 236: 0 237: 0 238: 0 239: 0 240: 0 241: 0 242: 0 243: 0 244: 0 245: 0 246: 0 247: 0 248: 0 249: 0 250: 0 251: 0 252: 0 253: 0 254: 0 255: 0 If all works as expected, only the count of exception #27 (5) should be non-0 (and equal to the number of the frames shown minus 1). (5) Level 3 interrupt; COPER is used to synchronize with the bottom of the screen. Last edited by saimo; 04 December 2023 at 21:20. Reason: Removed attachment as I provided a newer version later.*

04 December 2023, 16:36	#69
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,105	Only expected interrupts. However, I've figured out why it's slow on my machine: ATC-misses! If I disable page address translation and rely just on the TTR's I get 34.72 / 28.57 fps respectively. I timed a small test program that reads a byte from a (pseudo-)random offset into an array and varied the size: Up to 256KB there are no differences, and the no-MMU case stays flat at around ~375ns/loop iteration, while with MMU it grows to 867 for at 32MB array. EDIT: 256KB of course lines up perfectly with 64-entry ATC and page size of 4K, and forgot something actionable: Of course being more cache friendly is likely a big rework, but I think grouping (height,color) rather than having separate arrays would likely be an easy win. Obviously you can't just switch switch off MMU w/o consequences, so don't change something like that in your own code. Attached Thumbnails Last edited by paraj; 04 December 2023 at 19:40.

04 December 2023, 22:17	#74
saimo Registered User Join Date: Aug 2010 Location: Italy Posts: 787	New test build. Changes: 1. more precise timing: now the program uses the color clocks to measure the elapsed time, instead of the displayed frames number; 2. added the NOSUPERSCALARDISPATCH=NSD command line switch, which allows to turn off the 68060 superscalar dispatch; 3. added the NOMMU=NM command line switch, which allows to turn off the MMU (this helps improve the speed) - WARNING: USE AT YOUR OWN RISK. Last edited by saimo; 05 December 2023 at 22:27. Reason: Removed attachment as I provided a newer version afterwards.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
No native AGA screens on PIV since P96 v3 upgrade	LoadWB	support.Apps	0	30 October 2020 01:57
Extra bottom line on native screens, chipset feature or WinUAE?	PeterK	support.WinUAE	5	11 September 2019 21:21
My pseudo 3D jump code	Brick Nash	Coders. AMOS	24	03 September 2016 00:18
Chunky to Planar (C2P) -- USELESS GIMMICK?!	crosis38	support.Hardware	10	09 July 2016 04:17
Pseudo Ops Viruskiller	Promax	request.Apps	0	28 July 2010 22:21

03 December 2023, 19:00	#61
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,105	19.23 / 17.48 so yeah tiny improvement SS (and all other relevant stuff) is already enabled, so you shouldn't need to fiddle with that. Do you happen to use any of the unimplemented instructions (e.g. 64-bit mul/div)?

03 December 2023, 21:05	#63
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,105	Roughly how many fastmem accesses are you doing per frame? "The beast" numbers seem to indicate you might be limited by those. I don't recall what worst case #cylces for that, and don't know how you're doing stuff, but maybe try limiting drawing distance or something?

05 December 2023, 00:57	#75
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,165	@paraj That graph is for general pseudorandom byte sized accesses in an array, right?

06 December 2023, 11:43	#78
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,105	bb: 26.378, fb: 25.412, nommu bb: 25.062, nommu fb: 25.031 Glitches quite a bit now: https://i.imgur.com/vUmUUVG.mp4 (no command line arguments, but nommu looks the same)

06 December 2023, 12:50	#80
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,105	MMU is enabled. Same result running no s-s (but + setpatch). Looks like you're not setting the transparent translation registers? MMUlib clears them (sets them to $FFFF6040). You should set them to something like: ITT0/1 = $00ffc000 DDT0 = $0000c040 DDT1 = $00ffc000 before disabling paging, and restore before re-enabling.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)