English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 03 December 2023, 19:00   #61
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,105
19.23 / 17.48 so yeah tiny improvement

SS (and all other relevant stuff) is already enabled, so you shouldn't need to fiddle with that. Do you happen to use any of the unimplemented instructions (e.g. 64-bit mul/div)?
paraj is offline  
Old 03 December 2023, 20:47   #62
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
Been outside for a while, but the brain kept on returning to this all the time. At some point, I wondered whether maybe there's an exception firing all the time, disrupting the execution. I'll cook up a test build that counts the occurrences of all the exceptions.

Quote:
Originally Posted by paraj View Post
19.23 / 17.48 so yeah tiny improvement
This makes sense.

In the meanwhile, I received the results from tests on other 2 machines - full recap in the table below.

Code:
AMIGA |                               |     FPS |    FPS | PED81C COST    |
MODEL | ACCELERATOR BOARD             | (BLIND) | (FULL) | (FPS / FRAMES) | NOTE
------+-------------------------------+---------+--------+----------------+-----
 1200 | PiStorm32 + Raspberry Pi 3 A+ |   50.00 |  50.00 | 0.00 / ?.???   |
 CD³² | The Beast 030                 |         |  30.48 |                | 1
 1200 | Blizzard 1230 IV              |   23.14 |  20.92 | 2.22 / 0.229   | 2
 1200 | Blizzard 1260                 |   19.23 |  17.48 | 1.75 / 0.260   |
 1200 | TerribleFire TF1260           |   13.96 |  13.15 | 0.81 / 0.221   |
 4000 | Cyberstorm MK III             |   19.53 |  18.05 | 1.48 / 0.210   |

1. 68030 70 MHz, SRAM (1 cycle wrap-around burst)
2. 68030 50 MHz, RAM 60 ns
The time taken to write the data to CHIP RAM is very similar between all the machines (that is, other than the PiStorm-equipped one, which is so fast that it neutralizes the impact of the copy to CHIP RAM).

Quote:
SS (and all other relevant stuff) is already enabled, so you shouldn't need to fiddle with that. Do you happen to use any of the unimplemented instructions (e.g. 64-bit mul/div)?
Nope, no unimplemented instructions are used.
saimo is offline  
Old 03 December 2023, 21:05   #63
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,105
Roughly how many fastmem accesses are you doing per frame? "The beast" numbers seem to indicate you might be limited by those. I don't recall what worst case #cylces for that, and don't know how you're doing stuff, but maybe try limiting drawing distance or something?
paraj is offline  
Old 03 December 2023, 21:16   #64
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
Quote:
Originally Posted by paraj View Post
Roughly how many fastmem accesses are you doing per frame? "The beast" numbers seem to indicate you might be limited by those. I don't recall what worst case #cylces for that, and don't know how you're doing stuff, but maybe try limiting drawing distance or something?
Can't give a proper answer right now, but I'll get back to you.
Anyway, of course limiting the distance would increase the speed, but that won't explain why the 68030 performs better than the 68060
saimo is offline  
Old 04 December 2023, 01:45   #65
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
@paraj

Still quick, but better answer.

Rendering is done by columns, from bottom to top and then left to right.
The code applies a depth of 256 steps per column, so it evaluates 256*128 = 32768 dots per frame (and then renders only those which are actually visible).
For each of those dots, the renderer does this:
1. calculate the dot position in the map;
2. read the dot height from the map (memory read #1);
3. calculate the screen Y the dot would project to, taking the height of the camera into account (this requires a lookup table read - memory read #2 (1));
4. if the dot happens to be hidden behind the dots rendered previously, pass to the next dot (2);
5. read the dot color from the map (memory read #3);
6. plot the dot upwards (3), starting from the screen Y of the previous topmost dot to the screen Y just calculated (N memory writes (4)).

(1) This read cannot be efficiently optimized with real-time calculations, also on 68060.
(2) This is not rare at all and avoids further memory accesses.
(3) Even if the dots were plotted sequentially, using (an)+, the gain would be very little: on my machine that saves 2 cycles per write, but eventually the gain is of just 0.4 fps... which would be lost due to a more inefficient rendering of the background (it would be rotated by 90° and thus need more lines, i.e. more loops overhead) and, above all, due to the need to reorder the data while copying it to CHIP RAM, thus losing the benefits of the burst copy (even if most, if not all, of the rearranging could be done in parallel with the writes).
(4) On average very few.

The code boils down to a bunch of simple operations (moves, adds, subs, ands, etc.). For it to be the cause of the poor performance on 68060 (even if only 1 pipeline were used!), given the significantly smaller timings of the 68060 instructions, the FAST RAM access of those 68060 boards must be really terrible with respect to my Blizzard's - which is unlikely, as also the comparable FAST RAM -> CHIP RAM copy speeds indicate.
I'm not sharing the code yet because I want to first give it another thought (if micro-optimizations are possible, I want to find them myself ) and because I'm under the impression that there's a bigger problem somewhere else (not saying that's the case for sure, though).

Could you try the attached build? This one counts the occurrences of all the exceptions during the execution. The output will be like this:
Code:
total number of frames rendered:    100
total number of frames shown:       100
frames rendered per second average: 50.00
frames per render average:          1.00
ex#: count
  0: 0
  1: 0
  2: 0
  3: 0
  4: 0
  5: 0
  6: 0
  7: 0
  8: 0
  9: 0
 10: 0
 11: 0
 12: 0
 13: 0
 14: 0
 15: 0
 16: 0
 17: 0
 18: 0
 19: 0
 20: 0
 21: 0
 22: 0
 23: 0
 24: 0
 25: 0
 26: 0
 27: 99
 28: 0
 29: 0
 30: 0
 31: 0
 32: 0
 33: 0
 34: 0
 35: 0
 36: 0
 37: 0
 38: 0
 39: 0
 40: 0
 41: 0
 42: 0
 43: 0
 44: 0
 45: 0
 46: 0
 47: 0
 48: 0
 49: 0
 50: 0
 51: 0
 52: 0
 53: 0
 54: 0
 55: 0
 56: 0
 57: 0
 58: 0
 59: 0
 60: 0
 61: 0
 62: 0
 63: 0
 64: 0
 65: 0
 66: 0
 67: 0
 68: 0
 69: 0
 70: 0
 71: 0
 72: 0
 73: 0
 74: 0
 75: 0
 76: 0
 77: 0
 78: 0
 79: 0
 80: 0
 81: 0
 82: 0
 83: 0
 84: 0
 85: 0
 86: 0
 87: 0
 88: 0
 89: 0
 90: 0
 91: 0
 92: 0
 93: 0
 94: 0
 95: 0
 96: 0
 97: 0
 98: 0
 99: 0
100: 0
101: 0
102: 0
103: 0
104: 0
105: 0
106: 0
107: 0
108: 0
109: 0
110: 0
111: 0
112: 0
113: 0
114: 0
115: 0
116: 0
117: 0
118: 0
119: 0
120: 0
121: 0
122: 0
123: 0
124: 0
125: 0
126: 0
127: 0
128: 0
129: 0
130: 0
131: 0
132: 0
133: 0
134: 0
135: 0
136: 0
137: 0
138: 0
139: 0
140: 0
141: 0
142: 0
143: 0
144: 0
145: 0
146: 0
147: 0
148: 0
149: 0
150: 0
151: 0
152: 0
153: 0
154: 0
155: 0
156: 0
157: 0
158: 0
159: 0
160: 0
161: 0
162: 0
163: 0
164: 0
165: 0
166: 0
167: 0
168: 0
169: 0
170: 0
171: 0
172: 0
173: 0
174: 0
175: 0
176: 0
177: 0
178: 0
179: 0
180: 0
181: 0
182: 0
183: 0
184: 0
185: 0
186: 0
187: 0
188: 0
189: 0
190: 0
191: 0
192: 0
193: 0
194: 0
195: 0
196: 0
197: 0
198: 0
199: 0
200: 0
201: 0
202: 0
203: 0
204: 0
205: 0
206: 0
207: 0
208: 0
209: 0
210: 0
211: 0
212: 0
213: 0
214: 0
215: 0
216: 0
217: 0
218: 0
219: 0
220: 0
221: 0
222: 0
223: 0
224: 0
225: 0
226: 0
227: 0
228: 0
229: 0
230: 0
231: 0
232: 0
233: 0
234: 0
235: 0
236: 0
237: 0
238: 0
239: 0
240: 0
241: 0
242: 0
243: 0
244: 0
245: 0
246: 0
247: 0
248: 0
249: 0
250: 0
251: 0
252: 0
253: 0
254: 0
255: 0
If all works as expected, only the count of exception #27 (5) should be non-0 (and equal to the number of the frames shown minus 1).

(5) Level 3 interrupt; COPER is used to synchronize with the bottom of the screen.

Last edited by saimo; 04 December 2023 at 21:20. Reason: Removed attachment as I provided a newer version later.
saimo is offline  
Old 04 December 2023, 07:23   #66
Lunda
Registered User
 
Join Date: Jul 2023
Location: Domsjö/Sweden
Posts: 35
Quote:
Originally Posted by saimo View Post
Cool, thanks!
Could you run also the blind benchmark and give me the result, please? Also, what is your Amiga model, accelerator board and RAM speed? I'd like to add your figures to the table in the manual.
Here you go.
I've made some changes to to the SRAM controller. Read speed is still the same as in the bustest thread.
https://eab.abime.net/showpost.php?p...4&postcount=40

BB:
tot frames shown = 144
fps avg = 34.72
frames per renderer = 1.44

FB:
tot frames shown = 164
fps avg = 30.48
frames per renderer = 1.64

New BUSTEST FAST write performance:
writew 43.2 ns 46.3 MB/S
writel 43.1 ns 92.9 MB/S
writem 39.5 ns 101.3 MB/S
Lunda is offline  
Old 04 December 2023, 11:12   #67
alexh
Thalion Webshrine
 
alexh's Avatar
 
Join Date: Jan 2004
Location: Oxford
Posts: 14,354
Quote:
Originally Posted by saimo View Post
explain why the 68030 performs better than the 68060
Switch off Superscalar (setting PCR(0)=0) and see if that has any effect on 68060?

I would imagine it should have a negative effect but you never know
alexh is offline  
Old 04 December 2023, 13:48   #68
hooverphonique
ex. demoscener "Bigmama"
 
Join Date: Jun 2012
Location: Fyn / Denmark
Posts: 1,624
Quote:
Originally Posted by saimo View Post
(3) Even if the dots were plotted sequentially, using (an)+, the gain would be very little: on my machine that saves 2 cycles per write, but eventually the gain is of just 0.4 fps... which would be lost due to a more inefficient rendering of the background (it would be rotated by 90° and thus need more lines, i.e. more loops overhead) and, above all, due to the need to reorder the data while copying it to CHIP RAM, thus losing the benefits of the burst copy (even if most, if not all, of the rearranging could be done in parallel with the writes).
Would rotating the bitmap while copying from fast to chip not basically come for free? fast ram is cachable, so reading contiguously with burst should be good; chip ram is not cachable, so should not affect your data cache.
hooverphonique is offline  
Old 04 December 2023, 16:36   #69
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,105
Only expected interrupts. However, I've figured out why it's slow on my machine: ATC-misses! If I disable page address translation and rely just on the TTR's I get 34.72 / 28.57 fps respectively.

I timed a small test program that reads a byte from a (pseudo-)random offset into an array and varied the size:


Up to 256KB there are no differences, and the no-MMU case stays flat at around ~375ns/loop iteration, while with MMU it grows to 867 for at 32MB array.

EDIT: 256KB of course lines up perfectly with 64-entry ATC and page size of 4K, and forgot something actionable: Of course being more cache friendly is likely a big rework, but I think grouping (height,color) rather than having separate arrays would likely be an easy win. Obviously you can't just switch switch off MMU w/o consequences, so don't change something like that in your own code.
Attached Thumbnails
Click image for larger version

Name:	atc-miss.png
Views:	302
Size:	18.9 KB
ID:	80929  

Last edited by paraj; 04 December 2023 at 19:40.
paraj is offline  
Old 04 December 2023, 21:22   #70
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
Quote:
Originally Posted by Lunda View Post
Here you go.
I've made some changes to to the SRAM controller. Read speed is still the same as in the bustest thread.
https://eab.abime.net/showpost.php?p...4&postcount=40

BB:
tot frames shown = 144
fps avg = 34.72
frames per renderer = 1.44

FB:
tot frames shown = 164
fps avg = 30.48
frames per renderer = 1.64

New BUSTEST FAST write performance:
writew 43.2 ns 46.3 MB/S
writel 43.1 ns 92.9 MB/S
writem 39.5 ns 101.3 MB/S
Thank you!
Your figures make sense: they're just slightly better than those that can be derived proportionally from my 50 MHz 68030 (e.g. 21 fps * 70 / 50 = 29.4).

I'm about to post a new version that has a more precise timing: it would be great if you could post the results output by that one, too.
saimo is offline  
Old 04 December 2023, 21:25   #71
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
Quote:
Originally Posted by alexh View Post
Switch off Superscalar (setting PCR(0)=0) and see if that has any effect on 68060?

I would imagine it should have a negative effect but you never know
Yes, that would definitely have a negative effect, but you gave me an idea: to add a switch that allows to turn superscalar dispatch on and off at will, so that one can measure the benefits of 68060's parallelism. It's already done. I'm posting the new version after replying the other posts here.
saimo is offline  
Old 04 December 2023, 21:30   #72
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
Quote:
Originally Posted by hooverphonique View Post
Would rotating the bitmap while copying from fast to chip not basically come for free? fast ram is cachable, so reading contiguously with burst should be good; chip ram is not cachable, so should not affect your data cache.
Indeed the data cache wouldn't be affected and rotating can be done in parallel with the writes, but what I was trying to say is that a code that does that won't be as quick as the one I pasted in post #28. I know because when I wrote that code I tested tens of different pieces of code, and that one turned out the one that performs better on my machine - in particular writing 13 longwords with single moves performs worse than movem.
That said, I might give the sequential rendering a shot
saimo is offline  
Old 04 December 2023, 22:05   #73
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
Quote:
Originally Posted by paraj View Post
Only expected interrupts.
And less exceptions than those than actually happened: I had stupidly set the vectors that handle trap #0/1 (which I use to fiddle with the caches in real time) after setting the routine that counts the exception Once fixed that, also the numbers of vectors 32 and 33 were sane.

Quote:
However, I've figured out why it's slow on my machine: ATC-misses!

Last night I received a report (from the always super-supportive and enthusiast klx300r) that confirmed that no unexpected exceptions occurred and that convinced me that there must have been something else... and then it dawned on me, and I came exactly to the same conclusion!
In fact, this afternoon - unaware of your warning below - I also implemented a NOMMU switch that disables ATC-based translation.

Quote:
If I disable page address translation and rely just on the TTR's I get 34.72 / 28.57 fps respectively.
Now we're talking! At a glance, 1.5x faster rendering than on the 68030 looks realistic, considering the code. And maybe the performance is even better: in the last build, due to a rearrangement of the startup code, the part that performed the CPU-specific initializations was using an uninitialized (0) CPU ID variable, so the program used the generic 68020 code.

Quote:
I timed a small test program that reads a byte from a (pseudo-)random offset into an array and varied the size:


Up to 256KB there are no differences, and the no-MMU case stays flat at around ~375ns/loop iteration, while with MMU it grows to 867 for at 32MB array.
Thanks for the test and the report.

Quote:
EDIT: 256KB of course lines up perfectly with 64-entry ATC and page size of 4K, and forgot something actionable: Of course being more cache friendly is likely a big rework, but I think grouping (height,color) rather than having separate arrays would likely be an easy win.
Good suggestion but, unfortunately, the data has been organized like that ever since.

Quote:
Obviously you can't just switch switch off MMU w/o consequences, so don't change something like that in your own code.
This scared me. I wonder, where's the danger?
Here's the context:
1. the program takes over the system entirely, stores its state carefully;
2. if the NOMMU switch is specified, it stores the current value of the tcr and disables the translation;
3. it never uses to the OS anymore;
4. it accesses exclusively the chipset and the CHIP and FAST areas previously allocated;
5. upon exit, it restores the tcr, restores the system state and cleans everything up.

In the past (circa 2002-2004) I did fiddle directly with the MMU for an experimental and rather complex piece of software, which happily ran not only on my A1200/030, but also on A1200/060, A4000/040 and A4000/060.
But your warning makes me wonder: have things changed in the meanwhile? Can logical and physical RAM addresses differ?

Last edited by saimo; 04 December 2023 at 22:11.
saimo is offline  
Old 04 December 2023, 22:17   #74
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
New test build.

Changes:
1. more precise timing: now the program uses the color clocks to measure the elapsed time, instead of the displayed frames number;
2. added the NOSUPERSCALARDISPATCH=NSD command line switch, which allows to turn off the 68060 superscalar dispatch;
3. added the NOMMU=NM command line switch, which allows to turn off the MMU (this helps improve the speed) - WARNING: USE AT YOUR OWN RISK.

Last edited by saimo; 05 December 2023 at 22:27. Reason: Removed attachment as I provided a newer version afterwards.
saimo is offline  
Old 05 December 2023, 00:57   #75
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,165
@paraj

That graph is for general pseudorandom byte sized accesses in an array, right?
Karlos is offline  
Old 05 December 2023, 18:10   #76
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,105
Quote:
Originally Posted by Karlos View Post
@paraj

That graph is for general pseudorandom byte sized accesses in an array, right?
Yes. Timing this function 10 times, with d0=100000 and d1=arraysize:
Code:
_byteread::
        movem.l d2/d3,-(sp)
        move.l  d1,d3
        subq.l  #1,d3
        lea     buffer,a0
        move.l  #3141592,d1
.loop:
        move.l  d3,d2
        and.l   d1,d2
        rol.l   d1,d1
        addq.l  #7,d1
        move.b  (a0,d2.l),d2
        subq.l  #1,d0
        bne.b   .loop
        movem.l (sp)+,d2/d3
        rts

@saimo: If OS is completely off, and there is basically no MMU usage before, then yes, it will probably work. But it's not super uncommon to have things that e.g. move the first 4k to fast mem using MMU so gotta be careful about something like that.
paraj is offline  
Old 05 December 2023, 22:40   #77
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
Quote:
Originally Posted by paraj View Post
@saimo: If OS is completely off, and there is basically no MMU usage before, then yes, it will probably work. But it's not super uncommon to have things that e.g. move the first 4k to fast mem using MMU so gotta be careful about something like that.
I just tested it with both MuMove4K and MuForce running on my real Amiga and on WinUAE emulating a 68040 and a 68060, and it works fine. On the real Amiga the speed drops by 50% when the MMU is active, but specifying the NOMMU switch brings the speed back to normal. Funnily, thanks to the fact my PC isn't powerful, the same happens also under emulation
Could you try the attached version and see if the NOMMU switch does the magic also on your machine, please?

Changes since the previous version:
* fixed a nasty bug that caused a longword to be written to a random location when the fps indicator was on (two bsrs were used before restoring the stack pointer!);
* fixed the shell template ("/FB" -> "=FB/S");
* fixed the cleanup code in a place (it used exec.Supervisor() before the tcr and the OS were restored - thanks to the fact that the vbr points to a copy of the exceptions vectors table and that the code only modifies the level 3 interrupt and the trap #0 and #1 vectors, the flaw didn't have any practical consequences, but still...);
* avoided that the CPU tcr is fiddled with if the NOMMU switch is not specified (so that there are no problems also on 68EC060s).

Last edited by saimo; 06 December 2023 at 18:10. Reason: Removed attachment as I provided a newer version later.
saimo is offline  
Old 06 December 2023, 11:43   #78
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,105
bb: 26.378, fb: 25.412, nommu bb: 25.062, nommu fb: 25.031

Glitches quite a bit now: https://i.imgur.com/vUmUUVG.mp4 (no command line arguments, but nommu looks the same)
paraj is offline  
Old 06 December 2023, 12:01   #79
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
Quote:
Originally Posted by paraj View Post
bb: 26.378, fb: 25.412, nommu bb: 25.062, nommu fb: 25.031

Glitches quite a bit now: https://i.imgur.com/vUmUUVG.mp4 (no command line arguments, but nommu looks the same)

It looks like I screwed something up, but apart from adding the NOMMU switch and the bugfixes (which are unrelated to rendering), the rest remained unchanged... puzzling.
Given that the figures are higher than those you had posted previously and that the NOMMU makes no difference, I take it that the MMU is disabled already before launching the executable or that you changed the MMU settings somehow?
(Thanks for the useful video! Edit: it clearly shows that the background overwrites the rendered graphics, which means that buffering got broken... the problem must be in the COPER interrupt and the traps - but that's quite a mystery, since I didn't touch those.)
saimo is offline  
Old 06 December 2023, 12:50   #80
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,105
MMU is enabled. Same result running no s-s (but + setpatch).

Looks like you're not setting the transparent translation registers? MMUlib clears them (sets them to $FFFF6040). You should set them to something like:

ITT0/1 = $00ffc000
DDT0 = $0000c040
DDT1 = $00ffc000

before disabling paging, and restore before re-enabling.
paraj is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
No native AGA screens on PIV since P96 v3 upgrade LoadWB support.Apps 0 30 October 2020 01:57
Extra bottom line on native screens, chipset feature or WinUAE? PeterK support.WinUAE 5 11 September 2019 21:21
My pseudo 3D jump code Brick Nash Coders. AMOS 24 03 September 2016 00:18
Chunky to Planar (C2P) -- USELESS GIMMICK?! crosis38 support.Hardware 10 09 July 2016 04:17
Pseudo Ops Viruskiller Promax request.Apps 0 28 July 2010 22:21

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 05:31.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.14932 seconds with 14 queries