English Amiga Board


Go Back   English Amiga Board > Other Projects > project.WHDLoad

 
 
Thread Tools
Old 25 August 2007, 20:06   #1
girv
Mostly Harmless
 
girv's Avatar
 
Join Date: Aug 2004
Location: Northern Ireland
Posts: 1,115
WHDLoad blitter speed tests - needs you!

In The Zone shortly will be a WHDLoad slave that does some blitter speed tests. The intention is to investigate one possible cause of the annoying slowdowns many WHDLoad games experience on certain processors. 68040, I'm looking at you.

A sequence of blits, of the "cookie cutter" type typically found in games, is run with different sizes (16x16 to 1008x1008!), different types of blitter waiting code and different settings for the CPU caches. The the time taken for each blit is measured using the CIA timers, stored, and finally output to a text file "bspeed.txt" when all tests are complete.

The zip file in The Zone also includes some Excel spreadsheets containing the results from my A1200-060 and some pretty graphs to help analysis.

So here's your bit: run "WHDLoad BlitterSpeed.slave" and post, PM or email me your bspeed.txt files along with your machine specifications. AGA users can also see the effect of the FMODE register on blitter speed by passing the CUSTOM1 parameter to the slave with values of 0,1,2 or 3.

You will need a 68020 or better, 0.5Mb chipram and ~2mb fastram in order to run the tests. They will take 5-10 minutes to complete and the screen will flash colours to let you know it's still running. I'm especially interested in 68040 results, but the more the merrier!
girv is offline  
Old 25 August 2007, 21:13   #2
Graham Humphrey
Moderator
 
Graham Humphrey's Avatar
 
Join Date: Jul 2004
Location: Norwich, Norfolk, UK
Age: 37
Posts: 11,167
Will test it for you either tonight or tomorrow...
Graham Humphrey is offline  
Old 25 August 2007, 21:33   #3
Mad-Matt
Longplayer
 
Mad-Matt's Avatar
 
Join Date: Jan 2005
Location: Lincoln / UK
Age: 44
Posts: 1,852
Send a message via ICQ to Mad-Matt Send a message via MSN to Mad-Matt
Have set the test going on my BPPC 040/25.

Config is A1200-BlizzPPC 040/25, Mediator+Voodoo3/net/sb128/tv.
OS39+Custom OS39 rom including HOGWAITBLIT patch if that makes a difference. as miggy hasnt been on for so long the battery has flatened and lost settings so have just forced 60ns memory mode from command line since i cant see boot menu to set it.

CHIPNOCACHE whd option set for compatability. One normal and 4 custom1 tests.

Edit : 1 normal test added with Chip cachability enabled just incase it effects anything
Attached Files
File Type: txt bspeed-Custom1=0.txt (1.5 KB, 269 views)
File Type: txt bspeed-Custom1=1.txt (1.5 KB, 240 views)
File Type: txt bspeed-Custom1=2.txt (1.5 KB, 251 views)
File Type: txt bspeed-Custom1=3.txt (1.5 KB, 259 views)
File Type: txt bspeedCHIPCACHEEnabled.txt (1.5 KB, 266 views)

Last edited by Mad-Matt; 27 February 2021 at 14:32.
Mad-Matt is offline  
Old 26 August 2007, 07:12   #4
mrodfr
Registered User
 
mrodfr's Avatar
 
Join Date: Jan 2005
Location: 62-France
Age: 56
Posts: 413
hello,

tested on a1200dbox+1260/50/48mo fast+P96+aos3.9bb2.
blizzkick enabled with some modules running. custom=1.

just use WHDLoad BlitterSpeed.slave and wait the end.

interesting to konw the result.
Attached Files
File Type: txt bspeed.txt (1.5 KB, 259 views)
mrodfr is offline  
Old 26 August 2007, 10:18   #5
Shoonay
Global Caturator
 
Shoonay's Avatar
 
Join Date: Aug 2004
Location: Porando
Age: 43
Posts: 6,107
Cool, will run the test and I do hope it'll help solve the slowdown problem, tho I'm on 030

---=== EDITED ===---
Done...

Tested on Amiga 1200 with Blizzard 030/50MHz 32MB; OS 3.9 BB2 with FBlit & SystemPatch

Last edited by Shoonay; 13 May 2008 at 12:47.
Shoonay is offline  
Old 26 August 2007, 12:19   #6
derringer
Registered User
 
Join Date: Aug 2007
Location: Budapest/Hungary
Posts: 13
My config:

A1200 Bppc 68040/33MHz /603+/233MHz, 2Mb Chip 128MB Fast Ram, Mediator/Voodoo etc.

Chip ram cacheable disabled in whdload config because the compüatibility.

OS: 3.9, Kickstart 3.1 with blitzkick

Hope I can help you.
Attached Files
File Type: txt bspeed.txt (1.5 KB, 262 views)
derringer is offline  
Old 26 August 2007, 14:16   #7
Graham Humphrey
Moderator
 
Graham Humphrey's Avatar
 
Join Date: Jul 2004
Location: Norwich, Norfolk, UK
Age: 37
Posts: 11,167
Okay, done, courtesy of my A1200 with Apollo 040/33, and 32MB Fast RAM. WHDLoad 16.8 was used, and no tooltypes were set. If you want me to test it with any tooltypes just give me a shout.
Attached Files
File Type: txt bspeed.txt (1.5 KB, 265 views)
Graham Humphrey is offline  
Old 26 August 2007, 19:42   #8
girv
Mostly Harmless
 
girv's Avatar
 
Join Date: Aug 2004
Location: Northern Ireland
Posts: 1,115
@madmatt & Graham: do you run PAL Amigas ?
girv is offline  
Old 26 August 2007, 20:09   #9
Graham Humphrey
Moderator
 
Graham Humphrey's Avatar
 
Join Date: Jul 2004
Location: Norwich, Norfolk, UK
Age: 37
Posts: 11,167
Yep.
Graham Humphrey is offline  
Old 26 August 2007, 20:58   #10
keropi
.
 
Join Date: Oct 2004
Location: Ioannina/Greece
Posts: 5,040
here are my results, PAL, A4000D and cs-ppc 060/50mhz... no tooltypes whatsoever... chipmem cache disabled are usual for 040/060...

Attached Files
File Type: txt BSPEED.TXT (1.5 KB, 260 views)
keropi is offline  
Old 26 August 2007, 21:24   #11
Mad-Matt
Longplayer
 
Mad-Matt's Avatar
 
Join Date: Jan 2005
Location: Lincoln / UK
Age: 44
Posts: 1,852
Send a message via ICQ to Mad-Matt Send a message via MSN to Mad-Matt
Quote:
Originally Posted by girv View Post
@madmatt & Graham: do you run PAL Amigas ?
yes
Mad-Matt is offline  
Old 27 August 2007, 00:49   #12
girv
Mostly Harmless
 
girv's Avatar
 
Join Date: Aug 2004
Location: Northern Ireland
Posts: 1,115
Thumbs up ...and the results are in

Firstly, thanks all for taking the time to run my little test

From the results, it seems the standard blitter wait code supplied with WHDLoad is the best of the bunch. It may be possible to improve on it (I don't know) but it's not a bad start in any case and also proves Wepl's idea to avoid blitter slowdown was right. Not that that was really ever in doubt

I've attached a spreadsheet to this message containing some further analysis of the times reported for the 'A' code ie: the standard WHDLoad blitter wait. There are some interesting results...
  • With caches enabled, running in chipmem instead of fastmem costs roughly 10-20% blitter speed on 040 but may cost over 30% if the blits are small; 060 sees no slowdown and 030 sees a small slowdown for small blits only.
  • With caches disabled, running in chipmem instead of fastmem, all processors see a blitter slowdown of 10-20% but 040 can get a massive 45% loss of blitter speed on small blits.
  • Running in fastmem, disabling caches costs 8% blitter speed on 060 for small blits, and smaller amounts on other processors. I'm not sure why this is - best guess is that the CPU is running slower and hitting the dmacon register (to test for blit done) less often, so interfering less with the blitter DMA; this may also explain why slower processors see less of an effect from disabling the caches, but doesn't explain why the effect lessens with larger blits. Anyone care to take a stab?
  • Running in fastmem, with caches enabled, 040-33 machines will blit faster than 060-50 machines. Again, my best guess is that the CPU is running slower and hitting the dmacon register (to test for blit done) less often. This may suggest an improvement is possible to the standard blitwait code for CPUs faster than 040-33.
  • Running in chipmem, disabling caches costs 10-20% blitter speed on 060; 040 and 030 see no effect, but 040 BPPC actually speeds up! I believe this shows the effect of how the BPPC boards handle chipmem cacheability.

Some lessons learned then:
  • Use Wepl's blitwait code
  • It might be possible to improve Wepl's blitwait code for 060
  • Move all blitwaits to fastmem (into the slave would be fine). I guess this would apply for all code but blitwaits are obvious targets and easy to move.
  • Enable caches in fastmem
  • Enable caches in chipmem for 060

Perhaps these are obvious, but at least now there are hard numbers to back them up!

These tests just enable or disable the instruction / data caches wholesale. For further investigation it might be a worthwhile experiment to run a test with the two caches enabled & disabled separately and with the data cache running in different modes, especially for 040 machines running out of chipmem.

But whatever you do, don't take my word for this. Look at the numbers yourself and see if you agree!
Attached Files
File Type: zip compare2.zip (7.0 KB, 246 views)
girv is offline  
Old 27 August 2007, 12:40   #13
Codetapper
2 contact me: email only!
 
Codetapper's Avatar
 
Join Date: May 2001
Location: Auckland / New Zealand
Posts: 3,187
It would be nice if there was an easy way to do the equivalent of a jsr BlitWait directly into WHDLoad running in fast memory - and WHDLoad can put the best patch for each processor rather than putting the code into each slave.

The global blitter code could be tweaked for each CPU and the poor slave programmer doesn't have to worry about which blitwait code to run! And only one place to update to improve it.
Codetapper is offline  
Old 27 August 2007, 13:04   #14
musashi5150
move.w #$4489,$dff07e
 
musashi5150's Avatar
 
Join Date: Sep 2005
Location: Norfolk, UK
Age: 42
Posts: 2,351
That's a great idea Codetapper WHDLoad could determine which CPU type the machine has at startup and then any "jsr resload_BlitWait" would use the most efficient routine automatically. Clever
musashi5150 is offline  
Old 27 August 2007, 13:52   #15
Toni Wilen
WinUAE developer
 
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,534
Can I see the source code please?
Toni Wilen is offline  
Old 27 August 2007, 16:30   #16
dlfrsilver
CaptainM68K-SPS France
 
dlfrsilver's Avatar
 
Join Date: Dec 2004
Location: Melun nearby Paris/France
Age: 46
Posts: 10,474
Send a message via MSN to dlfrsilver
hi girv here is mine a bit late
Attached Files
File Type: txt bspeed.txt (1.5 KB, 260 views)
dlfrsilver is offline  
Old 27 August 2007, 23:00   #17
Wepl
Moderator
 
Wepl's Avatar
 
Join Date: Nov 2001
Location: Germany
Posts: 869
First Girv, many thanks for doing the investigation on this topic, this will help to clear things, doing the right patches and help to optimize the installs.

Quote:
Originally Posted by girv View Post
From the results, it seems the standard blitter wait code supplied with WHDLoad is the best of the bunch. It may be possible to improve on it (I don't know) but it's not a bad start in any case and also proves Wepl's idea to avoid blitter slowdown was right. Not that that was really ever in doubt

I've attached a spreadsheet to this message containing some further analysis of the times reported for the 'A' code ie: the standard WHDLoad blitter wait.
Can you please tell what wait routine is used in the cases B-F?
Best would be to send the slave source. I would also like to include the slave in whdload-dev package because I think it is of general interest.

I also like to add a wait routine using the stop instruction which should be the fastest, although it maybe hard to use it for the general patch case: (routine not tested by me, maybe small corrections required...)
Code:
init:
 move.w #$7fff,_custom+intena
 lea    int,a0
 move.l a0,$6c
 move.w #INTF_SETCLR|INTF_BLIT|INTF_INTEN,_custom+intena
 ...

int:
 move.w #INTF_BLIT,_custom+intreq
 tst.w  _custom+intreqr
 rte

blitwait:
 stop   #$2000
 btst   #DMAB_BLITDONE-8,_custom+intreqr ;sould be obsolete, but for security
 beq    blitwait
 move   #$2700,sr                        ;avoid interrupt occuring before stop
 rts
A point to make the results more reliable is that you should do n iterations for the small blits. Also the overhead of the blitwait routine itself may become important if there are many small blits.

Another topic is how much dma channels are used is your examples? I think this will have a noticeable effect on the results. I would expect much more influence of the blitwait routine if there are all 4 channels are used and less impact if only one or two channels are used.

Next point is the impact of BLITHOG/BLTPRI?

Quote:
Originally Posted by girv View Post
  • With caches enabled, running in chipmem instead of fastmem costs roughly 10-20% blitter speed on 040 but may cost over 30% if the blits are small; 060 sees no slowdown and 030 sees a small slowdown for small blits only.
The blitwait routine should fit into the cache of 020..060. So there should be no difference if the code is in chip or fast. If the 040 has a slowdown here this can only mean that this is a board which cannot cache chipmem. A small slowdown on the 030 means probably that your other code has flushed the cache. Looping n times for the small blits should eliminate this.

Quote:
Originally Posted by girv View Post
  • With caches disabled, running in chipmem instead of fastmem, all processors see a blitter slowdown of 10-20% but 040 can get a massive 45% loss of blitter speed on small blits.
  • Running in fastmem, disabling caches costs 8% blitter speed on 060 for small blits, and smaller amounts on other processors. I'm not sure why this is - best guess is that the CPU is running slower and hitting the dmacon register (to test for blit done) less often, so interfering less with the blitter DMA; this may also explain why slower processors see less of an effect from disabling the caches, but doesn't explain why the effect lessens with larger blits. Anyone care to take a stab?
  • Running in fastmem, with caches enabled, 040-33 machines will blit faster than 060-50 machines. Again, my best guess is that the CPU is running slower and hitting the dmacon register (to test for blit done) less often. This may suggest an improvement is possible to the standard blitwait code for CPUs faster than 040-33.
That was one of my main questions: does reading the dmaconr (or any other custom register) slow down the blitter? That accessing chip mem slows it down you can read in the KRM.
It seems so. To prove that more please try the wait routine using the stop instruction. I would expect that using this gives the same speed (at least for the large blits) on all machines.

Quote:
Originally Posted by girv View Post
  • Running in chipmem, disabling caches costs 10-20% blitter speed on 060; 040 and 030 see no effect, but 040 BPPC actually speeds up! I believe this shows the effect of how the BPPC boards handle chipmem cacheability.
yep.

Quote:
Originally Posted by girv View Post
Some lessons learned then:
  • Use Wepl's blitwait code
  • It might be possible to improve Wepl's blitwait code for 060
  • Move all blitwaits to fastmem (into the slave would be fine). I guess this would apply for all code but blitwaits are obvious targets and easy to move.
  • Enable caches in fastmem
  • Enable caches in chipmem for 060
caches in fastmem/slave are default on
enabling caches in chipmem may cause other problems, also there are boards out there (40/60) which do not support cachability of chipmem (CHIPNOCACHE must be used).

Quote:
Originally Posted by girv View Post
These tests just enable or disable the instruction / data caches wholesale. For further investigation it might be a worthwhile experiment to run a test with the two caches enabled & disabled separately and with the data cache running in different modes, especially for 040 machines running out of chipmem.
data cache should have no effect here because there is nothing to be cached.

Quote:
Originally Posted by Codetapper View Post
It would be nice if there was an easy way to do the equivalent of a jsr BlitWait directly into WHDLoad running in fast memory - and WHDLoad can put the best patch for each processor rather than putting the code into each slave.

The global blitter code could be tweaked for each CPU and the poor slave programmer doesn't have to worry about which blitwait code to run! And only one place to update to improve it.
Yes, if we have a blitwait routine which makes a useful difference on different setups. From the results I dont see much difference between the routines.
Wepl is offline  
Old 27 August 2007, 23:01   #18
girv
Mostly Harmless
 
girv's Avatar
 
Join Date: Aug 2004
Location: Northern Ireland
Posts: 1,115
Source code is now in The Zone. It's probably best if this is reviewed anyway Do with it what you will.

It does contain a pretty clear cut example of how to use the CIA timers in linked mode to time longer periods than one 16 bit counter will allow. This is the first serious use of the CIAs I've ever made and I'm quite pleased with how its gone

I've also attached results from my Blizzard 1260 50Mhz card, running versions of the standard WHDLoad BLITWAIT macro with different numbers (0-5) of "tst.b _ciaa" in the loop. Blitwait loops were running from fastmem with all caches enabled. As you can see, for blits sized 16x16 - 64x64, which I guess are the most common found in games, there is a definite 10+% blitter speedup to be had on 060 machines by including 4 or 5 "tst.b _ciaa" instructions instead of the standard 2. Adding more increases performance on larger blits too but the gains aren't so dramatic, so perhaps 4 or 5 is the best compromise for these machines?
Attached Files
File Type: zip 060.zip (3.4 KB, 230 views)
girv is offline  
Old 27 August 2007, 23:19   #19
girv
Mostly Harmless
 
girv's Avatar
 
Join Date: Aug 2004
Location: Northern Ireland
Posts: 1,115
Quote:
Originally Posted by Wepl View Post
First Girv, many thanks for doing the investigation on this topic, this will help to clear things, doing the right patches and help to optimize the installs.
No probs You did mention it needed looking in to

Quote:
Originally Posted by Wepl View Post
Can you please tell what wait routine is used in the cases B-F?
Best would be to send the slave source. I would also like to include the slave in whdload-dev package because I think it is of general interest.
Source is now in The Zone, but for reference (B) used no delaying instructions, (C) and (D) used NOPS, (E) used a short integer instruction and (F) used a long integer instruction (divs.l). I was trying to see if keeping the CPU busy and off the memory buses altogether would speed up blitter operations.

Quote:
Originally Posted by Wepl View Post
I also like to add a wait routine using the stop instruction which should be the fastest.
Good idea

Quote:
Originally Posted by Wepl View Post
A point to make the results more reliable is that you should do n iterations for the small blits.
Each blit is performed 64 times and the results averaged at the end.

Quote:
Originally Posted by Wepl View Post
Another topic is how much dma channels are used is your examples? I think this will have a noticeable effect on the results.

Next point is the impact of BLITHOG/BLTPRI?
I used all four channels to simulate the "cookie cut" blit operation used heavily in games (I think). I didn't look at BLITPRI, or audio DMA.

Quote:
Originally Posted by Wepl View Post
The blitwait routine should fit into the cache of 020..060. So there should be no difference if the code is in chip or fast. If the 040 has a slowdown here this can only mean that this is a board which cannot cache chipmem.
I guess the rule is to move the blitwait loops to fastmem in any case ?

Quote:
Originally Posted by Wepl View Post
enabling caches in chipmem may cause other problems, also there are boards out there (40/60) which do not support cachability of chipmem (CHIPNOCACHE must be used).
CHIPNOCACHE would override the setup in the slave right? So if the slave sets the optimal caches for most boards, the owners of non-caching boards can override the defaults to suit their hardware?

Quote:
Originally Posted by Wepl View Post
data cache should have no effect here because there is nothing to be cached.
Yep, you are correct. I realised this too late last night

Quote:
Originally Posted by Wepl View Post
Yes, if we have a blitwait routine which makes a useful difference on different setups.
10-15% on 060 good enough?
girv is offline  
Old 28 August 2007, 00:01   #20
girv
Mostly Harmless
 
girv's Avatar
 
Join Date: Aug 2004
Location: Northern Ireland
Posts: 1,115
Quote:
Originally Posted by Wepl View Post
It seems so. To prove that more please try the wait routine using the stop instruction. I would expect that using this gives the same speed (at least for the large blits) on all machines.
I did a quick test on my 060: the STOP routine is quicker for 128x128, gaining about 6-8% over the standard WHDLoad blitwait with caches on and 15-20% with caches off in chipmem. Its much slower for smaller sizes - I guess its the interrupt raising/handling overhead.
girv is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Future Wars - WHDload game speed XimeR project.WHDLoad 12 05 October 2022 16:16
Adjust cpu speed slider in WHDload config markpage support.WinUAE 2 09 October 2012 20:22
Game Speed under WHDLoad Winterjaeger support.Games 0 23 September 2012 20:03
change the mouse speed in a whdload game _psy project.WHDLoad 3 08 June 2012 10:41
WHDLoad game speed Washac project.WHDLoad 7 26 February 2012 17:40

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 14:30.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.10649 seconds with 15 queries