WHDLoad blitter speed tests - needs you!

girv · 25 August 2007, 20:06

In The Zone shortly will be a WHDLoad slave that does some blitter speed tests. The intention is to investigate one possible cause of the annoying slowdowns many WHDLoad games experience on certain processors. 68040, I'm looking at you.

A sequence of blits, of the "cookie cutter" type typically found in games, is run with different sizes (16x16 to 1008x1008!), different types of blitter waiting code and different settings for the CPU caches. The the time taken for each blit is measured using the CIA timers, stored, and finally output to a text file "bspeed.txt" when all tests are complete.

The zip file in The Zone also includes some Excel spreadsheets containing the results from my A1200-060 and some pretty graphs to help analysis.

So here's your bit: run "WHDLoad BlitterSpeed.slave" and post, PM or email me your bspeed.txt files along with your machine specifications. AGA users can also see the effect of the FMODE register on blitter speed by passing the CUSTOM1 parameter to the slave with values of 0,1,2 or 3.

You will need a 68020 or better, 0.5Mb chipram and ~2mb fastram in order to run the tests. They will take 5-10 minutes to complete and the screen will flash colours to let you know it's still running. I'm especially interested in 68040 results, but the more the merrier!

Graham Humphrey · 25 August 2007, 21:13

Will test it for you either tonight or tomorrow...

Mad-Matt · 25 August 2007, 21:33

Have set the test going on my BPPC 040/25.

Config is A1200-BlizzPPC 040/25, Mediator+Voodoo3/net/sb128/tv.
OS39+Custom OS39 rom including HOGWAITBLIT patch if that makes a difference. as miggy hasnt been on for so long the battery has flatened and lost settings so have just forced 60ns memory mode from command line since i cant see boot menu to set it.

CHIPNOCACHE whd option set for compatability. One normal and 4 custom1 tests.

Edit : 1 normal test added with Chip cachability enabled just incase it effects anything

mrodfr · 26 August 2007, 07:12

hello,

tested on a1200dbox+1260/50/48mo fast+P96+aos3.9bb2.
blizzkick enabled with some modules running. custom=1.

just use WHDLoad BlitterSpeed.slave and wait the end.

interesting to konw the result.

Shoonay · 26 August 2007, 10:18

Cool, will run the test and I do hope it'll help solve the slowdown problem, tho I'm on 030

---=== EDITED ===---
Done...

Tested on Amiga 1200 with Blizzard 030/50MHz 32MB; OS 3.9 BB2 with FBlit & SystemPatch

derringer · 26 August 2007, 12:19

My config:

A1200 Bppc 68040/33MHz /603+/233MHz, 2Mb Chip 128MB Fast Ram, Mediator/Voodoo etc.

Chip ram cacheable disabled in whdload config because the compüatibility.

OS: 3.9, Kickstart 3.1 with blitzkick

Hope I can help you.

Graham Humphrey · 26 August 2007, 14:16

Okay, done, courtesy of my A1200 with Apollo 040/33, and 32MB Fast RAM. WHDLoad 16.8 was used, and no tooltypes were set. If you want me to test it with any tooltypes just give me a shout.

girv · 26 August 2007, 19:42

@madmatt & Graham: do you run PAL Amigas ?

Graham Humphrey · 26 August 2007, 20:09

Yep.

keropi · 26 August 2007, 20:58

here are my results, PAL, A4000D and cs-ppc 060/50mhz... no tooltypes whatsoever... chipmem cache disabled are usual for 040/060...

Mad-Matt · 26 August 2007, 21:24

Quote:

Originally Posted by girv

@madmatt & Graham: do you run PAL Amigas ?

yes

girv · 27 August 2007, 00:49

Firstly, thanks all for taking the time to run my little test

From the results, it seems the standard blitter wait code supplied with WHDLoad is the best of the bunch. It may be possible to improve on it (I don't know) but it's not a bad start in any case and also proves Wepl's idea to avoid blitter slowdown was right. Not that that was really ever in doubt

I've attached a spreadsheet to this message containing some further analysis of the times reported for the 'A' code ie: the standard WHDLoad blitter wait. There are some interesting results...

With caches enabled, running in chipmem instead of fastmem costs roughly 10-20% blitter speed on 040 but may cost over 30% if the blits are small; 060 sees no slowdown and 030 sees a small slowdown for small blits only.
With caches disabled, running in chipmem instead of fastmem, all processors see a blitter slowdown of 10-20% but 040 can get a massive 45% loss of blitter speed on small blits.
Running in fastmem, disabling caches costs 8% blitter speed on 060 for small blits, and smaller amounts on other processors. I'm not sure why this is - best guess is that the CPU is running slower and hitting the dmacon register (to test for blit done) less often, so interfering less with the blitter DMA; this may also explain why slower processors see less of an effect from disabling the caches, but doesn't explain why the effect lessens with larger blits. Anyone care to take a stab?
Running in fastmem, with caches enabled, 040-33 machines will blit faster than 060-50 machines. Again, my best guess is that the CPU is running slower and hitting the dmacon register (to test for blit done) less often. This may suggest an improvement is possible to the standard blitwait code for CPUs faster than 040-33.
Running in chipmem, disabling caches costs 10-20% blitter speed on 060; 040 and 030 see no effect, but 040 BPPC actually speeds up! I believe this shows the effect of how the BPPC boards handle chipmem cacheability.

Some lessons learned then:

Use Wepl's blitwait code
It might be possible to improve Wepl's blitwait code for 060
Move all blitwaits to fastmem (into the slave would be fine). I guess this would apply for all code but blitwaits are obvious targets and easy to move.
Enable caches in fastmem
Enable caches in chipmem for 060

Perhaps these are obvious, but at least now there are hard numbers to back them up!

These tests just enable or disable the instruction / data caches wholesale. For further investigation it might be a worthwhile experiment to run a test with the two caches enabled & disabled separately and with the data cache running in different modes, especially for 040 machines running out of chipmem.

But whatever you do, don't take my word for this. Look at the numbers yourself and see if you agree!

Codetapper · 27 August 2007, 12:40

It would be nice if there was an easy way to do the equivalent of a jsr BlitWait directly into WHDLoad running in fast memory - and WHDLoad can put the best patch for each processor rather than putting the code into each slave.

The global blitter code could be tweaked for each CPU and the poor slave programmer doesn't have to worry about which blitwait code to run! And only one place to update to improve it.

musashi5150 · 27 August 2007, 13:04

That's a great idea Codetapper

WHDLoad could determine which CPU type the machine has at startup and then any "jsr resload_BlitWait" would use the most efficient routine automatically. Clever

Toni Wilen · 27 August 2007, 13:52

Can I see the source code please?

dlfrsilver · 27 August 2007, 16:30

hi girv here is mine a bit late

Wepl · 27 August 2007, 23:00

First Girv, many thanks for doing the investigation on this topic, this will help to clear things, doing the right patches and help to optimize the installs.

Quote:

Originally Posted by girv

From the results, it seems the standard blitter wait code supplied with WHDLoad is the best of the bunch. It may be possible to improve on it (I don't know) but it's not a bad start in any case and also proves Wepl's idea to avoid blitter slowdown was right. Not that that was really ever in doubt

I've attached a spreadsheet to this message containing some further analysis of the times reported for the 'A' code ie: the standard WHDLoad blitter wait.

Can you please tell what wait routine is used in the cases B-F?
Best would be to send the slave source. I would also like to include the slave in whdload-dev package because I think it is of general interest.

I also like to add a wait routine using the stop instruction which should be the fastest, although it maybe hard to use it for the general patch case: (routine not tested by me, maybe small corrections required...)

Code:

init:
 move.w #$7fff,_custom+intena
 lea    int,a0
 move.l a0,$6c
 move.w #INTF_SETCLR|INTF_BLIT|INTF_INTEN,_custom+intena
 ...

int:
 move.w #INTF_BLIT,_custom+intreq
 tst.w  _custom+intreqr
 rte

blitwait:
 stop   #$2000
 btst   #DMAB_BLITDONE-8,_custom+intreqr ;sould be obsolete, but for security
 beq    blitwait
 move   #$2700,sr                        ;avoid interrupt occuring before stop
 rts

A point to make the results more reliable is that you should do n iterations for the small blits. Also the overhead of the blitwait routine itself may become important if there are many small blits.

Another topic is how much dma channels are used is your examples? I think this will have a noticeable effect on the results. I would expect much more influence of the blitwait routine if there are all 4 channels are used and less impact if only one or two channels are used.

Next point is the impact of BLITHOG/BLTPRI?

Quote:

Originally Posted by girv

With caches enabled, running in chipmem instead of fastmem costs roughly 10-20% blitter speed on 040 but may cost over 30% if the blits are small; 060 sees no slowdown and 030 sees a small slowdown for small blits only.

The blitwait routine should fit into the cache of 020..060. So there should be no difference if the code is in chip or fast. If the 040 has a slowdown here this can only mean that this is a board which cannot cache chipmem. A small slowdown on the 030 means probably that your other code has flushed the cache. Looping n times for the small blits should eliminate this.

Quote:

Originally Posted by girv

With caches disabled, running in chipmem instead of fastmem, all processors see a blitter slowdown of 10-20% but 040 can get a massive 45% loss of blitter speed on small blits.
Running in fastmem, disabling caches costs 8% blitter speed on 060 for small blits, and smaller amounts on other processors. I'm not sure why this is - best guess is that the CPU is running slower and hitting the dmacon register (to test for blit done) less often, so interfering less with the blitter DMA; this may also explain why slower processors see less of an effect from disabling the caches, but doesn't explain why the effect lessens with larger blits. Anyone care to take a stab?
Running in fastmem, with caches enabled, 040-33 machines will blit faster than 060-50 machines. Again, my best guess is that the CPU is running slower and hitting the dmacon register (to test for blit done) less often. This may suggest an improvement is possible to the standard blitwait code for CPUs faster than 040-33.

That was one of my main questions: does reading the dmaconr (or any other custom register) slow down the blitter? That accessing chip mem slows it down you can read in the KRM.
It seems so. To prove that more please try the wait routine using the stop instruction. I would expect that using this gives the same speed (at least for the large blits) on all machines.

Quote:

Originally Posted by girv

Running in chipmem, disabling caches costs 10-20% blitter speed on 060; 040 and 030 see no effect, but 040 BPPC actually speeds up! I believe this shows the effect of how the BPPC boards handle chipmem cacheability.

yep.

Quote:

Originally Posted by girv

Some lessons learned then:

Use Wepl's blitwait code
It might be possible to improve Wepl's blitwait code for 060
Move all blitwaits to fastmem (into the slave would be fine). I guess this would apply for all code but blitwaits are obvious targets and easy to move.
Enable caches in fastmem
Enable caches in chipmem for 060

caches in fastmem/slave are default on
enabling caches in chipmem may cause other problems, also there are boards out there (40/60) which do not support cachability of chipmem (CHIPNOCACHE must be used).

Quote:

Originally Posted by girv

These tests just enable or disable the instruction / data caches wholesale. For further investigation it might be a worthwhile experiment to run a test with the two caches enabled & disabled separately and with the data cache running in different modes, especially for 040 machines running out of chipmem.

data cache should have no effect here because there is nothing to be cached.

Quote:

Originally Posted by Codetapper

It would be nice if there was an easy way to do the equivalent of a jsr BlitWait directly into WHDLoad running in fast memory - and WHDLoad can put the best patch for each processor rather than putting the code into each slave.

The global blitter code could be tweaked for each CPU and the poor slave programmer doesn't have to worry about which blitwait code to run! And only one place to update to improve it.

Yes, if we have a blitwait routine which makes a useful difference on different setups. From the results I dont see much difference between the routines.

girv · 27 August 2007, 23:01

Source code is now in The Zone. It's probably best if this is reviewed anyway

Do with it what you will.

It does contain a pretty clear cut example of how to use the CIA timers in linked mode to time longer periods than one 16 bit counter will allow. This is the first serious use of the CIAs I've ever made and I'm quite pleased with how its gone

I've also attached results from my Blizzard 1260 50Mhz card, running versions of the standard WHDLoad BLITWAIT macro with different numbers (0-5) of "tst.b _ciaa" in the loop. Blitwait loops were running from fastmem with all caches enabled. As you can see, for blits sized 16x16 - 64x64, which I guess are the most common found in games, there is a definite 10+% blitter speedup to be had on 060 machines by including 4 or 5 "tst.b _ciaa" instructions instead of the standard 2. Adding more increases performance on larger blits too but the gains aren't so dramatic, so perhaps 4 or 5 is the best compromise for these machines?

girv · 27 August 2007, 23:19

Quote:

Originally Posted by Wepl

First Girv, many thanks for doing the investigation on this topic, this will help to clear things, doing the right patches and help to optimize the installs.

No probs

You did mention it needed looking in to

Quote:

Originally Posted by Wepl

Can you please tell what wait routine is used in the cases B-F?
Best would be to send the slave source. I would also like to include the slave in whdload-dev package because I think it is of general interest.

Source is now in The Zone, but for reference (B) used no delaying instructions, (C) and (D) used NOPS, (E) used a short integer instruction and (F) used a long integer instruction (divs.l). I was trying to see if keeping the CPU busy and off the memory buses altogether would speed up blitter operations.

Quote:

Originally Posted by Wepl

I also like to add a wait routine using the stop instruction which should be the fastest.

Good idea

Quote:

Originally Posted by Wepl

A point to make the results more reliable is that you should do n iterations for the small blits.

Each blit is performed 64 times and the results averaged at the end.

Quote:

Originally Posted by Wepl

Another topic is how much dma channels are used is your examples? I think this will have a noticeable effect on the results.

Next point is the impact of BLITHOG/BLTPRI?

I used all four channels to simulate the "cookie cut" blit operation used heavily in games (I think). I didn't look at BLITPRI, or audio DMA.

Quote:

Originally Posted by Wepl

The blitwait routine should fit into the cache of 020..060. So there should be no difference if the code is in chip or fast. If the 040 has a slowdown here this can only mean that this is a board which cannot cache chipmem.

I guess the rule is to move the blitwait loops to fastmem in any case ?

Quote:

Originally Posted by Wepl

enabling caches in chipmem may cause other problems, also there are boards out there (40/60) which do not support cachability of chipmem (CHIPNOCACHE must be used).

CHIPNOCACHE would override the setup in the slave right? So if the slave sets the optimal caches for most boards, the owners of non-caching boards can override the defaults to suit their hardware?

Quote:

Originally Posted by Wepl

data cache should have no effect here because there is nothing to be cached.

Yep, you are correct. I realised this too late last night

Quote:

Originally Posted by Wepl

Yes, if we have a blitwait routine which makes a useful difference on different setups.

10-15% on 060 good enough?

girv · 28 August 2007, 00:01

Quote:

Originally Posted by Wepl

It seems so. To prove that more please try the wait routine using the stop instruction. I would expect that using this gives the same speed (at least for the large blits) on all machines.

I did a quick test on my 060: the STOP routine is quicker for 128x128, gaining about 6-8% over the standard WHDLoad blitwait with caches on and 15-20% with caches off in chipmem. Its much slower for smaller sizes - I guess its the interrupt raising/handling overhead.

25 August 2007, 20:06	#1
girv Mostly Harmless Join Date: Aug 2004 Location: Northern Ireland Posts: 1,115	WHDLoad blitter speed tests - needs you! In The Zone shortly will be a WHDLoad slave that does some blitter speed tests. The intention is to investigate one possible cause of the annoying slowdowns many WHDLoad games experience on certain processors. 68040, I'm looking at you. A sequence of blits, of the "cookie cutter" type typically found in games, is run with different sizes (16x16 to 1008x1008!), different types of blitter waiting code and different settings for the CPU caches. The the time taken for each blit is measured using the CIA timers, stored, and finally output to a text file "bspeed.txt" when all tests are complete. The zip file in The Zone also includes some Excel spreadsheets containing the results from my A1200-060 and some pretty graphs to help analysis. So here's your bit: run "WHDLoad BlitterSpeed.slave" and post, PM or email me your bspeed.txt files along with your machine specifications. AGA users can also see the effect of the FMODE register on blitter speed by passing the CUSTOM1 parameter to the slave with values of 0,1,2 or 3. You will need a 68020 or better, 0.5Mb chipram and ~2mb fastram in order to run the tests. They will take 5-10 minutes to complete and the screen will flash colours to let you know it's still running. I'm especially interested in 68040 results, but the more the merrier!

26 August 2007, 10:18	#5
Shoonay Global Caturator Join Date: Aug 2004 Location: Porando Age: 43 Posts: 6,107	Cool, will run the test and I do hope it'll help solve the slowdown problem, tho I'm on 030 ---=== EDITED ===--- Done... Tested on Amiga 1200 with Blizzard 030/50MHz 32MB; OS 3.9 BB2 with FBlit & SystemPatch Last edited by Shoonay; 13 May 2008 at 12:47.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Future Wars - WHDload game speed	XimeR	project.WHDLoad	12	05 October 2022 16:16
Adjust cpu speed slider in WHDload config	markpage	support.WinUAE	2	09 October 2012 20:22
Game Speed under WHDLoad	Winterjaeger	support.Games	0	23 September 2012 20:03
change the mouse speed in a whdload game	_psy	project.WHDLoad	3	08 June 2012 10:41
WHDLoad game speed	Washac	project.WHDLoad	7	26 February 2012 17:40

25 August 2007, 21:13	#2
Graham Humphrey Moderator Join Date: Jul 2004 Location: Norwich, Norfolk, UK Age: 37 Posts: 11,167	Will test it for you either tonight or tomorrow...

26 August 2007, 19:42	#8
girv Mostly Harmless Join Date: Aug 2004 Location: Northern Ireland Posts: 1,115	@madmatt & Graham: do you run PAL Amigas ?

26 August 2007, 20:09	#9
Graham Humphrey Moderator Join Date: Jul 2004 Location: Norwich, Norfolk, UK Age: 37 Posts: 11,167	Yep.

27 August 2007, 12:40	#13
Codetapper 2 contact me: email only! Join Date: May 2001 Location: Auckland / New Zealand Posts: 3,187	It would be nice if there was an easy way to do the equivalent of a jsr BlitWait directly into WHDLoad running in fast memory - and WHDLoad can put the best patch for each processor rather than putting the code into each slave. The global blitter code could be tweaked for each CPU and the poor slave programmer doesn't have to worry about which blitwait code to run! And only one place to update to improve it.

27 August 2007, 13:04	#14
musashi5150 move.w #$4489,$dff07e Join Date: Sep 2005 Location: Norfolk, UK Age: 42 Posts: 2,351	That's a great idea Codetapper WHDLoad could determine which CPU type the machine has at startup and then any "jsr resload_BlitWait" would use the most efficient routine automatically. Clever

27 August 2007, 13:52	#15
Toni Wilen WinUAE developer Join Date: Aug 2001 Location: Hämeenlinna/Finland Age: 49 Posts: 26,534	Can I see the source code please?

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)