English Amiga Board


Go Back   English Amiga Board > Coders > Coders. System

 
 
Thread Tools
Old 04 October 2017, 16:30   #1
SpeedGeek
Registered User
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 55
Posts: 382
FastCache040+ Released!

FastCache040+ 1.9 ©SpeedGeek 2018

INTRODUCTION:
FastCache040+ is a patch to replace the CachePreDMA() and
CachePostDMA() functions of most 68040/060 libraries. While
the old functions are adequate they are far from optimal.
These old functions have 2x more code then the new ones
provided with this patch!

Also, the new functions implement a much more efficient method
of managing the Copyback cache for DMA. While every system
will have some CPU performance loss under DMA conditions, the
new functions keep this performance loss to a bare minimum.

FEATURES:
- Replaces CachePreDMA() and CachePostDMA() with smaller
and more efficient code
- Replaces complex MMU code with simple and fast DTTR code
- Temporarily changes Copyback mode to Write Through for DMA
(but only when required!). See MEMF_24BIT change for v1.3.
- Never flushes the ATC!
- Never flushes the DC for Chip RAM DMA!
- Uses 68040/060 library detection code
- Will not patch itself
- 100% Assembler code

CODE SIZE COMPARISONS:
- FastCache040+ 1.9 (NewFunc 164 bytes)
- 68060.library 46.7 (OldFunc 304 bytes)
- 68040.library 44.2 (OldFunc 414 bytes)

REQUIREMENTS:
- Amiga with 68040 or 68060 CPU and MMU
- 68040.library or 68060.library

WARNING:
Do NOT use this patch with GigaMEM, VMM or any similar
virtual memory software! Do NOT use this patch with any
code which uses the MMU to write protect or remap modified
data structures!

NOTES:
Remapping a mirror image of the Kickstart ROM with the MMU
is OK! The new functions still have one thing in common with
the old functions. They do NOT translate virtual addresses
as specified in the Amiga RKRM! For more info on the old
functions see the Enforcer.guide by Michael Sinz.

UPDATE:
FastCache040+ v1.7 has been removed. Phase5 68060.library
users should use FixMapP5 before using this patch.

HISTORY:
v1.0 - First release
v1.1 - Fixed a bug which prevented the patch from installing
- Added code to use OldCachePreDMA for MEMF_24BIT
transfers (I don't know why errors occured here)
V1.2 - Added code to use OldCachePostDMA for MEMF_24BIT
transfers (So MMU Pages can be restored to original)
v1.3 - Added code to change MEMF_24BIT transfers to NoCache.
This eliminated all OldFunc calls. MEMF_24BIT
transfers may have some CPU performance loss but the
NewFunc code performance benefits should still justify
this.
v1.4 - Removed MEMF_24BIT code from PreDMA/PostDMA for the
case of 16 byte aligned transfers. This will allow
some MEMF_24BIT transfers to be cache enabled!
v1.5 - Found an occasional Recoverable Alert bug which could
possibly result in a crash but only on 060 systems!
The simple fix was to move "CINVA NC" in PostDMA to the
end of the code.
- Removed the "+" character from the executable name due
to a unknown "Feature" of the Amiga Shell causing script
execution and version command problems.
v1.6 - Added code to PostDMA to Flush the cache conditionally
(if the Store buffer and cache are enabled). Added NOPs
to sync the pipelines before RTE (CINVA is now obsolete)
v1.6P5 Removed code to allow PostDMA cache Flush for the case
of 16 byte aligned transfers. Added code to skip PostDMA cache
Flush for the case of cache disabled MEMF_24BIT transfers.
v1.7 - Removed all v1.6P5 PostDMA cache flush code so users can run at full speed!
v1.8 - Reworked the code to eliminate a serious (but seldom
noticed) data transfer corruption bug for the case of multiple
DMA drivers in the same system. Special Thanks to
Ralph Babel for his excellent knowledge on this topic.
v1.9 - Fixed "D2 Register Not Preserved" coding bug in PreDMA.
Most DMA drivers don't seem to need it preserved but
Thanks to Cosmos for reporting it anyway. Moved PostDMA
Nest count code to user section of code. This eliminates
any calls to Supervisor when the count is more than 1.
v1.9BR Added new "Experimental" code which should allow only
DMA targeted 16MB blocks of Fast RAM to change to Write
Through mode. This "In Theory" allows the other 16MB
blocks to remain in Copyback mode. This can only benefit
"Big RAM" systems with 32MB+ of Fast RAM and ONLY when
these systems run apps which use the extra Fast RAM.
WARNING: Use at you own risk!
Code:
CachePostDMA:	
	MOVE.L  A0,D1
	ANDI.L  #$FFE00000,D1   ;Chip RAM
	BEQ.B	  lbC00002A
	BTST	  #3,D0		;ReadFromRam	
	BNE.B	  lbC00002A
	MOVE.L  A0,D1
	OR.L    (A1),D1
	ANDI.B  #15,D1	        ;16 byte aligned
	BEQ.B	  lbC00002A
	LEA	  Nest(PC),A1
	SUBQ.W  #1,(A1)
	BNE.B	  lbC00002A		
	MOVE.L  A5,-(SP)			
	LEA	  (lbC00004E,PC),A5
	JSR	  (-$1E,A6) 	;Call Supervisor
	MOVE.L  (SP)+,A5

lbC00002A
	RTS

lbC00004E
	MOVEQ   #0,D1
	MOVEC   D1,DTT1		;Disable DTT1	    
	MOVEC   D1,DTT0		;Disable DTT0               
	RTE            

CachePreDMA:
	MOVEM.L A0/A5,-(SP)		
	MOVE.L  A0,D1
	ANDI.L  #$FFE00000,D1   ;Chip RAM
	BEQ.B   lbC000068
	ANDI.B  #$A,D0		;Continue or ReadFromRam
	BNE.B	  lbC000060
	MOVE.L  A0,D1
	OR.L    (A1),D1
	ANDI.B  #15,D1		;16 byte aligned
	BEQ.B	  lbC000060
lbC000054		
	LEA	  (lbC000074,PC),A5
	BRA.B   lbC000064
lbC000060
	LEA	  (lbC000084,PC),A5
lbC000064	 
	JSR	  (-$1E,A6) 	;Call Supervisor
lbC000068
	MOVEM.L (SP)+,A0/A5
	MOVE.L   A0,D0
	RTS

lbC000074
	LEA	   Nest(PC),A1
	ADDQ.W   #1,(A1)
	MOVE.L   #$00008040,D1	;NoCache mode + Serialized      		
	MOVEC   D1,DTT0		;Enable DTT0
	MOVE.L   A0,D1
	ANDI.L   #$FF000000,D1  ;MEMF_24BIT
	BEQ.B	   lbC000084
	MOVE.L   #$00FF8000,D1 	;Cache WT mode + User FC
	MOVEC    D1,DTT1	;Enable DTT1	
lbC000084
	MOVEC    CACR,D1
	BTST     #31,D1         ;Data cache enabled
	BEQ.B    lbC000090
	CPUSHA   DC 		;Flush dirty cache lines 
lbC000090 
	RTE		
Nest:	DC.W	0
Attached Files
File Type: lha FIXMAPP5_12.LHA (3.2 KB, 38 views)
File Type: lha FASTCACHE040+19.LHA (3.0 KB, 26 views)
File Type: lha FASTCACHE040+19BR.LHA (3.2 KB, 24 views)
File Type: lha CACHEDMABENCH11.LHA (2.2 KB, 27 views)

Last edited by SpeedGeek; 21 May 2018 at 19:36.
SpeedGeek is offline  
Old 04 October 2017, 17:57   #2
a/b
Registered User

 
Join Date: Jun 2016
Location: europe
Posts: 63
A couple of suggestions for CachePreDMA (2 bytes shorter and one less branch):
Code:
CachePreDMA:
    MOVEM.L    A0/A5,-(SP)        
    MOVE.L  A0,D1
    ANDI.L  #$FFE00000,D1   ;Chip RAM
    BEQ.B   lbC000068
 LEA    (lbC000084,PC),A5 ; moved from below
    ANDI.B    #$A,D0        ;Continue or ReadFromRam
    BNE.B    lbC000060
    MOVE.L    A0,D1
    OR.L    (A1),D1
    ANDI.B    #15,D1        ;16 Byte aligned
    BEQ.B    lbC000060
    LEA    (lbC000074,PC),A5
; or alternatively (faster on 040, don't know about 060; maybe add.w works better overall):
; LEA    (lbC000074-lbC000084,A5),A5

;;    BRA.B   lbC000064
lbC000060
;;    LEA    (lbC000084,PC),A5
;;lbC000064         
    JSR    (-$1E,A6)    ;Call Supervisor
lbC000068
a/b is offline  
Old 05 October 2017, 17:20   #3
SpeedGeek
Registered User
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 55
Posts: 382
Quote:
Originally Posted by a/b View Post
A couple of suggestions for CachePreDMA (2 bytes shorter and one less branch):
Thanks for the suggestion but saving 2 bytes of code does not result in faster execution in this case. BRA.B is faster than LEA for both 040 and 060 (but for 060 it's even faster with branch prediction).

If you want to make a patch with your suggestion that's OK with me. This patch code obtains most of it's performance benefit from more efficient cache management so small changes in code size or execution speed won't make much of a performance difference anyway.

Last edited by SpeedGeek; 05 October 2017 at 18:21.
SpeedGeek is offline  
Old 05 October 2017, 18:01   #4
kgc210
Registered User

 
Join Date: Jun 2016
Location: Stoke-On-Trent, England
Posts: 425
SpeedGeek

What sort of real life situations would benefit from this patch?
Or does it speed up all uses of the CPU?
kgc210 is offline  
Old 05 October 2017, 18:36   #5
SpeedGeek
Registered User
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 55
Posts: 382
Quote:
Originally Posted by kgc210 View Post
SpeedGeek

What sort of real life situations would benefit from this patch?
Or does it speed up all uses of the CPU?
Any situation where a DMA controller transfers data to Fast RAM. Also, for Chip RAM when the driver doesn't handle the case that Chip RAM is non-cache-able memory (because it expects these old functions to handle it for them).

There are a few benchmark programs (e.g. RSCP, DiskSpeed) which test "CPU Availability" for SCSI DMA transfers. Unfortunately, they pre-date the 68040 CPU and really don't provide any reliable results here.

Last edited by SpeedGeek; 05 October 2017 at 18:50.
SpeedGeek is offline  
Old 05 October 2017, 18:53   #6
kgc210
Registered User

 
Join Date: Jun 2016
Location: Stoke-On-Trent, England
Posts: 425
Ahh ok thanks for the explanation.
Do you think it would help on the Warpengine and A4000T or are their drivers optimised anyway?
kgc210 is offline  
Old 06 October 2017, 03:17   #7
SpeedGeek
Registered User
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 55
Posts: 382
** NEWS UPDATE **

Sorry, there was a bug in v1.0 with the patch install code.

v1.1 - Fixed a bug which prevented the patch from installing
- Added code to use OldCachePreDMA for MEMF_24BIT
transfers (I don't know why errors occured here)

Last edited by SpeedGeek; 06 October 2017 at 08:20.
SpeedGeek is offline  
Old 06 October 2017, 15:11   #8
SpeedGeek
Registered User
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 55
Posts: 382
** 2ND NEWS UPDATE **

v1.2 released (updated patch size info)
- Added code to use OldCachePostDMA for MEMF_24BIT
transfers (So MMU Pages can be restored to original)
SpeedGeek is offline  
Old 06 October 2017, 15:15   #9
SpeedGeek
Registered User
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 55
Posts: 382
OK, I believe I have found a solution to the MEMF_24BIT transfer
error problem without OldPre/OldPost calls. Unfortunately, the cache mode would have to be changed to NoCache.

This would make the NewFunc code a little smaller but could reduce CPU performance a little for MEMF_24BIT transfers.

So it's a trade off situation... will give it some more thought!

Last edited by SpeedGeek; 06 October 2017 at 17:35.
SpeedGeek is offline  
Old 10 October 2017, 22:09   #10
SpeedGeek
Registered User
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 55
Posts: 382
** 3RD NEWS UPDATE **

v1.3 Released!
- Added code to change MEMF_24BIT transfers to NoCache.
This eliminated all OldFunc calls. MEMF_24BIT transfers may have
some CPU performance loss but the NewFunc code performance
benefits should still justify this.

NOTES: v1.2 is now considered obsolete and was removed with the v1.5 release.

EDIT:
v1.4 Released!
- Removed MEMF_24BIT code from PreDMA/PostDMA for the
case of 16 byte aligned transfers. This will allow
some MEMF_24BIT transfers to be cache enabled!

EDIT2:
The v1.4 NewFuncSrc for lbC000080 should read as follows:
ORI.W #$8000,D1 ;Cache WT mode + User FC

Last edited by SpeedGeek; 30 March 2018 at 15:35.
SpeedGeek is offline  
Old 14 October 2017, 14:35   #11
SpeedGeek
Registered User
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 55
Posts: 382
Ok guys, now it's your turn to post your compatibility results!

Please provide information on 68040.library or 68060.library vendor and version. Also, accelerator card type and vendor is requested too. Thank you!
SpeedGeek is offline  
Old 14 October 2017, 17:08   #12
daxb
Registered User
 
Join Date: Oct 2009
Location: Germany
Posts: 2,020
How can we test compatibility? Is there a benchmark tool or similar to see the benefits?
daxb is offline  
Old 14 October 2017, 18:03   #13
SpeedGeek
Registered User
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 55
Posts: 382
Quote:
Originally Posted by daxb View Post
How can we test compatibility? Is there a benchmark tool or similar to see the benefits?
EDIT: See post #16 for benchmark info.

I have already tested 68040.library 44.2 (H&P) with an A3640 and 68060.library 46.7 (Phase5) with an A3660. However, these libraries may configure themselves differently on other systems. Also, there are 3rd party libraries (e.g. GVP, Apollo, etc.) which should be tested as well.

Last edited by SpeedGeek; 16 October 2017 at 06:04.
SpeedGeek is offline  
Old 14 October 2017, 19:11   #14
daxb
Registered User
 
Join Date: Oct 2009
Location: Germany
Posts: 2,020
Quote:
Originally Posted by SpeedGeek View Post
I have already tested 68040.library 44.2 (H&P)...
How?
daxb is offline  
Old 15 October 2017, 13:06   #15
SpeedGeek
Registered User
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 55
Posts: 382
Quote:
Originally Posted by daxb View Post
How?
Simply install the patch, use your system normally and look for any DMA transfer errors.

I went a little further than that. I made an LHA loop script which extracts 12MB of archives to the RAM disk. I installed a 2MB Zorro2 memory board for MEMF_24BIT testing. I have another script which changes the priority of the Zorro2 memory so the archive files extract there first. I loaded programs which open a screen in Chip RAM to test Chip RAM DMA, but loading icons on the Workbench screen does the same thing unless you are using RTG.
SpeedGeek is offline  
Old 15 October 2017, 18:52   #16
SpeedGeek
Registered User
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 55
Posts: 382
** 4TH NEWS UPDATE **

The was another stupid version bug in v1.4 which has now been fixed (It was a just a fully functional v1.4 reporting itself as v1.3).

I now have a simple benchmark tool called "CacheDMAmips" (see attached image). I will probably release it when I am satisfied with the compatibility results.

EDIT: CacheDMAmips was removed for providing bogus results. Obviously, programs compiled on an old "Pile of Crap" C compiler and using v34 timer.device functions are not so reliable. Mips benchmark results are generally bogus anyway! Thus a new improved benchmark tool is called for!

Last edited by SpeedGeek; 23 October 2017 at 17:57.
SpeedGeek is offline  
Old 23 October 2017, 17:12   #17
SpeedGeek
Registered User
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 55
Posts: 382
Ok, I now have a new improved benchmark tool. Sadly, only 1 user has provided compatibility results so far? :

EDIT: 3 users have now provided compatibility feedback.
See post #1 for the archive.

Code:
CacheDMAbench 1.1 ©SpeedGeek 2018

INTRODUCTION:
CacheDMAbench is a benchmark tool for FastCache040+. It can
also be used for benchmarking old Function Calls of most
68040/060 libraries. It runs a tight loop of 10000 CacheDMA
Function Calls. These FCs are made in paired Cache 
PreDMA/PostDMA calls. A smaller number (500 or 5%) of these
FCs are directed to Chip memory.

The parameters of this benchmark may have no relationship to
actual use but the goal here is to show the cumulative effect
on CPU performance of using these FCs frequently. However,
a great amount of effort was made to time these FCs as
accurately as possible. Results of the benchmark are reported
in Microseconds.

FEATURES:
- Benchmarks Cache PreDMA/PostDMA Function Calls      
- Uses ECLOCK timing of timer.device
- Uses 64 Bit integer math FCs of utility.library (v39+)  
  (Uses 32 Bit integer math for older versions)
- 100% Assembler code
           
REQUIREMENTS:
- Amiga with 68040 or 68060 CPU and MMU
- 68040.library or 68060.library
- FastCache040+

WARNINGS:
- This tool disables multitasking for a short time (but not
  interrupts). This seems to be a somewhat fair balance
  between accuracy and system friendly operation.  
- This tool may occasionly exit with a "Sorry, benchmark
  took too long." message. This happens either to avoid
  reporting bogus results or something even worse... the
  infamous Divide by Zero exception! 
 
NOTES:
- FastCache040+ should be renamed to "FastCache040" because
  the "+" character causes problems with script execution
  (due to an undocumented feature of the Amiga Shell).
- Do NOT expect the results to be exactly the same each time
  the benchmark is executed. 68040s are moderately dynamic
  in performance and 68060s are extremely dynamic!
- Benchmark result comparisons against other users systems
  have no practical use. Benchmark result comparisons on 
  your system are exactly what this tool was developed for!

HISTORY:
v1.0 - First release
v1.1 - Fixed address and size bugs in FC loop code which
could have affected the results.

Last edited by SpeedGeek; 21 May 2018 at 17:34.
SpeedGeek is offline  
Old 23 October 2017, 18:16   #18
kgc210
Registered User

 
Join Date: Jun 2016
Location: Stoke-On-Trent, England
Posts: 425
Could you include your benchmark tool and I will give it ago on my A4000.
kgc210 is offline  
Old 24 October 2017, 16:08   #19
SpeedGeek
Registered User
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 55
Posts: 382
Once again, I have requested compatibility results not benchmark results (See posts #11-15). I don't understand this "Horse before the Carriage" preoccupation of some users.

What is the point of making extraordinary efforts to determine how fast the Horse can run, if the Horse is fundamentally weak, lacking the strength and endurance to pull the carriage to it's final destination?

Last edited by SpeedGeek; 24 October 2017 at 17:41.
SpeedGeek is offline  
Old 24 October 2017, 21:48   #20
amigasith
Registered User

amigasith's Avatar
 
Join Date: Jan 2013
Location: Wild South / Germany
Age: 43
Posts: 206
Hi SpeedGeek,

I just tested version 1.4 of your FastCache040+ patch on my A4000 with A3640 and 68040.library version 37.30 (original OS3.1). Result: I didn't get any DMA transfer errors, but I also didn't notice any speedup... I have a Fastlane Z3 running in my A4000, so I hope that it actually triggered some DMA transfers. Not sure if this helps, but at least you have one feedback now

Cheers,
Marc
amigasith is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
WinUAE 2.3.3 released Toni Wilen News 26 18 November 2011 23:01
WHDLoad 17.0 released! Bamiga2002 News 28 16 September 2011 18:47
Never released??? tomcat666 project.aGTW 18 18 January 2010 14:44
16.6 Released alexh project.WHDLoad 6 09 June 2006 10:02
WinUAE 1.1 released... Joe Maroni News 18 05 October 2005 16:28

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 13:16.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2018, vBulletin Solutions Inc.
Page generated in 0.09005 seconds with 16 queries