English Amiga Board


Go Back   English Amiga Board > Coders > Coders. System

 
 
Thread Tools
Old 04 October 2017, 16:30   #1
SpeedGeek
Moderator
 
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 839
Lightbulb FastCache040+ Released!

FastCache040+ 2.6 ©SpeedGeek 2024

INTRODUCTION:
FastCache040+ is a patch to replace the CachePreDMA() and
CachePostDMA() functions of most 68040/060 libraries. While
the old functions are adequate they are far from optimal.
These old functions have 2x more code then the new ones
provided with this patch!

Also, the new functions implement a much more efficient method
of managing the Copyback cache for DMA. While every system
will have some CPU performance loss under DMA conditions, the
new functions keep this performance loss to a bare minimum.

FEATURES:
- Replaces CachePreDMA() and CachePostDMA() with smaller
and more efficient code
- Replaces complex MMU code with simple and fast DTTR code
- Temporarily changes Copyback mode to Write Through for DMA
(but only when required!).
- Never flushes the ATC!
- Never flushes the DC for Chip RAM DMA!
- Uses 68040/060 library detection code
- Will not patch itself
- 100% Assembler code

CODE SIZE COMPARISONS:
- FastCache040+ 2.6 (NewFunc 206 bytes)
- 68060.library 46.7 (OldFunc 304 bytes)
- 68040.library 44.2 (OldFunc 414 bytes)

REQUIREMENTS:
- Amiga with 68040 or 68060 CPU and MMU
- 68040.library or 68060.library

WARNING:
Do NOT use this patch with GigaMEM, VMM or any similar
virtual memory software! Do NOT use this patch with any
code which uses the MMU to write protect or remap modified
data structures!

NOTES:
Remapping a mirror image of the Kickstart ROM with the MMU
is OK! The new functions still have one thing in common with
the old functions. They do NOT translate virtual addresses
as specified in the Amiga RKRM! For more info on the old
functions see the Enforcer.guide by Michael Sinz.

UPDATE:
FastCache040+ v1.7 has been removed. Phase5 68060.library
users can optionally use FixMapP5.

HISTORY:
(Pre 2.0 history deleted)
v2.0 - Added code to enable only one DTTR when the Nest count
is one. Most systems have only one DMA driver and only need to
have 16MB of address space managed for this case.
Removed 1.9BR version which was over-rated due to most DMA
drivers operating at higher priority than typical user tasks.
v2.1 - Reworked the code to fix a problem with Snoopy 2.0
(Aminet). Sorry, this version no longer supports 16 byte aligned
cache enabled MEMF_24BIT transfers. NOTE: The original P5
library functions have problems with Snoopy too.
v2.2 - The Snoopy fix broke MEMF_24BIT transfers. So another
bug fix was required. Let's hope it's the last.
v2.3 - The 16 byte alignment code is back and now avoids the
change of cache mode for this specific case. Removed
Continue case from PreDMA since the expected results are
the same as the Non-Continue case. The cache disable test
code was removed to save the overhead of this very
uncommon case.
v2.4 - Reworked PostDMA code to fix Nested call cache flush bugs.
We really don't want to forget about systems with multiple DMA
drivers do we?
v2.5 - Fixed another rare but possible bug with DMA transfers
crossing the 16MB boundary of the DTTR! So now (except for
MEMF_24BIT DMA transfers) both DTTRs are enabled to
manage the full 32MB of address space.
v2.6 - The previous bug fix only worked for addresses in the 16MB
range. This fix should should now work with any address.
Code:
CachePostDMA:	
	MOVE.L  A0,D1
	ANDI.L  #$FFE00000,D1   ;Chip RAM
	BEQ.B	lbC00002A
	BTST	#3,D0		;ReadFromRam	
	BNE.B	lbC00002A
	MOVE.L	A5,-(SP)
	MOVE.L  A0,D1
	OR.L    (A1),D1
	ANDI.B  #15,D1		;16 byte aligned
	BEQ.B	lbC000020			
	LEA	Nest(PC),A1
	SUBQ.W  #1,(A1)
	BEQ.B	lbC000024
lbC000020	
	LEA	(lbC000050,PC),A5
	BRA.B	lbC000028					
lbC000024
	LEA	(lbC00004E,PC),A5
lbC000028	
	JSR	(-$1E,A6)	;Call Supervisor
	MOVE.L	(SP)+,A5

lbC00002A
	RTS

lbC00004E
	MOVEQ   #0,D1
	MOVEC	D1,DTT1		;Disable DTT1	    
	MOVEC   D1,DTT0		;Disable DTT0
lbC000050	
	CPUSHA	DC       
	RTE            

CachePreDMA:
	MOVEM.L	A0/A5,-(SP)		
	MOVE.L  A0,D1
	ANDI.L  #$FFE00000,D1   ;Chip RAM
	BEQ.B   lbC000068
	BTST	#3,D0		;ReadFromRam
	BNE.B	lbC000068
	MOVE.L  A0,D1
	OR.L    (A1),D1
	ANDI.B  #15,D1		;16 byte aligned
	BEQ.B	lbC000060		
	LEA	(lbC000074,PC),A5
	BRA.B   lbC000064
lbC000060
	LEA	(lbC000084,PC),A5
lbC000064	 
	JSR	(-$1E,A6)	;Call Supervisor
lbC000068
	MOVEM.L	(SP)+,A0/A5
	MOVE.L  A0,D0
	RTS

lbC000074	
	LEA	Nest(PC),A1
	TST.W	(A1)
	BEQ.B   lbC000078
	MOVE.L  #$0000C040,D1	;NoCache mode + Serialized      		
	MOVEC	D1,DTT0		;Enable DTT0
	MOVE.L  A0,D1
	ANDI.L  #$FF000000,D1   ;MEMF_24BIT
	BEQ.B	lbC000082
	MOVE.L  #$00FFC000,D1 	;Cache WT mode + ignore FC
	MOVEC 	D1,DTT1		;Enable DTT1	
	BRA.B   lbC000082
lbC000078
 	MOVE.L  A0,D1
	ANDI.L  #$FF000000,D1   ;MEMF_24BIT
        BNE.B   lbC000080        
	ORI.W	#$C040,D1 	;NoCache mode + Serialized
	MOVEC	D1,DTT0
	BRA.B	lbC000082
lbC000080
	ORI.W   #$C000,D1	;Cache WT mode + ignore FC
	MOVEC	D1,DTT0		;Lower 16MB cache control
	ADDI.L  #$1000000,D1    ;Increment address 
	MOVEC	D1,DTT1		;Upper 16MB cache control
	
lbC000082
	ADDQ.W  #1,(A1)	
lbC000084
	CPUSHA 	DC		;Flush dirty cache lines 
	RTE		
Nest:	DC.W	0
Attached Files
File Type: lha CACHEDMABENCH11.LHA (2.2 KB, 473 views)
File Type: lha FIXMAPP5_14.LHA (3.4 KB, 383 views)
File Type: lha FASTCACHE040+26.LHA (3.0 KB, 30 views)

Last edited by SpeedGeek; 01 March 2024 at 14:06.
SpeedGeek is offline  
Old 04 October 2017, 17:57   #2
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,039
A couple of suggestions for CachePreDMA (2 bytes shorter and one less branch):
Code:
CachePreDMA:
    MOVEM.L    A0/A5,-(SP)        
    MOVE.L  A0,D1
    ANDI.L  #$FFE00000,D1   ;Chip RAM
    BEQ.B   lbC000068
 LEA    (lbC000084,PC),A5 ; moved from below
    ANDI.B    #$A,D0        ;Continue or ReadFromRam
    BNE.B    lbC000060
    MOVE.L    A0,D1
    OR.L    (A1),D1
    ANDI.B    #15,D1        ;16 Byte aligned
    BEQ.B    lbC000060
    LEA    (lbC000074,PC),A5
; or alternatively (faster on 040, don't know about 060; maybe add.w works better overall):
; LEA    (lbC000074-lbC000084,A5),A5

;;    BRA.B   lbC000064
lbC000060
;;    LEA    (lbC000084,PC),A5
;;lbC000064         
    JSR    (-$1E,A6)    ;Call Supervisor
lbC000068
a/b is offline  
Old 05 October 2017, 17:20   #3
SpeedGeek
Moderator
 
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 839
Quote:
Originally Posted by a/b View Post
A couple of suggestions for CachePreDMA (2 bytes shorter and one less branch):
Thanks for the suggestion but saving 2 bytes of code does not result in faster execution in this case. BRA.B is faster than LEA for both 040 and 060 (but for 060 it's even faster with branch prediction).

If you want to make a patch with your suggestion that's OK with me. This patch code obtains most of it's performance benefit from more efficient cache management so small changes in code size or execution speed won't make much of a performance difference anyway.

Last edited by SpeedGeek; 05 October 2017 at 18:21.
SpeedGeek is offline  
Old 05 October 2017, 18:01   #4
kgc210
Registered User
 
Join Date: Jun 2016
Location: Stoke-On-Trent, England
Posts: 450
SpeedGeek

What sort of real life situations would benefit from this patch?
Or does it speed up all uses of the CPU?
kgc210 is offline  
Old 05 October 2017, 18:36   #5
SpeedGeek
Moderator
 
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 839
Quote:
Originally Posted by kgc210 View Post
SpeedGeek

What sort of real life situations would benefit from this patch?
Or does it speed up all uses of the CPU?
Any situation where a DMA controller transfers data to Fast RAM. Also, for Chip RAM when the driver doesn't handle the case that Chip RAM is non-cache-able memory (because it expects these old functions to handle it for them).

There are a few benchmark programs (e.g. RSCP, DiskSpeed) which test "CPU Availability" for SCSI DMA transfers. Unfortunately, they pre-date the 68040 CPU and really don't provide any reliable results here.

Last edited by SpeedGeek; 05 October 2017 at 18:50.
SpeedGeek is offline  
Old 05 October 2017, 18:53   #6
kgc210
Registered User
 
Join Date: Jun 2016
Location: Stoke-On-Trent, England
Posts: 450
Ahh ok thanks for the explanation.
Do you think it would help on the Warpengine and A4000T or are their drivers optimised anyway?
kgc210 is offline  
Old 06 October 2017, 03:17   #7
SpeedGeek
Moderator
 
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 839
** NEWS UPDATE **

Sorry, there was a bug in v1.0 with the patch install code.

v1.1 - Fixed a bug which prevented the patch from installing
- Added code to use OldCachePreDMA for MEMF_24BIT
transfers (I don't know why errors occured here)

Last edited by SpeedGeek; 06 October 2017 at 08:20.
SpeedGeek is offline  
Old 06 October 2017, 15:11   #8
SpeedGeek
Moderator
 
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 839
** 2ND NEWS UPDATE **

v1.2 released (updated patch size info)
- Added code to use OldCachePostDMA for MEMF_24BIT
transfers (So MMU Pages can be restored to original)
SpeedGeek is offline  
Old 06 October 2017, 15:15   #9
SpeedGeek
Moderator
 
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 839
OK, I believe I have found a solution to the MEMF_24BIT transfer
error problem without OldPre/OldPost calls. Unfortunately, the cache mode would have to be changed to NoCache.

This would make the NewFunc code a little smaller but could reduce CPU performance a little for MEMF_24BIT transfers.

So it's a trade off situation... will give it some more thought!

Last edited by SpeedGeek; 06 October 2017 at 17:35.
SpeedGeek is offline  
Old 10 October 2017, 22:09   #10
SpeedGeek
Moderator
 
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 839
** 3RD NEWS UPDATE **

v1.3 Released!
- Added code to change MEMF_24BIT transfers to NoCache.
This eliminated all OldFunc calls. MEMF_24BIT transfers may have
some CPU performance loss but the NewFunc code performance
benefits should still justify this.

NOTES: v1.2 is now considered obsolete and was removed with the v1.5 release.

EDIT:
v1.4 Released!
- Removed MEMF_24BIT code from PreDMA/PostDMA for the
case of 16 byte aligned transfers. This will allow
some MEMF_24BIT transfers to be cache enabled!

EDIT2:
The v1.4 NewFuncSrc for lbC000080 should read as follows:
ORI.W #$8000,D1 ;Cache WT mode + User FC

Last edited by SpeedGeek; 30 March 2018 at 15:35.
SpeedGeek is offline  
Old 14 October 2017, 14:35   #11
SpeedGeek
Moderator
 
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 839
Ok guys, now it's your turn to post your compatibility results!

Please provide information on 68040.library or 68060.library vendor and version. Also, accelerator card type and vendor is requested too. Thank you!
SpeedGeek is offline  
Old 14 October 2017, 17:08   #12
daxb
Registered User
 
Join Date: Oct 2009
Location: Germany
Posts: 3,303
How can we test compatibility? Is there a benchmark tool or similar to see the benefits?
daxb is offline  
Old 14 October 2017, 18:03   #13
SpeedGeek
Moderator
 
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 839
Quote:
Originally Posted by daxb View Post
How can we test compatibility? Is there a benchmark tool or similar to see the benefits?
EDIT: See post #16 for benchmark info.

I have already tested 68040.library 44.2 (H&P) with an A3640 and 68060.library 46.7 (Phase5) with an A3660. However, these libraries may configure themselves differently on other systems. Also, there are 3rd party libraries (e.g. GVP, Apollo, etc.) which should be tested as well.

Last edited by SpeedGeek; 16 October 2017 at 06:04.
SpeedGeek is offline  
Old 14 October 2017, 19:11   #14
daxb
Registered User
 
Join Date: Oct 2009
Location: Germany
Posts: 3,303
Quote:
Originally Posted by SpeedGeek View Post
I have already tested 68040.library 44.2 (H&P)...
How?
daxb is offline  
Old 15 October 2017, 13:06   #15
SpeedGeek
Moderator
 
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 839
Quote:
Originally Posted by daxb View Post
How?
Simply install the patch, use your system normally and look for any DMA transfer errors.

I went a little further than that. I made an LHA loop script which extracts 12MB of archives to the RAM disk. I installed a 2MB Zorro2 memory board for MEMF_24BIT testing. I have another script which changes the priority of the Zorro2 memory so the archive files extract there first. I loaded programs which open a screen in Chip RAM to test Chip RAM DMA, but loading icons on the Workbench screen does the same thing unless you are using RTG.
SpeedGeek is offline  
Old 15 October 2017, 18:52   #16
SpeedGeek
Moderator
 
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 839
** 4TH NEWS UPDATE **

The was another stupid version bug in v1.4 which has now been fixed (It was a just a fully functional v1.4 reporting itself as v1.3).

I now have a simple benchmark tool called "CacheDMAmips" (see attached image). I will probably release it when I am satisfied with the compatibility results.

EDIT: CacheDMAmips was removed for providing bogus results. Obviously, programs compiled on an old "Pile of Crap" C compiler and using v34 timer.device functions are not so reliable. Mips benchmark results are generally bogus anyway! Thus a new improved benchmark tool is called for!

Last edited by SpeedGeek; 23 October 2017 at 17:57.
SpeedGeek is offline  
Old 23 October 2017, 17:12   #17
SpeedGeek
Moderator
 
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 839
Ok, I now have a new improved benchmark tool. Sadly, only 1 user has provided compatibility results so far? :

EDIT: 3 users have now provided compatibility feedback.
See post #1 for the archive.

Code:
CacheDMAbench 1.1 ©SpeedGeek 2018

INTRODUCTION:
CacheDMAbench is a benchmark tool for FastCache040+. It can
also be used for benchmarking old Function Calls of most
68040/060 libraries. It runs a tight loop of 10000 CacheDMA
Function Calls. These FCs are made in paired Cache 
PreDMA/PostDMA calls. A smaller number (500 or 5%) of these
FCs are directed to Chip memory.

The parameters of this benchmark may have no relationship to
actual use but the goal here is to show the cumulative effect
on CPU performance of using these FCs frequently. However,
a great amount of effort was made to time these FCs as
accurately as possible. Results of the benchmark are reported
in Microseconds.

FEATURES:
- Benchmarks Cache PreDMA/PostDMA Function Calls      
- Uses ECLOCK timing of timer.device
- Uses 64 Bit integer math FCs of utility.library (v39+)  
  (Uses 32 Bit integer math for older versions)
- 100% Assembler code
           
REQUIREMENTS:
- Amiga with 68040 or 68060 CPU and MMU
- 68040.library or 68060.library
- FastCache040+

WARNINGS:
- This tool disables multitasking for a short time (but not
  interrupts). This seems to be a somewhat fair balance
  between accuracy and system friendly operation.  
- This tool may occasionly exit with a "Sorry, benchmark
  took too long." message. This happens either to avoid
  reporting bogus results or something even worse... the
  infamous Divide by Zero exception! 
 
NOTES:
- FastCache040+ should be renamed to "FastCache040" because
  the "+" character causes problems with script execution
  (due to an undocumented feature of the Amiga Shell).
- Do NOT expect the results to be exactly the same each time
  the benchmark is executed. 68040s are moderately dynamic
  in performance and 68060s are extremely dynamic!
- Benchmark result comparisons against other users systems
  have no practical use. Benchmark result comparisons on 
  your system are exactly what this tool was developed for!

HISTORY:
v1.0 - First release
v1.1 - Fixed address and size bugs in FC loop code which
could have affected the results.

Last edited by SpeedGeek; 21 May 2018 at 17:34.
SpeedGeek is offline  
Old 23 October 2017, 18:16   #18
kgc210
Registered User
 
Join Date: Jun 2016
Location: Stoke-On-Trent, England
Posts: 450
Could you include your benchmark tool and I will give it ago on my A4000.
kgc210 is offline  
Old 24 October 2017, 16:08   #19
SpeedGeek
Moderator
 
SpeedGeek's Avatar
 
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 839
Once again, I have requested compatibility results not benchmark results (See posts #11-15). I don't understand this "Horse before the Carriage" preoccupation of some users.

What is the point of making extraordinary efforts to determine how fast the Horse can run, if the Horse is fundamentally weak, lacking the strength and endurance to pull the carriage to it's final destination?

Last edited by SpeedGeek; 24 October 2017 at 17:41.
SpeedGeek is offline  
Old 24 October 2017, 21:48   #20
amigasith
Registered User
 
amigasith's Avatar
 
Join Date: Jan 2013
Location: Wild South / Germany
Age: 48
Posts: 271
Hi SpeedGeek,

I just tested version 1.4 of your FastCache040+ patch on my A4000 with A3640 and 68040.library version 37.30 (original OS3.1). Result: I didn't get any DMA transfer errors, but I also didn't notice any speedup... I have a Fastlane Z3 running in my A4000, so I hope that it actually triggered some DMA transfers. Not sure if this helps, but at least you have one feedback now

Cheers,
Marc
amigasith is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
plipbox 0.5 released lallafa News 0 29 November 2013 23:11
Never released??? tomcat666 project.aGTW 18 18 January 2010 14:44
AmigaSYS 3 Released! Dary News 89 13 April 2007 15:34
16.6 Released alexh project.WHDLoad 6 09 June 2006 10:02

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 02:31.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.17629 seconds with 14 queries