![]() |
![]() |
#1 |
Moderator
Join Date: Dec 2010
Location: Wisconsin USA
Age: 59
Posts: 729
|
![]()
FastCache040+ 2.5 ©SpeedGeek 2022
INTRODUCTION: FastCache040+ is a patch to replace the CachePreDMA() and CachePostDMA() functions of most 68040/060 libraries. While the old functions are adequate they are far from optimal. These old functions have 2x more code then the new ones provided with this patch! Also, the new functions implement a much more efficient method of managing the Copyback cache for DMA. While every system will have some CPU performance loss under DMA conditions, the new functions keep this performance loss to a bare minimum. FEATURES: - Replaces CachePreDMA() and CachePostDMA() with smaller and more efficient code - Replaces complex MMU code with simple and fast DTTR code - Temporarily changes Copyback mode to Write Through for DMA (but only when required!). - Never flushes the ATC! - Never flushes the DC for Chip RAM DMA! - Uses 68040/060 library detection code - Will not patch itself - 100% Assembler code CODE SIZE COMPARISONS: - FastCache040+ 2.5 (NewFunc 204 bytes) - 68060.library 46.7 (OldFunc 304 bytes) - 68040.library 44.2 (OldFunc 414 bytes) REQUIREMENTS: - Amiga with 68040 or 68060 CPU and MMU - 68040.library or 68060.library WARNING: Do NOT use this patch with GigaMEM, VMM or any similar virtual memory software! Do NOT use this patch with any code which uses the MMU to write protect or remap modified data structures! NOTES: Remapping a mirror image of the Kickstart ROM with the MMU is OK! The new functions still have one thing in common with the old functions. They do NOT translate virtual addresses as specified in the Amiga RKRM! For more info on the old functions see the Enforcer.guide by Michael Sinz. UPDATE: FastCache040+ v1.7 has been removed. Phase5 68060.library users can optionally use FixMapP5. HISTORY: (Pre 2.0 history deleted) v2.0 - Added code to enable only one DTTR when the Nest count is one. Most systems have only one DMA driver and only need to have 16MB of address space managed for this case. Removed 1.9BR version which was over-rated due to most DMA drivers operating at higher priority than typical user tasks. v2.1 - Reworked the code to fix a problem with Snoopy 2.0 (Aminet). Sorry, this version no longer supports 16 byte aligned cache enabled MEMF_24BIT transfers. NOTE: The original P5 library functions have problems with Snoopy too. v2.2 - The Snoopy fix broke MEMF_24BIT transfers. So another bug fix was required. Let's hope it's the last. v2.3 - The 16 byte alignment code is back and now avoids the change of cache mode for this specific case. Removed Continue case from PreDMA since the expected results are the same as the Non-Continue case. The cache disable test code was removed to save the overhead of this very uncommon case. v2.4 - Reworked PostDMA code to fix Nested call cache flush bugs. We really don't want to forget about systems with multiple DMA drivers do we? v2.5 - Fixed another rare but possible bug with DMA transfers crossing the 16MB boundary of the DTTR! So now (except for MEMF_24BIT DMA transfers) both DTTRs are enabled to manage the full 32MB of address space. Code:
CachePostDMA: MOVE.L A0,D1 ANDI.L #$FFE00000,D1 ;Chip RAM BEQ.B lbC00002A BTST #3,D0 ;ReadFromRam BNE.B lbC00002A MOVE.L A5,-(SP) MOVE.L A0,D1 OR.L (A1),D1 ANDI.B #15,D1 ;16 byte aligned BEQ.B lbC000020 LEA Nest(PC),A1 SUBQ.W #1,(A1) BEQ.B lbC000024 lbC000020 LEA (lbC000050,PC),A5 BRA.B lbC000028 lbC000024 LEA (lbC00004E,PC),A5 lbC000028 JSR (-$1E,A6) ;Call Supervisor MOVE.L (SP)+,A5 lbC00002A RTS lbC00004E MOVEQ #0,D1 MOVEC D1,DTT1 ;Disable DTT1 MOVEC D1,DTT0 ;Disable DTT0 lbC000050 CPUSHA DC RTE CachePreDMA: MOVEM.L A0/A5,-(SP) MOVE.L A0,D1 ANDI.L #$FFE00000,D1 ;Chip RAM BEQ.B lbC000068 BTST #3,D0 ;ReadFromRam BNE.B lbC000068 MOVE.L A0,D1 OR.L (A1),D1 ANDI.B #15,D1 ;16 byte aligned BEQ.B lbC000060 LEA (lbC000074,PC),A5 BRA.B lbC000064 lbC000060 LEA (lbC000084,PC),A5 lbC000064 JSR (-$1E,A6) ;Call Supervisor lbC000068 MOVEM.L (SP)+,A0/A5 MOVE.L A0,D0 RTS lbC000074 LEA Nest(PC),A1 TST.W (A1) BEQ.B lbC000078 MOVE.L #$0000C040,D1 ;NoCache mode + Serialized MOVEC D1,DTT0 ;Enable DTT0 MOVE.L A0,D1 ANDI.L #$FF000000,D1 ;MEMF_24BIT BEQ.B lbC000082 MOVE.L #$00FFC000,D1 ;Cache WT mode + ignore FC MOVEC D1,DTT1 ;Enable DTT1 BRA.B lbC000082 lbC000078 MOVE.L A0,D1 ANDI.L #$FF000000,D1 ;MEMF_24BIT BNE.B lbC000080 ORI.W #$C040,D1 ;NoCache mode + Serialized MOVEC D1,DTT0 BRA.B lbC000082 lbC000080 ORI.W #$C000,D1 ;Cache WT mode + ignore FC MOVEC D1,DTT0 ;Lower 16MB cache control LSR.W #1,D1 ;Adjust Word data for Long shift LSL.L #1,D1 ;Long shift MOVEC D1,DTT1 ;Upper 16MB cache control lbC000082 ADDQ.W #1,(A1) lbC000084 CPUSHA DC ;Flush dirty cache lines RTE Nest: DC.W 0 Last edited by SpeedGeek; 26 October 2022 at 18:34. |
![]() |
![]() |
#2 |
Registered User
![]() Join Date: Jun 2016
Location: europe
Posts: 852
|
A couple of suggestions for CachePreDMA (2 bytes shorter and one less branch):
Code:
CachePreDMA: MOVEM.L A0/A5,-(SP) MOVE.L A0,D1 ANDI.L #$FFE00000,D1 ;Chip RAM BEQ.B lbC000068 LEA (lbC000084,PC),A5 ; moved from below ANDI.B #$A,D0 ;Continue or ReadFromRam BNE.B lbC000060 MOVE.L A0,D1 OR.L (A1),D1 ANDI.B #15,D1 ;16 Byte aligned BEQ.B lbC000060 LEA (lbC000074,PC),A5 ; or alternatively (faster on 040, don't know about 060; maybe add.w works better overall): ; LEA (lbC000074-lbC000084,A5),A5 ;; BRA.B lbC000064 lbC000060 ;; LEA (lbC000084,PC),A5 ;;lbC000064 JSR (-$1E,A6) ;Call Supervisor lbC000068 |
![]() |
![]() |
#3 | |
Moderator
Join Date: Dec 2010
Location: Wisconsin USA
Age: 59
Posts: 729
|
Quote:
![]() If you want to make a patch with your suggestion that's OK with me. This patch code obtains most of it's performance benefit from more efficient cache management so small changes in code size or execution speed won't make much of a performance difference anyway. ![]() Last edited by SpeedGeek; 05 October 2017 at 18:21. |
|
![]() |
![]() |
#4 |
Registered User
![]() Join Date: Jun 2016
Location: Stoke-On-Trent, England
Posts: 450
|
SpeedGeek
What sort of real life situations would benefit from this patch? Or does it speed up all uses of the CPU? |
![]() |
![]() |
#5 | |
Moderator
Join Date: Dec 2010
Location: Wisconsin USA
Age: 59
Posts: 729
|
Quote:
![]() There are a few benchmark programs (e.g. RSCP, DiskSpeed) which test "CPU Availability" for SCSI DMA transfers. Unfortunately, they pre-date the 68040 CPU and really don't provide any reliable results here. ![]() Last edited by SpeedGeek; 05 October 2017 at 18:50. |
|
![]() |
![]() |
#6 |
Registered User
![]() Join Date: Jun 2016
Location: Stoke-On-Trent, England
Posts: 450
|
Ahh ok thanks for the explanation.
Do you think it would help on the Warpengine and A4000T or are their drivers optimised anyway? |
![]() |
![]() |
#7 |
Moderator
Join Date: Dec 2010
Location: Wisconsin USA
Age: 59
Posts: 729
|
** NEWS UPDATE **
Sorry, there was a bug in v1.0 with the patch install code. ![]() v1.1 - Fixed a bug which prevented the patch from installing - Added code to use OldCachePreDMA for MEMF_24BIT transfers (I don't know why errors occured here) Last edited by SpeedGeek; 06 October 2017 at 08:20. |
![]() |
![]() |
#8 |
Moderator
Join Date: Dec 2010
Location: Wisconsin USA
Age: 59
Posts: 729
|
** 2ND NEWS UPDATE **
v1.2 released (updated patch size info) - Added code to use OldCachePostDMA for MEMF_24BIT transfers (So MMU Pages can be restored to original) |
![]() |
![]() |
#9 |
Moderator
Join Date: Dec 2010
Location: Wisconsin USA
Age: 59
Posts: 729
|
OK, I believe I have found a solution to the MEMF_24BIT transfer
error problem without OldPre/OldPost calls. Unfortunately, the cache mode would have to be changed to NoCache. This would make the NewFunc code a little smaller but could reduce CPU performance a little for MEMF_24BIT transfers. So it's a trade off situation... will give it some more thought! ![]() Last edited by SpeedGeek; 06 October 2017 at 17:35. |
![]() |
![]() |
#10 |
Moderator
Join Date: Dec 2010
Location: Wisconsin USA
Age: 59
Posts: 729
|
** 3RD NEWS UPDATE **
v1.3 Released! - Added code to change MEMF_24BIT transfers to NoCache. This eliminated all OldFunc calls. MEMF_24BIT transfers may have some CPU performance loss but the NewFunc code performance benefits should still justify this. NOTES: v1.2 is now considered obsolete and was removed with the v1.5 release. EDIT: v1.4 Released! - Removed MEMF_24BIT code from PreDMA/PostDMA for the case of 16 byte aligned transfers. This will allow some MEMF_24BIT transfers to be cache enabled! EDIT2: The v1.4 NewFuncSrc for lbC000080 should read as follows: ORI.W #$8000,D1 ;Cache WT mode + User FC Last edited by SpeedGeek; 30 March 2018 at 15:35. |
![]() |
![]() |
#11 |
Moderator
Join Date: Dec 2010
Location: Wisconsin USA
Age: 59
Posts: 729
|
Ok guys, now it's your turn to post your compatibility results!
Please provide information on 68040.library or 68060.library vendor and version. Also, accelerator card type and vendor is requested too. Thank you! ![]() |
![]() |
![]() |
#12 |
Registered User
Join Date: Oct 2009
Location: Germany
Posts: 3,185
|
How can we test compatibility? Is there a benchmark tool or similar to see the benefits?
|
![]() |
![]() |
#13 | |
Moderator
Join Date: Dec 2010
Location: Wisconsin USA
Age: 59
Posts: 729
|
Quote:
I have already tested 68040.library 44.2 (H&P) with an A3640 and 68060.library 46.7 (Phase5) with an A3660. However, these libraries may configure themselves differently on other systems. Also, there are 3rd party libraries (e.g. GVP, Apollo, etc.) which should be tested as well. ![]() Last edited by SpeedGeek; 16 October 2017 at 06:04. |
|
![]() |
![]() |
#14 |
Registered User
Join Date: Oct 2009
Location: Germany
Posts: 3,185
|
|
![]() |
![]() |
#15 |
Moderator
Join Date: Dec 2010
Location: Wisconsin USA
Age: 59
Posts: 729
|
Simply install the patch, use your system normally and look for any DMA transfer errors.
I went a little further than that. I made an LHA loop script which extracts 12MB of archives to the RAM disk. I installed a 2MB Zorro2 memory board for MEMF_24BIT testing. I have another script which changes the priority of the Zorro2 memory so the archive files extract there first. I loaded programs which open a screen in Chip RAM to test Chip RAM DMA, but loading icons on the Workbench screen does the same thing unless you are using RTG. ![]() |
![]() |
![]() |
#16 |
Moderator
Join Date: Dec 2010
Location: Wisconsin USA
Age: 59
Posts: 729
|
** 4TH NEWS UPDATE **
The was another stupid version bug in v1.4 which has now been fixed (It was a just a fully functional v1.4 reporting itself as v1.3). I now have a simple benchmark tool called "CacheDMAmips" (see attached image). I will probably release it when I am satisfied with the compatibility results. ![]() EDIT: CacheDMAmips was removed for providing bogus results. Obviously, programs compiled on an old "Pile of Crap" C compiler and using v34 timer.device functions are not so reliable. Mips benchmark results are generally bogus anyway! Thus a new improved benchmark tool is called for! Last edited by SpeedGeek; 23 October 2017 at 17:57. |
![]() |
![]() |
#17 |
Moderator
Join Date: Dec 2010
Location: Wisconsin USA
Age: 59
Posts: 729
|
Ok, I now have a new improved benchmark tool. Sadly, only 1 user has provided compatibility results so far?
![]() EDIT: 3 users have now provided compatibility feedback. ![]() See post #1 for the archive. Code:
CacheDMAbench 1.1 ©SpeedGeek 2018 INTRODUCTION: CacheDMAbench is a benchmark tool for FastCache040+. It can also be used for benchmarking old Function Calls of most 68040/060 libraries. It runs a tight loop of 10000 CacheDMA Function Calls. These FCs are made in paired Cache PreDMA/PostDMA calls. A smaller number (500 or 5%) of these FCs are directed to Chip memory. The parameters of this benchmark may have no relationship to actual use but the goal here is to show the cumulative effect on CPU performance of using these FCs frequently. However, a great amount of effort was made to time these FCs as accurately as possible. Results of the benchmark are reported in Microseconds. FEATURES: - Benchmarks Cache PreDMA/PostDMA Function Calls - Uses ECLOCK timing of timer.device - Uses 64 Bit integer math FCs of utility.library (v39+) (Uses 32 Bit integer math for older versions) - 100% Assembler code REQUIREMENTS: - Amiga with 68040 or 68060 CPU and MMU - 68040.library or 68060.library - FastCache040+ WARNINGS: - This tool disables multitasking for a short time (but not interrupts). This seems to be a somewhat fair balance between accuracy and system friendly operation. - This tool may occasionly exit with a "Sorry, benchmark took too long." message. This happens either to avoid reporting bogus results or something even worse... the infamous Divide by Zero exception! NOTES: - FastCache040+ should be renamed to "FastCache040" because the "+" character causes problems with script execution (due to an undocumented feature of the Amiga Shell). - Do NOT expect the results to be exactly the same each time the benchmark is executed. 68040s are moderately dynamic in performance and 68060s are extremely dynamic! - Benchmark result comparisons against other users systems have no practical use. Benchmark result comparisons on your system are exactly what this tool was developed for! HISTORY: v1.0 - First release v1.1 - Fixed address and size bugs in FC loop code which could have affected the results. Last edited by SpeedGeek; 21 May 2018 at 17:34. |
![]() |
![]() |
#18 |
Registered User
![]() Join Date: Jun 2016
Location: Stoke-On-Trent, England
Posts: 450
|
Could you include your benchmark tool and I will give it ago on my A4000.
|
![]() |
![]() |
#19 |
Moderator
Join Date: Dec 2010
Location: Wisconsin USA
Age: 59
Posts: 729
|
Once again, I have requested compatibility results not benchmark results (See posts #11-15). I don't understand this "Horse before the Carriage" preoccupation of some users.
What is the point of making extraordinary efforts to determine how fast the Horse can run, if the Horse is fundamentally weak, lacking the strength and endurance to pull the carriage to it's final destination? ![]() Last edited by SpeedGeek; 24 October 2017 at 17:41. |
![]() |
![]() |
#20 |
Registered User
![]() Join Date: Jan 2013
Location: Wild South / Germany
Age: 47
Posts: 269
|
Hi SpeedGeek,
I just tested version 1.4 of your FastCache040+ patch on my A4000 with A3640 and 68040.library version 37.30 (original OS3.1). Result: I didn't get any DMA transfer errors, but I also didn't notice any speedup... I have a Fastlane Z3 running in my A4000, so I hope that it actually triggered some DMA transfers. Not sure if this helps, but at least you have one feedback now ![]() Cheers, Marc |
![]() |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
plipbox 0.5 released | lallafa | News | 0 | 29 November 2013 23:11 |
Never released??? | tomcat666 | project.aGTW | 18 | 18 January 2010 14:44 |
AmigaSYS 3 Released! | Dary | News | 89 | 13 April 2007 15:34 |
16.6 Released | alexh | project.WHDLoad | 6 | 09 June 2006 10:02 |
|
|