28 December 2023, 15:30 | #1 | ||
Moderator
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 846
|
Software which uses Copymem improvement patches
Quote:
Quote:
https://eab.abime.net/showthread.php?t=76777 This thread will be more generalized and cover the broader discussion. The following is a list of software and tools (including my own releases here on EAB) which can use or benefit as follows: Code:
APPLICATION........EXEC ROUTINE FastRom2MB.........Copymemquick() MapRom040+.........Copymemquick() TurboRom040+.......Copymemquick() Code:
APPLICATION...........EXEC ROUTINE LHA_68K...............Copymem() gvpscsi.device(4.x)...Copymem()* omniscsi.device(6.x)..Copymem()* vbak2091..............Copymem()* 8n1.device............Copymem() Resource(6.x).........Copymemquick() *Used for buffered DMA transfers. Code:
APPLICATION...........EXEC ROUTINE Cpu FastROM...........Copymemquick() Last edited by SpeedGeek; 19 January 2024 at 14:45. |
||
28 December 2023, 15:40 | #2 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,516
|
The only things that will benefit from memory copy optimisation are;
1) Applications which copy large amounts of data from one place to another, which will benefit from any throughput improvements. This will soon become IO bound however. 2) Applications which frequently copy small amounts of data from one place to another which can benefit from reduce function call setup time. Most applications don't fit either category since copying large amounts of data around is generally considered "a bad thing" (tm), and equally, frequently copying smaller items around (e.g. implementing by value semantics for structure passing) is not considered a "good thing" and even where compiled code does this, for most typical structures, it's faster and simpler to directly emit a few unrolled move instructions in order to duplicate a typical C structure from one location to another, than it is to do a library call to exec. I think you are more likely to benefit from improving the efficiency of other memory related calls, in particular allocation and deallocation of memory, since those can be expensive and are called quite frequently since in the end, even C applications eventually resorts to calling those. There is a somewhat mythical additional case, which will definitely benefit: 3) Applications which frequently copy non-trivial amounts of memory around. These would benefit from both reduced call setup time and improved bandwidth. However, such programs are rare and tend to be ports from other systems that are already punishingly inefficient on 68K to the extent that routinely copying page sized chunks of memory around are the least of your performance worries. |
29 December 2023, 15:02 | #3 | |
Moderator
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 846
|
Quote:
The reason why Commodore included these functions in the exec.library was to provide the most efficient use of the most commonly used functions. Since the library is always open and referenced at a specified location, it avoids the extra overhead of opening and closing the library. Whether or not a C compiler can duplicate structures more efficiently with internal functions was another decision made by the developer of that particular C compiler. I do most of my coding projects in assembler and with the projects I have done, I have referenced structures but I had no particular need for duplication. Getting back to a more general discussion, I think C coders can realize just as much as ASM coders that calling Copymem() for <= 4 bytes is very inefficient (a definite should not do) but there are some lazy coders who will do that anyway. Last edited by SpeedGeek; 29 December 2023 at 15:10. |
|
29 December 2023, 15:35 | #4 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,516
|
Calling it for anything that takes longer than the necessary number of inlined move operations is inefficient. This is why your typical C compiler will just emit something like that when copying a structure. I am sure there's a break even point somewhere, but that depends on the CPU and cache state.
|
29 December 2023, 17:19 | #5 | |
Registered User
Join Date: Apr 2005
Location: digital hell, Germany, after 1984, but worse
Posts: 3,385
|
Quote:
My 020+ routine is not optimized for 040/060 or unaligned blocks, but usually the block size that I have to copy is between 30 bytes and 5 kB only, not large enough or frequently called enough to benefit from additional checks of the circumstances and branching into (CPU) optimized subroutines. Anyway, I won't notice any speed improvements from a faster copying of memblocks for my icon.library, not even with benchmarks. |
|
30 December 2023, 14:57 | #6 | |
Moderator
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 846
|
Quote:
You probably won't notice any performance difference under WinUAE. You will need real M68K hardware to observe the difference. Let me know if you are are interested. Last edited by SpeedGeek; 30 December 2023 at 15:10. |
|
30 December 2023, 16:25 | #7 | ||||
Registered User
Join Date: Apr 2005
Location: digital hell, Germany, after 1984, but worse
Posts: 3,385
|
Quote:
Quote:
Quote:
Quote:
Update: By using MOVE.W instead of MOVE.L in my loop the icon loading and displaying got 0.5 % - 1.2 % slower without the Jit, but emulated "cycle-exact" at 28 MHz, although PCs don't like word size instructions, because they need an extra prefix byte. By using MOVE.B is was 1.3 % - 3.9 % slower. The first result was for ColorIcons (46x46), the second for large ARGB images, 64x64 or 256x256, or 16 kB - 256 kB blocks. With Jit enabled there was no reproducable speed difference. I don't know if you could improve the copy speed more than twice for 040/060 CPUs, but even if you can, I cannot expect to find your CopyMem patches installed on every users system. The same problem exists for the scratch registers. I would always prefer to use an internal copy routine. Update2: From the test results above I can estimate that my CopyMemBlock routine consumes about 1.3 % of the overall execution time for loading and displaying large icons, or just ~0.45 % for normal ColorIcons. This would mean that if your code could be twice as fast as mine on 040/060 then it would gain a 0.65 % speed improvement in the best case or maybe <1 % for being 4 times faster, nothing that anybody would ever notice. That's similar to the conclusion that I had to realize when I once tried to improve the MathLibs in 2005 and never found an application which was really getting faster. Last edited by PeterK; 30 December 2023 at 23:28. |
||||
31 December 2023, 16:40 | #8 | ||
Moderator
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 846
|
Quote:
Longword alignment benefits the most when the source and destination have equal offsets. Otherwise, it's better to have a least one operand longword aligned. For the 030 destination alignment gives a small performance improvement (because it has a write buffer) but for the 020 it won't make any difference. Quote:
BTW, the code in CMQ&B040 is much different from CMQ&B. If you want to see Move16 performance differences on a real 68040 you will find it in the "Testit" results posted in the thread linked in post #1. Last edited by SpeedGeek; 31 December 2023 at 19:17. |
||
31 December 2023, 17:20 | #9 |
Registered User
Join Date: Sep 2022
Location: Switzerland
Posts: 120
|
Interesting topic!
I am still learning Assembler and also thought about doing my own specialized CopyMem. yeah, I also think, that WinUAE is not (always) a good way for doing benchmarks?! That's a point I am wondering about. On 020+ MOVE.L is the fastest. But how about on 000 and 010? How do 2* MOVE.W compare to 1* MOVE.L as 000 and 010 have to split up 32 bit accesses to two 16 bit accesses?! And on the 010, due to its special loop mode, is it a good idea to unroll mini loops i.e. (pseudo-code) Code:
LOOP MOVE DBcc D0, LOOP Code:
LOOP MOVE MOVE MOVE MOVE DBcc D0/4, LOOP |
31 December 2023, 20:28 | #10 | ||
Registered User
Join Date: Apr 2005
Location: digital hell, Germany, after 1984, but worse
Posts: 3,385
|
Quote:
Quote:
The purpose of my tests with MOVE.W and MOVE.B was to see approximately how much longer the CopyMem routine would take, simply assuming two or four times the number of cycles, even if that might not be a 100 % correct. I think these quick tests are good enough to estimate by thumb which portion of the execution time CopyMemBlock takes in my library. Last edited by PeterK; 31 December 2023 at 20:56. |
||
31 December 2023, 21:17 | #11 | |
Registered User
Join Date: Apr 2005
Location: digital hell, Germany, after 1984, but worse
Posts: 3,385
|
Quote:
Update: I've examined the 40 calls of CopyMemBlock in my library a bit more. Unfortunately, there are only 3 large aligned ARGB longword copies, but none of them is used for normal icon decoding, one is for icons with 1 image only, another for the 2. images of selected icons only, and the last is for a special case, speed doesn't matter. And there are 4 copies for blocks with a few kB, which can't use CopyMemQuick(), nothing that will benefit enough from calling exec CopyMem(). Last edited by PeterK; 01 January 2024 at 16:45. |
|
31 December 2023, 23:02 | #12 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,516
|
What's the typical break even point, in bytes, at which the efficiency of CMQ with all these patches pays for its own library call overhead, compared to locally inlined move.l loop with a spot of unrolling? Let's say I'm most interested in 68030 and above for this question and we are dealing with a cacheable source but uncacheable destination (e.g. Local Fast to RTG VRAM memory).
|
01 January 2024, 16:38 | #13 | ||
Moderator
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 846
|
Quote:
If you really think preserving and restoring D1 to support a jsr to Copymem() is going to reduce the performance of your CopyMemBlock function that much, I suppose I could give you a license to use and distribute the CMQ&B 1.7 source code with your icon.library. You will see that it provides quite a large amount of "Unrolled Move.l" with minimal support code. Now, this large amount of unrolled move.l seems to give the best performance on 020/030 systems (even so, there is a point of diminishing return). The instruction cache is only 256 bytes and any Copymem code which is to big to fit or should co-exist with other code you want cached is something to think about. For the 040/060 things are much different. They certainly have much larger instruction caches but you will just end up wasting it with large unrolled loops. This is because the execution pipelines are very much faster, such that small loops running from the cache can execute as fast as the large loops on 020/030. Quote:
Last edited by SpeedGeek; 01 January 2024 at 22:34. |
||
01 January 2024, 16:57 | #14 | ||
Registered User
Join Date: Sep 2022
Location: Switzerland
Posts: 120
|
Quote:
neither am I Quote:
That's also my philosophy. |
||
01 January 2024, 17:15 | #15 |
Registered User
Join Date: Sep 2022
Location: Switzerland
Posts: 120
|
if I understand Motorolas documentation correctly, Move16 source and destination are (forced to be) longword aligned and always copies 16 Bytes?!
I especially wonder with the 060 because of it's superscalarity i.e. MOVE.L (A0)+,(A1)+ MOVE.L (A0)+,(A1)+ MOVE.L (A0)+,(A1)+ MOVE.L (A0)+,(A1)+ won't profit from the superscalarity, but LOOP MOVE.L (A0)+,(A1)+ DBcc or Bcc LOOP should ?? "Unfortunately" Move16 is pOEP-only. |
01 January 2024, 19:43 | #16 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,516
|
More explicitly, if memory serves (pardon the pun), move16 just ignores the lower 4 bits of the address and moves the whole cache line. This can result in our of bounds access for the unaware.
I've also had really tricky problems in the past with move16 transfers, I think some 040 have problem cases. Last edited by Karlos; 01 January 2024 at 19:51. |
02 January 2024, 01:17 | #17 | |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,324
|
Quote:
That, and many other unexpected things. There are a couple of errata in the Motorola chips concerning MOVE16, and more errata in the implications in bust transfers over Zorro. Avoid. MOVE16 can (unexpectedly) invalidate validate cache lines of unaffected memory (both the 040 and 060 are affected), and MOVE16 also bursts, which at least on some boards, does not go too well if the source or target is only reachable over Zorro-II (most GVP boards, and also at least the B2060 are affected by this issue). |
|
02 January 2024, 16:39 | #18 |
Moderator
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 846
|
@Thread
There is already a MOVE16 discussion thread here: https://eab.abime.net/showthread.php?t=102820 What is the point in repeating what's already been posted on that thread? |
02 January 2024, 23:22 | #19 | |
Moderator
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 846
|
Quote:
|
|
04 January 2024, 07:09 | #20 |
Engineer
Join Date: Oct 2018
Location: Shadow realm
Posts: 167
|
These copymem patches... is there a kickstart patch as an alternative?
I don't see why you wouldn't want them on most of the time. |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Copymem Quick & Big Released! | SpeedGeek | Coders. System | 23 | 08 January 2024 00:06 |
CopyMem Quick & Small released! | SpeedGeek | Coders. System | 12 | 04 July 2020 14:49 |
chipset_refreshrate improvement | Dr.Venom | request.UAE Wishlist | 5 | 21 May 2011 02:06 |
amipatch 3.1 , patches to deprotect software | mrodfr | News | 0 | 21 October 2006 06:45 |
Software patches more more speed | gary | request.Apps | 6 | 11 November 2003 06:47 |
|
|