Software which uses Copymem improvement patches

SpeedGeek · 28 December 2023, 15:30

Quote:

Originally Posted by koobo

Did/does anyone ever notice a real improvement from these CopyMem-improvement patches? Or maybe measure how many calls and what kind of parameters would be generated when using the OS for some ordinary tasks?

There were a lot of these patches, I also did one back in the day and was happy with myself. Whether it made any difference, that's another matter.

Quote:

Originally Posted by derSammler

As it says in the description:
It's obviously for people who like to brag with benchmark results. (no negative notion intended)

I doubt you find much, if any software, that will be much fast with that patch compared to other similar patches.

I was too busy to respond to these comments earlier. The reason I've started another thread is because I would rather have the discussion there limited to the specific patches on that thread:

https://eab.abime.net/showthread.php?t=76777

This thread will be more generalized and cover the broader discussion. The following is a list of software and tools (including my own releases here on EAB) which can use or benefit as follows:

Code:

APPLICATION........EXEC ROUTINE 
FastRom2MB.........Copymemquick()
MapRom040+.........Copymemquick()  
TurboRom040+.......Copymemquick()

Known 3rd party tools that I'm aware of:

Code:

APPLICATION...........EXEC ROUTINE 
LHA_68K...............Copymem()
gvpscsi.device(4.x)...Copymem()*
omniscsi.device(6.x)..Copymem()*
vbak2091..............Copymem()*
8n1.device............Copymem()
Resource(6.x).........Copymemquick()

*Used for buffered DMA transfers.

Amiga OS functions:

Code:

APPLICATION...........EXEC ROUTINE 
Cpu FastROM...........Copymemquick()

Other coders can certainly post information for the applications they have developed here. Anything, that's not currently developed or maintained can be analyzed with tools like Snoopdos or Snoopy (Aminet). They can also be disassembled, but that's probably the hard way to get the info your looking for...

Karlos · 28 December 2023, 15:40

The only things that will benefit from memory copy optimisation are;

1) Applications which copy large amounts of data from one place to another, which will benefit from any throughput improvements. This will soon become IO bound however.

2) Applications which frequently copy small amounts of data from one place to another which can benefit from reduce function call setup time.

Most applications don't fit either category since copying large amounts of data around is generally considered "a bad thing" (tm), and equally, frequently copying smaller items around (e.g. implementing by value semantics for structure passing) is not considered a "good thing" and even where compiled code does this, for most typical structures, it's faster and simpler to directly emit a few unrolled move instructions in order to duplicate a typical C structure from one location to another, than it is to do a library call to exec.

I think you are more likely to benefit from improving the efficiency of other memory related calls, in particular allocation and deallocation of memory, since those can be expensive and are called quite frequently since in the end, even C applications eventually resorts to calling those.

There is a somewhat mythical additional case, which will definitely benefit:

3) Applications which frequently copy non-trivial amounts of memory around. These would benefit from both reduced call setup time and improved bandwidth. However, such programs are rare and tend to be ports from other systems that are already punishingly inefficient on 68K to the extent that routinely copying page sized chunks of memory around are the least of your performance worries.

SpeedGeek · 29 December 2023, 15:02

Quote:

Originally Posted by Karlos

Most applications don't fit either category since copying large amounts of data around is generally considered "a bad thing" (tm), and equally, frequently copying smaller items around (e.g. implementing by value semantics for structure passing) is not considered a "good thing" and even where compiled code does this, for most typical structures, it's faster and simpler to directly emit a few unrolled move instructions in order to duplicate a typical C structure from one location to another, than it is to do a library call to exec.

I've observed that what most applications should or should not do is often irrelevant. They tend to do whatever their developers want them to do and that's that.

The reason why Commodore included these functions in the exec.library was to provide the most efficient use of the most commonly used functions. Since the library is always open and referenced at a specified location, it avoids the extra overhead of opening and closing the library.

Whether or not a C compiler can duplicate structures more efficiently with internal functions was another decision made by the developer of that particular C compiler. I do most of my coding projects in assembler and with the projects I have done, I have referenced structures but I had no particular need for duplication.

Getting back to a more general discussion, I think C coders can realize just as much as ASM coders that calling Copymem() for <= 4 bytes is very inefficient (a definite should not do) but there are some lazy coders who will do that anyway.

Karlos · 29 December 2023, 15:35

Quote:

Originally Posted by SpeedGeek

Getting back to a more general discussion, I think C coders can realize just as much as ASM coders that calling Copymem() for <= 4 bytes is very inefficient (a definite should not do) but there are some lazy coders who will do that anyway.

Calling it for anything that takes longer than the necessary number of inlined move operations is inefficient. This is why your typical C compiler will just emit something like that when copying a structure. I am sure there's a break even point somewhere, but that depends on the CPU and cache state.

PeterK · 29 December 2023, 17:19

Quote:

Originally Posted by Karlos

The only things that will benefit from memory copy optimisation are;

1) Applications which copy large amounts of data from one place to another, which will benefit from any throughput improvements. This will soon become IO bound however.

2) Applications which frequently copy small amounts of data from one place to another which can benefit from reduce function call setup time.

Most applications don't fit either category ...

That's the reason why I don't use these CopyMem/Quick() exec functions. I prefer to call my internal 020+ subroutine CopyMemBlock instead, because in that way I avoid to change the A6 register and backup/restore D1, and my routine won't need to preserve anything on the stack either. After returning from CopyMemBlock I can sometimes further benefit from re-using the A0 and A1 registers directly, because they are pointing to the next address below the source and destination blocks, which is often just as required. With a call to an exec function, I could not make any assumptions about the contents of all scratch registers.

My 020+ routine is not optimized for 040/060 or unaligned blocks, but usually the block size that I have to copy is between 30 bytes and 5 kB only, not large enough or frequently called enough to benefit from additional checks of the circumstances and branching into (CPU) optimized subroutines. Anyway, I won't notice any speed improvements from a faster copying of memblocks for my icon.library, not even with benchmarks.

SpeedGeek · 30 December 2023, 14:57

Quote:

Originally Posted by PeterK

That's the reason why I don't use these CopyMem/Quick() exec functions. I prefer to call my internal 020+ subroutine CopyMemBlock instead, because in that way I avoid to change the A6 register and backup/restore D1, and my routine won't need to preserve anything on the stack either. After returning from CopyMemBlock I can sometimes further benefit from re-using the A0 and A1 registers directly, because they are pointing to the next address below the source and destination blocks, which is often just as required. With a call to an exec function, I could not make any assumptions about the contents of all scratch registers.

My 020+ routine is not optimized for 040/060 or unaligned blocks, but usually the block size that I have to copy is between 30 bytes and 5 kB only, not large enough or frequently called enough to benefit from additional checks of the circumstances and branching into (CPU) optimized subroutines. Anyway, I won't notice any speed improvements from a faster copying of memblocks for my icon.library, not even with benchmarks.

I can make a special version of CMQ&B 1.7 which preserves A6/D1. A0/A1 can be preserved too but at the expense of more overhead. The code is quite similar to your CopyMemBlock code but it handles misalignment and that's where there is the most speed improvement to be gained. You don't need to worry about 040/060 optimization if the destination memory is non-cachable or the typical copy size is <= 4 KB (8 KB on 060) .

You probably won't notice any performance difference under WinUAE. You will need real M68K hardware to observe the difference. Let me know if you are are interested.

PeterK · 30 December 2023, 16:25

Quote:

Originally Posted by SpeedGeek

I can make a special version of CMQ&B 1.7 which preserves A6/D1. A0/A1 can be preserved too but at the expense of more overhead.

Preserving A6 won't help, because I have to set it to execbase in my code. A0/A1 should just point below the end of the memblocks, not to the top again.

Quote:

The code is quite similar to your CopyMemBlock code but it handles misalignment and that's where there is the most speed improvement to be gained.

There are only very few cases where an alignment is possible. In most cases the target memblock is an already longword aligned buffer, but the source address could be odd. I only try to fix the alignment in my 68000 code for cases when both, source and target blocks are starting at odd addresses by inserting a first byte copy before calling the 020 code, or otherwise copying bytes only.

Quote:

You don't need to worry about 040/060 optimization if the destination memory is non-cachable or the typical copy size is <= 4 KB (8 KB on 060) .

Ok, but 95 % of my blocks are less than 4 kB, only a few large icons need big copies. Anything above 16 kB is very rare, nothing is more than 256 kB. So, 040/060 optimization will improve the speed a little bit only for quite large TrueColor icons (> 64x64) when the ARGB images need to be copied, not just decompressed.

Quote:

You probably won't notice any performance difference under WinUAE. You will need real M68K hardware to observe the difference. Let me know if you are are interested.

Thank you! Yes, that's my problem that I cannot emulate the 040/060 caches with WinUAE. But I will try to make my CopyMemBlock 2 or 4 times slower by using word or byte moves only in order to see whether that would cause a noticeable slowdown ... If that happens, then I could give your code a try.

Update: By using MOVE.W instead of MOVE.L in my loop the icon loading and displaying got 0.5 % - 1.2 % slower without the Jit, but emulated "cycle-exact" at 28 MHz, although PCs don't like word size instructions, because they need an extra prefix byte. By using MOVE.B is was 1.3 % - 3.9 % slower. The first result was for ColorIcons (46x46), the second for large ARGB images, 64x64 or 256x256, or 16 kB - 256 kB blocks. With Jit enabled there was no reproducable speed difference.

I don't know if you could improve the copy speed more than twice for 040/060 CPUs, but even if you can, I cannot expect to find your CopyMem patches installed on every users system. The same problem exists for the scratch registers. I would always prefer to use an internal copy routine.

Update2: From the test results above I can estimate that my CopyMemBlock routine consumes about 1.3 % of the overall execution time for loading and displaying large icons, or just ~0.45 % for normal ColorIcons. This would mean that if your code could be twice as fast as mine on 040/060 then it would gain a 0.65 % speed improvement in the best case or maybe <1 % for being 4 times faster, nothing that anybody would ever notice. That's similar to the conclusion that I had to realize when I once tried to improve the MathLibs in 2005 and never found an application which was really getting faster.

SpeedGeek · 31 December 2023, 16:40

Quote:

Originally Posted by PeterK

Preserving A6 won't help, because I have to set it to execbase in my code. A0/A1 should just point below the end of the memblocks, not to the top again.

There are only very few cases where an alignment is possible. In most cases the target memblock is an already longword aligned buffer, but the source address could be odd. I only try to fix the alignment in my 68000 code for cases when both, source and target blocks are starting at odd addresses by inserting a first byte copy before calling the 020 code, or otherwise copying bytes only.

A0/A1 already point to the end of the copy block + increment. But keep in mind this may be a longword, word or byte aligned pointer. So what you really need preserved is D1 (A6 is not used by my code). Whether I make a special version which preserves D1 or you do that before calling Copymem() makes no real difference.

Longword alignment benefits the most when the source and destination have equal offsets. Otherwise, it's better to have a least one operand longword aligned. For the 030 destination alignment gives a small performance improvement (because it has a write buffer) but for the 020 it won't make any difference.

Quote:

Originally Posted by PeterK

Ok, but 95 % of my blocks are less than 4 kB, only a few large icons need big copies. Anything above 16 kB is very rare, nothing is more than 256 kB. So, 040/060 optimization will improve the speed a little bit only for quite large TrueColor icons (> 64x64) when the ARGB images need to be copied, not just decompressed.

Thank you! Yes, that's my problem that I cannot emulate the 040/060 caches with WinUAE. But I will try to make my CopyMemBlock 2 or 4 times slower by using word or byte moves only in order to see whether that would cause a noticeable slowdown ... If that happens, then I could give your code a try.

Update: By using MOVE.W instead of MOVE.L in my loop the icon loading and displaying got 0.5 % - 1.2 % slower without the Jit, but emulated "cycle-exact" at 28 MHz, although PCs don't like word size instructions, because they need an extra prefix byte. By using MOVE.B is was 1.3 % - 3.9 % slower. The first result was for ColorIcons (46x46), the second for large ARGB images, 64x64 or 256x256, or 16 kB - 256 kB blocks. With Jit enabled there was no reproducable speed difference.

I don't know if you could improve the copy speed more than twice for 040/060 CPUs, but even if you can, I cannot expect to find your CopyMem patches installed on every users system. The same problem exists for the scratch registers. I would always prefer to use an internal copy routine.

Update2: From the test results above I can estimate that my CopyMemBlock routine consumes about 1.3 % of the overall execution time for loading and displaying large icons, or just ~0.45 % for normal ColorIcons. This would mean that if your code could be twice as fast as mine on 040/060 then it would gain a 0.65 % speed improvement in the best case or maybe <1 % for being 4 times faster, nothing that anybody would ever notice. That's similar to the conclusion that I had to realize when I once tried to improve the MathLibs in 2005 and never found an application which was really getting faster.

It still doesn't seem to me like you have enough justification to make an 040/060 version of the icon.library. You might just want to make a version which calls Copymem() but only works with patches which meet your icon.library requirements.

BTW, the code in CMQ&B040 is much different from CMQ&B. If you want to see Move16 performance differences on a real 68040 you will find it in the "Testit" results posted in the thread linked in post #1.

No.3 · 31 December 2023, 17:20

Interesting topic!

I am still learning Assembler and also thought about doing my own specialized CopyMem.

Quote:

Originally Posted by PeterK

WinUAE

yeah, I also think, that WinUAE is not (always) a good way for doing benchmarks?!

Quote:

Originally Posted by PeterK

using MOVE.W instead of MOVE.L

That's a point I am wondering about. On 020+ MOVE.L is the fastest. But how about on 000 and 010? How do 2* MOVE.W compare to 1* MOVE.L as 000 and 010 have to split up 32 bit accesses to two 16 bit accesses?!

And on the 010, due to its special loop mode, is it a good idea to unroll mini loops i.e. (pseudo-code)

Code:

LOOP
 MOVE
 DBcc D0, LOOP

unroll to

Code:

LOOP
 MOVE
 MOVE
 MOVE
 MOVE
 DBcc D0/4, LOOP

PeterK · 31 December 2023, 20:28

Quote:

Originally Posted by No.3

... yeah, I also think, that WinUAE is not (always) a good way for doing benchmarks?!

Right, but my A2000 died about 15 years ago and went into the garbage collection. So, I don't have anything else than emulators for coding and testing, and I'm not even an AmigaOS user anymore.

Quote:

That's a point I am wondering about. On 020+ MOVE.L is the fastest. But how about on 000 and 010? How do 2* MOVE.W compare to 1* MOVE.L as 000 and 010 have to split up 32 bit accesses to two 16 bit accesses?! ...

I'm not an expert for calculating instruction timings. I must admit that I do not even understand exactly how to interpret these complex CPU dependent timing tables and how to take all the conditions for possible extra cycles correctly into account. And I don't care much about instruction and data caches, because if I would do that I may never get forward in coding anymore. Don't expect to get the fastest possible CPU specific code from me.

The purpose of my tests with MOVE.W and MOVE.B was to see approximately how much longer the CopyMem routine would take, simply assuming two or four times the number of cycles, even if that might not be a 100 % correct. I think these quick tests are good enough to estimate by thumb which portion of the execution time CopyMemBlock takes in my library.

PeterK · 31 December 2023, 21:17

Quote:

Originally Posted by SpeedGeek

If you want to see Move16 performance differences on a real 68040 you will find it in the "Testit" results posted in the thread linked in post #1.

No doubt that you did a good job by gaining an average speed improvement of more than 60 % and sometimes even > 100 %, but that would let my library decode large icons only ~0.5 - 0.6 % faster and some other icon and structure copies may even lose some speed due to the additional overhead. I always try to keep low-end systems in mind, because 040/060 are much faster anyway.

Update: I've examined the 40 calls of CopyMemBlock in my library a bit more. Unfortunately, there are only 3 large aligned ARGB longword copies, but none of them is used for normal icon decoding, one is for icons with 1 image only, another for the 2. images of selected icons only, and the last is for a special case, speed doesn't matter. And there are 4 copies for blocks with a few kB, which can't use CopyMemQuick(), nothing that will benefit enough from calling exec CopyMem().

Karlos · 31 December 2023, 23:02

What's the typical break even point, in bytes, at which the efficiency of CMQ with all these patches pays for its own library call overhead, compared to locally inlined move.l loop with a spot of unrolling? Let's say I'm most interested in 68030 and above for this question and we are dealing with a cacheable source but uncacheable destination (e.g. Local Fast to RTG VRAM memory).

SpeedGeek · 01 January 2024, 16:38

Quote:

Originally Posted by PeterK

No doubt that you did a good job by gaining an average speed improvement of more than 60 % and sometimes even > 100 %, but that would let my library decode large icons only ~0.5 - 0.6 % faster and some other icon and structure copies may even lose some speed due to the additional overhead. I always try to keep low-end systems in mind, because 040/060 are much faster anyway.

When you see that there quite a lot of extra overhead just to support Move16 you will be very discouraged from using it unless you've got a system which can show you the performance benefits of using it.

If you really think preserving and restoring D1 to support a jsr to Copymem() is going to reduce the performance of your CopyMemBlock function that much, I suppose I could give you a license to use and distribute the CMQ&B 1.7 source code with your icon.library. You will see that it provides quite a large amount of "Unrolled Move.l" with minimal support code.

Now, this large amount of unrolled move.l seems to give the best performance on 020/030 systems (even so, there is a point of diminishing return). The instruction cache is only 256 bytes and any Copymem code which is to big to fit or should co-exist with other code you want cached is something to think about.

For the 040/060 things are much different. They certainly have much larger instruction caches but you will just end up wasting it with large unrolled loops. This is because the execution pipelines are very much faster, such that small loops running from the cache can execute as fast as the large loops on 020/030.

Quote:

Originally Posted by Karlos

What's the typical break even point, in bytes, at which the efficiency of CMQ with all these patches pays for its own library call overhead, compared to locally inlined move.l loop with a spot of unrolling? Let's say I'm most interested in 68030 and above for this question and we are dealing with a cacheable source but uncacheable destination (e.g. Local Fast to RTG VRAM memory).

I think you basically answered your own question in post #4, but some benchmark testing is probably the best way to answer your question. TIP: The 030 has a write-through cache so the destination results are only affected by the physical write speed.

No.3 · 01 January 2024, 16:57

Quote:

Originally Posted by PeterK

Right, but my A2000 died about 15 years ago and went into the garbage collection. So, I don't have anything else than emulators for coding and testing, and I'm not even an AmigaOS user anymore.

yes, UAE is better then nothing and you can not have 20 Amigas in order to be able to benchmark every CPU, Chipset, Memory configuration...

Quote:

Originally Posted by PeterK

I'm not an expert for calculating instruction timings.

neither am I

Quote:

Originally Posted by PeterK

And I don't care much about instruction and data caches, because if I would do that I may never get forward in coding anymore.

Well, here I disagree, if a (small) change in the code has the effect that the code no more fits into the 256 bytes cache of a 020 or 030, then the speed loss my be drastic?!

Quote:

Originally Posted by PeterK

I always try to keep low-end systems in mind, because 040/060 are much faster anyway.

That's also my philosophy.

No.3 · 01 January 2024, 17:15

Quote:

Originally Posted by SpeedGeek

Move16

if I understand Motorolas documentation correctly, Move16 source and destination are (forced to be) longword aligned and always copies 16 Bytes?!

Quote:

Originally Posted by SpeedGeek

For the 040/060 things are much different.

I especially wonder with the 060 because of it's superscalarity i.e.

MOVE.L (A0)+,(A1)+
MOVE.L (A0)+,(A1)+
MOVE.L (A0)+,(A1)+
MOVE.L (A0)+,(A1)+

won't profit from the superscalarity, but

LOOP
MOVE.L (A0)+,(A1)+
DBcc or Bcc LOOP

should ??

"Unfortunately" Move16 is pOEP-only.

Karlos · 01 January 2024, 19:43

More explicitly, if memory serves (pardon the pun), move16 just ignores the lower 4 bits of the address and moves the whole cache line. This can result in our of bounds access for the unaware.

I've also had really tricky problems in the past with move16 transfers, I think some 040 have problem cases.

Thomas Richter · 02 January 2024, 01:17

Quote:

Originally Posted by No.3

if I understand Motorolas documentation correctly, Move16 source and destination are (forced to be) longword aligned and always copies 16 Bytes?!

That, and many other unexpected things. There are a couple of errata in the Motorola chips concerning MOVE16, and more errata in the implications in bust transfers over Zorro. Avoid.

MOVE16 can (unexpectedly) invalidate validate cache lines of unaffected memory (both the 040 and 060 are affected), and MOVE16 also bursts, which at least on some boards, does not go too well if the source or target is only reachable over Zorro-II (most GVP boards, and also at least the B2060 are affected by this issue).

SpeedGeek · 02 January 2024, 16:39

@Thread

There is already a MOVE16 discussion thread here:

https://eab.abime.net/showthread.php?t=102820

What is the point in repeating what's already been posted on that thread?

SpeedGeek · 02 January 2024, 23:22

Quote:

Originally Posted by No.3

You, SpeedGeek, directed to this thread

I read it and still do not know if I can or should not use Move16 for my own CopyMem.

[SPOILER]
my use case would be Move16 (A0)+,(A1)+ and A0 and A1 are guarenteed to be longword aligned and I could live if with it if it would be Fast-Mem only.
[/SPOILER]

Your question was answered on the Move16 discussion thread. Now, since there was very little (if any) discussion about Move16 coding on that thread, I don't see that part of the discussion as repeating what was previously posted there.

admiral · 04 January 2024, 07:09

These copymem patches... is there a kickstart patch as an alternative?

I don't see why you wouldn't want them on most of the time.

01 January 2024, 19:43	#16
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,516	More explicitly, if memory serves (pardon the pun), move16 just ignores the lower 4 bits of the address and moves the whole cache line. This can result in our of bounds access for the unaware. I've also had really tricky problems in the past with move16 transfers, I think some 040 have problem cases. Last edited by Karlos; 01 January 2024 at 19:51.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Copymem Quick & Big Released!	SpeedGeek	Coders. System	23	08 January 2024 00:06
CopyMem Quick & Small released!	SpeedGeek	Coders. System	12	04 July 2020 14:49
chipset_refreshrate improvement	Dr.Venom	request.UAE Wishlist	5	21 May 2011 02:06
amipatch 3.1 , patches to deprotect software	mrodfr	News	0	21 October 2006 06:45
Software patches more more speed	gary	request.Apps	6	11 November 2003 06:47

28 December 2023, 15:40	#2
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,516	The only things that will benefit from memory copy optimisation are; 1) Applications which copy large amounts of data from one place to another, which will benefit from any throughput improvements. This will soon become IO bound however. 2) Applications which frequently copy small amounts of data from one place to another which can benefit from reduce function call setup time. Most applications don't fit either category since copying large amounts of data around is generally considered "a bad thing" (tm), and equally, frequently copying smaller items around (e.g. implementing by value semantics for structure passing) is not considered a "good thing" and even where compiled code does this, for most typical structures, it's faster and simpler to directly emit a few unrolled move instructions in order to duplicate a typical C structure from one location to another, than it is to do a library call to exec. I think you are more likely to benefit from improving the efficiency of other memory related calls, in particular allocation and deallocation of memory, since those can be expensive and are called quite frequently since in the end, even C applications eventually resorts to calling those. There is a somewhat mythical additional case, which will definitely benefit: 3) Applications which frequently copy non-trivial amounts of memory around. These would benefit from both reduced call setup time and improved bandwidth. However, such programs are rare and tend to be ports from other systems that are already punishingly inefficient on 68K to the extent that routinely copying page sized chunks of memory around are the least of your performance worries.

31 December 2023, 23:02	#12
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,516	What's the typical break even point, in bytes, at which the efficiency of CMQ with all these patches pays for its own library call overhead, compared to locally inlined move.l loop with a spot of unrolling? Let's say I'm most interested in 68030 and above for this question and we are dealing with a cacheable source but uncacheable destination (e.g. Local Fast to RTG VRAM memory).

02 January 2024, 16:39	#18
SpeedGeek Moderator Join Date: Dec 2010 Location: Wisconsin USA Age: 60 Posts: 846	@Thread There is already a MOVE16 discussion thread here: https://eab.abime.net/showthread.php?t=102820 What is the point in repeating what's already been posted on that thread?

04 January 2024, 07:09	#20
admiral Engineer Join Date: Oct 2018 Location: Shadow realm Posts: 167	These copymem patches... is there a kickstart patch as an alternative? I don't see why you wouldn't want them on most of the time.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)