C++ to Assembler conversion (speedup) memory copy hack - Page 2

NOB · 05 February 2010, 16:10

How many clock cycles per instruction needs movem.l (a0),d0-d7/a2-a6 (68000 CPU)?

edit:CPU type

StingRay · 05 February 2010, 16:14

Quote:

Originally Posted by NOB

How many clock cycles per instruction needs movem.l (a0),d0-d7/a2-a6 ?

This question can't be answered since you didn't say on which CPU.

matthey · 06 February 2010, 00:34

Quote:

Originally Posted by NOB

How many clock cycles per instruction needs movem.l (a0),d0-d7/a2-a6 (68000 CPU)?
edit:CPU type

movem.l (an)+,reg_list ; 8+8n (2+2n/0)
movem.l reg_list,-(an) ; 4+10n (1/n)

n=number of registers moved.

Maybe you can explain how the movem.l (a0),reg_list works

. It would have been handy if there was a movem.l (a0)+,(a1)+ though. Then it would really be move memory

.

Toni Wilen · 06 February 2010, 09:27

(If you are wondering why 68000 MOVEM to registers is slower)

MOVEM to register(s) is one cycle slower because it does one extra dummy word read at the end. (something to do with pipelining in microcode I think)

MOVEM.W (An)+,<registers> = 4+4*n+4+4
MOVEM.W <registers>,-(An) = 4+4*n+4

MOVEM.L (An)+,<registers> = 4+(2*4)*n+4+4
MOVEM.L <registers>,-(An) = 4+(2*4)*n+4

(I wrote them that way because this shows the order of memory accesses)

MOVEM.L + blitter is fastest way to clear/copy Chip RAM data on non-expanded 68000 Amiga. (CPU uses every other cycle, 1/2 cycles would be wasted without blitter)

NOB · 06 February 2010, 11:56

Quote:

Originally Posted by matthey

Maybe you can explain how the movem.l (a0),reg_list works

.

Asmone is not that accurate.

(a0) or (a0)+ does not impress asmone in this case.

Interesting infos!Thanks

StingRay · 06 February 2010, 12:37

Quote:

Originally Posted by matthey

It would have been handy if there was a movem.l (a0)+,(a1)+ though. Then it would really be move memory

.

movem = move multiple (registers), not move memory.

Quote:

Originally Posted by NOB

Asmone is not that accurate.

(a0) or (a0)+ does not impress asmone in this case.

What do you mean here? Asmone is not that accurate? It assembles movem.l (ax),reglist correctly and it does the same for movem (ax)+,reglist.

Code:

    lea    test(pc),a0
    movem.l    (a0),d0-d7        ; a0 not changed
    ....
    lea    test(pc),a0
    movem.l    (a0)+,d0-d7        ; a0 will point to test+8*4 after movem

NOB · 06 February 2010, 13:47

ooops! thanks for the correction!! I have overlooked that.

PeterK · 06 February 2010, 23:06

Your discussion here just induced me to change the CopyMemBlock() routine of the icon.library too. The source buffers are always aligned and D1 is a scratch register in this case:

Code:

CopyMemBlock        ROR.L          #5,D0            ; works only < 2 MByte
                    SUBQ.W         #1,D0            ; D0 = number of bytes
                    BCS.S          .less32          ; A0 = source address
                    MOVEM.L        D2-D7/A2,-(SP)   ; A1 = target address
.next32             MOVEM.L        (A0)+,D1-D7/A2
                    MOVEM.L        D1-D7/A2,(A1)
                    LEA            32(A1),A1
                    DBRA.S         D0,.next32
                    MOVEM.L        (SP)+,D2-D7/A2
.less32             ADDQ.W         #1,D0
                    ROL.L          #5,D0
                    SUBQ.W         #1,D0
                    BCS.S          .done
.last31             MOVE.B         (A0)+,(A1)+
                    DBRA.S         D0,.last31
.done               RTS

matthey · 07 February 2010, 06:01

@PeterK

Good to see you still improving icon.library. It works well here. Thanks.

You say your data is aligned. Are they longword or just word aligned? Is your size an even number of longwords as well like is required by exec CopyMemQuick()? If it is, then you are better off changing the last loop to something like...

lastloop:
move.l (a0)+,(a1)+
subq.l #4,d0
bhi.b lastloop

I would expect your code to perform similar to a patched exec CopyMem() or CopyMemQuick() on the 68000, 68020, 68030 after the overhead of the library call is considered (only a jmp and move.l execbase,a6 if your routine is not inlined). It will perform very poorly on a 68040 because...

movem is slower than move
the branches are the wrong way and there is no branch cache
dbra has more setup time than bcc and scheduling is worse
the 68040 does not like shifts or rotates

The 68060 fairs better but prefers...

branchs the other way (like the 68040 branches taken are faster)
long operations

I would expect an average performance compared to using a patched exec CopyMem() or CopyMemQuick() of...

68000 95-105%
68020 95-105%
68030 95-105%
68040 50-70%
68060 70-90%

Testing (timing) tells the real story and most of the good CopyMem() CopyMemQuick() patches have been tested and are hard to beat. Rolling your own CopyMem() will only be a few percent faster at best on the 68000-68030 but will be considerably slower on 68040-68060 and take more memory on all. Plus there is (hopefully) no need for testing and debugging when using the exec copy routines.

PeterK · 07 February 2010, 21:30

Quote:

You say your data is aligned. Are they longword or just word aligned?

Yes, the buffers are usually allocated by AllocVec() and thus longword aligned.

Quote:

Is your size an even number of longwords as well like is required by exec CopyMemQuick()?

Autodocs say: the size must be an integral number of longwords (e.g. the size must be evenly divisible by four)
...not an even number of longwords, but in my case it can be any number of bytes, worst example: copying the RGB bytes of a colormap, so the number can be odd.

Quote:

It will perform very poorly on a 68040 because...

movem is slower than move
the branches are the wrong way and there is no branch cache
dbra has more setup time than bcc and scheduling is worse
the 68040 does not like shifts or rotates

Ooopps, I didn't know that the 68040 is sooo badly designed

Btw, do you have any CPU timing tables for the 68040 and the 68060 ?
Update: Already found this page now: http://www.mactech.com/articles/mact...040/index.html

Ok, maybe my copy routine is not optimal under every aspect but it's probably faster now than the MOVE.L (A0)+, (A1)+ routine with a PC relative jumptable at the end like I found it in the original library. Or is that better for an 68040 ?

Please remember that the average block size will be something between 500 and 5000 bytes only and atm this is not the bottleneck in the icon.library. There are still more than 4000 other code lines to analyze and to optimize.

But if I should get the feeling that this becomes the bottleneck one day then I would try to use execs CopyMem() instead as you suggested. Many thanks anyway for your helpful comments.

matthey · 08 February 2010, 01:00

Quote:

Originally Posted by PeterK

Yes, the buffers are usually allocated by AllocVec() and thus longword aligned.

Autodocs say: the size must be an integral number of longwords (e.g. the size must be evenly divisible by four)
...not an even number of longwords, but in my case it can be any number of bytes, worst example: copying the RGB bytes of a colormap, so the number can be odd.

You would have to use CopyMem() then. It's just a few cycles slower before the main copy loop and after as it checks alignment and longword aligns the data if necessary. The 68020/68030 may even use a move.b loop for really small copies as you do. The main copy loop should not be any slower with CopyMem() than CopyMemQuick().

Quote:

Ooopps, I didn't know that the 68040 is sooo badly designed

Btw, do you have any CPU timing tables for the 68040 and the 68060 ?

The 68040 is very different but not really bad. The movem being slower than move is an oversight but the branches taken being faster is the better way (Natami N68050+ will be the same way). The 68040 has it's advantages too. It can process large and complex instructions with relatively no slowdown. It's not that it's slow with smaller instructions it's just that it could do more.

The timings are in the 68040 and 68060 User's Manuals which were still available from Freescale (web site) last time I checked. If not, PM me your e-mail and I will mail you the PDFs. The 68060 timings are tricky as it's superscaler (2 integer instructions at once). It's more important that an instruction can work in both integer units as most instructions take 1 cycle. Swap for example will not but shift can. The 68060 can do 2 shifts (of any allowed size) per cycle but only 1 swap.

Quote:

Ok, maybe my copy routine is not optimal under every aspect but it's probably faster now than the MOVE.L (A0)+, (A1)+ routine with a PC relative jumptable at the end like I found it in the original library. Or is that better for an 68040 ?

You mean the SASC standard brain dead copy routine with jump table? It starts like this...

mem_copy:
cmpi.l #$18,d0
blt .smallcopy
.loop:
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
subi.l #$18,d0
cmpi.l #$18,d0
bge .loop
.smallcopy:
add.w d0,d0
move.w (jumptable,pc,d0.w),d0
jmp (2,pc,d0.w)

The 68040 likes the unrolled move.l loop. It's the fastest way to copy memory on the 68040 for smaller copies (move16 for bigger). It's not slowed down by the 6 byte xxxi.l instructions. It handles the complex jumptable setup instructions with ease but it doesn't like the jmp because it can not be predicted. The 68040 and 68060 do not like jump tables because of the more sophisticated branch prediction. The copy routine does not attempt to align data but that is good in your case (long word aligned src and dest). I would expect this routine to be faster on the 68040 than yours in most cases. It's still pretty poorly written with the big instructions and both a subi and cmpi inside the loop are unnecessary. I would expect your copy routine will perform better in most cases on most processors and it's smaller. It is difficult to make a copy routine that performs well on the whole 68k family and that is why the exec memory copy routines come in handy. They also save code.

Quote:

Please remember that the average block size will be something between 500 and 5000 bytes only and atm this is not the bottleneck in the icon.library. There are still more than 4000 other code lines to analyze and to optimize.

Even small copies can be substantially faster with efficient copy routines. Look at the chart in the readme here...

http://fi.aminet.net/util/boot/CopyMem.readme

Copying 4 bytes a set number of times with AmigaOS 3.9 CopyMem() went from .51 seconds to .06 seconds with my CopyMem060 patch. Similar gains were made for the 68040. The standard AmigaOS 3.9 CopyMem() uses a movem.l loop and was improved from AmigaOS 3.5.

Quote:

But if I should get the feeling that this becomes the bottleneck one day then I would try to use execs CopyMem() instead as you suggested. Many thanks anyway for your helpful comments.

Copying memory uses many cycles. Maybe I'll check the timing for you if I find enough time

.

PeterK · 08 February 2010, 01:23

Quote:

The 68040 likes the unrolled move.l loop.

Yes, it's exactly this SASC routine. And meanwhile, I've done some calculations comparing the MOVE.L (A0)+, (A1)+ version with the MOVEM.L version. I'm surprised, because the first version is even faster for the 68020. It needs just 64 cycles, but the MOVEM.L version needs more!! So I will go back to 8 MOVE.L instructions and then I won't need any additional register backups on the stack, too. MOVEM.L may have an advantage with some more registers on the 68020 only.

Code:

CopyMemBlock        ROR.L          #5,D0           ; works only < 2 MByte
                    BRA.S          .test32    
.copy32             MOVE.L         (A0)+,(A1)+
                    MOVE.L         (A0)+,(A1)+
                    MOVE.L         (A0)+,(A1)+
                    MOVE.L         (A0)+,(A1)+
                    MOVE.L         (A0)+,(A1)+
                    MOVE.L         (A0)+,(A1)+
                    MOVE.L         (A0)+,(A1)+
                    MOVE.L         (A0)+,(A1)+
.test32             DBRA.S         D0,.copy32
                    ADDQ.W         #1,D0
                    ROL.L          #3,D0
                    BRA.S          .longs
.copy4              MOVE.L         (A0)+,(A1)+
.longs              DBRA.S         D0,.copy4
                    ADDQ.W         #1,D0
                    ROL.L          #2,D0
                    BRA.S          .bytes
.copy1              MOVE.B         (A0)+,(A1)+
.bytes              DBRA.S         D0,.copy1
                    RTS

Ok, what do you think about this version?

matthey · 08 February 2010, 14:58

That should perform better on all 68k processors. Yes the movem.l has a lot of overhead to overcome (saving the registers to the stack) and needs to use most of the registers to be efficient. An unrolled move.l loop is fast and good for the whole 68k family and all but large copies. The efficiency comes from removed branches which eat many cycles except on the 68060 with it's 0 cycle predicted branches. You are on the right track. I'm not a fan of the bra into the dbra loop. I usually use a subq and bcc as it skips the branch at the top and isn't any slower on 68040-68060. It might be slightly slower in the loop for the 68020-68030. Dbra's biggest advantage is on the 68000. I haven't really studied the 68000-68030 like I have the latter processors though. The other thing about your routine is the last loop. It is 3 loops at most. It works better to just btst bit 1 and do a move.w (a0)+,(a1)+ if it's set and then btst bit 0 (or lsr into carry) and then do a move.b (a0)+,(a1)+ if it's set.

I remembered that those manuals are online here...

http://amiga.serveftp.net/datasheets.html

They say datasheets which is the same as User's Manuals. There is 68020-68060 "datasheets" there with timings and other good info. This is a great Amiga site so have a look around too.

PeterK · 08 February 2010, 19:24

@matthey

Many thanks for all your useful hints. It's always good to get some advice from an expert who knows all the arguments for and against any instruction in such a routine already. - Not tested yet, but the following should do it?

Code:

.longs              DBRA.S         D0,.copy4
                    ADD.L          D0,D0
                    BCC.S          .last
                    MOVE.W         (A0)+,(A1)+
.last               ADD.L          D0,D0
                    BCC.S          .done
                    MOVE.B         (A0),(A1) ; no post increment, thx matthey
.done               RTS

@NovaCoder

I hope this was not too much like a thread hijacking for you. Sorry, if so!

matthey · 09 February 2010, 01:19

I didn't think that would work at first but that add.l d0,d0 followed by bcc is very fast and I think it will work. You can drop the post increment on the last move though...

MOVE.B (A0),(A1) ;removing post increment is faster except 68060

Here is a little 68k optimization guide that is very good also...

http://www.freescale.com/files/32bit...80X0OPTAPP.txt

PeterK · 09 February 2010, 20:37

Yes, the ADD.L D0,D0 and BCC.S works and it avoids the ADD.W #1, D0 and ROL.L #2,D0 and also the BTST instructions.

Thanks for your great attention regarding the post increment. I'm soo blind and have simply overlooked that.

Btw, the quality of the "datasheets" from http://amiga.serveftp.net was much better than the "same" manuals from freescale, because they already did a poor quality rescan for their PDF files.

05 February 2010, 16:10	#21
NOB Zone Friend Join Date: Aug 2005 Location: Germany Age: 52 Posts: 424	How many clock cycles per instruction needs movem.l (a0),d0-d7/a2-a6 (68000 CPU)? edit:CPU type Last edited by NOB; 05 February 2010 at 18:55.

07 February 2010, 06:01	#29
matthey Banned Join Date: Jan 2010 Location: Kansas Posts: 1,284	@PeterK Good to see you still improving icon.library. It works well here. Thanks. You say your data is aligned. Are they longword or just word aligned? Is your size an even number of longwords as well like is required by exec CopyMemQuick()? If it is, then you are better off changing the last loop to something like... lastloop: move.l (a0)+,(a1)+ subq.l #4,d0 bhi.b lastloop I would expect your code to perform similar to a patched exec CopyMem() or CopyMemQuick() on the 68000, 68020, 68030 after the overhead of the library call is considered (only a jmp and move.l execbase,a6 if your routine is not inlined). It will perform very poorly on a 68040 because... movem is slower than move the branches are the wrong way and there is no branch cache dbra has more setup time than bcc and scheduling is worse the 68040 does not like shifts or rotates The 68060 fairs better but prefers... branchs the other way (like the 68040 branches taken are faster) long operations I would expect an average performance compared to using a patched exec CopyMem() or CopyMemQuick() of... 68000 95-105% 68020 95-105% 68030 95-105% 68040 50-70% 68060 70-90% Testing (timing) tells the real story and most of the good CopyMem() CopyMemQuick() patches have been tested and are hard to beat. Rolling your own CopyMem() will only be a few percent faster at best on the 68000-68030 but will be considerably slower on 68040-68060 and take more memory on all. Plus there is (hopefully) no need for testing and debugging when using the exec copy routines. Last edited by matthey; 07 February 2010 at 07:47.

08 February 2010, 19:24	#34
PeterK Registered User Join Date: Apr 2005 Location: digital hell, Germany, after 1984, but worse Posts: 3,365	@matthey Many thanks for all your useful hints. It's always good to get some advice from an expert who knows all the arguments for and against any instruction in such a routine already. - Not tested yet, but the following should do it? Code: .longs DBRA.S D0,.copy4 ADD.L D0,D0 BCC.S .last MOVE.W (A0)+,(A1)+ .last ADD.L D0,D0 BCC.S .done MOVE.B (A0),(A1) ; no post increment, thx matthey .done RTS @NovaCoder I hope this was not too much like a thread hijacking for you. Sorry, if so! Last edited by PeterK; 09 February 2010 at 20:45.

09 February 2010, 01:19	#35
matthey Banned Join Date: Jan 2010 Location: Kansas Posts: 1,284	I didn't think that would work at first but that add.l d0,d0 followed by bcc is very fast and I think it will work. You can drop the post increment on the last move though... MOVE.B (A0),(A1) ;removing post increment is faster except 68060 Here is a little 68k optimization guide that is very good also... http://www.freescale.com/files/32bit...80X0OPTAPP.txt Last edited by matthey; 09 February 2010 at 02:37.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
A2091ToFast: Even more A2091/A590 speedup possible!	SpeedGeek	Coders. System	8	24 July 2015 14:47
Requester Bug when copying IPF to Standard ADF with X-Copy/Power Copy.	BarryB	support.WinUAE	9	17 January 2012 20:20
1Mb CHIP RAM hack and extra memory	orange	Hardware mods	3	29 June 2010 13:18
DMA memory to memory copy	BlueAchenar	Coders. General	14	22 January 2009 23:29

06 February 2010, 09:27	#24
Toni Wilen WinUAE developer Join Date: Aug 2001 Location: Hämeenlinna/Finland Age: 49 Posts: 26,505	(If you are wondering why 68000 MOVEM to registers is slower) MOVEM to register(s) is one cycle slower because it does one extra dummy word read at the end. (something to do with pipelining in microcode I think) MOVEM.W (An)+,<registers> = 4+4n+4+4 MOVEM.W <registers>,-(An) = 4+4n+4 MOVEM.L (An)+,<registers> = 4+(24)n+4+4 MOVEM.L <registers>,-(An) = 4+(24)n+4 (I wrote them that way because this shows the order of memory accesses) MOVEM.L + blitter is fastest way to clear/copy Chip RAM data on non-expanded 68000 Amiga. (CPU uses every other cycle, 1/2 cycles would be wasted without blitter)

06 February 2010, 13:47	#27
NOB Zone Friend Join Date: Aug 2005 Location: Germany Age: 52 Posts: 424	ooops! thanks for the correction!! I have overlooked that.

08 February 2010, 14:58	#33
matthey Banned Join Date: Jan 2010 Location: Kansas Posts: 1,284	That should perform better on all 68k processors. Yes the movem.l has a lot of overhead to overcome (saving the registers to the stack) and needs to use most of the registers to be efficient. An unrolled move.l loop is fast and good for the whole 68k family and all but large copies. The efficiency comes from removed branches which eat many cycles except on the 68060 with it's 0 cycle predicted branches. You are on the right track. I'm not a fan of the bra into the dbra loop. I usually use a subq and bcc as it skips the branch at the top and isn't any slower on 68040-68060. It might be slightly slower in the loop for the 68020-68030. Dbra's biggest advantage is on the 68000. I haven't really studied the 68000-68030 like I have the latter processors though. The other thing about your routine is the last loop. It is 3 loops at most. It works better to just btst bit 1 and do a move.w (a0)+,(a1)+ if it's set and then btst bit 0 (or lsr into carry) and then do a move.b (a0)+,(a1)+ if it's set. I remembered that those manuals are online here... http://amiga.serveftp.net/datasheets.html They say datasheets which is the same as User's Manuals. There is 68020-68060 "datasheets" there with timings and other good info. This is a great Amiga site so have a look around too.

09 February 2010, 20:37	#36
PeterK Registered User Join Date: Apr 2005 Location: digital hell, Germany, after 1984, but worse Posts: 3,365	Yes, the ADD.L D0,D0 and BCC.S works and it avoids the ADD.W #1, D0 and ROL.L #2,D0 and also the BTST instructions. Thanks for your great attention regarding the post increment. I'm soo blind and have simply overlooked that. Btw, the quality of the "datasheets" from http://amiga.serveftp.net was much better than the "same" manuals from freescale, because they already did a poor quality rescan for their PDF files.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)