English Amiga Board


Go Back   English Amiga Board > Coders > Coders. General

 
 
Thread Tools
Old 05 February 2010, 16:10   #21
NOB
Zone Friend
 
Join Date: Aug 2005
Location: Germany
Age: 52
Posts: 424
How many clock cycles per instruction needs movem.l (a0),d0-d7/a2-a6 (68000 CPU)?



edit:CPU type

Last edited by NOB; 05 February 2010 at 18:55.
NOB is offline  
Old 05 February 2010, 16:14   #22
StingRay
move.l #$c0ff33,throat
 
StingRay's Avatar
 
Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 6,863
Quote:
Originally Posted by NOB View Post
How many clock cycles per instruction needs movem.l (a0),d0-d7/a2-a6 ?
This question can't be answered since you didn't say on which CPU.
StingRay is offline  
Old 06 February 2010, 00:34   #23
matthey
Banned
 
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
Quote:
Originally Posted by NOB View Post
How many clock cycles per instruction needs movem.l (a0),d0-d7/a2-a6 (68000 CPU)?
edit:CPU type
movem.l (an)+,reg_list ; 8+8n (2+2n/0)
movem.l reg_list,-(an) ; 4+10n (1/n)

n=number of registers moved.

Maybe you can explain how the movem.l (a0),reg_list works . It would have been handy if there was a movem.l (a0)+,(a1)+ though. Then it would really be move memory .
matthey is offline  
Old 06 February 2010, 09:27   #24
Toni Wilen
WinUAE developer
 
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,505
(If you are wondering why 68000 MOVEM to registers is slower)

MOVEM to register(s) is one cycle slower because it does one extra dummy word read at the end. (something to do with pipelining in microcode I think)

MOVEM.W (An)+,<registers> = 4+4*n+4+4
MOVEM.W <registers>,-(An) = 4+4*n+4

MOVEM.L (An)+,<registers> = 4+(2*4)*n+4+4
MOVEM.L <registers>,-(An) = 4+(2*4)*n+4

(I wrote them that way because this shows the order of memory accesses)

MOVEM.L + blitter is fastest way to clear/copy Chip RAM data on non-expanded 68000 Amiga. (CPU uses every other cycle, 1/2 cycles would be wasted without blitter)
Toni Wilen is offline  
Old 06 February 2010, 11:56   #25
NOB
Zone Friend
 
Join Date: Aug 2005
Location: Germany
Age: 52
Posts: 424
Quote:
Originally Posted by matthey View Post


Maybe you can explain how the movem.l (a0),reg_list works .

Asmone is not that accurate. (a0) or (a0)+ does not impress asmone in this case.

Interesting infos!Thanks
Attached Thumbnails
Click image for larger version

Name:	test (a0).png
Views:	221
Size:	8.1 KB
ID:	24181  

Last edited by NOB; 06 February 2010 at 12:03.
NOB is offline  
Old 06 February 2010, 12:37   #26
StingRay
move.l #$c0ff33,throat
 
StingRay's Avatar
 
Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 6,863
Quote:
Originally Posted by matthey View Post
It would have been handy if there was a movem.l (a0)+,(a1)+ though. Then it would really be move memory .
movem = move multiple (registers), not move memory.

Quote:
Originally Posted by NOB View Post
Asmone is not that accurate. (a0) or (a0)+ does not impress asmone in this case.
What do you mean here? Asmone is not that accurate? It assembles movem.l (ax),reglist correctly and it does the same for movem (ax)+,reglist.

Code:
    lea    test(pc),a0
    movem.l    (a0),d0-d7        ; a0 not changed
    ....
    lea    test(pc),a0
    movem.l    (a0)+,d0-d7        ; a0 will point to test+8*4 after movem
StingRay is offline  
Old 06 February 2010, 13:47   #27
NOB
Zone Friend
 
Join Date: Aug 2005
Location: Germany
Age: 52
Posts: 424
ooops! thanks for the correction!! I have overlooked that.
NOB is offline  
Old 06 February 2010, 23:06   #28
PeterK
Registered User
 
Join Date: Apr 2005
Location: digital hell, Germany, after 1984, but worse
Posts: 3,365
Your discussion here just induced me to change the CopyMemBlock() routine of the icon.library too. The source buffers are always aligned and D1 is a scratch register in this case:

Code:
CopyMemBlock        ROR.L          #5,D0            ; works only < 2 MByte
                    SUBQ.W         #1,D0            ; D0 = number of bytes
                    BCS.S          .less32          ; A0 = source address
                    MOVEM.L        D2-D7/A2,-(SP)   ; A1 = target address
.next32             MOVEM.L        (A0)+,D1-D7/A2
                    MOVEM.L        D1-D7/A2,(A1)
                    LEA            32(A1),A1
                    DBRA.S         D0,.next32
                    MOVEM.L        (SP)+,D2-D7/A2
.less32             ADDQ.W         #1,D0
                    ROL.L          #5,D0
                    SUBQ.W         #1,D0
                    BCS.S          .done
.last31             MOVE.B         (A0)+,(A1)+
                    DBRA.S         D0,.last31
.done               RTS

Last edited by PeterK; 07 February 2010 at 00:47.
PeterK is offline  
Old 07 February 2010, 06:01   #29
matthey
Banned
 
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
@PeterK

Good to see you still improving icon.library. It works well here. Thanks.

You say your data is aligned. Are they longword or just word aligned? Is your size an even number of longwords as well like is required by exec CopyMemQuick()? If it is, then you are better off changing the last loop to something like...

lastloop:
move.l (a0)+,(a1)+
subq.l #4,d0
bhi.b lastloop

I would expect your code to perform similar to a patched exec CopyMem() or CopyMemQuick() on the 68000, 68020, 68030 after the overhead of the library call is considered (only a jmp and move.l execbase,a6 if your routine is not inlined). It will perform very poorly on a 68040 because...

movem is slower than move
the branches are the wrong way and there is no branch cache
dbra has more setup time than bcc and scheduling is worse
the 68040 does not like shifts or rotates

The 68060 fairs better but prefers...

branchs the other way (like the 68040 branches taken are faster)
long operations

I would expect an average performance compared to using a patched exec CopyMem() or CopyMemQuick() of...

68000 95-105%
68020 95-105%
68030 95-105%
68040 50-70%
68060 70-90%

Testing (timing) tells the real story and most of the good CopyMem() CopyMemQuick() patches have been tested and are hard to beat. Rolling your own CopyMem() will only be a few percent faster at best on the 68000-68030 but will be considerably slower on 68040-68060 and take more memory on all. Plus there is (hopefully) no need for testing and debugging when using the exec copy routines.

Last edited by matthey; 07 February 2010 at 07:47.
matthey is offline  
Old 07 February 2010, 21:30   #30
PeterK
Registered User
 
Join Date: Apr 2005
Location: digital hell, Germany, after 1984, but worse
Posts: 3,365
Quote:
You say your data is aligned. Are they longword or just word aligned?
Yes, the buffers are usually allocated by AllocVec() and thus longword aligned.

Quote:
Is your size an even number of longwords as well like is required by exec CopyMemQuick()?
Autodocs say: the size must be an integral number of longwords (e.g. the size must be evenly divisible by four)
...not an even number of longwords, but in my case it can be any number of bytes, worst example: copying the RGB bytes of a colormap, so the number can be odd.

Quote:
It will perform very poorly on a 68040 because...

movem is slower than move
the branches are the wrong way and there is no branch cache
dbra has more setup time than bcc and scheduling is worse
the 68040 does not like shifts or rotates
Ooopps, I didn't know that the 68040 is sooo badly designed
Btw, do you have any CPU timing tables for the 68040 and the 68060 ?
Update: Already found this page now: http://www.mactech.com/articles/mact...040/index.html

Ok, maybe my copy routine is not optimal under every aspect but it's probably faster now than the MOVE.L (A0)+, (A1)+ routine with a PC relative jumptable at the end like I found it in the original library. Or is that better for an 68040 ?

Please remember that the average block size will be something between 500 and 5000 bytes only and atm this is not the bottleneck in the icon.library. There are still more than 4000 other code lines to analyze and to optimize.

But if I should get the feeling that this becomes the bottleneck one day then I would try to use execs CopyMem() instead as you suggested. Many thanks anyway for your helpful comments.

Last edited by PeterK; 07 February 2010 at 23:11.
PeterK is offline  
Old 08 February 2010, 01:00   #31
matthey
Banned
 
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
Quote:
Originally Posted by PeterK View Post
Yes, the buffers are usually allocated by AllocVec() and thus longword aligned.

Autodocs say: the size must be an integral number of longwords (e.g. the size must be evenly divisible by four)
...not an even number of longwords, but in my case it can be any number of bytes, worst example: copying the RGB bytes of a colormap, so the number can be odd.
You would have to use CopyMem() then. It's just a few cycles slower before the main copy loop and after as it checks alignment and longword aligns the data if necessary. The 68020/68030 may even use a move.b loop for really small copies as you do. The main copy loop should not be any slower with CopyMem() than CopyMemQuick().

Quote:
Ooopps, I didn't know that the 68040 is sooo badly designed
Btw, do you have any CPU timing tables for the 68040 and the 68060 ?
The 68040 is very different but not really bad. The movem being slower than move is an oversight but the branches taken being faster is the better way (Natami N68050+ will be the same way). The 68040 has it's advantages too. It can process large and complex instructions with relatively no slowdown. It's not that it's slow with smaller instructions it's just that it could do more.

The timings are in the 68040 and 68060 User's Manuals which were still available from Freescale (web site) last time I checked. If not, PM me your e-mail and I will mail you the PDFs. The 68060 timings are tricky as it's superscaler (2 integer instructions at once). It's more important that an instruction can work in both integer units as most instructions take 1 cycle. Swap for example will not but shift can. The 68060 can do 2 shifts (of any allowed size) per cycle but only 1 swap.

Quote:
Ok, maybe my copy routine is not optimal under every aspect but it's probably faster now than the MOVE.L (A0)+, (A1)+ routine with a PC relative jumptable at the end like I found it in the original library. Or is that better for an 68040 ?
You mean the SASC standard brain dead copy routine with jump table? It starts like this...

mem_copy:
cmpi.l #$18,d0
blt .smallcopy
.loop:
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
subi.l #$18,d0
cmpi.l #$18,d0
bge .loop
.smallcopy:
add.w d0,d0
move.w (jumptable,pc,d0.w),d0
jmp (2,pc,d0.w)

The 68040 likes the unrolled move.l loop. It's the fastest way to copy memory on the 68040 for smaller copies (move16 for bigger). It's not slowed down by the 6 byte xxxi.l instructions. It handles the complex jumptable setup instructions with ease but it doesn't like the jmp because it can not be predicted. The 68040 and 68060 do not like jump tables because of the more sophisticated branch prediction. The copy routine does not attempt to align data but that is good in your case (long word aligned src and dest). I would expect this routine to be faster on the 68040 than yours in most cases. It's still pretty poorly written with the big instructions and both a subi and cmpi inside the loop are unnecessary. I would expect your copy routine will perform better in most cases on most processors and it's smaller. It is difficult to make a copy routine that performs well on the whole 68k family and that is why the exec memory copy routines come in handy. They also save code.

Quote:
Please remember that the average block size will be something between 500 and 5000 bytes only and atm this is not the bottleneck in the icon.library. There are still more than 4000 other code lines to analyze and to optimize.
Even small copies can be substantially faster with efficient copy routines. Look at the chart in the readme here...

http://fi.aminet.net/util/boot/CopyMem.readme

Copying 4 bytes a set number of times with AmigaOS 3.9 CopyMem() went from .51 seconds to .06 seconds with my CopyMem060 patch. Similar gains were made for the 68040. The standard AmigaOS 3.9 CopyMem() uses a movem.l loop and was improved from AmigaOS 3.5.

Quote:
But if I should get the feeling that this becomes the bottleneck one day then I would try to use execs CopyMem() instead as you suggested. Many thanks anyway for your helpful comments.
Copying memory uses many cycles. Maybe I'll check the timing for you if I find enough time .
matthey is offline  
Old 08 February 2010, 01:23   #32
PeterK
Registered User
 
Join Date: Apr 2005
Location: digital hell, Germany, after 1984, but worse
Posts: 3,365
Quote:
The 68040 likes the unrolled move.l loop.
Yes, it's exactly this SASC routine. And meanwhile, I've done some calculations comparing the MOVE.L (A0)+, (A1)+ version with the MOVEM.L version. I'm surprised, because the first version is even faster for the 68020. It needs just 64 cycles, but the MOVEM.L version needs more!! So I will go back to 8 MOVE.L instructions and then I won't need any additional register backups on the stack, too. MOVEM.L may have an advantage with some more registers on the 68020 only.

Code:
CopyMemBlock        ROR.L          #5,D0           ; works only < 2 MByte
                    BRA.S          .test32    
.copy32             MOVE.L         (A0)+,(A1)+
                    MOVE.L         (A0)+,(A1)+
                    MOVE.L         (A0)+,(A1)+
                    MOVE.L         (A0)+,(A1)+
                    MOVE.L         (A0)+,(A1)+
                    MOVE.L         (A0)+,(A1)+
                    MOVE.L         (A0)+,(A1)+
                    MOVE.L         (A0)+,(A1)+
.test32             DBRA.S         D0,.copy32
                    ADDQ.W         #1,D0
                    ROL.L          #3,D0
                    BRA.S          .longs
.copy4              MOVE.L         (A0)+,(A1)+
.longs              DBRA.S         D0,.copy4
                    ADDQ.W         #1,D0
                    ROL.L          #2,D0
                    BRA.S          .bytes
.copy1              MOVE.B         (A0)+,(A1)+
.bytes              DBRA.S         D0,.copy1
                    RTS
Ok, what do you think about this version?

Last edited by PeterK; 08 February 2010 at 13:07.
PeterK is offline  
Old 08 February 2010, 14:58   #33
matthey
Banned
 
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
That should perform better on all 68k processors. Yes the movem.l has a lot of overhead to overcome (saving the registers to the stack) and needs to use most of the registers to be efficient. An unrolled move.l loop is fast and good for the whole 68k family and all but large copies. The efficiency comes from removed branches which eat many cycles except on the 68060 with it's 0 cycle predicted branches. You are on the right track. I'm not a fan of the bra into the dbra loop. I usually use a subq and bcc as it skips the branch at the top and isn't any slower on 68040-68060. It might be slightly slower in the loop for the 68020-68030. Dbra's biggest advantage is on the 68000. I haven't really studied the 68000-68030 like I have the latter processors though. The other thing about your routine is the last loop. It is 3 loops at most. It works better to just btst bit 1 and do a move.w (a0)+,(a1)+ if it's set and then btst bit 0 (or lsr into carry) and then do a move.b (a0)+,(a1)+ if it's set.

I remembered that those manuals are online here...

http://amiga.serveftp.net/datasheets.html

They say datasheets which is the same as User's Manuals. There is 68020-68060 "datasheets" there with timings and other good info. This is a great Amiga site so have a look around too.
matthey is offline  
Old 08 February 2010, 19:24   #34
PeterK
Registered User
 
Join Date: Apr 2005
Location: digital hell, Germany, after 1984, but worse
Posts: 3,365
@matthey

Many thanks for all your useful hints. It's always good to get some advice from an expert who knows all the arguments for and against any instruction in such a routine already. - Not tested yet, but the following should do it?
Code:
.longs              DBRA.S         D0,.copy4
                    ADD.L          D0,D0
                    BCC.S          .last
                    MOVE.W         (A0)+,(A1)+
.last               ADD.L          D0,D0
                    BCC.S          .done
                    MOVE.B         (A0),(A1) ; no post increment, thx matthey
.done               RTS
@NovaCoder

I hope this was not too much like a thread hijacking for you. Sorry, if so!

Last edited by PeterK; 09 February 2010 at 20:45.
PeterK is offline  
Old 09 February 2010, 01:19   #35
matthey
Banned
 
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
I didn't think that would work at first but that add.l d0,d0 followed by bcc is very fast and I think it will work. You can drop the post increment on the last move though...

MOVE.B (A0),(A1) ;removing post increment is faster except 68060

Here is a little 68k optimization guide that is very good also...

http://www.freescale.com/files/32bit...80X0OPTAPP.txt

Last edited by matthey; 09 February 2010 at 02:37.
matthey is offline  
Old 09 February 2010, 20:37   #36
PeterK
Registered User
 
Join Date: Apr 2005
Location: digital hell, Germany, after 1984, but worse
Posts: 3,365
Yes, the ADD.L D0,D0 and BCC.S works and it avoids the ADD.W #1, D0 and ROL.L #2,D0 and also the BTST instructions.

Thanks for your great attention regarding the post increment. I'm soo blind and have simply overlooked that.

Btw, the quality of the "datasheets" from http://amiga.serveftp.net was much better than the "same" manuals from freescale, because they already did a poor quality rescan for their PDF files.
PeterK is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
A2091ToFast: Even more A2091/A590 speedup possible! SpeedGeek Coders. System 8 24 July 2015 14:47
Requester Bug when copying IPF to Standard ADF with X-Copy/Power Copy. BarryB support.WinUAE 9 17 January 2012 20:20
1Mb CHIP RAM hack and extra memory orange Hardware mods 3 29 June 2010 13:18
DMA memory to memory copy BlueAchenar Coders. General 14 22 January 2009 23:29

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 10:53.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.10882 seconds with 16 queries