05 February 2010, 16:10 | #21 |
Zone Friend
Join Date: Aug 2005
Location: Germany
Age: 52
Posts: 424
|
How many clock cycles per instruction needs movem.l (a0),d0-d7/a2-a6 (68000 CPU)?
edit:CPU type Last edited by NOB; 05 February 2010 at 18:55. |
05 February 2010, 16:14 | #22 |
move.l #$c0ff33,throat
Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 6,863
|
|
06 February 2010, 00:34 | #23 | |
Banned
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
|
Quote:
movem.l reg_list,-(an) ; 4+10n (1/n) n=number of registers moved. Maybe you can explain how the movem.l (a0),reg_list works . It would have been handy if there was a movem.l (a0)+,(a1)+ though. Then it would really be move memory . |
|
06 February 2010, 09:27 | #24 |
WinUAE developer
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,505
|
(If you are wondering why 68000 MOVEM to registers is slower)
MOVEM to register(s) is one cycle slower because it does one extra dummy word read at the end. (something to do with pipelining in microcode I think) MOVEM.W (An)+,<registers> = 4+4*n+4+4 MOVEM.W <registers>,-(An) = 4+4*n+4 MOVEM.L (An)+,<registers> = 4+(2*4)*n+4+4 MOVEM.L <registers>,-(An) = 4+(2*4)*n+4 (I wrote them that way because this shows the order of memory accesses) MOVEM.L + blitter is fastest way to clear/copy Chip RAM data on non-expanded 68000 Amiga. (CPU uses every other cycle, 1/2 cycles would be wasted without blitter) |
06 February 2010, 11:56 | #25 |
Zone Friend
Join Date: Aug 2005
Location: Germany
Age: 52
Posts: 424
|
Asmone is not that accurate. (a0) or (a0)+ does not impress asmone in this case. Interesting infos!Thanks Last edited by NOB; 06 February 2010 at 12:03. |
06 February 2010, 12:37 | #26 | ||
move.l #$c0ff33,throat
Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 6,863
|
Quote:
Quote:
Code:
lea test(pc),a0 movem.l (a0),d0-d7 ; a0 not changed .... lea test(pc),a0 movem.l (a0)+,d0-d7 ; a0 will point to test+8*4 after movem |
||
06 February 2010, 13:47 | #27 |
Zone Friend
Join Date: Aug 2005
Location: Germany
Age: 52
Posts: 424
|
ooops! thanks for the correction!! I have overlooked that.
|
06 February 2010, 23:06 | #28 |
Registered User
Join Date: Apr 2005
Location: digital hell, Germany, after 1984, but worse
Posts: 3,365
|
Your discussion here just induced me to change the CopyMemBlock() routine of the icon.library too. The source buffers are always aligned and D1 is a scratch register in this case:
Code:
CopyMemBlock ROR.L #5,D0 ; works only < 2 MByte SUBQ.W #1,D0 ; D0 = number of bytes BCS.S .less32 ; A0 = source address MOVEM.L D2-D7/A2,-(SP) ; A1 = target address .next32 MOVEM.L (A0)+,D1-D7/A2 MOVEM.L D1-D7/A2,(A1) LEA 32(A1),A1 DBRA.S D0,.next32 MOVEM.L (SP)+,D2-D7/A2 .less32 ADDQ.W #1,D0 ROL.L #5,D0 SUBQ.W #1,D0 BCS.S .done .last31 MOVE.B (A0)+,(A1)+ DBRA.S D0,.last31 .done RTS Last edited by PeterK; 07 February 2010 at 00:47. |
07 February 2010, 06:01 | #29 |
Banned
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
|
@PeterK
Good to see you still improving icon.library. It works well here. Thanks. You say your data is aligned. Are they longword or just word aligned? Is your size an even number of longwords as well like is required by exec CopyMemQuick()? If it is, then you are better off changing the last loop to something like... lastloop: move.l (a0)+,(a1)+ subq.l #4,d0 bhi.b lastloop I would expect your code to perform similar to a patched exec CopyMem() or CopyMemQuick() on the 68000, 68020, 68030 after the overhead of the library call is considered (only a jmp and move.l execbase,a6 if your routine is not inlined). It will perform very poorly on a 68040 because... movem is slower than move the branches are the wrong way and there is no branch cache dbra has more setup time than bcc and scheduling is worse the 68040 does not like shifts or rotates The 68060 fairs better but prefers... branchs the other way (like the 68040 branches taken are faster) long operations I would expect an average performance compared to using a patched exec CopyMem() or CopyMemQuick() of... 68000 95-105% 68020 95-105% 68030 95-105% 68040 50-70% 68060 70-90% Testing (timing) tells the real story and most of the good CopyMem() CopyMemQuick() patches have been tested and are hard to beat. Rolling your own CopyMem() will only be a few percent faster at best on the 68000-68030 but will be considerably slower on 68040-68060 and take more memory on all. Plus there is (hopefully) no need for testing and debugging when using the exec copy routines. Last edited by matthey; 07 February 2010 at 07:47. |
07 February 2010, 21:30 | #30 | |||
Registered User
Join Date: Apr 2005
Location: digital hell, Germany, after 1984, but worse
Posts: 3,365
|
Quote:
Quote:
...not an even number of longwords, but in my case it can be any number of bytes, worst example: copying the RGB bytes of a colormap, so the number can be odd. Quote:
Btw, do you have any CPU timing tables for the 68040 and the 68060 ? Update: Already found this page now: http://www.mactech.com/articles/mact...040/index.html Ok, maybe my copy routine is not optimal under every aspect but it's probably faster now than the MOVE.L (A0)+, (A1)+ routine with a PC relative jumptable at the end like I found it in the original library. Or is that better for an 68040 ? Please remember that the average block size will be something between 500 and 5000 bytes only and atm this is not the bottleneck in the icon.library. There are still more than 4000 other code lines to analyze and to optimize. But if I should get the feeling that this becomes the bottleneck one day then I would try to use execs CopyMem() instead as you suggested. Many thanks anyway for your helpful comments. Last edited by PeterK; 07 February 2010 at 23:11. |
|||
08 February 2010, 01:00 | #31 | |||||
Banned
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
|
Quote:
Quote:
The timings are in the 68040 and 68060 User's Manuals which were still available from Freescale (web site) last time I checked. If not, PM me your e-mail and I will mail you the PDFs. The 68060 timings are tricky as it's superscaler (2 integer instructions at once). It's more important that an instruction can work in both integer units as most instructions take 1 cycle. Swap for example will not but shift can. The 68060 can do 2 shifts (of any allowed size) per cycle but only 1 swap. Quote:
mem_copy: cmpi.l #$18,d0 blt .smallcopy .loop: move.l (a0)+,(a1)+ move.l (a0)+,(a1)+ move.l (a0)+,(a1)+ move.l (a0)+,(a1)+ move.l (a0)+,(a1)+ move.l (a0)+,(a1)+ subi.l #$18,d0 cmpi.l #$18,d0 bge .loop .smallcopy: add.w d0,d0 move.w (jumptable,pc,d0.w),d0 jmp (2,pc,d0.w) The 68040 likes the unrolled move.l loop. It's the fastest way to copy memory on the 68040 for smaller copies (move16 for bigger). It's not slowed down by the 6 byte xxxi.l instructions. It handles the complex jumptable setup instructions with ease but it doesn't like the jmp because it can not be predicted. The 68040 and 68060 do not like jump tables because of the more sophisticated branch prediction. The copy routine does not attempt to align data but that is good in your case (long word aligned src and dest). I would expect this routine to be faster on the 68040 than yours in most cases. It's still pretty poorly written with the big instructions and both a subi and cmpi inside the loop are unnecessary. I would expect your copy routine will perform better in most cases on most processors and it's smaller. It is difficult to make a copy routine that performs well on the whole 68k family and that is why the exec memory copy routines come in handy. They also save code. Quote:
http://fi.aminet.net/util/boot/CopyMem.readme Copying 4 bytes a set number of times with AmigaOS 3.9 CopyMem() went from .51 seconds to .06 seconds with my CopyMem060 patch. Similar gains were made for the 68040. The standard AmigaOS 3.9 CopyMem() uses a movem.l loop and was improved from AmigaOS 3.5. Quote:
|
|||||
08 February 2010, 01:23 | #32 | |
Registered User
Join Date: Apr 2005
Location: digital hell, Germany, after 1984, but worse
Posts: 3,365
|
Quote:
Code:
CopyMemBlock ROR.L #5,D0 ; works only < 2 MByte BRA.S .test32 .copy32 MOVE.L (A0)+,(A1)+ MOVE.L (A0)+,(A1)+ MOVE.L (A0)+,(A1)+ MOVE.L (A0)+,(A1)+ MOVE.L (A0)+,(A1)+ MOVE.L (A0)+,(A1)+ MOVE.L (A0)+,(A1)+ MOVE.L (A0)+,(A1)+ .test32 DBRA.S D0,.copy32 ADDQ.W #1,D0 ROL.L #3,D0 BRA.S .longs .copy4 MOVE.L (A0)+,(A1)+ .longs DBRA.S D0,.copy4 ADDQ.W #1,D0 ROL.L #2,D0 BRA.S .bytes .copy1 MOVE.B (A0)+,(A1)+ .bytes DBRA.S D0,.copy1 RTS Last edited by PeterK; 08 February 2010 at 13:07. |
|
08 February 2010, 14:58 | #33 |
Banned
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
|
That should perform better on all 68k processors. Yes the movem.l has a lot of overhead to overcome (saving the registers to the stack) and needs to use most of the registers to be efficient. An unrolled move.l loop is fast and good for the whole 68k family and all but large copies. The efficiency comes from removed branches which eat many cycles except on the 68060 with it's 0 cycle predicted branches. You are on the right track. I'm not a fan of the bra into the dbra loop. I usually use a subq and bcc as it skips the branch at the top and isn't any slower on 68040-68060. It might be slightly slower in the loop for the 68020-68030. Dbra's biggest advantage is on the 68000. I haven't really studied the 68000-68030 like I have the latter processors though. The other thing about your routine is the last loop. It is 3 loops at most. It works better to just btst bit 1 and do a move.w (a0)+,(a1)+ if it's set and then btst bit 0 (or lsr into carry) and then do a move.b (a0)+,(a1)+ if it's set.
I remembered that those manuals are online here... http://amiga.serveftp.net/datasheets.html They say datasheets which is the same as User's Manuals. There is 68020-68060 "datasheets" there with timings and other good info. This is a great Amiga site so have a look around too. |
08 February 2010, 19:24 | #34 |
Registered User
Join Date: Apr 2005
Location: digital hell, Germany, after 1984, but worse
Posts: 3,365
|
@matthey
Many thanks for all your useful hints. It's always good to get some advice from an expert who knows all the arguments for and against any instruction in such a routine already. - Not tested yet, but the following should do it? Code:
.longs DBRA.S D0,.copy4 ADD.L D0,D0 BCC.S .last MOVE.W (A0)+,(A1)+ .last ADD.L D0,D0 BCC.S .done MOVE.B (A0),(A1) ; no post increment, thx matthey .done RTS I hope this was not too much like a thread hijacking for you. Sorry, if so! Last edited by PeterK; 09 February 2010 at 20:45. |
09 February 2010, 01:19 | #35 |
Banned
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
|
I didn't think that would work at first but that add.l d0,d0 followed by bcc is very fast and I think it will work. You can drop the post increment on the last move though...
MOVE.B (A0),(A1) ;removing post increment is faster except 68060 Here is a little 68k optimization guide that is very good also... http://www.freescale.com/files/32bit...80X0OPTAPP.txt Last edited by matthey; 09 February 2010 at 02:37. |
09 February 2010, 20:37 | #36 |
Registered User
Join Date: Apr 2005
Location: digital hell, Germany, after 1984, but worse
Posts: 3,365
|
Yes, the ADD.L D0,D0 and BCC.S works and it avoids the ADD.W #1, D0 and ROL.L #2,D0 and also the BTST instructions.
Thanks for your great attention regarding the post increment. I'm soo blind and have simply overlooked that. Btw, the quality of the "datasheets" from http://amiga.serveftp.net was much better than the "same" manuals from freescale, because they already did a poor quality rescan for their PDF files. |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
A2091ToFast: Even more A2091/A590 speedup possible! | SpeedGeek | Coders. System | 8 | 24 July 2015 14:47 |
Requester Bug when copying IPF to Standard ADF with X-Copy/Power Copy. | BarryB | support.WinUAE | 9 | 17 January 2012 20:20 |
1Mb CHIP RAM hack and extra memory | orange | Hardware mods | 3 | 29 June 2010 13:18 |
DMA memory to memory copy | BlueAchenar | Coders. General | 14 | 22 January 2009 23:29 |
|
|