19 April 2014, 18:18 | #1 |
Registered User
Join Date: May 2013
Location: Kleppe / Norway
Posts: 266
|
fastest possible rom copy loop
I needed a small rom copy routine to toy around with, so I made a small piece of code which seems to be working. Can this loop be improved to use less cycles?
lea.l $f80000,a0 lea.l $e00000,a1 Loop: move.l (a0),(a1) add.l #$4,a0 add.l #$4,a1 cmpi.l #$e80000,a1 bne Loop thanks in advance for suggestions. |
19 April 2014, 18:26 | #2 |
Glastonbridge Software
Join Date: Jan 2012
Location: Edinburgh/Scotland
Posts: 2,243
|
Code:
lea.l $f80000,a0 lea.l $e00000,a1 move.l #$80000,d0 Loop: move.l (a0)+,(a1)+ subq.l #4,d0 bgt.s Loop |
19 April 2014, 18:35 | #3 |
Glastonbridge Software
Join Date: Jan 2012
Location: Edinburgh/Scotland
Posts: 2,243
|
Code:
lea.l $f80000,a0 lea.l $e00000,a1 move.l #$ffff,d0 Loop: move.l (a0)+,(a1)+ move.l (a0)+,(a1)+ dbra d0,Loop |
19 April 2014, 18:40 | #4 |
Registered User
Join Date: May 2013
Location: Kleppe / Norway
Posts: 266
|
Massive speedup!
Thanks! |
19 April 2014, 19:25 | #5 |
Registered User
Join Date: Aug 2004
Location:
Posts: 3,349
|
You could use a similar DBF loop but using MOVEM.L instead, so something like
Code:
MOVE.W #10921,D0 ;(512*1024)/48 - 1 loop: MOVEM.L (A0)+,D1-D7/A2-A6 ;12 registers = 48 bytes MOVEM.L D1-D7/A2-A6,(A1)+ DBF D0,loop ; There are 32 bytes left over to copy MOVEM.L (A0)+,D1-D7/A2 ;8 registers = 32 bytes MOVEM.L D1-D7/A2,(A1) |
19 April 2014, 19:37 | #6 |
WinUAE developer
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,567
|
There is no MOVEM <regs>,(An)+ addressing mode. Only (An) or -(An).
Something like this works (but don't bother with it if CPU is 68020+) copy movem.l (A0)+,<regs> movem.l <regs>,(A1) add.l d1,a1 dbf d0,copy |
19 April 2014, 22:49 | #7 |
Registered User
Join Date: May 2013
Location: Kleppe / Norway
Posts: 266
|
This one is even faster than your first suggestion
My small program with 2 of those loops takes only a second to complete at 7mhz (68000), and less of course at higher frequencies. |
20 April 2014, 19:13 | #8 |
Glastonbridge Software
Join Date: Jan 2012
Location: Edinburgh/Scotland
Posts: 2,243
|
why exactly are you copying the ROM to $e00000? This area is marked as "reserved" in the HW reference manual. If you want to be safe you should really reserve some memory from exec.library otherwise who knows what you will be writing on top of.
|
21 April 2014, 01:51 | #9 |
Unregistered User
Join Date: Sep 2012
Location: Copenhagen / DK
Age: 44
Posts: 4,190
|
Are you optimizing for speed or size? You could unroll the loop even further to gain more speed if size was not a major issue, e.g. try adding another two move.l's and halfing d0.
|
21 April 2014, 22:59 | #10 | ||
Registered User
Join Date: May 2013
Location: Kleppe / Norway
Posts: 266
|
Quote:
The second loop transfers it back to rom address area, only now it is mapped up in fastram, so I get a "fast-rom". Quote:
Will try to add more lines and decrease loop counter. Thanks. Last edited by prowler; 22 April 2014 at 21:20. Reason: Back-to-back posts merged. |
||
29 April 2014, 05:19 | #11 |
Registered User
Join Date: Aug 2001
Location: Connecticut USA
Posts: 617
|
MOVEM supports the following addressing modes:
(Ax) -(Ax) (register to memory transfer only) (Ax)+ (memory to register transfer only) d(Ax) d(Ax,Rx) (Abs).L (Abs).W If you are shooting for speed, you should be using MOVEM, not MOVE. MOVEM requires two 16-bit fetch words for the instruction, and can transfer up to 14 longwords for that fetch and requires a 2nd MOVEM to write out the data, whereas MOVE.L requires 1 16-bit fetch but only copies 1 longword for that fetch. If you use a MOVEM with 8 registers, unrolling, you can get a loop like: Code:
(Instruction word count) (3)LEA $F80000,A0; source address (3)LEA $E00000,A1; destination address (2)MOVE.W #(($80000/128)-1),D6; loop count, copying 128 bytes per iteration (1)SUB.L D7,D7; clear d7 to 0 (1)BSET.W #5,D7; put 32 into d7 loop: (2)MOVEM.L (A0)+,D0-D5/A2-A3; unrolled loop, 4 execution of movem (2)MOVEM.L D0-D5/A2-A3,(A1) (1)ADDA.L D7,A1 (2)MOVEM.L (A0)+,D0-D5/A2-A3 (2)MOVEM.L D0-D5/A2-A3,(A1) (1)ADDA.L D7,A1 (2)MOVEM.L (A0)+,D0-D5/A2-A3 (2)MOVEM.L D0-D5/A2-A3,(A1) (1)ADDA.L D7,A1 (2)MOVEM.L (A0)+,D0-D5/A2-A3 (2)MOVEM.L D0-D5/A2-A3,(A1) (1)ADDA.L D7,A1 (2)DBRA.W D6,loop MRSBEANBAG'S loop of Code:
(1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (2)dbra d0,Loop Your original loop Code:
Loop: (1)move.l (a0),(a1) (3)add.l #$4,a0 (3)add.l #$4,a1 (3)cmpi.l #$e80000,a1 (1)bne Loop Last edited by Shadowfire; 29 April 2014 at 05:43. |
29 April 2014, 06:27 | #12 |
Registered User
Join Date: Dec 2013
Location: Lake Havasu City, AZ
Posts: 741
|
Yep, movem.l is the fastest for non-040/060 CPUs. With 040's I use move16 instead.
|
29 April 2014, 15:14 | #13 | |
Moderator
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 846
|
Quote:
|
|
29 April 2014, 16:05 | #14 |
Unregistered User
Join Date: Sep 2012
Location: Copenhagen / DK
Age: 44
Posts: 4,190
|
|
29 April 2014, 16:28 | #15 | |
Moderator
Join Date: Dec 2010
Location: Wisconsin USA
Age: 60
Posts: 846
|
Quote:
BusSpeedTest 0.19 (mlelstv) Buffer: 262144 Bytes, Alignment: 32768 ======================================================================== memtype addr op cycle calib bandwidth rom $00F80000 readw 1176.8 ns normal 1.7 * 10^6 byte/s rom $00F80000 readl 1757.7 ns normal 2.3 * 10^6 byte/s rom $00F80000 readm 1395.6 ns normal 2.9 * 10^6 byte/s BusSpeedTest 0.19 (mlelstv) Buffer: 262144 Bytes, Alignment: 32768 ======================================================================== memtype addr op cycle calib bandwidth fast $00240000 readw 1177.6 ns normal 1.7 * 10^6 byte/s fast $00240000 readl 1760.5 ns normal 2.3 * 10^6 byte/s fast $00240000 readm 1390.0 ns normal 2.9 * 10^6 byte/s fast $00240000 writew 1178.0 ns normal 1.7 * 10^6 byte/s fast $00240000 writel 1760.6 ns normal 2.3 * 10^6 byte/s fast $00240000 writem 1319.3 ns normal 3.0 * 10^6 byte/s Note: NTSC = 7.16 MHz, PAL = 7.09 MHz Last edited by SpeedGeek; 30 April 2014 at 13:26. |
|
29 April 2014, 16:50 | #16 |
Unregistered User
Join Date: Sep 2012
Location: Copenhagen / DK
Age: 44
Posts: 4,190
|
They do look quite identical. Not sure then why I can feel a difference in responsiveness when using skick.
|
02 May 2014, 20:58 | #17 | |
Registered User
Join Date: May 2013
Location: Kleppe / Norway
Posts: 266
|
Quote:
But the important point is that it is much faster at higher cpu clock frequencies, for which this is intended (I'm toying with a homemade internal simple cpu/ram-board with a 68HC000 processor, a CPLD, 16MB of SRAM, and bidirectional bus drivers for data and address lines). I have been using bustest, like yourself, to confirm better "rom" speeds at higher cpu frequencies. |
|
10 May 2014, 19:10 | #18 |
Moderator
Join Date: Nov 2004
Location: Eksjö / Sweden
Posts: 5,652
|
Toni's loop is the fastest IF you use more than 8 registers AND repeat many times to get less adds. Otherwise not.
move.l (a0)+,(a1)+ takes 20 cycles on 68000, 2x movem.l approaches 16 cycles per longword if you use many registers. So a repeated move.l (a0)+,(a1)+ will take you close to the max already. Remember that speed is only important on the slowest platforms you want to support, so code for them. On a 68030 a dead slow copy loop will be fast enough (as perceived by the user) already. |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Unknown Copy-Dongle [SOLVED: Siegfried-Copy 1.9SE] | TheZock | support.Hardware | 4 | 26 November 2013 00:23 |
Loop optimization + cycle counts | losso | Coders. Asm / Hardware | 8 | 05 November 2013 11:50 |
Sampled loop in cracktro | absence | request.Music | 2 | 30 June 2012 11:33 |
Requester Bug when copying IPF to Standard ADF with X-Copy/Power Copy. | BarryB | support.WinUAE | 9 | 17 January 2012 20:20 |
|
|