fastest possible rom copy loop

Yulquen74 · 19 April 2014, 18:18

I needed a small rom copy routine to toy around with, so I made a small piece of code which seems to be working. Can this loop be improved to use less cycles?

lea.l $f80000,a0
lea.l $e00000,a1

Loop:
move.l (a0),(a1)
add.l #$4,a0
add.l #$4,a1
cmpi.l #$e80000,a1
bne Loop

thanks in advance for suggestions.

Mrs Beanbag · 19 April 2014, 18:26

Code:

lea.l $f80000,a0
lea.l $e00000,a1
move.l #$80000,d0

Loop:
move.l (a0)+,(a1)+
subq.l #4,d0
bgt.s Loop

Mrs Beanbag · 19 April 2014, 18:35

Code:

lea.l $f80000,a0
lea.l $e00000,a1
move.l #$ffff,d0

Loop:
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
dbra d0,Loop

Yulquen74 · 19 April 2014, 18:40

Massive speedup!
Thanks!

mark_k · 19 April 2014, 19:25

You could use a similar DBF loop but using MOVEM.L instead, so something like

Code:

        MOVE.W  #10921,D0            ;(512*1024)/48 - 1
loop:   MOVEM.L (A0)+,D1-D7/A2-A6    ;12 registers = 48 bytes
        MOVEM.L D1-D7/A2-A6,(A1)+
        DBF D0,loop
; There are 32 bytes left over to copy
        MOVEM.L (A0)+,D1-D7/A2      ;8 registers = 32 bytes
        MOVEM.L D1-D7/A2,(A1)

Toni Wilen · 19 April 2014, 19:37

There is no MOVEM <regs>,(An)+ addressing mode. Only (An) or -(An).

Something like this works (but don't bother with it if CPU is 68020+)

copy
movem.l (A0)+,<regs>
movem.l <regs>,(A1)
add.l d1,a1
dbf d0,copy

Yulquen74 · 19 April 2014, 22:49

This one is even faster than your first suggestion

My small program with 2 of those loops takes
only a second to complete at 7mhz (68000),
and less of course at higher frequencies.

Quote:

Originally Posted by Mrs Beanbag

Code:

lea.l $f80000,a0
lea.l $e00000,a1
move.l #$ffff,d0

Loop:
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
dbra d0,Loop

Mrs Beanbag · 20 April 2014, 19:13

why exactly are you copying the ROM to $e00000? This area is marked as "reserved" in the HW reference manual. If you want to be safe you should really reserve some memory from exec.library otherwise who knows what you will be writing on top of.

demolition · 21 April 2014, 01:51

Quote:

Originally Posted by Yulquen74

This one is even faster than your first suggestion

My small program with 2 of those loops takes
only a second to complete at 7mhz (68000),
and less of course at higher frequencies.

Are you optimizing for speed or size? You could unroll the loop even further to gain more speed if size was not a major issue, e.g. try adding another two move.l's and halfing d0.

Yulquen74 · 21 April 2014, 22:59

Quote:

Originally Posted by Mrs Beanbag

why exactly are you copying the ROM to $e00000? This area is marked as "reserved" in the HW reference manual. If you want to be safe you should really reserve some memory from exec.library otherwise who knows what you will be writing on top of.

I have fastram mapped in that area, and its perfectly safe to use that way if I do it before it is added to the system pool with the addmem command.
The second loop transfers it back to rom address area, only now it is mapped up in fastram, so I get a "fast-rom".

Quote:

Originally Posted by demolition

Are you optimizing for speed or size? You could unroll the loop even further to gain more speed if size was not a major issue, e.g. try adding another two move.l's and halfing d0.

Optimizing for speed.
Will try to add more lines and decrease loop counter. Thanks.

Shadowfire · 29 April 2014, 05:19

MOVEM supports the following addressing modes:
(Ax)
-(Ax) (register to memory transfer only)
(Ax)+ (memory to register transfer only)
d(Ax)
d(Ax,Rx)
(Abs).L
(Abs).W
If you are shooting for speed, you should be using MOVEM, not MOVE. MOVEM requires two 16-bit fetch words for the instruction, and can transfer up to 14 longwords for that fetch and requires a 2nd MOVEM to write out the data, whereas MOVE.L requires 1 16-bit fetch but only copies 1 longword for that fetch.

If you use a MOVEM with 8 registers, unrolling, you can get a loop like:

Code:

(Instruction word count)
(3)LEA $F80000,A0; source address
(3)LEA $E00000,A1; destination address
(2)MOVE.W #(($80000/128)-1),D6; loop count, copying 128 bytes per iteration
(1)SUB.L D7,D7; clear d7 to 0
(1)BSET.W #5,D7; put 32 into d7

loop:
(2)MOVEM.L (A0)+,D0-D5/A2-A3; unrolled loop, 4 execution of movem
(2)MOVEM.L D0-D5/A2-A3,(A1)
(1)ADDA.L D7,A1
(2)MOVEM.L (A0)+,D0-D5/A2-A3
(2)MOVEM.L D0-D5/A2-A3,(A1)
(1)ADDA.L D7,A1
(2)MOVEM.L (A0)+,D0-D5/A2-A3
(2)MOVEM.L D0-D5/A2-A3,(A1)
(1)ADDA.L D7,A1
(2)MOVEM.L (A0)+,D0-D5/A2-A3
(2)MOVEM.L D0-D5/A2-A3,(A1)
(1)ADDA.L D7,A1
(2)DBRA.W D6,loop

This is a loop that fetches 22 instruction words (+4 dummy reads on the MOVEM.L (A0)+ instructions) to copy 128 bytes of data in each iteration, or 4.92 bytes copied per instruction word.

MRSBEANBAG'S loop of

Code:

(1)move.l (a0)+,(a1)+
(1)move.l (a0)+,(a1)+
(1)move.l (a0)+,(a1)+
(1)move.l (a0)+,(a1)+
(1)move.l (a0)+,(a1)+
(1)move.l (a0)+,(a1)+
(1)move.l (a0)+,(a1)+
(1)move.l (a0)+,(a1)+
(1)move.l (a0)+,(a1)+
(1)move.l (a0)+,(a1)+
(1)move.l (a0)+,(a1)+
(1)move.l (a0)+,(a1)+
(1)move.l (a0)+,(a1)+
(1)move.l (a0)+,(a1)+
(1)move.l (a0)+,(a1)+
(1)move.l (a0)+,(a1)+
(1)move.l (a0)+,(a1)+
(1)move.l (a0)+,(a1)+
(1)move.l (a0)+,(a1)+
(1)move.l (a0)+,(a1)+
(2)dbra d0,Loop

unrolled to 20 levels, fetches 22 instruction words to copy 80 bytes, or 3.636 bytes/instruction word.

Your original loop

Code:

Loop:
(1)move.l (a0),(a1)
(3)add.l #$4,a0
(3)add.l #$4,a1
(3)cmpi.l #$e80000,a1
(1)bne Loop

fetches 11 instruction words to copy 4 bytes, or 0.36363 bytes/instruction word.

JimDrew · 29 April 2014, 06:27

Yep, movem.l is the fastest for non-040/060 CPUs. With 040's I use move16 instead.

SpeedGeek · 29 April 2014, 15:14

Quote:

Originally Posted by Yulquen74

I have fastram mapped in that area, and its perfectly safe to use that way if I do it before it is added to the system pool with the addmem command.
The second loop transfers it back to rom address area, only now it is mapped up in fastram, so I get a "fast-rom".

Really? How does your 7MHz 68000 access Fast RAM any faster than the Kickstart ROM?

demolition · 29 April 2014, 16:05

Quote:

Originally Posted by SpeedGeek

Really? How does your 7MHz 68000 access Fast RAM any faster than the Kickstart ROM?

Fast RAM is much faster than the ROM. I use skick on my A500+ with 7 MHz CPU to map the kickstart into fast RAM, and there is a noticeable difference in speed after doing it.

SpeedGeek · 29 April 2014, 16:28

Quote:

Originally Posted by demolition

Fast RAM is much faster than the ROM. I use skick on my A500+ with 7 MHz CPU to map the kickstart into fast RAM, and there is a noticeable difference in speed after doing it.

Here's my A2000 7MHz 68000 Bustest results:

BusSpeedTest 0.19 (mlelstv) Buffer: 262144 Bytes, Alignment: 32768
========================================================================
memtype addr op cycle calib bandwidth
rom $00F80000 readw 1176.8 ns normal 1.7 * 10^6 byte/s
rom $00F80000 readl 1757.7 ns normal 2.3 * 10^6 byte/s
rom $00F80000 readm 1395.6 ns normal 2.9 * 10^6 byte/s

BusSpeedTest 0.19 (mlelstv) Buffer: 262144 Bytes, Alignment: 32768
========================================================================
memtype addr op cycle calib bandwidth
fast $00240000 readw 1177.6 ns normal 1.7 * 10^6 byte/s
fast $00240000 readl 1760.5 ns normal 2.3 * 10^6 byte/s
fast $00240000 readm 1390.0 ns normal 2.9 * 10^6 byte/s
fast $00240000 writew 1178.0 ns normal 1.7 * 10^6 byte/s
fast $00240000 writel 1760.6 ns normal 2.3 * 10^6 byte/s
fast $00240000 writem 1319.3 ns normal 3.0 * 10^6 byte/s

Note: NTSC = 7.16 MHz, PAL = 7.09 MHz

demolition · 29 April 2014, 16:50

They do look quite identical. Not sure then why I can feel a difference in responsiveness when using skick.

Yulquen74 · 02 May 2014, 20:58

Quote:

Originally Posted by SpeedGeek

Really? How does your 7MHz 68000 access Fast RAM any faster than the Kickstart ROM?

You are right of course, it is not faster at 7MHz.

But the important point is that it is much faster at higher cpu clock frequencies, for which this is intended (I'm toying with a homemade internal simple cpu/ram-board with a 68HC000 processor, a CPLD, 16MB of SRAM, and bidirectional bus drivers for data and address lines).

I have been using bustest, like yourself, to confirm better "rom" speeds at higher cpu frequencies.

Photon · 10 May 2014, 19:10

Toni's loop is the fastest IF you use more than 8 registers AND repeat many times to get less adds. Otherwise not.

move.l (a0)+,(a1)+ takes 20 cycles on 68000, 2x movem.l approaches 16 cycles per longword if you use many registers.

So a repeated move.l (a0)+,(a1)+ will take you close to the max already.

Remember that speed is only important on the slowest platforms you want to support, so code for them. On a 68030 a dead slow copy loop will be fast enough (as perceived by the user) already.

19 April 2014, 18:26	#2
Mrs Beanbag Glastonbridge Software Join Date: Jan 2012 Location: Edinburgh/Scotland Posts: 2,243	Code: lea.l $f80000,a0 lea.l $e00000,a1 move.l #$80000,d0 Loop: move.l (a0)+,(a1)+ subq.l #4,d0 bgt.s Loop

19 April 2014, 18:35	#3
Mrs Beanbag Glastonbridge Software Join Date: Jan 2012 Location: Edinburgh/Scotland Posts: 2,243	Code: lea.l $f80000,a0 lea.l $e00000,a1 move.l #$ffff,d0 Loop: move.l (a0)+,(a1)+ move.l (a0)+,(a1)+ dbra d0,Loop

29 April 2014, 05:19	#11
Shadowfire Registered User Join Date: Aug 2001 Location: Connecticut USA Posts: 617	MOVEM supports the following addressing modes: (Ax) -(Ax) (register to memory transfer only) (Ax)+ (memory to register transfer only) d(Ax) d(Ax,Rx) (Abs).L (Abs).W If you are shooting for speed, you should be using MOVEM, not MOVE. MOVEM requires two 16-bit fetch words for the instruction, and can transfer up to 14 longwords for that fetch and requires a 2nd MOVEM to write out the data, whereas MOVE.L requires 1 16-bit fetch but only copies 1 longword for that fetch. If you use a MOVEM with 8 registers, unrolling, you can get a loop like: Code: (Instruction word count) (3)LEA $F80000,A0; source address (3)LEA $E00000,A1; destination address (2)MOVE.W #(($80000/128)-1),D6; loop count, copying 128 bytes per iteration (1)SUB.L D7,D7; clear d7 to 0 (1)BSET.W #5,D7; put 32 into d7 loop: (2)MOVEM.L (A0)+,D0-D5/A2-A3; unrolled loop, 4 execution of movem (2)MOVEM.L D0-D5/A2-A3,(A1) (1)ADDA.L D7,A1 (2)MOVEM.L (A0)+,D0-D5/A2-A3 (2)MOVEM.L D0-D5/A2-A3,(A1) (1)ADDA.L D7,A1 (2)MOVEM.L (A0)+,D0-D5/A2-A3 (2)MOVEM.L D0-D5/A2-A3,(A1) (1)ADDA.L D7,A1 (2)MOVEM.L (A0)+,D0-D5/A2-A3 (2)MOVEM.L D0-D5/A2-A3,(A1) (1)ADDA.L D7,A1 (2)DBRA.W D6,loop This is a loop that fetches 22 instruction words (+4 dummy reads on the MOVEM.L (A0)+ instructions) to copy 128 bytes of data in each iteration, or 4.92 bytes copied per instruction word. MRSBEANBAG'S loop of Code: (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (1)move.l (a0)+,(a1)+ (2)dbra d0,Loop unrolled to 20 levels, fetches 22 instruction words to copy 80 bytes, or 3.636 bytes/instruction word. Your original loop Code: Loop: (1)move.l (a0),(a1) (3)add.l #$4,a0 (3)add.l #$4,a1 (3)cmpi.l #$e80000,a1 (1)bne Loop fetches 11 instruction words to copy 4 bytes, or 0.36363 bytes/instruction word. Last edited by Shadowfire; 29 April 2014 at 05:43.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Unknown Copy-Dongle [SOLVED: Siegfried-Copy 1.9SE]	TheZock	support.Hardware	4	26 November 2013 00:23
Loop optimization + cycle counts	losso	Coders. Asm / Hardware	8	05 November 2013 11:50
Sampled loop in cracktro	absence	request.Music	2	30 June 2012 11:33
Requester Bug when copying IPF to Standard ADF with X-Copy/Power Copy.	BarryB	support.WinUAE	9	17 January 2012 20:20

19 April 2014, 18:18	#1
Yulquen74 Registered User Join Date: May 2013 Location: Kleppe / Norway Posts: 266	fastest possible rom copy loop I needed a small rom copy routine to toy around with, so I made a small piece of code which seems to be working. Can this loop be improved to use less cycles? lea.l $f80000,a0 lea.l $e00000,a1 Loop: move.l (a0),(a1) add.l #$4,a0 add.l #$4,a1 cmpi.l #$e80000,a1 bne Loop thanks in advance for suggestions.

19 April 2014, 18:40	#4
Yulquen74 Registered User Join Date: May 2013 Location: Kleppe / Norway Posts: 266	Massive speedup! Thanks!

19 April 2014, 19:37	#6
Toni Wilen WinUAE developer Join Date: Aug 2001 Location: Hämeenlinna/Finland Age: 49 Posts: 26,567	There is no MOVEM <regs>,(An)+ addressing mode. Only (An) or -(An). Something like this works (but don't bother with it if CPU is 68020+) copy movem.l (A0)+,<regs> movem.l <regs>,(A1) add.l d1,a1 dbf d0,copy

20 April 2014, 19:13	#8
Mrs Beanbag Glastonbridge Software Join Date: Jan 2012 Location: Edinburgh/Scotland Posts: 2,243	why exactly are you copying the ROM to $e00000? This area is marked as "reserved" in the HW reference manual. If you want to be safe you should really reserve some memory from exec.library otherwise who knows what you will be writing on top of.

29 April 2014, 06:27	#12
JimDrew Registered User Join Date: Dec 2013 Location: Lake Havasu City, AZ Posts: 741	Yep, movem.l is the fastest for non-040/060 CPUs. With 040's I use move16 instead.

29 April 2014, 16:50	#16
demolition Unregistered User Join Date: Sep 2012 Location: Copenhagen / DK Age: 44 Posts: 4,190	They do look quite identical. Not sure then why I can feel a difference in responsiveness when using skick.

10 May 2014, 19:10	#18
Photon Moderator Join Date: Nov 2004 Location: Eksjö / Sweden Posts: 5,652	Toni's loop is the fastest IF you use more than 8 registers AND repeat many times to get less adds. Otherwise not. move.l (a0)+,(a1)+ takes 20 cycles on 68000, 2x movem.l approaches 16 cycles per longword if you use many registers. So a repeated move.l (a0)+,(a1)+ will take you close to the max already. Remember that speed is only important on the slowest platforms you want to support, so code for them. On a 68030 a dead slow copy loop will be fast enough (as perceived by the user) already.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)