68000 unaligned copy code performance

!ZAJC! · 15 February 2024, 07:37

Hi crew,

Newbie here, I would love your feedback on my Amiga Motorola 68000 code.

Assume I have to copy an arbitrarily small/large block of memory where either:
a) source address is odd and destination address is even
or
b) source address is even and destination address is odd
- length can be 0 to 512kB ($7FFFF)
- source and destination do not overlap

Here is what I have:

Code:

memcpy_b:
;;	a0.l - tgt
;;	a1.l - src
;;	d0.l - len (bytes)

	moveq.l	#7,d1
	and.w	d0,d1	; rem[d1] <- len[d0] & 7
	neg.w	d1	; negoff[d1] <- ...
	add.w	d1,d1	; negoff[d1] <- -2*rem[d1]

	asr.l	#3,d0	; blocks[d0] <- len[d0] / 8
	
	jmp	.done(pc,d1)	; jump to done - negoff[d1]

.cpyblk:
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
.done:
	dbra	d0,.cpyblk
	rts

The way I see it, I spend 44 cycles for setup and then I have about 10% overhead for looping.

Probably would not hurt to increase the setup to 46 cycles and do 16 moves instead of 8?

Any other approaches, maybe bugs in my code?

roondar · 15 February 2024, 09:50

One common approach would be to check how far away from alignment you are (either at or at destination, or both) and do only move.b's to get word aligned. Then, do the remainder using move.l's (or perhaps even movem.l's). This will get you better performance than doing it all byte-by-byte.

If you want to make the code optimal for 68020/030, doing the same as above, but aligning to longword instead will improve performance on those architectures.

Edit:
Regarding bugs, you might want to check the code that decides how many times to loop. If I am reading this correctly, you currently loop one more time than you should, because DBRA runs from the counter you give and stops when it hits -1, not when it hits 0.

Also be aware that your code doesn't do range checking and DBRA requires a word-sized operator. If this code is called with a size that is too big, or alternatively is called with a negative size, the behaviour is probably not what you want. This may be fine for our purposes, though. Just something to keep in mind.

jotd · 15 February 2024, 10:04

I think you can find examples of optimized overlapping memcopy routines already coded

hint: NOT chatgpt, this is damn unoptimized AND wrong!

Code:

    move.l  D0, D1         ; Copy the number of bytes to D1 for loop control
    move.l  A0, A2         ; Copy source address to A2
    move.l  A1, A3         ; Copy destination address to A3

copy_loop:
    move.b  (A2)+, (A3)+  ; Copy a byte from source to destination
    dbra    D1, copy_loop ; Decrement D1 and repeat if not zero

    rts                   ; Return from subroutine

let's have a good laugh again after I told Chatgpt that it was wrong and unoptimized

Code:

    move.l  D0, D1          ; Copy the number of bytes to D1 for loop control
    move.l  A0, A2          ; Copy source address to A2
    move.l  A1, A3          ; Copy destination address to A3

    ; Check if the number of bytes is not zero
    beq     copy_done

    ; Check if the number of bytes is odd (not divisible by 4)
    tst.b   D1
    beq.s   copy_aligned

    ; Copy remaining bytes one byte at a time
copy_byte_loop:
    move.b  (A2)+, (A3)+   ; Copy a byte from source to destination
    dbra    D1, copy_byte_loop

copy_aligned:
    ; Now D1 contains the number of long words to copy
    ; Copy long words (4 bytes) at a time
copy_long_loop:
    move.l  (A2)+, (A3)+   ; Copy a long word from source to destination
    dbra    D1, copy_long_loop

copy_done:
    rts                    ; Return from subroutine

there are so many issues that it makes my eyes bleed.

To do optimized copy, you'd have to consider a lot of cases, knowing that if source & dest aren't aligned you can NEVER use move.l.. Not a trivial problem. Even amiga OS doesn't have that. It has CopyMem and CopyMemQuick...

roondar · 15 February 2024, 10:29

Ah, I should've been a lot clearer about what I meant, with the move.l part. Sorry.
As jotd writes, in the case of source & dest having different alignment, you indeed can never use move.l (or move.w).

However, if both start and end are odd, then (and only then) you still can get the copy to be aligned. In that particular case, you can deal with the odd byte first and then switch to move.l. Sorry for the confusion here.

a/b · 15 February 2024, 11:48

Assuming you want to have a simple all-on-one routine and avoid checking for special cases where you could optimize for speed, that works.
As mentioned above, you don't check for 0 length, so if that's needed I'd do something like:

Code:

COPY_BITS	EQU	3

	subq.l	#1,d0
	bmi.b	.done
	moveq	#1<<COPY_BITS-1,d1
	eor.w	d1,d0
	and.w	d0,d1
	add.w	d1,d1
	lsr.l	#COPY_BITS,d0
	jmp	(.copy,pc,d1.w)
.copy	REPT	1<<COPY_BITS
	move.b	(a1)+,(a0)+
	ENDR
	dbf	d0,.copy
.done	rts

!ZAJC! · 15 February 2024, 11:54

Quote:

Originally Posted by roondar

One common approach would be to check how far away from alignment you are (either at or at destination, or both) and do only move.b's to get word aligned. Then, do the remainder using move.l's (or perhaps even movem.l's). This will get you better performance than doing it all byte-by-byte.

I understand this approach, but I believe it simply will not work in this case:
From my problem definition:

Quote:

a) source address is odd and destination address is even
or
b) source address is even and destination address is odd

Let me try to restate the above:

Quote:

(A0 XOR A1) AND 1 = 1

Quote:

If you want to make the code optimal for 68020/030, doing the same as above, but aligning to longword instead will improve performance on those architectures.

Thanks. As mentioned, if dealing with, say, strings or compressed files, I cannot guarantee that (a0 XOR a1) AND 1 = 0.

I titled the post with 68000 because I am coding for OCS Amiga 500 right now. On 020/030 I believe unaligned writes will also go through, but not sure of the performance implications.

Quote:

Edit:
Regarding bugs, you might want to check the code that decides how many times to loop. If I am reading this correctly, you currently loop one more time than you should, because DBRA runs from the counter you give and stops when it hits -1, not when it hits 0.

From my understanding DBRA dX, loc is a 10-cycle version of the following:

Code:

subq #1, dX
cmp.w #-1, dX
bne.s loc

For, say, d0 = 5, we will do asr.l d0,3 which will make d0 = 0. dbra will decrement d0 to -1 and then not branch to loc.

Quote:

Also be aware that your code doesn't do range checking and DBRA requires a word-sized operator. If this code is called with a size that is too big, or alternatively is called with a negative size, the behaviour is probably not what you want. This may be fine for our purposes, though. Just something to keep in mind.

Thanks, I believe my code will work from d0 =0 up to 512K (as defined). Notice that d0 contains the number of 8-bytes blocks, so the full range is 8*65535 = 524280 + 7 bytes

I am not sure how to define a copy with a negative length, I think that's a caller-side bug, as opposed to a mandatory subroutine check. The code I posted should work for d0.l = 0 (length 0).

!ZAJC! · 15 February 2024, 11:58

Quote:

Originally Posted by roondar

However, if both start and end are odd, then (and only then) you still can get the copy to be aligned. In that particular case, you can deal with the odd byte first and then switch to move.l. Sorry for the confusion here.

I think what you mean is source and destination having equal parity (equal value of last bit).Generally, I feel like you need to consider parity between all 3 inputs: source, destination and length. Consider for example even source, even destination and odd length. Or odd source, odd destination and even length.

!ZAJC! · 15 February 2024, 12:05

Quote:

Originally Posted by jotd

I think you can find examples of optimized overlapping memcopy routines already coded

Quote:

To do optimized copy, you'd have to consider a lot of cases, knowing that if source & dest aren't aligned you can NEVER use move.l.. Not a trivial problem. Even amiga OS doesn't have that. It has CopyMem and CopyMemQuick...

Thanks jotd for chipping in! Optimized memory copies are my idea of a fun way to get into a particular architecture, and I am sure I am not the only one on this forum

I first looked at CopyMem and CopyMemQuick and, as you said, per docs, they explicitly do not cover this case. Which is exactly why I wrote memcpy_b.
This is why I think coding this right, case by case, for 68000 was a fun way to get into coding and cycle counting

I noticed that a typical case on potentially unaligned data was not covered by ROM libraries. Compression, string routines, etc. are bound to have this problem, so I wondered how the experienced 68000 coders handle this one and not waste cycles.

jotd · 15 February 2024, 12:13

@a/b nice Duff's device code!!

A simple idea would be:

- if both aligned, use move.l until size doesn't allow, then use move.b
- if both unaligned, align with a few move.b then switch to case above
- if not same alignment, you're stuffed! move.b all the way... or maybe if source is aligned then read long, and write bytes using shifts... Would require benching.

I wonder if using libc 68000 memcpy wouldn't be a good idea. That one seems to cover a few cases: https://elixir.bootlin.com/linux/lat...k/lib/memcpy.c

!ZAJC! · 15 February 2024, 12:23

Quote:

Originally Posted by a/b

As mentioned above, you don't check for 0 length, so if that's needed I'd do something like:

From what I understand, I don't need to explicitly check for 0 length, because the logic in my code already takes care of it without treating it as a special case. Am I missing something? Where's the bug?

This way, I spend 44 cycles for setup.

My code spends 16 cycles less on setup than your code and works on 0 length, yielding a significantly better performance for smaller lengths while also performing in overall fewer clock cycles for larger lengths.

Your code could potentially have overall better performance if the caller deliberately calls it with lots of zero-length copies, but I think we all agree that's not the intent.

Thanks for taking a look so quickly.
I still think jacking this up to blocks of 16 gives the best overall size/speed tradeoff given that the move instruction is only 2 bytes in this case.

!ZAJC! · 15 February 2024, 12:26

Quote:

Originally Posted by jotd

@a/b nice Duff's device code!!

A simple idea would be:

- if both aligned, use move.l until size doesn't allow, then use move.b

There has to be a case where it pays to spill other registers to stack and use movem.l also, am I right?

Quote:

- if both unaligned, align with a few move.b then switch to case above
- if not same alignment, you're stuffed! move.b all the way... or maybe if source is aligned then read long, and write bytes using shifts... Would require benching.

Quote:

I wonder if using libc 68000 memcpy wouldn't be a good idea. That one seems to cover a few cases: https://elixir.bootlin.com/linux/lat...k/lib/memcpy.c

Hard to tell without seeing the assembly output of the compiler. gcc? clang? vbcc? The right thing to do IMHO would be to avoid spilling register to stack, otherwise your performance is horrible for small blocks.

chb · 15 February 2024, 13:21

Quote:

Originally Posted by !ZAJC!

There has to be a case where it pays to spill other registers to stack and use movem.l also, am I right?

AFAIR movem.l beats single move.l's when copying >4 longwords, not counting the overhead of the loop or the need to save and restore registers.

BTW, if your source and destination happen to be in chip memory, the fastest method by far for large blocks is using the blitter.

koobo · 15 February 2024, 13:41

I remember playing around with even-to-odd copy back in the day, so I dug up the old sources, looks like this:

Code:

        movem.l (a0)+,d0-d7
        rol.l   #8,d0
        rol.l   #8,d1
        rol.l   #8,d2
        rol.l   #8,d3
        rol.l   #8,d4
        rol.l   #8,d5
        rol.l   #8,d6
        rol.l   #8,d7
        move.b  d0,(a1)+
        move.b  d1,d0
        move.b  d2,d1
        move.b  d3,d2
        move.b  d4,d3
        move.b  d5,d4
        move.b  d6,d5
        move.b  d7,d6
        move.b  (a0)+,d7
        movem.l d0-d7,(a1)
        add.l   a3,a1
        move.b  (a0)+,(a1)+

The question is is this actually faster than a straight byte copy. Rolling bits is painfully slow on the 68000 at least.

chb · 15 February 2024, 13:59

Quote:

Originally Posted by koobo

The question is is this actually faster than a straight byte copy. Rolling bits is painfully slow on the 68000 at least.

If I did the math correctly, it is indeed a bit slower than unrolled move.b-s:
32 x move.b (a0)+,(a1)+: 32*12 cycles = 384 cycles
The above code: 404 cycles, of which 192 are for rolling, 148 for the movem.l-s, and 64 cycles for the remaining stuff. But your code is faster on a system without fastmem when there's a lot of DMA going on, as it needs less memory accesses.

roondar · 15 February 2024, 14:06

I've combined my response into a single post for convience's sake

Quote:

Originally Posted by !ZAJC!

On 020/030 I believe unaligned writes will also go through, but not sure of the performance implications.

Yes, 020+ allows for unaligned reads and writes, though there is a pretty big perfomance penalty for doing so. Depending on the value of the lower 2 address bits (and assuming 32 bit memory), such a read/write would either need 1 memory access or 2 memory accesses.

In the case of a combined read/write, you'd then end up with somewhere between 2 memory accesses for fully aligned and 4 memory accesses for fully unaligned.

Quote:

Originally Posted by !ZAJC!

From my understanding DBRA dX, loc is a 10-cycle version of the following:
Code:
subq #1, dX cmp.w #-1, dX bne.s loc
For, say, d0 = 5, we will do asr.l d0,3 which will make d0 = 0. dbra will decrement d0 to -1 and then not branch to loc.

The underlying reason for my misunderstanding here is that I had misread your code, specifically the part that set the offset for the jump into D1 (I missed you negate the result so it's a negative offset rather than positive). Serves me right for not taking more time

Your way to read DBRA is, of course, correct

Quote:

Originally Posted by !ZAJC!

I think what you mean is source and destination having equal parity (equal value of last bit).Generally, I feel like you need to consider parity between all 3 inputs: source, destination and length. Consider for example even source, even destination and odd length. Or odd source, odd destination and even length.

The length shouldn't matter for whether or not you can use move.l, only the alignment (or parity if you prefer) being the same between the two addresses.

You are correct that in case of such an optimisation you would have to take care of non-4 byte multiple lengths in the copy algorithm. Whether or not this makes sense really is dependent mostly on what kind of memory block sizes you are making this for. For really small copies, it's probably faster to make the copy as simple as it can be. If you expect to routinely copy somewhat more bytes, the extra overhead for doing this houskeeping will be easily paid for by transfering most of the copy in chunks of 4 bytes instead.

Copying 4 bytes using 4x move.b (a0)+,(a1)+ takes 48 cycles, while copying 4 bytes using move.l (a0)+,(a1)+ takes 20 cycles (assuming 68000). I haven't calculated this, but it seems to me that copies as small as 32 bytes, or perhaps even 16 bytes will already be faster if they can be done like this - even with the additional housekeeping involved.

Don_Adan · 15 February 2024, 16:51

I dont remember exactly timings, but if this code is for 68000, then perhaps using movep.l can be the fastest option.

a/b · 15 February 2024, 16:52

If you don't need any kind of input sanitation, that of course simplifies the things.
The rest comes down to size/speed balance (how much speed do you really need and how much memory can you expend for this routine), and how does the length distribution look like (is this some kind of an entirely generic copy routine, or are there certain length ranges and thresholds that are more prevalent)?
If you need good speed for smaller lengths, getting rid of an empty dbf + rshift will be noticable, and you could try a low overhead code. Something like:

Code:

	moveq	#50,d1		; 60+ requires a slower .w branch
	sub.l	d0,d1
	blt.b	.long
	add.w	d1,d1
	jmp	(.copy,pc,d1.w)
.copy	REPT	50
	move.b	(a1)+,(a0)+
	ENDR
	rts
.long
; handle lengths 50+ the usual way, eg. 8x or 16x unrolled
; set-up could be simplified (d1 reuse) if the above threshold is 2^n

NorthWay · 15 February 2024, 17:07

Back in the day (think Fish disks) there was a CopyMem patch by a Norwegian dude who went by the name Art IIRC. He did all sorts of unaligned and combinations thereof. I can't find that right now, but perhaps someone remembers better than me or has better google-fu. (He also had a fix for serial.device that used A6 wrong.)

paraj · 15 February 2024, 19:07

Maybe COPMQR28 (via this)?

NorthWay · 15 February 2024, 23:35

Quote:

Originally Posted by paraj

Maybe COPMQR28 (via this)?

Yes! That must be the one. Need to inspect the source to see if my memory was right.

15 February 2024, 09:50	#2
roondar Registered User Join Date: Jul 2015 Location: The Netherlands Posts: 3,411	One common approach would be to check how far away from alignment you are (either at or at destination, or both) and do only move.b's to get word aligned. Then, do the remainder using move.l's (or perhaps even movem.l's). This will get you better performance than doing it all byte-by-byte. If you want to make the code optimal for 68020/030, doing the same as above, but aligning to longword instead will improve performance on those architectures. Edit: Regarding bugs, you might want to check the code that decides how many times to loop. If I am reading this correctly, you currently loop one more time than you should, because DBRA runs from the counter you give and stops when it hits -1, not when it hits 0. Also be aware that your code doesn't do range checking and DBRA requires a word-sized operator. If this code is called with a size that is too big, or alternatively is called with a negative size, the behaviour is probably not what you want. This may be fine for our purposes, though. Just something to keep in mind. Last edited by roondar; 15 February 2024 at 09:59.

15 February 2024, 13:41	#13
koobo Registered User Join Date: Sep 2019 Location: Finland Posts: 363	I remember playing around with even-to-odd copy back in the day, so I dug up the old sources, looks like this: Code: movem.l (a0)+,d0-d7 rol.l #8,d0 rol.l #8,d1 rol.l #8,d2 rol.l #8,d3 rol.l #8,d4 rol.l #8,d5 rol.l #8,d6 rol.l #8,d7 move.b d0,(a1)+ move.b d1,d0 move.b d2,d1 move.b d3,d2 move.b d4,d3 move.b d5,d4 move.b d6,d5 move.b d7,d6 move.b (a0)+,d7 movem.l d0-d7,(a1) add.l a3,a1 move.b (a0)+,(a1)+ The question is is this actually faster than a straight byte copy. Rolling bits is painfully slow on the 68000 at least.

15 February 2024, 16:52	#17
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	If you don't need any kind of input sanitation, that of course simplifies the things. The rest comes down to size/speed balance (how much speed do you really need and how much memory can you expend for this routine), and how does the length distribution look like (is this some kind of an entirely generic copy routine, or are there certain length ranges and thresholds that are more prevalent)? If you need good speed for smaller lengths, getting rid of an empty dbf + rshift will be noticable, and you could try a low overhead code. Something like: Code: moveq #50,d1 ; 60+ requires a slower .w branch sub.l d0,d1 blt.b .long add.w d1,d1 jmp (.copy,pc,d1.w) .copy REPT 50 move.b (a1)+,(a0)+ ENDR rts .long ; handle lengths 50+ the usual way, eg. 8x or 16x unrolled ; set-up could be simplified (d1 reuse) if the above threshold is 2^n

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
68000 code optimisations	pmc	Coders. Asm / Hardware	248	17 September 2023 13:20
[C, RTG] My raycaster funtime.. sharing code, also looking for performance boost..	mateusz_s	Coders. General	26	23 April 2023 12:40
how to make this code 68000 compliant?	jotd	Coders. Asm / Hardware	2	30 January 2021 18:25
PErformance on 68000 machines	Amiga1992	project.WHDLoad	4	02 July 2014 11:16
68000 boot code	billt	Coders. General	15	05 May 2012 20:13

15 February 2024, 07:37	#1
!ZAJC! Registered User Join Date: Jan 2024 Location: Zagreb / Croatia Posts: 11	68000 unaligned copy code performance Hi crew, Newbie here, I would love your feedback on my Amiga Motorola 68000 code. Assume I have to copy an arbitrarily small/large block of memory where either: a) source address is odd and destination address is even or b) source address is even and destination address is odd - length can be 0 to 512kB ($7FFFF) - source and destination do not overlap Here is what I have: Code: memcpy_b: ;; a0.l - tgt ;; a1.l - src ;; d0.l - len (bytes) moveq.l #7,d1 and.w d0,d1 ; rem[d1] <- len[d0] & 7 neg.w d1 ; negoff[d1] <- ... add.w d1,d1 ; negoff[d1] <- -2*rem[d1] asr.l #3,d0 ; blocks[d0] <- len[d0] / 8 jmp .done(pc,d1) ; jump to done - negoff[d1] .cpyblk: move.b (a1)+,(a0)+ move.b (a1)+,(a0)+ move.b (a1)+,(a0)+ move.b (a1)+,(a0)+ move.b (a1)+,(a0)+ move.b (a1)+,(a0)+ move.b (a1)+,(a0)+ move.b (a1)+,(a0)+ .done: dbra d0,.cpyblk rts The way I see it, I spend 44 cycles for setup and then I have about 10% overhead for looping. Probably would not hurt to increase the setup to 46 cycles and do 16 moves instead of 8? Any other approaches, maybe bugs in my code?

15 February 2024, 10:29	#4
roondar Registered User Join Date: Jul 2015 Location: The Netherlands Posts: 3,411	Ah, I should've been a lot clearer about what I meant, with the move.l part. Sorry. As jotd writes, in the case of source & dest having different alignment, you indeed can never use move.l (or move.w). However, if both start and end are odd, then (and only then) you still can get the copy to be aligned. In that particular case, you can deal with the odd byte first and then switch to move.l. Sorry for the confusion here.

15 February 2024, 11:48	#5
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	Assuming you want to have a simple all-on-one routine and avoid checking for special cases where you could optimize for speed, that works. As mentioned above, you don't check for 0 length, so if that's needed I'd do something like: Code: COPY_BITS EQU 3 subq.l #1,d0 bmi.b .done moveq #1<<COPY_BITS-1,d1 eor.w d1,d0 and.w d0,d1 add.w d1,d1 lsr.l #COPY_BITS,d0 jmp (.copy,pc,d1.w) .copy REPT 1<<COPY_BITS move.b (a1)+,(a0)+ ENDR dbf d0,.copy .done rts

15 February 2024, 12:13	#9
jotd This cat is no more Join Date: Dec 2004 Location: FRANCE Age: 52 Posts: 8,197	@a/b nice Duff's device code!! A simple idea would be: - if both aligned, use move.l until size doesn't allow, then use move.b - if both unaligned, align with a few move.b then switch to case above - if not same alignment, you're stuffed! move.b all the way... or maybe if source is aligned then read long, and write bytes using shifts... Would require benching. I wonder if using libc 68000 memcpy wouldn't be a good idea. That one seems to cover a few cases: https://elixir.bootlin.com/linux/lat...k/lib/memcpy.c

15 February 2024, 16:51	#16
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 55 Posts: 1,975	I dont remember exactly timings, but if this code is for 68000, then perhaps using movep.l can be the fastest option.

15 February 2024, 17:07	#18
NorthWay Registered User Join Date: May 2013 Location: Grimstad / Norway Posts: 839	Back in the day (think Fish disks) there was a CopyMem patch by a Norwegian dude who went by the name Art IIRC. He did all sorts of unaligned and combinations thereof. I can't find that right now, but perhaps someone remembers better than me or has better google-fu. (He also had a fix for serial.device that used A6 wrong.)

15 February 2024, 19:07	#19
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,105	Maybe COPMQR28 (via this)?

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)