English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 15 February 2024, 07:37   #1
!ZAJC!
Registered User
 
Join Date: Jan 2024
Location: Zagreb / Croatia
Posts: 11
68000 unaligned copy code performance

Hi crew,

Newbie here, I would love your feedback on my Amiga Motorola 68000 code.

Assume I have to copy an arbitrarily small/large block of memory where either:
a) source address is odd and destination address is even
or
b) source address is even and destination address is odd
- length can be 0 to 512kB ($7FFFF)
- source and destination do not overlap

Here is what I have:

Code:
memcpy_b:
;;	a0.l - tgt
;;	a1.l - src
;;	d0.l - len (bytes)

	moveq.l	#7,d1
	and.w	d0,d1	; rem[d1] <- len[d0] & 7
	neg.w	d1	; negoff[d1] <- ...
	add.w	d1,d1	; negoff[d1] <- -2*rem[d1]

	asr.l	#3,d0	; blocks[d0] <- len[d0] / 8
	
	jmp	.done(pc,d1)	; jump to done - negoff[d1]

.cpyblk:
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
.done:
	dbra	d0,.cpyblk
	rts
The way I see it, I spend 44 cycles for setup and then I have about 10% overhead for looping.

Probably would not hurt to increase the setup to 46 cycles and do 16 moves instead of 8?

Any other approaches, maybe bugs in my code?
!ZAJC! is offline  
Old 15 February 2024, 09:50   #2
roondar
Registered User
 
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,411
One common approach would be to check how far away from alignment you are (either at or at destination, or both) and do only move.b's to get word aligned. Then, do the remainder using move.l's (or perhaps even movem.l's). This will get you better performance than doing it all byte-by-byte.

If you want to make the code optimal for 68020/030, doing the same as above, but aligning to longword instead will improve performance on those architectures.

Edit:
Regarding bugs, you might want to check the code that decides how many times to loop. If I am reading this correctly, you currently loop one more time than you should, because DBRA runs from the counter you give and stops when it hits -1, not when it hits 0.

Also be aware that your code doesn't do range checking and DBRA requires a word-sized operator. If this code is called with a size that is too big, or alternatively is called with a negative size, the behaviour is probably not what you want. This may be fine for our purposes, though. Just something to keep in mind.

Last edited by roondar; 15 February 2024 at 09:59.
roondar is online now  
Old 15 February 2024, 10:04   #3
jotd
This cat is no more
 
jotd's Avatar
 
Join Date: Dec 2004
Location: FRANCE
Age: 52
Posts: 8,197
I think you can find examples of optimized overlapping memcopy routines already coded

hint: NOT chatgpt, this is damn unoptimized AND wrong!

Code:
    move.l  D0, D1         ; Copy the number of bytes to D1 for loop control
    move.l  A0, A2         ; Copy source address to A2
    move.l  A1, A3         ; Copy destination address to A3

copy_loop:
    move.b  (A2)+, (A3)+  ; Copy a byte from source to destination
    dbra    D1, copy_loop ; Decrement D1 and repeat if not zero

    rts                   ; Return from subroutine
let's have a good laugh again after I told Chatgpt that it was wrong and unoptimized

Code:
    move.l  D0, D1          ; Copy the number of bytes to D1 for loop control
    move.l  A0, A2          ; Copy source address to A2
    move.l  A1, A3          ; Copy destination address to A3

    ; Check if the number of bytes is not zero
    beq     copy_done

    ; Check if the number of bytes is odd (not divisible by 4)
    tst.b   D1
    beq.s   copy_aligned

    ; Copy remaining bytes one byte at a time
copy_byte_loop:
    move.b  (A2)+, (A3)+   ; Copy a byte from source to destination
    dbra    D1, copy_byte_loop

copy_aligned:
    ; Now D1 contains the number of long words to copy
    ; Copy long words (4 bytes) at a time
copy_long_loop:
    move.l  (A2)+, (A3)+   ; Copy a long word from source to destination
    dbra    D1, copy_long_loop

copy_done:
    rts                    ; Return from subroutine
there are so many issues that it makes my eyes bleed.

To do optimized copy, you'd have to consider a lot of cases, knowing that if source & dest aren't aligned you can NEVER use move.l.. Not a trivial problem. Even amiga OS doesn't have that. It has CopyMem and CopyMemQuick...
jotd is offline  
Old 15 February 2024, 10:29   #4
roondar
Registered User
 
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,411
Ah, I should've been a lot clearer about what I meant, with the move.l part. Sorry.
As jotd writes, in the case of source & dest having different alignment, you indeed can never use move.l (or move.w).

However, if both start and end are odd, then (and only then) you still can get the copy to be aligned. In that particular case, you can deal with the odd byte first and then switch to move.l. Sorry for the confusion here.
roondar is online now  
Old 15 February 2024, 11:48   #5
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,039
Assuming you want to have a simple all-on-one routine and avoid checking for special cases where you could optimize for speed, that works.
As mentioned above, you don't check for 0 length, so if that's needed I'd do something like:
Code:
COPY_BITS	EQU	3

	subq.l	#1,d0
	bmi.b	.done
	moveq	#1<<COPY_BITS-1,d1
	eor.w	d1,d0
	and.w	d0,d1
	add.w	d1,d1
	lsr.l	#COPY_BITS,d0
	jmp	(.copy,pc,d1.w)
.copy	REPT	1<<COPY_BITS
	move.b	(a1)+,(a0)+
	ENDR
	dbf	d0,.copy
.done	rts
a/b is offline  
Old 15 February 2024, 11:54   #6
!ZAJC!
Registered User
 
Join Date: Jan 2024
Location: Zagreb / Croatia
Posts: 11
Quote:
Originally Posted by roondar View Post
One common approach would be to check how far away from alignment you are (either at or at destination, or both) and do only move.b's to get word aligned. Then, do the remainder using move.l's (or perhaps even movem.l's). This will get you better performance than doing it all byte-by-byte.
I understand this approach, but I believe it simply will not work in this case:
From my problem definition:
Quote:
a) source address is odd and destination address is even
or
b) source address is even and destination address is odd
Let me try to restate the above:
Quote:
(A0 XOR A1) AND 1 = 1

Quote:
If you want to make the code optimal for 68020/030, doing the same as above, but aligning to longword instead will improve performance on those architectures.
Thanks. As mentioned, if dealing with, say, strings or compressed files, I cannot guarantee that (a0 XOR a1) AND 1 = 0.

I titled the post with 68000 because I am coding for OCS Amiga 500 right now. On 020/030 I believe unaligned writes will also go through, but not sure of the performance implications.

Quote:
Edit:
Regarding bugs, you might want to check the code that decides how many times to loop. If I am reading this correctly, you currently loop one more time than you should, because DBRA runs from the counter you give and stops when it hits -1, not when it hits 0.
From my understanding DBRA dX, loc is a 10-cycle version of the following:
Code:
subq #1, dX
cmp.w #-1, dX
bne.s loc
For, say, d0 = 5, we will do asr.l d0,3 which will make d0 = 0. dbra will decrement d0 to -1 and then not branch to loc.

Quote:
Also be aware that your code doesn't do range checking and DBRA requires a word-sized operator. If this code is called with a size that is too big, or alternatively is called with a negative size, the behaviour is probably not what you want. This may be fine for our purposes, though. Just something to keep in mind.
Thanks, I believe my code will work from d0 =0 up to 512K (as defined). Notice that d0 contains the number of 8-bytes blocks, so the full range is 8*65535 = 524280 + 7 bytes

I am not sure how to define a copy with a negative length, I think that's a caller-side bug, as opposed to a mandatory subroutine check. The code I posted should work for d0.l = 0 (length 0).
!ZAJC! is offline  
Old 15 February 2024, 11:58   #7
!ZAJC!
Registered User
 
Join Date: Jan 2024
Location: Zagreb / Croatia
Posts: 11
Quote:
Originally Posted by roondar View Post
However, if both start and end are odd, then (and only then) you still can get the copy to be aligned. In that particular case, you can deal with the odd byte first and then switch to move.l. Sorry for the confusion here.
I think what you mean is source and destination having equal parity (equal value of last bit).Generally, I feel like you need to consider parity between all 3 inputs: source, destination and length. Consider for example even source, even destination and odd length. Or odd source, odd destination and even length.
!ZAJC! is offline  
Old 15 February 2024, 12:05   #8
!ZAJC!
Registered User
 
Join Date: Jan 2024
Location: Zagreb / Croatia
Posts: 11
Quote:
Originally Posted by jotd View Post
I think you can find examples of optimized overlapping memcopy routines already coded
Quote:
To do optimized copy, you'd have to consider a lot of cases, knowing that if source & dest aren't aligned you can NEVER use move.l.. Not a trivial problem. Even amiga OS doesn't have that. It has CopyMem and CopyMemQuick...
Thanks jotd for chipping in! Optimized memory copies are my idea of a fun way to get into a particular architecture, and I am sure I am not the only one on this forum

I first looked at CopyMem and CopyMemQuick and, as you said, per docs, they explicitly do not cover this case. Which is exactly why I wrote memcpy_b.
This is why I think coding this right, case by case, for 68000 was a fun way to get into coding and cycle counting

I noticed that a typical case on potentially unaligned data was not covered by ROM libraries. Compression, string routines, etc. are bound to have this problem, so I wondered how the experienced 68000 coders handle this one and not waste cycles.
!ZAJC! is offline  
Old 15 February 2024, 12:13   #9
jotd
This cat is no more
 
jotd's Avatar
 
Join Date: Dec 2004
Location: FRANCE
Age: 52
Posts: 8,197
@a/b nice Duff's device code!!

A simple idea would be:

- if both aligned, use move.l until size doesn't allow, then use move.b
- if both unaligned, align with a few move.b then switch to case above
- if not same alignment, you're stuffed! move.b all the way... or maybe if source is aligned then read long, and write bytes using shifts... Would require benching.

I wonder if using libc 68000 memcpy wouldn't be a good idea. That one seems to cover a few cases: https://elixir.bootlin.com/linux/lat...k/lib/memcpy.c
jotd is offline  
Old 15 February 2024, 12:23   #10
!ZAJC!
Registered User
 
Join Date: Jan 2024
Location: Zagreb / Croatia
Posts: 11
Quote:
Originally Posted by a/b View Post
As mentioned above, you don't check for 0 length, so if that's needed I'd do something like:
From what I understand, I don't need to explicitly check for 0 length, because the logic in my code already takes care of it without treating it as a special case. Am I missing something? Where's the bug?

This way, I spend 44 cycles for setup.

My code spends 16 cycles less on setup than your code and works on 0 length, yielding a significantly better performance for smaller lengths while also performing in overall fewer clock cycles for larger lengths.

Your code could potentially have overall better performance if the caller deliberately calls it with lots of zero-length copies, but I think we all agree that's not the intent.

Thanks for taking a look so quickly.
I still think jacking this up to blocks of 16 gives the best overall size/speed tradeoff given that the move instruction is only 2 bytes in this case.
!ZAJC! is offline  
Old 15 February 2024, 12:26   #11
!ZAJC!
Registered User
 
Join Date: Jan 2024
Location: Zagreb / Croatia
Posts: 11
Quote:
Originally Posted by jotd View Post
@a/b nice Duff's device code!!

A simple idea would be:

- if both aligned, use move.l until size doesn't allow, then use move.b
There has to be a case where it pays to spill other registers to stack and use movem.l also, am I right?

Quote:
- if both unaligned, align with a few move.b then switch to case above
- if not same alignment, you're stuffed! move.b all the way... or maybe if source is aligned then read long, and write bytes using shifts... Would require benching.
Quote:
I wonder if using libc 68000 memcpy wouldn't be a good idea. That one seems to cover a few cases: https://elixir.bootlin.com/linux/lat...k/lib/memcpy.c
Hard to tell without seeing the assembly output of the compiler. gcc? clang? vbcc? The right thing to do IMHO would be to avoid spilling register to stack, otherwise your performance is horrible for small blocks.
!ZAJC! is offline  
Old 15 February 2024, 13:21   #12
chb
Registered User
 
Join Date: Dec 2014
Location: germany
Posts: 439
Quote:
Originally Posted by !ZAJC! View Post
There has to be a case where it pays to spill other registers to stack and use movem.l also, am I right?
AFAIR movem.l beats single move.l's when copying >4 longwords, not counting the overhead of the loop or the need to save and restore registers.

BTW, if your source and destination happen to be in chip memory, the fastest method by far for large blocks is using the blitter.
chb is offline  
Old 15 February 2024, 13:41   #13
koobo
Registered User
 
koobo's Avatar
 
Join Date: Sep 2019
Location: Finland
Posts: 363
I remember playing around with even-to-odd copy back in the day, so I dug up the old sources, looks like this:

Code:
        movem.l (a0)+,d0-d7
        rol.l   #8,d0
        rol.l   #8,d1
        rol.l   #8,d2
        rol.l   #8,d3
        rol.l   #8,d4
        rol.l   #8,d5
        rol.l   #8,d6
        rol.l   #8,d7
        move.b  d0,(a1)+
        move.b  d1,d0
        move.b  d2,d1
        move.b  d3,d2
        move.b  d4,d3
        move.b  d5,d4
        move.b  d6,d5
        move.b  d7,d6
        move.b  (a0)+,d7
        movem.l d0-d7,(a1)
        add.l   a3,a1
        move.b  (a0)+,(a1)+
The question is is this actually faster than a straight byte copy. Rolling bits is painfully slow on the 68000 at least.
koobo is offline  
Old 15 February 2024, 13:59   #14
chb
Registered User
 
Join Date: Dec 2014
Location: germany
Posts: 439
Quote:
Originally Posted by koobo View Post
The question is is this actually faster than a straight byte copy. Rolling bits is painfully slow on the 68000 at least.
If I did the math correctly, it is indeed a bit slower than unrolled move.b-s:
32 x move.b (a0)+,(a1)+: 32*12 cycles = 384 cycles
The above code: 404 cycles, of which 192 are for rolling, 148 for the movem.l-s, and 64 cycles for the remaining stuff. But your code is faster on a system without fastmem when there's a lot of DMA going on, as it needs less memory accesses.
chb is offline  
Old 15 February 2024, 14:06   #15
roondar
Registered User
 
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,411
I've combined my response into a single post for convience's sake
Quote:
Originally Posted by !ZAJC!
On 020/030 I believe unaligned writes will also go through, but not sure of the performance implications.
Yes, 020+ allows for unaligned reads and writes, though there is a pretty big perfomance penalty for doing so. Depending on the value of the lower 2 address bits (and assuming 32 bit memory), such a read/write would either need 1 memory access or 2 memory accesses.

In the case of a combined read/write, you'd then end up with somewhere between 2 memory accesses for fully aligned and 4 memory accesses for fully unaligned.
Quote:
Originally Posted by !ZAJC!
From my understanding DBRA dX, loc is a 10-cycle version of the following:
Code:
subq #1, dX cmp.w #-1, dX bne.s loc
For, say, d0 = 5, we will do asr.l d0,3 which will make d0 = 0. dbra will decrement d0 to -1 and then not branch to loc.
The underlying reason for my misunderstanding here is that I had misread your code, specifically the part that set the offset for the jump into D1 (I missed you negate the result so it's a negative offset rather than positive). Serves me right for not taking more time

Your way to read DBRA is, of course, correct
Quote:
Originally Posted by !ZAJC! View Post
I think what you mean is source and destination having equal parity (equal value of last bit).Generally, I feel like you need to consider parity between all 3 inputs: source, destination and length. Consider for example even source, even destination and odd length. Or odd source, odd destination and even length.
The length shouldn't matter for whether or not you can use move.l, only the alignment (or parity if you prefer) being the same between the two addresses.

You are correct that in case of such an optimisation you would have to take care of non-4 byte multiple lengths in the copy algorithm. Whether or not this makes sense really is dependent mostly on what kind of memory block sizes you are making this for. For really small copies, it's probably faster to make the copy as simple as it can be. If you expect to routinely copy somewhat more bytes, the extra overhead for doing this houskeeping will be easily paid for by transfering most of the copy in chunks of 4 bytes instead.

Copying 4 bytes using 4x move.b (a0)+,(a1)+ takes 48 cycles, while copying 4 bytes using move.l (a0)+,(a1)+ takes 20 cycles (assuming 68000). I haven't calculated this, but it seems to me that copies as small as 32 bytes, or perhaps even 16 bytes will already be faster if they can be done like this - even with the additional housekeeping involved.
roondar is online now  
Old 15 February 2024, 16:51   #16
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,975
I dont remember exactly timings, but if this code is for 68000, then perhaps using movep.l can be the fastest option.
Don_Adan is offline  
Old 15 February 2024, 16:52   #17
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,039
If you don't need any kind of input sanitation, that of course simplifies the things.
The rest comes down to size/speed balance (how much speed do you really need and how much memory can you expend for this routine), and how does the length distribution look like (is this some kind of an entirely generic copy routine, or are there certain length ranges and thresholds that are more prevalent)?
If you need good speed for smaller lengths, getting rid of an empty dbf + rshift will be noticable, and you could try a low overhead code. Something like:
Code:
	moveq	#50,d1		; 60+ requires a slower .w branch
	sub.l	d0,d1
	blt.b	.long
	add.w	d1,d1
	jmp	(.copy,pc,d1.w)
.copy	REPT	50
	move.b	(a1)+,(a0)+
	ENDR
	rts
.long
; handle lengths 50+ the usual way, eg. 8x or 16x unrolled
; set-up could be simplified (d1 reuse) if the above threshold is 2^n
a/b is offline  
Old 15 February 2024, 17:07   #18
NorthWay
Registered User
 
Join Date: May 2013
Location: Grimstad / Norway
Posts: 839
Back in the day (think Fish disks) there was a CopyMem patch by a Norwegian dude who went by the name Art IIRC. He did all sorts of unaligned and combinations thereof. I can't find that right now, but perhaps someone remembers better than me or has better google-fu. (He also had a fix for serial.device that used A6 wrong.)
NorthWay is offline  
Old 15 February 2024, 19:07   #19
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,105
Maybe COPMQR28 (via this)?
paraj is offline  
Old 15 February 2024, 23:35   #20
NorthWay
Registered User
 
Join Date: May 2013
Location: Grimstad / Norway
Posts: 839
Quote:
Originally Posted by paraj View Post
Maybe COPMQR28 (via this)?
Yes! That must be the one. Need to inspect the source to see if my memory was right.
NorthWay is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
68000 code optimisations pmc Coders. Asm / Hardware 248 17 September 2023 13:20
[C, RTG] My raycaster funtime.. sharing code, also looking for performance boost.. mateusz_s Coders. General 26 23 April 2023 12:40
how to make this code 68000 compliant? jotd Coders. Asm / Hardware 2 30 January 2021 18:25
PErformance on 68000 machines Amiga1992 project.WHDLoad 4 02 July 2014 11:16
68000 boot code billt Coders. General 15 05 May 2012 20:13

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 00:21.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.09765 seconds with 15 queries