15 February 2024, 07:37 | #1 |
Registered User
Join Date: Jan 2024
Location: Zagreb / Croatia
Posts: 11
|
68000 unaligned copy code performance
Hi crew,
Newbie here, I would love your feedback on my Amiga Motorola 68000 code. Assume I have to copy an arbitrarily small/large block of memory where either: a) source address is odd and destination address is even or b) source address is even and destination address is odd - length can be 0 to 512kB ($7FFFF) - source and destination do not overlap Here is what I have: Code:
memcpy_b: ;; a0.l - tgt ;; a1.l - src ;; d0.l - len (bytes) moveq.l #7,d1 and.w d0,d1 ; rem[d1] <- len[d0] & 7 neg.w d1 ; negoff[d1] <- ... add.w d1,d1 ; negoff[d1] <- -2*rem[d1] asr.l #3,d0 ; blocks[d0] <- len[d0] / 8 jmp .done(pc,d1) ; jump to done - negoff[d1] .cpyblk: move.b (a1)+,(a0)+ move.b (a1)+,(a0)+ move.b (a1)+,(a0)+ move.b (a1)+,(a0)+ move.b (a1)+,(a0)+ move.b (a1)+,(a0)+ move.b (a1)+,(a0)+ move.b (a1)+,(a0)+ .done: dbra d0,.cpyblk rts Probably would not hurt to increase the setup to 46 cycles and do 16 moves instead of 8? Any other approaches, maybe bugs in my code? |
15 February 2024, 09:50 | #2 |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,411
|
One common approach would be to check how far away from alignment you are (either at or at destination, or both) and do only move.b's to get word aligned. Then, do the remainder using move.l's (or perhaps even movem.l's). This will get you better performance than doing it all byte-by-byte.
If you want to make the code optimal for 68020/030, doing the same as above, but aligning to longword instead will improve performance on those architectures. Edit: Regarding bugs, you might want to check the code that decides how many times to loop. If I am reading this correctly, you currently loop one more time than you should, because DBRA runs from the counter you give and stops when it hits -1, not when it hits 0. Also be aware that your code doesn't do range checking and DBRA requires a word-sized operator. If this code is called with a size that is too big, or alternatively is called with a negative size, the behaviour is probably not what you want. This may be fine for our purposes, though. Just something to keep in mind. Last edited by roondar; 15 February 2024 at 09:59. |
15 February 2024, 10:04 | #3 |
This cat is no more
Join Date: Dec 2004
Location: FRANCE
Age: 52
Posts: 8,197
|
I think you can find examples of optimized overlapping memcopy routines already coded
hint: NOT chatgpt, this is damn unoptimized AND wrong! Code:
move.l D0, D1 ; Copy the number of bytes to D1 for loop control move.l A0, A2 ; Copy source address to A2 move.l A1, A3 ; Copy destination address to A3 copy_loop: move.b (A2)+, (A3)+ ; Copy a byte from source to destination dbra D1, copy_loop ; Decrement D1 and repeat if not zero rts ; Return from subroutine Code:
move.l D0, D1 ; Copy the number of bytes to D1 for loop control move.l A0, A2 ; Copy source address to A2 move.l A1, A3 ; Copy destination address to A3 ; Check if the number of bytes is not zero beq copy_done ; Check if the number of bytes is odd (not divisible by 4) tst.b D1 beq.s copy_aligned ; Copy remaining bytes one byte at a time copy_byte_loop: move.b (A2)+, (A3)+ ; Copy a byte from source to destination dbra D1, copy_byte_loop copy_aligned: ; Now D1 contains the number of long words to copy ; Copy long words (4 bytes) at a time copy_long_loop: move.l (A2)+, (A3)+ ; Copy a long word from source to destination dbra D1, copy_long_loop copy_done: rts ; Return from subroutine To do optimized copy, you'd have to consider a lot of cases, knowing that if source & dest aren't aligned you can NEVER use move.l.. Not a trivial problem. Even amiga OS doesn't have that. It has CopyMem and CopyMemQuick... |
15 February 2024, 10:29 | #4 |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,411
|
Ah, I should've been a lot clearer about what I meant, with the move.l part. Sorry.
As jotd writes, in the case of source & dest having different alignment, you indeed can never use move.l (or move.w). However, if both start and end are odd, then (and only then) you still can get the copy to be aligned. In that particular case, you can deal with the odd byte first and then switch to move.l. Sorry for the confusion here. |
15 February 2024, 11:48 | #5 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,039
|
Assuming you want to have a simple all-on-one routine and avoid checking for special cases where you could optimize for speed, that works.
As mentioned above, you don't check for 0 length, so if that's needed I'd do something like: Code:
COPY_BITS EQU 3 subq.l #1,d0 bmi.b .done moveq #1<<COPY_BITS-1,d1 eor.w d1,d0 and.w d0,d1 add.w d1,d1 lsr.l #COPY_BITS,d0 jmp (.copy,pc,d1.w) .copy REPT 1<<COPY_BITS move.b (a1)+,(a0)+ ENDR dbf d0,.copy .done rts |
15 February 2024, 11:54 | #6 | ||||||
Registered User
Join Date: Jan 2024
Location: Zagreb / Croatia
Posts: 11
|
Quote:
From my problem definition: Quote:
Quote:
Quote:
I titled the post with 68000 because I am coding for OCS Amiga 500 right now. On 020/030 I believe unaligned writes will also go through, but not sure of the performance implications. Quote:
Code:
subq #1, dX cmp.w #-1, dX bne.s loc Quote:
I am not sure how to define a copy with a negative length, I think that's a caller-side bug, as opposed to a mandatory subroutine check. The code I posted should work for d0.l = 0 (length 0). |
||||||
15 February 2024, 11:58 | #7 |
Registered User
Join Date: Jan 2024
Location: Zagreb / Croatia
Posts: 11
|
I think what you mean is source and destination having equal parity (equal value of last bit).Generally, I feel like you need to consider parity between all 3 inputs: source, destination and length. Consider for example even source, even destination and odd length. Or odd source, odd destination and even length.
|
15 February 2024, 12:05 | #8 | ||
Registered User
Join Date: Jan 2024
Location: Zagreb / Croatia
Posts: 11
|
Quote:
Quote:
I first looked at CopyMem and CopyMemQuick and, as you said, per docs, they explicitly do not cover this case. Which is exactly why I wrote memcpy_b. This is why I think coding this right, case by case, for 68000 was a fun way to get into coding and cycle counting I noticed that a typical case on potentially unaligned data was not covered by ROM libraries. Compression, string routines, etc. are bound to have this problem, so I wondered how the experienced 68000 coders handle this one and not waste cycles. |
||
15 February 2024, 12:13 | #9 |
This cat is no more
Join Date: Dec 2004
Location: FRANCE
Age: 52
Posts: 8,197
|
@a/b nice Duff's device code!!
A simple idea would be: - if both aligned, use move.l until size doesn't allow, then use move.b - if both unaligned, align with a few move.b then switch to case above - if not same alignment, you're stuffed! move.b all the way... or maybe if source is aligned then read long, and write bytes using shifts... Would require benching. I wonder if using libc 68000 memcpy wouldn't be a good idea. That one seems to cover a few cases: https://elixir.bootlin.com/linux/lat...k/lib/memcpy.c |
15 February 2024, 12:23 | #10 | |
Registered User
Join Date: Jan 2024
Location: Zagreb / Croatia
Posts: 11
|
Quote:
This way, I spend 44 cycles for setup. My code spends 16 cycles less on setup than your code and works on 0 length, yielding a significantly better performance for smaller lengths while also performing in overall fewer clock cycles for larger lengths. Your code could potentially have overall better performance if the caller deliberately calls it with lots of zero-length copies, but I think we all agree that's not the intent. Thanks for taking a look so quickly. I still think jacking this up to blocks of 16 gives the best overall size/speed tradeoff given that the move instruction is only 2 bytes in this case. |
|
15 February 2024, 12:26 | #11 | |||
Registered User
Join Date: Jan 2024
Location: Zagreb / Croatia
Posts: 11
|
Quote:
Quote:
Quote:
|
|||
15 February 2024, 13:21 | #12 | |
Registered User
Join Date: Dec 2014
Location: germany
Posts: 439
|
Quote:
BTW, if your source and destination happen to be in chip memory, the fastest method by far for large blocks is using the blitter. |
|
15 February 2024, 13:41 | #13 |
Registered User
Join Date: Sep 2019
Location: Finland
Posts: 363
|
I remember playing around with even-to-odd copy back in the day, so I dug up the old sources, looks like this:
Code:
movem.l (a0)+,d0-d7 rol.l #8,d0 rol.l #8,d1 rol.l #8,d2 rol.l #8,d3 rol.l #8,d4 rol.l #8,d5 rol.l #8,d6 rol.l #8,d7 move.b d0,(a1)+ move.b d1,d0 move.b d2,d1 move.b d3,d2 move.b d4,d3 move.b d5,d4 move.b d6,d5 move.b d7,d6 move.b (a0)+,d7 movem.l d0-d7,(a1) add.l a3,a1 move.b (a0)+,(a1)+ |
15 February 2024, 13:59 | #14 | |
Registered User
Join Date: Dec 2014
Location: germany
Posts: 439
|
Quote:
32 x move.b (a0)+,(a1)+: 32*12 cycles = 384 cycles The above code: 404 cycles, of which 192 are for rolling, 148 for the movem.l-s, and 64 cycles for the remaining stuff. But your code is faster on a system without fastmem when there's a lot of DMA going on, as it needs less memory accesses. |
|
15 February 2024, 14:06 | #15 | |||
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,411
|
I've combined my response into a single post for convience's sake
Quote:
In the case of a combined read/write, you'd then end up with somewhere between 2 memory accesses for fully aligned and 4 memory accesses for fully unaligned. Quote:
Your way to read DBRA is, of course, correct Quote:
You are correct that in case of such an optimisation you would have to take care of non-4 byte multiple lengths in the copy algorithm. Whether or not this makes sense really is dependent mostly on what kind of memory block sizes you are making this for. For really small copies, it's probably faster to make the copy as simple as it can be. If you expect to routinely copy somewhat more bytes, the extra overhead for doing this houskeeping will be easily paid for by transfering most of the copy in chunks of 4 bytes instead. Copying 4 bytes using 4x move.b (a0)+,(a1)+ takes 48 cycles, while copying 4 bytes using move.l (a0)+,(a1)+ takes 20 cycles (assuming 68000). I haven't calculated this, but it seems to me that copies as small as 32 bytes, or perhaps even 16 bytes will already be faster if they can be done like this - even with the additional housekeeping involved. |
|||
15 February 2024, 16:51 | #16 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,975
|
I dont remember exactly timings, but if this code is for 68000, then perhaps using movep.l can be the fastest option.
|
15 February 2024, 16:52 | #17 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,039
|
If you don't need any kind of input sanitation, that of course simplifies the things.
The rest comes down to size/speed balance (how much speed do you really need and how much memory can you expend for this routine), and how does the length distribution look like (is this some kind of an entirely generic copy routine, or are there certain length ranges and thresholds that are more prevalent)? If you need good speed for smaller lengths, getting rid of an empty dbf + rshift will be noticable, and you could try a low overhead code. Something like: Code:
moveq #50,d1 ; 60+ requires a slower .w branch sub.l d0,d1 blt.b .long add.w d1,d1 jmp (.copy,pc,d1.w) .copy REPT 50 move.b (a1)+,(a0)+ ENDR rts .long ; handle lengths 50+ the usual way, eg. 8x or 16x unrolled ; set-up could be simplified (d1 reuse) if the above threshold is 2^n |
15 February 2024, 17:07 | #18 |
Registered User
Join Date: May 2013
Location: Grimstad / Norway
Posts: 839
|
Back in the day (think Fish disks) there was a CopyMem patch by a Norwegian dude who went by the name Art IIRC. He did all sorts of unaligned and combinations thereof. I can't find that right now, but perhaps someone remembers better than me or has better google-fu. (He also had a fix for serial.device that used A6 wrong.)
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
68000 code optimisations | pmc | Coders. Asm / Hardware | 248 | 17 September 2023 13:20 |
[C, RTG] My raycaster funtime.. sharing code, also looking for performance boost.. | mateusz_s | Coders. General | 26 | 23 April 2023 12:40 |
how to make this code 68000 compliant? | jotd | Coders. Asm / Hardware | 2 | 30 January 2021 18:25 |
PErformance on 68000 machines | Amiga1992 | project.WHDLoad | 4 | 02 July 2014 11:16 |
68000 boot code | billt | Coders. General | 15 | 05 May 2012 20:13 |
|
|