Work in Progress: the ultimate 68000 memcpy?

!ZAJC! · 01 March 2024, 03:03

Based on some stimulating feedback on this forum, I decided to take the plunge and try and implement the fastest possible generalized memcpy for an OCS Amiga 68000. Interested in further feedback and open to criticism..

Requirements:
- Must support all sizes from 0 bytes to 1 MB, even or odd values.
- Must support all valid source or target addresses, aligned, unaligned, even or odd.
- All setup overhead including stack pushing and popping counts towards the CPU clocks used.

Goals:
- Beat trivial move.b copy loop in as little as few bytes, while reaching movem performance in as few bytes as possible.

Future research:
- Also include blitter copies for chip RAM. Blitter is insanely fast reaching 2 CPU clocks/byte.
- Also include shadow chip ram copies for 512K Chip/512 Slow Fat Agnus Amigas..

Here's what I got:

For reference, best that a simple move.b/dbra loop can do is 22 clocks/byte. If you include error checking, we beat it in 2 bytes.

Best case for a simple move.l/dbra loop is 7.5 clocks/byte. We beat this loop for lengths of 40 bytes or more. Of course, a simple dbra/move.l loop does not work for odd addresses or lengths that are not multiples of 4.

By the time we are at 1KB, we match the performance of 16 unrolled move.l followed by dbra. By the time we are at 5KB we are close to saturating the 68000 bus bandwidth..

Code:

;; The ultimate(?) 68000 memcpy by !ZAJC!/GDS
;; SPDX-License-Identifier: 0BSD
;; parameters:
;;	a0.l - tgt
;;	a1.l - src
;;	d0.l - len (bytes)
;; jsr memcpy

memcpy_16unaligned:
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
memcpy_unaligned_loop:		;108
	dbra	d0,memcpy_16unaligned
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	rts
memcpy:
	subq.l	#8,d0	;8
	bgt.s	.plus8	;16
	add.w	d0,d0	;20
	neg.w	d0	;24 d0 <- (8 - len) * 2
	jmp	.micro(pc,d0)	; 36
.unaligned:		;62
	moveq	#15,d1	;66
	and.w	d0,d1	;70
	add.w	d1,d1	;74
	neg.w	d1	;78
	lsr.l	#4,d0	;94
	jmp	memcpy_unaligned_loop(pc,d1)	;108
.micro:	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	rts
.odd_a0:
	move.w	a1,d1			;24
	lsr.w	#1,d1			;32
	bcc.s	.unaligned		;40
	move.b	(a1)+,(a0)+		;52
	subq	#1,d0			;56
	bra.s	.aligncopy		;66

.movemcpy:				;100
	; movemcpy entry block
	movem.l d0/d2-d7/a2-a6,-(sp)	;104
	lsr.l	#7,d0			;126
	subq.w	#1,d0			;130
	; movemcpy entry block = 130 cycles
.movem128:				;230
	; begin movem128 cycle block
	movem.l	(a1)+,d1-d7/a2-a6	;108
	movem.l	d1-d7/a2-a6,(a0)	;212
	moveq	#48,d1			;216
	adda.l	d1,a0			;224
	movem.l	(a1)+,d1-d7/a2-a6	;332
	movem.l	d1-d7/a2-a6,(a0)	;436
	moveq	#48,d1			;440
	adda.l	d1,a0			;448
	move.l	(a1)+,(a0)+		;468
	move.l	(a1)+,(a0)+		;488
	move.l	(a1)+,(a0)+		;508
	move.l	(a1)+,(a0)+		;528
	move.l	(a1)+,(a0)+		;548
	move.l	(a1)+,(a0)+		;568
	move.l	(a1)+,(a0)+		;588
	move.l	(a1)+,(a0)+		;608
	dbra	d0,.movem128		;618
	; end movem128 block: 618/128 = 4.83 clocks/byte

	; movem128 exit block:
	movem.l	(sp)+, a2-a6/d0/d2-d7	;108
	moveq	#127,d1			;112
	and.w	d0,d1			;116
	move.w	d1,d0			;120
	moveq	#92,d1			;124
	cmp.w	d1,d0			;128
	ble.s	.from8to100		;136
	bra.s	.from100to520		;146
	; movem128 entry + exit = 130 + 142 = 272
	; previous copy detect/setup overhead = 100
	; total movem setup overhead = 376 clocks
	; at 512 bytes minimum:
	; (372+618*4)/512 = 5.55 clocks/byte
.plus8:				 ;18
	; start align block
	move.w	a0,d1		 ;4
	lsr.w	#1,d1		 ;12
	bcs.s	.odd_a0		 ;20
	move.w	a1,d1		 ;24
	lsr.w	#1,d1		 ;32
	bcs.s	.unaligned 	 ;40
	; end align block, 40 cycles

.aligncopy:			 ;58
	moveq	#92,d1		 ;62
	cmp.l	d1,d0		 ;68
	ble.s	.from8to100	 ;76
	cmpi.l	#512,d0		 ;90
	bge.s	.movemcpy	 ;98
.from100to520:
	moveq	#64,d1		 ;102
	sub.w	d1,d0		 ;106
.loop64:
	move.l	(a1)+,(a0)+
	move.l	(a1)+,(a0)+
	move.l	(a1)+,(a0)+
	move.l	(a1)+,(a0)+
	move.l	(a1)+,(a0)+
	move.l	(a1)+,(a0)+
	move.l	(a1)+,(a0)+
	move.l	(a1)+,(a0)+
	move.l	(a1)+,(a0)+
	move.l	(a1)+,(a0)+
	move.l	(a1)+,(a0)+
	move.l	(a1)+,(a0)+
	move.l	(a1)+,(a0)+
	move.l	(a1)+,(a0)+
	move.l	(a1)+,(a0)+
	move.l	(a1)+,(a0)+
	sub.w	d1,d0
	bgt.s	.loop64
	add.w	d1,d0

.from8to100:			;78
	move.l	(a1)+,(a0)+	;98
	move.l	(a1)+,(a0)+	;118

.aligncopy_jtbl:
	move.b	.cpyjtbl(pc,d0),d0	;132
	jmp	.cpya(pc,d0)	;146
.cpyjtbl:
	dc.b	.cpy00-.cpya,.cpy01-.cpya,.cpy02-.cpya,.cpy03-.cpya
	dc.b	.cpy04-.cpya,.cpy05-.cpya,.cpy06-.cpya,.cpy07-.cpya
	dc.b	.cpy08-.cpya,.cpy09-.cpya,.cpy10-.cpya,.cpy11-.cpya
	dc.b	.cpy12-.cpya,.cpy13-.cpya,.cpy14-.cpya,.cpy15-.cpya
	dc.b	.cpy16-.cpya,.cpy17-.cpya,.cpy18-.cpya,.cpy19-.cpya
	dc.b	.cpy20-.cpya,.cpy21-.cpya,.cpy22-.cpya,.cpy23-.cpya
	dc.b	.cpy24-.cpya,.cpy25-.cpya,.cpy26-.cpya,.cpy27-.cpya
	dc.b	.cpy28-.cpya,.cpy29-.cpya,.cpy30-.cpya,.cpy31-.cpya
	dc.b	.cpy32-.cpya,.cpy33-.cpya,.cpy34-.cpya,.cpy35-.cpya
	dc.b	.cpy36-.cpya,.cpy37-.cpya,.cpy38-.cpya,.cpy39-.cpya
	dc.b	.cpy40-.cpya,.cpy41-.cpya,.cpy42-.cpya,.cpy43-.cpya
	dc.b	.cpy44-.cpya,.cpy45-.cpya,.cpy46-.cpya,.cpy47-.cpya
	dc.b	.cpy48-.cpya,.cpy49-.cpya,.cpy50-.cpya,.cpy51-.cpya
	dc.b	.cpy52-.cpya,.cpy53-.cpya,.cpy54-.cpya,.cpy55-.cpya
	dc.b	.cpy56-.cpya,.cpy57-.cpya,.cpy58-.cpya,.cpy59-.cpya
	dc.b	.cpy60-.cpya,.cpy61-.cpya,.cpy62-.cpya,.cpy63-.cpya
	dc.b	.cpy64-.cpya,.cpy65-.cpya,.cpy66-.cpya,.cpy67-.cpya
	dc.b	.cpy68-.cpya,.cpy69-.cpya,.cpy70-.cpya,.cpy71-.cpya
	dc.b	.cpy72-.cpya,.cpy73-.cpya,.cpy74-.cpya,.cpy75-.cpya
	dc.b	.cpy76-.cpya,.cpy77-.cpya,.cpy78-.cpya,.cpy79-.cpya
	dc.b	.cpy80-.cpya,.cpy81-.cpya,.cpy82-.cpya,.cpy83-.cpya
	dc.b	.cpy84-.cpya,.cpy85-.cpya,.cpy86-.cpya,.cpy87-.cpya
	dc.b	.cpy88-.cpya,.cpy89-.cpya,.cpy90-.cpya,.cpy91-.cpya
	dc.b	.cpy92-.cpya,.cpy00-.cpya
.cpya:
.cpy92:	move.l	(a1)+,(a0)+
.cpy88:	move.l	(a1)+,(a0)+
.cpy84:	move.l	(a1)+,(a0)+
.cpy80:	move.l	(a1)+,(a0)+
.cpy76:	move.l	(a1)+,(a0)+
.cpy72:	move.l	(a1)+,(a0)+
.cpy68:	move.l	(a1)+,(a0)+
.cpy64:	move.l	(a1)+,(a0)+
.cpy60:	move.l	(a1)+,(a0)+
.cpy56:	move.l	(a1)+,(a0)+
.cpy52:	move.l	(a1)+,(a0)+
.cpy48:	move.l	(a1)+,(a0)+
.cpy44:	move.l	(a1)+,(a0)+
.cpy40:	move.l	(a1)+,(a0)+
.cpy36:	move.l	(a1)+,(a0)+
.cpy32:	move.l	(a1)+,(a0)+
.cpy28:	move.l	(a1)+,(a0)+
.cpy24:	move.l	(a1)+,(a0)+
.cpy20:	move.l	(a1)+,(a0)+
.cpy16: move.l	(a1)+,(a0)+
.cpy12:	move.l	(a1)+,(a0)+
.cpy08:	move.l	(a1)+,(a0)+
.cpy04:	move.l	(a1)+,(a0)+
.cpy00:	rts
.cpy91:	move.l	(a1)+,(a0)+
.cpy87:	move.l	(a1)+,(a0)+
.cpy83:	move.l	(a1)+,(a0)+
.cpy79:	move.l	(a1)+,(a0)+
.cpy75:	move.l	(a1)+,(a0)+
.cpy71:	move.l	(a1)+,(a0)+
.cpy67:	move.l	(a1)+,(a0)+
.cpy63:	move.l	(a1)+,(a0)+
.cpy59:	move.l	(a1)+,(a0)+
.cpy55:	move.l	(a1)+,(a0)+
.cpy51:	move.l	(a1)+,(a0)+
.cpy47:	move.l	(a1)+,(a0)+
.cpy43:	move.l	(a1)+,(a0)+
.cpy39:	move.l	(a1)+,(a0)+
.cpy35:	move.l	(a1)+,(a0)+
.cpy31:	move.l	(a1)+,(a0)+
.cpy27:	move.l	(a1)+,(a0)+
.cpy23:	move.l	(a1)+,(a0)+
.cpy19:	move.l	(a1)+,(a0)+
.cpy15:	move.l	(a1)+,(a0)+
.cpy11:	move.l	(a1)+,(a0)+
.cpy07:	move.l	(a1)+,(a0)+
.cpy03:	move.w	(a1)+,(a0)+
.cpy01: move.b	(a1)+,(a0)+
	rts
.cpy90:	move.l	(a1)+,(a0)+
.cpy86:	move.l	(a1)+,(a0)+
.cpy82:	move.l	(a1)+,(a0)+
.cpy78:	move.l	(a1)+,(a0)+
.cpy74:	move.l	(a1)+,(a0)+
.cpy70:	move.l	(a1)+,(a0)+
.cpy66:	move.l	(a1)+,(a0)+
.cpy62:	move.l	(a1)+,(a0)+
.cpy58:	move.l	(a1)+,(a0)+
.cpy54:	move.l	(a1)+,(a0)+
.cpy50:	move.l	(a1)+,(a0)+
.cpy46:	move.l	(a1)+,(a0)+
.cpy42:	move.l	(a1)+,(a0)+
.cpy38:	move.l	(a1)+,(a0)+
.cpy34:	move.l	(a1)+,(a0)+
.cpy30:	move.l	(a1)+,(a0)+
.cpy26:	move.l	(a1)+,(a0)+
.cpy22:	move.l	(a1)+,(a0)+
.cpy18:	move.l	(a1)+,(a0)+
.cpy14:	move.l	(a1)+,(a0)+
.cpy10:	move.l	(a1)+,(a0)+
.cpy06:	move.l	(a1)+,(a0)+
.cpy02:	move.w	(a1)+,(a0)+
	rts
.cpy89:	move.l	(a1)+,(a0)+
.cpy85:	move.l	(a1)+,(a0)+
.cpy81:	move.l	(a1)+,(a0)+
.cpy77:	move.l	(a1)+,(a0)+
.cpy73:	move.l	(a1)+,(a0)+
.cpy69:	move.l	(a1)+,(a0)+
.cpy65:	move.l	(a1)+,(a0)+
.cpy61:	move.l	(a1)+,(a0)+
.cpy57:	move.l	(a1)+,(a0)+
.cpy53:	move.l	(a1)+,(a0)+
.cpy49:	move.l	(a1)+,(a0)+
.cpy45:	move.l	(a1)+,(a0)+
.cpy41:	move.l	(a1)+,(a0)+
.cpy37:	move.l	(a1)+,(a0)+
.cpy33:	move.l	(a1)+,(a0)+
.cpy29:	move.l	(a1)+,(a0)+
.cpy25:	move.l	(a1)+,(a0)+
.cpy21:	move.l	(a1)+,(a0)+
.cpy17:	move.l	(a1)+,(a0)+
.cpy13:	move.l	(a1)+,(a0)+
.cpy09:	move.l	(a1)+,(a0)+
.cpy05:	move.l	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	rts

Some features:
- Small branches (Bxx.s) everywhere despite lots of jump tables and unrolled loops
- Takes only 36 cycles to:
1) detect a small (8 bytes or less) copy (subq + bgt.s)
2) and jump into an unrolled move.b code followed by rts
- Takes only 40 cycles to detect even aligned/ odd aligned or unaligned pointers.
Use lsr.w #1 followed by bcs/bcc
- Needs only 28 cycles to decide between 4 different variants of move.l unrolled loops
- Switches to word-sized operations for lengths guaranteed to be under 64k
- balances between shifting, masking, dbra and simply subtracting depending on whether we do less than 8 iterations or not.
- Switches to an interrupt-friendly version of movem when the copies are large enough to offset the cost of spilling all registers to stack.

Thanks to:
- roondar and jotd for urging me to consider the aligned case and consider looking at when the aligned copy pays off despite setup.
- Amigo for looking at my code and urging me to minimize tradeoffs for very small copies and get more aggressive with unrolling.
- chb for reminding me of how fast blitter is compared to everything mentioned in this thread (A->D copy is 2 clocks/byte)
- kooba for digging out a creative rol/move.b approach that almost works on 68000.
- to Northway and paraj for digging out existing memcpy approaches. I think I rediscovered some tricks (alignment testing) while making significant improvements in others:
- mainly around small copies, using byte-offset jump tables to take care of non-long aligned remainders and being extreme in minimizing cycles used to pick various copy versions.
- A faster movem version which uses a combo of move.l and movem to create 128 byte block, enabling me to use dbra (10 clocks every 128 bytes) instead of sub.l + Bxx.s (16 clocks every 48 bytes) and enabling me to finish off the rest of the copy with the same fast jump table code I use elsewhere

koobo · 01 March 2024, 06:43

Impressive! I especially like the measurement graphs

How about replacing this:

Code:

	movem.l	(a1)+,d1-d7/a2-a6	;108
	movem.l	d1-d7/a2-a6,(a0)	;212
	moveq	#48,d1			;216
	adda.l	d1,a0			;224
	movem.l	(a1)+,d1-d7/a2-a6	;332
	movem.l	d1-d7/a2-a6,(a0)	;436
	moveq	#48,d1			;440
	adda.l	d1,a0			;448

...with this:

Code:

	movem.l	(a1)+,d1-d7/a2-a6	;108
	movem.l	d1-d7/a2-a6,(a0)	;212
	movem.l	(a1)+,d1-d7/a2-a6	;332
	movem.l	d1-d7/a2-a6,48(a0)	;436
        lea     48+48(a0),a0

a/b · 01 March 2024, 08:25

8x move.b + rts at .micro could be deleted and .micro moved to right after dbra d0,memcpy_16unaligned.
Also, there's one sneaky subq without .w, to be consistent :P.

NorthWay · 01 March 2024, 14:04

Quote:

Originally Posted by !ZAJC!

Future research:
- Also include blitter copies for chip RAM. Blitter is insanely fast reaching 2 CPU clocks/byte.
- Also include shadow chip ram copies for 512K Chip/512 Slow Fat Agnus Amigas..

SCSI DMA chip memcopy?

temisu · 01 March 2024, 14:08

How did you profile them? I wrote a good implementation for memset (at least I think it is good) but I would like to run it through a similar test and look at the graph

mc6809e · 01 March 2024, 16:56

I'd be interested in seeing how DMA affects various techniques. With lots of DMA the goal might be to minimize the ratio of instruction fetch to actual memory read/writes instead of raw speed, trading away instruction fetch for idle cycles like the rol.l technique for byte unaligned moves.

Of course just using the code in exec might be best in huge DMA situations.

jotd · 01 March 2024, 17:17

nice work!

KONEY · 01 March 2024, 17:56

Fascinating work!

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
RKRM - AmigaDOS work in progress	Thomas Richter	Coders. System	74	04 May 2024 09:06
My life - work in progress c64 game	smila	Retrogaming General Discussion	10	11 August 2016 13:25
Work in progress.	Cowcat	Coders. General	7	18 February 2014 22:33
Work In Progress - Speedball 2	Bitmap Brother	Nostalgia & memories	3	21 September 2005 00:50
Trick Or Treat Remake - Work in Progress	ant512	Amiga scene	15	16 November 2004 17:25

01 March 2024, 08:25	#3
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	8x move.b + rts at .micro could be deleted and .micro moved to right after dbra d0,memcpy_16unaligned. Also, there's one sneaky subq without .w, to be consistent :P.

01 March 2024, 14:08	#5
temisu Registered User Join Date: Mar 2017 Location: Tallinn / Estonia Posts: 74	How did you profile them? I wrote a good implementation for memset (at least I think it is good) but I would like to run it through a similar test and look at the graph

01 March 2024, 16:56	#6
mc6809e Registered User Join Date: Jan 2012 Location: USA Posts: 372	I'd be interested in seeing how DMA affects various techniques. With lots of DMA the goal might be to minimize the ratio of instruction fetch to actual memory read/writes instead of raw speed, trading away instruction fetch for idle cycles like the rol.l technique for byte unaligned moves. Of course just using the code in exec might be best in huge DMA situations.

01 March 2024, 17:17	#7
jotd This cat is no more Join Date: Dec 2004 Location: FRANCE Age: 52 Posts: 8,209	nice work!

01 March 2024, 17:56	#8
KONEY OctaMED Music Composer Join Date: Jan 2009 Location: Venice - Italy Age: 49 Posts: 667	Fascinating work!

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)