English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 01 March 2024, 03:03   #1
!ZAJC!
Registered User
 
Join Date: Jan 2024
Location: Zagreb / Croatia
Posts: 11
Work in Progress: the ultimate 68000 memcpy?

Based on some stimulating feedback on this forum, I decided to take the plunge and try and implement the fastest possible generalized memcpy for an OCS Amiga 68000. Interested in further feedback and open to criticism..

Requirements:
- Must support all sizes from 0 bytes to 1 MB, even or odd values.
- Must support all valid source or target addresses, aligned, unaligned, even or odd.
- All setup overhead including stack pushing and popping counts towards the CPU clocks used.

Goals:
- Beat trivial move.b copy loop in as little as few bytes, while reaching movem performance in as few bytes as possible.

Future research:
- Also include blitter copies for chip RAM. Blitter is insanely fast reaching 2 CPU clocks/byte.
- Also include shadow chip ram copies for 512K Chip/512 Slow Fat Agnus Amigas..

Here's what I got:



For reference, best that a simple move.b/dbra loop can do is 22 clocks/byte. If you include error checking, we beat it in 2 bytes.

Best case for a simple move.l/dbra loop is 7.5 clocks/byte. We beat this loop for lengths of 40 bytes or more. Of course, a simple dbra/move.l loop does not work for odd addresses or lengths that are not multiples of 4.

By the time we are at 1KB, we match the performance of 16 unrolled move.l followed by dbra. By the time we are at 5KB we are close to saturating the 68000 bus bandwidth..





Code:
;; The ultimate(?) 68000 memcpy by !ZAJC!/GDS
;; SPDX-License-Identifier: 0BSD
;; parameters:
;;	a0.l - tgt
;;	a1.l - src
;;	d0.l - len (bytes)
;; jsr memcpy

memcpy_16unaligned:
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
memcpy_unaligned_loop:		;108
	dbra	d0,memcpy_16unaligned
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	rts
memcpy:
	subq.l	#8,d0	;8
	bgt.s	.plus8	;16
	add.w	d0,d0	;20
	neg.w	d0	;24 d0 <- (8 - len) * 2
	jmp	.micro(pc,d0)	; 36
.unaligned:		;62
	moveq	#15,d1	;66
	and.w	d0,d1	;70
	add.w	d1,d1	;74
	neg.w	d1	;78
	lsr.l	#4,d0	;94
	jmp	memcpy_unaligned_loop(pc,d1)	;108
.micro:	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	rts
.odd_a0:
	move.w	a1,d1			;24
	lsr.w	#1,d1			;32
	bcc.s	.unaligned		;40
	move.b	(a1)+,(a0)+		;52
	subq	#1,d0			;56
	bra.s	.aligncopy		;66

.movemcpy:				;100
	; movemcpy entry block
	movem.l d0/d2-d7/a2-a6,-(sp)	;104
	lsr.l	#7,d0			;126
	subq.w	#1,d0			;130
	; movemcpy entry block = 130 cycles
.movem128:				;230
	; begin movem128 cycle block
	movem.l	(a1)+,d1-d7/a2-a6	;108
	movem.l	d1-d7/a2-a6,(a0)	;212
	moveq	#48,d1			;216
	adda.l	d1,a0			;224
	movem.l	(a1)+,d1-d7/a2-a6	;332
	movem.l	d1-d7/a2-a6,(a0)	;436
	moveq	#48,d1			;440
	adda.l	d1,a0			;448
	move.l	(a1)+,(a0)+		;468
	move.l	(a1)+,(a0)+		;488
	move.l	(a1)+,(a0)+		;508
	move.l	(a1)+,(a0)+		;528
	move.l	(a1)+,(a0)+		;548
	move.l	(a1)+,(a0)+		;568
	move.l	(a1)+,(a0)+		;588
	move.l	(a1)+,(a0)+		;608
	dbra	d0,.movem128		;618
	; end movem128 block: 618/128 = 4.83 clocks/byte

	; movem128 exit block:
	movem.l	(sp)+, a2-a6/d0/d2-d7	;108
	moveq	#127,d1			;112
	and.w	d0,d1			;116
	move.w	d1,d0			;120
	moveq	#92,d1			;124
	cmp.w	d1,d0			;128
	ble.s	.from8to100		;136
	bra.s	.from100to520		;146
	; movem128 entry + exit = 130 + 142 = 272
	; previous copy detect/setup overhead = 100
	; total movem setup overhead = 376 clocks
	; at 512 bytes minimum:
	; (372+618*4)/512 = 5.55 clocks/byte
.plus8:				 ;18
	; start align block
	move.w	a0,d1		 ;4
	lsr.w	#1,d1		 ;12
	bcs.s	.odd_a0		 ;20
	move.w	a1,d1		 ;24
	lsr.w	#1,d1		 ;32
	bcs.s	.unaligned 	 ;40
	; end align block, 40 cycles

.aligncopy:			 ;58
	moveq	#92,d1		 ;62
	cmp.l	d1,d0		 ;68
	ble.s	.from8to100	 ;76
	cmpi.l	#512,d0		 ;90
	bge.s	.movemcpy	 ;98
.from100to520:
	moveq	#64,d1		 ;102
	sub.w	d1,d0		 ;106
.loop64:
	move.l	(a1)+,(a0)+
	move.l	(a1)+,(a0)+
	move.l	(a1)+,(a0)+
	move.l	(a1)+,(a0)+
	move.l	(a1)+,(a0)+
	move.l	(a1)+,(a0)+
	move.l	(a1)+,(a0)+
	move.l	(a1)+,(a0)+
	move.l	(a1)+,(a0)+
	move.l	(a1)+,(a0)+
	move.l	(a1)+,(a0)+
	move.l	(a1)+,(a0)+
	move.l	(a1)+,(a0)+
	move.l	(a1)+,(a0)+
	move.l	(a1)+,(a0)+
	move.l	(a1)+,(a0)+
	sub.w	d1,d0
	bgt.s	.loop64
	add.w	d1,d0

.from8to100:			;78
	move.l	(a1)+,(a0)+	;98
	move.l	(a1)+,(a0)+	;118

.aligncopy_jtbl:
	move.b	.cpyjtbl(pc,d0),d0	;132
	jmp	.cpya(pc,d0)	;146
.cpyjtbl:
	dc.b	.cpy00-.cpya,.cpy01-.cpya,.cpy02-.cpya,.cpy03-.cpya
	dc.b	.cpy04-.cpya,.cpy05-.cpya,.cpy06-.cpya,.cpy07-.cpya
	dc.b	.cpy08-.cpya,.cpy09-.cpya,.cpy10-.cpya,.cpy11-.cpya
	dc.b	.cpy12-.cpya,.cpy13-.cpya,.cpy14-.cpya,.cpy15-.cpya
	dc.b	.cpy16-.cpya,.cpy17-.cpya,.cpy18-.cpya,.cpy19-.cpya
	dc.b	.cpy20-.cpya,.cpy21-.cpya,.cpy22-.cpya,.cpy23-.cpya
	dc.b	.cpy24-.cpya,.cpy25-.cpya,.cpy26-.cpya,.cpy27-.cpya
	dc.b	.cpy28-.cpya,.cpy29-.cpya,.cpy30-.cpya,.cpy31-.cpya
	dc.b	.cpy32-.cpya,.cpy33-.cpya,.cpy34-.cpya,.cpy35-.cpya
	dc.b	.cpy36-.cpya,.cpy37-.cpya,.cpy38-.cpya,.cpy39-.cpya
	dc.b	.cpy40-.cpya,.cpy41-.cpya,.cpy42-.cpya,.cpy43-.cpya
	dc.b	.cpy44-.cpya,.cpy45-.cpya,.cpy46-.cpya,.cpy47-.cpya
	dc.b	.cpy48-.cpya,.cpy49-.cpya,.cpy50-.cpya,.cpy51-.cpya
	dc.b	.cpy52-.cpya,.cpy53-.cpya,.cpy54-.cpya,.cpy55-.cpya
	dc.b	.cpy56-.cpya,.cpy57-.cpya,.cpy58-.cpya,.cpy59-.cpya
	dc.b	.cpy60-.cpya,.cpy61-.cpya,.cpy62-.cpya,.cpy63-.cpya
	dc.b	.cpy64-.cpya,.cpy65-.cpya,.cpy66-.cpya,.cpy67-.cpya
	dc.b	.cpy68-.cpya,.cpy69-.cpya,.cpy70-.cpya,.cpy71-.cpya
	dc.b	.cpy72-.cpya,.cpy73-.cpya,.cpy74-.cpya,.cpy75-.cpya
	dc.b	.cpy76-.cpya,.cpy77-.cpya,.cpy78-.cpya,.cpy79-.cpya
	dc.b	.cpy80-.cpya,.cpy81-.cpya,.cpy82-.cpya,.cpy83-.cpya
	dc.b	.cpy84-.cpya,.cpy85-.cpya,.cpy86-.cpya,.cpy87-.cpya
	dc.b	.cpy88-.cpya,.cpy89-.cpya,.cpy90-.cpya,.cpy91-.cpya
	dc.b	.cpy92-.cpya,.cpy00-.cpya
.cpya:
.cpy92:	move.l	(a1)+,(a0)+
.cpy88:	move.l	(a1)+,(a0)+
.cpy84:	move.l	(a1)+,(a0)+
.cpy80:	move.l	(a1)+,(a0)+
.cpy76:	move.l	(a1)+,(a0)+
.cpy72:	move.l	(a1)+,(a0)+
.cpy68:	move.l	(a1)+,(a0)+
.cpy64:	move.l	(a1)+,(a0)+
.cpy60:	move.l	(a1)+,(a0)+
.cpy56:	move.l	(a1)+,(a0)+
.cpy52:	move.l	(a1)+,(a0)+
.cpy48:	move.l	(a1)+,(a0)+
.cpy44:	move.l	(a1)+,(a0)+
.cpy40:	move.l	(a1)+,(a0)+
.cpy36:	move.l	(a1)+,(a0)+
.cpy32:	move.l	(a1)+,(a0)+
.cpy28:	move.l	(a1)+,(a0)+
.cpy24:	move.l	(a1)+,(a0)+
.cpy20:	move.l	(a1)+,(a0)+
.cpy16: move.l	(a1)+,(a0)+
.cpy12:	move.l	(a1)+,(a0)+
.cpy08:	move.l	(a1)+,(a0)+
.cpy04:	move.l	(a1)+,(a0)+
.cpy00:	rts
.cpy91:	move.l	(a1)+,(a0)+
.cpy87:	move.l	(a1)+,(a0)+
.cpy83:	move.l	(a1)+,(a0)+
.cpy79:	move.l	(a1)+,(a0)+
.cpy75:	move.l	(a1)+,(a0)+
.cpy71:	move.l	(a1)+,(a0)+
.cpy67:	move.l	(a1)+,(a0)+
.cpy63:	move.l	(a1)+,(a0)+
.cpy59:	move.l	(a1)+,(a0)+
.cpy55:	move.l	(a1)+,(a0)+
.cpy51:	move.l	(a1)+,(a0)+
.cpy47:	move.l	(a1)+,(a0)+
.cpy43:	move.l	(a1)+,(a0)+
.cpy39:	move.l	(a1)+,(a0)+
.cpy35:	move.l	(a1)+,(a0)+
.cpy31:	move.l	(a1)+,(a0)+
.cpy27:	move.l	(a1)+,(a0)+
.cpy23:	move.l	(a1)+,(a0)+
.cpy19:	move.l	(a1)+,(a0)+
.cpy15:	move.l	(a1)+,(a0)+
.cpy11:	move.l	(a1)+,(a0)+
.cpy07:	move.l	(a1)+,(a0)+
.cpy03:	move.w	(a1)+,(a0)+
.cpy01: move.b	(a1)+,(a0)+
	rts
.cpy90:	move.l	(a1)+,(a0)+
.cpy86:	move.l	(a1)+,(a0)+
.cpy82:	move.l	(a1)+,(a0)+
.cpy78:	move.l	(a1)+,(a0)+
.cpy74:	move.l	(a1)+,(a0)+
.cpy70:	move.l	(a1)+,(a0)+
.cpy66:	move.l	(a1)+,(a0)+
.cpy62:	move.l	(a1)+,(a0)+
.cpy58:	move.l	(a1)+,(a0)+
.cpy54:	move.l	(a1)+,(a0)+
.cpy50:	move.l	(a1)+,(a0)+
.cpy46:	move.l	(a1)+,(a0)+
.cpy42:	move.l	(a1)+,(a0)+
.cpy38:	move.l	(a1)+,(a0)+
.cpy34:	move.l	(a1)+,(a0)+
.cpy30:	move.l	(a1)+,(a0)+
.cpy26:	move.l	(a1)+,(a0)+
.cpy22:	move.l	(a1)+,(a0)+
.cpy18:	move.l	(a1)+,(a0)+
.cpy14:	move.l	(a1)+,(a0)+
.cpy10:	move.l	(a1)+,(a0)+
.cpy06:	move.l	(a1)+,(a0)+
.cpy02:	move.w	(a1)+,(a0)+
	rts
.cpy89:	move.l	(a1)+,(a0)+
.cpy85:	move.l	(a1)+,(a0)+
.cpy81:	move.l	(a1)+,(a0)+
.cpy77:	move.l	(a1)+,(a0)+
.cpy73:	move.l	(a1)+,(a0)+
.cpy69:	move.l	(a1)+,(a0)+
.cpy65:	move.l	(a1)+,(a0)+
.cpy61:	move.l	(a1)+,(a0)+
.cpy57:	move.l	(a1)+,(a0)+
.cpy53:	move.l	(a1)+,(a0)+
.cpy49:	move.l	(a1)+,(a0)+
.cpy45:	move.l	(a1)+,(a0)+
.cpy41:	move.l	(a1)+,(a0)+
.cpy37:	move.l	(a1)+,(a0)+
.cpy33:	move.l	(a1)+,(a0)+
.cpy29:	move.l	(a1)+,(a0)+
.cpy25:	move.l	(a1)+,(a0)+
.cpy21:	move.l	(a1)+,(a0)+
.cpy17:	move.l	(a1)+,(a0)+
.cpy13:	move.l	(a1)+,(a0)+
.cpy09:	move.l	(a1)+,(a0)+
.cpy05:	move.l	(a1)+,(a0)+
	move.b	(a1)+,(a0)+
	rts
Some features:
- Small branches (Bxx.s) everywhere despite lots of jump tables and unrolled loops
- Takes only 36 cycles to:
1) detect a small (8 bytes or less) copy (subq + bgt.s)
2) and jump into an unrolled move.b code followed by rts
- Takes only 40 cycles to detect even aligned/ odd aligned or unaligned pointers.
Use lsr.w #1 followed by bcs/bcc
- Needs only 28 cycles to decide between 4 different variants of move.l unrolled loops
- Switches to word-sized operations for lengths guaranteed to be under 64k
- balances between shifting, masking, dbra and simply subtracting depending on whether we do less than 8 iterations or not.
- Switches to an interrupt-friendly version of movem when the copies are large enough to offset the cost of spilling all registers to stack.

Thanks to:
- roondar and jotd for urging me to consider the aligned case and consider looking at when the aligned copy pays off despite setup.
- Amigo for looking at my code and urging me to minimize tradeoffs for very small copies and get more aggressive with unrolling.
- chb for reminding me of how fast blitter is compared to everything mentioned in this thread (A->D copy is 2 clocks/byte)
- kooba for digging out a creative rol/move.b approach that almost works on 68000.
- to Northway and paraj for digging out existing memcpy approaches. I think I rediscovered some tricks (alignment testing) while making significant improvements in others:
- mainly around small copies, using byte-offset jump tables to take care of non-long aligned remainders and being extreme in minimizing cycles used to pick various copy versions.
- A faster movem version which uses a combo of move.l and movem to create 128 byte block, enabling me to use dbra (10 clocks every 128 bytes) instead of sub.l + Bxx.s (16 clocks every 48 bytes) and enabling me to finish off the rest of the copy with the same fast jump table code I use elsewhere
Attached Thumbnails
Click image for larger version

Name:	68000_memcpy_clocks_bytes.png
Views:	306
Size:	45.4 KB
ID:	81741   Click image for larger version

Name:	68000_memcpy_throughput.png
Views:	304
Size:	54.0 KB
ID:	81742   Click image for larger version

Name:	68000_memcpy_small_size_performance.png
Views:	298
Size:	55.3 KB
ID:	81743  
!ZAJC! is offline  
Old 01 March 2024, 06:43   #2
koobo
Registered User
 
koobo's Avatar
 
Join Date: Sep 2019
Location: Finland
Posts: 363
Impressive! I especially like the measurement graphs

How about replacing this:
Code:
	movem.l	(a1)+,d1-d7/a2-a6	;108
	movem.l	d1-d7/a2-a6,(a0)	;212
	moveq	#48,d1			;216
	adda.l	d1,a0			;224
	movem.l	(a1)+,d1-d7/a2-a6	;332
	movem.l	d1-d7/a2-a6,(a0)	;436
	moveq	#48,d1			;440
	adda.l	d1,a0			;448
...with this:
Code:
	movem.l	(a1)+,d1-d7/a2-a6	;108
	movem.l	d1-d7/a2-a6,(a0)	;212
	movem.l	(a1)+,d1-d7/a2-a6	;332
	movem.l	d1-d7/a2-a6,48(a0)	;436
        lea     48+48(a0),a0
koobo is offline  
Old 01 March 2024, 08:25   #3
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,039
8x move.b + rts at .micro could be deleted and .micro moved to right after dbra d0,memcpy_16unaligned.
Also, there's one sneaky subq without .w, to be consistent :P.
a/b is offline  
Old 01 March 2024, 14:04   #4
NorthWay
Registered User
 
Join Date: May 2013
Location: Grimstad / Norway
Posts: 841
Quote:
Originally Posted by !ZAJC! View Post
Future research:
- Also include blitter copies for chip RAM. Blitter is insanely fast reaching 2 CPU clocks/byte.
- Also include shadow chip ram copies for 512K Chip/512 Slow Fat Agnus Amigas..
SCSI DMA chip memcopy?
NorthWay is offline  
Old 01 March 2024, 14:08   #5
temisu
Registered User
 
Join Date: Mar 2017
Location: Tallinn / Estonia
Posts: 74
How did you profile them? I wrote a good implementation for memset (at least I think it is good) but I would like to run it through a similar test and look at the graph
temisu is offline  
Old 01 March 2024, 16:56   #6
mc6809e
Registered User
 
Join Date: Jan 2012
Location: USA
Posts: 372
I'd be interested in seeing how DMA affects various techniques. With lots of DMA the goal might be to minimize the ratio of instruction fetch to actual memory read/writes instead of raw speed, trading away instruction fetch for idle cycles like the rol.l technique for byte unaligned moves.

Of course just using the code in exec might be best in huge DMA situations.
mc6809e is offline  
Old 01 March 2024, 17:17   #7
jotd
This cat is no more
 
jotd's Avatar
 
Join Date: Dec 2004
Location: FRANCE
Age: 52
Posts: 8,209
nice work!
jotd is offline  
Old 01 March 2024, 17:56   #8
KONEY
OctaMED Music Composer
 
KONEY's Avatar
 
Join Date: Jan 2009
Location: Venice - Italy
Age: 49
Posts: 667
Fascinating work!
KONEY is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
RKRM - AmigaDOS work in progress Thomas Richter Coders. System 74 04 May 2024 09:06
My life - work in progress c64 game smila Retrogaming General Discussion 10 11 August 2016 13:25
Work in progress. Cowcat Coders. General 7 18 February 2014 22:33
Work In Progress - Speedball 2 Bitmap Brother Nostalgia & memories 3 21 September 2005 00:50
Trick Or Treat Remake - Work in Progress ant512 Amiga scene 15 16 November 2004 17:25

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 04:55.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.07351 seconds with 14 queries