C++ to Assembler conversion (speedup) memory copy hack

NovaCoder · 18 January 2010, 01:01

Hi,

In this game-port that I'm working on there is some code that copies dirty rectangles to a chunky buffer and sometimes to update the entire screen (it's not my code BTW). Can anyone give me a faster assembly based version or any general speed-up comments (for plain C), it's for 030+ and AGA only 320x200 8 bit display.

static byte *backBuffer;
backBuffer = (byte*)AllocMem(64000, MEMF_FAST);

void updateBackBuffer(byte *src, int x, int y, int w, int h) {

byte *dst;

dst = (byte*)backBuffer + y*320 + x;

do {
CopyMem(src, dst, w);
dst += 320;
src += 320;
} while (--h);
}

Jgames · 18 January 2010, 18:37

does CopyMem uses the blitter for the copy?
The CopyMem copy the screen horizontal line by hor line it seems.
The secret to speed up this is to know what CopyMem does and can do. (example uses the blitter).
Again, it's just guesses from my part, as i never coded for the Amiga.

Thorham · 18 January 2010, 18:39

Quote:

Originally Posted by Jgames

does CopyMem uses the blitter for the copy?

No, it doesn't. The routine is used to copy data to a chunky buffer. The chunky buffer is stored in fastmem.

StingRay · 18 January 2010, 18:51

Quote:

Originally Posted by NovaCoder

Can anyone give me a faster assembly based version or any general speed-up comments (for plain C), it's for 030+ and AGA only 320x200 8 bit display.

The only thing I can think of is to have a special case when source and dest have the same dimensions, i.e. you need to copy the full chunkybuffer. In that case you could use this:

Code:

; a0: source
; a1: dest

updateBackBuffer_full
    move.w    #320*200/4-1,d0
.loop    move.l    (a0)+,(a1)+
    dbf    d0,.loop
    rts

In all other cases I don't think you'll gain much by using an asm version.

NovaCoder · 19 January 2010, 04:48

I hoped that you could do something clever with the memory pointers but maybe it's not possible.

Someone else came up with this, but I'm not sure if it would actually be any faster:

Code:

 src = src + y1 * width + x1;
dst = dst + y2 * width + x2;

for (i = 0; i < copyheight; i++)
{
CopyMem(src, dst, copywidth);
src += width;
dst += width;
}

The best I could do, was something like:

Code:

 if(x == 0 & y == 0) {
  CopyMemQuick(src, backBuffer , w*h);
} else {

  src = src + y1 * width + x1;
 dst = dst + y2 * width + x2;

 for (i = 0; i < copyheight; i++)
 {
 CopyMem(src, dst, copywidth);
 src += width;
 dst += width;
 }
}

StingRay · 19 January 2010, 09:44

The first version will not be any faster since the innerloop (i.e. time consuming part) is exactly the same. Your version is fine (if you remove the bug that is :P), you should better spend time optimising other parts of the game anyway IMHO.

If you want to optimise the copyloop you can try to replace the CopyMem call with an asm version [move.b/move.l (a0)+,(a1)+] but I don't think this will help much.

Leffmann · 19 January 2010, 22:00

Have you looked at CopyMem to see if it's worth optimizing or not, i.e. if it's anything more clever than a plain for-loop or the equivalent byte copy loop in assembly?

This is a generic byte copy loop to compare execution speed with:

Code:

; A0 = source    A1 = destination
; D0 = x         D1 = y
; D2 = width     D3 = height

updateBackBuffer    mulu.w    #320, d1
                    add.l     d1, a1
                    add.w     d0, a1

                    move.w    #320, d1
                    sub.w     d2, d1
                    subq.w    #1, d2
                    subq.w    #1, d3
                    
.nextrow            move.w    d2, d0
.copy               move.b    (a0)+, (a1)+
                    dbf       d0, .copy
                    
                    add.w     d1, a0
                    add.w     d1, a1
                    
                    dbf       d3, .nextrow
                    rts

StingRay · 19 January 2010, 22:05

Quote:

Originally Posted by Leffmann

Have you looked at CopyMem to see if it's worth optimizing or not, i.e. if it's anything more clever than a plain for-loop or the equivalent byte copy loop in assembly?

AFAIR CopyMem has several routines and copies words/longwords when possible etc. Thus I still do not think it is worth to spend much time optimising the copyloop, there should be LOTS of other things that can (and should) be optimised in such game.

Leffmann · 19 January 2010, 22:15

Yeah, first thing that comes to mind is to convert all graphics to planar and ditch the whole C2P thing. Maybe it's a lot of work but the speed improvement would probably be worth it.

Since no chunky buffer is immediately visible on the screen, can't you just keep a single buffer and restore changes, draw new graphics and merge the dirty rectangles? Why do you need both a front and a back chunky buffer? Also, the AmigaOS memory allocation functions can be real performance killers, so never allocate memory repeatedly when drawing or updating unless it's absolutely necessary.

NovaCoder · 19 January 2010, 23:15

Quote:

Originally Posted by StingRay

Your version is fine (if you remove the bug that is :P)

Bug? Did I make a stuff-up again?

NovaCoder · 19 January 2010, 23:17

Quote:

Originally Posted by Leffmann

Yeah, first thing that comes to mind is to convert all graphics to planar and ditch the whole C2P thing. Maybe it's a lot of work but the speed improvement would probably be worth it.

Since no chunky buffer is immediately visible on the screen, can't you just keep a single buffer and restore changes, draw new graphics and merge the dirty rectangles? Why do you need both a front and a back chunky buffer? Also, the AmigaOS memory allocation functions can be real performance killers, so never allocate memory repeatedly when drawing or updating unless it's absolutely necessary.

It's too hard to convert all of the graphics to Planar but I might pre-convert the mouse pointer. Anyway thanks for all your help guys

StingRay · 20 January 2010, 15:52

Quote:

Originally Posted by NovaCoder

Bug? Did I make a stuff-up again?

The bug is in this line:
if(x == 0 & y == 0) {

Photon · 30 January 2010, 00:44

Doing simple things with the proper method is hacking now? That means I'm finally a hacker!!

Align both buffers a0,a1 to an even address.

Pseudocode:

save stack pointer
REPT copyareabytesize/4/14
movem.l (a0),d0-d7/a2-a7
movem.l d0-d7/a2-a7,(a1)+
lea 14*4(a0),a0
ENDR
restore stack pointer
copy the last few words with any method you like

If source and destination are within 64K of each other (unlikely...) you can use the a0 register as well for a small gain, replace 14 with 15 in the code. You can also remove the lea line and replace (a0) with a calculated offset(a0) if you have a capable assembler.

StingRay · 30 January 2010, 00:58

Quote:

Originally Posted by Photon

Doing simple things with the proper method is hacking now? That means I'm finally a hacker!!

movem.l d0-d7/a2-a7,(a1)+

So that is proper 68k asm for you?

Photon · 30 January 2010, 01:04

Drunkard revision.

Set a0 and a1 to end of buffers
save stack pointer
REPT copyareabytesize/4/14
lea -14*4(a0),a0
movem.l (a0),d0-d7/a2-a7
movem.l d0-d7/a2-a7,-(a1)
ENDR
restore stack pointer
copy the last few words with any method you like

beats move.l (a0)+,(a1)+ and dbf anyway. Muhaha.

StingRay · 30 January 2010, 01:07

Quote:

Originally Posted by Photon

beats move.l (a0)+,(a1)+ and dbf anyway. Muhaha.

Sure thing because it'll crash since you are using a7 and the game is systemfriendly. It's hard to be a haxx0r, I know.

Photon · 30 January 2010, 01:08

He can use as few registers as he wants ofc, that's why I explained the 14/15 factor.

Hope it gives some ideas and explains what can be done.

Leffmann · 30 January 2010, 03:31

Quote:

Originally Posted by StingRay

Sure thing because it'll crash since you are using a7 and the game is systemfriendly. It's hard to be a haxx0r, I know.

But do you know WHY it might crash because you decide to use A7 freely?

I was surprised when I found out why.

BTW what happened to this memory copy thing? Was the CopyMem routine optimized? The asm replacement for updateBackBuffer I wrote is slow and generic, and if the original C function calling CopyMem is faster then there's probably no performance gain to be found in the updateBackBuffer function.

There are 2 errors in the if(x == 0 & y == 0) line: the single & is for bitwise while the double && is the one for boolean, however it will still work in this very case since a true expression will return a non zero value, and bitwise and of two same non zero values will again result in the same non zero value which in turn will evaluate as true. Also you probably intended to check if the width was 320, that's when there's no gap between the end of one line and the beginning of the next, and it can all be thought of and copied as a contiguous block of bytes.

Wepl · 05 February 2010, 09:23

Quote:

Originally Posted by NovaCoder

it's for 030+ and AGA only 320x200 8 bit display.
static byte *backBuffer;
backBuffer = (byte*)AllocMem(64000, MEMF_FAST);

void updateBackBuffer(byte *src, int x, int y, int w, int h) {

byte *dst;

dst = (byte*)backBuffer + y*320 + x;

do {
CopyMem(src, dst, w);
dst += 320;
src += 320;
} while (--h);
}

calling a system routine for each line is not optimal better to copy yourself.
in general for performance the 'for' loop is most suitable for c compilers, e.g.

Code:

void updateBackBuffer(byte *src, int x, int y, int w, int h) {
	int i,j;
	byte dst*;

	dst = (byte*) backbuffer + y*320 + x;
	j = 320 - w;
	for (; h--, dst += j, src += j; h>0)
		for (i=w; i--; i>0)
			*dst++ = *src++;

 }

if w is often larger than 16 I would consider to adadpt to routine to copy long words instead bytes, for that start copy bytes until dst is long aligned then copy longs and then copy remaining bytes

matthey · 05 February 2010, 10:40

Quote:

Originally Posted by Wepl

calling a system routine for each line is not optimal better to copy yourself.

Yes, there is a substantial amount of overhead for each call to CopyMem() but there is also a substantial amount of overhead to using C. The AmigaOS 3.9 CopyMem() is better than most C programmers can do and it can be patched to be even faster. It is also friendlier for people with more advanced 68k processors. I am the author of CopyMem on Aminet that patches the CopyMem() routines on a 68040 or 68060 for an average of about 35%-40% gain over the AmigaOS 3.9 copy routines. I haven't written a 68020/68030 patch yet but there is the likes of CMQ v2.8 that is pretty good already. The optimal 68020 copy routine for a small longword aligned source and destination (320 byte) copy is going to be either an unrolled move.l loop or possibly a movem.l loop but there is setup overhead that may not be overcome for a small copy. The unrolled loop is possible in C but probably not pretty and the movem.l loop is probably not possible with out inline assembler. It's not so bad to install a tested and optimal CopyMem() patch and live with the small overhead of the library call even at 320 bytes.

Quote:

in general for performance the 'for' loop is most suitable for c compilers, e.g.

Actually, for the 68k processors at least, the do while loop is the best for performance because it avoids a tst/cmp and branch at the top of the loop. This is both faster and smaller. At least you kept the counter counting down to 0. This is optimal as the tst/cmp is usually unnecessary as the 68k sets the condition codes (for free) when the counter is decremented.

Quote:

if w is often larger than 16 I would consider to adapt a routine to copy long words instead bytes, for that start copy bytes until dst is long aligned then copy longs and then copy remaining bytes

Please just use CopyMem() if the data is unaligned. It's much easier to handle unaligned copies in assembler. A good assembler CopyMem() routine will align the src and dest as well as possible with a minimum of overhead. The 68020/68030 does NOT like unaligned copies. The 68040 and 68060 are much more tolerant especially if the data is in the caches. I have seen way too many of these jump table based roll your own C CopyMem() routines that end up way slower than using CopyMem(). I just replaced one the other day in datatypes.library. Saved several hundred bytes and made the copy several times faster.

19 January 2010, 09:44	#6
StingRay move.l #$c0ff33,throat Join Date: Dec 2005 Location: Berlin/Joymoney Posts: 6,863	The first version will not be any faster since the innerloop (i.e. time consuming part) is exactly the same. Your version is fine (if you remove the bug that is :P), you should better spend time optimising other parts of the game anyway IMHO. If you want to optimise the copyloop you can try to replace the CopyMem call with an asm version [move.b/move.l (a0)+,(a1)+] but I don't think this will help much. Last edited by StingRay; 19 January 2010 at 09:57. Reason: some corrections

19 January 2010, 22:00	#7
Leffmann Join Date: Jul 2008 Location: Sweden Posts: 2,269	Have you looked at CopyMem to see if it's worth optimizing or not, i.e. if it's anything more clever than a plain for-loop or the equivalent byte copy loop in assembly? This is a generic byte copy loop to compare execution speed with: Code: ; A0 = source A1 = destination ; D0 = x D1 = y ; D2 = width D3 = height updateBackBuffer mulu.w #320, d1 add.l d1, a1 add.w d0, a1 move.w #320, d1 sub.w d2, d1 subq.w #1, d2 subq.w #1, d3 .nextrow move.w d2, d0 .copy move.b (a0)+, (a1)+ dbf d0, .copy add.w d1, a0 add.w d1, a1 dbf d3, .nextrow rts

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
A2091ToFast: Even more A2091/A590 speedup possible!	SpeedGeek	Coders. System	8	24 July 2015 14:47
Requester Bug when copying IPF to Standard ADF with X-Copy/Power Copy.	BarryB	support.WinUAE	9	17 January 2012 20:20
1Mb CHIP RAM hack and extra memory	orange	Hardware mods	3	29 June 2010 13:18
DMA memory to memory copy	BlueAchenar	Coders. General	14	22 January 2009 23:29

18 January 2010, 01:01	#1
NovaCoder Registered User Join Date: Sep 2007 Location: Melbourne/Australia Posts: 4,408	C++ to Assembler conversion (speedup) memory copy hack Hi, In this game-port that I'm working on there is some code that copies dirty rectangles to a chunky buffer and sometimes to update the entire screen (it's not my code BTW). Can anyone give me a faster assembly based version or any general speed-up comments (for plain C), it's for 030+ and AGA only 320x200 8 bit display. static byte backBuffer; backBuffer = (byte)AllocMem(64000, MEMF_FAST); void updateBackBuffer(byte src, int x, int y, int w, int h) { byte dst; dst = (byte)backBuffer + y320 + x; do { CopyMem(src, dst, w); dst += 320; src += 320; } while (--h); }

18 January 2010, 18:37	#2
Jgames Registered User Join Date: Mar 2009 Location: UK Posts: 457	does CopyMem uses the blitter for the copy? The CopyMem copy the screen horizontal line by hor line it seems. The secret to speed up this is to know what CopyMem does and can do. (example uses the blitter). Again, it's just guesses from my part, as i never coded for the Amiga.

19 January 2010, 22:15	#9
Leffmann Join Date: Jul 2008 Location: Sweden Posts: 2,269	Yeah, first thing that comes to mind is to convert all graphics to planar and ditch the whole C2P thing. Maybe it's a lot of work but the speed improvement would probably be worth it. Since no chunky buffer is immediately visible on the screen, can't you just keep a single buffer and restore changes, draw new graphics and merge the dirty rectangles? Why do you need both a front and a back chunky buffer? Also, the AmigaOS memory allocation functions can be real performance killers, so never allocate memory repeatedly when drawing or updating unless it's absolutely necessary.

30 January 2010, 00:44	#13
Photon Moderator Join Date: Nov 2004 Location: Eksjö / Sweden Posts: 5,650	Doing simple things with the proper method is hacking now? That means I'm finally a hacker!! Align both buffers a0,a1 to an even address. Pseudocode: save stack pointer REPT copyareabytesize/4/14 movem.l (a0),d0-d7/a2-a7 movem.l d0-d7/a2-a7,(a1)+ lea 14*4(a0),a0 ENDR restore stack pointer copy the last few words with any method you like If source and destination are within 64K of each other (unlikely...) you can use the a0 register as well for a small gain, replace 14 with 15 in the code. You can also remove the lea line and replace (a0) with a calculated offset(a0) if you have a capable assembler.

30 January 2010, 01:04	#15
Photon Moderator Join Date: Nov 2004 Location: Eksjö / Sweden Posts: 5,650	Drunkard revision. Set a0 and a1 to end of buffers save stack pointer REPT copyareabytesize/4/14 lea -14*4(a0),a0 movem.l (a0),d0-d7/a2-a7 movem.l d0-d7/a2-a7,-(a1) ENDR restore stack pointer copy the last few words with any method you like beats move.l (a0)+,(a1)+ and dbf anyway. Muhaha.

30 January 2010, 01:08	#17
Photon Moderator Join Date: Nov 2004 Location: Eksjö / Sweden Posts: 5,650	He can use as few registers as he wants ofc, that's why I explained the 14/15 factor. Hope it gives some ideas and explains what can be done.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)