English Amiga Board


Go Back   English Amiga Board > Coders > Coders. General

 
 
Thread Tools
Old 18 January 2010, 01:01   #1
NovaCoder
Registered User
 
NovaCoder's Avatar
 
Join Date: Sep 2007
Location: Melbourne/Australia
Posts: 4,408
C++ to Assembler conversion (speedup) memory copy hack

Hi,

In this game-port that I'm working on there is some code that copies dirty rectangles to a chunky buffer and sometimes to update the entire screen (it's not my code BTW). Can anyone give me a faster assembly based version or any general speed-up comments (for plain C), it's for 030+ and AGA only 320x200 8 bit display.



static byte *backBuffer;
backBuffer = (byte*)AllocMem(64000, MEMF_FAST);





void updateBackBuffer(byte *src, int x, int y, int w, int h) {

byte *dst;

dst = (byte*)backBuffer + y*320 + x;

do {
CopyMem(src, dst, w);
dst += 320;
src += 320;
} while (--h);
}
NovaCoder is offline  
Old 18 January 2010, 18:37   #2
Jgames
Registered User
 
Jgames's Avatar
 
Join Date: Mar 2009
Location: UK
Posts: 457
does CopyMem uses the blitter for the copy?
The CopyMem copy the screen horizontal line by hor line it seems.
The secret to speed up this is to know what CopyMem does and can do. (example uses the blitter).
Again, it's just guesses from my part, as i never coded for the Amiga.
Jgames is offline  
Old 18 January 2010, 18:39   #3
Thorham
Computer Nerd
 
Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,831
Quote:
Originally Posted by Jgames View Post
does CopyMem uses the blitter for the copy?
No, it doesn't. The routine is used to copy data to a chunky buffer. The chunky buffer is stored in fastmem.
Thorham is offline  
Old 18 January 2010, 18:51   #4
StingRay
move.l #$c0ff33,throat
 
StingRay's Avatar
 
Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 6,863
Quote:
Originally Posted by NovaCoder View Post
Can anyone give me a faster assembly based version or any general speed-up comments (for plain C), it's for 030+ and AGA only 320x200 8 bit display.
The only thing I can think of is to have a special case when source and dest have the same dimensions, i.e. you need to copy the full chunkybuffer. In that case you could use this:

Code:
; a0: source
; a1: dest

updateBackBuffer_full
    move.w    #320*200/4-1,d0
.loop    move.l    (a0)+,(a1)+
    dbf    d0,.loop
    rts
In all other cases I don't think you'll gain much by using an asm version.
StingRay is offline  
Old 19 January 2010, 04:48   #5
NovaCoder
Registered User
 
NovaCoder's Avatar
 
Join Date: Sep 2007
Location: Melbourne/Australia
Posts: 4,408
I hoped that you could do something clever with the memory pointers but maybe it's not possible.

Someone else came up with this, but I'm not sure if it would actually be any faster:

Code:
 src = src + y1 * width + x1;
dst = dst + y2 * width + x2;

for (i = 0; i < copyheight; i++)
{
CopyMem(src, dst, copywidth);
src += width;
dst += width;
}
The best I could do, was something like:
Code:
 if(x == 0 & y == 0) {
  CopyMemQuick(src, backBuffer , w*h);
} else {

  src = src + y1 * width + x1;
 dst = dst + y2 * width + x2;

 for (i = 0; i < copyheight; i++)
 {
 CopyMem(src, dst, copywidth);
 src += width;
 dst += width;
 }
}
NovaCoder is offline  
Old 19 January 2010, 09:44   #6
StingRay
move.l #$c0ff33,throat
 
StingRay's Avatar
 
Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 6,863
The first version will not be any faster since the innerloop (i.e. time consuming part) is exactly the same. Your version is fine (if you remove the bug that is :P), you should better spend time optimising other parts of the game anyway IMHO. If you want to optimise the copyloop you can try to replace the CopyMem call with an asm version [move.b/move.l (a0)+,(a1)+] but I don't think this will help much.

Last edited by StingRay; 19 January 2010 at 09:57. Reason: some corrections
StingRay is offline  
Old 19 January 2010, 22:00   #7
Leffmann
 
Join Date: Jul 2008
Location: Sweden
Posts: 2,269
Have you looked at CopyMem to see if it's worth optimizing or not, i.e. if it's anything more clever than a plain for-loop or the equivalent byte copy loop in assembly?

This is a generic byte copy loop to compare execution speed with:

Code:
; A0 = source    A1 = destination
; D0 = x         D1 = y
; D2 = width     D3 = height

updateBackBuffer    mulu.w    #320, d1
                    add.l     d1, a1
                    add.w     d0, a1

                    move.w    #320, d1
                    sub.w     d2, d1
                    subq.w    #1, d2
                    subq.w    #1, d3
                    
.nextrow            move.w    d2, d0
.copy               move.b    (a0)+, (a1)+
                    dbf       d0, .copy
                    
                    add.w     d1, a0
                    add.w     d1, a1
                    
                    dbf       d3, .nextrow
                    rts
Leffmann is offline  
Old 19 January 2010, 22:05   #8
StingRay
move.l #$c0ff33,throat
 
StingRay's Avatar
 
Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 6,863
Quote:
Originally Posted by Leffmann View Post
Have you looked at CopyMem to see if it's worth optimizing or not, i.e. if it's anything more clever than a plain for-loop or the equivalent byte copy loop in assembly?
AFAIR CopyMem has several routines and copies words/longwords when possible etc. Thus I still do not think it is worth to spend much time optimising the copyloop, there should be LOTS of other things that can (and should) be optimised in such game.
StingRay is offline  
Old 19 January 2010, 22:15   #9
Leffmann
 
Join Date: Jul 2008
Location: Sweden
Posts: 2,269
Yeah, first thing that comes to mind is to convert all graphics to planar and ditch the whole C2P thing. Maybe it's a lot of work but the speed improvement would probably be worth it.

Since no chunky buffer is immediately visible on the screen, can't you just keep a single buffer and restore changes, draw new graphics and merge the dirty rectangles? Why do you need both a front and a back chunky buffer? Also, the AmigaOS memory allocation functions can be real performance killers, so never allocate memory repeatedly when drawing or updating unless it's absolutely necessary.
Leffmann is offline  
Old 19 January 2010, 23:15   #10
NovaCoder
Registered User
 
NovaCoder's Avatar
 
Join Date: Sep 2007
Location: Melbourne/Australia
Posts: 4,408
Quote:
Originally Posted by StingRay View Post
Your version is fine (if you remove the bug that is :P)
Bug? Did I make a stuff-up again?
NovaCoder is offline  
Old 19 January 2010, 23:17   #11
NovaCoder
Registered User
 
NovaCoder's Avatar
 
Join Date: Sep 2007
Location: Melbourne/Australia
Posts: 4,408
Quote:
Originally Posted by Leffmann View Post
Yeah, first thing that comes to mind is to convert all graphics to planar and ditch the whole C2P thing. Maybe it's a lot of work but the speed improvement would probably be worth it.

Since no chunky buffer is immediately visible on the screen, can't you just keep a single buffer and restore changes, draw new graphics and merge the dirty rectangles? Why do you need both a front and a back chunky buffer? Also, the AmigaOS memory allocation functions can be real performance killers, so never allocate memory repeatedly when drawing or updating unless it's absolutely necessary.
It's too hard to convert all of the graphics to Planar but I might pre-convert the mouse pointer. Anyway thanks for all your help guys
NovaCoder is offline  
Old 20 January 2010, 15:52   #12
StingRay
move.l #$c0ff33,throat
 
StingRay's Avatar
 
Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 6,863
Quote:
Originally Posted by NovaCoder View Post
Bug? Did I make a stuff-up again?

The bug is in this line:
if(x == 0 & y == 0) {
StingRay is offline  
Old 30 January 2010, 00:44   #13
Photon
Moderator
 
Photon's Avatar
 
Join Date: Nov 2004
Location: Eksjö / Sweden
Posts: 5,650
Doing simple things with the proper method is hacking now? That means I'm finally a hacker!!

Align both buffers a0,a1 to an even address.

Pseudocode:

save stack pointer
REPT copyareabytesize/4/14
movem.l (a0),d0-d7/a2-a7
movem.l d0-d7/a2-a7,(a1)+
lea 14*4(a0),a0
ENDR
restore stack pointer
copy the last few words with any method you like

If source and destination are within 64K of each other (unlikely...) you can use the a0 register as well for a small gain, replace 14 with 15 in the code. You can also remove the lea line and replace (a0) with a calculated offset(a0) if you have a capable assembler.
Photon is offline  
Old 30 January 2010, 00:58   #14
StingRay
move.l #$c0ff33,throat
 
StingRay's Avatar
 
Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 6,863
Quote:
Originally Posted by Photon View Post
Doing simple things with the proper method is hacking now? That means I'm finally a hacker!!

movem.l d0-d7/a2-a7,(a1)+
So that is proper 68k asm for you?
StingRay is offline  
Old 30 January 2010, 01:04   #15
Photon
Moderator
 
Photon's Avatar
 
Join Date: Nov 2004
Location: Eksjö / Sweden
Posts: 5,650
Drunkard revision.

Set a0 and a1 to end of buffers
save stack pointer
REPT copyareabytesize/4/14
lea -14*4(a0),a0
movem.l (a0),d0-d7/a2-a7
movem.l d0-d7/a2-a7,-(a1)
ENDR
restore stack pointer
copy the last few words with any method you like


beats move.l (a0)+,(a1)+ and dbf anyway. Muhaha.
Photon is offline  
Old 30 January 2010, 01:07   #16
StingRay
move.l #$c0ff33,throat
 
StingRay's Avatar
 
Join Date: Dec 2005
Location: Berlin/Joymoney
Posts: 6,863
Quote:
Originally Posted by Photon View Post
beats move.l (a0)+,(a1)+ and dbf anyway. Muhaha.
Sure thing because it'll crash since you are using a7 and the game is systemfriendly. It's hard to be a haxx0r, I know.
StingRay is offline  
Old 30 January 2010, 01:08   #17
Photon
Moderator
 
Photon's Avatar
 
Join Date: Nov 2004
Location: Eksjö / Sweden
Posts: 5,650
He can use as few registers as he wants ofc, that's why I explained the 14/15 factor. Hope it gives some ideas and explains what can be done.
Photon is offline  
Old 30 January 2010, 03:31   #18
Leffmann
 
Join Date: Jul 2008
Location: Sweden
Posts: 2,269
Quote:
Originally Posted by StingRay View Post
Sure thing because it'll crash since you are using a7 and the game is systemfriendly. It's hard to be a haxx0r, I know.
But do you know WHY it might crash because you decide to use A7 freely? I was surprised when I found out why.

BTW what happened to this memory copy thing? Was the CopyMem routine optimized? The asm replacement for updateBackBuffer I wrote is slow and generic, and if the original C function calling CopyMem is faster then there's probably no performance gain to be found in the updateBackBuffer function.

There are 2 errors in the if(x == 0 & y == 0) line: the single & is for bitwise while the double && is the one for boolean, however it will still work in this very case since a true expression will return a non zero value, and bitwise and of two same non zero values will again result in the same non zero value which in turn will evaluate as true. Also you probably intended to check if the width was 320, that's when there's no gap between the end of one line and the beginning of the next, and it can all be thought of and copied as a contiguous block of bytes.
Leffmann is offline  
Old 05 February 2010, 09:23   #19
Wepl
Moderator
 
Wepl's Avatar
 
Join Date: Nov 2001
Location: Germany
Posts: 873
Quote:
Originally Posted by NovaCoder View Post
it's for 030+ and AGA only 320x200 8 bit display.
static byte *backBuffer;
backBuffer = (byte*)AllocMem(64000, MEMF_FAST);

void updateBackBuffer(byte *src, int x, int y, int w, int h) {

byte *dst;

dst = (byte*)backBuffer + y*320 + x;

do {
CopyMem(src, dst, w);
dst += 320;
src += 320;
} while (--h);
}
calling a system routine for each line is not optimal better to copy yourself.
in general for performance the 'for' loop is most suitable for c compilers, e.g.

Code:
void updateBackBuffer(byte *src, int x, int y, int w, int h) {
	int i,j;
	byte dst*;

	dst = (byte*) backbuffer + y*320 + x;
	j = 320 - w;
	for (; h--, dst += j, src += j; h>0)
		for (i=w; i--; i>0)
			*dst++ = *src++;

 }
if w is often larger than 16 I would consider to adadpt to routine to copy long words instead bytes, for that start copy bytes until dst is long aligned then copy longs and then copy remaining bytes
Wepl is offline  
Old 05 February 2010, 10:40   #20
matthey
Banned
 
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
Quote:
Originally Posted by Wepl View Post
calling a system routine for each line is not optimal better to copy yourself.
Yes, there is a substantial amount of overhead for each call to CopyMem() but there is also a substantial amount of overhead to using C. The AmigaOS 3.9 CopyMem() is better than most C programmers can do and it can be patched to be even faster. It is also friendlier for people with more advanced 68k processors. I am the author of CopyMem on Aminet that patches the CopyMem() routines on a 68040 or 68060 for an average of about 35%-40% gain over the AmigaOS 3.9 copy routines. I haven't written a 68020/68030 patch yet but there is the likes of CMQ v2.8 that is pretty good already. The optimal 68020 copy routine for a small longword aligned source and destination (320 byte) copy is going to be either an unrolled move.l loop or possibly a movem.l loop but there is setup overhead that may not be overcome for a small copy. The unrolled loop is possible in C but probably not pretty and the movem.l loop is probably not possible with out inline assembler. It's not so bad to install a tested and optimal CopyMem() patch and live with the small overhead of the library call even at 320 bytes.

Quote:
in general for performance the 'for' loop is most suitable for c compilers, e.g.
Actually, for the 68k processors at least, the do while loop is the best for performance because it avoids a tst/cmp and branch at the top of the loop. This is both faster and smaller. At least you kept the counter counting down to 0. This is optimal as the tst/cmp is usually unnecessary as the 68k sets the condition codes (for free) when the counter is decremented.

Quote:
if w is often larger than 16 I would consider to adapt a routine to copy long words instead bytes, for that start copy bytes until dst is long aligned then copy longs and then copy remaining bytes
Please just use CopyMem() if the data is unaligned. It's much easier to handle unaligned copies in assembler. A good assembler CopyMem() routine will align the src and dest as well as possible with a minimum of overhead. The 68020/68030 does NOT like unaligned copies. The 68040 and 68060 are much more tolerant especially if the data is in the caches. I have seen way too many of these jump table based roll your own C CopyMem() routines that end up way slower than using CopyMem(). I just replaced one the other day in datatypes.library. Saved several hundred bytes and made the copy several times faster.
matthey is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
A2091ToFast: Even more A2091/A590 speedup possible! SpeedGeek Coders. System 8 24 July 2015 14:47
Requester Bug when copying IPF to Standard ADF with X-Copy/Power Copy. BarryB support.WinUAE 9 17 January 2012 20:20
1Mb CHIP RAM hack and extra memory orange Hardware mods 3 29 June 2010 13:18
DMA memory to memory copy BlueAchenar Coders. General 14 22 January 2009 23:29

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 16:55.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.10583 seconds with 14 queries