English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 25 May 2016, 20:00   #1
arti
Registered User

 
Join Date: Jul 2008
Location: Poland
Posts: 590
Lightbulb ASM optimising NetSurf

In this thread I would like to propose to interested asm coders, NetSurf code parts, that could be asm optimised for speed.

First proposition:

Error diffusion dithering NetSurf AGA is using:

Code:
/** Find best palette match for given colour. */
static inline uint8_t nsfb_palette_best_match(struct nsfb_palette_s *palette,
        nsfb_colour_t c, int *r_error, int *g_error, int *b_error)
{
    uint8_t best_col = 0;

    nsfb_colour_t palent;
    int col;
    int dr, dg, db; /* delta red, green blue values */

    int cur_distance;
    int best_distance = INT_MAX;

    int r = ( c        & 0xFF);
    int g = ((c >>  8) & 0xFF);
    int b = ((c >> 16) & 0xFF);

    switch (palette->type) {
    case NSFB_PALETTE_NSFB_8BPP:
        /* Index into colour cube part */
        dr = ((r * 5) + 128) / 256;
        dg = ((g * 7) + 128) / 256;
        db = ((b * 4) + 128) / 256;
        col = 40 * dr + 5 * dg + db;

        palent = palette->data[col];
        dr = r - ( palent        & 0xFF);
        dg = g - ((palent >>  8) & 0xFF);
        db = b - ((palent >> 16) & 0xFF);
        cur_distance = (dr * dr) + (dg * dg) + (db * db);

        best_col = col;
        best_distance = cur_distance;
        *r_error = dr;
        *g_error = dg;
        *b_error = db;

        /* Index into grayscale part */
        col = (r + g + b + (45 / 2)) / (15 * 3) - 1 + 240;
        palent = palette->data[col];

        dr = r - (palent & 0xFF);
        dg = g - (palent & 0xFF);
        db = b - (palent & 0xFF);
        cur_distance = (dr * dr) + (dg * dg) + (db * db);
        if (cur_distance < best_distance) {
            best_col = col;
            *r_error = dr;
            *g_error = dg;
            *b_error = db;
        }
        break;

    case NSFB_PALETTE_OTHER:
        /* Try all colours in palette */
        for (col = 0; col <= palette->last; col++) {
            palent = palette->data[col];

            dr = r - ( palent        & 0xFF);
            dg = g - ((palent >>  8) & 0xFF);
            db = b - ((palent >> 16) & 0xFF);
            cur_distance = (dr * dr) + (dg * dg) + (db * db);
            if (cur_distance < best_distance) {
                best_distance = cur_distance;
                best_col = col;
                *r_error = dr;
                *g_error = dg;
                *b_error = db;
            }
        }
        break;

    default:
        break;
    }

        return best_col;
}


/** Find best palette match for given colour, with error diffusion. */
static inline uint8_t nsfb_palette_best_match_dither(
        struct nsfb_palette_s *palette, nsfb_colour_t c)
{
    int r, g, b;
    int current;
    int error;
    int width = palette->dither_ctx.width;
    uint8_t best_col_index;

    if (palette == NULL)
        return 0;

    if (palette->dither == false)
        return nsfb_palette_best_match(palette, c, &r, &g, &b);

    current = palette->dither_ctx.current;

    /* Get RGB components of colour, and apply error */
    r = ( c        & 0xFF) + palette->dither_ctx.data[current    ];
    g = ((c >>  8) & 0xFF) + palette->dither_ctx.data[current + 1];
    b = ((c >> 16) & 0xFF) + palette->dither_ctx.data[current + 2];

    /* Clamp new RGB components to range */
    if (r <   0) r =   0;
    if (r > 255) r = 255;
    if (g <   0) g =   0;
    if (g > 255) g = 255;
    if (b <   0) b =   0;
    if (b > 255) b = 255;

    /* Reset error diffusion slots to 0 */
    palette->dither_ctx.data[current    ] = 0;
    palette->dither_ctx.data[current + 1] = 0;
    palette->dither_ctx.data[current + 2] = 0;

    /* Rebuild colour from modified components */
    c = r + (g << 8) + (b << 16);

    /* Get best match for pixel, and find errors for each component */
    best_col_index = nsfb_palette_best_match(palette, c, &r, &g, &b);

    /* Advance one set of error diffusion slots */
    current += 3;
    if (current >= width)
        current = 0;
    palette->dither_ctx.current = current;

    /* Save errors
     *
     *       [*]-[N]
     *      / | \
     *   [l]-[m]-[r]
     */
    error = current;

    /* Error for [N] (next) */
    if (error != 0) {
        /* The pixel exists */
        palette->dither_ctx.data[error    ] += r * 7 / 16;
        palette->dither_ctx.data[error + 1] += g * 7 / 16;
        palette->dither_ctx.data[error + 2] += b * 7 / 16;
    }

    error += width - 2 * 3;
    if (error >= width)
        error -= width;
    /* Error for [l] (below, left) */
    if (error >= 0 && error != 3) {
        /* The pixel exists */
        palette->dither_ctx.data[error    ] += r * 3 / 16;
        palette->dither_ctx.data[error + 1] += g * 3 / 16;
        palette->dither_ctx.data[error + 2] += b * 3 / 16;
    }

    error += 3;
    if (error >= width)
        error -= width;
    /* Error for [m] (below, middle) */
    palette->dither_ctx.data[error    ] += r * 5 / 16;
    palette->dither_ctx.data[error + 1] += g * 5 / 16;
    palette->dither_ctx.data[error + 2] += b * 5 / 16;

    error += 3;
    if (error >= width)
        error -= width;
    /* Error for [r] (below, right) */
    if (error != 0) {
        /* The pixel exists */
        palette->dither_ctx.data[error    ] += r / 16;
        palette->dither_ctx.data[error + 1] += g / 16;
        palette->dither_ctx.data[error + 2] += b / 16;
    }

    return best_col_index;
}
I hope you enjoy coding and NetSurf users enjoy faster browsing
arti is offline  
Old 25 May 2016, 20:52   #2
meynaf
son of 68k
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 48
Posts: 4,323
Getting this faster first requires using the proper method. Trying all colors in the palette is very, very slow.
You can simply reduce the color precision to, say, 4 or 5 bits per component and use a table that will directly return the index of the nearest color.
meynaf is offline  
Old 25 May 2016, 21:44   #3
Thorham
Computer Nerd

Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 45
Posts: 3,235
Another one is that Sierra filter Lite is faster than Floyd Steinberg dithering for the same quality. And then there's the fact that optimized error diffusion is probably going to be *slower* than good HAM8 rendering with a fixed palette, and that will certainly almost always produce better looking results.

Optionally, if you really want speed, then the best way is Bayer 4x4 ordered dithering with a 216 color palette. Is lightning fast, leaves 40 colors free, and looks reasonable.

So consider that error diffusion is slower than HAM8 rendering, and that Bayer 4x4 dithering is much faster than either method.
Thorham is offline  
Old 26 May 2016, 00:46   #4
matthey
Banned
 
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
Quote:
Originally Posted by arti View Post
In this thread I would like to propose to interested asm coders, NetSurf code parts, that could be asm optimised for speed.

First proposition:

Error diffusion dithering NetSurf AGA is using:
Did you use a profiler to identify this code?

This code is not bad for C but there is room for improvement. All the divides are powers of 2 which is very smart but most of the divides and right shifts are signed which is bad. The following is the code for signed division by 256 with GCC 3.x -O2.

Code:
tst.l d0 ; or an operation which sets the CC
bpl.b *+8
add.l #255,d0
asr.l #8,d0
and unsigned division by 256.

Code:
lsr.l #8,d0
The unsigned division by 256 code is a big improvement and is usually faster on most processors. The signed variables could be cast as unsigned before a division or shift but look at the resulting code to make sure the compiler is not generating extra code for a conversion. All the casting could make the code less readable as well. It would be best to convert the code to use unsigned int where possible.

With GCC on the 68k, it was often beneficial to use intermediate variables and do shifting by 8.

Code:
    unsigned int temp_c = (unsigned int)c;
    int r = (temp_c & 0xFF);
    temp_c >>= 8;
    int g = (temp_c & 0xFF);
    temp_c >>= 8;
    int b = (temp_c & 0xFF);
This is better because the shifts are <=8 for the 68k and unsigned which is faster for the 68040 and older 68k processors. Optimizing for a particular processor should not be necessary (and is frowned upon) but commonly generates better code. IMO, it is ok if it is equally fast on modern processors which this should be.
matthey is offline  
Old 26 May 2016, 03:31   #5
NovaCoder
Registered User
NovaCoder's Avatar
 
Join Date: Sep 2007
Location: Melbourne/Australia
Posts: 4,089
Hi Arti,

The problem with optimizing large C/C++ GCC projects for the Amiga is the lack of a profiler. Without a profiler you are just guessing which area's of the code is slowing things down, you don't actually know where the real bottlenecks are

At least with a FPS game you normally have a frames per second counter so you can get some feedback about your ASM tweaks but you still don't know if you have have applied those ASM tweaks to the slowest C code.

Remember that it's better to optimize a slow bit of code that gets called many times than to optimize a really slow bit of code that rarely gets called

It's also dangerous to assume that C code converted to ASM will always be quicker than pure C code, normally it will be but you can never be 100% sure.

When I was optimizing the C code for AGA Quake I was just relying on ‘gut instinct’ and guesswork, with a decent GCC profiler then I could have got a few more FPS dammit!

Last edited by NovaCoder; 26 May 2016 at 06:29.
NovaCoder is offline  
Old 26 May 2016, 03:58   #6
Thorham
Computer Nerd

Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 45
Posts: 3,235
A profiler doesn't seem necessary here. The whole way in which this works seems painfully slow. Basically this stuff gets called for each pixel in the bitmap you're error diffusing. Definitely slows things down.
Thorham is offline  
Old 26 May 2016, 19:59   #7
arti
Registered User

 
Join Date: Jul 2008
Location: Poland
Posts: 590
Quote:
Originally Posted by matthey View Post

With GCC on the 68k, it was often beneficial to use intermediate variables and do shifting by 8.

Code:
    unsigned int temp_c = (unsigned int)c;
    int r = (temp_c & 0xFF);
    temp_c >>= 8;
    int g = (temp_c & 0xFF);
    temp_c >>= 8;
    int b = (temp_c & 0xFF);
Thanks , this works
arti is offline  
Old 27 May 2016, 02:45   #8
matthey
Banned
 
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
Quote:
Originally Posted by Thorham View Post
A profiler doesn't seem necessary here. The whole way in which this works seems painfully slow. Basically this stuff gets called for each pixel in the bitmap you're error diffusing. Definitely slows things down.
Good algorithms are nice but sometimes it is necessary to deal with the provided interfaces or else change a lot of code. It is better to use a profiler to find which crappy code needs to be dealt with first.

Quote:
Originally Posted by arti View Post
Thanks , this works
I tried some more example code by rewriting palette.h which I have attached as palette.c (EAB does not recognize .h attachments so rename back to palette.h). The programmers need to turn off GCC slop mode and compile in C99 mode. For example, bool is used without including stdbool.h and NULL is not defined without string.h. This code is no longer portable following the C99 standard but rather relies on C99 slop mode in newer versions of GCC. The code will not work with 16 bit ints either. Using int everywhere is rarely the most efficient datatype with C99. The programmer does not have a very good understanding of processors and expects the compiler to do the "optimizing". Perhaps not a good algorithm too? Anyway, my palette.h is an example of how to optimize C code. It should be faster on every processor but more so on the 68k, if it works .

Last edited by matthey; 27 May 2016 at 20:10.
matthey is offline  
Old 27 May 2016, 03:04   #9
Thorham
Computer Nerd

Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 45
Posts: 3,235
Quote:
Originally Posted by matthey View Post
It is better to use a profiler to find which crappy code needs to be dealt with first.
Doesn't matter, because you have to deal with ALL crappy code
Thorham is offline  
Old 27 May 2016, 03:06   #10
wawa
Registered User
 
Join Date: Aug 2007
Location: berlin/germany
Posts: 1,054
Quote:
Originally Posted by matthey View Post
bool is used without including stdbool.h and NULL is not defined without string.h
mm. i hate to introduce well informed conversation, but have you yourself profiled this? id expect content of headers like that is being handled by preprocessor, where it could be easier to improve for particular platform, by a choice of options or macros.

the remaining question is, how to find an appropriate profiler for 68k. bernd has mentioned he had a commecial peephole optimizer, he hasnt ever revealed towards me, what it was. aros has a set of debugging tools, that might be useful even for 68k code, however i wasnt able to get grip of this even what concerns a proper 68k gdb usage.
wawa is offline  
Old 27 May 2016, 03:09   #11
wawa
Registered User
 
Join Date: Aug 2007
Location: berlin/germany
Posts: 1,054
Quote:
Originally Posted by Thorham View Post
Doesn't matter, because you have to deal with ALL crappy code
cmon. your a coder. im not. but do you really expect to have every single c line of all components and apps rewritten and optimized by hand in machine code?
wawa is offline  
Old 27 May 2016, 03:27   #12
Thorham
Computer Nerd

Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 45
Posts: 3,235
Quote:
Originally Posted by wawa View Post
cmon. your a coder. im not. but do you really expect to have every single c line of all components and apps rewritten and optimized by hand in machine code?
I wasn't being entirely serious, but no, of course not. It would be quite pointless. The important parts are the tight loops (the heavy weight code, so to speak). Before that, it makes sense to see if things are done in the right way, because if they're not, there's no point in optimizing that code.
Thorham is offline  
Old 27 May 2016, 03:54   #13
NovaCoder
Registered User
NovaCoder's Avatar
 
Join Date: Sep 2007
Location: Melbourne/Australia
Posts: 4,089
Quote:
Originally Posted by matthey View Post
Good algorithms are nice but sometimes it is necessary to deal with the provided interfaces or else change a lot of code. It is better to use a profiler to find which crappy code needs to be dealt with first.
That was exactly my point


Confucius say 'the fastest code is the code that you don't execute'


This is why I spend all day at work deleting code and then seeing if the game will still run, if I break something then I just put that bit of code back in again....easy!

Last edited by NovaCoder; 27 May 2016 at 04:03.
NovaCoder is offline  
Old 27 May 2016, 07:50   #14
meynaf
son of 68k
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 48
Posts: 4,323
A profiler isn't needed to see that executing a loop that is up to 256 iterations per pixel, will make the program very slow...

Anyway even in C you should be able to use da ol' background color trick. Syntax in C is horrible but in ASM that gives :
Code:
; do that at the start of your routine
 move.w #$f00,$dff180

; do that at the end of your routine
 move.w #$000,$dff180
Sure, it's not very precise.
But the more red you see, you know the worse it is.
Of course you can do several routines or code parts simultaneously with different colors.
Don't forget some kind of #ifdef or your users will wonder what's happening


Quote:
Originally Posted by NovaCoder View Post
This is why I spend all day at work deleting code and then seeing if the game will still run, if I break something then I just put that bit of code back in again....easy!
I hope you're not serious. This is the best way to make programs unstable.
There are usually many particular cases in programs, that occur rarely. Here you're just destroying them.
meynaf is offline  
Old 27 May 2016, 10:18   #15
arti
Registered User

 
Join Date: Jul 2008
Location: Poland
Posts: 590
@matthey
Cool but something is not right or my headers?

Should be http://amigaworld.net/modules/myalbum/photos/459.jpg
Attached Thumbnails
Click image for larger version

Name:	nsopt.jpg
Views:	149
Size:	423.1 KB
ID:	48617  

Last edited by arti; 27 May 2016 at 10:24.
arti is offline  
Old 27 May 2016, 10:31   #16
arti
Registered User

 
Join Date: Jul 2008
Location: Poland
Posts: 590
Bug is in nsfb_palette_best_match_dither function.
arti is offline  
Old 27 May 2016, 10:41   #17
matthey
Banned
 
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
Quote:
Originally Posted by wawa View Post
mm. i hate to introduce well informed conversation, but have you yourself profiled this?
No. I attempted to improve the C code as an example. If I have to also show how to profile then we might be in trouble. Of course, we might not notice much difference if this routine was not CPU intensive (my guess is that it would save 20-100 bytes in the executable size though). I would expect maybe a 10%-20% improvement on the 68k for C optimizing like this. Assembler might be able to gain that much again without changing the algorithm. There were some more aggressive optimizations which I didn't try because they would be more error prone and I can't test.

Quote:
Originally Posted by wawa View Post
id expect content of headers like that is being handled by preprocessor, where it could be easier to improve for particular platform, by a choice of options or macros.
GCC "improved it" by not requiring the preprocessor instructions at all thus creating their own C standard. It should be called GCC Slop C.

Quote:
Originally Posted by wawa View Post
the remaining question is, how to find an appropriate profiler for 68k.
For an alien compiler to the Amiga like GCC, it may be better to stick with GCC tools. I like AProf for more Amiga friendly compilers but it is old and quirky. Maybe I will try and recompile the source again some day.

Quote:
Originally Posted by wawa View Post
bernd has mentioned he had a commercial peephole optimizer, he hasnt ever revealed towards me, what it was.
Vasm has the best 68k peephole optimizer in the world. It is not quite a drop in for GCC's GAS and vbcc doesn't compile Slop C.

Quote:
Originally Posted by wawa View Post
aros has a set of debugging tools, that might be useful even for 68k code, however i wasnt able to get grip of this even what concerns a proper 68k gdb usage.
I prefer to stick with BDebug, AProf, etc. rather than the GCC tools.
matthey is offline  
Old 27 May 2016, 10:47   #18
matthey
Banned
 
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
Quote:
Originally Posted by arti View Post
Bug is in nsfb_palette_best_match_dither function.
Alright. That is the more complex of the 2 functions. Try undoing my changes until you can isolate the problem. I'll look over the code some more and see if I can spot the problem.

Edit: I'm not seeing a problem but it is getting late and I'm tired.

Last edited by matthey; 27 May 2016 at 11:31.
matthey is offline  
Old 27 May 2016, 11:36   #19
arti
Registered User

 
Join Date: Jul 2008
Location: Poland
Posts: 590
Bug is under

Code:
	/* Save errors
	 *
	 *[*]-[N]
	 *      / | \
	 *   [l]-[m]-[r]
	 */
by adding (uint_fast32_t)

palette->dither_ctx.data[error ] += (uint_fast32_t)r * 7 / 16;

Edit:
changed to int_fast32_t and it works,
I think scrolling is faster now
Thanks a lot!
arti is offline  
Old 27 May 2016, 16:13   #20
bubbob42
Registered User
 
Join Date: Oct 2012
Location: Germany
Posts: 521
SAS/C features two profilers, but they depend on the compiler's startup code, unfortunately.
bubbob42 is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Tool to convert asm to gnu asm (gas) Asman Coders. Asm / Hardware 13 30 December 2020 11:57
Optimising CD32 Boot Times Amigajay project.CD32 Conversion 11 05 March 2016 13:30
Can we start an "Optimising Workbench/Your Amiga (A Guide For Novices)" thread? vroom6sri support.Other 4 16 January 2014 10:06
Optimising ILBM decode pmc Coders. Asm / Hardware 21 12 October 2011 20:24
help optimising a section of code h0ffman Coders. General 15 02 March 2011 13:19

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 16:29.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, vBulletin Solutions Inc.
Page generated in 0.12986 seconds with 16 queries