NetSurf AGA optimizing

arti · 30 September 2013, 17:36

In order to speed up scrolling a little faster this code should be optimized or even better, asm'fied:

Code:

	switch (palette->type) {
	case NSFB_PALETTE_NSFB_8BPP:
		/* Index into colour cube part */
		dr = ((( c        & 0xFF) * 5) + 128) / 256;
		dg = ((((c >>  8) & 0xFF) * 7) + 128) / 256;
		db = ((((c >> 16) & 0xFF) * 4) + 128) / 256;
		col = 40 * dr + 5 * dg + db;

		palent = palette->data[col];
		dr = ( c        & 0xFF) - ( palent        & 0xFF);
		dg = ((c >>  8) & 0xFF) - ((palent >> 8 ) & 0xFF);
		db = ((c >> 16) & 0xFF) - ((palent >> 16) & 0xFF);
		cur_distance = (dr * dr) + (dg * dg) + (db * db);

		best_col = col;
		best_distance = cur_distance;
		*r_error = dr;
		*g_error = dg;
		*b_error = db;

		/* Index into grayscale part */
		col = (( c        & 0xFF) +
		       ((c >>  8) & 0xFF) +
		       ((c >> 16) & 0xFF) + (45 / 2)) / (15 * 3) - 1 + 240;
		palent = palette->data[col];

		dr = ( c        & 0xFF) - ( palent        & 0xFF);
		dg = ((c >>  8) & 0xFF) - ((palent >>  8) & 0xFF);
		db = ((c >> 16) & 0xFF) - ((palent >> 16) & 0xFF);
		cur_distance = (dr * dr) + (dg * dg) + (db * db);
		if (cur_distance < best_distance) {
			best_distance = cur_distance;
			best_col = col;
			*r_error = dr;
			*g_error = dg;
			*b_error = db;
		}
		break;

Unfortunately, I don't have that knowledge, so, if someone here could do this, that would be a benefit for us all.

Thorham · 30 September 2013, 19:07

It would help if you could explain what this code does exactly. Also, what are the types of the variables? What does palette->data[col] contain exactly? Rewriting this in assembler isn't hard, and it's certainly possible to optimize it quite a bit, but without some extra info it's going to be a little problematic. Especially knowledge of what the code does allows better optimization.

arti · 01 October 2013, 16:45

Here is whole file:

http://pastebin.com/X9rg2WSA

It converts 24bit to 8bit colour.

And function above is used here:

http://pastebin.com/CwyUWVsp

Thorham · 01 October 2013, 20:37

Thanks, seeing this in the right context is much better

Anyway, I've made some simple optimizations to the original C code. Get it here: pallete.txt

One thing: Make sure you generate and add a multiplication table that multiplies a number by itself. This is for the deltas. The table should be 511 ints long, starting with the result for -255 * -255 and it should end with the result for 255 * 255. Then just use a pointer that points to the middle of the table (0 * 0). The array name used in the code is square, but you'll see that.

There are numerous other optimizations possible, but I'm not very experienced with C (I do know some things about optimizing), so try the code first.

In assembly language all this can be done in a better way, but we'll get to that later

NovaCoder · 02 October 2013, 00:52

Just to help a bit

Code:

/** representation of a colour.
 *
 * The colour value comprises of four components arranged in the order ABGR:
 * bits 24-31 are the alpha value and represent the opacity. 0 is
 *   transparent i.e. there would be no change in the target surface if
 *   this colour were to be used and 0xFF is opaque.
 *
 * bits 16-23 are the Blue component of the colour.
 *
 * bits 8-15 are the Green component of the colour.
 *
 * bits 0-7 are the Red component of the colour.
 */
typedef uint32_t nsfb_colour_t;

NovaCoder · 02 October 2013, 01:08

The old code was written like this, did you update it?

Code:

static uint8_t colour_to_pixel(nsfb_t *nsfb, nsfb_colour_t c)
{
        nsfb_colour_t palent;
        int col;

        int dr, dg, db; /* delta red, green blue values */

        int cur_distance;
        int best_distance = INT_MAX;
        uint8_t best_col = 0;

        for (col = 0; col < 256; col++) {
                palent = nsfb->palette[col];

                dr = (c & 0xFF) - (palent & 0xFF);
                dg = ((c >> 8) & 0xFF) - ((palent >> 8) & 0xFF);
                db = ((c >> 16) & 0xFF) - ((palent >> 16) & 0xFF);
                cur_distance = ((dr * dr) + (dg * dg) + (db *db));
                if (cur_distance < best_distance) {
                        best_distance = cur_distance;
                        best_col = col;
                }
        }

        return best_col;
}

The real problem you've got here is that colour is internally stored as a 32 bit value, normally the way to improve performance is to use a lookup table but this won't work in your case because the memory requirements would be huge.

Color conversion is always going to be slow if it's doing it for every time it needs to find out what color a pixel is

Maybe there is some fancy code that will allow you to store 16,777,216 indexes? (I think alpha can be ignored for AGA)

It looks like this call is only used in two places, is that correct?

Code:

static bool fill(nsfb_t *nsfb, nsfb_bbox_t *rect, nsfb_colour_t c)
{
        int y;
        uint8_t ent;
        uint8_t *pvideo;

        if (!nsfb_plot_clip_ctx(nsfb, rect))
                return true; /* fill lies outside current clipping region */

        pvideo = get_xy_loc(nsfb, rect->x0, rect->y0);

        ent = colour_to_pixel(nsfb, c);

        for (y = rect->y0; y < rect->y1; y++) {
                memset(pvideo, ent, rect->x1 - rect->x0);
                pvideo += nsfb->linelen;
        }

        return true;
}

Code:

static bool
glyph1(nsfb_t *nsfb,
       nsfb_bbox_t *loc,
       const uint8_t *pixel,
       int pitch,
       nsfb_colour_t c)
{
        PLOT_TYPE *pvideo;
        PLOT_TYPE fgcol;
        int xloop, yloop;
        int xoff, yoff; /* x and y offset into image */
        int x = loc->x0;
        int y = loc->y0;
        int width = loc->x1 - loc->x0;
        int height = loc->y1 - loc->y0;
        const uint8_t *fntd;
        uint8_t row;

        if (!nsfb_plot_clip_ctx(nsfb, loc))
                return true;

        if (height > (loc->y1 - loc->y0))
                height = (loc->y1 - loc->y0);

        if (width > (loc->x1 - loc->x0))
                width = (loc->x1 - loc->x0);

        xoff = loc->x0 - x;
        yoff = loc->y0 - y;

        pvideo = get_xy_loc(nsfb, loc->x0, loc->y0);

        fgcol = colour_to_pixel(nsfb, c);

        for (yloop = yoff; yloop < height; yloop++) {
                fntd = pixel + (yloop * (pitch>>3)) + (xoff>>3);
                row = (*fntd++) << (xoff & 3);
                for (xloop = xoff; xloop < width ; xloop++) {
                        if (((xloop % 8) == 0) && (xloop != 0)) {
                                row = *fntd++;
                        }

                        if ((row & 0x80) != 0) {
                                *(pvideo + xloop) = fgcol;
                        }
                        row = row << 1;

                }

                pvideo += PLOT_LINELEN(nsfb->linelen);
        }

        return true;
}

static bool
glyph8(nsfb_t *nsfb,
       nsfb_bbox_t *loc,
       const uint8_t *pixel,
       int pitch,
       nsfb_colour_t c)
{
        PLOT_TYPE *pvideo;
		nsfb_colour_t fgcol;
        nsfb_colour_t abpixel; /* alphablended pixel */
        int xloop, yloop;
        int xoff, yoff; /* x and y offset into image */
        int x = loc->x0;
        int y = loc->y0;
        int width = loc->x1 - loc->x0;
        int height = loc->y1 - loc->y0;

        if (!nsfb_plot_clip_ctx(nsfb, loc))
                return true;

        if (height > (loc->y1 - loc->y0))
                height = (loc->y1 - loc->y0);

        if (width > (loc->x1 - loc->x0))
                width = (loc->x1 - loc->x0);

        xoff = loc->x0 - x;
        yoff = loc->y0 - y;

        pvideo = get_xy_loc(nsfb, loc->x0, loc->y0);

        fgcol = c & 0xFFFFFF;

        for (yloop = 0; yloop < height; yloop++) {
                for (xloop = 0; xloop < width; xloop++) {
                        abpixel = (pixel[((yoff + yloop) * pitch) + xloop + xoff] << 24) | fgcol;
                        if ((abpixel & 0xFF000000) != 0) {
                                /* pixel is not transparent */
                                if ((abpixel & 0xFF000000) != 0xFF000000) {
                                        abpixel = nsfb_plot_ablend(abpixel,
                                                                   pixel_to_colour(nsfb, *(pvideo + xloop)));
                                }

                                *(pvideo + xloop) = colour_to_pixel(nsfb, abpixel);
                        }
                }
                pvideo += PLOT_LINELEN(nsfb->linelen);
        }

        return true;
}

Thorham · 02 October 2013, 02:59

Quote:

Originally Posted by NovaCoder

The real problem you've got here is that colour is internally stored as a 32 bit value, normally the way to improve performance is to use a lookup table but this won't work in your case because the memory requirements would be huge.

They could still use a lookup table. Just convert the 24 bit colors to 12 bit colors. Now you only need a 4096 entries long table. Should be sufficient. A few more bits, say 15, wouldn't hurt and the table still wouldn't be too large.

Quote:

Originally Posted by NovaCoder

Color conversion is always going to be slow if it's doing it for every time it needs to find out what color a pixel is

No way around that unfortunately.

Ultimately there's only so much you can do to speed this up. It would be much better to rewrite the whole loop with all calls inlined in assembly language. This goes for the dither routine as well, which can also be sped up even more by using Sierra Filter Lite instead of Floyd Steiberg (and is just as good).

NovaCoder · 02 October 2013, 03:47

Yep that's what I'm thinking now........convert the color from 32 to 16 bits and then do a lookup.

That should result in a massive speed increase

NovaCoder · 02 October 2013, 04:18

So this is how I'd write it:

Code:

/*
 * Copyright 2009 Vincent Sanders <vince@simtec.co.uk>
 * Copyright 2010 Michael Drake <tlsa@netsurf-browser.org>
 *
 * This file is part of libnsfb, http://www.netsurf-browser.org/
 * Licenced under the MIT License,
 *                http://www.opensource.org/licenses/mit-license.php
 */

#include <stdbool.h>
#include <endian.h>
#include <stdlib.h>
#include <string.h>

#include "libnsfb.h"
#include "libnsfb_plot.h"
#include "libnsfb_plot_util.h"

#include "nsfb.h"
#include "palette.h"
#include "plot.h"

static byte palette_index_lookup[65536];

static uint8_t colour_to_pixel(nsfb_t *nsfb, nsfb_colour_t c)
{
        int 16bitColor;
		int paletteIndex;
		
		if (nsfb->palette == NULL)
                return 0;

		// First convert you 32bit ABGR color to a 16bit RGB color
		16bitColor = COLOR_CONVERSION_MACRO(c);

		// See if the 16bit color is already stored in the lookup.
		if (palette_index_lookup[16bitColor]) {
			// Just return the cached value.
			paletteIndex = palette_index_lookup[16bitColor];
		} else {
			// Calcuate
			paletteIndex = nsfb_palette_best_match_dither(nsfb->palette,c);
			
			// Store in lookup so it can be used next time.
			palette_index_lookup[16bitColor] = paletteIndex;
		}
		
		return paletteIndex;
}

Obviously this code is a bit rough but you should be able to get the idea.

You'll need to come up with a MACRO to convert from ABGR to a 16 bit color (Google is your friend).

You'll also need to initialize the 'palette_index_lookup' and reset it every time a logical palette update occurs.

And you should handle black (16bit value of zero) properly.

Thorham · 02 October 2013, 16:38

Quote:

Originally Posted by NovaCoder

You'll need to come up with a MACRO to convert from ABGR to a 16 bit color (Google is your friend).

That's easy enough... in assembler

Quote:

Originally Posted by NovaCoder

You'll also need to initialize the 'palette_index_lookup' and reset it every time a logical palette update occurs.

What for? Just use a palette that never changes. You need a 'one size fits all' palette anyway. Just generate a lookup table for the palette once (or store on disk), and be done with it. There's also no need to cache any values.

In assembler it would look something like this (68020/030, 060 needs to be optimized differently):

Code:

	lea	c+1,a0 		; ptr to c, skip alpha part
	clr	d0 		; clear table index
	move.b	(a0)+,d0	; get red
	lsl.l	#5,d0		; make room for next byte (shift)
	move.b	(a0)+,d		; get green
	lsl.l	#5,d0		; make room for next byte (shift)
	move.b	(a0)+,d0	; get blue
	lsr.l	#3,d0		; final 15 bit value
	move.l	(a1,d0.w),d0	; get palette color number

It would be faster if this is done directly in the loop that reads the argb pixels, where you should read bytes, not the whole argb pixel.

There's always a ton of stuff you can happily do in C, but tight loops like this should be optimized in assembler... after the C code works properly, of course

NovaCoder · 03 October 2013, 02:53

Quote:

Originally Posted by Thorham

What for? Just use a palette that never changes. You need a 'one size fits all' palette anyway. Just generate a lookup table for the palette once (or store on disk), and be done with it. There's also no need to cache any values.

Yep you could do that, it would obviously give the best speed but it would probably look pretty nasty. For my ScummVM ports I used a paint program to build an optimised 256 color palette based on a screen grab I took of the 16bit GUI, I then simply load up that palette file at program start up (this only works because the GUI in ScummVM is based on a unified color scheme).

I'm not sure how often NetSurf is updating the palette but if it's coding properly it will be updating it only once per page, if that is the case then it would look a lot better (and still be reasonably fast) to use my palette caching code listed above.

An important thing to check would be to ensure that it's not updating the palette each time the user scrolls the web page (the palette should be based on the entire page including the off screen regions).

PeterK · 03 October 2013, 07:37

You can find my color reduction from ARGB to the OS 3.5 palette mapped format with up to 256 colors here:

http://eab.abime.net/showthread.php?p=914435#post914435

The OS 3.5 format can be adapted to every AGA screen mode with the normal color mapping code which is also part of my library (including the optional and much faster color mapping based on a 512 byte cache). Just search for IconBeFast in my source code and ask me if I should explain how it works.

Thorham · 03 October 2013, 08:52

To NovaCoder:

Yep, it won't be pretty

Anyway, if you're going to generate a palette for the whole page, you might as well render it in HAM8. Looks better, and I wouldn't be surprised if it was faster, too (especially when you start adding error diffusion into the mix). Meynaf's HAM8 rendering routine (not the one I showed you) runs at about 120 cycles per pixel on a 68030. Seems hard to beat.

Also, you'd have to render the whole page with all the images in one go, otherwise you can't calculate the palette. It's better if images are loaded dynamically so that the user can at least read and scroll through the text on the page. It seems to me that a fixed palette or HAM is best.

Leffmann · 03 October 2013, 15:44

Quote:

Originally Posted by arti

In order to speed up scrolling a little faster this code should be optimized or even better, asm'fied:

Do you mean it will speed up loading of the page, or does the browser dither the images and render the whole page over and over again as you scroll around on it? If pages load quickly, but scrolling is slow, then the bottleneck is probably not in the code that renders the page.

EDIT: try this to start with. I've changed the calculation of col on line 92 in pallete.h as it would sometimes produce an index of 256, and I've reduced the size of the r g b_error variables to 16-bit, so you need to change this in the dithering function as well. The code also assumes entries 239 through 255 in the palette are perfect grays:

Code:

arti · 04 October 2013, 12:26

So, I simply replace code from line 92 to 106 with this code , yes?

Where can I find tables.i include?

Leffmann · 04 October 2013, 16:58

Unfortunately it's not that simple, you can't just plop some assembly right into the C source code without knowing more about the compiler.

I've added all files to the attached archive, and for now you can try linking in the object file, change the rgb_error variables to 16-bit sizes, and declare and call it like this:

Code:

int func_NSFB_PALETTE_NSFB_8BPP(int, void*, void*, void*, void*);

switch (palette->type) {
  case NSFB_PALETTE_NSFB_8BPP:
    best_col = func_NSFB_PALETTE_NSFB_8BPP(c, palette->data, &r_error, &g_error, &b_error);
    break;

    ...
}

If it doesn't work then you have to look in the manual and find out about the compiler's calling conventions. The best thing would probably be to replace all of it with an optimized function that takes a 24-bit image and a 256-color palette, and outputs a dithered 256-color image.

arti · 04 October 2013, 18:37

It doesn't work.

Quote:

The best thing would probably be to replace all of it with an optimized function that takes a 24-bit image and a 256-color palette, and outputs a dithered 256-color image.

I agree, PeterK has this function in asm used in his icon.library.

I'll ask him how to implement it here:

Code:

static uint8_t colour_to_pixel(nsfb_t *nsfb, nsfb_colour_t c)
{
        if (nsfb->palette == NULL)
                return 0;

			
		return nsfb_palette_best_match_dither(nsfb->palette,c);
}

PeterK · 05 October 2013, 00:11

Attached is the nature image (reduced to 256x160) as a direct drawing snapshot of my nature.info icon file. The size limitation exists only for the icons, of course. My color reduction routine could also handle larger sizes, too.
http://eab.abime.net/showpost.php?p=...&postcount=730
But I'm sorry, I've no intentions to port my assembler code to C.

bubbob42 · 05 October 2013, 10:11

Quote:

Originally Posted by PeterK

But I'm sorry, I've no intentions to port my assembler code to C.

Maybe you could just create some linkable object modules instead? I think there is no need to port stuff.

PeterK · 05 October 2013, 23:46

I came to the conclusion that NetSurf needs a totally different solution than my color recuction code since it creates a seperate palette for every image which is not really suitable for a browser.

Just for short now what I'm planning to do:

The netsurf browser needs a fixed 256 color palette in order to decode all images as fast as possible. I want to use the 4 system colors and then add 252 fixed colors to the palette (6x7x6 possible values for the RGB components). The first 3 system colors (transparent, black and white) should be used too for the color assignments.

The decoding of the color components will be done without any table or comparison by simply substracting 42 or 36 again and again from the component values. This will only require about 10 substractions for an average RGB value, no DIV, no MUL or anything CPU time consuming. And it can probably done in C.

But I've no idea yet how good the image quality could be without using any dithering (at least not in the first version based on this direct color to palette assignment method).

Update1: The first problem I noticed now are the mouse pointer colors (pen 16-18) which could detroy my concept for the fixed palette.

Update2: Instead of 252 colors I will try to use 216 colors now (RGB = 6x6x6) from pen 32 up to pen 247. This leave some free pens for MWB and the color mapping and it also allows correct gray shades. The color components could have the values 15, 60, 105, 150, 195, 240 for a first test.

30 September 2013, 19:07	#2
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,796	It would help if you could explain what this code does exactly. Also, what are the types of the variables? What does palette->data[col] contain exactly? Rewriting this in assembler isn't hard, and it's certainly possible to optimize it quite a bit, but without some extra info it's going to be a little problematic. Especially knowledge of what the code does allows better optimization. Last edited by Thorham; 30 September 2013 at 19:20.

02 October 2013, 03:47	#8
NovaCoder Registered User Join Date: Sep 2007 Location: Melbourne/Australia Posts: 4,405	Yep that's what I'm thinking now........convert the color from 32 to 16 bits and then do a lookup. That should result in a massive speed increase Last edited by NovaCoder; 02 October 2013 at 04:00.

05 October 2013, 23:46	#20
PeterK Registered User Join Date: Apr 2005 Location: digital hell, Germany, after 1984, but worse Posts: 3,378	I came to the conclusion that NetSurf needs a totally different solution than my color recuction code since it creates a seperate palette for every image which is not really suitable for a browser. Just for short now what I'm planning to do: The netsurf browser needs a fixed 256 color palette in order to decode all images as fast as possible. I want to use the 4 system colors and then add 252 fixed colors to the palette (6x7x6 possible values for the RGB components). The first 3 system colors (transparent, black and white) should be used too for the color assignments. The decoding of the color components will be done without any table or comparison by simply substracting 42 or 36 again and again from the component values. This will only require about 10 substractions for an average RGB value, no DIV, no MUL or anything CPU time consuming. And it can probably done in C. But I've no idea yet how good the image quality could be without using any dithering (at least not in the first version based on this direct color to palette assignment method). Update1: The first problem I noticed now are the mouse pointer colors (pen 16-18) which could detroy my concept for the fixed palette. Update2: Instead of 252 colors I will try to use 216 colors now (RGB = 6x6x6) from pen 32 up to pen 247. This leave some free pens for MWB and the color mapping and it also allows correct gray shades. The color components could have the values 15, 60, 105, 150, 195, 240 for a first test. Last edited by PeterK; 07 October 2013 at 11:07.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
NetSurf for AGA	arti	News	92	14 March 2016 21:44
Optimizing question: instruction order	TheDarkCoder	Coders. Asm / Hardware	9	29 October 2011 17:07
Layered tile engine optimizing.	Thorham	Coders. General	0	30 September 2011 20:43
Benching and optimizing CF-IDE speed	Photon	support.Hardware	12	15 July 2009 01:48
For people who like optimizing 680x0 code.	Thorham	Coders. General	5	28 May 2008 11:48

01 October 2013, 16:45	#3
arti Registered User Join Date: Jul 2008 Location: Poland Posts: 663	Here is whole file: http://pastebin.com/X9rg2WSA It converts 24bit to 8bit colour. And function above is used here: http://pastebin.com/CwyUWVsp Last edited by arti; 01 October 2013 at 17:01.

01 October 2013, 20:37	#4
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,796	Thanks, seeing this in the right context is much better Anyway, I've made some simple optimizations to the original C code. Get it here: pallete.txt One thing: Make sure you generate and add a multiplication table that multiplies a number by itself. This is for the deltas. The table should be 511 ints long, starting with the result for *-255 -255 and it should end with the result for 255 * 255. Then just use a pointer that points to the middle of the table (0 * 0). The array name used in the code is square**, but you'll see that. There are numerous other optimizations possible, but I'm not very experienced with C (I do know some things about optimizing), so try the code first. In assembly language all this can be done in a better way, but we'll get to that later

03 October 2013, 07:37	#12
PeterK Registered User Join Date: Apr 2005 Location: digital hell, Germany, after 1984, but worse Posts: 3,378	You can find my color reduction from ARGB to the OS 3.5 palette mapped format with up to 256 colors here: http://eab.abime.net/showthread.php?p=914435#post914435 The OS 3.5 format can be adapted to every AGA screen mode with the normal color mapping code which is also part of my library (including the optional and much faster color mapping based on a 512 byte cache). Just search for IconBeFast in my source code and ask me if I should explain how it works.

03 October 2013, 08:52	#13
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,796	To NovaCoder: Yep, it won't be pretty Anyway, if you're going to generate a palette for the whole page, you might as well render it in HAM8. Looks better, and I wouldn't be surprised if it was faster, too (especially when you start adding error diffusion into the mix). Meynaf's HAM8 rendering routine (not the one I showed you) runs at about 120 cycles per pixel on a 68030. Seems hard to beat. Also, you'd have to render the whole page with all the images in one go, otherwise you can't calculate the palette. It's better if images are loaded dynamically so that the user can at least read and scroll through the text on the page. It seems to me that a fixed palette or HAM is best.

04 October 2013, 12:26	#15
arti Registered User Join Date: Jul 2008 Location: Poland Posts: 663	So, I simply replace code from line 92 to 106 with this code , yes? Where can I find tables.i include?

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)