English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 30 September 2013, 17:36   #1
arti
Registered User
 
Join Date: Jul 2008
Location: Poland
Posts: 663
NetSurf AGA optimizing

In order to speed up scrolling a little faster this code should be optimized or even better, asm'fied:

Code:
	switch (palette->type) {
	case NSFB_PALETTE_NSFB_8BPP:
		/* Index into colour cube part */
		dr = ((( c        & 0xFF) * 5) + 128) / 256;
		dg = ((((c >>  8) & 0xFF) * 7) + 128) / 256;
		db = ((((c >> 16) & 0xFF) * 4) + 128) / 256;
		col = 40 * dr + 5 * dg + db;

		palent = palette->data[col];
		dr = ( c        & 0xFF) - ( palent        & 0xFF);
		dg = ((c >>  8) & 0xFF) - ((palent >> 8 ) & 0xFF);
		db = ((c >> 16) & 0xFF) - ((palent >> 16) & 0xFF);
		cur_distance = (dr * dr) + (dg * dg) + (db * db);

		best_col = col;
		best_distance = cur_distance;
		*r_error = dr;
		*g_error = dg;
		*b_error = db;

		/* Index into grayscale part */
		col = (( c        & 0xFF) +
		       ((c >>  8) & 0xFF) +
		       ((c >> 16) & 0xFF) + (45 / 2)) / (15 * 3) - 1 + 240;
		palent = palette->data[col];

		dr = ( c        & 0xFF) - ( palent        & 0xFF);
		dg = ((c >>  8) & 0xFF) - ((palent >>  8) & 0xFF);
		db = ((c >> 16) & 0xFF) - ((palent >> 16) & 0xFF);
		cur_distance = (dr * dr) + (dg * dg) + (db * db);
		if (cur_distance < best_distance) {
			best_distance = cur_distance;
			best_col = col;
			*r_error = dr;
			*g_error = dg;
			*b_error = db;
		}
		break;
Unfortunately, I don't have that knowledge, so, if someone here could do this, that would be a benefit for us all.
arti is offline  
Old 30 September 2013, 19:07   #2
Thorham
Computer Nerd
 
Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,796
It would help if you could explain what this code does exactly. Also, what are the types of the variables? What does palette->data[col] contain exactly? Rewriting this in assembler isn't hard, and it's certainly possible to optimize it quite a bit, but without some extra info it's going to be a little problematic. Especially knowledge of what the code does allows better optimization.

Last edited by Thorham; 30 September 2013 at 19:20.
Thorham is offline  
Old 01 October 2013, 16:45   #3
arti
Registered User
 
Join Date: Jul 2008
Location: Poland
Posts: 663
Here is whole file:

http://pastebin.com/X9rg2WSA

It converts 24bit to 8bit colour.

And function above is used here:

http://pastebin.com/CwyUWVsp

Last edited by arti; 01 October 2013 at 17:01.
arti is offline  
Old 01 October 2013, 20:37   #4
Thorham
Computer Nerd
 
Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,796
Thanks, seeing this in the right context is much better

Anyway, I've made some simple optimizations to the original C code. Get it here: pallete.txt

One thing: Make sure you generate and add a multiplication table that multiplies a number by itself. This is for the deltas. The table should be 511 ints long, starting with the result for -255 * -255 and it should end with the result for 255 * 255. Then just use a pointer that points to the middle of the table (0 * 0). The array name used in the code is square, but you'll see that.

There are numerous other optimizations possible, but I'm not very experienced with C (I do know some things about optimizing), so try the code first.

In assembly language all this can be done in a better way, but we'll get to that later
Thorham is offline  
Old 02 October 2013, 00:52   #5
NovaCoder
Registered User
 
NovaCoder's Avatar
 
Join Date: Sep 2007
Location: Melbourne/Australia
Posts: 4,405
Just to help a bit

Code:
/** representation of a colour.
 *
 * The colour value comprises of four components arranged in the order ABGR:
 * bits 24-31 are the alpha value and represent the opacity. 0 is
 *   transparent i.e. there would be no change in the target surface if
 *   this colour were to be used and 0xFF is opaque.
 *
 * bits 16-23 are the Blue component of the colour.
 *
 * bits 8-15 are the Green component of the colour.
 *
 * bits 0-7 are the Red component of the colour.
 */
typedef uint32_t nsfb_colour_t;
NovaCoder is offline  
Old 02 October 2013, 01:08   #6
NovaCoder
Registered User
 
NovaCoder's Avatar
 
Join Date: Sep 2007
Location: Melbourne/Australia
Posts: 4,405
The old code was written like this, did you update it?

Code:
static uint8_t colour_to_pixel(nsfb_t *nsfb, nsfb_colour_t c)
{
        nsfb_colour_t palent;
        int col;

        int dr, dg, db; /* delta red, green blue values */

        int cur_distance;
        int best_distance = INT_MAX;
        uint8_t best_col = 0;

        for (col = 0; col < 256; col++) {
                palent = nsfb->palette[col];

                dr = (c & 0xFF) - (palent & 0xFF);
                dg = ((c >> 8) & 0xFF) - ((palent >> 8) & 0xFF);
                db = ((c >> 16) & 0xFF) - ((palent >> 16) & 0xFF);
                cur_distance = ((dr * dr) + (dg * dg) + (db *db));
                if (cur_distance < best_distance) {
                        best_distance = cur_distance;
                        best_col = col;
                }
        }

        return best_col;
}
The real problem you've got here is that colour is internally stored as a 32 bit value, normally the way to improve performance is to use a lookup table but this won't work in your case because the memory requirements would be huge.

Color conversion is always going to be slow if it's doing it for every time it needs to find out what color a pixel is

Maybe there is some fancy code that will allow you to store 16,777,216 indexes? (I think alpha can be ignored for AGA)


It looks like this call is only used in two places, is that correct?

Code:
static bool fill(nsfb_t *nsfb, nsfb_bbox_t *rect, nsfb_colour_t c)
{
        int y;
        uint8_t ent;
        uint8_t *pvideo;

        if (!nsfb_plot_clip_ctx(nsfb, rect))
                return true; /* fill lies outside current clipping region */

        pvideo = get_xy_loc(nsfb, rect->x0, rect->y0);

        ent = colour_to_pixel(nsfb, c);

        for (y = rect->y0; y < rect->y1; y++) {
                memset(pvideo, ent, rect->x1 - rect->x0);
                pvideo += nsfb->linelen;
        }

        return true;
}
Code:
static bool
glyph1(nsfb_t *nsfb,
       nsfb_bbox_t *loc,
       const uint8_t *pixel,
       int pitch,
       nsfb_colour_t c)
{
        PLOT_TYPE *pvideo;
        PLOT_TYPE fgcol;
        int xloop, yloop;
        int xoff, yoff; /* x and y offset into image */
        int x = loc->x0;
        int y = loc->y0;
        int width = loc->x1 - loc->x0;
        int height = loc->y1 - loc->y0;
        const uint8_t *fntd;
        uint8_t row;

        if (!nsfb_plot_clip_ctx(nsfb, loc))
                return true;

        if (height > (loc->y1 - loc->y0))
                height = (loc->y1 - loc->y0);

        if (width > (loc->x1 - loc->x0))
                width = (loc->x1 - loc->x0);

        xoff = loc->x0 - x;
        yoff = loc->y0 - y;

        pvideo = get_xy_loc(nsfb, loc->x0, loc->y0);

        fgcol = colour_to_pixel(nsfb, c);

        for (yloop = yoff; yloop < height; yloop++) {
                fntd = pixel + (yloop * (pitch>>3)) + (xoff>>3);
                row = (*fntd++) << (xoff & 3);
                for (xloop = xoff; xloop < width ; xloop++) {
                        if (((xloop % 8) == 0) && (xloop != 0)) {
                                row = *fntd++;
                        }

                        if ((row & 0x80) != 0) {
                                *(pvideo + xloop) = fgcol;
                        }
                        row = row << 1;

                }

                pvideo += PLOT_LINELEN(nsfb->linelen);
        }

        return true;
}

static bool
glyph8(nsfb_t *nsfb,
       nsfb_bbox_t *loc,
       const uint8_t *pixel,
       int pitch,
       nsfb_colour_t c)
{
        PLOT_TYPE *pvideo;
		nsfb_colour_t fgcol;
        nsfb_colour_t abpixel; /* alphablended pixel */
        int xloop, yloop;
        int xoff, yoff; /* x and y offset into image */
        int x = loc->x0;
        int y = loc->y0;
        int width = loc->x1 - loc->x0;
        int height = loc->y1 - loc->y0;

        if (!nsfb_plot_clip_ctx(nsfb, loc))
                return true;

        if (height > (loc->y1 - loc->y0))
                height = (loc->y1 - loc->y0);

        if (width > (loc->x1 - loc->x0))
                width = (loc->x1 - loc->x0);

        xoff = loc->x0 - x;
        yoff = loc->y0 - y;

        pvideo = get_xy_loc(nsfb, loc->x0, loc->y0);

        fgcol = c & 0xFFFFFF;

        for (yloop = 0; yloop < height; yloop++) {
                for (xloop = 0; xloop < width; xloop++) {
                        abpixel = (pixel[((yoff + yloop) * pitch) + xloop + xoff] << 24) | fgcol;
                        if ((abpixel & 0xFF000000) != 0) {
                                /* pixel is not transparent */
                                if ((abpixel & 0xFF000000) != 0xFF000000) {
                                        abpixel = nsfb_plot_ablend(abpixel,
                                                                   pixel_to_colour(nsfb, *(pvideo + xloop)));
                                }

                                *(pvideo + xloop) = colour_to_pixel(nsfb, abpixel);
                        }
                }
                pvideo += PLOT_LINELEN(nsfb->linelen);
        }

        return true;
}

Last edited by NovaCoder; 02 October 2013 at 01:23.
NovaCoder is offline  
Old 02 October 2013, 02:59   #7
Thorham
Computer Nerd
 
Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,796
Quote:
Originally Posted by NovaCoder View Post
The real problem you've got here is that colour is internally stored as a 32 bit value, normally the way to improve performance is to use a lookup table but this won't work in your case because the memory requirements would be huge.
They could still use a lookup table. Just convert the 24 bit colors to 12 bit colors. Now you only need a 4096 entries long table. Should be sufficient. A few more bits, say 15, wouldn't hurt and the table still wouldn't be too large.

Quote:
Originally Posted by NovaCoder View Post
Color conversion is always going to be slow if it's doing it for every time it needs to find out what color a pixel is
No way around that unfortunately.

Ultimately there's only so much you can do to speed this up. It would be much better to rewrite the whole loop with all calls inlined in assembly language. This goes for the dither routine as well, which can also be sped up even more by using Sierra Filter Lite instead of Floyd Steiberg (and is just as good).
Thorham is offline  
Old 02 October 2013, 03:47   #8
NovaCoder
Registered User
 
NovaCoder's Avatar
 
Join Date: Sep 2007
Location: Melbourne/Australia
Posts: 4,405
Yep that's what I'm thinking now........convert the color from 32 to 16 bits and then do a lookup.

That should result in a massive speed increase

Last edited by NovaCoder; 02 October 2013 at 04:00.
NovaCoder is offline  
Old 02 October 2013, 04:18   #9
NovaCoder
Registered User
 
NovaCoder's Avatar
 
Join Date: Sep 2007
Location: Melbourne/Australia
Posts: 4,405
So this is how I'd write it:

Code:
/*
 * Copyright 2009 Vincent Sanders <vince@simtec.co.uk>
 * Copyright 2010 Michael Drake <tlsa@netsurf-browser.org>
 *
 * This file is part of libnsfb, http://www.netsurf-browser.org/
 * Licenced under the MIT License,
 *                http://www.opensource.org/licenses/mit-license.php
 */

#include <stdbool.h>
#include <endian.h>
#include <stdlib.h>
#include <string.h>

#include "libnsfb.h"
#include "libnsfb_plot.h"
#include "libnsfb_plot_util.h"

#include "nsfb.h"
#include "palette.h"
#include "plot.h"

static byte palette_index_lookup[65536];

static uint8_t colour_to_pixel(nsfb_t *nsfb, nsfb_colour_t c)
{
        int 16bitColor;
		int paletteIndex;
		
		if (nsfb->palette == NULL)
                return 0;

		// First convert you 32bit ABGR color to a 16bit RGB color
		16bitColor = COLOR_CONVERSION_MACRO(c);

		// See if the 16bit color is already stored in the lookup.
		if (palette_index_lookup[16bitColor]) {
			// Just return the cached value.
			paletteIndex = palette_index_lookup[16bitColor];
		} else {
			// Calcuate
			paletteIndex = nsfb_palette_best_match_dither(nsfb->palette,c);
			
			// Store in lookup so it can be used next time.
			palette_index_lookup[16bitColor] = paletteIndex;
		}
		
		return paletteIndex;
}
Obviously this code is a bit rough but you should be able to get the idea.

You'll need to come up with a MACRO to convert from ABGR to a 16 bit color (Google is your friend).

You'll also need to initialize the 'palette_index_lookup' and reset it every time a logical palette update occurs.

And you should handle black (16bit value of zero) properly.

Last edited by NovaCoder; 08 October 2013 at 03:30.
NovaCoder is offline  
Old 02 October 2013, 16:38   #10
Thorham
Computer Nerd
 
Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,796
Quote:
Originally Posted by NovaCoder View Post
You'll need to come up with a MACRO to convert from ABGR to a 16 bit color (Google is your friend).
That's easy enough... in assembler

Quote:
Originally Posted by NovaCoder View Post
You'll also need to initialize the 'palette_index_lookup' and reset it every time a logical palette update occurs.
What for? Just use a palette that never changes. You need a 'one size fits all' palette anyway. Just generate a lookup table for the palette once (or store on disk), and be done with it. There's also no need to cache any values.

In assembler it would look something like this (68020/030, 060 needs to be optimized differently):
Code:
	lea	c+1,a0 		; ptr to c, skip alpha part
	clr	d0 		; clear table index
	move.b	(a0)+,d0	; get red
	lsl.l	#5,d0		; make room for next byte (shift)
	move.b	(a0)+,d		; get green
	lsl.l	#5,d0		; make room for next byte (shift)
	move.b	(a0)+,d0	; get blue
	lsr.l	#3,d0		; final 15 bit value
	move.l	(a1,d0.w),d0	; get palette color number
It would be faster if this is done directly in the loop that reads the argb pixels, where you should read bytes, not the whole argb pixel.

There's always a ton of stuff you can happily do in C, but tight loops like this should be optimized in assembler... after the C code works properly, of course

Last edited by Thorham; 02 October 2013 at 21:36.
Thorham is offline  
Old 03 October 2013, 02:53   #11
NovaCoder
Registered User
 
NovaCoder's Avatar
 
Join Date: Sep 2007
Location: Melbourne/Australia
Posts: 4,405
Quote:
Originally Posted by Thorham View Post
What for? Just use a palette that never changes. You need a 'one size fits all' palette anyway. Just generate a lookup table for the palette once (or store on disk), and be done with it. There's also no need to cache any values.
Yep you could do that, it would obviously give the best speed but it would probably look pretty nasty. For my ScummVM ports I used a paint program to build an optimised 256 color palette based on a screen grab I took of the 16bit GUI, I then simply load up that palette file at program start up (this only works because the GUI in ScummVM is based on a unified color scheme).

I'm not sure how often NetSurf is updating the palette but if it's coding properly it will be updating it only once per page, if that is the case then it would look a lot better (and still be reasonably fast) to use my palette caching code listed above.

An important thing to check would be to ensure that it's not updating the palette each time the user scrolls the web page (the palette should be based on the entire page including the off screen regions).
NovaCoder is offline  
Old 03 October 2013, 07:37   #12
PeterK
Registered User
 
Join Date: Apr 2005
Location: digital hell, Germany, after 1984, but worse
Posts: 3,378
You can find my color reduction from ARGB to the OS 3.5 palette mapped format with up to 256 colors here:

http://eab.abime.net/showthread.php?p=914435#post914435

The OS 3.5 format can be adapted to every AGA screen mode with the normal color mapping code which is also part of my library (including the optional and much faster color mapping based on a 512 byte cache). Just search for IconBeFast in my source code and ask me if I should explain how it works.
PeterK is offline  
Old 03 October 2013, 08:52   #13
Thorham
Computer Nerd
 
Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,796
To NovaCoder:

Yep, it won't be pretty

Anyway, if you're going to generate a palette for the whole page, you might as well render it in HAM8. Looks better, and I wouldn't be surprised if it was faster, too (especially when you start adding error diffusion into the mix). Meynaf's HAM8 rendering routine (not the one I showed you) runs at about 120 cycles per pixel on a 68030. Seems hard to beat.

Also, you'd have to render the whole page with all the images in one go, otherwise you can't calculate the palette. It's better if images are loaded dynamically so that the user can at least read and scroll through the text on the page. It seems to me that a fixed palette or HAM is best.
Thorham is offline  
Old 03 October 2013, 15:44   #14
Leffmann
 
Join Date: Jul 2008
Location: Sweden
Posts: 2,269
Quote:
Originally Posted by arti View Post
In order to speed up scrolling a little faster this code should be optimized or even better, asm'fied:
Do you mean it will speed up loading of the page, or does the browser dither the images and render the whole page over and over again as you scroll around on it? If pages load quickly, but scrolling is slow, then the bottleneck is probably not in the code that renders the page.

EDIT: try this to start with. I've changed the calculation of col on line 92 in pallete.h as it would sometimes produce an index of 256, and I've reduced the size of the r g b_error variables to 16-bit, so you need to change this in the dithering function as well. The code also assumes entries 239 through 255 in the palette are perfect grays:
Code:
-

Last edited by Leffmann; 04 October 2013 at 16:58.
Leffmann is offline  
Old 04 October 2013, 12:26   #15
arti
Registered User
 
Join Date: Jul 2008
Location: Poland
Posts: 663
So, I simply replace code from line 92 to 106 with this code , yes?

Where can I find tables.i include?
arti is offline  
Old 04 October 2013, 16:58   #16
Leffmann
 
Join Date: Jul 2008
Location: Sweden
Posts: 2,269
Unfortunately it's not that simple, you can't just plop some assembly right into the C source code without knowing more about the compiler.

I've added all files to the attached archive, and for now you can try linking in the object file, change the rgb_error variables to 16-bit sizes, and declare and call it like this:

Code:
int func_NSFB_PALETTE_NSFB_8BPP(int, void*, void*, void*, void*);

switch (palette->type) {
  case NSFB_PALETTE_NSFB_8BPP:
    best_col = func_NSFB_PALETTE_NSFB_8BPP(c, palette->data, &r_error, &g_error, &b_error);
    break;

    ...
}
If it doesn't work then you have to look in the manual and find out about the compiler's calling conventions. The best thing would probably be to replace all of it with an optimized function that takes a 24-bit image and a 256-color palette, and outputs a dithered 256-color image.
Attached Thumbnails
Click image for larger version

Name:	nature.png
Views:	394
Size:	130.5 KB
ID:	37035   Click image for larger version

Name:	nature8bit.png
Views:	393
Size:	30.8 KB
ID:	37036  
Attached Files
File Type: zip conv.zip (4.3 KB, 201 views)
Leffmann is offline  
Old 04 October 2013, 18:37   #17
arti
Registered User
 
Join Date: Jul 2008
Location: Poland
Posts: 663
It doesn't work.

Quote:
The best thing would probably be to replace all of it with an optimized function that takes a 24-bit image and a 256-color palette, and outputs a dithered 256-color image.
I agree, PeterK has this function in asm used in his icon.library.

I'll ask him how to implement it here:

Code:
static uint8_t colour_to_pixel(nsfb_t *nsfb, nsfb_colour_t c)
{
        if (nsfb->palette == NULL)
                return 0;

			
		return nsfb_palette_best_match_dither(nsfb->palette,c);
}

Last edited by arti; 04 October 2013 at 18:46.
arti is offline  
Old 05 October 2013, 00:11   #18
PeterK
Registered User
 
Join Date: Apr 2005
Location: digital hell, Germany, after 1984, but worse
Posts: 3,378
Attached is the nature image (reduced to 256x160) as a direct drawing snapshot of my nature.info icon file. The size limitation exists only for the icons, of course. My color reduction routine could also handle larger sizes, too.
http://eab.abime.net/showpost.php?p=...&postcount=730
But I'm sorry, I've no intentions to port my assembler code to C.
Attached Files
File Type: rar nature.info.rar (20.3 KB, 182 views)

Last edited by PeterK; 10 May 2018 at 21:21.
PeterK is offline  
Old 05 October 2013, 10:11   #19
bubbob42
Registered User
 
Join Date: Oct 2012
Location: Germany
Posts: 585
Quote:
Originally Posted by PeterK View Post
But I'm sorry, I've no intentions to port my assembler code to C.
Maybe you could just create some linkable object modules instead? I think there is no need to port stuff.
bubbob42 is offline  
Old 05 October 2013, 23:46   #20
PeterK
Registered User
 
Join Date: Apr 2005
Location: digital hell, Germany, after 1984, but worse
Posts: 3,378
I came to the conclusion that NetSurf needs a totally different solution than my color recuction code since it creates a seperate palette for every image which is not really suitable for a browser.

Just for short now what I'm planning to do:

The netsurf browser needs a fixed 256 color palette in order to decode all images as fast as possible. I want to use the 4 system colors and then add 252 fixed colors to the palette (6x7x6 possible values for the RGB components). The first 3 system colors (transparent, black and white) should be used too for the color assignments.

The decoding of the color components will be done without any table or comparison by simply substracting 42 or 36 again and again from the component values. This will only require about 10 substractions for an average RGB value, no DIV, no MUL or anything CPU time consuming. And it can probably done in C.

But I've no idea yet how good the image quality could be without using any dithering (at least not in the first version based on this direct color to palette assignment method).

Update1: The first problem I noticed now are the mouse pointer colors (pen 16-18) which could detroy my concept for the fixed palette.

Update2: Instead of 252 colors I will try to use 216 colors now (RGB = 6x6x6) from pen 32 up to pen 247. This leave some free pens for MWB and the color mapping and it also allows correct gray shades. The color components could have the values 15, 60, 105, 150, 195, 240 for a first test.

Last edited by PeterK; 07 October 2013 at 11:07.
PeterK is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
NetSurf for AGA arti News 92 14 March 2016 21:44
Optimizing question: instruction order TheDarkCoder Coders. Asm / Hardware 9 29 October 2011 17:07
Layered tile engine optimizing. Thorham Coders. General 0 30 September 2011 20:43
Benching and optimizing CF-IDE speed Photon support.Hardware 12 15 July 2009 01:48
For people who like optimizing 680x0 code. Thorham Coders. General 5 28 May 2008 11:48

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 05:50.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.14115 seconds with 14 queries