NetSurf AGA optimizing - Page 5

Thorham · 20 October 2013, 18:29

Quote:

Originally Posted by Mrs Beanbag

instead of

Code:

	sub.l	#640*6,a0

how about

Code:

        lea -640*6(A0),A0

Yeah, haven't gotten around to optimizing it farther because I was still busy with the contents of the loop. Now that I have plenty of registers left, I'm going to do this:

Code:

    sub.l   d5,a0

PeterK · 20 October 2013, 18:47

@Thorham
Your C-array init would require to setup all 834 bytes. That's not what I'm looking for.

I still need a short and efficient table generator in C like this one in assembler:

Quote:

/* DCBlock.Byte count, data */

DCB.B 53, 4
DCB.B 43, 5
DCB.B 43, 6
DCB.B 43, 7
DCB.B 43, 8
DCB.B 53, 9

DCB.B 41, 0
DCB.B 37, 6
DCB.B 37, 12
DCB.B 37, 18
DCB.B 37, 24
DCB.B 37, 30
DCB.B 52, 36

DCB.B 53, 0
DCB.B 43, 42
DCB.B 43, 84
DCB.B 43, 126
DCB.B 43, 168
DCB.B 53, 210

The other code could be like this now:

Code:

case NSFB_PALETTE_CUBE_676:

	dr = ( c        & 0xFF);
	dg = ((c >>  8) & 0xFF);
	db = ((c >> 16) & 0xFF);

	if (pushRGBlevel = ~pushRGBlevel) { /* push up every 2. pixel */
		dr += 0x16;
		dg += 0x16;
		db += 0x16;
	}
	if (dr > 250)
		if (dg > 250)
			if (db > 250) return 2; /* this is white */

	best_col = table_for_cube_676[dr+556]
		 + table_for_cube_676[dg+278]
		 + table_for_cube_676[db];

	break;

Thorham · 20 October 2013, 18:56

Quote:

Originally Posted by PeterK

Your C-array init would require to setup all 834 bytes. That's not what I'm looking for. I need a short and efficient table generator.

Sorry, thought you meant how to do that asm table in C

Mrs Beanbag · 20 October 2013, 18:58

Quote:

Originally Posted by Thorham

Yeah, haven't gotten around to optimizing it farther because I was still busy with the contents of the loop. Now that I have plenty of registers left, I'm going to do this:

Code:

    sub.l   d5,a0

fair enough although it's worth remembering that sub.w will sign extend the source operand when destination is an address register. Makes no odds in this case though I suppose.

Thorham · 20 October 2013, 18:59

Quote:

Originally Posted by Mrs Beanbag

fair enough although it's worth remembering that sub.w will sign extend the source operand when destination is an address register. Makes no odds in this case though I suppose.

Size of register to register subs and adds makes no difference on 68020+. Same for moves, logical operators, shifts and rotates.

Mrs Beanbag · 20 October 2013, 19:14

Quote:

Originally Posted by Thorham

To PeterK:

Code:

; dithering
    subq.l    #1,d3
    bge    .l1
    moveq    #2,d3
    add.l    d4,a2

.l1
    dbra    d6,.loopx

Code:

; dithering
    subq.l    #1,d3
    dblt    d6,.loopx
    bge.s  .l1
    moveq    #2,d3
    add.l    d4,a2

    dbra    d6,.loopx
.l1

PeterK · 20 October 2013, 20:03

@arti
Atm, I don't know how to help you any further as long as you don't tell me what you need now.

Thorham · 20 October 2013, 20:04

To Mrs Beanbag:

Good one

That shaves off a good few cycles! I should read up on the dbcc instruction, because I only use it to make for loops.

arti · 20 October 2013, 20:45

@PeterK

Should I comment nsfb_palette_generate_nsfb_8bpp(nsfb->palette);
and use nsfb_palette_generate_cube_676(nsfb->palette); instead.
Or use both functions?

I've implemented your code and this is result. Doesn't work yet.

PeterK · 20 October 2013, 20:59

Yeah, you could try to comment out nsfb_palette_generate_nsfb_8bpp and use nsfb_palette_generate_cube_676 instead.

I must admit that I don't understand all the dependencies in Netsurf concerning how the palettes are mapped to the screen pens and how it manages to use more than one palette at the same time. I've never done anything with Netsurf yet.

If you are still using my older code then please comment the alpha channel handling out:
// if (c < 0x46000000) return 0; /* alpha < 70 gets pen 0 */

Maybe, NetSurf sets the alpha channel always to zero ? I don't know,

arti · 20 October 2013, 21:18

Have you looked at common.c ? Maybe that helps you understand.

PeterK · 20 October 2013, 21:37

Where can I download the latest source code of Netsurf and which compiler and additional resources will I need to compile it?

arti · 20 October 2013, 21:59

Here https://www.dropbox.com/sh/k49d8viddz9xo28/Z-HGQIXIRe

I use gcc 4.5.0 for cygwin from amiga.sf with AmiDevCpp 0.9.8 workspace

Thorham · 20 October 2013, 22:54

Quote:

Originally Posted by Mrs Beanbag

Code:

; dithering
    subq.l    #1,d3
    dblt    d6,.loopx
    bge.s  .l1
    moveq    #2,d3
    add.l    d4,a2

    dbra    d6,.loopx
.l1

LOL:

Code:

; dithering
    move.l  a2,d5
    move.l  a3,a2
    move.l  a4,a3
    move.l  d5,a4
    
    dbra    d6,.loopx
    sub.l   #640*6,a0
    dbra    d7,.loopy

Mrs Beanbag · 20 October 2013, 23:00

Quote:

Originally Posted by Thorham

LOL:

Code:

; dithering
    move.l  a2,d5
    move.l  a3,a2
    move.l  a4,a3
    move.l  d5,a4
    
    dbra    d6,.loopx
    sub.l   #640*6,a0
    dbra    d7,.loopy

I take it d4 is double d2 then

edit: d2=256, d4=512, right?

Thorham · 20 October 2013, 23:09

Quote:

Originally Posted by Mrs Beanbag

I take it d4 is double d2 then

edit: d2=256, d4=512, right?

Here's the whole render routine:

Code:

renderImage
    lea     image_end-640*3,a0
    lea     bmp,a1
    lea     tableR+256-16,a2

    move.l  a2,a3
    add.l   #16,a3
    move.l  a3,a4
    add.l   #16,a4

    clr.l   d0
;
; render loop
;
    move.l  #512-1,d7   ; image height
.loopy
    move.l  #640-1,d6   ; image width
.loopx
    move.b  (a0)+,d0
    move.b  (a2,d0.w,256*6.w),d1

    move.b  (a0)+,d0
    add.b   (a2,d0.w,256*3.w),d1

    move.b  (a0)+,d0
    add.b   (a2,d0.w),d1

    move.b  d1,(a1)+

; dithering
    move.l  a2,d2
    move.l  a3,a2
    move.l  a4,a3
    move.l  d2,a4

.next
    dbra    d6,.loopx
    sub.l   #640*6,a0
    dbra    d7,.loopy

Mrs Beanbag · 20 October 2013, 23:12

Neat! You could probably re-arrange the instruction order a bit to assist pipelining/mitigate memory stalls.

Code:

.loopx
    move.b  (a0)+,d0
    move.l  a4,a3
    move.b  (a2,d0.w,256*6.w),d1

    move.b  (a0)+,d0
    move.l  d2,a4
    add.b   (a2,d0.w,256*3.w),d1

    move.b  (a0)+,d0
    move.l  a2,d2
    add.b   (a2,d0.w),d1

    move.l  a3,a2
    move.b  d1,(a1)+

I'll admit I have no idea what is the structure of this look-up table.

Thorham · 20 October 2013, 23:40

Quote:

Originally Posted by Mrs Beanbag

Neat! You could probably re-arrange the instruction order a bit to assist pipelining/mitigate memory stalls.

Yeah, for '60 that's best. This code is written for '20/'30, so you can't do much as far as I'm aware (although I have no clue about those memory stalls, if 20/30 has them).

Quote:

Originally Posted by Mrs Beanbag

I'll admit I have no idea what is the structure of this look-up table.

It's a little hard to explain, but it's not too complicated. Here's the code that generates the table:

Code:

;
; generate color reduction tables
;
genTables
    movem.l d0-a6,-(sp)

    lea tableR+256,a0
    lea tableG+256,a1
    lea tableB+256,a2

    clr.l   d0 ; Red
    clr.l   d1 ; Green
    clr.l   d2 ; Blue

    move.l  #(1<<16)/(51-1),d3 ; Red 16bit.16bit fixed point number
    move.l  #(1<<16)/(42-1),d4 ; Green 16bit.16bit fixed point number

    move.l  #255,d7
.loop
    move.l  d0,d6
    swap    d6
    move.b  d6,(a0)+

    move.l  d1,d6
    swap    d6
    mulu.w  #6,d6
    move.b  d6,(a1)+

    move.l  d0,d6
    swap    d6
    mulu.w  #7*6,d6
    move.b  d6,(a2)+

    add.l   d3,d0
    add.l   d4,d1

    dbra    d7,.loop

    lea tableR,a0
    lea tableG,a1
    lea tableB,a2

    move.l  508(a0),d0
    move.l  508(a1),d1
    move.l  508(a2),d2

    move.l  #255,d7
.loop2
    move.b  d0,512(a0)
    clr.b   (a0)+
    move.b  d1,512(a1)
    clr.b   (a1)+
    move.b  d2,512(a2)
    clr.b   (a2)+

    dbra    d7,.loop2

    movem.l (sp)+,d0-a6
    rts

And here's the palette generation code:

Code:

;
; set palette to a 6*7*6 palette
;
setPalette
    movem.l d0-a6,-(sp)
    move.l  scr,a5

    lea     sc_BitMap(a5),a4
    lea     sc_ViewPort(a5),a4
    move.l  a4,svport

    lea b,a0
    moveq   #255/5,d4
    moveq   #255/6,d3

    moveq   #5,d7   ; blue
.loopz
    moveq   #6,d6   ; green
.loopy
    moveq   #5,d5   ; red
.loopx
    moveq   #5,d0
    sub.l   d5,d0
    mulu.w  d4,d0
    ror.l   #8,d0
    move.l  d0,(a0)+

    moveq   #6,d0
    sub.l   d6,d0
    mulu.w  d3,d0
    ror.l   #8,d0
    move.l  d0,(a0)+

    moveq   #5,d0
    sub.l   d7,d0
    mulu.w  d4,d0
    ror.l   #8,d0
    move.l  d0,(a0)+

    dbra    d5,.loopx
    dbra    d6,.loopy
    dbra    d7,.loopz

    move.l  gfxbase,a6
    move.l  svport,a0
    lea     pal,a1
    jsr     _LVOLoadRGB32(a6)

    movem.l (sp)+,d0-a6
    rts

Mrs Beanbag · 21 October 2013, 15:51

another thing you could do is unroll the loop 3 times, and get rid of those four moves entirely.

Thorham · 21 October 2013, 16:43

Quote:

Originally Posted by Mrs Beanbag

another thing you could do is unroll the loop 3 times, and get rid of those four moves entirely.

Thanks, that's a good idea

Especially with a loop this small. Make a macro, and it should stay pretty clean looking, too.

20 October 2013, 20:45	#89
arti Registered User Join Date: Jul 2008 Location: Poland Posts: 662	@PeterK Should I comment nsfb_palette_generate_nsfb_8bpp(nsfb->palette); and use nsfb_palette_generate_cube_676(nsfb->palette); instead. Or use both functions? I've implemented your code and this is result. Doesn't work yet. Attached Thumbnails

20 October 2013, 20:59	#90
PeterK Registered User Join Date: Apr 2005 Location: digital hell, Germany, after 1984, but worse Posts: 3,366	Yeah, you could try to comment out nsfb_palette_generate_nsfb_8bpp and use nsfb_palette_generate_cube_676 instead. I must admit that I don't understand all the dependencies in Netsurf concerning how the palettes are mapped to the screen pens and how it manages to use more than one palette at the same time. I've never done anything with Netsurf yet. If you are still using my older code then please comment the alpha channel handling out: // if (c < 0x46000000) return 0; /* alpha < 70 gets pen 0 / Maybe, NetSurf sets the alpha channel always to zero ? I don't know, Last edited by PeterK; 20 October 2013 at 21:14.*

20 October 2013, 21:59	#93
arti Registered User Join Date: Jul 2008 Location: Poland Posts: 662	Here https://www.dropbox.com/sh/k49d8viddz9xo28/Z-HGQIXIRe I use gcc 4.5.0 for cygwin from amiga.sf with AmiDevCpp 0.9.8 workspace Last edited by arti; 20 October 2013 at 22:05.

20 October 2013, 23:12	#97
Mrs Beanbag Glastonbridge Software Join Date: Jan 2012 Location: Edinburgh/Scotland Posts: 2,243	Neat! You could probably re-arrange the instruction order a bit to assist pipelining/mitigate memory stalls. Code: .loopx move.b (a0)+,d0 move.l a4,a3 move.b (a2,d0.w,2566.w),d1 move.b (a0)+,d0 move.l d2,a4 add.b (a2,d0.w,2563.w),d1 move.b (a0)+,d0 move.l a2,d2 add.b (a2,d0.w),d1 move.l a3,a2 move.b d1,(a1)+ I'll admit I have no idea what is the structure of this look-up table. Last edited by Mrs Beanbag; 20 October 2013 at 23:18.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
NetSurf for AGA	arti	News	92	14 March 2016 21:44
Optimizing question: instruction order	TheDarkCoder	Coders. Asm / Hardware	9	29 October 2011 17:07
Layered tile engine optimizing.	Thorham	Coders. General	0	30 September 2011 20:43
Benching and optimizing CF-IDE speed	Photon	support.Hardware	12	15 July 2009 01:48
For people who like optimizing 680x0 code.	Thorham	Coders. General	5	28 May 2008 11:48

20 October 2013, 20:03	#87
PeterK Registered User Join Date: Apr 2005 Location: digital hell, Germany, after 1984, but worse Posts: 3,366	@arti Atm, I don't know how to help you any further as long as you don't tell me what you need now.

20 October 2013, 20:04	#88
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,762	To Mrs Beanbag: Good one That shaves off a good few cycles! I should read up on the dbcc instruction, because I only use it to make for loops.

20 October 2013, 21:37	#92
PeterK Registered User Join Date: Apr 2005 Location: digital hell, Germany, after 1984, but worse Posts: 3,366	Where can I download the latest source code of Netsurf and which compiler and additional resources will I need to compile it?

21 October 2013, 15:51	#99
Mrs Beanbag Glastonbridge Software Join Date: Jan 2012 Location: Edinburgh/Scotland Posts: 2,243	another thing you could do is unroll the loop 3 times, and get rid of those four moves entirely.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)