Optimizing HAM8 renderer.

Thorham · 20 June 2017, 20:47

For people who like optimizing 68020/68030 code (don't sacrifice render quality):

Code:

ham8.render
    movem.l d0-a6,-(sp)

    move.l  bmpFile,a5
    add.l   bmpFileSize,a5
    sub.l   #640*3,a5
    move.l  bmp,a6

    clr.l   d0
    clr.l   d1
    clr.l   d2
    clr.l   d3
    clr.l   d4
    clr.l   d5

    move.w  #512-1,-(sp)
.loopy

    clra    a0
    clra    a1
    clra    a2

    move.w  #640-1,d7
.loopx

; read pixel's red green and blue components (little endian)

    move.b  (a5)+,d2 ; blue
    move.b  (a5)+,d1 ; green
    move.b  (a5)+,d0 ; red

; get pointer to closest palette color

    move.b  d0,d6
    lsl.w   #4,d6
    move.b  d1,d6
    lsl.w   #4,d6
    move.b  d2,d6
    lsr.w   #4,d6

    lea     (ham8.colorTable.w,pc,d6.w*8),a3

; palette difference

    move.w  (a3)+,d3 ; red
    sub.w   d0,d3
    subx.w  d6,d6
    eor.w   d6,d3

    move.w  (a3)+,d4 ; green
    sub.w   d1,d4
    subx.w  d6,d6
    eor.w   d6,d4

    move.w  (a3)+,d5 ; blue
    sub.w   d2,d5
    subx.w  d6,d6
    eor.w   d6,d5

; calculate weighted x2 x3 x1 threshold

    add.l   d4,d3
    add.l   d3,d3
    add.l   d4,d3
    add.l   d5,d3
    move.l  d3,a4

; ham difference

    move.l  a0,d3 ; red
    sub.w   d0,d3
    subx.w  d6,d6
    eor.w   d6,d3

    move.l  a1,d4 ; green
    sub.w   d1,d4
    subx.w  d6,d6
    eor.w   d6,d4

    move.l  a2,d5 ; blue
    sub.w   d2,d5
    subx.w  d6,d6
    eor.w   d6,d5

; 2x 3x 1x ham difference weights

    add.l   d3,d3
    move.l  d4,d6
    add.l   d4,d4
    add.l   d6,d4

; mask for ham pixels

    moveq   #-4,d6

; compare ham differences for green

    cmp.l   d4,d3
    bgt.s   .red
    cmp.l   d4,d5
    bgt.s   .blue

; check weighted threshold

    add.l   d5,d3
    cmp.l   d3,a4
    ble.s   .palette

; update ham color, set green ham code, write pixel

    and.b   d6,d1
    move.l  d1,a1
    addq.l  #3,d1
    move.b  d1,(a6)+

    dbra    d7,.loopx
    bra.s   .next

; compare ham differences for red

.red
    cmp.l   d3,d5
    bgt.s   .blue

; check weighted threshold

    add.l   d5,d4
    cmp.l   d4,a4
    ble.s   .palette

; update ham color, set red ham code, write pixel

    and.b   d6,d0
    move.l  d0,a0
    addq.l  #2,d0
    move.b  d0,(a6)+

    dbra    d7,.loopx
    bra.s   .next

; check weighted threshold

.blue
    add.l   d4,d3
    cmp.l   d3,a4
    ble.s   .palette

; update ham color, set blue ham code, write pixel

    and.b   d6,d2
    move.l  d2,a2
    addq.l  #1,d2
    move.b  d2,(a6)+

    dbra    d7,.loopx
    bra.s   .next

; write palette color and update current ham color

.palette

    subq.l  #6,a3

    move.w  (a3)+,a0
    move.w  (a3)+,a1
    move.w  (a3)+,a2
    move.b  (a3),(a6)+

    dbra    d7,.loopx

.next
    sub.l   #640*6,a5

    subq.w  #1,(sp)
    bge     .loopy

    addq.l  #2,sp

    movem.l (sp)+,d0-a6
    rts

ham8.render_end
ham8.colorTable

Thorham · 22 June 2017, 04:33

I refuse to believe no one sees any optimizations at all.

a/b · 22 June 2017, 08:21

I'm more into 000/040...
Anyway, three minor things after taking a quick look at the code and 020 tables:
- lea (ham8.colorTable.w,pc,d6.w*8),a3 is out of 8-bit range
- and.b #$fc,dx as fast as and.b d6,dx? if so, moveq #-4,d6 not needed
- (-6,a3)/(-4,a3)/(-2,a3) faster than subq.l #6,a3 and 3x(a3)+? (postinc as fast as indirect displacement)

Branching looks OK to me (G's weight is 3 so it makes sense to assume branch not taken when comparing d4 with d3/d5).

EDIT:
So much about taking a nap, now I can't shut my brain off..

Code:

;  move.w  #512-1,-(sp)
...
;  move.w  #640-1,d7
  move.l  #(512-1)<<16+(640-1),d7

  dbf d7,.loopx
..
;  subq.w  #1,(sp)
  sub.l #(2<<16)-640,d7
  bge     .loopy
;  addq.l  #2,sp

Thorham · 22 June 2017, 13:16

Thanks, but it's not that easy

Quote:

Originally Posted by a/b

- lea (ham8.colorTable.w,pc,d6.w*8),a3 is out of 8-bit range

That one might be a bit of a problem, because the table is 32kb.

Quote:

Originally Posted by a/b

- and.b #$fc,dx as fast as and.b d6,dx? if so, moveq #-4,d6 not needed

On 68020/30 AND immediate same speed as AND register + moveq.

Quote:

Originally Posted by a/b

- (-6,a3)/(-4,a3)/(-2,a3) faster than subq.l #6,a3 and 3x(a3)+? (postinc as fast as indirect displacement)

Auto decrement is 1 cycle slower than auto increment (really). Furthermore, you have to move the write to memory to a place where nothing gets pipelined (the dbra gets partially pipelined now).

Quote:

Originally Posted by a/b

Branching looks OK to me (G's weight is 3 so it makes sense to assume branch not taken when comparing d4 with d3/d5).

G's case should happen the most often, so it's done first (if that's what you mean).

Quote:

Originally Posted by a/b

Code:

;  move.w  #512-1,-(sp)
...
;  move.w  #640-1,d7
  move.l  #(512-1)<<16+(640-1),d7

  dbf d7,.loopx
..
;  subq.w  #1,(sp)
  sub.l #(2<<16)-640,d7
  bge     .loopy
;  addq.l  #2,sp

That's very interesting, thanks.

a/b · 22 June 2017, 17:02

I meant:

Code:

.palette
;    subq.l  #6,a3
;    move.w  (a3)+,a0
;    move.w  (a3)+,a1
;    move.w  (a3)+,a2
    move.w (-6,a3),a0
    move.w (-4,a3),a1
    move.w (-2,a3),a2
    move.b  (a3),(a6)+

But, just the same as with AND, it depends on several things. In theory it could be faster. (Ax)+ calc-ea is 2/2/2 (best/cache/worst case), (d16,pc/Ax) is 2/2/3. calc&fetch-ea is 4/4/4 vs. 3/5/6 so in best case scenario it's 1 cycle less and subq is not needed.
For AND calc&fetch-ea looks like 0/0/0 reg and 0/2/3 immed so, again, best case it's the same speed but moveq is not needed. In theory, and assuming 020 ;P.

Let me take a look at 030... Uhm, this is significantly different.
(Ax)+ calc-ea is 0+0/2/2 (head+tail/cache/nocache) and (d16,pc/Ax) is 2+0/2/2, so yeah it's very likely slower on 030.

And regarding the color table. It's fine as is, I simply forgot that asm-one wants the 16-bit displacement at the end, otherwise it will parse it as a brief/old mode and then complain it's not within 8-bit.

Thorham · 22 June 2017, 18:29

Quote:

Originally Posted by a/b

Let me take a look at 030... Uhm, this is significantly different.

It is. 16 bit immediate AND is always 4 cycles, for example. The code you posted is a few cycles slower on 68020/30. Very typical how that works.

Quote:

Originally Posted by a/b

I simply forgot that asm-one wants the 16-bit displacement at the end

Yeah, I use Barfly. It's very annoying that different assemblers have a different syntax for this

Come on guys... surely more people can see something?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Want to Find : Amiga 808 synth / sample renderer	Zetr0	Nostalgia & memories	5	14 August 2016 11:06
Renderer that played synth sound, know its name?	copse	Nostalgia & memories	0	10 June 2015 10:12
Improved scanline renderer in FS-UAE	FrodeSolheim	support.FS-UAE	55	30 March 2013 14:31
HAM8 screen question.	Thorham	Coders. General	28	04 April 2011 19:26
REQ : Vistapro (Landscape Renderer)	Djay	request.Apps	22	01 May 2002 22:47

22 June 2017, 04:33	#2
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 48 Posts: 3,938	I refuse to believe no one sees any optimizations at all.

22 June 2017, 08:21	#3
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,089	I'm more into 000/040... Anyway, three minor things after taking a quick look at the code and 020 tables: - lea (ham8.colorTable.w,pc,d6.w8),a3 is out of 8-bit range - and.b #$fc,dx as fast as and.b d6,dx? if so, moveq #-4,d6 not needed - (-6,a3)/(-4,a3)/(-2,a3) faster than subq.l #6,a3 and 3x(a3)+? (postinc as fast as indirect displacement) Branching looks OK to me (G's weight is 3 so it makes sense to assume branch not taken when comparing d4 with d3/d5). EDIT: So much about taking a nap, now I can't shut my brain off.. Code: ; move.w #512-1,-(sp) ... ; move.w #640-1,d7 move.l #(512-1)<<16+(640-1),d7 dbf d7,.loopx .. ; subq.w #1,(sp) sub.l #(2<<16)-640,d7 bge .loopy ; addq.l #2,sp Last edited by a/b; 22 June 2017 at 09:28.*

22 June 2017, 17:02	#5
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,089	I meant: Code: .palette ; subq.l #6,a3 ; move.w (a3)+,a0 ; move.w (a3)+,a1 ; move.w (a3)+,a2 move.w (-6,a3),a0 move.w (-4,a3),a1 move.w (-2,a3),a2 move.b (a3),(a6)+ But, just the same as with AND, it depends on several things. In theory it could be faster. (Ax)+ calc-ea is 2/2/2 (best/cache/worst case), (d16,pc/Ax) is 2/2/3. calc&fetch-ea is 4/4/4 vs. 3/5/6 so in best case scenario it's 1 cycle less and subq is not needed. For AND calc&fetch-ea looks like 0/0/0 reg and 0/2/3 immed so, again, best case it's the same speed but moveq is not needed. In theory, and assuming 020 ;P. Let me take a look at 030... Uhm, this is significantly different. (Ax)+ calc-ea is 0+0/2/2 (head+tail/cache/nocache) and (d16,pc/Ax) is 2+0/2/2, so yeah it's very likely slower on 030. And regarding the color table. It's fine as is, I simply forgot that asm-one wants the 16-bit displacement at the end, otherwise it will parse it as a brief/old mode and then complain it's not within 8-bit.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)