English Amiga Board - Optimizing HAM8 renderer.

English Amiga Board (https://eab.abime.net/index.php)

- Coders. Asm / Hardware (https://eab.abime.net/forumdisplay.php?f=112)

- - Optimizing HAM8 renderer. (https://eab.abime.net/showthread.php?t=87662)

Optimizing HAM8 renderer.

For people who like optimizing 68020/68030 code (don't sacrifice render quality):

Code:

ham8.render

    movem.l d0-a6,-(sp)



    move.l  bmpFile,a5

    add.l   bmpFileSize,a5

    sub.l   #640*3,a5

    move.l  bmp,a6



    clr.l   d0

    clr.l   d1

    clr.l   d2

    clr.l   d3

    clr.l   d4

    clr.l   d5



    move.w  #512-1,-(sp)

.loopy



    clra    a0

    clra    a1

    clra    a2



    move.w  #640-1,d7

.loopx



; read pixel's red green and blue components (little endian)



    move.b  (a5)+,d2 ; blue

    move.b  (a5)+,d1 ; green

    move.b  (a5)+,d0 ; red



; get pointer to closest palette color



    move.b  d0,d6

    lsl.w   #4,d6

    move.b  d1,d6

    lsl.w   #4,d6

    move.b  d2,d6

    lsr.w   #4,d6



    lea     (ham8.colorTable.w,pc,d6.w*8),a3



; palette difference



    move.w  (a3)+,d3 ; red

    sub.w   d0,d3

    subx.w  d6,d6

    eor.w   d6,d3



    move.w  (a3)+,d4 ; green

    sub.w   d1,d4

    subx.w  d6,d6

    eor.w   d6,d4



    move.w  (a3)+,d5 ; blue

    sub.w   d2,d5

    subx.w  d6,d6

    eor.w   d6,d5



; calculate weighted x2 x3 x1 threshold



    add.l   d4,d3

    add.l   d3,d3

    add.l   d4,d3

    add.l   d5,d3

    move.l  d3,a4



; ham difference



    move.l  a0,d3 ; red

    sub.w   d0,d3

    subx.w  d6,d6

    eor.w   d6,d3



    move.l  a1,d4 ; green

    sub.w   d1,d4

    subx.w  d6,d6

    eor.w   d6,d4



    move.l  a2,d5 ; blue

    sub.w   d2,d5

    subx.w  d6,d6

    eor.w   d6,d5



; 2x 3x 1x ham difference weights



    add.l   d3,d3

    move.l  d4,d6

    add.l   d4,d4

    add.l   d6,d4



; mask for ham pixels



    moveq   #-4,d6



; compare ham differences for green



    cmp.l   d4,d3

    bgt.s   .red

    cmp.l   d4,d5

    bgt.s   .blue



; check weighted threshold



    add.l   d5,d3

    cmp.l   d3,a4

    ble.s   .palette



; update ham color, set green ham code, write pixel



    and.b   d6,d1

    move.l  d1,a1

    addq.l  #3,d1

    move.b  d1,(a6)+



    dbra    d7,.loopx

    bra.s   .next



; compare ham differences for red



.red

    cmp.l   d3,d5

    bgt.s   .blue



; check weighted threshold



    add.l   d5,d4

    cmp.l   d4,a4

    ble.s   .palette



; update ham color, set red ham code, write pixel



    and.b   d6,d0

    move.l  d0,a0

    addq.l  #2,d0

    move.b  d0,(a6)+



    dbra    d7,.loopx

    bra.s   .next



; check weighted threshold



.blue

    add.l   d4,d3

    cmp.l   d3,a4

    ble.s   .palette



; update ham color, set blue ham code, write pixel



    and.b   d6,d2

    move.l  d2,a2

    addq.l  #1,d2

    move.b  d2,(a6)+



    dbra    d7,.loopx

    bra.s   .next



; write palette color and update current ham color



.palette



    subq.l  #6,a3



    move.w  (a3)+,a0

    move.w  (a3)+,a1

    move.w  (a3)+,a2

    move.b  (a3),(a6)+



    dbra    d7,.loopx



.next

    sub.l   #640*6,a5



    subq.w  #1,(sp)

    bge     .loopy



    addq.l  #2,sp



    movem.l (sp)+,d0-a6

    rts



ham8.render_end

ham8.colorTable

I refuse to believe no one sees any optimizations at all.

I'm more into 000/040...
Anyway, three minor things after taking a quick look at the code and 020 tables:
- lea (ham8.colorTable.w,pc,d6.w*8),a3 is out of 8-bit range
- and.b #$fc,dx as fast as and.b d6,dx? if so, moveq #-4,d6 not needed
- (-6,a3)/(-4,a3)/(-2,a3) faster than subq.l #6,a3 and 3x(a3)+? (postinc as fast as indirect displacement)

Branching looks OK to me (G's weight is 3 so it makes sense to assume branch not taken when comparing d4 with d3/d5).

EDIT:
So much about taking a nap, now I can't shut my brain off..

Code:

;  move.w  #512-1,-(sp)

...

;  move.w  #640-1,d7

  move.l  #(512-1)<<16+(640-1),d7



  dbf d7,.loopx

..

;  subq.w  #1,(sp)

  sub.l #(2<<16)-640,d7

  bge     .loopy

;  addq.l  #2,sp

Thanks, but it's not that easy :D

Quote: