English Amiga Board

English Amiga Board (https://eab.abime.net/index.php)
-   Coders. Asm / Hardware (https://eab.abime.net/forumdisplay.php?f=112)
-   -   Optimizing HAM8 renderer. (https://eab.abime.net/showthread.php?t=87662)

Thorham 20 June 2017 20:47

Optimizing HAM8 renderer.
 
For people who like optimizing 68020/68030 code (don't sacrifice render quality):

Code:

ham8.render
    movem.l d0-a6,-(sp)

    move.l  bmpFile,a5
    add.l  bmpFileSize,a5
    sub.l  #640*3,a5
    move.l  bmp,a6

    clr.l  d0
    clr.l  d1
    clr.l  d2
    clr.l  d3
    clr.l  d4
    clr.l  d5

    move.w  #512-1,-(sp)
.loopy

    clra    a0
    clra    a1
    clra    a2

    move.w  #640-1,d7
.loopx

; read pixel's red green and blue components (little endian)

    move.b  (a5)+,d2 ; blue
    move.b  (a5)+,d1 ; green
    move.b  (a5)+,d0 ; red

; get pointer to closest palette color

    move.b  d0,d6
    lsl.w  #4,d6
    move.b  d1,d6
    lsl.w  #4,d6
    move.b  d2,d6
    lsr.w  #4,d6

    lea    (ham8.colorTable.w,pc,d6.w*8),a3

; palette difference

    move.w  (a3)+,d3 ; red
    sub.w  d0,d3
    subx.w  d6,d6
    eor.w  d6,d3

    move.w  (a3)+,d4 ; green
    sub.w  d1,d4
    subx.w  d6,d6
    eor.w  d6,d4

    move.w  (a3)+,d5 ; blue
    sub.w  d2,d5
    subx.w  d6,d6
    eor.w  d6,d5

; calculate weighted x2 x3 x1 threshold

    add.l  d4,d3
    add.l  d3,d3
    add.l  d4,d3
    add.l  d5,d3
    move.l  d3,a4

; ham difference

    move.l  a0,d3 ; red
    sub.w  d0,d3
    subx.w  d6,d6
    eor.w  d6,d3

    move.l  a1,d4 ; green
    sub.w  d1,d4
    subx.w  d6,d6
    eor.w  d6,d4

    move.l  a2,d5 ; blue
    sub.w  d2,d5
    subx.w  d6,d6
    eor.w  d6,d5

; 2x 3x 1x ham difference weights

    add.l  d3,d3
    move.l  d4,d6
    add.l  d4,d4
    add.l  d6,d4

; mask for ham pixels

    moveq  #-4,d6

; compare ham differences for green

    cmp.l  d4,d3
    bgt.s  .red
    cmp.l  d4,d5
    bgt.s  .blue

; check weighted threshold

    add.l  d5,d3
    cmp.l  d3,a4
    ble.s  .palette

; update ham color, set green ham code, write pixel

    and.b  d6,d1
    move.l  d1,a1
    addq.l  #3,d1
    move.b  d1,(a6)+

    dbra    d7,.loopx
    bra.s  .next

; compare ham differences for red

.red
    cmp.l  d3,d5
    bgt.s  .blue

; check weighted threshold

    add.l  d5,d4
    cmp.l  d4,a4
    ble.s  .palette

; update ham color, set red ham code, write pixel

    and.b  d6,d0
    move.l  d0,a0
    addq.l  #2,d0
    move.b  d0,(a6)+

    dbra    d7,.loopx
    bra.s  .next

; check weighted threshold

.blue
    add.l  d4,d3
    cmp.l  d3,a4
    ble.s  .palette

; update ham color, set blue ham code, write pixel

    and.b  d6,d2
    move.l  d2,a2
    addq.l  #1,d2
    move.b  d2,(a6)+

    dbra    d7,.loopx
    bra.s  .next

; write palette color and update current ham color

.palette

    subq.l  #6,a3

    move.w  (a3)+,a0
    move.w  (a3)+,a1
    move.w  (a3)+,a2
    move.b  (a3),(a6)+

    dbra    d7,.loopx

.next
    sub.l  #640*6,a5

    subq.w  #1,(sp)
    bge    .loopy

    addq.l  #2,sp

    movem.l (sp)+,d0-a6
    rts

ham8.render_end
ham8.colorTable


Thorham 22 June 2017 04:33

I refuse to believe no one sees any optimizations at all.

a/b 22 June 2017 08:21

I'm more into 000/040...
Anyway, three minor things after taking a quick look at the code and 020 tables:
- lea (ham8.colorTable.w,pc,d6.w*8),a3 is out of 8-bit range
- and.b #$fc,dx as fast as and.b d6,dx? if so, moveq #-4,d6 not needed
- (-6,a3)/(-4,a3)/(-2,a3) faster than subq.l #6,a3 and 3x(a3)+? (postinc as fast as indirect displacement)

Branching looks OK to me (G's weight is 3 so it makes sense to assume branch not taken when comparing d4 with d3/d5).

EDIT:
So much about taking a nap, now I can't shut my brain off..
Code:

;  move.w  #512-1,-(sp)
...
;  move.w  #640-1,d7
  move.l  #(512-1)<<16+(640-1),d7

  dbf d7,.loopx
..
;  subq.w  #1,(sp)
  sub.l #(2<<16)-640,d7
  bge    .loopy
;  addq.l  #2,sp


Thorham 22 June 2017 13:16

Thanks, but it's not that easy :D

Quote:

Originally Posted by a/b (Post 1166802)
- lea (ham8.colorTable.w,pc,d6.w*8),a3 is out of 8-bit range

That one might be a bit of a problem, because the table is 32kb.

Quote:

Originally Posted by a/b (Post 1166802)
- and.b #$fc,dx as fast as and.b d6,dx? if so, moveq #-4,d6 not needed

On 68020/30 AND immediate same speed as AND register + moveq.

Quote:

Originally Posted by a/b (Post 1166802)
- (-6,a3)/(-4,a3)/(-2,a3) faster than subq.l #6,a3 and 3x(a3)+? (postinc as fast as indirect displacement)

Auto decrement is 1 cycle slower than auto increment (really). Furthermore, you have to move the write to memory to a place where nothing gets pipelined (the dbra gets partially pipelined now).

Quote:

Originally Posted by a/b (Post 1166802)
Branching looks OK to me (G's weight is 3 so it makes sense to assume branch not taken when comparing d4 with d3/d5).

G's case should happen the most often, so it's done first (if that's what you mean).

Quote:

Originally Posted by a/b (Post 1166802)
Code:

;  move.w  #512-1,-(sp)
...
;  move.w  #640-1,d7
  move.l  #(512-1)<<16+(640-1),d7

  dbf d7,.loopx
..
;  subq.w  #1,(sp)
  sub.l #(2<<16)-640,d7
  bge    .loopy
;  addq.l  #2,sp


That's very interesting, thanks.

a/b 22 June 2017 17:02

I meant:
Code:

.palette
;    subq.l  #6,a3
;    move.w  (a3)+,a0
;    move.w  (a3)+,a1
;    move.w  (a3)+,a2
    move.w (-6,a3),a0
    move.w (-4,a3),a1
    move.w (-2,a3),a2
    move.b  (a3),(a6)+

But, just the same as with AND, it depends on several things. In theory it could be faster. (Ax)+ calc-ea is 2/2/2 (best/cache/worst case), (d16,pc/Ax) is 2/2/3. calc&fetch-ea is 4/4/4 vs. 3/5/6 so in best case scenario it's 1 cycle less and subq is not needed.
For AND calc&fetch-ea looks like 0/0/0 reg and 0/2/3 immed so, again, best case it's the same speed but moveq is not needed. In theory, and assuming 020 ;P.

Let me take a look at 030... Uhm, this is significantly different.
(Ax)+ calc-ea is 0+0/2/2 (head+tail/cache/nocache) and (d16,pc/Ax) is 2+0/2/2, so yeah it's very likely slower on 030.

And regarding the color table. It's fine as is, I simply forgot that asm-one wants the 16-bit displacement at the end, otherwise it will parse it as a brief/old mode and then complain it's not within 8-bit.

Thorham 22 June 2017 18:29

Quote:

Originally Posted by a/b (Post 1166907)
Let me take a look at 030... Uhm, this is significantly different.

It is. 16 bit immediate AND is always 4 cycles, for example. The code you posted is a few cycles slower on 68020/30. Very typical how that works.

Quote:

Originally Posted by a/b (Post 1166907)
I simply forgot that asm-one wants the 16-bit displacement at the end

Yeah, I use Barfly. It's very annoying that different assemblers have a different syntax for this :scream

Come on guys... surely more people can see something?


All times are GMT +2. The time now is 19:40.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.

Page generated in 0.04312 seconds with 11 queries