20 June 2017, 20:47 | #1 |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,938
|
Optimizing HAM8 renderer.
For people who like optimizing 68020/68030 code (don't sacrifice render quality):
Code:
ham8.render movem.l d0-a6,-(sp) move.l bmpFile,a5 add.l bmpFileSize,a5 sub.l #640*3,a5 move.l bmp,a6 clr.l d0 clr.l d1 clr.l d2 clr.l d3 clr.l d4 clr.l d5 move.w #512-1,-(sp) .loopy clra a0 clra a1 clra a2 move.w #640-1,d7 .loopx ; read pixel's red green and blue components (little endian) move.b (a5)+,d2 ; blue move.b (a5)+,d1 ; green move.b (a5)+,d0 ; red ; get pointer to closest palette color move.b d0,d6 lsl.w #4,d6 move.b d1,d6 lsl.w #4,d6 move.b d2,d6 lsr.w #4,d6 lea (ham8.colorTable.w,pc,d6.w*8),a3 ; palette difference move.w (a3)+,d3 ; red sub.w d0,d3 subx.w d6,d6 eor.w d6,d3 move.w (a3)+,d4 ; green sub.w d1,d4 subx.w d6,d6 eor.w d6,d4 move.w (a3)+,d5 ; blue sub.w d2,d5 subx.w d6,d6 eor.w d6,d5 ; calculate weighted x2 x3 x1 threshold add.l d4,d3 add.l d3,d3 add.l d4,d3 add.l d5,d3 move.l d3,a4 ; ham difference move.l a0,d3 ; red sub.w d0,d3 subx.w d6,d6 eor.w d6,d3 move.l a1,d4 ; green sub.w d1,d4 subx.w d6,d6 eor.w d6,d4 move.l a2,d5 ; blue sub.w d2,d5 subx.w d6,d6 eor.w d6,d5 ; 2x 3x 1x ham difference weights add.l d3,d3 move.l d4,d6 add.l d4,d4 add.l d6,d4 ; mask for ham pixels moveq #-4,d6 ; compare ham differences for green cmp.l d4,d3 bgt.s .red cmp.l d4,d5 bgt.s .blue ; check weighted threshold add.l d5,d3 cmp.l d3,a4 ble.s .palette ; update ham color, set green ham code, write pixel and.b d6,d1 move.l d1,a1 addq.l #3,d1 move.b d1,(a6)+ dbra d7,.loopx bra.s .next ; compare ham differences for red .red cmp.l d3,d5 bgt.s .blue ; check weighted threshold add.l d5,d4 cmp.l d4,a4 ble.s .palette ; update ham color, set red ham code, write pixel and.b d6,d0 move.l d0,a0 addq.l #2,d0 move.b d0,(a6)+ dbra d7,.loopx bra.s .next ; check weighted threshold .blue add.l d4,d3 cmp.l d3,a4 ble.s .palette ; update ham color, set blue ham code, write pixel and.b d6,d2 move.l d2,a2 addq.l #1,d2 move.b d2,(a6)+ dbra d7,.loopx bra.s .next ; write palette color and update current ham color .palette subq.l #6,a3 move.w (a3)+,a0 move.w (a3)+,a1 move.w (a3)+,a2 move.b (a3),(a6)+ dbra d7,.loopx .next sub.l #640*6,a5 subq.w #1,(sp) bge .loopy addq.l #2,sp movem.l (sp)+,d0-a6 rts ham8.render_end ham8.colorTable |
22 June 2017, 04:33 | #2 |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,938
|
I refuse to believe no one sees any optimizations at all.
|
22 June 2017, 08:21 | #3 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,089
|
I'm more into 000/040...
Anyway, three minor things after taking a quick look at the code and 020 tables: - lea (ham8.colorTable.w,pc,d6.w*8),a3 is out of 8-bit range - and.b #$fc,dx as fast as and.b d6,dx? if so, moveq #-4,d6 not needed - (-6,a3)/(-4,a3)/(-2,a3) faster than subq.l #6,a3 and 3x(a3)+? (postinc as fast as indirect displacement) Branching looks OK to me (G's weight is 3 so it makes sense to assume branch not taken when comparing d4 with d3/d5). EDIT: So much about taking a nap, now I can't shut my brain off.. Code:
; move.w #512-1,-(sp) ... ; move.w #640-1,d7 move.l #(512-1)<<16+(640-1),d7 dbf d7,.loopx .. ; subq.w #1,(sp) sub.l #(2<<16)-640,d7 bge .loopy ; addq.l #2,sp Last edited by a/b; 22 June 2017 at 09:28. |
22 June 2017, 13:16 | #4 | |||
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,938
|
Thanks, but it's not that easy
That one might be a bit of a problem, because the table is 32kb. Quote:
Quote:
Quote:
That's very interesting, thanks. |
|||
22 June 2017, 17:02 | #5 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,089
|
I meant:
Code:
.palette ; subq.l #6,a3 ; move.w (a3)+,a0 ; move.w (a3)+,a1 ; move.w (a3)+,a2 move.w (-6,a3),a0 move.w (-4,a3),a1 move.w (-2,a3),a2 move.b (a3),(a6)+ For AND calc&fetch-ea looks like 0/0/0 reg and 0/2/3 immed so, again, best case it's the same speed but moveq is not needed. In theory, and assuming 020 ;P. Let me take a look at 030... Uhm, this is significantly different. (Ax)+ calc-ea is 0+0/2/2 (head+tail/cache/nocache) and (d16,pc/Ax) is 2+0/2/2, so yeah it's very likely slower on 030. And regarding the color table. It's fine as is, I simply forgot that asm-one wants the 16-bit displacement at the end, otherwise it will parse it as a brief/old mode and then complain it's not within 8-bit. |
22 June 2017, 18:29 | #6 | |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,938
|
It is. 16 bit immediate AND is always 4 cycles, for example. The code you posted is a few cycles slower on 68020/30. Very typical how that works.
Quote:
Come on guys... surely more people can see something? |
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Want to Find : Amiga 808 synth / sample renderer | Zetr0 | Nostalgia & memories | 5 | 14 August 2016 11:06 |
Renderer that played synth sound, know its name? | copse | Nostalgia & memories | 0 | 10 June 2015 10:12 |
Improved scanline renderer in FS-UAE | FrodeSolheim | support.FS-UAE | 55 | 30 March 2013 14:31 |
HAM8 screen question. | Thorham | Coders. General | 28 | 04 April 2011 19:26 |
REQ : Vistapro (Landscape Renderer) | Djay | request.Apps | 22 | 01 May 2002 22:47 |
|
|