21 July 2024, 17:08 | #1 |
This cat is no more
Join Date: Dec 2004
Location: FRANCE
Age: 52
Posts: 8,417
|
can someone help me to optimize this blitter routine?
I wrote that myself, so I'm not questioning it too much but maybe I'm missing something BIG...
The inputs are pretty easy to understand, code supports vertical clipping, and works on A0 which is a pointer on a list of bitplanes: 16 pixel width. If bitplane is 0 then it's skipped (which is a big optimization already). Also I chose not to "cookie cut" the background if a bitplane is 0 (which can lead to strange effects when BOBs are overlayed, but in the facts it's barely noticeable). CHECK_BLITTER_BOUNDS is only enabled in "developer" mode. WAIT_BLIT is a macro that sets "blitter nasty" flag, waits for blitter and unsets "blitter nasty". Code:
.macro WAIT_BLIT move.w #0x8400,(dmacon,a5) | blitter high priority wait\@: BTST #6,(dmaconr,a5) BNE.S wait\@ move.w #0x0400,(dmacon,a5) | blitter normal priority .endm Code:
* < A5: custom * < D0.W,D1.W: x,y * < A0: source (pointer on array of planes) * < A1: destination fg plane, also background to mix with cookie cut fg plane * < A3: source mask for cookie cut * < D2: width in bytes (inc. 2 extra for shifting) * < D3: number of planes * < D4: height. If negative, source is copied with negative modulo (flip) * < D5: y offset for source planes * blit mask set * returns: start of destination in A1 (computed from old A1+X,Y) * trashes: a1 blit_planes_any_internal_cookie_cut: movem.l d0-d7/a2/a4,-(a7) * pre-compute the maximum of shit here tst.w d4 bpl.b 1f * inverted y blit sub.w d4,d1 | pre-add height to d1 subq.w #1,d1 | minus one 1: tst d1 beq.b 2f | optim cmp.w #NB_LINES,d1 jcc 8f | too low, won't be drawn, may as well optimize lea mulNB_BYTES_PER_ROW_table,a4 .ifdef NO68020 add.w d1,d1 move.w (a4,d1.w),d1 | y times 40 .else move.w (a4,d1.w*2),d1 | y times 40 .endif 2: move.w d5,-(a7) moveq #0,d5 move.w #0x0fca,d5 | B+C-A->D cookie cut swap d5 moveq #0,d6 | make sure D6.L is zero!! move.w d0,d6 beq.b 4f lsr.w #3,d0 bclr #0,d0 and.w #0xF,d6 beq.b 3f | if 0 shift, optimize a few instructions lsl.l #8,d6 lsl.l #4,d6 or.w d6,d5 | add shift to mask (bltcon1) swap d6 clr.w d6 or.l d6,d5 | add shift 3: add.w d0,d1 4: * make offset even. Blitter will ignore odd address * but a 68000 CPU doesn't and since we RETURN A1... bclr #0,d1 add.w d1,a1 | plane position (D1 < 0x7FFF, 288*40=0x2D00) move.w #NB_BYTES_PER_ROW,d0 tst.w d4 bpl.b 5f neg.w d0 neg.w d4 | make d4 positive again 5: sub.w d2,d0 | blit width lsl.w #6,d4 lsr.w #1,d2 add.w d2,d4 | blit height * always the same settings (ATM) * prepare d1 moveq #0,d1 move.w #0x0BCA,d1 swap d1 or.l d6,d1 * now just wait for blitter ready to write all registers WAIT_BLIT * blitter registers set clr.w bltamod(a5) |A modulo=bytes to skip between lines clr.w bltbmod(a5) |B modulo=bytes to skip between lines move.l d5,d7 | save cookie cut bltcon move.w (a7)+,d5 move.w d0,bltcmod(a5) |C modulo move.w d0,bltdmod(a5) |D modulo add.w d5,a3 | apply to mask too subq #1,d3 beq.b 7f subq #1,d3 6: jbsr process_1_plane lea (BG_SCREEN_PLANE_SIZE,a1),a1 WAIT_BLIT dbf d3,6b 7: jbsr process_1_plane 8: movem.l (a7)+,d0-d7/a2/a4 rts process_1_plane: move.l a3,bltapt(a5) | source graphic top left corner (mask) move.l (a0)+,d0 jeq 63f | do nothing ATM see if it works move.l d0,a4 add.w d5,a4 bra.b 61f 60: * source is 0: just apply mask (less bandwidth lost) and change bltcon move.l d1,bltcon0(a5) | sets con0 and con1: C-A->D cookie cut, B fixed clr.w bltbdat(a5) |B word is zero bra.b 62f 61: * non-zero: set data source & bltcon move.l d7,bltcon0(a5) | sets con0 and con1: C-A+B->D cookie cut full move.l a4,bltbpt(a5) |source graphic top left corner 62: CHECK_BLITTER_BOUNDS move.l a1,bltcpt(a5) |pristine background top (bottom) left corner move.l a1,bltdpt(a5) |destination top (bottom) left corner move.w d4,bltsize(a5) |rectangle size, starts blit 63: rts |
21 July 2024, 17:59 | #2 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,286
|
A few micro optimizations that spring to mind:
Code:
lsl.l #8,d6 lsl.l #4,d6 ror.w #4,d6 But really that whole sequence.. Are you blitting a lot at x=0 (probably that optimization is not worth it)? So Code:
moveq #$f,d6 and.w d0,d6 beq.b .noshift ror.w #4,d6 ... And of course make process_1_plane a macro and inline it |
21 July 2024, 18:06 | #3 |
This cat is no more
Join Date: Dec 2004
Location: FRANCE
Age: 52
Posts: 8,417
|
makes sense. Except that you probably mean ror.l #4,d6. Oh since d6 can't be > 0xFFFF ok I get it!!
About inlining the big routine, yes, it would be good for 68000 (but the overhead is negligible given the size of the routine), probably not so much for 68020. Last edited by jotd; 21 July 2024 at 18:20. |
21 July 2024, 18:29 | #4 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,286
|
Quote:
Probably won't be a win 020, but measurement is king of course. Whether you consider 34 cycles per loop iteration (for bsr.b+rts) negligible is up to you of course, but if it is then the above is even less worth while |
|
21 July 2024, 20:02 | #5 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,103
|
For me
Code:
move.w d5,-(a7) ; this is bug, because longword is used moveq #0,d5 move.w #0x0fca,d5 | B+C-A->D cookie cut swap d5 Code:
move.l d5,-(a7) move.l #0x0fca0000,d5 | B+C-A->D cookie cut |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
What the bloody heck is wrong with this blitter routine? | mcgeezer | Coders. Asm / Hardware | 6 | 27 March 2019 18:31 |
CPU Filling vs. Blitter Filling Routine | victim | Coders. General | 18 | 26 January 2014 02:15 |
Blitter filling routine used in games | Codetapper | Coders. General | 2 | 26 January 2012 10:20 |
Optimize the configuration | Raudi | support.WinUAE | 12 | 26 May 2008 08:44 |
App to optimize disks? | Photon | request.Apps | 7 | 06 January 2007 05:30 |
|
|