can someone help me to optimize this blitter routine?

jotd · 21 July 2024, 17:08

I wrote that myself, so I'm not questioning it too much but maybe I'm missing something BIG...

The inputs are pretty easy to understand, code supports vertical clipping, and works on A0 which is a pointer on a list of bitplanes: 16 pixel width. If bitplane is 0 then it's skipped (which is a big optimization already). Also I chose not to "cookie cut" the background if a bitplane is 0 (which can lead to strange effects when BOBs are overlayed, but in the facts it's barely noticeable).

CHECK_BLITTER_BOUNDS is only enabled in "developer" mode.
WAIT_BLIT is a macro that sets "blitter nasty" flag, waits for blitter and unsets "blitter nasty".

Code:

.macro	WAIT_BLIT
	move.w	#0x8400,(dmacon,a5)		| blitter high priority
wait\@:
	BTST	#6,(dmaconr,a5)
	BNE.S	wait\@
	move.w	#0x0400,(dmacon,a5)		| blitter normal priority
.endm

It's also using a multiplication table to compute offset from Y value (mulNB_BYTES_PER_ROW_table). Blitter mask is all FFFFFs all through the game.

Code:

* < A5: custom
* < D0.W,D1.W: x,y
* < A0: source (pointer on array of planes)
* < A1: destination fg plane, also background to mix with cookie cut fg plane
* < A3: source mask for cookie cut
* < D2: width in bytes (inc. 2 extra for shifting)
* < D3: number of planes
* < D4: height. If negative, source is copied with negative modulo (flip)
* < D5: y offset for source planes

* blit mask set
* returns: start of destination in A1 (computed from old A1+X,Y)
* trashes: a1
blit_planes_any_internal_cookie_cut:
    movem.l d0-d7/a2/a4,-(a7)
    * pre-compute the maximum of shit here
    tst.w    d4
    bpl.b    1f
    * inverted y blit
    
    sub.w    d4,d1    | pre-add height to d1
    subq.w    #1,d1    | minus one
1:
    tst    d1
    beq.b   2f    | optim
    cmp.w    #NB_LINES,d1
    jcc        8f            | too low, won't be drawn, may as well optimize
    lea        mulNB_BYTES_PER_ROW_table,a4
    .ifdef    NO68020
    add.w    d1,d1
    move.w  (a4,d1.w),d1    | y times 40
    .else
    move.w  (a4,d1.w*2),d1    | y times 40
    .endif
2:
    move.w    d5,-(a7)
    moveq    #0,d5
    move.w  #0x0fca,d5    | B+C-A->D cookie cut   
    swap    d5
    moveq    #0,d6        | make sure D6.L is zero!!
    move.w  d0,d6
    beq.b   4f
    lsr.w   #3,d0
    bclr    #0,d0
    and.w   #0xF,d6
    beq.b    3f                | if 0 shift, optimize a few instructions
    lsl.l   #8,d6
    lsl.l   #4,d6
    or.w    d6,d5            | add shift to mask (bltcon1)
    swap    d6
    clr.w   d6
    or.l    d6,d5            | add shift
3:   
    add.w   d0,d1
4:
    * make offset even. Blitter will ignore odd address
    * but a 68000 CPU doesn't and since we RETURN A1...
    bclr    #0,d1
    add.w   d1,a1       | plane position (D1 < 0x7FFF, 288*40=0x2D00)
    move.w    #NB_BYTES_PER_ROW,d0
    tst.w    d4
    bpl.b    5f
    neg.w    d0
    neg.w    d4    | make d4 positive again
5:

    sub.w   d2,d0       | blit width
    lsl.w   #6,d4
    lsr.w   #1,d2
    add.w   d2,d4       | blit height
    * always the same settings (ATM)

    * prepare d1
    moveq    #0,d1
    move.w    #0x0BCA,d1
    swap    d1
    or.l    d6,d1

    * now just wait for blitter ready to write all registers
    WAIT_BLIT
    
    * blitter registers set

    clr.w bltamod(a5)        |A modulo=bytes to skip between lines
    clr.w bltbmod(a5)        |B modulo=bytes to skip between lines
    move.l    d5,d7            | save cookie cut bltcon
    move.w    (a7)+,d5
    
    move.w  d0,bltcmod(a5)    |C modulo
    move.w  d0,bltdmod(a5)    |D modulo
                    
    add.w    d5,a3            | apply to mask too
    subq    #1,d3
    beq.b    7f
    subq    #1,d3
6:
    jbsr    process_1_plane
    lea        (BG_SCREEN_PLANE_SIZE,a1),a1
    WAIT_BLIT
    dbf        d3,6b
7:
    jbsr    process_1_plane
8:   
    movem.l (a7)+,d0-d7/a2/a4
    rts
    
process_1_plane:
    move.l a3,bltapt(a5)    |  source graphic top left corner (mask)
    move.l (a0)+,d0
    jeq    63f    | do nothing ATM see if it works
    move.l    d0,a4
    add.w    d5,a4
    bra.b    61f
60:
    * source is 0: just apply mask (less bandwidth lost) and change bltcon
    move.l    d1,bltcon0(a5)    | sets con0 and con1: C-A->D cookie cut, B fixed
     clr.w    bltbdat(a5)    |B word is zero
    bra.b    62f
61:
    * non-zero: set data source & bltcon
    move.l    d7,bltcon0(a5)    | sets con0 and con1: C-A+B->D cookie cut full
    move.l    a4,bltbpt(a5)    |source graphic top left corner
62:
    CHECK_BLITTER_BOUNDS
     move.l    a1,bltcpt(a5)    |pristine background top (bottom) left corner
    move.l    a1,bltdpt(a5)    |destination top (bottom) left corner
    move.w  d4,bltsize(a5)    |rectangle size, starts blit
63:    
    rts

paraj · 21 July 2024, 17:59

A few micro optimizations that spring to mind:

Code:

    lsl.l   #8,d6
    lsl.l   #4,d6

Faster to use

ror.w #4,d6

But really that whole sequence.. Are you blitting a lot at x=0 (probably that optimization is not worth it)? So

Code:

   moveq #$f,d6
   and.w d0,d6
   beq.b .noshift
   ror.w #4,d6
   ...

Also clr.w to memory is not great on 68000 since it does a useless read (not dangerous here, but beware). moveq #0,tempreg, move.w tempreg,xxx(a5), move.w tempreg,yyy(a5) is faster than 2x clr.w

And of course make process_1_plane a macro and inline it

jotd · 21 July 2024, 18:06

makes sense. Except that you probably mean ror.l #4,d6. Oh since d6 can't be > 0xFFFF ok I get it!!

About inlining the big routine, yes, it would be good for 68000 (but the overhead is negligible given the size of the routine), probably not so much for 68020.

paraj · 21 July 2024, 18:29

Quote:

Originally Posted by jotd

makes sense. Except that you probably mean ror.l #4,d6. Oh since d6 can't be > 0xFFFF ok I get it!!

It would move the bits into the wrong position with ror.l

also looks like you don't need the clr.w d6 after swap d6 (upper word is always clear), and maybe you can arrange the instructions a bit to avoid a bit of swap logic anyway.

Quote:

Originally Posted by jotd

About inlining the big routine, yes, it would be good for 68000 (but the overhead is negligible given the size of the routine), probably not so much for 68020.

Probably won't be a win 020, but measurement is king of course. Whether you consider 34 cycles per loop iteration (for bsr.b+rts) negligible is up to you of course, but if it is then the above is even less worth while

Don_Adan · 21 July 2024, 20:02

For me

Code:

    move.w    d5,-(a7) ; this is bug, because longword is used
   moveq    #0,d5
   move.w  #0x0fca,d5    | B+C-A->D cookie cut   
   swap    d5

Code:

   move.l    d5,-(a7)
 
    move.l  #0x0fca0000,d5    | B+C-A->D cookie cut

21 July 2024, 17:59	#2
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,286	A few micro optimizations that spring to mind: Code: lsl.l #8,d6 lsl.l #4,d6 Faster to use ror.w #4,d6 But really that whole sequence.. Are you blitting a lot at x=0 (probably that optimization is not worth it)? So Code: moveq #$f,d6 and.w d0,d6 beq.b .noshift ror.w #4,d6 ... Also clr.w to memory is not great on 68000 since it does a useless read (not dangerous here, but beware). moveq #0,tempreg, move.w tempreg,xxx(a5), move.w tempreg,yyy(a5) is faster than 2x clr.w And of course make process_1_plane a macro and inline it

21 July 2024, 20:02	#5
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 56 Posts: 2,103	For me Code: move.w d5,-(a7) ; this is bug, because longword is used moveq #0,d5 move.w #0x0fca,d5 \| B+C-A->D cookie cut swap d5 Code: move.l d5,-(a7) move.l #0x0fca0000,d5 \| B+C-A->D cookie cut

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
What the bloody heck is wrong with this blitter routine?	mcgeezer	Coders. Asm / Hardware	6	27 March 2019 18:31
CPU Filling vs. Blitter Filling Routine	victim	Coders. General	18	26 January 2014 02:15
Blitter filling routine used in games	Codetapper	Coders. General	2	26 January 2012 10:20
Optimize the configuration	Raudi	support.WinUAE	12	26 May 2008 08:44
App to optimize disks?	Photon	request.Apps	7	06 January 2007 05:30

21 July 2024, 18:06	#3
jotd This cat is no more Join Date: Dec 2004 Location: FRANCE Age: 52 Posts: 8,417	makes sense. Except that you probably mean ror.l #4,d6. Oh since d6 can't be > 0xFFFF ok I get it!! About inlining the big routine, yes, it would be good for 68000 (but the overhead is negligible given the size of the routine), probably not so much for 68020. Last edited by jotd; 21 July 2024 at 18:20.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)