CPU Filling vs. Blitter Filling Routine

victim · 10 February 2010, 18:57

Hello coder Boys !

At the moment i try to programming a new 3d engine.
I trying current to programming a better cpu fill routine for the 68020 to 68060 processor.

Have you some ideas how I can do better routine or another solution?

What do you think about Flood-Fill or Scanline-algorithms?

My first experimental is very slow, here now my result.

Code:

;--------------------------------------------------------------------------------
;---PROCESSOR-FILLING-Routine 10.02.2010 Sascha Müller alias Victim of Savage ---

;at first fill the tables
           bsr     InitFillTable  

;now come some other routines and then fill
           bsr    ProcFill
           rts

ProcFill:            
            move.l    planebufferwork,a0
            moveq.l  #0,d0
            moveq.l  #0,d1
            lea        intab(pc),a2
            lea        dbtab(pc),a3
            move     #255,d7
pr_fyl:   lea         fntab(pc),a1
            move.l    a1,4(a3)
            lea         fitab(pc),a1
            move.l    a1,(a3)
            moveq    #40-1,d6    ;width
pr_fxl:    tst.b      d1             ;speed up by ignoring 0 bytes
            beq        pr_zer
pr_set:   tst.b      (a0)
            bne        pr_lin
            move.b    #$ff,(a0)+
            dbf        d6,pr_set
            bra        pr_ny
pr_zer:   tst.b    (a0)
            bne        pr_lin
            adda.l    #1,a0
            dbf        d6,pr_zer
            bra        pr_ny
pr_lin:   ;move    #$1ff,d5
;llo:       nop
            ;dbf        d5,llo
            move.b    (a0),d0
            move.b    (a1,d0.l),(a0)+
            tst.b    (a2,d0.l)
            beq        pr_nic        ;no insert change
            move.l    0(a3),d2    ;change filltables
            move.l    4(a3),0(a3)
            move.l    d2,4(a3)
            move.l    0(a3),a1    ;->a1
            not.b    d1            ;change insert in d1
pr_nic:   dbf        d6,pr_fxl
pr_ny:    dbf        d7,pr_fyl
            rts

;---    creates three tables later used for filling    ---

InitFillTable:
            moveq.l    #0,d0            ;init routine
            moveq.l    #0,d3

            lea        fitab(pc),a0    ;fill    table
            lea        intab(pc),a1    ;insert table
            move     #255,d4            ;256 different bytes fill 
pr_bl:    move     d3,d0
            bsr        pr_bfi
            move.b   d0,(a0)+
            move.b    d1,(a1)+
            addq      #1,d3    
            dbf        d4,pr_bl
        
            lea        fitab(pc),a0
            lea        fntab(pc),a1
            move     #255,d1
pr_inv:   move.b   (a0)+,d0
            not.b      d0
            move.b    d0,(a1)+
            dbf        d1,pr_inv
            rts

pr_bfi:   moveq.l   #0,d1        ;must complete a byte (d0) as the blitte
            moveq    #7,d2        ;to test all eight bits
pr_bfl:   btst       d2,d0
            bne        pr_ich
            tst.b      d1
            beq        pr_nb
            bset      d2,d0
pr_nb:   dbf        d2,pr_bfl
            rts
        
pr_ich:   tst.b    d1
            bne        pr_nof    
            bclr    d2,d0
pr_nof:  not.b    d1
            bra        pr_nb    

dbtab:   dc.l    fitab,fntab    ;change tabelle
fitab:    blk.b    256,0        ;fill table    insert=0
fntab:    blk.b    256,0        ;fill table    insert=1
intab:    blk.b    256,0        ;insert change    ($ff)

so long.....
Victim

StingRay · 10 February 2010, 19:16

Since your code is not really commented (which parameters etc pp?) I don't know how it is supposed to work (and I don't feel like doing "guess work"). Your code looks kinda strange to me anyway.
Anyway, here's my flat filler which should be easy to understand.

Code:

*******************************************
*** Draw flat filled Polygons		***
*******************************************

; $VER: POLYFILLER FLAT v3.o, Wed, 24-Mar-2oo4
;	(c)oded by StingRay/[S]carab^Scoopex
;
; derived from my texture-mapper, no division table used
; maximum 3 divisions per polygon, writes 4 pixels at once
;

; d0-d2: coords
;    d3: color
;    a4: surface
;    a6: ptr to object structure


DRAW_FLAT
	moveq	#0,d5
	move.b	d3,d5		; ......cc
	cmp.w	#P_T50,SURF_FLAGS(a4)
	beq.b	.nolc
	lsl.w	#8,d5		; ....cc00
	move.b	d3,d5		; ....cccc
	move.w	d5,d3
	swap	d5		; cccc....
	move.w	d3,d5		; cccccccc

.nolc	cmp.w	d0,d2
	bge.b	.ok1
	exg.l	d0,d2
.ok1	cmp.w	d1,d2
	bge.b	.ok2
	exg.l	d1,d2
.ok2	cmp.w	d0,d1
	bge.b	.ok3
	exg.l	d0,d1
.ok3	
	
	move.b	OBJ_CLIP(a6),d7
	lea	.VARS(pc),a6
	movem.l	d0-d2,.X1(a6)	; x/y 1-3
	move.b	d7,.CLIP(a6)	; clipping status
	move.w	SURF_FLAGS(a4),.FLAGS(a6)


	move.w	d2,d7
	sub.w	d0,d7		; y3-y1
	beq.w	.exit
	move.l	d2,d6
	sub.l	d0,d6		; x3-x1
	asr.l	#8,d6
	ext.l	d7
	divs.l	d7,d6
	move.l	d6,.DXDY2(a6)


	move.w	d1,d7
	sub.w	d0,d7		; y2-y1
	bne.b	.noflat

; xend for 2nd section = x1
	move.w	.X1(a6),d1
	ext.l	d1
	lsl.l	#8,d1
	bra.w	.part2


.noflat	move.w	.X2(a6),d1
	sub.w	.X1(a6),d1
	ext.l	d7
	ext.l	d1
	lsl.l	#8,d1
	divs.l	d7,d1
	move.l	d1,.DXDY1(a6)


	*** Draw top part of triangle ***

	move.w	.X1(a6),d0
	move.w	d0,d1

	move.l	.DXDY2(a6),a2
	move.l	.DXDY1(a6),a3

	tst.b	.CLIP(a6)
	beq.b	.nclip
	move.w	.Y1(a6),a5
	move.w	.Y2(a6),d7
	bsr.b	.draw
	bra.b	.part2
	
.nclip	move.w	.Y2(a6),d7
	sub.w	.Y1(a6),d7
	move.l	ENG_CHUNKYBUFFER(pc),a0
	move.w	.Y1(a6),d6
	mulu.w	#CHUNKYX,d6
	add.l	d6,a0
	bsr.b	.draw

	*** Draw bottom part of triangle ***

.part2	move.w	.Y3(a6),d7
	sub.w	.Y2(a6),d7
	beq.w	.exit
	move.w	.X3(a6),d0
	sub.w	.X2(a6),d0
	ext.l	d7
	ext.l	d0
	lsl.l	#8,d0
	divs.l	d7,d0
	move.l	d0,.DXDY1(a6)

	move.w	.X2(a6),d0
	move.l	.DXDY1(a6),a2
	move.l	.DXDY2(a6),a3

	tst.b	.CLIP(a6)
	beq.b	.nclip2
	move.w	.Y2(a6),a5
	move.w	.Y3(a6),d7
	bra.b	.draw2

.nclip2	move.l	ENG_CHUNKYBUFFER(pc),a0
	move.w	.Y2(a6),d6
	mulu.w	#CHUNKYX,d6
	add.l	d6,a0
	bra.b	.draw2

*******************************************
*** NON-CLIPPED DRAW	NON-TRANSPARENT	***
*******************************************

; d0: x1
; d1: x2
; d5: color
; d7: height
; a0: chunkybuffer + yoffset
; a2: DXDY2 (DXDY1 for the 2nd section)
; a3: DXDY1 (DXDY2 for the 2nd section)

.draw	ext.l	d1
	lsl.l	#8,d1
.draw2	ext.l	d0
	lsl.l	#8,d0

	subq.w	#1,d7				; adapt "dbf"
	cmp.l	a2,a3
	bge.b	.swap

	move.l	d0,a2				; x left
	move.l	d1,a1				; x right
	move.l	.DXDY1(a6),d6			; delta xleft
	move.l	.DXDY2(a6),a4			; delta xright
	bsr.b	.go
	move.l	a1,d1
.exit	rts

.swap	move.l	d1,a2				; x left
	move.l	d0,a1				; x right
	move.l	.DXDY2(a6),d6			; delta xleft
	move.l	.DXDY1(a6),a4			; delta xright
	bsr.b	.go
	move.l	a2,d1
	rts
	
	CNOP	0,4
.go	tst.b	.CLIP(a6)
	bne.w	.cl_go
	cmp.w	#P_T50,.FLAGS(a6)
	beq.b	.trans
.loopY	move.l	a2,d0		; x start
	move.l	a1,d1		; x end
	lsr.l	#8,d0
	lsr.l	#8,d1
	sub.w	d0,d1		; delta X = width of scanline
	ble.w	.noX
	lea	(a0,d0.w),a5

	lsr.w	#1,d1
	bcc.b	.nobyte
	move.b	d5,(a5)+

.nobyte	lsr.w	#1,d1
	bcc.b	.noword
	move.w	d5,(a5)+

.noword	subq.w	#1,d1
	bmi.b	.nolong
.loopX	move.l	d5,(a5)+
	dbf	d1,.loopX

.nolong

.noX	add.l	a4,a1
	add.l	d6,a2
	lea	CHUNKYX(a0),a0
	dbf	d7,.loopY
	rts


*******************************************
*** NON-CLIPPED DRAW	50% TRANSPARENT	***
*******************************************

.trans
.tloopY	move.l	a2,d0		; x start
	move.l	a1,d1		; x end
	lsr.l	#8,d0
	lsr.l	#8,d1
	sub.w	d0,d1		; delta X = width of scanline
	ble.b	.tnoX
	lea	(a0,d0.w),a5
.tloopX	moveq	#0,d4
	move.b	(a5),d4
	add.w	d5,d4
	lsr.w	#1,d4
	move.b	d4,(a5)+
	subq.w	#1,d1
	bne.b	.tloopX
.tnox	add.l	a4,a1
	add.l	d6,a2
	lea	CHUNKYX(a0),a0
	dbf	d7,.tloopY
	rts


*******************************************
*** CLIPPED DRAW	NON-TRANSPARENT	***
*******************************************

; d0: x1
; d1: x2
; d5: color
; d7: y1
; a5: y2
; a2: DXDY2 (DXDY1 for the 2nd section)
; a3: DXDY1 (DXDY2 for the 2nd section)

	CNOP	0,4
.cl_go
.cl_loopY
	cmp.w	#P_T50,.FLAGS(a6)
	beq.b	.cltrans
	cmp.w	a5,d7
	blt.b	.cl_exit

	move.w	a5,d0
	cmp.w	#CLIPY_MAX,d0
	bgt.b	.cl_exit
	cmp.w	#CLIPY_MIN,d0
	blt.b	.cl_noX
	mulu.w	#CHUNKYX,d0
	move.l	ENG_CHUNKYBUFFER(pc),a0
	add.l	d0,a0
	

	move.l	a2,d0		; x start
	move.l	a1,d1		; x end
	lsr.l	#8,d0
	lsr.l	#8,d1


	movem.w	([ENG_CLIPX_TABPTR,pc],a5.w*4),d2/d3
.cl_cl	cmp.w	d3,d0
	bgt.b	.cl_noX
	cmp.w	d2,d1
	blt.b	.cl_noX
	cmp.w	d3,d1
	ble.b	.cl_xmaxok
	move.w	d3,d1
.cl_xmaxok
	cmp.w	d2,d0
	bge.b	.cl_xminok
	move.w	d2,d0
.cl_xminok

;	cmp.w	#CLIPX_MAX,d0
;	bgt.b	.cl_noX
;	cmp.w	#CLIPX_MIN,d1
;	blt.b	.cl_noX
;	cmp.w	#CLIPX_MAX,d1
;	ble.b	.cl_xmaxok
;	move.w	#CLIPX_MAX,d1
;.cl_xmaxok
;	cmp.w	#CLIPX_MIN,d0
;	bge.b	.cl_xminok
;	moveq	#CLIPX_MIN,d0	; x1 = CLIPX_MIN
;.cl_xminok

	sub.w	d0,d1		; delta X = width of scanline
	ble.w	.cl_nox
	add.w	d0,a0
	lsr.w	#1,d1
	bcc.b	.cl_nob
	move.b	d5,(a0)+

.cl_nob	lsr.w	#1,d1
	bcc.b	.cl_now
	move.w	d5,(a0)+

.cl_now	subq.w	#1,d1
	bmi.b	.cl_nol
.cl_loopX
	move.l	d5,(a0)+
	dbf	d1,.cl_loopX
.cl_nol

.cl_noX	add.l	a4,a1
	add.l	d6,a2

	addq.w	#1,a5
	bra.b	.cl_loopY
.cl_exit
	rts

*******************************************
*** CLIPPED DRAW	50% TRANSPARENT	***
*******************************************

.cltrans
.clt_loopY
	cmp.w	a5,d7
	blt.b	.cl_exit

	move.w	a5,d0
	cmp.w	#CLIPY_MAX,d0
	bgt.b	.cl_exit
	cmp.w	#CLIPY_MIN,d0
	blt.b	.clt_noX
	mulu.w	#CHUNKYX,d0
	move.l	ENG_CHUNKYBUFFER(pc),a0
	add.l	d0,a0
	

	move.l	a2,d0		; x start
	move.l	a1,d1		; x end
	lsr.l	#8,d0
	lsr.l	#8,d1


	cmp.w	#CLIPX_MAX,d0
	bgt.b	.clt_noX
	cmp.w	#CLIPX_MIN,d1
	blt.b	.clt_noX
	cmp.w	#CLIPX_MAX,d1
	ble.b	.clt_xmaxok
	move.w	#CLIPX_MAX,d1
.clt_xmaxok
	cmp.w	#CLIPX_MIN,d0
	bge.b	.clt_xminok
	moveq	#CLIPX_MIN,d0	; x1 = CLIPX_MIN
.clt_xminok

	sub.w	d0,d1		; delta X = width of scanline
	ble.w	.clt_nox
	add.w	d0,a0
	
.clt_loopX
	moveq	#0,d4
	move.b	(a0),d4
	add.w	d5,d4
	lsr.w	#1,d4
	move.b	d4,(a0)+
	subq.w	#1,d1
	bne.b	.clt_loopX

.clt_noX
	add.l	a4,a1
	add.l	d6,a2

	addq.w	#1,a5
	bra.b	.clt_loopY


.VARS	RSRESET
.X1	rs.w	1
.Y1	rs.w	1
.X2	rs.w	1
.Y2	rs.w	1
.X3	rs.w	1
.Y3	rs.w	1
.DXDY1	rs.l	1
.DXDY2	rs.l	1
.CLIP	rs.w	1
.FLAGS	rs.w	1
.SIZE	rs.b	0
	dcb.b	.SIZE

pmc · 10 February 2010, 19:31

@ Sting - I take it in general as you go above 68020 it starts to get quicker to do blitter type stuff using the processor? - perhaps gaining additional speed because you're able to use fast RAM...?

Samurai_Crow · 10 February 2010, 20:21

@pmc

RE: '020+ CPU blitting routines

The '020 has a 256 byte code cache and 32 bit memory bus, making small loops go much faster and outstripping the bandwidth of a 16-bit ECS blitter. The main reason you'd want to use a CPU-blitting routine though, is that chunky graphics are almost always faster than planar.

Photon · 10 February 2010, 21:30

I think he probably means 'filled polygons'?

In the Stunner Dentro I had 25% of the screen height filled by the CPU while the blitter filled the rest IIRC. Could have been 12.5%. Stunner has a completely other (set of) 'fill methods', as it's inconvex vectors. Let me know if you want me to dig it up Victim.

I'm not sure he means filling blitter-drawn polys ofc. If it's only going to work on expanded A1200's (and better), it'd be better to keep it all in fastram and do all the work with the CPU, including the final copy to chipmem screen.

I know little about cache vs stock A1200 behavior - I'd have to learn more about the behavior if it fills the chipmem screen directly.

KevG · 11 February 2010, 00:04

I did some time tests on this recently and found that you can get a large speed increase if you do alternate blitting and filling on a 68000 Amiga. plane 1 = blit, plane 2 = fill etc....

However, you get a HUGE increase in speed if you only cpu-fill on an 68020 with some fast ram. Hope that helps.

Kev G

pmc · 11 February 2010, 07:55

Quote:

Originally Posted by KevG

I did some time tests on this recently and found that you can get a large speed increase if you do alternate blitting and filling on a 68000 Amiga. plane 1 = blit, plane 2 = fill etc....

I do my coding on 68000 Amigas so this is interesting to me. When you say large, how large are we talking?

Speed increase is there purely because of processor / blitter doing these operations concurrently...?

Photon · 11 February 2010, 18:13

Quote:

Originally Posted by KevG

I did some time tests on this recently and found that you can get a large speed increase if you do alternate blitting and filling on a 68000 Amiga. plane 1 = blit, plane 2 = fill etc....

However, you get a HUGE increase in speed if you only cpu-fill on an 68020 with some fast ram. Hope that helps.

Kev G

In my experience it's better to have a '%' of the height, that way you can set different factors for different cpus, dmatime stolen by bitplanes etc. But yes, blitter fill doesn't use all sources so if you have nothing else to do with the cpu while it fills, you can cpu-fill with something like your or my method.

victim · 12 February 2010, 00:18

Hi coder boys !

Many thanks for your quick reply and excuse my bad english as a first, but I am still learning.

Quote:

StingRay
Since your code is not really commented (which parameters etc pp?) I don't know how it is supposed to work (and I don't feel like doing "guess work"). Your code looks kinda strange to me anyway.
Anyway, here's my flat filler which should be easy to understand.

compliment, excellent routine you have there programms.
My first goal is a routine without any chunky mode programming.
Chunky mode to come until a later version.

Quote:

Photon I think he probably means 'filled polygons'?

In the Stunner Dentro I had 25% of the screen height filled by the CPU while the blitter filled the rest IIRC. Could have been 12.5%. Stunner has a completely other (set of) 'fill methods', as it's inconvex vectors. Let me know if you want me to dig it up Victim.

I'm not sure he means filling blitter-drawn polys ofc. If it's only going to work on expanded A1200's (and better), it'd be better to keep it all in fastram and do all the work with the CPU, including the final copy to chipmem screen.

I know little about cache vs stock A1200 behavior - I'd have to learn more about the behavior if it fills the chipmem screen directly.

can you give me an example or show?

what can i still improve on my routine and do better?

Do you have any ideas?

how do I use the cache of 68040 or 68060?

Here is one of my current routines, but how stressed she is unfortunately very slow.

- Pure CPU vector calculation
- Pure CPU clear
- Pure CPU Line
- Pure CPU Fill

Code:

		
;*******		
;****		Date: 11.02.2010	Prog: CPU-Fill-Vector
;****		
;****		Done by Victim of Savage (Sascha Mueller)
;*******


		
		auto	cs\Sinus\0\451\451\$7fff-572\572\w0\ny
		auto	e\o\j
	
		section	prog,CODE_C
		
start:		movem.l	d0-d7/a0-a6,-(sp)

		move	#$4000,$dff09a
		lea	CopperList(pc),a0
		move.l	a0,$dff084			;Init CopperList

		bsr	InitCPUFillTab		;MAKE FILL TAB
		bsr	InitMulTab
		

MouseWait:	bsr	WaitRaster
		bsr	TribbleBuffer
		bsr	CPUClear
		bsr	Rotate
		bsr	Angels

		movem.l	d0-d7/a0-a6,-(sp)
		bsr	CPUFill
		movem.l	(sp)+,d0-d7/a0-a6

		move	#$ff,$dff180

		btst	#6,$bfe001
		bne.w	MouseWait

		movem.l	(sp)+,d0-d7/a0-a6
		rts

WaitRaster:	move.l	$dff004,d0
		asr.l	#8,d0
		and	#$1ff,d0
		cmp	#300,d0		;300
		bne.s	WaitRaster
		rts

TribbleBuffer:
		move.l	PlaneBufferShow(pc),d0
		move.l	PlaneBufferWork(pc),PlaneBufferShow
		move.l	PlaneBufferClear(pc),PlaneBufferWork
		move.l	d0,PlaneBufferClear

		
		lea.l	CopperPlanes(pc),a3
		move.l	PlaneBufferShow(pc),d0
		move	d0,6(a3)
		swap	d0
		move	d0,2(a3)
		swap	d0
		add.l	#40,d0
		move	d0,14(a3)
		swap	d0
		move	d0,10(a3)
		rts

;*****************************************************
;** CPU-Clear for two Bitplanes
;*****************************************************

CPUClear:	lea	leer(pc),a6
		movem.l	(a6)+,d0-d7/a0-a5
		move.l	PlaneBufferClear(pc),a6
		lea.l	256*40*2(a6),a6
		blk.l	364,$48e6fffc		;2 Planes 366
		rts

leer:		blk.l	14,0

;*****************************************************
;** CPU-Vector-Rotation with 12 Muls
;** In a later version i will make a matrix with 9 muls 
;*****************************************************

Rotate:		lea	Vektor(pc),a1
		lea	XY(pc),a2	
		lea	Sinus(pc),a3
		
		moveq	#7,d7
Rot:
		movem	(a1)+,d0-d2

		tst	AngleXSpeed
		beq	RotateY

		move	alpha(pc),d6		;Angle
		move	(a3,d6.w*2),d5		;Sinus   
		move	(a3,d6.w*2,180),d6	;Cosinus   

		move	d1,d3		;y save
		move	d2,d4		;z save

		muls	d6,d1		;y*cos(alpha)
		muls	d5,d4		;z*sin(alpha)
		sub.l	d4,d1		;y*cos(alpha) - z*sin(alpha)

		add.l	d1,d1		;2^15 = $8000
		swap	d1
				
		muls	d5,d3		;y*sin(alpha)
		muls	d6,d2		;z*cos(alpha)
		add.l	d3,d2		;y*sin(alpha) + z*cos(alpha)

		add.l	d2,d2		;2^15
		swap	d2

RotateY:	tst	AngleYSpeed
		beq	RotateZ	

		move	Beta(pc),d6		;Angle
		move	(a3,d6.w*2),d5		;Sinus   
		move	(a3,d6.w*2,180),d6	;Cosinus 

		move	d0,d3		;x retten
		move	d2,d4		;z retten

		muls	d6,d0		;x*cos(beta)
		muls	d5,d4		;z*sin(beta)
		add.l	d4,d0		;x*cos(beta) + z*sin(beta)

		add.l	d0,d0		;2^15
		swap	d0
				
		muls	d5,d3		;-x*sin(beta)
		muls	d6,d2		;z*cos(beta)
		sub.l	d3,d2		;-x*sin(beta) + z*cos(beta) (because -x = neg)
					;z*cos(beta)-x*sin(beta)  

		add.l	d2,d2		;2^15
		swap	d2


RotateZ:	tst	AngleZSpeed
		beq	CentralProj
		
		move	Gamma(pc),d6		;Angle
		move	(a3,d6.w*2),d5		;Sinus   
		move	(a3,d6.w*2,180),d6	;Cosinus 

		move	d0,d3		;x retten
		move	d1,d4		;y retten

		muls	d6,d0		;x*cos(gamma)
		muls	d5,d4		;y*sin(gamma)
		sub.l	d4,d0		;x*cos(gamma) - y*sin(gamma)

		add.l	d0,d0		;2^15
		swap	d0
				
		muls	d5,d3		;x*sin(gamma)
		muls	d6,d1		;y*cos(gamma)
		add.l	d3,d1		;x*sin(gamma) + y*cos(gamma)

		add.l	d1,d1		;2^15
		swap	d1

CentralProj:	add	xpos(pc),d0
		add	ypos(pc),d1
		sub	zpos(pc),d2
		

;* Qx,Qy and Qz are the angles additions 
;*
;* Zx,Zy and Zz are the central projection values (constant charged)

;* Px = Zx-Zz*Qx-Zx/Qz-Zz

		move	d0,d4
		sub	x(pc),d4	;Qx-Zx
		move	d2,d5
		sub	z(pc),d5	;Qz-Zz
		muls	z(pc),d4	;Zz*Qx-Zx
		divs	d5,d4		;Zz*Qx-Zx/Qz-Zz
		sub	x(pc),d4	;Zx-Zz*Qx-Zx/Qz-Zz

;* Py = Zy-Zz*Qy-Zy/Qz-Zz

		move	d1,d6
		sub	y(pc),d6	;Qy-Zy
		muls	z(pc),d6	;Zz*Qy-Zy
		divs	d5,d6		;Zz*Qy-Zy/Qz-Zz
		sub	y(pc),d6	;Zy-Zz*Qy-Zy/Qz-Zz

		add	#320/2,d4
		add	#256/2,d6

		move	d4,(a2)+
		move	d6,(a2)+

		dbf	d7,Rot

		lea.l	connect(pc),a3		;connect
		lea	xy(pc),a2		;xy
		move	(a3)+,d7
		subq	#1,d7

PolySort:	move	(a3)+,d6
		subq	#1,d6
		move	(a3)+,color

		move	(a3),d4
		movem	(a2,d4.w),d0/d1		;x1,y1

		move	2(a3),d4
		movem	(a2,d4.w),d2/d3		;x2,y2

		move	4(a3),d4
		movem	(a2,d4.w),d4/d5		;x3,y3

		sub	d1,d5
		sub	d0,d2
		sub	d0,d4
		sub	d1,d3

		muls	d2,d5
		muls	d3,d4

		sub.l	d4,d5

		bmi	nopoly
	
DrawLines:	move	(a3)+,d4
		movem	(a2,d4.w),d0/d1

		move	(a3),d4
		movem	(a2,d4.w),d2/d3

		move.l 	PlaneBufferWork(pc),a0
		moveq	#40,d4
		move	color(pc),d5
			
		movem	d0-d3,-(sp)
		btst	#0,d5
		beq	SkipPoly
		movem.l	d4-d7/a0-a6,-(a7)
		bsr	CPULine
		movem.l	(a7)+,d4-d7/a0-a6
		
		
SkipPoly:	movem	(sp)+,d0-d3
		btst	#1,d5
		beq	SkipP2
		add.l	#40,a0
		movem.l	d4-d7/a0-a6,-(a7)
		bsr	CPULine
		movem.l	(a7)+,d4-d7/a0-a6
		
		
SkipP2:		dbra	d6,DrawLines
		bra	NextStep

NoPoly:		addq.l	#8,a3
		add	d6,d6
		add	(a3,d6.w),d6

NextStep:	addq.l	#2,a3
		dbra	d7,PolySort
		rts

color:		dc.w	0

ZAdd:		dc.w	-60		;Z World
XY:		blk.w	80*2

xpos:		dc.w	0
ypos:		dc.w	0
zpos:		dc.w	-1050

;Viewer

x:		dc.w	0
y:		dc.w	0
z:		dc.w	600


;*****************************************************
;** Angels addition
;*****************************************************

Angels:		move	AngleXSpeed(pc),d0
		add	d0,Alpha
		cmp	#360,Alpha		;360 grad
		blt	NextYAngle		;branching if smaller
		move	#0,Alpha		;if  zero then 360 = 0

NextYAngle:	move	AngleYSpeed(pc),d0
		add	d0,Beta
		cmp	#360,Beta
		blt	NextZAngle
		move	#0,Beta

NextZAngle:	move	AngleZSpeed(pc),d0
		add	d0,Gamma
		cmp	#360,Gamma
		blt	NextAngleEnd
		move	#0,Gamma

NextAngleEnd:	rts

AngleXSpeed:	dc.w	2
AngleYSpeed:	dc.w	2
AngleZSpeed:	dc.w	2

Alpha:		dc.w	0
Beta:		dc.w	0
Gamma:		dc.w	0


**		x,y,z

Vektor:		dc.w	-50,50,-50
		dc.w	-50,-50,-50
		dc.w	50,-50,-50
		dc.w	50,50,-50
		dc.w	-50,50,50
		dc.w	-50,-50,50
		dc.w	50,-50,50
		dc.w	50,50,50
		

connect:	dc.w	6
		dc.w	4, 1,0*4,1*4,2*4,3*4,0*4
		dc.w	4, 1,4*4,7*4,6*4,5*4,4*4
		dc.w	4, 2,0*4,4*4,5*4,1*4,0*4
		dc.w	4, 2,3*4,2*4,6*4,7*4,3*4
		dc.w	4, 3,1*4,5*4,6*4,2*4,1*4
		dc.w	4, 3,0*4,3*4,7*4,4*4,0*4


;*****************************************************
;** Mul tab for the CPU-Line routine
;*****************************************************

InitMulTab:	move 	#255,d7
		lea	MulTab(pc),a0
		moveq 	#0,d0
loop1:		move 	d0,(a0)+
		add 	#80,d0
		dbf 	d7,loop1
		rts

;*****************************************************
;** CPU-Draw-Line routine with special Fill BIT
;*****************************************************

;---------------------------------------------------------------
;	d0 = x1
;	d1 = y1
;	d2 = x2
;	d3 = y2
;	a0 = PlanePointer

CPULine:	lea	multab(pc),a1
		cmp	d1,d3
		bgt	pl_ord
		beq	pl_out
		exg.l	d0,d2
		exg.l	d1,d3
pl_ord:		move	d2,d4
		move	d3,d5
		sub	d1,d5
		sub	d0,d4
		bge	pl_o78
pl_o56:		move	#-1,a3	;x-symmetry
		neg	d4
		move	d4,d2
		add	d0,d2
		cmp	d4,d5
		bgt	pl_d67
		bra	pl_d58
pl_o78:		move	#1,a3
		cmp	d4,d5
		bgt	pl_d67
pl_d58:		moveq.l	#0,d4
		move	d0,d6
		move	d1,d7
		move	d2,a2
		move	d6,d0
		move	d7,d1
		sub	d0,a2	;dx
		sub	d1,d3	;dy
		sub	a2,d4
		asr	#1,d4	;error=-dy/2
		move	a2,d5
		subq	#1,d5
pl_l58:		move	d6,d0
		move	d7,d1
		add	d3,d4	;error=error+dy
		blt	pl_t58	
		;add	d1,d1		;--- pset ----------	
		move	(a1,d1.w*2),d1	;mulu	#40,d1
		move	d0,d2
		asr	#3,d2
		add	d2,d1
		not	d0
		bchg	d0,(a0,d1)	;-------------------
		addq	#1,d7
		sub	a2,d4	;error=error-dx
pl_t58:		add	a3,d6	;a3:={-1,1}
		dbf	d5,pl_l58
		rts

pl_d67:		moveq.l	#0,d4
		move	d0,d6
		move	d1,d7
		move	d2,a2
		sub	d0,a2	;dx
		sub	d1,d3	;dy
		sub	a2,d4
		asr	#1,d4	;error=-dx/2
		move	d3,d5
		subq	#1,d5
pl_l67:		move	d6,d0
		move	d7,d1
		;add	d1,d1		;--- pset ---------
		move	(a1,d1.w*2),d1	;mulu	#40,d1
		move	d0,d2
		asr	#3,d2
		add	d2,d1
		not	d0
		bchg	d0,(a0,d1)	;------------------
		add	a2,d4	;error=error+dx
		blt	pl_t67	
		add	a3,d6	;a3:={-1,1}
		sub	d3,d4	;error=error-dy
pl_t67:		addq	#1,d7
		dbf	d5,pl_l67
pl_out:		rts

MulTab:		blk.w 256,0



;*****************************************************
;** CPU-Fill Routine
;*****************************************************


CPUFill:	move.l	planebufferwork(pc),a0
		
		moveq.l	#0,d0
		moveq.l	#0,d1
		lea	intab(pc),a2
		lea	dbtab(pc),a3
		move.w	#255*2,d7
pr_fyl:		lea	fntab(pc),a1
		move.l	a1,4(a3)
		lea	fitab(pc),a1
		move.l	a1,(a3)
		move.w	#40-1,d6	;width
pr_fxl:
;---
		tst.b	d1		;speed up by ignoring 0 bytes
		beq	pr_zer
pr_set:		tst.b	(a0)
		bne	pr_lin
		move.b	#$ff,(a0)+
		dbf	d6,pr_set
		bra	pr_ny
pr_zer:		tst.b	(a0)
		bne	pr_lin
		adda.l	#1,a0
		dbf	d6,pr_zer
		bra	pr_ny
pr_lin:
;---
;		move.w	#$1ff,d5
;llo:		nop
;		dbf	d5,llo

		move.b	(a0),d0
		move.b	(a1,d0.l),(a0)+
		tst.b	(a2,d0.l)
		beq	pr_nic		;no insert change
		move.l	0(a3),d2	;change filltables
		move.l	4(a3),0(a3)
		move.l	d2,4(a3)
		move.l	0(a3),a1	;->a1
		not.b	d1		;change insert in d1
pr_nic:		dbf	d6,pr_fxl
pr_ny:		dbf	d7,pr_fyl
		rts

;---	creates three tables later used for filling	---

InitCPUFillTab:	

		moveq.l	#0,d0		;init routine
		moveq.l	#0,d3
		lea	fitab(pc),a0	;fill	table
		lea	intab(pc),a1	;insert table
		move	#255,d4		;256 different bytes to fill
pr_bl:		move	d3,d0
		bsr	pr_bfi
		move.b	d0,(a0)+
		move.b	d1,(a1)+
		addq	#1,d3	
		dbf	d4,pr_bl
		lea	fitab,a0
		lea	fntab,a1
		move	#255,d1
pr_inv:		move.b	(a0)+,d0
		not.b	d0
		move.b	d0,(a1)+
		dbf	d1,pr_inv
		rts

pr_bfi:		moveq.l	#0,d1		;fill one byte (d0) as the blitter
		moveq	#7,d2		;test all eight bits 
pr_bfl:		btst	d2,d0
		bne	pr_ich
		tst.b	d1
		beq	pr_nb
		bset	d2,d0
pr_nb:		dbf	d2,pr_bfl
		rts
pr_ich:		tst.b	d1
		bne	pr_nof	
		bclr	d2,d0
pr_nof:		not.b	d1
		bra	pr_nb	

dbtab:	dc.l	fitab,fntab	;change table
fitab:	blk.b	256,0		;fill table		insert=0
fntab:	blk.b	256,0		;fill table		insert=1
intab:	blk.b	256,0		;insert change	($ff)

CopperList:	
		dc.l	$01200000
		dc.l	$01220000
		dc.l	$01fc0000
		dc.l	$01020000
		dc.l	$01040000
		dc.l	$01060000
		dc.l	$01080028
		dc.l	$010a0028
CopperPlanes:	
		dc.l	$00e00000
		dc.l	$00e20000
		dc.l	$00e40000
		dc.l	$00e60000
		dc.l	$01002200
		dc.l	$008e2981
		dc.l	$009029d1
		dc.l	$00920038
		dc.l	$009400d0
		dc.l	$01800000
		dc.l	$0182000f
		dc.l	$0184000a
		dc.l	$01860006
		dc.l	-2

		; 2^15 =  $8000 
Sinus:		blk.w	452,0

PlaneBufferShow:	dc.l	Buffer1
PlaneBufferWork:	dc.l	Buffer2
PlaneBufferClear:	dc.l	Buffer3

Buffer1:	blk.b	320*256/8*2,0
Buffer2:	blk.b	320*256/8*2,0
Buffer3:	blk.b	320*256/8*2,0
END

Download the code: www.savage-crew.de/code/CPU-Vector-Fill-Line.s

so long....
victim

Leffmann · 12 February 2010, 02:13

Try changing your line drawing routine so it plots 1 pixel per column instead of per row, then you can fill vertically and do 32 pixels in one go, f.ex:

Code:

          lea       Bitplane, a0
          moveq     #40, d2             ; Screen width in bytes

          moveq     #10-1, d3           ; Screen width in 32-bit longwords
.xloop    moveq     #0, d1              ; Set fill carry to 0

          move.w    #256-1, d0          ; 256 rows
.yloop    eor.l     (a0), d1            ; Fill
          move.l    d1, (a0)

          add.w     d2, a0              ; Step to next row
          dbf       d0, .yloop

          lea       -40*256+4(a0), a0   ; Step to top of next column
          dbf       d3, .xloop

StingRay · 12 February 2010, 09:44

Hmm, not sure if longword eor filler will work correctly in bitplane mode. I am sure that Leffmann's code won't assemble though (eor.l (a0),d1 = you wish

). :P

Anyway, why use such brute force method when you can do it in a different way (that's how I would do it anyway):

- Have a table of of 256*2 words which holds xstart/xend coords, initialize this table with "invalid" coords (e.g. moveq #-1,dx move.l dx,(ax)+ )
- instead of drawing pixels in your line draw routine you save x coords, you need to check if a coord has already been written (that's why you need to initialize the table), if so, store the value as xend, otherwise it's xstart
- also save ymin/ymax so you don't have the read the whole buffer later
- code a simple horizontal line drawer which reads the coords from the table and draws a line from x1 to x2, obviously you want a "write 32 pixels at once" routine

victim · 12 February 2010, 12:07

Quote:

Originally Posted by StingRay

Hmm, not sure if longword eor filler will work correctly in bitplane mode. I am sure that Leffmann's code won't assemble though (eor.l (a0),d1 = you wish

). :P

Anyway, why use such brute force method when you can do it in a different way (that's how I would do it anyway):

- Have a table of of 256*2 words which holds xstart/xend coords, initialize this table with "invalid" coords (e.g. moveq #-1,dx move.l dx,(ax)+ )
- instead of drawing pixels in your line draw routine you save x coords, you need to check if a coord has already been written (that's why you need to initialize the table), if so, store the value as xend, otherwise it's xstart
- also save ymin/ymax so you don't have the read the whole buffer later
- code a simple horizontal line drawer which reads the coords from the table and draws a line from x1 to x2, obviously you want a "write 32 pixels at once" routine

Stingray, usually you have absolutely right but I love old school effects. Chunky mode in this case was probably the easiest. A vector routine without chunky mode just looks beautiful and clear and does not look so rough.

A very good example of perhaps one of the best CPU Vector routines is the demo ARTE from SANITY. This meant the end part of the demo.
[ Show youtube player ] or http://www.pouet.net/prod.php?which=1477
The higher the CPU, the faster and more fluid running the routines in the demo. Although only half a screen, it has been used here, but by the large number of objects, it is still very fast.

I will try to implement your suggestions soon.

so long...
Victim

StingRay · 12 February 2010, 12:36

Quote:

Originally Posted by victim

Stingray, usually you have absolutely right but I love old school effects. Chunky mode in this case was probably the easiest. A vector routine without chunky mode just looks beautiful and clear and does not look so rough.

Where did I mention chunky mode (except in my first post)? Read my post again and if you understand(!) it you'll see that it'll work in bitplane mode.
In chunky mode I would just use my chunky triangle filler you can find in my first post.

Also, vectors look the same in chunky and bitplane mode, just the way you draw them is different, nothing else. I don't see how they look "rough" in chunky mode, care to elaborate?

victim · 12 February 2010, 13:01

Quote:

Originally Posted by StingRay

Where did I mention chunky mode (except in my first post)? Read my post again and if you understand(!) it you'll see that it'll work in bitplane mode.
In chunky mode I would just use my chunky triangle filler.

Also, vectors look the same in chunky and bitplane mode, just the way you draw them is different, nothing else. I don't see how they look "rough" in chunky mode, care to elaborate?

Since I've probably put a little misleading.

From chunky mode you have not spoken directly. I had only briefly seen through your routine and the variable "ENG_CHUNKYBUFFER" discovered and thought it would have a special routine just for chunky mode.

Now your routine did i understand, to first the data is normal calculated and did not do until later with a converter "chunky to planar" to translate and then drawing into the screen.

so long....
Victim

StingRay · 12 February 2010, 13:05

Quote:

Originally Posted by victim

Since I've probably put a little misleading.

From chunky mode you have not spoken directly. I had only briefly seen through your routine and the variable "ENG_CHUNKYBUFFER" discovered and thought it would have a special routine just for chunky mode.

Now your routine did i understand, to first the data is normal calculated and did not do until later with a converter "chunky to planar" to translate and then drawing into the screen.

I only posted my chunky filler because I didn't really understand what you wanted to do at first. But you replied to my other post here and that was an answer to your "how can I make my CPU fill routine faster" question and had nothing to do with any chunky mode at all.

Leffmann · 12 February 2010, 13:09

Ops, should've checked the code in an assembler first. The 68K is convenient but it's not as orthogonal as you'd wish is it

XOR filling is well tested and is exactly what the blitter does, except we do 32 bits in parallel and this is why it works like it should in bitplane mode. Here's the corrected algorithm:

Code:

          lea       Bitplane, a0
          moveq     #40, d2             ; Screen width in bytes

          moveq     #10-1, d3           ; Screen width in 32-bit longwords
.xloop    moveq     #0, d1              ; Set fill carry to 0

          move.w    #256-1, d0          ; 256 rows
.yloop    move.l    (a0), d4
          eor.l     d4, d1              ; Fill
          move.l    d1, (a0)

          add.w     d2, a0              ; Step to next row
          dbf       d0, .yloop

          lea       -40*256+4(a0), a0   ; Step to top of next column
          dbf       d3, .xloop

One benefit of drawing and filling using these methods is you can remove duplicate lines and also avoid lots of overdraw, and eliminate overdraw completely with convex objects, but using scanline buffer methods will probably be faster at some point when fastmem is involved. Would be really interesting to see some good benchmarks on this.

StingRay · 12 February 2010, 13:13

Quote:

Originally Posted by Leffmann

XOR filling is well tested and is exactly what the blitter does, except we do 32 bits in parallel and this is why it works like it should in bitplane mode.

I know how eor fillers work (yes, I once did C64 coding too ;D), just wasn't sure if it would work for a whole longword. Anyway, I still think my approach is faster as it'll only draw the really needed pixels without having to go through the whole scanline. And it's more flexible too since you can use it to do pencil vectors f.e.

victim · 12 February 2010, 14:32

Quote:

Originally Posted by Leffmann

Ops, should've checked the code in an assembler first. The 68K is convenient but it's not as orthogonal as you'd wish is it

XOR filling is well tested and is exactly what the blitter does, except we do 32 bits in parallel and this is why it works like it should in bitplane mode. Here's the corrected algorithm:

Code:

         

CPUFIll:
          lea          Bitplane, a0
          moveq     #40, d2             ; Screen width in bytes

          moveq     #10-1, d3           ; Screen width in 32-bit longwords
.xloop  moveq     #0, d1              ; Set fill carry to 0

          move.w    #256-1, d0          ; 256 rows
.yloop    move.l    (a0), d4
          eor.l     d4, d1              ; Fill
          move.l    d1, (a0)

          add.w     d2, a0              ; Step to next row
          dbf       d0, .yloop

          lea       -40*256+4(a0), a0   ; Step to top of next column
          dbf       d3, .xloop

One benefit of drawing and filling using these methods is you can remove duplicate lines and also avoid lots of overdraw, and eliminate overdraw completely with convex objects, but using scanline buffer methods will probably be faster at some point when fastmem is involved. Would be really interesting to see some good
benchmarks on this.

I think it must also watch out that the left and right page margins, and must be dealt with horizontally and vertically.
I think I will try another experiment in the scanline algorithm

Code:

CPUFill:    
            movea.l    PlaneBufferWork(pc),a0    ;Bit Plane pointer
            move    #255,d6        ; 256 rows
            movea.l    a0,a1
FillVert:    
            moveq.l    #0,d4
            moveq    #9,d5
FillHori:    tst.l    (a0)
            bne        Fit
            move.l    d4,(a1)+
            lea        4(a0),a0
            dbf        d5,FillHori
            dbf        d6,FillVert
            rts
            
Fit:        move.l    (a0),d0
            move.l    d0,d1
            subq.l    #1,d0
            move.l    d0,d2
            and.l    d1,d0
            bne    B    etw
FBit:        not.l    d4
            beq        Left
Right:        move.l    (a0)+,d0
            subq.l    #1,d0
            move.l    d0,(a1)+
            dbf        d5,FillHori
            dbf        d6,FillVert
            rts

Left:        move.l    (a0)+,d0
            subq.l    #1,d0
            not.l    d0
            move.l    d0,(a1)+
            dbf        d5,FillHori
            dbf        d6,FillVert
            rts

Betw:        lea        4(a0),a0
            subq.l    #1,d0
            ;or.l    d1,d0
            eor.l    d2,d0        ;exclusiv Fill
            move.l    d0,(a1)+
            dbf        d5,FillHori
            dbf        d6,FillVert
            rts

so long...

victim

Lekman · 26 January 2014, 02:15

Quote:

Originally Posted by victim

I think it must also watch out that the left and right page margins, and must be dealt with horizontally and vertically.
I think I will try another experiment in the scanline algorithm

Code:

CPUFill:    
            movea.l    PlaneBufferWork(pc),a0    ;Bit Plane pointer
            move    #255,d6        ; 256 rows
            movea.l    a0,a1
FillVert:    
            moveq.l    #0,d4
            moveq    #9,d5
FillHori:    tst.l    (a0)
            bne        Fit
            move.l    d4,(a1)+
            lea        4(a0),a0
            dbf        d5,FillHori
            dbf        d6,FillVert
            rts
            
Fit:        move.l    (a0),d0
            move.l    d0,d1
            subq.l    #1,d0
            move.l    d0,d2
            and.l    d1,d0
            bne    B    etw
FBit:        not.l    d4
            beq        Left
Right:        move.l    (a0)+,d0
            subq.l    #1,d0
            move.l    d0,(a1)+
            dbf        d5,FillHori
            dbf        d6,FillVert
            rts

Left:        move.l    (a0)+,d0
            subq.l    #1,d0
            not.l    d0
            move.l    d0,(a1)+
            dbf        d5,FillHori
            dbf        d6,FillVert
            rts

Betw:        lea        4(a0),a0
            subq.l    #1,d0
            ;or.l    d1,d0
            eor.l    d2,d0        ;exclusiv Fill
            move.l    d0,(a1)+
            dbf        d5,FillHori
            dbf        d6,FillVert
            rts

so long...

victim

I searched a bit online for "cpu fill amiga" to see if my old cpu fill was any good. I found this almost four year old thread. My routine is pretty similar to yours, it clear screen while filling from drawbuffer.

Code:

*****************************************************************************
** CPU FILL:
*****************************************************************************
** After the "dbne loop" i have to sustract one from d6 because when the
** "dbne" instruction doesn't branch, it doesn't decrement. I'm using a
** dbf to take care of that.
**
** Made by Lekman/Hemoraiders 1997
*****************************************************************************

CPU_FILL:    lea    DrawBuffer(pc),a0
        lea    Screen(pc),a1
        move.l    #ScreenSize*2,d2
        add.l    d2,a0            ; ***
        add.l    d2,a1            ; ***
        move.w    #(ScrHeight*2)-1,d7

.FillLoop    lea    (a0),a2            ; Source
        lea    (a1),a3            ; Destination
        moveq    #(ScrWidth/4)-1,d6    ; Screenwidth in longwords

        moveq    #0,d5

.FindPoint    move.l    d5,-(a3)        ; Clear/Fill
        move.l    -(a2),d0
        dbne    d6,.FindPoint        ;loops if d0=$00000000
        beq.s    .NextLine        ;line finished?

        move.l    d5,d2
        bmi.s    .FEndBit    ; Find End Bit

.FStartBit    move.l    d0,d1
        subq.l    #1,d1
        and.l    d1,d0    ; Mask out first bit
        eor.l    d0,d1    ; Mask out other bits
        not.l    d1
        or.l    d1,d2    ; Mask
        tst.l    d0
        beq.s    .Filling    ;only one bit in longword?

.FEndBit    move.l    d0,d1
        subq.l    #1,d1
        and.l    d1,d0    ; Mask out first bit
        eor.l    d0,d1    ; Mask out other bits
        and.l    d1,d2    ; Mask
        tst.l    d0
        bne.s    .FStartBit
        moveq    #0,d5        ; Clearing
        bra.s    .StoreLongWord
.Filling    moveq    #-1,d5        ; Filling

.StoreLongWord    move.l    d2,(a3)
        dbf    d6,.FindPoint    ; Fill all long-words

.NextLine    lea    -ScrWidth(a0),a0
        lea    -ScrWidth(a1),a1
        dbf    d7,.FillLoop
        rts

I modified it slightly to just clear / fill the required area.

Code:

*****************************************************************************
** CPU FILL:
*****************************************************************************

        lea    DrawBuffer+ScreenSize+ScrWidth,a0
        move.l    ScreenP(pc),a1
        lea    ScreenSize(a1),a1

        move.w    (a2)+,d0        ; CPUFill_Pos
        
        add.w    d0,a0
        add.w    d0,a1
        
        movem.w    (a2),d3/d4/d7        ; CPUFill_Mod/Size

.FillLoop    move.w    d4,d6        ; Screenwidth in longwords

        moveq    #0,d5

.FindPoint    move.l    d5,-(a1)        ; Clear
        move.l    -(a0),d0
        dbne    d6,.FindPoint
        beq.s    .NextLine

        move.l    d5,d2
        bmi.s    .FEndBit    ; Find End Bit

.FStartBit    move.l    d0,d1
        subq.l    #1,d1
        and.l    d1,d0    ; Mask out first bit
        eor.l    d0,d1    ; Mask out other bits
        not.l    d1
        or.l    d1,d2    ; Mask
        tst.l    d0
        beq.s    .Filling

.FEndBit    move.l    d0,d1
        subq.l    #1,d1
        and.l    d1,d0    ; Mask out first bit
        eor.l    d0,d1    ; Mask out other bits
        and.l    d1,d2    ; Mask
        tst.l    d0
        bne.s    .FStartBit
        moveq    #0,d5        ; Clearing
        bra.s    .StoreLongWord
.Filling    moveq    #-1,d5        ; Filling

.StoreLongWord    move.l    d2,(a1)
        dbf    d6,.FindPoint
.NextLine    sub.w    d3,a0
        sub.w    d3,a1
        dbf    d7,.FillLoop
.NoFill        rts

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Linedraw blitter vs. CPU on 68000	pmc	Coders. Asm / Hardware	17	29 February 2012 15:02
Blitter filling routine used in games	Codetapper	Coders. General	2	26 January 2012 10:20
Filling with the blitter...	Lonewolf10	Coders. Tutorials	7	13 September 2011 14:30
Blitter fighting the CPU	h0ffman	Coders. General	5	05 April 2011 13:18

10 February 2010, 19:31	#3
pmc gone Join Date: Apr 2007 Location: completely gone Posts: 1,596	@ Sting - I take it in general as you go above 68020 it starts to get quicker to do blitter type stuff using the processor? - perhaps gaining additional speed because you're able to use fast RAM...? Last edited by pmc; 11 February 2010 at 07:51. Reason: Edited...

10 February 2010, 20:21	#4
Samurai_Crow Total Chaos forever! Join Date: Aug 2007 Location: Waterville, MN, USA Age: 49 Posts: 2,187	@pmc RE: '020+ CPU blitting routines The '020 has a 256 byte code cache and 32 bit memory bus, making small loops go much faster and outstripping the bandwidth of a 16-bit ECS blitter. The main reason you'd want to use a CPU-blitting routine though, is that chunky graphics are almost always faster than planar.

10 February 2010, 21:30	#5
Photon Moderator Join Date: Nov 2004 Location: Eksjö / Sweden Posts: 5,613	I think he probably means 'filled polygons'? In the Stunner Dentro I had 25% of the screen height filled by the CPU while the blitter filled the rest IIRC. Could have been 12.5%. Stunner has a completely other (set of) 'fill methods', as it's inconvex vectors. Let me know if you want me to dig it up Victim. I'm not sure he means filling blitter-drawn polys ofc. If it's only going to work on expanded A1200's (and better), it'd be better to keep it all in fastram and do all the work with the CPU, including the final copy to chipmem screen. I know little about cache vs stock A1200 behavior - I'd have to learn more about the behavior if it fills the chipmem screen directly.

11 February 2010, 00:04	#6
KevG Banned Join Date: Jan 2009 Location: U.K. Posts: 93	I did some time tests on this recently and found that you can get a large speed increase if you do alternate blitting and filling on a 68000 Amiga. plane 1 = blit, plane 2 = fill etc.... However, you get a HUGE increase in speed if you only cpu-fill on an 68020 with some fast ram. Hope that helps. Kev G

12 February 2010, 09:44	#11
StingRay move.l #$c0ff33,throat Join Date: Dec 2005 Location: Berlin/Joymoney Posts: 6,863	Hmm, not sure if longword eor filler will work correctly in bitplane mode. I am sure that Leffmann's code won't assemble though (eor.l (a0),d1 = you wish ). :P Anyway, why use such brute force method when you can do it in a different way (that's how I would do it anyway): - Have a table of of 256*2 words which holds xstart/xend coords, initialize this table with "invalid" coords (e.g. moveq #-1,dx move.l dx,(ax)+ ) - instead of drawing pixels in your line draw routine you save x coords, you need to check if a coord has already been written (that's why you need to initialize the table), if so, store the value as xend, otherwise it's xstart - also save ymin/ymax so you don't have the read the whole buffer later - code a simple horizontal line drawer which reads the coords from the table and draws a line from x1 to x2, obviously you want a "write 32 pixels at once" routine

12 February 2010, 13:09	#16
Leffmann Join Date: Jul 2008 Location: Sweden Posts: 2,269	Ops, should've checked the code in an assembler first. The 68K is convenient but it's not as orthogonal as you'd wish is it XOR filling is well tested and is exactly what the blitter does, except we do 32 bits in parallel and this is why it works like it should in bitplane mode. Here's the corrected algorithm: Code: lea Bitplane, a0 moveq #40, d2 ; Screen width in bytes moveq #10-1, d3 ; Screen width in 32-bit longwords .xloop moveq #0, d1 ; Set fill carry to 0 move.w #256-1, d0 ; 256 rows .yloop move.l (a0), d4 eor.l d4, d1 ; Fill move.l d1, (a0) add.w d2, a0 ; Step to next row dbf d0, .yloop lea -40*256+4(a0), a0 ; Step to top of next column dbf d3, .xloop One benefit of drawing and filling using these methods is you can remove duplicate lines and also avoid lots of overdraw, and eliminate overdraw completely with convex objects, but using scanline buffer methods will probably be faster at some point when fastmem is involved. Would be really interesting to see some good benchmarks on this.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)