How did games like Starglider 2 get such a high frame rate? - Page 5

roondar · 25 August 2018, 16:55

Quote:

Originally Posted by deimos

Right now, I have no idea how to debug something like this. With my small polygons this bit of code doesn’t get called all the time, so if I had a debugger I’d set a breakpoint. Maybe in ‘87 I’d know where to start, but today? No chance.

You can use the WinUAE build in debugger to help out.
Not sure if this is the best way to set breakpoints, but if you open the debugger (there is a command line and a GUI one, but I only know how to open the GUI one) with shift-F12, you can use the following commands:

Code:

g - continue exectution
m - monitor
d - disassemble

t - step into one instruction, can optionally add a number to it to skip forward
z - step through one instruction
f <address> - add a breakpoint
w <address> - break when memory address is read/written

There are more commands, but these can help a lot.

For instance, if you do something like move.w $8,$8 in your code when you want to break into the debugger and then set up a w 8 in the debugger itself, it will break whenever it reaches that instruction.

a/b · 26 August 2018, 11:49

Try this:

Code:

do_middle
		moveq	#17+2, d0
		sub.w	d5, d0
		add.w	d4, d0
		blt		do_right
		add.w	d0, d0
		jmp		movetable(pc, d0.w)

movetable
		move.w	d7, (a1)+
		move.w	d7, (a1)+
...

deimos · 26 August 2018, 13:07

Thanks. I'm not sure right now whether I'll use this or stick with the current jump table - the jump table allows me to mix in move.l and movem.l instructions to do the really long runs, which might give a better advantage, but I won't know until I know how many long runs I have. I'll keep this snippet for until then.

One thing I have done though is to move this code to outside the loop, and store the destination to jump to in a register. The total time spent in the subroutine is now 8 seconds, giving me 2.5 fps.

Quote:

Originally Posted by a/b

Try this:

Code:

do_middle
		moveq	#17+2, d0
		sub.w	d5, d0
		add.w	d4, d0
		blt		do_right
		add.w	d0, d0
		jmp		movetable(pc, d0.w)

movetable
		move.w	d7, (a1)+
		move.w	d7, (a1)+
...

deimos · 26 August 2018, 13:18

Current state of my code:

Code:

section	"CODE", code

public	_FastDrawHorizontalLine

; void FastDrawHorizontalLine(const PLANEPTR planes [], const WORD x1, const WORD x2, const WORD y, const WORD colour)
;
; 	a0 	planes
; 	d0 	x1
; 	d1 	x2
; 	d2 	y
; 	d3 	colour

_FastDrawHorizontalLine
	movem.l	d2-d7/a2-a4, -(a7)

; todo: calculate and increment per row outside this routine to eliminate the multiply
	mulu.w	#20, d2			; 20 words per 320 pixel row

	move.w	d0, d4
	asr.w	#4, d4
	add.w	d2, d4

	move.w	d1, d5
	asr.w	#4, d5
	add.w	d2, d5

	and.w	#$000f, d0
	add.w	d0, d0
	lea	left_bits, a1
	move.w	(a1, d0.w), d0

	and.w	#$000f, d1
	add.w	d1, d1
	lea	right_bits, a1
	move.w	(a1, d1.w), d1

	moveq	#6, d6			; 6 bitplanes

	cmp.w	d4, d5
	beq	do_overlap

	sub.w	d4, d5
	sub.w	#2, d5
	cmp.w	#17, d5
	bhi	1$
	add.w	d5, d5
	add.w	d5, d5
	move.l	duff(pc, d5.w), a2
	bra	2$
1$	lea	do_right, a2
2$	move.w	d0, a3
	move.w	d1, a4
	
	bra	loop_over_planes

loop_over_planes_start
	move.l	(a0)+, a1
	move.w	d4, d7
	add.w	d7, d7
	lea	(a1, d7.w), a1

	lsr.b	d3
	bcc 	1$
	moveq	#-1, d7
	bra	2$
1$	moveq	#0, d7
2$

do_left
	move.w	a3, d1
	move.w	(a1), d0
	tst.w	d7
	beq	1$
	or.w	d1, d0
	bra	2$
1$	not.w	d1
	and.w	d1, d0
2$	move.w	d0, (a1)+

	jmp	(a2)

	cnop	0, 4

duff	dc.l	2$, 3$, 4$, 5$, 6$, 7$, 8$, 9$, 10$, 11$, 12$, 13$, 14$, 15$, 16$, 17$, 18$, 19$

19$	move.w	d7, (a1)+
18$	move.w	d7, (a1)+
17$	move.w	d7, (a1)+
16$	move.w	d7, (a1)+
15$	move.w	d7, (a1)+
14$	move.w	d7, (a1)+
13$	move.w	d7, (a1)+
12$	move.w	d7, (a1)+
11$	move.w	d7, (a1)+
10$	move.w	d7, (a1)+
9$	move.w	d7, (a1)+
8$	move.w	d7, (a1)+
7$	move.w	d7, (a1)+
6$	move.w	d7, (a1)+
5$	move.w	d7, (a1)+
4$	move.w	d7, (a1)+
3$	move.w	d7, (a1)+
2$	move.w	d7, (a1)+

do_right
	move.w	a4, d1
	move.w	(a1), d0
	tst.w	d7
	beq	1$
	or.w	d1, d0
	bra	2$
1$	not.w	d1
	and.w	d1, d0
2$	move.w	d0, (a1)+

loop_over_planes
	dbra	d6, loop_over_planes_start
	bra	return

do_overlap
	and.w	d0, d1
	move.w	d1, d2
	not.w	d2

	add.w	d4, d4

	bra	loop_over_planes_overlap

loop_over_planes_start_overlap
	move.l	(a0)+, a1
	lea	(a1, d4.w), a1

	move.w	(a1), d0

	lsr.b	d3
	bcc 	1$
	or.w	d1, d0
	bra	2$
1$	and.w	d2, d0
2$	move.w	d0, (a1)+

loop_over_planes_overlap
	dbra	d6, loop_over_planes_start_overlap

return
	movem.l	(a7)+, d2-d7/a2-a4
	rts

	section	"DATA", data

	cnop	0, 4

left_bits
	dc.w	$ffff, $7fff, $3fff, $1fff, $0fff, $07ff, $03ff, $01ff
	dc.w	$00ff, $007f, $003f, $001f, $000f, $0007, $0003, $0001

right_bits
	dc.w	$8000, $c000, $e000, $f000, $f800, $fc00, $fe00, $ff00
	dc.w	$ff80, $ffc0, $ffe0, $fff0, $fff8, $fffc, $fffe, $ffff

roondar · 26 August 2018, 14:48

Quote:

Originally Posted by deimos

Thanks. I'm not sure right now whether I'll use this or stick with the current jump table - the jump table allows me to mix in move.l and movem.l instructions to do the really long runs, which might give a better advantage, but I won't know until I know how many long runs I have. I'll keep this snippet for until then.

One thing I have done though is to move this code to outside the loop, and store the destination to jump to in a register. The total time spent in the subroutine is now 8 seconds, giving me 2.5 fps.

Well, despite the overall program still being slow, that does seem like a good start for the drawing routines. I know that 2.5fps isn't much, but you're drawing much faster than before. We started with 15 + 17 + 20 = 52.
Now we're at: 15 + 17 + 8 = 40. Which means your drawing routine is now 2.5 times as fast it was. Which is pretty good.

It'll be interesting to see how much faster this part can still get with the other improvements you're still considering.

After that, the hard part starts: optimising the other two parts of the code...

hooverphonique · 27 August 2018, 14:09

Quote:

Originally Posted by Gorf

ok .. not sure about the rest, but for the multiplication you want to replace:
move it to registers and do two shifts (lsl.w #4, d3 and lsl.w #2, d2) and add both results...

or even

Code:

add.w d2,d2
add.w d2,d2 ; *4
move.w d2,d3
add.w d3,d3
add.w d3,d3 ; *16
add.w d3,d2 ; *20

deimos · 27 August 2018, 16:38

Quote:

Originally Posted by roondar

We started with 15 + 17 + 20 = 52.
Now we're at: 15 + 17 + 8 = 40.

And we're now at: 15 + 13 + 8 = 36.

I took what I learned getting the 20 to an 8 and applied it to the 17. I haven't even touched the assembly language for it, I've just rewritten parts of the C in a way that generates better assembly.

roondar · 27 August 2018, 17:05

I will add that I don't actually know what a CPU reasonable polygon drawing speed is.

I mean, I can tell you that the CPU writing a single pattern to memory has a (rather theoretical) peak of about 34000 bytes/frame on an 68000@7Mhz or about 18fps for your 80x80x6 bitplane ball, but that number is not very useful considering you'd like your code to be more than a giant pile of move.w d7,(a1)+'s

zero · 28 August 2018, 10:34

You could mega-unroll the drawing loops. movem all the spans. Unrolled loops for each colour.

Often it can be faster to create a "draw table" as you calculate the polygons, and then fill them all in one go at the end so that you have more registers to work with.

You can also avoid the need to sort polygons by instead sorting objects by z depth and using simple backface culling with carefully designed models (no convex parts).

I noticed something about NFS. Some colours are reserved just for the other racers. The amateur riders, the track and all the other objects use a smaller number of colours. Could just be aesthetics but could also be a nice little optimization, because even if you have say 4 bitplanes most of the screen only needs to draw 3. The areas where the other competitors are drawn can have plane 4 quickly block cleared over them.

deimos · 29 August 2018, 12:47

Here's the current state of my scanline fill routine that calls the horizontal line draw routine from earlier.

I've rewritten the code in assembly, using the compiler generated code as a reference. I've reduced the time spent in it from 13s to 10s (was originally 17s).

This gives 15 + 10 + 8 = 33.

Or 3.03fps.

Not bad, compared to before, but still a way to go though.

As before, I'd be very interested in any pointers on how to improve this code. The one bit I think is probably the worst is the bit around insert_loop, which inserts an edge into a sorted array. The array is guaranteed to be small (say maximum 8 elements, but normally 3 or 4), so I don't think the algorithm really matters too much, but I think I've probably implemented it poorly.

Another problem I seem to be always facing is running out of registers. I guess that's life though.

Anyway, here's the code. It is quite long, but it naturally breaks into two halves, which then aren't too bad to understand. The first half creates the edges (from the polygon's points) and inserts them into an array, sorted by their min y value. The second half takes a pair of edges from the array and bresenhams down them, drawing horizontal lines between them, and replacing them with the next one from the array when their end is reached.

EDIT: Updated code, I spotted and fixed a couple of things.

Code:

	section	"CODE", code

	public	_FastScanlineFill

	public	_FastDrawHorizontalLine

; void FastScanlineFill(struct RastPort * rastPort, const WORD c, WORD poly[][2], const WORD colour)
;
;	a0	rastPort	(for now)
;	d0	c
;	a1	poly
;	d1	colour

; todo - use the proper macros to define these
; offsets into the Edge struct
x		equ	0
miny		equ	2
maxy		equ	4
sx		equ	6
numerator	equ	8
denominator	equ	10
N		equ	12

_FastScanlineFill
	movem.l	a2-a6/d2-d7, -(a7)

	move.l	(4, a0), a0
	lea	(8, a0), a0
	move.l	a0, planes				; planes = rastPort->BitMap->Planes

	move.l	a1, poly
	subq.w	#1, d0
	move.w	d0, cm1
	move.w	d1, colour


	; create all non-horizontal edges for the polygon

	moveq	#0, d7					; numEdges = 0

	moveq.l	#0, d6					; visible = FALSE

	lea	edgePool, a3				; edge = &edgePool[0]

	move.l	poly, a2				; p = &poly[0][0];

	move.w	(a2)+, d4				; x2 = *p++;
	move.w	(a2)+, d5				; y2 = *p++;

	move.w	#0, i					; i = 0
	bra	first_loop

first_loop_begin
	move.w	d4, d2					; x1 = x2;
	move.w	d5, d3					; y1 = y2

	move.w	cm1, d0
	cmp.w	i, d0
	bne	1$					; if (i == c - 1)
	move.l	poly, a2				; 	p = &poly[0][0]
1$
	move.w	(a2)+, d4				; x2
	move.w	(a2)+, d5				; y2

	; ignore horizontal edges

	cmp.w	d3, d5
	beq	first_loop_continue			; if (y1 == y2) continue

	cmp.w	d3, d5
	ble	2$					; if (y1 < y2)



	move.w	d2, (x, a3)				; 	edge->x = x1
	move.w	d3, (miny, a3)				; 	edge->miny = y1
	move.w	d5, (maxy, a3)				; 	edge->maxy = y2
	move.w	d4, d1
	sub.w	d2, d1					; 	dx = x2 - x1
	move.w	d5, d0
	sub.w	d3, d0					; 	y2 - y1
	move.w	d0, (denominator, a3)			; 	edge->denominator = y2 - y1
	move.w	d0, (N, a3)				; 	edge->N = edge->denominator



	tst.w	d6					; 	visible
	bne	3$
	tst.w	d5					; 	y2 >= 0
	blt	3$
	cmp.w	#256, d3				; 	y1 < HEIGHT
	bge	3$
	moveq	#1, d6					; 	visible = visible || d3 >= 0 && y2 < HEIGHT
	bra	3$

2$							; else

	move.w	d4, (x, a3)				; 	edge->x = x2
	move.w	d5, (miny, a3)				;	edge->miny = y2
	move.w	d3, (maxy, a3)				;	edge->maxy = y1
	move.w	d2, d1
	sub.w	d4, d1					; 	dx = x1 - x2
	move.w	d3, d0
	sub.w	d5, d0
	move.w	d0, (denominator, a3)			;	edge->denominator = y1 - y2
	move.w	d0, (N, a3)				;	edge=>N = edge->denominator



	tst.w	d6
	bne	3$
	tst.w	d3
	blt	3$
	cmp.w	#256, d5
	bge	3$
	moveq	#1, d6					; 	visible = visible || y1 >= 0 && y2 < HEIGHT

3$
	tst.w	d1
	blt	4$					; if (dx > 0)

	move.w	#1, (sx, a3)				;	edge->sx = 1;
	move.w	d1, (numerator, a3)			; 	edge->numerator = dx

	bra	5$
4$							; else

	move.w	#-1, (sx, a3)				;	edge->sx = -1
	neg.w	d1
	move.w	d1, (numerator, a3)			;	edge->numerator = -dx

5$


; generally unhappy about this bit...

; insert into sorted array (of max size 4)

	move.w	d7, d1					; k = numEdges

	lea	edges, a0
	move.w	d1, d0
	add.w	d0, d0
	add.w	d0, d0
	lea	(a0, d0.w), a0
	move.l	a0, a5					; pk = &edges[k]

	lea	(-4, a5), a4				; pj = pk - 1

	bra	insert_loop
insert_loop_begin
	move.l	(a4), (a5)				;	*pk-- = *pj--
	subq.l	#4, a4
	subq.l	#4, a5

	subq.w	#1, d1					; 	k--
insert_loop
	tst.w	d1
	ble	insert_loop_end				; k > 0

	move.l	(a4), a0
	move.w	(miny, a0), d0
	cmp.w	(miny, a3), d0
	bgt	insert_loop_begin			; (*pj)->miny > edge->miny
insert_loop_end
	move.l	a3, (a5)				; *pk = edge
	addq.w	#1, d7					; numEdges++



first_loop_continue
	lea	(14, a3), a3				; edge++
	addq.w	#1, i
first_loop
	move.w	i, d0
	cmp.w	cm1, d0
	ble	first_loop_begin


; ---------------------------------------------------------------------------


	; if the polygon is entirely above or below the display then ignore it

	tst.w	d6
	beq	return					; if (!visible) return


; ---------------------------------------------------------------------------


; second half


	; start at the top of the first polygon

	lea	edges, a4				; firstEdge = &edges[0]

	move.l	(a4)+, a2				; edgeTracker1 = *firstEdge++

	move.w	(miny, a2), d2				; y = (*firstEdge)->miny

	; exit if we're starting below the bottom of the display

	cmp.w	#256, d2
	bge	return					; if (y >= HEIGHT) return

	; initialise edge trackers


	move.l	(a4)+, a3				; edgeTracker2 = *firstEdge++

	subq.w	#2, d7					; numEdges -= 2

main_loop

	; draw the current scanline if we're within the display

	tst.w	d2
	blt	4$					; if (y >= 0)
	move.w	(x, a2), d0				; 	x1 = edgeTracker1->x
	move.w	(x, a3), d1				; 	x2 = edgeTracker2->x

	cmp.w	d1, d0
	ble	1$					;	if (x1 > x2)
	exg.l	d0, d1					;		x1 <=> x2
1$
	tst.w	d1
	blt	4$					; 	if (x2 >= 0 &&
	cmp.w	#320, d0
	bge	4$					;		x1 < WIDTH)
	tst.w	d0
	bge	2$					;		if (x1 < 0)
	moveq	#0, d0					;			x1 = 0
2$
	cmp.w	#320, d1
	ble	3$					; 		if (x2 > WIDTH)
	move.w	#320, d1				; 			x2 = WIDTH

3$

	cmp.w	d0, d1
	beq	4$					; 		if (x1 != x2)

	move.l	planes, a0
	subq.l	#1, d1
	move.w	colour, d3
	jsr	_FastDrawHorizontalLine			;			FastDrawHorizontalLine(planes, x1, x2 - 1, y, colour)

4$
	; exit if bottom of display reached
	cmp.w	#255, d2
	bge	return					; if (y >= HEIGHT - 1) return

	; replace edge trackers, exit if there are none left

	move.w	(maxy, a2), d0
	cmp.w	d2, d0
	bne	6$					; if (edgeTracker1->maxy == y)
	tst.w	d7
	ble	return					;	if (numEdges <= 0) return

	move.l	(a4)+, a2				; 	edgeTracker1 = *firstEdge++;

	subq.w	#1, d7					;	numEdges--
6$
	move.w	(maxy, a3), d0
	cmp.w	d2, d0
	bne	7$					; if (edgeTracker2->maxy == y)
	tst.w	d7
	ble	return					;	if (numEdges <= 0) return

	move.l	(a4)+, a3				;	edgeTracker2 = *firstEdge++

	subq.w	#1, d7					;	numEdges--

7$
	; update edge trackers

	move.w	(N, a2), d0
	move.w	(numerator, a2), d1
	add.w	d0, d1
	move.w	d1, (N, a2)				; edgeTracker1->N += edgeTracker1->numerator
	bra	edge_tracker_1_loop
edge_tracker_1_loop_begin
	move.w	(x, a2), d0
	move.w	(sx, a2), d1
	add.w	d0, d1
	move.w	d1, (x, a2)				; edgeTracker1->x += edgeTracker1->sx
	move.w	(N, a2), d0
	move.w	(denominator, a2), d1
	neg.w	d1
	add.w	d0, d1
	move.w	d1, (N, a2)				; edgeTracker1->N -= edgeTracker1->denominator
edge_tracker_1_loop
	move.w	(N, a2), d0
	cmp.w	(denominator, a2), d0
	bgt	edge_tracker_1_loop_begin		; edgeTracker1->N > edgeTracker1->denominator


	move.w	(N, a3), d0
	move.w	(numerator, a3), d1
	add.w	d0, d1
	move.w	d1, (N, a3)				; edgeTracker2->N += edgeTracker2->numerator
	bra	edge_tracker_2_loop
edge_tracker_2_loop_begin
	move.w	(x, a3), d0
	move.w	(sx, a3), d1
	add.w	d0, d1
	move.w	d1, (x, a3)				; edgeTracker2->x += edgeTracker2->sx
	move.w	(N, a3), d0
	move.w	(denominator, a3), d1
	neg.w	d1
	add.w	d0, d1
	move.w	d1, (N, a3)				; edgeTracker2->N -= edgeTracker2->denominator
edge_tracker_2_loop
	move.w	(N, a3), d0
	cmp.w	(denominator, a3), d0
	bgt	edge_tracker_2_loop_begin		; edgeTracker2->N > edgeTracker2->denominator


	; move to the next scanline
	addq.w	#1, d2					; y++

	bra	main_loop

return

	movem.l	(a7)+, a2-a6/d2-d7

	rts

	section	"BSS", bss

planes		ds.l	1
cm1		ds.w	1
poly		ds.l	1
colour		ds.w	1

edges		ds.l	4
edgePool	ds.w	4*7

; run out of registers
i		ds.w	1

deimos · 29 August 2018, 18:33

Quote:

Originally Posted by zero

Often it can be faster to create a "draw table" as you calculate the polygons, and then fill them all in one go at the end so that you have more registers to work with.

I think this is an idea worth exploring, thanks.

zero · 30 August 2018, 09:59

Quote:

Originally Posted by deimos

Here's the current state of my scanline fill routine that calls the horizontal line draw routine from earlier.

Think about how you can optimize the scanline drawing code. You have two scenarios.

1. Very short span, all in one word. Can be optimized with a 256 byte lookup table.

2. Longer spans, which have a start word, end word and if long enough some fill words between them.

For (2) I'd suggest lookup tables for the start and end words again. Just calculate the word width of the line (shift by 4) and then do

- Write start word
- Jump into span fill (including jump over if span is 0 width)
- Write end word

No loops.

Another option is to do all the start/end words and then come back and do span fills with movem. You can just create a little stack of span fills to do.

roondar · 30 August 2018, 12:09

I didn't have time to look into your code in detail, but here's a few things I noted anyway - maybe they help

1) You use static memory addresses to store/retrieve variables for a number of things, amongst which the loop counters. If this is really required (you generally do not want anything that is frequently used in a routine to be outside of a register), there is a more efficient way to do so, which is using the stack instead. Doing so allows access to these variable via the stack pointer, which both makes using structures easier and is usally faster than using direct memory addresses.

You can do this using something like this:

Code:

   movem.l d0-d7/a0-a6,-(a7) ; Start of routine, push to stack
   lea     -128(a7),(a7)         ; reserve 128 bytes on the stack
...stuff happens...

   move.w  dx,2(a7)           ; push variable to the reserved area
   move.w  dx,4(a7)           ; push variable to the reserved area
...more stuff...

   move.w  2(a7),dx           ; pull variable from the reserved area
   cmp.w   4(a7),dx            ; compare to variable in the reserved area
...even more stuff...

   lea     128(a7),(a7)          ; release reserved stack space
   movem.l (a7)+,d0-d7/a0-a6 ; End of routine, pop from stack

If you prefer, you can also use the link/unlink commands to achieve this. However, that generally requires a free address register to be useful.

2) You've chosen to keep the loop counters outside of registers. Depending on how often you expect the loop to run and how often each segment of the loop is entered, this may be less optimal than pushing some other variables into memory instead.

3) I'd suggest looking into the DBcc commands (DBRA/DBNE/DBMI/etc). These can optimise loops by combining a branch type test with a decrementing counter and are faster than doing both seperately. This does require your loop counter to run in reverse.

4) As with the other code you showed, the loop structure feels a bit odd. Now it is possible you chose this structure for a very good reason (it may be faster), but I do find it odd to see so many loops being triggered by a branch into a point somewhere in the loop (rather than starting without such a branch).

Generally I'd recommend reducing the amount of branches and compares to the absolute minimum possible.

deimos · 30 August 2018, 12:39

I believe I have essentially what you've suggested, except I have lookup tables for each end, and and them together if the start and end are within the same word.

Quote:

Originally Posted by zero

Think about how you can optimize the scanline drawing code. You have two scenarios.

1. Very short span, all in one word. Can be optimized with a 256 byte lookup table.

2. Longer spans, which have a start word, end word and if long enough some fill words between them.

For (2) I'd suggest lookup tables for the start and end words again. Just calculate the word width of the line (shift by 4) and then do

- Write start word
- Jump into span fill (including jump over if span is 0 width)
- Write end word

No loops.

Another option is to do all the start/end words and then come back and do span fills with movem. You can just create a little stack of span fills to do.

deimos · 30 August 2018, 12:58

1, 2 & 3 came about because I simply ran out of registers. I tried to choose the variables that were accessed the least frequently to keep in memory, and that included the loop counter. I'd like to change the loop to use a dbra, that would also get rid of the cmi variable as that's used to detect the final loop through the points when we have to connect back to the first point. But I have to free up a register or two first. I'm thinking of either misusing an address register, or swapping to use the top and bottom halves of a register.

As for using the stack for things that won't fit in registers, I'm all for it, I just need to understand how to calculate the sizes / offsets.

Regarding 4, this is the way I used to do it in '87. I'll have a closer look to see if there's actually a reason, or whether I learnt it wrong back then.

Quote:

Originally Posted by roondar

I didn't have time to look into your code in detail, but here's a few things I noted anyway - maybe they help

1) You use static memory addresses to store/retrieve variables for a number of things, amongst which the loop counters. If this is really required (you generally do not want anything that is frequently used in a routine to be outside of a register), there is a more efficient way to do so, which is using the stack instead. Doing so allows access to these variable via the stack pointer, which both makes using structures easier and is usally faster than using direct memory addresses.

You can do this using something like this:

Code:

   movem.l d0-d7/a0-a6,-(a7) ; Start of routine, push to stack
   lea     -128(a7),(a7)         ; reserve 128 bytes on the stack
...stuff happens...

   move.w  dx,2(a7)           ; push variable to the reserved area
   move.w  dx,4(a7)           ; push variable to the reserved area
...more stuff...

   move.w  2(a7),dx           ; pull variable from the reserved area
   cmp.w   4(a7),dx            ; compare to variable in the reserved area
...even more stuff...

   lea     128(a7),(a7)          ; release reserved stack space
   movem.l (a7)+,d0-d7/a0-a6 ; End of routine, pop from stack

If you prefer, you can also use the link/unlink commands to achieve this. However, that generally requires a free address register to be useful.

2) You've chosen to keep the loop counters outside of registers. Depending on how often you expect the loop to run and how often each segment of the loop is entered, this may be less optimal than pushing some other variables into memory instead.

3) I'd suggest looking into the DBcc commands (DBRA/DBNE/DBMI/etc). These can optimise loops by combining a branch type test with a decrementing counter and are faster than doing both seperately. This does require your loop counter to run in reverse.

4) As with the other code you showed, the loop structure feels a bit odd. Now it is possible you chose this structure for a very good reason (it may be faster), but I do find it odd to see so many loops being triggered by a branch into a point somewhere in the loop (rather than starting without such a branch).

Generally I'd recommend reducing the amount of branches and compares to the absolute minimum possible.

deimos · 30 August 2018, 13:04

Creating a draw table and drawing all the lines as a batch gained me a second of processing time, from 33s to 32s, or 0.1 fps, as I don't have to save a bunch of registers on the stack for every line drawn.

Quote:

Originally Posted by zero

Often it can be faster to create a "draw table" as you calculate the polygons, and then fill them all in one go at the end so that you have more registers to work with.

roondar · 30 August 2018, 13:54

Quote:

Originally Posted by deimos

As for using the stack for things that won't fit in registers, I'm all for it, I just need to understand how to calculate the sizes / offsets.

I'm all for being lazy and thus I usually let the assembler do that for me by defining a structure or two. The syntax to do so is sadly not fully standardized so my example may not work, but here is a VASM example nonetheless.

Code:

        rsset 2  ; Starting offset should be equal to size (2 per .w/4 per .l) of 
                 ; first element to store
cm1     rs.w 1   ; word variable cm
i       rs.w 1   ; word variable i
extra   rs.l 1   ; long variable extra
exstr   rs.l 10  ; example structure - 10 longs in size
strsize EQU __RS ; __RS is the internal variable VASM uses for rs offsets.

lea.l  -strsize(a7),a7 ; reserves space for this structure on the stack
move.w #10,cm1(a7)
move.w #0,i(a7)
move.l #$c0ffee,extra(a7)
... etc...

lea.l   strsize(a7),a7 ; frees stack

If you prefer to do this manually, it should be a case of using an offset of 2 per word/byte and 4 for every long word.

deimos · 30 August 2018, 13:58

Thanks, I'll give it a go.

I take it I can use the same method for defining the constants used to access into my Edge structures?

Code:

x		equ	0
miny		equ	2
maxy		equ	4
sx		equ	6
numerator	equ	8
denominator	equ	10
N		equ	12

Quote:

Originally Posted by roondar

I'm all for being lazy and thus I usually let the assembler do that for me by defining a structure or two. The syntax to do so is sadly not fully standardized so my example may not work, but here is a VASM example nonetheless.

Code:

        rsset 2  ; Starting offset should be equal to size (2 per .w/4 per .l) of 
                 ; first element to store
cm1     rs.w 1   ; word variable cm
i       rs.w 1   ; word variable i
extra   rs.l 1   ; long variable extra
exstr   rs.l 10  ; example structure - 10 longs in size
strsize EQU __RS ; __RS is the internal variable VASM uses for rs offsets.

lea.l  -strsize(a7),a7 ; reserves space for this structure on the stack
move.w #10,cm1(a7)
move.w #0,i(a7)
move.l #$c0ffee,extra(a7)
... etc...

lea.l   strsize(a7),a7 ; frees stack

If you prefer to do this manually, it should be a case of using an offset of 2 per word/byte and 4 for every long word.

roondar · 30 August 2018, 14:17

Yes, that is indeed possible, as long as you use valid offsets

I did forget to mention something:

Quote:

I'm thinking of either misusing an address register, or swapping to use the top and bottom halves of a register.

I'd say: go for it, I use address registers for non-address register purposes as needed.

There a few small caveats in doing so: sometimes doing an operation in one direction is faster than the other - f.ex. cmp.w dx,an is slower than cmp.w an,dx and moves to address registers never set the condition codes. But other than that, it can really help.

The swap around trick can also work, but isn't always faster (you usually end up needing two swaps, which can add up).

zero · 30 August 2018, 18:06

Quote:

Originally Posted by deimos

Creating a draw table and drawing all the lines as a batch gained me a second of processing time, from 33s to 32s, or 0.1 fps, as I don't have to save a bunch of registers on the stack for every line drawn.

So you got rid of this table?

Code:

19$	move.w	d7, (a1)+
18$	move.w	d7, (a1)+
17$	move.w	d7, (a1)+
16$	move.w	d7, (a1)+
15$	move.w	d7, (a1)+
14$	move.w	d7, (a1)+
13$	move.w	d7, (a1)+
12$	move.w	d7, (a1)+
11$	move.w	d7, (a1)+
10$	move.w	d7, (a1)+
9$	move.w	d7, (a1)+
8$	move.w	d7, (a1)+
7$	move.w	d7, (a1)+
6$	move.w	d7, (a1)+
5$	move.w	d7, (a1)+
4$	move.w	d7, (a1)+
3$	move.w	d7, (a1)+
2$	move.w	d7, (a1)+

Replaced it with a movem table? Particularly for the longer spans that will be the fastest possible way of doing it. Remember you can movem.w and movem.l too.

26 August 2018, 11:49	#82
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	Try this: Code: do_middle moveq #17+2, d0 sub.w d5, d0 add.w d4, d0 blt do_right add.w d0, d0 jmp movetable(pc, d0.w) movetable move.w d7, (a1)+ move.w d7, (a1)+ ...

27 August 2018, 17:05	#88
roondar Registered User Join Date: Jul 2015 Location: The Netherlands Posts: 3,410	I will add that I don't actually know what a CPU reasonable polygon drawing speed is. I mean, I can tell you that the CPU writing a single pattern to memory has a (rather theoretical) peak of about 34000 bytes/frame on an 68000@7Mhz or about 18fps for your 80x80x6 bitplane ball, but that number is not very useful considering you'd like your code to be more than a giant pile of move.w d7,(a1)+'s Last edited by roondar; 27 August 2018 at 17:28.

30 August 2018, 12:09	#93
roondar Registered User Join Date: Jul 2015 Location: The Netherlands Posts: 3,410	I didn't have time to look into your code in detail, but here's a few things I noted anyway - maybe they help 1) You use static memory addresses to store/retrieve variables for a number of things, amongst which the loop counters. If this is really required (you generally do not want anything that is frequently used in a routine to be outside of a register), there is a more efficient way to do so, which is using the stack instead. Doing so allows access to these variable via the stack pointer, which both makes using structures easier and is usally faster than using direct memory addresses. You can do this using something like this: Code: movem.l d0-d7/a0-a6,-(a7) ; Start of routine, push to stack lea -128(a7),(a7) ; reserve 128 bytes on the stack ...stuff happens... move.w dx,2(a7) ; push variable to the reserved area move.w dx,4(a7) ; push variable to the reserved area ...more stuff... move.w 2(a7),dx ; pull variable from the reserved area cmp.w 4(a7),dx ; compare to variable in the reserved area ...even more stuff... lea 128(a7),(a7) ; release reserved stack space movem.l (a7)+,d0-d7/a0-a6 ; End of routine, pop from stack If you prefer, you can also use the link/unlink commands to achieve this. However, that generally requires a free address register to be useful. 2) You've chosen to keep the loop counters outside of registers. Depending on how often you expect the loop to run and how often each segment of the loop is entered, this may be less optimal than pushing some other variables into memory instead. 3) I'd suggest looking into the DBcc commands (DBRA/DBNE/DBMI/etc). These can optimise loops by combining a branch type test with a decrementing counter and are faster than doing both seperately. This does require your loop counter to run in reverse. 4) As with the other code you showed, the loop structure feels a bit odd. Now it is possible you chose this structure for a very good reason (it may be faster), but I do find it odd to see so many loops being triggered by a branch into a point somewhere in the loop (rather than starting without such a branch). Generally I'd recommend reducing the amount of branches and compares to the absolute minimum possible.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Games that are Full Frame Rate or Slower - Limitations or Choice?	Foebane	Retrogaming General Discussion	35	08 April 2018 13:22
F1 grand prix frame rate	universale	support.Games	18	13 July 2015 21:45
The First Person Shooter frame rate tolerance poll...	DDNI	Retrogaming General Discussion	41	30 June 2011 03:32
Vsync Fullscreen and Double Buffer, incorrect frame rate?	rsn8887	support.WinUAE	1	07 April 2011 20:43
Propper speed request when recording with "Disable frame rate" turned on.	Ironclaw	request.UAE Wishlist	9	02 August 2006 07:21

28 August 2018, 10:34	#89
zero Registered User Join Date: Jun 2016 Location: UK Posts: 428	You could mega-unroll the drawing loops. movem all the spans. Unrolled loops for each colour. Often it can be faster to create a "draw table" as you calculate the polygons, and then fill them all in one go at the end so that you have more registers to work with. You can also avoid the need to sort polygons by instead sorting objects by z depth and using simple backface culling with carefully designed models (no convex parts). I noticed something about NFS. Some colours are reserved just for the other racers. The amateur riders, the track and all the other objects use a smaller number of colours. Could just be aesthetics but could also be a nice little optimization, because even if you have say 4 bitplanes most of the screen only needs to draw 3. The areas where the other competitors are drawn can have plane 4 quickly block cleared over them.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)