How did games like Starglider 2 get such a high frame rate? - Page 4

Tigerskunk · 22 August 2018, 09:25

I only want to drop by to say that I think it's nice someone is doing new 3D game stuff on OCS Amigas. We have a lot of 2D game dev going on again, but 3D was missing so far...

desiv · 22 August 2018, 19:59

Quote:

Originally Posted by roondar

There is that jet fighter game that apparently ran at 50Hz. But I'm forgetting the name now. .

Possibly not the one you were thinking of, but the Fighter Duel games on the Amiga seemed pretty darn fast (and in hires IIRC).

desiv

zero · 23 August 2018, 10:23

It's surprising how many pixels you can push with the CPU.

A 320x200 pixel display area is 32k per frame in 16 colours. 1.6mb/sec for 50 fps. At the lower end say 24k for 8 colours and 600k/sec for 25 fps. That's more realistic for an A500 that also needs to do some geometry processing.

Games like NSP and Fighter Duel must be getting close to the bandwidth limits of the RAM. If I find the time I'd like to do a frame rate analysis of NSP, but my estimate would be 15-20 fps.

roondar · 23 August 2018, 13:29

Quote:

Originally Posted by desiv

Possibly not the one you were thinking of, but the Fighter Duel games on the Amiga seemed pretty darn fast (and in hires IIRC).

desiv

Fighter Duel! That's the one I meant

zero · 23 August 2018, 15:17

The horizon on Fighter Duel is interesting. Looks like several colours, relatively expensive to draw and considering the rest of the 3D display only uses a few colours it would slow that drawing down a lot unless some trick was used.

deimos · 24 August 2018, 09:46

Just so that anyone who's interested knows, I just got back to this after a bit of a forced break. I've done some work on the assembly version of the horizontal line routine and have knocked 5s of the 52s processing time. That's about 25% off that bit of code.

Most of the improvement came from changing a lot of .l instructions to .w - I think that, as C constants are always at least int sized, the compiler is forced to treat them as 32 bits. This caused a lot of things to be extended to longs before long operations are applied and then the top half of the long result is discarded as the result is assigned to a word.

I'm not sure how I can fix that in C, but if I could, I'd see a small improvement straight away, and the generated code would be a better starting point for an assembly re-implementation.

Quote:

Originally Posted by deimos

Here are the current numbers following the symmetry-based optimisations on the vertex data suggested by chb:

My code takes 52s to draw 100 frames (320x256, 6 bitmap / EHB mode). There are 128 polygons represented by 12 x-coordinates, 5 y-coordinates and 12 z-coordinates (114 vertices pointing into that coordinate data).

100 / 52 gives 1.92 fps.

The CPU time is spent as follow:

35% is spent rotating and moving the ball and doing the 3D calculations.
27% is doing the edge tracking part of the scanline fill algorithm.
38% is spent drawing horizontal lines between active edges to fill the polygons.

The code is mostly C, but I do have some assembly in there for a couple of hotspots and for the fixed point arithmetic.

I think my next step is to rewrite more of it in assembly, but I don't believe I'll get more than a 25% improvement across the board, firstly because it won't be changing the underlying algorithms, and secondly because the code generated by the compiler isn't bad to start with.

netraider · 24 August 2018, 11:13

Quote:

Originally Posted by Estrayk

You said, one of the fastest. Is there any faster than NSP for Amiga OCS?

This.

Although the objects are simple, it's the fastest and smoothest I've ever seen on Amiga OCS, and manages full frame rate most of the time.
Very playable too...

roondar · 24 August 2018, 13:46

Quote:

Originally Posted by deimos

Just so that anyone who's interested knows, I just got back to this after a bit of a forced break. I've done some work on the assembly version of the horizontal line routine and have knocked 5s of the 52s processing time. That's about 25% off that bit of code.

I have a question, because I'm still a bit puzzled here.

You see, if I understand what you wrote correctly, your drawing code takes 38*0.75=28,5% of the total time and your frame rate is now 100/47=2,13fps.

Of this time, 28,5% is spent drawing. This would lead to a 'no-calculations frame rate' of (1/0,285)*2,13=7,5fps. And that seems like it's rather slow, even when accounting for the fact the CPU is used to draw.

Edit: reading this back, I'm fairly sure my calculated new percentage if not correct, but my question still stands even if I am a bit off with the numbers

This all makes me wonder: what is your effective fill rate in bytes (or pixels if you prefer) per second?

Quote:

Most of the improvement came from changing a lot of .l instructions to .w - I think that, as C constants are always at least int sized, the compiler is forced to treat them as 32 bits. This caused a lot of things to be extended to longs before long operations are applied and then the top half of the long result is discarded as the result is assigned to a word.

I'm not sure how I can fix that in C, but if I could, I'd see a small improvement straight away, and the generated code would be a better starting point for an assembly re-implementation.

You might be able to force them into 16 bit values by casting them to a 16 bit type whenever you use them. If the compiler is 'smart enough', it should use them as .w values then (and not do some silly conversion step).

deimos · 24 August 2018, 14:07

Quote:

Originally Posted by roondar

your drawing code takes 38*0.75=28,5% of the total time and your frame rate is now 100/47=2,13fps

Almost. The code was 15 + 17 + 20 = 52, it's now 15 + 17 + 15 = 47.

15 / 47 = 32%.

But, your point stands, if the 15 + 17 weren't there, I'd have 15 seconds for 100 frames, or 6.67 fps, which isn't fast enough.

What's my effective fill rate? Quick maths, looking at the image in the very first post in this thread and estimating the ball at 80px diameter, leads me to say approximately 10,000 pixels per second (at 6 bit planes). I'm not sure if that's what you meant.

Edit: I don't think that maths is correct, the ball is about 5000 pixels, at 6.67 fps that's 33,000 pixels per second.

I'll experiment with casting down to 16 bits when I get to the next step.

And hopefully I can post my line draw code later today or tomorrow.

roondar · 24 August 2018, 14:42

Quote:

Originally Posted by deimos

Almost. The code was 15 + 17 + 20 = 52, it's now 15 + 17 + 15 = 47.

15 / 47 = 32%.

But, your point stands, if the 15 + 17 weren't there, I'd have 15 seconds for 100 frames, or 6.67 fps, which isn't fast enough.

What's my effective fill rate? Quick maths, looking at the image in the very first post in this thread and estimating the ball at 80px diameter, leads me to say approximately 10,000 pixels per second (at 6 bit planes). I'm not sure if that's what you meant.

Edit: I don't think that maths is correct, the ball is about 5000 pixels, at 6.67 fps that's 33,000 pixels per second.

I'll experiment with casting down to 16 bits when I get to the next step.

And hopefully I can post my line draw code later today or tomorrow.

Let's see, 33000 pixels x 6bpl = 198000 1-bpl pixels drawn in total over one second. That would be 3968 pixels per frame (1-bpl, rounded up to 1 word width increments).

Without the code it's of course tricky to be certain how you implemented it, but assuming you've optimised the scanline algorithm to draw as many full words per step as possible (rather than separate pixels which would be very slow) that would lead to an optimal case of moving 248 words of data per frame.

Something isn't right there. I understand much more is happening than just moving the data, but this result would indicate the code spends in the region of 500-600 cycles for every word drawn (@ 1bpl). Which feels like it's a very high amount.

grond · 24 August 2018, 16:52

Not sure anyone has mentioned it yet: you can define a bounding box for all your objects, project that onto the screen and determine the (word-aligned) regions where the updated bounding box and the preceding bounding box do not overlap. Then you just need to clear the non-overlapping segments of the bounding boxes for the next frame.

deimos · 24 August 2018, 19:17

Quote:

Originally Posted by deimos

hopefully I can post my line draw code later today

Code:

	section	"CODE", code

	public	_FastDrawHorizontalLine

; void FastDrawHorizontalLine(const PLANEPTR planes [], const WORD x1, const WORD x2, const WORD y, const WORD colour)
;
; 	a0 	planes
; 	d0 	x1
; 	d1 	x2
; 	d2 	y
; 	d3 	colour

_FastDrawHorizontalLine
		movem.l	d2-d7/a2-a5, -(a7)
	
		mulu.w	#20, d2						; 20 words per 320 pixel row - aim to replace this multiply with an add

		move.w	d0, d4
		asr.w	#4, d4
		add.w	d2, d4

		move.w	d1, d5
		asr.w	#4, d5
		add.w	d2, d5

		and.w	#$000f, d0
		lsl.w	#1, d0
		lea		left_bits, a2
		move.w	(a2, d0.w), d0
		move.w	d0, a3

		and.w	#$000f, d1
		lsl.w	#1, d1
		lea		right_bits, a2
		move.w	(a2, d1.w), d1
		move.w	d1, a4

		and.w	d1, d0
		move.w	d0, a5

		moveq	#6, d6						; 6 bitplanes
		bra		loop_over_planes

loop_over_planes_start
		move.l	(a0)+, a1
		move.w	d4, d7
		lsl.w	#1, d7
		lea		(a1, d7.w), a1

		lsr.b	d3
		bcc 	1$
		moveq	#-1, d7
		bra		2$
1$		moveq	#0, d7

2$		cmp.w	d4, d5
		beq		do_overlap

do_left
		move.w	a3, d1
		move.w	(a1), d0
		tst.w	d7
		beq		1$
		or.w	d1, d0
		bra		2$
1$		not.w	d1
		and.w	d1, d0
2$		move.w	d0, (a1)+

do_middle
		move.w	d5, d0
		sub.w	d4, d0
		sub.w	#2, d0
		cmp.w	#17, d0
		bhi		do_right
		lsl.w	#2, d0
		move.l	duff(pc, d0.w), a2
		jmp	 	(a2)

		cnop	0, 4
duff	dc.l	2$
		dc.l	3$
		dc.l	4$
		dc.l	5$
		dc.l	6$
		dc.l	7$
		dc.l	8$
		dc.l	9$
		dc.l	10$
		dc.l	11$
		dc.l	12$
		dc.l	13$
		dc.l	14$
		dc.l	15$
		dc.l	16$
		dc.l	17$
		dc.l	18$
		dc.l	19$

19$		move.w	d7, (a1)+
18$		move.w	d7, (a1)+
17$		move.w	d7, (a1)+
16$		move.w	d7, (a1)+
15$		move.w	d7, (a1)+
14$		move.w	d7, (a1)+
13$		move.w	d7, (a1)+
12$		move.w	d7, (a1)+
11$		move.w	d7, (a1)+
10$		move.w	d7, (a1)+
9$		move.w	d7, (a1)+
8$		move.w	d7, (a1)+
7$		move.w	d7, (a1)+
6$		move.w	d7, (a1)+
5$		move.w	d7, (a1)+
4$		move.w	d7, (a1)+
3$		move.w	d7, (a1)+
2$		move.w	d7, (a1)+

do_right
		move.w	a4, d1
		move.w	(a1), d0
		tst.w	d7
		beq		1$
		or.w	d1, d0
		bra		2$
1$		not.w	d1
		and.w	d1, d0
2$		move.w	d0, (a1)+

		bra		loop_over_planes

do_overlap
		move.w	a5, d1
		move.w	(a1), d0
		tst.w	d7
		beq		1$
		or.w	d1, d0
		bra		2$
1$		not.w	d1
		and.w	d1, d0
2$		move.w	d0, (a1)+

loop_over_planes
		dbra	d6, loop_over_planes_start

		movem.l	(a7)+, d2-d7/a2-a5
		rts

		section	"DATA", data

		cnop	0, 4

left_bits
		dc.w	$ffff, $7fff, $3fff, $1fff, $0fff, $07ff, $03ff, $01ff
		dc.w	$00ff, $007f, $003f, $001f, $000f, $0007, $0003, $0001
	
right_bits
		dc.w	$8000, $c000, $e000, $f000, $f800, $fc00, $fe00, $ff00
		dc.w	$ff80, $ffc0, $ffe0, $fff0, $fff8, $fffc, $fffe, $ffff

Oh, and I'm now at 15 + 17 + 10 = 42, or 2.38 fps, meaning I've doubled the performance of this piece of code by rewriting it in assembly.

Gorf · 24 August 2018, 22:19

ok .. not sure about the rest, but for the multiplication you want to replace:
move it to registers and do two shifts (lsl.w #4, d3 and lsl.w #2, d2) and add both results...

but still: I am sure we are missing something obvious, as it is way too slow

roondar · 24 August 2018, 23:13

Looking at your code and new times quoted, there are a few things to note. As before, I'm going to just look at the drawing part and pretend that the rest is not being done

1) Your new code runs at 10fps/6 bitplanes. It should draw at 15fps if you used 4 bitplanes/20 fps if you used 3 bitplanes. Probably even faster, as 6 bitplanes slows down the CPU by something on the order of 30% per frame (it loses 50% during the period where a six bitplane screen is shown, 0% where no screen is shown).

Most, if not all, OCS Amiga 3D games run using a maximum of 4 bitplanes - the machine is just not fast enough to do 3D well in more than that. I'd suggest you try that.

2) Your indirect jump can be optimized. Each of the move statements take 2 bytes, so you don't need a lookup table. You should be able to do something like

Code:

   add.w d0,d0 ; Adding is slightly faster on 68000 than a shift up to a shift of 2. Shifting three or more is faster using the shift commands.
   jmp movetable(pc,d0.w) ; Jump can be indirect

movetable 
   move.w d7,(a1)+
   move.w d7,(a1)+
; etc

This will be slightly faster per loop.

3) There is one other thing that might make it faster for bigger objects (as in, wider than 32 pixels on average). Right now, you are using move.w d7,(a1)+.
If you draw mostly those larger objects, using move.l d7,(a1)+ could be quite a bit faster.

Code:

move.w dn,(an)+ ; 8 cycles per word => 16 cycles per long
move.l dn,(an)+ ; 12 cycles per long =>25% faster

4) Another very minor thing I spotted was that there are multiple small lsl.w/lsl.b commands. A few cycles can be saved per loop by changing these to add.w dx,dx's/add.b dx,dx's instead (for shifts of 1 or 2 anyway).

Other than that, the middle part of the loop (which is where you should spend the most time and so I concentrated on it) is looking fine.

--
For the other parts, it might be possible to change part of the code so that you could drop a few tst.w and other branches. I haven't looked at it in great detail, but it feels you might be able to drop a few of those by reordering things.

However, I feel the screen depth you used is probably the biggest hurdle right now.

Maybe others can chip in and get us to see more optimizations than me though

The other code I might not be able to help with as much, as I haven't done any 3D stuff. I do know about 2D stuff and 68k, but not 3D.

deimos · 25 August 2018, 14:03

Regarding 1, yes, I know, but I'm still hoping to make it work. I'd really like to keep the extra colours as I get some nice subtle shading effects and more realistic shadows, and I hope that if I have the top half of the screen in 6 bitplane mode and the bottom half for dashboard / controls in 3 or 4 bitplanes then the CPU will still have enough access to memory. If it doesn't work out then I can change this.

Regarding 2 & 3, I can do 2 right now, but if I decide to do 3, or to use movem.l for longer lines, then I guess I'll need the lookup table back, so I might leave it for now.

At the moment all my polygons are small, only around half of them have a middle part to fill. I don't yet what the final mix will be, and I suspect this is something where I'll have to implement all the options and let them fight it out amongst themselves. I can imagine all sorts of complex combinations, but considering there's only 10 longs across the screen, perhaps a simple split into even and odd runs where the odd runs get one extra move.w before falling into a duff's device of move.l's.

Regarding 4, these sound like easy things that I should do right now, and start to build up my list of similar quick fixes.

Regarding the other parts, I think I may be able to separate the do_overlap part (where the line begins and ends within the same word) and give it it's own bitplane loop. That would eliminate one comparison and branch per loop, and maybe free up a register (of which I currently have none spare).

I'll see what I can do today.

Quote:

Originally Posted by roondar

Looking at your code and new times quoted, there are a few things to note. As before, I'm going to just look at the drawing part and pretend that the rest is not being done

1) Your new code runs at 10fps/6 bitplanes. It should draw at 15fps if you used 4 bitplanes/20 fps if you used 3 bitplanes. Probably even faster, as 6 bitplanes slows down the CPU by something on the order of 30% per frame (it loses 50% during the period where a six bitplane screen is shown, 0% where no screen is shown).

Most, if not all, OCS Amiga 3D games run using a maximum of 4 bitplanes - the machine is just not fast enough to do 3D well in more than that. I'd suggest you try that.

2) Your indirect jump can be optimized. Each of the move statements take 2 bytes, so you don't need a lookup table. You should be able to do something like

Code:

   add.w d0,d0 ; Adding is slightly faster on 68000 than a shift up to a shift of 2. Shifting three or more is faster using the shift commands.
   jmp movetable(pc,d0.w) ; Jump can be indirect

movetable 
   move.w d7,(a1)+
   move.w d7,(a1)+
; etc

This will be slightly faster per loop.

3) There is one other thing that might make it faster for bigger objects (as in, wider than 32 pixels on average). Right now, you are using move.w d7,(a1)+.
If you draw mostly those larger objects, using move.l d7,(a1)+ could be quite a bit faster.

Code:

move.w dn,(an)+ ; 8 cycles per word => 16 cycles per long
move.l dn,(an)+ ; 12 cycles per long =>25% faster

4) Another very minor thing I spotted was that there are multiple small lsl.w/lsl.b commands. A few cycles can be saved per loop by changing these to add.w dx,dx's/add.b dx,dx's instead (for shifts of 1 or 2 anyway).

Other than that, the middle part of the loop (which is where you should spend the most time and so I concentrated on it) is looking fine.

--
For the other parts, it might be possible to change part of the code so that you could drop a few tst.w and other branches. I haven't looked at it in great detail, but it feels you might be able to drop a few of those by reordering things.

However, I feel the screen depth you used is probably the biggest hurdle right now.

Maybe others can chip in and get us to see more optimizations than me though

The other code I might not be able to help with as much, as I haven't done any 3D stuff. I do know about 2D stuff and 68k, but not 3D.

roondar · 25 August 2018, 14:45

Right, so I misunderstood the drawing routine originally. I thought you called this routine once per line of the object being displayed. But you are in fact calling it once per line per polygon being displayed.

With this and your desire to keep running at 6 bitplanes in mind, I've thought up a few other things you could try, though one of them isn't going to help for readability much.

As for the move.l idea, yeah that's only worthwhile when plotting big polygons.

1) This is the most involved suggestion.

You could try to change the way you deal with drawing a polygon in the colour of your choice. Right now, you basically draw the entire polygon, plane by plane (alternating between drawing and erasing by choosing what value to use as source for the move.w's based on the colour per plane) using one routine.

An option could be to do this only once, effectively drawing to a one bitplane polygon 'buffer'. Then combine this pre-drawn one bitplane version with the actual image. This will cause overhead in the form of drawing one additional bitplane of pixels, but will remove a lot of overhead by no longer needing a fairly involved drawing process for each plane but just for one plane.

The remaining planes will lose all that overhead because they essentially become copy or mask operations.

2) This is easier to implement and might still help quite a bit. Currently, you calculate the Y offset using a mulu for every line of every polygon. This is not required. You could move this calculation out into whatever routine calls the line drawing routine. Because you draw line by line, all you'd need to do is multiply once before the start of drawing the polygon and then, for every line, add the modulo to Y, instead of adding one to Y.

This way, a ten line polygon will save out on nine multiply commands.

You can go one further and actually implement this all the way in your projection (I'm assuming the 3D coordinates you use are projected onto the screen and that they are not on a one-to-one scale here). This is more involved, but could remove the need to multiply from the drawing process entirely and keep it there where it's probably being used already.

3) A bit of a micro optimisation this, but you could opt to turn the routine into a macro instead of a called routine. A bsr (or jsr)/rts combo takes up quite a few cycles, especially given how often you call the routine (once per line of each polygon adds up). This will definitely reduce readability of code and isn't the neatest of things but it can help. Do note that this will save more cycles than my 4th suggestion in my earlier post.

4) Another micro optimisation, but consider stack usage. Each register pushed to & later pulled from the stack also costs something like 16 cycles. This is not a lot, but when you call routines very often, it does become worthwhile to check if the registers all need to be put onto the stack and if all of those that really do need to be on the stack actually need to be put there as long values.

deimos · 25 August 2018, 15:11

Yes, this is just a polygon fill, not a multiple polygon fill, which I looked at and decided (rightly or wrongly) were only suited to chunky displays where you could write a pixel at a time. Also, I can limit myself to convex, non-intersecting polygons, so one polygon at a time works for me.

1) Might actually be a really smart idea, I will give it proper consideration.

2) Already considered and planned, this is what I meant by the comment next to it "aim to replace this multiply with an add". Coherence.

3 & 4) I can't see any purpose of this code apart from supporting the scanline polygon fill, so once it's right (if point 1 doesn't change things) I'll bung it in place and make its register saving specific to the one use case instead of making it a general purpose subroutine.

Quote:

Originally Posted by roondar

Right, so I misunderstood the drawing routine originally. I thought you called this routine once per line of the object being displayed. But you are in fact calling it once per line per polygon being displayed.

With this and your desire to keep running at 6 bitplanes in mind, I've thought up a few other things you could try, though one of them isn't going to help for readability much.

As for the move.l idea, yeah that's only worthwhile when plotting big polygons.

1) This is the most involved suggestion.

You could try to change the way you deal with drawing a polygon in the colour of your choice. Right now, you basically draw the entire polygon, plane by plane (alternating between drawing and erasing by choosing what value to use as source for the move.w's based on the colour per plane) using one routine.

An option could be to do this only once, effectively drawing to a one bitplane polygon 'buffer'. Then combine this pre-drawn one bitplane version with the actual image. This will cause overhead in the form of drawing one additional bitplane of pixels, but will remove a lot of overhead by no longer needing a fairly involved drawing process for each plane but just for one plane.

The remaining planes will lose all that overhead because they essentially become copy or mask operations.

2) This is easier to implement and might still help quite a bit. Currently, you calculate the Y offset using a mulu for every line of every polygon. This is not required. You could move this calculation out into whatever routine calls the line drawing routine. Because you draw line by line, all you'd need to do is multiply once before the start of drawing the polygon and then, for every line, add the modulo to Y, instead of adding one to Y.

This way, a ten line polygon will save out on nine multiply commands.

You can go one further and actually implement this all the way in your projection (I'm assuming the 3D coordinates you use are projected onto the screen and that they are not on a one-to-one scale here). This is more involved, but could remove the need to multiply from the drawing process entirely and keep it there where it's probably being used already.

3) A bit of a micro optimisation this, but you could opt to turn the routine into a macro instead of a called routine. A bsr (or jsr)/rts combo takes up quite a few cycles, especially given how often you call the routine (once per line of each polygon adds up). This will definitely reduce readability of code and isn't the neatest of things but it can help. Do note that this will save more cycles than my 4th suggestion in my earlier post.

4) Another micro optimisation, but consider stack usage. Each register pushed to & later pulled from the stack also costs something like 16 cycles. This is not a lot, but when you call routines very often, it does become worthwhile to check if the registers all need to be put onto the stack and if all of those that really do need to be on the stack actually need to be put there as long values.

deimos · 25 August 2018, 16:25

@roondar, I'm just trying this out, as a last micro-optimisation before I step back and evaluate everything...

Quote:

Originally Posted by roondar

2) Your indirect jump can be optimized. Each of the move statements take 2 bytes, so you don't need a lookup table. You should be able to do something like

Code:

   add.w d0,d0 ; Adding is slightly faster on 68000 than a shift up to a shift of 2. Shifting three or more is faster using the shift commands.
   jmp movetable(pc,d0.w) ; Jump can be indirect

movetable 
   move.w d7,(a1)+
   move.w d7,(a1)+
; etc

This will be slightly faster per loop.

I'm not sure what is going wrong, it looks like I'm always jumping to one of the first entries of the table, in that I get almost a whole row drawn, as if the value of d0 at the time of the jump is 0. I swear I've not changed anything else, and the values of d4 and d5 are exactly the same as before.

My current code is as follows:

Code:

do_middle
		move.w	d5, d0
		sub.w	d4, d0
		sub.w	#2, d0
		cmp.w	#17, d0
		bhi		do_right
		add.w	d0, d0
		jmp		movetable(pc, d0.w)

movetable
		move.w	d7, (a1)+
		move.w	d7, (a1)+
		move.w	d7, (a1)+
		move.w	d7, (a1)+
		move.w	d7, (a1)+
		move.w	d7, (a1)+
		move.w	d7, (a1)+
		move.w	d7, (a1)+
		move.w	d7, (a1)+
		move.w	d7, (a1)+
		move.w	d7, (a1)+
		move.w	d7, (a1)+
		move.w	d7, (a1)+
		move.w	d7, (a1)+
		move.w	d7, (a1)+
		move.w	d7, (a1)+
		move.w	d7, (a1)+
		move.w	d7, (a1)+

do_right

roondar · 25 August 2018, 16:32

That's quite odd. I use code very similarly in a sprite update routine I once wrote (for very small sprites, the CPU beats the blitter for filling). And that works just fine.

Code:

            add.w    d1,d1                    ; *2 to get the offset in words.
            jmp        .unrolled(pc,d1)        ; Jump to the right address here.

.unrolled   move.l    (a0)+,(a1)+
            move.l    (a0)+,(a1)+
            move.l    (a0)+,(a1)+
            ...etc

I use the same basic idea for jump tables into branches. It really should work

Edit: wait... I understand, you call into the jump table using the value of D0, which is numbered precisely opposite to what it would need to be

For 0 pixels, you would need a D0 of (<table entries>+1)*2, for max width it'd have to be 0. Might not work as an optimisation then.

deimos · 25 August 2018, 16:43

Edit: just saw your edit. Yes, higher numbers are at the start of the table. Maybe I could alter the maths to make it work, but considering everything I may be better off leaving it how it was.

Right now, I have no idea how to debug something like this. With my small polygons this bit of code doesn’t get called all the time, so if I had a debugger I’d set a breakpoint. Maybe in ‘87 I’d know where to start, but today? No chance.

Quote:

Originally Posted by roondar

That's quite odd. I use code very similarly in a sprite update routine I once wrote (for very small sprites, the CPU beats the blitter for filling). And that works just fine.

Code:

            add.w    d1,d1                    ; *2 to get the offset in words.
            jmp        .unrolled(pc,d1)        ; Jump to the right address here.

.unrolled   move.l    (a0)+,(a1)+
            move.l    (a0)+,(a1)+
            move.l    (a0)+,(a1)+
            ...etc

I use the same basic idea for jump tables into branches. It really should work

Edit: wait... I understand, you call into the jump table using the value of D0, which is numbered precisely opposite to what it would need to be

For 0 pixels, you would need a D0 of <table entries>+1, for max width it'd have to be 0. Might not work as an optimisation then.

24 August 2018, 23:13	#74
roondar Registered User Join Date: Jul 2015 Location: The Netherlands Posts: 3,414	Looking at your code and new times quoted, there are a few things to note. As before, I'm going to just look at the drawing part and pretend that the rest is not being done 1) Your new code runs at 10fps/6 bitplanes. It should draw at 15fps if you used 4 bitplanes/20 fps if you used 3 bitplanes. Probably even faster, as 6 bitplanes slows down the CPU by something on the order of 30% per frame (it loses 50% during the period where a six bitplane screen is shown, 0% where no screen is shown). Most, if not all, OCS Amiga 3D games run using a maximum of 4 bitplanes - the machine is just not fast enough to do 3D well in more than that. I'd suggest you try that. 2) Your indirect jump can be optimized. Each of the move statements take 2 bytes, so you don't need a lookup table. You should be able to do something like Code: add.w d0,d0 ; Adding is slightly faster on 68000 than a shift up to a shift of 2. Shifting three or more is faster using the shift commands. jmp movetable(pc,d0.w) ; Jump can be indirect movetable move.w d7,(a1)+ move.w d7,(a1)+ ; etc This will be slightly faster per loop. 3) There is one other thing that might make it faster for bigger objects (as in, wider than 32 pixels on average). Right now, you are using move.w d7,(a1)+. If you draw mostly those larger objects, using move.l d7,(a1)+ could be quite a bit faster. Code: move.w dn,(an)+ ; 8 cycles per word => 16 cycles per long move.l dn,(an)+ ; 12 cycles per long =>25% faster 4) Another very minor thing I spotted was that there are multiple small lsl.w/lsl.b commands. A few cycles can be saved per loop by changing these to add.w dx,dx's/add.b dx,dx's instead (for shifts of 1 or 2 anyway). Other than that, the middle part of the loop (which is where you should spend the most time and so I concentrated on it) is looking fine. -- For the other parts, it might be possible to change part of the code so that you could drop a few tst.w and other branches. I haven't looked at it in great detail, but it feels you might be able to drop a few of those by reordering things. However, I feel the screen depth you used is probably the biggest hurdle right now. Maybe others can chip in and get us to see more optimizations than me though The other code I might not be able to help with as much, as I haven't done any 3D stuff. I do know about 2D stuff and 68k, but not 3D. Last edited by roondar; 24 August 2018 at 23:19.

25 August 2018, 14:45	#76
roondar Registered User Join Date: Jul 2015 Location: The Netherlands Posts: 3,414	Right, so I misunderstood the drawing routine originally. I thought you called this routine once per line of the object being displayed. But you are in fact calling it once per line per polygon being displayed. With this and your desire to keep running at 6 bitplanes in mind, I've thought up a few other things you could try, though one of them isn't going to help for readability much. As for the move.l idea, yeah that's only worthwhile when plotting big polygons. 1) This is the most involved suggestion. You could try to change the way you deal with drawing a polygon in the colour of your choice. Right now, you basically draw the entire polygon, plane by plane (alternating between drawing and erasing by choosing what value to use as source for the move.w's based on the colour per plane) using one routine. An option could be to do this only once, effectively drawing to a one bitplane polygon 'buffer'. Then combine this pre-drawn one bitplane version with the actual image. This will cause overhead in the form of drawing one additional bitplane of pixels, but will remove a lot of overhead by no longer needing a fairly involved drawing process for each plane but just for one plane. The remaining planes will lose all that overhead because they essentially become copy or mask operations. 2) This is easier to implement and might still help quite a bit. Currently, you calculate the Y offset using a mulu for every line of every polygon. This is not required. You could move this calculation out into whatever routine calls the line drawing routine. Because you draw line by line, all you'd need to do is multiply once before the start of drawing the polygon and then, for every line, add the modulo to Y, instead of adding one to Y. This way, a ten line polygon will save out on nine multiply commands. You can go one further and actually implement this all the way in your projection (I'm assuming the 3D coordinates you use are projected onto the screen and that they are not on a one-to-one scale here). This is more involved, but could remove the need to multiply from the drawing process entirely and keep it there where it's probably being used already. 3) A bit of a micro optimisation this, but you could opt to turn the routine into a macro instead of a called routine. A bsr (or jsr)/rts combo takes up quite a few cycles, especially given how often you call the routine (once per line of each polygon adds up). This will definitely reduce readability of code and isn't the neatest of things but it can help. Do note that this will save more cycles than my 4th suggestion in my earlier post. 4) Another micro optimisation, but consider stack usage. Each register pushed to & later pulled from the stack also costs something like 16 cycles. This is not a lot, but when you call routines very often, it does become worthwhile to check if the registers all need to be put onto the stack and if all of those that really do need to be on the stack actually need to be put there as long values. Last edited by roondar; 25 August 2018 at 14:57.

25 August 2018, 16:32	#79
roondar Registered User Join Date: Jul 2015 Location: The Netherlands Posts: 3,414	That's quite odd. I use code very similarly in a sprite update routine I once wrote (for very small sprites, the CPU beats the blitter for filling). And that works just fine. Code: add.w d1,d1 ; 2 to get the offset in words. jmp .unrolled(pc,d1) ; Jump to the right address here. .unrolled move.l (a0)+,(a1)+ move.l (a0)+,(a1)+ move.l (a0)+,(a1)+ ...etc I use the same basic idea for jump tables into branches. It really should work Edit: wait... I understand, you call into the jump table using the value of D0, which is numbered precisely opposite to what it would need to be For 0 pixels, you would need a D0 of (<table entries>+1)2, for max width it'd have to be 0. Might not work as an optimisation then. Last edited by roondar; 25 August 2018 at 16:41. Reason: So yeah, the 68000 doesn't like odd addresses :P

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Games that are Full Frame Rate or Slower - Limitations or Choice?	Foebane	Retrogaming General Discussion	35	08 April 2018 13:22
F1 grand prix frame rate	universale	support.Games	18	13 July 2015 21:45
The First Person Shooter frame rate tolerance poll...	DDNI	Retrogaming General Discussion	41	30 June 2011 03:32
Vsync Fullscreen and Double Buffer, incorrect frame rate?	rsn8887	support.WinUAE	1	07 April 2011 20:43
Propper speed request when recording with "Disable frame rate" turned on.	Ironclaw	request.UAE Wishlist	9	02 August 2006 07:21

22 August 2018, 09:25	#61
Tigerskunk Inviyya Dude! Join Date: Sep 2016 Location: Amiga Island Posts: 2,774	I only want to drop by to say that I think it's nice someone is doing new 3D game stuff on OCS Amigas. We have a lot of 2D game dev going on again, but 3D was missing so far...

23 August 2018, 10:23	#63
zero Registered User Join Date: Jun 2016 Location: UK Posts: 428	It's surprising how many pixels you can push with the CPU. A 320x200 pixel display area is 32k per frame in 16 colours. 1.6mb/sec for 50 fps. At the lower end say 24k for 8 colours and 600k/sec for 25 fps. That's more realistic for an A500 that also needs to do some geometry processing. Games like NSP and Fighter Duel must be getting close to the bandwidth limits of the RAM. If I find the time I'd like to do a frame rate analysis of NSP, but my estimate would be 15-20 fps.

23 August 2018, 15:17	#65
zero Registered User Join Date: Jun 2016 Location: UK Posts: 428	The horizon on Fighter Duel is interesting. Looks like several colours, relatively expensive to draw and considering the rest of the 3D display only uses a few colours it would slow that drawing down a lot unless some trick was used.

24 August 2018, 16:52	#71
grond Registered User Join Date: Jun 2015 Location: Germany Posts: 1,918	Not sure anyone has mentioned it yet: you can define a bounding box for all your objects, project that onto the screen and determine the (word-aligned) regions where the updated bounding box and the preceding bounding box do not overlap. Then you just need to clear the non-overlapping segments of the bounding boxes for the next frame.

24 August 2018, 22:19	#73
Gorf Registered User Join Date: May 2017 Location: Munich/Bavaria Posts: 2,295	ok .. not sure about the rest, but for the multiplication you want to replace: move it to registers and do two shifts (lsl.w #4, d3 and lsl.w #2, d2) and add both results... but still: I am sure we are missing something obvious, as it is way too slow

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)