Layered tile engine optimizing.

Thorham · 30 September 2011, 20:43

Hi,

I've been working on my Advance Wars 2 conversion, and have written a new layered tile engine (old one sucked) that needs to run as fast as possible

Basically the lowest target is an A1200 with some trapdoor fastmem (2 MB?) and an HD.

The question is: Can the code below be optimized further? I think it's fast enough already (haven't tested it yet), but some input on the subject would be greatly appreciated

The code should be optimized for 68020s and 68030s, anything above will run this fast enough anyway.

The code simply reads four layers of 16x16 pixel bitmaps with masks (except the first layer). The masks are interleaved into the bitmap data so you can read 32 mask bits and after that 32 tile bits (the routine reads two lines of 16 pixels at a time). This is done twice, after which there is a simple transpose (from Kalms' c2p), the two longwords are then written to chipmem.

Note that this routine does not handle movement of individual sprites, everything is simply 16x16 pixel aligned. The required frame rate is about 6 or 7 frames per second (I'll write other code for things that require super smoothness).

If anyone can see a way to do it better, then let's hear it! Any questions? Please ask, and sorry about the lack of comments

Code:

update
	movem.l	d0-a6,-(sp)
	subq.l	#12,sp

	move.l	gfx_bank_table,-(sp)
	move.l	screen_map,-(sp)
	move.l	#10240-16*4,d3	; may be wrong, check

	move.l	#160,-(sp)
.loopz
	move.l	4(sp),a5
	move.l	8(sp),a4

	move.l	(a4)+,a0
	add.l	(a5)+,a0
	move.l	(a4)+,a1
	add.l	(a5)+,a1
	move.l	(a4)+,a2
	add.l	(a5)+,a2
	move.l	(a4)+,d2
	add.l	(a5)+,d2

	move.l	(a4)+,a3
	add.l	(a5)+,a3
	move.l	(a4)+,d0
	add.l	(a5)+,d0
	move.l	(a4)+,d1
	add.l	(a5)+,d1
	move.l	(a4)+,d5
	add.l	(a5)+,d5

	move.l	d0,a4
	move.l	a5,4(sp)
	move.l	d1,a5

	moveq	#8-1,d6
.loopy
	moveq	#8-1,d7
.loopx
	move.l	(a0)+,d0
	and.l	(a1)+,d0
	or.l	(a1)+,d0
	and.l	(a2)+,d0
	or.l	(a2)+,d0
	exg	d2,a2
	and.l	(a2)+,d0
	or.l	(a2)+,d0

	move.l	(a3)+,d1
	and.l	(a4)+,d1
	or.l	(a4)+,d1
	and.l	(a5)+,d1
	or.l	(a5)+,d1
	exg	d5,a5
	and.l	(a5)+,d1
	or.l	(a5)+,d1

	swap	d1
	eor.w	d0,d1
	eor.w	d1,d0
	move.l	d0,(a6)
	add.l	d4,a6
	exg	d2,a2
	eor.w	d0,d1
	swap	d1
	move.l	d1,(a6)
	add.l	d4,a6
	exg	d5,a5
.nextx
	dbra	d7,.loopx
	add.l	d3,a6
.nexty
	dbra	d6,.loopy
	sub.l	#81920+40*16-4,a6	; may be wrong, check
.nextz
	move.l	(sp),d0
	subq.l	#1,d0
	move.l	d0,(sp)
	bne	.loopz
.exit
	add.l	#12,sp
	movem.l	(sp)+,d0-a6
	rts

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Optimizing WHDLoad config for 040/060	8bitbubsy	project.WHDLoad	1	03 November 2011 22:37
Optimizing question: instruction order	TheDarkCoder	Coders. Asm / Hardware	9	29 October 2011 17:07
Benching and optimizing CF-IDE speed	Photon	support.Hardware	12	15 July 2009 01:48
For people who like optimizing 680x0 code.	Thorham	Coders. General	5	28 May 2008 11:48
Tile map sample	Blip	Coders. General	1	18 July 2007 13:53

30 September 2011, 20:43	#1
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,767	Layered tile engine optimizing. Hi, I've been working on my Advance Wars 2 conversion, and have written a new layered tile engine (old one sucked) that needs to run as fast as possible Basically the lowest target is an A1200 with some trapdoor fastmem (2 MB?) and an HD. The question is: Can the code below be optimized further? I think it's fast enough already (haven't tested it yet), but some input on the subject would be greatly appreciated The code should be optimized for 68020s and 68030s, anything above will run this fast enough anyway. The code simply reads four layers of 16x16 pixel bitmaps with masks (except the first layer). The masks are interleaved into the bitmap data so you can read 32 mask bits and after that 32 tile bits (the routine reads two lines of 16 pixels at a time). This is done twice, after which there is a simple transpose (from Kalms' c2p), the two longwords are then written to chipmem. Note that this routine does not handle movement of individual sprites, everything is simply 16x16 pixel aligned. The required frame rate is about 6 or 7 frames per second (I'll write other code for things that require super smoothness). If anyone can see a way to do it better, then let's hear it! Any questions? Please ask, and sorry about the lack of comments Code: update movem.l d0-a6,-(sp) subq.l #12,sp move.l gfx_bank_table,-(sp) move.l screen_map,-(sp) move.l #10240-164,d3 ; may be wrong, check move.l #160,-(sp) .loopz move.l 4(sp),a5 move.l 8(sp),a4 move.l (a4)+,a0 add.l (a5)+,a0 move.l (a4)+,a1 add.l (a5)+,a1 move.l (a4)+,a2 add.l (a5)+,a2 move.l (a4)+,d2 add.l (a5)+,d2 move.l (a4)+,a3 add.l (a5)+,a3 move.l (a4)+,d0 add.l (a5)+,d0 move.l (a4)+,d1 add.l (a5)+,d1 move.l (a4)+,d5 add.l (a5)+,d5 move.l d0,a4 move.l a5,4(sp) move.l d1,a5 moveq #8-1,d6 .loopy moveq #8-1,d7 .loopx move.l (a0)+,d0 and.l (a1)+,d0 or.l (a1)+,d0 and.l (a2)+,d0 or.l (a2)+,d0 exg d2,a2 and.l (a2)+,d0 or.l (a2)+,d0 move.l (a3)+,d1 and.l (a4)+,d1 or.l (a4)+,d1 and.l (a5)+,d1 or.l (a5)+,d1 exg d5,a5 and.l (a5)+,d1 or.l (a5)+,d1 swap d1 eor.w d0,d1 eor.w d1,d0 move.l d0,(a6) add.l d4,a6 exg d2,a2 eor.w d0,d1 swap d1 move.l d1,(a6) add.l d4,a6 exg d5,a5 .nextx dbra d7,.loopx add.l d3,a6 .nexty dbra d6,.loopy sub.l #81920+4016-4,a6 ; may be wrong, check .nextz move.l (sp),d0 subq.l #1,d0 move.l d0,(sp) bne .loopz .exit add.l #12,sp movem.l (sp)+,d0-a6 rts Last edited by Thorham; 01 October 2011 at 00:49.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)