Mega Typhoon Deconstruction - Page 4

Antiriad_UK · 09 August 2020, 23:14

I did have a $ffdf,$fffe in there. I tend to trigger my lev3 irq from the copper and usually just after the last line of the display so I can start clearing - so I usually wait for 255, then my last display line. But in this case by adding one more bob it pushes the display past line 255 so that when it hits $ffdf,$fffe it skips a frame. So i'd hit a fake limit.

I've swapped it to a more normal irq at line 0 and I got this up to 79 BOBs as well. Output looks better. Look how fast the blits occur after the display dma ends - imagine if it were that fast all the time...

roondar · 09 August 2020, 23:34

Hmm, still 79... That's actually quite interesting, I would've expected a bit more impact of moving to Copper blits. But if I understand you correctly, both methods (at least as you did them) take a very similar amount of total time. Fascinating stuff

ross · 09 August 2020, 23:47

Quote:

Originally Posted by roondar

Hmm, still 79... That's actually quite interesting, I would've expected a bit more impact of moving to Copper blits. But if I understand you correctly, both methods (at least as you did them) take a very similar amount of total time. Fascinating stuff

Nah, not bad at all

4 static, 7 updates in chip mem.
My stupid calculations, seen over this results, seem quite sensible.
Consider that this is a bad situation to use copper to control the blitter, the 68k code version just do too little...
More than anything else the 'gain' comes from the absence of [wasted CPU cycles for blitter waits]

Antiriad_UK · 09 August 2020, 23:49

I think it kinda makes sense. identical BOBs are as predictable as it gets. No long or short blits. In the code the blit takes longer than the CPU code in the draw loop so it's only held up by the blitwait - there's never that normal bottleneck of having the blitter sitting idle.

I think with anything more complicated in the CPU part the copper one will pull ahead. But is losing other copper features worth it...

Out of interest, I enabled 5 bitplanes just to see and both versions dropped to 57 BOBs.

roondar · 10 August 2020, 11:25

Quote:

Originally Posted by ross

Nah, not bad at all

4 static, 7 updates in chip mem.
My stupid calculations, seen over this results, seem quite sensible.
Consider that this is a bad situation to use copper to control the blitter, the 68k code version just do too little...
More than anything else the 'gain' comes from the absence of [wasted CPU cycles for blitter waits]

Let me explain

It's actually in part your calculations that made me wonder why there was no gain

Basically, I kind of disagree we shouldn't see any gains (or perhaps I just don't understand why, which is equally possible

). As I understand it, the advantages of Copper blitting are the following:

CPU/Blitter concurrency can be maximized without needing to use expensive interrupts. This results in a modest gain in performance by achieving better bus utilization (indeed, in the optimal case you can have the Blitter run almost all of the time instead of having parts of the frame where the Blitter will never run)
There is no need to wait on the Blitter using the CPU, saving those cycles
Copper based Blitting requires significantly less cycles to set up the blits, even if we take into account the time spent updating the Copperlist

Of these, only the first advantage is essentially nullified by having simple code that doesn't do much. The other two still should still stand and should thus lead to visible gains. But in the example shown by Antiriad_UK it seems that those two advantages essentially have a near zero effect.

This is surprising to me, as I've been told by several people (and your calculations also suggest this) that there are pretty big advantages in terms of setup cost and avoiding CPU based Blitter waiting.

I'm inclined to conclude that perhaps the costs of setting up/maintaining the Copperlist(s) for Copper based blitting is not where the speed advantage lies (despite many claims to the contrary). Rather, the advantage apparently lies purely in reaching better bus utilization.

mr.spiv · 10 August 2020, 12:12

Quote:

Originally Posted by roondar

Let me explain

..
There is no need to wait on the Blitter using the CPU, saving those cycles
Copper based Blitting requires significantly less cycles to set up the blits, even if we take into account the time spent updating the Copperlist

Of these, only the first advantage is essentially nullified by having simple code that doesn't do much. The other two still should still stand and should thus lead to visible gains. But in the example shown by Antiriad_UK it seems that those two advantages essentially have a near zero effect.

This is surprising to me, as I've been told by several people (and your calculations also suggest this) that there are pretty big advantages in terms of setup cost and avoiding CPU based Blitter waiting.

I'm inclined to conclude that perhaps the costs of setting up/maintaining the Copperlist(s) for Copper based blitting is not where the speed advantage lies (despite many claims to the contrary). Rather, the advantage apparently lies purely in reaching better bus utilization.

We are still talking fairly modest amount of blitts IMHO. Take e.g. a sinus scroller (lo- or hires) or similar where you do 320 to 640 blitts.. there the setup time starts to count more.

roondar · 10 August 2020, 13:00

Quote:

Originally Posted by mr.spiv

We are still talking fairly modest amount of blitts IMHO. Take e.g. a sinus scroller (lo- or hires) or similar where you do 320 to 640 blitts.. there the setup time starts to count more.

The thing is that looking at the Visual DMA debugger tells us that both versions of the effect end their last blit at pretty much exactly the same spot of the frame (almost exactly at the bottom right), while achieving the same results. This implies the gains are actually very small.

As a thought experiment: suppose the Copper based blitting method offered an very small 6 CPU cycle gain in terms of setup/Blitter waiting over using the CPU to blit as normal. In that case we ought to see at least one full rasterline that is "free" (as in not blitting)*. That doesn't seem to be the case, so the gain appear to be even smaller than that.

Had the Copper based blitting method offered the kind of gains that ross predicted, we ought to see at least some extra bobs or a chunk of free raster time. We see neither, which to me seems to show that there indeed is very little actual gains in this case.

*) 6*79=474c or slightly more than 1 rasterline.

mcgeezer · 10 August 2020, 13:47

I'm wondering if there's a gain to be made in the screen clear? As the copper list already holds the positions of where the Bob's are I'm wondering if it's quicker to use that to clear the bobs too. That would have an advantage over the CPU based version.

I'd need to think it through properly but I think I'm right.

ross · 10 August 2020, 14:14

Quote:

Originally Posted by roondar

Let me explain

It's actually in part your calculations that made me wonder why there was no gain

The key is here: "4 static, 7 updates in chip mem".

In my superficial calculation I've adopted 2 (vs 11) chip mem update for the gain, here there are 7 (vs 11) for a tie!
So you can also roughly estimate how much CWAIT_BFD saves compared to CPU_BWAIT.
For me it is a good result, from here on you can only gain

Quote:

Originally Posted by mcgeezer

I'm wondering if there's a gain to be made in the screen clear? As the copper list already holds the positions of where the Bob's are I'm wondering if it's quicker to use that to clear the bobs too. That would have an advantage over the CPU based version.

I'd need to think it through properly but I think I'm right.

Yep, but only if make the copper list as a series of sub-jumps.
I have a half idea how to do it ..

roondar · 10 August 2020, 16:51

Quote:

Originally Posted by ross

The key is here: "4 static, 7 updates in chip mem".

In my superficial calculation I've adopted 2 (vs 11) chip mem update for the gain, here there are 7 (vs 11) for a tie!
So you can also roughly estimate how much CWAIT_BFD saves compared to CPU_BWAIT.
For me it is a good result, from here on you can only gain

Ahhh, that explains things

So if you could get it down to two chipmemory updates you'd gain something on the order of 12*5 CPU cycles per blit (or about 4700 cycles total in this example case). That's a nice little bump in performance.

Antiriad_UK · 15 August 2020, 15:41

Just to finish this off for me, I did a blit interrupt version. I usually run everything in lev3 vblank/lev3 copper which was clashing with the lev3 blit interrupt. So I changed to running in a lev1 softint initiated from the copper. That was interesting.

Copper list blitting version was 79 BOBs and I got the blit interrupt version to 65 BOBs. The interrupt save/restore registers is the killer part. I could optimize this case because it is so simple, but the difference between 'movem.l d0-d7/a0-a6' vs 'movem.l a0/a6' was the difference between 55 and 65 BOBs.

Definitely not optimal but I can think of a few pieces of code where I've not been able to split the CPU/blit parts up neatly and maybe for a handful of blits this method is useful.

Code:

P0_BlitDoneIrq_SecondBlit:
	movem.l	a0/a6,-(sp)

	lea	_custom+bltcpth,a6

	move.w	#INTF_BLIT,intreq-bltcpth(a6)
	move.w	#INTF_BLIT,intreq-bltcpth(a6)

	move.l	BOB_BlitQueue_PTR(pc),a0
	cmp.l	#BOB_BlitQueue_End,a0
	bge.s	.exit

	move.l	(a0)+,bltcon0-bltcpth(a6)
	move.l 	(a0)+,(a6)+	;bltcpt
	move.l 	(a0)+,(a6)+	;bltbpt
	move.l 	(a0)+,(a6)+	;bltapt
	move.l 	(a0)+,(a6)+	;bltdpt
	move.w	(a0)+,(a6)	;bltsize

	move.l	a0,BOB_BlitQueue_PTR
.exit:	
	movem.l	(sp)+,a0/a6
	rte

BOB_BlitQueue_PTR:	dc.l	0

	rsreset
;BLTQ_BLTCON0		rs.w	1
;BLTQ_BLTCON1		rs.w	1
;BLTQ_BLTCPTH		rs.w	1
;BLTQ_BLTCPTL		rs.w	1
;BLTQ_BLTBPTH		rs.w	1
;BLTQ_BLTBPTL		rs.w	1
;BLTQ_BLTAPTH		rs.w	1
;BLTQ_BLTAPTL		rs.w	1
;BLTQ_BLTDPTH		rs.w	1
;BLTQ_BLTDPTL		rs.w	1
;BLTQ_BLTSIZE		rs.w	1
;BLTQ_SIZEOF		rs.w	0

mcgeezer · 15 August 2020, 16:39

Nice one @Antiriad_UK.

I'm on the same journey but on stock AGA.

In the same 3 bitplane graphics mode using traditional blits I'm currently at 111 16x16 bobs (full clear and redraw every frame).

I haven't done any sine/cosine yet but can plot at any x/y position independently.

I'll move it to the Copper blits soon and post results... may also put the mode into Dual Playfield on AGA and test results which would be a more real world.

Geezer

FSizzle · 15 August 2020, 19:50

Quote:

Originally Posted by Antiriad_UK

The interrupt save/restore registers is the killer part.

Code:

    movem.l    a0/a6,-(sp)
    ...

    movem.l    (sp)+,a0/a6

It's not much, but you could save 4 cycles and a bus access on the restore by doing this:

Code:

move.l (sp)+, a6   // 12 (3/0)
move.l (sp)+, a0   // 12 (3/0)

instead of:

Code:

movem.l (sp)+, a0/a6   // 28 (7/0)   (12+8n (3+2n/0))

About enough for an extra half a bob

mcgeezer · 15 August 2020, 22:55

So here's mine after a bit of work on AGA.

CPU Blits I can push around 136 16x16 (8 col) bobs

COPPER Blits I can push around 132 16x16 (8 Col) bobs.

Screen mode is x4 fetch with 3 bitplanes enabled.

More importantly, I've found this to be a really great discussion point on Amiga programming as I have learned so much from it. If I knew this info now doing my previous projects I could have improved upon their performance by a long stretch.

Thanks to all for your contributions. A great gift of knowledge from everyone involed.

Geezer

Antiriad_UK · 15 August 2020, 23:34

Yes me too. I’ve shunned copper blits and blit interrupts before. I’m kinda in awe at the people who figured this out without today’s resources.

I’m also impressed with the order of the blit registers. Amiga team thought so far ahead about the optimal order of those registers. Crazy.

DanScott · 15 August 2020, 23:45

Quote:

Originally Posted by mcgeezer

CPU Blits I can push around 136 16x16 (8 col) bobs

COPPER Blits I can push around 132 16x16 (8 Col) bobs.

The classic "bob record" demos on A500 (From around 1989/1990) used copper blitting techniques to achieve the best results

mcgeezer · 15 August 2020, 23:57

Quote:

Originally Posted by DanScott

The classic "bob record" demos on A500 (From around 1989/1990) used copper blitting techniques to achieve the best results

Yeah, I'm going to take a look... I think Dragons demo had a decent one if I recall.

But with that said, there's one thing gaining performance in a demo, and quite another gaining one writing a game (only in my opinion ofcourse).

Edit - it's amazing the old stuff I remember... just checked out the bobs on Dragons Megademo, clearly a screen buffer trick used there... was used all the time on the ST demos.

Antiriad_UK · 16 August 2020, 00:11

Unlimited bobs don’t count. They do make me smile though

buzzybee · 18 August 2020, 05:22

Very inspiring and englightening thread going on here. Pretty enlightening for Kevins and my current project Proxima 3 too, since this relies heavily on moving around as many objects as possible within on frame.

Thank you guys for your measures and comparisons regarding cpu and dma overhead. This is really amazing work and very precious input.

The conclusion for p3 seems it makes sense to stick with a rather traditional approach of feeding blitter with the cpu and trying to use the copper for visual fx. Copper feeds BPLxMOD and BPLCON1 at least every two scanlines with modified modulus and scrolldata to achieve for a number of visual distortions, and I can´t see how I could combine this with copper feeding the blitter.

Feeding blitter-with-copper-technique seems very interesting and I´d love to use it in a future project. Imagine there is a lot of room for elegant and fast optimisations here. For example by setting up a number of predefined sub-copperlists for various bob sizes and bitplane source adresses, which are then called from a main copperlist with a series of copper jumps. Modifying these jumps would be the only job the cpu would have to do each frame.

Tigerskunk · 18 August 2020, 10:09

This approach seems to be a bit of a pain in the ass for any non general display engine, isn't it?
Like, if you have a sprite parallax layer or a lot of copper palette changes this thing seems hard to code.

09 August 2020, 23:14	#61
Antiriad_UK OCS forever! Join Date: Mar 2019 Location: Birmingham, UK Posts: 418	I did have a $ffdf,$fffe in there. I tend to trigger my lev3 irq from the copper and usually just after the last line of the display so I can start clearing - so I usually wait for 255, then my last display line. But in this case by adding one more bob it pushes the display past line 255 so that when it hits $ffdf,$fffe it skips a frame. So i'd hit a fake limit. I've swapped it to a more normal irq at line 0 and I got this up to 79 BOBs as well. Output looks better. Look how fast the blits occur after the display dma ends - imagine if it were that fast all the time... Attached Thumbnails

15 August 2020, 15:41	#71
Antiriad_UK OCS forever! Join Date: Mar 2019 Location: Birmingham, UK Posts: 418	Just to finish this off for me, I did a blit interrupt version. I usually run everything in lev3 vblank/lev3 copper which was clashing with the lev3 blit interrupt. So I changed to running in a lev1 softint initiated from the copper. That was interesting. Copper list blitting version was 79 BOBs and I got the blit interrupt version to 65 BOBs. The interrupt save/restore registers is the killer part. I could optimize this case because it is so simple, but the difference between 'movem.l d0-d7/a0-a6' vs 'movem.l a0/a6' was the difference between 55 and 65 BOBs. Definitely not optimal but I can think of a few pieces of code where I've not been able to split the CPU/blit parts up neatly and maybe for a handful of blits this method is useful. Code: P0_BlitDoneIrq_SecondBlit: movem.l a0/a6,-(sp) lea _custom+bltcpth,a6 move.w #INTF_BLIT,intreq-bltcpth(a6) move.w #INTF_BLIT,intreq-bltcpth(a6) move.l BOB_BlitQueue_PTR(pc),a0 cmp.l #BOB_BlitQueue_End,a0 bge.s .exit move.l (a0)+,bltcon0-bltcpth(a6) move.l (a0)+,(a6)+ ;bltcpt move.l (a0)+,(a6)+ ;bltbpt move.l (a0)+,(a6)+ ;bltapt move.l (a0)+,(a6)+ ;bltdpt move.w (a0)+,(a6) ;bltsize move.l a0,BOB_BlitQueue_PTR .exit: movem.l (sp)+,a0/a6 rte BOB_BlitQueue_PTR: dc.l 0 rsreset ;BLTQ_BLTCON0 rs.w 1 ;BLTQ_BLTCON1 rs.w 1 ;BLTQ_BLTCPTH rs.w 1 ;BLTQ_BLTCPTL rs.w 1 ;BLTQ_BLTBPTH rs.w 1 ;BLTQ_BLTBPTL rs.w 1 ;BLTQ_BLTAPTH rs.w 1 ;BLTQ_BLTAPTL rs.w 1 ;BLTQ_BLTDPTH rs.w 1 ;BLTQ_BLTDPTL rs.w 1 ;BLTQ_BLTSIZE rs.w 1 ;BLTQ_SIZEOF rs.w 0 Attached Thumbnails

15 August 2020, 22:55	#74
mcgeezer Registered User Join Date: Oct 2017 Location: Sunderland, England Posts: 2,702	So here's mine after a bit of work on AGA. CPU Blits I can push around 136 16x16 (8 col) bobs COPPER Blits I can push around 132 16x16 (8 Col) bobs. Screen mode is x4 fetch with 3 bitplanes enabled. More importantly, I've found this to be a really great discussion point on Amiga programming as I have learned so much from it. If I knew this info now doing my previous projects I could have improved upon their performance by a long stretch. Thanks to all for your contributions. A great gift of knowledge from everyone involed. Geezer Attached Thumbnails

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Mega Typhoon ECS only?	Photon	HOL suggestions and feedback	8	16 April 2020 21:47
EAB/Lemon Super League 2017: Round 4 - Mega Typhoon	Graham Humphrey	EAB's competition	50	09 April 2017 11:01
Working copy of Mega Typhoon ECS game?	ImmortalA1000	request.Old Rare Games	9	04 February 2013 06:38
Mega Typhoon Trainer Version - Working!	plasmatron	request.Old Rare Games	1	03 July 2011 23:52
Mega Typhoon	haynor666	HOL contributions	1	19 August 2008 00:37

09 August 2020, 23:34	#62
roondar Registered User Join Date: Jul 2015 Location: The Netherlands Posts: 3,438	Hmm, still 79... That's actually quite interesting, I would've expected a bit more impact of moving to Copper blits. But if I understand you correctly, both methods (at least as you did them) take a very similar amount of total time. Fascinating stuff

09 August 2020, 23:49	#64
Antiriad_UK OCS forever! Join Date: Mar 2019 Location: Birmingham, UK Posts: 418	I think it kinda makes sense. identical BOBs are as predictable as it gets. No long or short blits. In the code the blit takes longer than the CPU code in the draw loop so it's only held up by the blitwait - there's never that normal bottleneck of having the blitter sitting idle. I think with anything more complicated in the CPU part the copper one will pull ahead. But is losing other copper features worth it... Out of interest, I enabled 5 bitplanes just to see and both versions dropped to 57 BOBs.

10 August 2020, 13:47	#68
mcgeezer Registered User Join Date: Oct 2017 Location: Sunderland, England Posts: 2,702	I'm wondering if there's a gain to be made in the screen clear? As the copper list already holds the positions of where the Bob's are I'm wondering if it's quicker to use that to clear the bobs too. That would have an advantage over the CPU based version. I'd need to think it through properly but I think I'm right.

15 August 2020, 16:39	#72
mcgeezer Registered User Join Date: Oct 2017 Location: Sunderland, England Posts: 2,702	Nice one @Antiriad_UK. I'm on the same journey but on stock AGA. In the same 3 bitplane graphics mode using traditional blits I'm currently at 111 16x16 bobs (full clear and redraw every frame). I haven't done any sine/cosine yet but can plot at any x/y position independently. I'll move it to the Copper blits soon and post results... may also put the mode into Dual Playfield on AGA and test results which would be a more real world. Geezer

15 August 2020, 23:34	#75
Antiriad_UK OCS forever! Join Date: Mar 2019 Location: Birmingham, UK Posts: 418	Yes me too. I’ve shunned copper blits and blit interrupts before. I’m kinda in awe at the people who figured this out without today’s resources. I’m also impressed with the order of the blit registers. Amiga team thought so far ahead about the optimal order of those registers. Crazy.

16 August 2020, 00:11	#78
Antiriad_UK OCS forever! Join Date: Mar 2019 Location: Birmingham, UK Posts: 418	Unlimited bobs don’t count. They do make me smile though

18 August 2020, 05:22	#79
buzzybee Registered User Join Date: Oct 2015 Location: Landsberg / Germany Posts: 526	Very inspiring and englightening thread going on here. Pretty enlightening for Kevins and my current project Proxima 3 too, since this relies heavily on moving around as many objects as possible within on frame. Thank you guys for your measures and comparisons regarding cpu and dma overhead. This is really amazing work and very precious input. The conclusion for p3 seems it makes sense to stick with a rather traditional approach of feeding blitter with the cpu and trying to use the copper for visual fx. Copper feeds BPLxMOD and BPLCON1 at least every two scanlines with modified modulus and scrolldata to achieve for a number of visual distortions, and I can´t see how I could combine this with copper feeding the blitter. Feeding blitter-with-copper-technique seems very interesting and I´d love to use it in a future project. Imagine there is a lot of room for elegant and fast optimisations here. For example by setting up a number of predefined sub-copperlists for various bob sizes and bitplane source adresses, which are then called from a main copperlist with a series of copper jumps. Modifying these jumps would be the only job the cpu would have to do each frame.

18 August 2020, 10:09	#80
Tigerskunk Inviyya Dude! Join Date: Sep 2016 Location: Amiga Island Posts: 2,798	This approach seems to be a bit of a pain in the ass for any non general display engine, isn't it? Like, if you have a sprite parallax layer or a lot of copper palette changes this thing seems hard to code.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)