English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 09 August 2020, 23:14   #61
Antiriad_UK
OCS forever!
 
Antiriad_UK's Avatar
 
Join Date: Mar 2019
Location: Birmingham, UK
Posts: 418
I did have a $ffdf,$fffe in there. I tend to trigger my lev3 irq from the copper and usually just after the last line of the display so I can start clearing - so I usually wait for 255, then my last display line. But in this case by adding one more bob it pushes the display past line 255 so that when it hits $ffdf,$fffe it skips a frame. So i'd hit a fake limit.

I've swapped it to a more normal irq at line 0 and I got this up to 79 BOBs as well. Output looks better. Look how fast the blits occur after the display dma ends - imagine if it were that fast all the time...
Attached Thumbnails
Click image for larger version

Name:	003.png
Views:	216
Size:	19.5 KB
ID:	68432  
Antiriad_UK is offline  
Old 09 August 2020, 23:34   #62
roondar
Registered User
 
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,438
Hmm, still 79... That's actually quite interesting, I would've expected a bit more impact of moving to Copper blits. But if I understand you correctly, both methods (at least as you did them) take a very similar amount of total time. Fascinating stuff
roondar is offline  
Old 09 August 2020, 23:47   #63
ross
Defendit numerus
 
ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 54
Posts: 4,498
Quote:
Originally Posted by roondar View Post
Hmm, still 79... That's actually quite interesting, I would've expected a bit more impact of moving to Copper blits. But if I understand you correctly, both methods (at least as you did them) take a very similar amount of total time. Fascinating stuff
Nah, not bad at all

4 static, 7 updates in chip mem.
My stupid calculations, seen over this results, seem quite sensible.
Consider that this is a bad situation to use copper to control the blitter, the 68k code version just do too little...
More than anything else the 'gain' comes from the absence of [wasted CPU cycles for blitter waits]

Last edited by ross; 10 August 2020 at 00:05. Reason: []
ross is offline  
Old 09 August 2020, 23:49   #64
Antiriad_UK
OCS forever!
 
Antiriad_UK's Avatar
 
Join Date: Mar 2019
Location: Birmingham, UK
Posts: 418
I think it kinda makes sense. identical BOBs are as predictable as it gets. No long or short blits. In the code the blit takes longer than the CPU code in the draw loop so it's only held up by the blitwait - there's never that normal bottleneck of having the blitter sitting idle.

I think with anything more complicated in the CPU part the copper one will pull ahead. But is losing other copper features worth it...

Out of interest, I enabled 5 bitplanes just to see and both versions dropped to 57 BOBs.
Antiriad_UK is offline  
Old 10 August 2020, 11:25   #65
roondar
Registered User
 
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,438
Quote:
Originally Posted by ross View Post
Nah, not bad at all

4 static, 7 updates in chip mem.
My stupid calculations, seen over this results, seem quite sensible.
Consider that this is a bad situation to use copper to control the blitter, the 68k code version just do too little...
More than anything else the 'gain' comes from the absence of [wasted CPU cycles for blitter waits]
Let me explain
It's actually in part your calculations that made me wonder why there was no gain

Basically, I kind of disagree we shouldn't see any gains (or perhaps I just don't understand why, which is equally possible ). As I understand it, the advantages of Copper blitting are the following:
  1. CPU/Blitter concurrency can be maximized without needing to use expensive interrupts. This results in a modest gain in performance by achieving better bus utilization (indeed, in the optimal case you can have the Blitter run almost all of the time instead of having parts of the frame where the Blitter will never run)
  2. There is no need to wait on the Blitter using the CPU, saving those cycles
  3. Copper based Blitting requires significantly less cycles to set up the blits, even if we take into account the time spent updating the Copperlist
Of these, only the first advantage is essentially nullified by having simple code that doesn't do much. The other two still should still stand and should thus lead to visible gains. But in the example shown by Antiriad_UK it seems that those two advantages essentially have a near zero effect.

This is surprising to me, as I've been told by several people (and your calculations also suggest this) that there are pretty big advantages in terms of setup cost and avoiding CPU based Blitter waiting.

I'm inclined to conclude that perhaps the costs of setting up/maintaining the Copperlist(s) for Copper based blitting is not where the speed advantage lies (despite many claims to the contrary). Rather, the advantage apparently lies purely in reaching better bus utilization.

Last edited by roondar; 10 August 2020 at 11:35.
roondar is offline  
Old 10 August 2020, 12:12   #66
mr.spiv
Registered User
 
mr.spiv's Avatar
 
Join Date: Aug 2006
Location: Finland
Age: 52
Posts: 244
Quote:
Originally Posted by roondar View Post
Let me explain
  1. ..
  2. There is no need to wait on the Blitter using the CPU, saving those cycles
  3. Copper based Blitting requires significantly less cycles to set up the blits, even if we take into account the time spent updating the Copperlist
Of these, only the first advantage is essentially nullified by having simple code that doesn't do much. The other two still should still stand and should thus lead to visible gains. But in the example shown by Antiriad_UK it seems that those two advantages essentially have a near zero effect.

This is surprising to me, as I've been told by several people (and your calculations also suggest this) that there are pretty big advantages in terms of setup cost and avoiding CPU based Blitter waiting.

I'm inclined to conclude that perhaps the costs of setting up/maintaining the Copperlist(s) for Copper based blitting is not where the speed advantage lies (despite many claims to the contrary). Rather, the advantage apparently lies purely in reaching better bus utilization.
We are still talking fairly modest amount of blitts IMHO. Take e.g. a sinus scroller (lo- or hires) or similar where you do 320 to 640 blitts.. there the setup time starts to count more.
mr.spiv is offline  
Old 10 August 2020, 13:00   #67
roondar
Registered User
 
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,438
Quote:
Originally Posted by mr.spiv View Post
We are still talking fairly modest amount of blitts IMHO. Take e.g. a sinus scroller (lo- or hires) or similar where you do 320 to 640 blitts.. there the setup time starts to count more.
The thing is that looking at the Visual DMA debugger tells us that both versions of the effect end their last blit at pretty much exactly the same spot of the frame (almost exactly at the bottom right), while achieving the same results. This implies the gains are actually very small.

As a thought experiment: suppose the Copper based blitting method offered an very small 6 CPU cycle gain in terms of setup/Blitter waiting over using the CPU to blit as normal. In that case we ought to see at least one full rasterline that is "free" (as in not blitting)*. That doesn't seem to be the case, so the gain appear to be even smaller than that.

Had the Copper based blitting method offered the kind of gains that ross predicted, we ought to see at least some extra bobs or a chunk of free raster time. We see neither, which to me seems to show that there indeed is very little actual gains in this case.

*) 6*79=474c or slightly more than 1 rasterline.
roondar is offline  
Old 10 August 2020, 13:47   #68
mcgeezer
Registered User
 
Join Date: Oct 2017
Location: Sunderland, England
Posts: 2,702
I'm wondering if there's a gain to be made in the screen clear? As the copper list already holds the positions of where the Bob's are I'm wondering if it's quicker to use that to clear the bobs too. That would have an advantage over the CPU based version.

I'd need to think it through properly but I think I'm right.
mcgeezer is offline  
Old 10 August 2020, 14:14   #69
ross
Defendit numerus
 
ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 54
Posts: 4,498
Quote:
Originally Posted by roondar View Post
Let me explain
It's actually in part your calculations that made me wonder why there was no gain
The key is here: "4 static, 7 updates in chip mem".

In my superficial calculation I've adopted 2 (vs 11) chip mem update for the gain, here there are 7 (vs 11) for a tie!
So you can also roughly estimate how much CWAIT_BFD saves compared to CPU_BWAIT.
For me it is a good result, from here on you can only gain

Quote:
Originally Posted by mcgeezer View Post
I'm wondering if there's a gain to be made in the screen clear? As the copper list already holds the positions of where the Bob's are I'm wondering if it's quicker to use that to clear the bobs too. That would have an advantage over the CPU based version.

I'd need to think it through properly but I think I'm right.
Yep, but only if make the copper list as a series of sub-jumps.
I have a half idea how to do it ..
ross is offline  
Old 10 August 2020, 16:51   #70
roondar
Registered User
 
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,438
Quote:
Originally Posted by ross View Post
The key is here: "4 static, 7 updates in chip mem".

In my superficial calculation I've adopted 2 (vs 11) chip mem update for the gain, here there are 7 (vs 11) for a tie!
So you can also roughly estimate how much CWAIT_BFD saves compared to CPU_BWAIT.
For me it is a good result, from here on you can only gain
Ahhh, that explains things
So if you could get it down to two chipmemory updates you'd gain something on the order of 12*5 CPU cycles per blit (or about 4700 cycles total in this example case). That's a nice little bump in performance.
roondar is offline  
Old 15 August 2020, 15:41   #71
Antiriad_UK
OCS forever!
 
Antiriad_UK's Avatar
 
Join Date: Mar 2019
Location: Birmingham, UK
Posts: 418
Just to finish this off for me, I did a blit interrupt version. I usually run everything in lev3 vblank/lev3 copper which was clashing with the lev3 blit interrupt. So I changed to running in a lev1 softint initiated from the copper. That was interesting.

Copper list blitting version was 79 BOBs and I got the blit interrupt version to 65 BOBs. The interrupt save/restore registers is the killer part. I could optimize this case because it is so simple, but the difference between 'movem.l d0-d7/a0-a6' vs 'movem.l a0/a6' was the difference between 55 and 65 BOBs.

Definitely not optimal but I can think of a few pieces of code where I've not been able to split the CPU/blit parts up neatly and maybe for a handful of blits this method is useful.

Code:
P0_BlitDoneIrq_SecondBlit:
	movem.l	a0/a6,-(sp)

	lea	_custom+bltcpth,a6

	move.w	#INTF_BLIT,intreq-bltcpth(a6)
	move.w	#INTF_BLIT,intreq-bltcpth(a6)

	move.l	BOB_BlitQueue_PTR(pc),a0
	cmp.l	#BOB_BlitQueue_End,a0
	bge.s	.exit

	move.l	(a0)+,bltcon0-bltcpth(a6)
	move.l 	(a0)+,(a6)+	;bltcpt
	move.l 	(a0)+,(a6)+	;bltbpt
	move.l 	(a0)+,(a6)+	;bltapt
	move.l 	(a0)+,(a6)+	;bltdpt
	move.w	(a0)+,(a6)	;bltsize

	move.l	a0,BOB_BlitQueue_PTR
.exit:	
	movem.l	(sp)+,a0/a6
	rte

BOB_BlitQueue_PTR:	dc.l	0

	rsreset
;BLTQ_BLTCON0		rs.w	1
;BLTQ_BLTCON1		rs.w	1
;BLTQ_BLTCPTH		rs.w	1
;BLTQ_BLTCPTL		rs.w	1
;BLTQ_BLTBPTH		rs.w	1
;BLTQ_BLTBPTL		rs.w	1
;BLTQ_BLTAPTH		rs.w	1
;BLTQ_BLTAPTL		rs.w	1
;BLTQ_BLTDPTH		rs.w	1
;BLTQ_BLTDPTL		rs.w	1
;BLTQ_BLTSIZE		rs.w	1
;BLTQ_SIZEOF		rs.w	0
Attached Thumbnails
Click image for larger version

Name:	004.png
Views:	133
Size:	21.6 KB
ID:	68493  
Antiriad_UK is offline  
Old 15 August 2020, 16:39   #72
mcgeezer
Registered User
 
Join Date: Oct 2017
Location: Sunderland, England
Posts: 2,702
Nice one @Antiriad_UK.

I'm on the same journey but on stock AGA.

In the same 3 bitplane graphics mode using traditional blits I'm currently at 111 16x16 bobs (full clear and redraw every frame).

I haven't done any sine/cosine yet but can plot at any x/y position independently.

I'll move it to the Copper blits soon and post results... may also put the mode into Dual Playfield on AGA and test results which would be a more real world.

Geezer
mcgeezer is offline  
Old 15 August 2020, 19:50   #73
FSizzle
Registered User
 
Join Date: Nov 2017
Location: Los Angeles
Posts: 49
Quote:
Originally Posted by Antiriad_UK View Post
The interrupt save/restore registers is the killer part.
Code:
    movem.l    a0/a6,-(sp)
    ...

    movem.l    (sp)+,a0/a6

It's not much, but you could save 4 cycles and a bus access on the restore by doing this:
Code:
move.l (sp)+, a6   // 12 (3/0)
move.l (sp)+, a0   // 12 (3/0)
instead of:
Code:
movem.l (sp)+, a0/a6   // 28 (7/0)   (12+8n (3+2n/0))
About enough for an extra half a bob

Last edited by FSizzle; 15 August 2020 at 19:58.
FSizzle is offline  
Old 15 August 2020, 22:55   #74
mcgeezer
Registered User
 
Join Date: Oct 2017
Location: Sunderland, England
Posts: 2,702
So here's mine after a bit of work on AGA.

CPU Blits I can push around 136 16x16 (8 col) bobs

COPPER Blits I can push around 132 16x16 (8 Col) bobs.

Screen mode is x4 fetch with 3 bitplanes enabled.

More importantly, I've found this to be a really great discussion point on Amiga programming as I have learned so much from it. If I knew this info now doing my previous projects I could have improved upon their performance by a long stretch.

Thanks to all for your contributions. A great gift of knowledge from everyone involed.

Geezer
Attached Thumbnails
Click image for larger version

Name:	COP_BLITS.png
Views:	149
Size:	75.8 KB
ID:	68501   Click image for larger version

Name:	CPU_BLITS.png
Views:	132
Size:	58.6 KB
ID:	68502  
mcgeezer is offline  
Old 15 August 2020, 23:34   #75
Antiriad_UK
OCS forever!
 
Antiriad_UK's Avatar
 
Join Date: Mar 2019
Location: Birmingham, UK
Posts: 418
Yes me too. I’ve shunned copper blits and blit interrupts before. I’m kinda in awe at the people who figured this out without today’s resources.

I’m also impressed with the order of the blit registers. Amiga team thought so far ahead about the optimal order of those registers. Crazy.
Antiriad_UK is offline  
Old 15 August 2020, 23:45   #76
DanScott
Lemon. / Core Design
 
DanScott's Avatar
 
Join Date: Mar 2016
Location: Tier 5
Posts: 1,213
Quote:
Originally Posted by mcgeezer View Post

CPU Blits I can push around 136 16x16 (8 col) bobs

COPPER Blits I can push around 132 16x16 (8 Col) bobs.
The classic "bob record" demos on A500 (From around 1989/1990) used copper blitting techniques to achieve the best results
DanScott is offline  
Old 15 August 2020, 23:57   #77
mcgeezer
Registered User
 
Join Date: Oct 2017
Location: Sunderland, England
Posts: 2,702
Quote:
Originally Posted by DanScott View Post
The classic "bob record" demos on A500 (From around 1989/1990) used copper blitting techniques to achieve the best results
Yeah, I'm going to take a look... I think Dragons demo had a decent one if I recall.

But with that said, there's one thing gaining performance in a demo, and quite another gaining one writing a game (only in my opinion ofcourse).

Edit - it's amazing the old stuff I remember... just checked out the bobs on Dragons Megademo, clearly a screen buffer trick used there... was used all the time on the ST demos.

Last edited by mcgeezer; 16 August 2020 at 00:03.
mcgeezer is offline  
Old 16 August 2020, 00:11   #78
Antiriad_UK
OCS forever!
 
Antiriad_UK's Avatar
 
Join Date: Mar 2019
Location: Birmingham, UK
Posts: 418
Unlimited bobs don’t count. They do make me smile though
Antiriad_UK is offline  
Old 18 August 2020, 05:22   #79
buzzybee
Registered User
 
Join Date: Oct 2015
Location: Landsberg / Germany
Posts: 526
Very inspiring and englightening thread going on here. Pretty enlightening for Kevins and my current project Proxima 3 too, since this relies heavily on moving around as many objects as possible within on frame.

Thank you guys for your measures and comparisons regarding cpu and dma overhead. This is really amazing work and very precious input.

The conclusion for p3 seems it makes sense to stick with a rather traditional approach of feeding blitter with the cpu and trying to use the copper for visual fx. Copper feeds BPLxMOD and BPLCON1 at least every two scanlines with modified modulus and scrolldata to achieve for a number of visual distortions, and I can´t see how I could combine this with copper feeding the blitter.

Feeding blitter-with-copper-technique seems very interesting and I´d love to use it in a future project. Imagine there is a lot of room for elegant and fast optimisations here. For example by setting up a number of predefined sub-copperlists for various bob sizes and bitplane source adresses, which are then called from a main copperlist with a series of copper jumps. Modifying these jumps would be the only job the cpu would have to do each frame.
buzzybee is offline  
Old 18 August 2020, 10:09   #80
Tigerskunk
Inviyya Dude!
 
Tigerskunk's Avatar
 
Join Date: Sep 2016
Location: Amiga Island
Posts: 2,798
This approach seems to be a bit of a pain in the ass for any non general display engine, isn't it?
Like, if you have a sprite parallax layer or a lot of copper palette changes this thing seems hard to code.
Tigerskunk is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Mega Typhoon ECS only? Photon HOL suggestions and feedback 8 16 April 2020 21:47
EAB/Lemon Super League 2017: Round 4 - Mega Typhoon Graham Humphrey EAB's competition 50 09 April 2017 11:01
Working copy of Mega Typhoon ECS game? ImmortalA1000 request.Old Rare Games 9 04 February 2013 06:38
Mega Typhoon Trainer Version - Working! plasmatron request.Old Rare Games 1 03 July 2011 23:52
Mega Typhoon haynor666 HOL contributions 1 19 August 2008 00:37

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 20:28.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.12047 seconds with 14 queries