OCS Blitter Speed (FrameBuffer Clear)

VladR · 15 June 2022, 15:49

How much of a frame time [in %] on NTSC OCS does it take for Blitter to clear:
1 Bitplane
2 Bitplanes
4 Bitplanes
6 Bitplanes

Each BP being (320x200) = 8,000 Bytes.

From a quick search it would appear that Blitter on OCS is apparently slower than CPU version, but if that is the case it wouldn't be a huge issue for me because I could simply run other parts of 3D pipeline in parallel (like I do on Jaguar where I initiate the clear at beginning of the frame and in parallel do 3D transform and clipping).

Of course, in a 6-BP EHB Mode I might not get many cycles for parallel execution, but it should still be faster in the end (meaning, if I lock framerate to 20 fps, I should still finish everything faster when working in parallel compared to a slightly faster CPU clear).

VladR · 15 June 2022, 15:54

I just found this:
Thread: https://eab.abime.net/showthread.php?t=103515

Quote:

Originally Posted by BigT

I was messing around with clearing a single bitplane in ASM on A500 OCS ie 320x256=10240 bytes. My findings were as follows:

Code:

178 scanlines  clr.l    (a0)+ dbra loop - 
118 scanlines  clr.l    (a0)+ dbra loop unrolled x16 clr.l statements - 
 73 scanlines  move.l    d1,(a0)+  unrolled x16 - 
 56 scanlines  movem.l    d1-d6/a2-a3,-(a0) unrolled x2 - 
 50 scanlines  blitter D channel clear - 
 27 scanlines  Blitter D + movem.l combination -

What I was most surprised by was how close in performance a movem.l operation was to the blitter. I had always thought the blitter was much faster....

VladR · 15 June 2022, 15:58

So, if it takes 50 scanlines out 256 to clear 10,240 Bytes, then for 6 BPs in EHB, it would take:
6 * 8,000 [Bytes] = 48,000 Bytes
48,000 / 10,240 = 4.6875

4.6875 * 50 [scanlines] = 234.375 scanlines

234.375 scanlines - that's basically almost entire frame - like 235/256 = ~92% of frame time ?

Does that sound about right that it would actually take that long just to clear FrameBuffer ?

The Use case here is a 3D starfield - I am wondering where exactly is the threshold where it still makes performance sense to just erase last-frame's stars instead of brute-force clear.
Approach 1: BruteForce FB Clear - much faster DrawPixel version (that does not do AND masking), but gives us advantage of drawing anything else (as it will be cleared next frame autoamtically)
Approach 2: No FB Clear - much slower DrawPixel version with AND masking, plus we have to clear last frame's pixels - effectively ~halving pixel throughput (but gaining the time it would take the clear FB)

Depending on the benchmark numbers, the final scene complexity will quite differ and I am quite curious about the numbers for each approach (and the threshold).

a/b · 15 June 2022, 17:48

Blitter clear if good if you can do it 100% in parallel and never have to wait for it to finish. Otherwise, cpu+blitter split is faster. Blitclear is 0 sources, 1 destination, so it has an idle state when it's supposed to read, and it's not running at full speed like it would with 1+ sources, meaning there could be some unused dma slots depending on what the cpu and other dma are doing.
Best would be to benchmark it against cpu dot clear, I'd guess. Another option is adaptive approach, if you're under a certain number of stars you clear them individually with cpu, otherwise blit and/or cpu clear.

VladR · 15 June 2022, 20:05

Quote:

Originally Posted by a/b

Blitter clear if good if you can do it 100% in parallel and never have to wait for it to finish. Otherwise, cpu+blitter split is faster.

Yeah, I don't think I should have to wait for Blitter, because at the very least, I have 2 full frames worth of work (maybe 3 -> 20 fps).
As long as there are no framedrops, 20 fps is plenty smooth. But there must be enough performance buffer for the worst CPU spikes (worst clipping scenarios plus AI spike)...

What I'm realizing right now is that with 6 Bitplanes, even Blitter will be affected by DMA (though it still has higher priority than CPU), so it might even take longer than ~93% of frame time to clear all 6 BPs ?

Quote:

Originally Posted by a/b

Another option is adaptive approach, if you're under a certain number of stars you clear them individually with cpu, otherwise blit and/or cpu clear.

You mean, like, Level of Detail ? Perhaps as an option, yes. With the obvious impact on framerate...

paraj · 15 June 2022, 20:39

Quote:

Originally Posted by VladR

Yeah, I don't think I should have to wait for Blitter, because at the very least, I have 2 full frames worth of work (maybe 3 -> 20 fps).
As long as there are no framedrops, 20 fps is plenty smooth. But there must be enough performance buffer for the worst CPU spikes (worst clipping scenarios plus AI spike)...

What I'm realizing right now is that with 6 Bitplanes, even Blitter will be affected by DMA (though it still has higher priority than CPU), so it might even take longer than ~93% of frame time to clear all 6 BPs ?

You mean, like, Level of Detail ? Perhaps as an option, yes. With the obvious impact on framerate...

I'm used to PAL, so numbers might be a bit off, but roughly in NTSC you have 262 scanlines with 223 usable DMA slots per scanline = 58426. 320x200x6 BPL uses 24000 (~41%). Clearing the screen using the blitter also takes 24000 leaving you very little (~10K) to anything else. Timewise the clearing would indeed take 48K CCKs (@~3.5Mhz) ~ almost a complete frame though every other DMA slot would be open (if it isn't used by for display). You're starting to see why very few (if any) games used EHB in game..

For the cut-off a/b mentioned it's not so much LOD as considering the trade-off. Simplified example assuming stock A500 clearing 8000 bytes using only blitter takes 4000 * 4 7Mhz cycles, say you've optimized clearing a pixel down to a single "and.b dN, ofs(AM)" instruction taking 16 * 7Mhz cycles the switch off point where it's better to use the blitter would be 1000 stars. You'd use method 2 (clear with CPU) if the number of displayed stars was less than 1000.

VladR · 15 June 2022, 23:29

Quote:

Originally Posted by paraj

I'm used to PAL, so numbers might be a bit off, but roughly in NTSC you have 262 scanlines with 223 usable DMA slots per scanline = 58426. 320x200x6 BPL uses 24000 (~41%). Clearing the screen using the blitter also takes 24000 leaving you very little (~10K) to anything else. Timewise the clearing would indeed take 48K CCKs (@~3.5Mhz) ~ almost a complete frame though every other DMA slot would be open (if it isn't used by for display).

Ouch

Thanks for confirming.

Quote:

Originally Posted by paraj

You're starting to see why very few (if any) games used EHB in game..

Well, actually, not really. For double-buffering, it's not that huge of a deal. Yes, it sucks having one full frame destroyed by clearing framebuffer, but we can still design the game around 30 or 20 fps. For EHB, probably 20 fps, because EHB takes 24,000 DMA slots just for display. But, even then, we still have 2 full frames worth of CPU time (well, 2 frames of ~54% anyway - kinda like just one full frame at 4 BPLs).
Probably can't move too many huge SW sprites around the screen with this much CPU throughput, but surely plenty games can be designed around that.
Of course, it's a major complication for dev, so that must have played some role, I guess.

Quote:

Originally Posted by paraj

For the cut-off a/b mentioned it's not so much LOD as considering the trade-off. Simplified example assuming stock A500 clearing 8000 bytes using only blitter takes 4000 * 4 7Mhz cycles, say you've optimized clearing a pixel down to a single "and.b dN, ofs(AM)" instruction taking 16 * 7Mhz cycles the switch off point where it's better to use the blitter would be 1000 stars. You'd use method 2 (clear with CPU) if the number of displayed stars was less than 1000.

Let's talk specific numbers for EHB (6 BPLs):

236c : ClearPixel
230c: DrawPixel (No AND Mask - assumes FB was cleared)
390c: DrawPixel (Clears all bits first, writes only 1s)

0.54*119,333 = 64,439c (available cycles after DMA given ~54% utilization)

Let's just round the clearing of framebuffer to full frame (for comparison purposes).
How much can we do in the same time ?
64,439 = PixelCount * (236+230)
PixelCount = 64,439 / 466 = 138 - this is the threshold for EHB

So, during same time it takes to Clear FrameBuffer, 138 individual pixels can be cleared and drawn anew.
Well, slightly less in reality, because those 138 pixels need to be read from an array, which is additional cycles obviously.

Tigerskunk · 16 June 2022, 12:01

Aren't you coding on the Vamp?

Why are you interested in the Blitter than? Would be like the biggest bottleneck of all time if you used that instead of a simple CPU clear routine.

VladR · 16 June 2022, 14:28

Quote:

Originally Posted by Tigerskunk

Aren't you coding on the Vamp?

Yes, but it's not exclusive anymore. Since I got a fantastic remote job a year ago, I had to temporarily put my big Vampire project on hold (for as long as I keep the job). And I discovered the wonderful world of OCS. It's kinda like Atari 800XL on steroids - it has Blitter, 32 registers, Copper, expandable RAM, OS and a giant variety of upgrades (up to the level of Jaguar and probably above with 060, MIPS-wise).

It's a better target to remake 8-bit games than Jaguar. Not to mention the absence of the general Jaguar toxic hostility omnipresent in jag forums.

Here, we can actually have a technical conversation, people are willing to share what they know without belittling newbies (like me) and OCS gives a coder so many options how to implement things.

Still, I wish there was a 160x200 native resolution for OCS...

Quote:

Originally Posted by Tigerskunk

Why are you interested in the Blitter than? Would be like the biggest bottleneck of all time if you used that instead of a simple CPU clear routine.

Because on Jaguar, at the start of the frame, I initiate the FrameBuffer Clear, and in parallel, start doing the 3d pipeline.
If you break the pipeline into discreet batches, it's possible to time it perfectly with Blitter just having finished clearing (without ever having to wait).

There's no EHB 3D game on OCS but that doesn't mean one can't be made. As long as one understands the constraints, it's doable.

Basically - if I lock the framerate to 20 fps and have 3 full frames, I got 3 * 0.54 * 119,333 = 3 * 64,439c = 193,317 cycle budget.

Question is - how much 3D can you do in 193,317c ? I don't know yet, but that's what we're trying to figure out now.

15 June 2022, 15:49	#1
VladR Registered User Join Date: Dec 2019 Location: North Dakota Posts: 741	OCS Blitter Speed (FrameBuffer Clear) How much of a frame time [in %] on NTSC OCS does it take for Blitter to clear: 1 Bitplane 2 Bitplanes 4 Bitplanes 6 Bitplanes Each BP being (320x200) = 8,000 Bytes. From a quick search it would appear that Blitter on OCS is apparently slower than CPU version, but if that is the case it wouldn't be a huge issue for me because I could simply run other parts of 3D pipeline in parallel (like I do on Jaguar where I initiate the clear at beginning of the frame and in parallel do 3D transform and clipping). Of course, in a 6-BP EHB Mode I might not get many cycles for parallel execution, but it should still be faster in the end (meaning, if I lock framerate to 20 fps, I should still finish everything faster when working in parallel compared to a slightly faster CPU clear).

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
blitter speed and fmode	jotd	Coders. Asm / Hardware	2	19 June 2021 14:38
Fastest method to clear a single bitplane on Amiga OCS - My findings	BigT	Coders. General	11	12 August 2020 19:51
Question for the pros about blitter clear and triple buffering	mc6809e	Coders. General	2	02 May 2020 19:41
Data dependent OCS cycle-exact blitter speed?	hooverphonique	support.WinUAE	4	18 November 2017 09:08
Blitter filling speed, how much?	sandruzzo	Coders. Asm / Hardware	7	03 July 2015 14:38

15 June 2022, 15:58	#3
VladR Registered User Join Date: Dec 2019 Location: North Dakota Posts: 741	So, if it takes 50 scanlines out 256 to clear 10,240 Bytes, then for 6 BPs in EHB, it would take: 6 * 8,000 [Bytes] = 48,000 Bytes 48,000 / 10,240 = 4.6875 4.6875 * 50 [scanlines] = 234.375 scanlines 234.375 scanlines - that's basically almost entire frame - like 235/256 = ~92% of frame time ? Does that sound about right that it would actually take that long just to clear FrameBuffer ? The Use case here is a 3D starfield - I am wondering where exactly is the threshold where it still makes performance sense to just erase last-frame's stars instead of brute-force clear. Approach 1: BruteForce FB Clear - much faster DrawPixel version (that does not do AND masking), but gives us advantage of drawing anything else (as it will be cleared next frame autoamtically) Approach 2: No FB Clear - much slower DrawPixel version with AND masking, plus we have to clear last frame's pixels - effectively ~halving pixel throughput (but gaining the time it would take the clear FB) Depending on the benchmark numbers, the final scene complexity will quite differ and I am quite curious about the numbers for each approach (and the threshold).

15 June 2022, 17:48	#4
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	Blitter clear if good if you can do it 100% in parallel and never have to wait for it to finish. Otherwise, cpu+blitter split is faster. Blitclear is 0 sources, 1 destination, so it has an idle state when it's supposed to read, and it's not running at full speed like it would with 1+ sources, meaning there could be some unused dma slots depending on what the cpu and other dma are doing. Best would be to benchmark it against cpu dot clear, I'd guess. Another option is adaptive approach, if you're under a certain number of stars you clear them individually with cpu, otherwise blit and/or cpu clear.

16 June 2022, 12:01	#8
Tigerskunk Inviyya Dude! Join Date: Sep 2016 Location: Amiga Island Posts: 2,770	Aren't you coding on the Vamp? Why are you interested in the Blitter than? Would be like the biggest bottleneck of all time if you used that instead of a simple CPU clear routine.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)