Copper driven blitter queue

paraj · 01 September 2022, 19:01

Thought of experimenting with copper driven blits (of course someone already though of at least parts of it: http://eab.abime.net/showpost.php?p=...5&postcount=19

)
The basic idea is to have the main copper list regularly "check" if it needs to/can start a new blit, i.e. interspersing it with SKIP copperbusy + write COPJMP2 instructions. When no blit is necessary 2nd copperlist would just return.
A few problems with this if I understand it correctly:

Writes to copjmpX reload the pointer fro copXlc, so you need to store the "return address" in the "opposite" copXlc register before "returning?

The normal "do nothing" function for cop2 needs to be:

Code:

loop:
    dc.w cop2lc+0, loop>>16
    dc.w cop2lc+2, loop&0$ffff
    dc.w copjmp1, 0

and the "Check if blit needed" needs to be:

Code:

    dc.w cop1lc+0, next>>16
    dc.w cop1lc+2, next&0$ffff
    dc.w $ffff, $0001 ; skip if blitter busy
    dc.w copjmp2, 0
next:

Correct? So they need to be patched up correctly by code and not done nicely by asm macros? (of course high part can be skipped if laid out correctly in memory).

Also probably an issue with triggering a blit correctly without race conditions. You could always restrict yourself to blits only starting at a certain point (with an interrupt routine), but maybe it can be done better?

I'll experiment on my own, but maybe somebody already did the hard work

a/b · 01 September 2022, 19:46

Quote:

Originally Posted by paraj

Writes to copjmpX reload the pointer fro copXlc, so you need to store the "return address" in the "opposite" copXlc register before "returning?

If you want to return to "somewhere in middle of" copper1 before VBL, yes (and also set the ptr back to the start of copper1 before EOF, for a correct VBL restart). Otherwise copper1 it will be restarted automatically during VBL.

Jobbo · 01 September 2022, 19:59

I can't think when it makes sense to construct a copper list where it will constantly check if there's a blit to process.

It seems much simpler to treat the copper list as a command buffer you filled in during the previous frame.

My set up basically has an array of blits each with a wait at the start. The last wait gets patched so it's a jump that will terminate the sequence of blits.

So, each frame I just unpatch that jump and then start again at the front of the array filling it up. It's double buffered so a little more complex than that but not much.

In order to be as optimal as possible I only patch the values I know I need to change in the copper list of blits, which is easy enough if they are all lines but not so practical for something more generic.

paraj · 01 September 2022, 20:20

Quote:

Originally Posted by a/b

If you want to return to "somewhere in middle of" copper1 before VBL, yes (and also set the ptr back to the start of copper1 before EOF, for a correct VBL restart). Otherwise copper1 it will be restarted automatically during VBL.

Thanks, yes, I basically want to do a "subroutine"(more like coroutine) call to copperlist2.

Quote:

Originally Posted by Jobbo

I can't think when it makes sense to construct a copper list where it will constantly check if there's a blit to process.

It seems much simpler to treat the copper list as a command buffer you filled in during the previous frame.

My set up basically has an array of blits each with a wait at the start. The last wait gets patched so it's a jump that will terminate the sequence of blits.

So, each frame I just unpatch that jump and then start again at the front of the array filling it up. It's double buffered so a little more complex than that but not much.

In order to be as optimal as possible I only patch the values I know I need to change in the copper list of blits, which is easy enough if they are all lines but not so practical for something more generic.

I wouldn't "constantly" check, just a regular intervals (based on the application).

Instead of being coy, I'll just mention that I have a specific application in mind. I'm working on an effect where I do a lot of CPU heavy stuff (~1-3 frames) and then do ~1 frame of blitter ops (C2P). Thinking of ways to reduce the CPU load to speed it up. I need to do 12 blits (on OCS) for the C2P part, so I was thinking of speeding it not doing a (CPU driver) blitter queue, but instead offloading it to the blitter. Would need tuning of course, but if I have say 6-8 checks/frame in by the copper/blitter each frame that could be faster.

ross · 01 September 2022, 20:27

Quote:

Originally Posted by Jobbo

I can't think when it makes sense to construct a copper list where it will constantly check if there's a blit to process.

Well, the point is that your queue disrupt all copper effects in the frame.
With this approach you can put some blits in the 'dead spots' of the copper list.

@paraj: I don't know what will come out but I'm curious

Jobbo · 01 September 2022, 20:27

For so few blits I would just use interrupts.

Using the copper is more about maximizing the number of blits you can submit, at the cost of doing all the other things the copper is normally good for like color changes etc.

You can of course interleave copper blits and other copper work but it's a pain.

It's fun to try none the less

Jobbo · 01 September 2022, 20:31

Quote:

Originally Posted by ross

Well, the point is that your queue disrupt all copper effects in the frame.
With this approach you can put some blits in the 'dead spots' of the copper list.

That's fair, I'm not sure what was linked but it does make some sense to interleave checks that will skip if the last blit hasn't finished yet. Then it can continue on with whatever has to happen that line or whatever.

Bound to be pretty wasteful of DMA slots with all that copper checking, but maybe worth it for some cases.

ross · 01 September 2022, 20:41

Quote:

Originally Posted by Jobbo

Bound to be pretty wasteful of DMA slots with all that copper checking, but maybe worth it for some cases.

Yep, this could be a problem.
And the fact that the handler code will not be trivial at all..

But experiments are experiments, curiosity cannot be stopped!

You think that right now I'm thinking how to use blitter for a huffman decoding accelerator when building a 'sparse' table.

paraj · 01 September 2022, 20:50

Actually, might as well post the code. It's not like you guys can't keep a secret right?

I really want this to run 3 frame (which I think is the limit at 320x256), but running out of normal ideas. Hence this thread.

Jobbo · 01 September 2022, 21:25

I can take a look later and see if I have any ideas. Still at work for now.

Jobbo · 02 September 2022, 02:32

From what I see using Bartman's awesome profiler there's very little CPU time spent starting the blits. So, not much to gain trying to speed that up.

The blitter work in total lasts about 1 frame for each rotation which means the blitter is idle for 2 full frames.

If you could somehow find more work for the blitter that would unburden the CPU then that'd be your best bet for making an improvement.

I'd have to spend more time understanding the code to know how it all works. Or you could explain and I might have some thoughts. But I'm no expert, I'm sure some others would have more to say.

paraj · 02 September 2022, 09:40

Yeah, I'm not expecting miracles, but since I'm 100% CPU bound anything might help. It was mostly to try it out as I think it might be useful in general. Also it could open up other optimization opportunities, for example there are only 6 combinations of source/destination so the copperlists could be precalculated.

The core is "renderline" which renders 160 scrambled chunky pixels 4 pixels at a time. 4 pre-scrambled textures are used: Texture0 contains data like this (numbers are bitplanes): 32------ 10------, Texture 1 --32---- --10----, etc.).

Code:

        move.w  ofs1(a1),d0
        or.w    ofs2(a2),d0
        or.w    ofs3(a3),d0
        or.w    ofs4(a4),d0
        movep.w d0,dofs(a0)

After the movep two consecutive words look like this:

Code:

a3 a2 b3 b2 c3 c2 d3 d2 e3 e2 f3 f2 g3 g2 h3 h2
a1 a0 b1 b0 c1 c0 d1 d0 e1 e0 f1 f0 g1 g0 h1 h0

Then the blitter is used to perform horizontal doubling and extract the individual bitplane data.

updatecode modifies "renderline" to sample from x*scale*cos(angle), x*scale*-sin(angle) (throwing away some precision because integer offsets are used)

yloop steps in the vertical direction calculating the starting offset for each scanline and then "calls" renderline to output pixels. The textures are doubled in size such that the starting offset + horizontal offset doesn't step outside the texture.

Of course I could offload the job of movep.w to the blitter, but I don't think it's faster (movep.w vs move.w only adds 8 cycles)

Galahad/FLT · 02 September 2022, 13:03

Beware that MOVEP isn't supported on 68060

pink^abyss · 02 September 2022, 13:05

Quote:

Originally Posted by paraj

Also probably an issue with triggering a blit correctly without race conditions. You could always restrict yourself to blits only starting at a certain point (with an interrupt routine), but maybe it can be done better

As Jobbo said, the blitting is almost optimal. You could gain a few cycles by starting the first blit in rasterline 300, instead of rasterline 302

In the frames without blitting you still have 20% of idle DMA cycles. Perhaps you could do some work with the blitter here.
I would try to get the effect running in 2 frames. A window of 256x256 (or even smaller) could make it happen.

paraj · 02 September 2022, 19:07

Quote:

Originally Posted by Galahad/FLT

Beware that MOVEP isn't supported on 68060

Yeah, thanks, if AFB_68060 is set I replace all the movep instructions with bsr.w movepemu (which does the necessary work instead). Without that it's 10x slower than a500 on my 1260

(reminds me that I need a more proper check in case that flag isn't set).

Quote:

Originally Posted by pink^abyss

As Jobbo said, the blitting is almost optimal. You could gain a few cycles by starting the first blit in rasterline 300, instead of rasterline 302

In the frames without blitting you still have 20% of idle DMA cycles. Perhaps you could do some work with the blitter here.
I would try to get the effect running in 2 frames. A window of 256x256 (or even smaller) could make it happen.

So it's probably worthwhile to do the 8x1 step with the blitter even if it's (theoretically) more cycles overall. Thanks, I'll try that though that would be a bit disappointing since I more or less started out wanting to try out a c2p using movep...

First though, I need to get this copper blitter queue thing working to stay on topic

a/b · 02 September 2022, 20:51

Quote:

Originally Posted by paraj

... Without that it's 10x slower than a500 on my 1260

Hmm, that's not good. I have a couple of movep using routines, and my initial plan was to write a very specific emulation routine, and hopefully 68060 would be sufficiently faster than 68000 to make it work.

Could you please try this and see how much slower it is? Should be pretty straightforward to include. It handles your case: movep.w d0,(offset,a0) with destroyable d0.

Code:

; a5=vbr
	move.l	($0f4,a5),-(a7)		; #61 (unimpl. int instr.)
	move.l	#HandleMovep,($0f4,a5)

	lea	(Test,pc),a0
	move.w	#$1234,d0
	movep.w	d0,(1,a0)

	move.l	(a7)+,($0f4,a5)
	rts

Test	DC.L	~0

HandleMovep
	movem.l	a0/a1,-(a7)
	move.l	(4*2+2,a7),a1		; +0 = sr, +2 = pc
	add.w	(2,a1),a0		; dst offset

	move.b	d0,(2,a0)
	lsr.w	#8,d0
	move.b	d0,(a0)

	addq.l	#4,a1			; skip movep
	move.l	a1,(4*2+2,a7)		; update pc

	movem.l	(a7)+,a0/a1
	rte

paraj · 02 September 2022, 21:38

Quote:

Originally Posted by a/b

Hmm, that's not good. I have a couple of movep using routines, and my initial plan was to write a very specific emulation routine, and hopefully 68060 would be sufficiently faster than 68000 to make it work.

Don't trust this since I only did a quick test, but it runs at ~1530 scanlines (~500 used with my emu code). I'll do proper timing when I have more free time .

a/b · 02 September 2022, 22:31

OK, thanks. Looks way too slow, I'll have to try something else (a simple bsr.w won't do, it's ~230KB of unrolled variable length repeats with non-uniform offsets).

paraj · 03 September 2022, 12:28

All right, so it is possible, but as y'all already predicted it's not faster. It's actually slower

The code also got quite complicated and ugly, but it is possible. For reasons I don't quite understand the "is blitter free" checks have to be spaced quite closely (at least in this case). Maybe I did something wrong though.

The copperlists absolutely have to be kept within the same 64k page (it's ok for cop1 and cop2 to be in different ones though) such that only the lower part of the address needs to be updated. This also allows "racefree" starting of the blits by writing to cop2lc+2.

EDIT: And using the blitter for the 8x1 pass instead of using movep.w is 10 raster lines faster (bringing it to 964).

Quote:

Originally Posted by a/b

OK, thanks. Looks way too slow, I'll have to try something else (a simple bsr.w won't do, it's ~230KB of unrolled variable length repeats with non-uniform offsets).

Testing it in isolation (still writing to chipram) I get:
060.library (system handler): 3365 cycles
Your version: 708 cycles
My version: 157 cycles (including the extra tst.w dmaconr)

I guess the pipeline synchronization that happens when handling exceptions is killing performance here. Maybe it's not so bad in your case, depending on what you're using it for. You can send me a test executable if you have something in particular you want timed.

a/b · 03 September 2022, 15:38

Thanks again. That was with superscalar mode enabled?

01 September 2022, 19:01	#1
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,279	Copper driven blitter queue Thought of experimenting with copper driven blits (of course someone already though of at least parts of it: http://eab.abime.net/showpost.php?p=...5&postcount=19 ) The basic idea is to have the main copper list regularly "check" if it needs to/can start a new blit, i.e. interspersing it with SKIP copperbusy + write COPJMP2 instructions. When no blit is necessary 2nd copperlist would just return. A few problems with this if I understand it correctly: Writes to copjmpX reload the pointer fro copXlc, so you need to store the "return address" in the "opposite" copXlc register before "returning? The normal "do nothing" function for cop2 needs to be: Code: loop: dc.w cop2lc+0, loop>>16 dc.w cop2lc+2, loop&0$ffff dc.w copjmp1, 0 and the "Check if blit needed" needs to be: Code: dc.w cop1lc+0, next>>16 dc.w cop1lc+2, next&0$ffff dc.w $ffff, $0001 ; skip if blitter busy dc.w copjmp2, 0 next: Correct? So they need to be patched up correctly by code and not done nicely by asm macros? (of course high part can be skipped if laid out correctly in memory). Also probably an issue with triggering a blit correctly without race conditions. You could always restrict yourself to blits only starting at a certain point (with an interrupt routine), but maybe it can be done better? I'll experiment on my own, but maybe somebody already did the hard work

02 September 2022, 09:40	#12
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,279	Yeah, I'm not expecting miracles, but since I'm 100% CPU bound anything might help. It was mostly to try it out as I think it might be useful in general. Also it could open up other optimization opportunities, for example there are only 6 combinations of source/destination so the copperlists could be precalculated. The core is "renderline" which renders 160 scrambled chunky pixels 4 pixels at a time. 4 pre-scrambled textures are used: Texture0 contains data like this (numbers are bitplanes): 32------ 10------, Texture 1 --32---- --10----, etc.). Code: move.w ofs1(a1),d0 or.w ofs2(a2),d0 or.w ofs3(a3),d0 or.w ofs4(a4),d0 movep.w d0,dofs(a0) After the movep two consecutive words look like this: Code: a3 a2 b3 b2 c3 c2 d3 d2 e3 e2 f3 f2 g3 g2 h3 h2 a1 a0 b1 b0 c1 c0 d1 d0 e1 e0 f1 f0 g1 g0 h1 h0 Then the blitter is used to perform horizontal doubling and extract the individual bitplane data. updatecode modifies "renderline" to sample from xscalecos(angle), xscale-sin(angle) (throwing away some precision because integer offsets are used) yloop steps in the vertical direction calculating the starting offset for each scanline and then "calls" renderline to output pixels. The textures are doubled in size such that the starting offset + horizontal offset doesn't step outside the texture. Of course I could offload the job of movep.w to the blitter, but I don't think it's faster (movep.w vs move.w only adds 8 cycles)

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Copper driven blitter waits in WinUAE	Jobbo	Coders. Asm / Hardware	38	22 May 2021 20:48
Testing a blitter queue	deimos	Coders. General	3	22 October 2019 15:15
Strange behavior when using blit queue in copper list	losso	Coders. Asm / Hardware	39	02 February 2018 08:24
Blitter-driven tile map	Leffmann	Coders. Tutorials	5	11 November 2015 17:17
Blitter using the copper...	h0ffman	Coders. Asm / Hardware	9	23 February 2012 08:25

01 September 2022, 19:59	#3
Jobbo Registered User Join Date: Jun 2020 Location: Druidia Posts: 389	I can't think when it makes sense to construct a copper list where it will constantly check if there's a blit to process. It seems much simpler to treat the copper list as a command buffer you filled in during the previous frame. My set up basically has an array of blits each with a wait at the start. The last wait gets patched so it's a jump that will terminate the sequence of blits. So, each frame I just unpatch that jump and then start again at the front of the array filling it up. It's double buffered so a little more complex than that but not much. In order to be as optimal as possible I only patch the values I know I need to change in the copper list of blits, which is easy enough if they are all lines but not so practical for something more generic.

01 September 2022, 20:27	#6
Jobbo Registered User Join Date: Jun 2020 Location: Druidia Posts: 389	For so few blits I would just use interrupts. Using the copper is more about maximizing the number of blits you can submit, at the cost of doing all the other things the copper is normally good for like color changes etc. You can of course interleave copper blits and other copper work but it's a pain. It's fun to try none the less

01 September 2022, 21:25	#10
Jobbo Registered User Join Date: Jun 2020 Location: Druidia Posts: 389	I can take a look later and see if I have any ideas. Still at work for now.

02 September 2022, 02:32	#11
Jobbo Registered User Join Date: Jun 2020 Location: Druidia Posts: 389	From what I see using Bartman's awesome profiler there's very little CPU time spent starting the blits. So, not much to gain trying to speed that up. The blitter work in total lasts about 1 frame for each rotation which means the blitter is idle for 2 full frames. If you could somehow find more work for the blitter that would unburden the CPU then that'd be your best bet for making an improvement. I'd have to spend more time understanding the code to know how it all works. Or you could explain and I might have some thoughts. But I'm no expert, I'm sure some others would have more to say.

02 September 2022, 13:03	#13
Galahad/FLT Going nowhere Join Date: Oct 2001 Location: United Kingdom Age: 50 Posts: 9,025	Beware that MOVEP isn't supported on 68060

02 September 2022, 22:31	#18
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,076	OK, thanks. Looks way too slow, I'll have to try something else (a simple bsr.w won't do, it's ~230KB of unrolled variable length repeats with non-uniform offsets).

03 September 2022, 15:38	#20
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,076	Thanks again. That was with superscalar mode enabled?

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)