English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 01 September 2022, 19:01   #1
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,279
Copper driven blitter queue

Thought of experimenting with copper driven blits (of course someone already though of at least parts of it: http://eab.abime.net/showpost.php?p=...5&postcount=19 )
The basic idea is to have the main copper list regularly "check" if it needs to/can start a new blit, i.e. interspersing it with SKIP copperbusy + write COPJMP2 instructions. When no blit is necessary 2nd copperlist would just return.
A few problems with this if I understand it correctly:

Writes to copjmpX reload the pointer fro copXlc, so you need to store the "return address" in the "opposite" copXlc register before "returning?

The normal "do nothing" function for cop2 needs to be:
Code:
loop:
    dc.w cop2lc+0, loop>>16
    dc.w cop2lc+2, loop&0$ffff
    dc.w copjmp1, 0
and the "Check if blit needed" needs to be:
Code:
    dc.w cop1lc+0, next>>16
    dc.w cop1lc+2, next&0$ffff
    dc.w $ffff, $0001 ; skip if blitter busy
    dc.w copjmp2, 0
next:
Correct? So they need to be patched up correctly by code and not done nicely by asm macros? (of course high part can be skipped if laid out correctly in memory).

Also probably an issue with triggering a blit correctly without race conditions. You could always restrict yourself to blits only starting at a certain point (with an interrupt routine), but maybe it can be done better?

I'll experiment on my own, but maybe somebody already did the hard work
paraj is offline  
Old 01 September 2022, 19:46   #2
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,076
Quote:
Originally Posted by paraj View Post
Writes to copjmpX reload the pointer fro copXlc, so you need to store the "return address" in the "opposite" copXlc register before "returning?
If you want to return to "somewhere in middle of" copper1 before VBL, yes (and also set the ptr back to the start of copper1 before EOF, for a correct VBL restart). Otherwise copper1 it will be restarted automatically during VBL.
a/b is offline  
Old 01 September 2022, 19:59   #3
Jobbo
Registered User
 
Jobbo's Avatar
 
Join Date: Jun 2020
Location: Druidia
Posts: 389
I can't think when it makes sense to construct a copper list where it will constantly check if there's a blit to process.

It seems much simpler to treat the copper list as a command buffer you filled in during the previous frame.

My set up basically has an array of blits each with a wait at the start. The last wait gets patched so it's a jump that will terminate the sequence of blits.

So, each frame I just unpatch that jump and then start again at the front of the array filling it up. It's double buffered so a little more complex than that but not much.

In order to be as optimal as possible I only patch the values I know I need to change in the copper list of blits, which is easy enough if they are all lines but not so practical for something more generic.
Jobbo is online now  
Old 01 September 2022, 20:20   #4
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,279
Quote:
Originally Posted by a/b View Post
If you want to return to "somewhere in middle of" copper1 before VBL, yes (and also set the ptr back to the start of copper1 before EOF, for a correct VBL restart). Otherwise copper1 it will be restarted automatically during VBL.
Thanks, yes, I basically want to do a "subroutine"(more like coroutine) call to copperlist2.

Quote:
Originally Posted by Jobbo View Post
I can't think when it makes sense to construct a copper list where it will constantly check if there's a blit to process.

It seems much simpler to treat the copper list as a command buffer you filled in during the previous frame.

My set up basically has an array of blits each with a wait at the start. The last wait gets patched so it's a jump that will terminate the sequence of blits.

So, each frame I just unpatch that jump and then start again at the front of the array filling it up. It's double buffered so a little more complex than that but not much.

In order to be as optimal as possible I only patch the values I know I need to change in the copper list of blits, which is easy enough if they are all lines but not so practical for something more generic.
I wouldn't "constantly" check, just a regular intervals (based on the application).

Instead of being coy, I'll just mention that I have a specific application in mind. I'm working on an effect where I do a lot of CPU heavy stuff (~1-3 frames) and then do ~1 frame of blitter ops (C2P). Thinking of ways to reduce the CPU load to speed it up. I need to do 12 blits (on OCS) for the C2P part, so I was thinking of speeding it not doing a (CPU driver) blitter queue, but instead offloading it to the blitter. Would need tuning of course, but if I have say 6-8 checks/frame in by the copper/blitter each frame that could be faster.
paraj is offline  
Old 01 September 2022, 20:27   #5
ross
Defendit numerus
 
ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 54
Posts: 4,508
Quote:
Originally Posted by Jobbo View Post
I can't think when it makes sense to construct a copper list where it will constantly check if there's a blit to process.
Well, the point is that your queue disrupt all copper effects in the frame.
With this approach you can put some blits in the 'dead spots' of the copper list.

@paraj: I don't know what will come out but I'm curious
ross is offline  
Old 01 September 2022, 20:27   #6
Jobbo
Registered User
 
Jobbo's Avatar
 
Join Date: Jun 2020
Location: Druidia
Posts: 389
For so few blits I would just use interrupts.

Using the copper is more about maximizing the number of blits you can submit, at the cost of doing all the other things the copper is normally good for like color changes etc.

You can of course interleave copper blits and other copper work but it's a pain.

It's fun to try none the less
Jobbo is online now  
Old 01 September 2022, 20:31   #7
Jobbo
Registered User
 
Jobbo's Avatar
 
Join Date: Jun 2020
Location: Druidia
Posts: 389
Quote:
Originally Posted by ross View Post
Well, the point is that your queue disrupt all copper effects in the frame.
With this approach you can put some blits in the 'dead spots' of the copper list.

That's fair, I'm not sure what was linked but it does make some sense to interleave checks that will skip if the last blit hasn't finished yet. Then it can continue on with whatever has to happen that line or whatever.

Bound to be pretty wasteful of DMA slots with all that copper checking, but maybe worth it for some cases.
Jobbo is online now  
Old 01 September 2022, 20:41   #8
ross
Defendit numerus
 
ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 54
Posts: 4,508
Quote:
Originally Posted by Jobbo View Post
Bound to be pretty wasteful of DMA slots with all that copper checking, but maybe worth it for some cases.
Yep, this could be a problem.
And the fact that the handler code will not be trivial at all..

But experiments are experiments, curiosity cannot be stopped!

You think that right now I'm thinking how to use blitter for a huffman decoding accelerator when building a 'sparse' table.
ross is offline  
Old 01 September 2022, 20:50   #9
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,279
Actually, might as well post the code. It's not like you guys can't keep a secret right?

I really want this to run 3 frame (which I think is the limit at 320x256), but running out of normal ideas. Hence this thread.
Attached Files
File Type: zip rotozoom4.zip (24.9 KB, 60 views)
paraj is offline  
Old 01 September 2022, 21:25   #10
Jobbo
Registered User
 
Jobbo's Avatar
 
Join Date: Jun 2020
Location: Druidia
Posts: 389
I can take a look later and see if I have any ideas. Still at work for now.
Jobbo is online now  
Old 02 September 2022, 02:32   #11
Jobbo
Registered User
 
Jobbo's Avatar
 
Join Date: Jun 2020
Location: Druidia
Posts: 389
From what I see using Bartman's awesome profiler there's very little CPU time spent starting the blits. So, not much to gain trying to speed that up.

The blitter work in total lasts about 1 frame for each rotation which means the blitter is idle for 2 full frames.

If you could somehow find more work for the blitter that would unburden the CPU then that'd be your best bet for making an improvement.

I'd have to spend more time understanding the code to know how it all works. Or you could explain and I might have some thoughts. But I'm no expert, I'm sure some others would have more to say.
Jobbo is online now  
Old 02 September 2022, 09:40   #12
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,279
Yeah, I'm not expecting miracles, but since I'm 100% CPU bound anything might help. It was mostly to try it out as I think it might be useful in general. Also it could open up other optimization opportunities, for example there are only 6 combinations of source/destination so the copperlists could be precalculated.

The core is "renderline" which renders 160 scrambled chunky pixels 4 pixels at a time. 4 pre-scrambled textures are used: Texture0 contains data like this (numbers are bitplanes): 32------ 10------, Texture 1 --32---- --10----, etc.).
Code:
        move.w  ofs1(a1),d0
        or.w    ofs2(a2),d0
        or.w    ofs3(a3),d0
        or.w    ofs4(a4),d0
        movep.w d0,dofs(a0)
After the movep two consecutive words look like this:
Code:
a3 a2 b3 b2 c3 c2 d3 d2 e3 e2 f3 f2 g3 g2 h3 h2
a1 a0 b1 b0 c1 c0 d1 d0 e1 e0 f1 f0 g1 g0 h1 h0
Then the blitter is used to perform horizontal doubling and extract the individual bitplane data.

updatecode modifies "renderline" to sample from x*scale*cos(angle), x*scale*-sin(angle) (throwing away some precision because integer offsets are used)

yloop steps in the vertical direction calculating the starting offset for each scanline and then "calls" renderline to output pixels. The textures are doubled in size such that the starting offset + horizontal offset doesn't step outside the texture.

Of course I could offload the job of movep.w to the blitter, but I don't think it's faster (movep.w vs move.w only adds 8 cycles)
paraj is offline  
Old 02 September 2022, 13:03   #13
Galahad/FLT
Going nowhere
 
Galahad/FLT's Avatar
 
Join Date: Oct 2001
Location: United Kingdom
Age: 50
Posts: 9,025
Beware that MOVEP isn't supported on 68060
Galahad/FLT is offline  
Old 02 September 2022, 13:05   #14
pink^abyss
Registered User
 
Join Date: Aug 2018
Location: Untergrund/Germany
Posts: 410
Quote:
Originally Posted by paraj View Post
Also probably an issue with triggering a blit correctly without race conditions. You could always restrict yourself to blits only starting at a certain point (with an interrupt routine), but maybe it can be done better

As Jobbo said, the blitting is almost optimal. You could gain a few cycles by starting the first blit in rasterline 300, instead of rasterline 302
In the frames without blitting you still have 20% of idle DMA cycles. Perhaps you could do some work with the blitter here.
I would try to get the effect running in 2 frames. A window of 256x256 (or even smaller) could make it happen.
pink^abyss is offline  
Old 02 September 2022, 19:07   #15
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,279
Quote:
Originally Posted by Galahad/FLT View Post
Beware that MOVEP isn't supported on 68060
Yeah, thanks, if AFB_68060 is set I replace all the movep instructions with bsr.w movepemu (which does the necessary work instead). Without that it's 10x slower than a500 on my 1260 (reminds me that I need a more proper check in case that flag isn't set).

Quote:
Originally Posted by pink^abyss View Post
As Jobbo said, the blitting is almost optimal. You could gain a few cycles by starting the first blit in rasterline 300, instead of rasterline 302
In the frames without blitting you still have 20% of idle DMA cycles. Perhaps you could do some work with the blitter here.
I would try to get the effect running in 2 frames. A window of 256x256 (or even smaller) could make it happen.
So it's probably worthwhile to do the 8x1 step with the blitter even if it's (theoretically) more cycles overall. Thanks, I'll try that though that would be a bit disappointing since I more or less started out wanting to try out a c2p using movep...

First though, I need to get this copper blitter queue thing working to stay on topic
paraj is offline  
Old 02 September 2022, 20:51   #16
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,076
Quote:
Originally Posted by paraj View Post
... Without that it's 10x slower than a500 on my 1260
Hmm, that's not good. I have a couple of movep using routines, and my initial plan was to write a very specific emulation routine, and hopefully 68060 would be sufficiently faster than 68000 to make it work.

Could you please try this and see how much slower it is? Should be pretty straightforward to include. It handles your case: movep.w d0,(offset,a0) with destroyable d0.

Code:
; a5=vbr
	move.l	($0f4,a5),-(a7)		; #61 (unimpl. int instr.)
	move.l	#HandleMovep,($0f4,a5)

	lea	(Test,pc),a0
	move.w	#$1234,d0
	movep.w	d0,(1,a0)

	move.l	(a7)+,($0f4,a5)
	rts

Test	DC.L	~0

HandleMovep
	movem.l	a0/a1,-(a7)
	move.l	(4*2+2,a7),a1		; +0 = sr, +2 = pc
	add.w	(2,a1),a0		; dst offset

	move.b	d0,(2,a0)
	lsr.w	#8,d0
	move.b	d0,(a0)

	addq.l	#4,a1			; skip movep
	move.l	a1,(4*2+2,a7)		; update pc

	movem.l	(a7)+,a0/a1
	rte
a/b is offline  
Old 02 September 2022, 21:38   #17
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,279
Quote:
Originally Posted by a/b View Post
Hmm, that's not good. I have a couple of movep using routines, and my initial plan was to write a very specific emulation routine, and hopefully 68060 would be sufficiently faster than 68000 to make it work.
Don't trust this since I only did a quick test, but it runs at ~1530 scanlines (~500 used with my emu code). I'll do proper timing when I have more free time .
paraj is offline  
Old 02 September 2022, 22:31   #18
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,076
OK, thanks. Looks way too slow, I'll have to try something else (a simple bsr.w won't do, it's ~230KB of unrolled variable length repeats with non-uniform offsets).
a/b is offline  
Old 03 September 2022, 12:28   #19
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,279
All right, so it is possible, but as y'all already predicted it's not faster. It's actually slower
The code also got quite complicated and ugly, but it is possible. For reasons I don't quite understand the "is blitter free" checks have to be spaced quite closely (at least in this case). Maybe I did something wrong though.


The copperlists absolutely have to be kept within the same 64k page (it's ok for cop1 and cop2 to be in different ones though) such that only the lower part of the address needs to be updated. This also allows "racefree" starting of the blits by writing to cop2lc+2.

EDIT: And using the blitter for the 8x1 pass instead of using movep.w is 10 raster lines faster (bringing it to 964).

Quote:
Originally Posted by a/b View Post
OK, thanks. Looks way too slow, I'll have to try something else (a simple bsr.w won't do, it's ~230KB of unrolled variable length repeats with non-uniform offsets).
Testing it in isolation (still writing to chipram) I get:
060.library (system handler): 3365 cycles
Your version: 708 cycles
My version: 157 cycles (including the extra tst.w dmaconr)

I guess the pipeline synchronization that happens when handling exceptions is killing performance here. Maybe it's not so bad in your case, depending on what you're using it for. You can send me a test executable if you have something in particular you want timed.
Attached Files
File Type: zip rotozoom4_copper_blitq.zip (22.5 KB, 45 views)

Last edited by paraj; 03 September 2022 at 13:04.
paraj is offline  
Old 03 September 2022, 15:38   #20
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,076
Thanks again. That was with superscalar mode enabled?
a/b is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Copper driven blitter waits in WinUAE Jobbo Coders. Asm / Hardware 38 22 May 2021 20:48
Testing a blitter queue deimos Coders. General 3 22 October 2019 15:15
Strange behavior when using blit queue in copper list losso Coders. Asm / Hardware 39 02 February 2018 08:24
Blitter-driven tile map Leffmann Coders. Tutorials 5 11 November 2015 17:17
Blitter using the copper... h0ffman Coders. Asm / Hardware 9 23 February 2012 08:25

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 00:20.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.12273 seconds with 14 queries