![]() |
![]() |
#1 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,279
|
Copper driven blitter queue
Thought of experimenting with copper driven blits (of course someone already though of at least parts of it: http://eab.abime.net/showpost.php?p=...5&postcount=19
![]() The basic idea is to have the main copper list regularly "check" if it needs to/can start a new blit, i.e. interspersing it with SKIP copperbusy + write COPJMP2 instructions. When no blit is necessary 2nd copperlist would just return. A few problems with this if I understand it correctly: Writes to copjmpX reload the pointer fro copXlc, so you need to store the "return address" in the "opposite" copXlc register before "returning? The normal "do nothing" function for cop2 needs to be: Code:
loop: dc.w cop2lc+0, loop>>16 dc.w cop2lc+2, loop&0$ffff dc.w copjmp1, 0 Code:
dc.w cop1lc+0, next>>16 dc.w cop1lc+2, next&0$ffff dc.w $ffff, $0001 ; skip if blitter busy dc.w copjmp2, 0 next: Also probably an issue with triggering a blit correctly without race conditions. You could always restrict yourself to blits only starting at a certain point (with an interrupt routine), but maybe it can be done better? I'll experiment on my own, but maybe somebody already did the hard work ![]() |
![]() |
![]() |
#2 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,076
|
If you want to return to "somewhere in middle of" copper1 before VBL, yes (and also set the ptr back to the start of copper1 before EOF, for a correct VBL restart). Otherwise copper1 it will be restarted automatically during VBL.
|
![]() |
![]() |
#3 |
Registered User
Join Date: Jun 2020
Location: Druidia
Posts: 389
|
I can't think when it makes sense to construct a copper list where it will constantly check if there's a blit to process.
It seems much simpler to treat the copper list as a command buffer you filled in during the previous frame. My set up basically has an array of blits each with a wait at the start. The last wait gets patched so it's a jump that will terminate the sequence of blits. So, each frame I just unpatch that jump and then start again at the front of the array filling it up. It's double buffered so a little more complex than that but not much. In order to be as optimal as possible I only patch the values I know I need to change in the copper list of blits, which is easy enough if they are all lines but not so practical for something more generic. |
![]() |
![]() |
#4 | ||
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,279
|
Quote:
Quote:
Instead of being coy, I'll just mention that I have a specific application in mind. I'm working on an effect where I do a lot of CPU heavy stuff (~1-3 frames) and then do ~1 frame of blitter ops (C2P). Thinking of ways to reduce the CPU load to speed it up. I need to do 12 blits (on OCS) for the C2P part, so I was thinking of speeding it not doing a (CPU driver) blitter queue, but instead offloading it to the blitter. Would need tuning of course, but if I have say 6-8 checks/frame in by the copper/blitter each frame that could be faster. |
||
![]() |
![]() |
#5 | |
Defendit numerus
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 54
Posts: 4,508
|
Quote:
With this approach you can put some blits in the 'dead spots' of the copper list. @paraj: I don't know what will come out but I'm curious ![]() |
|
![]() |
![]() |
#6 |
Registered User
Join Date: Jun 2020
Location: Druidia
Posts: 389
|
For so few blits I would just use interrupts.
Using the copper is more about maximizing the number of blits you can submit, at the cost of doing all the other things the copper is normally good for like color changes etc. You can of course interleave copper blits and other copper work but it's a pain. It's fun to try none the less ![]() |
![]() |
![]() |
#7 | |
Registered User
Join Date: Jun 2020
Location: Druidia
Posts: 389
|
Quote:
That's fair, I'm not sure what was linked but it does make some sense to interleave checks that will skip if the last blit hasn't finished yet. Then it can continue on with whatever has to happen that line or whatever. Bound to be pretty wasteful of DMA slots with all that copper checking, but maybe worth it for some cases. |
|
![]() |
![]() |
#8 | |
Defendit numerus
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 54
Posts: 4,508
|
Quote:
And the fact that the handler code will not be trivial at all.. But experiments are experiments, curiosity cannot be stopped! You think that right now I'm thinking how to use blitter for a huffman decoding accelerator when building a 'sparse' table. ![]() |
|
![]() |
![]() |
#9 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,279
|
Actually, might as well post the code. It's not like you guys can't keep a secret right?
![]() I really want this to run 3 frame (which I think is the limit at 320x256), but running out of normal ideas. Hence this thread. |
![]() |
![]() |
#10 |
Registered User
Join Date: Jun 2020
Location: Druidia
Posts: 389
|
I can take a look later and see if I have any ideas. Still at work for now.
|
![]() |
![]() |
#11 |
Registered User
Join Date: Jun 2020
Location: Druidia
Posts: 389
|
From what I see using Bartman's awesome profiler there's very little CPU time spent starting the blits. So, not much to gain trying to speed that up.
The blitter work in total lasts about 1 frame for each rotation which means the blitter is idle for 2 full frames. If you could somehow find more work for the blitter that would unburden the CPU then that'd be your best bet for making an improvement. I'd have to spend more time understanding the code to know how it all works. Or you could explain and I might have some thoughts. But I'm no expert, I'm sure some others would have more to say. |
![]() |
![]() |
#12 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,279
|
Yeah, I'm not expecting miracles, but since I'm 100% CPU bound anything might help. It was mostly to try it out as I think it might be useful in general. Also it could open up other optimization opportunities, for example there are only 6 combinations of source/destination so the copperlists could be precalculated.
The core is "renderline" which renders 160 scrambled chunky pixels 4 pixels at a time. 4 pre-scrambled textures are used: Texture0 contains data like this (numbers are bitplanes): 32------ 10------, Texture 1 --32---- --10----, etc.). Code:
move.w ofs1(a1),d0 or.w ofs2(a2),d0 or.w ofs3(a3),d0 or.w ofs4(a4),d0 movep.w d0,dofs(a0) Code:
a3 a2 b3 b2 c3 c2 d3 d2 e3 e2 f3 f2 g3 g2 h3 h2 a1 a0 b1 b0 c1 c0 d1 d0 e1 e0 f1 f0 g1 g0 h1 h0 updatecode modifies "renderline" to sample from x*scale*cos(angle), x*scale*-sin(angle) (throwing away some precision because integer offsets are used) yloop steps in the vertical direction calculating the starting offset for each scanline and then "calls" renderline to output pixels. The textures are doubled in size such that the starting offset + horizontal offset doesn't step outside the texture. Of course I could offload the job of movep.w to the blitter, but I don't think it's faster (movep.w vs move.w only adds 8 cycles) |
![]() |
![]() |
#13 |
Going nowhere
Join Date: Oct 2001
Location: United Kingdom
Age: 50
Posts: 9,025
|
Beware that MOVEP isn't supported on 68060
|
![]() |
![]() |
#14 | |
Registered User
Join Date: Aug 2018
Location: Untergrund/Germany
Posts: 410
|
Quote:
As Jobbo said, the blitting is almost optimal. You could gain a few cycles by starting the first blit in rasterline 300, instead of rasterline 302 ![]() In the frames without blitting you still have 20% of idle DMA cycles. Perhaps you could do some work with the blitter here. I would try to get the effect running in 2 frames. A window of 256x256 (or even smaller) could make it happen. |
|
![]() |
![]() |
#15 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,279
|
Yeah, thanks, if AFB_68060 is set I replace all the movep instructions with bsr.w movepemu (which does the necessary work instead). Without that it's 10x slower than a500 on my 1260
![]() Quote:
First though, I need to get this copper blitter queue thing working to stay on topic ![]() |
|
![]() |
![]() |
#16 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,076
|
Hmm, that's not good. I have a couple of movep using routines, and my initial plan was to write a very specific emulation routine, and hopefully 68060 would be sufficiently faster than 68000 to make it work.
Could you please try this and see how much slower it is? Should be pretty straightforward to include. It handles your case: movep.w d0,(offset,a0) with destroyable d0. Code:
; a5=vbr move.l ($0f4,a5),-(a7) ; #61 (unimpl. int instr.) move.l #HandleMovep,($0f4,a5) lea (Test,pc),a0 move.w #$1234,d0 movep.w d0,(1,a0) move.l (a7)+,($0f4,a5) rts Test DC.L ~0 HandleMovep movem.l a0/a1,-(a7) move.l (4*2+2,a7),a1 ; +0 = sr, +2 = pc add.w (2,a1),a0 ; dst offset move.b d0,(2,a0) lsr.w #8,d0 move.b d0,(a0) addq.l #4,a1 ; skip movep move.l a1,(4*2+2,a7) ; update pc movem.l (a7)+,a0/a1 rte |
![]() |
![]() |
#17 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,279
|
Don't trust this since I only did a quick test, but it runs at ~1530 scanlines (~500 used with my emu code). I'll do proper timing when I have more free time .
|
![]() |
![]() |
#18 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,076
|
OK, thanks. Looks way too slow, I'll have to try something else (a simple bsr.w won't do, it's ~230KB of unrolled variable length repeats with non-uniform offsets).
|
![]() |
![]() |
#19 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,279
|
All right, so it is possible, but as y'all already predicted it's not faster. It's actually slower
![]() The code also got quite complicated and ugly, but it is possible. For reasons I don't quite understand the "is blitter free" checks have to be spaced quite closely (at least in this case). Maybe I did something wrong though. The copperlists absolutely have to be kept within the same 64k page (it's ok for cop1 and cop2 to be in different ones though) such that only the lower part of the address needs to be updated. This also allows "racefree" starting of the blits by writing to cop2lc+2. EDIT: And using the blitter for the 8x1 pass instead of using movep.w is 10 raster lines faster (bringing it to 964). Quote:
060.library (system handler): 3365 cycles Your version: 708 cycles My version: 157 cycles (including the extra tst.w dmaconr) I guess the pipeline synchronization that happens when handling exceptions is killing performance here. Maybe it's not so bad in your case, depending on what you're using it for. You can send me a test executable if you have something in particular you want timed. Last edited by paraj; 03 September 2022 at 13:04. |
|
![]() |
![]() |
#20 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,076
|
Thanks again. That was with superscalar mode enabled?
|
![]() |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Copper driven blitter waits in WinUAE | Jobbo | Coders. Asm / Hardware | 38 | 22 May 2021 20:48 |
Testing a blitter queue | deimos | Coders. General | 3 | 22 October 2019 15:15 |
Strange behavior when using blit queue in copper list | losso | Coders. Asm / Hardware | 39 | 02 February 2018 08:24 |
Blitter-driven tile map | Leffmann | Coders. Tutorials | 5 | 11 November 2015 17:17 |
Blitter using the copper... | h0ffman | Coders. Asm / Hardware | 9 | 23 February 2012 08:25 |
|
|