10 June 2016, 12:34 | #1 |
Registered User
Join Date: Jun 2016
Location: UK
Posts: 428
|
Maximum blitter speed with pipelining
How can we get the maximum speed from the blitter? Let's target Amigas with only chip RAM, so a stock A500 or A1200.
The blitter registers are not double buffered, so you can't pre-load them between operations. The CPU stops when the blitter has priority anyway... Can you toggle the priority bit on half way through an operation? Then you could start in friendly mode, load up the next operation into CPU registers and enable the priority bit, and then immediately write the next operation settings into the blitter's registers. What other techniques can be used to get maximum speed from the blitter? Say I have a load of pre-calculated operations I want to perform, like a demo effect or bob list. |
10 June 2016, 12:42 | #2 |
Registered User
Join Date: Feb 2010
Location: Espoo / Finland
Posts: 818
|
|
10 June 2016, 13:03 | #3 |
WinUAE developer
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,506
|
I'd say "it depends", not no or yes
|
10 June 2016, 13:26 | #4 |
Glastonbridge Software
Join Date: Jan 2012
Location: Edinburgh/Scotland
Posts: 2,243
|
if you have only Chip RAM and no instruction caches
|
10 June 2016, 13:59 | #5 |
Registered User
Join Date: Jun 2016
Location: UK
Posts: 428
|
Yeah, should have specified, a 68000 with only chip RAM stops, but of course an A1200 with 68EC020 can continue to execute code from its cache. However, in this scenario, since we need to fetch the next blitter operation parameters from RAM...
I suppose you could use speedcode, but getting the instructions into the cache seems like a tricky problem. What about the copper? If you calculate the right wait positions so that it copies new data as soon as the blitter finishes, it could be quicker than the CPU I think. There might be some wasted cycles due to the copper not being able to wait for the exact point it needs to (extremes of the scanline). |
10 June 2016, 19:05 | #6 |
WinUAE developer
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 49
Posts: 26,506
|
It still depends, even with chip ram only A500 (or chip+"slow" ram). It depends on selected channel combination, some have idle cycles that are free for the CPU.
Copper has blitter wait bit, it is used in many demos to start multiple blits sequentially. |
11 June 2016, 22:20 | #7 |
Registered User
Join Date: Jun 2016
Location: UK
Posts: 428
|
Thanks Toni. So would you say that is the fastest possible way to queue up blits, but waiting and loading registers with the copper?
Any other useful optimizations? I suppose arranging your bitmaps in memory so that you don't need to reload address registers might help. |
12 June 2016, 06:28 | #8 |
Code Kitten
Join Date: Aug 2015
Location: Montreal/Canadia
Age: 52
Posts: 1,178
|
This is definitely the fastest since the copper will react much faster than the CPU to a blitter interrupt and will be much faster at setting blitter registers.
Possible optimizations include grouping blits so that consecutive blits share very similar setups so the minimum amount of registers need to be re-set between each but his might require quite a bit of CPU if you need to do this dynamically and cannot predict this order statically. |
12 June 2016, 06:34 | #9 |
Registered User
Join Date: Feb 2011
Location: Italy/Rome
Posts: 2,281
|
Interleaved bitplanes could be faster than usual bitplanes setup
|
17 July 2016, 18:09 | #10 |
Moderator
Join Date: Nov 2004
Location: Eksjö / Sweden
Posts: 5,602
|
It does for most things everyone does with the Blitter.
Exceptions are clear and polyfill. If you need CPU cycles during a blit, you can disable BLTPRI or use the uncommon USEx channel masks 7, 5, or 3. This will make the blit finish later. |
18 July 2016, 10:48 | #11 | |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,410
|
Quote:
Surely if you're going to blit with the Copper the CPU has to create/update the copperlist for every blit you do? Doesn't this mean you end up spending more time because the CPU would've done more or less the same writes anyway to set up the blitter even when the Copper isn't used? I guess it depends a bit, if you're dead set on using blitter interrupts it'd probably be faster than the CPU, but if you just blit in sequence with the CPU you don't have the overhead of interrupts so the CPU should win* in that case, shouldn't it? *) You'd only have Blitter wait overhead but that can be limited by dynamically switching the BLTPRI bit in DMACON as part of the blitter wait. |
|
18 July 2016, 10:51 | #12 | |
Registered User
Join Date: Feb 2011
Location: Italy/Rome
Posts: 2,281
|
Quote:
|
|
18 July 2016, 11:12 | #13 |
Registered User
Join Date: Jun 2016
Location: UK
Posts: 428
|
It helps to think about what the CPU has to do in order to set up the blitter. The CPU has to write several words to the blitter's registers. Normally it would fetch the data from RAM into Dx registers and then write it out again, but that could be optimized a bit with speed code (self-modifying code loading immediate values).
The copper is a bit more efficient because it's all basically speed code, and because it uses a reduced address space to write values encoded in the instructions directly to chipset registers. In other words, the CPU has to do three memory accesses (load immediate instruction, move to absolute address instruction, write to blitter register) and the copper only has to do two (read instruction, write to blitter register). Of course it's different when you have an 020 or some fast RAM and the CPU can pre-load the values it wants to write to the blitter. Even then the copper is probably faster because the CPU will have to poll the blitter to check when it has finished. This brings us to the other major advantage of the copper. It can wait for the blitter to finish and there is no interrupt or polling overhead. So the fastest way to blit is to create a custom copperlist. That can get very tricky if you are trying to use the copper for other stuff like palette changes and sprite hacks. I think this is why you rarely see both in demo effects that make heavy use of the blitter. |
18 July 2016, 11:15 | #14 | |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,410
|
Quote:
Maybe I'm just missing something, but it still feels to me using the copper to blit won't be faster. More convenient because you don't need to bother with interrupts etc, but I'm not so sure about it being faster. |
|
18 July 2016, 11:18 | #15 | |
Registered User
Join Date: Feb 2011
Location: Italy/Rome
Posts: 2,281
|
Quote:
|
|
18 July 2016, 11:21 | #16 | |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,410
|
Quote:
What I'm saying is that: A) the polling overhead is, if done right, nearly free* and B) my guess is that writing the copper list will take about as much time as it would've to just blit direcly with the CPU (consider: you need to the the exact same work as setting up the blitter to write out the copperlist updates) Now, I haven't actually created a copper based blitting system where the CPU creates/update such a list so I could be very wrong indeed. Hence my confusion *) if you use the extra DMA acces you need to use for safe blitter waiting to set BLTPRI to on as part of the blitter wait and then deactivate it after the blitter wait loop you're looking at something on the order of 20 cycles total overhead per blit (assuming 68000 / no fast memory). Last edited by roondar; 18 July 2016 at 11:26. |
|
18 July 2016, 11:37 | #17 |
Registered User
Join Date: Feb 2011
Location: Italy/Rome
Posts: 2,281
|
Could we have some dinamyc copper list using copper's jump register, and update only that with cpu?
|
18 July 2016, 11:59 | #18 | |
Registered User
Join Date: Jun 2016
Location: UK
Posts: 428
|
Quote:
Code:
move $00, BLTCPTH move $00, BLTCPTL ... move $00, BLTCON0 move $00, BLTSIZE wait for blitter finished bit move $00, BLTCPTH move $00, BLTCPTL ... move $00, BLTCON0 move $00, BLTSIZE wait for blitter finished bit .... Consider that many of the values will not change from frame to frame. In a game you might have slots allocated to player and enemy bobs, so their source addresses, masks, functions and sizes stay the same. Only the destination address and bit rotation changes. |
|
18 July 2016, 16:28 | #19 | |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,410
|
Quote:
That is indeed useful, though I suspect you'll need to update source/mask adresses relatively often for animation purposes - at which point the gain will be lower. Nice idea though, learned something new |
|
18 July 2016, 21:42 | #20 | |
Code Kitten
Join Date: Aug 2015
Location: Montreal/Canadia
Age: 52
Posts: 1,178
|
Quote:
|
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Maximum speed of the internal serial port? | Iznougoud | support.Hardware | 32 | 06 November 2020 23:18 |
Blitter filling speed, how much? | sandruzzo | Coders. Asm / Hardware | 7 | 03 July 2015 14:38 |
FS-UAE uses always maximum CPU speed? | AGS | support.FS-UAE | 6 | 15 February 2015 13:08 |
Maximum MaxTransfer and ATAPI speed (IDEfix97) | Leandro Jardim | support.WinUAE | 2 | 04 August 2014 14:45 |
CD/DVD Drive Maximum Speed | Calgor | support.Hardware | 2 | 19 June 2007 16:18 |
|
|