Maximum blitter speed with pipelining

zero · 10 June 2016, 12:34

How can we get the maximum speed from the blitter? Let's target Amigas with only chip RAM, so a stock A500 or A1200.

The blitter registers are not double buffered, so you can't pre-load them between operations. The CPU stops when the blitter has priority anyway...

Can you toggle the priority bit on half way through an operation? Then you could start in friendly mode, load up the next operation into CPU registers and enable the priority bit, and then immediately write the next operation settings into the blitter's registers.

What other techniques can be used to get maximum speed from the blitter? Say I have a load of pre-calculated operations I want to perform, like a demo effect or bob list.

britelite · 10 June 2016, 12:42

Quote:

Originally Posted by zero

The CPU stops when the blitter has priority anyway...

No, it doesn't

Toni Wilen · 10 June 2016, 13:03

I'd say "it depends", not no or yes

Mrs Beanbag · 10 June 2016, 13:26

if you have only Chip RAM and no instruction caches

zero · 10 June 2016, 13:59

Yeah, should have specified, a 68000 with only chip RAM stops, but of course an A1200 with 68EC020 can continue to execute code from its cache. However, in this scenario, since we need to fetch the next blitter operation parameters from RAM...

I suppose you could use speedcode, but getting the instructions into the cache seems like a tricky problem.

What about the copper? If you calculate the right wait positions so that it copies new data as soon as the blitter finishes, it could be quicker than the CPU I think. There might be some wasted cycles due to the copper not being able to wait for the exact point it needs to (extremes of the scanline).

Toni Wilen · 10 June 2016, 19:05

It still depends, even with chip ram only A500 (or chip+"slow" ram). It depends on selected channel combination, some have idle cycles that are free for the CPU.

Copper has blitter wait bit, it is used in many demos to start multiple blits sequentially.

zero · 11 June 2016, 22:20

Thanks Toni. So would you say that is the fastest possible way to queue up blits, but waiting and loading registers with the copper?

Any other useful optimizations? I suppose arranging your bitmaps in memory so that you don't need to reload address registers might help.

ReadOnlyCat · 12 June 2016, 06:28

This is definitely the fastest since the copper will react much faster than the CPU to a blitter interrupt and will be much faster at setting blitter registers.
Possible optimizations include grouping blits so that consecutive blits share very similar setups so the minimum amount of registers need to be re-set between each but his might require quite a bit of CPU if you need to do this dynamically and cannot predict this order statically.

sandruzzo · 12 June 2016, 06:34

Interleaved bitplanes could be faster than usual bitplanes setup

Photon · 17 July 2016, 18:09

Quote:

Originally Posted by zero

The CPU stops when the blitter has priority anyway...

Quote:

Originally Posted by britelite

No, it doesn't

It does for most things everyone does with the Blitter.

Exceptions are clear and polyfill.

If you need CPU cycles during a blit, you can disable BLTPRI or use the uncommon USEx channel masks 7, 5, or 3. This will make the blit finish later.

roondar · 18 July 2016, 10:48

Quote:

Originally Posted by ReadOnlyCat

This is definitely the fastest since the copper will react much faster than the CPU to a blitter interrupt and will be much faster at setting blitter registers.
Possible optimizations include grouping blits so that consecutive blits share very similar setups so the minimum amount of registers need to be re-set between each but his might require quite a bit of CPU if you need to do this dynamically and cannot predict this order statically.

I'm a bit confused about this.

Surely if you're going to blit with the Copper the CPU has to create/update the copperlist for every blit you do? Doesn't this mean you end up spending more time because the CPU would've done more or less the same writes anyway to set up the blitter even when the Copper isn't used?

I guess it depends a bit, if you're dead set on using blitter interrupts it'd probably be faster than the CPU, but if you just blit in sequence with the CPU you don't have the overhead of interrupts so the CPU should win* in that case, shouldn't it?

*) You'd only have Blitter wait overhead but that can be limited by dynamically switching the BLTPRI bit in DMACON as part of the blitter wait.

sandruzzo · 18 July 2016, 10:51

Quote:

Originally Posted by roondar

I'm a bit confused about this.

Surely if you're going to blit with the Copper the CPU has to create/update the copperlist for every blit you do? Doesn't this mean you end up spending more time because the CPU would've done more or less the same writes anyway to set up the blitter even when the Copper isn't used?

I guess it depends a bit, if you're dead set on using blitter interrupts it'd probably be faster than the CPU, but if you just blit in sequence with the CPU you don't have the overhead of interrupts so the CPU should win* in that case, shouldn't it?

*) You'd only have Blitter wait overhead but that can be limited by dynamically switching the BLTPRI bit in DMACON as part of the blitter wait.

Maybe, if you can precalculate some copper list, you can archive maximum blitter speed

zero · 18 July 2016, 11:12

It helps to think about what the CPU has to do in order to set up the blitter. The CPU has to write several words to the blitter's registers. Normally it would fetch the data from RAM into Dx registers and then write it out again, but that could be optimized a bit with speed code (self-modifying code loading immediate values).

The copper is a bit more efficient because it's all basically speed code, and because it uses a reduced address space to write values encoded in the instructions directly to chipset registers.

In other words, the CPU has to do three memory accesses (load immediate instruction, move to absolute address instruction, write to blitter register) and the copper only has to do two (read instruction, write to blitter register).

Of course it's different when you have an 020 or some fast RAM and the CPU can pre-load the values it wants to write to the blitter. Even then the copper is probably faster because the CPU will have to poll the blitter to check when it has finished.

This brings us to the other major advantage of the copper. It can wait for the blitter to finish and there is no interrupt or polling overhead.

So the fastest way to blit is to create a custom copperlist. That can get very tricky if you are trying to use the copper for other stuff like palette changes and sprite hacks. I think this is why you rarely see both in demo effects that make heavy use of the blitter.

roondar · 18 July 2016, 11:15

Quote:

Originally Posted by sandruzzo

Maybe, if you can precalculate some copper list, you can archive maximum blitter speed

But then you'd still need to update that precalculated list whenever you want more/less bobs, or different positions/animation frames for your bobs.

Maybe I'm just missing something, but it still feels to me using the copper to blit won't be faster. More convenient because you don't need to bother with interrupts etc, but I'm not so sure about it being faster.

sandruzzo · 18 July 2016, 11:18

Quote:

Originally Posted by roondar

But then you'd still need to update that precalculated list whenever you want more/less bobs, or different positions/animation frames for your bobs.

Maybe I'm just missing something, but it still feels to me using the copper to blit won't be faster. More convenient because you don't need to bother with interrupts etc, but I'm not so sure about it being faster.

For general game will be a challenge to do that. Maybe some steady part, like scrolling update can made by blitter, or maybe demo and 3d stuffs.

roondar · 18 July 2016, 11:21

Quote:

Originally Posted by zero

It helps to think about what the CPU has to do in order to set up the blitter. The CPU has to write several words to the blitter's registers. Normally it would fetch the data from RAM into Dx registers and then write it out again, but that could be optimized a bit with speed code (self-modifying code loading immediate values).

The copper is a bit more efficient because it's all basically speed code, and because it uses a reduced address space to write values encoded in the instructions directly to chipset registers.

In other words, the CPU has to do three memory accesses (load immediate instruction, move to absolute address instruction, write to blitter register) and the copper only has to do two (read instruction, write to blitter register).

Of course it's different when you have an 020 or some fast RAM and the CPU can pre-load the values it wants to write to the blitter. Even then the copper is probably faster because the CPU will have to poll the blitter to check when it has finished.

This brings us to the other major advantage of the copper. It can wait for the blitter to finish and there is no interrupt or polling overhead.

So the fastest way to blit is to create a custom copperlist. That can get very tricky if you are trying to use the copper for other stuff like palette changes and sprite hacks. I think this is why you rarely see both in demo effects that make heavy use of the blitter.

I'm not denying that the copper list will be executed faster than the CPU can set up the blitter. It will!

What I'm saying is that:

A) the polling overhead is, if done right, nearly free* and
B) my guess is that writing the copper list will take about as much time as it would've to just blit direcly with the CPU (consider: you need to the the exact same work as setting up the blitter to write out the copperlist updates)

Now, I haven't actually created a copper based blitting system where the CPU creates/update such a list so I could be very wrong indeed. Hence my confusion

*) if you use the extra DMA acces you need to use for safe blitter waiting to set BLTPRI to on as part of the blitter wait and then deactivate it after the blitter wait loop you're looking at something on the order of 20 cycles total overhead per blit (assuming 68000 / no fast memory).

sandruzzo · 18 July 2016, 11:37

Could we have some dinamyc copper list using copper's jump register, and update only that with cpu?

zero · 18 July 2016, 11:59

Quote:

Originally Posted by roondar

B) my guess is that writing the copper list will take about as much time as it would've to just blit direcly with the CPU (consider: you need to the the exact same work as setting up the blitter to write out the copperlist updates)

That's definitely not the case though. You can create a copperlist like this (pseudo-code, I don't have my Amiga hat on right now):

Code:

move $00, BLTCPTH
move $00, BLTCPTL
...
move $00, BLTCON0
move $00, BLTSIZE
wait for blitter finished bit
move $00, BLTCPTH
move $00, BLTCPTL
...
move $00, BLTCON0
move $00, BLTSIZE
wait for blitter finished bit
....

Now, all you need to do is set up your blits is write some words into this copperlist. You can replace one of the waits with the usual $FFFF,$FFFE to end the list if you don't need every "slot".

Consider that many of the values will not change from frame to frame. In a game you might have slots allocated to player and enemy bobs, so their source addresses, masks, functions and sizes stay the same. Only the destination address and bit rotation changes.

roondar · 18 July 2016, 16:28

Quote:

Originally Posted by zero

That's definitely not the case though. You can create a copperlist like this (pseudo-code, I don't have my Amiga hat on right now):

Code:

move $00, BLTCPTH
move $00, BLTCPTL
...
move $00, BLTCON0
move $00, BLTSIZE
wait for blitter finished bit
move $00, BLTCPTH
move $00, BLTCPTL
...
move $00, BLTCON0
move $00, BLTSIZE
wait for blitter finished bit
....

Now, all you need to do is set up your blits is write some words into this copperlist. You can replace one of the waits with the usual $FFFF,$FFFE to end the list if you don't need every "slot".

Consider that many of the values will not change from frame to frame. In a game you might have slots allocated to player and enemy bobs, so their source addresses, masks, functions and sizes stay the same. Only the destination address and bit rotation changes.

So, the gain is in not needing to update every value every frame.

That is indeed useful, though I suspect you'll need to update source/mask adresses relatively often for animation purposes - at which point the gain will be lower.

Nice idea though, learned something new

ReadOnlyCat · 18 July 2016, 21:42

Quote:

Originally Posted by roondar

So, the gain is in not needing to update every value every frame.

That is indeed useful, though I suspect you'll need to update source/mask adresses relatively often for animation purposes - at which point the gain will be lower.

Nice idea though, learned something new

It doesn't work with every type of game though. If one needs to create a Copper gradient while driving the Blitter then a pre-made Copper list does not work anymore. This is why Lotus 2 (or 3, I am not sure anymore) uses CPU interruptions to drive the Blitter: they needed the Copper to create the colored roadside strips.

10 June 2016, 12:34	#1
zero Registered User Join Date: Jun 2016 Location: UK Posts: 428	Maximum blitter speed with pipelining How can we get the maximum speed from the blitter? Let's target Amigas with only chip RAM, so a stock A500 or A1200. The blitter registers are not double buffered, so you can't pre-load them between operations. The CPU stops when the blitter has priority anyway... Can you toggle the priority bit on half way through an operation? Then you could start in friendly mode, load up the next operation into CPU registers and enable the priority bit, and then immediately write the next operation settings into the blitter's registers. What other techniques can be used to get maximum speed from the blitter? Say I have a load of pre-calculated operations I want to perform, like a demo effect or bob list.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Maximum speed of the internal serial port?	Iznougoud	support.Hardware	32	06 November 2020 23:18
Blitter filling speed, how much?	sandruzzo	Coders. Asm / Hardware	7	03 July 2015 14:38
FS-UAE uses always maximum CPU speed?	AGS	support.FS-UAE	6	15 February 2015 13:08
Maximum MaxTransfer and ATAPI speed (IDEfix97)	Leandro Jardim	support.WinUAE	2	04 August 2014 14:45
CD/DVD Drive Maximum Speed	Calgor	support.Hardware	2	19 June 2007 16:18

10 June 2016, 13:03	#3
Toni Wilen WinUAE developer Join Date: Aug 2001 Location: Hämeenlinna/Finland Age: 49 Posts: 26,506	I'd say "it depends", not no or yes

10 June 2016, 13:26	#4
Mrs Beanbag Glastonbridge Software Join Date: Jan 2012 Location: Edinburgh/Scotland Posts: 2,243	if you have only Chip RAM and no instruction caches

10 June 2016, 13:59	#5
zero Registered User Join Date: Jun 2016 Location: UK Posts: 428	Yeah, should have specified, a 68000 with only chip RAM stops, but of course an A1200 with 68EC020 can continue to execute code from its cache. However, in this scenario, since we need to fetch the next blitter operation parameters from RAM... I suppose you could use speedcode, but getting the instructions into the cache seems like a tricky problem. What about the copper? If you calculate the right wait positions so that it copies new data as soon as the blitter finishes, it could be quicker than the CPU I think. There might be some wasted cycles due to the copper not being able to wait for the exact point it needs to (extremes of the scanline).

10 June 2016, 19:05	#6
Toni Wilen WinUAE developer Join Date: Aug 2001 Location: Hämeenlinna/Finland Age: 49 Posts: 26,506	It still depends, even with chip ram only A500 (or chip+"slow" ram). It depends on selected channel combination, some have idle cycles that are free for the CPU. Copper has blitter wait bit, it is used in many demos to start multiple blits sequentially.

11 June 2016, 22:20	#7
zero Registered User Join Date: Jun 2016 Location: UK Posts: 428	Thanks Toni. So would you say that is the fastest possible way to queue up blits, but waiting and loading registers with the copper? Any other useful optimizations? I suppose arranging your bitmaps in memory so that you don't need to reload address registers might help.

12 June 2016, 06:28	#8
ReadOnlyCat Code Kitten Join Date: Aug 2015 Location: Montreal/Canadia Age: 52 Posts: 1,178	This is definitely the fastest since the copper will react much faster than the CPU to a blitter interrupt and will be much faster at setting blitter registers. Possible optimizations include grouping blits so that consecutive blits share very similar setups so the minimum amount of registers need to be re-set between each but his might require quite a bit of CPU if you need to do this dynamically and cannot predict this order statically.

12 June 2016, 06:34	#9
sandruzzo Registered User Join Date: Feb 2011 Location: Italy/Rome Posts: 2,281	Interleaved bitplanes could be faster than usual bitplanes setup

18 July 2016, 11:12	#13
zero Registered User Join Date: Jun 2016 Location: UK Posts: 428	It helps to think about what the CPU has to do in order to set up the blitter. The CPU has to write several words to the blitter's registers. Normally it would fetch the data from RAM into Dx registers and then write it out again, but that could be optimized a bit with speed code (self-modifying code loading immediate values). The copper is a bit more efficient because it's all basically speed code, and because it uses a reduced address space to write values encoded in the instructions directly to chipset registers. In other words, the CPU has to do three memory accesses (load immediate instruction, move to absolute address instruction, write to blitter register) and the copper only has to do two (read instruction, write to blitter register). Of course it's different when you have an 020 or some fast RAM and the CPU can pre-load the values it wants to write to the blitter. Even then the copper is probably faster because the CPU will have to poll the blitter to check when it has finished. This brings us to the other major advantage of the copper. It can wait for the blitter to finish and there is no interrupt or polling overhead. So the fastest way to blit is to create a custom copperlist. That can get very tricky if you are trying to use the copper for other stuff like palette changes and sprite hacks. I think this is why you rarely see both in demo effects that make heavy use of the blitter.

18 July 2016, 11:37	#17
sandruzzo Registered User Join Date: Feb 2011 Location: Italy/Rome Posts: 2,281	Could we have some dinamyc copper list using copper's jump register, and update only that with cpu?

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)