Subtract/Saturate with Blitter?

remz · 24 August 2022, 19:24

Hello again Amiga Coders!

I am quite a newbie relating to using the Amiga Blitter, and I have an interesting challenge/idea:
Is it possible to perform a subtraction using the blitter that would "clip at zero" to prevent underflow?
Let me explain what I would to achieve, it might make it clearer:
The display is in HAM, 6bpp. The first four planes contains a value from 0 to 15.
I would to subtract a constant value between 0 and 15. The goal would be that the resulting subtract would "clip" at zero to prevent underflow.
Example: if color was 13, subtract 3, would result in 10.
If color was 1, subtract 3, would result to 0.
To make more complicated, the bitplanes 5 and 6 needs to stay untouched. This might not be an issue if the calculations are done "per scanline" with one blit per line.

Thinking about it perhaps the cpu would be better suited for this task. Even perhaps my graphics should be stored in "chunky" mode to speed up reading the value, and then written back in bitplane mode. I'll have to think about it more.

robinsonb5 · 24 August 2022, 19:44

Yes, it's certainly possible - I did it in AMOSPro some years ago (using some trickery with poking bitplane pointers, to restrict which planes are affected by a blit.) - and I believe the AMCAF extension had similar facilities.

http://retroramblings.net/?p=25

Thomas Richter · 24 August 2022, 23:09

Quote:

Originally Posted by remz

Is it possible to perform a subtraction using the blitter that would "clip at zero" to prevent underflow?

Not really. Please understand that the operations the blitter can perform are limited to those offered by the minterms, i.e. bit combinations - bit by bit - of the source and the destination. There are 16 minterms in total how source and destination can be combined (ignoring masking), and subtraction is none of them - as for that the blitter would also need to access multiple bitplanes, not just source and target.

With a lot of creativity, you can probably built up something suitable, but it would need multiple blits. At this point, using the CPU would certainly be faster than triggering the blitter multiple times.

ross · 24 August 2022, 23:42

Quote:

Originally Posted by Thomas Richter

Please understand that the operations the blitter can perform are limited to those offered by the minterms, i.e. bit combinations - bit by bit - of the source and the destination. There are 16 minterms in total how source and destination can be combined..

hmm, no.
Minterms are 8 (2^3), and combinations are 256 (2^(2^3)).

If the source channels had been 4 then the Minterms would have been 16, but at that point a whole 16 bit word would have been needed to define the function (65536 combinations).

EDIT: of course many of these combinations don't make sense, or they only make sense with some channels with static data, or with channels deactivated (or even all deactivated), but that's another story

robinsonb5 · 25 August 2022, 01:09

Quote:

Originally Posted by Thomas Richter

At this point, using the CPU would certainly be faster than triggering the blitter multiple times.

Only if the source data is in chunky format and you're using a C2P.

A multi-pass blitter algorithm can modify the data in-place much faster than the CPU would be able to. (The demo in my blog post linked above manages around 5fps when fading a 320x256 HAM6 screen to black. There's no way the CPU's going to manipulate planar data at those sorts of speeds.)

I've used the same techniques to draw shaded bevels on a 16-colour greyscale screen, again much faster than the CPU could do it.

Thomas Richter · 25 August 2022, 01:33

Quote:

Originally Posted by ross

hmm, no.
Minterms are 8 (2^3), and combinations are 256 (2^(2^3)).

Look, as I said, you typically need a mask, so you can only use two of the 3 channels, the third is usually occupied by the mask such that destination=source for those pixels that are outside the mask. The number was not picked at random and not without reason. This all aside, there is no subtraction minterm as the blitter operates bit combinations.

Thomas Richter · 25 August 2022, 01:40

Quote:

Originally Posted by robinsonb5

A multi-pass blitter algorithm can modify the data in-place much faster than the CPU would be able to.

Cough. No, this depends on the CPU, and the operation. The blitter cannot do magic, it can only operate at the bandwidth of the chip RAM bus, and the CPU can also only access chip RAM at the same speed. However, depending on how many source channels you take, and depending on the blitter nasty flag, the blitter cannot occupy every possible cycle (i.e. every cycle not taken by DMA).

In particular, if you have complex operations like subtractions the blitter cannot perform (without going through multiple iterations at least), the CPU becomes faster.

Thus, while there is an advantage of the blitter for a poor old 68000, this advantage melts away quite quickly with faster CPUs. The blitter can still be of some advantage for faster CPUs if the CPU can do something else while the blitter is operating on the screen, but if manipulating chip mem is the only thing that needs to be done, then a faster CPU will outperform the blitter easily.

Quote:

Originally Posted by robinsonb5

I've used the same techniques to draw shaded bevels on a 16-colour greyscale screen, again much faster than the CPU could do it.

In this generality, certainly not. As said, a CPU on a turbo board can access chip memory every cycle, and therefore saturate the chip memory bus easily. The blitter will not be able to do so, *except* if you really need all channels, and set the blitter nasty flag.

roondar · 25 August 2022, 09:10

Quote:

Originally Posted by Thomas Richter

In this generality, certainly not. As said, a CPU on a turbo board can access chip memory every cycle, and therefore saturate the chip memory bus easily. The blitter will not be able to do so, *except* if you really need all channels, and set the blitter nasty flag.

This is not true on both counts. First, CPU's on the Amiga are limited to every-other-cycle access to Chip RAM*. This remains true for Turbo boards, which still access Chip RAM through Agnus/Alice (which is where the limit comes from). Second, on OCS and most ECS systems** the CPU will also be limited to 16 bit Chip RAM access, no matter the CPU type. Third, the Blitter can access every cycle in Chip RAM for several different channel combinations, which include most of the major block based GFX operations (copy, mask & cookie-cut). It will indeed yield some cycles to the CPU without Blitter nasty set, but that's easy enough to fix

*) Note that CPU's can access any Chip RAM cycle, they just can't access two cycles back-to-back.
**) The exception is the A3000, still limited to every-other-cycle Chip RAM access, but like AGA does allow 32 bit access to Chip RAM.

ross · 25 August 2022, 09:39

Quote:

Originally Posted by Thomas Richter

Look, as I said, you typically need a mask, so you can only use two of the 3 channels, the third is usually occupied by the mask..

If before I had the doubt now I have the certainty that you are confusing the Minterms with the possible results of the boolean operation applied to the number of inputs.

A Minterm is a Boolean function that takes the value 1 in correspondence with a single configuration of independent (Boolean) input variables.
In canonical form you can represent it as a sum of products, therefore with ANDs (but it can also use NORs).
This is what every single LF bit (from 0 to 7), in the BLTCON0 register, does: select one of the products for the ABC input channels in their possible combinations of 0/1 status.
This matrix of possible combinations, given by the values of the inputs and by the Minterms, allows you to any logical operation (which I usually prefer to solve for simplicity with a Karnaugh map, there are several solver online too).

When you talk about "two of the 3 [input] channels" you are talking about 4 possible Minterms (and therefore 16 possible combinations).
Mask or not mask these are not 16 Minterms.

hooverphonique · 25 August 2022, 10:05

You could take a look at some source code that does shade bobs (if you can find it). It usually performs the addition using multiple blitter passes. Doing it in a single pass isn't possible.

Found a thread on shade bobs here: http://eab.abime.net/showthread.php?t=71954

ross · 25 August 2022, 10:43

Quote:

Originally Posted by Thomas Richter

As said, a CPU on a turbo board can access chip memory every cycle, and therefore saturate the chip memory bus easily. The blitter will not be able to do so, *except* if you really need all channels, and set the blitter nasty flag.

Ah, I missed this, but I see that roondar has already answered

Two other statements that are not true.

First.
Agnus is the arbitrator of the 'internal' memory and therefore any request for access must reach it, it is not possible to access 'directly' (otherwise the whole unified memory system used by the CPU and DMA channels would go to hell!). This means that if in the first cycle I send the address in the second I will have the result, you can't do better. Of course you can do it in any cycle 'type' (odd or even).
A simple movem at 7MHz of the 68000 'saturates' the bus regarding CPU accesses, you can also try with a 50MHz machine but the internal bus usage would be the same.

Second.
The blitter is pipelined and you do not "really need all channels" to be active to saturate the bus.
Even a simple AD copy (two active channels) can use all memory cycles (that the CPU can never do) after a very short initial phase of start-up.

Of course all this has nothing to do with the topic of whether the blitter can 'directly' do 'complex' operations such as additions or subtractions.
But it doesn't seem to me that no one has claimed that it can

(but it can 'trick' and do it with multipass)

EDIT: obviously with an accelerator and fast ram nobody prevents you from using the chipmemory as a simple framebuffer (and therefore doing much faster than the blitter and doing much more complex operations, or using the blitter only for the secondary and simple things) but it seems to me that in this topic we are talking about something else..

malko · 25 August 2022, 13:04

^ Thanks for those explanations ross. They are piece of puzzle I can insert in my global understanding of the system

chb · 25 August 2022, 14:06

It is certainly possible to do saturated substraction with the blitter (using multiple passes as ross pointed out), and for 4-bit values in planar format it will most likely considerably faster than using the CPU on at least on an unexpanded Amiga.

Untested, more or less from the top of my head, so near certainly riddled with errors:

Truth table for binary substraction (A: operand 1/bitplane data, B: operand 2/constant that is being substracted, Ci/Co: carry/borrow in/out, D: difference). This can be directly expressed as minterm for each D and Co.

Code:

A   B   Ci  |   D   Co
0   0   0   |   0   1
0   0   1   |   1   1
0   1   0   |   1   1
0   1   1   |   0   1
1   0   0   |   1   0
1   1   0   |   0   0
1   0   1   |   0   0   
1   1   1   |   1   1

To obtain D and Co, you need one blit each;for the first blit Ci is 0, for the next it's the borrow from the last blit

Saturation: D' = D AND (NOT Cn) on all result planes from before: if the last carry/borrow Cn is set, indicating underflow, set all planes to zero.

Performance: If you substract a constant (can be kept in blitter register, no DMA needed), this gives (for 4-bit values) 2x AD-blit (least significant bit without earlier borrow) + 10x ABD-blit (bits 1,2,3 + 4x saturation), in sum 34 memory accesses for 4 4-bit values or 68 clock cycles (without other DMA), which equals 17 cycles per 4-bit value. I doubt you can beat that with the CPU on the 68000, esp. if you need to convert to planar (or even planar-chunky-planar).

paraj · 25 August 2022, 17:39

CPU is probably not faster, but for this case you don't need to do any c<->p conversions. Just operate on the pixels in parallel like the blitter. Pseudo-code:

Code:

    # pixel = saturate(pixel + n), if n is negative use 2's complement representation
    mask0 = $ffffffff if (n&1) else 0
    ...
    [carry, bpl0] = half_add(bpl0, mask0)
    [carry, bpl1] = full_add(bpl1, mask1, carry)
    [carry, bpl2] = full_add(bpl2, mask2, carry)
    [carry, bpl3] = full_add(bpl3, mask3, carry)
    if n < 0:
        for bpl in bpl0...3: bpl &= carry # this works because the overflow bit (bit4) = bpl4 ^ mask4 ^ Cin = 0 ^ 1 ^ Cin = NOT Cin, B (mask)=1 comes from n being negative
    else:
        for bpl in bpl0...3: bpl |= carry

Now since mask0..3 is known (if it isn't just create 16 functions) the "full_add" can be optimized:

Code:

    Res   = A ^ B ^ Cin
    Cout  = (A & B) | (Cin & (A ^ B))

    B (mask) = 0:
    Res  = A ^ 0 ^ Cin = A ^ Cin
    Cout = (A & 0) | (Cin & (A ^ 0)) = A & Cin

    B = 1:
    Res  = A ^ 1 ^ Cin = NOT(A ^ Cin)
    Cout = (A & 1) | (Cin & (A ^ 1)) = A | Cin

EDIT: Of course the half_add part can also be optimized, and maybe the results can be propagated

Thomas Richter · 25 August 2022, 18:19

Quote:

Originally Posted by ross

If before I had the doubt now I have the certainty that you are confusing the Minterms with the possible results of the boolean operation applied to the number of inputs.

No, and I urge you to read my post carefully again. Please do not tell me the obvious. Once again, if you need a mask, and you typically do if you have arbitrary rectangles to blit (or operate on), you are down to 16 minterms of the total 256 available because the remaining 240 will not take care of the mask correctly. You can also see that in a different way: If you have one source, and one destination, you have 2^(2^2) = 16 possible combinations you need to care about. The third source the blitter offers is occupied already and has to have a particular function for masking to work, so you are constrained to 4 minterm bits = 16 minterms, not 256.

Thomas Richter · 25 August 2022, 18:25

Quote:

Originally Posted by chb

It is certainly possible to do saturated substraction with the blitter (using multiple passes as ross pointed out), and for 4-bit values in planar format it will most likely considerably faster than using the CPU on at least on an unexpanded Amiga.

Only on an unaccelerated Amiga. If you have complex operations like subtraction, you are probably much better off extracting the four bitplanes by the CPU using a p2c, subtract, and then write back by c2p. A 68000@7Mhz cannot do that, but as soon as you get faster, the dominating (and limiting) factor is not the execution speed of the CPU, but the bottleneck of Chip RAM bandwidth. You can arrange your CPU loop such that the write to chip RAM is completely hidden in the push-buffer of the CPU such that it overlaps with other CPU processing tasks, and only the read will block your processing loop.

On the other hand an operation like subtractions needs the blitter to go over the data multiple times (more often than the number of bitplanes you have), and that limits the effective bandwidth to the chip mem bandwidth divided by the number of processing loops.

Even for simple blits, a CPU blit is faster than a native blit, provided you have a CPU that is fast enough.

remz · 25 August 2022, 18:41

Fascinating read! So much good information.
In my use case I wish the game to work on plain 68000 chipram Amiga: so I want to have the worst-case running at maximum speed.
I will study the various approach mentioned here.
The Linear feedback register won't work because in HAM, the non-indexed colors are always 0 to 15, they cannot be reorganized like for regular indexed blitter shadeBobs.
Thank you all for your inputs, ideas, and examples!

paraj · 25 August 2022, 19:14

Good luck with you HAM experiments

Did a very quick CPU implementation and it comes out at 246(42/8) (according to https://68kcounter.grahambates.com/) for 32 pixels with -3 + saturate. Measured code:

Code:

    move.l (a0),d0
    move.l (a0,$0004),d1
    move.l (a0,$0008),d2
    move.l (a0,$000c),d3
    move.l d0,d4
    not.l d0
    move.l d4,d5
    and.l d1,d4
    eor.l d5,d1
    move.l d4,d5
    or.l d2,d4
    eor.l d5,d2
    not.l d2
    move.l d4,d5
    or.l d3,d4
    eor.l d5,d3
    not.l d3
    and.l d4,d0
    and.l d4,d1
    and.l d4,d2
    and.l d4,d3
    move.l d0,(a0)
    move.l d1,(a0,$0004)
    move.l d2,(a0,$0008)
    move.l d3,(a0,$000c)

From attached macro/test stuff (probably has errors, since it was done quickly)

ross · 25 August 2022, 19:25

Quote:

Originally Posted by Thomas Richter

No, and I urge you to read my post carefully again. Please do not tell me the obvious. Once again, if you need a mask, and you typically do if you have arbitrary rectangles to blit (or operate on), you are down to 16 minterms of the total 256 available because the remaining 240 will not take care of the mask correctly. You can also see that in a different way: If you have one source, and one destination, you have 2^(2^2) = 16 possible combinations you need to care about. The third source the blitter offers is occupied already and has to have a particular function for masking to work, so you are constrained to 4 minterm bits = 16 minterms, not 256.

Bah.., deleted, I do not feel like controversy.

We could probably argue for hours and not find points of agreement (I think you are too specific and limiting, and you use *in my opinion* incorrect terms).
Too different points of view

Message to the OP: experiment! The Blitter is a nice piece of hardware

Especially if you want to use it on bare machines, on which it gives its best.

remz · 26 August 2022, 00:03

Quote:

Originally Posted by ross

Message to the OP: experiment! The Blitter is a nice piece of hardware

Especially if you want to use it on bare machines, on which it gives its best.

Yes, absolutely! It is great to use the blitter in non-hog mode so the cpu can interleave instructions, running in parallel like setting up the next blit for a fraction of the cost!

I will experiment and try to implement my "semi realtime shading" and post back the results.

24 August 2022, 19:24	#1
remz Registered User Join Date: May 2022 Location: Canada Posts: 138	Subtract/Saturate with Blitter? Hello again Amiga Coders! I am quite a newbie relating to using the Amiga Blitter, and I have an interesting challenge/idea: Is it possible to perform a subtraction using the blitter that would "clip at zero" to prevent underflow? Let me explain what I would to achieve, it might make it clearer: The display is in HAM, 6bpp. The first four planes contains a value from 0 to 15. I would to subtract a constant value between 0 and 15. The goal would be that the resulting subtract would "clip" at zero to prevent underflow. Example: if color was 13, subtract 3, would result in 10. If color was 1, subtract 3, would result to 0. To make more complicated, the bitplanes 5 and 6 needs to stay untouched. This might not be an issue if the calculations are done "per scanline" with one blit per line. Thinking about it perhaps the cpu would be better suited for this task. Even perhaps my graphics should be stored in "chunky" mode to speed up reading the value, and then written back in bitplane mode. I'll have to think about it more.

25 August 2022, 13:04	#12
malko Ex nihilo nihil Join Date: Oct 2017 Location: CH Posts: 4,856	^ Thanks for those explanations ross. They are piece of puzzle I can insert in my global understanding of the system Last edited by malko; 25 August 2022 at 13:20. Reason: typo

25 August 2022, 14:06	#13
chb Registered User Join Date: Dec 2014 Location: germany Posts: 439	It is certainly possible to do saturated substraction with the blitter (using multiple passes as ross pointed out), and for 4-bit values in planar format it will most likely considerably faster than using the CPU on at least on an unexpanded Amiga. Untested, more or less from the top of my head, so near certainly riddled with errors: Truth table for binary substraction (A: operand 1/bitplane data, B: operand 2/constant that is being substracted, Ci/Co: carry/borrow in/out, D: difference). This can be directly expressed as minterm for each D and Co. Code: A B Ci \| D Co 0 0 0 \| 0 1 0 0 1 \| 1 1 0 1 0 \| 1 1 0 1 1 \| 0 1 1 0 0 \| 1 0 1 1 0 \| 0 0 1 0 1 \| 0 0 1 1 1 \| 1 1 To obtain D and Co, you need one blit each;for the first blit Ci is 0, for the next it's the borrow from the last blit Saturation: D' = D AND (NOT Cn) on all result planes from before: if the last carry/borrow Cn is set, indicating underflow, set all planes to zero. Performance: If you substract a constant (can be kept in blitter register, no DMA needed), this gives (for 4-bit values) 2x AD-blit (least significant bit without earlier borrow) + 10x ABD-blit (bits 1,2,3 + 4x saturation), in sum 34 memory accesses for 4 4-bit values or 68 clock cycles (without other DMA), which equals 17 cycles per 4-bit value. I doubt you can beat that with the CPU on the 68000, esp. if you need to convert to planar (or even planar-chunky-planar).

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Immediate Blitter & Wait for Blitter...	volvo_0ne	support.WinUAE	32	18 September 2022 09:52
wait for blitter vs immediate blitter	jotd	support.WinUAE	1	08 September 2020 04:14
Blitter defill - is it possible?	Ozzyboshi	Coders. Asm / Hardware	4	12 December 2018 09:17
Blitter C2P? How?	Samurai_Crow	Coders. Asm / Hardware	21	24 April 2018 19:12
Blitter busy flag with blitter DMA off?	NorthWay	Coders. Asm / Hardware	9	23 February 2014 21:05

24 August 2022, 19:44	#2
robinsonb5 Registered User Join Date: Mar 2012 Location: Norfolk, UK Posts: 1,153	Yes, it's certainly possible - I did it in AMOSPro some years ago (using some trickery with poking bitplane pointers, to restrict which planes are affected by a blit.) - and I believe the AMCAF extension had similar facilities. http://retroramblings.net/?p=25

25 August 2022, 10:05	#10
hooverphonique ex. demoscener "Bigmama" Join Date: Jun 2012 Location: Fyn / Denmark Posts: 1,624	You could take a look at some source code that does shade bobs (if you can find it). It usually performs the addition using multiple blitter passes. Doing it in a single pass isn't possible. Found a thread on shade bobs here: http://eab.abime.net/showthread.php?t=71954

25 August 2022, 18:41	#17
remz Registered User Join Date: May 2022 Location: Canada Posts: 138	Fascinating read! So much good information. In my use case I wish the game to work on plain 68000 chipram Amiga: so I want to have the worst-case running at maximum speed. I will study the various approach mentioned here. The Linear feedback register won't work because in HAM, the non-indexed colors are always 0 to 15, they cannot be reorganized like for regular indexed blitter shadeBobs. Thank you all for your inputs, ideas, and examples!

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)