English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 24 August 2022, 19:24   #1
remz
Registered User
 
Join Date: May 2022
Location: Canada
Posts: 138
Subtract/Saturate with Blitter?

Hello again Amiga Coders!

I am quite a newbie relating to using the Amiga Blitter, and I have an interesting challenge/idea:
Is it possible to perform a subtraction using the blitter that would "clip at zero" to prevent underflow?
Let me explain what I would to achieve, it might make it clearer:
The display is in HAM, 6bpp. The first four planes contains a value from 0 to 15.
I would to subtract a constant value between 0 and 15. The goal would be that the resulting subtract would "clip" at zero to prevent underflow.
Example: if color was 13, subtract 3, would result in 10.
If color was 1, subtract 3, would result to 0.
To make more complicated, the bitplanes 5 and 6 needs to stay untouched. This might not be an issue if the calculations are done "per scanline" with one blit per line.

Thinking about it perhaps the cpu would be better suited for this task. Even perhaps my graphics should be stored in "chunky" mode to speed up reading the value, and then written back in bitplane mode. I'll have to think about it more.
remz is offline  
Old 24 August 2022, 19:44   #2
robinsonb5
Registered User
 
Join Date: Mar 2012
Location: Norfolk, UK
Posts: 1,153
Yes, it's certainly possible - I did it in AMOSPro some years ago (using some trickery with poking bitplane pointers, to restrict which planes are affected by a blit.) - and I believe the AMCAF extension had similar facilities.

http://retroramblings.net/?p=25
robinsonb5 is offline  
Old 24 August 2022, 23:09   #3
Thomas Richter
Registered User
 
Join Date: Jan 2019
Location: Germany
Posts: 3,215
Quote:
Originally Posted by remz View Post
Is it possible to perform a subtraction using the blitter that would "clip at zero" to prevent underflow?
Not really. Please understand that the operations the blitter can perform are limited to those offered by the minterms, i.e. bit combinations - bit by bit - of the source and the destination. There are 16 minterms in total how source and destination can be combined (ignoring masking), and subtraction is none of them - as for that the blitter would also need to access multiple bitplanes, not just source and target.



With a lot of creativity, you can probably built up something suitable, but it would need multiple blits. At this point, using the CPU would certainly be faster than triggering the blitter multiple times.
Thomas Richter is offline  
Old 24 August 2022, 23:42   #4
ross
Defendit numerus
 
ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
Quote:
Originally Posted by Thomas Richter View Post
Please understand that the operations the blitter can perform are limited to those offered by the minterms, i.e. bit combinations - bit by bit - of the source and the destination. There are 16 minterms in total how source and destination can be combined..
hmm, no.
Minterms are 8 (2^3), and combinations are 256 (2^(2^3)).

If the source channels had been 4 then the Minterms would have been 16, but at that point a whole 16 bit word would have been needed to define the function (65536 combinations).


EDIT: of course many of these combinations don't make sense, or they only make sense with some channels with static data, or with channels deactivated (or even all deactivated), but that's another story

Last edited by ross; 24 August 2022 at 23:57.
ross is offline  
Old 25 August 2022, 01:09   #5
robinsonb5
Registered User
 
Join Date: Mar 2012
Location: Norfolk, UK
Posts: 1,153
Quote:
Originally Posted by Thomas Richter View Post
At this point, using the CPU would certainly be faster than triggering the blitter multiple times.
Only if the source data is in chunky format and you're using a C2P.

A multi-pass blitter algorithm can modify the data in-place much faster than the CPU would be able to. (The demo in my blog post linked above manages around 5fps when fading a 320x256 HAM6 screen to black. There's no way the CPU's going to manipulate planar data at those sorts of speeds.)

I've used the same techniques to draw shaded bevels on a 16-colour greyscale screen, again much faster than the CPU could do it.
robinsonb5 is offline  
Old 25 August 2022, 01:33   #6
Thomas Richter
Registered User
 
Join Date: Jan 2019
Location: Germany
Posts: 3,215
Quote:
Originally Posted by ross View Post
hmm, no.
Minterms are 8 (2^3), and combinations are 256 (2^(2^3)).
Look, as I said, you typically need a mask, so you can only use two of the 3 channels, the third is usually occupied by the mask such that destination=source for those pixels that are outside the mask. The number was not picked at random and not without reason. This all aside, there is no subtraction minterm as the blitter operates bit combinations.

Last edited by Thomas Richter; 25 August 2022 at 01:41.
Thomas Richter is offline  
Old 25 August 2022, 01:40   #7
Thomas Richter
Registered User
 
Join Date: Jan 2019
Location: Germany
Posts: 3,215
Quote:
Originally Posted by robinsonb5 View Post
A multi-pass blitter algorithm can modify the data in-place much faster than the CPU would be able to.
Cough. No, this depends on the CPU, and the operation. The blitter cannot do magic, it can only operate at the bandwidth of the chip RAM bus, and the CPU can also only access chip RAM at the same speed. However, depending on how many source channels you take, and depending on the blitter nasty flag, the blitter cannot occupy every possible cycle (i.e. every cycle not taken by DMA).


In particular, if you have complex operations like subtractions the blitter cannot perform (without going through multiple iterations at least), the CPU becomes faster.


Thus, while there is an advantage of the blitter for a poor old 68000, this advantage melts away quite quickly with faster CPUs. The blitter can still be of some advantage for faster CPUs if the CPU can do something else while the blitter is operating on the screen, but if manipulating chip mem is the only thing that needs to be done, then a faster CPU will outperform the blitter easily.


Quote:
Originally Posted by robinsonb5 View Post

I've used the same techniques to draw shaded bevels on a 16-colour greyscale screen, again much faster than the CPU could do it.
In this generality, certainly not. As said, a CPU on a turbo board can access chip memory every cycle, and therefore saturate the chip memory bus easily. The blitter will not be able to do so, *except* if you really need all channels, and set the blitter nasty flag.
Thomas Richter is offline  
Old 25 August 2022, 09:10   #8
roondar
Registered User
 
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,410
Quote:
Originally Posted by Thomas Richter View Post
In this generality, certainly not. As said, a CPU on a turbo board can access chip memory every cycle, and therefore saturate the chip memory bus easily. The blitter will not be able to do so, *except* if you really need all channels, and set the blitter nasty flag.
This is not true on both counts. First, CPU's on the Amiga are limited to every-other-cycle access to Chip RAM*. This remains true for Turbo boards, which still access Chip RAM through Agnus/Alice (which is where the limit comes from). Second, on OCS and most ECS systems** the CPU will also be limited to 16 bit Chip RAM access, no matter the CPU type. Third, the Blitter can access every cycle in Chip RAM for several different channel combinations, which include most of the major block based GFX operations (copy, mask & cookie-cut). It will indeed yield some cycles to the CPU without Blitter nasty set, but that's easy enough to fix

*) Note that CPU's can access any Chip RAM cycle, they just can't access two cycles back-to-back.
**) The exception is the A3000, still limited to every-other-cycle Chip RAM access, but like AGA does allow 32 bit access to Chip RAM.

Last edited by roondar; 25 August 2022 at 10:05. Reason: Made it more complete
roondar is online now  
Old 25 August 2022, 09:39   #9
ross
Defendit numerus
 
ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
Quote:
Originally Posted by Thomas Richter View Post
Look, as I said, you typically need a mask, so you can only use two of the 3 channels, the third is usually occupied by the mask..
If before I had the doubt now I have the certainty that you are confusing the Minterms with the possible results of the boolean operation applied to the number of inputs.

A Minterm is a Boolean function that takes the value 1 in correspondence with a single configuration of independent (Boolean) input variables.
In canonical form you can represent it as a sum of products, therefore with ANDs (but it can also use NORs).
This is what every single LF bit (from 0 to 7), in the BLTCON0 register, does: select one of the products for the ABC input channels in their possible combinations of 0/1 status.
This matrix of possible combinations, given by the values of the inputs and by the Minterms, allows you to any logical operation (which I usually prefer to solve for simplicity with a Karnaugh map, there are several solver online too).

When you talk about "two of the 3 [input] channels" you are talking about 4 possible Minterms (and therefore 16 possible combinations).
Mask or not mask these are not 16 Minterms.
ross is offline  
Old 25 August 2022, 10:05   #10
hooverphonique
ex. demoscener "Bigmama"
 
Join Date: Jun 2012
Location: Fyn / Denmark
Posts: 1,624
You could take a look at some source code that does shade bobs (if you can find it). It usually performs the addition using multiple blitter passes. Doing it in a single pass isn't possible.

Found a thread on shade bobs here: http://eab.abime.net/showthread.php?t=71954
hooverphonique is offline  
Old 25 August 2022, 10:43   #11
ross
Defendit numerus
 
ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
Quote:
Originally Posted by Thomas Richter View Post
As said, a CPU on a turbo board can access chip memory every cycle, and therefore saturate the chip memory bus easily. The blitter will not be able to do so, *except* if you really need all channels, and set the blitter nasty flag.
Ah, I missed this, but I see that roondar has already answered
Two other statements that are not true.

First.
Agnus is the arbitrator of the 'internal' memory and therefore any request for access must reach it, it is not possible to access 'directly' (otherwise the whole unified memory system used by the CPU and DMA channels would go to hell!). This means that if in the first cycle I send the address in the second I will have the result, you can't do better. Of course you can do it in any cycle 'type' (odd or even).
A simple movem at 7MHz of the 68000 'saturates' the bus regarding CPU accesses, you can also try with a 50MHz machine but the internal bus usage would be the same.

Second.
The blitter is pipelined and you do not "really need all channels" to be active to saturate the bus.
Even a simple AD copy (two active channels) can use all memory cycles (that the CPU can never do) after a very short initial phase of start-up.

Of course all this has nothing to do with the topic of whether the blitter can 'directly' do 'complex' operations such as additions or subtractions.
But it doesn't seem to me that no one has claimed that it can (but it can 'trick' and do it with multipass)


EDIT: obviously with an accelerator and fast ram nobody prevents you from using the chipmemory as a simple framebuffer (and therefore doing much faster than the blitter and doing much more complex operations, or using the blitter only for the secondary and simple things) but it seems to me that in this topic we are talking about something else..

Last edited by ross; 25 August 2022 at 10:53.
ross is offline  
Old 25 August 2022, 13:04   #12
malko
Ex nihilo nihil
 
malko's Avatar
 
Join Date: Oct 2017
Location: CH
Posts: 4,856
^ Thanks for those explanations ross. They are piece of puzzle I can insert in my global understanding of the system

Last edited by malko; 25 August 2022 at 13:20. Reason: typo
malko is offline  
Old 25 August 2022, 14:06   #13
chb
Registered User
 
Join Date: Dec 2014
Location: germany
Posts: 439
It is certainly possible to do saturated substraction with the blitter (using multiple passes as ross pointed out), and for 4-bit values in planar format it will most likely considerably faster than using the CPU on at least on an unexpanded Amiga.

Untested, more or less from the top of my head, so near certainly riddled with errors:

Truth table for binary substraction (A: operand 1/bitplane data, B: operand 2/constant that is being substracted, Ci/Co: carry/borrow in/out, D: difference). This can be directly expressed as minterm for each D and Co.

Code:
A   B   Ci  |   D   Co
0   0   0   |   0   1
0   0   1   |   1   1
0   1   0   |   1   1
0   1   1   |   0   1
1   0   0   |   1   0
1   1   0   |   0   0
1   0   1   |   0   0   
1   1   1   |   1   1
To obtain D and Co, you need one blit each;for the first blit Ci is 0, for the next it's the borrow from the last blit

Saturation: D' = D AND (NOT Cn) on all result planes from before: if the last carry/borrow Cn is set, indicating underflow, set all planes to zero.

Performance: If you substract a constant (can be kept in blitter register, no DMA needed), this gives (for 4-bit values) 2x AD-blit (least significant bit without earlier borrow) + 10x ABD-blit (bits 1,2,3 + 4x saturation), in sum 34 memory accesses for 4 4-bit values or 68 clock cycles (without other DMA), which equals 17 cycles per 4-bit value. I doubt you can beat that with the CPU on the 68000, esp. if you need to convert to planar (or even planar-chunky-planar).
chb is offline  
Old 25 August 2022, 17:39   #14
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,099
CPU is probably not faster, but for this case you don't need to do any c<->p conversions. Just operate on the pixels in parallel like the blitter. Pseudo-code:
Code:
    # pixel = saturate(pixel + n), if n is negative use 2's complement representation
    mask0 = $ffffffff if (n&1) else 0
    ...
    [carry, bpl0] = half_add(bpl0, mask0)
    [carry, bpl1] = full_add(bpl1, mask1, carry)
    [carry, bpl2] = full_add(bpl2, mask2, carry)
    [carry, bpl3] = full_add(bpl3, mask3, carry)
    if n < 0:
        for bpl in bpl0...3: bpl &= carry # this works because the overflow bit (bit4) = bpl4 ^ mask4 ^ Cin = 0 ^ 1 ^ Cin = NOT Cin, B (mask)=1 comes from n being negative
    else:
        for bpl in bpl0...3: bpl |= carry
Now since mask0..3 is known (if it isn't just create 16 functions) the "full_add" can be optimized:
Code:
    Res   = A ^ B ^ Cin
    Cout  = (A & B) | (Cin & (A ^ B))

    B (mask) = 0:
    Res  = A ^ 0 ^ Cin = A ^ Cin
    Cout = (A & 0) | (Cin & (A ^ 0)) = A & Cin

    B = 1:
    Res  = A ^ 1 ^ Cin = NOT(A ^ Cin)
    Cout = (A & 1) | (Cin & (A ^ 1)) = A | Cin
EDIT: Of course the half_add part can also be optimized, and maybe the results can be propagated

Last edited by paraj; 25 August 2022 at 18:10.
paraj is offline  
Old 25 August 2022, 18:19   #15
Thomas Richter
Registered User
 
Join Date: Jan 2019
Location: Germany
Posts: 3,215
Quote:
Originally Posted by ross View Post
If before I had the doubt now I have the certainty that you are confusing the Minterms with the possible results of the boolean operation applied to the number of inputs.
No, and I urge you to read my post carefully again. Please do not tell me the obvious. Once again, if you need a mask, and you typically do if you have arbitrary rectangles to blit (or operate on), you are down to 16 minterms of the total 256 available because the remaining 240 will not take care of the mask correctly. You can also see that in a different way: If you have one source, and one destination, you have 2^(2^2) = 16 possible combinations you need to care about. The third source the blitter offers is occupied already and has to have a particular function for masking to work, so you are constrained to 4 minterm bits = 16 minterms, not 256.
Thomas Richter is offline  
Old 25 August 2022, 18:25   #16
Thomas Richter
Registered User
 
Join Date: Jan 2019
Location: Germany
Posts: 3,215
Quote:
Originally Posted by chb View Post
It is certainly possible to do saturated substraction with the blitter (using multiple passes as ross pointed out), and for 4-bit values in planar format it will most likely considerably faster than using the CPU on at least on an unexpanded Amiga.
Only on an unaccelerated Amiga. If you have complex operations like subtraction, you are probably much better off extracting the four bitplanes by the CPU using a p2c, subtract, and then write back by c2p. A 68000@7Mhz cannot do that, but as soon as you get faster, the dominating (and limiting) factor is not the execution speed of the CPU, but the bottleneck of Chip RAM bandwidth. You can arrange your CPU loop such that the write to chip RAM is completely hidden in the push-buffer of the CPU such that it overlaps with other CPU processing tasks, and only the read will block your processing loop.


On the other hand an operation like subtractions needs the blitter to go over the data multiple times (more often than the number of bitplanes you have), and that limits the effective bandwidth to the chip mem bandwidth divided by the number of processing loops.


Even for simple blits, a CPU blit is faster than a native blit, provided you have a CPU that is fast enough.
Thomas Richter is offline  
Old 25 August 2022, 18:41   #17
remz
Registered User
 
Join Date: May 2022
Location: Canada
Posts: 138
Fascinating read! So much good information.
In my use case I wish the game to work on plain 68000 chipram Amiga: so I want to have the worst-case running at maximum speed.
I will study the various approach mentioned here.
The Linear feedback register won't work because in HAM, the non-indexed colors are always 0 to 15, they cannot be reorganized like for regular indexed blitter shadeBobs.
Thank you all for your inputs, ideas, and examples!
remz is offline  
Old 25 August 2022, 19:14   #18
paraj
Registered User
 
paraj's Avatar
 
Join Date: Feb 2017
Location: Denmark
Posts: 1,099
Good luck with you HAM experiments

Did a very quick CPU implementation and it comes out at 246(42/8) (according to https://68kcounter.grahambates.com/) for 32 pixels with -3 + saturate. Measured code:
Code:
    move.l (a0),d0
    move.l (a0,$0004),d1
    move.l (a0,$0008),d2
    move.l (a0,$000c),d3
    move.l d0,d4
    not.l d0
    move.l d4,d5
    and.l d1,d4
    eor.l d5,d1
    move.l d4,d5
    or.l d2,d4
    eor.l d5,d2
    not.l d2
    move.l d4,d5
    or.l d3,d4
    eor.l d5,d3
    not.l d3
    and.l d4,d0
    and.l d4,d1
    and.l d4,d2
    and.l d4,d3
    move.l d0,(a0)
    move.l d1,(a0,$0004)
    move.l d2,(a0,$0008)
    move.l d3,(a0,$000c)
From attached macro/test stuff (probably has errors, since it was done quickly)
Attached Files
File Type: c addsat.c (1.5 KB, 19 views)
File Type: s sup.S (1.1 KB, 23 views)
paraj is offline  
Old 25 August 2022, 19:25   #19
ross
Defendit numerus
 
ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
Quote:
Originally Posted by Thomas Richter View Post
No, and I urge you to read my post carefully again. Please do not tell me the obvious. Once again, if you need a mask, and you typically do if you have arbitrary rectangles to blit (or operate on), you are down to 16 minterms of the total 256 available because the remaining 240 will not take care of the mask correctly. You can also see that in a different way: If you have one source, and one destination, you have 2^(2^2) = 16 possible combinations you need to care about. The third source the blitter offers is occupied already and has to have a particular function for masking to work, so you are constrained to 4 minterm bits = 16 minterms, not 256.
Bah.., deleted, I do not feel like controversy.

We could probably argue for hours and not find points of agreement (I think you are too specific and limiting, and you use *in my opinion* incorrect terms).
Too different points of view

Message to the OP: experiment! The Blitter is a nice piece of hardware
Especially if you want to use it on bare machines, on which it gives its best.

Last edited by ross; 25 August 2022 at 21:19.
ross is offline  
Old 26 August 2022, 00:03   #20
remz
Registered User
 
Join Date: May 2022
Location: Canada
Posts: 138
Quote:
Originally Posted by ross View Post
Message to the OP: experiment! The Blitter is a nice piece of hardware
Especially if you want to use it on bare machines, on which it gives its best.
Yes, absolutely! It is great to use the blitter in non-hog mode so the cpu can interleave instructions, running in parallel like setting up the next blit for a fraction of the cost!

I will experiment and try to implement my "semi realtime shading" and post back the results.
remz is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Immediate Blitter & Wait for Blitter... volvo_0ne support.WinUAE 32 18 September 2022 09:52
wait for blitter vs immediate blitter jotd support.WinUAE 1 08 September 2020 04:14
Blitter defill - is it possible? Ozzyboshi Coders. Asm / Hardware 4 12 December 2018 09:17
Blitter C2P? How? Samurai_Crow Coders. Asm / Hardware 21 24 April 2018 19:12
Blitter busy flag with blitter DMA off? NorthWay Coders. Asm / Hardware 9 23 February 2014 21:05

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 15:31.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.19525 seconds with 16 queries