24 August 2022, 19:24 | #1 |
Registered User
Join Date: May 2022
Location: Canada
Posts: 138
|
Subtract/Saturate with Blitter?
Hello again Amiga Coders!
I am quite a newbie relating to using the Amiga Blitter, and I have an interesting challenge/idea: Is it possible to perform a subtraction using the blitter that would "clip at zero" to prevent underflow? Let me explain what I would to achieve, it might make it clearer: The display is in HAM, 6bpp. The first four planes contains a value from 0 to 15. I would to subtract a constant value between 0 and 15. The goal would be that the resulting subtract would "clip" at zero to prevent underflow. Example: if color was 13, subtract 3, would result in 10. If color was 1, subtract 3, would result to 0. To make more complicated, the bitplanes 5 and 6 needs to stay untouched. This might not be an issue if the calculations are done "per scanline" with one blit per line. Thinking about it perhaps the cpu would be better suited for this task. Even perhaps my graphics should be stored in "chunky" mode to speed up reading the value, and then written back in bitplane mode. I'll have to think about it more. |
24 August 2022, 19:44 | #2 |
Registered User
Join Date: Mar 2012
Location: Norfolk, UK
Posts: 1,153
|
Yes, it's certainly possible - I did it in AMOSPro some years ago (using some trickery with poking bitplane pointers, to restrict which planes are affected by a blit.) - and I believe the AMCAF extension had similar facilities.
http://retroramblings.net/?p=25 |
24 August 2022, 23:09 | #3 | |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,215
|
Quote:
With a lot of creativity, you can probably built up something suitable, but it would need multiple blits. At this point, using the CPU would certainly be faster than triggering the blitter multiple times. |
|
24 August 2022, 23:42 | #4 | |
Defendit numerus
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
|
Quote:
Minterms are 8 (2^3), and combinations are 256 (2^(2^3)). If the source channels had been 4 then the Minterms would have been 16, but at that point a whole 16 bit word would have been needed to define the function (65536 combinations). EDIT: of course many of these combinations don't make sense, or they only make sense with some channels with static data, or with channels deactivated (or even all deactivated), but that's another story Last edited by ross; 24 August 2022 at 23:57. |
|
25 August 2022, 01:09 | #5 | |
Registered User
Join Date: Mar 2012
Location: Norfolk, UK
Posts: 1,153
|
Quote:
A multi-pass blitter algorithm can modify the data in-place much faster than the CPU would be able to. (The demo in my blog post linked above manages around 5fps when fading a 320x256 HAM6 screen to black. There's no way the CPU's going to manipulate planar data at those sorts of speeds.) I've used the same techniques to draw shaded bevels on a 16-colour greyscale screen, again much faster than the CPU could do it. |
|
25 August 2022, 01:33 | #6 |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,215
|
Look, as I said, you typically need a mask, so you can only use two of the 3 channels, the third is usually occupied by the mask such that destination=source for those pixels that are outside the mask. The number was not picked at random and not without reason. This all aside, there is no subtraction minterm as the blitter operates bit combinations.
Last edited by Thomas Richter; 25 August 2022 at 01:41. |
25 August 2022, 01:40 | #7 | |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,215
|
Quote:
In particular, if you have complex operations like subtractions the blitter cannot perform (without going through multiple iterations at least), the CPU becomes faster. Thus, while there is an advantage of the blitter for a poor old 68000, this advantage melts away quite quickly with faster CPUs. The blitter can still be of some advantage for faster CPUs if the CPU can do something else while the blitter is operating on the screen, but if manipulating chip mem is the only thing that needs to be done, then a faster CPU will outperform the blitter easily. In this generality, certainly not. As said, a CPU on a turbo board can access chip memory every cycle, and therefore saturate the chip memory bus easily. The blitter will not be able to do so, *except* if you really need all channels, and set the blitter nasty flag. |
|
25 August 2022, 09:10 | #8 | |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,409
|
Quote:
*) Note that CPU's can access any Chip RAM cycle, they just can't access two cycles back-to-back. **) The exception is the A3000, still limited to every-other-cycle Chip RAM access, but like AGA does allow 32 bit access to Chip RAM. Last edited by roondar; 25 August 2022 at 10:05. Reason: Made it more complete |
|
25 August 2022, 09:39 | #9 | |
Defendit numerus
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
|
Quote:
A Minterm is a Boolean function that takes the value 1 in correspondence with a single configuration of independent (Boolean) input variables. In canonical form you can represent it as a sum of products, therefore with ANDs (but it can also use NORs). This is what every single LF bit (from 0 to 7), in the BLTCON0 register, does: select one of the products for the ABC input channels in their possible combinations of 0/1 status. This matrix of possible combinations, given by the values of the inputs and by the Minterms, allows you to any logical operation (which I usually prefer to solve for simplicity with a Karnaugh map, there are several solver online too). When you talk about "two of the 3 [input] channels" you are talking about 4 possible Minterms (and therefore 16 possible combinations). Mask or not mask these are not 16 Minterms. |
|
25 August 2022, 10:05 | #10 |
ex. demoscener "Bigmama"
Join Date: Jun 2012
Location: Fyn / Denmark
Posts: 1,624
|
You could take a look at some source code that does shade bobs (if you can find it). It usually performs the addition using multiple blitter passes. Doing it in a single pass isn't possible.
Found a thread on shade bobs here: http://eab.abime.net/showthread.php?t=71954 |
25 August 2022, 10:43 | #11 | |
Defendit numerus
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
|
Quote:
Two other statements that are not true. First. Agnus is the arbitrator of the 'internal' memory and therefore any request for access must reach it, it is not possible to access 'directly' (otherwise the whole unified memory system used by the CPU and DMA channels would go to hell!). This means that if in the first cycle I send the address in the second I will have the result, you can't do better. Of course you can do it in any cycle 'type' (odd or even). A simple movem at 7MHz of the 68000 'saturates' the bus regarding CPU accesses, you can also try with a 50MHz machine but the internal bus usage would be the same. Second. The blitter is pipelined and you do not "really need all channels" to be active to saturate the bus. Even a simple AD copy (two active channels) can use all memory cycles (that the CPU can never do) after a very short initial phase of start-up. Of course all this has nothing to do with the topic of whether the blitter can 'directly' do 'complex' operations such as additions or subtractions. But it doesn't seem to me that no one has claimed that it can (but it can 'trick' and do it with multipass) EDIT: obviously with an accelerator and fast ram nobody prevents you from using the chipmemory as a simple framebuffer (and therefore doing much faster than the blitter and doing much more complex operations, or using the blitter only for the secondary and simple things) but it seems to me that in this topic we are talking about something else.. Last edited by ross; 25 August 2022 at 10:53. |
|
25 August 2022, 13:04 | #12 |
Ex nihilo nihil
Join Date: Oct 2017
Location: CH
Posts: 4,856
|
^ Thanks for those explanations ross. They are piece of puzzle I can insert in my global understanding of the system
Last edited by malko; 25 August 2022 at 13:20. Reason: typo |
25 August 2022, 14:06 | #13 |
Registered User
Join Date: Dec 2014
Location: germany
Posts: 439
|
It is certainly possible to do saturated substraction with the blitter (using multiple passes as ross pointed out), and for 4-bit values in planar format it will most likely considerably faster than using the CPU on at least on an unexpanded Amiga.
Untested, more or less from the top of my head, so near certainly riddled with errors: Truth table for binary substraction (A: operand 1/bitplane data, B: operand 2/constant that is being substracted, Ci/Co: carry/borrow in/out, D: difference). This can be directly expressed as minterm for each D and Co. Code:
A B Ci | D Co 0 0 0 | 0 1 0 0 1 | 1 1 0 1 0 | 1 1 0 1 1 | 0 1 1 0 0 | 1 0 1 1 0 | 0 0 1 0 1 | 0 0 1 1 1 | 1 1 Saturation: D' = D AND (NOT Cn) on all result planes from before: if the last carry/borrow Cn is set, indicating underflow, set all planes to zero. Performance: If you substract a constant (can be kept in blitter register, no DMA needed), this gives (for 4-bit values) 2x AD-blit (least significant bit without earlier borrow) + 10x ABD-blit (bits 1,2,3 + 4x saturation), in sum 34 memory accesses for 4 4-bit values or 68 clock cycles (without other DMA), which equals 17 cycles per 4-bit value. I doubt you can beat that with the CPU on the 68000, esp. if you need to convert to planar (or even planar-chunky-planar). |
25 August 2022, 17:39 | #14 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,099
|
CPU is probably not faster, but for this case you don't need to do any c<->p conversions. Just operate on the pixels in parallel like the blitter. Pseudo-code:
Code:
# pixel = saturate(pixel + n), if n is negative use 2's complement representation mask0 = $ffffffff if (n&1) else 0 ... [carry, bpl0] = half_add(bpl0, mask0) [carry, bpl1] = full_add(bpl1, mask1, carry) [carry, bpl2] = full_add(bpl2, mask2, carry) [carry, bpl3] = full_add(bpl3, mask3, carry) if n < 0: for bpl in bpl0...3: bpl &= carry # this works because the overflow bit (bit4) = bpl4 ^ mask4 ^ Cin = 0 ^ 1 ^ Cin = NOT Cin, B (mask)=1 comes from n being negative else: for bpl in bpl0...3: bpl |= carry Code:
Res = A ^ B ^ Cin Cout = (A & B) | (Cin & (A ^ B)) B (mask) = 0: Res = A ^ 0 ^ Cin = A ^ Cin Cout = (A & 0) | (Cin & (A ^ 0)) = A & Cin B = 1: Res = A ^ 1 ^ Cin = NOT(A ^ Cin) Cout = (A & 1) | (Cin & (A ^ 1)) = A | Cin Last edited by paraj; 25 August 2022 at 18:10. |
25 August 2022, 18:19 | #15 |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,215
|
No, and I urge you to read my post carefully again. Please do not tell me the obvious. Once again, if you need a mask, and you typically do if you have arbitrary rectangles to blit (or operate on), you are down to 16 minterms of the total 256 available because the remaining 240 will not take care of the mask correctly. You can also see that in a different way: If you have one source, and one destination, you have 2^(2^2) = 16 possible combinations you need to care about. The third source the blitter offers is occupied already and has to have a particular function for masking to work, so you are constrained to 4 minterm bits = 16 minterms, not 256.
|
25 August 2022, 18:25 | #16 | |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,215
|
Quote:
On the other hand an operation like subtractions needs the blitter to go over the data multiple times (more often than the number of bitplanes you have), and that limits the effective bandwidth to the chip mem bandwidth divided by the number of processing loops. Even for simple blits, a CPU blit is faster than a native blit, provided you have a CPU that is fast enough. |
|
25 August 2022, 18:41 | #17 |
Registered User
Join Date: May 2022
Location: Canada
Posts: 138
|
Fascinating read! So much good information.
In my use case I wish the game to work on plain 68000 chipram Amiga: so I want to have the worst-case running at maximum speed. I will study the various approach mentioned here. The Linear feedback register won't work because in HAM, the non-indexed colors are always 0 to 15, they cannot be reorganized like for regular indexed blitter shadeBobs. Thank you all for your inputs, ideas, and examples! |
25 August 2022, 19:14 | #18 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,099
|
Good luck with you HAM experiments
Did a very quick CPU implementation and it comes out at 246(42/8) (according to https://68kcounter.grahambates.com/) for 32 pixels with -3 + saturate. Measured code: Code:
move.l (a0),d0 move.l (a0,$0004),d1 move.l (a0,$0008),d2 move.l (a0,$000c),d3 move.l d0,d4 not.l d0 move.l d4,d5 and.l d1,d4 eor.l d5,d1 move.l d4,d5 or.l d2,d4 eor.l d5,d2 not.l d2 move.l d4,d5 or.l d3,d4 eor.l d5,d3 not.l d3 and.l d4,d0 and.l d4,d1 and.l d4,d2 and.l d4,d3 move.l d0,(a0) move.l d1,(a0,$0004) move.l d2,(a0,$0008) move.l d3,(a0,$000c) |
25 August 2022, 19:25 | #19 | |
Defendit numerus
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
|
Quote:
We could probably argue for hours and not find points of agreement (I think you are too specific and limiting, and you use *in my opinion* incorrect terms). Too different points of view Message to the OP: experiment! The Blitter is a nice piece of hardware Especially if you want to use it on bare machines, on which it gives its best. Last edited by ross; 25 August 2022 at 21:19. |
|
26 August 2022, 00:03 | #20 | |
Registered User
Join Date: May 2022
Location: Canada
Posts: 138
|
Quote:
I will experiment and try to implement my "semi realtime shading" and post back the results. |
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Immediate Blitter & Wait for Blitter... | volvo_0ne | support.WinUAE | 32 | 18 September 2022 09:52 |
wait for blitter vs immediate blitter | jotd | support.WinUAE | 1 | 08 September 2020 04:14 |
Blitter defill - is it possible? | Ozzyboshi | Coders. Asm / Hardware | 4 | 12 December 2018 09:17 |
Blitter C2P? How? | Samurai_Crow | Coders. Asm / Hardware | 21 | 24 April 2018 19:12 |
Blitter busy flag with blitter DMA off? | NorthWay | Coders. Asm / Hardware | 9 | 23 February 2014 21:05 |
|
|