Optimizing polygonfill bitcopy
After i drawn it's edges and xorfilled the polygon on a 1-bit plane, i copy the data to the actual planes. Since i wanted to support dither (like PenA and PenB in the OS API), i built a two by eight support table for masking the planes when copying:
Code:
t = 2; Code:
while (--h >= 0) Any suggestion how to make this bitcopy more effective? (The algorithm, i mean. When that is done, i will rewrite it into assembly.) |
Do you actually need to clear the destination inside the loop (*dstptr16 &= ..) ?
The blitter is perfect for this sort of copy with logical op applied, btw ;-) |
Of course i do. There is no tell what the corresponding bits on that plane will be. We are doing dithering fill here. I only could spare the zeroing and then the writing, if i would copy one plane, because that necessarily will hold only one type of bits (depending if i filled the polygon on the temporary plane with zeros or ones) and thus i could just OR or AND them with the destination plane. But on multiple planes, this is dependent on the color index. Example with two planes with dithering colours number 2 and 3:
Code:
PLANE 0: |
I rewrote it in ASM, but it only got slower. Any ideas?
Code:
section code,code |
Going to bed, so this is only a few obvious things, will probably look at it tomorrow.
Code:
_PolygonBitmapToPlanes32: |
Unfortunately the code crashes. The polygon has been drawn, but dots appear on the screen in a matrix and then, GURU.
Edit: And i think i know why. If i used register a5, then my code crashed too. That is why i skipped it and use a6instead. Preserving it in the stack and pulling it back at the end did not helped at all. If used, crashed. Edit #2: Nope, i've found it: you pushed registers to the stack by movem.l d5-d7/a3-a6,-(a7), but you pulled them back by movem.l (a7)+,d5-d7/a3/a4/a6, skipping a5. Now it works. And while it's significantly faster than my original code, it is still slower than the C code. http://oscomp.hu/depot/PolygonBitmapToPlanes32.png I'm gonna check your modifications, to understand this. |
Quote:
|
It's possible, but it seems, that the problem was, that a/b pushed
a5into the stack, but did not pull it back. I have no idea what was the problem, when i did mess around a5. Either i messed up something about push/pull too, or you're right and an external interrupt messed up my registers. |
Quote:
Weird that a/b code is slower than C... |
Yeah, as I said I glanced over it and only did the obvious/simple optimizations. And apparently too tired to even do that properly ;P.
|
Quote:
|
Quote:
Regarding the stack, only the move.b instructions are lined up to word. |
Quote:
F.e A5=$00010000 Move.w a5,-(sp) ; $0000 Move.w (sp)+,a5, $00000000 A5 is not restored correctly, if i remember right. |
Don't understand this part:
Code:
asl.l #2,d2 ; Modulo <<= 2; Basically I need an extra free reg (almost all code between c_p and c_w can be pushed one level higher but need an extra reg to avoid using stack) and trying to figure out the relation between width/rowsize/modulo. What do you send in d1/d2/d3 and how is data organized in memory? |
Word move to areg will auto-expand to longword and destroy whatever it was in the upper half of areg. Same with e.g. adda.w (it will expand to longword and do a longword addition to areg), etc.
|
Quote:
So TCH had been lucky with A6 in its version (and explain crash with A5). |
Try this, I take no reposibility if anything explodes ;P.
A few things... 1) It seems you are accessing pattern LUT backwards. You are going through bitplanes 0->N, while LUT index goes N->0. This version below is going 0->N in both cases, which I pressume is the correct way. 2) Temp area seems to be the same width as dest (full width), but is organized differently. dest: - line0: AAAA00BBBB00CCCC00DDD00, - line1: AAAA00BBBB00CCCC00DDD00, ... temp: - line0: AAAABBBBCCCCDDD00000000, - line1: AAAABBBBCCCCDDD00000000, ... A=bpl0, B=bpl1, C=bpl2, D=bpl3, 0 = area not affected by a blit If my assumption is corrent, the explosion won't be too big. Code:
; d3 (rowsize) is not needed, we assume that rowsize = depth*((width+modulo)<<2) |
@Don_Adan:
That's a good thing to know, that address registers needs to be push/pull as longwords, thanks. @a/b: Modulois multiplied by 4, because former C cycles were using it with 32-bit pointers. In this post i posted the ASM code, and the args were listed, but now i elaborate then. a0.lis the destination zone's upper left corner on the zeroth plane of the screen. a1.lis the same, but for the source bitmap, which has only one plane. d0.wand d1.ware the height (as in number of rows) and width (as in number of longwords per row), in this order. These are the dimensions of the work area, not the screen. d2.lis modulo value, as in how many longwords are left on this screen, until the beginning of the same line of the work area on the next plane. d3.lis the number of bytes per row of one plane on the screen. (As in: RastPort->BitMap->BytesPerRow) d4.wis the depth of the screen. It is passed as word, because if i'd send a byte, then pushing it to the stack would mess up word-alignment. Maybe this is not necessary, i was only cautious. And a2.lis the LUT for patterns, which are filled up with the code in my opening post. The source area is a simple 1-bit bitmap, with identical dimensions to the destination screen. First, i calculate the work rectangle by iterating the vertice coordinates of the polygon, then i clear the work rectangle on it with zeros, then draw the outlines of the polygon, then fill it up with ones with xorfill. Then i copy this workarea to each plane of the destination screen, line by line, but masked with the corresponding value of LUT. (Direct copy would result in the last colour index all times.) As for the LUT, you can see, how it's filled up in the opening post. Basically it's iterating through the bits of both colour indexes and puts together patterns for all planes, in which one pixel is one and next pixel is other, so a LUT entry is a mask pattern for a bitplane. And since it has to alternate per lines, the LUT has 16 entries and the upper entries has these pattern, but with swapped colours (by XOR 3, but i've just realized, that XOR is not needed, as neither the double cycle, i can just rotate the patterns by one bit and put them to pattern [1][i]). The result is visible on that picture i linked in previously. (Without the separate dots of course.) However, i filled up LUT backwards, to be able to decrement the loop variable to zero, instead of incrementing it to the depth, and to spare a D - ioperation each time. So, RowSize is needed to be passed. (Your code did not exploded, but did not worked also, due to this.) |
> d2.l is modulo value, as in how many longwords are left on this screen, until the beginning of the same line of the work area on the next plane
OK, this is important because it's misleading. It's a delta, not modulo (semantics :p ). Modulo would be, for me, how many bytes until the next line, e.g. bitmap width is 320px and blit width is 256px, so modulo is then 64px or 8 bytes. While in your case it's 320*bitmap_height/8 bytes, so for a 320x256 bitmap that's 10240 bytes. That got me confused, and when I asked what do you send in d1/d2/d3 I meant the actual values so I could clearly see what's going on. OK, I'll post a fixed version later today... |
Did a simple test with a 320x256x3 bitmap and 256x128 blit size, and the produced output 100% matched the original code's output, so this should work in a non-explosive way.
And yeah, pattern LUT must be in normal (non-backward) order... Code:
_PolygonBitmapToPlanes32: |
All times are GMT +2. The time now is 20:54. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.