Can I write a game in 3 weeks? - Page 3

deimos · 18 October 2019, 11:00

Quote:

Originally Posted by phx

Looks as if you need more than 1ms on average for drawing a line (in 4 bitplanes?). That's less than 20 lines per frame, which is pretty low.

Would it help if you draw a polygon once, then copy it masked into all the bitplanes needed for the chosen colour?

That is what I'm doing. I have 2 single plane scratch buffers. I xor one dot lines into scratch buffer A, fill it into scratch buffer B then copy masked scratchbuffer B into each bitplane of the real screen buffer (setting or clearing as the colour needs), then finally xor the lines to get rid of them, leaving scratch buffer A clean.

Code:

void FillPolygon2D(const ScreenBuffer * screenBuffer,
                   const ScreenBuffer * scratchScreenBufferA, const ScreenBuffer * scratchScreenBufferB,
                   const Point2D * polygon, const UWORD n, const UWORD colour)
{
    P_START // profiling

    Rectangle2D boundingRect = { -1, -1, -1, -1 };

    for (UWORD i = 0; i < n; i++) {
        if (boundingRect.top == -1 || polygon[i].value[Y] < boundingRect.top)
            boundingRect.top = polygon[i].value[Y];
        if (boundingRect.right == -1 || polygon[i].value[X] > boundingRect.right)
            boundingRect.right = polygon[i].value[X];
        if (boundingRect.bottom == -1 || polygon[i].value[Y] > boundingRect.bottom)
            boundingRect.bottom = polygon[i].value[Y];
        if (boundingRect.left == -1 || polygon[i].value[X] < boundingRect.left)
            boundingRect.left = polygon[i].value[X];
    }
    
    for (UWORD i = 0; i < n; i++)
        DrawOneDotLine(scratchScreenBufferA, scratchScreenBufferB, (Point2D [2]) { polygon[i], polygon[(i + 1) % n] });

    ScratchAreaFill(scratchScreenBufferB, scratchScreenBufferA, boundingRect);

    CopyScratchInColour(screenBuffer, scratchScreenBufferB, colour, boundingRect);

    for (UWORD i = 0; i < n; i++)
        DrawOneDotLine(scratchScreenBufferA, scratchScreenBufferB, (Point2D [2]) { polygon[i], polygon[(i + 1) % n] });

    P_STOP // profiling
}

a/b · 18 October 2019, 13:04

All the other stuff aside...
1. If you initialize boundingRect differently (something like: top +maxint, bottom -maxint, left +maxint, right -maxint), you can get rid of 4 comparisons with -1.
2. You are doing 2 slow divisions (modulos) per vertex. Change for loops to for(i=0; i<n-1; i++) DrawOneDotLine(...,i,i+1), and do the final call outside as DrawOneDotLine(...,n-1,0). That way you don't have to handle the wrap per vertex.

deimos · 18 October 2019, 13:55

Quote:

Originally Posted by a/b

All the other stuff aside...
1. If you initialize boundingRect differently (something like: top +maxint, bottom -maxint, left +maxint, right -maxint), you can get rid of 4 comparisons with -1.
2. You are doing 2 slow divisions (modulos) per vertex. Change for loops to for(i=0; i<n-1; i++) DrawOneDotLine(...,i,i+1), and do the final call outside as DrawOneDotLine(...,n-1,0). That way you don't have to handle the wrap per vertex.

I've done this, but it has not made a significant difference.

I think it must be down to the overhead of the blitter interrupts. I don't think I'm spending much actual time in these routines, but the clock keeps ticking while the interrupts are being serviced. I'm adding some code to try and see what's happening. I don't think I can measure the interrupt overhead, but maybe I can count how many times I'm interrupted.

DanScott · 18 October 2019, 13:58

Consider blit clearing smaller areas for small polygons (might be quicker than redrawing the lines to clear the scratch)

There's probably a "size" that you could use as the cut-off point.. not sure what the value would be though, would need some experimentation.

Also, it probably is quicker to render smaller polygons directly to the screen, completely with the CPU

deimos · 18 October 2019, 14:13

Quote:

Originally Posted by DanScott

Consider blit clearing smaller areas for small polygons (might be quicker than redrawing the lines to clear the scratch)

There's probably a "size" that you could use as the cut-off point.. not sure what the value would be though, would need some experimentation.

Sure, but that's not going to make the difference I need. Even if clearing the lines were free, I'd only get less than half the time spent in this function back. I might even get to a whopping four frames per second.

Quote:

Originally Posted by DanScott

Also, it probably is quicker to render smaller polygons directly to the screen, completely with the CPU

My queue is based around callbacks, so I could implement some bits as CPU-based, but right now I think the problem is the queue itself, so apart from reducing the number of interrupts generated per polygon, I don't think this would solve the problem.

a/b · 18 October 2019, 14:37

Yeah, I was just checking the previous posts and noticed this (number of calls):
Blitter_AquireBlit 22556 1268
Blitter_EnqueueBlit 18440 5923
DrawOneDotLine 18006 19968

So, you are using interrupts to draw every single line? That's like instant (at least) 2 times slower than start blitter linedraw and then calc the next line while blitter is working. Even if you inline aquire/enque and optimize in asm, it's still super slow.
In general... I'd use interrupts only for larger blits (and not for lines), and if I'd need it done asynchronously I'd rather use copper lists, which could be very tricky if you are not doing stuff with fixed frame rates (e.g demo effects).

DanScott · 18 October 2019, 14:45

I would imagine he is CPU calculating his lines / fills / copies with the CPU into one buffer, while the 2nd buffer is being rendered with the interrupt??

deimos · 18 October 2019, 14:53

Quote:

Originally Posted by a/b

Yeah, I was just checking the previous posts and noticed this (number of calls):
Blitter_AquireBlit 22556 1268
Blitter_EnqueueBlit 18440 5923
DrawOneDotLine 18006 19968

So, you are using interrupts to draw every single line? That's like instant (at least) 2 times slower than start blitter linedraw and then calc the next line while blitter is working. Even if you inline aquire/enque and optimize in asm, it's still super slow.
In general... I'd use interrupts only for larger blits (and not for lines), and if I'd need it done asynchronously I'd rather use copper lists, which could be very tricky if you are not doing stuff with fixed frame rates (e.g demo effects).

Yeah, every blit goes through the queue, so there's an interrupt every time a blit finishes, which may start another blit.

I thought that what I'd lose with interrupt overhead I'd gain in never having to busy-wait.

I don't see a practical way to mix and match an interrupt-driven blitter queue with ad hoc blits. I suppose I could write something akin to a yield function that would consume from the queue instead of an interrupt. But that does not spark joy.

deimos · 18 October 2019, 14:54

Quote:

Originally Posted by DanScott

I would imagine he is CPU calculating his lines / fills / copies with the CPU into one buffer, while the 2nd buffer is being rendered with the interrupt??

No CPU, only Blitter.

a/b · 18 October 2019, 15:10

Well, context (blit size in this case) matters... Interrupts work with few large blits. Many small blits, probably best to do it all with cpu (and busy waits). And with mixed sizes, copper if you can afford it. Otherwise I'd try cpu and busy waits, *and* set blitter nasty with area blits (clear/copy/fill). And if you have some really large blits, try to decouple them from the pipeline and do with interrupts.
Doing it all with a single routine, like in this case, obviously won't cut it ;\.

deimos · 18 October 2019, 15:14

As it stands, to draw the just the first frame takes 178 blits, so 178 BLIT interrupts will be generated. Just how much overhead are we talking about to process an interrupt? If it takes 500ms to render the screen, and most of the time is spent in interrupt overhead, then that implies around 2ms of overhead per interrupt. That sounds like a big number.

roondar · 18 October 2019, 15:18

Using interrupts to control the Blitter is a very nice and elegant way of doing things. I've experimented with it for quite a while.

However, nice as it is from a programmers viewpoint (no busy waiting, simply queue stuff) it is usually quite slow on (low end) Amiga's. As of now, I've not managed to make my interrupt based blitting to run anywhere near the speed of my non-interrupt based blitting.

Although I've not found any form of blitting that was consistently faster using interrupts, I have found that blits that don't use all DMA slots (such as drawing lines and clear blits) do better than those that do use all DMA slots (such as copy and cookie cut blits). This is because these types of blits allow the CPU to do more useful work while blitting. However, even with that interrupt based blitting was up up to twice as slow for me than using a 'smart blitter wait'* (depending on size of blit - smaller was much closer to this than bigger). Basically, interrupt overhead can cost as much or more than what you normally spend setting up blits on an A500.

*) Meaning: set up as much of the next blit as possible while the previous blit is running and only then start waiting. When you are finally waiting, set BLTPRI in DMACON to one ('Blitter Nasty mode') first. When done waiting, switch it back to zero.

Edit: on overhead numbers... IIRC my assembly blitter routine ends up taking around 150-200 cycles per blit (in total). An interrupt takes a minimum of around 70 cycles assuming you just immediately RTE without doing anything. Add in saving/restoring registers and acknowledging the interrupt and you're looking at around 150-200 cycles in overhead for an interrupt.

Doing a 178 blits per frame would then cost my code between ~27.000 and ~71.000 cycles in overhead (without vs with interrupt). For reference, the 68000 runs at ~7MHz. So it has about 141.000 cycles per frame to play with. Any cycle used by the CPU can't be used by the blitter so that suffers too.

The above is all based on A500/68000.

deimos · 18 October 2019, 15:37

Quote:

Originally Posted by roondar

Doing a 178 blits per frame would then cost my code between ~50.000 and ~100.000 cycles in overhead (without vs with interrupt). For reference, the 68000 runs at ~7MHz. So it has about 141.000 cycles per frame to play with. Any cycle used by the CPU can't be used by the blitter so that suffers too.

Those numbers, even though large, don't seem to explain where all the time is going in my code. Even if an interrupt-driven queue is a dead end, I'd still like to get to the bottom of this mystery before I start again. Maybe I've just done something dumb.

roondar · 18 October 2019, 15:41

Quote:

Originally Posted by deimos

Those numbers, even though large, don't seem to explain where all the time is going in my code. Even if an interrupt-driven queue is a dead end, I'd still like to get to the bottom of this mystery before I start again. Maybe I've just done something dumb.

Do note I corrected them, they're actually quite a bit lower still

A question here would be what the performance difference here is between your compiled C code and assembly. I honestly don't know, it could be the compiler makes very decent code and the numbers are similar. It could also be the code is much slower.

Perhaps this is worth investigating by itself?
Or can we already tell the per-blit setup time from the numbers you posted?

My reason for pointing this out is that if adding a small bit of assembly in your code would make a big difference, it may be worthwhile.

deimos · 18 October 2019, 15:52

I think the compiler is doing a good job, I can't see anything that would cause a massive amount of extra overhead (remembering I'm trying to find where half a second has gone).

Code:

00001202 <_InterruptHandler_handleLevel3Interrupt>:

static __attribute__((interrupt)) void _InterruptHandler_handleLevel3Interrupt(void) {
    1202:	48e7 e0c0      	movem.l d0-d2/a0-a1,-(sp)
    
    for (UWORD intreqr_intenar = custom->intreqr & custom->intenar & (INTF_COPER | INTF_VERTB | INTF_BLIT);
    1206:	3439 00df f01e 	move.w dff01e <_end+0xddb07e>,d2
    120c:	3039 00df f01c 	move.w dff01c <_end+0xddb07c>,d0
    1212:	c440           	and.w d0,d2
    1214:	0242 0070      	andi.w #112,d2
    1218:	6700 0096      	beq.w 12b0 <_InterruptHandler_handleLevel3Interrupt+0xae>
         intreqr_intenar;
         intreqr_intenar = custom->intreqr & custom->intenar & (INTF_COPER | INTF_VERTB | INTF_BLIT))
    {
        if (intreqr_intenar & INTF_COPER) {
    121c:	0802 0004      	btst #4,d2
    1220:	6724           	beq.s 1246 <_InterruptHandler_handleLevel3Interrupt+0x44>
            custom->intreq = (UWORD) INTF_COPER; custom->intreq = (UWORD) INTF_COPER;
    1222:	33fc 0010 00df 	move.w #16,dff09c <_end+0xddb0fc>
    1228:	f09c 
    122a:	33fc 0010 00df 	move.w #16,dff09c <_end+0xddb0fc>
    1230:	f09c 
            interruptHandler->_processInterrupt(interruptHandler, INTB_COPER);
    1232:	2079 0000 7fc0 	movea.l 7fc0 <interruptHandler>,a0
    1238:	4878 0004      	pea 4 <_start+0x4>
    123c:	2f08           	move.l a0,-(sp)
    123e:	2068 0008      	movea.l 8(a0),a0
    1242:	4e90           	jsr (a0)
    1244:	508f           	addq.l #8,sp
        }
        
        if (intreqr_intenar & INTF_VERTB) {
    1246:	0802 0005      	btst #5,d2
    124a:	6724           	beq.s 1270 <_InterruptHandler_handleLevel3Interrupt+0x6e>
            custom->intreq = (UWORD) INTF_VERTB; custom->intreq = (UWORD) INTF_VERTB;
    124c:	33fc 0020 00df 	move.w #32,dff09c <_end+0xddb0fc>
    1252:	f09c 
    1254:	33fc 0020 00df 	move.w #32,dff09c <_end+0xddb0fc>
    125a:	f09c 
            interruptHandler->_processInterrupt(interruptHandler, INTB_VERTB);
    125c:	2079 0000 7fc0 	movea.l 7fc0 <interruptHandler>,a0
    1262:	4878 0005      	pea 5 <_start+0x5>
    1266:	2f08           	move.l a0,-(sp)
    1268:	2068 0008      	movea.l 8(a0),a0
    126c:	4e90           	jsr (a0)
    126e:	508f           	addq.l #8,sp
        }
        
        if (intreqr_intenar & INTF_BLIT) {
    1270:	0802 0006      	btst #6,d2
    1274:	6790           	beq.s 1206 <_InterruptHandler_handleLevel3Interrupt+0x4>
            custom->intreq = (UWORD) INTF_BLIT; custom->intreq = (UWORD) INTF_BLIT;
    1276:	33fc 0040 00df 	move.w #64,dff09c <_end+0xddb0fc>
    127c:	f09c 
    127e:	33fc 0040 00df 	move.w #64,dff09c <_end+0xddb0fc>
    1284:	f09c 
            interruptHandler->_processInterrupt(interruptHandler, INTB_BLIT);
    1286:	2079 0000 7fc0 	movea.l 7fc0 <interruptHandler>,a0
    128c:	4878 0006      	pea 6 <_start+0x6>
    1290:	2f08           	move.l a0,-(sp)
    1292:	2068 0008      	movea.l 8(a0),a0
    1296:	4e90           	jsr (a0)
         intreqr_intenar = custom->intreqr & custom->intenar & (INTF_COPER | INTF_VERTB | INTF_BLIT))
    1298:	3039 00df f01e 	move.w dff01e <_end+0xddb07e>,d0
    129e:	3439 00df f01c 	move.w dff01c <_end+0xddb07c>,d2
    12a4:	c440           	and.w d0,d2
    12a6:	0242 0070      	andi.w #112,d2
    for (UWORD intreqr_intenar = custom->intreqr & custom->intenar & (INTF_COPER | INTF_VERTB | INTF_BLIT);
    12aa:	508f           	addq.l #8,sp
    12ac:	6600 ff6e      	bne.w 121c <_InterruptHandler_handleLevel3Interrupt+0x1a>
        }
    }
}
    12b0:	4cdf 0307      	movem.l (sp)+,d0-d2/a0-a1
    12b4:	4e73           	rte

000012b6 <_InterruptHandler_handleLevel4Interrupt>:
    12b6:	4e73           	rte

000012b8 <_InterruptHandler_handleLevel5Interrupt>:
    12b8:	4e73           	rte

000012ba <_InterruptHandler_handleLevel6Interrupt>:
    12ba:	4e73           	rte

a/b · 18 October 2019, 16:44

If we look at the profiling...
FillPolygon2D: 39sec
- almost all aquire/enqueue calls are done here, so 1sec+6sec=7sec
- copyscratch 3sec
- DrawOneDotLine 20sec
That leaves us with 39-7-3-20=9sec.
Since it was measured before suggested adjustements:
- boundingbox: always 8 cmps per vertex except 4 for the first one
- 2 divs per vertex (that's 300 cycles on average, say 4 vertices per poly and 50 polys is then 60k extra cycles per drawn frame)
- DrawOneDotLine is using a newly constructed struct with two member structs as 3rd parameter instead of e.g. taking two Point2D pointers
- majority of calls are done here, everything is passed via stack with extra copy/paste (e.g. scratchScreenBufferA, scratchScreenBufferB are the same for all calls).
So while 9sec looks like a lot, it's reasonable considering the above. I don't see anything being out of line.

deimos · 18 October 2019, 18:07

Quote:

Originally Posted by a/b

If we look at the profiling...
FillPolygon2D: 39sec
- almost all aquire/enqueue calls are done here, so 1sec+6sec=7sec
- copyscratch 3sec
- DrawOneDotLine 20sec
That leaves us with 39-7-3-20=9sec.
Since it was measured before suggested adjustements:
- boundingbox: always 8 cmps per vertex except 4 for the first one
- 2 divs per vertex (that's 300 cycles on average, say 4 vertices per poly and 50 polys is then 60k extra cycles per drawn frame)
- DrawOneDotLine is using a newly constructed struct with two member structs as 3rd parameter instead of e.g. taking two Point2D pointers
- majority of calls are done here, everything is passed via stack with extra copy/paste (e.g. scratchScreenBufferA, scratchScreenBufferB are the same for all calls).
So while 9sec looks like a lot, it's reasonable considering the above. I don't see anything being out of line.

I can't see a way to test what you've said, but I can test a few related things.

If I remove the calls to ScratchAreaFill and CopyScratchInColour from FillPolygon2D, the amount of time spent in DrawOneDotLine should not be affected (apart from the time spent catching up on the blits for those two calls).

In my current test code (with your initial optimisations to FillPolygon2D implemented) the time spent in DrawOneDotLine drops from 30s to 26s.

These two function calls take almost the same parameters. I can leave the call to ScratchAreaFill in while removing the call to CopyScratchInColour. The time spent in DrawOneDotLine returns to 30s.

I'm not sure what this tells me, but it does make me think that a large proportion of what the clock says is spent in FillPolygon2D is actually spent elsewhere, in interrupts or interrupt overhead. And amount of "lost" time seems too high to be explained by interrupt overhead or inefficiencies due to compiling from C.

a/b · 18 October 2019, 19:26

First, small correction. I missed fillscratch, so that's 5sec for fill+copy and 7sec for code within FillPolygon2D...

How about if you test these scenarios:
- completely skip blitter code in the interrupt handler (only handle intena bits),
- only handle clear screen and similar (skip poly related blits),
- instead of actual poly blits, use size=(1<<6)+1 D=0 blits (minimal size blits, so you keep triggering interrupts).
- anything similar that comes to your mind.
So you can see how much overhead you have from pure interrupts, interrupts with minimal blits, etc. and then compare.

DanScott · 18 October 2019, 20:31

Of course, you have to remember that you are going to be writing your blitter values to a memory buffer, and then reading them again before writing them AGAIN to the actual blitter hardware registers... this (with the overhead of the blitter interrupt request) is going to eat a LOT of extra cycles.... the only main advantage is that you are not halting the CPU with blit waits... perhaps with a few large blits, then an interrupt it better, and with a lot of small blits, not so good.

Makes you wonder why nearly ALL the games used software CPU rendering back in the day

And demos are designed around the amiga hardware (filling convex objects in one pass etc..) so they will use the blitter to the max in the most efficient way possible.

To me, the "holy grail" of Amiga 3D, would be to have a system that plotted the edge points of polygons in a complex scene, allowing the whole scene to be filled in one blitter fill operation at the end

something I have been looking at on and off since around 1991

deimos · 18 October 2019, 21:11

I've added profiling to my interrupt code. The numbers make me think there's a problem there.

Code:

_InterruptHandler_handleLevel3Interrupt	33426	40725
_GameInterruptHandler_processVERTB	7793	528
_GameInterruptHandler_processBLIT	29293	1037

This run had a total elapsed time of 156s, of which I spent 40s handling level 3 interrupts, but out of that only 0.5s + 1s were spent really doing stuff.

Obviously you can add interrupts to the list of things I don't really understand.

Code:

static __attribute__((interrupt)) void _InterruptHandler_handleLevel3Interrupt(void) {
    F_START

    for (UWORD intreqr_intenar = custom->intreqr & custom->intenar & (INTF_COPER | INTF_VERTB | INTF_BLIT);
         intreqr_intenar;
         intreqr_intenar = custom->intreqr & custom->intenar & (INTF_COPER | INTF_VERTB | INTF_BLIT))
    {
        if (intreqr_intenar & INTF_COPER) {
            custom->intreq = (UWORD) INTF_COPER; custom->intreq = (UWORD) INTF_COPER;
            interruptHandler->_processInterrupt(interruptHandler, INTB_COPER);
        }
        
        if (intreqr_intenar & INTF_VERTB) {
            custom->intreq = (UWORD) INTF_VERTB; custom->intreq = (UWORD) INTF_VERTB;
            interruptHandler->_processInterrupt(interruptHandler, INTB_VERTB);
        }
        
        if (intreqr_intenar & INTF_BLIT) {
            custom->intreq = (UWORD) INTF_BLIT; custom->intreq = (UWORD) INTF_BLIT;
            interruptHandler->_processInterrupt(interruptHandler, INTB_BLIT);
        }
    }

    F_STOP
}

And the number of times the BLIT interrupt is handled is more than the number of blits I actually enqueue.

Code:

Function Name				Number of Calls	Total Elapsed Time (ms)
_InterruptHandler_handleLevel2Interrupt	0		0
_InterruptHandler_handleLevel3Interrupt	33426		40725
_GameInterruptHandler_processVERTB	7793		528
_GameInterruptHandler_processBLIT	29293		1037
PlayGame				1		156282
FillScreen				88		132
Blitter_AquireBlit			22556		1337
Blitter_EnqueueBlit			22556		61151
Renderer_TransformModel			88		3144
Renderer_RenderModel			88		152210
RenderFace				2237		144695
ClipPolyToNearPlane			2187		2054
ClipAndFillPolygon2D			2187		138433
FillPolygon2D				2187		129042
DrawOneDotLine				18006		74168
ScratchAreaFill				2187		9254
CopyScratchInColour			2187		24010
DrawingComplete				88		153

18 October 2019, 15:18	#52
roondar Registered User Join Date: Jul 2015 Location: The Netherlands Posts: 3,413	Using interrupts to control the Blitter is a very nice and elegant way of doing things. I've experimented with it for quite a while. However, nice as it is from a programmers viewpoint (no busy waiting, simply queue stuff) it is usually quite slow on (low end) Amiga's. As of now, I've not managed to make my interrupt based blitting to run anywhere near the speed of my non-interrupt based blitting. Although I've not found any form of blitting that was consistently faster using interrupts, I have found that blits that don't use all DMA slots (such as drawing lines and clear blits) do better than those that do use all DMA slots (such as copy and cookie cut blits). This is because these types of blits allow the CPU to do more useful work while blitting. However, even with that interrupt based blitting was up up to twice as slow for me than using a 'smart blitter wait'* (depending on size of blit - smaller was much closer to this than bigger). Basically, interrupt overhead can cost as much or more than what you normally spend setting up blits on an A500. ) Meaning: set up as much of the next blit as possible while the previous blit is running and only then start waiting. When you are finally waiting, set BLTPRI in DMACON to one ('Blitter Nasty mode') first. When done waiting, switch it back to zero. Edit: on overhead numbers... IIRC my assembly blitter routine ends up taking around 150-200 cycles per blit (in total). An interrupt takes a minimum of around 70 cycles assuming you just immediately RTE without doing anything. Add in saving/restoring registers and acknowledging the interrupt and you're looking at around 150-200 cycles in overhead for an interrupt. Doing a 178 blits per frame would then cost my code between ~27.000 and ~71.000 cycles in overhead (without vs with interrupt). For reference, the 68000 runs at ~7MHz. So it has about 141.000 cycles per frame to play with. Any cycle used by the CPU can't be used by the blitter so that suffers too. The above is all based on A500/68000. Last edited by roondar; 18 October 2019 at 15:32. Reason: Grammar stuff / corrected my cycle numbers*

18 October 2019, 20:31	#59
DanScott Lemon. / Core Design Join Date: Mar 2016 Location: Tier 5 Posts: 1,212	Of course, you have to remember that you are going to be writing your blitter values to a memory buffer, and then reading them again before writing them AGAIN to the actual blitter hardware registers... this (with the overhead of the blitter interrupt request) is going to eat a LOT of extra cycles.... the only main advantage is that you are not halting the CPU with blit waits... perhaps with a few large blits, then an interrupt it better, and with a lot of small blits, not so good. Makes you wonder why nearly ALL the games used software CPU rendering back in the day And demos are designed around the amiga hardware (filling convex objects in one pass etc..) so they will use the blitter to the max in the most efficient way possible. To me, the "holy grail" of Amiga 3D, would be to have a system that plotted the edge points of polygons in a complex scene, allowing the whole scene to be filled in one blitter fill operation at the end something I have been looking at on and off since around 1991 Last edited by DanScott; 18 October 2019 at 20:39.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
So, I'd like to write a new Amiga game - what do you want to see?	Graham Humphrey	Amiga scene	88	26 February 2012 21:50
My sales over next couple of weeks	emdxxxx	MarketPlace	4	31 October 2007 10:17
AmigaSYS 1.7 Released ETA : 1-2 Weeks.	Dary	News	34	22 March 2005 19:51
HOL mentioned in this weeks Micro Mart	fiath	Amiga scene	8	06 June 2004 23:56

18 October 2019, 13:04	#42
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	All the other stuff aside... 1. If you initialize boundingRect differently (something like: top +maxint, bottom -maxint, left +maxint, right -maxint), you can get rid of 4 comparisons with -1. 2. You are doing 2 slow divisions (modulos) per vertex. Change for loops to for(i=0; i<n-1; i++) DrawOneDotLine(...,i,i+1), and do the final call outside as DrawOneDotLine(...,n-1,0). That way you don't have to handle the wrap per vertex.

18 October 2019, 13:58	#44
DanScott Lemon. / Core Design Join Date: Mar 2016 Location: Tier 5 Posts: 1,212	Consider blit clearing smaller areas for small polygons (might be quicker than redrawing the lines to clear the scratch) There's probably a "size" that you could use as the cut-off point.. not sure what the value would be though, would need some experimentation. Also, it probably is quicker to render smaller polygons directly to the screen, completely with the CPU

18 October 2019, 14:37	#46
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	Yeah, I was just checking the previous posts and noticed this (number of calls): Blitter_AquireBlit 22556 1268 Blitter_EnqueueBlit 18440 5923 DrawOneDotLine 18006 19968 So, you are using interrupts to draw every single line? That's like instant (at least) 2 times slower than start blitter linedraw and then calc the next line while blitter is working. Even if you inline aquire/enque and optimize in asm, it's still super slow. In general... I'd use interrupts only for larger blits (and not for lines), and if I'd need it done asynchronously I'd rather use copper lists, which could be very tricky if you are not doing stuff with fixed frame rates (e.g demo effects).

18 October 2019, 14:45	#47
DanScott Lemon. / Core Design Join Date: Mar 2016 Location: Tier 5 Posts: 1,212	I would imagine he is CPU calculating his lines / fills / copies with the CPU into one buffer, while the 2nd buffer is being rendered with the interrupt??

18 October 2019, 15:10	#50
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	Well, context (blit size in this case) matters... Interrupts work with few large blits. Many small blits, probably best to do it all with cpu (and busy waits). And with mixed sizes, copper if you can afford it. Otherwise I'd try cpu and busy waits, and set blitter nasty with area blits (clear/copy/fill). And if you have some really large blits, try to decouple them from the pipeline and do with interrupts. Doing it all with a single routine, like in this case, obviously won't cut it ;\.

18 October 2019, 15:14	#51
deimos It's coming back! Join Date: Jul 2018 Location: comp.sys.amiga Posts: 762	As it stands, to draw the just the first frame takes 178 blits, so 178 BLIT interrupts will be generated. Just how much overhead are we talking about to process an interrupt? If it takes 500ms to render the screen, and most of the time is spent in interrupt overhead, then that implies around 2ms of overhead per interrupt. That sounds like a big number.

18 October 2019, 16:44	#56
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	If we look at the profiling... FillPolygon2D: 39sec - almost all aquire/enqueue calls are done here, so 1sec+6sec=7sec - copyscratch 3sec - DrawOneDotLine 20sec That leaves us with 39-7-3-20=9sec. Since it was measured before suggested adjustements: - boundingbox: always 8 cmps per vertex except 4 for the first one - 2 divs per vertex (that's 300 cycles on average, say 4 vertices per poly and 50 polys is then 60k extra cycles per drawn frame) - DrawOneDotLine is using a newly constructed struct with two member structs as 3rd parameter instead of e.g. taking two Point2D pointers - majority of calls are done here, everything is passed via stack with extra copy/paste (e.g. scratchScreenBufferA, scratchScreenBufferB are the same for all calls). So while 9sec looks like a lot, it's reasonable considering the above. I don't see anything being out of line.

18 October 2019, 19:26	#58
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	First, small correction. I missed fillscratch, so that's 5sec for fill+copy and 7sec for code within FillPolygon2D... How about if you test these scenarios: - completely skip blitter code in the interrupt handler (only handle intena bits), - only handle clear screen and similar (skip poly related blits), - instead of actual poly blits, use size=(1<<6)+1 D=0 blits (minimal size blits, so you keep triggering interrupts). - anything similar that comes to your mind. So you can see how much overhead you have from pure interrupts, interrupts with minimal blits, etc. and then compare.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)