CPU cycles left, when Blitter is busy

phx · 05 June 2023, 11:13

In my last games I had a loop to draw all BOBs like this (with BLTAFWM, BLTALWM, BLTAMOD,BLTBMOD appropriately preconfigured):

Code:

        bra     .2

.1:     lea     BOsize(a0),a0
        move.w  (a0)+,d3                ; BOsize
        move.w  (a0)+,d1                ; BOshift
        move.w  d1,d2
        or.w    d6,d2                   ; BLTCON0: use ABCD, D=AB+/AC
        swap    d2
        move.w  d1,d2                   ; d2 = BLTCON0 | BLTCON1
        movem.l (a0)+,a2-a3/a5          ; BOimg, BOmsk, BOpos
        move.l  a5,a1
        move.w  (a0),d1                 ; BOmod

        WAITBLIT
        move.w  d1,BLTCMOD(a6)
        move.w  d1,BLTDMOD(a6)
        move.l  d2,BLTCON0(a6)
        movem.l a1-a3/a5,BLTCPT(a6)
        move.w  d3,BLTSIZE(a6)

        move.l  d0,a0
.2:     move.l  BOnext(a0),d0
        bne     .1

I do rarely count cycles and just let my feeling tell me that when there are any Chip RAM bus cycles left, they might have been used to fetch the data for the next BOB.

But, assuming the worst case, a Chip-RAM-only 7 MHz A500, how many free cycles are there really, when all four Blitter channels are used on a five-plane display with 256 rows and DDFSTRT=$30, DDFSTOP=$d8? There shouldn't be much copper activity, besides the top of display and at the split-line for vertical scrolling.

Can I expect to execute some more code after drawing each (16x16x5) BOB? Or are there already too few cycles available? I don't need exact numbers, just your estimation.

(What I would like to do additionally is to mark the dirty map tiles for redraw.)

paraj · 05 June 2023, 13:13

With all channels enabled in area mode the blitter will use every available cycle, so there are no "free" (unused) ones.

However, if you don't have blitter nasty enabled, you will find that you can actually fit in more productive work "for free" since the memory cycles "stolen" by the CPU would otherwise go to waste waiting for the blitter to finish.

The code before "WAITBLIT" needs 29 memory cycles and will complete after roughly 150 CCKs (It will go: Internal cycles, wait 3 CCKs, memory cycle) while your blit needs 16*5*4=320.
Inside the display area with 5 bplanes each doing 22 fetches per line there will be 223 - 22*5 (display DMA) - 227/5 (CPU stealing every fifth) = 68 or so left for the blitter, and the bltting will take ~4.7 scanlines to finish (~1.8 outside).

So if my estimates are right, you can get in around 51 memory accesses outside the display area, and 183 inside it with the CPU (-8 for the blitter wait loop) without slowing things down.

Try to put in 40 nops before WAITBLIT and see if it affects performance notably

phx · 05 June 2023, 17:48

Quote:

Originally Posted by paraj

However, if you don't have blitter nasty enabled

Only during WAITBLIT.

Quote:

The code before "WAITBLIT" needs 29 memory cycles

How do you count memory cycles? Instruction word and data word fetches combined seem to be around 20.

Quote:

So if my estimates are right, you can get in around 51 memory accesses outside the display area, and 183 inside it with the CPU (-8 for the blitter wait loop) without slowing things down.

Wow! So there are four times the memory cycles, which have already been used, still available? Very nice! Thanks for your detailed analysis!

Quote:

Try to put in 40 nops before WAITBLIT and see if it affects performance notably

40 NOPs are not 40 memory accesses?

paraj · 05 June 2023, 18:14

Quote:

Originally Posted by phx

Only during WAITBLIT.

How do you count memory cycles? Instruction word and data word fetches combined seem to be around 20.

Wow! So there are four times the memory cycles, which have already been used, still available? Very nice! Thanks for your detailed analysis!

40 NOPs are not 40 memory accesses?

Ah, if you enable blitter nasty in WAITBLIT much of the analysis goes out the window

The "extra" cycles are only there if blitter nasty is off and you're wasting them waiting for the blitter to finish. Once you enable nasty, the CPU will effectively halt until the blit is done.

However before that pure CPU cycles (i.e. ones that don't need to access memory, like for shifts/muls/divs) are essentially free, so there could still be a benefit in putting some work there.

For the memory cycles I also counted the looping part (not manually, I'm too lazy, I just threw the code into https://68kcounter.grahambates.com/ created by another EAB member). 1 nop = one memory access due to the instruction prefetch (and assuming no true fast RAM it will block the blitter w/ nasty off after waiting for 3 CCKs).

phx · 05 June 2023, 19:20

Quote:

Originally Posted by paraj

Ah, if you enable blitter nasty in WAITBLIT much of the analysis goes out the window

Why? When the code finally enters WAITBLIT then everything I wanted the CPU to do is done.

Quote:

The "extra" cycles are only there if blitter nasty is off

Sure. When the Blitter is available again, then the Nasty-flag is turned off, before restarting the Blitter.

Code:

        macro   WAITBLIT
        move.w  #$8400,DMACON(a6)
.\@:    btst    #6,DMACONR(a6)
        bne.b   .\@
        move.w  #$0400,DMACON(a6)
        endm

We can also assume Blitter-Nasty is always off, to make things easier and avoid confusions.

Quote:

and you're wasting them waiting for the blitter to finish.

Note that multiple BOBs will be blitted from a list, so I want to waste cycles at the beginning of that loop while the previous BOB is blitted.

Quote:

However before that pure CPU cycles (i.e. ones that don't need to access memory, like for shifts/muls/divs) are essentially free, so there could still be a benefit in putting some work there.

Now I'm confused. First we had 234 free memory cycles and now none?

paraj · 05 June 2023, 20:05

Quote:

Originally Posted by phx

Why? When the code finally enters WAITBLIT then everything I wanted the CPU to do is done.

Sure. When the Blitter is available again, then the Nasty-flag is turned off, before restarting the Blitter.

Code:

        macro   WAITBLIT
        move.w  #$8400,DMACON(a6)
.\@:    btst    #6,DMACONR(a6)
        bne.b   .\@
        move.w  #$0400,DMACON(a6)
        endm

We can also assume Blitter-Nasty is always off, to make things easier and avoid confusions.

Note that multiple BOBs will be blitted from a list, so I want to waste cycles at the beginning of that loop while the previous BOB is blitted.

Now I'm confused. First we had 234 free memory cycles and now none?

I probably phrased things badly, let me try again.

The "free" memory cycles were based on the (wrong) assumption that blitter nasty was disabled (all the time), and there were cycles were being wasted in the loop waiting for the blitter to finish. Since that's not the case, there isn't much gain in this case.

For my previous estimate:

Simplifying things let's assume only CPU and blitter access memory and CPU is only executing instructions with no overhead (meaning memory access would normally happen every 4 CPU (7MHz) cycles e.g. like a NOP).
After you start the blit the access pattern will be CBBBB repeated. I.e. CPU accesses memory, does some internal processing for one CCK (where blitter has acccess) then requests the bus again and waits for 3 cycles (where the blitter has access) before being allowed to do its read.
If you're doing real work with that CPU instruction that's great, but if it's JUST waiting for the blit to finish that's wasted work ("free cycles"). In this case every 5th cycle is wasted by waiting.

With display DMA active it gets worse since without blitter nasty the CPU will still get ~1/5 memory cycles, and with 5 bitplanes active it looks more like CDBDBDDD (with D being display DMA), i.e. the blitter getting around the same number of memory accesses as the CPU.

In this case (which doesn't apply to you) the memory cycles used by the CPU could be put to better use.

phx · 06 June 2023, 01:33

Oh! So you are basically saying that doing some more work with the CPU, while the Blitter is running, does not give me any advantage compared to doing that work after all Blitter activity has ceased, because Nasty-mode (during WAITBLIT) would then be entered later, consuming that potential advantage?

Hmm. Maybe there is still an advantage to do some more work with data from the current BOB structure, because setting up a second loop to go over the list of BOBs again, after all BOBs have been drawn, might be more expensive?

paraj · 06 June 2023, 08:00

Quote:

Oh! So you are basically saying that doing some more work with the CPU, while the Blitter is running, does not give me any advantage compared to doing that work after all Blitter activity has ceased, because Nasty-mode (during WAITBLIT) would then be entered later, consuming that potential advantage?

Yes, exactly. On the other hand, while the blitter is running, internal CPU cycles (the ones that don't need to access memory) will overlap blitter activity like I mentioned, so in a sense it's optimal to place more code there, but only if it doesn't cause you to miss a deadline.
I.e. if you have:

Blitter stuff
CPU stuff
Publish frame

Overlapping CPU & blitter is good, but if you have blitter stuff / publish frame / CPU stuff you have to be more careful.

Quote:

Hmm. Maybe there is still an advantage to do some more work with data from the current BOB structure, because setting up a second loop to go over the list of BOBs again, after all BOBs have been drawn, might be more expensive?

Yes, assuming the above and that you can do it without needing to spill registers, it will at worst be no faster.

phx · 06 June 2023, 14:32

Ok, thanks! That helps. Then I will add some more code to that loop.

Rock'n Roll · 08 June 2023, 09:52

Quote:

So if my estimates are right, you can get in around 51 memory accesses outside the display area, and 183 inside it with the CPU (-8 for the blitter wait loop) without slowing things down.

how do I get 51? or 183? (51+183=234?)

paraj · 08 June 2023, 17:43

Simplifying a bit (and remember this only holds if blitter nasty is off, and stays off) and sticking outside the display area, and assuming no other DMA activity.

The CPU will access memory every 5th CCK. After the blitter is started the CPU does 29 memory accesses (instruction fetch, reading), and the blitter will have gotten 116 (29*4) memory accesses in that time meaning it needs 204 (320-204) to finish the blit.
One in five (204/4 = 51) will be given to the CPU.

Inside the display area the calculation is more complicated since the blitter will back off (allow the CPU access) regardless of the reason the CPU was blocked - meaning more cycles will be given to the CPU relative to the blitter. Memory access from the CPUs point of view will still take ~5 CCK but now it's not 4/5 going to the blitter, but only ~30% (~50% is used by bitplane DMA).

So after the CPU has done its 29 memory accesses the blitter will have only done ~43 (every scanline blitter gets ~68 vs CPU ~45). For the remaining 276 (320-44) memory accesses the blitter needs to do the CPU will get ~183 (276/(68/45)).

Probably made some mistakes and other faulty assumptions, but that's roughly how I arrived at those numbers

Photon · 08 June 2023, 20:51

You should enable BLTPRI if this is one of the largest jobs in your render loop. Toggling it during Blitter wait affects nothing (well, except making the Blitter wait code slightly longer to execute.)

You use all 3 sources, which means no CPU for you

. On top of this, for the display of the bitplanes, using 5 bitplanes robs the Blitter of 25% of its memory accesses.

If you have a large CPU load, and it's compatible with any raster waits in the Copper (in your case nope or maybe), you may off-load the loading of register values to the Copper, else a Blitter interrupt. In this case, you can disable BLTPRI to free ~25% of the Blitter's memory accesses for the CPU.

If BLTPRI=1 + optimal minterm, then the only memory access cycles available to the CPU are as specified per number of sources and minterm by the figure in HRM (collated here), as the pipeline fills and empties. In addition to this, you can fit 1 or 2 memory accesses for CPU internal instruction(s) before the Blitter wait (very difficult to make use of in practice.

)

Rock'n Roll · 09 June 2023, 12:46

Ok, I understand your calculation so far. Thanks.
But the question was if there are still CPU cycles free while Bob is blitting?

And without getting too specific now, a rough estimate could be the following:
In the example: 320 cycles for the blitt and 118 CPU cycles without waitblit.

The simple estimation would then be:
for CBBBB: 320/4=80 ; 80 x CPU-cycles and 80*4 Blitter cycles ; Blitter free before CPU finish
for CDBDBDDD: 320/2=160 ; 160 x CPU-cycles and 160*2 Blitter cycles ; CPU finish before Blitter
So the value is somewhere in between. To be more precise, the shares could be weighted by percentage.

paraj · 09 June 2023, 20:55

You're right that the value will be in between, but 118 CPU cycles is 59 CCKs, and only roughly half are memory cycles. Only memory accesses really matter here.

Your "CDBDBDDD" example will probably look like "CDBDCDDD" if the first CPU cycle was preparing for a memory access.

This is the key point: If the CPU has been waiting to access memory for 3 cycles with blitter nasty off it will "steal" the cycle from the blitter even if those cycles didn't go the blitter.

And while it's not an issue for phx in this case, be careful about blindly enabling blitter nasty in your WAITBLIT macros as it can (and likely will) delay interrupts until the blit has finished.

phx · 10 June 2023, 14:22

Quote:

Originally Posted by Photon

You should enable BLTPRI if this is one of the largest jobs in your render loop. Toggling it during Blitter wait affects nothing (well, except making the Blitter wait code slightly longer to execute.)

Hmm. I don't understand. You are probably saying that the Blitter is getting all memory cycles anyway, when the CPU is idle? So BLITPRI doesn't give any advantage there?

But the CPU isn't idle. At least it has to check DMACONR for the Blitter-Done flag, which steals every third or fifth (or whatever) memory cycle from the Blitter. Or I am still misunderstanding?

Or were you saying I should better use BLITPRI from the beginning? That's the question. Is the loop in my initial post faster with BLITPRI always enabled (and taken out of the blitter wait)?

Quote:

If you have a large CPU load, and it's compatible with any raster waits in the Copper (in your case nope or maybe), you may off-load the loading of register values to the Copper

IMHO this is far too complex. It may work in a static demo scenario, but a game is usually too dynamic. I guess I would need a lot of time to update and dynamically write copper lists instead (?).

Quote:

else a Blitter interrupt. In this case, you can disable BLTPRI to free ~25% of the Blitter's memory accesses for the CPU.

But then you get some serious overhead for interrupt processing in exchange. I have yet to see real world tests here, especially when blitting dozens of 16x16 BOBs.

Quote:

In addition to this, you can fit 1 or 2 memory accesses for CPU internal instruction(s) before the Blitter wait (very difficult to make use of in practice.

)

Not sure what you are refering to. Can you elobarate? But when it is so difficult to exploit it is probably not worth it

Quote:

Originally Posted by paraj

And while it's not an issue for phx in this case, be careful about blindly enabling blitter nasty in your WAITBLIT macros as it can (and likely will) delay interrupts until the blit has finished.

Good point. Although not too relevant in my case. It may delay the CIA-B timer interrupt a little bit, which affects music replay. Otherwise there is only the VERTB interrupt, and when I don't finish rendering within a frame then the engine is definitely too slow!

Rock'n Roll · 10 June 2023, 22:07

The question is: at the end of code cpu, is the blitter already free again or does it have to be wait for? Waiting would be bad.

So save the signal state blitter free or busy at the first occurrence on waitblit. if the blitter is busy then fill it with "useless" commands until the signal state topples.
So you can determine the amount of free cycles experimentally.

phx · 11 June 2023, 20:52

Quote:

Originally Posted by Rock'n Roll

The question is: at the end of code cpu, is the blitter already free again or does it have to be wait for?

According to paraj's estimations the Blitter is still busy (on a Chip-only 68000) after executing the code from my initial posting.

Quote:

Waiting would be bad.

Yes. Although I can never eliminate the wait, because there will be systems with Fast-RAM and/or a faster CPU.

Quote:

So save the signal state blitter free or busy at the first occurrence on waitblit. if the blitter is busy then fill it with "useless" commands until the signal state topples.
So you can determine the amount of free cycles experimentally.

Indeed, I probably have to make some real word tests.

Rock'n Roll · 13 June 2023, 14:14

Quote:

You're right that the value will be in between, but 118 CPU cycles is 59 CCKs, and only roughly half are memory cycles. Only memory accesses really matter here.

How to interpret the cpu-cycle information (68k counter) in context with CCK/Chip-cycle information from singlestep or DMA-debugger?
Or how do I know how many CPU memory accesses cycles I have?

Code:

Total cycles:	; from 68k counter
20(4/1) 8	move.w	#$0000,$dff042
28(5/2) 10	move.l	#START,$dff054
20(4/1) 8	move.w	#$0000,$dff066
20(4/1) 8	move.w	#(1*64)+$10,$dff058 
>d
0002a708 33fc 0000 00df f066      move.w #$0000,$00dff066
0002a710 33fc 0050 00df f058      move.w #$0050,$00dff058
>t
Cycles: 14 Chip, 28 CPU. (V=80 H=72 -> V=80 H=86)
  D0 00005000   D1 00010540   D2 00000000   D3 00000000
  D4 00000000   D5 00000000   D6 00000000   D7 00000000
  A0 00C028F6   A1 00010540   A2 00000000   A3 00000000
  A4 00000000   A5 00000000   A6 00C028F6   A7 00C5F898
USP  00C5F898 ISP  00C60898
T=00 S=0 M=0 X=1 N=0 Z=0 V=0 C=0 IMASK=0 STP=0
Prefetch 33fc (MOVE) 0000 (OR) Chip latch 00000000
0002a708 33fc 0000 00df f066      move.w #$0000,$00dff066
Next PC: 0002a710
>t
Cycles: 10 Chip, 20 CPU. (V=80 H=86 -> V=80 H=96)
  D0 00005000   D1 00010540   D2 00000000   D3 00000000
  D4 00000000   D5 00000000   D6 00000000   D7 00000000
  A0 00C028F6   A1 00010540   A2 00000000   A3 00000000
  A4 00000000   A5 00000000   A6 00C028F6   A7 00C5F898
USP  00C5F898 ISP  00C60898
T=00 S=0 M=0 X=1 N=0 Z=1 V=0 C=0 IMASK=0 STP=0
Prefetch 33fc (MOVE) 0050 (OR) Chip latch 00000050
0002a710 33fc 0050 00df f058      move.w #$0050,$00dff058
Next PC: 0002a718

>v $50 !80
Line: 50  80 HPOS 50  80:
 [50  80]  [51  81]  [52  82]  [53  83]  [54  84]  [55  85]  [56  86]  [57  87]
   CPU-WW  BPL2 112    CPU-WW  BPL1 110    CPU-RW  BPL2 112    CPU-RW  BPL1 110
     0001      0000      0508      0018      0000      0000      00DF      0000		; move.w #$0000,$00dff066
 00DFF054  0001A898  00DFF056  00015898  0002A70A  0001A89A  0002A70C  0001589A
 235A0600  235A0800  235A0A00  235A0C00  235A0E00  235A1000  235A1200  235A1400

 [58  88]  [59  89]  [5A  90]  [5B  91]  [5C  92]  [5D  93]  [5E  94]  [5F  95]
   CPU-RW  BPL2 112    CPU-RW  BPL1 110    CPU-WW  BPL2 112    CPU-RW  BPL1 110
     F066      0000      33FC      0000      0000      0000      0050      0000
 0002A70E  0001A89C  0002A710  0001589C  00DFF066  0001A89E  0002A712  0001589E
 235A1600  235A1800  235A1A00  235A1C00  235A1E00  235A2000  235A2200  235A2400

 [60  96]  [61  97]  [62  98]  [63  99]  [64 100]  [65 101]  [66 102]  [67 103]
   CPU-RW  BPL2 112    CPU-RW  BPL1 110    CPU-RW  BPL2 112    CPU-WW  BPL1 110
     00DF      0000      F058      0000      0839      0000      0050      0000		; move.w #$0050,$00dff058
 0002A714  0001A8A0  0002A716  000158A0  0002A718  0001A8A2  00DFF058  000158A2
 235A2600  235A2800  235A2A00  235A2C00  235A2E00  235A3000  235A3200  235A3400

For move.w #$0050,$00dff058
0002a710 33fc 0050 00df f058 move.w #$0050,$00dff058 ; 20(4/1) 8 bytes
I count:
4xCPU-RW on 5A-33FC, 5E-0050, 60-00DF, 62-F058 it's 4
1xCPU-WW on 5C-0000 it's 1
Thats are: 4+1=5 5=CCK in 5*2=10 Chip-Cycles

So, there are 118 total cycles and 29 CCK for the phx routine (behind BLITSIZE to WAITBLIT). (68k counter info)
And the 29 CCK are the relevant value in this view. Or?

paraj · 13 June 2023, 18:00

Quote:

For move.w #$0050,$00dff058
0002a710 33fc 0050 00df f058 move.w #$0050,$00dff058 ; 20(4/1) 8 bytes
I count:
4xCPU-RW on 5A-33FC, 5E-0050, 60-00DF, 62-F058 it's 4
1xCPU-WW on 5C-0000 it's 1
Thats are: 4+1=5 5=CCK in 5*2=10 Chip-Cycles

The (4/1) part matches the 4xCPU-RW/1xCPU-WW (that's what those numbers mean: number of memory reads/writes).
Chip cycles are CCKs. On all Amigas any memory access that needs to go through Agnus (chip mem/slow mem and custom registers) always takes 2 CCKs which is why you're seeing 10 CCKs for 5 memory accesses.
1 CCK for setup (free for DMA), and 1 for the actual access (CPU has exclusive access). For a 7MHz 68000 this works out nicely (not a coincidence) since it can only access memory every 4 CPU cycles (=2 CCKs) anyway.
That's why the CPU can run at essentially full speed even with 4 bitplanes active in lores (the bitplane DMA is scheduled to -4-2-3-1), but will slow down with 5 or 6 (some of those -'s where the CPU is accessing memory go away).
On a stock A1200 it's the same thing, but since the CPU can do long word memory accesses it can stay at (mostly) full speed, but it is *THE* bottleneck for accelerated amigas (if they need to write to chipram).

Quote:

So, there are 118 total cycles and 29 CCK for the phx routine (behind BLITSIZE to WAITBLIT). (68k counter info)
And the 29 CCK are the relevant value in this view. Or?

I was playing a bit fast and loose with the terminology and equating memory accesses with CCKs (even if the CPU couldn't complete them that fast with no contention) since when the blitter has started it will "solve" the issue of the CPU not being fast enough.

29 memory accesses is the relevant number since it's the bottleneck. That it would take 118 CPU cycles to complete is sort of incidental (but could factor in for more complicated scenarios).

Don't know if that makes sense

Rock'n Roll · 14 June 2023, 10:29

@Toni
I never understand this chip-cycle information completly, but now its clear.

Code:

CPU-Cycles are in the most times twice as Chip-Cycles with memory access
12(3/0) 6 - lea	bitplane,a0		    3*2=6 Chip	12 CPU	ok
 8(2/0) 4 - and.w  #$000f,d0		    2*2=4 Chip   8 CPU	ok
12(2/1) 4 - move.w d0,$40(a5)		(2+1)*2=6 Chip  12 CPU  ok

but not always
46(2/0) 4 - mulu #40,d0			2*2=4 Chip-cycles with memory access
 8(1/0) 2 - add.w  d1,a0		1*2=2 Chip
12(1/0) 2 - lsr.w #3,d1			1*2=2 Chip

Now, I know the Chip-Cycles value but not how much Chip-Cycles are with memory access.
e.g.
...
00024b54 c0fc 0028 mulu.w #$0028,d0
Next PC: 00024b58
>t
Cycles: 23 Chip, 46 CPU. (V=308 H=194 -> V=308 H=217)
VPOS: 308 ($134) HPOS: 217 ($0d9) COP: $0006a538

and maybe the R/W information could be add in the first line and also summerized if more then one step.
>t
Cycles: 46 CPU, 23 Chip (2/0). (V=308 H=194 -> V=308 H=217)
VPOS: 308 ($134) HPOS: 217 ($0d9) COP: $0006a538

05 June 2023, 11:13	#1
phx Natteravn Join Date: Nov 2009 Location: Herford / Germany Posts: 2,496	CPU cycles left, when Blitter is busy In my last games I had a loop to draw all BOBs like this (with BLTAFWM, BLTALWM, BLTAMOD,BLTBMOD appropriately preconfigured): Code: bra .2 .1: lea BOsize(a0),a0 move.w (a0)+,d3 ; BOsize move.w (a0)+,d1 ; BOshift move.w d1,d2 or.w d6,d2 ; BLTCON0: use ABCD, D=AB+/AC swap d2 move.w d1,d2 ; d2 = BLTCON0 \| BLTCON1 movem.l (a0)+,a2-a3/a5 ; BOimg, BOmsk, BOpos move.l a5,a1 move.w (a0),d1 ; BOmod WAITBLIT move.w d1,BLTCMOD(a6) move.w d1,BLTDMOD(a6) move.l d2,BLTCON0(a6) movem.l a1-a3/a5,BLTCPT(a6) move.w d3,BLTSIZE(a6) move.l d0,a0 .2: move.l BOnext(a0),d0 bne .1 I do rarely count cycles and just let my feeling tell me that when there are any Chip RAM bus cycles left, they might have been used to fetch the data for the next BOB. But, assuming the worst case, a Chip-RAM-only 7 MHz A500, how many free cycles are there really, when all four Blitter channels are used on a five-plane display with 256 rows and DDFSTRT=$30, DDFSTOP=$d8? There shouldn't be much copper activity, besides the top of display and at the split-line for vertical scrolling. Can I expect to execute some more code after drawing each (16x16x5) BOB? Or are there already too few cycles available? I don't need exact numbers, just your estimation. (What I would like to do additionally is to mark the dirty map tiles for redraw.)

14 June 2023, 10:29	#20
Rock'n Roll German Translator Join Date: Aug 2018 Location: Drübeck / Germany Age: 49 Posts: 183	@Toni I never understand this chip-cycle information completly, but now its clear. Code: CPU-Cycles are in the most times twice as Chip-Cycles with memory access 12(3/0) 6 - lea bitplane,a0 32=6 Chip 12 CPU ok 8(2/0) 4 - and.w #$000f,d0 22=4 Chip 8 CPU ok 12(2/1) 4 - move.w d0,$40(a5) (2+1)2=6 Chip 12 CPU ok but not always 46(2/0) 4 - mulu #40,d0 22=4 Chip-cycles with memory access 8(1/0) 2 - add.w d1,a0 12=2 Chip 12(1/0) 2 - lsr.w #3,d1 12=2 Chip Now, I know the Chip-Cycles value but not how much Chip-Cycles are with memory access. e.g. ... 00024b54 c0fc 0028 mulu.w #$0028,d0 Next PC: 00024b58 >t Cycles: 23 Chip, 46 CPU. (V=308 H=194 -> V=308 H=217) VPOS: 308 ($134) HPOS: 217 ($0d9) COP: $0006a538 and maybe the R/W information could be add in the first line and also summerized if more then one step. >t Cycles: 46 CPU, 23 Chip (2/0). (V=308 H=194 -> V=308 H=217) VPOS: 308 ($134) HPOS: 217 ($0d9) COP: $0006a538

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Blitter Busy Flag	bloodline	Coders. Asm / Hardware	20	08 March 2019 20:49
Blitter: clean-up line drawing and fill mode idle cycles.	ross	Coders. Asm / Hardware	9	12 May 2018 22:32
Trying to measuring the CPU cycles/instr ! (A500)	amilo3438	Coders. Asm / Hardware	20	31 August 2017 20:22
Blitter busy flag with blitter DMA off?	NorthWay	Coders. Asm / Hardware	9	23 February 2014 21:05
CPU execution on odd cycles if no Audio/Disk/Sprite DMA	mc6809e	Coders. Asm / Hardware	2	02 April 2012 19:50

05 June 2023, 13:13	#2
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,099	With all channels enabled in area mode the blitter will use every available cycle, so there are no "free" (unused) ones. However, if you don't have blitter nasty enabled, you will find that you can actually fit in more productive work "for free" since the memory cycles "stolen" by the CPU would otherwise go to waste waiting for the blitter to finish. The code before "WAITBLIT" needs 29 memory cycles and will complete after roughly 150 CCKs (It will go: Internal cycles, wait 3 CCKs, memory cycle) while your blit needs 1654=320. Inside the display area with 5 bplanes each doing 22 fetches per line there will be 223 - 22*5 (display DMA) - 227/5 (CPU stealing every fifth) = 68 or so left for the blitter, and the bltting will take ~4.7 scanlines to finish (~1.8 outside). So if my estimates are right, you can get in around 51 memory accesses outside the display area, and 183 inside it with the CPU (-8 for the blitter wait loop) without slowing things down. Try to put in 40 nops before WAITBLIT and see if it affects performance notably

06 June 2023, 01:33	#7
phx Natteravn Join Date: Nov 2009 Location: Herford / Germany Posts: 2,496	Oh! So you are basically saying that doing some more work with the CPU, while the Blitter is running, does not give me any advantage compared to doing that work after all Blitter activity has ceased, because Nasty-mode (during WAITBLIT) would then be entered later, consuming that potential advantage? Hmm. Maybe there is still an advantage to do some more work with data from the current BOB structure, because setting up a second loop to go over the list of BOBs again, after all BOBs have been drawn, might be more expensive?

06 June 2023, 14:32	#9
phx Natteravn Join Date: Nov 2009 Location: Herford / Germany Posts: 2,496	Ok, thanks! That helps. Then I will add some more code to that loop.

08 June 2023, 17:43	#11
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,099	Simplifying a bit (and remember this only holds if blitter nasty is off, and stays off) and sticking outside the display area, and assuming no other DMA activity. The CPU will access memory every 5th CCK. After the blitter is started the CPU does 29 memory accesses (instruction fetch, reading), and the blitter will have gotten 116 (29*4) memory accesses in that time meaning it needs 204 (320-204) to finish the blit. One in five (204/4 = 51) will be given to the CPU. Inside the display area the calculation is more complicated since the blitter will back off (allow the CPU access) regardless of the reason the CPU was blocked - meaning more cycles will be given to the CPU relative to the blitter. Memory access from the CPUs point of view will still take ~5 CCK but now it's not 4/5 going to the blitter, but only ~30% (~50% is used by bitplane DMA). So after the CPU has done its 29 memory accesses the blitter will have only done ~43 (every scanline blitter gets ~68 vs CPU ~45). For the remaining 276 (320-44) memory accesses the blitter needs to do the CPU will get ~183 (276/(68/45)). Probably made some mistakes and other faulty assumptions, but that's roughly how I arrived at those numbers

08 June 2023, 20:51	#12
Photon Moderator Join Date: Nov 2004 Location: Eksjö / Sweden Posts: 5,602	You should enable BLTPRI if this is one of the largest jobs in your render loop. Toggling it during Blitter wait affects nothing (well, except making the Blitter wait code slightly longer to execute.) You use all 3 sources, which means no CPU for you . On top of this, for the display of the bitplanes, using 5 bitplanes robs the Blitter of 25% of its memory accesses. If you have a large CPU load, and it's compatible with any raster waits in the Copper (in your case nope or maybe), you may off-load the loading of register values to the Copper, else a Blitter interrupt. In this case, you can disable BLTPRI to free ~25% of the Blitter's memory accesses for the CPU. If BLTPRI=1 + optimal minterm, then the only memory access cycles available to the CPU are as specified per number of sources and minterm by the figure in HRM (collated here), as the pipeline fills and empties. In addition to this, you can fit 1 or 2 memory accesses for CPU internal instruction(s) before the Blitter wait (very difficult to make use of in practice. )

09 June 2023, 12:46	#13
Rock'n Roll German Translator Join Date: Aug 2018 Location: Drübeck / Germany Age: 49 Posts: 183	Ok, I understand your calculation so far. Thanks. But the question was if there are still CPU cycles free while Bob is blitting? And without getting too specific now, a rough estimate could be the following: In the example: 320 cycles for the blitt and 118 CPU cycles without waitblit. The simple estimation would then be: for CBBBB: 320/4=80 ; 80 x CPU-cycles and 804 Blitter cycles ; Blitter free before CPU finish for CDBDBDDD: 320/2=160 ; 160 x CPU-cycles and 1602 Blitter cycles ; CPU finish before Blitter So the value is somewhere in between. To be more precise, the shares could be weighted by percentage.

09 June 2023, 20:55	#14
paraj Registered User Join Date: Feb 2017 Location: Denmark Posts: 1,099	You're right that the value will be in between, but 118 CPU cycles is 59 CCKs, and only roughly half are memory cycles. Only memory accesses really matter here. Your "CDBDBDDD" example will probably look like "CDBDCDDD" if the first CPU cycle was preparing for a memory access. This is the key point: If the CPU has been waiting to access memory for 3 cycles with blitter nasty off it will "steal" the cycle from the blitter even if those cycles didn't go the blitter. And while it's not an issue for phx in this case, be careful about blindly enabling blitter nasty in your WAITBLIT macros as it can (and likely will) delay interrupts until the blit has finished.

10 June 2023, 22:07	#16
Rock'n Roll German Translator Join Date: Aug 2018 Location: Drübeck / Germany Age: 49 Posts: 183	The question is: at the end of code cpu, is the blitter already free again or does it have to be wait for? Waiting would be bad. So save the signal state blitter free or busy at the first occurrence on waitblit. if the blitter is busy then fill it with "useless" commands until the signal state topples. So you can determine the amount of free cycles experimentally.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)