![]() |
![]() |
#1 |
Natteravn
Join Date: Nov 2009
Location: Herford / Germany
Posts: 2,417
|
CPU cycles left, when Blitter is busy
In my last games I had a loop to draw all BOBs like this (with BLTAFWM, BLTALWM, BLTAMOD,BLTBMOD appropriately preconfigured):
Code:
bra .2 .1: lea BOsize(a0),a0 move.w (a0)+,d3 ; BOsize move.w (a0)+,d1 ; BOshift move.w d1,d2 or.w d6,d2 ; BLTCON0: use ABCD, D=AB+/AC swap d2 move.w d1,d2 ; d2 = BLTCON0 | BLTCON1 movem.l (a0)+,a2-a3/a5 ; BOimg, BOmsk, BOpos move.l a5,a1 move.w (a0),d1 ; BOmod WAITBLIT move.w d1,BLTCMOD(a6) move.w d1,BLTDMOD(a6) move.l d2,BLTCON0(a6) movem.l a1-a3/a5,BLTCPT(a6) move.w d3,BLTSIZE(a6) move.l d0,a0 .2: move.l BOnext(a0),d0 bne .1 But, assuming the worst case, a Chip-RAM-only 7 MHz A500, how many free cycles are there really, when all four Blitter channels are used on a five-plane display with 256 rows and DDFSTRT=$30, DDFSTOP=$d8? There shouldn't be much copper activity, besides the top of display and at the split-line for vertical scrolling. Can I expect to execute some more code after drawing each (16x16x5) BOB? Or are there already too few cycles available? I don't need exact numbers, just your estimation. (What I would like to do additionally is to mark the dirty map tiles for redraw.) |
![]() |
![]() |
#2 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 872
|
With all channels enabled in area mode the blitter will use every available cycle, so there are no "free" (unused) ones.
However, if you don't have blitter nasty enabled, you will find that you can actually fit in more productive work "for free" since the memory cycles "stolen" by the CPU would otherwise go to waste waiting for the blitter to finish. The code before "WAITBLIT" needs 29 memory cycles and will complete after roughly 150 CCKs (It will go: Internal cycles, wait 3 CCKs, memory cycle) while your blit needs 16*5*4=320. Inside the display area with 5 bplanes each doing 22 fetches per line there will be 223 - 22*5 (display DMA) - 227/5 (CPU stealing every fifth) = 68 or so left for the blitter, and the bltting will take ~4.7 scanlines to finish (~1.8 outside). So if my estimates are right, you can get in around 51 memory accesses outside the display area, and 183 inside it with the CPU (-8 for the blitter wait loop) without slowing things down. Try to put in 40 nops before WAITBLIT and see if it affects performance notably ![]() |
![]() |
![]() |
#3 | |||
Natteravn
Join Date: Nov 2009
Location: Herford / Germany
Posts: 2,417
|
Only during WAITBLIT.
Quote:
Quote:
![]() Quote:
|
|||
![]() |
![]() |
#4 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 872
|
Quote:
Ah, if you enable blitter nasty in WAITBLIT much of the analysis goes out the window ![]() The "extra" cycles are only there if blitter nasty is off and you're wasting them waiting for the blitter to finish. Once you enable nasty, the CPU will effectively halt until the blit is done. However before that pure CPU cycles (i.e. ones that don't need to access memory, like for shifts/muls/divs) are essentially free, so there could still be a benefit in putting some work there. For the memory cycles I also counted the looping part (not manually, I'm too lazy, I just threw the code into https://68kcounter.grahambates.com/ created by another EAB member). 1 nop = one memory access due to the instruction prefetch (and assuming no true fast RAM it will block the blitter w/ nasty off after waiting for 3 CCKs). |
|
![]() |
![]() |
#5 | ||||
Natteravn
Join Date: Nov 2009
Location: Herford / Germany
Posts: 2,417
|
Quote:
Quote:
Code:
macro WAITBLIT move.w #$8400,DMACON(a6) .\@: btst #6,DMACONR(a6) bne.b .\@ move.w #$0400,DMACON(a6) endm ![]() Quote:
Quote:
![]() |
||||
![]() |
![]() |
#6 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 872
|
Quote:
The "free" memory cycles were based on the (wrong) assumption that blitter nasty was disabled (all the time), and there were cycles were being wasted in the loop waiting for the blitter to finish. Since that's not the case, there isn't much gain in this case. For my previous estimate: Simplifying things let's assume only CPU and blitter access memory and CPU is only executing instructions with no overhead (meaning memory access would normally happen every 4 CPU (7MHz) cycles e.g. like a NOP). After you start the blit the access pattern will be CBBBB repeated. I.e. CPU accesses memory, does some internal processing for one CCK (where blitter has acccess) then requests the bus again and waits for 3 cycles (where the blitter has access) before being allowed to do its read. If you're doing real work with that CPU instruction that's great, but if it's JUST waiting for the blit to finish that's wasted work ("free cycles"). In this case every 5th cycle is wasted by waiting. With display DMA active it gets worse since without blitter nasty the CPU will still get ~1/5 memory cycles, and with 5 bitplanes active it looks more like CDBDBDDD (with D being display DMA), i.e. the blitter getting around the same number of memory accesses as the CPU. In this case (which doesn't apply to you) the memory cycles used by the CPU could be put to better use. |
|
![]() |
![]() |
#7 |
Natteravn
Join Date: Nov 2009
Location: Herford / Germany
Posts: 2,417
|
Oh! So you are basically saying that doing some more work with the CPU, while the Blitter is running, does not give me any advantage compared to doing that work after all Blitter activity has ceased, because Nasty-mode (during WAITBLIT) would then be entered later, consuming that potential advantage?
Hmm. Maybe there is still an advantage to do some more work with data from the current BOB structure, because setting up a second loop to go over the list of BOBs again, after all BOBs have been drawn, might be more expensive? |
![]() |
![]() |
#8 | ||
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 872
|
Quote:
I.e. if you have: Blitter stuff CPU stuff Publish frame Overlapping CPU & blitter is good, but if you have blitter stuff / publish frame / CPU stuff you have to be more careful. Quote:
|
||
![]() |
![]() |
#9 |
Natteravn
Join Date: Nov 2009
Location: Herford / Germany
Posts: 2,417
|
Ok, thanks! That helps. Then I will add some more code to that loop.
![]() |
![]() |
![]() |
#10 | |
German Translator
Join Date: Aug 2018
Location: Drübeck / Germany
Age: 49
Posts: 138
|
Quote:
|
|
![]() |
![]() |
#11 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 872
|
Simplifying a bit (and remember this only holds if blitter nasty is off, and stays off) and sticking outside the display area, and assuming no other DMA activity.
The CPU will access memory every 5th CCK. After the blitter is started the CPU does 29 memory accesses (instruction fetch, reading), and the blitter will have gotten 116 (29*4) memory accesses in that time meaning it needs 204 (320-204) to finish the blit. One in five (204/4 = 51) will be given to the CPU. Inside the display area the calculation is more complicated since the blitter will back off (allow the CPU access) regardless of the reason the CPU was blocked - meaning more cycles will be given to the CPU relative to the blitter. Memory access from the CPUs point of view will still take ~5 CCK but now it's not 4/5 going to the blitter, but only ~30% (~50% is used by bitplane DMA). So after the CPU has done its 29 memory accesses the blitter will have only done ~43 (every scanline blitter gets ~68 vs CPU ~45). For the remaining 276 (320-44) memory accesses the blitter needs to do the CPU will get ~183 (276/(68/45)). Probably made some mistakes and other faulty assumptions, but that's roughly how I arrived at those numbers ![]() |
![]() |
![]() |
#12 |
Moderator
Join Date: Nov 2004
Location: Eksjö / Sweden
Posts: 5,455
|
You should enable BLTPRI if this is one of the largest jobs in your render loop. Toggling it during Blitter wait affects nothing (well, except making the Blitter wait code slightly longer to execute.)
You use all 3 sources, which means no CPU for you ![]() If you have a large CPU load, and it's compatible with any raster waits in the Copper (in your case nope or maybe), you may off-load the loading of register values to the Copper, else a Blitter interrupt. In this case, you can disable BLTPRI to free ~25% of the Blitter's memory accesses for the CPU. If BLTPRI=1 + optimal minterm, then the only memory access cycles available to the CPU are as specified per number of sources and minterm by the figure in HRM (collated here), as the pipeline fills and empties. In addition to this, you can fit 1 or 2 memory accesses for CPU internal instruction(s) before the Blitter wait (very difficult to make use of in practice. ![]() |
![]() |
![]() |
#13 |
German Translator
Join Date: Aug 2018
Location: Drübeck / Germany
Age: 49
Posts: 138
|
Ok, I understand your calculation so far. Thanks.
But the question was if there are still CPU cycles free while Bob is blitting? And without getting too specific now, a rough estimate could be the following: In the example: 320 cycles for the blitt and 118 CPU cycles without waitblit. The simple estimation would then be: for CBBBB: 320/4=80 ; 80 x CPU-cycles and 80*4 Blitter cycles ; Blitter free before CPU finish for CDBDBDDD: 320/2=160 ; 160 x CPU-cycles and 160*2 Blitter cycles ; CPU finish before Blitter So the value is somewhere in between. To be more precise, the shares could be weighted by percentage. |
![]() |
![]() |
#14 |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 872
|
You're right that the value will be in between, but 118 CPU cycles is 59 CCKs, and only roughly half are memory cycles. Only memory accesses really matter here.
Your "CDBDBDDD" example will probably look like "CDBDCDDD" if the first CPU cycle was preparing for a memory access. This is the key point: If the CPU has been waiting to access memory for 3 cycles with blitter nasty off it will "steal" the cycle from the blitter even if those cycles didn't go the blitter. And while it's not an issue for phx in this case, be careful about blindly enabling blitter nasty in your WAITBLIT macros as it can (and likely will) delay interrupts until the blit has finished. |
![]() |
![]() |
#15 | |||||
Natteravn
Join Date: Nov 2009
Location: Herford / Germany
Posts: 2,417
|
Quote:
But the CPU isn't idle. At least it has to check DMACONR for the Blitter-Done flag, which steals every third or fifth (or whatever) memory cycle from the Blitter. Or I am still misunderstanding? ![]() Or were you saying I should better use BLITPRI from the beginning? That's the question. Is the loop in my initial post faster with BLITPRI always enabled (and taken out of the blitter wait)? Quote:
Quote:
Quote:
Quote:
![]() |
|||||
![]() |
![]() |
#16 |
German Translator
Join Date: Aug 2018
Location: Drübeck / Germany
Age: 49
Posts: 138
|
The question is: at the end of code cpu, is the blitter already free again or does it have to be wait for? Waiting would be bad.
So save the signal state blitter free or busy at the first occurrence on waitblit. if the blitter is busy then fill it with "useless" commands until the signal state topples. So you can determine the amount of free cycles experimentally. |
![]() |
![]() |
#17 | |||
Natteravn
Join Date: Nov 2009
Location: Herford / Germany
Posts: 2,417
|
Quote:
Quote:
Quote:
|
|||
![]() |
![]() |
#18 | |
German Translator
Join Date: Aug 2018
Location: Drübeck / Germany
Age: 49
Posts: 138
|
Quote:
Or how do I know how many CPU memory accesses cycles I have? Code:
Total cycles: ; from 68k counter 20(4/1) 8 move.w #$0000,$dff042 28(5/2) 10 move.l #START,$dff054 20(4/1) 8 move.w #$0000,$dff066 20(4/1) 8 move.w #(1*64)+$10,$dff058 >d 0002a708 33fc 0000 00df f066 move.w #$0000,$00dff066 0002a710 33fc 0050 00df f058 move.w #$0050,$00dff058 >t Cycles: 14 Chip, 28 CPU. (V=80 H=72 -> V=80 H=86) D0 00005000 D1 00010540 D2 00000000 D3 00000000 D4 00000000 D5 00000000 D6 00000000 D7 00000000 A0 00C028F6 A1 00010540 A2 00000000 A3 00000000 A4 00000000 A5 00000000 A6 00C028F6 A7 00C5F898 USP 00C5F898 ISP 00C60898 T=00 S=0 M=0 X=1 N=0 Z=0 V=0 C=0 IMASK=0 STP=0 Prefetch 33fc (MOVE) 0000 (OR) Chip latch 00000000 0002a708 33fc 0000 00df f066 move.w #$0000,$00dff066 Next PC: 0002a710 >t Cycles: 10 Chip, 20 CPU. (V=80 H=86 -> V=80 H=96) D0 00005000 D1 00010540 D2 00000000 D3 00000000 D4 00000000 D5 00000000 D6 00000000 D7 00000000 A0 00C028F6 A1 00010540 A2 00000000 A3 00000000 A4 00000000 A5 00000000 A6 00C028F6 A7 00C5F898 USP 00C5F898 ISP 00C60898 T=00 S=0 M=0 X=1 N=0 Z=1 V=0 C=0 IMASK=0 STP=0 Prefetch 33fc (MOVE) 0050 (OR) Chip latch 00000050 0002a710 33fc 0050 00df f058 move.w #$0050,$00dff058 Next PC: 0002a718 >v $50 !80 Line: 50 80 HPOS 50 80: [50 80] [51 81] [52 82] [53 83] [54 84] [55 85] [56 86] [57 87] CPU-WW BPL2 112 CPU-WW BPL1 110 CPU-RW BPL2 112 CPU-RW BPL1 110 0001 0000 0508 0018 0000 0000 00DF 0000 ; move.w #$0000,$00dff066 00DFF054 0001A898 00DFF056 00015898 0002A70A 0001A89A 0002A70C 0001589A 235A0600 235A0800 235A0A00 235A0C00 235A0E00 235A1000 235A1200 235A1400 [58 88] [59 89] [5A 90] [5B 91] [5C 92] [5D 93] [5E 94] [5F 95] CPU-RW BPL2 112 CPU-RW BPL1 110 CPU-WW BPL2 112 CPU-RW BPL1 110 F066 0000 33FC 0000 0000 0000 0050 0000 0002A70E 0001A89C 0002A710 0001589C 00DFF066 0001A89E 0002A712 0001589E 235A1600 235A1800 235A1A00 235A1C00 235A1E00 235A2000 235A2200 235A2400 [60 96] [61 97] [62 98] [63 99] [64 100] [65 101] [66 102] [67 103] CPU-RW BPL2 112 CPU-RW BPL1 110 CPU-RW BPL2 112 CPU-WW BPL1 110 00DF 0000 F058 0000 0839 0000 0050 0000 ; move.w #$0050,$00dff058 0002A714 0001A8A0 0002A716 000158A0 0002A718 0001A8A2 00DFF058 000158A2 235A2600 235A2800 235A2A00 235A2C00 235A2E00 235A3000 235A3200 235A3400 0002a710 33fc 0050 00df f058 move.w #$0050,$00dff058 ; 20(4/1) 8 bytes I count: 4xCPU-RW on 5A-33FC, 5E-0050, 60-00DF, 62-F058 it's 4 1xCPU-WW on 5C-0000 it's 1 Thats are: 4+1=5 5=CCK in 5*2=10 Chip-Cycles So, there are 118 total cycles and 29 CCK for the phx routine (behind BLITSIZE to WAITBLIT). (68k counter info) And the 29 CCK are the relevant value in this view. Or? |
|
![]() |
![]() |
#19 | ||
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 872
|
Quote:
Chip cycles are CCKs. On all Amigas any memory access that needs to go through Agnus (chip mem/slow mem and custom registers) always takes 2 CCKs which is why you're seeing 10 CCKs for 5 memory accesses. 1 CCK for setup (free for DMA), and 1 for the actual access (CPU has exclusive access). For a 7MHz 68000 this works out nicely (not a coincidence) since it can only access memory every 4 CPU cycles (=2 CCKs) anyway. That's why the CPU can run at essentially full speed even with 4 bitplanes active in lores (the bitplane DMA is scheduled to -4-2-3-1), but will slow down with 5 or 6 (some of those -'s where the CPU is accessing memory go away). On a stock A1200 it's the same thing, but since the CPU can do long word memory accesses it can stay at (mostly) full speed, but it is *THE* bottleneck for accelerated amigas (if they need to write to chipram). Quote:
29 memory accesses is the relevant number since it's the bottleneck. That it would take 118 CPU cycles to complete is sort of incidental (but could factor in for more complicated scenarios). Don't know if that makes sense ![]() |
||
![]() |
![]() |
#20 |
German Translator
Join Date: Aug 2018
Location: Drübeck / Germany
Age: 49
Posts: 138
|
@Toni
I never understand this chip-cycle information completly, but now its clear. Code:
CPU-Cycles are in the most times twice as Chip-Cycles with memory access 12(3/0) 6 - lea bitplane,a0 3*2=6 Chip 12 CPU ok 8(2/0) 4 - and.w #$000f,d0 2*2=4 Chip 8 CPU ok 12(2/1) 4 - move.w d0,$40(a5) (2+1)*2=6 Chip 12 CPU ok but not always 46(2/0) 4 - mulu #40,d0 2*2=4 Chip-cycles with memory access 8(1/0) 2 - add.w d1,a0 1*2=2 Chip 12(1/0) 2 - lsr.w #3,d1 1*2=2 Chip e.g. ... 00024b54 c0fc 0028 mulu.w #$0028,d0 Next PC: 00024b58 >t Cycles: 23 Chip, 46 CPU. (V=308 H=194 -> V=308 H=217) VPOS: 308 ($134) HPOS: 217 ($0d9) COP: $0006a538 and maybe the R/W information could be add in the first line and also summerized if more then one step. >t Cycles: 46 CPU, 23 Chip (2/0). (V=308 H=194 -> V=308 H=217) VPOS: 308 ($134) HPOS: 217 ($0d9) COP: $0006a538 |
![]() |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Blitter Busy Flag | bloodline | Coders. Asm / Hardware | 20 | 08 March 2019 20:49 |
Blitter: clean-up line drawing and fill mode idle cycles. | ross | Coders. Asm / Hardware | 9 | 12 May 2018 22:32 |
Trying to measuring the CPU cycles/instr ! (A500) | amilo3438 | Coders. Asm / Hardware | 20 | 31 August 2017 20:22 |
Blitter busy flag with blitter DMA off? | NorthWay | Coders. Asm / Hardware | 9 | 23 February 2014 21:05 |
CPU execution on odd cycles if no Audio/Disk/Sprite DMA | mc6809e | Coders. Asm / Hardware | 2 | 02 April 2012 19:50 |
|
|