17 March 2024, 20:17 | #1 |
ex. demoscener "Bigmama"
Join Date: Jun 2012
Location: Fyn / Denmark
Posts: 1,646
|
Assessing cpu cycles available on A500 with slow mem only
I was trying to do a ball-park assessment of if a piece of code would be able to complete in a single pal frame (20ms).
With only a little copper dma (display setup) and a single lo-res bitplane, I would think that the 68k would run at full speed, i.e. not be tied up by dma. 7.09MHz * 20ms/frame ~ 141000 cpu cycles. If I added up the cycles of my code correctly, it seems like it's more in the area of around 100k cycles before it starts to take more than one frame. Is the above right, or am I completely off track here? |
17 March 2024, 21:21 | #2 |
Defendit numerus
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 54
Posts: 4,501
|
No
Code:
nops = 141456/4 lea $dff000,a6 move.w #$7fff,($9a,a6) move.w #$7fff,($96,a6) move.w #$1a5c,($8e,a6) move.w #$38c8,($90,a6) move.w #$18,($92,a6) move.w #$d8,($94,a6) move.w #$1200,($100,a6) moveq #0,d2 move.w d2,($180,a6) move.w d2,($182,a6) move.w d2,($140,a6) move.w d2,($142,a6) move.w #$8300,($96,a6) move.w #$4e71,d0 move.w #nops-3-1,d1 lea (code,pc),a0 lea (-8,a0),a1 .cn move.w d0,(a0)+ dbf d1,.cn move.w #$4ef9,(a0)+ move.l a1,(a0) .line cmpi.b #$40,(6,a6) bne.b .line move.w d1,($180,a6) move.w d2,($180,a6) code dx.w nops |
17 March 2024, 21:31 | #3 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,068
|
(227-4 for mem refresh)*312 is ~69.5k dma slots. 320x256 lores is 5120. This leaves you with ~64.5k or ~129k cycles. If your code contains instructions taking 6/10/14/... cycles you will lose 2 cycles each time there is a collision with bitplane dma.
Still too big of a difference, maybe your cycles aren't entirely accurate. When I'm doing estimates I count dma slots (memory reads/writes) and then multiply by 4, that's usually good enough. |
17 March 2024, 21:36 | #4 | |
Defendit numerus
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 54
Posts: 4,501
|
Quote:
This in my opinion brings it much closer to 140Kcycles. My stupid nops code do this and the ~141Kcycles are respected EDIT I add some considerations: - usually we often tend to exclude the impact of IRQ calls (which add cycles for the call/setup and exit); - Copper is a bit invasive, because the cycles are always alternated and therefore if the code is not in multiples of 4 and aligned it tends to cause cycles to be forward shifted; - the more bitplanes you add, the more the possibility of collision increases and therefore the slowdown is more than linearly increasing; - access to the CIAs is very slow.. This implies that in conditions of 'normal' game/demo code the a/b estimates are close to reality. Last edited by ross; 17 March 2024 at 21:56. |
|
17 March 2024, 23:22 | #5 |
ex. demoscener "Bigmama"
Join Date: Jun 2012
Location: Fyn / Denmark
Posts: 1,646
|
Hmm.. thanks for confirming that it's probably my calculations which are wrong.
@a/b did you mean you count dma slots used and subtract from total to get cpu cycle estimate? |
18 March 2024, 00:18 | #6 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,068
|
Nope, just count them, and then multiply by 4 if you need the cycles or add with other dma usages to see if it's under ~70k, or use a metric that suits you.
E.g. +1 for opcode, +1 for each extra opcode word (#, offset, index, ...), +1 for word r/w, +2 for long r/w, etc. Much faster that way to get a good early estimate. |
19 March 2024, 02:10 | #7 |
Registered User
Join Date: Jan 2012
Location: USA
Posts: 373
|
The table here might be of interest to you: https://gist.githubusercontent.com/c...b3d2/Yacht.txt
Every 'n' or 'p' or 'r' etc, is two CPU cycles which is the time it takes for one DMA slot. As you can see there are many 'np' 'nr' 'nw' type pairings. These types of pairings are very common and allow for interleaved DMA. Every 'n' is two CPU cycles where the CPU isn't using the data bus. Ideally DMA happens at that point. That's not always the case, though. You'll see some instructions with one or more 'n' states that aren't paired. These are the sorts of instructions a/b was talking where collisions might happen. The most common situation involves taken branches. |
19 March 2024, 09:58 | #8 |
ex. demoscener "Bigmama"
Join Date: Jun 2012
Location: Fyn / Denmark
Posts: 1,646
|
thanks, guys.
the code I was assessing was only using move,or,addq,lea and the occasional lsl, but it was indeed a 12 cycle or I had mistakenly counted as an 8 cycle or at least I now know the 140k cycle ball-park isn't the culprit. |
20 March 2024, 03:54 | #9 |
Registered User
Join Date: Jan 2024
Location: Zagreb / Croatia
Posts: 11
|
A follow-up post coming up, but here's my exact numbers(****):
The following are regardless of resolution 1 bitplane: 141476 cycles 2 bitplane: 141476 cycles 3 bitplane: 141476 cycles 4 bitplane: 141476 cycles You get these cycles even in giga 376x286 overscan by ross above. Some restrictions apply (*)(**)(***) 5 or 6 bitplanes? standard PAL LORES: 5 bitplane 320x256: 120996 6 bitplane 320x256: 100516 320x200 PAL LORES (not NTSC!!): 5 bitplane 320x200: 125476 6 bitplane 320x200: 109476 376x286 PAL LORES not-your-dads-scrolling-overscan(C)ross: 5 bitplane 376x286 + scroll: 112876 6 bitplane 376x286 + scroll: 84276 (*) no sprites, no audio, no copper, no disk. (**)Also, forget about your nice 14-clock instructions such as move.w -(a0),(a1)+ taking 14 cycles if Agnus is busy displaying a line, they will take 16 cycles as the 14th cycle will collide with Agnus fetching your bitplane data. (***)The more bitplanes you have and the larger your display size, the more non-multiples of 4 cycles are forced to become multiples of 4 cycles (or even 8 cycles if 5 or 6 bpls) when Agnus is getting bitplanes (****) Total overkill that OP didn't ask for, but we are coding on Amigas in 2024 |
20 March 2024, 06:27 | #10 |
Registered User
Join Date: Jul 2009
Location: Lala Land
Posts: 608
|
Looking forward to seeing the result of this exploration. Nice project.
|
20 March 2024, 10:18 | #11 |
Registered User
Join Date: Oct 2023
Location: London, UK
Posts: 124
|
Brilliant investigation work!
|
21 March 2024, 01:59 | #12 |
Registered User
Join Date: Jan 2012
Location: USA
Posts: 373
|
Great work! Love it.
I wonder, how badly are MULs and DIVs affected? I have this fantasy that one day someone will write 3d code that includes a math scheduler that tries to time the use of these instructions during bitplane fetch to minimize DMA collisions. These instructions have lots of internal cycles and require few memory accesses. Be interesting to see how these instructions are affected especially the 6 plane low res and 3 plane hires cases. Even full 4 plane hires would be interesting. |
21 March 2024, 15:17 | #13 |
German Translator
Join Date: Aug 2018
Location: Drübeck / Germany
Age: 49
Posts: 200
|
simple but great!
WinUAE Debugger results: only one breakpoint on the jmp >g Breakpoint 0 triggered. Cycles: 70740 Chip, 141480 CPU. (V=68 H=158 -> V=67 H=74) and with cycle exact >g Breakpoint 0 triggered. Cycles: 71053 Chip, 142106 CPU. (V=253 H=85 -> V=253 H=87) also the visual DMA-Debugger output very stable! very nice picture. |
21 March 2024, 17:08 | #14 | |||
Registered User
Join Date: Jan 2024
Location: Zagreb / Croatia
Posts: 11
|
Quote:
In case you missed it, this thread here contains the code that should give you a stable frame for an arbitrary PAL lores screens, including 5 or 6 bitplanes where you have contention between Agnus and the Motorola 68000. Quote:
Hopefully it's easy to assemble my code and it should be super easy to add a few muls and divs and their expected clock cycles from this 680000 instruction cycle count chart that I use all the time.. So totally agree, as you said, CPU calc during bitplane DMA and blitter during non-display will get you maximum perf. Quote:
Would love it if you also gave my code a spin, especially if you can check it out on a real Amiga and play with DIW/DDF settings and 5-6 bitplanes (I see from your previous posts that Ross and you are all over this stuff) |
|||
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Question about Slow Mem | Brick Nash | Coders. Asm / Hardware | 11 | 13 January 2024 08:30 |
CPU cycles left, when Blitter is busy | phx | Coders. Asm / Hardware | 42 | 20 June 2023 20:27 |
Is slow mem needed for A500 1MB emulation? | rsn8887 | support.WinUAE | 12 | 09 November 2020 01:51 |
Trying to measuring the CPU cycles/instr ! (A500) | amilo3438 | Coders. Asm / Hardware | 20 | 31 August 2017 20:22 |
CPU execution on odd cycles if no Audio/Disk/Sprite DMA | mc6809e | Coders. Asm / Hardware | 2 | 02 April 2012 19:50 |
|
|