Assessing cpu cycles available on A500 with slow mem only

hooverphonique · 17 March 2024, 20:17

I was trying to do a ball-park assessment of if a piece of code would be able to complete in a single pal frame (20ms).

With only a little copper dma (display setup) and a single lo-res bitplane, I would think that the 68k would run at full speed, i.e. not be tied up by dma.

7.09MHz * 20ms/frame ~ 141000 cpu cycles.

If I added up the cycles of my code correctly, it seems like it's more in the area of around 100k cycles before it starts to take more than one frame.

Is the above right, or am I completely off track here?

ross · 17 March 2024, 21:21

Quote:

Originally Posted by hooverphonique

If I added up the cycles of my code correctly..

No

Code:

nops	        =	141456/4

		lea	$dff000,a6
		move.w	#$7fff,($9a,a6)
		move.w	#$7fff,($96,a6)
		move.w	#$1a5c,($8e,a6)
		move.w	#$38c8,($90,a6)
		move.w	#$18,($92,a6)
		move.w	#$d8,($94,a6)
		move.w	#$1200,($100,a6)
		moveq	#0,d2
		move.w	d2,($180,a6)
		move.w	d2,($182,a6)
		move.w	d2,($140,a6)
		move.w	d2,($142,a6)
		move.w	#$8300,($96,a6)

		move.w	#$4e71,d0
		move.w	#nops-3-1,d1
		lea	(code,pc),a0
		lea	(-8,a0),a1
.cn		move.w	d0,(a0)+
		dbf	d1,.cn
		move.w	#$4ef9,(a0)+
		move.l	a1,(a0)
.line	        cmpi.b	#$40,(6,a6)
		bne.b	.line
		move.w	d1,($180,a6)
		move.w	d2,($180,a6)

code	        dx.w	nops

a/b · 17 March 2024, 21:31

(227-4 for mem refresh)*312 is ~69.5k dma slots. 320x256 lores is 5120. This leaves you with ~64.5k or ~129k cycles. If your code contains instructions taking 6/10/14/... cycles you will lose 2 cycles each time there is a collision with bitplane dma.
Still too big of a difference, maybe your cycles aren't entirely accurate.
When I'm doing estimates I count dma slots (memory reads/writes) and then multiply by 4, that's usually good enough.

ross · 17 March 2024, 21:36

Quote:

Originally Posted by a/b

(227-4 for mem refresh)*312 is ~69.5k dma slots. 320x256 lores is 5120. This leaves you with ~64.5k or ~129k cycles. If your code contains instructions taking 6/10/14/... cycles you will lose 2 cycles each time there is a collision with bitplane dma.
Still too big of a difference, maybe your cycles aren't entirely accurate.
When I'm doing estimates I count dma slots (memory reads/writes) and then multiply by 4, that's usually good enough.

313 lines/frame, and the single bitplane active give negligible impact most of the time.
This in my opinion brings it much closer to 140Kcycles.

My stupid nops code do this and the ~141Kcycles are respected

EDIT
I add some considerations:
- usually we often tend to exclude the impact of IRQ calls (which add cycles for the call/setup and exit);
- Copper is a bit invasive, because the cycles are always alternated and therefore if the code is not in multiples of 4 and aligned it tends to cause cycles to be forward shifted;
- the more bitplanes you add, the more the possibility of collision increases and therefore the slowdown is more than linearly increasing;
- access to the CIAs is very slow..

This implies that in conditions of 'normal' game/demo code the a/b estimates are close to reality.

hooverphonique · 17 March 2024, 23:22

Hmm.. thanks for confirming that it's probably my calculations which are wrong.

@a/b did you mean you count dma slots used and subtract from total to get cpu cycle estimate?

a/b · 18 March 2024, 00:18

Nope, just count them, and then multiply by 4 if you need the cycles or add with other dma usages to see if it's under ~70k, or use a metric that suits you.
E.g. +1 for opcode, +1 for each extra opcode word (#, offset, index, ...), +1 for word r/w, +2 for long r/w, etc. Much faster that way to get a good early estimate.

mc6809e · 19 March 2024, 02:10

The table here might be of interest to you: https://gist.githubusercontent.com/c...b3d2/Yacht.txt

Every 'n' or 'p' or 'r' etc, is two CPU cycles which is the time it takes for one DMA slot.

As you can see there are many 'np' 'nr' 'nw' type pairings. These types of pairings are very common and allow for interleaved DMA. Every 'n' is two CPU cycles where the CPU isn't using the data bus. Ideally DMA happens at that point.

That's not always the case, though. You'll see some instructions with one or more 'n' states that aren't paired. These are the sorts of instructions a/b was talking where collisions might happen. The most common situation involves taken branches.

hooverphonique · 19 March 2024, 09:58

thanks, guys.

the code I was assessing was only using move,or,addq,lea and the occasional lsl, but it was indeed a 12 cycle or I had mistakenly counted as an 8 cycle or

at least I now know the 140k cycle ball-park isn't the culprit.

!ZAJC! · 20 March 2024, 03:54

A follow-up post coming up, but here's my exact numbers(****):

The following are regardless of resolution
1 bitplane: 141476 cycles
2 bitplane: 141476 cycles
3 bitplane: 141476 cycles
4 bitplane: 141476 cycles

You get these cycles even in giga 376x286 overscan by ross above.
Some restrictions apply (*)(**)(***)

5 or 6 bitplanes?
standard PAL LORES:
5 bitplane 320x256: 120996
6 bitplane 320x256: 100516

320x200 PAL LORES (not NTSC!!):
5 bitplane 320x200: 125476
6 bitplane 320x200: 109476

376x286 PAL LORES not-your-dads-scrolling-overscan(C)ross:
5 bitplane 376x286 + scroll: 112876
6 bitplane 376x286 + scroll: 84276

(*) no sprites, no audio, no copper, no disk.
(**)Also, forget about your nice 14-clock instructions such as move.w -(a0),(a1)+ taking 14 cycles if Agnus is busy displaying a line, they will take 16 cycles as the 14th cycle will collide with Agnus fetching your bitplane data.
(***)The more bitplanes you have and the larger your display size, the more non-multiples of 4 cycles are forced to become multiples of 4 cycles (or even 8 cycles if 5 or 6 bpls) when Agnus is getting bitplanes

(****) Total overkill that OP didn't ask for, but we are coding on Amigas in 2024

copse · 20 March 2024, 06:27

Looking forward to seeing the result of this exploration. Nice project.

reassembler · 20 March 2024, 10:18

Brilliant investigation work!

mc6809e · 21 March 2024, 01:59

Quote:

Originally Posted by !ZAJC!

A follow-up post coming up, but here's my exact numbers(****):

Great work! Love it.

I wonder, how badly are MULs and DIVs affected?

I have this fantasy that one day someone will write 3d code that includes a math scheduler that tries to time the use of these instructions during bitplane fetch to minimize DMA collisions.

These instructions have lots of internal cycles and require few memory accesses.

Be interesting to see how these instructions are affected especially the 6 plane low res and 3 plane hires cases.

Even full 4 plane hires would be interesting.

Rock'n Roll · 21 March 2024, 15:17

Quote:

Originally Posted by ross

My stupid nops code do this and the ~141Kcycles are respected

simple but great!

WinUAE Debugger results:
only one breakpoint on the jmp
>g
Breakpoint 0 triggered.
Cycles: 70740 Chip, 141480 CPU. (V=68 H=158 -> V=67 H=74)

and with cycle exact
>g
Breakpoint 0 triggered.
Cycles: 71053 Chip, 142106 CPU. (V=253 H=85 -> V=253 H=87)

also the visual DMA-Debugger output very stable! very nice picture.

!ZAJC! · 21 March 2024, 17:08

Quote:

Originally Posted by copse

Looking forward to seeing the result of this exploration. Nice project.

Thank you folks for encouragement

In case you missed it, this thread here contains the code that should give you a stable frame for an arbitrary PAL lores screens, including 5 or 6 bitplanes where you have contention between Agnus and the Motorola 68000.

Quote:

Originally Posted by mc6809e

I wonder, how badly are MULs and DIVs affected?

Be interesting to see how these instructions are affected especially the 6 plane low res and 3 plane hires cases.

The 6-plane lores case should be super easy to check, but I expect mul and div to be virtually unaffected, especially with operands in registers.

Hopefully it's easy to assemble my code and it should be super easy to add a few muls and divs and their expected clock cycles from this 680000 instruction cycle count chart that I use all the time..

So totally agree, as you said, CPU calc during bitplane DMA and blitter during non-display will get you maximum perf.

Quote:

Originally Posted by Rock'n Roll

simple but great!
Cycles: 71053 Chip, 142106 CPU. (V=253 H=85 -> V=253 H=87)

also the visual DMA-Debugger output very stable! very nice picture.

I agree, Ross' code was super inspiring and set me out to try and chase a closed-form non-empirical formula for CPU cycles.

Would love it if you also gave my code a spin, especially if you can check it out on a real Amiga and play with DIW/DDF settings and 5-6 bitplanes

(I see from your previous posts that Ross and you are all over this stuff)

17 March 2024, 20:17	#1
hooverphonique ex. demoscener "Bigmama" Join Date: Jun 2012 Location: Fyn / Denmark Posts: 1,624	Assessing cpu cycles available on A500 with slow mem only I was trying to do a ball-park assessment of if a piece of code would be able to complete in a single pal frame (20ms). With only a little copper dma (display setup) and a single lo-res bitplane, I would think that the 68k would run at full speed, i.e. not be tied up by dma. 7.09MHz * 20ms/frame ~ 141000 cpu cycles. If I added up the cycles of my code correctly, it seems like it's more in the area of around 100k cycles before it starts to take more than one frame. Is the above right, or am I completely off track here?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Question about Slow Mem	Brick Nash	Coders. Asm / Hardware	11	13 January 2024 08:30
CPU cycles left, when Blitter is busy	phx	Coders. Asm / Hardware	42	20 June 2023 20:27
Is slow mem needed for A500 1MB emulation?	rsn8887	support.WinUAE	12	09 November 2020 01:51
Trying to measuring the CPU cycles/instr ! (A500)	amilo3438	Coders. Asm / Hardware	20	31 August 2017 20:22
CPU execution on odd cycles if no Audio/Disk/Sprite DMA	mc6809e	Coders. Asm / Hardware	2	02 April 2012 19:50

17 March 2024, 21:31	#3
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	(227-4 for mem refresh)*312 is ~69.5k dma slots. 320x256 lores is 5120. This leaves you with ~64.5k or ~129k cycles. If your code contains instructions taking 6/10/14/... cycles you will lose 2 cycles each time there is a collision with bitplane dma. Still too big of a difference, maybe your cycles aren't entirely accurate. When I'm doing estimates I count dma slots (memory reads/writes) and then multiply by 4, that's usually good enough.

17 March 2024, 23:22	#5
hooverphonique ex. demoscener "Bigmama" Join Date: Jun 2012 Location: Fyn / Denmark Posts: 1,624	Hmm.. thanks for confirming that it's probably my calculations which are wrong. @a/b did you mean you count dma slots used and subtract from total to get cpu cycle estimate?

18 March 2024, 00:18	#6
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	Nope, just count them, and then multiply by 4 if you need the cycles or add with other dma usages to see if it's under ~70k, or use a metric that suits you. E.g. +1 for opcode, +1 for each extra opcode word (#, offset, index, ...), +1 for word r/w, +2 for long r/w, etc. Much faster that way to get a good early estimate.

19 March 2024, 02:10	#7
mc6809e Registered User Join Date: Jan 2012 Location: USA Posts: 372	The table here might be of interest to you: https://gist.githubusercontent.com/c...b3d2/Yacht.txt Every 'n' or 'p' or 'r' etc, is two CPU cycles which is the time it takes for one DMA slot. As you can see there are many 'np' 'nr' 'nw' type pairings. These types of pairings are very common and allow for interleaved DMA. Every 'n' is two CPU cycles where the CPU isn't using the data bus. Ideally DMA happens at that point. That's not always the case, though. You'll see some instructions with one or more 'n' states that aren't paired. These are the sorts of instructions a/b was talking where collisions might happen. The most common situation involves taken branches.

19 March 2024, 09:58	#8
hooverphonique ex. demoscener "Bigmama" Join Date: Jun 2012 Location: Fyn / Denmark Posts: 1,624	thanks, guys. the code I was assessing was only using move,or,addq,lea and the occasional lsl, but it was indeed a 12 cycle or I had mistakenly counted as an 8 cycle or at least I now know the 140k cycle ball-park isn't the culprit.

20 March 2024, 03:54	#9
!ZAJC! Registered User Join Date: Jan 2024 Location: Zagreb / Croatia Posts: 11	A follow-up post coming up, but here's my exact numbers(***): The following are regardless of resolution 1 bitplane: 141476 cycles 2 bitplane: 141476 cycles 3 bitplane: 141476 cycles 4 bitplane: 141476 cycles You get these cycles even in giga 376x286 overscan by ross above. Some restrictions apply ()()() 5 or 6 bitplanes? standard PAL LORES: 5 bitplane 320x256: 120996 6 bitplane 320x256: 100516 320x200 PAL LORES (not NTSC!!): 5 bitplane 320x200: 125476 6 bitplane 320x200: 109476 376x286 PAL LORES not-your-dads-scrolling-overscan(C)ross: 5 bitplane 376x286 + scroll: 112876 6 bitplane 376x286 + scroll: 84276 () no sprites, no audio, no copper, no disk. ()Also, forget about your nice 14-clock instructions such as move.w -(a0),(a1)+ taking 14 cycles if Agnus is busy displaying a line, they will take 16 cycles as the 14th cycle will collide with Agnus fetching your bitplane data. ()The more bitplanes you have and the larger your display size, the more non-multiples of 4 cycles are forced to become multiples of 4 cycles (or even 8 cycles if 5 or 6 bpls) when Agnus is getting bitplanes (***) Total overkill that OP didn't ask for, but we are coding on Amigas in 2024

20 March 2024, 06:27	#10
copse Registered User Join Date: Jul 2009 Location: Lala Land Posts: 522	Looking forward to seeing the result of this exploration. Nice project.

20 March 2024, 10:18	#11
reassembler Registered User Join Date: Oct 2023 Location: London, UK Posts: 92	Brilliant investigation work!

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)