English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 17 March 2024, 20:17   #1
hooverphonique
ex. demoscener "Bigmama"
 
Join Date: Jun 2012
Location: Fyn / Denmark
Posts: 1,624
Assessing cpu cycles available on A500 with slow mem only

I was trying to do a ball-park assessment of if a piece of code would be able to complete in a single pal frame (20ms).


With only a little copper dma (display setup) and a single lo-res bitplane, I would think that the 68k would run at full speed, i.e. not be tied up by dma.


7.09MHz * 20ms/frame ~ 141000 cpu cycles.


If I added up the cycles of my code correctly, it seems like it's more in the area of around 100k cycles before it starts to take more than one frame.



Is the above right, or am I completely off track here?
hooverphonique is offline  
Old 17 March 2024, 21:21   #2
ross
Defendit numerus
 
ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,474
Quote:
Originally Posted by hooverphonique View Post
If I added up the cycles of my code correctly..
No

Code:
nops	        =	141456/4

		lea	$dff000,a6
		move.w	#$7fff,($9a,a6)
		move.w	#$7fff,($96,a6)
		move.w	#$1a5c,($8e,a6)
		move.w	#$38c8,($90,a6)
		move.w	#$18,($92,a6)
		move.w	#$d8,($94,a6)
		move.w	#$1200,($100,a6)
		moveq	#0,d2
		move.w	d2,($180,a6)
		move.w	d2,($182,a6)
		move.w	d2,($140,a6)
		move.w	d2,($142,a6)
		move.w	#$8300,($96,a6)

		move.w	#$4e71,d0
		move.w	#nops-3-1,d1
		lea	(code,pc),a0
		lea	(-8,a0),a1
.cn		move.w	d0,(a0)+
		dbf	d1,.cn
		move.w	#$4ef9,(a0)+
		move.l	a1,(a0)
.line	        cmpi.b	#$40,(6,a6)
		bne.b	.line
		move.w	d1,($180,a6)
		move.w	d2,($180,a6)

code	        dx.w	nops
Attached Files
File Type: 68k frame_cycles.68k (152 Bytes, 28 views)
ross is offline  
Old 17 March 2024, 21:31   #3
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,039
(227-4 for mem refresh)*312 is ~69.5k dma slots. 320x256 lores is 5120. This leaves you with ~64.5k or ~129k cycles. If your code contains instructions taking 6/10/14/... cycles you will lose 2 cycles each time there is a collision with bitplane dma.
Still too big of a difference, maybe your cycles aren't entirely accurate.
When I'm doing estimates I count dma slots (memory reads/writes) and then multiply by 4, that's usually good enough.
a/b is offline  
Old 17 March 2024, 21:36   #4
ross
Defendit numerus
 
ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,474
Quote:
Originally Posted by a/b View Post
(227-4 for mem refresh)*312 is ~69.5k dma slots. 320x256 lores is 5120. This leaves you with ~64.5k or ~129k cycles. If your code contains instructions taking 6/10/14/... cycles you will lose 2 cycles each time there is a collision with bitplane dma.
Still too big of a difference, maybe your cycles aren't entirely accurate.
When I'm doing estimates I count dma slots (memory reads/writes) and then multiply by 4, that's usually good enough.
313 lines/frame, and the single bitplane active give negligible impact most of the time.
This in my opinion brings it much closer to 140Kcycles.

My stupid nops code do this and the ~141Kcycles are respected

EDIT
I add some considerations:
- usually we often tend to exclude the impact of IRQ calls (which add cycles for the call/setup and exit);
- Copper is a bit invasive, because the cycles are always alternated and therefore if the code is not in multiples of 4 and aligned it tends to cause cycles to be forward shifted;
- the more bitplanes you add, the more the possibility of collision increases and therefore the slowdown is more than linearly increasing;
- access to the CIAs is very slow..

This implies that in conditions of 'normal' game/demo code the a/b estimates are close to reality.

Last edited by ross; 17 March 2024 at 21:56.
ross is offline  
Old 17 March 2024, 23:22   #5
hooverphonique
ex. demoscener "Bigmama"
 
Join Date: Jun 2012
Location: Fyn / Denmark
Posts: 1,624
Hmm.. thanks for confirming that it's probably my calculations which are wrong.

@a/b did you mean you count dma slots used and subtract from total to get cpu cycle estimate?
hooverphonique is offline  
Old 18 March 2024, 00:18   #6
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,039
Nope, just count them, and then multiply by 4 if you need the cycles or add with other dma usages to see if it's under ~70k, or use a metric that suits you.
E.g. +1 for opcode, +1 for each extra opcode word (#, offset, index, ...), +1 for word r/w, +2 for long r/w, etc. Much faster that way to get a good early estimate.
a/b is offline  
Old 19 March 2024, 02:10   #7
mc6809e
Registered User
 
Join Date: Jan 2012
Location: USA
Posts: 372
The table here might be of interest to you: https://gist.githubusercontent.com/c...b3d2/Yacht.txt

Every 'n' or 'p' or 'r' etc, is two CPU cycles which is the time it takes for one DMA slot.

As you can see there are many 'np' 'nr' 'nw' type pairings. These types of pairings are very common and allow for interleaved DMA. Every 'n' is two CPU cycles where the CPU isn't using the data bus. Ideally DMA happens at that point.

That's not always the case, though. You'll see some instructions with one or more 'n' states that aren't paired. These are the sorts of instructions a/b was talking where collisions might happen. The most common situation involves taken branches.
mc6809e is offline  
Old 19 March 2024, 09:58   #8
hooverphonique
ex. demoscener "Bigmama"
 
Join Date: Jun 2012
Location: Fyn / Denmark
Posts: 1,624
thanks, guys.

the code I was assessing was only using move,or,addq,lea and the occasional lsl, but it was indeed a 12 cycle or I had mistakenly counted as an 8 cycle or

at least I now know the 140k cycle ball-park isn't the culprit.
hooverphonique is offline  
Old 20 March 2024, 03:54   #9
!ZAJC!
Registered User
 
Join Date: Jan 2024
Location: Zagreb / Croatia
Posts: 11
A follow-up post coming up, but here's my exact numbers(****):

The following are regardless of resolution
1 bitplane: 141476 cycles
2 bitplane: 141476 cycles
3 bitplane: 141476 cycles
4 bitplane: 141476 cycles

You get these cycles even in giga 376x286 overscan by ross above.
Some restrictions apply (*)(**)(***)

5 or 6 bitplanes?
standard PAL LORES:
5 bitplane 320x256: 120996
6 bitplane 320x256: 100516

320x200 PAL LORES (not NTSC!!):
5 bitplane 320x200: 125476
6 bitplane 320x200: 109476

376x286 PAL LORES not-your-dads-scrolling-overscan(C)ross:
5 bitplane 376x286 + scroll: 112876
6 bitplane 376x286 + scroll: 84276

(*) no sprites, no audio, no copper, no disk.
(**)Also, forget about your nice 14-clock instructions such as move.w -(a0),(a1)+ taking 14 cycles if Agnus is busy displaying a line, they will take 16 cycles as the 14th cycle will collide with Agnus fetching your bitplane data.
(***)The more bitplanes you have and the larger your display size, the more non-multiples of 4 cycles are forced to become multiples of 4 cycles (or even 8 cycles if 5 or 6 bpls) when Agnus is getting bitplanes

(****) Total overkill that OP didn't ask for, but we are coding on Amigas in 2024
!ZAJC! is offline  
Old 20 March 2024, 06:27   #10
copse
Registered User
 
Join Date: Jul 2009
Location: Lala Land
Posts: 522
Looking forward to seeing the result of this exploration. Nice project.
copse is offline  
Old 20 March 2024, 10:18   #11
reassembler
Registered User
 
reassembler's Avatar
 
Join Date: Oct 2023
Location: London, UK
Posts: 92
Brilliant investigation work!
reassembler is offline  
Old 21 March 2024, 01:59   #12
mc6809e
Registered User
 
Join Date: Jan 2012
Location: USA
Posts: 372
Quote:
Originally Posted by !ZAJC! View Post
A follow-up post coming up, but here's my exact numbers(****):
Great work! Love it.

I wonder, how badly are MULs and DIVs affected?

I have this fantasy that one day someone will write 3d code that includes a math scheduler that tries to time the use of these instructions during bitplane fetch to minimize DMA collisions.

These instructions have lots of internal cycles and require few memory accesses.

Be interesting to see how these instructions are affected especially the 6 plane low res and 3 plane hires cases.

Even full 4 plane hires would be interesting.
mc6809e is offline  
Old 21 March 2024, 15:17   #13
Rock'n Roll
German Translator
 
Rock'n Roll's Avatar
 
Join Date: Aug 2018
Location: Drübeck / Germany
Age: 49
Posts: 185
Quote:
Originally Posted by ross View Post
My stupid nops code do this and the ~141Kcycles are respected
simple but great!

WinUAE Debugger results:
only one breakpoint on the jmp
>g
Breakpoint 0 triggered.
Cycles: 70740 Chip, 141480 CPU. (V=68 H=158 -> V=67 H=74)

and with cycle exact
>g
Breakpoint 0 triggered.
Cycles: 71053 Chip, 142106 CPU. (V=253 H=85 -> V=253 H=87)

also the visual DMA-Debugger output very stable! very nice picture.
Rock'n Roll is offline  
Old 21 March 2024, 17:08   #14
!ZAJC!
Registered User
 
Join Date: Jan 2024
Location: Zagreb / Croatia
Posts: 11
Quote:
Originally Posted by copse View Post
Looking forward to seeing the result of this exploration. Nice project.
Thank you folks for encouragement

In case you missed it, this thread here contains the code that should give you a stable frame for an arbitrary PAL lores screens, including 5 or 6 bitplanes where you have contention between Agnus and the Motorola 68000.

Quote:
Originally Posted by mc6809e View Post
I wonder, how badly are MULs and DIVs affected?

Be interesting to see how these instructions are affected especially the 6 plane low res and 3 plane hires cases.
The 6-plane lores case should be super easy to check, but I expect mul and div to be virtually unaffected, especially with operands in registers.

Hopefully it's easy to assemble my code and it should be super easy to add a few muls and divs and their expected clock cycles from this 680000 instruction cycle count chart that I use all the time..

So totally agree, as you said, CPU calc during bitplane DMA and blitter during non-display will get you maximum perf.

Quote:
Originally Posted by Rock'n Roll View Post
simple but great!
Cycles: 71053 Chip, 142106 CPU. (V=253 H=85 -> V=253 H=87)

also the visual DMA-Debugger output very stable! very nice picture.
I agree, Ross' code was super inspiring and set me out to try and chase a closed-form non-empirical formula for CPU cycles.

Would love it if you also gave my code a spin, especially if you can check it out on a real Amiga and play with DIW/DDF settings and 5-6 bitplanes (I see from your previous posts that Ross and you are all over this stuff)
!ZAJC! is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Question about Slow Mem Brick Nash Coders. Asm / Hardware 11 13 January 2024 08:30
CPU cycles left, when Blitter is busy phx Coders. Asm / Hardware 42 20 June 2023 20:27
Is slow mem needed for A500 1MB emulation? rsn8887 support.WinUAE 12 09 November 2020 01:51
Trying to measuring the CPU cycles/instr ! (A500) amilo3438 Coders. Asm / Hardware 20 31 August 2017 20:22
CPU execution on odd cycles if no Audio/Disk/Sprite DMA mc6809e Coders. Asm / Hardware 2 02 April 2012 19:50

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 00:31.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.09620 seconds with 16 queries