English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 23 August 2017, 18:39   #1
amilo3438
Amiga 500 User
 
Join Date: Jun 2013
Location: EU
Posts: 1,275
Trying to measuring the CPU cycles/instr ! (A500)

Trying to measuring the CPU cycles/instr using ASM-One V1.20 in WinUAE A500 quickstart !

Here is the program:

move.l #$dff006,a0 ;VHPOSR
move.w (a0),d0 ;start_value
move.w (a0),d1 ;end_value
move.w d1,d2
sub.w d0,d2 ;diff_value
rts


The result in d2=00000004.

So, if resolution of VHPOSR H timer = 280nS (= 2 X 140nS CPU cycles time),
that would mean the value in d2 is equal to 8 CPU cycles (or 2 bus cycles) !
That means instruction "move.w (a0),d1" takes 8 CPU cycles (or 2 bus cycles)!

On that way would be possible to "measure" the cycles of any instruction in between, f.e.:

move.l #$dff006,a0 ;VHPOSR
move.w (a0),d0 ;start_value
nop ;unknown_value
move.w (a0),d1 ;end_value (we now know this takes $00000004)
move.w d1,d2 ;end_value to d2
sub.w d0,d2 ;end_value - start value
sub.w #$4,d2 ;d2 - $00000004 = unknown_value
rts


The 1st result in d2=00000006, but as we already know the last instr takes $00000004,
so the final result in d2=00000002 and that is equal to 4 CPU cycles (or 1 bus cycle)!
That means instruction "nop" takes 4 CPU cycles (or 1 bus cycle)!

Now would like to check is that correct way or not ? (tia)
amilo3438 is offline  
Old 23 August 2017, 18:45   #2
Toni Wilen
WinUAE developer
 
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 46
Posts: 25,037
It is not possible to time anything cycle-accurately in software.

Each CPU custom register access takes 2 color clocks (1 color clock = ~3.5MHz), even in AGA systems which is huge waste of clocks when CPU is 68020+.
Toni Wilen is offline  
Old 23 August 2017, 18:50   #3
amilo3438
Amiga 500 User
 
Join Date: Jun 2013
Location: EU
Posts: 1,275
Hmm, I hoped it could work. (at least approx)

Thanks on the answer !
amilo3438 is offline  
Old 23 August 2017, 19:11   #4
amilo3438
Amiga 500 User
 
Join Date: Jun 2013
Location: EU
Posts: 1,275
Quote:
Originally Posted by Toni Wilen View Post
It is not possible to time anything cycle-accurately in software.

Each CPU custom register access takes 2 color clocks (1 color clock = ~3.5MHz)
Yes, but does the two readings of the register at the beginning and the end do not degrade the delay !? (it lefts only what is in between)
(i.e. between the start_value and end_value)
amilo3438 is offline  
Old 23 August 2017, 19:20   #5
Toni Wilen
WinUAE developer
 
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 46
Posts: 25,037
It isn't that simple either. Each CPU custom access also need to wait until CPU clock is in sync with color clock = there is "unknown" 0-7 extra wasted clocks when CPU is doing nothing (or some internal ALU operation) -> short timing result are totally useless..
Toni Wilen is offline  
Old 23 August 2017, 19:33   #6
amilo3438
Amiga 500 User
 
Join Date: Jun 2013
Location: EU
Posts: 1,275
Quote:
Originally Posted by amilo3438 View Post
So, if resolution of VHPOSR H timer = 280nS (= 2 X 140nS CPU cycles time),
that would mean the value in d2 is equal to 8 CPU cycles (or 2 bus cycles) !
That means instruction "move.w (a0),d1" takes 8 CPU cycles (or 2 bus cycles)!

The 1st result in d2=00000006, but as we already know the last instr takes $00000004,
so the final result in d2=00000002 and that is equal to 4 CPU cycles (or 1 bus cycle)!
That means instruction "nop" takes 4 CPU cycles (or 1 bus cycle)!
At least it seems above is working for mentioned examples "move.w(a0),d1" and "nop":

http://oldwww.nvg.ntnu.no/amiga/MC68...s/timmove.HTML

Move Byte and Word Instruction Execution Times
(An) Dn 8(2/0)

http://oldwww.nvg.ntnu.no/amiga/MC68...s/timmisc.HTML

instruction size register memory
NOP - 4(1/0) -


Quote:
Originally Posted by Toni Wilen View Post
-> short timing result are totally useless..
Anyway I want to timing only 1 instruction per time, not the program!
And for above mentioned two examples it works! (also need to test some other instructions to confirm is it in general usable)

Last edited by amilo3438; 23 August 2017 at 19:40.
amilo3438 is offline  
Old 23 August 2017, 20:28   #7
Toni Wilen
WinUAE developer
 
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 46
Posts: 25,037
No. It does not work, it may appear to work in some cases only.
Toni Wilen is offline  
Old 23 August 2017, 21:11   #8
amilo3438
Amiga 500 User
 
Join Date: Jun 2013
Location: EU
Posts: 1,275
Quote:
Originally Posted by Toni Wilen View Post
No. It does not work, it may appear to work in some cases only.
Can you give an instr example ? (I can't find what is not working.)
amilo3438 is offline  
Old 23 August 2017, 21:33   #9
Toni Wilen
WinUAE developer
 
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 46
Posts: 25,037
For example if code is already in cache. Which is common real world use case.

This is totally useless, those cycle usage charts for 68020+ are only min/max theoretical values, they can be only used to calculate worst/best case situations.

EDIT: for some reason I thought you meant 68020+. 68000 has nothing of those and there isn't even any need to do any tests. It is simple and can be fully checked using logic analyzer because previous or next instruction makes no difference to execution speed.

This method is quite accurate with 68000 because it is always in sync with color clock and 68000 memory cycle is exactly 2 color clocks. (But make sure to run the test code in real fast ram!)
Toni Wilen is offline  
Old 23 August 2017, 21:39   #10
amilo3438
Amiga 500 User
 
Join Date: Jun 2013
Location: EU
Posts: 1,275
Yea, I am playing with 68000 on A500 config only !
So far this works fine here.

Here is final:

move.l #$dff006,a0 ;VHPOSR
test:
move.w (a0),d0
cmp #$00,d0 ;test if Hpos=0
bne test
move.w (a0),d0 ;start_value
nop ;unknown_cycles to count
move.w (a0),d1 ;end_value (we know this takes $00000004)
move.w d1,d2 ;end_value to d2
sub.w d0,d2 ;end_value - start value
sub.w #$4,d2 ;d2 - $00000004
add d2,d2 ;d2 x 2 = unknown_cycles
rts


So result of unknown_cycles for "nop" is in d2!
("nop" can be replaced with any other instr to test)

P.S.
Maybe it would need to disable INTENA on the start and enable INTENA on the end, but I am not sure how to do it !? (and DMA channels too, everything)
I am not programmer, have not much experience on machine coding on Amiga. (very little, only basics)

Last edited by amilo3438; 23 August 2017 at 21:56.
amilo3438 is offline  
Old 23 August 2017, 22:25   #11
Toni Wilen
WinUAE developer
 
Join Date: Aug 2001
Location: Hämeenlinna/Finland
Age: 46
Posts: 25,037
Quick and dirty way:

move.w #$4000,INTENA
move.w #$0200,DMACON

test code

move.w #$8200,DMACON
move.w #$c000,INTENA

But you can still hit refresh cycles when accessing custom registers (or chip ram) that adds extra 2 cycle delay.
Toni Wilen is offline  
Old 23 August 2017, 22:37   #12
amilo3438
Amiga 500 User
 
Join Date: Jun 2013
Location: EU
Posts: 1,275
Thanks !

I wonder if same/similar code can be used on an A1200 for quick and dirty comparison between a real and emulation ? (at least for unknown cases like muls, mulu, divs etc.)

EDIT:
Because an A1200 runs at 14MHz I guess the best results would be to slow it down on 7MHz or even 3.5MHz!
For example, on standard 14MHz A1200 the test program for "nop" instruction gives 0, with 7MHz it gives 1 and with 3.5 MHz it gives 3 as result. (need to be tested w/o "add.w d2,d2" and "sub.w #$4,d2" so to measure/compare only color clocks for tested i.e. "nop" instr + "move.w(a0),d1" instr also as its value depends of current CPU frequency.)

Last edited by amilo3438; 24 August 2017 at 00:18.
amilo3438 is offline  
Old 24 August 2017, 09:11   #13
meynaf
son of 68k
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 48
Posts: 4,359
Quote:
Originally Posted by amilo3438 View Post
I wonder if same/similar code can be used on an A1200 for quick and dirty comparison between a real and emulation ? (at least for unknown cases like muls, mulu, divs etc.)
If what you want is having clock cycles of a specific instruction, a better way would be executing a large number of it in a loop.
On my 68030/50 i have a program doing 50,000,000 iterations of a loop. I count the number of seconds it takes, do -6 to take dbf into account, and it gives me the clock cycles of a single or group of instructions.
meynaf is offline  
Old 24 August 2017, 13:39   #14
amilo3438
Amiga 500 User
 
Join Date: Jun 2013
Location: EU
Posts: 1,275
So the idea to using VHPOSR (or color clks) to count cycles works in practice, but is tested only for 68000 on A500!

Final program is attached as picture below!

Also ,instead of only one instruction it can be added more instructions, but if its too more it may finish with an error (D0-D2 reset to zero) what means that starting counter is higher than the end counter! If everything is fine the D2 register will contain the cpu_cycles.

Cheers!

PS. I feel this concept can be more improved, so I left it to someone experienced in amiga machine language programming!


Quote:
Originally Posted by meynaf View Post
If what you want is having clock cycles of a specific instruction, a better way would be executing a large number of it in a loop.
Yea, this may be one way, but I wanted to see does it works by using VHPOSR, and it does.
Attached Thumbnails
Click image for larger version

Name:	Count_instr_Cycles_on_A500.png
Views:	134
Size:	10.8 KB
ID:	54298  

Last edited by amilo3438; 24 August 2017 at 13:46.
amilo3438 is offline  
Old 24 August 2017, 13:48   #15
Kalms
Registered User
 
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 230
You can use VHPOSR as a high-precision timer source on all Amiga systems. You need to make sure that the resolution is set to something such that you know how the hardware will count (DBLPAL will count differently from PAL for example). Also, you need to take into account that accessing VHPOSR is done over a bus where you compete with other hardware for access cycles. As people have outlined above, the simplest way to minimize the measurement errors is to 1) minimize other hardware activity and 2) measure across a large chunk of code & time. You can combine VHPOSR with the TOD counters to do measurements over >1 frame without needing any interrupt driven activity in between.

If you are profiling on a non-a500 platform then you are probably targeting a range of configurations. You will need to combine the profiling with estimations (based on instruction timing from the processor manuals) to build performant code across a range of hardware.
Kalms is offline  
Old 24 August 2017, 14:11   #16
amilo3438
Amiga 500 User
 
Join Date: Jun 2013
Location: EU
Posts: 1,275
My motivation/idea was to try to find a way how approx "measure/compare" the emulated CPU with an real CPU by using software.

Ok, this "proof of concept" obviously works for 68000 and A500, but for higher speeds the VHPOSR resolution is not enough, so it would need to ad f.e. 10 same instr (or more) instead of just one to test, or reduce a cpu speed.

Generally, it would be nice if this concept could be used f.e. to take some values on an real machine like A1200 and than compare same values on emulated A1200, in order to improve the emulation accuracy even more, but I am afraid my current knowledge of machine programming is still not enough for a such task.
amilo3438 is offline  
Old 24 August 2017, 16:53   #17
Kalms
Registered User
 
Join Date: Nov 2006
Location: Stockholm, Sweden
Posts: 230
Quote:
Originally Posted by amilo3438 View Post
My motivation/idea was to try to find a way how approx "measure/compare" the emulated CPU with an real CPU by using software.

Ok, this "proof of concept" obviously works for 68000 and A500, but for higher speeds the VHPOSR resolution is not enough, so it would need to ad f.e. 10 same instr (or more) instead of just one to test, or reduce a cpu speed.

Generally, it would be nice if this concept could be used f.e. to take some values on an real machine like A1200 and than compare same values on emulated A1200, in order to improve the emulation accuracy even more, but I am afraid my current knowledge of machine programming is still not enough for a such task.
As others have mentioned, the measurement error will be large if you take single instructions. If you want less than 10% error margin then I think you will need to measure blocks that take 100+ cycles to execute.

When you begin to look at faster machines than the A500, the interactions between CPUs, buses and other hardware become more complicated and more pronounced. You can probably use the framework that you have, but you will also need to design different test cases very carefully.

Then, I wonder what the purpose is of the comparison. Is the ultimate purpose to adjust the emulator to match real hardware performance better? You will find that the emulator includes a number of approximations of how the machine is built, and you will need to understand both the real machine's workings (best done through analysis, read hw specs, do measurements) and the emulator (best done by reading the source code -- not by doing measurements) before you will be able to make useful changes to the emulator.

In other words; the timing framework will enable you to tell that "yep there seems to be a difference in <this area over here> between the real hw and the emulator" but it will probably not enable you to pinpoint exactly where and why.
Kalms is offline  
Old 24 August 2017, 18:42   #18
amilo3438
Amiga 500 User
 
Join Date: Jun 2013
Location: EU
Posts: 1,275
Here is new updated version attached!

I have check how it work with loops and have found that it counts accurate till d2=$190 (400) cpu_cycles! (what is equal to 100 nop)

So loops till 400 cpu_cycles should work fine, I hope!

PS.
As mentioned before, result of cpu_cycles is in d2, and on an error all d0-d2 registers is erased! (what means that it takes more than 400 cycles)

EDIT:
Counting for the loop in d4=$1b (27) on the attached picture below:
Result in d2=$18c (396) cycles: 396-(4*28)-(10*27)-14=0.
(Note: loop goes from 27 till 0 = 28; nop takes (4*28) and dbeq takes (10*27)+14 cycles.)

EDIT2:
Added 4 cycles faster/optimized version !
Attached Thumbnails
Click image for larger version

Name:	Count_instr_Cycles_on_A500_acc_till_400_cycles_optimized.png
Views:	95
Size:	12.3 KB
ID:	54301  

Last edited by amilo3438; 25 August 2017 at 00:59.
amilo3438 is offline  
Old 31 August 2017, 13:26   #19
amilo3438
Amiga 500 User
 
Join Date: Jun 2013
Location: EU
Posts: 1,275
New version that can now count accurate till 35444 mem_cycles (or nop_s) => 141776 cpu_cycles !

Note: It needs an A500 + fast RAM configuration for 100% non-mem wait states !
But depending of instruction used in test, it could also work fine w/chip memory only. (I guess)
(no, w/chip memory it works fine only till $e3-d0 color_clocks)


EDIT: Counting example !

d7=$278d loop value (max value for accurate)
loop:
nop
dbf d7,loop

d3=$229c8 cpu_cycles => $229c8-(4x$278e)-(10x$278d)-14=0 => Accurate!
Attached Thumbnails
Click image for larger version

Name:	A500+FastRAM_001.png
Views:	94
Size:	24.7 KB
ID:	54402   Click image for larger version

Name:	A500+FastRAM_002.png
Views:	96
Size:	23.4 KB
ID:	54403  

Last edited by amilo3438; 01 September 2017 at 18:38.
amilo3438 is offline  
Old 31 August 2017, 13:49   #20
Thorham
Computer Nerd

Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 45
Posts: 3,237
Quote:
Originally Posted by meynaf View Post
If what you want is having clock cycles of a specific instruction, a better way would be executing a large number of it in a loop.
On my 68030/50 i have a program doing 50,000,000 iterations of a loop. I count the number of seconds it takes, do -6 to take dbf into account, and it gives me the clock cycles of a single or group of instructions.
That's indeed the best way.

I do something similar. On my 50mhz 68030, I execute the code that is to be measured one million times and simply count the number of vertical blanks in a VBL interrupt. For screen modes that have a refresh rate of 50hrtz , this count gives you the number of cycles including the loop handling (50.000.000 / 50 = 1.000.000).

Advantage compared to meynaf's method: Takes a lot less time to run for larger pieces code while still being quite accurate.

Disadvantage compared to meynaf's method: More code. If you run something 50 million times, then you can probably just use a time stamp (from timer.device, not dos.library).
Thorham is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
measuring in layout - lightwave source support.Apps 7 25 June 2016 00:20
Convert Sonix instr to adf for WinUAE Weemus support.WinUAE 11 15 June 2012 21:14
CPU execution on odd cycles if no Audio/Disk/Sprite DMA mc6809e Coders. Asm / Hardware 2 02 April 2012 19:50
Measuring speed with pixels Lonewolf10 Coders. General 19 18 November 2011 09:31
A500 CPU riser Eamoe support.Hardware 5 31 January 2011 23:31

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 04:28.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, vBulletin Solutions Inc.
Page generated in 0.10272 seconds with 16 queries