English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 16 April 2024, 11:26   #41
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by Don_Adan View Post
If result will be 85 cycles, then is executed parallel with mulu.
If result will be 87 cycles, then is not executed parallel with mulu.
Not so easy. You will discover that extra add.l will not do +2 cycles each. Some will do nothing and then suddenly you will have lots of cycles. This is due to chipmem timing alignment.
meynaf is offline  
Old 16 April 2024, 11:30   #42
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
@Don_Adan

Quote:
Originally Posted by Don_Adan View Post
Of course to be sure if add.l d3,d3 is executed in parallel to mulu, for testing next add.l d3,d3 must be using for same code:

Code:
                 core: .l move.l d0,(a0)
                          mulu.w d1,d2
                          add.l  d3,d3
                          add.l  d3,d3
                          move.l d0,(a0)
                          dbf    d7,.l
If result will be 85 cycles, then is executed parallel with mulu.
If result will be 87 cycles, then is not executed parallel with mulu.
You can be sure that nothing can execute in parallel with mulu because the 68030 is able to overlap only the tail (which comes from the write part) of an instruction with the head (of the ea) of the following instruction, and mulu's tail is 0 cycles.
This is confirmed already by these tests:

Code:
                 core: .l move.l d6,d0 ;2 cycles
                          mulu.l d7,d0 ;45 cycles*
                          dbf    d7,.l ;6 cycles

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 3476590.087
CPU cycles / per loop: 53.048
         color clocks: 246622
          rasterlines: 1086.440
               frames: 3.471
                   µs: 69531.801
*As reported in a post #6, mulu always takes 44 or 46 cycles, depending on the operands. Since in this code one operands is the loop counter, 45 cycles are a plausible average.
Code:
                 core: .l move.w d6,d0 ;2 cycles
                          mulu.w d7,d0 ;27 cycles*
                          dbf    d7,.l ;6 cycles

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 2296896.299
CPU cycles / per loop: 35.047
         color clocks: 162937
          rasterlines: 717.784
               frames: 2.293
                   µs: 45937.925
*Same here, except that when the size is .w mulu is faster.


@meynaf

Quote:
What seems wrong to me is that measurement overhead makes it look like it uses 85 clocks but it should really use 84. Which is 3x28.
First move.l takes 28.
Mulu.w takes 26 (maybe 24 here if first 2 execute in parallel to the write).
Second move.l takes 28 but it cannot start immediately if timing not multiple of 28 (maybe 14 ?). So we can fit one small instruction before.
Could be interesting to add 2-cycle instructions until timing jumps to next chipmem slot.
85 cycles look odd to me, too. I'll make more tests, give it a good thought (not done yet) and report back.
By the way, bus slots do not happen at exact multiples of 14 CPU cycles, but every 50.000000/3.546895 = ~14.0968 cycles (and then, of course, there is the limitation that the CPU is never granted two consecutive slots, which is what brings the time of consecutive writes to CHIP RAM to 28+ cycles).
saimo is offline  
Old 16 April 2024, 11:35   #43
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,975
I can say this is not confirmed for me, but im strange.
Something different is add.l and something different is dbf
NOT ALL instructions must works parallel with mulu.
Then these tests are not 100% valid for me.
Don_Adan is online now  
Old 16 April 2024, 11:57   #44
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by saimo View Post
As reported in a post #6, mulu always takes 44 or 46 cycles, depending on the operands. Since in this code one operands is the loop counter, 45 cycles are a plausible average.
I remember mulu.w being 26 cycles and mulu.l 44 cycles (never 46).
Never found a different timing except for value zero.
Immediate mulu.w adds 2 ; no difference between mulu and muls.


Quote:
Originally Posted by saimo View Post
Code:
                 core: .l move.w d6,d0 ;2 cycles
                          mulu.w d7,d0 ;27 cycles*
                          dbf    d7,.l ;6 cycles
I doubt about 27 cycles. I timed just too many multiplies in dct stuff and consistently found immediate mul to be 28 cycles regardless of values.
Really you should check your timing overhead or even calculation.


Quote:
Originally Posted by saimo View Post
85 cycles look odd to me, too. I'll make more tests, give it a good thought (not done yet) and report back.
I think 65536 loops aren't enough to have precise value.
I used to have 50,000,000 loops so that 1 second = 1 cycle. Yes this took time.
I used 25000 loops with dbf, 2000 times with an outer loop. Found 0,5% more than real values.


Quote:
Originally Posted by saimo View Post
By the way, bus slots do not happen at exact multiples of 14 CPU cycles, but every 50.000000/3.546895 = ~14.0968 cycles (and then, of course, there is the limitation that the CPU is never granted two consecutive slots, which is what brings the time of consecutive writes to CHIP RAM to 28+ cycles).
You don't have exact 50,000,000 cycles either. It could be slightly more, or slightly less. I wouldn't be surprised if real measured value is 49,656,530 instead
Then we would have different timing from one machine to another, too, which won't help...
meynaf is offline  
Old 16 April 2024, 12:22   #45
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
Quote:
Originally Posted by Don_Adan View Post
I can say this is not confirmed for me, but im strange.
Something different is add.l and something different is dbf
NOT ALL instructions must works parallel with mulu.
Sorry, but no instruction that follows mulu can execute in parallel with it simply because the 68030 architecture doesn't allow it. Check out the formula 11-2 at 11.3.4 and the timings table at 11.6.8 in the MC68030UM.
saimo is offline  
Old 16 April 2024, 12:58   #46
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
Quote:
Originally Posted by meynaf View Post
I remember mulu.w being 26 cycles and mulu.l 44 cycles (never 46).
Never found a different timing except for value zero.
Immediate mulu.w adds 2 ; no difference between mulu and muls.

I doubt about 27 cycles. I timed just too many multiplies in dct stuff and consistently found immediate mul to be 28 cycles regardless of values.
Really you should check your timing overhead or even calculation.
I've considered that, too, but then again I would always get wrong values, and instead I get consistently correct results for cases where there are no doubts about timings (e.g. .l dbf dx,.l -> 6 cycles).
By the way, before running the actual test, the proggie runs a dummy test to measure the overhead and then it subtracts that overhead from the final calculation. But even if it didn't do that, the overhead is negligible with enough number of iterations, as it's basically the access time to the CIA B TOD HI and LO registers (a couple of CIA cycles, i.e. about 70.5 CPU cycles*), the VHPOSR register (14 cycles twice) and a jsr ([bd.l]) (other 14 cycles) - say, a 200 cycles in all, which have no real weight on the millions of cycles being measured here (e.g. 5570816 cycles for the move-mulu-move case).

To be precise, here is the relevant part of the code:
Code:
* TIMING START

	clr.b  RA_CIABTODHI        ;clear TODHI and stop TOD clock
	clr.b  RA_CIABTODMID       ;clear TODMID
	cnop   0,4                 ;align code (for equal conditions - see overhead calculation)
	bsr    _WtVB               ;wait for frame to start (for equal conditions - see overhead calculation)
	clr.b  RA_CIABTODLO        ;clear TODLO and start TOD clock
	move.w RA_VHPOSR,BeamYX0   ;get initial beam position

* MEASURED CODE

	jsr    ([BlobAddress_C.l])

* TIMING STOP

	tst.b  RA_CIABTODHI        ;stop TOD clock
	move.w RA_VHPOSR,BeamYX1   ;get final beam position
*Quite a while ago I measured that the access time to CIAs varies from machine to machine and even on the same machine it might depend on the instruction (move, tst, clr, etc.). IIRC, the 68030 on my Blizzard 1230 IV requires 2 CIA cycles to perform a read (but, even if it were 1 cycles more or less, there would be no practical difference).

Quote:
I think 65536 loops aren't enough to have precise value.
I used to have 50,000,000 loops so that 1 second = 1 cycle. Yes this took time.
I used 25000 loops with dbf, 2000 times with an outer loop. Found 0,5% more than real values.
No problem, I can make tests also with 50 million loops.

In the meanwhile, I finally got around to make tests to measure how many cycles one can stuff between CHIP RAM writes when mulu.w is also in between without affecting the overall execution time:

Code:
+--------------------------------------------------------+
|                   .l move.l d0,(a0)                    |
|                      mulu.w d1,d2                      |
|                      move.l d0,(a0)                    |
|                      dbf    d7,.l                      |
|                                                        |
|                Note: buffer in CHIP RAM                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 5570816.164                     |
|CPU cycles / per loop : 85.003                          |
|         color clocks : 395182                          |
|          rasterlines : 1740.889                        |
|               frames : 5.561                           |
|                   µs : 111416.323                      |
+----------------------+---------------------------------+

+--------------------------------------------------------+
|                   .l move.l d0,(a0)                    |
|                      rept   14                         |
|                      add.l  d3,d3                      |
|                      endr                              |
|                      mulu.w d1,d2                      |
|                      move.l d0,(a0)                    |
|                      dbf    d7,.l                      |
|                                                        |
|                Note: buffer in CHIP RAM                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 5570830.261                     |
|CPU cycles / per loop : 85.004                          |
|         color clocks : 395183                          |
|          rasterlines : 1740.894                        |
|               frames : 5.561                           |
|                   µs : 111416.605                      |
+----------------------+---------------------------------+


+--------------------------------------------------------+
|                   .l move.l d0,(a0)                    |
|                      rept   15                         |
|                      add.l  d3,d3                      |
|                      endr                              |
|                      mulu.w d1,d2                      |
|                      move.l d0,(a0)                    |
|                      dbf    d7,.l                      |
|                                                        |
|                Note: buffer in CHIP RAM                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 6489239.179                     |
|CPU cycles / per loop : 99.017                          |
|         color clocks : 460333                          |
|          rasterlines : 2027.898                        |
|               frames : 6.478                           |
|                   µs : 129784.783                      |
+----------------------+---------------------------------+

+--------------------------------------------------------+
|                   .l move.l d0,(a0)                    |
|                      mulu.w d1,d2                      |
|                      add.l  d3,d3                      |
|                      add.l  d3,d3                      |
|                      dbf    d7,.l                      |
|                                                        |
|                Note: buffer in CHIP RAM                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 5570816.164                     |
|CPU cycles / per loop : 85.003                          |
|         color clocks : 395182                          |
|          rasterlines : 1740.889                        |
|               frames : 5.561                           |
|                   µs : 111416.323                      |
+----------------------+---------------------------------+

+--------------------------------------------------------+
|                   .l move.l d0,(a0)                    |
|                      mulu.w d1,d2                      |
|                      add.l  d3,d3                      |
|                      add.l  d3,d3                      |
|                      add.l  d3,d3                      |
|                      move.l d0,(a0)                    |
|                      dbf    d7,.l                      |
|                                                        |
|                Note: buffer in CHIP RAM                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 6502729.852                     |
|CPU cycles / per loop : 99.223                          |
|         color clocks : 461290                          |
|          rasterlines : 2032.114                        |
|               frames : 6.492                           |
|                   µs : 130054.597                      |
+----------------------+---------------------------------+

+--------------------------------------------------------+
|                   .l move.l d0,(a0)                    |
|                      mulu.w d1,d2                      |
|                      move.l d0,(a0)                    |
|                      dbf    d7,.l                      |
|                                                        |
|                Note: buffer in FAST RAM                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 2502146.243                     |
|CPU cycles / per loop : 38.179                          |
|         color clocks : 177497                          |
|          rasterlines : 781.925                         |
|               frames : 2.498                           |
|                   µs : 50042.924                       |
+----------------------+---------------------------------+

+--------------------------------------------------------+
|                   .l move.l d0,(a0)                    |
|                      add.l  d3,d3                      |
|                      mulu.w d1,d2                      |
|                      move.l d0,(a0)                    |
|                      dbf    d7,.l                      |
|                                                        |
|                Note: buffer in FAST RAM                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 2502146.243                     |
|CPU cycles / per loop : 38.179                          |
|         color clocks : 177497                          |
|          rasterlines : 781.925                         |
|               frames : 2.498                           |
|                   µs : 50042.924                       |
+----------------------+---------------------------------+

+--------------------------------------------------------+
|                   .l move.l d0,(a0)                    |
|                      add.l  d3,d3                      |
|                      add.l  d3,d3                      |
|                      mulu.w d1,d2                      |
|                      move.l d0,(a0)                    |
|                      dbf    d7,.l                      |
|                                                        |
|                Note: buffer in FAST RAM                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 2631414.236                     |
|CPU cycles / per loop : 40.152                          |
|         color clocks : 186667                          |
|          rasterlines : 822.321                         |
|               frames : 2.627                           |
|                   µs : 52628.284                       |
+----------------------+---------------------------------+

+--------------------------------------------------------+
|                   .l move.l d0,(a0)                    |
|                      mulu.w d1,d2                      |
|                      add.l  d3,d3                      |
|                      move.l d0,(a0)                    |
|                      dbf    d7,.l                      |
|                                                        |
|                Note: buffer in FAST RAM                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 2631414.236                     |
|CPU cycles / per loop : 40.152                          |
|         color clocks : 186667                          |
|          rasterlines : 822.321                         |
|               frames : 2.627                           |
|                   µs : 52628.284                       |
+----------------------+---------------------------------+
Summary:
* CHIP RAM / move / add / mulu / move -> 14 adds
* CHIP RAM / move / mulu / add / move -> 2 adds
* FAST RAM / move / add / mulu / move -> 1 adds
* FAST RAM / move / mulu / add / move -> 0 adds
saimo is offline  
Old 16 April 2024, 14:18   #47
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by saimo View Post
I've considered that, too, but then again I would always get wrong values, and instead I get consistently correct results for cases where there are no doubts about timings (e.g. .l dbf dx,.l -> 6 cycles).
That you have right values for small number of cycles doesn't mean it won't deviate for larger numbers. E.g. add 5% to 6, you still get 6. But add 20% to 20 and you'll get 21.


Quote:
Originally Posted by saimo View Post
By the way, before running the actual test, the proggie runs a dummy test to measure the overhead and then it subtracts that overhead from the final calculation. But even if it didn't do that, the overhead is negligible with enough number of iterations, as it's basically the access time to the CIA B TOD HI and LO registers (a couple of CIA cycles, i.e. about 70.5 CPU cycles*), the VHPOSR register (14 cycles twice) and a jsr ([bd.l]) (other 14 cycles) - say, a 200 cycles in all, which have no real weight on the millions of cycles being measured here (e.g. 5570816 cycles for the move-mulu-move case).
I wouldn't trust CIA timings too much...
Anyway, you've shown non-integer timings such as 85.380, 56.636, 16.200, 42.715, 53.048, 71.158, 85.003. Don't tell me that just chopping the non-integer part of that, can give the right result.


Quote:
Originally Posted by saimo View Post
In the meanwhile, I finally got around to make tests to measure how many cycles one can stuff between CHIP RAM writes when mulu.w is also in between without affecting the overall execution time:
<snip>
Summary:
* CHIP RAM / move / add / mulu / move -> 14 adds
* CHIP RAM / move / mulu / add / move -> 2 adds
* FAST RAM / move / add / mulu / move -> 1 adds
* FAST RAM / move / mulu / add / move -> 0 adds
Ok, let's try to interpret these :
Code:
.l move.l d0,(a0)                  ; 28 (r=26)
 mulu.w d1,d2                      ; 26->24 (r=24, stall, next @+4)
 move.l d0,(a0)                    ; 28->32 (+4 stall)
 dbf    d7,.l                      ; 6->0
; total 84 (3 blocks of 28)

.l move.l d0,(a0)                  ; 28 (r=26)
 rept   14
 add.l  d3,d3                      ; 28->2 (next @+12 missed -> @+26)
 endr
 mulu.w d1,d2                      ; 26 (next @+0)
 move.l d0,(a0)                    ; 28
 dbf    d7,.l                      ; 6->0
; total 84

.l move.l d0,(a0)                  ; 28 (r=26)
 rept   15
 add.l  d3,d3                      ; 30->4 (next @+10 missed -> @+24 missed -> @+38)
 endr
 mulu.w d1,d2                      ; 26 (next @ +12)
 move.l d0,(a0)                    ; 28 + 12 (stall)
 dbf    d7,.l                      ; 6->0
; total 98 (7 blocks of 14)

.l move.l d0,(a0)                  ; 28->36 (r=26, stall +8 from prev. iteration)
 mulu.w d1,d2                      ; 26->24 (r=24, stall, next @+4)
 add.l  d3,d3                      ; 2 (next @+2)
 add.l  d3,d3                      ; 2 (next@ 0)
 dbf    d7,.l                      ; 6 (next @+8) - next move will stall +8
; total 70 (5 blocks of 14)
; error - are you sure of your 85 ? (this is exact same values as first test)

.l move.l d0,(a0)                  ; 28 (r=26)
 mulu.w d1,d2                      ; 26->24 (r=24, stall, next @+4)
 add.l  d3,d3                      ; 2 (next @+2)
 add.l  d3,d3                      ; 2 (next @+0)
 add.l  d3,d3                      ; 2 (next @+12)
 move.l d0,(a0)                    ; 28->40 (r=24, stall +12)
 dbf    d7,.l                      ; 6->0
; total 98
Seems chipmem results are predictable after all.
But fastmem results give 2 cycles less than expected by me.
meynaf is offline  
Old 16 April 2024, 15:26   #48
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,975
In general some results are strange for me, f.e 99 cycles for 3x add.l d3,d3, when for 1x add.l d3,d3 and 2x add.l d3,d3 is only 85 cycles.
Perhaps must exist more relationships.
I can only suspect than CCR handling can be problematic. f.e bit X

Anyway, but if You really want to check if chip ram writes can be covered by mulu.
Then replace
move.l d0,(A0)
with
movem.l d0,(A0)
Movem dont change CCR if I remember right, when move change CCR.
Don_Adan is online now  
Old 16 April 2024, 18:25   #49
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
@meynaf

Reply split in 2 parts, as the forum tells me it's too long...

Quote:
Originally Posted by meynaf View Post
That you have right values for small number of cycles doesn't mean it won't deviate for larger numbers. E.g. add 5% to 6, you still get 6. But add 20% to 20 and you'll get 21.
In this case such logic doesn't apply, as the overhead applies only once per test (not per loop) and the way the time is measured (more about this below) does not depend on the loops number.
Anyway, as promised, I made also tests with 50 million loops, and they confirm that the proggie measures the time correctly - in the results below, you can see that the reported times are just 1/000th away from perfectly round figures (of course, with so many more loops, precision increased)
These tests focus of mulu.x alone, without accesses to RAM, to be sure about the precision of the figures.
The analysis of the (interesting) results follows the data.

Code:
mulu.w d1,d2 / d1 = 0 / 65536 loops
26 cycles

+--------------------------------------------------------+
|              .l move.w d0,d6 ;d0 = $1234               |
|                 mulu.w d1,d6 ;d1 = 0                   |
|                 dbf    d7,.l                           |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 2231416.492                     |
|CPU cycles / per loop : 34.048                          |
|         color clocks : 158292                          |
|          rasterlines : 697.321                         |
|               frames : 2.227                           |
|                   µs : 44628.329                       |
+----------------------+---------------------------------+


mulu.w d1,d2 / d1 = 0 / 5000000 loops
26 cycles

+--------------------------------------------------------+
|              .l move.w d0,d6 ;d0 = $1234               |
|                 mulu.w d1,d6 ;d1 = 0                   |
|                 subq.l #1,d7                           |
|                 bne.b  .l                              |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 50000000                        |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 1799995164.784                  |
|CPU cycles / per loop : 35.999                          |
|         color clocks : 127687877                       |
|          rasterlines : 562501.660                      |
|               frames : 1797.129                        |
|                   µs : 35999903.295                    |
+----------------------+---------------------------------+


mulu.w d1,d2 / d1 = 7 / 65536 loops
28 cycles

+--------------------------------------------------------+
|              .l move.w d0,d6 ;d0 = $1234               |
|                 mulu.w d1,d6 ;d1 = 7                   |
|                 dbf    d7,.l                           |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 2362446.590                     |
|CPU cycles / per loop : 36.048                          |
|         color clocks : 167587                          |
|          rasterlines : 738.268                         |
|               frames : 2.358                           |
|                   µs : 47248.931                       |
+----------------------+---------------------------------+


mulu.w d1,d2 / d1 = 7 / 50000000 loops
28 cycles

+--------------------------------------------------------+
|              .l move.w d0,d6 ;d0 = $1234               |
|                 mulu.w d1,d6 ;d1 = 7                   |
|                 subq.l #1,d7                           |
|                 bne.b  .l                              |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 50000000                        |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 1899994600.911                  |
|CPU cycles / per loop : 37.999                          |
|         color clocks : 134781627                       |
|          rasterlines : 593751.660                      |
|               frames : 1896.970                        |
|                   µs : 37999892.018                    |
+----------------------+---------------------------------+


mulu.w d1,d2 / d1 = even / 25000000 loops
26 cycles

+--------------------------------------------------------+
|              .l move.w d0,d6 ;d0 = $1234               |
|                 mulu.w d7,d6                           |
|                 subq.l #2,d7                           |
|                 bne.b  .l                              |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 25000000                        |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 899999252.867                   |
|CPU cycles / per loop : 35.999                          |
|         color clocks : 63844057                        |
|          rasterlines : 281251.352                      |
|               frames : 898.566                         |
|                   µs : 17999985.057                    |
+----------------------+---------------------------------+


mulu.w d1,d2 / d1 = odd / 25000000 loops
28 cycles

+--------------------------------------------------------+
|              .l move.w d0,d6 ;d0 = $1234               |
|                 mulu.w d7,d6                           |
|                 subq.l #2,d7                           |
|                 bpl.b  .l                              |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 25000000                        |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 949998970.930                   |
|CPU cycles / per loop : 37.999                          |
|         color clocks : 67390932                        |
|          rasterlines : 296876.352                      |
|               frames : 948.486                         |
|                   µs : 18999979.418                    |
+----------------------+---------------------------------+


mulu.w d1,d2 / d1 from 65535 to 0 / 65536 loops
27 cycles on average

+--------------------------------------------------------+
|              .l move.w d0,d6 ;d0 = $1234               |
|                 mulu.w d7,d6                           |
|                 dbf    d7,.l                           |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 2296966.783                     |
|CPU cycles / per loop : 35.048                          |
|         color clocks : 162942                          |
|          rasterlines : 717.806                         |
|               frames : 2.293                           |
|                   µs : 45939.335                       |
+----------------------+---------------------------------+


mulu.w d1,d2 / d1 from 65535 to 0 / 50000000 loops
27 cycles on average

+--------------------------------------------------------+
|              .l move.w d0,d6 ;d0 = $1234               |
|                 mulu.w d7,d6                           |
|                 subq.l #1,d7                           |
|                 bne.b  .l                              |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 50000000                        |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 1849994812.364                  |
|CPU cycles / per loop : 36.999                          |
|         color clocks : 131234747                       |
|          rasterlines : 578126.638                      |
|               frames : 1847.049                        |
|                   µs : 36999896.247                    |
+----------------------+---------------------------------+


mulu.l d1,d2 / d1 = 0 / 65536 loops
44 cycles

+--------------------------------------------------------+
|            .l move.l d0,d6 ;d0 = $12345678             |
|               mulu.l d1,d6 ;d1 = 0                     |
|               dbf    d7,.l                             |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 3407839.814                     |
|CPU cycles / per loop : 51.999                          |
|         color clocks : 241745                          |
|          rasterlines : 1064.955                        |
|               frames : 3.402                           |
|                   µs : 68156.796                       |
+----------------------+---------------------------------+


mulu.l d1,d1 / d1 = 0 / 50000000 loops
44 cycles

+--------------------------------------------------------+
|            .l move.l d0,d6 ;d0 = $12345678             |
|               mulu.l d1,d6 ;d1 = 0                     |
|               subq.l #1,d7                             |
|               bne.b  .l                                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 50000000                        |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 2699992415.901                  |
|CPU cycles / per loop : 53.999                          |
|         color clocks : 191531792                       |
|          rasterlines : 843752.387                      |
|               frames : 2695.694                        |
|                   µs : 53999848.318                    |
+----------------------+---------------------------------+


mulu.l d1,d2 / d1 = 7 / 65536 loops
46 cycles

+--------------------------------------------------------+
|            .l move.l d0,d6 ;d0 = $12345678             |
|               mulu.l d1,d6 ;d1 = 7                     |
|               dbf    d7,.l                             |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 3542140.379                     |
|CPU cycles / per loop : 54.048                          |
|         color clocks : 251272                          |
|          rasterlines : 1106.925                        |
|               frames : 3.536                           |
|                   µs : 70842.807                       |
+----------------------+---------------------------------+


mulu.l d1,d1 / d1 = 7 / 50000000 loops
46 cycles

+--------------------------------------------------------+
|            .l move.l d0,d6 ;d0 = $12345678             |
|               mulu.l d1,d6 ;d1 = 7                     |
|               subq.l #1,d7                             |
|               bne.b  .l                                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 50000000                        |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 2799991570.091                  |
|CPU cycles / per loop : 55.999                          |
|         color clocks : 198625522                       |
|          rasterlines : 875002.299                      |
|               frames : 2795.534                        |
|                   µs : 55999831.401                    |
+----------------------+---------------------------------+


mulu.l d1,d2 / d1 = even / 25000000 loops
44 cycles

+--------------------------------------------------------+
|            .l move.l d0,d6 ;d0 = $12345678             |
|               mulu.l d7,d6                             |
|               subq.l #2,d7                             |
|               bne.b  .l                                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 25000000                        |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 1349996870.502                  |
|CPU cycles / per loop : 53.999                          |
|         color clocks : 95765943                        |
|          rasterlines : 421876.400                      |
|               frames : 1347.847                        |
|                   µs : 26999937.410                    |
+----------------------+---------------------------------+


mulu.l d1,d2 / d1 = odd / 25000000 loops
46 cycles

+--------------------------------------------------------+
|            .l move.l d0,d6 ;d0 = $12345678             |
|               mulu.l d7,d6                             |
|               subq.l #2,d7                             |
|               bpl.b  .l                                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 25000000                        |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 1399997067.857                  |
|CPU cycles / per loop : 55.999                          |
|         color clocks : 99312852                        |
|          rasterlines : 437501.550                      |
|               frames : 1397.768                        |
|                   µs : 27999941.357                    |
+----------------------+---------------------------------+


mulu.l d1,d2 / d1 from 65535 to 0 / 65536 loops
45 cycles on average

+--------------------------------------------------------+
|            .l move.l d0,d6 ;d0 = $12345678             |
|               mulu.l d7,d6                             |
|               dbf    d7,.l                             |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 3476590.087                     |
|CPU cycles / per loop : 53.048                          |
|         color clocks : 246622                          |
|          rasterlines : 1086.440                        |
|               frames : 3.471                           |
|                   µs : 69531.801                       |
+----------------------+---------------------------------+


mulu.l d1,d1 / d1 from 65535 to 0 / 50000000 loops
45 cycles on average

+--------------------------------------------------------+
|            .l move.l d0,d6 ;d0 = $12345678             |
|               mulu.l d7,d6                             |
|               subq.l #1,d7                             |
|               bne.b  .l                                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 50000000                        |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 2749988088.172                  |
|CPU cycles / per loop : 54.999                          |
|         color clocks : 195078380                       |
|          rasterlines : 859376.123                      |
|               frames : 2745.610                        |
|                   µs : 54999761.763                    |
+----------------------+---------------------------------+
Summary:
* mulu.w d1,d2 / d1 = 0 / 65536 loops -> 26 cycles
* mulu.w d1,d2 / d1 = 0 / 5000000 loops -> 26 cycles
* mulu.w d1,d2 / d1 = 7 / 65536 loops -> 28 cycles
* mulu.w d1,d2 / d1 = 7 / 50000000 loops -> 28 cycles
* mulu.w d1,d2 / d1 = even / 25000000 loops -> 26 cycles
* mulu.w d1,d2 / d1 = odd / 25000000 loops -> 28 cycles
* mulu.w d1,d2 / d1 from 65535 to 0 / 65536 loops -> 27 cycles on average
* mulu.w d1,d2 / d1 from 65535 to 0 / 50000000 loops -> 27 cycles on average
* mulu.l d1,d2 / d1 = 0 / 65536 loops -> 44 cycles
* mulu.l d1,d1 / d1 = 0 / 50000000 loops -> 44 cycles
* mulu.l d1,d2 / d1 = 7 / 65536 loops -> 46 cycles
* mulu.l d1,d1 / d1 = 7 / 50000000 loops -> 46 cycles
* mulu.l d1,d2 / d1 = even / 25000000 loops -> 44 cycles
* mulu.l d1,d2 / d1 = odd / 25000000 loops -> 46 cycles
* mulu.l d1,d2 / d1 from 65535 to 0 / 65536 loops -> 45 cycles on average
* mulu.l d1,d1 / d1 from 65535 to 0 / 50000000 loops -> 45 cycles on average

Takeaways:
* mulu.w takes either 26 or 28 cycles;
* mulu.l takes either 44 or 46 cycles;
* when the left operand is odd, mulu.x takes 2 cycles more (I guess this is true also for muls.x);
* the figures obtained by means of millions of loops match those obtained with 65536 loops.

Quote:
I wouldn't trust CIA timings too much...
I use the CIA B TOD to count the number of rasterlines elapsed, and that's reliable. For sub-rasterline precision I read VHPOSR - see the code snippet in my previous post.

Quote:
Anyway, you've shown non-integer timings such as 85.380, 56.636, 16.200, 42.715, 53.048, 71.158, 85.003. Don't tell me that just chopping the non-integer part of that, can give the right result.
I derive the CPU cycles from the color clocks measured by means of the CIA B TOD and VHPOSR as mentioned above. The formula is <color clocks> * <CPU frequency> / 3.546895, and that's where the decimals come from. Unfortunately, since there isn't a higher resolution clock, I have to make do with this. The precision is of course not perfect, but, as you can see from the results above, the figures are reliable already with 65536 loops.
saimo is offline  
Old 16 April 2024, 18:45   #50
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
@maynaf

Reply part 2

Quote:
Ok, let's try to interpret these :
Code:
.l move.l d0,(a0)                  ; 28 (r=26)
 mulu.w d1,d2                      ; 26->24 (r=24, stall, next @+4)
 move.l d0,(a0)                    ; 28->32 (+4 stall)
 dbf    d7,.l                      ; 6->0
; total 84 (3 blocks of 28)

.l move.l d0,(a0)                  ; 28 (r=26)
 rept   14
 add.l  d3,d3                      ; 28->2 (next @+12 missed -> @+26)
 endr
 mulu.w d1,d2                      ; 26 (next @+0)
 move.l d0,(a0)                    ; 28
 dbf    d7,.l                      ; 6->0
; total 84

.l move.l d0,(a0)                  ; 28 (r=26)
 rept   15
 add.l  d3,d3                      ; 30->4 (next @+10 missed -> @+24 missed -> @+38)
 endr
 mulu.w d1,d2                      ; 26 (next @ +12)
 move.l d0,(a0)                    ; 28 + 12 (stall)
 dbf    d7,.l                      ; 6->0
; total 98 (7 blocks of 14)

.l move.l d0,(a0)                  ; 28->36 (r=26, stall +8 from prev. iteration)
 mulu.w d1,d2                      ; 26->24 (r=24, stall, next @+4)
 add.l  d3,d3                      ; 2 (next @+2)
 add.l  d3,d3                      ; 2 (next@ 0)
 dbf    d7,.l                      ; 6 (next @+8) - next move will stall +8
; total 70 (5 blocks of 14)
; error - are you sure of your 85 ? (this is exact same values as first test)

.l move.l d0,(a0)                  ; 28 (r=26)
 mulu.w d1,d2                      ; 26->24 (r=24, stall, next @+4)
 add.l  d3,d3                      ; 2 (next @+2)
 add.l  d3,d3                      ; 2 (next @+0)
 add.l  d3,d3                      ; 2 (next @+12)
 move.l d0,(a0)                    ; 28->40 (r=24, stall +12)
 dbf    d7,.l                      ; 6->0
; total 98
Seems chipmem results are predictable after all
But fastmem results give 2 cycles less than expected by me.
About the matter of what happens when memory is accessed, I've drawn a couple of quick and dirty diagrams for the move-mulu-move case and, to my surprise, it looks like, in theory, the speed should be 1 color clock (i.e. 14 CPU cycles) less:



I'll prepare a proper picture and explain the underlying thoughts later (or more probably tomorrow - I haven't slept for basically 2 days after months of bad sleep, so I'm about to collapse...).

I also ran the test again using 50000000 loops, obtaining this, which confirms 85 cycles on average:
Code:
move > mulu.w d1,d2 > move / d1 from 65535 to 0 / 50000000 loops

+--------------------------------------------------------+
|             .l move.l d0,(a0)                          |
|                mulu.w d1,d2   ;d1 = d2 = 0             |
|                move.l d0,(a0)                          |
|                subq.l #1,d7                            |
|                bne.b  .l                               |
|                                                        |
|                Note: buffer in CHIP RAM                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 50000000                        |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 4247766807.306                  |
|CPU cycles / per loop : 84.955                          |
|         color clocks : 301327657                       |
|          rasterlines : 1327434.612                     |
|               frames : 4241.005                        |
|                   µs : 84955336.146                    |
+----------------------+---------------------------------+
Anyway, here's a first guess about what gives such odd timings: as seen from the results (and also from by doodles above) the mulus happen to end very close to the next color clock, so maybe, at times, the CPU misses that slot and thus ends up taking 14 cycles more. How many times?
Using 65536 loops for convenience:
* total cycles: 65536 * 85 = 5570560
* expected cycles: 65536 * 84 = 5505024
* from that: (65536 - x) * 84 + x * (84 + 14) = 65536 * 85 -> x = 65536/14 = about 4681 times

Why does that happen? I don't know. For sure it's not because of the odd/even operand factor (alone), otherwise the slot would be missed 50% of the times and the average time would be higher. I guess it's first necessary to figure out why the execution most of the times takes 84 cycles instead of 98 in first place.
saimo is offline  
Old 16 April 2024, 18:57   #51
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
@Don_Adan

Quote:
Originally Posted by Don_Adan View Post
In general some results are strange for me, f.e 99 cycles for 3x add.l d3,d3, when for 1x add.l d3,d3 and 2x add.l d3,d3 is only 85 cycles.
Perhaps must exist more relationships.
I can only suspect than CCR handling can be problematic. f.e bit X
To me it looks like you're missing the fact that 1 color clock = about 14 CPU cycles, that the CPU gets to write only on those boundaries and that the CPU can't use two consecutive CPU bus slots. So, even if X adds might overlap nicely, X+1 adds might push the CPU into the next CHIP bus slot.

Quote:
Anyway, but if You really want to check if chip ram writes can be covered by mulu.
Then replace
move.l d0,(A0)
with
movem.l d0,(A0)
Movem dont change CCR if I remember right, when move change CCR.
The fact that exg cannot run in parallel as well makes me suspect that the ccr plays no role, but I'll run the test and let you know.

EDIT: the movem test confirms that the ccr doesn't matter ->

Code:
+--------------------------------------------------------+
|            .l movem.l d0,(a0)                          |
|               mulu.w  d1,d2   ;d1 = d2 = 0             |
|               movem.l d0,(a0)                          |
|               subq.l  #1, d7                           |
|               bne.b   .l                               |
|                                                        |
|                Note: buffer in CHIP RAM                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 50000000                        |
|         leaked bytes : 672                             |
+----------------------+---------------------------------+
|   CPU cycles / total : 4247766877.790                  |
|CPU cycles / per loop : 84.955                          |
|         color clocks : 301327662                       |
|          rasterlines : 1327434.634                     |
|               frames : 4241.005                        |
|                   µs : 84955337.555                    |
+----------------------+---------------------------------+
The test is included in the attached archive.


@all

Attached here is the updated archive with the latest tests included.

Last edited by saimo; 16 April 2024 at 23:18. Reason: Removed attachment and fixed the English.
saimo is offline  
Old 16 April 2024, 19:06   #52
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by saimo View Post
In this case such logic doesn't apply, as the overhead applies only once per test (not per loop) and the way the time is measured (more about this below) does not depend on the loops number.
Sorry, but once-per-test overhead can not explain the non-integer results you got before. Especially not that they disappeared with extra precision.


Quote:
Originally Posted by saimo View Post
I derive the CPU cycles from the color clocks measured by means of the CIA B TOD and VHPOSR as mentioned above. The formula is <color clocks> * <CPU frequency> / 3.546895, and that's where the decimals come from. Unfortunately, since there isn't a higher resolution clock, I have to make do with this. The precision is of course not perfect, but, as you can see from the results above, the figures are reliable already with 65536 loops.
There are easier ways to compute the timing.
meynaf is offline  
Old 16 April 2024, 19:22   #53
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
Quote:
Originally Posted by meynaf View Post
Sorry, but once-per-test overhead can not explain the non-integer results you got before. Especially not that they disappeared with extra precision.
My reply was relative to the reliability of the figures, not to why decimals appear (which I explained separately).


Quote:
There are easier ways to compute the timing.
I'd be glad to implement a better method: which would that be?
saimo is offline  
Old 16 April 2024, 19:38   #54
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by saimo View Post
My reply was relative to the reliability of the figures, not to why decimals appear (which I explained separately).
But both are the same. The decimals indicated something didn't go well.


Quote:
Originally Posted by saimo View Post
I'd be glad to implement a better method: which would that be?
Not 'better' but easier as i said : just count the vblanks...
Cheap and (with enough loops) reliable enough.
meynaf is offline  
Old 16 April 2024, 23:16   #55
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
@meynaf

Quote:
Originally Posted by meynaf View Post
But both are the same. The decimals indicated something didn't go well.
1. You first questioned the precision of the decimals, saying that the error might scale depending on the magnitude of the numbers.
2. I explained that the error doesn't depend on that.
3. You said that the origin of the decimals was not explained.
4. I said that I was answering to a different remark (and added that the explanation you wanted was provided elsewhere in the very same post).
5. Now you're saying that the precision of the decimals and where they come from are the same thing and that they indicate some error.
Sorry, but here you're not being fair, so this comment of yours goes to >NIL:

Quote:
Not 'better' but easier as i said : just count the vblanks...
Cheap and (with enough loops) reliable enough.
Here's an even easier method - the easiest and always indisputably correct one: printing out "CPU cycles taken: more than 0".
What's the point in suggesting an easier method is if it isn't just as precise? Especially considering that in this context precision matters and, even more, considering that you have been questioning the precision of the results! You really seem to just want to be against for the sake of it.

Anyway, technically speaking, using the vertical blanks is 313*227/50 = 1421.02 times less precise than the method I used (which isn't that complicated, by the way), so, no, it isn't worth considering.

That said, I must admit that I was so annoyed by the comments that I decided to have a second look at my code to try and make it even more precise. After some investigation, I found something I wasn't aware of: the CIA B TOD clock does not increase when a rasterline starts or ends (as I had erroneously believed), but at around color color $60 (sometimes a little less, sometimes a little more, and this figure is affected also by the slow access time of CIAs). In some occasions, this caused the color clocks count to be wrong by 227 (i.e. 1 rasterline). So, I changed the code to take that into account and to work regardless of when exactly the TOD clock increases. The code now is not only no longer affected by off-by-one issue, but is actually simpler, more beautiful (than what shown in the snippet I posted earlier) and - guess what - more precise. So much more precise that even the 65536 loops tests provide figures that are just 0.001 or 0.002 away from the perfect ones - here are some examples:

Code:
+--------------------------------------------------------+
|                      .l dbf d7,.l                      |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 393358.134                      |
|CPU cycles / per loop : 6.002                           |
|         color clocks : 27904                           |
|          rasterlines : 122.925                         |
|               frames : 0.392                           |
|                   µs : 7867.162                        |
+----------------------+---------------------------------+


+--------------------------------------------------------+
|              .l move.w d0,d6 ;d0 = $1234               |
|                 mulu.w d7,d6                           |
|                 dbf    d7,.l                           |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 2293907.770                     |
|CPU cycles / per loop : 35.002                          |
|         color clocks : 162725                          |
|          rasterlines : 716.850                         |
|               frames : 2.290                           |
|                   µs : 45878.155                       |
+----------------------+---------------------------------+


+--------------------------------------------------------+
|            .l move.l d0,d6 ;d0 = $12345678             |
|               mulu.l d7,d6                             |
|               dbf    d7,.l                             |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 3473488.783                     |
|CPU cycles / per loop : 53.001                          |
|         color clocks : 246402                          |
|          rasterlines : 1085.471                        |
|               frames : 3.467                           |
|                   µs : 69469.775                       |
+----------------------+---------------------------------+


+--------------------------------------------------------+
|             .l move.l d0,(a0)                          |
|                mulu.w d1,d2   ;d1 = d2 = 0             |
|                move.l d0,(a0)                          |
|                dbf    d7,.l                            |
|                                                        |
|                Note: buffer in CHIP RAM                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 5567728.957                     |
|CPU cycles / per loop : 84.956                          |
|         color clocks : 394963                          |
|          rasterlines : 1739.925                        |
|               frames : 5.558                           |
|                   µs : 111354.579                      |
+----------------------+---------------------------------+


+--------------------------------------------------------+
|             .l move.l d0,(a0)                          |
|                mulu.w d1,d2   ;d1 = d2 = 0             |
|                move.l d0,(a0)                          |
|                subq.l #1,d7                            |
|                bne.b  .l                               |
|                                                        |
|                Note: buffer in CHIP RAM                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 50000000                        |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 4247763790.583                  |
|CPU cycles / per loop : 84.955                          |
|         color clocks : 301327443                       |
|          rasterlines : 1327433.669                     |
|               frames : 4241.002                        |
|                   µs : 84955275.811                    |
+----------------------+---------------------------------+
Attached is the updated archive.

Last edited by saimo; 17 April 2024 at 23:04. Reason: Removed attachment as I provided a newer version later.
saimo is offline  
Old 17 April 2024, 11:40   #56
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by saimo View Post
1. You first questioned the precision of the decimals, saying that the error might scale depending on the magnitude of the numbers.
Which is true. An error in a constant, clocks not exactly being 3546895 and 50000000, an interrupt not disabled... Many things can make the result drift away.


Quote:
Originally Posted by saimo View Post
2. I explained that the error doesn't depend on that.
And where did I agree to this ? I did not. It can still depend on that.
I haven't said it depends only on that, though. There is setup-related error, which is why many loops are needed.


Quote:
Originally Posted by saimo View Post
3. You said that the origin of the decimals was not explained.
What ? No. What i said is that once-per-test overhead wasn't enough to explain them.


Quote:
Originally Posted by saimo View Post
4. I said that I was answering to a different remark (and added that the explanation you wanted was provided elsewhere in the very same post).
Except that your explanation wasn't convincing and there was only a single remark to reply to.


Quote:
Originally Posted by saimo View Post
5. Now you're saying that the precision of the decimals and where they come from are the same thing and that they indicate some error.
Look at the values i wrote in post #47. These can *not* be the right result.


Quote:
Originally Posted by saimo View Post
Sorry, but here you're not being fair, so this comment of yours goes to >NIL:
I think i'll put that on the account of your recent lack of sleep.
I may lack the most basic of diplomacy skills, that's sure - and we might just have had a misunderstanding after all - but i'm not unfair.


Quote:
Originally Posted by saimo View Post
Here's an even easier method - the easiest and always indisputably correct one: printing out "CPU cycles taken: more than 0".
Very funny -- or maybe not, as this looks a lot like a strawman fallacy.
(And wrong in the case we're timing an empty routine. )


Quote:
Originally Posted by saimo View Post
What's the point in suggesting an easier method is if it isn't just as precise?
Because it's a lot simpler, requires less computations and thus is less error prone ?


Quote:
Originally Posted by saimo View Post
Especially considering that in this context precision matters and, even more, considering that you have been questioning the precision of the results! You really seem to just want to be against for the sake of it.
Now you're being unfair ! The precision we need is 1 cycle. So as long as this integer cycle value is unambiguous, extra precision is useless. But previously you didn't reach this (with 65536 loops). And using the vblank is enough to have it.


Quote:
Originally Posted by saimo View Post
Anyway, technically speaking, using the vertical blanks is 313*227/50 = 1421.02 times less precise than the method I used (which isn't that complicated, by the way), so, no, it isn't worth considering.
As i said, it is not "the more, the better".
Vertical blank is max 20ms error, which is quite acceptable for a computation that lasts several seconds.
We need to know the number of cycles, not the number of 1/1000th of a cycle.


Quote:
Originally Posted by saimo View Post
That said, I must admit that I was so annoyed by the comments that I decided to have a second look at my code to try and make it even more precise. After some investigation, I found something I wasn't aware of: the CIA B TOD clock does not increase when a rasterline starts or ends (as I had erroneously believed), but at around color color $60 (sometimes a little less, sometimes a little more, and this figure is affected also by the slow access time of CIAs). In some occasions, this caused the color clocks count to be wrong by 227 (i.e. 1 rasterline). So, I changed the code to take that into account and to work regardless of when exactly the TOD clock increases. The code now is not only no longer affected by off-by-one issue, but is actually simpler, more beautiful (than what shown in the snippet I posted earlier) and - guess what - more precise. So much more precise that even the 65536 loops tests provide figures that are just 0.001 or 0.002 away from the perfect ones - here are some examples:
So at least this has been useful for something.
Perhaps we could get further by watching this 84.955 closer, hmm ?
meynaf is offline  
Old 17 April 2024, 13:03   #57
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
@maynaf

>NIL:
saimo is offline  
Old 17 April 2024, 14:09   #58
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by saimo View Post
@maynaf

>NIL:
You should really go to sleep.
meynaf is offline  
Old 17 April 2024, 14:21   #59
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,165
Can we stop the nonsense and get back to measuring? This is interesting, the flexing isn't.
Karlos is online now  
Old 17 April 2024, 14:35   #60
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by Karlos View Post
Can we stop the nonsense and get back to measuring? This is interesting, the flexing isn't.
You may try to find a way to predict the obtained values, and ask for extra tests when some theory needs to be verified. Me, i tried but with little result.
meynaf is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
68040 to 68060 adapter respin with A2000 and Zeus 68040 Accelerator richx support.Hardware 14 26 April 2022 05:46
Games that required an accelerator (68030, 68040, 68060) Radertified Nostalgia & memories 47 12 January 2022 16:45
68030, 68040 and 68060 MMU support (really!) Toni Wilen support.WinUAE 262 19 February 2019 12:36
mulu.l (a0),d0-d1 on 68060 BlankVector support.WinUAE 4 20 July 2012 19:03
WTB: 68030 or 68040 accelerator for A2000 Shadowfire MarketPlace 2 19 September 2009 17:52

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 21:40.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.14725 seconds with 14 queries