mulu.l cycles on 68030 and 68040 - Page 3

meynaf · 16 April 2024, 11:26

Quote:

Originally Posted by Don_Adan

If result will be 85 cycles, then is executed parallel with mulu.
If result will be 87 cycles, then is not executed parallel with mulu.

Not so easy. You will discover that extra add.l will not do +2 cycles each. Some will do nothing and then suddenly you will have lots of cycles. This is due to chipmem timing alignment.

saimo · 16 April 2024, 11:30

@Don_Adan

Quote:

Originally Posted by Don_Adan

Of course to be sure if add.l d3,d3 is executed in parallel to mulu, for testing next add.l d3,d3 must be using for same code:

Code:

                 core: .l move.l d0,(a0)
                          mulu.w d1,d2
                          add.l  d3,d3
                          add.l  d3,d3
                          move.l d0,(a0)
                          dbf    d7,.l

If result will be 85 cycles, then is executed parallel with mulu.
If result will be 87 cycles, then is not executed parallel with mulu.

You can be sure that nothing can execute in parallel with mulu because the 68030 is able to overlap only the tail (which comes from the write part) of an instruction with the head (of the ea) of the following instruction, and mulu's tail is 0 cycles.
This is confirmed already by these tests:

Code:

                 core: .l move.l d6,d0 ;2 cycles
                          mulu.l d7,d0 ;45 cycles*
                          dbf    d7,.l ;6 cycles

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 3476590.087
CPU cycles / per loop: 53.048
         color clocks: 246622
          rasterlines: 1086.440
               frames: 3.471
                   µs: 69531.801

*As reported in a post #6, mulu always takes 44 or 46 cycles, depending on the operands. Since in this code one operands is the loop counter, 45 cycles are a plausible average.

Code:

                 core: .l move.w d6,d0 ;2 cycles
                          mulu.w d7,d0 ;27 cycles*
                          dbf    d7,.l ;6 cycles

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 2296896.299
CPU cycles / per loop: 35.047
         color clocks: 162937
          rasterlines: 717.784
               frames: 2.293
                   µs: 45937.925

*Same here, except that when the size is .w mulu is faster.

@meynaf

Quote:

What seems wrong to me is that measurement overhead makes it look like it uses 85 clocks but it should really use 84. Which is 3x28.
First move.l takes 28.
Mulu.w takes 26 (maybe 24 here if first 2 execute in parallel to the write).
Second move.l takes 28 but it cannot start immediately if timing not multiple of 28 (maybe 14 ?). So we can fit one small instruction before.
Could be interesting to add 2-cycle instructions until timing jumps to next chipmem slot.

85 cycles look odd to me, too. I'll make more tests, give it a good thought (not done yet) and report back.
By the way, bus slots do not happen at exact multiples of 14 CPU cycles, but every 50.000000/3.546895 = ~14.0968 cycles (and then, of course, there is the limitation that the CPU is never granted two consecutive slots, which is what brings the time of consecutive writes to CHIP RAM to 28+ cycles).

Don_Adan · 16 April 2024, 11:35

I can say this is not confirmed for me, but im strange.
Something different is add.l and something different is dbf
NOT ALL instructions must works parallel with mulu.
Then these tests are not 100% valid for me.

meynaf · 16 April 2024, 11:57

Quote:

Originally Posted by saimo

As reported in a post #6, mulu always takes 44 or 46 cycles, depending on the operands. Since in this code one operands is the loop counter, 45 cycles are a plausible average.

I remember mulu.w being 26 cycles and mulu.l 44 cycles (never 46).
Never found a different timing except for value zero.
Immediate mulu.w adds 2 ; no difference between mulu and muls.

Quote:

Originally Posted by saimo

Code:

                 core: .l move.w d6,d0 ;2 cycles
                          mulu.w d7,d0 ;27 cycles*
                          dbf    d7,.l ;6 cycles

I doubt about 27 cycles. I timed just too many multiplies in dct stuff and consistently found immediate mul to be 28 cycles regardless of values.
Really you should check your timing overhead or even calculation.

Quote:

Originally Posted by saimo

85 cycles look odd to me, too. I'll make more tests, give it a good thought (not done yet) and report back.

I think 65536 loops aren't enough to have precise value.
I used to have 50,000,000 loops so that 1 second = 1 cycle. Yes this took time.

I used 25000 loops with dbf, 2000 times with an outer loop. Found 0,5% more than real values.

Quote:

Originally Posted by saimo

By the way, bus slots do not happen at exact multiples of 14 CPU cycles, but every 50.000000/3.546895 = ~14.0968 cycles (and then, of course, there is the limitation that the CPU is never granted two consecutive slots, which is what brings the time of consecutive writes to CHIP RAM to 28+ cycles).

You don't have exact 50,000,000 cycles either. It could be slightly more, or slightly less. I wouldn't be surprised if real measured value is 49,656,530 instead

Then we would have different timing from one machine to another, too, which won't help...

saimo · 16 April 2024, 12:22

Quote:

Originally Posted by Don_Adan

I can say this is not confirmed for me, but im strange.
Something different is add.l and something different is dbf
NOT ALL instructions must works parallel with mulu.

Sorry, but no instruction that follows mulu can execute in parallel with it simply because the 68030 architecture doesn't allow it. Check out the formula 11-2 at 11.3.4 and the timings table at 11.6.8 in the MC68030UM.

saimo · 16 April 2024, 12:58

Quote:

Originally Posted by meynaf

I remember mulu.w being 26 cycles and mulu.l 44 cycles (never 46).
Never found a different timing except for value zero.
Immediate mulu.w adds 2 ; no difference between mulu and muls.

I doubt about 27 cycles. I timed just too many multiplies in dct stuff and consistently found immediate mul to be 28 cycles regardless of values.
Really you should check your timing overhead or even calculation.

I've considered that, too, but then again I would always get wrong values, and instead I get consistently correct results for cases where there are no doubts about timings (e.g. .l dbf dx,.l -> 6 cycles).
By the way, before running the actual test, the proggie runs a dummy test to measure the overhead and then it subtracts that overhead from the final calculation. But even if it didn't do that, the overhead is negligible with enough number of iterations, as it's basically the access time to the CIA B TOD HI and LO registers (a couple of CIA cycles, i.e. about 70.5 CPU cycles*), the VHPOSR register (14 cycles twice) and a jsr ([bd.l]) (other 14 cycles) - say, a 200 cycles in all, which have no real weight on the millions of cycles being measured here (e.g. 5570816 cycles for the move-mulu-move case).

To be precise, here is the relevant part of the code:

Code:

* TIMING START

	clr.b  RA_CIABTODHI        ;clear TODHI and stop TOD clock
	clr.b  RA_CIABTODMID       ;clear TODMID
	cnop   0,4                 ;align code (for equal conditions - see overhead calculation)
	bsr    _WtVB               ;wait for frame to start (for equal conditions - see overhead calculation)
	clr.b  RA_CIABTODLO        ;clear TODLO and start TOD clock
	move.w RA_VHPOSR,BeamYX0   ;get initial beam position

* MEASURED CODE

	jsr    ([BlobAddress_C.l])

* TIMING STOP

	tst.b  RA_CIABTODHI        ;stop TOD clock
	move.w RA_VHPOSR,BeamYX1   ;get final beam position

*Quite a while ago I measured that the access time to CIAs varies from machine to machine and even on the same machine it might depend on the instruction (move, tst, clr, etc.). IIRC, the 68030 on my Blizzard 1230 IV requires 2 CIA cycles to perform a read (but, even if it were 1 cycles more or less, there would be no practical difference).

Quote:

I think 65536 loops aren't enough to have precise value.
I used to have 50,000,000 loops so that 1 second = 1 cycle. Yes this took time.
I used 25000 loops with dbf, 2000 times with an outer loop. Found 0,5% more than real values.

No problem, I can make tests also with 50 million loops.

In the meanwhile, I finally got around to make tests to measure how many cycles one can stuff between CHIP RAM writes when mulu.w is also in between without affecting the overall execution time:

Code:

+--------------------------------------------------------+
|                   .l move.l d0,(a0)                    |
|                      mulu.w d1,d2                      |
|                      move.l d0,(a0)                    |
|                      dbf    d7,.l                      |
|                                                        |
|                Note: buffer in CHIP RAM                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 5570816.164                     |
|CPU cycles / per loop : 85.003                          |
|         color clocks : 395182                          |
|          rasterlines : 1740.889                        |
|               frames : 5.561                           |
|                   µs : 111416.323                      |
+----------------------+---------------------------------+

+--------------------------------------------------------+
|                   .l move.l d0,(a0)                    |
|                      rept   14                         |
|                      add.l  d3,d3                      |
|                      endr                              |
|                      mulu.w d1,d2                      |
|                      move.l d0,(a0)                    |
|                      dbf    d7,.l                      |
|                                                        |
|                Note: buffer in CHIP RAM                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 5570830.261                     |
|CPU cycles / per loop : 85.004                          |
|         color clocks : 395183                          |
|          rasterlines : 1740.894                        |
|               frames : 5.561                           |
|                   µs : 111416.605                      |
+----------------------+---------------------------------+


+--------------------------------------------------------+
|                   .l move.l d0,(a0)                    |
|                      rept   15                         |
|                      add.l  d3,d3                      |
|                      endr                              |
|                      mulu.w d1,d2                      |
|                      move.l d0,(a0)                    |
|                      dbf    d7,.l                      |
|                                                        |
|                Note: buffer in CHIP RAM                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 6489239.179                     |
|CPU cycles / per loop : 99.017                          |
|         color clocks : 460333                          |
|          rasterlines : 2027.898                        |
|               frames : 6.478                           |
|                   µs : 129784.783                      |
+----------------------+---------------------------------+

+--------------------------------------------------------+
|                   .l move.l d0,(a0)                    |
|                      mulu.w d1,d2                      |
|                      add.l  d3,d3                      |
|                      add.l  d3,d3                      |
|                      dbf    d7,.l                      |
|                                                        |
|                Note: buffer in CHIP RAM                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 5570816.164                     |
|CPU cycles / per loop : 85.003                          |
|         color clocks : 395182                          |
|          rasterlines : 1740.889                        |
|               frames : 5.561                           |
|                   µs : 111416.323                      |
+----------------------+---------------------------------+

+--------------------------------------------------------+
|                   .l move.l d0,(a0)                    |
|                      mulu.w d1,d2                      |
|                      add.l  d3,d3                      |
|                      add.l  d3,d3                      |
|                      add.l  d3,d3                      |
|                      move.l d0,(a0)                    |
|                      dbf    d7,.l                      |
|                                                        |
|                Note: buffer in CHIP RAM                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 6502729.852                     |
|CPU cycles / per loop : 99.223                          |
|         color clocks : 461290                          |
|          rasterlines : 2032.114                        |
|               frames : 6.492                           |
|                   µs : 130054.597                      |
+----------------------+---------------------------------+

+--------------------------------------------------------+
|                   .l move.l d0,(a0)                    |
|                      mulu.w d1,d2                      |
|                      move.l d0,(a0)                    |
|                      dbf    d7,.l                      |
|                                                        |
|                Note: buffer in FAST RAM                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 2502146.243                     |
|CPU cycles / per loop : 38.179                          |
|         color clocks : 177497                          |
|          rasterlines : 781.925                         |
|               frames : 2.498                           |
|                   µs : 50042.924                       |
+----------------------+---------------------------------+

+--------------------------------------------------------+
|                   .l move.l d0,(a0)                    |
|                      add.l  d3,d3                      |
|                      mulu.w d1,d2                      |
|                      move.l d0,(a0)                    |
|                      dbf    d7,.l                      |
|                                                        |
|                Note: buffer in FAST RAM                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 2502146.243                     |
|CPU cycles / per loop : 38.179                          |
|         color clocks : 177497                          |
|          rasterlines : 781.925                         |
|               frames : 2.498                           |
|                   µs : 50042.924                       |
+----------------------+---------------------------------+

+--------------------------------------------------------+
|                   .l move.l d0,(a0)                    |
|                      add.l  d3,d3                      |
|                      add.l  d3,d3                      |
|                      mulu.w d1,d2                      |
|                      move.l d0,(a0)                    |
|                      dbf    d7,.l                      |
|                                                        |
|                Note: buffer in FAST RAM                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 2631414.236                     |
|CPU cycles / per loop : 40.152                          |
|         color clocks : 186667                          |
|          rasterlines : 822.321                         |
|               frames : 2.627                           |
|                   µs : 52628.284                       |
+----------------------+---------------------------------+

+--------------------------------------------------------+
|                   .l move.l d0,(a0)                    |
|                      mulu.w d1,d2                      |
|                      add.l  d3,d3                      |
|                      move.l d0,(a0)                    |
|                      dbf    d7,.l                      |
|                                                        |
|                Note: buffer in FAST RAM                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 2631414.236                     |
|CPU cycles / per loop : 40.152                          |
|         color clocks : 186667                          |
|          rasterlines : 822.321                         |
|               frames : 2.627                           |
|                   µs : 52628.284                       |
+----------------------+---------------------------------+

Summary:
* CHIP RAM / move / add / mulu / move -> 14 adds
* CHIP RAM / move / mulu / add / move -> 2 adds
* FAST RAM / move / add / mulu / move -> 1 adds
* FAST RAM / move / mulu / add / move -> 0 adds

meynaf · 16 April 2024, 14:18

Quote:

Originally Posted by saimo

I've considered that, too, but then again I would always get wrong values, and instead I get consistently correct results for cases where there are no doubts about timings (e.g. .l dbf dx,.l -> 6 cycles).

That you have right values for small number of cycles doesn't mean it won't deviate for larger numbers. E.g. add 5% to 6, you still get 6. But add 20% to 20 and you'll get 21.

Quote:

Originally Posted by saimo

By the way, before running the actual test, the proggie runs a dummy test to measure the overhead and then it subtracts that overhead from the final calculation. But even if it didn't do that, the overhead is negligible with enough number of iterations, as it's basically the access time to the CIA B TOD HI and LO registers (a couple of CIA cycles, i.e. about 70.5 CPU cycles*), the VHPOSR register (14 cycles twice) and a jsr ([bd.l]) (other 14 cycles) - say, a 200 cycles in all, which have no real weight on the millions of cycles being measured here (e.g. 5570816 cycles for the move-mulu-move case).

I wouldn't trust CIA timings too much...
Anyway, you've shown non-integer timings such as 85.380, 56.636, 16.200, 42.715, 53.048, 71.158, 85.003. Don't tell me that just chopping the non-integer part of that, can give the right result.

Quote:

Originally Posted by saimo

In the meanwhile, I finally got around to make tests to measure how many cycles one can stuff between CHIP RAM writes when mulu.w is also in between without affecting the overall execution time:
<snip>
Summary:
* CHIP RAM / move / add / mulu / move -> 14 adds
* CHIP RAM / move / mulu / add / move -> 2 adds
* FAST RAM / move / add / mulu / move -> 1 adds
* FAST RAM / move / mulu / add / move -> 0 adds

Ok, let's try to interpret these :

Code:

.l move.l d0,(a0)                  ; 28 (r=26)
 mulu.w d1,d2                      ; 26->24 (r=24, stall, next @+4)
 move.l d0,(a0)                    ; 28->32 (+4 stall)
 dbf    d7,.l                      ; 6->0
; total 84 (3 blocks of 28)

.l move.l d0,(a0)                  ; 28 (r=26)
 rept   14
 add.l  d3,d3                      ; 28->2 (next @+12 missed -> @+26)
 endr
 mulu.w d1,d2                      ; 26 (next @+0)
 move.l d0,(a0)                    ; 28
 dbf    d7,.l                      ; 6->0
; total 84

.l move.l d0,(a0)                  ; 28 (r=26)
 rept   15
 add.l  d3,d3                      ; 30->4 (next @+10 missed -> @+24 missed -> @+38)
 endr
 mulu.w d1,d2                      ; 26 (next @ +12)
 move.l d0,(a0)                    ; 28 + 12 (stall)
 dbf    d7,.l                      ; 6->0
; total 98 (7 blocks of 14)

.l move.l d0,(a0)                  ; 28->36 (r=26, stall +8 from prev. iteration)
 mulu.w d1,d2                      ; 26->24 (r=24, stall, next @+4)
 add.l  d3,d3                      ; 2 (next @+2)
 add.l  d3,d3                      ; 2 (next@ 0)
 dbf    d7,.l                      ; 6 (next @+8) - next move will stall +8
; total 70 (5 blocks of 14)
; error - are you sure of your 85 ? (this is exact same values as first test)

.l move.l d0,(a0)                  ; 28 (r=26)
 mulu.w d1,d2                      ; 26->24 (r=24, stall, next @+4)
 add.l  d3,d3                      ; 2 (next @+2)
 add.l  d3,d3                      ; 2 (next @+0)
 add.l  d3,d3                      ; 2 (next @+12)
 move.l d0,(a0)                    ; 28->40 (r=24, stall +12)
 dbf    d7,.l                      ; 6->0
; total 98

Seems chipmem results are predictable after all.
But fastmem results give 2 cycles less than expected by me.

Don_Adan · 16 April 2024, 15:26

In general some results are strange for me, f.e 99 cycles for 3x add.l d3,d3, when for 1x add.l d3,d3 and 2x add.l d3,d3 is only 85 cycles.
Perhaps must exist more relationships.
I can only suspect than CCR handling can be problematic. f.e bit X

Anyway, but if You really want to check if chip ram writes can be covered by mulu.
Then replace
move.l d0,(A0)
with
movem.l d0,(A0)
Movem dont change CCR if I remember right, when move change CCR.

saimo · 16 April 2024, 18:25

@meynaf

Reply split in 2 parts, as the forum tells me it's too long...

Quote:

Originally Posted by meynaf

That you have right values for small number of cycles doesn't mean it won't deviate for larger numbers. E.g. add 5% to 6, you still get 6. But add 20% to 20 and you'll get 21.

In this case such logic doesn't apply, as the overhead applies only once per test (not per loop) and the way the time is measured (more about this below) does not depend on the loops number.
Anyway, as promised, I made also tests with 50 million loops, and they confirm that the proggie measures the time correctly - in the results below, you can see that the reported times are just 1/000th away from perfectly round figures (of course, with so many more loops, precision increased)
These tests focus of mulu.x alone, without accesses to RAM, to be sure about the precision of the figures.
The analysis of the (interesting) results follows the data.

Code:

mulu.w d1,d2 / d1 = 0 / 65536 loops
26 cycles

+--------------------------------------------------------+
|              .l move.w d0,d6 ;d0 = $1234               |
|                 mulu.w d1,d6 ;d1 = 0                   |
|                 dbf    d7,.l                           |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 2231416.492                     |
|CPU cycles / per loop : 34.048                          |
|         color clocks : 158292                          |
|          rasterlines : 697.321                         |
|               frames : 2.227                           |
|                   µs : 44628.329                       |
+----------------------+---------------------------------+


mulu.w d1,d2 / d1 = 0 / 5000000 loops
26 cycles

+--------------------------------------------------------+
|              .l move.w d0,d6 ;d0 = $1234               |
|                 mulu.w d1,d6 ;d1 = 0                   |
|                 subq.l #1,d7                           |
|                 bne.b  .l                              |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 50000000                        |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 1799995164.784                  |
|CPU cycles / per loop : 35.999                          |
|         color clocks : 127687877                       |
|          rasterlines : 562501.660                      |
|               frames : 1797.129                        |
|                   µs : 35999903.295                    |
+----------------------+---------------------------------+


mulu.w d1,d2 / d1 = 7 / 65536 loops
28 cycles

+--------------------------------------------------------+
|              .l move.w d0,d6 ;d0 = $1234               |
|                 mulu.w d1,d6 ;d1 = 7                   |
|                 dbf    d7,.l                           |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 2362446.590                     |
|CPU cycles / per loop : 36.048                          |
|         color clocks : 167587                          |
|          rasterlines : 738.268                         |
|               frames : 2.358                           |
|                   µs : 47248.931                       |
+----------------------+---------------------------------+


mulu.w d1,d2 / d1 = 7 / 50000000 loops
28 cycles

+--------------------------------------------------------+
|              .l move.w d0,d6 ;d0 = $1234               |
|                 mulu.w d1,d6 ;d1 = 7                   |
|                 subq.l #1,d7                           |
|                 bne.b  .l                              |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 50000000                        |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 1899994600.911                  |
|CPU cycles / per loop : 37.999                          |
|         color clocks : 134781627                       |
|          rasterlines : 593751.660                      |
|               frames : 1896.970                        |
|                   µs : 37999892.018                    |
+----------------------+---------------------------------+


mulu.w d1,d2 / d1 = even / 25000000 loops
26 cycles

+--------------------------------------------------------+
|              .l move.w d0,d6 ;d0 = $1234               |
|                 mulu.w d7,d6                           |
|                 subq.l #2,d7                           |
|                 bne.b  .l                              |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 25000000                        |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 899999252.867                   |
|CPU cycles / per loop : 35.999                          |
|         color clocks : 63844057                        |
|          rasterlines : 281251.352                      |
|               frames : 898.566                         |
|                   µs : 17999985.057                    |
+----------------------+---------------------------------+


mulu.w d1,d2 / d1 = odd / 25000000 loops
28 cycles

+--------------------------------------------------------+
|              .l move.w d0,d6 ;d0 = $1234               |
|                 mulu.w d7,d6                           |
|                 subq.l #2,d7                           |
|                 bpl.b  .l                              |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 25000000                        |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 949998970.930                   |
|CPU cycles / per loop : 37.999                          |
|         color clocks : 67390932                        |
|          rasterlines : 296876.352                      |
|               frames : 948.486                         |
|                   µs : 18999979.418                    |
+----------------------+---------------------------------+


mulu.w d1,d2 / d1 from 65535 to 0 / 65536 loops
27 cycles on average

+--------------------------------------------------------+
|              .l move.w d0,d6 ;d0 = $1234               |
|                 mulu.w d7,d6                           |
|                 dbf    d7,.l                           |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 2296966.783                     |
|CPU cycles / per loop : 35.048                          |
|         color clocks : 162942                          |
|          rasterlines : 717.806                         |
|               frames : 2.293                           |
|                   µs : 45939.335                       |
+----------------------+---------------------------------+


mulu.w d1,d2 / d1 from 65535 to 0 / 50000000 loops
27 cycles on average

+--------------------------------------------------------+
|              .l move.w d0,d6 ;d0 = $1234               |
|                 mulu.w d7,d6                           |
|                 subq.l #1,d7                           |
|                 bne.b  .l                              |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 50000000                        |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 1849994812.364                  |
|CPU cycles / per loop : 36.999                          |
|         color clocks : 131234747                       |
|          rasterlines : 578126.638                      |
|               frames : 1847.049                        |
|                   µs : 36999896.247                    |
+----------------------+---------------------------------+


mulu.l d1,d2 / d1 = 0 / 65536 loops
44 cycles

+--------------------------------------------------------+
|            .l move.l d0,d6 ;d0 = $12345678             |
|               mulu.l d1,d6 ;d1 = 0                     |
|               dbf    d7,.l                             |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 3407839.814                     |
|CPU cycles / per loop : 51.999                          |
|         color clocks : 241745                          |
|          rasterlines : 1064.955                        |
|               frames : 3.402                           |
|                   µs : 68156.796                       |
+----------------------+---------------------------------+


mulu.l d1,d1 / d1 = 0 / 50000000 loops
44 cycles

+--------------------------------------------------------+
|            .l move.l d0,d6 ;d0 = $12345678             |
|               mulu.l d1,d6 ;d1 = 0                     |
|               subq.l #1,d7                             |
|               bne.b  .l                                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 50000000                        |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 2699992415.901                  |
|CPU cycles / per loop : 53.999                          |
|         color clocks : 191531792                       |
|          rasterlines : 843752.387                      |
|               frames : 2695.694                        |
|                   µs : 53999848.318                    |
+----------------------+---------------------------------+


mulu.l d1,d2 / d1 = 7 / 65536 loops
46 cycles

+--------------------------------------------------------+
|            .l move.l d0,d6 ;d0 = $12345678             |
|               mulu.l d1,d6 ;d1 = 7                     |
|               dbf    d7,.l                             |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 3542140.379                     |
|CPU cycles / per loop : 54.048                          |
|         color clocks : 251272                          |
|          rasterlines : 1106.925                        |
|               frames : 3.536                           |
|                   µs : 70842.807                       |
+----------------------+---------------------------------+


mulu.l d1,d1 / d1 = 7 / 50000000 loops
46 cycles

+--------------------------------------------------------+
|            .l move.l d0,d6 ;d0 = $12345678             |
|               mulu.l d1,d6 ;d1 = 7                     |
|               subq.l #1,d7                             |
|               bne.b  .l                                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 50000000                        |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 2799991570.091                  |
|CPU cycles / per loop : 55.999                          |
|         color clocks : 198625522                       |
|          rasterlines : 875002.299                      |
|               frames : 2795.534                        |
|                   µs : 55999831.401                    |
+----------------------+---------------------------------+


mulu.l d1,d2 / d1 = even / 25000000 loops
44 cycles

+--------------------------------------------------------+
|            .l move.l d0,d6 ;d0 = $12345678             |
|               mulu.l d7,d6                             |
|               subq.l #2,d7                             |
|               bne.b  .l                                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 25000000                        |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 1349996870.502                  |
|CPU cycles / per loop : 53.999                          |
|         color clocks : 95765943                        |
|          rasterlines : 421876.400                      |
|               frames : 1347.847                        |
|                   µs : 26999937.410                    |
+----------------------+---------------------------------+


mulu.l d1,d2 / d1 = odd / 25000000 loops
46 cycles

+--------------------------------------------------------+
|            .l move.l d0,d6 ;d0 = $12345678             |
|               mulu.l d7,d6                             |
|               subq.l #2,d7                             |
|               bpl.b  .l                                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 25000000                        |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 1399997067.857                  |
|CPU cycles / per loop : 55.999                          |
|         color clocks : 99312852                        |
|          rasterlines : 437501.550                      |
|               frames : 1397.768                        |
|                   µs : 27999941.357                    |
+----------------------+---------------------------------+


mulu.l d1,d2 / d1 from 65535 to 0 / 65536 loops
45 cycles on average

+--------------------------------------------------------+
|            .l move.l d0,d6 ;d0 = $12345678             |
|               mulu.l d7,d6                             |
|               dbf    d7,.l                             |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 3476590.087                     |
|CPU cycles / per loop : 53.048                          |
|         color clocks : 246622                          |
|          rasterlines : 1086.440                        |
|               frames : 3.471                           |
|                   µs : 69531.801                       |
+----------------------+---------------------------------+


mulu.l d1,d1 / d1 from 65535 to 0 / 50000000 loops
45 cycles on average

+--------------------------------------------------------+
|            .l move.l d0,d6 ;d0 = $12345678             |
|               mulu.l d7,d6                             |
|               subq.l #1,d7                             |
|               bne.b  .l                                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 50000000                        |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 2749988088.172                  |
|CPU cycles / per loop : 54.999                          |
|         color clocks : 195078380                       |
|          rasterlines : 859376.123                      |
|               frames : 2745.610                        |
|                   µs : 54999761.763                    |
+----------------------+---------------------------------+

Summary:
* mulu.w d1,d2 / d1 = 0 / 65536 loops -> 26 cycles
* mulu.w d1,d2 / d1 = 0 / 5000000 loops -> 26 cycles
* mulu.w d1,d2 / d1 = 7 / 65536 loops -> 28 cycles
* mulu.w d1,d2 / d1 = 7 / 50000000 loops -> 28 cycles
* mulu.w d1,d2 / d1 = even / 25000000 loops -> 26 cycles
* mulu.w d1,d2 / d1 = odd / 25000000 loops -> 28 cycles
* mulu.w d1,d2 / d1 from 65535 to 0 / 65536 loops -> 27 cycles on average
* mulu.w d1,d2 / d1 from 65535 to 0 / 50000000 loops -> 27 cycles on average
* mulu.l d1,d2 / d1 = 0 / 65536 loops -> 44 cycles
* mulu.l d1,d1 / d1 = 0 / 50000000 loops -> 44 cycles
* mulu.l d1,d2 / d1 = 7 / 65536 loops -> 46 cycles
* mulu.l d1,d1 / d1 = 7 / 50000000 loops -> 46 cycles
* mulu.l d1,d2 / d1 = even / 25000000 loops -> 44 cycles
* mulu.l d1,d2 / d1 = odd / 25000000 loops -> 46 cycles
* mulu.l d1,d2 / d1 from 65535 to 0 / 65536 loops -> 45 cycles on average
* mulu.l d1,d1 / d1 from 65535 to 0 / 50000000 loops -> 45 cycles on average

Takeaways:
* mulu.w takes either 26 or 28 cycles;
* mulu.l takes either 44 or 46 cycles;
* when the left operand is odd, mulu.x takes 2 cycles more (I guess this is true also for muls.x);
* the figures obtained by means of millions of loops match those obtained with 65536 loops.

Quote:

I wouldn't trust CIA timings too much...

I use the CIA B TOD to count the number of rasterlines elapsed, and that's reliable. For sub-rasterline precision I read VHPOSR - see the code snippet in my previous post.

Quote:

Anyway, you've shown non-integer timings such as 85.380, 56.636, 16.200, 42.715, 53.048, 71.158, 85.003. Don't tell me that just chopping the non-integer part of that, can give the right result.

I derive the CPU cycles from the color clocks measured by means of the CIA B TOD and VHPOSR as mentioned above. The formula is <color clocks> * <CPU frequency> / 3.546895, and that's where the decimals come from. Unfortunately, since there isn't a higher resolution clock, I have to make do with this. The precision is of course not perfect, but, as you can see from the results above, the figures are reliable already with 65536 loops.

saimo · 16 April 2024, 18:45

@maynaf

Reply part 2

Quote:

Ok, let's try to interpret these :

Code:

.l move.l d0,(a0)                  ; 28 (r=26)
 mulu.w d1,d2                      ; 26->24 (r=24, stall, next @+4)
 move.l d0,(a0)                    ; 28->32 (+4 stall)
 dbf    d7,.l                      ; 6->0
; total 84 (3 blocks of 28)

.l move.l d0,(a0)                  ; 28 (r=26)
 rept   14
 add.l  d3,d3                      ; 28->2 (next @+12 missed -> @+26)
 endr
 mulu.w d1,d2                      ; 26 (next @+0)
 move.l d0,(a0)                    ; 28
 dbf    d7,.l                      ; 6->0
; total 84

.l move.l d0,(a0)                  ; 28 (r=26)
 rept   15
 add.l  d3,d3                      ; 30->4 (next @+10 missed -> @+24 missed -> @+38)
 endr
 mulu.w d1,d2                      ; 26 (next @ +12)
 move.l d0,(a0)                    ; 28 + 12 (stall)
 dbf    d7,.l                      ; 6->0
; total 98 (7 blocks of 14)

.l move.l d0,(a0)                  ; 28->36 (r=26, stall +8 from prev. iteration)
 mulu.w d1,d2                      ; 26->24 (r=24, stall, next @+4)
 add.l  d3,d3                      ; 2 (next @+2)
 add.l  d3,d3                      ; 2 (next@ 0)
 dbf    d7,.l                      ; 6 (next @+8) - next move will stall +8
; total 70 (5 blocks of 14)
; error - are you sure of your 85 ? (this is exact same values as first test)

.l move.l d0,(a0)                  ; 28 (r=26)
 mulu.w d1,d2                      ; 26->24 (r=24, stall, next @+4)
 add.l  d3,d3                      ; 2 (next @+2)
 add.l  d3,d3                      ; 2 (next @+0)
 add.l  d3,d3                      ; 2 (next @+12)
 move.l d0,(a0)                    ; 28->40 (r=24, stall +12)
 dbf    d7,.l                      ; 6->0
; total 98

Seems chipmem results are predictable after all
But fastmem results give 2 cycles less than expected by me.

About the matter of what happens when memory is accessed, I've drawn a couple of quick and dirty diagrams for the move-mulu-move case and, to my surprise, it looks like, in theory, the speed should be 1 color clock (i.e. 14 CPU cycles) less:

I'll prepare a proper picture and explain the underlying thoughts later (or more probably tomorrow - I haven't slept for basically 2 days after months of bad sleep, so I'm about to collapse...).

I also ran the test again using 50000000 loops, obtaining this, which confirms 85 cycles on average:

Code:

move > mulu.w d1,d2 > move / d1 from 65535 to 0 / 50000000 loops

+--------------------------------------------------------+
|             .l move.l d0,(a0)                          |
|                mulu.w d1,d2   ;d1 = d2 = 0             |
|                move.l d0,(a0)                          |
|                subq.l #1,d7                            |
|                bne.b  .l                               |
|                                                        |
|                Note: buffer in CHIP RAM                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 50000000                        |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 4247766807.306                  |
|CPU cycles / per loop : 84.955                          |
|         color clocks : 301327657                       |
|          rasterlines : 1327434.612                     |
|               frames : 4241.005                        |
|                   µs : 84955336.146                    |
+----------------------+---------------------------------+

Anyway, here's a first guess about what gives such odd timings: as seen from the results (and also from by doodles above) the mulus happen to end very close to the next color clock, so maybe, at times, the CPU misses that slot and thus ends up taking 14 cycles more. How many times?
Using 65536 loops for convenience:
* total cycles: 65536 * 85 = 5570560
* expected cycles: 65536 * 84 = 5505024
* from that: (65536 - x) * 84 + x * (84 + 14) = 65536 * 85 -> x = 65536/14 = about 4681 times

Why does that happen? I don't know. For sure it's not because of the odd/even operand factor (alone), otherwise the slot would be missed 50% of the times and the average time would be higher. I guess it's first necessary to figure out why the execution most of the times takes 84 cycles instead of 98 in first place.

saimo · 16 April 2024, 18:57

@Don_Adan

Quote:

Originally Posted by Don_Adan

In general some results are strange for me, f.e 99 cycles for 3x add.l d3,d3, when for 1x add.l d3,d3 and 2x add.l d3,d3 is only 85 cycles.
Perhaps must exist more relationships.
I can only suspect than CCR handling can be problematic. f.e bit X

To me it looks like you're missing the fact that 1 color clock = about 14 CPU cycles, that the CPU gets to write only on those boundaries and that the CPU can't use two consecutive CPU bus slots. So, even if X adds might overlap nicely, X+1 adds might push the CPU into the next CHIP bus slot.

Quote:

Anyway, but if You really want to check if chip ram writes can be covered by mulu.
Then replace
move.l d0,(A0)
with
movem.l d0,(A0)
Movem dont change CCR if I remember right, when move change CCR.

The fact that exg cannot run in parallel as well makes me suspect that the ccr plays no role, but I'll run the test and let you know.

EDIT: the movem test confirms that the ccr doesn't matter ->

Code:

+--------------------------------------------------------+
|            .l movem.l d0,(a0)                          |
|               mulu.w  d1,d2   ;d1 = d2 = 0             |
|               movem.l d0,(a0)                          |
|               subq.l  #1, d7                           |
|               bne.b   .l                               |
|                                                        |
|                Note: buffer in CHIP RAM                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 50000000                        |
|         leaked bytes : 672                             |
+----------------------+---------------------------------+
|   CPU cycles / total : 4247766877.790                  |
|CPU cycles / per loop : 84.955                          |
|         color clocks : 301327662                       |
|          rasterlines : 1327434.634                     |
|               frames : 4241.005                        |
|                   µs : 84955337.555                    |
+----------------------+---------------------------------+

The test is included in the attached archive.

@all

Attached here is the updated archive with the latest tests included.

meynaf · 16 April 2024, 19:06

Quote:

Originally Posted by saimo

In this case such logic doesn't apply, as the overhead applies only once per test (not per loop) and the way the time is measured (more about this below) does not depend on the loops number.

Sorry, but once-per-test overhead can not explain the non-integer results you got before. Especially not that they disappeared with extra precision.

Quote:

Originally Posted by saimo

I derive the CPU cycles from the color clocks measured by means of the CIA B TOD and VHPOSR as mentioned above. The formula is <color clocks> * <CPU frequency> / 3.546895, and that's where the decimals come from. Unfortunately, since there isn't a higher resolution clock, I have to make do with this. The precision is of course not perfect, but, as you can see from the results above, the figures are reliable already with 65536 loops.

There are easier ways to compute the timing.

saimo · 16 April 2024, 19:22

Quote:

Originally Posted by meynaf

Sorry, but once-per-test overhead can not explain the non-integer results you got before. Especially not that they disappeared with extra precision.

My reply was relative to the reliability of the figures, not to why decimals appear (which I explained separately).

Quote:

There are easier ways to compute the timing.

I'd be glad to implement a better method: which would that be?

meynaf · 16 April 2024, 19:38

Quote:

Originally Posted by saimo

My reply was relative to the reliability of the figures, not to why decimals appear (which I explained separately).

But both are the same. The decimals indicated something didn't go well.

Quote:

Originally Posted by saimo

I'd be glad to implement a better method: which would that be?

Not 'better' but easier as i said : just count the vblanks...
Cheap and (with enough loops) reliable enough.

saimo · 16 April 2024, 23:16

@meynaf

Quote:

Originally Posted by meynaf

But both are the same. The decimals indicated something didn't go well.

1. You first questioned the precision of the decimals, saying that the error might scale depending on the magnitude of the numbers.
2. I explained that the error doesn't depend on that.
3. You said that the origin of the decimals was not explained.
4. I said that I was answering to a different remark (and added that the explanation you wanted was provided elsewhere in the very same post).
5. Now you're saying that the precision of the decimals and where they come from are the same thing and that they indicate some error.
Sorry, but here you're not being fair, so this comment of yours goes to >NIL:

Quote:

Not 'better' but easier as i said : just count the vblanks...
Cheap and (with enough loops) reliable enough.

Here's an even easier method - the easiest and always indisputably correct one: printing out "CPU cycles taken: more than 0".
What's the point in suggesting an easier method is if it isn't just as precise? Especially considering that in this context precision matters and, even more, considering that you have been questioning the precision of the results! You really seem to just want to be against for the sake of it.

Anyway, technically speaking, using the vertical blanks is 313*227/50 = 1421.02 times less precise than the method I used (which isn't that complicated, by the way), so, no, it isn't worth considering.

That said, I must admit that I was so annoyed by the comments that I decided to have a second look at my code to try and make it even more precise. After some investigation, I found something I wasn't aware of: the CIA B TOD clock does not increase when a rasterline starts or ends (as I had erroneously believed), but at around color color $60 (sometimes a little less, sometimes a little more, and this figure is affected also by the slow access time of CIAs). In some occasions, this caused the color clocks count to be wrong by 227 (i.e. 1 rasterline). So, I changed the code to take that into account and to work regardless of when exactly the TOD clock increases. The code now is not only no longer affected by off-by-one issue, but is actually simpler, more beautiful (than what shown in the snippet I posted earlier) and - guess what - more precise. So much more precise that even the 65536 loops tests provide figures that are just 0.001 or 0.002 away from the perfect ones - here are some examples:

Code:

+--------------------------------------------------------+
|                      .l dbf d7,.l                      |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 393358.134                      |
|CPU cycles / per loop : 6.002                           |
|         color clocks : 27904                           |
|          rasterlines : 122.925                         |
|               frames : 0.392                           |
|                   µs : 7867.162                        |
+----------------------+---------------------------------+


+--------------------------------------------------------+
|              .l move.w d0,d6 ;d0 = $1234               |
|                 mulu.w d7,d6                           |
|                 dbf    d7,.l                           |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 2293907.770                     |
|CPU cycles / per loop : 35.002                          |
|         color clocks : 162725                          |
|          rasterlines : 716.850                         |
|               frames : 2.290                           |
|                   µs : 45878.155                       |
+----------------------+---------------------------------+


+--------------------------------------------------------+
|            .l move.l d0,d6 ;d0 = $12345678             |
|               mulu.l d7,d6                             |
|               dbf    d7,.l                             |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 3473488.783                     |
|CPU cycles / per loop : 53.001                          |
|         color clocks : 246402                          |
|          rasterlines : 1085.471                        |
|               frames : 3.467                           |
|                   µs : 69469.775                       |
+----------------------+---------------------------------+


+--------------------------------------------------------+
|             .l move.l d0,(a0)                          |
|                mulu.w d1,d2   ;d1 = d2 = 0             |
|                move.l d0,(a0)                          |
|                dbf    d7,.l                            |
|                                                        |
|                Note: buffer in CHIP RAM                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 5567728.957                     |
|CPU cycles / per loop : 84.956                          |
|         color clocks : 394963                          |
|          rasterlines : 1739.925                        |
|               frames : 5.558                           |
|                   µs : 111354.579                      |
+----------------------+---------------------------------+


+--------------------------------------------------------+
|             .l move.l d0,(a0)                          |
|                mulu.w d1,d2   ;d1 = d2 = 0             |
|                move.l d0,(a0)                          |
|                subq.l #1,d7                            |
|                bne.b  .l                               |
|                                                        |
|                Note: buffer in CHIP RAM                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 50000000                        |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 4247763790.583                  |
|CPU cycles / per loop : 84.955                          |
|         color clocks : 301327443                       |
|          rasterlines : 1327433.669                     |
|               frames : 4241.002                        |
|                   µs : 84955275.811                    |
+----------------------+---------------------------------+

Attached is the updated archive.

meynaf · 17 April 2024, 11:40

Quote:

Originally Posted by saimo

1. You first questioned the precision of the decimals, saying that the error might scale depending on the magnitude of the numbers.

Which is true. An error in a constant, clocks not exactly being 3546895 and 50000000, an interrupt not disabled... Many things can make the result drift away.

Quote:

Originally Posted by saimo

2. I explained that the error doesn't depend on that.

And where did I agree to this ? I did not. It can still depend on that.
I haven't said it depends only on that, though. There is setup-related error, which is why many loops are needed.

Quote:

Originally Posted by saimo

3. You said that the origin of the decimals was not explained.

What ? No. What i said is that once-per-test overhead wasn't enough to explain them.

Quote:

Originally Posted by saimo

4. I said that I was answering to a different remark (and added that the explanation you wanted was provided elsewhere in the very same post).

Except that your explanation wasn't convincing and there was only a single remark to reply to.

Quote:

Originally Posted by saimo

5. Now you're saying that the precision of the decimals and where they come from are the same thing and that they indicate some error.

Look at the values i wrote in post #47. These can *not* be the right result.

Quote:

Originally Posted by saimo

Sorry, but here you're not being fair, so this comment of yours goes to >NIL:

I think i'll put that on the account of your recent lack of sleep.
I may lack the most basic of diplomacy skills, that's sure - and we might just have had a misunderstanding after all - but i'm not unfair.

Quote:

Originally Posted by saimo

Here's an even easier method - the easiest and always indisputably correct one: printing out "CPU cycles taken: more than 0".

Very funny -- or maybe not, as this looks a lot like a strawman fallacy.
(And wrong in the case we're timing an empty routine.

)

Quote:

Originally Posted by saimo

What's the point in suggesting an easier method is if it isn't just as precise?

Because it's a lot simpler, requires less computations and thus is less error prone ?

Quote:

Originally Posted by saimo

Especially considering that in this context precision matters and, even more, considering that you have been questioning the precision of the results! You really seem to just want to be against for the sake of it.

Now you're being unfair ! The precision we need is 1 cycle. So as long as this integer cycle value is unambiguous, extra precision is useless. But previously you didn't reach this (with 65536 loops). And using the vblank is enough to have it.

Quote:

Originally Posted by saimo

Anyway, technically speaking, using the vertical blanks is 313*227/50 = 1421.02 times less precise than the method I used (which isn't that complicated, by the way), so, no, it isn't worth considering.

As i said, it is not "the more, the better".
Vertical blank is max 20ms error, which is quite acceptable for a computation that lasts several seconds.
We need to know the number of cycles, not the number of 1/1000th of a cycle.

Quote:

Originally Posted by saimo

That said, I must admit that I was so annoyed by the comments that I decided to have a second look at my code to try and make it even more precise. After some investigation, I found something I wasn't aware of: the CIA B TOD clock does not increase when a rasterline starts or ends (as I had erroneously believed), but at around color color $60 (sometimes a little less, sometimes a little more, and this figure is affected also by the slow access time of CIAs). In some occasions, this caused the color clocks count to be wrong by 227 (i.e. 1 rasterline). So, I changed the code to take that into account and to work regardless of when exactly the TOD clock increases. The code now is not only no longer affected by off-by-one issue, but is actually simpler, more beautiful (than what shown in the snippet I posted earlier) and - guess what - more precise. So much more precise that even the 65536 loops tests provide figures that are just 0.001 or 0.002 away from the perfect ones - here are some examples:

So at least this has been useful for something.
Perhaps we could get further by watching this 84.955 closer, hmm ?

saimo · 17 April 2024, 13:03

@maynaf

>NIL:

meynaf · 17 April 2024, 14:09

Quote:

Originally Posted by saimo

@maynaf

>NIL:

You should really go to sleep.

Karlos · 17 April 2024, 14:21

Can we stop the nonsense and get back to measuring? This is interesting, the flexing isn't.

meynaf · 17 April 2024, 14:35

Quote:

Originally Posted by Karlos

Can we stop the nonsense and get back to measuring? This is interesting, the flexing isn't.

You may try to find a way to predict the obtained values, and ask for extra tests when some theory needs to be verified. Me, i tried but with little result.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
68040 to 68060 adapter respin with A2000 and Zeus 68040 Accelerator	richx	support.Hardware	14	26 April 2022 05:46
Games that required an accelerator (68030, 68040, 68060)	Radertified	Nostalgia & memories	47	12 January 2022 16:45
68030, 68040 and 68060 MMU support (really!)	Toni Wilen	support.WinUAE	262	19 February 2019 12:36
mulu.l (a0),d0-d1 on 68060	BlankVector	support.WinUAE	4	20 July 2012 19:03
WTB: 68030 or 68040 accelerator for A2000	Shadowfire	MarketPlace	2	19 September 2009 17:52

16 April 2024, 11:35	#43
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 55 Posts: 1,975	I can say this is not confirmed for me, but im strange. Something different is add.l and something different is dbf NOT ALL instructions must works parallel with mulu. Then these tests are not 100% valid for me.

16 April 2024, 15:26	#48
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 55 Posts: 1,975	In general some results are strange for me, f.e 99 cycles for 3x add.l d3,d3, when for 1x add.l d3,d3 and 2x add.l d3,d3 is only 85 cycles. Perhaps must exist more relationships. I can only suspect than CCR handling can be problematic. f.e bit X Anyway, but if You really want to check if chip ram writes can be covered by mulu. Then replace move.l d0,(A0) with movem.l d0,(A0) Movem dont change CCR if I remember right, when move change CCR.

17 April 2024, 13:03	#57
saimo Registered User Join Date: Aug 2010 Location: Italy Posts: 787	@maynaf >NIL:

17 April 2024, 14:21	#59
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,165	Can we stop the nonsense and get back to measuring? This is interesting, the flexing isn't.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)