16 April 2024, 11:26 | #41 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,365
|
Not so easy. You will discover that extra add.l will not do +2 cycles each. Some will do nothing and then suddenly you will have lots of cycles. This is due to chipmem timing alignment.
|
16 April 2024, 11:30 | #42 | ||
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 857
|
@Don_Adan
Quote:
This is confirmed already by these tests: Code:
core: .l move.l d6,d0 ;2 cycles mulu.l d7,d0 ;45 cycles* dbf d7,.l ;6 cycles CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 3476590.087 CPU cycles / per loop: 53.048 color clocks: 246622 rasterlines: 1086.440 frames: 3.471 µs: 69531.801 Code:
core: .l move.w d6,d0 ;2 cycles mulu.w d7,d0 ;27 cycles* dbf d7,.l ;6 cycles CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 2296896.299 CPU cycles / per loop: 35.047 color clocks: 162937 rasterlines: 717.784 frames: 2.293 µs: 45937.925 @meynaf Quote:
By the way, bus slots do not happen at exact multiples of 14 CPU cycles, but every 50.000000/3.546895 = ~14.0968 cycles (and then, of course, there is the limitation that the CPU is never granted two consecutive slots, which is what brings the time of consecutive writes to CHIP RAM to 28+ cycles). |
||
16 April 2024, 11:35 | #43 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,052
|
I can say this is not confirmed for me, but im strange.
Something different is add.l and something different is dbf NOT ALL instructions must works parallel with mulu. Then these tests are not 100% valid for me. |
16 April 2024, 11:57 | #44 | ||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,365
|
Quote:
Never found a different timing except for value zero. Immediate mulu.w adds 2 ; no difference between mulu and muls. Quote:
Really you should check your timing overhead or even calculation. Quote:
I used to have 50,000,000 loops so that 1 second = 1 cycle. Yes this took time. I used 25000 loops with dbf, 2000 times with an outer loop. Found 0,5% more than real values. Quote:
Then we would have different timing from one machine to another, too, which won't help... |
||||
16 April 2024, 12:22 | #45 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 857
|
Sorry, but no instruction that follows mulu can execute in parallel with it simply because the 68030 architecture doesn't allow it. Check out the formula 11-2 at 11.3.4 and the timings table at 11.6.8 in the MC68030UM.
|
16 April 2024, 12:58 | #46 | ||
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 857
|
Quote:
By the way, before running the actual test, the proggie runs a dummy test to measure the overhead and then it subtracts that overhead from the final calculation. But even if it didn't do that, the overhead is negligible with enough number of iterations, as it's basically the access time to the CIA B TOD HI and LO registers (a couple of CIA cycles, i.e. about 70.5 CPU cycles*), the VHPOSR register (14 cycles twice) and a jsr ([bd.l]) (other 14 cycles) - say, a 200 cycles in all, which have no real weight on the millions of cycles being measured here (e.g. 5570816 cycles for the move-mulu-move case). To be precise, here is the relevant part of the code: Code:
* TIMING START clr.b RA_CIABTODHI ;clear TODHI and stop TOD clock clr.b RA_CIABTODMID ;clear TODMID cnop 0,4 ;align code (for equal conditions - see overhead calculation) bsr _WtVB ;wait for frame to start (for equal conditions - see overhead calculation) clr.b RA_CIABTODLO ;clear TODLO and start TOD clock move.w RA_VHPOSR,BeamYX0 ;get initial beam position * MEASURED CODE jsr ([BlobAddress_C.l]) * TIMING STOP tst.b RA_CIABTODHI ;stop TOD clock move.w RA_VHPOSR,BeamYX1 ;get final beam position Quote:
In the meanwhile, I finally got around to make tests to measure how many cycles one can stuff between CHIP RAM writes when mulu.w is also in between without affecting the overall execution time: Code:
+--------------------------------------------------------+ | .l move.l d0,(a0) | | mulu.w d1,d2 | | move.l d0,(a0) | | dbf d7,.l | | | | Note: buffer in CHIP RAM | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 5570816.164 | |CPU cycles / per loop : 85.003 | | color clocks : 395182 | | rasterlines : 1740.889 | | frames : 5.561 | | µs : 111416.323 | +----------------------+---------------------------------+ +--------------------------------------------------------+ | .l move.l d0,(a0) | | rept 14 | | add.l d3,d3 | | endr | | mulu.w d1,d2 | | move.l d0,(a0) | | dbf d7,.l | | | | Note: buffer in CHIP RAM | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 5570830.261 | |CPU cycles / per loop : 85.004 | | color clocks : 395183 | | rasterlines : 1740.894 | | frames : 5.561 | | µs : 111416.605 | +----------------------+---------------------------------+ +--------------------------------------------------------+ | .l move.l d0,(a0) | | rept 15 | | add.l d3,d3 | | endr | | mulu.w d1,d2 | | move.l d0,(a0) | | dbf d7,.l | | | | Note: buffer in CHIP RAM | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 6489239.179 | |CPU cycles / per loop : 99.017 | | color clocks : 460333 | | rasterlines : 2027.898 | | frames : 6.478 | | µs : 129784.783 | +----------------------+---------------------------------+ +--------------------------------------------------------+ | .l move.l d0,(a0) | | mulu.w d1,d2 | | add.l d3,d3 | | add.l d3,d3 | | dbf d7,.l | | | | Note: buffer in CHIP RAM | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 5570816.164 | |CPU cycles / per loop : 85.003 | | color clocks : 395182 | | rasterlines : 1740.889 | | frames : 5.561 | | µs : 111416.323 | +----------------------+---------------------------------+ +--------------------------------------------------------+ | .l move.l d0,(a0) | | mulu.w d1,d2 | | add.l d3,d3 | | add.l d3,d3 | | add.l d3,d3 | | move.l d0,(a0) | | dbf d7,.l | | | | Note: buffer in CHIP RAM | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 6502729.852 | |CPU cycles / per loop : 99.223 | | color clocks : 461290 | | rasterlines : 2032.114 | | frames : 6.492 | | µs : 130054.597 | +----------------------+---------------------------------+ +--------------------------------------------------------+ | .l move.l d0,(a0) | | mulu.w d1,d2 | | move.l d0,(a0) | | dbf d7,.l | | | | Note: buffer in FAST RAM | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 2502146.243 | |CPU cycles / per loop : 38.179 | | color clocks : 177497 | | rasterlines : 781.925 | | frames : 2.498 | | µs : 50042.924 | +----------------------+---------------------------------+ +--------------------------------------------------------+ | .l move.l d0,(a0) | | add.l d3,d3 | | mulu.w d1,d2 | | move.l d0,(a0) | | dbf d7,.l | | | | Note: buffer in FAST RAM | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 2502146.243 | |CPU cycles / per loop : 38.179 | | color clocks : 177497 | | rasterlines : 781.925 | | frames : 2.498 | | µs : 50042.924 | +----------------------+---------------------------------+ +--------------------------------------------------------+ | .l move.l d0,(a0) | | add.l d3,d3 | | add.l d3,d3 | | mulu.w d1,d2 | | move.l d0,(a0) | | dbf d7,.l | | | | Note: buffer in FAST RAM | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 2631414.236 | |CPU cycles / per loop : 40.152 | | color clocks : 186667 | | rasterlines : 822.321 | | frames : 2.627 | | µs : 52628.284 | +----------------------+---------------------------------+ +--------------------------------------------------------+ | .l move.l d0,(a0) | | mulu.w d1,d2 | | add.l d3,d3 | | move.l d0,(a0) | | dbf d7,.l | | | | Note: buffer in FAST RAM | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 2631414.236 | |CPU cycles / per loop : 40.152 | | color clocks : 186667 | | rasterlines : 822.321 | | frames : 2.627 | | µs : 52628.284 | +----------------------+---------------------------------+ * CHIP RAM / move / add / mulu / move -> 14 adds * CHIP RAM / move / mulu / add / move -> 2 adds * FAST RAM / move / add / mulu / move -> 1 adds * FAST RAM / move / mulu / add / move -> 0 adds |
||
16 April 2024, 14:18 | #47 | |||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,365
|
Quote:
Quote:
Anyway, you've shown non-integer timings such as 85.380, 56.636, 16.200, 42.715, 53.048, 71.158, 85.003. Don't tell me that just chopping the non-integer part of that, can give the right result. Quote:
Code:
.l move.l d0,(a0) ; 28 (r=26) mulu.w d1,d2 ; 26->24 (r=24, stall, next @+4) move.l d0,(a0) ; 28->32 (+4 stall) dbf d7,.l ; 6->0 ; total 84 (3 blocks of 28) .l move.l d0,(a0) ; 28 (r=26) rept 14 add.l d3,d3 ; 28->2 (next @+12 missed -> @+26) endr mulu.w d1,d2 ; 26 (next @+0) move.l d0,(a0) ; 28 dbf d7,.l ; 6->0 ; total 84 .l move.l d0,(a0) ; 28 (r=26) rept 15 add.l d3,d3 ; 30->4 (next @+10 missed -> @+24 missed -> @+38) endr mulu.w d1,d2 ; 26 (next @ +12) move.l d0,(a0) ; 28 + 12 (stall) dbf d7,.l ; 6->0 ; total 98 (7 blocks of 14) .l move.l d0,(a0) ; 28->36 (r=26, stall +8 from prev. iteration) mulu.w d1,d2 ; 26->24 (r=24, stall, next @+4) add.l d3,d3 ; 2 (next @+2) add.l d3,d3 ; 2 (next@ 0) dbf d7,.l ; 6 (next @+8) - next move will stall +8 ; total 70 (5 blocks of 14) ; error - are you sure of your 85 ? (this is exact same values as first test) .l move.l d0,(a0) ; 28 (r=26) mulu.w d1,d2 ; 26->24 (r=24, stall, next @+4) add.l d3,d3 ; 2 (next @+2) add.l d3,d3 ; 2 (next @+0) add.l d3,d3 ; 2 (next @+12) move.l d0,(a0) ; 28->40 (r=24, stall +12) dbf d7,.l ; 6->0 ; total 98 But fastmem results give 2 cycles less than expected by me. |
|||
16 April 2024, 15:26 | #48 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,052
|
In general some results are strange for me, f.e 99 cycles for 3x add.l d3,d3, when for 1x add.l d3,d3 and 2x add.l d3,d3 is only 85 cycles.
Perhaps must exist more relationships. I can only suspect than CCR handling can be problematic. f.e bit X Anyway, but if You really want to check if chip ram writes can be covered by mulu. Then replace move.l d0,(A0) with movem.l d0,(A0) Movem dont change CCR if I remember right, when move change CCR. |
16 April 2024, 18:25 | #49 | |||
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 857
|
@meynaf
Reply split in 2 parts, as the forum tells me it's too long... Quote:
Anyway, as promised, I made also tests with 50 million loops, and they confirm that the proggie measures the time correctly - in the results below, you can see that the reported times are just 1/000th away from perfectly round figures (of course, with so many more loops, precision increased) These tests focus of mulu.x alone, without accesses to RAM, to be sure about the precision of the figures. The analysis of the (interesting) results follows the data. Code:
mulu.w d1,d2 / d1 = 0 / 65536 loops 26 cycles +--------------------------------------------------------+ | .l move.w d0,d6 ;d0 = $1234 | | mulu.w d1,d6 ;d1 = 0 | | dbf d7,.l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 2231416.492 | |CPU cycles / per loop : 34.048 | | color clocks : 158292 | | rasterlines : 697.321 | | frames : 2.227 | | µs : 44628.329 | +----------------------+---------------------------------+ mulu.w d1,d2 / d1 = 0 / 5000000 loops 26 cycles +--------------------------------------------------------+ | .l move.w d0,d6 ;d0 = $1234 | | mulu.w d1,d6 ;d1 = 0 | | subq.l #1,d7 | | bne.b .l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 50000000 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 1799995164.784 | |CPU cycles / per loop : 35.999 | | color clocks : 127687877 | | rasterlines : 562501.660 | | frames : 1797.129 | | µs : 35999903.295 | +----------------------+---------------------------------+ mulu.w d1,d2 / d1 = 7 / 65536 loops 28 cycles +--------------------------------------------------------+ | .l move.w d0,d6 ;d0 = $1234 | | mulu.w d1,d6 ;d1 = 7 | | dbf d7,.l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 2362446.590 | |CPU cycles / per loop : 36.048 | | color clocks : 167587 | | rasterlines : 738.268 | | frames : 2.358 | | µs : 47248.931 | +----------------------+---------------------------------+ mulu.w d1,d2 / d1 = 7 / 50000000 loops 28 cycles +--------------------------------------------------------+ | .l move.w d0,d6 ;d0 = $1234 | | mulu.w d1,d6 ;d1 = 7 | | subq.l #1,d7 | | bne.b .l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 50000000 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 1899994600.911 | |CPU cycles / per loop : 37.999 | | color clocks : 134781627 | | rasterlines : 593751.660 | | frames : 1896.970 | | µs : 37999892.018 | +----------------------+---------------------------------+ mulu.w d1,d2 / d1 = even / 25000000 loops 26 cycles +--------------------------------------------------------+ | .l move.w d0,d6 ;d0 = $1234 | | mulu.w d7,d6 | | subq.l #2,d7 | | bne.b .l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 25000000 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 899999252.867 | |CPU cycles / per loop : 35.999 | | color clocks : 63844057 | | rasterlines : 281251.352 | | frames : 898.566 | | µs : 17999985.057 | +----------------------+---------------------------------+ mulu.w d1,d2 / d1 = odd / 25000000 loops 28 cycles +--------------------------------------------------------+ | .l move.w d0,d6 ;d0 = $1234 | | mulu.w d7,d6 | | subq.l #2,d7 | | bpl.b .l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 25000000 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 949998970.930 | |CPU cycles / per loop : 37.999 | | color clocks : 67390932 | | rasterlines : 296876.352 | | frames : 948.486 | | µs : 18999979.418 | +----------------------+---------------------------------+ mulu.w d1,d2 / d1 from 65535 to 0 / 65536 loops 27 cycles on average +--------------------------------------------------------+ | .l move.w d0,d6 ;d0 = $1234 | | mulu.w d7,d6 | | dbf d7,.l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 2296966.783 | |CPU cycles / per loop : 35.048 | | color clocks : 162942 | | rasterlines : 717.806 | | frames : 2.293 | | µs : 45939.335 | +----------------------+---------------------------------+ mulu.w d1,d2 / d1 from 65535 to 0 / 50000000 loops 27 cycles on average +--------------------------------------------------------+ | .l move.w d0,d6 ;d0 = $1234 | | mulu.w d7,d6 | | subq.l #1,d7 | | bne.b .l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 50000000 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 1849994812.364 | |CPU cycles / per loop : 36.999 | | color clocks : 131234747 | | rasterlines : 578126.638 | | frames : 1847.049 | | µs : 36999896.247 | +----------------------+---------------------------------+ mulu.l d1,d2 / d1 = 0 / 65536 loops 44 cycles +--------------------------------------------------------+ | .l move.l d0,d6 ;d0 = $12345678 | | mulu.l d1,d6 ;d1 = 0 | | dbf d7,.l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 3407839.814 | |CPU cycles / per loop : 51.999 | | color clocks : 241745 | | rasterlines : 1064.955 | | frames : 3.402 | | µs : 68156.796 | +----------------------+---------------------------------+ mulu.l d1,d1 / d1 = 0 / 50000000 loops 44 cycles +--------------------------------------------------------+ | .l move.l d0,d6 ;d0 = $12345678 | | mulu.l d1,d6 ;d1 = 0 | | subq.l #1,d7 | | bne.b .l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 50000000 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 2699992415.901 | |CPU cycles / per loop : 53.999 | | color clocks : 191531792 | | rasterlines : 843752.387 | | frames : 2695.694 | | µs : 53999848.318 | +----------------------+---------------------------------+ mulu.l d1,d2 / d1 = 7 / 65536 loops 46 cycles +--------------------------------------------------------+ | .l move.l d0,d6 ;d0 = $12345678 | | mulu.l d1,d6 ;d1 = 7 | | dbf d7,.l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 3542140.379 | |CPU cycles / per loop : 54.048 | | color clocks : 251272 | | rasterlines : 1106.925 | | frames : 3.536 | | µs : 70842.807 | +----------------------+---------------------------------+ mulu.l d1,d1 / d1 = 7 / 50000000 loops 46 cycles +--------------------------------------------------------+ | .l move.l d0,d6 ;d0 = $12345678 | | mulu.l d1,d6 ;d1 = 7 | | subq.l #1,d7 | | bne.b .l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 50000000 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 2799991570.091 | |CPU cycles / per loop : 55.999 | | color clocks : 198625522 | | rasterlines : 875002.299 | | frames : 2795.534 | | µs : 55999831.401 | +----------------------+---------------------------------+ mulu.l d1,d2 / d1 = even / 25000000 loops 44 cycles +--------------------------------------------------------+ | .l move.l d0,d6 ;d0 = $12345678 | | mulu.l d7,d6 | | subq.l #2,d7 | | bne.b .l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 25000000 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 1349996870.502 | |CPU cycles / per loop : 53.999 | | color clocks : 95765943 | | rasterlines : 421876.400 | | frames : 1347.847 | | µs : 26999937.410 | +----------------------+---------------------------------+ mulu.l d1,d2 / d1 = odd / 25000000 loops 46 cycles +--------------------------------------------------------+ | .l move.l d0,d6 ;d0 = $12345678 | | mulu.l d7,d6 | | subq.l #2,d7 | | bpl.b .l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 25000000 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 1399997067.857 | |CPU cycles / per loop : 55.999 | | color clocks : 99312852 | | rasterlines : 437501.550 | | frames : 1397.768 | | µs : 27999941.357 | +----------------------+---------------------------------+ mulu.l d1,d2 / d1 from 65535 to 0 / 65536 loops 45 cycles on average +--------------------------------------------------------+ | .l move.l d0,d6 ;d0 = $12345678 | | mulu.l d7,d6 | | dbf d7,.l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 3476590.087 | |CPU cycles / per loop : 53.048 | | color clocks : 246622 | | rasterlines : 1086.440 | | frames : 3.471 | | µs : 69531.801 | +----------------------+---------------------------------+ mulu.l d1,d1 / d1 from 65535 to 0 / 50000000 loops 45 cycles on average +--------------------------------------------------------+ | .l move.l d0,d6 ;d0 = $12345678 | | mulu.l d7,d6 | | subq.l #1,d7 | | bne.b .l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 50000000 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 2749988088.172 | |CPU cycles / per loop : 54.999 | | color clocks : 195078380 | | rasterlines : 859376.123 | | frames : 2745.610 | | µs : 54999761.763 | +----------------------+---------------------------------+ * mulu.w d1,d2 / d1 = 0 / 65536 loops -> 26 cycles * mulu.w d1,d2 / d1 = 0 / 5000000 loops -> 26 cycles * mulu.w d1,d2 / d1 = 7 / 65536 loops -> 28 cycles * mulu.w d1,d2 / d1 = 7 / 50000000 loops -> 28 cycles * mulu.w d1,d2 / d1 = even / 25000000 loops -> 26 cycles * mulu.w d1,d2 / d1 = odd / 25000000 loops -> 28 cycles * mulu.w d1,d2 / d1 from 65535 to 0 / 65536 loops -> 27 cycles on average * mulu.w d1,d2 / d1 from 65535 to 0 / 50000000 loops -> 27 cycles on average * mulu.l d1,d2 / d1 = 0 / 65536 loops -> 44 cycles * mulu.l d1,d1 / d1 = 0 / 50000000 loops -> 44 cycles * mulu.l d1,d2 / d1 = 7 / 65536 loops -> 46 cycles * mulu.l d1,d1 / d1 = 7 / 50000000 loops -> 46 cycles * mulu.l d1,d2 / d1 = even / 25000000 loops -> 44 cycles * mulu.l d1,d2 / d1 = odd / 25000000 loops -> 46 cycles * mulu.l d1,d2 / d1 from 65535 to 0 / 65536 loops -> 45 cycles on average * mulu.l d1,d1 / d1 from 65535 to 0 / 50000000 loops -> 45 cycles on average Takeaways: * mulu.w takes either 26 or 28 cycles; * mulu.l takes either 44 or 46 cycles; * when the left operand is odd, mulu.x takes 2 cycles more (I guess this is true also for muls.x); * the figures obtained by means of millions of loops match those obtained with 65536 loops. Quote:
Quote:
|
|||
16 April 2024, 18:45 | #50 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 857
|
@maynaf
Reply part 2 Quote:
I'll prepare a proper picture and explain the underlying thoughts later (or more probably tomorrow - I haven't slept for basically 2 days after months of bad sleep, so I'm about to collapse...). I also ran the test again using 50000000 loops, obtaining this, which confirms 85 cycles on average: Code:
move > mulu.w d1,d2 > move / d1 from 65535 to 0 / 50000000 loops +--------------------------------------------------------+ | .l move.l d0,(a0) | | mulu.w d1,d2 ;d1 = d2 = 0 | | move.l d0,(a0) | | subq.l #1,d7 | | bne.b .l | | | | Note: buffer in CHIP RAM | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 50000000 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 4247766807.306 | |CPU cycles / per loop : 84.955 | | color clocks : 301327657 | | rasterlines : 1327434.612 | | frames : 4241.005 | | µs : 84955336.146 | +----------------------+---------------------------------+ Using 65536 loops for convenience: * total cycles: 65536 * 85 = 5570560 * expected cycles: 65536 * 84 = 5505024 * from that: (65536 - x) * 84 + x * (84 + 14) = 65536 * 85 -> x = 65536/14 = about 4681 times Why does that happen? I don't know. For sure it's not because of the odd/even operand factor (alone), otherwise the slot would be missed 50% of the times and the average time would be higher. I guess it's first necessary to figure out why the execution most of the times takes 84 cycles instead of 98 in first place. |
|
16 April 2024, 18:57 | #51 | ||
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 857
|
@Don_Adan
Quote:
Quote:
EDIT: the movem test confirms that the ccr doesn't matter -> Code:
+--------------------------------------------------------+ | .l movem.l d0,(a0) | | mulu.w d1,d2 ;d1 = d2 = 0 | | movem.l d0,(a0) | | subq.l #1, d7 | | bne.b .l | | | | Note: buffer in CHIP RAM | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 50000000 | | leaked bytes : 672 | +----------------------+---------------------------------+ | CPU cycles / total : 4247766877.790 | |CPU cycles / per loop : 84.955 | | color clocks : 301327662 | | rasterlines : 1327434.634 | | frames : 4241.005 | | µs : 84955337.555 | +----------------------+---------------------------------+ @all Attached here is the updated archive with the latest tests included. Last edited by saimo; 16 April 2024 at 23:18. Reason: Removed attachment and fixed the English. |
||
16 April 2024, 19:06 | #52 | ||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,365
|
Quote:
Quote:
|
||
16 April 2024, 19:22 | #53 | ||
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 857
|
Quote:
Quote:
|
||
16 April 2024, 19:38 | #54 | |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,365
|
Quote:
Not 'better' but easier as i said : just count the vblanks... Cheap and (with enough loops) reliable enough. |
|
16 April 2024, 23:16 | #55 | ||
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 857
|
@meynaf
Quote:
2. I explained that the error doesn't depend on that. 3. You said that the origin of the decimals was not explained. 4. I said that I was answering to a different remark (and added that the explanation you wanted was provided elsewhere in the very same post). 5. Now you're saying that the precision of the decimals and where they come from are the same thing and that they indicate some error. Sorry, but here you're not being fair, so this comment of yours goes to >NIL: Quote:
What's the point in suggesting an easier method is if it isn't just as precise? Especially considering that in this context precision matters and, even more, considering that you have been questioning the precision of the results! You really seem to just want to be against for the sake of it. Anyway, technically speaking, using the vertical blanks is 313*227/50 = 1421.02 times less precise than the method I used (which isn't that complicated, by the way), so, no, it isn't worth considering. That said, I must admit that I was so annoyed by the comments that I decided to have a second look at my code to try and make it even more precise. After some investigation, I found something I wasn't aware of: the CIA B TOD clock does not increase when a rasterline starts or ends (as I had erroneously believed), but at around color color $60 (sometimes a little less, sometimes a little more, and this figure is affected also by the slow access time of CIAs). In some occasions, this caused the color clocks count to be wrong by 227 (i.e. 1 rasterline). So, I changed the code to take that into account and to work regardless of when exactly the TOD clock increases. The code now is not only no longer affected by off-by-one issue, but is actually simpler, more beautiful (than what shown in the snippet I posted earlier) and - guess what - more precise. So much more precise that even the 65536 loops tests provide figures that are just 0.001 or 0.002 away from the perfect ones - here are some examples: Code:
+--------------------------------------------------------+ | .l dbf d7,.l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 393358.134 | |CPU cycles / per loop : 6.002 | | color clocks : 27904 | | rasterlines : 122.925 | | frames : 0.392 | | µs : 7867.162 | +----------------------+---------------------------------+ +--------------------------------------------------------+ | .l move.w d0,d6 ;d0 = $1234 | | mulu.w d7,d6 | | dbf d7,.l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 2293907.770 | |CPU cycles / per loop : 35.002 | | color clocks : 162725 | | rasterlines : 716.850 | | frames : 2.290 | | µs : 45878.155 | +----------------------+---------------------------------+ +--------------------------------------------------------+ | .l move.l d0,d6 ;d0 = $12345678 | | mulu.l d7,d6 | | dbf d7,.l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 3473488.783 | |CPU cycles / per loop : 53.001 | | color clocks : 246402 | | rasterlines : 1085.471 | | frames : 3.467 | | µs : 69469.775 | +----------------------+---------------------------------+ +--------------------------------------------------------+ | .l move.l d0,(a0) | | mulu.w d1,d2 ;d1 = d2 = 0 | | move.l d0,(a0) | | dbf d7,.l | | | | Note: buffer in CHIP RAM | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 5567728.957 | |CPU cycles / per loop : 84.956 | | color clocks : 394963 | | rasterlines : 1739.925 | | frames : 5.558 | | µs : 111354.579 | +----------------------+---------------------------------+ +--------------------------------------------------------+ | .l move.l d0,(a0) | | mulu.w d1,d2 ;d1 = d2 = 0 | | move.l d0,(a0) | | subq.l #1,d7 | | bne.b .l | | | | Note: buffer in CHIP RAM | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 50000000 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 4247763790.583 | |CPU cycles / per loop : 84.955 | | color clocks : 301327443 | | rasterlines : 1327433.669 | | frames : 4241.002 | | µs : 84955275.811 | +----------------------+---------------------------------+ Last edited by saimo; 17 April 2024 at 23:04. Reason: Removed attachment as I provided a newer version later. |
||
17 April 2024, 11:40 | #56 | |||||||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,365
|
Quote:
And where did I agree to this ? I did not. It can still depend on that. I haven't said it depends only on that, though. There is setup-related error, which is why many loops are needed. What ? No. What i said is that once-per-test overhead wasn't enough to explain them. Quote:
Quote:
Quote:
I may lack the most basic of diplomacy skills, that's sure - and we might just have had a misunderstanding after all - but i'm not unfair. Quote:
(And wrong in the case we're timing an empty routine. ) Quote:
Quote:
Quote:
Vertical blank is max 20ms error, which is quite acceptable for a computation that lasts several seconds. We need to know the number of cycles, not the number of 1/1000th of a cycle. Quote:
Perhaps we could get further by watching this 84.955 closer, hmm ? |
|||||||||
17 April 2024, 13:03 | #57 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 857
|
@maynaf
>NIL: |
17 April 2024, 14:09 | #58 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,365
|
|
17 April 2024, 14:21 | #59 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,555
|
Can we stop the nonsense and get back to measuring? This is interesting, the flexing isn't.
|
17 April 2024, 14:35 | #60 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,365
|
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
68040 to 68060 adapter respin with A2000 and Zeus 68040 Accelerator | richx | support.Hardware | 14 | 26 April 2022 05:46 |
Games that required an accelerator (68030, 68040, 68060) | Radertified | Nostalgia & memories | 47 | 12 January 2022 16:45 |
68030, 68040 and 68060 MMU support (really!) | Toni Wilen | support.WinUAE | 262 | 19 February 2019 12:36 |
mulu.l (a0),d0-d1 on 68060 | BlankVector | support.WinUAE | 4 | 20 July 2012 19:03 |
WTB: 68030 or 68040 accelerator for A2000 | Shadowfire | MarketPlace | 2 | 19 September 2009 17:52 |
|
|