11 April 2024, 15:53 | #21 | |||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,365
|
Quote:
A memory access (with a simple addressing mode) as i observed years ago takes 4+n cycles, where n is 1 for data cache, 4 for 60ns ram, 5 for 70ns ram, 24 for chipmem. Oh, well. It's perhaps just the (an)+ mode changing things. Have you made tests with (an), (an)+ and -(an) to see differences ? Quote:
And so should be 24 after a write to chip (28 cycles overall). Or what did I miss ? Quote:
Did you know it is also possible to hide the cost of some instructions between reads ? This can happen if datacache burst is active. First read will start a burst, but second read will stall - unless a few instructions are executed in between. Gain is minimalistic, though. swap, really ? There are quite a few of them in a c2p and they didn't seem to cause issues. But nop does, however. Not knowing this by the time, i tried to count available cycles and it failed due nop isn't real nop. This is what my memory says ; it may be wrong. |
|||
12 April 2024, 01:19 | #22 | |||||||
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 857
|
@meynaf
I noticed your post only now. Too late to make other tests now, but I'll do more and, if anybody is interested, I'll provide also the tool I'm using to make the tests, so that anyone can verify my results and also make tests on other machines. Quote:
By the way, in my previous post there are some figures that deal with those addressing modes. Quote:
Quote:
I'll make tests also about this. Quote:
Quote:
Quote:
Quote:
|
|||||||
12 April 2024, 08:20 | #23 | |||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,365
|
Quote:
Quote:
From memory I would have said : Code:
.l move.b dx,-(az) ; 8 move.b dx,-(az) ; 8 dbf dy,.l ; 0 total 16 (you got 15) .l move.b dx,-(az) ; 8 add.l dw,dw ; 0 (2/4) add.l dw,dw ; 0 (4/4) move.b dx,-(az) ; 8 dbf dy,.l ; 0 total 16 (you got 15) .l move.b dx,-(az) ; 8 add.l dw,dw ; 0 (2/4) add.l dw,dw ; 0 (4/4) add.l dw,dw ; 2 move.b dx,-(az) ; 8 dbf dy,.l ; 0 total 18 (you got 16) .l move.b dx,(az) ; 8 move.b dx,(az) ; 8 dbf dy,.l ; 0 total 16 (you got 14) .l move.b dx,(az) ; 8 add.l dw,dw ; 0 (2/4) add.l dw,dw ; 0 (4/4) move.b dx,(az) ; 8 dbf dy,.l ; 0 total 16 (you got 14) .l move.b dx,(az) ; 8 add.l dw,dw ; 0 (2/4) add.l dw,dw ; 0 (4/4) add.l dw,dw ; 2 move.b dx,(az) ; 8 dbf dy,.l ; 0 total 18 (you got 16) Quote:
Checked my c2p code, there is a swap that should have broken my cycle counts completely. Yes, please. EDIT: Testing clr could be interesting too. From what i've read clr -(ax) is faster than clr (ax)+... Last edited by meynaf; 13 April 2024 at 09:04. |
|||
12 April 2024, 10:09 | #24 | |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,052
|
Quote:
For Your example, You wrote only 1 (2 bytes) mulu.l instruction, between 2 chip memory writes. Im not 68030 expert, but detecting (not even executing) next instruction as memory access instruction can stop pipelining for mulu.l instruction. For me You must/can check, also next cases, to be sure: Code:
.l move.l dx,(ay) mulu.l dw,dz dbf dy,.l Code:
.l move.l dx,(ay) mulu.l dw,dz move.l dx,dx dbf dy,.l Code:
.l move.l dx,(ay) mulu.w dw,dz dbf dy,.l |
|
12 April 2024, 23:58 | #25 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 857
|
@meynaf
First off: sorry for scaring you, but swap can indeed run in parallel! Yesterday I must have swapped (heh) binaries/results (To be honest, I scared myself as well, as I use swap that way in several places as well...) Today I couldn't make all the tests, so here's an initial bunch: Code:
Results obtained on Amiga 1200, Blizzard 1230 IV, 68030 @ 50 MHZ, 60 ns RAM ------------------------------------------------------- core: .l dbf d7,.l loops number: 65536 CPU cycles / total: 396431.244 CPU cycles / per loop: 6.049 color clocks: 28122 rasterlines: 123.885 frames: 0.395 µs: 7928.624 ------------------------------------------------------- core: .l move.l d0,(a0) dbf d7,.l note: (a0) -> CHIP RAM loops number: 65536 CPU cycles / total: 1855862.662 CPU cycles / per loop: 28.318 color clocks: 131651 rasterlines: 579.960 frames: 1.852 µs: 37117.253 ------------------------------------------------------- core: .l move.l d0,(a0) dbf d7,.l note: (a0) -> FAST RAM loops number: 65536 CPU cycles / total: 530647.228 CPU cycles / per loop: 8.097 color clocks: 37643 rasterlines: 165.828 frames: 0.529 µs: 10612.944 ------------------------------------------------------- core: .l move.l d0,(a0) rept 13 add.l d1,d1 endr move.l d0,(a0) dbf d7,.l note: (a0) -> CHIP RAM loops number: 65536 CPU cycles / total: 3711767.616 CPU cycles / per loop: 56.637 color clocks: 263305 rasterlines: 1159.933 frames: 3.705 µs: 74235.352 ------------------------------------------------------- core: .l move.l d0,(a0) rept 14 add.l d1,d1 endr move.l d0,(a0) dbf d7,.l note: (a0) -> CHIP RAM loops number: 65536 CPU cycles / total: 4663473.263 CPU cycles / per loop: 71.158 color clocks: 330817 rasterlines: 1457.343 frames: 4.656 µs: 93269.465 ------------------------------------------------------- core: .l move.l d0,(a0) add.l d3,d3 mulu.w d1,d2 move.l d0,(a0) dbf d7,.l note: (a0) -> CHIP RAM loops number: 65536 CPU cycles / total: 5570816.164 CPU cycles / per loop: 85.003 color clocks: 395182 rasterlines: 1740.889 frames: 5.561 µs: 111416.323 ------------------------------------------------------- core: .l move.l d0,(a0) mulu.w d1,d2 move.l d0,(a0) dbf d7,.l note: (a0) -> CHIP RAM loops number: 65536 CPU cycles / total: 5570816.164 CPU cycles / per loop: 85.003 color clocks: 395182 rasterlines: 1740.889 frames: 5.561 µs: 111416.323 ------------------------------------------------------- core: .l move.l d0,(a0) mulu.w d1,d2 add.l d3,d3 move.l d0,(a0) dbf d7,.l note: (a0) -> CHIP RAM loops number: 65536 CPU cycles / total: 5570816.164 CPU cycles / per loop: 85.003 color clocks: 395182 rasterlines: 1740.889 frames: 5.561 µs: 111416.323 ------------------------------------------------------- core: .l move.l d0,(a0) swap.w d1 move.l d0,(a0) dbf d7,.l note: (a0) -> CHIP RAM loops number: 65536 CPU cycles / total: 3711767.616 CPU cycles / per loop: 56.637 color clocks: 263305 rasterlines: 1159.933 frames: 3.705 µs: 74235.352 ------------------------------------------------------- core: .l move.l d0,(a0) tst.l d1 move.l d0,(a0) dbf d7,.l note: (a0) -> CHIP RAM loops number: 65536 CPU cycles / total: 4663473.263 CPU cycles / per loop: 71.158 color clocks: 330817 rasterlines: 1457.343 frames: 4.656 µs: 93269.465 ------------------------------------------------------- core: .l move.l d0,(a0) move.l d0,(a0) dbf d7,.l note: (a0) -> CHIP RAM loops number: 65536 CPU cycles / total: 3711767.616 CPU cycles / per loop: 56.637 color clocks: 263305 rasterlines: 1159.933 frames: 3.705 µs: 74235.352 ------------------------------------------------------- core: .l move.l d0,(a0) add.l d1,d1 add.l d1,d1 move.l d0,(a0) dbf d7,.l note: (a0) -> FAST RAM loops number: 65536 CPU cycles / total: 924752.494 CPU cycles / per loop: 14.110 color clocks: 65600 rasterlines: 288.986 frames: 0.923 µs: 18495.049 ------------------------------------------------------- core: .l move.l d0,(a0) add.l d1,d1 add.l d1,d1 add.l d1,d1 move.l d0,(a0) dbf d7,.l note: (a0) -> FAST RAM loops number: 65536 CPU cycles / total: 1058150.861 CPU cycles / per loop: 16.146 color clocks: 75063 rasterlines: 330.674 frames: 1.056 µs: 21163.017 ------------------------------------------------------- core: .l move.l d0,(a0) add.l d3,d3 mulu.w d1,d2 move.l d0,(a0) dbf d7,.l note: (a0) -> FAST RAM loops number: 65536 CPU cycles / total: 2502146.243 CPU cycles / per loop: 38.179 color clocks: 177497 rasterlines: 781.925 frames: 2.498 µs: 50042.924 ------------------------------------------------------- core: .l move.l d0,(a0) mulu.w d1,d2 move.l d0,(a0) dbf d7,.l note: (a0) -> FAST RAM loops number: 65536 CPU cycles / total: 2502146.243 CPU cycles / per loop: 38.179 color clocks: 177497 rasterlines: 781.925 frames: 2.498 µs: 50042.924 ------------------------------------------------------- core: .l move.l d0,(a0) mulu.w d1,d2 add.l d3,d3 move.l d0,(a0) dbf d7,.l note: (a0) -> FAST RAM loops number: 65536 CPU cycles / total: 2631414.236 CPU cycles / per loop: 40.152 color clocks: 186667 rasterlines: 822.321 frames: 2.627 µs: 52628.284 ------------------------------------------------------- core: .l move.l d0,(a0) swap.w d1 move.l d0,(a0) dbf d7,.l note: (a0) -> FAST RAM loops number: 65536 CPU cycles / total: 924752.494 CPU cycles / per loop: 14.110 color clocks: 65600 rasterlines: 288.986 frames: 0.923 µs: 18495.049 ------------------------------------------------------- core: .l move.l d0,(a0) tst.l d1 move.l d0,(a0) dbf d7,.l note: (a0) -> FAST RAM loops number: 65536 CPU cycles / total: 1061801.942 CPU cycles / per loop: 16.201 color clocks: 75322 rasterlines: 331.814 frames: 1.060 µs: 21236.038 ------------------------------------------------------- core: .l move.l d0,(a0) move.l d0,(a0) dbf d7,.l note: (a0) -> FAST RAM loops number: 65536 CPU cycles / total: 924752.494 CPU cycles / per loop: 14.110 color clocks: 65600 rasterlines: 288.986 frames: 0.923 µs: 18495.049 More tests will come tomorrow / in the next days. @Don_Adan You had managed to explain yourself, no worries Regarding your questions: while mulu itself can't run in parallel, it's possible to add instructions before and after it that might still execute in parallel, depeding on the bus timing (so, when the accesses are made to CHIP RAM, it's more likely that the other instructions will come for free). Check out the mulu tests in the results above. Last edited by saimo; 14 April 2024 at 02:33. Reason: Removed attachment as I provided an updated one in a later post. |
13 April 2024, 09:21 | #26 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,365
|
Sure, swap did really surprise me here.
I have my own tool for measuring but unfortunately my A1200 isn't in the right shape to perform tests... More tests are welcome but i can't the heck interpret the results we already have. |
13 April 2024, 11:57 | #27 |
BiO-sanitation Battalion
Join Date: Jun 2017
Location: Scotland
Posts: 152
|
Code:
.l move.l dx,(ay) rol.l #1,dz move.l dx,(ay) dbf dy,.l B |
13 April 2024, 13:04 | #28 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 857
|
|
13 April 2024, 13:08 | #29 |
Alien Bleed
Join Date: Aug 2022
Location: UK
Posts: 4,556
|
It would be interesting to see the cycles per loop versus an entirely naive summation of worst case clocks, no parallelism for each. It's easy to underestimate the effort that went I to the design of these older CPUs
|
14 April 2024, 02:29 | #30 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 857
|
A few more tests: some clr, mulu and instructions after reads cases.
Updated results: Code:
------------------------------------------------------- core: .l clr.l (a0)+ clr.l (a0)+ dbf d7,.l note: buffer in CHIP RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 5595485.628 CPU cycles / per loop: 85.380 color clocks: 396932 rasterlines: 1748.599 frames: 5.586 µs: 111909.712 ------------------------------------------------------- core: .l clr.l (a0)+ clr.l (a0)+ dbf d7,.l note: buffer in FAST RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 1189519.283 CPU cycles / per loop: 18.150 color clocks: 84382 rasterlines: 371.726 frames: 1.187 µs: 23790.385 ------------------------------------------------------- core: .l clr.l -(a0) clr.l -(a0) dbf d7,.l note: buffer in CHIP RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 3711753.519 CPU cycles / per loop: 56.636 color clocks: 263304 rasterlines: 1159.929 frames: 3.705 µs: 74235.070 ------------------------------------------------------- core: .l clr.l -(a0) clr.l -(a0) dbf d7,.l note: buffer in FAST RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 1061731.458 CPU cycles / per loop: 16.200 color clocks: 75317 rasterlines: 331.792 frames: 1.060 µs: 21234.629 ------------------------------------------------------- core: .l clr.l (a0)+ dbf d7,.l note: buffer in CHIP RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 2799378.047 CPU cycles / per loop: 42.715 color clocks: 198582 rasterlines: 874.810 frames: 2.794 µs: 55987.560 ------------------------------------------------------- core: .l clr.l (a0)+ dbf d7,.l note: buffer in FAST RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 659872.931 CPU cycles / per loop: 10.068 color clocks: 46810 rasterlines: 206.211 frames: 0.658 µs: 13197.458 ------------------------------------------------------- core: .l clr.l -(a0) dbf d7,.l note: buffer in CHIP RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 1855848.566 CPU cycles / per loop: 28.318 color clocks: 131650 rasterlines: 579.955 frames: 1.852 µs: 37116.971 ------------------------------------------------------- core: .l clr.l -(a0) dbf d7,.l note: buffer in FAST RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 658843.862 CPU cycles / per loop: 10.053 color clocks: 46737 rasterlines: 205.889 frames: 0.657 µs: 13176.877 ------------------------------------------------------- core: .l dbf d7,.l CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 396431.244 CPU cycles / per loop: 6.049 color clocks: 28122 rasterlines: 123.885 frames: 0.395 µs: 7928.624 ------------------------------------------------------- core: .l move.l (a0),d0 dbf d7,.l note: buffer in CHIP RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 2799378.047 CPU cycles / per loop: 42.715 color clocks: 198582 rasterlines: 874.810 frames: 2.794 µs: 55987.560 ------------------------------------------------------- core: .l move.l (a0),d0 add.l d1,d1 add.l d1,d1 add.l d1,d1 dbf d7,.l note: buffer in CHIP RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 2799378.047 CPU cycles / per loop: 42.715 color clocks: 198582 rasterlines: 874.810 frames: 2.794 µs: 55987.560 ------------------------------------------------------- core: .l move.l (a0),d0 add.l d1,d1 add.l d1,d1 add.l d1,d1 add.l d1,d1 dbf d7,.l note: buffer in CHIP RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 3714883.017 CPU cycles / per loop: 56.684 color clocks: 263526 rasterlines: 1160.907 frames: 3.708 µs: 74297.660 ------------------------------------------------------- core: .l move.l (a0),d0 dbf d7,.l note: buffer in FAST RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 658561.925 CPU cycles / per loop: 10.048 color clocks: 46717 rasterlines: 205.801 frames: 0.657 µs: 13171.238 ------------------------------------------------------- core: .l move.l (a0),d0 add.l d1,d1 dbf d7,.l note: buffer in FAST RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 789592.023 CPU cycles / per loop: 12.048 color clocks: 56012 rasterlines: 246.748 frames: 0.788 µs: 15791.840 ------------------------------------------------------- core: .l move.l d6,d0 mulu.l d7,d0 dbf d7,.l CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 3476590.087 CPU cycles / per loop: 53.048 color clocks: 246622 rasterlines: 1086.440 frames: 3.471 µs: 69531.801 ------------------------------------------------------- core: .l move.w d6,d0 mulu.w d7,d0 dbf d7,.l CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 2296896.299 CPU cycles / per loop: 35.047 color clocks: 162937 rasterlines: 717.784 frames: 2.293 µs: 45937.925 ------------------------------------------------------- core: .l move.l d0,(a0) dbf d7,.l note: buffer in CHIP RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 1855848.566 CPU cycles / per loop: 28.318 color clocks: 131650 rasterlines: 579.955 frames: 1.852 µs: 37116.971 ------------------------------------------------------- core: .l move.l d0,(a0) rept 13 add.l d1,d1 endr move.l d0,(a0) dbf d7,.l note: buffer in CHIP RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 3711767.616 CPU cycles / per loop: 56.637 color clocks: 263305 rasterlines: 1159.933 frames: 3.705 µs: 74235.352 ------------------------------------------------------- core: .l move.l d0,(a0) rept 14 add.l d1,d1 endr move.l d0,(a0) dbf d7,.l note: buffer in CHIP RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 4663473.263 CPU cycles / per loop: 71.158 color clocks: 330817 rasterlines: 1457.343 frames: 4.656 µs: 93269.465 ------------------------------------------------------- core: .l move.l d0,(a0) add.l d3,d3 mulu.w d1,d2 move.l d0,(a0) dbf d7,.l note: buffer in CHIP RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 5570816.164 CPU cycles / per loop: 85.003 color clocks: 395182 rasterlines: 1740.889 frames: 5.561 µs: 111416.323 ------------------------------------------------------- core: .l move.l d0,(a0) move.l d0,(a0) dbf d7,.l note: buffer in CHIP RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 3714897.114 CPU cycles / per loop: 56.684 color clocks: 263527 rasterlines: 1160.911 frames: 3.708 µs: 74297.942 ------------------------------------------------------- core: .l move.l d0,(a0) mulu.w d1,d2 add.l d3,d3 move.l d0,(a0) dbf d7,.l note: buffer in CHIP RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 5570830.261 CPU cycles / per loop: 85.004 color clocks: 395183 rasterlines: 1740.894 frames: 5.561 µs: 111416.605 ------------------------------------------------------- core: .l move.l d0,(a0) mulu.w d1,d2 move.l d0,(a0) dbf d7,.l note: buffer in CHIP RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 5570816.164 CPU cycles / per loop: 85.003 color clocks: 395182 rasterlines: 1740.889 frames: 5.561 µs: 111416.323 ------------------------------------------------------- core: .l move.l d0,(a0) swap.w d1 move.l d0,(a0) dbf d7,.l note: buffer in CHIP RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 3711767.616 CPU cycles / per loop: 56.637 color clocks: 263305 rasterlines: 1159.933 frames: 3.705 µs: 74235.352 ------------------------------------------------------- core: .l move.l d0,(a0) tst.l d1 move.l d0,(a0) dbf d7,.l note: buffer in CHIP RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 4663473.263 CPU cycles / per loop: 71.158 color clocks: 330817 rasterlines: 1457.343 frames: 4.656 µs: 93269.465 ------------------------------------------------------- core: .l move.l d0,(a0) dbf d7,.l note: buffer in FAST RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 530689.518 CPU cycles / per loop: 8.097 color clocks: 37646 rasterlines: 165.841 frames: 0.529 µs: 10613.790 ------------------------------------------------------- core: .l move.l d0,(a0) add.l d1,d1 add.l d1,d1 move.l d0,(a0) dbf d7,.l note: buffer in FAST RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 924738.397 CPU cycles / per loop: 14.110 color clocks: 65599 rasterlines: 288.982 frames: 0.923 µs: 18494.767 ------------------------------------------------------- core: .l move.l d0,(a0) add.l d1,d1 add.l d1,d1 add.l d1,d1 move.l d0,(a0) dbf d7,.l note: buffer in FAST RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 1058136.764 CPU cycles / per loop: 16.145 color clocks: 75062 rasterlines: 330.669 frames: 1.056 µs: 21162.735 ------------------------------------------------------- core: .l move.l d0,(a0) add.l d3,d3 mulu.w d1,d2 move.l d0,(a0) dbf d7,.l note: buffer in FAST RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 2498960.358 CPU cycles / per loop: 38.131 color clocks: 177271 rasterlines: 780.929 frames: 2.494 µs: 49979.207 ------------------------------------------------------- core: .l move.l d0,(a0) move.l d0,(a0) dbf d7,.l note: buffer in FAST RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 924752.494 CPU cycles / per loop: 14.110 color clocks: 65600 rasterlines: 288.986 frames: 0.923 µs: 18495.049 ------------------------------------------------------- core: .l move.l d0,(a0) mulu.w d1,d2 add.l d3,d3 move.l d0,(a0) dbf d7,.l note: buffer in FAST RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 2631414.236 CPU cycles / per loop: 40.152 color clocks: 186667 rasterlines: 822.321 frames: 2.627 µs: 52628.284 ------------------------------------------------------- core: .l move.l d0,(a0) mulu.w d1,d2 move.l d0,(a0) dbf d7,.l note: buffer in FAST RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 2498974.455 CPU cycles / per loop: 38.131 color clocks: 177272 rasterlines: 780.933 frames: 2.494 µs: 49979.489 ------------------------------------------------------- core: .l move.l d0,(a0) swap.w d1 move.l d0,(a0) dbf d7,.l note: buffer in FAST RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 924752.494 CPU cycles / per loop: 14.110 color clocks: 65600 rasterlines: 288.986 frames: 0.923 µs: 18495.049 ------------------------------------------------------- core: .l move.l d0,(a0) tst.l d1 move.l d0,(a0) dbf d7,.l note: buffer in FAST RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 1061731.458 CPU cycles / per loop: 16.200 color clocks: 75317 rasterlines: 331.792 frames: 1.060 µs: 21234.629 Last edited by saimo; 15 April 2024 at 23:21. Reason: Removed attachment as I provided a newer one later. |
14 April 2024, 02:35 | #31 | ||
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 857
|
Quote:
Quote:
|
||
14 April 2024, 11:13 | #32 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,052
|
Thanks Too many test in one code text, better (for my poor Android) will be splitted for a few smallest examples.
But it looks like I was right. Next instruction after mulu can not be memory instruction, must be registers instruction. If this is general rule (i dont know) for all 68030 instructions which needs more than 2 cycles. Then tests like next are not good for pipelining. Code:
move.l d0,(a0) ; chip ram mulu d1,d2 move.l d0,(a0) Code:
move.l d0,(a0) ; chip ram mulu d1,d2 add.l d3,d3 add.l d3,d3 move.l d3,d3 ...... move.l d0,(a0) |
15 April 2024, 01:06 | #33 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 857
|
@Don_Adan
Sorry for the big results list. Here are the tests you're interested in (straight from the list): Code:
------------------------------------------------------- core: .l move.l d0,(a0) add.l d3,d3 mulu.w d1,d2 move.l d0,(a0) dbf d7,.l note: buffer in CHIP RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 5570816.164 CPU cycles / per loop: 85.003 color clocks: 395182 rasterlines: 1740.889 frames: 5.561 µs: 111416.323 ------------------------------------------------------- core: .l move.l d0,(a0) mulu.w d1,d2 add.l d3,d3 move.l d0,(a0) dbf d7,.l note: buffer in CHIP RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 5570830.261 CPU cycles / per loop: 85.004 color clocks: 395183 rasterlines: 1740.894 frames: 5.561 µs: 111416.605 ------------------------------------------------------- core: .l move.l d0,(a0) mulu.w d1,d2 move.l d0,(a0) dbf d7,.l note: buffer in CHIP RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 5570816.164 CPU cycles / per loop: 85.003 color clocks: 395182 rasterlines: 1740.889 frames: 5.561 µs: 111416.323 Note: with mulu, which is so slow, of course not many other instructions can be added; in the tests I didn't even try to add more because I was interested in just verifying the concept. @all Tomorrow I hope to be manage to provide more tests and an updated tool that allows to control the caches from the commandline (this is already done). |
15 April 2024, 13:56 | #34 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,052
|
Perhaps the best option for pipelining mulu instruction, can be pipelining 2-3 others mulu (immediately) instructions.
Something like this: Code:
mulu.w D0,D1 mulu.w #$1234,D2 mulu.w #$5678,D3 Then for 1 mulu cycles, 3-4 mulu can be done. But it needs special routine, which used 1 normal mulu and 2-3 immediate mulu. Personally, I never need similar routine, but maybe can be useful for something. |
15 April 2024, 23:11 | #35 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 857
|
Quote:
Code:
mulu.w d0,d1 move.l d2,d3 lsl.l #2,d2 add.l d2,d3 Last edited by saimo; 16 April 2024 at 11:30. Reason: Fixed formatting, which had been screwed up by the browser/site. |
|
15 April 2024, 23:21 | #36 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 857
|
I couldn't make more tests, but I did improve the test tool: I prettified the output to make it more readable and added cache control flags to the commandline.
It's attached here, along with all the previous tests and a little readme (which is actually a snippet from the source code) that explains what the tool does and how to use it. Last edited by saimo; 16 April 2024 at 19:25. Reason: Removed attachment as I provided a newer version later. |
15 April 2024, 23:50 | #37 | |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,052
|
Quote:
But if nothing can be execute in parallel with mulu then this idea is useless. |
|
16 April 2024, 00:24 | #38 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,052
|
But, comparing this code:
Code:
------------------------------------------------------- core: .l move.l d0,(a0) mulu.w d1,d2 add.l d3,d3 move.l d0,(a0) dbf d7,.l note: buffer in CHIP RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 5570830.261 CPU cycles / per loop: 85.004 color clocks: 395183 rasterlines: 1740.894 frames: 5.561 µs: 111416.605 Code:
------------------------------------------------------- core: .l move.l d0,(a0) mulu.w d1,d2 move.l d0,(a0) dbf d7,.l note: buffer in CHIP RAM CPU: 68030 @ 50.000000 MHz loops number: 65536 CPU cycles / total: 5570816.164 CPU cycles / per loop: 85.003 color clocks: 395182 rasterlines: 1740.889 frames: 5.561 µs: 111416.323 add.l d3,d3 is executed in parallel with mulu for me. |
16 April 2024, 08:34 | #39 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,365
|
It does not execute in parallel to mulu. It takes 2 cycles off the second move.l which waits 2 cycles less for chipmem.
What seems wrong to me is that measurement overhead makes it look like it uses 85 clocks but it should really use 84. Which is 3x28. First move.l takes 28. Mulu.w takes 26 (maybe 24 here if first 2 execute in parallel to the write). Second move.l takes 28 but it cannot start immediately if timing not multiple of 28 (maybe 14 ?). So we can fit one small instruction before. Could be interesting to add 2-cycle instructions until timing jumps to next chipmem slot. |
16 April 2024, 10:51 | #40 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,052
|
Of course to be sure if add.l d3,d3 is executed in parallel to mulu, for testing next add.l d3,d3 must be using for same code:
Code:
core: .l move.l d0,(a0) mulu.w d1,d2 add.l d3,d3 add.l d3,d3 move.l d0,(a0) dbf d7,.l If result will be 87 cycles, then is not executed parallel with mulu. |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
68040 to 68060 adapter respin with A2000 and Zeus 68040 Accelerator | richx | support.Hardware | 14 | 26 April 2022 05:46 |
Games that required an accelerator (68030, 68040, 68060) | Radertified | Nostalgia & memories | 47 | 12 January 2022 16:45 |
68030, 68040 and 68060 MMU support (really!) | Toni Wilen | support.WinUAE | 262 | 19 February 2019 12:36 |
mulu.l (a0),d0-d1 on 68060 | BlankVector | support.WinUAE | 4 | 20 July 2012 19:03 |
WTB: 68030 or 68040 accelerator for A2000 | Shadowfire | MarketPlace | 2 | 19 September 2009 17:52 |
|
|