English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 11 April 2024, 15:53   #21
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by saimo View Post
Even better: all those 28 cycles are free (also for same not-so-simple instructions like pack); between two moves to CHIP RAM, like in the examples I posted above, 26 cycles are free (the remaining 2 cycles, I guess, are used precisely by the second move).
They can't all be free ?
A memory access (with a simple addressing mode) as i observed years ago takes 4+n cycles, where n is 1 for data cache, 4 for 60ns ram, 5 for 70ns ram, 24 for chipmem.
Oh, well. It's perhaps just the (an)+ mode changing things. Have you made tests with (an), (an)+ and -(an) to see differences ?


Quote:
Originally Posted by saimo View Post
Yes, on my Blizzard 1230 IV with 60 ns there are 4 free cycles after a write to FAST RAM.
With 8 cycles overall.
And so should be 24 after a write to chip (28 cycles overall).
Or what did I miss ?


Quote:
Originally Posted by saimo View Post
I didn't remember the data cache thing (EDIT: or maybe I never stumbled upon it, because I always preload data in registers before writing to memory precisely to exploit the free cycles). Thanks for pointing it out.
Sometimes we use top of stack when we don't have enough registers. Not the best idea here.

Did you know it is also possible to hide the cost of some instructions between reads ?
This can happen if datacache burst is active. First read will start a burst, but second read will stall - unless a few instructions are executed in between. Gain is minimalistic, though.


Quote:
Originally Posted by saimo View Post
And unfortunately also some simple ones like exg or swap.
swap, really ? There are quite a few of them in a c2p and they didn't seem to cause issues.
But nop does, however. Not knowing this by the time, i tried to count available cycles and it failed due nop isn't real nop.


Quote:
Originally Posted by saimo View Post
This is new to me: I'll make a test right away and post the result.
This is what my memory says ; it may be wrong.
meynaf is offline  
Old 12 April 2024, 01:19   #22
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
@meynaf

I noticed your post only now. Too late to make other tests now, but I'll do more and, if anybody is interested, I'll provide also the tool I'm using to make the tests, so that anyone can verify my results and also make tests on other machines.

Quote:
Originally Posted by meynaf View Post
They can't all be free ?
A memory access (with a simple addressing mode) as i observed years ago takes 4+n cycles, where n is 1 for data cache, 4 for 60ns ram, 5 for 70ns ram, 24 for chipmem.
Oh, well. It's perhaps just the (an)+ mode changing things. Have you made tests with (an), (an)+ and -(an) to see differences ?
I'm saying the opposite: 26 usable cycles between CHIP RAM writes, instead of 24

By the way, in my previous post there are some figures that deal with those addressing modes.

Quote:
Did you know it is also possible to hide the cost of some instructions between reads ?
Yes, indeed. It's a relatively recent discovery for me: if I remember correctly, I noticed that only when I started working on SkillGrid, i.e. in 2017 (when I returned to classic Amiga development). (And if I had known before, then I had also forgotten it )

Quote:
This can happen if datacache burst is active. First read will start a burst, but second read will stall - unless a few instructions are executed in between. Gain is minimalistic, though.
I rarely use the data cache burst, so I never measured the gain in that case. Off the top of my head, I can say that after a read from CHIP RAM (so, without the data cache being involved) there are 4 free cycles and also that even a humble 68020 in a stock A1200 enjoys 2 free cycles!

I'll make tests also about this.

Quote:
swap, really ?
Yes, really. And here's another even more surprising one: tst.

Quote:
There are quite a few of them in a c2p and they didn't seem to cause issues.
I can provide tests and figures about this as well.

Quote:
But nop does, however. Not knowing this by the time, i tried to count available cycles and it failed due nop isn't real nop.
Yep, nop by specification waits for bus activity to end before "executing".

Quote:
This is what my memory says ; it may be wrong.
Eventually I made tests and reported the results in my previous post - if you haven't noticed them, check it out.
saimo is offline  
Old 12 April 2024, 08:20   #23
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by saimo View Post
I'm saying the opposite: 26 usable cycles between CHIP RAM writes, instead of 24
Can you post same figures as in your previous post, but for chipmem ? This looks real weird !


Quote:
Originally Posted by saimo View Post
By the way, in my previous post there are some figures that deal with those addressing modes.
Noticed that. Even weirder to me.

From memory I would have said :
Code:
.l move.b dx,-(az) ; 8
   move.b dx,-(az) ; 8
   dbf    dy,.l    ; 0
total 16 (you got 15)

.l move.b dx,-(az) ; 8
   add.l  dw,dw    ; 0 (2/4)
   add.l  dw,dw    ; 0 (4/4)
   move.b dx,-(az) ; 8
   dbf    dy,.l    ; 0
total 16 (you got 15)

.l move.b dx,-(az) ; 8
   add.l  dw,dw    ; 0 (2/4)
   add.l  dw,dw    ; 0 (4/4)
   add.l  dw,dw    ; 2
   move.b dx,-(az) ; 8
   dbf    dy,.l    ; 0
total 18 (you got 16)

.l move.b dx,(az)  ; 8
   move.b dx,(az)  ; 8
   dbf    dy,.l    ; 0
total 16 (you got 14)

.l move.b dx,(az)  ; 8
   add.l  dw,dw    ; 0 (2/4)
   add.l  dw,dw    ; 0 (4/4)
   move.b dx,(az)  ; 8
   dbf    dy,.l    ; 0
total 16 (you got 14)

.l move.b dx,(az)  ; 8
   add.l  dw,dw    ; 0 (2/4)
   add.l  dw,dw    ; 0 (4/4)
   add.l  dw,dw    ; 2
   move.b dx,(az)  ; 8
   dbf    dy,.l    ; 0
total 18 (you got 16)

Quote:
Originally Posted by saimo View Post
I rarely use the data cache burst, so I never measured the gain in that case. Off the top of my head, I can say that after a read from CHIP RAM (so, without the data cache being involved) there are 4 free cycles and also that even a humble 68020 in a stock A1200 enjoys 2 free cycles!

I'll make tests also about this.
Chip ram accesses depend more on chipset properties than on cpu properties. It seems best timing is obtained when the loop is multiple of 14 cycles ; adding instructions in areas without memory accesses can have strange effects.


Quote:
Originally Posted by saimo View Post
Yes, really. And here's another even more surprising one: tst.
Checked my c2p code, there is a swap that should have broken my cycle counts completely.


Quote:
Originally Posted by saimo View Post
I can provide tests and figures about this as well.
Yes, please.

EDIT:
Testing clr could be interesting too. From what i've read clr -(ax) is faster than clr (ax)+...

Last edited by meynaf; 13 April 2024 at 09:04.
meynaf is offline  
Old 12 April 2024, 10:09   #24
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,975
Quote:
Originally Posted by saimo View Post
Yes, many more instructions are possible to exploit all those cycles. The examples in the post you quoted used single instructions to show the impact of instructions that cannot execute in parallel with writes to RAM.
Not exactly, what I mean, but my english is very limited.
For Your example, You wrote only 1 (2 bytes) mulu.l instruction, between 2 chip memory writes. Im not 68030 expert, but detecting (not even executing) next instruction as memory access instruction can stop pipelining for mulu.l instruction.

For me You must/can check, also next cases, to be sure:

Code:
.l move.l dx,(ay)
   mulu.l dw,dz
   dbf    dy,.l

Code:
.l move.l dx,(ay)
   mulu.l dw,dz
   move.l dx,dx
   dbf    dy,.l

Code:
.l move.l dx,(ay)
   mulu.w dw,dz
   dbf    dy,.l
Don_Adan is offline  
Old 12 April 2024, 23:58   #25
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
@meynaf

First off: sorry for scaring you, but swap can indeed run in parallel! Yesterday I must have swapped (heh) binaries/results
(To be honest, I scared myself as well, as I use swap that way in several places as well...)

Today I couldn't make all the tests, so here's an initial bunch:

Code:
Results obtained on Amiga 1200, Blizzard 1230 IV, 68030 @ 50 MHZ, 60 ns RAM

-------------------------------------------------------
                 core: .l dbf d7,.l

         loops number: 65536
   CPU cycles / total: 396431.244
CPU cycles / per loop: 6.049
         color clocks: 28122
          rasterlines: 123.885
               frames: 0.395
                   µs: 7928.624

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          dbf    d7,.l

                 note: (a0) -> CHIP RAM

         loops number: 65536
   CPU cycles / total: 1855862.662
CPU cycles / per loop: 28.318
         color clocks: 131651
          rasterlines: 579.960
               frames: 1.852
                   µs: 37117.253

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          dbf    d7,.l

                 note: (a0) -> FAST RAM

         loops number: 65536
   CPU cycles / total: 530647.228
CPU cycles / per loop: 8.097
         color clocks: 37643
          rasterlines: 165.828
               frames: 0.529
                   µs: 10612.944

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          rept   13
                          add.l  d1,d1
                          endr
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: (a0) -> CHIP RAM

         loops number: 65536
   CPU cycles / total: 3711767.616
CPU cycles / per loop: 56.637
         color clocks: 263305
          rasterlines: 1159.933
               frames: 3.705
                   µs: 74235.352

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          rept   14
                          add.l  d1,d1
                          endr
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: (a0) -> CHIP RAM

         loops number: 65536
   CPU cycles / total: 4663473.263
CPU cycles / per loop: 71.158
         color clocks: 330817
          rasterlines: 1457.343
               frames: 4.656
                   µs: 93269.465

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          add.l  d3,d3
                          mulu.w d1,d2
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: (a0) -> CHIP RAM

         loops number: 65536
   CPU cycles / total: 5570816.164
CPU cycles / per loop: 85.003
         color clocks: 395182
          rasterlines: 1740.889
               frames: 5.561
                   µs: 111416.323

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          mulu.w d1,d2
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: (a0) -> CHIP RAM

         loops number: 65536
   CPU cycles / total: 5570816.164
CPU cycles / per loop: 85.003
         color clocks: 395182
          rasterlines: 1740.889
               frames: 5.561
                   µs: 111416.323

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          mulu.w d1,d2
                          add.l  d3,d3
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: (a0) -> CHIP RAM

         loops number: 65536
   CPU cycles / total: 5570816.164
CPU cycles / per loop: 85.003
         color clocks: 395182
          rasterlines: 1740.889
               frames: 5.561
                   µs: 111416.323

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          swap.w d1
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: (a0) -> CHIP RAM

         loops number: 65536
   CPU cycles / total: 3711767.616
CPU cycles / per loop: 56.637
         color clocks: 263305
          rasterlines: 1159.933
               frames: 3.705
                   µs: 74235.352

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          tst.l  d1
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: (a0) -> CHIP RAM

         loops number: 65536
   CPU cycles / total: 4663473.263
CPU cycles / per loop: 71.158
         color clocks: 330817
          rasterlines: 1457.343
               frames: 4.656
                   µs: 93269.465

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: (a0) -> CHIP RAM

         loops number: 65536
   CPU cycles / total: 3711767.616
CPU cycles / per loop: 56.637
         color clocks: 263305
          rasterlines: 1159.933
               frames: 3.705
                   µs: 74235.352

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          add.l  d1,d1
                          add.l  d1,d1
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: (a0) -> FAST RAM

         loops number: 65536
   CPU cycles / total: 924752.494
CPU cycles / per loop: 14.110
         color clocks: 65600
          rasterlines: 288.986
               frames: 0.923
                   µs: 18495.049

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          add.l  d1,d1
                          add.l  d1,d1
                          add.l  d1,d1
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: (a0) -> FAST RAM

         loops number: 65536
   CPU cycles / total: 1058150.861
CPU cycles / per loop: 16.146
         color clocks: 75063
          rasterlines: 330.674
               frames: 1.056
                   µs: 21163.017

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          add.l  d3,d3
                          mulu.w d1,d2
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: (a0) -> FAST RAM

         loops number: 65536
   CPU cycles / total: 2502146.243
CPU cycles / per loop: 38.179
         color clocks: 177497
          rasterlines: 781.925
               frames: 2.498
                   µs: 50042.924

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          mulu.w d1,d2
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: (a0) -> FAST RAM

         loops number: 65536
   CPU cycles / total: 2502146.243
CPU cycles / per loop: 38.179
         color clocks: 177497
          rasterlines: 781.925
               frames: 2.498
                   µs: 50042.924

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          mulu.w d1,d2
                          add.l  d3,d3
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: (a0) -> FAST RAM

         loops number: 65536
   CPU cycles / total: 2631414.236
CPU cycles / per loop: 40.152
         color clocks: 186667
          rasterlines: 822.321
               frames: 2.627
                   µs: 52628.284

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          swap.w d1
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: (a0) -> FAST RAM

         loops number: 65536
   CPU cycles / total: 924752.494
CPU cycles / per loop: 14.110
         color clocks: 65600
          rasterlines: 288.986
               frames: 0.923
                   µs: 18495.049

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          tst.l  d1
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: (a0) -> FAST RAM

         loops number: 65536
   CPU cycles / total: 1061801.942
CPU cycles / per loop: 16.201
         color clocks: 75322
          rasterlines: 331.814
               frames: 1.060
                   µs: 21236.038

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: (a0) -> FAST RAM

         loops number: 65536
   CPU cycles / total: 924752.494
CPU cycles / per loop: 14.110
         color clocks: 65600
          rasterlines: 288.986
               frames: 0.923
                   µs: 18495.049
Attached is an archive that contains the test tool, the tests binaries and sources, and a script to run all the tests. The script specifies 50000000 for CPU frequency, so you might need to replace that (I really should add CPU frequency autodetection to the tool...). No time for explanations now, but it's all quite self-explaining.

More tests will come tomorrow / in the next days.


@Don_Adan

You had managed to explain yourself, no worries
Regarding your questions: while mulu itself can't run in parallel, it's possible to add instructions before and after it that might still execute in parallel, depeding on the bus timing (so, when the accesses are made to CHIP RAM, it's more likely that the other instructions will come for free). Check out the mulu tests in the results above.

Last edited by saimo; 14 April 2024 at 02:33. Reason: Removed attachment as I provided an updated one in a later post.
saimo is offline  
Old 13 April 2024, 09:21   #26
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Sure, swap did really surprise me here.
I have my own tool for measuring but unfortunately my A1200 isn't in the right shape to perform tests...

More tests are welcome but i can't the heck interpret the results we already have.
meynaf is offline  
Old 13 April 2024, 11:57   #27
Old_Bob
BiO-sanitation Battalion
 
Old_Bob's Avatar
 
Join Date: Jun 2017
Location: Scotland
Posts: 151
Code:
.l move.l dx,(ay)
   rol.l  #1,dz
   move.l dx,(ay)
   dbf    dy,.l
Interesting info. Have you tested this using lsl.l as well?

B
Old_Bob is offline  
Old 13 April 2024, 13:04   #28
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
Quote:
Originally Posted by Old_Bob View Post
Code:
.l move.l dx,(ay)
   rol.l  #1,dz
   move.l dx,(ay)
   dbf    dy,.l
Interesting info. Have you tested this using lsl.l as well?

B
Yes, lsd can execute in parallel. Shifts and rotations are quite peculiar - check out my previous posts in this thread
saimo is offline  
Old 13 April 2024, 13:08   #29
Karlos
Alien Bleed
 
Karlos's Avatar
 
Join Date: Aug 2022
Location: UK
Posts: 4,165
It would be interesting to see the cycles per loop versus an entirely naive summation of worst case clocks, no parallelism for each. It's easy to underestimate the effort that went I to the design of these older CPUs
Karlos is offline  
Old 14 April 2024, 02:29   #30
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
A few more tests: some clr, mulu and instructions after reads cases.

Updated results:
Code:
-------------------------------------------------------
                 core: .l clr.l (a0)+
                          clr.l (a0)+
                          dbf   d7,.l

                 note: buffer in CHIP RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 5595485.628
CPU cycles / per loop: 85.380
         color clocks: 396932
          rasterlines: 1748.599
               frames: 5.586
                   µs: 111909.712

-------------------------------------------------------
                 core: .l clr.l (a0)+
                          clr.l (a0)+
                          dbf   d7,.l

                 note: buffer in FAST RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 1189519.283
CPU cycles / per loop: 18.150
         color clocks: 84382
          rasterlines: 371.726
               frames: 1.187
                   µs: 23790.385

-------------------------------------------------------
                 core: .l clr.l -(a0)
                          clr.l -(a0)
                          dbf   d7,.l

                 note: buffer in CHIP RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 3711753.519
CPU cycles / per loop: 56.636
         color clocks: 263304
          rasterlines: 1159.929
               frames: 3.705
                   µs: 74235.070

-------------------------------------------------------
                 core: .l clr.l -(a0)
                          clr.l -(a0)
                          dbf   d7,.l

                 note: buffer in FAST RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 1061731.458
CPU cycles / per loop: 16.200
         color clocks: 75317
          rasterlines: 331.792
               frames: 1.060
                   µs: 21234.629

-------------------------------------------------------
                 core: .l clr.l (a0)+
                          dbf   d7,.l

                 note: buffer in CHIP RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 2799378.047
CPU cycles / per loop: 42.715
         color clocks: 198582
          rasterlines: 874.810
               frames: 2.794
                   µs: 55987.560

-------------------------------------------------------
                 core: .l clr.l (a0)+
                          dbf   d7,.l

                 note: buffer in FAST RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 659872.931
CPU cycles / per loop: 10.068
         color clocks: 46810
          rasterlines: 206.211
               frames: 0.658
                   µs: 13197.458

-------------------------------------------------------
                 core: .l clr.l -(a0)
                          dbf   d7,.l

                 note: buffer in CHIP RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 1855848.566
CPU cycles / per loop: 28.318
         color clocks: 131650
          rasterlines: 579.955
               frames: 1.852
                   µs: 37116.971

-------------------------------------------------------
                 core: .l clr.l -(a0)
                          dbf   d7,.l

                 note: buffer in FAST RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 658843.862
CPU cycles / per loop: 10.053
         color clocks: 46737
          rasterlines: 205.889
               frames: 0.657
                   µs: 13176.877

-------------------------------------------------------
                 core: .l dbf d7,.l

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 396431.244
CPU cycles / per loop: 6.049
         color clocks: 28122
          rasterlines: 123.885
               frames: 0.395
                   µs: 7928.624

-------------------------------------------------------
                 core: .l move.l (a0),d0
                          dbf    d7,.l

                 note: buffer in CHIP RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 2799378.047
CPU cycles / per loop: 42.715
         color clocks: 198582
          rasterlines: 874.810
               frames: 2.794
                   µs: 55987.560

-------------------------------------------------------
                 core: .l move.l (a0),d0
                          add.l  d1,d1
                          add.l  d1,d1
                          add.l  d1,d1
                          dbf    d7,.l

                 note: buffer in CHIP RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 2799378.047
CPU cycles / per loop: 42.715
         color clocks: 198582
          rasterlines: 874.810
               frames: 2.794
                   µs: 55987.560

-------------------------------------------------------
                 core: .l move.l (a0),d0
                          add.l  d1,d1
                          add.l  d1,d1
                          add.l  d1,d1
                          add.l  d1,d1
                          dbf    d7,.l

                 note: buffer in CHIP RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 3714883.017
CPU cycles / per loop: 56.684
         color clocks: 263526
          rasterlines: 1160.907
               frames: 3.708
                   µs: 74297.660

-------------------------------------------------------
                 core: .l move.l (a0),d0
                          dbf    d7,.l

                 note: buffer in FAST RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 658561.925
CPU cycles / per loop: 10.048
         color clocks: 46717
          rasterlines: 205.801
               frames: 0.657
                   µs: 13171.238

-------------------------------------------------------
                 core: .l move.l (a0),d0
                          add.l  d1,d1
                          dbf    d7,.l

                 note: buffer in FAST RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 789592.023
CPU cycles / per loop: 12.048
         color clocks: 56012
          rasterlines: 246.748
               frames: 0.788
                   µs: 15791.840

-------------------------------------------------------
                 core: .l move.l d6,d0
                          mulu.l d7,d0
                          dbf    d7,.l

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 3476590.087
CPU cycles / per loop: 53.048
         color clocks: 246622
          rasterlines: 1086.440
               frames: 3.471
                   µs: 69531.801

-------------------------------------------------------
                 core: .l move.w d6,d0
                          mulu.w d7,d0
                          dbf    d7,.l

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 2296896.299
CPU cycles / per loop: 35.047
         color clocks: 162937
          rasterlines: 717.784
               frames: 2.293
                   µs: 45937.925

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          dbf    d7,.l

                 note: buffer in CHIP RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 1855848.566
CPU cycles / per loop: 28.318
         color clocks: 131650
          rasterlines: 579.955
               frames: 1.852
                   µs: 37116.971

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          rept   13
                          add.l  d1,d1
                          endr
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: buffer in CHIP RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 3711767.616
CPU cycles / per loop: 56.637
         color clocks: 263305
          rasterlines: 1159.933
               frames: 3.705
                   µs: 74235.352

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          rept   14
                          add.l  d1,d1
                          endr
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: buffer in CHIP RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 4663473.263
CPU cycles / per loop: 71.158
         color clocks: 330817
          rasterlines: 1457.343
               frames: 4.656
                   µs: 93269.465

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          add.l  d3,d3
                          mulu.w d1,d2
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: buffer in CHIP RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 5570816.164
CPU cycles / per loop: 85.003
         color clocks: 395182
          rasterlines: 1740.889
               frames: 5.561
                   µs: 111416.323

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: buffer in CHIP RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 3714897.114
CPU cycles / per loop: 56.684
         color clocks: 263527
          rasterlines: 1160.911
               frames: 3.708
                   µs: 74297.942

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          mulu.w d1,d2
                          add.l  d3,d3
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: buffer in CHIP RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 5570830.261
CPU cycles / per loop: 85.004
         color clocks: 395183
          rasterlines: 1740.894
               frames: 5.561
                   µs: 111416.605

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          mulu.w d1,d2
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: buffer in CHIP RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 5570816.164
CPU cycles / per loop: 85.003
         color clocks: 395182
          rasterlines: 1740.889
               frames: 5.561
                   µs: 111416.323

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          swap.w d1
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: buffer in CHIP RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 3711767.616
CPU cycles / per loop: 56.637
         color clocks: 263305
          rasterlines: 1159.933
               frames: 3.705
                   µs: 74235.352

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          tst.l  d1
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: buffer in CHIP RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 4663473.263
CPU cycles / per loop: 71.158
         color clocks: 330817
          rasterlines: 1457.343
               frames: 4.656
                   µs: 93269.465

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          dbf    d7,.l

                 note: buffer in FAST RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 530689.518
CPU cycles / per loop: 8.097
         color clocks: 37646
          rasterlines: 165.841
               frames: 0.529
                   µs: 10613.790

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          add.l  d1,d1
                          add.l  d1,d1
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: buffer in FAST RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 924738.397
CPU cycles / per loop: 14.110
         color clocks: 65599
          rasterlines: 288.982
               frames: 0.923
                   µs: 18494.767

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          add.l  d1,d1
                          add.l  d1,d1
                          add.l  d1,d1
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: buffer in FAST RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 1058136.764
CPU cycles / per loop: 16.145
         color clocks: 75062
          rasterlines: 330.669
               frames: 1.056
                   µs: 21162.735

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          add.l  d3,d3
                          mulu.w d1,d2
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: buffer in FAST RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 2498960.358
CPU cycles / per loop: 38.131
         color clocks: 177271
          rasterlines: 780.929
               frames: 2.494
                   µs: 49979.207

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: buffer in FAST RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 924752.494
CPU cycles / per loop: 14.110
         color clocks: 65600
          rasterlines: 288.986
               frames: 0.923
                   µs: 18495.049

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          mulu.w d1,d2
                          add.l  d3,d3
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: buffer in FAST RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 2631414.236
CPU cycles / per loop: 40.152
         color clocks: 186667
          rasterlines: 822.321
               frames: 2.627
                   µs: 52628.284

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          mulu.w d1,d2
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: buffer in FAST RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 2498974.455
CPU cycles / per loop: 38.131
         color clocks: 177272
          rasterlines: 780.933
               frames: 2.494
                   µs: 49979.489

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          swap.w d1
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: buffer in FAST RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 924752.494
CPU cycles / per loop: 14.110
         color clocks: 65600
          rasterlines: 288.986
               frames: 0.923
                   µs: 18495.049

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          tst.l  d1
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: buffer in FAST RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 1061731.458
CPU cycles / per loop: 16.200
         color clocks: 75317
          rasterlines: 331.792
               frames: 1.060
                   µs: 21234.629

Last edited by saimo; 15 April 2024 at 23:21. Reason: Removed attachment as I provided a newer one later.
saimo is offline  
Old 14 April 2024, 02:35   #31
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
Quote:
Originally Posted by Karlos View Post
It would be interesting to see the cycles per loop versus an entirely naive summation of worst case clocks, no parallelism for each.
Unfortunately that would take a lot of manual counting or the writing of a worst-case cycle-counting tool... i.e. a lot of work!

Quote:
It's easy to underestimate the effort that went I to the design of these older CPUs
They get the utmost respect from me
saimo is offline  
Old 14 April 2024, 11:13   #32
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,975
Thanks Too many test in one code text, better (for my poor Android) will be splitted for a few smallest examples.
But it looks like I was right.
Next instruction after mulu can not be memory instruction, must be registers instruction.
If this is general rule (i dont know) for all 68030 instructions which needs more than 2 cycles.
Then tests like next are not good for pipelining.

Code:
 move.l d0,(a0) ; chip ram
 mulu d1,d2
 move.l d0,(a0)
Must be like this:
Code:
 move.l d0,(a0) ; chip ram
 mulu d1,d2
 add.l d3,d3
 add.l d3,d3
 move.l d3,d3
 ......
 move.l d0,(a0)
Don_Adan is offline  
Old 15 April 2024, 01:06   #33
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
@Don_Adan

Sorry for the big results list.
Here are the tests you're interested in (straight from the list):

Code:
-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          add.l  d3,d3
                          mulu.w d1,d2
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: buffer in CHIP RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 5570816.164
CPU cycles / per loop: 85.003
         color clocks: 395182
          rasterlines: 1740.889
               frames: 5.561
                   µs: 111416.323

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          mulu.w d1,d2
                          add.l  d3,d3
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: buffer in CHIP RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 5570830.261
CPU cycles / per loop: 85.004
         color clocks: 395183
          rasterlines: 1740.894
               frames: 5.561
                   µs: 111416.605

-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          mulu.w d1,d2
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: buffer in CHIP RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 5570816.164
CPU cycles / per loop: 85.003
         color clocks: 395182
          rasterlines: 1740.889
               frames: 5.561
                   µs: 111416.323
They show, as I had said, that one can fit before and after mulu (or any other instruction that cannot execute in parallel with memory accesses) as many instructions as allowed by the bus timing (and CHIP RAM allows a lot). Of course, one of those critical instructions can cause a waste of time when code cannot be arranged to make the most of the parallelism.
Note: with mulu, which is so slow, of course not many other instructions can be added; in the tests I didn't even try to add more because I was interested in just verifying the concept.


@all

Tomorrow I hope to be manage to provide more tests and an updated tool that allows to control the caches from the commandline (this is already done).
saimo is offline  
Old 15 April 2024, 13:56   #34
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,975
Perhaps the best option for pipelining mulu instruction, can be pipelining 2-3 others mulu (immediately) instructions.
Something like this:

Code:
 mulu.w D0,D1
 mulu.w #$1234,D2
 mulu.w #$5678,D3
And replacing mulu for D2 and for D3 with move.l, add.l, sub.l, lsr.l
Then for 1 mulu cycles, 3-4 mulu can be done.
But it needs special routine, which used 1 normal mulu and 2-3 immediate mulu.
Personally, I never need similar routine, but maybe can be useful for something.
Don_Adan is offline  
Old 15 April 2024, 23:11   #35
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
Quote:
Originally Posted by Don_Adan View Post
Perhaps the best option for pipelining mulu instruction, can be pipelining 2-3 others mulu (immediately) instructions.
Something like this:

Code:
  mulu.w D0,D1
 mulu.w #$1234,D2
 mulu.w #$5678,D3
And replacing mulu for D2 and for D3 with move.l, add.l, sub.l, lsr.l
Then for 1 mulu cycles, 3-4 mulu can be done.
Do you mean getting quick multiplications with moves, adds and shifts done during mulus with something like this?

Code:
   mulu.w d0,d1 
   move.l d2,d3
   lsl.l  #2,d2
   add.l  d2,d3
This won't help: on 68030, nothing can execute in parallel with mulu (mulu has a tail of 0 cycles). The instructions that follow mulu will execute only after it ends.

Last edited by saimo; 16 April 2024 at 11:30. Reason: Fixed formatting, which had been screwed up by the browser/site.
saimo is offline  
Old 15 April 2024, 23:21   #36
saimo
Registered User
 
saimo's Avatar
 
Join Date: Aug 2010
Location: Italy
Posts: 787
I couldn't make more tests, but I did improve the test tool: I prettified the output to make it more readable and added cache control flags to the commandline.
It's attached here, along with all the previous tests and a little readme (which is actually a snippet from the source code) that explains what the tool does and how to use it.

Last edited by saimo; 16 April 2024 at 19:25. Reason: Removed attachment as I provided a newer version later.
saimo is offline  
Old 15 April 2024, 23:50   #37
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,975
Quote:
Originally Posted by saimo View Post
Do you mean getting quick multiplications with moves, adds and shifts done during mulus with something like this?

Code:
   mulu.w d0,d1 
   move.l d2,d3
   lsl.l     #2,d2
   add.l   d2,d3
This won't help: on 68030, nothing can execute in parallel with mulu (mulu has a tail of 0 cycles). The instructions that follow mulu will execute only after it ends.
Yes, I mean about quick multiplications with moves, adds and shifts.
But if nothing can be execute in parallel with mulu then this idea is useless.
Don_Adan is offline  
Old 16 April 2024, 00:24   #38
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,975
But, comparing this code:

Code:
-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          mulu.w d1,d2
                          add.l  d3,d3
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: buffer in CHIP RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 5570830.261
CPU cycles / per loop: 85.004
         color clocks: 395183
          rasterlines: 1740.894
               frames: 5.561
                   µs: 111416.605
with this code

Code:
-------------------------------------------------------
                 core: .l move.l d0,(a0)
                          mulu.w d1,d2
                          move.l d0,(a0)
                          dbf    d7,.l

                 note: buffer in CHIP RAM

                  CPU: 68030 @ 50.000000 MHz
         loops number: 65536

   CPU cycles / total: 5570816.164
CPU cycles / per loop: 85.003
         color clocks: 395182
          rasterlines: 1740.889
               frames: 5.561
                   µs: 111416.323
I think that You are wrong.
add.l d3,d3 is executed in parallel with mulu for me.
Don_Adan is offline  
Old 16 April 2024, 08:34   #39
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
It does not execute in parallel to mulu. It takes 2 cycles off the second move.l which waits 2 cycles less for chipmem.

What seems wrong to me is that measurement overhead makes it look like it uses 85 clocks but it should really use 84. Which is 3x28.
First move.l takes 28.
Mulu.w takes 26 (maybe 24 here if first 2 execute in parallel to the write).
Second move.l takes 28 but it cannot start immediately if timing not multiple of 28 (maybe 14 ?). So we can fit one small instruction before.
Could be interesting to add 2-cycle instructions until timing jumps to next chipmem slot.
meynaf is offline  
Old 16 April 2024, 10:51   #40
Don_Adan
Registered User
 
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 1,975
Of course to be sure if add.l d3,d3 is executed in parallel to mulu, for testing next add.l d3,d3 must be using for same code:

Code:
                 core: .l move.l d0,(a0)
                          mulu.w d1,d2
                          add.l  d3,d3
                          add.l  d3,d3
                          move.l d0,(a0)
                          dbf    d7,.l
If result will be 85 cycles, then is executed parallel with mulu.
If result will be 87 cycles, then is not executed parallel with mulu.
Don_Adan is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
68040 to 68060 adapter respin with A2000 and Zeus 68040 Accelerator richx support.Hardware 14 26 April 2022 05:46
Games that required an accelerator (68030, 68040, 68060) Radertified Nostalgia & memories 47 12 January 2022 16:45
68030, 68040 and 68060 MMU support (really!) Toni Wilen support.WinUAE 262 19 February 2019 12:36
mulu.l (a0),d0-d1 on 68060 BlankVector support.WinUAE 4 20 July 2012 19:03
WTB: 68030 or 68040 accelerator for A2000 Shadowfire MarketPlace 2 19 September 2009 17:52

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 08:17.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.11567 seconds with 14 queries