mulu.l cycles on 68030 and 68040

Thomas Richter · 07 April 2024, 13:55

Does anyone have the formula for the number of cycles a 32x32->32 multiplication takes on the 68030 and 68040? I'm particularly interested in the case where the number of 1-bits in the source is 4. For the 68030, the time depends on the number of 1-bits in the multiplicand, which seems plausible. In the UM, only the worst case is given, and from that one can guess that the multiplier takes 2 cycles per set bit, probably using some microcoded shift+add algorithm. For the 68040, the UM states that the multiplier takes constant time, 16 or 20 cycles, depending on the 16 or 32 bit case. Which sounds odd because the best time on the 68040 is then worse than the best time on the 68030, so I guess what is stated in the UM is only the worst case, but I do not know. I currently do not have a 68040 at my hand, thus if anyone could measure, I would appreciate. PS: Yes, that's really one of the rare cases where a potential multiplication sits deep in the inner loop of a fairly important assembler algorithm, so it does matter this time, otherwise I wouldn't ask.

a/b · 07 April 2024, 15:56

Here are some empirical numbers for a 68040 @25Mhz, with all dma/interrupts off, PAL screen. Yes, the measurment method, probably both of use agree on this, is not that great but it worked for me for smaller code (no mul/div up to now, though) and individual instructions for a long time.
Also, mulu.l is very nearly constant. There was a smaller deviation (up to 2/3 of a rasterline after 8000 repeats) in some rare-ish cases with some numbers. I guess, that's good enough to call it constant.
Test case is dbf loop 1000x, unrolled 8x (REPT 8) within dbf, start at raster position line=0,slot=0 and record the ending position.

empty $02,$80
nop $29,$da
moveq $06,$dd
moveq d0+d1 $0b,$da
move.l #x,d0 $07,$03
move.l #x,d0 + move.l #y,d1 $0c,$10
move.l #x,d0 + move.l #y,d1 + mulu.l d0,d1 $75,$8d (lowest $74,$dc)
mulu.l d0,d1 $6a,$d3
mulu.l d0,d1 + moveq #0,d2 $6f,$d6

1 cycle = 1/25mhz = 40 nanosec
So, first number * 20000/313 + second number * 20000/(313*227) = execution time in microsec.
For mulu.l(-empty) we get: $68,$53 => 6669 microsec total, 834 nanosec per instruction, or 21 cycles (1 for ea, 20 for exec, matches UM pg.10-25).

Thorham · 07 April 2024, 18:15

Quote:

Originally Posted by Thomas Richter

PS: Yes, that's really one of the rare cases where a potential multiplication sits deep in the inner loop of a fairly important assembler algorithm, so it does matter this time, otherwise I wouldn't ask.

Could you tell us what exactly to satisfy anyone's curiosity?

AestheticDebris · 07 April 2024, 18:21

Quote:

Originally Posted by Thomas Richter

Which sounds odd because the best time on the 68040 is then worse than the best time on the 68030, so I guess what is stated in the UM is only the worst case, but I do not know.

Is that necessarily odd? If they've changed the algorithm used internally, it's plausible the best case is worse but the worst (and more importantly average) case is significantly better. That might be an acceptable trade off in many cases.

Thomas Richter · 07 April 2024, 18:29

Quote:

Originally Posted by Thorham

Could you tell us what exactly to satisfy anyone's curiosity?

It's the core of the bitplane-by-bitplane C2P algorithm within P96. This is a tad more complicated than your average C2P because it also needs to implement minterms, masking and clipping, and it needs to be agnostic to the number of bitplanes without wasting too much space. Thus, short on registers, and I also want a scalable and testable algorithm that supports blank bitplanes, missing bitplanes and all the other fancy stuff graphics has to offer without replicating code for all possible bitplane counts and arrangements.

It turns out that you can actually replace parts of the bitshuffling by a 32x32->32 multiplication to move bits from 8 chunky pixels into one planar byte by a single instruction, but while the idea is neat for the 68060 and its fast multiplier, it seems that the 68040 multiplier is too slow to make this idea work. For the 68030, the situation is again different because all other instructions are also slow, so it seems it is on par for the 68030.

saimo · 07 April 2024, 23:02

@Thomas Richter

On 68030, mulu.l dx,dy takes (or at least seems to take) always 44 or 46 cycles (44 when dx is "particular" edit: further tests revealed that "particular" = "odd").

Here are a few results from my 68030:

Code:

        dx |        dy | 44 cycles?
-----------+-----------+------------
         0 |         1 | yes
         1 |         0 |
         1 |         1 |
         1 |         2 |
         2 |         1 | yes
         1 |        15 | yes
        15 |         1 |
        15 |        15 |
        15 |        16 |
        16 |        15 | yes
        15 |     $ffff |
     $ffff |        15 |
        16 |     $ffff | yes
     $ffff |        16 |
        16 |    $10000 | yes
    $10000 |        16 | yes
     $ffff |     $ffff |
     $ffff |    $10000 |
    $10000 |     $ffff | yes
        15 | $10101010 |
 $10101010 |        15 |
        15 | $11111111 |
 $11111111 |        15 |
         3 | $ffffffff |
 $ffffffff |         3 |

I made the tests with the attached program, which measures the time taken by 65536 iterations of this loop with OS, interrupts and DMA off:

Code:

.loop move.l d1,d0
      mulu.l d2,d0
      dbf    d7,.loop

The shell arguments are M1/A,M2/A (M1 -> d1, M2 -> d2), where M1 are 32 bit numbers that can be specified both in decimal and hexadecimal format (the latter with '$' prefix).
The elapsed time is reported in color clocks.
Given that, once the code is in the instruction cache, move.l takes 2 cycles and dbf takes 6 cycles, the cycles taken by mulu.l can be calculated with (<color clocks> * <CPU frequency> / 3.546895) / 65536 - 8.

Side note:
* the MC68030UM/AD REV 2 says that mulu.l EA,Dn, in the worst case, takes 44 cycles + FIEA time;
* I have no idea why they chose the FIEA time, instead of the FEA time (they consistently do it for all the .l muls and divs; (EDIT: explanation found in section 11.3.4: "Many two-word instructions (e.g., MULU.L, DIV.L, BFSET, etc.) include the fetch immediate effective address (fiea) time or the calculate immediate effective address (ciea) time in the execution time calculation. The timing for immediate data of word length (#<data>.W) is used for these calculations. If the instruction has a source and a destination, the source EA is used for the table lookup. If the instruction is single operand, the effective address of that operand is used".);
* given that the operand is in a data register, FEA time (0) should apply, in theory;
* if it were so, my timings would be off by 2 cycles, but, if I replace mulu.l with add.l, which takes 2 cycles, the result is that each iteration takes 10 cycles, which is correct (i.e. the program measures the elapsed time correctly);
* unless other tests prove the contrary, to me it looks like it's the MC68030UM to report a wrong figure or to document it poorly (FIEA thing).

Thomas Richter · 08 April 2024, 19:17

Thanks, but with mulu so slow on the lower processes, I'd rather leave the algorithm as is, there is no benefit then.

mc6809e · 08 April 2024, 20:46

Quote:

Originally Posted by Thomas Richter

Thanks, but with mulu so slow on the lower processes, I'd rather leave the algorithm as is, there is no benefit then.

Have you tried reordering instructions to optimize bus usage?

Mulu.l might be a win if you can initiate a cache line fill or chip ram write just before the mulu.l.

It would require a certain amount of software pipelining.

For example, let's say you've already got BP data ready from previous operations. Hold that until you're about to do the mulu. Just before the mulu, initiate a write of BP data to chip ram and then execute the mulu. The mulu will be executing while the bus logic does the slow write to chip ram in parallel (assuming mulu instruction is cached).

Same thing for uncached chunky pixel reads. First read misses causing a stall and cache line fill. HW pipeline can continue once first byte is available and instructions/data are already in the cache/registers. Bus logic will update data cache in parallel.

Optimum might be some mix of mulu and other techniques. Mulu when bus is expected to be busy with cache line fills/chip ram writes and other faster code when your cache/registers have everything you need.

There used to be a guy named Terje Mathison on usenet's comp.arch that talked about such techniques for x86. I think Id software brought him in at one point to help optimize Quake. Guy is a legend. Probably could have single handedly saved Commodore and Amiga if he'd been a fan.

Alas, was a PC/Intel x86 guy.

saimo · 08 April 2024, 21:35

Quote:

Originally Posted by mc6809e

For example, let's say you've already got BP data ready from previous operations. Hold that until you're about to do the mulu. Just before the mulu, initiate a write of BP data to chip ram and then execute the mulu. The mulu will be executing while the bus logic does the slow write to chip ram in parallel (assuming mulu instruction is cached).

This is a good idea in general. When it comes to writes to CHIP RAM, it indeed allows to get a lot of free cycles with cached code and data (even the humble 68020 on a stock A1200 gains 6 cycles). On my 50 MHz 68030 (mounted on a Blizzard 1230 IV) the mulu would run in parallel for 26 cycles and thus end up costing "only" 18/20 cycles (edit: this last statement is false: see my post #14).

Thomas Richter · 09 April 2024, 08:23

No need to worry about, really. The instruction execution time is really hidden by the chip mem write which should allocate in the store buffer. I just want to avoid that a 68030 or 68040 is potentially not exploiting the full bandwidth by taking too many cycles in the critical path. The old algorithm has less cycles for the 68040 and thus is better suited, whereas with a bit of reshuffling, the execution time on the 68060 stayed on the level of the mulu.l based conversion.

What I found a little bit odd is that within the old conversion (folding bits on top of each other in a LW with a couple of shifts and or's) which is now active again a ROR is measurably and reproducably faster than a LSR, which is counter-intuitive. Both instructions are listed as 1 cycle, pOEP | sOEP, and the ROR is one cycle slower on the 68030, so maybe something else goes on here.

saimo · 09 April 2024, 15:03

When it comes to shifts, timings are more than odd on 68030

roxd #x,dy takes only 4 cycles instead of 12 when x=1 (it's as fast as lsd #1,dy)

asl, rod and roxd cannot execute in parallel with writes to RAM.

In this situation:
move.l dx,(ay)
<shift>
move.l dx,(ay)

The timings of <shift> are as follows:
* if the memory location is in CHIP RAM: 14.5 cycles on average, for any of the mentioned instructions;
* if the memory location is in FAST RAM: asl: 4 cycles; rod #x,dy: 2 cycles; rod dx,dy: 4 cycles; roxd: 8 cycles

(All the figures I post are relative to the only Amiga I have, i.e. A1200 with Blizzard 1230 IV, 68030 @ 50 MHz, 60 ns FAST RAM, and DMA completely off.)

paraj · 09 April 2024, 19:10

Quote:

Originally Posted by Thomas Richter

What I found a little bit odd is that within the old conversion (folding bits on top of each other in a LW with a couple of shifts and or's) which is now active again a ROR is measurably and reproducably faster than a LSR, which is counter-intuitive. Both instructions are listed as 1 cycle, pOEP | sOEP, and the ROR is one cycle slower on the 68030, so maybe something else goes on here.

On 060? Sounds interesting, could you post the snippets?

Cyprian · 10 April 2024, 15:44

Quote:

Originally Posted by saimo

* if the memory location is in CHIP RAM: 14.5 cycles on average, for any of the mentioned instructions;

(All the figures I post are relative to the only Amiga I have, i.e. A1200 with Blizzard 1230 IV, 68030 @ 50 MHz, 60 ns FAST RAM, and DMA completely off.)

the CPU has an access to the Chip ram for every other of 3.5MHz clock - this is 1,7 millions accesses per second.
In your case - 68030 @ 50 MHz, the CPU has to wait 28,5 (50/1,7) cycles for an access to the chip ram. One access means two instructions because of 32bit bus.
This is 28,5 CPU cycles for two instructions, ~14,5 CPU cycles per instruction.

saimo · 10 April 2024, 19:04

Quote:

Originally Posted by Cyprian

the CPU has an access to the Chip ram for every other of 3.5MHz clock - this is 1,7 millions accesses per second.
In your case - 68030 @ 50 MHz, the CPU has to wait 28,5 (50/1,7) cycles for an access to the chip ram. One access means two instructions because of 32bit bus.
This is 28,5 CPU cycles for two instructions, ~14,5 CPU cycles per instruction.

It looks like you're thinking of the instruction fetch time, but the timings I'm posting are relative to cached code.
If you are referring to the writes, the bus access limitations are not what directly causes those instructions to execute slower than normal. But they provide another indirect way to support the assumption (that I had left implied) that the instructions that cannot execute in parallel with the writes must depend on circuitry shared with / dependent on the memory controller. Oh boy, what a convoluted way to say that it's the CPU's fault, not the bus' (which just exposes this specific weakness of the CPU)

(EDIT: re-reading it, I must conclude that the wording sucks bigtime; what follows clarifies things, though.)
(EDIT2: no intention to belittle your observation - your post is more than appropriate and welcome, and you're perfectly right with pointing out that bus access is a key factor here.)

For the sake of clearness, here are some of the tests and numbers I had derived the figures from.
I measured that running a loop of the kind

Code:

.l move.l dx,(ay)
   move.l dx,(ay)
   dbf    dy,.l

65536 times requires about 56.77 cycles per iteration - it's basically the cost of the two moves (twice 50/(3.546895/2)) = 28.19), as dbf (6 cycles) executes entirely in parallel.
It is possible to add instructions between the moves and between dbf and the first move that amount to up to 26 cycles (so, 52 cycles in all) without affecting the overall execution... except when the instructions cannot execute in parallel. For example

Code:

.l move.l dx,(ay)
   rol.l  #1,dz
   move.l dx,(ay)
   dbf    dy,.l

ends up taking about 71.27 cycles on average, i.e. about 14.5 cycles more (even if rol normally would take 6 cycles). So, the CPU stalls after the first write, executes the rol and then stalls again waiting for the next CHIP bus slot (I guess that it should be possible to add some more instructions before/after rol for a total of about 8 cycles without changing the timing - I haven't tested this).

Given that this thread is about mulu.l, returning to it triggered a distant memory: weren't muls (and divs) also unable to run in parallel? So I tested this:

Code:

.l move.l dx,(ay)
   mulu.l dw,dz
   move.l dx,(ay)
   dbf    dy,.l

The average execuction time turned out to be 99.38 cycles. Therefore, each mulu took on average 99.38-56.77 = 42.61 cycles, only a couple of cycles less than the case when not executing between writes to RAM. So, in short, mulu mostly cannot execute in parallel with writes and what I said in my post #9 is wrong.

mc6809e · 11 April 2024, 04:23

Quote:

Originally Posted by saimo

Given that this thread is about mulu.l, returning to it triggered a distant memory: weren't muls (and divs) also unable to run in parallel? So I tested this:

Code:

.l move.l dx,(ay)
   mulu.l dw,dz
   move.l dx,(ay)
   dbf    dy,.l

The average execuction time turned out to be 99.38 cycles. Therefore, each mulu took on average 99.38-56.77 = 42.61 cycles, only a couple of cycles less than the case when not executing between writes to RAM. So, in short, mulu mostly cannot execute in parallel with writes and what I said in my post #9 is wrong.

Reviewing the MC68030 UM reveals the ugly truth.

Seems the best that can be expected in terms of hiding by parallel execution any write wait states from a previous instruction appears to be the head time of next instruction. And the head time for MULx.L dx,dy is two. That's it. Max two cycles saved.

Uhg. Depressing.

Bit field operations on data registers have larger head times. So do shifts and rotates. Curiously UNPK dx,dy takes 8 cycles but also has a head time of 8 cycles, making it essentially free after a write to chip ram.

Don_Adan · 11 April 2024, 10:50

Quote:

Originally Posted by saimo

It looks like you're thinking of the instruction fetch time, but the timings I'm posting are relative to cached code.
If you are referring to the writes, the bus access limitations are not what directly causes those instructions to execute slower than normal. But they provide another indirect way to support the assumption (that I had left implied) that the instructions that cannot execute in parallel with the writes must depend on circuitry shared with / dependent on the memory controller. Oh boy, what a convoluted way to say that it's the CPU's fault, not the bus' (which just exposes this specific weakness of the CPU)

(EDIT: re-reading it, I must conclude that the wording sucks bigtime; what follows clarifies things, though.)
(EDIT2: no intention to belittle your observation - your post is more than appropriate and welcome, and you're perfectly right with pointing out that bus access is a key factor here.)

For the sake of clearness, here are some of the tests and numbers I had derived the figures from.
I measured that running a loop of the kind

Code:

.l move.l dx,(ay)
   move.l dx,(ay)
   dbf    dy,.l

65536 times requires about 56.77 cycles per iteration - it's basically the cost of the two moves (twice 50/(3.546895/2)) = 28.19), as dbf (6 cycles) executes entirely in parallel.
It is possible to add instructions between the moves and between dbf and the first move that amount to up to 26 cycles (so, 52 cycles in all) without affecting the overall execution... except when the instructions cannot execute in parallel. For example

Code:

.l move.l dx,(ay)
   rol.l  #1,dz
   move.l dx,(ay)
   dbf    dy,.l

ends up taking about 71.27 cycles on average, i.e. about 14.5 cycles more (even if rol normally would take 6 cycles). So, the CPU stalls after the first write, executes the rol and then stalls again waiting for the next CHIP bus slot (I guess that it should be possible to add some more instructions before/after rol for a total of about 8 cycles without changing the timing - I haven't tested this).

Given that this thread is about mulu.l, returning to it triggered a distant memory: weren't muls (and divs) also unable to run in parallel? So I tested this:

Code:

.l move.l dx,(ay)
   mulu.l dw,dz
   move.l dx,(ay)
   dbf    dy,.l

The average execuction time turned out to be 99.38 cycles. Therefore, each mulu took on average 99.38-56.77 = 42.61 cycles, only a couple of cycles less than the case when not executing between writes to RAM. So, in short, mulu mostly cannot execute in parallel with writes and what I said in my post #9 is wrong.

Im not 68030 expert, but your examples are very special.
You added very short (2 bytes) instruction between 2 writes to memory.
You must try with more instructions between 2 writes to memory.
If I remember right info from meynaf, single write to chip memory on 68030 can hide instructions equal ~28 cycles in pipe.

meynaf · 11 April 2024, 13:13

By experience i know that 50mhz 030 can do one chipmem write every 28 clocks and this allows hiding up to 24 clocks of simple instructions behind. Fastmem writes were 8 clocks and could hide 4.
Any memory access, even a read from dcache, will stall.
Some instructions will stall too, especially complex ones.
IIRC 4 cycles were enough to fully hide a dbf even though it's supposed to be 6.

saimo · 11 April 2024, 15:22

Quote:

Originally Posted by mc6809e

Reviewing the MC68030 UM reveals the ugly truth.

Seems the best that can be expected in terms of hiding by parallel execution any write wait states from a previous instruction appears to be the head time of next instruction. And the head time for MULx.L dx,dy is two. That's it. Max two cycles saved.

Uhg. Depressing.

Bit field operations on data registers have larger head times. So do shifts and rotates. Curiously UNPK dx,dy takes 8 cycles but also has a head time of 8 cycles, making it essentially free after a write to chip ram.

Thankfully the behaviour of mulu is not general

What I've been trying to say is that some instructions, for internal reasons, cannot execute (entirely) while the memory controller is busy. mulx, divx, asl, rod, roxd have been mentioned already; others are exg, swap, abcd, sbcd, andi/eori/ori to ccr/sr and probably others (I didn't test the whole instruction set and I can't remember all the cases I've stumbled upon over the years - also, having been away from Amiga from 2003 to 2017 made lots of my memories rust).
On the other hand, some instructions can entirely execute in parallel. For example, as mentioned lsd, adds, subs and many others. As a practical example, these two loops take exactly the same time:

Code:

.l move.l dx,(az)
   move.l dx,(az)
   dbf    dy,.l

.l move.l dx,(az)
   adda.l #k,aw
   adda.l #k,aw
   adda.l #k,aw
   adda.l #k,aw
   move.l dx,(az)
   adda.l #k,aw
   adda.l #k,aw
   adda.l #k,aw
    dbf    dy,.l

(adda.l #k,aw takes 6 cycles, of which only 4 are the head from the immediate operand; dbf takes also 6 cycles, with exactly 6 head cycles.)

saimo · 11 April 2024, 15:25

Quote:

Originally Posted by Don_Adan

Im not 68030 expert, but your examples are very special.
You added very short (2 bytes) instruction between 2 writes to memory.
You must try with more instructions between 2 writes to memory.
If I remember right info from meynaf, single write to chip memory on 68030 can hide instructions equal ~28 cycles in pipe.

Yes, many more instructions are possible to exploit all those cycles. The examples in the post you quoted used single instructions to show the impact of instructions that cannot execute in parallel with writes to RAM.

saimo · 11 April 2024, 15:32

Quote:

Originally Posted by meynaf

By experience i know that 50mhz 030 can do one chipmem write every 28 clocks and this allows hiding up to 24 clocks of simple instructions behind.

Even better: all those 28 cycles are free (also for some not-so-simple instructions like pack); between two moves to CHIP RAM, like in the examples I posted above, 26 cycles are free (the remaining 2 cycles, I guess, are used precisely by the second move).

Quote:

Fastmem writes were 8 clocks and could hide 4.

Yes, on my Blizzard 1230 IV with 60 ns there are 4 free cycles after a write to FAST RAM (EDIT: at least, that's what my handwritten notes say, but, given what reported at the bottom, this requires verification).

Quote:

Any memory access, even a read from dcache, will stall.

I didn't remember the data cache thing (EDIT: or maybe I never stumbled upon it, because I always preload data in registers before writing to memory precisely to exploit the free cycles). Thanks for pointing it out.

Quote:

Some instructions will stall too, especially complex ones.

And unfortunately also some simple ones like exg or swap.

Quote:

IIRC 4 cycles were enough to fully hide a dbf even though it's supposed to be 6.

This is new to me: I'll make a test right away and post the result.

EDIT:

I can't achieve that. The best one can do is to exploit the tail of the write. For example, this piece of code

Code:

.l move.l dx,-(az)
   dbf    dy,.l

executes in 8 cycles, which matches the timings on the MC68030UM - in fact, move takes 4 cycles and has a tail of 2 cycles, dbf requires 6 cycles and has a head of 6 cycles, so it's 4+6-2 = 8.

The same goes with

Code:

.l move.l dx,(az)
   dbf    dy,.l

which takes 8 cycles - move requires 3 cycles and has a tail of 1 cycle, so it's 3+6-1 = 8.

It's interesting to note that in these cases dbf cannot exploit all the 4 free cycles (contrary to what happens when writing to CHIP RAM). Hmmm... maybe there are not 4 free cycles after all? More tests are required...

EDIT2:

Tests confirmed the oddity of dbcc: it can execute entirely in parallel with writes to CHIP RAM, but when the writes are to FAST RAM the only overlapping is that given by head and tail. The same goes for bra.x and bcc.x.

I also checked whether there actually are 4 cycles free after a write to FAST RAM with these tests:

Code:

.l move.b dx,-(az) ;4 (0 head, 2 tail)
   move.b dx,-(az) ;4 (0 head, 2 tail)
   dbf    dy,.l    ;6 (6 head, 0 tail)

* Measured cycles: 15.15 (theoretical: 4+4+6-2 = 12)

.l move.b dx,-(az) ;4 (0 head, 2 tail)
   add.l  dw,dw    ;2 (2 head, 0 tail)
   add.l  dw,dw    ;2 (2 head, 0 tail)
   move.b dx,-(az) ;4 (0 head, 2 tail)
   dbf    dy,.l    ;6 (6 head, 0 tail)

* Measured cycles: 15.15 (so the adds are free)

.l move.b dx,-(az) ;4 (0 head, 2 tail)
   add.l  dw,dw    ;2 (2 head, 0 tail)
   add.l  dw,dw    ;2 (2 head, 0 tail)
   add.l  dw,dw    ;2 (2 head, 0 tail)
   move.b dx,-(az) ;4 (0 head, 2 tail)
   dbf    dy,.l    ;6 (6 head, 0 tail)

* Measured cycles: 16.10 (so 2 adds and a half are free)

.l move.b dx,(az)  ;4 (0 head, 2 tail)
   move.b dx,(az)  ;4 (0 head, 2 tail)
   dbf    dy,.l    ;6 (6 head, 0 tail)

* Measured cycles: 14.09 (theoretical: 4+4+6-2 = 12)

.l move.b dx,(az)  ;4 (0 head, 2 tail)
   add.l  dw,dw    ;2 (2 head, 0 tail)
   add.l  dw,dw    ;2 (2 head, 0 tail)
   move.b dx,(az)  ;4 (0 head, 2 tail)
   dbf    dy,.l    ;6 (6 head, 0 tail)

* Measured cycles: 14.09 (so the adds are free)

.l move.b dx,(az)  ;4 (0 head, 2 tail)
   add.l  dw,dw    ;2 (2 head, 0 tail)
   add.l  dw,dw    ;2 (2 head, 0 tail)
   add.l  dw,dw    ;2 (2 head, 0 tail)
   move.b dx,(az)  ;4 (0 head, 2 tail)
   dbf    dy,.l    ;6 (6 head, 0 tail)

* Measured cycles: 16.12 (so only the 2 adds are free)

Timings with (az)+ are the same as with (az).

07 April 2024, 23:02	#6
saimo Registered User Join Date: Aug 2010 Location: Italy Posts: 854	@Thomas Richter On 68030, mulu.l dx,dy takes (or at least seems to take) always 44 or 46 cycles (44 when dx is "particular" edit: further tests revealed that "particular" = "odd"). Here are a few results from my 68030: Code: dx \| dy \| 44 cycles? -----------+-----------+------------ 0 \| 1 \| yes 1 \| 0 \| 1 \| 1 \| 1 \| 2 \| 2 \| 1 \| yes 1 \| 15 \| yes 15 \| 1 \| 15 \| 15 \| 15 \| 16 \| 16 \| 15 \| yes 15 \| $ffff \| $ffff \| 15 \| 16 \| $ffff \| yes $ffff \| 16 \| 16 \| $10000 \| yes $10000 \| 16 \| yes $ffff \| $ffff \| $ffff \| $10000 \| $10000 \| $ffff \| yes 15 \| $10101010 \| $10101010 \| 15 \| 15 \| $11111111 \| $11111111 \| 15 \| 3 \| $ffffffff \| $ffffffff \| 3 \| I made the tests with the attached program, which measures the time taken by 65536 iterations of this loop with OS, interrupts and DMA off: Code: .loop move.l d1,d0 mulu.l d2,d0 dbf d7,.loop The shell arguments are M1/A,M2/A (M1 -> d1, M2 -> d2), where M1 are 32 bit numbers that can be specified both in decimal and hexadecimal format (the latter with '$' prefix). The elapsed time is reported in color clocks. Given that, once the code is in the instruction cache, move.l takes 2 cycles and dbf takes 6 cycles, the cycles taken by mulu.l can be calculated with (<color clocks> * <CPU frequency> / 3.546895) / 65536 - 8. Side note: * the MC68030UM/AD REV 2 says that mulu.l EA,Dn, in the worst case, takes 44 cycles + FIEA time; * I have no idea why they chose the FIEA time, instead of the FEA time (they consistently do it for all the .l muls and divs; (EDIT: explanation found in section 11.3.4: "Many two-word instructions (e.g., MULU.L, DIV.L, BFSET, etc.) include the fetch immediate effective address (fiea) time or the calculate immediate effective address (ciea) time in the execution time calculation. The timing for immediate data of word length (#<data>.W) is used for these calculations. If the instruction has a source and a destination, the source EA is used for the table lookup. If the instruction is single operand, the effective address of that operand is used".); * given that the operand is in a data register, FEA time (0) should apply, in theory; * if it were so, my timings would be off by 2 cycles, but, if I replace mulu.l with add.l, which takes 2 cycles, the result is that each iteration takes 10 cycles, which is correct (i.e. the program measures the elapsed time correctly); * unless other tests prove the contrary, to me it looks like it's the MC68030UM to report a wrong figure or to document it poorly (FIEA thing). Last edited by saimo; 19 April 2024 at 14:04. Reason: Removed attachment, as it is no longer needed. Added snippet from MC68030UM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
68040 to 68060 adapter respin with A2000 and Zeus 68040 Accelerator	richx	support.Hardware	14	26 April 2022 05:46
Games that required an accelerator (68030, 68040, 68060)	Radertified	Nostalgia & memories	47	12 January 2022 16:45
68030, 68040 and 68060 MMU support (really!)	Toni Wilen	support.WinUAE	262	19 February 2019 12:36
mulu.l (a0),d0-d1 on 68060	BlankVector	support.WinUAE	4	20 July 2012 19:03
WTB: 68030 or 68040 accelerator for A2000	Shadowfire	MarketPlace	2	19 September 2009 17:52

07 April 2024, 13:55	#1
Thomas Richter Registered User Join Date: Jan 2019 Location: Germany Posts: 3,304	mulu.l cycles on 68030 and 68040 Does anyone have the formula for the number of cycles a 32x32->32 multiplication takes on the 68030 and 68040? I'm particularly interested in the case where the number of 1-bits in the source is 4. For the 68030, the time depends on the number of 1-bits in the multiplicand, which seems plausible. In the UM, only the worst case is given, and from that one can guess that the multiplier takes 2 cycles per set bit, probably using some microcoded shift+add algorithm. For the 68040, the UM states that the multiplier takes constant time, 16 or 20 cycles, depending on the 16 or 32 bit case. Which sounds odd because the best time on the 68040 is then worse than the best time on the 68030, so I guess what is stated in the UM is only the worst case, but I do not know. I currently do not have a 68040 at my hand, thus if anyone could measure, I would appreciate. PS: Yes, that's really one of the rare cases where a potential multiplication sits deep in the inner loop of a fairly important assembler algorithm, so it does matter this time, otherwise I wouldn't ask.

07 April 2024, 15:56	#2
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,054	Here are some empirical numbers for a 68040 @25Mhz, with all dma/interrupts off, PAL screen. Yes, the measurment method, probably both of use agree on this, is not that great but it worked for me for smaller code (no mul/div up to now, though) and individual instructions for a long time. Also, mulu.l is very nearly constant. There was a smaller deviation (up to 2/3 of a rasterline after 8000 repeats) in some rare-ish cases with some numbers. I guess, that's good enough to call it constant. Test case is dbf loop 1000x, unrolled 8x (REPT 8) within dbf, start at raster position line=0,slot=0 and record the ending position. empty $02,$80 nop $29,$da moveq $06,$dd moveq d0+d1 $0b,$da move.l #x,d0 $07,$03 move.l #x,d0 + move.l #y,d1 $0c,$10 move.l #x,d0 + move.l #y,d1 + mulu.l d0,d1 $75,$8d (lowest $74,$dc) mulu.l d0,d1 $6a,$d3 mulu.l d0,d1 + moveq #0,d2 $6f,$d6 1 cycle = 1/25mhz = 40 nanosec So, first number * 20000/313 + second number * 20000/(313*227) = execution time in microsec. For mulu.l(-empty) we get: $68,$53 => 6669 microsec total, 834 nanosec per instruction, or 21 cycles (1 for ea, 20 for exec, matches UM pg.10-25).

08 April 2024, 19:17	#7
Thomas Richter Registered User Join Date: Jan 2019 Location: Germany Posts: 3,304	Thanks, but with mulu so slow on the lower processes, I'd rather leave the algorithm as is, there is no benefit then.

09 April 2024, 08:23	#10
Thomas Richter Registered User Join Date: Jan 2019 Location: Germany Posts: 3,304	No need to worry about, really. The instruction execution time is really hidden by the chip mem write which should allocate in the store buffer. I just want to avoid that a 68030 or 68040 is potentially not exploiting the full bandwidth by taking too many cycles in the critical path. The old algorithm has less cycles for the 68040 and thus is better suited, whereas with a bit of reshuffling, the execution time on the 68060 stayed on the level of the mulu.l based conversion. What I found a little bit odd is that within the old conversion (folding bits on top of each other in a LW with a couple of shifts and or's) which is now active again a ROR is measurably and reproducably faster than a LSR, which is counter-intuitive. Both instructions are listed as 1 cycle, pOEP \| sOEP, and the ROR is one cycle slower on the 68030, so maybe something else goes on here.

09 April 2024, 15:03	#11
saimo Registered User Join Date: Aug 2010 Location: Italy Posts: 854	When it comes to shifts, timings are more than odd on 68030 roxd #x,dy takes only 4 cycles instead of 12 when x=1 (it's as fast as lsd #1,dy) asl, rod and roxd cannot execute in parallel with writes to RAM. In this situation: move.l dx,(ay) <shift> move.l dx,(ay) The timings of <shift> are as follows: * if the memory location is in CHIP RAM: 14.5 cycles on average, for any of the mentioned instructions; * if the memory location is in FAST RAM: asl: 4 cycles; rod #x,dy: 2 cycles; rod dx,dy: 4 cycles; roxd: 8 cycles (All the figures I post are relative to the only Amiga I have, i.e. A1200 with Blizzard 1230 IV, 68030 @ 50 MHz, 60 ns FAST RAM, and DMA completely off.)

11 April 2024, 13:13	#17
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,352	By experience i know that 50mhz 030 can do one chipmem write every 28 clocks and this allows hiding up to 24 clocks of simple instructions behind. Fastmem writes were 8 clocks and could hide 4. Any memory access, even a read from dcache, will stall. Some instructions will stall too, especially complex ones. IIRC 4 cycles were enough to fully hide a dbf even though it's supposed to be 6.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)