07 April 2024, 13:55 | #1 |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,304
|
mulu.l cycles on 68030 and 68040
Does anyone have the formula for the number of cycles a 32x32->32 multiplication takes on the 68030 and 68040? I'm particularly interested in the case where the number of 1-bits in the source is 4. For the 68030, the time depends on the number of 1-bits in the multiplicand, which seems plausible. In the UM, only the worst case is given, and from that one can guess that the multiplier takes 2 cycles per set bit, probably using some microcoded shift+add algorithm. For the 68040, the UM states that the multiplier takes constant time, 16 or 20 cycles, depending on the 16 or 32 bit case. Which sounds odd because the best time on the 68040 is then worse than the best time on the 68030, so I guess what is stated in the UM is only the worst case, but I do not know. I currently do not have a 68040 at my hand, thus if anyone could measure, I would appreciate. PS: Yes, that's really one of the rare cases where a potential multiplication sits deep in the inner loop of a fairly important assembler algorithm, so it does matter this time, otherwise I wouldn't ask.
|
07 April 2024, 15:56 | #2 |
Registered User
Join Date: Jun 2016
Location: europe
Posts: 1,054
|
Here are some empirical numbers for a 68040 @25Mhz, with all dma/interrupts off, PAL screen. Yes, the measurment method, probably both of use agree on this, is not that great but it worked for me for smaller code (no mul/div up to now, though) and individual instructions for a long time.
Also, mulu.l is very nearly constant. There was a smaller deviation (up to 2/3 of a rasterline after 8000 repeats) in some rare-ish cases with some numbers. I guess, that's good enough to call it constant. Test case is dbf loop 1000x, unrolled 8x (REPT 8) within dbf, start at raster position line=0,slot=0 and record the ending position. empty $02,$80 nop $29,$da moveq $06,$dd moveq d0+d1 $0b,$da move.l #x,d0 $07,$03 move.l #x,d0 + move.l #y,d1 $0c,$10 move.l #x,d0 + move.l #y,d1 + mulu.l d0,d1 $75,$8d (lowest $74,$dc) mulu.l d0,d1 $6a,$d3 mulu.l d0,d1 + moveq #0,d2 $6f,$d6 1 cycle = 1/25mhz = 40 nanosec So, first number * 20000/313 + second number * 20000/(313*227) = execution time in microsec. For mulu.l(-empty) we get: $68,$53 => 6669 microsec total, 834 nanosec per instruction, or 21 cycles (1 for ea, 20 for exec, matches UM pg.10-25). |
07 April 2024, 18:15 | #3 |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,831
|
Could you tell us what exactly to satisfy anyone's curiosity?
|
07 April 2024, 18:21 | #4 |
Registered User
Join Date: May 2023
Location: Norwich
Posts: 426
|
Is that necessarily odd? If they've changed the algorithm used internally, it's plausible the best case is worse but the worst (and more importantly average) case is significantly better. That might be an acceptable trade off in many cases.
|
07 April 2024, 18:29 | #5 |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,304
|
It's the core of the bitplane-by-bitplane C2P algorithm within P96. This is a tad more complicated than your average C2P because it also needs to implement minterms, masking and clipping, and it needs to be agnostic to the number of bitplanes without wasting too much space. Thus, short on registers, and I also want a scalable and testable algorithm that supports blank bitplanes, missing bitplanes and all the other fancy stuff graphics has to offer without replicating code for all possible bitplane counts and arrangements.
It turns out that you can actually replace parts of the bitshuffling by a 32x32->32 multiplication to move bits from 8 chunky pixels into one planar byte by a single instruction, but while the idea is neat for the 68060 and its fast multiplier, it seems that the 68040 multiplier is too slow to make this idea work. For the 68030, the situation is again different because all other instructions are also slow, so it seems it is on par for the 68030. |
07 April 2024, 23:02 | #6 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
@Thomas Richter
On 68030, mulu.l dx,dy takes (or at least seems to take) always 44 or 46 cycles (44 when dx is "particular" edit: further tests revealed that "particular" = "odd"). Here are a few results from my 68030: Code:
dx | dy | 44 cycles? -----------+-----------+------------ 0 | 1 | yes 1 | 0 | 1 | 1 | 1 | 2 | 2 | 1 | yes 1 | 15 | yes 15 | 1 | 15 | 15 | 15 | 16 | 16 | 15 | yes 15 | $ffff | $ffff | 15 | 16 | $ffff | yes $ffff | 16 | 16 | $10000 | yes $10000 | 16 | yes $ffff | $ffff | $ffff | $10000 | $10000 | $ffff | yes 15 | $10101010 | $10101010 | 15 | 15 | $11111111 | $11111111 | 15 | 3 | $ffffffff | $ffffffff | 3 | Code:
.loop move.l d1,d0 mulu.l d2,d0 dbf d7,.loop The elapsed time is reported in color clocks. Given that, once the code is in the instruction cache, move.l takes 2 cycles and dbf takes 6 cycles, the cycles taken by mulu.l can be calculated with (<color clocks> * <CPU frequency> / 3.546895) / 65536 - 8. Side note: * the MC68030UM/AD REV 2 says that mulu.l EA,Dn, in the worst case, takes 44 cycles + FIEA time; * I have no idea why they chose the FIEA time, instead of the FEA time (they consistently do it for all the .l muls and divs; (EDIT: explanation found in section 11.3.4: "Many two-word instructions (e.g., MULU.L, DIV.L, BFSET, etc.) include the fetch immediate effective address (fiea) time or the calculate immediate effective address (ciea) time in the execution time calculation. The timing for immediate data of word length (#<data>.W) is used for these calculations. If the instruction has a source and a destination, the source EA is used for the table lookup. If the instruction is single operand, the effective address of that operand is used".); * given that the operand is in a data register, FEA time (0) should apply, in theory; * if it were so, my timings would be off by 2 cycles, but, if I replace mulu.l with add.l, which takes 2 cycles, the result is that each iteration takes 10 cycles, which is correct (i.e. the program measures the elapsed time correctly); * unless other tests prove the contrary, to me it looks like it's the MC68030UM to report a wrong figure or to document it poorly (FIEA thing). Last edited by saimo; 19 April 2024 at 14:04. Reason: Removed attachment, as it is no longer needed. Added snippet from MC68030UM. |
08 April 2024, 19:17 | #7 |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,304
|
Thanks, but with mulu so slow on the lower processes, I'd rather leave the algorithm as is, there is no benefit then.
|
08 April 2024, 20:46 | #8 | |
Registered User
Join Date: Jan 2012
Location: USA
Posts: 373
|
Quote:
Mulu.l might be a win if you can initiate a cache line fill or chip ram write just before the mulu.l. It would require a certain amount of software pipelining. For example, let's say you've already got BP data ready from previous operations. Hold that until you're about to do the mulu. Just before the mulu, initiate a write of BP data to chip ram and then execute the mulu. The mulu will be executing while the bus logic does the slow write to chip ram in parallel (assuming mulu instruction is cached). Same thing for uncached chunky pixel reads. First read misses causing a stall and cache line fill. HW pipeline can continue once first byte is available and instructions/data are already in the cache/registers. Bus logic will update data cache in parallel. Optimum might be some mix of mulu and other techniques. Mulu when bus is expected to be busy with cache line fills/chip ram writes and other faster code when your cache/registers have everything you need. There used to be a guy named Terje Mathison on usenet's comp.arch that talked about such techniques for x86. I think Id software brought him in at one point to help optimize Quake. Guy is a legend. Probably could have single handedly saved Commodore and Amiga if he'd been a fan. Alas, was a PC/Intel x86 guy. |
|
08 April 2024, 21:35 | #9 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Quote:
Last edited by saimo; 10 April 2024 at 19:06. Reason: Self-correction |
|
09 April 2024, 08:23 | #10 |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,304
|
No need to worry about, really. The instruction execution time is really hidden by the chip mem write which should allocate in the store buffer. I just want to avoid that a 68030 or 68040 is potentially not exploiting the full bandwidth by taking too many cycles in the critical path. The old algorithm has less cycles for the 68040 and thus is better suited, whereas with a bit of reshuffling, the execution time on the 68060 stayed on the level of the mulu.l based conversion.
What I found a little bit odd is that within the old conversion (folding bits on top of each other in a LW with a couple of shifts and or's) which is now active again a ROR is measurably and reproducably faster than a LSR, which is counter-intuitive. Both instructions are listed as 1 cycle, pOEP | sOEP, and the ROR is one cycle slower on the 68030, so maybe something else goes on here. |
09 April 2024, 15:03 | #11 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
When it comes to shifts, timings are more than odd on 68030
roxd #x,dy takes only 4 cycles instead of 12 when x=1 (it's as fast as lsd #1,dy) asl, rod and roxd cannot execute in parallel with writes to RAM. In this situation: move.l dx,(ay) <shift> move.l dx,(ay) The timings of <shift> are as follows: * if the memory location is in CHIP RAM: 14.5 cycles on average, for any of the mentioned instructions; * if the memory location is in FAST RAM: asl: 4 cycles; rod #x,dy: 2 cycles; rod dx,dy: 4 cycles; roxd: 8 cycles (All the figures I post are relative to the only Amiga I have, i.e. A1200 with Blizzard 1230 IV, 68030 @ 50 MHz, 60 ns FAST RAM, and DMA completely off.) |
09 April 2024, 19:10 | #12 | |
Registered User
Join Date: Feb 2017
Location: Denmark
Posts: 1,189
|
Quote:
|
|
10 April 2024, 15:44 | #13 | |
Registered User
Join Date: Jul 2014
Location: Warsaw/Poland
Posts: 192
|
Quote:
the CPU has an access to the Chip ram for every other of 3.5MHz clock - this is 1,7 millions accesses per second. In your case - 68030 @ 50 MHz, the CPU has to wait 28,5 (50/1,7) cycles for an access to the chip ram. One access means two instructions because of 32bit bus. This is 28,5 CPU cycles for two instructions, ~14,5 CPU cycles per instruction. |
|
10 April 2024, 19:04 | #14 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Quote:
If you are referring to the writes, the bus access limitations are not what directly causes those instructions to execute slower than normal. But they provide another indirect way to support the assumption (that I had left implied) that the instructions that cannot execute in parallel with the writes must depend on circuitry shared with / dependent on the memory controller. Oh boy, what a convoluted way to say that it's the CPU's fault, not the bus' (which just exposes this specific weakness of the CPU) (EDIT: re-reading it, I must conclude that the wording sucks bigtime; what follows clarifies things, though.) (EDIT2: no intention to belittle your observation - your post is more than appropriate and welcome, and you're perfectly right with pointing out that bus access is a key factor here.) For the sake of clearness, here are some of the tests and numbers I had derived the figures from. I measured that running a loop of the kind Code:
.l move.l dx,(ay) move.l dx,(ay) dbf dy,.l It is possible to add instructions between the moves and between dbf and the first move that amount to up to 26 cycles (so, 52 cycles in all) without affecting the overall execution... except when the instructions cannot execute in parallel. For example Code:
.l move.l dx,(ay) rol.l #1,dz move.l dx,(ay) dbf dy,.l Given that this thread is about mulu.l, returning to it triggered a distant memory: weren't muls (and divs) also unable to run in parallel? So I tested this: Code:
.l move.l dx,(ay) mulu.l dw,dz move.l dx,(ay) dbf dy,.l Last edited by saimo; 10 April 2024 at 22:26. Reason: EDIT: added a code line that got cut and clarified a few things. |
|
11 April 2024, 04:23 | #15 | |
Registered User
Join Date: Jan 2012
Location: USA
Posts: 373
|
Quote:
Seems the best that can be expected in terms of hiding by parallel execution any write wait states from a previous instruction appears to be the head time of next instruction. And the head time for MULx.L dx,dy is two. That's it. Max two cycles saved. Uhg. Depressing. Bit field operations on data registers have larger head times. So do shifts and rotates. Curiously UNPK dx,dy takes 8 cycles but also has a head time of 8 cycles, making it essentially free after a write to chip ram. |
|
11 April 2024, 10:50 | #16 | |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,030
|
Quote:
You added very short (2 bytes) instruction between 2 writes to memory. You must try with more instructions between 2 writes to memory. If I remember right info from meynaf, single write to chip memory on 68030 can hide instructions equal ~28 cycles in pipe. |
|
11 April 2024, 13:13 | #17 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,352
|
By experience i know that 50mhz 030 can do one chipmem write every 28 clocks and this allows hiding up to 24 clocks of simple instructions behind. Fastmem writes were 8 clocks and could hide 4.
Any memory access, even a read from dcache, will stall. Some instructions will stall too, especially complex ones. IIRC 4 cycles were enough to fully hide a dbf even though it's supposed to be 6. |
11 April 2024, 15:22 | #18 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Quote:
What I've been trying to say is that some instructions, for internal reasons, cannot execute (entirely) while the memory controller is busy. mulx, divx, asl, rod, roxd have been mentioned already; others are exg, swap, abcd, sbcd, andi/eori/ori to ccr/sr and probably others (I didn't test the whole instruction set and I can't remember all the cases I've stumbled upon over the years - also, having been away from Amiga from 2003 to 2017 made lots of my memories rust). On the other hand, some instructions can entirely execute in parallel. For example, as mentioned lsd, adds, subs and many others. As a practical example, these two loops take exactly the same time: Code:
.l move.l dx,(az) move.l dx,(az) dbf dy,.l .l move.l dx,(az) adda.l #k,aw adda.l #k,aw adda.l #k,aw adda.l #k,aw move.l dx,(az) adda.l #k,aw adda.l #k,aw adda.l #k,aw dbf dy,.l |
|
11 April 2024, 15:25 | #19 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Quote:
Last edited by saimo; 11 April 2024 at 15:33. |
|
11 April 2024, 15:32 | #20 | |||||
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Quote:
Quote:
Quote:
Quote:
Quote:
EDIT: I can't achieve that. The best one can do is to exploit the tail of the write. For example, this piece of code Code:
.l move.l dx,-(az) dbf dy,.l The same goes with Code:
.l move.l dx,(az) dbf dy,.l It's interesting to note that in these cases dbf cannot exploit all the 4 free cycles (contrary to what happens when writing to CHIP RAM). Hmmm... maybe there are not 4 free cycles after all? More tests are required... EDIT2: Tests confirmed the oddity of dbcc: it can execute entirely in parallel with writes to CHIP RAM, but when the writes are to FAST RAM the only overlapping is that given by head and tail. The same goes for bra.x and bcc.x. I also checked whether there actually are 4 cycles free after a write to FAST RAM with these tests: Code:
.l move.b dx,-(az) ;4 (0 head, 2 tail) move.b dx,-(az) ;4 (0 head, 2 tail) dbf dy,.l ;6 (6 head, 0 tail) * Measured cycles: 15.15 (theoretical: 4+4+6-2 = 12) .l move.b dx,-(az) ;4 (0 head, 2 tail) add.l dw,dw ;2 (2 head, 0 tail) add.l dw,dw ;2 (2 head, 0 tail) move.b dx,-(az) ;4 (0 head, 2 tail) dbf dy,.l ;6 (6 head, 0 tail) * Measured cycles: 15.15 (so the adds are free) .l move.b dx,-(az) ;4 (0 head, 2 tail) add.l dw,dw ;2 (2 head, 0 tail) add.l dw,dw ;2 (2 head, 0 tail) add.l dw,dw ;2 (2 head, 0 tail) move.b dx,-(az) ;4 (0 head, 2 tail) dbf dy,.l ;6 (6 head, 0 tail) * Measured cycles: 16.10 (so 2 adds and a half are free) .l move.b dx,(az) ;4 (0 head, 2 tail) move.b dx,(az) ;4 (0 head, 2 tail) dbf dy,.l ;6 (6 head, 0 tail) * Measured cycles: 14.09 (theoretical: 4+4+6-2 = 12) .l move.b dx,(az) ;4 (0 head, 2 tail) add.l dw,dw ;2 (2 head, 0 tail) add.l dw,dw ;2 (2 head, 0 tail) move.b dx,(az) ;4 (0 head, 2 tail) dbf dy,.l ;6 (6 head, 0 tail) * Measured cycles: 14.09 (so the adds are free) .l move.b dx,(az) ;4 (0 head, 2 tail) add.l dw,dw ;2 (2 head, 0 tail) add.l dw,dw ;2 (2 head, 0 tail) add.l dw,dw ;2 (2 head, 0 tail) move.b dx,(az) ;4 (0 head, 2 tail) dbf dy,.l ;6 (6 head, 0 tail) * Measured cycles: 16.12 (so only the 2 adds are free) Timings with (az)+ are the same as with (az). Last edited by saimo; 11 April 2024 at 17:04. |
|||||
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
68040 to 68060 adapter respin with A2000 and Zeus 68040 Accelerator | richx | support.Hardware | 14 | 26 April 2022 05:46 |
Games that required an accelerator (68030, 68040, 68060) | Radertified | Nostalgia & memories | 47 | 12 January 2022 16:45 |
68030, 68040 and 68060 MMU support (really!) | Toni Wilen | support.WinUAE | 262 | 19 February 2019 12:36 |
mulu.l (a0),d0-d1 on 68060 | BlankVector | support.WinUAE | 4 | 20 July 2012 19:03 |
WTB: 68030 or 68040 accelerator for A2000 | Shadowfire | MarketPlace | 2 | 19 September 2009 17:52 |
|
|