17 April 2024, 15:58 | #61 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
|
17 April 2024, 18:22 | #62 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
To understand what happens in the cases where the average execution time is not round, I resorted to a properly-drawn diagram (as I had anticipated) and a piece of code that uses only instructions whose timings are certain and that has a non-round average execution time. I went for two moves to CHIP RAM with as many adds stuffed in between:
Code:
.l move.l d0,(a0) rept 13 add.l d1,d1 endr move.l d0,(a0) rept 10 add.l d1,d1 endr dbf d7,.l Code:
.l move.l d0,(a0) move.l d0,(a0) dbf d7,.l Code:
+--------------------------------------------------------+ | .l move.l d0,(a0) | | move.l d0,(a0) | | dbf d7,.l | | | | Note: buffer in CHIP RAM | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 3711908.584 | |CPU cycles / per loop : 56.639 | | color clocks : 263315 | | rasterlines : 1159.977 | | frames : 3.705 | | µs : 74238.171 | +----------------------+---------------------------------+ +--------------------------------------------------------+ | .l move.l d0,(a0) | | rept 13 | | add.l d1,d1 | | endr | | move.l d0,(a0) | | rept 10 | | add.l d1,d1 | | endr | | dbf d7,.l | | | | Note: buffer in CHIP RAM | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 3711964.972 | |CPU cycles / per loop : 56.640 | | color clocks : 263319 | | rasterlines : 1159.995 | | frames : 3.706 | | µs : 74239.299 | +----------------------+---------------------------------+ +--------------------------------------------------------+ | .l move.l d0,(a0) | | rept 13 | | add.l d1,d1 | | endr | | move.l d0,(a0) | | rept 11 | | add.l d1,d1 | | endr | | dbf d7,.l | | | | Note: buffer in CHIP RAM | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 4660400.152 | |CPU cycles / per loop : 71.112 | | color clocks : 330599 | | rasterlines : 1456.383 | | frames : 4.652 | | µs : 93208.003 | +----------------------+---------------------------------+ Code:
+--------------------------------------------------------+ | .l move.l d0,(a0) | | rept 13 | | add.l d1,d1 | | endr | | move.l d0,(a0) | | rept 9 | | add.l d1,d1 | | endr | | subq.l #1,d7 | | bne.b .l | | | | Note: buffer in CHIP RAM | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 50000000 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 2831842625.733 | |CPU cycles / per loop : 56.636 | | color clocks : 200884969 | | rasterlines : 884955.810 | | frames : 2827.334 | | µs : 56636852.514 | +----------------------+---------------------------------+ Then, I drew a precise diagram based on the duration of a color clock expressed in CPU cycles: 50/3.546895 = 14.09683681078803, which, for practical reasons, can be reasonably rounded to 14.1. (click to see in full size) EDIT 3: the original picture was flawed, as in various points I had aligned the CPU instructions to the color clocks (don't ask me why: I don't know myself)! Even worse, based on that, I even wrote some nonsense - shame on me! Luckily, the conclusion was still valid as it was based on hard numbers and reasoning. EDIT 5: replaced picture again with a more detailed one which shows the cycles of the moves and their sub-cycle states as per MC68030UM 7.3.2, to help see how the CPU talks with Alice (in particular, refer to figure 7-25). The picture shows that the loop takes either 56 or 57 cycles, always fitting in 4 color clocks. It also shows that the loops repeat with the pattern of [57, 56, 56, 57, 56] cycles (going from top-left to bottom-right, the first and the last loops are identical). Note: actually, due to the drifting caused by 4 color clocks being a little less than 14.1 CPU cycles, periodically - but very rarely - there will be a "jump", possibly of 1 color clock (admittedly, I didn't invest further time on this EDIT4: OK, I just had to calculate this, too; the difference is 14.1-14.09683681078803 = 0.00316318921197; this means that every 1/0.00316318921197 = 316.1366371053127 color clocks, from a CPU point of view, there is a "short" color clock that lasts just 14 cycles; this might have no impact, but, after a while, the growing drifting will cause the CPU to miss a slot or eliminate a 1 cycle wait state; maybe this is what explains the difference between the calculated and measured times reported below?). Since the loop basically fits in 4 color clocks, all loops should take 65536*4 = 262144 color clocks, not 263319. Something is missing, something that, every now and then, causes the CPU to miss a CHIP bus slot... when I said this to myself, it dawned on me: memory refresh cycles! Let's assume that the loop begins exactly at color clock #0: that's a memory refresh slot, so execution actually starts only at slot 1. Given that the CPU is granted only every other slot, execution uses slots 1, 3, ... , 225 and then would go back to slot #0 of the next rasterline - which, of course, it cannot use. So, basically, the calculation must add an extra color clock for every rasterline. The test report shows that execution lasts 1159.977 (= 263315/227) rasterlines: therefore, the color clocks should be 262144+1159 = 263304 in total. What were the measured color clocks, again? 263319 - that's 12 color clocks more - a 12*100/263304 = 0.0045574696928265% difference from the theoretical value. I'm pretty sure that the other odd figures returned by the tests can be demonstrated in a similar way. But time is limited and I have better things to do. Plus, this additional research proved (again) that the test tool provides reliable figures, so there's no further work to do on it (well, since I started sharing it here, I've been adding features to make it more suitable for the public and I realized that maybe others could find it useful: maybe I could do a little more work on it and release it publicly). EDIT: originally I had started drawing the diagram for the mulu test code; since mulu's behaviour is not certain (given that one test shows that it's possible to execute 2 adds after it for free, it seems that more than its head can run in parallel with writes), I soon realized that I'd better remove the unknown and then return to it later. But once I reached the conclusion illustrated above, I realized the exercise was not worth the effort, as the actual results and the explanation found are sufficient to me. EDIT2: attached updated test tool archive; the tool now shows also the supervisor state registers and provides to blobs a simple mechanism to dump data to disk. @Thomas Richter Sorry for the huge OT, but at least, as far as multiplications go, it unveiled that odd left operands cause the 68030 to take 2 more cycles when multiplying and that multiplications won't run in parallel with memory writes Last edited by saimo; 19 April 2024 at 19:49. Reason: Replaced picture (was flawed); 1160 -> 1159; added paragraph at the end; removed attachment, as I provide a newer one later. |
18 April 2024, 10:51 | #63 |
Registered User
Join Date: Oct 2023
Location: London, UK
Posts: 124
|
This is a fascinating read, thanks for your investigations and posting the methodology in detail. It also might help with some minor C2P optimizations in my case I'm guessing.
I was also interested in reading the earlier results with FastRAM. I'd read about this elsewhere, but never seen it analysed in quite this detail! |
18 April 2024, 17:01 | #64 | |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Quote:
Despite the good intentions I professed in my previous post, the mystery of move-mulu-move taking 6 color clocks instead of the theoretical 5 kept on bugging me, so eventually I ended up doing a little research on that as well. I'm halfway through it and the surprise I've stumbled upon is quite, well, surprising. Along the way, I've touched up my little tool (I'll post it later together with what I found) - maybe it might help you tune your code. |
|
19 April 2024, 19:48 | #65 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
Eventually I couldn't resist and dove into the mulu oddity...
As mentioned earlier, tests showed that this code Code:
.l move.l d0,(a0) mulu.w d1,d2 ;d1 even move.l d0,(a0) dbf d7,.l Code:
+--------------------------------------------------------+ | .l move.l d0,(a0) ;CHIP RAM | | mulu.w d1,d2 ;d1 even | | move.l d0,(a0) ;CHIP RAM | | dbf d7,.l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 5567743.054 | |CPU cycles / per loop : 84.957 | | color clocks : 394964 | | rasterlines : 1739.929 | | frames : 5.558 | | µs : 111354.861 | +----------------------+---------------------------------+ +--------------------------------------------------------+ | .l move.l d0,(a0) ;CHIP RAM | | mulu.w d1,d2 ;d1 even | | move.l d0,(a0) ;CHIP RAM | | subq.l #1,d7 | | bne.b .l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 50000000 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 4247763804.679 | |CPU cycles / per loop : 84.955 | | color clocks : 301327444 | | rasterlines : 1327433.674 | | frames : 4241.002 | | µs : 84955276.093 | +----------------------+---------------------------------+ Theoretically, the timings should be these: Code:
.l move.l d0,(a0) ;stalls until next color clock; 1 thru 2 color clocks -> 14.1 thru 28.2 = 15 thru 29 cycles mulu.w d1,d2 ;26 cycles move.l d0,(a0) ;stalls until next color clock; 1 thru 2 color clocks -> 14.1 thru 28.2 = 15 thru 29 cycles dbf d7,.l ;6 cycles, in parallel with move * 15+26+15 = 56 cycles (= 56*3.546895/50 = 3.9725224 color clocks) if no stall happens (impossible condition, due to the clocks unrelated frequencies); * 29+26+29 = 84 cycles (= 84*3.546895/50 = 5.9587836 color clocks) if 2 full-color-clock stalls happen. However, a stall doesn't necessarily have to last a whole color clock: it could be even as short as 1 CPU cycle; so, why can't something in between the two extremes happen? To try to answer the question, I made some tests that measured the wasted cycles. The first test aimed to find how many dummy adds can be executed before mulu without affecting the execution time. The results was 14: Code:
+--------------------------------------------------------+ | .l move.l d0,(a0) ;CHIP RAM | | add.l d3,d3... ;14x | | mulu.w d1,d2 ;d1 even | | move.l d0,(a0) ;CHIP RAM | | dbf d7,.l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 5567799.441 | |CPU cycles / per loop : 84.957 | | color clocks : 394968 | | rasterlines : 1739.947 | | frames : 5.558 | | µs : 111355.988 | +----------------------+---------------------------------+ +--------------------------------------------------------+ | .l move.l d0,(a0) ;CHIP RAM | | add.l d3,d3... ;15x | | mulu.w d1,d2 ;d1 even | | move.l d0,(a0) ;CHIP RAM | | dbf d7,.l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 6486151.972 | |CPU cycles / per loop : 98.970 | | color clocks : 460114 | | rasterlines : 2026.933 | | frames : 6.475 | | µs : 129723.039 | +----------------------+---------------------------------+ Code:
+--------------------------------------------------------+ | .l move.l d0,(a0) ;CHIP RAM | | mulu.w d1,d2 ;d1 even | | add.l d3,d3... ;2x | | move.l d0,(a0) ;CHIP RAM | | dbf d7,.l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 5567813.538 | |CPU cycles / per loop : 84.958 | | color clocks : 394969 | | rasterlines : 1739.951 | | frames : 5.558 | | µs : 111356.270 | +----------------------+---------------------------------+ +--------------------------------------------------------+ | .l move.l d0,(a0) ;CHIP RAM | | mulu.w d1,d2 ;d1 even | | add.l d3,d3... ;3x | | move.l d0,(a0) ;CHIP RAM | | dbf d7,.l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 6499628.548 | |CPU cycles / per loop : 99.176 | | color clocks : 461070 | | rasterlines : 2031.145 | | frames : 6.489 | | µs : 129992.570 | +----------------------+---------------------------------+ To double check whether such conclusions were correct, I wrote a test that inserts 12 adds before mulu, 2 adds after it and also 9 adds, 1 subq and 1 bne after the second write, to exploit every available cycle. Making it loop 200000 times confirmed what found so far: Code:
+--------------------------------------------------------+ | .l move.l d0,(a0) ;CHIP RAM | | add.l d3,d3... ;12x | | mulu.w d1,d2 ;d1 even | | add.l d3,d3... ;2x | | move.l d0,(a0) ;CHIP RAM | | add.l d3,d3... ;9x | | subq.l #1,d7 | | bne.b .l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 200000 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 16991255.732 | |CPU cycles / per loop : 84.956 | | color clocks : 1205324 | | rasterlines : 5309.797 | | frames : 16.964 | | µs : 339825.114 | +----------------------+---------------------------------+ (click to see in full size) Visually, we get confirmation that the loop takes 84 or 85 cycles and the fact that mulu can't start earlier than 12 cycles after the tail of the write started is very clear. Why is that? I don't know, but it will be interesting to check such behaviour using other instructions that can't execute in parallel with RAM writes (I've started looking into this as well, but I'll leave it for a future post; curiosity, what a terrible beast!). Returning to the test results, we notice that 200000 loops required 1205324 color clocks; without memory refresh slots collisions, they would be 200000*6 = 12000000, so the difference with the measured color clocks is 1205324-1200000 = 5324. Unlike the case discussed in post #62, here there is not an access every other color clock (A-A-A-...), but the pattern is A---A-...: so, how many collisions will happen? Let's assume that the loop begins at slot #0: the initial collision will make the loop use the slots 1 (6*0+1), 5 (6*0+5), 7 (6*1+1), 11 (6*1+5), ... 223 (37*6+1), 227 (37*6+5) = 0 - collision again! The same will repeat on the next rasterline. And the same would happen even if execution started at any other slot, as there are 4 memory refresh slots (0, 2, 4, 6), one of which will surely be stumled upon because the pattern repeats every 6 accesses (<color slot> mod 6 falls in [0, 5]). Thus, also in this case, the number of collisions will be equal to the number of rasterlines. Therefore, the theoretical number of color clocks is 1200000+5309 = 1205309, which is just 1205324-1205309 = 15 color clocks less than the measured ones (a 15*100/1205309 = 0.0012444941504627% error). My current guess is that this (insignificant) difference is due to the drifting caused by the fact that 1 color clock is slightly less than 14.1 cycles (see post #62 again, as I have added to it a few considerations about this - plus, I have updated also the diagram ) or, more in general, to the misaligment of the clocks frequencies. In conclusion, the measured peformance is verified (and the reliability of the tool is demonstrated again). Attached is the archive that contains a slightly updated tool plus the sources and binaries of the tests mentioned here. Last edited by saimo; 20 April 2024 at 16:54. Reason: Fixed English in a couple of places. Removed attachment as I provided a newer one later. |
20 April 2024, 00:12 | #66 |
Registered User
Join Date: May 2013
Location: Grimstad / Norway
Posts: 852
|
Stupid Q: Does movem behave the same as move?
|
20 April 2024, 13:57 | #67 |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,310
|
The thread slightly derailed from the original question, which is for me no longer relevant as solved the issue in a different way as mulu is simply too complex on the 68030 to provide competitive performance. Maybe the algorithm itself is interesting anyhow. It takes four 32-bit values carrying 16 chunky pixels, and converts them bitplane by bitplane into 16-bit words that are pushed out into chip memory. As the chip memory interface is only 16 bit wide on ECS machines, I was hoping that this would be sufficient to saturate the bandwidth of the bus, but that does not seem to be the case. Unfortunately, there are not sufficient registers left for a 32-pixel conversion function as each word (or long-word) also needs a mask (to cover edge-cases) and minterms also have to be emulated.
Anyhow, this is how this attempt looked like: Code:
_loop: ;next pixels (within the line) movem.l (a2)+,d0/d2/d4/d6 _nextplane: ;next bitplane lea C2P_PLANEPTRS(a7),a0 rol.l #4,d0 move.l (a0),d5 ;get first bitplane pointer rol.l #4,d4 ;pre-shift bra.s _bitplanedone Write_NotSrc: not.w d1 Write_Src: move.w d1,(a5,a3.l) _bitplanedone: move.l d0,d7 ;2 1 move.l d5,d1 ;was pre-loaded with next bitplane start ble.s _skipbitplane move.l d2,d1 and.l #$10101010,d7 and.l #$01010101,d1 ror.l #1,d0 or.l d7,d1 lsr.l #1,d2 mulu.l #$01020408,d1 move.l d5,a3 move.l d4,d7 move.l d6,d5 and.l #$10101010,d7 and.l #$01010101,d5 ror.l #1,d4 or.l d5,d7 lsr.l #1,d6 mulu.l #$01020408,d7 swap d1 rol.l #8,d7 addq.l #4,a0 move.b d7,d1 move.l (a0),d5 jmp (a1) ;write data with mask The overall problem is not really the running time of the main loop with the mulu (at least not on the 68060), but that there is only one 16-bit word written per iteration. |
20 April 2024, 15:05 | #68 |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
If you want to go to 32 bit per iteration, it is perhaps possible.
Not having the full picture i may be missing something, but what about this : Code:
_loop: ;next pixels (within the line) movem.l (a2)+,d0/d2/d4/d6 _nextplane: ;next bitplane lea C2P_PLANEPTRS(a7),a0 rol.l #4,d0 rol.l #4,d4 ;pre-shift bra.s _bitplanedone Write_NotSrc: not.l d1 Write_Src: move.l d1,(a5,a3.l) _bitplanedone: move.l (a0)+,a3 move.l a3,d1 ;was pre-loaded with next bitplane start ble.s _skipbitplane move.l d0,d7 move.l d2,d1 and.l #$10101010,d7 and.l #$01010101,d1 ror.l #1,d0 or.l d7,d1 lsr.l #1,d2 mulu.l #$01020408,d1 move.l d4,d7 move.l d6,d5 and.l #$10101010,d7 and.l #$01010101,d5 ror.l #1,d4 or.l d5,d7 lsr.l #1,d6 mulu.l #$01020408,d7 swap d1 rol.l #8,d7 move.b d7,d1 move.l d0,d7 move.l d2,d5 and.l #$10101010,d7 and.l #$01010101,d5 ror.l #1,d0 or.l d7,d5 lsr.l #1,d2 mulu.l #$01020408,d5 lsl.l #8,d1 move.b d5,d1 move.l d4,d7 move.l d6,d5 and.l #$10101010,d7 and.l #$01010101,d5 ror.l #1,d4 or.l d5,d7 lsr.l #1,d6 mulu.l #$01020408,d7 lsl.l #8,d1 move.b d7,d1 jmp (a1) ;write data with mask |
20 April 2024, 16:21 | #69 | |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,310
|
Quote:
That is not immediately possible since there are no free registers anymore, and it is neither clear wether this source data is accessible or beyond the edge of the source (that would require another comparison with the end of line register and potentially a call to the "fetch partical" function). You can neither reuse the existing set of four registers (d0,d2,d4,d6) as you then loose the loop invariance (bits 0 and 4 of the work registers contains the bits of the bitplane to convert). |
|
20 April 2024, 16:43 | #70 |
Registered User
Join Date: Aug 2010
Location: Italy
Posts: 854
|
@Thomas Richter
Right, sorry again. I won't post further OT stuff other than what follows (just to close what discussed). @NorthWay Yes: Code:
+--------------------------------------------------------+ | .l movem.l d0,(a0) ;CHIP RAM | | movem.l d0,(a0) ;CHIP RAM | | dbf d7,.l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 3708666.312 | |CPU cycles / per loop : 56.589 | | color clocks : 263085 | | rasterlines : 1158.964 | | frames : 3.702 | | µs : 74173.326 | +----------------------+---------------------------------+ +--------------------------------------------------------+ | .l movem.l d0,(a0) ;CHIP RAM | | mulu.w d1,d2 ;d1 even | | movem.l d0,(a0) ;CHIP RAM | | dbf d7,.l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 5567728.957 | |CPU cycles / per loop : 84.956 | | color clocks : 394963 | | rasterlines : 1739.925 | | frames : 5.558 | | µs : 111354.579 | +----------------------+---------------------------------+ @all First of all, a correction to myself. exg does execute in parallel with writes to RAM: Code:
+--------------------------------------------------------+ | .l move.l d0,(a0) ;CHIP RAM | | move.l d0,(a0) ;CHIP RAM | | dbf d7,.l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 3711880.391 | |CPU cycles / per loop : 56.638 | | color clocks : 263313 | | rasterlines : 1159.969 | | frames : 3.705 | | µs : 74237.607 | +----------------------+---------------------------------+ +--------------------------------------------------------+ | .l exg.l d0,d1 | | dbf d7,.l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 655488.814 | |CPU cycles / per loop : 10.001 | | color clocks : 46499 | | rasterlines : 204.841 | | frames : 0.654 | | µs : 13109.776 | +----------------------+---------------------------------+ +--------------------------------------------------------+ | .l move.l d0,(a0) ;CHIP RAM | | exg.l d1,d2 | | move.l d0,(a0) ;CHIP RAM | | dbf d7,.l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 3711894.487 | |CPU cycles / per loop : 56.639 | | color clocks : 263314 | | rasterlines : 1159.973 | | frames : 3.705 | | µs : 74237.889 | +----------------------+---------------------------------+ Then, complete tests that show that the only factor that affects mulu's execution speed is whether the left operand is even or odd (the same goes for muls, but I'm not including the results here for practical reasons): Code:
mulu.w 0,even 26 cycles +--------------------------------------------------------+ | .l move.w d1,d2 ;d1 even | | mulu.w d0,d2 ;d0 zero | | dbf d7,.l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 2225143.400 | |CPU cycles / per loop : 33.952 | | color clocks : 157847 | | rasterlines : 695.361 | | frames : 2.221 | | µs : 44502.868 | +----------------------+---------------------------------+ mulu.w 0,odd 26 cycles +--------------------------------------------------------+ | .l move.w d1,d2 ;d1 odd | | mulu.w d0,d2 ;d0 zero | | dbf d7,.l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 2225129.303 | |CPU cycles / per loop : 33.952 | | color clocks : 157846 | | rasterlines : 695.356 | | frames : 2.221 | | µs : 44502.586 | +----------------------+---------------------------------+ mulu.w even,0 26 cycles +--------------------------------------------------------+ | .l mulu.w d1,d0 ;d1 even, d0 zero | | dbf d7,.l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 2094042.817 | |CPU cycles / per loop : 31.952 | | color clocks : 148547 | | rasterlines : 654.392 | | frames : 2.090 | | µs : 41880.856 | +----------------------+---------------------------------+ mulu.w odd,0 28 cycles +--------------------------------------------------------+ | .l mulu.w d1,d0 ;d1 odd, d0 zero | | dbf d7,.l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 2225143.400 | |CPU cycles / per loop : 33.952 | | color clocks : 157847 | | rasterlines : 695.361 | | frames : 2.221 | | µs : 44502.868 | +----------------------+---------------------------------+ mulu.w even,even 26 cycles +--------------------------------------------------------+ | .l move.w d1,d2 ;d1 even | | mulu.w d0,d2 ;d0 even | | dbf d7,.l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 2225129.303 | |CPU cycles / per loop : 33.952 | | color clocks : 157846 | | rasterlines : 695.356 | | frames : 2.221 | | µs : 44502.586 | +----------------------+---------------------------------+ mulu.w even,odd 26 cycles +--------------------------------------------------------+ | .l move.w d1,d2 ;d1 odd | | mulu.w d0,d2 ;d0 even | | dbf d7,.l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 2225143.400 | |CPU cycles / per loop : 33.952 | | color clocks : 157847 | | rasterlines : 695.361 | | frames : 2.221 | | µs : 44502.868 | +----------------------+---------------------------------+ mulu.w odd,even 28 cycles +--------------------------------------------------------+ | .l move.w d1,d2 ;d1 even | | mulu.w d0,d2 ;d0 odd | | dbf d7,.l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 2356243.982 | |CPU cycles / per loop : 35.953 | | color clocks : 167147 | | rasterlines : 736.330 | | frames : 2.352 | | µs : 47124.879 | +----------------------+---------------------------------+ mulu.w odd,odd 28 cycles +--------------------------------------------------------+ | .l move.w d1,d2 ;d1 odd | | mulu.w d0,d2 ;d0 odd | | dbf d7,.l | +----------------------+---------------------------------+ | CPU : 68030 50.000000 MHz IiDd.. | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 2356243.982 | |CPU cycles / per loop : 35.953 | | color clocks : 167147 | | rasterlines : 736.330 | | frames : 2.352 | | µs : 47124.879 | +----------------------+---------------------------------+ Code:
+--------------------------------------------------------+ | .l swap.w d0 | | dbf d7,.l | +----------------------+---------------------------------+ | CPU : 68020 14.187580 MHz I..... | | loops number : 65536 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 655412.000 | |CPU cycles / per loop : 10.000 | | color clocks : 163853 | | rasterlines : 721.819 | | frames : 2.306 | | µs : 46196.180 | +----------------------+---------------------------------+ +--------------------------------------------------------+ | .l move.l d0,(a0) ;CHIP RAM | | add.l d3,d3... ;12x | | mulu.w d1,d2 ;d1 even | | add.l d3,d3... ;2x | | move.l d0,(a0) ;CHIP RAM | | add.l d3,d3... ;9x | | subq.l #1,d7 | | bne.b .l | +----------------------+---------------------------------+ | CPU : 68020 14.187580 MHz I..... | | loops number : 200000 | | leaked bytes : 0 | +----------------------+---------------------------------+ | CPU cycles / total : 16800148.000 | |CPU cycles / per loop : 84.000 | | color clocks : 4200037 | | rasterlines : 18502.365 | | frames : 59.112 | | µs : 1184144.723 | +----------------------+---------------------------------+ @admins Would it be possible to move to OT posts to a separate thread? Last edited by saimo; 23 April 2024 at 23:26. Reason: Updated archive. |
20 April 2024, 16:53 | #71 | ||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
Quote:
Quote:
Perhaps posting a little more code would help. Losing data in d0/d2/d4/d6 seems acceptable, just read the data again. 68030 classic c2p does two 4-bit passes, reading same data twice. |
||
20 April 2024, 17:38 | #72 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,039
|
Code:
_loop: ;next pixels (within the line) movem.l (a2)+,d0/d2/d4/d6 _nextplane: ;next bitplane lea C2P_PLANEPTRS(a7),a0 rol.l #4,d0 move.l (a0),d5 ;get first bitplane pointer rol.l #4,d4 ;pre-shift bra.s _bitplanedone Write_NotSrc: not.w d1 Write_Src: move.w d1,(a5,a3.l) _bitplanedone: move.l d0,d7 ;2 1 move.l d5,d1 ;was pre-loaded with next bitplane start ble.s _skipbitplane move.l #$01010101,d5 ; + move.l d1,a3 ; + move.l d2,d1 and.l #$10101010,d7 ; and.l #$01010101,d1 and.l d5,d1 ; + ror.l #1,d0 or.l d7,d1 lsr.l #1,d2 mulu.l #$01020408,d1 ; move.l d5,a3 move.l d4,d7 ; move.l d6,d5 and.l #$10101010,d7 ; and.l #$01010101,d5 and.l d6,d5 ; + ror.l #1,d4 or.l d5,d7 lsr.l #1,d6 mulu.l #$01020408,d7 swap d1 rol.l #8,d7 addq.l #4,a0 move.b d7,d1 move.l (a0),d5 jmp (a1) ;write data with mask |
20 April 2024, 18:16 | #73 |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,039
|
Code:
_loop: ;next pixels (within the line) movem.l (a2)+,d0/d2/d4/d6 _nextplane: ;next bitplane lea C2P_PLANEPTRS(a7),a0 rol.l #4,d0 move.l (a0),d5 ;get first bitplane pointer rol.l #4,d4 ;pre-shift bra.s _bitplanedone Write_NotSrc: not.w d1 Write_Src: move.w d1,(a5,a3.l) _bitplanedone: move.l d0,d7 ;2 1 move.l d5,d1 ;was pre-loaded with next bitplane start ble.s _skipbitplane move.l #$01010101,d5 ; + move.l d1,a3 ; + move.l d2,d1 and.l #$10101010,d7 ; and.l #$01010101,d1 and.l d5,d1 ; + ror.l #1,d0 or.l d7,d1 lsr.l #1,d2 mulu.l #$01020408,d1 ; move.l d5,a3 ; move.l d4,d7 ; move.l d6,d5 ; and.l #$10101010,d7 move.l d5,d7 ; + add.l d7,d7 ; + and.l d4,d7 ;+ ; and.l #$01010101,d5 and.l d6,d5 ; + ror.l #1,d4 or.l d5,d7 lsr.l #1,d6 mulu.l #$01020408,d7 swap d1 rol.l #8,d7 addq.l #4,a0 move.b d7,d1 move.l (a0),d5 jmp (a1) ;write data with mask |
20 April 2024, 19:28 | #74 | ||
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,310
|
Not visible here. a4 is the "write data with mask" function, and a6 is the end-of-line pointer to which a2 is compared.
Look, I'm not using this code anyhow, so this does not make much sense, and you're missing the point. It does not matter to squeeze a couple of cycles from this loop as long as only 16 bits are written at a time. The mulu approach is not the right one to begin with. It looked like a neat idea initially because it keeps the instruction count low, but mulu is apparently eating too many cycles compared to a couple of manual "bit folding" instructions, at least on the 030 and 040. On the 68060, it should be faster, but the net benefit is zero - the problem sits elsewhere. Quote:
Quote:
Yes, but it violates a couple of constraints I have - it reads potentially invalid memory, causing hits or crashes, and it does not implement minterms. Besides, the code is not maintainable and scales poorly. |
||
20 April 2024, 19:47 | #75 | |||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
Quote:
Quote:
Quote:
It cannot become invalid if you lock it during the whole process. |
|||
20 April 2024, 20:27 | #76 | ||
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,310
|
Quote:
Quote:
Err, what? That's not a matter of "locking". It is a matter of "reaching the end of the RAM". |
||
20 April 2024, 20:58 | #77 | ||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
Quote:
Quote:
Shouldn't happen anyway if you align the source. |
||
20 April 2024, 22:56 | #78 | |
Registered User
Join Date: Jan 2019
Location: Germany
Posts: 3,310
|
Quote:
The source is wherever the source happens to be, there is no control on alignment, and yes, it surely is a problem if you find hardware registers before or behind the source bitmap if that is a graphics card. |
|
21 April 2024, 08:35 | #79 | |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
Quote:
The video ram should have its start and end longword aligned. It can't be anywhere. Just clip correctly and perform aligned reads. |
|
21 April 2024, 08:37 | #80 | ||
Registered User
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,732
|
Quote:
Quote:
How do you know that it isn't 'sufficient to saturate the bandwidth of the bus'? |
||
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
68040 to 68060 adapter respin with A2000 and Zeus 68040 Accelerator | richx | support.Hardware | 14 | 26 April 2022 05:46 |
Games that required an accelerator (68030, 68040, 68060) | Radertified | Nostalgia & memories | 47 | 12 January 2022 16:45 |
68030, 68040 and 68060 MMU support (really!) | Toni Wilen | support.WinUAE | 262 | 19 February 2019 12:36 |
mulu.l (a0),d0-d1 on 68060 | BlankVector | support.WinUAE | 4 | 20 July 2012 19:03 |
WTB: 68030 or 68040 accelerator for A2000 | Shadowfire | MarketPlace | 2 | 19 September 2009 17:52 |
|
|