mulu.l cycles on 68030 and 68040 - Page 4

saimo · 17 April 2024, 15:58

Quote:

Originally Posted by Karlos

Can we stop the nonsense and get back to measuring? This is interesting, the flexing isn't.

Something interesting is coming soon...

saimo · 17 April 2024, 18:22

To understand what happens in the cases where the average execution time is not round, I resorted to a properly-drawn diagram (as I had anticipated) and a piece of code that uses only instructions whose timings are certain and that has a non-round average execution time. I went for two moves to CHIP RAM with as many adds stuffed in between:

Code:

.l move.l d0,(a0)
   rept   13
   add.l  d1,d1
   endr
   move.l d0,(a0)
   rept   10
   add.l  d1,d1
   endr
   dbf    d7,.l

If my guess were correct, such code should be as fast as

Code:

.l move.l d0,(a0)
   move.l d0,(a0)
   dbf    d7,.l

and if even just a single add were added (heh) it should become slower. Tests proved that the code is correct:

Code:

+--------------------------------------------------------+
|                   .l move.l d0,(a0)                    |
|                      move.l d0,(a0)                    |
|                      dbf    d7,.l                      |
|                                                        |
|                Note: buffer in CHIP RAM                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 3711908.584                     |
|CPU cycles / per loop : 56.639                          |
|         color clocks : 263315                          |
|          rasterlines : 1159.977                        |
|               frames : 3.705                           |
|                   µs : 74238.171                       |
+----------------------+---------------------------------+

+--------------------------------------------------------+
|                   .l move.l d0,(a0)                    |
|                      rept   13                         |
|                      add.l  d1,d1                      |
|                      endr                              |
|                      move.l d0,(a0)                    |
|                      rept   10                         |
|                      add.l  d1,d1                      |
|                      endr                              |
|                      dbf    d7,.l                      |
|                                                        |
|                Note: buffer in CHIP RAM                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 3711964.972                     |
|CPU cycles / per loop : 56.640                          |
|         color clocks : 263319                          |
|          rasterlines : 1159.995                        |
|               frames : 3.706                           |
|                   µs : 74239.299                       |
+----------------------+---------------------------------+

+--------------------------------------------------------+
|                   .l move.l d0,(a0)                    |
|                      rept   13                         |
|                      add.l  d1,d1                      |
|                      endr                              |
|                      move.l d0,(a0)                    |
|                      rept   11                         |
|                      add.l  d1,d1                      |
|                      endr                              |
|                      dbf    d7,.l                      |
|                                                        |
|                Note: buffer in CHIP RAM                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 4660400.152                     |
|CPU cycles / per loop : 71.112                          |
|         color clocks : 330599                          |
|          rasterlines : 1456.383                        |
|               frames : 4.652                           |
|                   µs : 93208.003                       |
+----------------------+---------------------------------+

Given that the tests looped "only" 65536 times, I also made a test with 50 million loops, which confirmed the results:

Code:

+--------------------------------------------------------+
|                   .l move.l d0,(a0)                    |
|                      rept   13                         |
|                      add.l  d1,d1                      |
|                      endr                              |
|                      move.l d0,(a0)                    |
|                      rept   9                          |
|                      add.l  d1,d1                      |
|                      endr                              |
|                      subq.l #1,d7                      |
|                      bne.b  .l                         |
|                                                        |
|                Note: buffer in CHIP RAM                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 50000000                        |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 2831842625.733                  |
|CPU cycles / per loop : 56.636                          |
|         color clocks : 200884969                       |
|          rasterlines : 884955.810                      |
|               frames : 2827.334                        |
|                   µs : 56636852.514                    |
+----------------------+---------------------------------+

So, each loop takes on average 56.636 CPU cycles.

Then, I drew a precise diagram based on the duration of a color clock expressed in CPU cycles: 50/3.546895 = 14.09683681078803, which, for practical reasons, can be reasonably rounded to 14.1.

(click to see in full size)

EDIT 3: the original picture was flawed, as in various points I had aligned the CPU instructions to the color clocks (don't ask me why: I don't know myself)! Even worse, based on that, I even wrote some nonsense - shame on me! Luckily, the conclusion was still valid as it was based on hard numbers and reasoning.
EDIT 5: replaced picture again with a more detailed one which shows the cycles of the moves and their sub-cycle states as per MC68030UM 7.3.2, to help see how the CPU talks with Alice (in particular, refer to figure 7-25).

The picture shows that the loop takes either 56 or 57 cycles, always fitting in 4 color clocks. It also shows that the loops repeat with the pattern of [57, 56, 56, 57, 56] cycles (going from top-left to bottom-right, the first and the last loops are identical). Note: actually, due to the drifting caused by 4 color clocks being a little less than 14.1 CPU cycles, periodically - but very rarely - there will be a "jump", possibly of 1 color clock (admittedly, I didn't invest further time on this EDIT4: OK, I just had to calculate this, too; the difference is 14.1-14.09683681078803 = 0.00316318921197; this means that every 1/0.00316318921197 = 316.1366371053127 color clocks, from a CPU point of view, there is a "short" color clock that lasts just 14 cycles; this might have no impact, but, after a while, the growing drifting will cause the CPU to miss a slot or eliminate a 1 cycle wait state; maybe this is what explains the difference between the calculated and measured times reported below?).
Since the loop basically fits in 4 color clocks, all loops should take 65536*4 = 262144 color clocks, not 263319. Something is missing, something that, every now and then, causes the CPU to miss a CHIP bus slot... when I said this to myself, it dawned on me: memory refresh cycles!
Let's assume that the loop begins exactly at color clock #0: that's a memory refresh slot, so execution actually starts only at slot 1. Given that the CPU is granted only every other slot, execution uses slots 1, 3, ... , 225 and then would go back to slot #0 of the next rasterline - which, of course, it cannot use. So, basically, the calculation must add an extra color clock for every rasterline. The test report shows that execution lasts 1159.977 (= 263315/227) rasterlines: therefore, the color clocks should be 262144+1159 = 263304 in total. What were the measured color clocks, again? 263319 - that's 12 color clocks more - a 12*100/263304 = 0.0045574696928265% difference from the theoretical value.

I'm pretty sure that the other odd figures returned by the tests can be demonstrated in a similar way. But time is limited and I have better things to do. Plus, this additional research proved (again) that the test tool provides reliable figures, so there's no further work to do on it (well, since I started sharing it here, I've been adding features to make it more suitable for the public and I realized that maybe others could find it useful: maybe I could do a little more work on it and release it publicly).

EDIT: originally I had started drawing the diagram for the mulu test code; since mulu's behaviour is not certain (given that one test shows that it's possible to execute 2 adds after it for free, it seems that more than its head can run in parallel with writes), I soon realized that I'd better remove the unknown and then return to it later. But once I reached the conclusion illustrated above, I realized the exercise was not worth the effort, as the actual results and the explanation found are sufficient to me.

EDIT2: attached updated test tool archive; the tool now shows also the supervisor state registers and provides to blobs a simple mechanism to dump data to disk.

@Thomas Richter

Sorry for the huge OT, but at least, as far as multiplications go, it unveiled that odd left operands cause the 68030 to take 2 more cycles when multiplying and that multiplications won't run in parallel with memory writes

reassembler · 18 April 2024, 10:51

This is a fascinating read, thanks for your investigations and posting the methodology in detail. It also might help with some minor C2P optimizations in my case I'm guessing.

I was also interested in reading the earlier results with FastRAM. I'd read about this elsewhere, but never seen it analysed in quite this detail!

saimo · 18 April 2024, 17:01

Quote:

Originally Posted by reassembler

This is a fascinating read, thanks for your investigations and posting the methodology in detail. It also might help with some minor C2P optimizations in my case I'm guessing.

I was also interested in reading the earlier results with FastRAM. I'd read about this elsewhere, but never seen it analysed in quite this detail!

Glad to hear this might turn out to be useful to you

Despite the good intentions I professed in my previous post, the mystery of move-mulu-move taking 6 color clocks instead of the theoretical 5 kept on bugging me, so eventually I ended up doing a little research on that as well. I'm halfway through it and the surprise I've stumbled upon is quite, well, surprising. Along the way, I've touched up my little tool (I'll post it later together with what I found) - maybe it might help you tune your code.

saimo · 19 April 2024, 19:48

Eventually I couldn't resist and dove into the mulu oddity...

As mentioned earlier, tests showed that this code

Code:

.l move.l d0,(a0)
   mulu.w d1,d2   ;d1 even
   move.l d0,(a0)
   dbf    d7,.l

takes 84.955 cycles on average - for convenience, here are the test results again, performed with both 65535 and 50000000 loops:

Code:

+--------------------------------------------------------+
|             .l move.l d0,(a0) ;CHIP RAM                |
|                mulu.w d1,d2   ;d1 even                 |
|                move.l d0,(a0) ;CHIP RAM                |
|                dbf    d7,.l                            |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 5567743.054                     |
|CPU cycles / per loop : 84.957                          |
|         color clocks : 394964                          |
|          rasterlines : 1739.929                        |
|               frames : 5.558                           |
|                   µs : 111354.861                      |
+----------------------+---------------------------------+

+--------------------------------------------------------+
|            .l move.l d0,(a0) ;CHIP RAM                 |
|               mulu.w d1,d2   ;d1 even                  |
|               move.l d0,(a0) ;CHIP RAM                 |
|               subq.l #1,d7                             |
|               bne.b  .l                                |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 50000000                        |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 4247763804.679                  |
|CPU cycles / per loop : 84.955                          |
|         color clocks : 301327444                       |
|          rasterlines : 1327433.674                     |
|               frames : 4241.002                        |
|                   µs : 84955276.093                    |
+----------------------+---------------------------------+

Without considering the CHIP RAM memory refresh slots collisions, the code, in practice, takes 6 color clocks (84.6 CPU cycles on average). However, looking at the timings, in theory it could execute also in 5 color clocks, even considering the fact that - as the tests revealed - mulu cannot execute in parallel with writes to RAM.
Theoretically, the timings should be these:

Code:

.l move.l d0,(a0) ;stalls until next color clock; 1 thru 2 color clocks -> 14.1 thru 28.2 = 15 thru 29 cycles
   mulu.w d1,d2   ;26 cycles
   move.l d0,(a0) ;stalls until next color clock; 1 thru 2 color clocks -> 14.1 thru 28.2 = 15 thru 29 cycles
   dbf    d7,.l   ;6 cycles, in parallel with move

That's:
* 15+26+15 = 56 cycles (= 56*3.546895/50 = 3.9725224 color clocks) if no stall happens (impossible condition, due to the clocks unrelated frequencies);
* 29+26+29 = 84 cycles (= 84*3.546895/50 = 5.9587836 color clocks) if 2 full-color-clock stalls happen.

However, a stall doesn't necessarily have to last a whole color clock: it could be even as short as 1 CPU cycle; so, why can't something in between the two extremes happen?

To try to answer the question, I made some tests that measured the wasted cycles.
The first test aimed to find how many dummy adds can be executed before mulu without affecting the execution time. The results was 14:

Code:

+--------------------------------------------------------+
|             .l move.l d0,(a0)  ;CHIP RAM               |
|                add.l  d3,d3... ;14x                    |
|                mulu.w d1,d2    ;d1 even                |
|                move.l d0,(a0)  ;CHIP RAM               |
|                dbf    d7,.l                            |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 5567799.441                     |
|CPU cycles / per loop : 84.957                          |
|         color clocks : 394968                          |
|          rasterlines : 1739.947                        |
|               frames : 5.558                           |
|                   µs : 111355.988                      |
+----------------------+---------------------------------+

+--------------------------------------------------------+
|             .l move.l d0,(a0)  ;CHIP RAM               |
|                add.l  d3,d3... ;15x                    |
|                mulu.w d1,d2    ;d1 even                |
|                move.l d0,(a0)  ;CHIP RAM               |
|                dbf    d7,.l                            |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 6486151.972                     |
|CPU cycles / per loop : 98.970                          |
|         color clocks : 460114                          |
|          rasterlines : 2026.933                        |
|               frames : 6.475                           |
|                   µs : 129723.039                      |
+----------------------+---------------------------------+

Then, a similar test showed that 2 adds can be inserted after the mulu (again without affecting the execution time):

Code:

+--------------------------------------------------------+
|             .l move.l d0,(a0)  ;CHIP RAM               |
|                mulu.w d1,d2    ;d1 even                |
|                add.l  d3,d3... ;2x                     |
|                move.l d0,(a0)  ;CHIP RAM               |
|                dbf    d7,.l                            |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 5567813.538                     |
|CPU cycles / per loop : 84.958                          |
|         color clocks : 394969                          |
|          rasterlines : 1739.951                        |
|               frames : 5.558                           |
|                   µs : 111356.270                      |
+----------------------+---------------------------------+

+--------------------------------------------------------+
|             .l move.l d0,(a0)  ;CHIP RAM               |
|                mulu.w d1,d2    ;d1 even                |
|                add.l  d3,d3... ;3x                     |
|                move.l d0,(a0)  ;CHIP RAM               |
|                dbf    d7,.l                            |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 6499628.548                     |
|CPU cycles / per loop : 99.176                          |
|         color clocks : 461070                          |
|          rasterlines : 2031.145                        |
|               frames : 6.489                           |
|                   µs : 129992.570                      |
+----------------------+---------------------------------+

This looks odd already: it is possible to insert 12 adds in front of the mulu and 2 adds after it, which means that mulu cannot start before 12*2 = 24 cycles have elapsed since the write to CHIP RAM has initiated - but that write lasts only about 14.1/2 = 7 cycles! Why can't mulu start 5 cycles earlier?
To double check whether such conclusions were correct, I wrote a test that inserts 12 adds before mulu, 2 adds after it and also 9 adds, 1 subq and 1 bne after the second write, to exploit every available cycle. Making it loop 200000 times confirmed what found so far:

Code:

+--------------------------------------------------------+
|             .l move.l d0,(a0)  ;CHIP RAM               |
|                add.l  d3,d3... ;12x                    |
|                mulu.w d1,d2    ;d1 even                |
|                add.l  d3,d3... ;2x                     |
|                move.l d0,(a0)  ;CHIP RAM               |
|                add.l  d3,d3... ;9x                     |
|                subq.l #1,d7                            |
|                bne.b  .l                               |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 200000                          |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 16991255.732                    |
|CPU cycles / per loop : 84.956                          |
|         color clocks : 1205324                         |
|          rasterlines : 5309.797                        |
|               frames : 16.964                          |
|                   µs : 339825.114                      |
+----------------------+---------------------------------+

Another diagram would have helped, so...

(click to see in full size)

Visually, we get confirmation that the loop takes 84 or 85 cycles and the fact that mulu can't start earlier than 12 cycles after the tail of the write started is very clear. Why is that? I don't know, but it will be interesting to check such behaviour using other instructions that can't execute in parallel with RAM writes (I've started looking into this as well, but I'll leave it for a future post; curiosity, what a terrible beast!).

Returning to the test results, we notice that 200000 loops required 1205324 color clocks; without memory refresh slots collisions, they would be 200000*6 = 12000000, so the difference with the measured color clocks is 1205324-1200000 = 5324. Unlike the case discussed in post #62, here there is not an access every other color clock (A-A-A-...), but the pattern is A---A-...: so, how many collisions will happen? Let's assume that the loop begins at slot #0: the initial collision will make the loop use the slots

1 (6*0+1), 5 (6*0+5), 7 (6*1+1), 11 (6*1+5), ... 223 (37*6+1), 227 (37*6+5) = 0

- collision again! The same will repeat on the next rasterline. And the same would happen even if execution started at any other slot, as there are 4 memory refresh slots (0, 2, 4, 6), one of which will surely be stumled upon because the pattern repeats every 6 accesses (<color slot> mod 6 falls in [0, 5]). Thus, also in this case, the number of collisions will be equal to the number of rasterlines. Therefore, the theoretical number of color clocks is 1200000+5309 = 1205309, which is just 1205324-1205309 = 15 color clocks less than the measured ones (a 15*100/1205309 = 0.0012444941504627% error). My current guess is that this (insignificant) difference is due to the drifting caused by the fact that 1 color clock is slightly less than 14.1 cycles (see post #62 again, as I have added to it a few considerations about this - plus, I have updated also the diagram

) or, more in general, to the misaligment of the clocks frequencies.

In conclusion, the measured peformance is verified (and the reliability of the tool is demonstrated again).

Attached is the archive that contains a slightly updated tool plus the sources and binaries of the tests mentioned here.

NorthWay · 20 April 2024, 00:12

Stupid Q: Does movem behave the same as move?

Thomas Richter · 20 April 2024, 13:57

The thread slightly derailed from the original question, which is for me no longer relevant as solved the issue in a different way as mulu is simply too complex on the 68030 to provide competitive performance. Maybe the algorithm itself is interesting anyhow. It takes four 32-bit values carrying 16 chunky pixels, and converts them bitplane by bitplane into 16-bit words that are pushed out into chip memory. As the chip memory interface is only 16 bit wide on ECS machines, I was hoping that this would be sufficient to saturate the bandwidth of the bus, but that does not seem to be the case. Unfortunately, there are not sufficient registers left for a 32-pixel conversion function as each word (or long-word) also needs a mask (to cover edge-cases) and minterms also have to be emulated.

Anyhow, this is how this attempt looked like:

Code:

_loop:				;next pixels (within the line)
	movem.l	(a2)+,d0/d2/d4/d6
_nextplane:			;next bitplane
	lea	C2P_PLANEPTRS(a7),a0
	rol.l	#4,d0
	move.l	(a0),d5		;get first bitplane pointer
	rol.l	#4,d4		;pre-shift
	bra.s	_bitplanedone
Write_NotSrc:
	not.w	d1
Write_Src:
	move.w	d1,(a5,a3.l)	
_bitplanedone:
	move.l	d0,d7		;2	1
	move.l	d5,d1		;was pre-loaded with next bitplane start
	ble.s	_skipbitplane
	move.l	d2,d1
	and.l	#$10101010,d7
	and.l	#$01010101,d1
	ror.l	#1,d0
	or.l	d7,d1
	lsr.l	#1,d2
	mulu.l	#$01020408,d1
	
	move.l	d5,a3
	move.l	d4,d7
	move.l	d6,d5
	and.l	#$10101010,d7
	and.l	#$01010101,d5
	ror.l	#1,d4	
	or.l	d5,d7
	lsr.l	#1,d6
	mulu.l	#$01020408,d7
	swap	d1
	rol.l	        #8,d7
	addq.l	#4,a0
	move.b	d7,d1
	move.l	(a0),d5
	jmp	(a1)	   	;write data with mask

This is in a nut-shell the main conversion loop. (a2) contains the source data, which is moved into the work registers d0,d2,d4,d6. These registers are right-shifted by one bit each round, exposing the bits that make up the next bitplane. On edge-cases, they are filled by a different function not shown here. C2P_PLANEPTRS points to an array of of plane pointers, pointing to the origin of the target planes, or NULL in case a particular plane is masked out or not available, or -1 for the end. Handing of skipped planes is not shown (but trivial). The ANDs in the above code extract the bits for the next bitplane, and the "mulu.l" (the core of the question) accumulates (quickly - or not) 8-bits in the most-significant byte of a long-word. This happens twice to compose the upper and lower bit of the target word. Meanwhile, the work-registers d0,d2,d4,d6 are updated for the next bitplane. (a1) contains the "write with minterm to destination" function, which is often (but not necessarily) Write_Src, where a5 is the byte-offset into the target position and a3 the current bitplane pointer. d5 is the prefetched next target pointer. d1 and d7 are required scratch registers of the generic "write with minterm" function - for the simple "copy source to destination" minterm, they are of course not needed, except for the source that sits in d1. d3 contains the "pixel mask" (FWM,LWM equivalent) - it is not seen here as it is only preloaded on the edges and then handled by another "min-term-writer" function. The stack is deliberately not used as it would cause writes that may conflict with the store-buffer keeping the chip memory target.

The overall problem is not really the running time of the main loop with the mulu (at least not on the 68060), but that there is only one 16-bit word written per iteration.

meynaf · 20 April 2024, 15:05

If you want to go to 32 bit per iteration, it is perhaps possible.
Not having the full picture i may be missing something, but what about this :

Code:

_loop:				;next pixels (within the line)
	movem.l	(a2)+,d0/d2/d4/d6
_nextplane:			;next bitplane
	lea	C2P_PLANEPTRS(a7),a0
	rol.l	#4,d0
	rol.l	#4,d4		;pre-shift
	bra.s	_bitplanedone
Write_NotSrc:
	not.l	d1
Write_Src:
	move.l	d1,(a5,a3.l)	
_bitplanedone:
	move.l	(a0)+,a3
	move.l	a3,d1		;was pre-loaded with next bitplane start
	ble.s	_skipbitplane

	move.l	d0,d7
	move.l	d2,d1
	and.l	#$10101010,d7
	and.l	#$01010101,d1
	ror.l	#1,d0
	or.l	d7,d1
	lsr.l	#1,d2
	mulu.l	#$01020408,d1

	move.l	d4,d7
	move.l	d6,d5
	and.l	#$10101010,d7
	and.l	#$01010101,d5
	ror.l	#1,d4	
	or.l	d5,d7
	lsr.l	#1,d6
	mulu.l	#$01020408,d7

	swap	d1
	rol.l	#8,d7
	move.b	d7,d1

	move.l	d0,d7
	move.l	d2,d5
	and.l	#$10101010,d7
	and.l	#$01010101,d5
	ror.l	#1,d0
	or.l	d7,d5
	lsr.l	#1,d2
	mulu.l	#$01020408,d5

	lsl.l	#8,d1
	move.b	d5,d1

	move.l	d4,d7
	move.l	d6,d5
	and.l	#$10101010,d7
	and.l	#$01010101,d5
	ror.l	#1,d4	
	or.l	d5,d7
	lsr.l	#1,d6
	mulu.l	#$01020408,d7

	lsl.l	#8,d1
	move.b	d7,d1

	jmp	(a1)	   	;write data with mask

Thomas Richter · 20 April 2024, 16:21

Quote:

Originally Posted by meynaf

If you want to go to 32 bit per iteration, it is perhaps possible.
Not having the full picture i may be missing something, but what about this

This does not work. Remember, each byte contains 8 bitplanes worth of planar data, thus one register contains four pixels chunky, or four bits planar. With four registers filled, you have 16 bytes = 16 pixels worth of chunky, or 16 bits worth of planar. What you would need is to refill another four registers of chunky from the source a2.

That is not immediately possible since there are no free registers anymore, and it is neither clear wether this source data is accessible or beyond the edge of the source (that would require another comparison with the end of line register and potentially a call to the "fetch partical" function). You can neither reuse the existing set of four registers (d0,d2,d4,d6) as you then loose the loop invariance (bits 0 and 4 of the work registers contains the bits of the bitplane to convert).

saimo · 20 April 2024, 16:43

@Thomas Richter

Quote:

Originally Posted by Thomas Richter

The thread slightly derailed from the original question

Right, sorry again. I won't post further OT stuff other than what follows (just to close what discussed).

@NorthWay

Quote:

Originally Posted by NorthWay

Stupid Q: Does movem behave the same as move?

Yes:

Code:

+--------------------------------------------------------+
|              .l movem.l d0,(a0) ;CHIP RAM              |
|                 movem.l d0,(a0) ;CHIP RAM              |
|                 dbf     d7,.l                          |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 3708666.312                     |
|CPU cycles / per loop : 56.589                          |
|         color clocks : 263085                          |
|          rasterlines : 1158.964                        |
|               frames : 3.702                           |
|                   µs : 74173.326                       |
+----------------------+---------------------------------+

+--------------------------------------------------------+
|              .l movem.l d0,(a0) ;CHIP RAM              |
|                 mulu.w  d1,d2   ;d1 even               |
|                 movem.l d0,(a0) ;CHIP RAM              |
|                 dbf     d7,.l                          |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 5567728.957                     |
|CPU cycles / per loop : 84.956                          |
|         color clocks : 394963                          |
|          rasterlines : 1739.925                        |
|               frames : 5.558                           |
|                   µs : 111354.579                      |
+----------------------+---------------------------------+

@all

First of all, a correction to myself. exg does execute in parallel with writes to RAM:

Code:

+--------------------------------------------------------+
|              .l move.l d0,(a0) ;CHIP RAM               |
|                 move.l d0,(a0) ;CHIP RAM               |
|                 dbf    d7,.l                           |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 3711880.391                     |
|CPU cycles / per loop : 56.638                          |
|         color clocks : 263313                          |
|          rasterlines : 1159.969                        |
|               frames : 3.705                           |
|                   µs : 74237.607                       |
+----------------------+---------------------------------+

+--------------------------------------------------------+
|                     .l exg.l d0,d1                     |
|                        dbf   d7,.l                     |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 655488.814                      |
|CPU cycles / per loop : 10.001                          |
|         color clocks : 46499                           |
|          rasterlines : 204.841                         |
|               frames : 0.654                           |
|                   µs : 13109.776                       |
+----------------------+---------------------------------+

+--------------------------------------------------------+
|              .l move.l d0,(a0) ;CHIP RAM               |
|                 exg.l  d1,d2                           |
|                 move.l d0,(a0) ;CHIP RAM               |
|                 dbf    d7,.l                           |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 3711894.487                     |
|CPU cycles / per loop : 56.639                          |
|         color clocks : 263314                          |
|          rasterlines : 1159.973                        |
|               frames : 3.705                           |
|                   µs : 74237.889                       |
+----------------------+---------------------------------+

This prompted me to finally put the test blobs files in various directories, so that it's easier to deal with them. Moreover, this allowed to simplify the naming. I also removed some useless ones and added a few new ones. The attached archives reflects these changes.

Then, complete tests that show that the only factor that affects mulu's execution speed is whether the left operand is even or odd (the same goes for muls, but I'm not including the results here for practical reasons):

Code:

mulu.w 0,even
26 cycles
+--------------------------------------------------------+
|                .l move.w d1,d2 ;d1 even                |
|                   mulu.w d0,d2 ;d0 zero                |
|                   dbf    d7,.l                         |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 2225143.400                     |
|CPU cycles / per loop : 33.952                          |
|         color clocks : 157847                          |
|          rasterlines : 695.361                         |
|               frames : 2.221                           |
|                   µs : 44502.868                       |
+----------------------+---------------------------------+

mulu.w 0,odd
26 cycles
+--------------------------------------------------------+
|                .l move.w d1,d2 ;d1 odd                 |
|                   mulu.w d0,d2 ;d0 zero                |
|                   dbf    d7,.l                         |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 2225129.303                     |
|CPU cycles / per loop : 33.952                          |
|         color clocks : 157846                          |
|          rasterlines : 695.356                         |
|               frames : 2.221                           |
|                   µs : 44502.586                       |
+----------------------+---------------------------------+

mulu.w even,0
26 cycles
+--------------------------------------------------------+
|           .l mulu.w d1,d0 ;d1 even, d0 zero            |
|              dbf    d7,.l                              |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 2094042.817                     |
|CPU cycles / per loop : 31.952                          |
|         color clocks : 148547                          |
|          rasterlines : 654.392                         |
|               frames : 2.090                           |
|                   µs : 41880.856                       |
+----------------------+---------------------------------+

mulu.w odd,0
28 cycles
+--------------------------------------------------------+
|            .l mulu.w d1,d0 ;d1 odd, d0 zero            |
|               dbf    d7,.l                             |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 2225143.400                     |
|CPU cycles / per loop : 33.952                          |
|         color clocks : 157847                          |
|          rasterlines : 695.361                         |
|               frames : 2.221                           |
|                   µs : 44502.868                       |
+----------------------+---------------------------------+

mulu.w even,even
26 cycles
+--------------------------------------------------------+
|                .l move.w d1,d2 ;d1 even                |
|                   mulu.w d0,d2 ;d0 even                |
|                   dbf    d7,.l                         |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 2225129.303                     |
|CPU cycles / per loop : 33.952                          |
|         color clocks : 157846                          |
|          rasterlines : 695.356                         |
|               frames : 2.221                           |
|                   µs : 44502.586                       |
+----------------------+---------------------------------+

mulu.w even,odd
26 cycles
+--------------------------------------------------------+
|                .l move.w d1,d2 ;d1 odd                 |
|                   mulu.w d0,d2 ;d0 even                |
|                   dbf    d7,.l                         |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 2225143.400                     |
|CPU cycles / per loop : 33.952                          |
|         color clocks : 157847                          |
|          rasterlines : 695.361                         |
|               frames : 2.221                           |
|                   µs : 44502.868                       |
+----------------------+---------------------------------+

mulu.w odd,even
28 cycles
+--------------------------------------------------------+
|                .l move.w d1,d2 ;d1 even                |
|                   mulu.w d0,d2 ;d0 odd                 |
|                   dbf    d7,.l                         |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 2356243.982                     |
|CPU cycles / per loop : 35.953                          |
|         color clocks : 167147                          |
|          rasterlines : 736.330                         |
|               frames : 2.352                           |
|                   µs : 47124.879                       |
+----------------------+---------------------------------+

mulu.w odd,odd
28 cycles
+--------------------------------------------------------+
|                .l move.w d1,d2 ;d1 odd                 |
|                   mulu.w d0,d2 ;d0 odd                 |
|                   dbf    d7,.l                         |
+----------------------+---------------------------------+
|                  CPU : 68030 50.000000 MHz IiDd..      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 2356243.982                     |
|CPU cycles / per loop : 35.953                          |
|         color clocks : 167147                          |
|          rasterlines : 736.330                         |
|               frames : 2.352                           |
|                   µs : 47124.879                       |
+----------------------+---------------------------------+

Finally, to verify my theory that the misalignment of the clocks of CPU and chipset might be the cause of the difference between the calculated and measured results, I ran the big mulu test (plus a few other simple ones) on a stock A1200, which produced perfectly round results:

Code:

+--------------------------------------------------------+
|                    .l swap.w d0                        |
|                       dbf    d7,.l                     |
+----------------------+---------------------------------+
|                  CPU : 68020 14.187580 MHz I.....      |
|         loops number : 65536                           |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 655412.000                      |
|CPU cycles / per loop : 10.000                          |
|         color clocks : 163853                          |
|          rasterlines : 721.819                         |
|               frames : 2.306                           |
|                   µs : 46196.180                       |
+----------------------+---------------------------------+

+--------------------------------------------------------+
|              .l move.l d0,(a0)  ;CHIP RAM              |
|                 add.l  d3,d3... ;12x                   |
|                 mulu.w d1,d2    ;d1 even               |
|                 add.l  d3,d3... ;2x                    |
|                 move.l d0,(a0)  ;CHIP RAM              |
|                 add.l  d3,d3... ;9x                    |
|                 subq.l #1,d7                           |
|                 bne.b  .l                              |
+----------------------+---------------------------------+
|                  CPU : 68020 14.187580 MHz I.....      |
|         loops number : 200000                          |
|         leaked bytes : 0                               |
+----------------------+---------------------------------+
|   CPU cycles / total : 16800148.000                    |
|CPU cycles / per loop : 84.000                          |
|         color clocks : 4200037                         |
|          rasterlines : 18502.365                       |
|               frames : 59.112                          |
|                   µs : 1184144.723                     |
+----------------------+---------------------------------+

@admins

Would it be possible to move to OT posts to a separate thread?

meynaf · 20 April 2024, 16:53

Quote:

Originally Posted by Thomas Richter

This does not work. Remember, each byte contains 8 bitplanes worth of planar data, thus one register contains four pixels chunky, or four bits planar. With four registers filled, you have 16 bytes = 16 pixels worth of chunky, or 16 bits worth of planar. What you would need is to refill another four registers of chunky from the source a2.

This is still doable in the middle of the computation. 68030 would stall but not 68060 (i think).

Quote:

Originally Posted by Thomas Richter

That is not immediately possible since there are no free registers anymore, and it is neither clear wether this source data is accessible or beyond the edge of the source (that would require another comparison with the end of line register and potentially a call to the "fetch partical" function). You can neither reuse the existing set of four registers (d0,d2,d4,d6) as you then loose the loop invariance (bits 0 and 4 of the work registers contains the bits of the bitplane to convert).

No free registers ? What are a4/a6 then ?
Perhaps posting a little more code would help.
Losing data in d0/d2/d4/d6 seems acceptable, just read the data again. 68030 classic c2p does two 4-bit passes, reading same data twice.

Don_Adan · 20 April 2024, 17:38

Code:

_loop:				;next pixels (within the line)
	movem.l	(a2)+,d0/d2/d4/d6
_nextplane:			;next bitplane
	lea	C2P_PLANEPTRS(a7),a0
	rol.l	#4,d0
	move.l	(a0),d5		;get first bitplane pointer
	rol.l	#4,d4		;pre-shift
	bra.s	_bitplanedone
Write_NotSrc:
	not.w	d1
Write_Src:
	move.w	d1,(a5,a3.l)	
_bitplanedone:
	move.l	d0,d7		;2	1
	move.l	d5,d1		;was pre-loaded with next bitplane start
	ble.s	_skipbitplane
        move.l  #$01010101,d5 ; +
        move.l  d1,a3 ; +
	move.l	d2,d1
	and.l	#$10101010,d7
;	and.l	#$01010101,d1
        and.l   d5,d1 ; +
	ror.l	#1,d0
	or.l	d7,d1
	lsr.l	#1,d2
	mulu.l	#$01020408,d1
	
;	move.l	d5,a3
	move.l	d4,d7
;	move.l	d6,d5
	and.l	#$10101010,d7
;	and.l	#$01010101,d5
        and.l   d6,d5 ; +
	ror.l	#1,d4	
	or.l	d5,d7
	lsr.l	#1,d6
	mulu.l	#$01020408,d7
	swap	d1
	rol.l	        #8,d7
	addq.l	#4,a0
	move.b	d7,d1
	move.l	(a0),d5
	jmp	(a1)	   	;write data with mask

If no my faults, then this version is 4 bytes shortest (4 cycles fastest) for 68030.

Don_Adan · 20 April 2024, 18:16

Code:

_loop:				;next pixels (within the line)
	movem.l	(a2)+,d0/d2/d4/d6
_nextplane:			;next bitplane
	lea	C2P_PLANEPTRS(a7),a0
	rol.l	#4,d0
	move.l	(a0),d5		;get first bitplane pointer
	rol.l	#4,d4		;pre-shift
	bra.s	_bitplanedone
Write_NotSrc:
	not.w	d1
Write_Src:
	move.w	d1,(a5,a3.l)	
_bitplanedone:
	move.l	d0,d7		;2	1
	move.l	d5,d1		;was pre-loaded with next bitplane start
	ble.s	_skipbitplane
        move.l  #$01010101,d5 ; +
        move.l  d1,a3 ; +
	move.l	d2,d1
	and.l	#$10101010,d7
;	and.l	#$01010101,d1
        and.l   d5,d1 ; +
	ror.l	#1,d0
	or.l	d7,d1
	lsr.l	#1,d2
	mulu.l	#$01020408,d1
	
;	move.l	d5,a3
;	move.l	d4,d7
;	move.l	d6,d5
;	and.l	#$10101010,d7
        move.l  d5,d7 ; +
        add.l   d7,d7 ; +
        and.l   d4,d7 ;+
;	and.l	#$01010101,d5
        and.l   d6,d5 ; +
	ror.l	#1,d4	
	or.l	d5,d7
	lsr.l	#1,d6
	mulu.l	#$01020408,d7
	swap	d1
	rol.l	        #8,d7
	addq.l	#4,a0
	move.b	d7,d1
	move.l	(a0),d5
	jmp	(a1)	   	;write data with mask

And this version is next 2 bytes shortest (2 cycles fastest) for 68030.

Thomas Richter · 20 April 2024, 19:28

Quote:

Originally Posted by meynaf

No free registers ? What are a4/a6 then ?

Not visible here. a4 is the "write data with mask" function, and a6 is the end-of-line pointer to which a2 is compared.

Quote:

Originally Posted by meynaf

Perhaps posting a little more code would help.

Look, I'm not using this code anyhow, so this does not make much sense, and you're missing the point. It does not matter to squeeze a couple of cycles from this loop as long as only 16 bits are written at a time. The mulu approach is not the right one to begin with. It looked like a neat idea initially because it keeps the instruction count low, but mulu is apparently eating too many cycles compared to a couple of manual "bit folding" instructions, at least on the 030 and 040. On the 68060, it should be faster, but the net benefit is zero - the problem sits elsewhere.

Quote:

Originally Posted by meynaf

Losing data in d0/d2/d4/d6 seems acceptable, just read the data again.

That invalidates the loop invariant (the work registers are shifted such that bits 0 and 4 contain the current bitplane to work on), it is another memory read from potentially slow memory (Video RAM on the graphics card), and it would require another test whether the end of line has been reached, in which case another function would have to be called because there need not to be valid memory at a2. So, it's not that easy.

Quote:

Originally Posted by meynaf

68030 classic c2p does two 4-bit passes, reading same data twice.

Yes, but it violates a couple of constraints I have - it reads potentially invalid memory, causing hits or crashes, and it does not implement minterms. Besides, the code is not maintainable and scales poorly.

meynaf · 20 April 2024, 19:47

Quote:

Originally Posted by Thomas Richter

Not visible here. a4 is the "write data with mask" function, and a6 is the end-of-line pointer to which a2 is compared.

That's why showing more code would have been useful.

Quote:

Originally Posted by Thomas Richter

Look, I'm not using this code anyhow, so this does not make much sense, and you're missing the point. It does not matter to squeeze a couple of cycles from this loop as long as only 16 bits are written at a time. The mulu approach is not the right one to begin with. It looked like a neat idea initially because it keeps the instruction count low, but mulu is apparently eating too many cycles compared to a couple of manual "bit folding" instructions, at least on the 030 and 040. On the 68060, it should be faster, but the net benefit is zero - the problem sits elsewhere.

It may be just for academic purposes, but making it work could be interesting. That's not missing the point.

Quote:

Originally Posted by Thomas Richter

That invalidates the loop invariant (the work registers are shifted such that bits 0 and 4 contain the current bitplane to work on), it is another memory read from potentially slow memory (Video RAM on the graphics card), and it would require another test whether the end of line has been reached, in which case another function would have to be called because there need not to be valid memory at a2. So, it's not that easy.

So you read video ram on a graphics card and send the result to chip memory ? What is the operation needing this ?

Quote:

Originally Posted by Thomas Richter

Yes, but it violates a couple of constraints I have - it reads potentially invalid memory, causing hits or crashes, and it does not implement minterms. Besides, the code is not maintainable and scales poorly.

It cannot become invalid if you lock it during the whole process.

Thomas Richter · 20 April 2024, 20:27

Quote:

Originally Posted by meynaf

So you read video ram on a graphics card and send the result to chip memory ? What is the operation needing this ?

Where the source bitmap is unknown, but a bitmap can be placed in any memory - and need to be blitted in any other memory.

Quote:

Originally Posted by meynaf

It cannot become invalid if you lock it during the whole process.

Err, what? That's not a matter of "locking". It is a matter of "reaching the end of the RAM".

meynaf · 20 April 2024, 20:58

Quote:

Originally Posted by Thomas Richter

Where the source bitmap is unknown, but a bitmap can be placed in any memory - and need to be blitted in any other memory.

Either this is a very rare case (blitting from gfx card to original chipset) or this is the sign of an error somewhere which may well lead to a p2c later...

Quote:

Originally Posted by Thomas Richter

Err, what? That's not a matter of "locking". It is a matter of "reaching the end of the RAM".

Which rarely causes any issue on the Amiga.
Shouldn't happen anyway if you align the source.

Thomas Richter · 20 April 2024, 22:56

Quote:

Originally Posted by meynaf

Either this is a very rare case (blitting from gfx card to original chipset) or this is the sign of an error somewhere which may well lead to a p2c later...

It's not a sign of an error. It is just a bitmap in video ram. There is nothing wrong with that. P96 can place bitmaps in any ram, including video RAM, and there is neither a problem attempting to blit from video ram to chip ram.

Quote:

Originally Posted by meynaf

Which rarely causes any issue on the Amiga.
Shouldn't happen anyway if you align the source.

The source is wherever the source happens to be, there is no control on alignment, and yes, it surely is a problem if you find hardware registers before or behind the source bitmap if that is a graphics card.

meynaf · 21 April 2024, 08:35

Quote:

Originally Posted by Thomas Richter

It's not a sign of an error. It is just a bitmap in video ram. There is nothing wrong with that. P96 can place bitmaps in any ram, including video RAM, and there is neither a problem attempting to blit from video ram to chip ram.

Sorry, i still can't find a single concrete (not theoretical) case where this is sane approach.

Quote:

Originally Posted by Thomas Richter

The source is wherever the source happens to be, there is no control on alignment, and yes, it surely is a problem if you find hardware registers before or behind the source bitmap if that is a graphics card.

The video ram should have its start and end longword aligned. It can't be anywhere. Just clip correctly and perform aligned reads.

Bruce Abbott · 21 April 2024, 08:37

Quote:

Originally Posted by Thomas Richter

It's not a sign of an error. It is just a bitmap in video ram. There is nothing wrong with that. P96 can place bitmaps in any ram, including video RAM, and there is neither a problem attempting to blit from video ram to chip ram.

The source is wherever the source happens to be, there is no control on alignment, and yes, it surely is a problem if you find hardware registers before or behind the source bitmap if that is a graphics card.

I'm a 'bit' confused by this. So when blitting from the graphics card it automatically converts from chunky to planar?

Quote:

As the chip memory interface is only 16 bit wide on ECS machines, I was hoping that this would be sufficient to saturate the bandwidth of the bus, but that does not seem to be the case.

ECS means a maximum of 6 bitplanes can be displayed. How does that work when the source is 8 bit chunky?

How do you know that it isn't 'sufficient to saturate the bandwidth of the bus'?

20 April 2024, 13:57	#67
Thomas Richter Registered User Join Date: Jan 2019 Location: Germany Posts: 3,310	The thread slightly derailed from the original question, which is for me no longer relevant as solved the issue in a different way as mulu is simply too complex on the 68030 to provide competitive performance. Maybe the algorithm itself is interesting anyhow. It takes four 32-bit values carrying 16 chunky pixels, and converts them bitplane by bitplane into 16-bit words that are pushed out into chip memory. As the chip memory interface is only 16 bit wide on ECS machines, I was hoping that this would be sufficient to saturate the bandwidth of the bus, but that does not seem to be the case. Unfortunately, there are not sufficient registers left for a 32-pixel conversion function as each word (or long-word) also needs a mask (to cover edge-cases) and minterms also have to be emulated. Anyhow, this is how this attempt looked like: Code: _loop: ;next pixels (within the line) movem.l (a2)+,d0/d2/d4/d6 _nextplane: ;next bitplane lea C2P_PLANEPTRS(a7),a0 rol.l #4,d0 move.l (a0),d5 ;get first bitplane pointer rol.l #4,d4 ;pre-shift bra.s _bitplanedone Write_NotSrc: not.w d1 Write_Src: move.w d1,(a5,a3.l) _bitplanedone: move.l d0,d7 ;2 1 move.l d5,d1 ;was pre-loaded with next bitplane start ble.s _skipbitplane move.l d2,d1 and.l #$10101010,d7 and.l #$01010101,d1 ror.l #1,d0 or.l d7,d1 lsr.l #1,d2 mulu.l #$01020408,d1 move.l d5,a3 move.l d4,d7 move.l d6,d5 and.l #$10101010,d7 and.l #$01010101,d5 ror.l #1,d4 or.l d5,d7 lsr.l #1,d6 mulu.l #$01020408,d7 swap d1 rol.l #8,d7 addq.l #4,a0 move.b d7,d1 move.l (a0),d5 jmp (a1) ;write data with mask This is in a nut-shell the main conversion loop. (a2) contains the source data, which is moved into the work registers d0,d2,d4,d6. These registers are right-shifted by one bit each round, exposing the bits that make up the next bitplane. On edge-cases, they are filled by a different function not shown here. C2P_PLANEPTRS points to an array of of plane pointers, pointing to the origin of the target planes, or NULL in case a particular plane is masked out or not available, or -1 for the end. Handing of skipped planes is not shown (but trivial). The ANDs in the above code extract the bits for the next bitplane, and the "mulu.l" (the core of the question) accumulates (quickly - or not) 8-bits in the most-significant byte of a long-word. This happens twice to compose the upper and lower bit of the target word. Meanwhile, the work-registers d0,d2,d4,d6 are updated for the next bitplane. (a1) contains the "write with minterm to destination" function, which is often (but not necessarily) Write_Src, where a5 is the byte-offset into the target position and a3 the current bitplane pointer. d5 is the prefetched next target pointer. d1 and d7 are required scratch registers of the generic "write with minterm" function - for the simple "copy source to destination" minterm, they are of course not needed, except for the source that sits in d1. d3 contains the "pixel mask" (FWM,LWM equivalent) - it is not seen here as it is only preloaded on the edges and then handled by another "min-term-writer" function. The stack is deliberately not used as it would cause writes that may conflict with the store-buffer keeping the chip memory target. The overall problem is not really the running time of the main loop with the mulu (at least not on the 68060), but that there is only one 16-bit word written per iteration.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
68040 to 68060 adapter respin with A2000 and Zeus 68040 Accelerator	richx	support.Hardware	14	26 April 2022 05:46
Games that required an accelerator (68030, 68040, 68060)	Radertified	Nostalgia & memories	47	12 January 2022 16:45
68030, 68040 and 68060 MMU support (really!)	Toni Wilen	support.WinUAE	262	19 February 2019 12:36
mulu.l (a0),d0-d1 on 68060	BlankVector	support.WinUAE	4	20 July 2012 19:03
WTB: 68030 or 68040 accelerator for A2000	Shadowfire	MarketPlace	2	19 September 2009 17:52

17 April 2024, 18:22	#62
saimo Registered User Join Date: Aug 2010 Location: Italy Posts: 854	To understand what happens in the cases where the average execution time is not round, I resorted to a properly-drawn diagram (as I had anticipated) and a piece of code that uses only instructions whose timings are certain and that has a non-round average execution time. I went for two moves to CHIP RAM with as many adds stuffed in between: Code: .l move.l d0,(a0) rept 13 add.l d1,d1 endr move.l d0,(a0) rept 10 add.l d1,d1 endr dbf d7,.l If my guess were correct, such code should be as fast as Code: .l move.l d0,(a0) move.l d0,(a0) dbf d7,.l and if even just a single add were added (heh) it should become slower. Tests proved that the code is correct: Code: +--------------------------------------------------------+ \| .l move.l d0,(a0) \| \| move.l d0,(a0) \| \| dbf d7,.l \| \| \| \| Note: buffer in CHIP RAM \| +----------------------+---------------------------------+ \| CPU : 68030 50.000000 MHz IiDd.. \| \| loops number : 65536 \| \| leaked bytes : 0 \| +----------------------+---------------------------------+ \| CPU cycles / total : 3711908.584 \| \|CPU cycles / per loop : 56.639 \| \| color clocks : 263315 \| \| rasterlines : 1159.977 \| \| frames : 3.705 \| \| µs : 74238.171 \| +----------------------+---------------------------------+ +--------------------------------------------------------+ \| .l move.l d0,(a0) \| \| rept 13 \| \| add.l d1,d1 \| \| endr \| \| move.l d0,(a0) \| \| rept 10 \| \| add.l d1,d1 \| \| endr \| \| dbf d7,.l \| \| \| \| Note: buffer in CHIP RAM \| +----------------------+---------------------------------+ \| CPU : 68030 50.000000 MHz IiDd.. \| \| loops number : 65536 \| \| leaked bytes : 0 \| +----------------------+---------------------------------+ \| CPU cycles / total : 3711964.972 \| \|CPU cycles / per loop : 56.640 \| \| color clocks : 263319 \| \| rasterlines : 1159.995 \| \| frames : 3.706 \| \| µs : 74239.299 \| +----------------------+---------------------------------+ +--------------------------------------------------------+ \| .l move.l d0,(a0) \| \| rept 13 \| \| add.l d1,d1 \| \| endr \| \| move.l d0,(a0) \| \| rept 11 \| \| add.l d1,d1 \| \| endr \| \| dbf d7,.l \| \| \| \| Note: buffer in CHIP RAM \| +----------------------+---------------------------------+ \| CPU : 68030 50.000000 MHz IiDd.. \| \| loops number : 65536 \| \| leaked bytes : 0 \| +----------------------+---------------------------------+ \| CPU cycles / total : 4660400.152 \| \|CPU cycles / per loop : 71.112 \| \| color clocks : 330599 \| \| rasterlines : 1456.383 \| \| frames : 4.652 \| \| µs : 93208.003 \| +----------------------+---------------------------------+ Given that the tests looped "only" 65536 times, I also made a test with 50 million loops, which confirmed the results: Code: +--------------------------------------------------------+ \| .l move.l d0,(a0) \| \| rept 13 \| \| add.l d1,d1 \| \| endr \| \| move.l d0,(a0) \| \| rept 9 \| \| add.l d1,d1 \| \| endr \| \| subq.l #1,d7 \| \| bne.b .l \| \| \| \| Note: buffer in CHIP RAM \| +----------------------+---------------------------------+ \| CPU : 68030 50.000000 MHz IiDd.. \| \| loops number : 50000000 \| \| leaked bytes : 0 \| +----------------------+---------------------------------+ \| CPU cycles / total : 2831842625.733 \| \|CPU cycles / per loop : 56.636 \| \| color clocks : 200884969 \| \| rasterlines : 884955.810 \| \| frames : 2827.334 \| \| µs : 56636852.514 \| +----------------------+---------------------------------+ So, each loop takes on average 56.636 CPU cycles. Then, I drew a precise diagram based on the duration of a color clock expressed in CPU cycles: 50/3.546895 = 14.09683681078803, which, for practical reasons, can be reasonably rounded to 14.1. (click to see in full size) EDIT 3: the original picture was flawed, as in various points I had aligned the CPU instructions to the color clocks (don't ask me why: I don't know myself)! Even worse, based on that, I even wrote some nonsense - shame on me! Luckily, the conclusion was still valid as it was based on hard numbers and reasoning. EDIT 5: replaced picture again with a more detailed one which shows the cycles of the moves and their sub-cycle states as per MC68030UM 7.3.2, to help see how the CPU talks with Alice (in particular, refer to figure 7-25). The picture shows that the loop takes either 56 or 57 cycles, always fitting in 4 color clocks. It also shows that the loops repeat with the pattern of [57, 56, 56, 57, 56] cycles (going from top-left to bottom-right, the first and the last loops are identical). Note: actually, due to the drifting caused by 4 color clocks being a little less than 14.1 CPU cycles, periodically - but very rarely - there will be a "jump", possibly of 1 color clock (admittedly, I didn't invest further time on this EDIT4: OK, I just had to calculate this, too; the difference is 14.1-14.09683681078803 = 0.00316318921197; this means that every 1/0.00316318921197 = 316.1366371053127 color clocks, from a CPU point of view, there is a "short" color clock that lasts just 14 cycles; this might have no impact, but, after a while, the growing drifting will cause the CPU to miss a slot or eliminate a 1 cycle wait state; maybe this is what explains the difference between the calculated and measured times reported below?). Since the loop basically fits in 4 color clocks, all loops should take 655364 = 262144 color clocks, not 263319. Something is missing, something that, every now and then, causes the CPU to miss a CHIP bus slot... when I said this to myself, it dawned on me: memory refresh cycles! Let's assume that the loop begins exactly at color clock #0: that's a memory refresh slot, so execution actually starts only at slot 1. Given that the CPU is granted only every other slot, execution uses slots 1, 3, ... , 225 and then would go back to slot #0 of the next rasterline - which, of course, it cannot use. So, basically, the calculation must add an extra color clock for every rasterline. The test report shows that execution lasts 1159.977 (= 263315/227) rasterlines: therefore, the color clocks should be 262144+1159 = 263304 in total. What were the measured color clocks, again? 263319 - that's 12 color clocks more - a 12100/263304 = 0.0045574696928265% difference from the theoretical value. I'm pretty sure that the other odd figures returned by the tests can be demonstrated in a similar way. But time is limited and I have better things to do. Plus, this additional research proved (again) that the test tool provides reliable figures, so there's no further work to do on it (well, since I started sharing it here, I've been adding features to make it more suitable for the public and I realized that maybe others could find it useful: maybe I could do a little more work on it and release it publicly). EDIT: originally I had started drawing the diagram for the mulu test code; since mulu's behaviour is not certain (given that one test shows that it's possible to execute 2 adds after it for free, it seems that more than its head can run in parallel with writes), I soon realized that I'd better remove the unknown and then return to it later. But once I reached the conclusion illustrated above, I realized the exercise was not worth the effort, as the actual results and the explanation found are sufficient to me. EDIT2: attached updated test tool archive; the tool now shows also the supervisor state registers and provides to blobs a simple mechanism to dump data to disk. @Thomas Richter Sorry for the huge OT, but at least, as far as multiplications go, it unveiled that odd left operands cause the 68030 to take 2 more cycles when multiplying and that multiplications won't run in parallel with memory writes Last edited by saimo; 19 April 2024 at 19:49. Reason: Replaced picture (was flawed); 1160 -> 1159; added paragraph at the end; removed attachment, as I provide a newer one later.

18 April 2024, 10:51	#63
reassembler Registered User Join Date: Oct 2023 Location: London, UK Posts: 124	This is a fascinating read, thanks for your investigations and posting the methodology in detail. It also might help with some minor C2P optimizations in my case I'm guessing. I was also interested in reading the earlier results with FastRAM. I'd read about this elsewhere, but never seen it analysed in quite this detail!

20 April 2024, 00:12	#66
NorthWay Registered User Join Date: May 2013 Location: Grimstad / Norway Posts: 852	Stupid Q: Does movem behave the same as move?

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)