68060 change/use register stalls

Nut · 13 April 2012, 15:50

Hello guys,

I'm reading the 68060 user manual, page 306, paragraph 3. It talks about change/use register stalls. I'm not quite sure how this works so if you could enlighten me a bit.

My code is like this

Code:

scs.b   d6 poep
subq.l  #$1,d7 soep
add.w   (a2,d6*2),a6 poep
subq.l  #$1,d6 soep
move.w  $2(a6),a0 poep

As I understand it from the manual, there's two places where this code will stall. First is that I'm writing d6 and then using it in address calculation, the other that I'm writing a6 and then using it in address calculation. Both stalls are the change/use type like the document says.

According to document first stall is 3 cycles and second stall is 2 cycles.

My guestion is where do you start counting the stall? After the instruction that is the root cause of the stall (changing d6, changing a6) or do you start counting cycles from the root-cause instruction? I guess it's the next cycle after, as the document had an example where mulu command causes a stall of 2 cycles, but mulu is already 2 cycles in itself.

Example

Code:

add.w   d0,a0 poep
subq.l  #$1,d7 soep
move.w  (a0),d0 poep

There is a stall of 2 cycles, right? Counting from the move.w command?

How does this behave?

Code:

add.w   d0,a0 poep
subq.l  #$1,d7 soep
subq.l  #$1,d6 poep
subq.l  #$1,d5 soep
move.w  (a0),d0 poep

Is there a stall of 1 now? Or no stall?

I don't have 68060 card and I can't test this with WinUAE. I would still like to optimize my code for 060. My first code was particularly problematic as it would stall 3+2 cycles. But I could do some reordering. I was optimizing my code just for parallel execution (poep/soep) but now I come to think of the stalls it looks bad for that particular part.

Anything else I should know about stalls? (not cache stalls)

Wepl · 13 April 2012, 19:25

which page is 306?
my book isn't numbered that way. you probably have only the pdf.

matthey · 14 April 2012, 02:00

@Wepl
The Instruction Execution Timing Section 10-10.

@Nut
The change/use stall occurs at <ea> calculation of the indirect addressing mode and the register result of a previously updated register is not available yet because of pipeline delays. It's best to use 32 bit longword instruction results as they can be forwarded sometimes reducing a delay. You have the right idea with rescheduling instructions and pOEP/sOEP operation.

Wepl · 14 April 2012, 12:02

I confirm to your guess that it's calculated from the end of the root cause (according the manual).
In the last example this would make a stall of 1 cycle.
Would be best to ensure this by testing on real 68060. If you write test code I can execute it for you

.

Photon · 14 April 2012, 17:09

The number of cycles of delay is relative to which stage in the pipeline has to wait for the needed value (in this case ea stage) and which of the two pipes are occupied, odd or even. Each stage is a cycle.

As optimization is unimportant except in loops executed many times, calculating the exact stall cycles is unimportant except in such loops that also fit in the instruction cache and accesses no memory. So except for those rare cases, thinking about cache use and interleaving load-use with 1 instruction is a much more sane approach to milking performance

In other words, I'm an anal optimizer myself but I would save counting stage cycles for those very rare loops.

In more complex CPUs like this one, you can never ever do timing and optimizing in theory, so if you don't have a real Amiga with 060 I really recommend getting one.

I found the timing-by-rasterline method more appropriate than ever before, because you can't have all the performance aspects (dirty cache lines, addressing mode extra cycles, which instruction is odd or even, branch behavior) 100% in your head when coding, only "in theory".

And if the loop doesn't even take a rasterline you don't need to optimize it, another rule of thumb I have.

Timing-by-rasterline is either 1) change $dff180 before the call or loop, and change it back after. 2) save $dff004.l to someplace after. If the background-colored chunk visibly shrunk or the 2nd byte of the longword decreased, you optimized the routine.

Nut · 17 April 2012, 20:07

Thanks dudes! I now understand the nature of the stall. It's the pipeline design of the execution engine. If you change a register just before a command that needs <effective address> calculation, and the register you changed is part of that calculation, the <ea> has to be calculated again. That's why it stalls 1-3 cycles depending on what stage of the pipeline the new correct <ea> (or something else) comes from. Yes, this makes it much more understandable.

I have the PDF(s) of all 68k-processors. I'm only optimizing essential innerloops that consume 95% of CPU time in my program. The rest of the code I don't care much. I'm trying to optimize for both 040 and 060, and just suppose they run fine on 020 and 030 too, if I do that. I particularly want to squeeze everything out of the 060 superscalar design. Yes, it's theory mostly, but theory mostly works

I'm not reading or writing memory much, so the caches do what they do. Basicly just reading stuff linearly in and then writing it linearly out after processing. Only single memory reads and writes now and then.

I want to ask a couple of other questions on the poep/soep parallel execution. User manual 060 page 304, paragraph 5 & 6...
It says there's a few important exceptions to the rule of using the result of poep in soep during the same cycle. Usually this can't be done except for these two cases :

1. is a long move from <ea> to register and using that on soep

Code:

move.l  (a0),d1 poep
add.l   d1,d2 soep

or

move.l  d0,d1 poep
add.l   d1,d2 soep

This works because d1 result is known before the execution of the commands, and the move is long so there is no unknown component on the register d1, I think.

2. is a move (any size?) from register to <mem> after poep

Code:

add.l       d2,d1 poep
move.l/w/b  d1,(a0) soep

This works because the result is sent to memory and no register has to be updated, I think. So these were the two exceptions to the rule (according to user manual).

Okey, in my code I do this many times, I really like this construct because it's so efficient. But I'm worried it doesn't work in one cycle, infact.

Code:

add.l   d2,d1 poep
move.l  d1,d0 soep

Does this work or not? Can somebody actually test this? Another variation of the same idea which I would like to use is

Code:

add.w   (a0),a1 poep
move.l  a1,d0 soep

I understand that for both of these d1/a1 result is not known before execution, but I was hoping the CPU would be intelligent enough to be able to send the result directly to another register during the same cycle. It's like case2 but instead of memory send it to a register (long move).

Please test these two. With this info I could optimize my innerloops for poep/soep (and avoid stalls). I really like optimizing btw

Photon · 18 April 2012, 00:04

You're welcome.

A result is never available to the same pipe in a parallel-pipe CPU at the same cycle, so no matter what you do, you can't fit those 4 examples in 1 (2xparallel instructions) cycle. Assuming the last 2 examples are in a loop that is already in the ICACHE, the first will take 2 cycles (odd or even), and the second will take the 2+DCACHE cycles (odd or even). The pipes are separate and their results are available at the same time, so the only thing you can do to optimize is to pad with other useful in-loop instructions that are suitable for the free even/odd slots.

Again, it doesn't help to "know a specific piped+cached CPU well enough", the factors are too many to write a perfect theoretically optimized loop. If you have a single memory-access instruction in the loop, the penalty for cache misses is much greater than any pipe-optimization you can conjure up.

The "sane approach" recommended above reaps rewards on all CPUs and it would optimized your 4 examples/answered your question.

You want the answer, which is faster on 68060, but it's my opinion that it's always a mistake to optimize code supporting 680x0 for the fastest CPU rather than the slowest CPU. The fast one will always be faster and fewer people will have it. If the target is 68060-mainly (not only) it's a different matter, but only us demofreaks do that I think

If you are coding a demo for AGA/060 then there is absolutely no way to know without getting one of those and running the code. And even then there are differences between memory interfaces (and HDD interfaces if you are loading, and OS performance if you are accessing the OS during the demo) requiring discretion.

Nut · 26 April 2012, 16:17

Quote:

A result is never available to the same pipe in a parallel-pipe CPU at the same cycle, so no matter what you do, you can't fit those 4 examples in 1 (2xparallel instructions) cycle. Assuming the last 2 examples are in a loop that is already in the ICACHE, the first will take 2 cycles (odd or even), and the second will take the 2+DCACHE cycles (odd or even). The pipes are separate and their results are available at the same time, so the only thing you can do to optimize is to pad with other useful in-loop instructions that are suitable for the free even/odd slots.

Yes, but the first two examples were exceptions to the rule that the user manual mentions. They do work. Haven't tested but I explained exactly how it's said in the manual. They do run in 1 cycle. (case1, case2) The magic is that what seems to be sent to soep during the same cycle, is not really sent, but the register contents are already known during the execution stage, they are figured out earlier in the pipelines. Check out the manual if you like, or my explanations, you'll figure why they do work.

I was wondering if the CPU would be intelligent enough to send the result to another register during "soep cycle" (last 2 examples). If those loops were unrolled it would work btw after the first poep instruction (soep/poep, soep/poep, ...) see case1. Long move can be "forwarded" to soep during the same cycle. I'm a little sad it doesn't work for a register (last 2 examples), but it does work for memory output (case2). It's just STUPID!! Why couldn't it just forward the result to another register and be available at the next cycle?? The long move is just forwarding. Like case1, but the other way around.

Anyway, I was able to reorder my code to avoid all stalls in the innerloops. I also did some other optimizations. I think it's good enough now, since CACHE misses would likely mask any small speed improvements anyway. Afterall I must read in data and write it out. Can't avoid an occasional cache miss I think. I'm doing some software audio mixing and it's like a stream of data that goes in and comes out in a constant flow.

britelite · 27 April 2012, 09:08

Code:

move.l  (a0),d1 poep
add.l   d1,d2 soep

In this case I would do a soep-instruction that doesn't use d1 in between these two instructions, just in case the memory being read isn't in cache.

13 April 2012, 15:50	#1
Nut Registered User Join Date: Feb 2010 Location: Helsinki, Finland Posts: 36	68060 change/use register stalls Hello guys, I'm reading the 68060 user manual, page 306, paragraph 3. It talks about change/use register stalls. I'm not quite sure how this works so if you could enlighten me a bit. My code is like this Code: scs.b d6 poep subq.l #$1,d7 soep *add.w (a2,d62),a6 poep subq.l #$1,d6 soep move.w $2(a6),a0 poep As I understand it from the manual, there's two places where this code will stall. First is that I'm writing d6 and then using it in address calculation, the other that I'm writing a6 and then using it in address calculation. Both stalls are the change/use type like the document says. According to document first stall is 3 cycles and second stall is 2 cycles. My guestion is where do you start counting the stall? After the instruction that is the root cause of the stall (changing d6, changing a6) or do you start counting cycles from the root-cause instruction? I guess it's the next cycle after, as the document had an example where mulu command causes a stall of 2 cycles, but mulu is already 2 cycles in itself. Example Code: add.w d0,a0 poep subq.l #$1,d7 soep move.w (a0),d0 poep There is a stall of 2 cycles, right? Counting from the move.w command? How does this behave? Code: add.w d0,a0 poep subq.l #$1,d7 soep subq.l #$1,d6 poep subq.l #$1,d5 soep move.w (a0),d0 poep Is there a stall of 1 now? Or no stall? I don't have 68060 card and I can't test this with WinUAE. I would still like to optimize my code for 060. My first code was particularly problematic as it would stall 3+2 cycles. But I could do some reordering. I was optimizing my code just for parallel execution (poep/soep) but now I come to think of the stalls it looks bad for that particular part. Anything else I should know about stalls? (not cache stalls**)

14 April 2012, 02:00	#3
matthey Banned Join Date: Jan 2010 Location: Kansas Posts: 1,284	@Wepl The Instruction Execution Timing Section 10-10. @Nut The change/use stall occurs at <ea> calculation of the indirect addressing mode and the register result of a previously updated register is not available yet because of pipeline delays. It's best to use 32 bit longword instruction results as they can be forwarded sometimes reducing a delay. You have the right idea with rescheduling instructions and pOEP/sOEP operation. Last edited by matthey; 14 April 2012 at 02:32.

14 April 2012, 17:09	#5
Photon Moderator Join Date: Nov 2004 Location: Eksjö / Sweden Posts: 5,602	The number of cycles of delay is relative to which stage in the pipeline has to wait for the needed value (in this case ea stage) and which of the two pipes are occupied, odd or even. Each stage is a cycle. As optimization is unimportant except in loops executed many times, calculating the exact stall cycles is unimportant except in such loops that also fit in the instruction cache and accesses no memory. So except for those rare cases, thinking about cache use and interleaving load-use with 1 instruction is a much more sane approach to milking performance In other words, I'm an anal optimizer myself but I would save counting stage cycles for those very rare loops. In more complex CPUs like this one, you can never ever do timing and optimizing in theory, so if you don't have a real Amiga with 060 I really recommend getting one. I found the timing-by-rasterline method more appropriate than ever before, because you can't have all the performance aspects (dirty cache lines, addressing mode extra cycles, which instruction is odd or even, branch behavior) 100% in your head when coding, only "in theory". And if the loop doesn't even take a rasterline you don't need to optimize it, another rule of thumb I have. Timing-by-rasterline is either 1) change $dff180 before the call or loop, and change it back after. 2) save $dff004.l to someplace after. If the background-colored chunk visibly shrunk or the 2nd byte of the longword decreased, you optimized the routine. Last edited by Photon; 14 April 2012 at 17:14.

17 April 2012, 20:07	#6
Nut Registered User Join Date: Feb 2010 Location: Helsinki, Finland Posts: 36	Thanks dudes! I now understand the nature of the stall. It's the pipeline design of the execution engine. If you change a register just before a command that needs <effective address> calculation, and the register you changed is part of that calculation, the <ea> has to be calculated again. That's why it stalls 1-3 cycles depending on what stage of the pipeline the new correct <ea> (or something else) comes from. Yes, this makes it much more understandable. I have the PDF(s) of all 68k-processors. I'm only optimizing essential innerloops that consume 95% of CPU time in my program. The rest of the code I don't care much. I'm trying to optimize for both 040 and 060, and just suppose they run fine on 020 and 030 too, if I do that. I particularly want to squeeze everything out of the 060 superscalar design. Yes, it's theory mostly, but theory mostly works I'm not reading or writing memory much, so the caches do what they do. Basicly just reading stuff linearly in and then writing it linearly out after processing. Only single memory reads and writes now and then. I want to ask a couple of other questions on the poep/soep parallel execution. User manual 060 page 304, paragraph 5 & 6... It says there's a few important exceptions to the rule of using the result of poep in soep during the same cycle. Usually this can't be done except for these two cases : 1. is a long move from <ea> to register and using that on soep Code: move.l (a0),d1 poep add.l d1,d2 soep or move.l d0,d1 poep add.l d1,d2 soep This works because d1 result is known before the execution of the commands, and the move is long so there is no unknown component on the register d1, I think. 2. is a move (any size?) from register to <mem> after poep Code: add.l d2,d1 poep move.l/w/b d1,(a0) soep This works because the result is sent to memory and no register has to be updated, I think. So these were the two exceptions to the rule (according to user manual). Okey, in my code I do this many times, I really like this construct because it's so efficient. But I'm worried it doesn't work in one cycle, infact. Code: add.l d2,d1 poep move.l d1,d0 soep Does this work or not? Can somebody actually test this? Another variation of the same idea which I would like to use is Code: add.w (a0),a1 poep move.l a1,d0 soep I understand that for both of these d1/a1 result is not known before execution, but I was hoping the CPU would be intelligent enough to be able to send the result directly to another register during the same cycle. It's like case2 but instead of memory send it to a register (long move). Please test these two. With this info I could optimize my innerloops for poep/soep (and avoid stalls). I really like optimizing btw

27 April 2012, 09:08	#9
britelite Registered User Join Date: Feb 2010 Location: Espoo / Finland Posts: 818	Code: move.l (a0),d1 poep add.l d1,d2 soep In this case I would do a soep-instruction that doesn't use d1 in between these two instructions, just in case the memory being read isn't in cache.

13 April 2012, 19:25	#2
Wepl Moderator Join Date: Nov 2001 Location: Germany Posts: 866	which page is 306? my book isn't numbered that way. you probably have only the pdf.

14 April 2012, 12:02	#4
Wepl Moderator Join Date: Nov 2001 Location: Germany Posts: 866	I confirm to your guess that it's calculated from the end of the root cause (according the manual). In the last example this would make a stall of 1 cycle. Would be best to ensure this by testing on real 68060. If you write test code I can execute it for you .

18 April 2012, 00:04	#7
Photon Moderator Join Date: Nov 2004 Location: Eksjö / Sweden Posts: 5,602	You're welcome. A result is never available to the same pipe in a parallel-pipe CPU at the same cycle, so no matter what you do, you can't fit those 4 examples in 1 (2xparallel instructions) cycle. Assuming the last 2 examples are in a loop that is already in the ICACHE, the first will take 2 cycles (odd or even), and the second will take the 2+DCACHE cycles (odd or even). The pipes are separate and their results are available at the same time, so the only thing you can do to optimize is to pad with other useful in-loop instructions that are suitable for the free even/odd slots. Again, it doesn't help to "know a specific piped+cached CPU well enough", the factors are too many to write a perfect theoretically optimized loop. If you have a single memory-access instruction in the loop, the penalty for cache misses is much greater than any pipe-optimization you can conjure up. The "sane approach" recommended above reaps rewards on all CPUs and it would optimized your 4 examples/answered your question. You want the answer, which is faster on 68060, but it's my opinion that it's always a mistake to optimize code supporting 680x0 for the fastest CPU rather than the slowest CPU. The fast one will always be faster and fewer people will have it. If the target is 68060-mainly (not only) it's a different matter, but only us demofreaks do that I think If you are coding a demo for AGA/060 then there is absolutely no way to know without getting one of those and running the code. And even then there are differences between memory interfaces (and HDD interfaces if you are loading, and OS performance if you are accessing the OS during the demo) requiring discretion.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Lemmings Whdload on P-UAE: stalls after a while	Gaula92	project.WHDLoad	12	15 November 2010 22:03
winuae stalls/pauses for several seconds	Gaula92	support.WinUAE	12	08 April 2009 20:33
Backbone Register Help	AmigaNG	support.Apps	10	13 May 2008 21:57
change the cd-rom unit number(sorry wrong place if a moderator can change)	turrican3	support.OtherUAE	19	04 May 2007 23:27
What do you get when you register?	Magno Boots	project.WHDLoad	4	31 January 2007 12:18