13 April 2012, 15:50 | #1 |
Registered User
Join Date: Feb 2010
Location: Helsinki, Finland
Posts: 36
|
68060 change/use register stalls
Hello guys,
I'm reading the 68060 user manual, page 306, paragraph 3. It talks about change/use register stalls. I'm not quite sure how this works so if you could enlighten me a bit. My code is like this Code:
scs.b d6 poep subq.l #$1,d7 soep add.w (a2,d6*2),a6 poep subq.l #$1,d6 soep move.w $2(a6),a0 poep According to document first stall is 3 cycles and second stall is 2 cycles. My guestion is where do you start counting the stall? After the instruction that is the root cause of the stall (changing d6, changing a6) or do you start counting cycles from the root-cause instruction? I guess it's the next cycle after, as the document had an example where mulu command causes a stall of 2 cycles, but mulu is already 2 cycles in itself. Example Code:
add.w d0,a0 poep subq.l #$1,d7 soep move.w (a0),d0 poep How does this behave? Code:
add.w d0,a0 poep subq.l #$1,d7 soep subq.l #$1,d6 poep subq.l #$1,d5 soep move.w (a0),d0 poep I don't have 68060 card and I can't test this with WinUAE. I would still like to optimize my code for 060. My first code was particularly problematic as it would stall 3+2 cycles. But I could do some reordering. I was optimizing my code just for parallel execution (poep/soep) but now I come to think of the stalls it looks bad for that particular part. Anything else I should know about stalls? (not cache stalls) |
13 April 2012, 19:25 | #2 |
Moderator
Join Date: Nov 2001
Location: Germany
Posts: 866
|
which page is 306?
my book isn't numbered that way. you probably have only the pdf. |
14 April 2012, 02:00 | #3 |
Banned
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
|
@Wepl
The Instruction Execution Timing Section 10-10. @Nut The change/use stall occurs at <ea> calculation of the indirect addressing mode and the register result of a previously updated register is not available yet because of pipeline delays. It's best to use 32 bit longword instruction results as they can be forwarded sometimes reducing a delay. You have the right idea with rescheduling instructions and pOEP/sOEP operation. Last edited by matthey; 14 April 2012 at 02:32. |
14 April 2012, 12:02 | #4 |
Moderator
Join Date: Nov 2001
Location: Germany
Posts: 866
|
I confirm to your guess that it's calculated from the end of the root cause (according the manual).
In the last example this would make a stall of 1 cycle. Would be best to ensure this by testing on real 68060. If you write test code I can execute it for you . |
14 April 2012, 17:09 | #5 |
Moderator
Join Date: Nov 2004
Location: Eksjö / Sweden
Posts: 5,602
|
The number of cycles of delay is relative to which stage in the pipeline has to wait for the needed value (in this case ea stage) and which of the two pipes are occupied, odd or even. Each stage is a cycle.
As optimization is unimportant except in loops executed many times, calculating the exact stall cycles is unimportant except in such loops that also fit in the instruction cache and accesses no memory. So except for those rare cases, thinking about cache use and interleaving load-use with 1 instruction is a much more sane approach to milking performance In other words, I'm an anal optimizer myself but I would save counting stage cycles for those very rare loops. In more complex CPUs like this one, you can never ever do timing and optimizing in theory, so if you don't have a real Amiga with 060 I really recommend getting one. I found the timing-by-rasterline method more appropriate than ever before, because you can't have all the performance aspects (dirty cache lines, addressing mode extra cycles, which instruction is odd or even, branch behavior) 100% in your head when coding, only "in theory". And if the loop doesn't even take a rasterline you don't need to optimize it, another rule of thumb I have. Timing-by-rasterline is either 1) change $dff180 before the call or loop, and change it back after. 2) save $dff004.l to someplace after. If the background-colored chunk visibly shrunk or the 2nd byte of the longword decreased, you optimized the routine. Last edited by Photon; 14 April 2012 at 17:14. |
17 April 2012, 20:07 | #6 |
Registered User
Join Date: Feb 2010
Location: Helsinki, Finland
Posts: 36
|
Thanks dudes! I now understand the nature of the stall. It's the pipeline design of the execution engine. If you change a register just before a command that needs <effective address> calculation, and the register you changed is part of that calculation, the <ea> has to be calculated again. That's why it stalls 1-3 cycles depending on what stage of the pipeline the new correct <ea> (or something else) comes from. Yes, this makes it much more understandable.
I have the PDF(s) of all 68k-processors. I'm only optimizing essential innerloops that consume 95% of CPU time in my program. The rest of the code I don't care much. I'm trying to optimize for both 040 and 060, and just suppose they run fine on 020 and 030 too, if I do that. I particularly want to squeeze everything out of the 060 superscalar design. Yes, it's theory mostly, but theory mostly works I'm not reading or writing memory much, so the caches do what they do. Basicly just reading stuff linearly in and then writing it linearly out after processing. Only single memory reads and writes now and then. I want to ask a couple of other questions on the poep/soep parallel execution. User manual 060 page 304, paragraph 5 & 6... It says there's a few important exceptions to the rule of using the result of poep in soep during the same cycle. Usually this can't be done except for these two cases : 1. is a long move from <ea> to register and using that on soep Code:
move.l (a0),d1 poep add.l d1,d2 soep or move.l d0,d1 poep add.l d1,d2 soep 2. is a move (any size?) from register to <mem> after poep Code:
add.l d2,d1 poep move.l/w/b d1,(a0) soep Okey, in my code I do this many times, I really like this construct because it's so efficient. But I'm worried it doesn't work in one cycle, infact. Code:
add.l d2,d1 poep move.l d1,d0 soep Code:
add.w (a0),a1 poep move.l a1,d0 soep Please test these two. With this info I could optimize my innerloops for poep/soep (and avoid stalls). I really like optimizing btw |
18 April 2012, 00:04 | #7 |
Moderator
Join Date: Nov 2004
Location: Eksjö / Sweden
Posts: 5,602
|
You're welcome.
A result is never available to the same pipe in a parallel-pipe CPU at the same cycle, so no matter what you do, you can't fit those 4 examples in 1 (2xparallel instructions) cycle. Assuming the last 2 examples are in a loop that is already in the ICACHE, the first will take 2 cycles (odd or even), and the second will take the 2+DCACHE cycles (odd or even). The pipes are separate and their results are available at the same time, so the only thing you can do to optimize is to pad with other useful in-loop instructions that are suitable for the free even/odd slots. Again, it doesn't help to "know a specific piped+cached CPU well enough", the factors are too many to write a perfect theoretically optimized loop. If you have a single memory-access instruction in the loop, the penalty for cache misses is much greater than any pipe-optimization you can conjure up. The "sane approach" recommended above reaps rewards on all CPUs and it would optimized your 4 examples/answered your question. You want the answer, which is faster on 68060, but it's my opinion that it's always a mistake to optimize code supporting 680x0 for the fastest CPU rather than the slowest CPU. The fast one will always be faster and fewer people will have it. If the target is 68060-mainly (not only) it's a different matter, but only us demofreaks do that I think If you are coding a demo for AGA/060 then there is absolutely no way to know without getting one of those and running the code. And even then there are differences between memory interfaces (and HDD interfaces if you are loading, and OS performance if you are accessing the OS during the demo) requiring discretion. |
26 April 2012, 16:17 | #8 | |
Registered User
Join Date: Feb 2010
Location: Helsinki, Finland
Posts: 36
|
Quote:
I was wondering if the CPU would be intelligent enough to send the result to another register during "soep cycle" (last 2 examples). If those loops were unrolled it would work btw after the first poep instruction (soep/poep, soep/poep, ...) see case1. Long move can be "forwarded" to soep during the same cycle. I'm a little sad it doesn't work for a register (last 2 examples), but it does work for memory output (case2). It's just STUPID!! Why couldn't it just forward the result to another register and be available at the next cycle?? The long move is just forwarding. Like case1, but the other way around. Anyway, I was able to reorder my code to avoid all stalls in the innerloops. I also did some other optimizations. I think it's good enough now, since CACHE misses would likely mask any small speed improvements anyway. Afterall I must read in data and write it out. Can't avoid an occasional cache miss I think. I'm doing some software audio mixing and it's like a stream of data that goes in and comes out in a constant flow. |
|
27 April 2012, 09:08 | #9 |
Registered User
Join Date: Feb 2010
Location: Espoo / Finland
Posts: 818
|
Code:
move.l (a0),d1 poep add.l d1,d2 soep |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Lemmings Whdload on P-UAE: stalls after a while | Gaula92 | project.WHDLoad | 12 | 15 November 2010 22:03 |
winuae stalls/pauses for several seconds | Gaula92 | support.WinUAE | 12 | 08 April 2009 20:33 |
Backbone Register Help | AmigaNG | support.Apps | 10 | 13 May 2008 21:57 |
change the cd-rom unit number(sorry wrong place if a moderator can change) | turrican3 | support.OtherUAE | 19 | 04 May 2007 23:27 |
What do you get when you register? | Magno Boots | project.WHDLoad | 4 | 31 January 2007 12:18 |
|
|