English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 13 April 2012, 15:50   #1
Nut
Registered User
 
Join Date: Feb 2010
Location: Helsinki, Finland
Posts: 36
Post 68060 change/use register stalls

Hello guys,

I'm reading the 68060 user manual, page 306, paragraph 3. It talks about change/use register stalls. I'm not quite sure how this works so if you could enlighten me a bit.

My code is like this

Code:
scs.b   d6 poep
subq.l  #$1,d7 soep
add.w   (a2,d6*2),a6 poep
subq.l  #$1,d6 soep
move.w  $2(a6),a0 poep
As I understand it from the manual, there's two places where this code will stall. First is that I'm writing d6 and then using it in address calculation, the other that I'm writing a6 and then using it in address calculation. Both stalls are the change/use type like the document says.

According to document first stall is 3 cycles and second stall is 2 cycles.

My guestion is where do you start counting the stall? After the instruction that is the root cause of the stall (changing d6, changing a6) or do you start counting cycles from the root-cause instruction? I guess it's the next cycle after, as the document had an example where mulu command causes a stall of 2 cycles, but mulu is already 2 cycles in itself.

Example

Code:
add.w   d0,a0 poep
subq.l  #$1,d7 soep
move.w  (a0),d0 poep
There is a stall of 2 cycles, right? Counting from the move.w command?

How does this behave?

Code:
add.w   d0,a0 poep
subq.l  #$1,d7 soep
subq.l  #$1,d6 poep
subq.l  #$1,d5 soep
move.w  (a0),d0 poep
Is there a stall of 1 now? Or no stall?

I don't have 68060 card and I can't test this with WinUAE. I would still like to optimize my code for 060. My first code was particularly problematic as it would stall 3+2 cycles. But I could do some reordering. I was optimizing my code just for parallel execution (poep/soep) but now I come to think of the stalls it looks bad for that particular part.

Anything else I should know about stalls? (not cache stalls)
Nut is offline  
Old 13 April 2012, 19:25   #2
Wepl
Moderator
Wepl's Avatar
 
Join Date: Nov 2001
Location: Germany
Posts: 764
which page is 306?
my book isn't numbered that way. you probably have only the pdf.
Wepl is offline  
Old 14 April 2012, 02:00   #3
matthey
Banned
 
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
@Wepl
The Instruction Execution Timing Section 10-10.

@Nut
The change/use stall occurs at <ea> calculation of the indirect addressing mode and the register result of a previously updated register is not available yet because of pipeline delays. It's best to use 32 bit longword instruction results as they can be forwarded sometimes reducing a delay. You have the right idea with rescheduling instructions and pOEP/sOEP operation.

Last edited by matthey; 14 April 2012 at 02:32.
matthey is offline  
Old 14 April 2012, 12:02   #4
Wepl
Moderator
Wepl's Avatar
 
Join Date: Nov 2001
Location: Germany
Posts: 764
I confirm to your guess that it's calculated from the end of the root cause (according the manual).
In the last example this would make a stall of 1 cycle.
Would be best to ensure this by testing on real 68060. If you write test code I can execute it for you .
Wepl is offline  
Old 14 April 2012, 17:09   #5
Photon
Moderator

Photon's Avatar
 
Join Date: Nov 2004
Location: Eksjö / Sweden
Posts: 4,954
The number of cycles of delay is relative to which stage in the pipeline has to wait for the needed value (in this case ea stage) and which of the two pipes are occupied, odd or even. Each stage is a cycle.

As optimization is unimportant except in loops executed many times, calculating the exact stall cycles is unimportant except in such loops that also fit in the instruction cache and accesses no memory. So except for those rare cases, thinking about cache use and interleaving load-use with 1 instruction is a much more sane approach to milking performance

In other words, I'm an anal optimizer myself but I would save counting stage cycles for those very rare loops.

In more complex CPUs like this one, you can never ever do timing and optimizing in theory, so if you don't have a real Amiga with 060 I really recommend getting one.

I found the timing-by-rasterline method more appropriate than ever before, because you can't have all the performance aspects (dirty cache lines, addressing mode extra cycles, which instruction is odd or even, branch behavior) 100% in your head when coding, only "in theory". And if the loop doesn't even take a rasterline you don't need to optimize it, another rule of thumb I have.

Timing-by-rasterline is either 1) change $dff180 before the call or loop, and change it back after. 2) save $dff004.l to someplace after. If the background-colored chunk visibly shrunk or the 2nd byte of the longword decreased, you optimized the routine.

Last edited by Photon; 14 April 2012 at 17:14.
Photon is offline  
Old 17 April 2012, 20:07   #6
Nut
Registered User
 
Join Date: Feb 2010
Location: Helsinki, Finland
Posts: 36
Thanks dudes! I now understand the nature of the stall. It's the pipeline design of the execution engine. If you change a register just before a command that needs <effective address> calculation, and the register you changed is part of that calculation, the <ea> has to be calculated again. That's why it stalls 1-3 cycles depending on what stage of the pipeline the new correct <ea> (or something else) comes from. Yes, this makes it much more understandable.

I have the PDF(s) of all 68k-processors. I'm only optimizing essential innerloops that consume 95% of CPU time in my program. The rest of the code I don't care much. I'm trying to optimize for both 040 and 060, and just suppose they run fine on 020 and 030 too, if I do that. I particularly want to squeeze everything out of the 060 superscalar design. Yes, it's theory mostly, but theory mostly works I'm not reading or writing memory much, so the caches do what they do. Basicly just reading stuff linearly in and then writing it linearly out after processing. Only single memory reads and writes now and then.



I want to ask a couple of other questions on the poep/soep parallel execution. User manual 060 page 304, paragraph 5 & 6...
It says there's a few important exceptions to the rule of using the result of poep in soep during the same cycle. Usually this can't be done except for these two cases :

1. is a long move from <ea> to register and using that on soep

Code:
move.l  (a0),d1 poep
add.l   d1,d2 soep

or

move.l  d0,d1 poep
add.l   d1,d2 soep
This works because d1 result is known before the execution of the commands, and the move is long so there is no unknown component on the register d1, I think.

2. is a move (any size?) from register to <mem> after poep

Code:
add.l       d2,d1 poep
move.l/w/b  d1,(a0) soep
This works because the result is sent to memory and no register has to be updated, I think. So these were the two exceptions to the rule (according to user manual).



Okey, in my code I do this many times, I really like this construct because it's so efficient. But I'm worried it doesn't work in one cycle, infact.

Code:
add.l   d2,d1 poep
move.l  d1,d0 soep
Does this work or not? Can somebody actually test this? Another variation of the same idea which I would like to use is

Code:
add.w   (a0),a1 poep
move.l  a1,d0 soep
I understand that for both of these d1/a1 result is not known before execution, but I was hoping the CPU would be intelligent enough to be able to send the result directly to another register during the same cycle. It's like case2 but instead of memory send it to a register (long move).

Please test these two. With this info I could optimize my innerloops for poep/soep (and avoid stalls). I really like optimizing btw
Nut is offline  
Old 18 April 2012, 00:04   #7
Photon
Moderator

Photon's Avatar
 
Join Date: Nov 2004
Location: Eksjö / Sweden
Posts: 4,954
You're welcome.

A result is never available to the same pipe in a parallel-pipe CPU at the same cycle, so no matter what you do, you can't fit those 4 examples in 1 (2xparallel instructions) cycle. Assuming the last 2 examples are in a loop that is already in the ICACHE, the first will take 2 cycles (odd or even), and the second will take the 2+DCACHE cycles (odd or even). The pipes are separate and their results are available at the same time, so the only thing you can do to optimize is to pad with other useful in-loop instructions that are suitable for the free even/odd slots.

Again, it doesn't help to "know a specific piped+cached CPU well enough", the factors are too many to write a perfect theoretically optimized loop. If you have a single memory-access instruction in the loop, the penalty for cache misses is much greater than any pipe-optimization you can conjure up.

The "sane approach" recommended above reaps rewards on all CPUs and it would optimized your 4 examples/answered your question.

You want the answer, which is faster on 68060, but it's my opinion that it's always a mistake to optimize code supporting 680x0 for the fastest CPU rather than the slowest CPU. The fast one will always be faster and fewer people will have it. If the target is 68060-mainly (not only) it's a different matter, but only us demofreaks do that I think

If you are coding a demo for AGA/060 then there is absolutely no way to know without getting one of those and running the code. And even then there are differences between memory interfaces (and HDD interfaces if you are loading, and OS performance if you are accessing the OS during the demo) requiring discretion.
Photon is offline  
Old 26 April 2012, 16:17   #8
Nut
Registered User
 
Join Date: Feb 2010
Location: Helsinki, Finland
Posts: 36
Quote:
A result is never available to the same pipe in a parallel-pipe CPU at the same cycle, so no matter what you do, you can't fit those 4 examples in 1 (2xparallel instructions) cycle. Assuming the last 2 examples are in a loop that is already in the ICACHE, the first will take 2 cycles (odd or even), and the second will take the 2+DCACHE cycles (odd or even). The pipes are separate and their results are available at the same time, so the only thing you can do to optimize is to pad with other useful in-loop instructions that are suitable for the free even/odd slots.
Yes, but the first two examples were exceptions to the rule that the user manual mentions. They do work. Haven't tested but I explained exactly how it's said in the manual. They do run in 1 cycle. (case1, case2) The magic is that what seems to be sent to soep during the same cycle, is not really sent, but the register contents are already known during the execution stage, they are figured out earlier in the pipelines. Check out the manual if you like, or my explanations, you'll figure why they do work.

I was wondering if the CPU would be intelligent enough to send the result to another register during "soep cycle" (last 2 examples). If those loops were unrolled it would work btw after the first poep instruction (soep/poep, soep/poep, ...) see case1. Long move can be "forwarded" to soep during the same cycle. I'm a little sad it doesn't work for a register (last 2 examples), but it does work for memory output (case2). It's just STUPID!! Why couldn't it just forward the result to another register and be available at the next cycle?? The long move is just forwarding. Like case1, but the other way around.

Anyway, I was able to reorder my code to avoid all stalls in the innerloops. I also did some other optimizations. I think it's good enough now, since CACHE misses would likely mask any small speed improvements anyway. Afterall I must read in data and write it out. Can't avoid an occasional cache miss I think. I'm doing some software audio mixing and it's like a stream of data that goes in and comes out in a constant flow.
Nut is offline  
Old 27 April 2012, 09:08   #9
britelite
Registered User
 
Join Date: Feb 2010
Location: Espoo / Finland
Posts: 787
Code:
move.l  (a0),d1 poep
add.l   d1,d2 soep
In this case I would do a soep-instruction that doesn't use d1 in between these two instructions, just in case the memory being read isn't in cache.
britelite is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Lemmings Whdload on P-UAE: stalls after a while Gaula92 project.WHDLoad 12 15 November 2010 22:03
winuae stalls/pauses for several seconds Gaula92 support.WinUAE 12 08 April 2009 20:33
Backbone Register Help AmigaNG support.Apps 10 13 May 2008 21:57
change the cd-rom unit number(sorry wrong place if a moderator can change) turrican3 support.OtherUAE 19 04 May 2007 23:27
What do you get when you register? Magno Boots project.WHDLoad 4 31 January 2007 12:18

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 00:41.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, vBulletin Solutions Inc.
Page generated in 0.09439 seconds with 13 queries