English Amiga Board - View Single Post

Nut · 17 April 2012, 20:07

Thanks dudes! I now understand the nature of the stall. It's the pipeline design of the execution engine. If you change a register just before a command that needs <effective address> calculation, and the register you changed is part of that calculation, the <ea> has to be calculated again. That's why it stalls 1-3 cycles depending on what stage of the pipeline the new correct <ea> (or something else) comes from. Yes, this makes it much more understandable.

I have the PDF(s) of all 68k-processors. I'm only optimizing essential innerloops that consume 95% of CPU time in my program. The rest of the code I don't care much. I'm trying to optimize for both 040 and 060, and just suppose they run fine on 020 and 030 too, if I do that. I particularly want to squeeze everything out of the 060 superscalar design. Yes, it's theory mostly, but theory mostly works

I'm not reading or writing memory much, so the caches do what they do. Basicly just reading stuff linearly in and then writing it linearly out after processing. Only single memory reads and writes now and then.

I want to ask a couple of other questions on the poep/soep parallel execution. User manual 060 page 304, paragraph 5 & 6...
It says there's a few important exceptions to the rule of using the result of poep in soep during the same cycle. Usually this can't be done except for these two cases :

1. is a long move from <ea> to register and using that on soep

Code:

move.l  (a0),d1 poep
add.l   d1,d2 soep

or

move.l  d0,d1 poep
add.l   d1,d2 soep

This works because d1 result is known before the execution of the commands, and the move is long so there is no unknown component on the register d1, I think.

2. is a move (any size?) from register to <mem> after poep

Code:

add.l       d2,d1 poep
move.l/w/b  d1,(a0) soep

This works because the result is sent to memory and no register has to be updated, I think. So these were the two exceptions to the rule (according to user manual).

Okey, in my code I do this many times, I really like this construct because it's so efficient. But I'm worried it doesn't work in one cycle, infact.

Code:

add.l   d2,d1 poep
move.l  d1,d0 soep

Does this work or not? Can somebody actually test this? Another variation of the same idea which I would like to use is

Code:

add.w   (a0),a1 poep
move.l  a1,d0 soep

I understand that for both of these d1/a1 result is not known before execution, but I was hoping the CPU would be intelligent enough to be able to send the result directly to another register during the same cycle. It's like case2 but instead of memory send it to a register (long move).

Please test these two. With this info I could optimize my innerloops for poep/soep (and avoid stalls). I really like optimizing btw

17 April 2012, 20:07	#6
Nut Registered User Join Date: Feb 2010 Location: Helsinki, Finland Posts: 36	Thanks dudes! I now understand the nature of the stall. It's the pipeline design of the execution engine. If you change a register just before a command that needs <effective address> calculation, and the register you changed is part of that calculation, the <ea> has to be calculated again. That's why it stalls 1-3 cycles depending on what stage of the pipeline the new correct <ea> (or something else) comes from. Yes, this makes it much more understandable. I have the PDF(s) of all 68k-processors. I'm only optimizing essential innerloops that consume 95% of CPU time in my program. The rest of the code I don't care much. I'm trying to optimize for both 040 and 060, and just suppose they run fine on 020 and 030 too, if I do that. I particularly want to squeeze everything out of the 060 superscalar design. Yes, it's theory mostly, but theory mostly works I'm not reading or writing memory much, so the caches do what they do. Basicly just reading stuff linearly in and then writing it linearly out after processing. Only single memory reads and writes now and then. I want to ask a couple of other questions on the poep/soep parallel execution. User manual 060 page 304, paragraph 5 & 6... It says there's a few important exceptions to the rule of using the result of poep in soep during the same cycle. Usually this can't be done except for these two cases : 1. is a long move from <ea> to register and using that on soep Code: move.l (a0),d1 poep add.l d1,d2 soep or move.l d0,d1 poep add.l d1,d2 soep This works because d1 result is known before the execution of the commands, and the move is long so there is no unknown component on the register d1, I think. 2. is a move (any size?) from register to <mem> after poep Code: add.l d2,d1 poep move.l/w/b d1,(a0) soep This works because the result is sent to memory and no register has to be updated, I think. So these were the two exceptions to the rule (according to user manual). Okey, in my code I do this many times, I really like this construct because it's so efficient. But I'm worried it doesn't work in one cycle, infact. Code: add.l d2,d1 poep move.l d1,d0 soep Does this work or not? Can somebody actually test this? Another variation of the same idea which I would like to use is Code: add.w (a0),a1 poep move.l a1,d0 soep I understand that for both of these d1/a1 result is not known before execution, but I was hoping the CPU would be intelligent enough to be able to send the result directly to another register during the same cycle. It's like case2 but instead of memory send it to a register (long move). Please test these two. With this info I could optimize my innerloops for poep/soep (and avoid stalls). I really like optimizing btw