English Amiga Board


Go Back   English Amiga Board > Coders > Coders. General

 
 
Thread Tools
Old 03 May 2018, 16:53   #1
Megol
Registered User
 
Megol's Avatar
 
Join Date: May 2014
Location: inside the emulator
Posts: 377
A question to coders - 3 op instructions [fpga 68k]

I've started thinking on 68k FPGA again, as before it will probably not result in anything. Except wasting of my time of course.

Thinking about a 3 operand extension where instructions can have two sources and one (independent) destination leads to some practical problems and a question to anybody that knows 68k assembly (and interested in this theoretical game of time-wasting).

If anybody is interested in why this is a relevant question I've tried to describe the technical problem below, but it isn't really needed to answer the question. IOW skip to the question unless interested in technical mumbo-jumbo.

--
The main problem keeping the old semantics for 3 operand instruction is when going for an out of order execution core. In any practical implementation one have to have register renaming in order to remove false dependencies.

Very simplified one can say register renaming is referring to a register as register[rename_table[register_number]] instead of register[register_number].
New results allocate a free physical register leaving the old register content unchanged, this means old instructions can see the old register content and new instructions can see the new register content.
This is a great part of what makes instructions capable to execute out of order, with older instructions started after new ones are finished.

A 2 operand instruction would be converted to a 3 operand internally:

add.b d0, d1 => _add.b d0_old, d1_old, d1_new ; _add is the internal format of the processor

And there are no great problems keeping the old semantics, the upper 24 bits of d1 is simply copied unchanged into the newly allocated d1_new register.

It isn't so easy when going to 3 operand instructions:

add.b d0, d1, d2 => _add.b d0_old, d1_old, d2_new

To keep the old semantics here means the upper content of d2 will have to be preserved. However as written the upper content of d2 isn't known.

One way to handle this would be expanding this further adding a read of the old d2 value:

add.b d0, d1, d2 => _add.b d0_old, d1_old, d2_old, d2_new

And copy the upper bits of d2_old to d2_new as in the 2 operand example.

However adding more sources adds complications, more register read ports, more wires, more multiplexers, more complex instruction scheduler.

Another way would be inserting extra internal operations when needed. So that reading a byte from d2_new would work without problem (as the upper 24 bits aren't relevant) but reading a longword from the same would cause the processor to add an internal operation merging the high bits of d2_old with those of d2_new:

__fuse.b d2_new, d2_old, d2_now_extra_new

That however adds complications elsewhere in the design. Some x86 processors have used a similar design for a similar problem, they had no other reasonable choice though (ISA specific).

But when creating a new extension there is another alternative: by either zeroing or sign extending the (byte/word) result the upper bits of the old register aren't needed anymore.
--

The question is relatively simple: if the instruction set is extended to support 3 operands is it in your opinion okay to have 3 operand instructions always zero or sign extend byte/word operations?
And as a follow-up question: if not, would it be reasonable for 3 op instructions that doesn't zero/sign extend to be slower?
Megol is offline  
Old 03 May 2018, 18:07   #2
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,322
Quote:
Originally Posted by Megol View Post
The question is relatively simple: if the instruction set is extended to support 3 operands is it in your opinion okay to have 3 operand instructions always zero or sign extend byte/word operations?
Extending could have some uses, but it's not consistent with the normal behaviour and would probably lead to confusion.


Quote:
Originally Posted by Megol View Post
And as a follow-up question: if not, would it be reasonable for 3 op instructions that doesn't zero/sign extend to be slower?
Not really, as they would become total useless, being the same as move followed by 2-op instruction.

IMO the only way to implement 3-op in the 68k is to do it "silently" by decoding instruction pairs as if they were single instructions. You do not need to create new encodings at all. In addition old programs don't need to be rewritten to take benefit from this.

Note that 3-operand does not occur as frequently as most people appear to believe and is probably not worth the trouble anyway.
meynaf is offline  
Old 04 May 2018, 18:25   #3
Megol
Registered User
 
Megol's Avatar
 
Join Date: May 2014
Location: inside the emulator
Posts: 377
Quote:
Originally Posted by meynaf View Post
Extending could have some uses, but it's not consistent with the normal behaviour and would probably lead to confusion.
I agree that confusion isn't good however the problems described are real.

Quote:
Not really, as they would become total useless, being the same as move followed by 2-op instruction.
Actually that don't have to be the case. In an unlikely future with a wide OoO 68k with let's say 3 ALU pipes 2 of those pipes could handle 3 op instructions with zero/sign extension and a third pipe more complex operations including 3 op instructions that don't extend. Another possibility would be the handling 3 op without extension as a two cycle operation or (as above) inserting a one cycle internal operation behind the the first one.

Quote:
IMO the only way to implement 3-op in the 68k is to do it "silently" by decoding instruction pairs as if they were single instructions. You do not need to create new encodings at all. In addition old programs don't need to be rewritten to take benefit from this.
Fusion obviously have advantages including compatibility. However that approach also have several disadvantages, instructions have to be adjacent to be fused, condition codes have to be handled, sub-register updates still create problems.

Code:
MOVE.B D1, D0
ADD.B D2, D0 

_ADD.B D1_old, D2_old, D0_new
Observe the same problem as described above popping up - but here one can't handle it by zero/sign extending the result. IOW extra overheads.

One shouldn't discount the problem of MOVE updating condition codes either as they have to be tracked.

However there wouldn't be a huge problem to support MOVE fusion for longword operations for a limited subset of following instructions.

Quote:
Note that 3-operand does not occur as frequently as most people appear to believe and is probably not worth the trouble anyway.
That is true and the only reason thinking about them is to reduce overheads for another more generic extension, you may remember them from earlier discussions.

As I'm not coding much nowadays and even less in 68k assembly I thought that the opinions of skilled assembly coders should influence the (vaporware) development.
Megol is offline  
Old 04 May 2018, 21:53   #4
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,322
Quote:
Originally Posted by Megol View Post
Fusion obviously have advantages including compatibility. However that approach also have several disadvantages, instructions have to be adjacent to be fused, condition codes have to be handled, sub-register updates still create problems.
If the 3-op is done with simple macros, then the instructions are adjacent. No loss in comparison to new 3-op instructions.
If you follow a MOVE by something like ADD, then the MOVE condition codes don't have to be computed at all - because ADD will change them all later.
Sub-register updates can be ignored if you decide to support only longword size. Else, well, it's time to be innovative


Quote:
Originally Posted by Megol View Post
Observe the same problem as described above popping up - but here one can't handle it by zero/sign extending the result. IOW extra overheads.
Perhaps the size could be handled by the register file itself ?
I mean, the high part would then be unimportant out of the ALU, meaning you don't need to read D2. But upon writing, you say it's a byte, and the register file masks D2's high part out.
Ok, that's just my 2 cents


Quote:
Originally Posted by Megol View Post
One shouldn't discount the problem of MOVE updating condition codes either as they have to be tracked.
ADD will destroy them all so previous MOVE doesn't have to update the cc.


Quote:
Originally Posted by Megol View Post
That is true and the only reason thinking about them is to reduce overheads for another more generic extension, you may remember them from earlier discussions.
I had discussions for extensions with a lot of ppl and i don't remember much of them
It might be better to discuss these directly.

That said, perhaps it would be better to just implement the full cpu before adding anything new, it's already a daunting task by itself...
meynaf is offline  
Old 06 May 2018, 13:37   #5
Megol
Registered User
 
Megol's Avatar
 
Join Date: May 2014
Location: inside the emulator
Posts: 377
Quote:
Originally Posted by meynaf View Post
If the 3-op is done with simple macros, then the instructions are adjacent. No loss in comparison to new 3-op instructions.
If you follow a MOVE by something like ADD, then the MOVE condition codes don't have to be computed at all - because ADD will change them all later.
Sub-register updates can be ignored if you decide to support only longword size. Else, well, it's time to be innovative
The problem is that the main idea is supporting a "prefix" instruction before any old instruction translating any 2 op instruction into a 3 op one.

Sure just supporting longs is one option but doesn't feel elegant. However requiring sign/zero extension isn't elegant either, changes the feel of the 68k instruction set. :/

Quote:
Perhaps the size could be handled by the register file itself ?
I mean, the high part would then be unimportant out of the ALU, meaning you don't need to read D2. But upon writing, you say it's a byte, and the register file masks D2's high part out.
Ok, that's just my 2 cents
That is possible if not using register renaming and (I guess) what the Apollo use. When using renaming the D2 that is written by an instruction isn't the same register as existed before the instruction executed.

ADD.B D0, D1
MOVE.L #$FEEDBEEF, D0

The MOVE uses the same register name (D0) as the previous instruction however it ignores the content in D0. This is a so called false dependency commonly called Write After Read (WAR).

When using register renaming the processor handles this by letting the first instruction have a different D0 than the second. So after decoding the internal instructions can be:

_ADD.B D0_0, D1_0, D1_1
_MOVE.L #$FEEDBEEF, D0_1

The false dependency is removed and the MOVE can execute in parallel with or before the ADD.

Registers are allocated from a larger register file (e.g. x86-64 processors have >160 physical registers but only 16 general registers) and any physical register can be used for any programmer visible register.
That means D0_0 is really any physical register and not the same as D0_1, D0_2 etc.

So:
ADD.B D0, D1

becomes:

_ADD.B D0_0, D1_0, D1_1

And here the upper 24 bits can be taken from the old register D1_0 and copied unchanged to D1_1, this as D1_0 is an input.

ADD.B D0, D1, D2

would be translated to:

_ADD.B D0_0, D1_0, D2_1 ; D2_1 is a previously unused physical register

Observe that the "old" D2 isn't available as an input to the instruction, the upper bits are unknown.

So why not simply see that D1 is overwritten and map D2_1 to the same physical register as D2_0?

OR.B D2, D4
ADD.B D0, D1, D2

_OR.L D2_0, D4_0, D4_1
_ADD.B D0_0, D1_0, D2_0

Because then we'd still have a false dependency between the instructions - the ADD have to be executed after the OR.

In other words that solution is one of the ways 3 op instructions that don't extend the result can be run slower: by effectively "disabling" OoO execution.

Quote:
ADD will destroy them all so previous MOVE doesn't have to update the cc.
Yes such cases can be detected but in order to make it general tracking have to be implemented.

Quote:
I had discussions for extensions with a lot of ppl and i don't remember much of them
It might be better to discuss these directly.

That said, perhaps it would be better to just implement the full cpu before adding anything new, it's already a daunting task by itself...
Yes, that's the priority.
Megol is offline  
Old 06 May 2018, 17:36   #6
robinsonb5
Registered User
 
Join Date: Mar 2012
Location: Norfolk, UK
Posts: 1,153
Quote:
Originally Posted by Megol View Post
Sure just supporting longs is one option but doesn't feel elegant. However requiring sign/zero extension isn't elegant either, changes the feel of the 68k instruction set. :/
Would an acceptable compromise be to perform the byte-addition as usual, but for the upper bits of d2_1 to come from d1_0 instead of d2_0?
I.e., add.b d0,d1,d2 will leave all 32 bits of d2 set to what d1 would have contained if the third operand hadn't been present?
robinsonb5 is offline  
Old 07 May 2018, 07:36   #7
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,322
Quote:
Originally Posted by Megol View Post
The problem is that the main idea is supporting a "prefix" instruction before any old instruction translating any 2 op instruction into a 3 op one.
A prefix will "push" the instruction, and all fields that were previously located at a fixed place now have a variable one.
They also make computing the instruction size one step longer.
How do you plan managing this ?


Quote:
Originally Posted by Megol View Post
In other words that solution is one of the ways 3 op instructions that don't extend the result can be run slower: by effectively "disabling" OoO execution.
I see.
It seems that if you don't want to change the feel of the 68k instruction set, you will have to read the 3 registers. A few 020+ instructions need reading more than 2 regs already, btw.
Or, just drop the idea of 3-op. It wouldn't make the code shorter or easier to write, two conditions that are important for 68k but often overlooked (and that Gunnar never understood !).
meynaf is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
lotus trilogy question for the coders turrican3 support.Games 13 13 June 2014 03:43
Question for AmigaDos coders DeafDaz Coders. General 11 26 September 2011 07:31
Coders challenge: Memcopy oRBIT Coders. General 29 28 June 2011 11:57
WinUae 68k emulation question. Thorham Coders. General 7 15 July 2009 10:31
Coders Heaven (PC Development) Feltzkrone Coders. General 5 15 November 2004 10:08

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 08:06.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.20043 seconds with 13 queries