English Amiga Board - View Single Post

matthey · 08 August 2016, 04:55

Quote:

Originally Posted by meynaf

Can you provide sample code ? This is what i was asking for.

Any sample code for SELcc would be artificial as the instruction does not exist anywhere. There are 2 variations.

if (cc) var=val1

SELcc EA,d0

Code:

  cmp ?
  bcc .skip
  move.l EA,d0
.skip:

The first variation above is less flexible than predication because it can only be used to set a variable (like CMOVcc) where predication can do many other operations. Code density is about the same but the SELcc may be simpler to use and/or implement (needs further review by unbiased hardware designers). Some CPU designs may not have predication and this variation comes for free with the more powerful variation below.

if (cc) var=val1
else var=val2

SELcc EA,d1,d0

Code:

  cmp ?
  bcc .skip1
  move.l EA,d0
  bra .skip2
.skip1:
  move.l d1,d0
.skip2:

This variation replaces 2 branches in an if/then/else statement but we are limited again to only being able to set a variable to one of two choices. Code density is improved and would commonly be half of the 2xbranch version. Removing 2 branches and halving the code size is a big improvement but the question is how often could it be used and could compilers take advantage of it (needs further review by unbiased compiler designers).

Quote:

Originally Posted by meynaf

But instead of the hint bit we could simply use backward taken / forward not taken and optimize by moving code around.

BTFN is the default static branch prediction on the 68040+. It is not possible to move code around in all cases to avoid branch mispredictions the first 2 times and is it not a good idea to move code around in some other cases. Efficient code should fall through to maximize the ICache and instruction stream length between branches. Falling through branches when BTFN is incorrect can also improve code density slightly.

Quote:

Originally Posted by meynaf

I like EREV. Some xREV for bits would be consistent then, but B is "single bit". Maybe LREV for longword reverse ?

I'm not a fan of the 'L' at the beginning. I'm leaning on going back to BITREV if BREV is unclear. BFREV would be clear though?

Quote:

Originally Posted by meynaf

That's one of the reasons why i don't like the SIMD extensions. They get obsolete as soon as the next version comes out.

Yes, they have growing pains but they eventually get to the point of diminishing returns. The 64 bit SIMD registers, integer only operations and difficulty of programming were a good reasons to implement and focus on the FPU until there are enough resources to do a proper SIMD. I have no qualms with toying with an SIMD either but it should be a separate unit.

Quote:

Originally Posted by meynaf

Peephole asm optimization can perhaps do the conversion.

Current compilers are able to use the MOVEQ+AND trick, so there is little use for AND.L #i16.
And anyway if we really want this, it takes a very small encoding space.

Assembler peephole optimizers usually can't do the MOVEQ+OP optimization because there usually isn't a trash register defined. Compilers can and do do the optimization but it adds to the complexity of compilers and processors to handle them. The assembler code is less typing and looks more professional without MOVEQ and sometimes saves a register which can be used for other things or improved code density when a trash register is not available. I think it is great and you hate it. Gunnar liked it so it was probably easy to implement from the hardware side as I would expect with it so simple.

Quote:

Originally Posted by meynaf

Yes this is the correct operation. There is a shorter way (same register layout as yours) :

Code:

 move.l d1,d3 ; pOEP
 eor.l d2,d1 ; sOEP
 and.l d0,d1 ; pOEP
 eor.l d1,d2 ; pOEP
 eor.l d3,d1 ; sOEP

This kind of code is quite standard issue, i believe it would have been a poor competition entry.

Nice. You used an EOR EXG also. I did a double version which does allow to schedule better although I should have reversed the last 2 lines of my code for 3.5 cycles superscalar to your 3 cycles which wins. The smaller code is better also. I still have my doubts that this is worthy of an instruction but it is interesting and not the easiest to do with existing instructions.

Quote:

Originally Posted by meynaf

What type of algorithms, well, aside of the classic c2p/p2c it's for whenever you need to exchange selected bits, i.e. extract a bit field or separate bits, while keeping the old value somewhere.
Many cases would go away if we had a BFEXG, though.

BFEXG would be nice and would be possible if there was a free encoding bit in BFEXTU and BFEXTS. The bit could turn on bit swapping between the extracted area and the destination register for BFEXTU/S. This would make it also like BFINS which could have shared encoding space with 2 encoding bits (EB=Extract Bit, IB=Insert Bit)

EB=0 IB=0 BFTST
EB=1 IB=0 BFEXTS/U
EB=0 IB=1 BFINS
EB=1 IB=1 BFEXG

Of course, you may want BFEXG to swap bits between 2 registers with the same offset and width. I believe this kind of BFEXG would be less general purpose and not work as well on bit streams. Any bit offset >31 would be useless without being able to specify 2 bit fields which is too expensive.