Enhanced 68k ISA - Page 2

meynaf · 06 August 2016, 11:09

Quote:

Originally Posted by matthey

If a conditional or SELcc style instruction could replace all branches then we wouldn't have branches. SELcc is at its best when it can replace an IF/THEN/ELSE eliminating 2 branches. It would also be another possible tool for those evil branches which can't be predicted well.

Can you provide sample code ? This is what i was asking for.

Quote:

Originally Posted by matthey

Yes, some CPU designs may choose not to use the branch hint bit and for others it may be the only branch prediction help as it is so cheap. Although many programmers will choose not to use a hint bit or optimize to this level, others may. I see PPC hint bits in much of Frank Wille's PPC assembler code for vbcc and most modern PPC processors don't use it. Amiga programmers may be more likely to use a hint bit because we like optimized code and we generally use slower processors and often in assembler.

But instead of the hint bit we could simply use backward taken / forward not taken and optimize by moving code around.

Quote:

Originally Posted by matthey

Yes, this is a source of inconsistency for me. I prefer to look at the result size. How can you tell new programmers to use longword sizes when the result does not match the instruction size?

SWW.L would not be bad but REVW.L is more understandable, IMO. My preferred syntax is very clear at least. There may be too much use of REV with BREV.L, REVB.L and REVW.L though. Maybe REVB.L could be EREV.L or ESWAP.L for endian reverse or endian swap. I don't know. The originals aren't horrible either even if BYTEREV is a little long.

I like EREV. Some xREV for bits would be consistent then, but B is "single bit". Maybe LREV for longword reverse ?

Quote:

Originally Posted by matthey

More addressing space is really the only good reason to move to 64 bit and there are other ways of working around this issue. I guess Gunnar would answer so I have 64 bit registers for my SIMD instructions. What happens if the SIMD gets floating point or grows to 128 bit registers though?

That's one of the reasons why i don't like the SIMD extensions. They get obsolete as soon as the next version comes out.

Quote:

Originally Posted by matthey

The AND.L is simpler and what a compiler is most likely going to use. The AND.L #d16.w,Dn addressing mode is the same size as BFEXTU but it may be faster on some CPU implementations.

Peephole asm optimization can perhaps do the conversion.

Current compilers are able to use the MOVEQ+AND trick, so there is little use for AND.L #i16.
And anyway if we really want this, it takes a very small encoding space.

Quote:

Originally Posted by matthey

I guess it depends on how you look at it. The addressing mode doesn't use much valuable encoding space but I guess you could say it uses a little encoding space in every instruction with an EA.

Count the number of code words, the difference is big.
Is it worth creating an addressing mode that only a handful instructions will use ? If we list all instructions where the immediate mode is available, we won't get a lot of them.

Quote:

Originally Posted by matthey

I played with your mix and it is trickier than it first looks. It might have been a good algorithm for the programming competition. I came up with
the following.

Code:

mix:
; d0 = mask (trashed)
; d1 = number 1 (mixed result 1)
; d2 = number 2 (mixed result 2)
; d3 = scratch
   move.l d0,d3
   and.l d1,d3
   and.l d2,d0
   eor.l d3,d2
   eor.l d0,d1
   eor.l d3,d1
   eor.l d0,d2

Is this the correct operation? Is there shorter/faster/better code for it? What types of algorithms is this used for?

Yes this is the correct operation. There is a shorter way (same register layout as yours) :

Code:

 move.l d1,d3
 eor.l d2,d1
 and.l d0,d1
 eor.l d1,d2
 eor.l d3,d1

This kind of code is quite standard issue, i believe it would have been a poor competition entry.
What type of algorithms, well, aside of the classic c2p/p2c it's for whenever you need to exchange selected bits, i.e. extract a bit field or separate bits, while keeping the old value somewhere.
Many cases would go away if we had a BFEXG, though.

Quote:

Originally Posted by matthey

I'm not a fan of the MIX EA,Rn:Rn using Rn:Rn which should be reserved for a high:low 64 bit value. I would just go for MIX EA,Rn,Rn. The ColdFire tried to limit instructions to 2 OPs also which is why it ended up with REM(S/U) EA,Dr : Dq when there is no 64 bit operation. It is confusing and didn't work for 64 bit REMS/REMU which is one of the reasons I created DIVUR/DIVSR. It also keeps me from using an alias of REMS->DIVSR and REMU->DIVSU like you suggested.

Either Da : Db or Da, Db, doesn't matter for me. Feel free to choose what you see fit, i won't pop up to contradict you there

pandy71 · 06 August 2016, 13:34

Quote:

Originally Posted by matthey

Feel free to discuss any 68k enhancements (ISA, ABI or CPU design) in this thread.

Not sure if this is feasible for 68k but repeat mode (hw loop counter) known from TMS320 (RPT, RPTK, RPTS, RPTB) can be nice and useful thing.
http://www.ti.com/lit/gpn/tms320c30

matthey · 08 August 2016, 04:55

Quote:

Originally Posted by meynaf

Can you provide sample code ? This is what i was asking for.

Any sample code for SELcc would be artificial as the instruction does not exist anywhere. There are 2 variations.

if (cc) var=val1

SELcc EA,d0

Code:

  cmp ?
  bcc .skip
  move.l EA,d0
.skip:

The first variation above is less flexible than predication because it can only be used to set a variable (like CMOVcc) where predication can do many other operations. Code density is about the same but the SELcc may be simpler to use and/or implement (needs further review by unbiased hardware designers). Some CPU designs may not have predication and this variation comes for free with the more powerful variation below.

if (cc) var=val1
else var=val2

SELcc EA,d1,d0

Code:

  cmp ?
  bcc .skip1
  move.l EA,d0
  bra .skip2
.skip1:
  move.l d1,d0
.skip2:

This variation replaces 2 branches in an if/then/else statement but we are limited again to only being able to set a variable to one of two choices. Code density is improved and would commonly be half of the 2xbranch version. Removing 2 branches and halving the code size is a big improvement but the question is how often could it be used and could compilers take advantage of it (needs further review by unbiased compiler designers).

Quote:

Originally Posted by meynaf

But instead of the hint bit we could simply use backward taken / forward not taken and optimize by moving code around.

BTFN is the default static branch prediction on the 68040+. It is not possible to move code around in all cases to avoid branch mispredictions the first 2 times and is it not a good idea to move code around in some other cases. Efficient code should fall through to maximize the ICache and instruction stream length between branches. Falling through branches when BTFN is incorrect can also improve code density slightly.

Quote:

Originally Posted by meynaf

I like EREV. Some xREV for bits would be consistent then, but B is "single bit". Maybe LREV for longword reverse ?

I'm not a fan of the 'L' at the beginning. I'm leaning on going back to BITREV if BREV is unclear. BFREV would be clear though?

Quote:

Originally Posted by meynaf

That's one of the reasons why i don't like the SIMD extensions. They get obsolete as soon as the next version comes out.

Yes, they have growing pains but they eventually get to the point of diminishing returns. The 64 bit SIMD registers, integer only operations and difficulty of programming were a good reasons to implement and focus on the FPU until there are enough resources to do a proper SIMD. I have no qualms with toying with an SIMD either but it should be a separate unit.

Quote:

Originally Posted by meynaf

Peephole asm optimization can perhaps do the conversion.

Current compilers are able to use the MOVEQ+AND trick, so there is little use for AND.L #i16.
And anyway if we really want this, it takes a very small encoding space.

Assembler peephole optimizers usually can't do the MOVEQ+OP optimization because there usually isn't a trash register defined. Compilers can and do do the optimization but it adds to the complexity of compilers and processors to handle them. The assembler code is less typing and looks more professional without MOVEQ and sometimes saves a register which can be used for other things or improved code density when a trash register is not available. I think it is great and you hate it. Gunnar liked it so it was probably easy to implement from the hardware side as I would expect with it so simple.

Quote:

Originally Posted by meynaf

Yes this is the correct operation. There is a shorter way (same register layout as yours) :

Code:

 move.l d1,d3 ; pOEP
 eor.l d2,d1 ; sOEP
 and.l d0,d1 ; pOEP
 eor.l d1,d2 ; pOEP
 eor.l d3,d1 ; sOEP

This kind of code is quite standard issue, i believe it would have been a poor competition entry.

Nice. You used an EOR EXG also. I did a double version which does allow to schedule better although I should have reversed the last 2 lines of my code for 3.5 cycles superscalar to your 3 cycles which wins. The smaller code is better also. I still have my doubts that this is worthy of an instruction but it is interesting and not the easiest to do with existing instructions.

Quote:

Originally Posted by meynaf

What type of algorithms, well, aside of the classic c2p/p2c it's for whenever you need to exchange selected bits, i.e. extract a bit field or separate bits, while keeping the old value somewhere.
Many cases would go away if we had a BFEXG, though.

BFEXG would be nice and would be possible if there was a free encoding bit in BFEXTU and BFEXTS. The bit could turn on bit swapping between the extracted area and the destination register for BFEXTU/S. This would make it also like BFINS which could have shared encoding space with 2 encoding bits (EB=Extract Bit, IB=Insert Bit)

EB=0 IB=0 BFTST
EB=1 IB=0 BFEXTS/U
EB=0 IB=1 BFINS
EB=1 IB=1 BFEXG

Of course, you may want BFEXG to swap bits between 2 registers with the same offset and width. I believe this kind of BFEXG would be less general purpose and not work as well on bit streams. Any bit offset >31 would be useless without being able to specify 2 bit fields which is too expensive.

matthey · 08 August 2016, 06:26

Quote:

Originally Posted by pandy71

Not sure if this is feasible for 68k but repeat mode (hw loop counter) known from TMS320 (RPT, RPTK, RPTS, RPTB) can be nice and useful thing.
http://www.ti.com/lit/gpn/tms320c30

Hardware repeat/loop registers are good for performance because the register can't be changed in the loop unlike a general purpose (GP) register. However, they are generally less flexible and the register lost may be better used as GP. Loops not of the decrement and branch type are better off with another GP register. The 68k already has a DBcc instruction but it is challenging for performance because the loop register can change inside the loop. There are several possibilities for dealing with this. Maybe the now unused lowest bit in the displacement could be set which would tell the CPU that the loop counter is not changed in the loop and optimizations can be made (perhaps the CPU itself could also set the bit after using loops which don't change the counter). We have the cards we were dealt with with the 68k. We also need good compatibility. I would rather focus on improving the performance of what we have before bolting on a bunch of foreign loop instructions and making processors deal with too many loop variations.

meynaf · 08 August 2016, 09:43

Quote:

Originally Posted by matthey

Any sample code for SELcc would be artificial as the instruction does not exist anywhere.

That's common among new instructions

Quote:

Originally Posted by matthey

if (cc) var=val1

SELcc EA,d0

Code:

  cmp ?
  bcc .skip
  move.l EA,d0
.skip:

The first variation above is less flexible than predication because it can only be used to set a variable (like CMOVcc) where predication can do many other operations. Code density is about the same but the SELcc may be simpler to use and/or implement (needs further review by unbiased hardware designers). Some CPU designs may not have predication and this variation comes for free with the more powerful variation below.

This variant does not need to add a new instruction. The decoder could just merge the bcc with the move and execute an internal conditional move.
Code size is exactly the same (4+ea).

Quote:

Originally Posted by matthey

if (cc) var=val1
else var=val2

SELcc EA,d1,d0

Code:

  cmp ?
  bcc .skip1
  move.l EA,d0
  bra .skip2
.skip1:
  move.l d1,d0
.skip2:

This variation replaces 2 branches in an if/then/else statement but we are limited again to only being able to set a variable to one of two choices. Code density is improved and would commonly be half of the 2xbranch version. Removing 2 branches and halving the code size is a big improvement but the question is how often could it be used and could compilers take advantage of it (needs further review by unbiased compiler designers).

The block can't be merged in a single instruction (or at least it becomes a lot more difficult). And code size being half of the 2xbranch version would be a nice saving.
However if D1 needs to be full EA then you're caught. Same for D0. Immediates (which are the most common case) can't be used (well, not for both operands). This severely limits the number of potential cases. I'm afraid that it'll end up with the coder saying "oh no i can't use it" in 90% cases.

This is why i suggest having a look in real life code to find use cases of it. I considered this kind of instruction long ago, had a look and didn't find any, but you might be more lucky.
IOW studying complete routines would bring more info.

Quote:

Originally Posted by matthey

BTFN is the default static branch prediction on the 68040+. It is not possible to move code around in all cases to avoid branch mispredictions the first 2 times and is it not a good idea to move code around in some other cases. Efficient code should fall through to maximize the ICache and instruction stream length between branches. Falling through branches when BTFN is incorrect can also improve code density slightly.

But is "first 2 times" really worth worrying about ? Code that's not executed many times doesn't need to be optimal.

Quote:

Originally Posted by matthey

I'm not a fan of the 'L' at the beginning. I'm leaning on going back to BITREV if BREV is unclear. BFREV would be clear though?

BFREV is confusing. That's not a true bit-field instruction.
REVL maybe ?

Quote:

Originally Posted by matthey

Assembler peephole optimizers usually can't do the MOVEQ+OP optimization because there usually isn't a trash register defined.

I was talking about the AND.L -> BFEXTU size optimization.

Quote:

Originally Posted by matthey

Compilers can and do do the optimization but it adds to the complexity of compilers and processors to handle them. The assembler code is less typing and looks more professional without MOVEQ and sometimes saves a register which can be used for other things or improved code density when a trash register is not available. I think it is great and you hate it. Gunnar liked it so it was probably easy to implement from the hardware side as I would expect with it so simple.

In comparison to MOVEQ, it's not smaller. If done in HW, MOVEQ+op could be merged in one instruction.

In addition, the short immediate being an addressing mode, the target must be a register ('xcept for move).
It means you can't use short immediates for memory.
It would be kinda strange to be able to do ADD.L #$1234.W,D0 and not ADD.L #$1234.W,(A0), where we can do ADD.L #$1234,(A0). No good for orthogonality - if you care about that.

Quote:

Originally Posted by matthey

BFEXG would be nice and would be possible if there was a free encoding bit in BFEXTU and BFEXTS. The bit could turn on bit swapping between the extracted area and the destination register for BFEXTU/S. This would make it also like BFINS which could have shared encoding space with 2 encoding bits (EB=Extract Bit, IB=Insert Bit)

EB=0 IB=0 BFTST
EB=1 IB=0 BFEXTS/U
EB=0 IB=1 BFINS
EB=1 IB=1 BFEXG

Of course, you may want BFEXG to swap bits between 2 registers with the same offset and width. I believe this kind of BFEXG would be less general purpose and not work as well on bit streams. Any bit offset >31 would be useless without being able to specify 2 bit fields which is too expensive.

Adding new bit-field instructions could be done without eating encoding space, but alas only in an incompatible way.

Actually it could have been reduced by half if using the trick to use '1111' as register for operations that don't use it (bitfields can be useful for An but not A7).
This means we would only have 4 opcodes : 00 BFEXTU/BFTST, 01 BFEXTS/BFCHG, 10 BFINS/BFCLR, 11 BFFFO/BFSET. Version without the register is selected if that register is A7.

Then we could add new bitfield ops : BFREV, BFEXG, BFCMP, maybe even BFASL, BFLSR. Or a simple BFEXT which extracts the field without extending it, keeping the other bits in the target.

But the BF are complicated enough the way they are (for HW), so i believe we're ok with what we already have...

Quote:

Originally Posted by matthey

Hardware repeat/loop registers are good for performance because the register can't be changed in the loop unlike a general purpose (GP) register. However, they are generally less flexible and the register lost may be better used as GP. Loops not of the decrement and branch type are better off with another GP register. The 68k already has a DBcc instruction but it is challenging for performance because the loop register can change inside the loop. There are several possibilities for dealing with this. Maybe the now unused lowest bit in the displacement could be set which would tell the CPU that the loop counter is not changed in the loop and optimizations can be made (perhaps the CPU itself could also set the bit after using loops which don't change the counter). We have the cards we were dealt with with the 68k. We also need good compatibility. I would rather focus on improving the performance of what we have before bolting on a bunch of foreign loop instructions and making processors deal with too many loop variations.

I wonder if hardware loops couldn't replace the SIMD extensions.
That would be called hardware autovectorization.
Then it would be potentially beneficial to every program, not just ones that make the effort to use those filthy vector extensions.
And the next gen could have better performance without rewriting any program.

Just dreaming...

Mrs Beanbag · 08 August 2016, 21:24

Quote:

Originally Posted by matthey

It is no more difficult to use an absolute location than a PC relative one.

Well that depends. If you are loading your code into an AMOS memory bank or writing an AMOS extension, it is impossible to use relocation. I'm sure this is a shortcoming in AMOS. Nevertheless i do have to work with it sometimes! Or i have in the past.

Quote:

The 68020 addressing modes are very versatile and can do what you want.

Code:

    jmp ([d16,An])
    jsr ([d16,An])

[/quote]
True but they result in a bigger encoding (not taking any other supporting code into account).

In my own code i just jump into a table of branches. I know it is not top for performance but it is great for flexibility, and sometimes i just really want to be able to do that sort of thing.

Quote:

The 68k designers probably had OCD because most data in encodings is well organized and commonly 8, 16 or 32 bits. Data extensions are in multiples of 16 bits as the variable length encodings are multiples of 16 bits.

now i'm not suggesting they "should" have done any differently but i did comment elsewhere about how a byte-sized immediate instruction wastes a whole byte.

As for the whole branch prediction subject, i did wonder if different condition codes have different branch frequencies. Does a BNE get taken more often than a BEQ? Or a BVS?

meynaf · 08 August 2016, 22:23

Quote:

Originally Posted by Mrs Beanbag

now i'm not suggesting they "should" have done any differently but i did comment elsewhere about how a byte-sized immediate instruction wastes a whole byte.

A byte encoding would be very good for code density. Too bad it can't be done in a compatible way.

Quote:

Originally Posted by Mrs Beanbag

As for the whole branch prediction subject, i did wonder if different condition codes have different branch frequencies. Does a BNE get taken more often than a BEQ? Or a BVS?

False conditions are probably taken more frequently than true conditions in average but i doubt it goes very far.

matthey · 09 August 2016, 00:18

Quote:

Originally Posted by meynaf

This variant does not need to add a new instruction. The decoder could just merge the bcc with the move and execute an internal conditional move.
Code size is exactly the same (4+ea).

Yes, but not all processors would have predication which I believe is more complex than the SELcc instruction. This variation of SELcc comes for free with the more powerful variation.

Quote:

Originally Posted by meynaf

The block can't be merged in a single instruction (or at least it becomes a lot more difficult). And code size being half of the 2xbranch version would be a nice saving.
However if D1 needs to be full EA then you're caught. Same for D0. Immediates (which are the most common case) can't be used (well, not for both operands). This severely limits the number of potential cases. I'm afraid that it'll end up with the coder saying "oh no i can't use it" in 90% cases.

The cc can be reversed and the EA used for the other selection. Gunnar's statements gave me the impression that conditional writing/storing (especially to cache/memory) was more costly than conditional reading. I believe SELcc is cheaper and has more performance potential than CMOVcc. If the coder can't use it in 90% of cases then it means they can in 10% which I would consider a good result. Removing 2 branches 5% of the time would be a good result.

Quote:

Originally Posted by meynaf

But is "first 2 times" really worth worrying about ? Code that's not executed many times doesn't need to be optimal.

Most users and programmers on fast processors wouldn't worry about a few missed branch predictions as it is such a small percentage of the CPU time wasted. They probably aren't going to worry about the extra energy used fetching and throwing away instructions from branch mis-predictions (the number can be significant). Embedded processors needing deterministic and consistent timings with near real time results on lower clocked processors with power consumption restrictions may have a different perspective.

Quote:

Originally Posted by meynaf

BFREV is confusing. That's not a true bit-field instruction.
REVL maybe ?

I was asking hypothetically, if there was a BFREV (Bit Field Reverse), would the name be clear. BREV should be just as clear for the non-BF instruction. What happens to REVL with the size appended as REVL.L? It sounds like Reverse Long.Long. I like BREV but the original BITREV name is ok. EREV for BYTEREV would be ok as BYTEREV is a long name I would like to shorten.

Quote:

Originally Posted by meynaf

I was talking about the AND.L -> BFEXTU size optimization.

The assembler peephole optimizer under most compilers will only do optimizations where the cc flags are set the same. The vbcc 68k backend suffers because Volker assumed the assembler could do many peephole optimizations which ended up not being possible. The 68k setting the cc all the time is good for code density but makes peephole optimizing and instruction scheduling more challenging.

Quote:

Originally Posted by meynaf

In comparison to MOVEQ, it's not smaller. If done in HW, MOVEQ+op could be merged in one instruction.

There is complexity to instruction folding/merging and result forwarding. Are we going to assume 3 ops internally with instruction folding, result forwarding and predication? I believe the addressing mode technique is very simple in comparison to processor internal optimizations.

Quote:

Originally Posted by meynaf

In addition, the short immediate being an addressing mode, the target must be a register ('xcept for move).
It means you can't use short immediates for memory.
It would be kinda strange to be able to do ADD.L #$1234.W,D0 and not ADD.L #$1234.W,(A0), where we can do ADD.L #$1234,(A0). No good for orthogonality - if you care about that.

I originally wanted to do the OPI.L #data.w,EA encodings for consistency (less about orthogonality IMO) but the gain was not worth compatibility issues. I caved on that one and sided with you which was one of the major sticking points between us and Gunnar. It would be possible to convert the OPI.L #data.w,Dn -> OP.L #data.w,Dn at least. My code analysis did show a significant gain with the addressing mode.

Quote:

Originally Posted by meynaf

Adding new bit-field instructions could be done without eating encoding space, but alas only in an incompatible way.

Actually it could have been reduced by half if using the trick to use '1111' as register for operations that don't use it (bitfields can be useful for An but not A7).
This means we would only have 4 opcodes : 00 BFEXTU/BFTST, 01 BFEXTS/BFCHG, 10 BFINS/BFCLR, 11 BFFFO/BFSET. Version without the register is selected if that register is A7.

Then we could add new bitfield ops : BFREV, BFEXG, BFCMP, maybe even BFASL, BFLSR. Or a simple BFEXT which extracts the field without extending it, keeping the other bits in the target.

But the BF are complicated enough the way they are (for HW), so i believe we're ok with what we already have...

Yea, the encodings for the BF instructions could have been more powerful and more compact. BFCNT would have been nice also. It is not as easy to do much with them now and CPU designers hate the work needed to implement them. Maybe something to think about for a new CPU though.

Quote:

Originally Posted by meynaf

I wonder if hardware loops couldn't replace the SIMD extensions.
That would be called hardware autovectorization.
Then it would be potentially beneficial to every program, not just ones that make the effort to use those filthy vector extensions.
And the next gen could have better performance without rewriting any program.

Superscalar allows parallel operations but usually doesn't have the pipes, resources and/or registers to do as much in parallel as SIMD. SIMD is very efficient at its limited operations. Superscalar is much more flexible but can't compete with the brute force of an SIMD.

matthey · 09 August 2016, 01:10

Quote:

Originally Posted by Mrs Beanbag

Well that depends. If you are loading your code into an AMOS memory bank or writing an AMOS extension, it is impossible to use relocation. I'm sure this is a shortcoming in AMOS. Nevertheless i do have to work with it sometimes! Or i have in the past.

The AmigaOS scatter loader does most of the work. Programmers rarely have to worry about relocs. Position independent code is nice though. It would have to be cheaper than (bd32,pc,Rn.Size*Scale) and have more range than (bd16,PC,Rn.Size*Scale) to try and do away with relocs. Did you not like the (bd20,PC,Rn.Size*Scale) idea?

(bd16,PC,Rn.Size*Scale) range is +32767 to -32768 bytes
(bd20,PC,Rn.Size*Scale) range is +524287 to -524288 bytes
(bd32,pc,Rn.Size*Scale) range is 2147483647 to -2147483648

(bd20,PC,Rn.Size*Scale) would allow about a 1MB all PC relative executable compared to about a 65kB all PC relative executable with (bd16,PC,Rn.Size*Scale) while giving the same size instruction.

Quote:

Originally Posted by Mrs Beanbag

Code:

    jmp ([d16,An])
    jsr ([d16,An])

True but they result in a bigger encoding (not taking any other supporting code into account).

Code:

    move.l (d16,An),An ; 4 bytes
    jmp (An) ; 2 bytes

vs

Code:

    jmp ([d16,An]) ; 6 bytes

Tie. Same size. There are a few cases where the size is smaller with double memory indirect addressing modes.

Quote:

Originally Posted by Mrs Beanbag

As for the whole branch prediction subject, i did wonder if different condition codes have different branch frequencies. Does a BNE get taken more often than a BEQ? Or a BVS?

BEQ, BNE and BVS forward are usually errors. Most static prediction is based on the direction only. If you had read the link I posted above.

Always not taken ~40% correct
Always taken ~60% correct
BTFN ~65% correct
Semi-Static hint bit with profiling ~75% correct

meynaf · 09 August 2016, 09:30

Quote:

Originally Posted by matthey

Yes, but not all processors would have predication which I believe is more complex than the SELcc instruction. This variation of SELcc comes for free with the more powerful variation.

This isn't predication. This is just a decoder variation, turning an instruction pair into an internal conditional move.

Quote:

Originally Posted by matthey

The cc can be reversed and the EA used for the other selection. Gunnar's statements gave me the impression that conditional writing/storing (especially to cache/memory) was more costly than conditional reading. I believe SELcc is cheaper and has more performance potential than CMOVcc. If the coder can't use it in 90% of cases then it means they can in 10% which I would consider a good result. Removing 2 branches 5% of the time would be a good result.

It wouldn't be 10% of all code, just 10% of the small 1% where the instruction has a potential.

Quote:

Originally Posted by matthey

Embedded processors needing deterministic and consistent timings with near real time results on lower clocked processors with power consumption restrictions may have a different perspective.

Sure but BTFN is enough for them.

Quote:

Originally Posted by matthey

I was asking hypothetically, if there was a BFREV (Bit Field Reverse), would the name be clear.

Then of course yes, it's clear.

Quote:

Originally Posted by matthey

BREV should be just as clear for the non-BF instruction.

Not necessarily. Sounds like BTST but doesn't operate on a single bit.

Quote:

Originally Posted by matthey

What happens to REVL with the size appended as REVL.L? It sounds like Reverse Long.Long.

The same thing as for DIVUL.L. Why appending a size anyway ? REVL is enough by itself.

Quote:

Originally Posted by matthey

I like BREV but the original BITREV name is ok. EREV for BYTEREV would be ok as BYTEREV is a long name I would like to shorten.

If BITREV is ok but not BYTEREV, you could try BYTREV

Quote:

Originally Posted by matthey

The assembler peephole optimizer under most compilers will only do optimizations where the cc flags are set the same. The vbcc 68k backend suffers because Volker assumed the assembler could do many peephole optimizations which ended up not being possible. The 68k setting the cc all the time is good for code density but makes peephole optimizing and instruction scheduling more challenging.

Then indeed BFEXTU isn't the solution for compilers (even though asm programmers can still do it manually).
But how many times is AND.L of a small constant needed ? Do you have some statistics on this ?

Quote:

Originally Posted by matthey

There is complexity to instruction folding/merging and result forwarding. Are we going to assume 3 ops internally with instruction folding, result forwarding and predication? I believe the addressing mode technique is very simple in comparison to processor internal optimizations.

Consider the problems your addressing mode can have. Do you allow it to execute if the size isn't a longword ? If yes, you're creating many useless combinations. If no, then you make the decoding more complicated.

Quote:

Originally Posted by matthey

I originally wanted to do the OPI.L #data.w,EA encodings for consistency (less about orthogonality IMO) but the gain was not worth compatibility issues. I caved on that one and sided with you which was one of the major sticking points between us and Gunnar. It would be possible to convert the OPI.L #data.w,Dn -> OP.L #data.w,Dn at least. My code analysis did show a significant gain with the addressing mode.

Do you have detailed statistics about this gain ?

Quote:

Originally Posted by matthey

Yea, the encodings for the BF instructions could have been more powerful and more compact. BFCNT would have been nice also. It is not as easy to do much with them now and CPU designers hate the work needed to implement them. Maybe something to think about for a new CPU though.

Oddly enough, when reencoding i've thrown BFFFO out of the BF, keeping only simple 32-bit FFO. Seems it's the most complex of them all, the others being only data movers.

Quote:

Originally Posted by matthey

Superscalar allows parallel operations but usually doesn't have the pipes, resources and/or registers to do as much in parallel as SIMD. SIMD is very efficient at its limited operations. Superscalar is much more flexible but can't compete with the brute force of an SIMD.

I was just dreaming about SIMD advantages without the shortcomings.

Current SIMD needs extra-large registers which can't be feed with the DCache (too large) and have problems with memory latency, nullifying a large part of their potential.
Furthermore they're used on fixed size data, which doesn't match real life needs where data isn't necessarily a nice multiple of your SIMD size.
They rely on either handwritten asm (a dead end as it makes asm writing more complicated), cumbersome vector datatypes (another dead end as the casual programmer won't use them), or autovectorization features of the compiler (which can only do trivial cases, when it can do something).

I don't know exactly how some DSP's hardware loops work, but they look and feel like SIMD without extra instructions.

Quote:

Originally Posted by matthey

Always not taken ~40% correct
Always taken ~60% correct
BTFN ~65% correct
Semi-Static hint bit with profiling ~75% correct

Does that mean that the hint bit only provides a 10% gain ?

Mrs Beanbag · 09 August 2016, 11:05

Quote:

Originally Posted by matthey

The AmigaOS scatter loader does most of the work. Programmers rarely have to worry about relocs.

Tell it to Francois Lionet.

If you load an executable into an AMOS memory bank, you only get the first code hunk. It doesn't process the RELOC table. And even if it did, it throws it away so there is no way to re-RELOC it when you save your program and load it in again.

Also when writing AMOS extensions, the compiler will pull only the extension functions that are actually used out of the executable and concatenate them, with no RELOC data, you use special macros to define branches to one function from another. Actually it is horrible, because it just goes through the file looking for some specific codes, so some of your data might accidentally match! But this is what i've got to work with...

Quote:

(bd20,PC,Rn.Size*Scale) would allow about a 1MB all PC relative executable compared to about a 65kB all PC relative executable with (bd16,PC,Rn.Size*Scale) while giving the same size instruction.

Yes a 20 bit offset would be acceptable, if there is space for it, why not use it?

Quote:

Code:

    move.l (d16,An),An ; 4 bytes
    jmp (An) ; 2 bytes

vs

Code:

    jmp ([d16,An]) ; 6 bytes

Tie. Same size. There are a few cases where the size is smaller with double memory indirect addressing modes.

Yes but my point was JMPM/JSRM d16(An) could be only 4 bytes.

Quote:

BEQ, BNE and BVS forward are usually errors. Most static prediction is based on the direction only. If you had read the link I posted above.

The link does not answer my question. I know how static prediction usually works. What i don't know if whether a forward BEQ is taken more or less often than a forward BVS, for instance.

meynaf · 09 August 2016, 11:28

Quote:

Originally Posted by Mrs Beanbag

Tell it to Francois Lionet.

If you load an executable into an AMOS memory bank, you only get the first code hunk. It doesn't process the RELOC table. And even if it did, it throws it away so there is no way to re-RELOC it when you save your program and load it in again.

Also when writing AMOS extensions, the compiler will pull only the extension functions that are actually used out of the executable and concatenate them, with no RELOC data, you use special macros to define branches to one function from another. Actually it is horrible, because it just goes through the file looking for some specific codes, so some of your data might accidentally match! But this is what i've got to work with...

Sounds like fun. If faced this, i'd start by trying to encapsulate this horror, like having the code just load something from outside with LoadSeg.

Quote:

Originally Posted by Mrs Beanbag

Yes but my point was JMPM/JSRM d16(An) could be only 4 bytes.

You're counting 4 or 6 bytes while having a big 4-byte per entry jump table (the relocs making it even worse). It's like counting cents after having wasted whole dollars.
What about starting by using word size offsets instead ?

Quote:

Originally Posted by Mrs Beanbag

The link does not answer my question. I know how static prediction usually works. What i don't know if whether a forward BEQ is taken more or less often than a forward BVS, for instance.

Forward or backward, BVS is rarely taken. BEQ is difficult to predict ; i'd say 50%.
This won't help much though, as BEQ is the most occuring branch type whereas BVS is relatively rare.

Mrs Beanbag · 09 August 2016, 11:36

Quote:

Originally Posted by meynaf

Sounds like fun. If faced this, i'd start by trying to encapsulate this horror, like having the code just load something from outside with LoadSeg.

Well, the whole point of using the AMOS banks really is so that you don't need to have lots of files floating about outside the main executable. But your approach means a whopping resource leak every time you run the program, because there is no way to make sure UnLoadSeg is called on program exit...

As for extensions, it might be worthwhile to create OS libraries with all the functionality and then just have a thin wrapper as the AMOS extension.

Quote:

Forward or backward, BVS is rarely taken. BEQ is difficult to predict ; i'd say 50%.
This won't help much though, as BEQ is the most occuring branch type whereas BVS is relatively rare.

True but would be fairly trivial to implement.

meynaf · 09 August 2016, 11:49

Quote:

Originally Posted by Mrs Beanbag

Well, the whole point of using the AMOS banks really is so that you don't need to have lots of files floating about outside the main executable. But your approach means a whopping resource leak every time you run the program, because there is no way to make sure UnLoadSeg is called on program exit...

Better rewrite AMOS itself then

Quote:

Originally Posted by Mrs Beanbag

As for extensions, it might be worthwhile to create OS libraries with all the functionality and then just have a thin wrapper as the AMOS extension.

Yes but then it's CloseLibrary that's not guaranteed to be called. The leak is smaller but still there.

Quote:

Originally Posted by Mrs Beanbag

True but would be fairly trivial to implement.

Trivial yes, but for what gain ?
BVS is one branch out of something like 2000.

Mrs Beanbag · 09 August 2016, 12:21

Quote:

Originally Posted by meynaf

Yes but then it's CloseLibrary that's not guaranteed to be called. The leak is smaller but still there.

An extension gets informed on shutdown, a memory bank does not.

But yeah. Roll on AMOS 3 which isn't terrible in myriad ways? I dunno. The personal answer must be "stop using it" but it's convenient as a development environment.

Quote:

Trivial yes, but for what gain ?
BVS is one branch out of something like 2000.

I really don't know unless i see actual data, maybe some more common condition codes are also biased in different ways.

Quote:

You're counting 4 or 6 bytes while having a big 4-byte per entry jump table (the relocs making it even worse). It's like counting cents after having wasted whole dollars.
What about starting by using word size offsets instead ?

Well ok, but it's also not just about memory used by the code but about time wasted reading instructions to execute. And a word-read doesn't cost less than a longword-read on 68020+, so...

But anyway, supposing we had instruction that did this in 4 bytes:

Code:

  add.w d16(An),An
  jmp (An)

and it would be nice not to have to trash An while doing it.

meynaf · 09 August 2016, 17:33

Quote:

Originally Posted by Mrs Beanbag

Well ok, but it's also not just about memory used by the code but about time wasted reading instructions to execute. And a word-read doesn't cost less than a longword-read on 68020+, so...

A word doesn't cost less than a longword to read, but it costs more space in the dcache. So if the table gets read repetitively, it has better be small.

Quote:

Originally Posted by Mrs Beanbag

But anyway, supposing we had instruction that did this in 4 bytes:

Code:

  add.w d16(An),An
  jmp (An)

and it would be nice not to have to trash An while doing it.

But there, d16 is a constant.
Did you mean, rather :

Code:

 add.w (An,Dn.w*2),An
 jmp (An)

In that case it would even be better to replace :

Code:

 lea table(pc),An
 add.w (An,Dn.w*2),An
 jmp (An)

By :

Code:

 jmpt table(pc),Dn

Mrs Beanbag · 09 August 2016, 19:30

Quote:

Originally Posted by meynaf

A word doesn't cost less than a longword to read, but it costs more space in the dcache. So if the table gets read repetitively, it has better be small.

fair point

Quote:

But there, d16 is a constant.
Did you mean, rather :

no, i really meant d16 as a constant. Usually, if you are calling a virtual method through a vtable, you know the table offset at compile time, but you don't necessarily know the address of the table itself (if you did, you wouldn't even need to use it, since the addresses of the relevant methods would be known directly). The address of the table will be the first longword in the polymorphic object, so the full process is:

Code:

move.l (An),Am
add.w d16(Am),Am
jsr (Am)

An being a pointer to the object. Am, i suppose, is re-usable if you need to call more than one method on the same object.

Also if inheritance is used, the relevant function might not actually be in the same class or module as the vtable. Where a base class function is not overridden, the vtable for the derived class might point directly to the base class functions. So relative offsets may not be the best choice in this scenario. Although if these kinds of objects are to be loaded and linked dynamically, i suppose more than just the OS reloc tables will be needed.

Also i have problems in my own code, aside from AMOS's foolishness, i compress my executables in my own format, so how to do relocs then? I can process the reloc tables myself into whatever format i can use, i know it's not a hugely complex procedure being as it's just going through a list of offsets and adding on the base address, but so far, i have not bothered to do it since i haven't needed to.

matthey · 09 August 2016, 22:56

Quote:

Originally Posted by meynaf

This isn't predication. This is just a decoder variation, turning an instruction pair into an internal conditional move.

Yes, but SELcc is more useful if there is no predication support. It may not be wise to assume all processors will have predication though.

Quote:

Originally Posted by meynaf

Sure but BTFN is enough for them.

Maybe. Consistent and predictable performance is more important for embedded processors where most modern processors have good peak performance but can spend many cycles delaying without executing much code at all. A branch hint bit may help smooth out performance and save cycles for repetitive and predictable tasks common in the embedded market. Is an enhanced 68k CPU more likely to be used for embedded or desktop purposes?

Quote:

Originally Posted by meynaf

The same thing as for DIVUL.L. Why appending a size anyway ? REVL is enough by itself.

Why not always append the .L and have REV.L? I really don't know what kind of data I'm reversing when I see REVL or REV.L though. BREV is better for me. If BREV (Bit Reverse) sounds like one bit to you then shouldn't BITREV be BITSREV and BYTEREV be BYTESREV as well?

Quote:

Originally Posted by meynaf

If BITREV is ok but not BYTEREV, you could try BYTREV

I've seen worse

.

Quote:

Originally Posted by meynaf

Then indeed BFEXTU isn't the solution for compilers (even though asm programmers can still do it manually).
But how many times is AND.L of a small constant needed ? Do you have some statistics on this ?

I believe small constants would be common as many compilers promote shorter integer datatypes to longwords. Vbcc generated code saw one of the biggest benefits for OP.L #data.w,Dn as well as MVS/MVZ because this helps it promote the integers to longwords and compress the OP.L immediates. SAS/C at the other extreme, often generates code which works with .b and .w data sizes where only a few peephole optimizations can be made here and there. The workload and optimization options can make a big difference as well.

Quote:

Originally Posted by meynaf

Consider the problems your addressing mode can have. Do you allow it to execute if the size isn't a longword ? If yes, you're creating many useless combinations. If no, then you make the decoding more complicated.

ARM's approach of declaring some encodings as undefined/reserved but not trapping is probably the most efficient. Trapping has its advantages and is more 68k like. It should be enough to be clear that other variations of the addressing mode are undefined/reserved whether they are trapped or not.

Quote:

Originally Posted by meynaf

Do you have detailed statistics about this gain ?

My statistics were more like a random sampling than a full statistical study. I recall cases where I saw 5%+ code density improvements with vbcc generated code with a combination of OP.L #data.w,Dn and MVS/MVZ where SAS/C generated code sometimes didn't even have a 1% improvement. I was searching for particular instruction pairs so it is likely that more gains would be possible with a compiler and peephole assembler aware of the new functionality.

Quote:

Originally Posted by meynaf

Oddly enough, when reencoding i've thrown BFFFO out of the BF, keeping only simple 32-bit FFO. Seems it's the most complex of them all, the others being only data movers.

BFFFO is probably the most powerful BF instruction though. It removes a whole loop (BFCNT would have similar complexity and advantages). It would be possible to do a BFEXT first but it would make TLSFMem type memory handling less optimal. BFFFO is possible in 1-2 cycles which allows dynamic memory allocations to be 5x faster on the Apollo-core. This alone would be worth the effort to implement BFFFO if it could be implemented in the OS which is blocked in the case of the AmigaOS. AROS may seize the opportunity as AmigaOS dies. We need 3 incompatible flavors of AmigaOS after all

.

Quote:

Originally Posted by meynaf

I was just dreaming about SIMD advantages without the shortcomings.

Current SIMD needs extra-large registers which can't be feed with the DCache (too large) and have problems with memory latency, nullifying a large part of their potential.
Furthermore they're used on fixed size data, which doesn't match real life needs where data isn't necessarily a nice multiple of your SIMD size.
They rely on either handwritten asm (a dead end as it makes asm writing more complicated), cumbersome vector datatypes (another dead end as the casual programmer won't use them), or autovectorization features of the compiler (which can only do trivial cases, when it can do something).

I don't know exactly how some DSP's hardware loops work, but they look and feel like SIMD without extra instructions.

DSPs are highly tuned, specialized and difficult to use also. A general purpose superscalar processor with some DSP like instructions can usually keep up but with more cost in resources. An SIMD is for when you go big into data processing and needs to process a lot of data at once to be worthwhile. It makes sense not to cache big data streams in most cases.

Quote:

Originally Posted by meynaf

Does that mean that the hint bit only provides a 10% gain ?

There are potentially more gains than the branch prediction success gains. Code can be better organized to fall through the branch increasing the instructions between branches, make better use of the ICache and improves code density in some cases. This is what happens when BTFN is the correct prediction but this is only ~65% correct.

Mrs Beanbag · 09 August 2016, 23:20

andi.l #$ff,Dn can be used to extend an unsigned byte to longword size. then again a special instruction for that could be better. ori.l #data.w,Dn, on the other hand, would be perfectly pointless.

add/sub.l #data.w,Dn would be very useful, however. I often need to do this.

matthey · 10 August 2016, 01:11

Quote:

Originally Posted by Mrs Beanbag

Yes a 20 bit offset would be acceptable, if there is space for it, why not use it?

Maybe I will try to write up the encoding for evaluation. Gunnar suggested the idea as I recall. I wasn't big on the idea because I only found a few programs which could have used it. The GCC 3.x compiler is one of them as I recall. It could partially explain why the GCC 3.x compiler was so much slower than the GCC 2.x compiler. Most programmers don't look at their code and probably wouldn't have noticed though.

Quote:

Originally Posted by Mrs Beanbag

The link does not answer my question. I know how static prediction usually works. What i don't know if whether a forward BEQ is taken more or less often than a forward BVS, for instance.

You would need a special profiler for this. I'm not sure the right tool exists for the 68k. Motorola may have had tools like this but we are unlikely to ever see them.

Quote:

Originally Posted by Mrs Beanbag

andi.l #$ff,Dn can be used to extend an unsigned byte to longword size. then again a special instruction for that could be better.

Vasm already has the following optimizations.

Quote:

Originally Posted by vasm.pdf

ANDI.L #$ff,Dn optimized to MVZ.B Dn,Dn, for ColdFire ISA B/C.
ANDI.L #$ffff,Dn optimized to MVZ.W Dn,Dn, for ColdFire ISA B/C.

I hope it is optimizing AND.L also. This reduces the instruction size from 6 bytes to 2 bytes. These ANDs are common in some compiler generated code.

The key here is that the cc is set the same allowing for a "safe" peephole optimization which can be used for compilers. This is true of the addressing mode I proposed which also sets the cc the same way. The cc of AND is commonly used. In fact, it is not unusual for the data to be thrown away. We could have an AND Dn,#data which would preserve the register but set the cc. It would probably be more common than the BTST Dn,#data which meynaf would like to extend to other sizes.

Quote:

Originally Posted by Mrs Beanbag

ori.l #data.w,Dn, on the other hand, would be perfectly pointless.

add/sub.l #data.w,Dn would be very useful, however. I often need to do this.

OR.L #data.w,Dn sets the cc different than OR.W #data.w,Dn. A safe peephole optimization from OP.L #data.w,Dn -> OP.W #data.w,Dn is not possible because of this. The condition codes are usually not needed with OR but are compilers smart enough to use OR.W #data.w,Dn when the cc is not needed? Using OR.W #data.w,Dn would be slower on the 68060 in some cases so should compilers do this? The new addressing mode would allow compilers to generate OR.L which is safe and best performance.

The new OP.L #data.w,Dn addressing mode would work with the following.

ADD.L, AND.L, CMP.L, DIVx.L, MOVE.L, MULx.L, OR.L, SELcc, SUB.L

It would not work with any of the OPI.L encodings but OPI.L #data,Dn could be converted to OP.L #data.w,Dn.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
BOOM (DOOM Enhanced) port to 68k	NovaCoder	News	155	05 May 2023 12:26
ISA Ethernet Cards	jmmijo	support.Hardware	13	03 February 2015 11:04
Any ISA Mach64 Information?	CU_AMiGA	support.Hardware	21	09 September 2007 22:17
Help converting an 8bit ISA slot to 16bit ISA slot	Smiley	support.Hardware	4	25 April 2006 11:20
A2000 ISA slots	Unknown_K	support.Hardware	1	20 March 2005 09:48

09 August 2016, 23:20	#39
Mrs Beanbag Glastonbridge Software Join Date: Jan 2012 Location: Edinburgh/Scotland Posts: 2,243	andi.l #$ff,Dn can be used to extend an unsigned byte to longword size. then again a special instruction for that could be better. ori.l #data.w,Dn, on the other hand, would be perfectly pointless. add/sub.l #data.w,Dn would be very useful, however. I often need to do this.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)