View Single Post
Old 09 August 2016, 22:56   #38
matthey
Banned
 
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
Quote:
Originally Posted by meynaf View Post
This isn't predication. This is just a decoder variation, turning an instruction pair into an internal conditional move.
Yes, but SELcc is more useful if there is no predication support. It may not be wise to assume all processors will have predication though.

Quote:
Originally Posted by meynaf View Post
Sure but BTFN is enough for them.
Maybe. Consistent and predictable performance is more important for embedded processors where most modern processors have good peak performance but can spend many cycles delaying without executing much code at all. A branch hint bit may help smooth out performance and save cycles for repetitive and predictable tasks common in the embedded market. Is an enhanced 68k CPU more likely to be used for embedded or desktop purposes?

Quote:
Originally Posted by meynaf View Post
The same thing as for DIVUL.L. Why appending a size anyway ? REVL is enough by itself.
Why not always append the .L and have REV.L? I really don't know what kind of data I'm reversing when I see REVL or REV.L though. BREV is better for me. If BREV (Bit Reverse) sounds like one bit to you then shouldn't BITREV be BITSREV and BYTEREV be BYTESREV as well?

Quote:
Originally Posted by meynaf View Post
If BITREV is ok but not BYTEREV, you could try BYTREV
I've seen worse .

Quote:
Originally Posted by meynaf View Post
Then indeed BFEXTU isn't the solution for compilers (even though asm programmers can still do it manually).
But how many times is AND.L of a small constant needed ? Do you have some statistics on this ?
I believe small constants would be common as many compilers promote shorter integer datatypes to longwords. Vbcc generated code saw one of the biggest benefits for OP.L #data.w,Dn as well as MVS/MVZ because this helps it promote the integers to longwords and compress the OP.L immediates. SAS/C at the other extreme, often generates code which works with .b and .w data sizes where only a few peephole optimizations can be made here and there. The workload and optimization options can make a big difference as well.

Quote:
Originally Posted by meynaf View Post
Consider the problems your addressing mode can have. Do you allow it to execute if the size isn't a longword ? If yes, you're creating many useless combinations. If no, then you make the decoding more complicated.
ARM's approach of declaring some encodings as undefined/reserved but not trapping is probably the most efficient. Trapping has its advantages and is more 68k like. It should be enough to be clear that other variations of the addressing mode are undefined/reserved whether they are trapped or not.

Quote:
Originally Posted by meynaf View Post
Do you have detailed statistics about this gain ?
My statistics were more like a random sampling than a full statistical study. I recall cases where I saw 5%+ code density improvements with vbcc generated code with a combination of OP.L #data.w,Dn and MVS/MVZ where SAS/C generated code sometimes didn't even have a 1% improvement. I was searching for particular instruction pairs so it is likely that more gains would be possible with a compiler and peephole assembler aware of the new functionality.

Quote:
Originally Posted by meynaf View Post
Oddly enough, when reencoding i've thrown BFFFO out of the BF, keeping only simple 32-bit FFO. Seems it's the most complex of them all, the others being only data movers.
BFFFO is probably the most powerful BF instruction though. It removes a whole loop (BFCNT would have similar complexity and advantages). It would be possible to do a BFEXT first but it would make TLSFMem type memory handling less optimal. BFFFO is possible in 1-2 cycles which allows dynamic memory allocations to be 5x faster on the Apollo-core. This alone would be worth the effort to implement BFFFO if it could be implemented in the OS which is blocked in the case of the AmigaOS. AROS may seize the opportunity as AmigaOS dies. We need 3 incompatible flavors of AmigaOS after all .

Quote:
Originally Posted by meynaf View Post
I was just dreaming about SIMD advantages without the shortcomings.

Current SIMD needs extra-large registers which can't be feed with the DCache (too large) and have problems with memory latency, nullifying a large part of their potential.
Furthermore they're used on fixed size data, which doesn't match real life needs where data isn't necessarily a nice multiple of your SIMD size.
They rely on either handwritten asm (a dead end as it makes asm writing more complicated), cumbersome vector datatypes (another dead end as the casual programmer won't use them), or autovectorization features of the compiler (which can only do trivial cases, when it can do something).

I don't know exactly how some DSP's hardware loops work, but they look and feel like SIMD without extra instructions.
DSPs are highly tuned, specialized and difficult to use also. A general purpose superscalar processor with some DSP like instructions can usually keep up but with more cost in resources. An SIMD is for when you go big into data processing and needs to process a lot of data at once to be worthwhile. It makes sense not to cache big data streams in most cases.

Quote:
Originally Posted by meynaf View Post
Does that mean that the hint bit only provides a 10% gain ?
There are potentially more gains than the branch prediction success gains. Code can be better organized to fall through the branch increasing the instructions between branches, make better use of the ICache and improves code density in some cases. This is what happens when BTFN is the correct prediction but this is only ~65% correct.
matthey is offline  
 
Page generated in 0.08162 seconds with 11 queries