English Amiga Board


Go Back   English Amiga Board > Main > Amiga scene

 
 
Thread Tools
Old 25 May 2018, 18:44   #581
Megol
Registered User

Megol's Avatar
 
Join Date: May 2014
Location: inside the emulator
Posts: 332
..

Last edited by Megol; 25 May 2018 at 18:51. Reason: Who cares.
Megol is offline  
Old 25 May 2018, 18:50   #582
Gorf
Registered User

 
Join Date: May 2017
Location: Munich/Bavaria
Posts: 547
Quote:
Originally Posted by meynaf View Post
So the hardware has to be made asap.
Else it is putting the carriage before the horse.
Here I am with Kolla...

We need simply both the carriage (OS/software) and the horse (hardware).
But new horses are expensive and there is already a mule, a donkey and a ox in our barn.
The mule is kicking, the ox is slow and the donkey is stubborn - but all three are able to draw the carriage. Not as good as a horse would, but that is all we have got for now...

Quote:
If you're speaking about the OS rather the hardware, then classic may be outdated and missing much, but it ain't slow. On fast hardware it actually flies.
it maybe would - but there is no compatible fast hardware. So it may be relatively fast, but even the fastest snail is still a snail...

Quote:
That will do no good. Current "modern" hardware is under-documented (usually) and very variable (it's close to saying no two machines have exact same config).
Do you really want it to become a mess of drivers ?
HAL (based on netBSD anykernel)

Quote:
Software doesn't wear out - it can wait.
especially vaporware!

Quote:
Ok then it is maybe time to actually start. I could speak about my cpu design all day long, but i'm rather implementing my vm, assembler and debugger. Maybe i will always keep them for myself. Maybe not. Who knows.
That of course is entirely up to you, Sir.

Quote:
AOS is already modular and extensible.
But what it lacks is more "high level" APIs. As currently not much is available directly (without having to open libraries, devices, etc) and there are often several calls where one should suffice.
It also lacks good error handling (many functions just crash with wrong parameters) and resource tracking.
I see it as some kind of "two levels" : a set of APIs that are enough for most programs, and more advanced ones giving the features that are rarely needed (but provide more control).
the guru will meditate on these words

Quote:
Then again, just do it. Most of the recipes should still be valid for an enhanced ISA.
that is what I am counting on.
alright than.

Last edited by Gorf; 25 May 2018 at 23:45.
Gorf is offline  
Old 25 May 2018, 19:04   #583
Gorf
Registered User

 
Join Date: May 2017
Location: Munich/Bavaria
Posts: 547
Quote:
Originally Posted by Megol View Post
..
i would have cared ....
Gorf is offline  
Old 25 May 2018, 19:08   #584
meynaf
son of 68k
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 45
Posts: 3,197
Quote:
Originally Posted by Megol View Post
Look at a modern processor.
Caches, instruction scheduler, load/store unit.

Everything isn't twice the size. Registers and execution hardware (ALUs) are.
I know it's not exactly everything, but nevertheless close enough.
Load/store unit has to handle addresses twice as large.
And ya better have bigger caches because you have more data to fit.
Of course decoder probably has to cope with bigger instruction (larger immediates and addresses).
Even branch prediction has to handle larger addresses...


Quote:
Originally Posted by Megol View Post
I haven't mentioned 64 bit addresses. It would be logical to support more than 4GB RAM but it isn't necessary. Even so letting programs run in a 64 bit address space with a dedicated 4GB chunk with smaller pointers is trivial.
Not exactly trivial, no. This means not only data would have a variable size, but also pointers. Would make the encoding even more of a mess. Every memory access (that's not at a fixed place) would have to say which size the pointer is.
And that's bad because else it would indeed be a better solution.


Quote:
Originally Posted by Megol View Post
In fact when I sketched this Complex RISC thingy I included a base pointer for just that - it's set to one value per program and all local memory pointers can be 32 bit.
Oh yes the return of segmentation


Quote:
Originally Posted by Megol View Post
Don't see why you thing optionally supporting 64 bit operations is such a huge problem.
I think i have detailed enough of it. But if you have questions, just shoot.


Quote:
Originally Posted by Megol View Post
Yes there are a number of such programs. That's the reason some people have hacked together a pseduo-32 bit mode with 64 bit registers and the other advantages of AMD64.
So you acknowledge that 32 bit can have advantages ? I can't believe it


Quote:
Originally Posted by Megol View Post
So only your code is relevant? I've used such multiplications to speed up code significantly. As have others.
I have disassembled megabytes of other people's code and never found even a single occurence. I can return the compliment : the fact "you" have used it once does not make it generally useful.


Quote:
Originally Posted by Megol View Post
But what about 64x64->64 then? Still need 3 muls + additions.
Assuming there are special 32x32->32 instructions shifting the result left 16 bits and a ternary addition operation that would be 4 instructions.
Very rare use cases... Usually 64x32->64 is enough... Not really timing critical... Boring...


Quote:
Originally Posted by Megol View Post
Because they wanted to have a minimum expansion of code so they had to reuse some instructions that weren't used in practice.
No. That's not the reason for most limitations. Hint : x86 instructions are limited to 15 bytes.


Quote:
Originally Posted by Megol View Post
I'm not talking about making a evolution here, just a processor that can run converted 68k very fast and not caring about being "pure" RISC.
As long as the ouside visible ISA is programmer friendly, who cares what's inside to execute the microcode ?


Quote:
Originally Posted by Megol View Post
And you obviously didn't read my post, pot kettle etc.
I don't like writing these things but you started and i could not find a better answer. I know this isn't ideal.


Quote:
Originally Posted by Megol View Post
I didn't write about extending a 68k processor but making a processor that takes ideas from both RISC (easy to decode, very orthogonal) and CISC (more complex instructions, condition codes etc.).
Yes it's about creating a new processor. Did i write something else, aside of an occasionnal reply to something mentioning 68k ?


Quote:
Originally Posted by Megol View Post
Previously I've mentioned toying with the idea of extending 68k to 64 bits and it isn't too bad.
It does not look too bad at first sight i admit. But attempts to really have things crystal clear reveal it's not that easy. I know, i've tried.


Quote:
Originally Posted by Megol View Post
64 bit operations would need a prefix word with the exception of a few common instructions (eg. MOVE.Q Dn, EA and MOVE.Q EA, Dn). This because the standard operation size encoding only supporting b, w, l.
Oh yes a prefix that would make instruction size computation twice as complicated.


Quote:
Originally Posted by Megol View Post
64 bit immediates would be supported, 64 bit displacement for branches and addresses would be supported. Not because they are generally useful but because orthogonality is damn nice.
Frankly usefulness beats orthogonality. If you want to be able to do lots of dirty things just because they are "orthogonal", get a VAX.


Quote:
Originally Posted by Megol View Post
It would be as orthogonal as the original ISA as it is the original ISA.
There are a few things that won't work this way i'm afraid...
Think of 020+ modes. Largest instruction is 22 bytes and would become 38.
And bitfields. No way to encode more than 32 bits in the expansion word.


Quote:
Originally Posted by Megol View Post
Probably not. Mainly because the C/RISC described is essentially a sketch I made for this thread by checking how a reasonable encoding would be. So it isn't finished, barely started.
I am a little bit more advanced than you are, it seems


Quote:
Originally Posted by Megol View Post
ARM. The one used almost everywhere. Some people think there should be an alternative so they support development of RISC V.
Not really ARM. It's slowly leaving the area, looking more and more like a hybrid.
But RISC V yes, though it's not what i'd call popular.
Btw ARM is used everywhere not for any technical advantage it might possess, but merely because it is easy to licence and build.


Quote:
Originally Posted by Megol View Post
The both of those would be two instructions in the sketch design.

ADD.WZ (A0), D0, D20
ST.W D20, (A0)+

ORQ.BZ (A1), #1, D20
ST.B D20, (A1)

But I think ld-op with immediate would be complicated and rarely used so more likely:

LD.BZ (A1), D20
ORQ #1, D20, D20
ST.B D20, (A1)
Ok for the add, but the other is not a true BSET, which sets the CCR in a useful manner (it's common to set the bit then check if it was set before).


Quote:
Originally Posted by Megol View Post
Why didn't I include ld/st operations? Because they are a pain in the ass to handle in hardware and generally unnecessary when one have 31 free registers.

However there is room to support ld/st and if one did both of those would be 1 instruction for this example.
An overlooked advantage of memory operations is that they don't need extra registers. In some circumstances (e.g. interrupts) as little registers as possible have to be used, because saving them takes time.


Quote:
Originally Posted by Megol View Post
You assume a load is the same speed as a register access.
That's because it is. Don't confuse speed and latency.


Quote:
Originally Posted by Megol View Post
A modern processor is deeply pipelined, it uses instruction and data caches. They generally are out of order and have a load to use latency of 2 to 4 clock cycles.
OoO is enough to hide these latencies, we're not only doing mem operations so nearby register ops can be done during that time.


Quote:
Originally Posted by Megol View Post
To decode CISC instructions one have to have a deeper decode part of the pipeline, this as detecting instruction length is needed before decoding can begin. This means branch mispredict latency (time from a mispredicted branch is detected until execution on the correct path starts) increases. This decreases IPC.
Not true anymore. Instruction sizes can be cached.
And decoding can start before instruction size is known (if you have a wealth of logic to spend for decoding all possible combinations and later choose the right one).


Quote:
Originally Posted by Megol View Post
Decoding complex instructions often means they have to be split into two or more. This is because that increases performance in several ways, for instance splitting a load-execute instruction into a load and a integer operation means instruction schedulers are simplified and can be made faster and larger. But needing to do that increases complexity in the decoders. Which normally means added pipeline stages - increased mispredict latency.

Then we have the complications of partial register updates, condition code computation and handling etc. Those may not decrease IPC but they do decrease the C part.
Simple read-modify-write do not really need to be split. And even if it is, there is no difference...
Now load-store has a big problem in comparison : instead of a single instruction, we have 3 - and they all depend on the result of the previous one !


Quote:
Originally Posted by Megol View Post
Comparing what? An extended 68k or the RISCy thing mentioned earlier?
Compare any cpu with any other cpu, by showing real code.


Quote:
Originally Posted by Megol View Post
Have you tried that on any other architecture?
Yes.


Quote:
Originally Posted by Megol View Post
RISC was designed using analysis of real world code.
No. It was only a vague theory.
Belief that complex instructions are not really used is simply wrong.
Net result : lots and lots of useful stuff removed.
meynaf is online now  
Old 25 May 2018, 19:17   #585
Gorf
Registered User

 
Join Date: May 2017
Location: Munich/Bavaria
Posts: 547
Quote:
Quote:
Originally Posted by Gorf View Post
Can you give me some examples of code that take two or more instructions on 68k, but are just one instruction on RISC?
ARM:
Code:
BICNE R0, R1, R2 ASR R3 ; If zero flag clear R0 = R1 AND-NOT (R2 >> R3)
Yes, that is actually useful.
I was not arguing to discredit your claim, but simply interested.
So thanks!

Quote:
But note that I was talking about the thingy mentioned in this thread. It would have a operate-branch conditional instruction format so something like:

CMPBCS D0, D1, .label

Would be equal to:

CMP D0, D1
BCS .label

reading this and your answer to meynaf it strikes me as an interesting concept.

I have to admit I never did a single line in 64-bit asm, so I can not judge how bad this would be - but as you said, it should be possible to code within a 32-bit range and the developer would not see a single 64-bit pointer...



Quote:
The canonical form for ARM with no free register would be:

Code:
CMP R0, R1
EORGT R0, R0, R1
EORGT R1, R1, R0
EORGT R0, R0, R1
The more common pure min/max would be two instructions.
Gorf is offline  
Old 25 May 2018, 19:18   #586
meynaf
son of 68k
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 45
Posts: 3,197
Quote:
Originally Posted by Gorf View Post
Here I am with Kolla...

We need simply both the carriage (OS/software) and the horse (hardware).
But new horses are expensive and there is already a mule, a donkey and a ox in our barn.
The mule is kicking, the os is slow and the donkey is stubborn - but all three are able to draw the carriage. Not as good as a horse would, but thats all we have got for now...
Of course but if you design your carriage for use with donkeys, mules or whatever, you'll be real sorry the day the horse comes and it doesn't work together anymore !
Besides, as said previously, hardware can be emulated.


Quote:
Originally Posted by Gorf View Post
it maybe would - but there is no compatible fast hardware. So it may be relatively fast, but even the fastest snail is still a snail...
Really, it does already. Try JIT emulation and tell me this isn't fast.


Quote:
Originally Posted by Gorf View Post
HAL (based on netBSD anykernel)
Not understood...


Quote:
Originally Posted by Gorf View Post
the guru will meditate on these words
Program failed - Error 87000004
Wait for disk activity to finish.
Suspend | Reboot

(oops i guru meditated too )


Quote:
Originally Posted by Gorf View Post
i would have cared ....
Me too and I even replied.
meynaf is online now  
Old 25 May 2018, 19:30   #587
meynaf
son of 68k
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 45
Posts: 3,197
Have i been so rude ? Why destroying posts like this ?


Quote:
Originally Posted by Megol
BICNE R0, R1, R2 ASR R3 ; If zero flag clear R0 = R1 AND-NOT (R2 >> R3)
Unreadable to say the least !


Quote:
Originally Posted by Megol
Yes, that is actually useful.
For what ?


Quote:
Originally Posted by Megol
But note that I was talking about the thingy mentioned in this thread. It would have a operate-branch conditional instruction format so something like:
CMPBCS D0, D1, .label
Would be equal to:
CMP D0, D1 BCS .label
I tried this for my ISA. Ended up eating as much encoding space as the whole branch instruction, and unable to replace it as other instructions than CMP also set useful flags.


Quote:
Originally Posted by Megol
The canonical form for ARM with no free register would be:
CMP R0, R1
EORGT R0, R0, R1
EORGT R1, R1, R0
EORGT R0, R0, R1
That would be simpler with an EXG instruction.


Quote:
Originally Posted by Megol
The more common pure min/max would be two instructions.
Indeed, and two instructions actually 25% larger in size than the 3 classical ones used for this...
meynaf is online now  
Old 25 May 2018, 19:30   #588
Gorf
Registered User

 
Join Date: May 2017
Location: Munich/Bavaria
Posts: 547
Quote:
Originally Posted by hth313 View Post
Somewhat like Apple did and there is always DragonFly. We can always look at current macOS and compare that to the old one, it is a step like that. Perhaps the Amiga OS is in better shape than the old macOS and more can be used. Some lessons can probably be drawn from the migration/shift Apple did.
Dragonfly yes, Matt Dillon is still the man! (but not on my desktop...)

By the way: OSX is since yesterday "older" than System x.x
(older as in longer sold by Apple)

Quote:
Having a UNIX at the bottom ...
let me stop you right here.
the only legitimation for some unix code are missing drivers. They could be useful to give us a HAL, without developing drivers ourselves.

Otherwise: there are already 1001 unix/posix-clones out there - no need for one more.

Last edited by Gorf; 25 May 2018 at 19:42.
Gorf is offline  
Old 25 May 2018, 19:33   #589
meynaf
son of 68k
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 45
Posts: 3,197
Quote:
Originally Posted by Gorf View Post
Otherwise: there are already 1001 unix/posix-clones out there - no need for one more.
meynaf is online now  
Old 25 May 2018, 19:42   #590
Gorf
Registered User

 
Join Date: May 2017
Location: Munich/Bavaria
Posts: 547
Quote:
Originally Posted by meynaf View Post
Really, it does already. Try JIT emulation and tell me this isn't fast.
OK: "It isn't fast!"

no really: it is not fast! not compared to the hypothetical native speed.
(and that is part of the reason, why I am quite confident the emulation can be improved)

And while the OS or better the GUI reaches certainly a point of speed, where human interaction fails to feel the difference - just start a raytracer and compare it to a native implementation (eg old version of povray) or encode some mpegs or whatever - or maybe more crucial: try to browse the web! (other than this nice site)
Gorf is offline  
Old 25 May 2018, 20:00   #591
meynaf
son of 68k
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 45
Posts: 3,197
Quote:
Originally Posted by Gorf View Post
OK: "It isn't fast!"

no really: it is not fast! not compared to the hypothetical native speed.
(and that is part of the reason, why I am quite confident the emulation can be improved)
Of course it's not the hypothetical native speed, but even native apps don't reach this


Quote:
Originally Posted by Gorf View Post
And while the OS or better the GUI reaches certainly a point of speed, where human interaction fails to feel the difference - just start a raytracer and compare it to a native implementation (eg old version of povray) or encode some mpegs or whatever - or maybe more crucial: try to browse the web! (other than this nice site)
That really depend what you're doing.
Some Amiga C compilers are way faster than VS on the pc.
Merely booting the machine is also faster.
And switching it off - it's simply different worlds.
Play Protracker module - 0% cpu use.
Decode Flac (with my program) - 90x real time (i believe it's enough ?).

So yes, AOS is fast. It's just running on underpowered hardware.
Changing the OS because the hardware is slow or the emulation isn't giving close to native speeds, is throwing the baby with the bathwater.
meynaf is online now  
Old 25 May 2018, 20:22   #592
Gorf
Registered User

 
Join Date: May 2017
Location: Munich/Bavaria
Posts: 547
Quote:
Originally Posted by meynaf View Post
Of course it's not the hypothetical native speed, but even native apps don't reach this


That really depend what you're doing.
Some Amiga C compilers are way faster than VS on the pc.
Way faster? might be true for small chunks of code ... but in general?

Quote:
Merely booting the machine is also faster.
A "feature" we have to witness more often than we certainly would like to...

Quote:
And switching it off - it's simply different worlds.
Is it? I can take my alpine-linux RasPi off the grid any time - no harm will be done. That is a matter of configuration only.

Quote:
Play Protracker module - 0% cpu use.
on real or on uae?
and what does that tell us about the OS?
nothing - we can only state that the chipset works as intended and/or uae is doing a good job in emulation the chipset.

Quote:
Decode Flac (with my program) - 90x real time (i believe it's enough ?).
compared to 500x realtime native?
the argument "fast enough" is like "nobody will need more than 640k RAM".
Yes: for this specific use-case it is enough.
And now i want to watch my latest 4k video.

Quote:
So yes, AOS is fast. It's just running on underpowered hardware.
that is what I have said. Sadly there is no faster compatible hardware - so I will try to improve the emulator.

Still: native code always wins.

As for the "perfect ISA":

Ideal would be a instruction set or at least a code, that contains extensive hints. E.g. what branch is more like to be taken - the importance of a loop, or in general what a section of code is meant to do.
the holy grail, would certainly be an instruction set, that allows for binary translation being done ahead of time.

Last edited by Gorf; 25 May 2018 at 21:10.
Gorf is offline  
Old 25 May 2018, 20:40   #593
kolla
Registered User
kolla's Avatar
 
Join Date: Nov 2007
Location: Trondheim, Norway
Posts: 1,447
Quote:
Originally Posted by Gorf View Post
And now i want to watch me latest 4k video.
Worse, this is Amiga, the computer for the creative mind - you want to edit that 4k video.

(Ironically I am posting this from an Intel NUC running DragonFlyBSD and FS-UAE /OS3.9, since attempts at AROS hosted on DFBSD still fail)
kolla is offline  
Old 25 May 2018, 21:02   #594
MigaTech
Only Amiga !!

MigaTech's Avatar
 
Join Date: Apr 2017
Location: United Kingdom
Posts: 522
Quote:
Originally Posted by Gorf View Post
- the unsolved ownership of everything "Amiga" :-(
So lets call it "Agima"
Someone beat you to it back in 2012 take a look at this and think, just maybe it could have been the Next Gen Amiga?

Yeah right with an i7 in there!! The only thing related to Amiga is the name and the article even states this!

Commodore USA they really did try! They kind of remind me of Commodore UK, they held on and ariston too! Check out those asking prices!! WHOA!!

https://www.techradar.com/news/pc/co...-years-1072954
MigaTech is offline  
Old 25 May 2018, 21:36   #595
meynaf
son of 68k
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 45
Posts: 3,197
Quote:
Originally Posted by Gorf View Post
Way faster? might be true for small chunks of code ... but in general?
Yes, in general. Especially when loading projects, opening files, etc.
Believe me, VS2015 that i have here is very slow.


Quote:
Originally Posted by Gorf View Post
A "feature" we have to witness more often than we certainly would like to...
Not that much for people who don't code, really.


Quote:
Originally Posted by Gorf View Post
Is it? I can take my alpine-linux RasPi off the grid any time - no harm will be done. That is a matter of configuration only.
Yes but handheld devices are not exactly that superior to emulation running on a good machine - if superior at all.


Quote:
Originally Posted by Gorf View Post
on real or on use?
What do you mean, real or use ?


Quote:
Originally Posted by Gorf View Post
and what does that tell us about the OS?
nothing - we can only state that the chipset works as intended and/or uae is doing a good job in emulation the chipset.
Ok you got the point here. It says nothing about the OS.
However it says the chipset can do things that are not that obvious on other "superior" machines...


Quote:
Originally Posted by Gorf View Post
compared to 500x realtime native?
Probably not 500x, no.


Quote:
Originally Posted by Gorf View Post
the argument "fast enough" is like "nobody will need more than 640k RAM".
That's a too easy shortcut. But perhaps everybody need more than 512TB ram after all


Quote:
Originally Posted by Gorf View Post
Yes: for this specific use-case it is enough.
It is enough for many use cases already.


Quote:
Originally Posted by Gorf View Post
And now i want to watch my latest 4k video.
Sorry, for now i can only do the sound out of an mp4.
Pretty sure it can be done if given a direct access to the host's gpu, though.
But wait... can your RasPi do 4k video ?


Quote:
Originally Posted by Gorf View Post
that is what I have said. Sadly there is no faster compatible hardware - so I will try to improve the emulator.
That sounds like premature optimization. Making it work would already be some achievement.


Quote:
Originally Posted by Gorf View Post
Still: native code always wins.
Not always, no. Native code can be compiled in inefficent languages or use very poor algorithms so that emulated code ends up faster.


Quote:
Originally Posted by Gorf View Post
As for the "perfect ISA":

Ideal would be a instruction set or at least a code, that contains extensive hints.
That is possible in some way. I thought about it, a software VM sometimes can use a little info, like the size of the converted code (for branch target computation), if an instruction's flags (ccr) will be used or not, and whatever shortcut that can be taken to reduce the amount of host instructions.
But it's not ideal as it is specific to some particular host and it might be more efficient to compute that data at runtime and cache it.


Quote:
Originally Posted by Gorf View Post
E.g. what branch is more like to be taken -
Branches don't need this - a good host cpu has a good branch predictor that will do the job.


Quote:
Originally Posted by Gorf View Post
the importance of a loop,
Nah. Just execute any loop as fast as possible


Quote:
Originally Posted by Gorf View Post
or in general what a section of code is meant to do.
Do you think a computer can understand what it does ? That would require more than what the best AI can currently do.
(Or your sentence had a different sense than it seemed to have.)


Quote:
Originally Posted by Gorf View Post
the holy grail, would certainly be an instruction set, that allows for binary translation being done ahead of time.
Why ? Doing the translation when reading the code is similar in concept to an ordinary cpu's prefetch. And then code cache can hold translated instructions. This is not the job of the instruction set.

By all means, do not tailor the instruction set for a specific implementation !
meynaf is online now  
Old 25 May 2018, 21:54   #596
Gorf
Registered User

 
Join Date: May 2017
Location: Munich/Bavaria
Posts: 547
Quote:
Originally Posted by meynaf View Post
But wait... can your RasPi do 4k video ?
yes but only in 15Hz ... but thats looks great on a very slow movie

Quote:
Branches don't need this - a good host cpu has a good branch predictor that will do the job.
sophisticated branch prediction takes a lot of space - hints in the code could render this irrelevant

Quote:
Nah. Just execute any loop as fast as possible
think of nested loops and the effort it takes to analyze them - again hints could be very useful here

Quote:
Do you think a computer can understand what it does?
no - that is why we need that information from the coder.

Quote:
By all means, do not tailor the instruction set for a specific implementation !
it should provide the opposite: enough hints and clarity to allow AOT binary2binary translation into any target.
(And no: I have not the slightest clue how to do this - apparently nobody does.)

Last edited by Gorf; 25 May 2018 at 23:22.
Gorf is offline  
Old 26 May 2018, 09:28   #597
meynaf
son of 68k
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 45
Posts: 3,197
Quote:
Originally Posted by Gorf View Post
sophisticated branch prediction takes a lot of space - hints in the code could render this irrelevant
I heard IBM Power did that in the past, and the hint bit is now ignored because ordinary branch prediction works better.
Hints in the code also take space, btw. Encoding space is valuable and there are better uses for it.


Quote:
Originally Posted by Gorf View Post
think of nested loops and the effort it takes to analyze them - again hints could be very useful here
What kind of analyze ?


Quote:
Originally Posted by Gorf View Post
no - that is why we need that information from the coder.
What information ? This is wholly unclear.


Quote:
Originally Posted by Gorf View Post
it should provide the opposite: enough hints and clarity to allow AOT binary2binary translation into any target.
If implementation is visible in the instruction set, then this makes next implementation problematic if it does it differently.
If one day we get a nice asic, these emulation hints will look ridiculously useless.
I've said that somewhere before, but don't forget that architectures persist longer than implementations.


Quote:
Originally Posted by Gorf View Post
(And no: I have not the slightest clue how to do this - apparently nobody does.)
Hints given to some machine to build some program are called code. Thus "enough hints" is like performing the conversion yourself.
meynaf is online now  
Old 31 May 2018, 16:07   #598
Megol
Registered User

Megol's Avatar
 
Join Date: May 2014
Location: inside the emulator
Posts: 332
Quote:
Originally Posted by meynaf View Post
Have i been so rude ? Why destroying posts like this ?
Because it doesn't make any difference.

Quote:
Unreadable to say the least !
BIC = bit clear
NE = execute on Not Equal = Z condition code clear
No S = doesn't set flags

R0, R1, R2 ASR R3 ; R0 = R1 op (R2 ASR R3)

Quote:
For what ?
For example if R2 is negative one can clear a bitfield from the MSb towards the LSb with the size specified in R3. So bitfield operations.

BEQ .skip
ASR.L D3, D2 ; D2 = D2 ASR D3
NOT.L D2 ; ...
AND.L D1, D2 ; D2 = D2 AND D1
.skip

Note that I selected this instruction to illustrate a point. One 32 bit RISC type instruction doing the work of 4 68k instructions 64 bit in total.

While being easier to decode, faster to execute and not overwriting potentially useful data.

Quote:
I tried this for my ISA. Ended up eating as much encoding space as the whole branch instruction, and unable to replace it as other instructions than CMP also set useful flags.
Those aren't meant for size optimization but to enable one instruction to both do an operation and branch on condition.

The hardware costs for this is slim, each instruction would set the condition codes anyway, comparing condition codes against a specific condition is trivial and the ALU would only need to give the result to the branch unit.

The complication in the decoder is how to extract the condition field and detect (and extract) the branch address. Trivial.

Sadly the semantics of DBcc makes it hard to do in one standard instruction but op+branch still makes it easier to translate.

Quote:
That would be simpler with an EXG instruction.
Yes and shows that ARM isn't the perfect instruction set.

Not being able to do something like:

ADD R0, R1, #1 LSL R1

Is also sad.

That most instructions waste 4 bits is insane, that the PC is a general register a problem, that it only have 15 normal registers another.

Quote:
Indeed, and two instructions actually 25% larger in size than the 3 classical ones used for this...
But faster to decode and execute while requiring less transistors. Which is why RISC was created in the first place - to be more efficient to execute.
Megol is offline  
Old 31 May 2018, 16:52   #599
meynaf
son of 68k
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 45
Posts: 3,197
Quote:
Originally Posted by Megol View Post
Because it doesn't make any difference.
What difference would you want to make then ?


Quote:
Originally Posted by Megol View Post
BIC = bit clear
NE = execute on Not Equal = Z condition code clear
No S = doesn't set flags

R0, R1, R2 ASR R3 ; R0 = R1 op (R2 ASR R3)
When i wrote "unreadable", i didn't ask for an explanation. I just wanted to point that what the thing does isn't obvious and requires some amount of deciphering. This has a bad effect when reading code.
(This is a general statement ; instructions doing several unrelated things together make code harder to read.)


Quote:
Originally Posted by Megol View Post
For example if R2 is negative one can clear a bitfield from the MSb towards the LSb with the size specified in R3. So bitfield operations.

BEQ .skip
ASR.L D3, D2 ; D2 = D2 ASR D3
NOT.L D2 ; ...
AND.L D1, D2 ; D2 = D2 AND D1
.skip

Note that I selected this instruction to illustrate a point. One 32 bit RISC type instruction doing the work of 4 68k instructions 64 bit in total.

While being easier to decode, faster to execute and not overwriting potentially useful data.
Yes but why would one want to do that ?
When i wrote "for what", i asked for a concrete case.
One can easily invent an instruction that does the job of many others, but if no program ever has these instruction sequences then it's pointless.
And that is more or less the case here.
Something like 90% of instructions in a typical ARM stream start with "E" (for "always true" condition) and not 10% will use the barrel shifter (i'm giving these from memory, if you think it's wrong then you may just disassemble code and make your own statistics). In fact it's similar to the use of branch and shift instructions in any other architecture.

An example you could give is a real life routine, not just "clear a bitfield if some val is negative" (which btw is doable in 3 instructions in 68k rather than 4, but still not very useful).
How many instructions, for example, to read a decimal number from a stream ?


Quote:
Originally Posted by Megol View Post
Those aren't meant for size optimization but to enable one instruction to both do an operation and branch on condition.

The hardware costs for this is slim, each instruction would set the condition codes anyway, comparing condition codes against a specific condition is trivial and the ALU would only need to give the result to the branch unit.

The complication in the decoder is how to extract the condition field and detect (and extract) the branch address. Trivial.

Sadly the semantics of DBcc makes it hard to do in one standard instruction but op+branch still makes it easier to translate.
This is purely implementation-driven decisions, and perhaps you know or can guess what i think about these.


Quote:
Originally Posted by Megol View Post
Yes and shows that ARM isn't the perfect instruction set.

Not being able to do something like:

ADD R0, R1, #1 LSL R1

Is also sad.

That most instructions waste 4 bits is insane, that the PC is a general register a problem, that it only have 15 normal registers another.
Perhaps it would be interesting to list all the shortcomings and attempt to fix them (for a purely academic purpose).
First thing to get is a complete instruction set encoding reference table and so far i've been unable to merely find that. It seems every model has own instruction set and it doesn't help.


Quote:
Originally Posted by Megol View Post
But faster to decode and execute while requiring less transistors. Which is why RISC was created in the first place - to be more efficient to execute.
Requiring less transistors yes, but not faster anymore.
RISC was an advantage in the past as it could implement relative costly methods for executing instructions fast (things like OoO, etc) but now there is a wealth of transistors and they have no advantage left.
This is why ARM is beaten in performance by x86.
This is also why IBM had to use exceedingly aggressive designs for POWER8 to still be able to compete (and it does only in floating-point).

If you have to execute more instructions for doing the same job, today it's not possible to be faster. As simple as that.
meynaf is online now  
Old 03 June 2018, 22:17   #600
kolla
Registered User
kolla's Avatar
 
Join Date: Nov 2007
Location: Trondheim, Norway
Posts: 1,447
So why are there so few x86 phones around?
kolla is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Has anyone got an Amiga 1200 T12 Gen II? ccorkin support.Hardware 10 14 April 2017 23:18
What do people think about this as next Gen AMIGA? Gunnar Amiga scene 111 05 July 2014 20:59
Classic 1st Gen EA games for the Amiga illy5603 support.Games 8 03 July 2010 02:59
Next-gen Amiga development LaundroMat Coders. General 3 05 October 2002 00:30

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 11:12.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2018, vBulletin Solutions Inc.
Page generated in 0.16561 seconds with 16 queries