Next gen Amiga - Page 26

Gorf · 22 May 2018, 13:47

Quote:

Originally Posted by Dunny

I think it will be square, flat, with pins on the underside.

How boring...
It should at least be a hexagon.

meynaf · 22 May 2018, 14:56

Quote:

Originally Posted by Gorf

most modern CPUs are just emulating the legacy ISA ... the line between hard- and software is somewhat blurred today.

Microcode is at a lower level than software emulation (more transparent).
So you can hide it and pretend there is no emulation - noone will be able to check.

Quote:

Originally Posted by Gorf

true, but there is of course always the speed/price ratio - how much bang for the dollar.
(and there its always the multiple-instances/sandbox/cluster approach that could make use of more cores and level the field a little bit)

If one wants the bang for the dollar, then a previously owned peecee should do the trick...

Quote:

Originally Posted by Gorf

Some I mentioned already earlier in this thread.
To speed up the first-round(s) of code-execution and reduce the lag, it is probably a good idea to "outsource" some of the work onto dedicated hardware (FPGA). First step would be a fast 68k-instruktion decoder/translater. Thereby one would need to use all sorts of tricks like instruction bonding or recognition of some very often used tuples of instruction and their shortest representation in the target-ISA.

We would need to evaluate where to place the memory controller: using the (mostly faster) host CPU for that or is some extra channel for the FPGA-part better?
The second approach would allow to take care of the endianness swapping in the FPGA before handing data over for calculation to the host cpu.
It could also take care of some memory operations, that would just be very slow on a risc CPU, directly ...
(and we could implement your idea of a dedicated MMU and/or protection unit)

Hotspots in the code would than be translated and optimized by the software-JIT .. preferably in parallel by a second core.

Very hot hotspots could than again be compiled into VHDL for live updating the FPGA - could be working for things like a ray-tracer or fractal generator.

But more clever ideas are welcome!

If you want to do hardware assisted things for the host cpu, first thing to check is how to feed it with code. As i suppose we can't access the cpu's pipeline directly, it has to be located in some memory ?

Quote:

Originally Posted by Gorf

best price/speed ratio.

Hmm. I'm afraid getting a cheap peecee would be better already.

Quote:

Originally Posted by Gorf

I am listening.

Ok. See below.

Quote:

Originally Posted by Gorf

Well - such a platform would probably start as some kind of emulation or FPGA implementation, wouldn't it?

Sure it would.

Quote:

Originally Posted by Gorf

but yes: I like the idea, but this opens of course a myriad of possibilities how to do things

I'm more interested in what to do than how to do it. An implementation can be changed later but the actual architecture is more difficult to change.

Quote:

Originally Posted by Gorf

OK - lets just concentrate on one element for now:
What should the cpu look like?

I can not tell about how it has to look like physically

But from a programmer's pov, then i might have quite a clear idea.

Going from 68000 to 68020 has bring extra programming flexibility.
But from 68020 to 68040, nothing came. This might have contributed to the downfall of the family, maybe.

If a coder wants speed, he just takes winuae.
Coders like achieving something big out of something little.
So to provide them the toy they want to play with, my idea is to give them a cpu which is even more friendly to code on than actual 68k.
And to not fall in the performance-driven design trap.

Megol · 22 May 2018, 15:05

Quote:

Originally Posted by Gorf

most modern CPUs are just emulating the legacy ISA ... the line between hard- and software is somewhat blurred today.

Modern processors are implementations of their ISA and the idea behind the ISA is what IBM called architecture when releasing the IBM 360: a contract describing how the processor behaves rather than how it is implemented.

Not even the old Intel Pentium Pro emulated anything, it implemented the x86 ISA using very RISC like internal operations.

Quote:

true, but there is of course always the speed/price ratio - how much bang for the dollar.
(and there its always the multiple-instances/sandbox/cluster approach that could make use of more cores and level the field a little bit)

thats what I am trying to figure out. I am still collecting Ideas and evaluating approaches.

Some I mentioned already earlier in this thread.
To speed up the first-round(s) of code-execution and reduce the lag, it is probably a good idea to "outsource" some of the work onto dedicated hardware (FPGA). First step would be a fast 68k-instruktion decoder/translater. Thereby one would need to use all sorts of tricks like instruction bonding or recognition of some very often used tuples of instruction and their shortest representation in the target-ISA.

Let's assume we have no access to internals of the target ISA (I'll call it host), no ability to add custom instructions or access registers etc.
Then FPGA have to communicate to the host via the normal interfaces: memory and interrupt signals.

If the host use the FPGA with a akiko type interface, that is the host writing bytes to be translated to a memory mapped area and reading the result synchronization is easy. For instance one could probably stall the reading of the translated code till the FPGA is finished with its work. But that would be very inefficient.

So a more reasonable interface is letting the host direct the FPGA to a block of code to be translated with a target buffer being either implicit or explicit.
Then the translation hardware will work until some limit is reached producing a block of code. Synchronization can be either polling the hardware until it signals completion or the host getting an interrupt signal from the FPGA when done.

Then comes the problem of branch address translation. Unlike naive code translation this isn't a mechanical process, 68k branch addresses have to be looked up and if translated code for that address is found inserted. If it isn't translated yet one can imagine the FPGA going down that path to translate the new block of code but that isn't realistic for several reasons. Path explosion being the obvious one.

A software JIT can quickly switch between executing native code and interpreting 68k code. If we remove the interpreter the host have to point the FPGA to the code block to be executed, wait until translation is done and then start executing again. I think the overheads would be huge.

Quote:

We would need to evaluate where to place the memory controller: using the (mostly faster) host CPU for that or is some extra channel for the FPGA-part better?
The second approach would allow to take care of the endianness swapping in the FPGA before handing data over for calculation to the host cpu.
It could also take care of some memory operations, that would just be very slow on a risc CPU, directly ...
(and we could implement your idea of a dedicated MMU and/or protection unit)

Endian swapping isn't needed if using ARM or modern x86. ARM can be toggled to big endian mode and x86 can use the MOVBE instruction.

The host processor have a highly optimized cache subsystem, what exactly would the FPGA be able to do faster?

Quote:

Hotspots in the code would than be translated and optimized by the software-JIT .. preferably in parallel by a second core.

That is a good idea. Perhaps it would be better to do the translation on another core too?

Quote:

Very hot hotspots could than again be compiled into VHDL for live updating the FPGA - could be working for things like a ray-tracer or fractal generator.

That isn't really a good idea - you'd have to generate specialized 68k cores in realtime!

Quote:

I am listening.

What could be offered that WinUAE on a high end PC doesn't already provide?

Quote:

OK - lets just concentrate on one element for now:
What should the cpu look like?

IMO a RISC/CISC hybrid. Something like:
32, 64, 96 bit instructions
Load-operate instructions, perhaps Operate-store instructions.
Auto-increment and decrement address modes.
Immediate values of at least 8, 32, 64 bits.
Condition codes but stored per register. 32 registers -> 32 carries, overflows etc.
Only 64 bit operations internally, loads can zero/sign extend from byte, w, l.
Hardware division.
Perhaps MOVEM type instructions.
Hardware supported translation of 68k instructions.
Perhaps hardware CAM (type of lookup table) for accelerating address translation.

This would make translation of 68k instructions trivial and be easy to code for.

meynaf · 22 May 2018, 15:28

Quote:

Originally Posted by Megol

What could be offered that WinUAE on a high end PC doesn't already provide?

The only possibility i can see, is extra programming flexibility.

Quote:

Originally Posted by Megol

IMO a RISC/CISC hybrid. Something like:
32, 64, 96 bit instructions
Load-operate instructions, perhaps Operate-store instructions.
Auto-increment and decrement address modes.
Immediate values of at least 8, 32, 64 bits.
Condition codes but stored per register. 32 registers -> 32 carries, overflows etc.
Only 64 bit operations internally, loads can zero/sign extend from byte, w, l.
Hardware division.
Perhaps MOVEM type instructions.
Hardware supported translation of 68k instructions.
Perhaps hardware CAM (type of lookup table) for accelerating address translation.

Nobody would want to code on that monster.
Beware, too, of 64 bit immediates. Simple move of a full sized data to a linear address would be a monster instruction of at least 18 bytes !

Gorf · 22 May 2018, 15:40

Quote:

Originally Posted by Megol

Modern processors are implementations of their ISA and the idea behind the ISA is what IBM called architecture when releasing the IBM 360: a contract describing how the processor behaves rather than how it is implemented.

Not even the old Intel Pentium Pro emulated anything, it implemented the x86 ISA using very RISC like internal operations.

so any implementation of the x86 ISA, or any ISA for that matter, would not be considered an emulation.
So UAE does not emulate 68K but fulfills the ISA-contact.

there is no spoon!

Gorf · 22 May 2018, 15:48

Quote:

If you want to do hardware assisted things for the host cpu, first thing to check is how to feed it with code. As i suppose we can't access the cpu's pipeline directly, it has to be located in some memory ?

Yes. That is indeed the Achilles Heel of many boards, but the situation is improving and things like ChipLink provide a fast interconnection between cpu and fpga - but I still need to evaluate details.

I am also wondering how much speed is lost for e.g. WinUAE, by Windows (or Linux) preempting the JIT-task, callbacks to other parts of UAE, flushing caches and so on ...
having a core dedicated to the test of executing translated code could improve things quite a bit ...

Quote:

I can not tell about how it has to look like physically
But from a programmer's pov, then i might have quite a clear idea.

I thought is was clear, that I am NOT talking about the packaging or the silicon!
Of course we need to talk about the ISA.

Gorf · 22 May 2018, 17:39

Quote:

Originally Posted by Megol

If the host use the FPGA with a akiko type interface, that is the host writing bytes to be translated to a memory mapped area and reading the result synchronization is easy. For instance one could probably stall the reading of the translated code till the FPGA is finished with its work. But that would be very inefficient.

yes, that would be madness

Quote:

So a more reasonable interface is letting the host direct the FPGA to a block of code to be translated with a target buffer being either implicit or explicit.
Then the translation hardware will work until some limit is reached producing a block of code. Synchronization can be either polling the hardware until it signals completion or the host getting an interrupt signal from the FPGA when done.

preferably via some cpu interconnect mechanism

Quote:

Then comes the problem of branch address translation. Unlike naive code translation this isn't a mechanical process, 68k branch addresses have to be looked up and if translated code for that address is found inserted. If it isn't translated yet one can imagine the FPGA going down that path to translate the new block of code but that isn't realistic for several reasons. Path explosion being the obvious one.

That is one reason why I suggested to do implement the memory controller on the FPGA-side of things - that could provide a mechanism to keep track of branch addresses ...

Quote:

A software JIT can quickly switch between executing native code and interpreting 68k code. If we remove the interpreter the host have to point the FPGA to the code block to be executed, wait until translation is done and then start executing again. I think the overheads would be huge.

I already linked a theses here in this tread, that describes exactly that.
"Bochs" x86 emulator on a PPC-FPGA combo - only the instruction decoding was done in the FPGA.
Despite of the overhead the speed was improved.
Today the connection between CPU and FPGA is much faster...

Quote:

The host processor have a highly optimized cache subsystem, what exactly would the FPGA be able to do faster?

The host CPU should use its cache ans memory bus for the translated blocks of code. It should stay in "native" mode as long as possible.
Instruction decoding and translating can be realized much better in FPGA, due to parallelism and the possibility to build effective pipelines.
That is the strength of the FPGA, while very fast ALUs are part of the CPU.

Quote:

That is a good idea. Perhaps it would be better to do the translation on another core too?

See above. FPGA can do it faster.
(and without risking cash flushes or other resource conflicts)

Quote:

That isn't really a good idea - you'd have to generate specialized 68k cores in realtime!

I also posted a paper regarding this issue. This has been done before - latency is just a few milliseconds.

Megol · 22 May 2018, 18:46

Quote:

Originally Posted by meynaf

The only possibility i can see, is extra programming flexibility.

Do not understand what that would be.

Quote:

Nobody would want to code on that monster.
Beware, too, of 64 bit immediates. Simple move of a full sized data to a linear address would be a monster instruction of at least 18 bytes !

Yes while doing the work of two instructions together being 24 bytes. Guess what's smaller, easier to support in hardware, faster?

64 bit values are very rarely needed but not supporting them would make the processor less orthogonal and harder to use. The same hardware that extracts them from the instruction stream also makes 64 bit branch and address displacements trivial. Compared to CISC instructions decoding is easy.

What exactly makes this look monstrous to you? Minimum instruction size?

meynaf · 22 May 2018, 19:58

Quote:

Originally Posted by Gorf

I am also wondering how much speed is lost for e.g. WinUAE, by Windows (or Linux) preempting the JIT-task, callbacks to other parts of UAE, flushing caches and so on ...
having a core dedicated to the test of executing translated code could improve things quite a bit ...

- OS preempting : makes the timings unstable but does not measurably reduce the peak speed
- other parts of UAE : simple, check host cpu% when emulated cpu does nothing
- flushing caches : not needed if code cache and data cache aren't separate

Quote:

Originally Posted by Gorf

I thought is was clear, that I am NOT talking about the packaging or the silicon!
Of course we need to talk about the ISA.

I have put a smiley and thought it would be clear, too...

Quote:

Originally Posted by Megol

Do not understand what that would be.

Not surprising.

Quote:

Originally Posted by Megol

Yes while doing the work of two instructions together being 24 bytes. Guess what's smaller, easier to support in hardware, faster?

Smaller, easier to support in hardware, faster, is to just drop 64-bit support.

Quote:

Originally Posted by Megol

64 bit values are very rarely needed but not supporting them would make the processor less orthogonal and harder to use. The same hardware that extracts them from the instruction stream also makes 64 bit branch and address displacements trivial.

Less orthogonal ???
No 64-bit cpu in the world is orthogonal, and for a good reason.
But you can have orthogonal, easy-to-use, 32-bit cpu.

For 64-bit there are better ways.
For data, merge 2 32-bit instructions together to do a single 64-bit one. As an advantage, your code will run regardless if this is the 32-bit or the 64-bit of your core.
For addresses, use the trick i mentioned before.

Quote:

Originally Posted by Megol

Compared to CISC instructions decoding is easy.

But programming is a pain in the a$$. Sorry, but no.
Having to use 4 or 5 instructions to do the job of one is a no-go today. But there are still people believing in RISC lies...

Quote:

Originally Posted by Megol

What exactly makes this look monstrous to you? Minimum instruction size?

It would just be horrible to code on. Besides, it would have very poor code density.

Gorf · 23 May 2018, 14:27

Quote:

Originally Posted by meynaf

- OS preempting : makes the timings unstable but does not measurably reduce the peak speed
- other parts of UAE : simple, check host cpu% when emulated cpu does nothing

That does tell us, that WinUAE is a nice program, that does not do much unnecessary stuff.

It does tell us nothing about the efficiency!
It does not tell us how it behaves under heavy load.

Quote:

- flushing caches : not needed if code cache and data cache aren't separate

Windows flushes cpu cashes all the time!
(talking about the host - not the emulated cpu)

If you look at benchmarks of OSv, MirageOS or other Unikernel or baremetal approaches the overhead of systems like Windows or Linux eats up at least 5% of your performance....

But thats all not really important for now.

meynaf · 23 May 2018, 14:40

Quote:

Originally Posted by Gorf

It does tell us nothing about the efficiency!

Then what do you call efficiency here ?
Is it the number of host instructions per emulated instructions or something like that ?

Quote:

Originally Posted by Gorf

It does not tell us how it behaves under heavy load.

But what do you call heavy load here ?
Host side ? Other apps eating cpu ? Chipset config needing more cpu power than usual ?
Or amiga side ? Is it emulated cpu doing heavy things ?

Gorf · 23 May 2018, 15:02

Quote:

Originally Posted by meynaf

Then what do you call efficiency here ?
Is it the number of host instructions per emulated instructions or something like that ?

no, that would only make sense for a non-JIT of course.
But ... would be an other interesting number!

efficiency in this case (for me lacking a better word):
Percentage of cpu-time time the host cpu spends in executing translated (former 68K) instructions.

the 1-x time would than include: host-OS, host-gfx, host-sound, host-io, UAE-chipset-emu, UAE-contolling - housekeeping and synchronizing, JIT overhead, 68K-decoding, ...

Quote:

But what do you call heavy load here ?
Host side ? Other apps eating cpu ?

no, of course not!

Quote:

Chipset config needing more cpu power than usual ?
Or amiga side ? Is it emulated cpu doing heavy things ?

yes. and this leads to a high load on the host-system as well.

meynaf · 23 May 2018, 16:05

Quote:

Originally Posted by Gorf

efficiency in this case (for me lacking a better word):
Percentage of cpu-time time the host cpu spends in executing translated (former 68K) instructions.

That really depends. (I only have basic knowledge about how it works, so don't take this as more than a rough idea.) But even with JIT active, some time is spent on executing non-JIT instructions, either because they are not directly JIT supported, or they are end of translated blocks, or they are currently being written for later JIT use. And the ratio depends heavily on what the emulated cpu is currently executing.

Quote:

Originally Posted by Gorf

the 1-x time would than include: host-OS, host-gfx, host-sound, host-io, UAE-chipset-emu, UAE-contolling - housekeeping and synchronizing, JIT overhead, 68K-decoding, ...

Again that does not have a fixed impact, f.e. host-gfx can be very different depending on the filter you use ; JIT overhead depends how often it has to rebuild blocks of translated instructions, etc.

Quote:

Originally Posted by Gorf

yes. and this leads to a high load on the host-system as well.

A loaded emulated machine won't make much of a difference, will it ? There will just be less time spent waiting.
For the chipset, only rare corner cases need to really push the cpu, and it seems to count less today than it used to in the past (because machines are faster). And this is anyway a typical case where the fpga can do the work.

Overall, perhaps just reading the cpu% shown by either task manager of winuae itself, will give you some numbers.
But let's be honest : if you expect nice numbers in nice cells of a nice table, then this simply can not be done.

Gorf · 23 May 2018, 16:42

Quote:

Originally Posted by meynaf

That really depends. (I only have basic knowledge about how it works, so don't take this as more than a rough idea.) But even with JIT active, some time is spent on executing non-JIT instructions, either because they are not directly JIT supported, or they are end of translated blocks, or they are currently being written for later JIT use. And the ratio depends heavily on what the emulated cpu is currently executing.

"your mileage may vary"

I was not expecting a definitive number ... I know it depends on very many variables.
The emulation of some retracer with more or less static output to a p96 screen and no sound is probably more "effective" than AGA-Doom at max resolution...
And some things have a upper limit in usage, as things are supply done after some time, while other stuff may use up a constant percentage, no matter how fast your cpu is...

I am just asking ´, because it would give us a rough estimation how much room for improvement there is.

Quote:

Again that does not have a fixed impact, f.e. host-gfx can be very different depending on the filter you use ; JIT overhead depends how often it has to rebuild blocks of translated instructions, etc..

true

Quote:

But let's be honest : if you expect nice numbers in nice cells of a nice table, then this simply can not be done.

not expecting anything :-)
just gathering information pice by pice

Gorf · 23 May 2018, 16:55

Quote:

Originally Posted by Megol

Quote:

Very hot hotspots could than again be compiled into VHDL for live updating the FPGA - could be working for things like a ray-tracer or fractal generator.

That isn't really a good idea - you'd have to generate specialized 68k cores in realtime!

(I need to come back to this one more time, since my first answer was not good enough.)

To make this clear: this is an optional optimization. It would be the 3. step and is just an idea. But this idea could be useful.

Step one: interpreted execution of code. FPGA can assist in decoding and translating. Good speed-up but slower than JIT.

Step two: JIT on host cpu. Identifying hotspots and optimizing execution. Buffering translated code.

Step three: identifying persistent hotspots and generate specialized cores in the FPGA. This needs to be done by spare cores, that are not utilizes otherwise.
We would NOT create "specialized 68k cores" or "specialized Host-CPU cores", but rather special CL-cores or special DSPs - just capable of executing one former loop of code by sending a single instruction and a range of data.

Thorham · 23 May 2018, 17:04

Quote:

Originally Posted by meynaf

I think i'll pass until you have something that runs directly...

The calculator already runs directly.

Quote:

Originally Posted by meynaf

I can't be 100% sure of this, but as i told, AOS is quite tied to its hardware. It was certainly not designed with multi-platform in mind.

Yeah, I can see how it could be a problem.

Quote:

Originally Posted by meynaf

And of course we could write an OS that is similar

But... who would attempt such a daunting task, for so little result ?

How hard can it be?

meynaf · 23 May 2018, 17:06

Quote:

Originally Posted by Gorf

"your mileage may vary"

Exactly !

Quote:

Originally Posted by Gorf

The emulation of some retracer with more or less static output to a p96 screen and no sound is probably more "effective" than AGA-Doom at max resolution...

Probably, but who knows what's lurking inside...

Quote:

Originally Posted by Gorf

I am just asking ´, because it would give us a rough estimation how much room for improvement there is.

Well that depends how that "improvement" is made.
But in any case emulator settings are what have the most impact.

Quote:

Originally Posted by Gorf

We would NOT create "specialized 68k cores" or "specialized Host-CPU cores", but rather special CL-cores or special DSPs - just capable of executing one former loop of code by sending a single instruction and a range of data.

You do not necessarily need to change the fpga's configuration for that.
You could just have some sort of ultra-wide (simd) alu.
Then when a loop is identified which has all its instructions supported there (and with no bad dependencies), it can be "rewritten" to use that special hardware.
I can tell i'd find this kind of hardware autovectorization a lot more sexy than adding dumb simd extensions to the instruction set...

meynaf · 23 May 2018, 17:12

Quote:

Originally Posted by Thorham

The calculator already runs directly.

That's not what i understood from what you wrote a few posts earlier :

Quote:

Originally Posted by Thorham

If you mean Lua code, then all I have is in the archive. If it's C, then you need the Cephes library object file and link it to your own test program because I have nothing that runs directly..

So i need my own test program or not ? What's needed to run that calculator ? Do you have an archive from which i can run the calculator as any other program ?

Quote:

Originally Posted by Thorham

How hard can it be?

As hard as writing a new OS is, no more no less.

Gorf · 23 May 2018, 17:18

Quote:

Originally Posted by meynaf

You do not necessarily need to change the fpga's configuration for that.
You could just have some sort of ultra-wide (simd) alu.
Then when a loop is identified which has all its instructions supported there (and with no bad dependencies), it can be "rewritten" to use that special hardware.
I can tell i'd find this kind of hardware autovectorization a lot more sexy than adding dumb simd extensions to the instruction set...

you are right: SIMD in (a new) instruction set are a must.

but this was a reply to my FPGA-CPU-hybrid emulator idea. And in this case it would need to stick to the legacy 68K ISA.
This special SIMD-Unit (reconfigurable or not) would be part of the enhanced JIT.

Even Intel is playing with this ideas:
to use Intel's own SPMD compiler to create special Cl cores in FPGAs that are more efficient than e.g. generic CL-cores in your gfx-card.

Edit: ah - "lot more sexy" instead of "not more sexy" - i misread the first time ;-)
YES: it is fascinating but a lot of work....

Gorf · 23 May 2018, 17:30

Quote:

Originally Posted by meynaf

As hard as writing a new OS is, no more no less.

am I wrong or is AROS not offering exactly that?
Or would you both consider it as to far off already?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Has anyone got an Amiga 1200 T12 Gen II?	ccorkin	support.Hardware	10	14 April 2017 23:18
What do people think about this as next Gen AMIGA?	Gunnar	Amiga scene	111	05 July 2014 20:59
Classic 1st Gen EA games for the Amiga	illy5603	support.Games	8	03 July 2010 02:59
Next-gen Amiga development	LaundroMat	Coders. General	3	05 October 2002 00:30

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)