Demo Coder Challenge - Vampire Beta Project ! - Page 7

Megol · 31 May 2014, 11:49

Quote:

Originally Posted by Thorham

Why not simply implement what exists and be done with it? Shouldn't it be enough to get existing 68060 speed? What's the deal with wanting more and more speed? Isn't part of the charm that right now we don't have all the speed in the universe?

Also, if you're already moving away from 680x0 (FPGA is NOT 680x0), then why not stick a cheap AMD on a board and run a 680x0 emu?

Perhaps I'm naive, but I just had to ask these questions

IMHO there is a charm in native execution. If one wanted the fastest possible Amiga system something like Amithlon combined with a high performance processor (x86 or perhaps in the future ARM) couldn't be beaten. Not to complain about Amithlon but performance and compatibility could be improved too.

However that is also the problem: if one wants a semi-compatible system capable of running some old software MorphOS and AOS4 does provide that. But people interested in those systems are extremely few compared to those running their original Amigas for nostalgia reasons.

Gunnar · 01 June 2014, 11:14

On an accelerated AMIGA the copy from fastmem to chip mem is the main bottleneck.

For example:
Looking at the Phoenix_demo4 the CPU on the A600-Vampire could could
reach 45 FPS if the chipmem bus would not be the bottle neck.

The best solution to fix this is to add VIDEO-out to the Turbocard.

The next CPU-Card comes with Video out - also supporting chunky / truecolor.
This will remove this bottleneck allowing much faster games.

meynaf · 02 June 2014, 08:55

Quote:

Originally Posted by pandy71

So how many cycles it must be 200? 2000? 20000?

Well, dunno. How many clocks for fsin in a 68882 ? That must be the target.

Quote:

Originally Posted by pandy71

100k LE's Altera or Xilinx or Lattice or Other...?

I don't remember the exact fpga model, but it was probably Altera.

Quote:

Originally Posted by pandy71

Unless you replace Denise we have only planar and HAM mode - as this disscussion is about CPU not Denise in FPGA then a see huge use for C2P and HicolorC2PHAM in hardware...

As a 68060 can already do a C2P nearly in copymem speed and something plugged at the place of the cpu can't go faster than copymem speed (especially for chipmem), i don't think you'll get something fast there.

Also HAM rendering with a good quality is several orders of magnitude more complex than a simple MOVEP...

Quote:

Originally Posted by Megol

Well at least he could stop calling recognizing limits lame and amateurish?

Well, please forgive him if he's a little bit rude, as his level in english isn't exactly high...

Quote:

Originally Posted by Megol

That means the logic needed to detect complex instructions have to be bigger and probably stored in a ROM. This can make the detection logic a limit to the clock frequency in several ways, increased fan-out of the fetch data, routing delays to and from the ROM and possibly problems fitting the <logic>->ROM-><logic> in the detector into one clock cycle. Adding another cycle for complex operation detection will require a micro-flush of the latest fetched instruction which will complicate even more parts of the fetch/decode pipeline.
But that is only for _detecting_ that the instruction should be considered complex, then the fetching have to redirected to a microcode engine. That adds more complexities.

That seems to be assuming we want to detect all cases in 1 clock at the decoder. But the decoder could "flag" these as trapped, while at a later stage we redecode trapped instructions and see if they must be handled in a different way.

Quote:

Originally Posted by Megol

In comparison an enhanced trap mechanism can do some simplifications as some cases can be handled in software. There still have to be detection logic however that can be reduced in size as parts of that already must exist to detect illegal instructions.

As said above, trapping them as illegal is enough for a first step.

Quote:

Originally Posted by Megol

But is that really performance critical? The 68k processor natively executing MOVEP as fast as possible is the 68040 right? Let's assume that require 8 clocks per MOVEP at 33MHz -> ~4M per second peak. Now assume the new processor runs at 200MHz and require 50 cycles -> ~4M per second peak.

But it doesn't run at 200Mhz and 50 cycles are not enough to both decode and execute it.

Quote:

Originally Posted by Megol

Save/restore of registers can be done with banking in the register file or renaming techniques depending on the processor design.

But the emulation has to access the normal registers used by the instruction and register banking won't save you.

Quote:

Originally Posted by Megol

It's hard to eliminate the transfer to the handler routine but it can be accelerated by treating every unimplemented instruction as a bsr to the trap handler eliminating the miss predict latency.

And you will be happily polluting the instruction cache and the stack. Not to mention you have to find a place in memory where to put that trap handler.

Quote:

Originally Posted by Megol

Removing the checking for which instruction is to be emulated can be done at least partially by having a mechanism that translates part of the instruction word into an offset into the trap handler. Some checking code would still be needed though.

If you partially decode the instruction like this, it requires you to detect that the instruction is complex, which you didn't want to do earlier...

Quote:

Originally Posted by Megol

However that is also the problem: if one wants a semi-compatible system capable of running some old software MorphOS and AOS4 does provide that. But people interested in those systems are extremely few compared to those running their original Amigas for nostalgia reasons.

And when you're running it for nostalgia reasons, you do not want some instructions to be sacrificed on the altar of "speed".

Gunnar · 02 June 2014, 12:10

Quote:

Originally Posted by meynaf

Well, dunno. How many clocks for fsin in a 68882 ? That must be the target.

Where is actually the difference of doing an instruction in Software and doing an instruction in hardware?

Lets compare some 68000.
Phoenix is hardwired and does all its instructions in hardware - and all normals ones in a single cycle.

The original 68_000 did all instructions in software in several cycles!
The software for them was in the ROM in the 68K CPU.

The FSIN on the 68882 was a routine that was in fact executed from the ROM of the 68882.

When the 68040 was designed Motorola figured that spending chip size on ROMS to including FSIN routines - will occupy valuable chip space.
And they figures that spending this chip space to increase Cache size is the nbetter desicion - as the increased cache size will benefit the CPU performance.

Motorolas logic was good.
And nothing has changed since then.
Instead adding ROMS with the routines - spending the chip space on bigger Caches is the most sensible solution.

And Motorala is not alone in this idea - all chip companies figured the same ...

robinsonb5 · 02 June 2014, 13:35

Quote:

Originally Posted by Gunnar

Where is actually the difference of doing an instruction in Software and doing an instruction in hardware?

The difference, to my mind, is simply in whether or not the implementation is completely transparent to the software (including OS) running on the machine. Thus microcode counts as "hardware", whereas traps don't - even though, depending upon implementation, the only practical difference might well be that trap code is visible to the rest of the computer.

pandy71 · 02 June 2014, 14:38

Quote:

Originally Posted by meynaf

Well, dunno. How many clocks for fsin in a 68882 ? That must be the target.

Why? Why for example FSIN must be implemented but C2P not - is there suddenly something changing in Amiga (OCS\ECS\AGA) architecture?
As a programmer you may reuse library with soft FSIN, you may reuse code with FSIN or do this in a flavor you want as purpose of Sine(x) can be context dependent an sometimes simple LUT is sufficient, sometimes not.

So in other words - from how smart programmer/developer you are and how do you know problem you want to solve depends what method most optimal you selecting. From usual life practice Sine is substituted usually by simpler approximations which are sufficient from problem point of view and you don't need 80 bit FP precision.

Quote:

Originally Posted by meynaf

I don't remember the exact fpga model, but it was probably Altera.

Ok, i was quite curios about details.

Quote:

Originally Posted by meynaf

As a 68060 can already do a C2P nearly in copymem speed and something plugged at the place of the cpu can't go faster than copymem speed (especially for chipmem), i don't think you'll get something fast there.

But why? this is something that not change, it is performed so frequently that using CPU for this is plain wasting CPU - perhaps CPU should put data directly in BPLxDAT instead in RAM to avoid C2P?

Quote:

Originally Posted by meynaf

Also HAM rendering with a good quality is several orders of magnitude more complex than a simple MOVEP...

However seem that software conversion to HAM (HQ especially) require more cycles than doing this in hardware - once again - to refresh screen you need to pass lot of data in accurate and precisely timed manner.
But this is plain dispute as there is no open code to provide HQ HAM conversion, existing C (open) code for HAM conversion is quite simple and should be not to difficult to implement such code in VHDL/Verilog (but i agree - it will be very poor so perhaps it should be improved but in a way to be still useful with limited amount LE's we have).

We have CPU accelerator without other way to display data than feeding CHIP mem (or banging registers) so i would say that we should focus how to use existing display hardware - i found C2P and HAM usage as most important especially for OCS/ECS Amiga models.

Gunnar · 02 June 2014, 14:39

Quote:

Originally Posted by robinsonb5

The difference, to my mind, is simply in whether or not the implementation is completely transparent to the software (including OS) running on the machine. Thus microcode counts as "hardware", whereas traps don't - even though, depending upon implementation, the only practical difference might well be that trap code is visible to the rest of the computer.

How would you call a situation where the ROM code is placed not inside the CPU
but placed in an external ROM?

Lets say just like the MICROCODE ROM of the original 68K,
this external ROM is there and does not depends on OS or library support.

So even any "old" software would run out of the box.

The main difference to the 68882 ROM would be that the new ROM is external for cost reduction.

How would you call this setup?

Gunnar · 02 June 2014, 14:47

Quote:

Originally Posted by pandy71

We have CPU accelerator without other way to display data than feeding CHIP mem

I agree with you that the chipmem transver limit is a severe limitation.

There are several ways to improve this.

From a software perspective a nice solution would be a C2P instruction combined with Multithreading.
This combination would allow
1) to run a C2P at high speed from fastmem to fastmem.
2) to run C2P from fastmem to slow chipmem slow in parallel with low system resource usage

Another way to improve the whole setup is to add RGB out to the FPGA card.
This solution will open a lot more options of course with fast high resolution, truecolor screen.

robinsonb5 · 02 June 2014, 15:39

Quote:

Originally Posted by Gunnar

How would you call a situation where the ROM code is placed not inside the CPU
but placed in an external ROM?

How would you call this setup?

The key distinction is whether or not this ROM appears somewhere in the Amiga's memory map and uses exceptions / traps / autoconfig initialization, or whether it's only visible to the CPU core itself, connected via some new designed-for-the-task mechanism, and thus completely transparent to the Amiga.

robinsonb5 · 02 June 2014, 15:43

Quote:

Originally Posted by Gunnar

From a software perspective a nice solution would be a C2P instruction combined with Multithreading.

Some kind of DMA engine that does C2P on data from Fast RAM and writes it to Chip RAM would be better still, since the CPU could set it going, then forget about it until the job's complete.

Quote:

Another way to improve the whole setup is to add RGB out to the FPGA card.
This solution will open a lot more options of course with fast high resolution, truecolor screen.

Yes, indeed - the difficulty then, however, is that you either have to emulate the entire graphics chipset in the FPGA, have some kind of video-in / scandoubler in the FPGA, or the user has to have two separate monitors or some kind of KVM switch!

Gunnar · 02 June 2014, 16:22

Quote:

Originally Posted by robinsonb5

Some kind of DMA engine that does C2P on data from Fast RAM and writes it to Chip RAM would be better still, since the CPU could set it going, then forget about it until the job's complete.

Yes an DMA engine would have this advantage.
But of course a DMA engine is always limited in flexibility.

A second CPU thread is a lot more flexible.
Threads can be used for many task - e.g. handling IDE or network traffic.
Many tasks which "traditionally" used DMA could also be handled very good with hardware threads.

Quote:

Originally Posted by robinsonb5

Yes, indeed - the difficulty then, however, is that you either have to emulate the entire graphics chipset in the FPGA, have some kind of video-in / scandoubler in the FPGA, or the user has to have two separate monitors or some kind of KVM switch!

Yes.
I have a LCD-TV connected to the AMIGA.
The normal display comes in it with Scart.
The new display can come in it with HDMI.
This is easy to use.

Adding a Flickerfixer to the FPGA is not difficult.

Putting a whole chipset in the FPGA is more work but was also done before.

pandy71 · 02 June 2014, 16:27

Quote:

Originally Posted by Gunnar

I agree with you that the chipmem transver limit is a severe limitation.

There are several ways to improve this.

From a software perspective a nice solution would be a C2P instruction combined with Multithreading.
This combination would allow
1) to run a C2P at high speed from fastmem to fastmem.
2) to run C2P from fastmem to slow chipmem slow in parallel with low system resource usage

Another way to improve the whole setup is to add RGB out to the FPGA card.
This solution will open a lot more options of course with fast high resolution, truecolor screen.

Ok, i was not clear - my idea is to have DMA service where destination can be planar and source can be chunky, now as we have more RAM (or we can use unused space in current memory map or use banking or more fancy MMU address translation), CPU can operate on local ('superfast') memory and create video buffer there, at some point flag 'send2chip' is set by CPU based on programmer decision and DMA start working with maximum speed allowed by CHIP timing in background of CPU stealing only fraction of cycles - CPU can perform calculation on second buffer - after this it can set flag to sent another chunk of data etc. This should be most efficient as CPU will be stalled only partially (with clever buffering perhaps this kind of transfer can be fully hide from CPU perspective). Same for HAM case - conversion performed during transfer from 'superfast' RAM to CHIP RAM.
IMHO as this is performed 25,30,50,60 times per second on full screen then it can be more beneficial than fully extended precision transcendental FPU implementation (as 64KB LUT can cover Sine with 32b float and resolution of 0.005deg and i assume it will be fastest way to have FSIN).

Adding video output can be done with help additional board that have been placed over Denise, then video from Denise can be captured and rerouted back do VIDIOT however also it can be possible to feed VIDIOT with new video data directly from FPGA where Denise video will be visible as overlay (in controlled window, perhaps with resizer/rescaler) thus it should be possible to have noninterlaced output with original video that fill whole screen size, original video as window inside bigger added/new video etc.
Link between boards can be modern fast serial (like HDMI/DVI type of interface - video serializer and deserializer).
But IMHO then it will be better to recreate whole Amiga (or by using similar principle to A-Clone) or by connectiong all main IC's (Agnus, Paula, Denise) around one FPGA and trough FPGA provide access to memory (as OCS\ECS\AGA will use very low amount of bandwidth this can be seen as UMA type architecture and CHIP can be unified with FAST).

Vot · 02 June 2014, 16:34

Quote:

Originally Posted by matthey

Hardware assisted c2p is important when there is a lack of CPU performance as it increases the display frames per second (fps) considerably. With plenty of CPU performance, the fps is limited by slow chipmem. With hardware assisted c2p, the fps is limited by slow chipmem. It may give a couple of more fps but it can't perform miracles. We need the Amiga chipset in fpga where chipmem is fast and where we can create new chunky screen modes.

The Apollo Team has VHDL for Akiko c2p not that it is a particularly efficient way to do hardware assisted c2p. Converting a whole c2p buffer at once would be the most efficient and could be done in parallel with low resources as you say, not that the Vampire has any free space. Phoenix does use separate memory buses for chip and fast memory and I believe separate write buffers (fast memory would use writethrough caching). Perhaps it's this extra parallelism that allows the Vampire with Phoenix to do a few extra fps? If fpga hardware assisted c2p was able to give any more fps, surely it wouldn't be much. It could save a few CPU cycles but the c2p conversion would be a low percentage of the CPU processing power on Phoenix.

>> I always wonder what gains could be made if someone made a memory controller to connect to the memory sockets / pads on the mobo. I.e so chip ram really is the same mem used by the cpu on the accelerator. Have all of the memory addressable to the cpu so memory can be copied into the chipram area bypassing the chipset transfer limits. Or the memory controller could bank-switch parts of the chip mem etc. lots of different options would be possible.

Hell how about an improved fpga blitter and c2p from "fast mem" area into the "chip mem" area

pandy71 · 02 June 2014, 21:24

Quote:

Originally Posted by Vot

Hell how about an improved fpga blitter and c2p from "fast mem" area into the "chip mem" area

ask Jens,

http://www.totalamiga.org/files/TA25...iewExtract.pdf

btw seem that Denise is one of "easiest" IC's from Amiga to recreate in FPGA.

Megol · 03 June 2014, 18:12

Quote:

Originally Posted by meynaf

Well, dunno. How many clocks for fsin in a 68882 ? That must be the target.

Why? The processor wouldn't be comparable in other aspects anyway.

Quote:

That seems to be assuming we want to detect all cases in 1 clock at the decoder. But the decoder could "flag" these as trapped, while at a later stage we redecode trapped instructions and see if they must be handled in a different way.

Of course one could do that but it isn't a good fit in a pipelined implementation. But one could use predecode data, see below.

Quote:

But it doesn't run at 200Mhz and 50 cycles are not enough to both decode and execute it.

No at the moment it runs either at 0 MHz or +INF MHz depending on how one looks at it.
And why wouldn't 50 cycles be possible with an optimized trap mechanism?

Quote:

But the emulation has to access the normal registers used by the instruction and register banking won't save you.

Supporting that isn't a problem, even a simple instruction to swap banks for selected registers would be enough.
Another way to support it would be using an extension of my prefix mechanism. A third way would be to implement the prefix system and document that using complex instructions can overwrite extended registers (D8-D15, A8-A15). It wouldn't be a problem for existing code and new code could just avoid those instructions.
There are other options too.

NB that a register file implemented in the smallest available type of memory block have 32 or 64 registers so a design with separate data and address register files have plenty to use for this purpose.

Quote:

And you will be happily polluting the instruction cache and the stack. Not to mention you have to find a place in memory where to put that trap handler.

Do you think ROM and microcode logic is free? They aren't. ROMs use memory blocks otherwise available for caches.

Quote:

If you partially decode the instruction like this, it requires you to detect that the instruction is complex, which you didn't want to do earlier...

I didn't want to do in a critical path. Some predecode data bits are free and detecting complex instructions can be done pipelined and parallelized in the instruction cache fill mechanism with very little performance impact.

If one of the two (for Xilinx FPGAs) free bits per instruction word is set the instruction could be trapped.

Quote:

And when you're running it for nostalgia reasons, you do not want some instructions to be sacrificed on the altar of "speed".

I've never run an Amiga with a 68882 nor one with an integrated FPU. The fastest Amigas available uses a processor that don't implement the whole 68882.

So why would nostalgia require support of all instructions ever existing in the 68k ISA?

Megol · 03 June 2014, 18:14

Quote:

Originally Posted by robinsonb5

The key distinction is whether or not this ROM appears somewhere in the Amiga's memory map and uses exceptions / traps / autoconfig initialization, or whether it's only visible to the CPU core itself, connected via some new designed-for-the-task mechanism, and thus completely transparent to the Amiga.

How would an improved trap mechanism be more visible to users than using microcode?

robinsonb5 · 03 June 2014, 19:33

Quote:

Originally Posted by Megol

How would an improved trap mechanism be more visible to users than using microcode?

It might not be, depending upon how it's implemented. Which was actually my point - that's where the line between "hardware" and "software" lies, given that the very nature of FPGAs blurs that line: if the implementation is completely transparent to the Amiga then it's "hardware", even if the actual implementation is Musashi running on an ARM or suchlike. If you need to run something on the Amiga to set it up, or even if you don't, but the code maps somewhere into the Amiga's memory map where careless software could mess it up, or thoughtless messing with the exception table could stop it working, then it's "software".

Note - that's just clarifying a distinction, not saying either is "better".
If you want me to say which i think is better, then I'd say it's far more important for the base 68000 instruction set to be "hardware" than it is for FPU instructions.

meynaf · 05 June 2014, 14:46

Quote:

Originally Posted by Gunnar

Where is actually the difference of doing an instruction in Software and doing an instruction in hardware?

It's not as simple as if there were just two categories.

We have the choice between :
- hardwired (full hardware)
- iterative (like HW but in several passes)
- microcode
- emulation

Only the last solution is unacceptable.

Quote:

Originally Posted by Gunnar

Lets compare some 68000.

Yes, good idea. A 68000 doing MOVEP is ok. A 68030 doing MOVEP is ok. A 68060 doing MOVEP is not ok.
Easy to see the difference, really.

Quote:

Originally Posted by Gunnar

Phoenix is hardwired and does all its instructions in hardware - and all normals ones in a single cycle.

How do you do things such as LINK, UNLK, TRAP, MOVEM ?
What prevents you from doing the same with e.g. MOVEP or FSIN ?

I know that the 7000 LEs of the Vampire aren't enough. But for the full Apollo, why not ? You have enough space for several 68k in 100k LEs !

Quote:

Originally Posted by Gunnar

The original 68_000 did all instructions in software in several cycles!
The software for them was in the ROM in the 68K CPU.

The FSIN on the 68882 was a routine that was in fact executed from the ROM of the 68882.

Microcode and software emulation are very different.
But if you want to do instructions such as MOVEP exactly like they were done in the 68000 and absolutely no difference is visible in comparison to it, then it's fine with me.

Alas, while microcode is 100% transparent, software emulation is not.

The 68000's microcode was NOT 68k instructions. It was VLIW. It did not have to save regs, change the PC, decode instructions by software, and return to the caller. I guess the 68030's microcode is similar.

And the 68000 was only 68000 transistors (hence its name). Boy, what a cost nowadays.

Quote:

Originally Posted by Gunnar

When the 68040 was designed Motorola figured that spending chip size on ROMS to including FSIN routines - will occupy valuable chip space.
And they figures that spending this chip space to increase Cache size is the nbetter desicion - as the increased cache size will benefit the CPU performance.

What was right by the time of the 68040 may well be wrong today. If you have 100k LEs, sorry, but space isn't a good excuse.

Also the 68k family started to decay at the time of the 68040. Not for nothing. The 68040's implementation was very poor and is really not a good example of a right choice.

Quote:

Originally Posted by Gunnar

Motorolas logic was good.
And nothing has changed since then.
Instead adding ROMS with the routines - spending the chip space on bigger Caches is the most sensible solution.

Then tell me why even the most recent x86 still support the old fsin (and the old 16-bit mode, and a lot of other things !)...

No, on the contrary, Moto's logic was all but good.
It was good up to the 68030 which ruled the world in its time. Not after, when it changed.

Quote:

Originally Posted by Gunnar

And Motorala is not alone in this idea - all chip companies figured the same ...

... and they've all been beaten by Intel, who did not figure the same.

Quote:

Originally Posted by pandy71

Why? Why for example FSIN must be implemented but C2P not - is there suddenly something changing in Amiga (OCS\ECS\AGA) architecture?

Because FSIN was there before, not C2P. FSIN is currently in the 68k instruction set, not C2P. Therefore no FSIN is a regression. Easy or not ?

Furthermore, a C2P is 100% Amiga specific - which the 68k must NOT be in any manner IMO.

Quote:

Originally Posted by pandy71

As a programmer you may reuse library with soft FSIN, you may reuse code with FSIN or do this in a flavor you want as purpose of Sine(x) can be context dependent an sometimes simple LUT is sufficient, sometimes not.

So in other words - from how smart programmer/developer you are and how do you know problem you want to solve depends what method most optimal you selecting. From usual life practice Sine is substituted usually by simpler approximations which are sufficient from problem point of view and you don't need 80 bit FP precision.

You focus too much on fsin. There are other useful transcendental instructions, like fatan for angle computations.

Following your logic, no fpu at all is better. Perhaps this is what you want ?

The limit with the c2p is the chipmem bandwidth, not the cpu. Therefore a hardware c2p wouldn't be much faster.

Anyway, i don't want fsin for use myself especially. I want it mainly because it was there before.

Quote:

Originally Posted by pandy71

But why? this is something that not change, it is performed so frequently that using CPU for this is plain wasting CPU - perhaps CPU should put data directly in BPLxDAT instead in RAM to avoid C2P?

A big problem with the C2P is that it's made totally obsolete when you have real chunky. And i don't like short-lived things.

Never forget that architectures persist longer than implementations.

Do you accept adding instructions specific to solve some hardware problem ? Not me. Look at MOVEP for an example : designed for some specific purpose - now in the way and must be kept.

Quote:

Originally Posted by pandy71

However seem that software conversion to HAM (HQ especially) require more cycles than doing this in hardware - once again - to refresh screen you need to pass lot of data in accurate and precisely timed manner.
But this is plain dispute as there is no open code to provide HQ HAM conversion, existing C (open) code for HAM conversion is quite simple and should be not to difficult to implement such code in VHDL/Verilog (but i agree - it will be very poor so perhaps it should be improved but in a way to be still useful with limited amount LE's we have).

HAM isn't 1:1 like a C2P. And it's not a simple computation either, there is decision making in it (unless you have a very poor quality - which i'd vote against).

A good HAM rendering method has to read a pixel, find out whether it's closer to a fixed, red, green or blue pixel, and then emit it according to that choice. Doing that gives a quite big routine already (mine is around 240 bytes of code and you can bet it's optimised to death).

If you wish to do HAM conversion in HW (good quality), you have to know that big TABLES are used there.

Quote:

Originally Posted by pandy71

We have CPU accelerator without other way to display data than feeding CHIP mem (or banging registers) so i would say that we should focus how to use existing display hardware - i found C2P and HAM usage as most important especially for OCS/ECS Amiga models.

I personnally wouldn't use HAM for animations, only for still images - as fringing effects would be quite ugly in an anim.
So i see little use for HAM in HW.

Quote:

Originally Posted by Gunnar

How would you call a situation where the ROM code is placed not inside the CPU but placed in an external ROM?

"slow because of the latencies" ?

Quote:

Originally Posted by Gunnar

Lets say just like the MICROCODE ROM of the original 68K,
this external ROM is there and does not depends on OS or library support.

So even any "old" software would run out of the box.

The main difference to the 68882 ROM would be that the new ROM is external for cost reduction.

How would you call this setup?

See above.

A ROM inside the CPU is a lot faster than a ROM outside. Perhaps you forgot that a ROM has latencies, and they're quite big even at 100mhz.

You may want to "hide" these latencies - but then you're gonna pollute the icache with that ROM - which isn't the case with microcode, obviously.

Quote:

Originally Posted by Megol

Why? The processor wouldn't be comparable in other aspects anyway.

Why, simply because we just need to be at least as fast as what was there before - and preferably be faster clock-by-clock too.

Quote:

Originally Posted by Megol

And why wouldn't 50 cycles be possible with an optimized trap mechanism?

Because trapping itself takes time, decoding instructions by software take a lot of time, indirect access to registers takes time, and returning from a trap handler takes time too. Write the code if you don't believe me.

By the time of the Natami's 68050 i wrote some small emu lib for it, so i know what i'm talking about. Even if you remove some of the bottlenecks, what remains is still an horror to handle.

Quote:

Originally Posted by Megol

Supporting that isn't a problem, even a simple instruction to swap banks for selected registers would be enough.
Another way to support it would be using an extension of my prefix mechanism. A third way would be to implement the prefix system and document that using complex instructions can overwrite extended registers (D8-D15, A8-A15). It wouldn't be a problem for existing code and new code could just avoid those instructions.
There are other options too.

That would add a lot of dirty legacy...

Quote:

Originally Posted by Megol

Do you think ROM and microcode logic is free? They aren't. ROMs use memory blocks otherwise available for caches.

In the same way, do you think your "improved trap mechanism" is free ? It's not. This is leading nowhere.

Quote:

Originally Posted by Megol

I didn't want to do in a critical path. Some predecode data bits are free and detecting complex instructions can be done pipelined and parallelized in the instruction cache fill mechanism with very little performance impact.

If one of the two (for Xilinx FPGAs) free bits per instruction word is set the instruction could be trapped.

This invalidates what was opposed before to the microcode (about decoding the special instructions).

Quote:

Originally Posted by Megol

I've never run an Amiga with a 68882 nor one with an integrated FPU.

Not really true. You've probably ran a PC already, and x87 is very similar to 68882 - yet it's an integrated FPU nowadays.

Quote:

Originally Posted by Megol

The fastest Amigas available uses a processor that don't implement the whole 68882.

Wrong if you consider UAE as an Amiga - fastest than everyone else, and full 68882 if you want one.

Anyway the 68882 is something, regular integer instructions is something else. We may talk about CAS, or the bitfields for example. May be a lot more interesting than FSIN, huh ?

Quote:

Originally Posted by Megol

So why would nostalgia require support of all instructions ever existing in the 68k ISA?

Nostalgia asks for everything that was made before, not a castrated, cut-down evolution of it.

Nostalgia wants a cpu that's easy to code on, has a complete instruction set, not a cpu that's the fastest possible and sacrifices everything for that chimeric goal (as you're not gonna be competitive anyway with other current families).

If you want to do a 68k, you do a 68k, period.
If you want to take a subset of its instruction set, then you can reencode it fully and it'll be another story.

I want to code in asm because I like the freedom of it.
And only the 68k (or possibly a derived cpu family) is appropriate for that.
Basically this is what I defend here.

The ISA should be extended, not reduced, even if this costs a few mhz.

Quote:

Originally Posted by Megol

How would an improved trap mechanism be more visible to users than using microcode?

Because it pollutes the address space, the stack, the caches, the memory bus, and perhaps a few other things.
Because it involves executing many more instructions than microcode and would be a lot slower.

As we're running on an FPGA, why not implement BOTH solutions anyway ?
It's possible to switch even at runtime !
So everyone would be happy. The costs and benefits of each solution readily available for direct study. No long, useless discussions.

But perhaps some are afraid of what they would discover ? That, for example, having all instructions isn't much slower than removing some ?

Gunnar · 05 June 2014, 15:39

Quote:

Originally Posted by meynaf

Yes, good idea. A 68000 doing MOVEP is ok. A 68030 doing MOVEP is ok. A 68060 doing MOVEP is not ok.
Easy to see the difference, really.

Fine lets compare a HW and a SW solution.

MOVEP.L on the 68000 took 24 clocks
24 clocks @ 7 MHz is equivalent to 411 clocks @120MHz.

Doing a trap costs me 8 clocks.
This means you have 403 clocks to do MOVEP in software
and would still not be slower....

Sounds doable...

Quote:

Originally Posted by meynaf

How do you do things such as LINK, UNLK, TRAP, MOVEM ?
What prevents you from doing the same with e.g. MOVEP or FSIN ?

I assume people would rate instructions on their usefullness.
* MOVEM is usefull
* MOVEP is by far not that important

Of course its possible to include every instruction....
But would you also include CALLM ?

If not why not?

Megol · 06 June 2014, 14:10

Quote:

Originally Posted by meynaf

Why, simply because we just need to be at least as fast as what was there before - and preferably be faster clock-by-clock too.

Then we agree. Trap and emulate would be as fast or faster.

Quote:

Because trapping itself takes time, decoding instructions by software take a lot of time, indirect access to registers takes time, and returning from a trap handler takes time too. Write the code if you don't believe me.

Let's see: No. Not with the proposed support. Indirect access? Don't know what you mean by that. Returning is fast too.

Quote:

By the time of the Natami's 68050 i wrote some small emu lib for it, so i know what i'm talking about. Even if you remove some of the bottlenecks, what remains is still an horror to handle.

So how long time does your MOVEP emulation take?

Quote:

That would add a lot of dirty legacy...

No because it wouldn't be visible to old software. New software shouldn't see it or if it was visible shouldn't use the exposed changes.

Quote:

In the same way, do you think your "improved trap mechanism" is free ? It's not. This is leading nowhere.

Of course it isn't free. It requires predecode logic and ... well depending on the rest of the design almost nothing else.

Note that I don't propose to implement such a mechanism as I think the normal trap mechanism will be more than enough.
The predecode bits have better uses that can potentially accelerate all instructions instead of only accelerating unimplemented instruction emulation.

Quote:

This invalidates what was opposed before to the microcode (about decoding the special instructions).

No, not really. First there is no decoding done. Then the pipeline design and critical bottlenecks makes it significantly different.

Quote:

Not really true. You've probably ran a PC already, and x87 is very similar to 68882 - yet it's an integrated FPU nowadays.

Yes. What have to do with this discussion? We have no well payed experienced team(s) of hardware designers. We have no top of the line ASIC process. We have no market that will bear the costs of several years of development costs. We don't have almost 4 decades of experience designing 68k compatibles.

We have hobbyists doing hacking on their spare time, low end FPGAs and no market.

Quote:

Wrong if you consider UAE as an Amiga - fastest than everyone else, and full 68882 if you want one.

Anyway the 68882 is something, regular integer instructions is something else. We may talk about CAS, or the bitfields for example. May be a lot more interesting than FSIN, huh ?

Bitfield instructions yes. CAS? It isn't needed unless multiprocessing was added somehow.

Multiprocessing is hard to retrofit into the Amiga and my previous attempts to discuss the topic didn't result of any feedback so I guess nobody is interested in even trying getting it to work.

Quote:

Nostalgia asks for everything that was made before, not a castrated, cut-down evolution of it.

That's not my definition. Most people using C64 doesn't use a 64kiB C128 (in C64 mode of course to relevant), have no Super CPU and no REU. Many doesn't have a 1531 mouse nor JiffyDOS installed.

Quote:

Nostalgia wants a cpu that's easy to code on, has a complete instruction set, not a cpu that's the fastest possible and sacrifices everything for that chimeric goal (as you're not gonna be competitive anyway with other current families).

If you want to do a 68k, you do a 68k, period.
If you want to take a subset of its instruction set, then you can reencode it fully and it'll be another story.

Okay so you'd like a slow 68000 processor? Why don't you buy one then? It is already available for a reasonable price and can be overclocked to several times the original A500 performance given a good memory subsystem.

Quote:

I want to code in asm because I like the freedom of it.
And only the 68k (or possibly a derived cpu family) is appropriate for that.
Basically this is what I defend here.

Many other architectures have good assembly language support and are fun to program. ARM and x86-32 for instance. Yes really.

Quote:

The ISA should be extended, not reduced, even if this costs a few mhz.

In your opinion. But you have selected the absolute superset of instructions ever implemented in the 68k series and made that your goal.
But most people never used that kind of system.

Quote:

Because it pollutes the address space, the stack, the caches, the memory bus, and perhaps a few other things.
Because it involves executing many more instructions than microcode and would be a lot slower.

Microcode isn't anything magical, it is a indirection in the execution path. Nothing more, nothing less.

Quote:

As we're running on an FPGA, why not implement BOTH solutions anyway ?
It's possible to switch even at runtime !
So everyone would be happy. The costs and benefits of each solution readily available for direct study. No long, useless discussions.

But perhaps some are afraid of what they would discover ? That, for example, having all instructions isn't much slower than removing some ?

Okay. But that requires more time and more people working on it. If you want something like that why don't you do like Majsta and learn how to do hardware design and try it out?

BTW it will be slower with all instructions implemented. Even having MOVEM support decreases performance but it have to be supported.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Vampire 500 project started	majsta	Hardware mods	221	17 August 2016 18:42
cd32 project idea i challenge ...	sian	request.Other	11	15 June 2013 19:34
Looking for artist to collaborate on Lotus Turbo Challenge project	P-J	Amiga scene	16	07 January 2012 04:21
Desperately seeking Amiga Demo Coder	slayerGTN	Amiga scene	2	02 August 2010 23:34
Project-X SE & F17 Challenge v2.0 (1993)(Team 17)(M5)[compilation][CDD3499]	retrogamer	request.Old Rare Games	0	05 April 2007 14:37

01 June 2014, 11:14	#122
Gunnar Registered User Join Date: Apr 2014 Location: Germany Posts: 154	On an accelerated AMIGA the copy from fastmem to chip mem is the main bottleneck. For example: Looking at the Phoenix_demo4 the CPU on the A600-Vampire could could reach 45 FPS if the chipmem bus would not be the bottle neck. The best solution to fix this is to add VIDEO-out to the Turbocard. The next CPU-Card comes with Video out - also supporting chunky / truecolor. This will remove this bottleneck allowing much faster games.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)