Demo Coder Challenge - Vampire Beta Project ! - Page 6

Megol · 23 May 2014, 15:29

Quote:

Originally Posted by pandy71

At some point i decided to give up however as i began see better point of view many of You i need ask once again - for who this board will be designed, who will be marketing target and what is main purpose for this board(s).

Also just from pure curiosity: i don't understand why so many people insist to use CPU to perform DMA-like tasks, i don't understand why so many people insist to perform for example C2P by CPU (where some small logic seem to be more efficient).

And no, i not afraid of Sine - i just found it not worth to pay 200E more.

The board _is_ designed and have a thread on this very board! This thread is about the Vampire A600 board running a cut down Apollo core called Phoenix

That board was designed to put a reasonably priced FPGA design in the hands of Amiga users in order to promote development. Look at http://majsta.com/ for more information.

Or are you talking about the potential new design? Don't know anything about it and I doubt those talking about it knows much either given the current available information.

matthey · 23 May 2014, 18:42

Quote:

Originally Posted by pandy71

Also just from pure curiosity: i don't understand why so many people insist to use CPU to perform DMA-like tasks, i don't understand why so many people insist to perform for example C2P by CPU (where some small logic seem to be more efficient).

Hardware assisted c2p is important when there is a lack of CPU performance as it increases the display frames per second (fps) considerably. With plenty of CPU performance, the fps is limited by slow chipmem. With hardware assisted c2p, the fps is limited by slow chipmem. It may give a couple of more fps but it can't perform miracles. We need the Amiga chipset in fpga where chipmem is fast and where we can create new chunky screen modes.

Quote:

Originally Posted by robinsonb5

In an FPGA it'd be perfectly possible to design some logic that reads a buffer from SDRAM, does C2P conversion and writes the result to Chip RAM without CPU intervention. If the CPU's running largely from SDRAM / Cache then this background task would have very little impact on the CPU speed.

The Apollo Team has VHDL for Akiko c2p not that it is a particularly efficient way to do hardware assisted c2p. Converting a whole c2p buffer at once would be the most efficient and could be done in parallel with low resources as you say, not that the Vampire has any free space. Phoenix does use separate memory buses for chip and fast memory and I believe separate write buffers (fast memory would use writethrough caching). Perhaps it's this extra parallelism that allows the Vampire with Phoenix to do a few extra fps? If fpga hardware assisted c2p was able to give any more fps, surely it wouldn't be much. It could save a few CPU cycles but the c2p conversion would be a low percentage of the CPU processing power on Phoenix.

robinsonb5 · 23 May 2014, 19:49

Quote:

Originally Posted by matthey

It could save a few CPU cycles but the c2p conversion would be a low percentage of the CPU processing power on Phoenix.

Having write buffers on the Chip RAM bus will definitely help. I know this isn't a fair comparison, but on the Chameleon64's Minimig core I get a rough average of 6.5 fps, peaking at 7.1 on the Phoenix demo, and this goes up to 7.5 peaking at 8.1 if I enable "Turbo chip RAM" which allows Chip RAM to be written at near Fast RAM speeds.
That's a saving of nearly 20ms per frame with nothing changing except the amount of time the CPU has to wait when writing to the Chip RAM bus.

Ultimately, though, as long as the CPU has to spoonfeed C2Ped data into Chip RAM it's going to be wasting a lot of time waiting for that bus to be free. What's more, the more powerful the CPU the more potential work will be lost to that waiting.

Thorham · 23 May 2014, 20:06

Quote:

Originally Posted by pandy71

What can be faster? doing this by CPU?

Yes. No one does it using the blitter on accelerated machines because it's slower.

pandy71 · 23 May 2014, 21:31

Quote:

Originally Posted by Megol

The board _is_ designed and have a thread on this very board! This thread is about the Vampire A600 board running a cut down Apollo core called Phoenix

That board was designed to put a reasonably priced FPGA design in the hands of Amiga users in order to promote development. Look at http://majsta.com/ for more information.

Or are you talking about the potential new design? Don't know anything about it and I doubt those talking about it knows much either given the current available information.

And perhaps future Vampire A500 (as i understand Majsta only problem is to create IDE port - remain and required to use in A500 Vampire A600 functionality is to add E clock generation as in A600 E clock is generated inside Gayle and on A500 this is part of CPU). Not sure if Vampire A500 should be considered as new project.

Quote:

Originally Posted by matthey

Hardware assisted c2p is important when there is a lack of CPU performance as it increases the display frames per second (fps) considerably. With plenty of CPU performance, the fps is limited by slow chipmem. With hardware assisted c2p, the fps is limited by slow chipmem. It may give a couple of more fps but it can't perform miracles. We need the Amiga chipset in fpga where chipmem is fast and where we can create new chunky screen modes.

But this something else - just CPU - however i see no point to perform specialized (and in assumption) frequency performed task in software where it can be perfectly performed by HW during transfer from one memory adress (area) to another.

Quote:

Originally Posted by matthey

The Apollo Team has VHDL for Akiko c2p not that it is a particularly efficient way to do hardware assisted c2p. Converting a whole c2p buffer at once would be the most efficient and could be done in parallel with low resources as you say, not that the Vampire has any free space. Phoenix does use separate memory buses for chip and fast memory and I believe separate write buffers (fast memory would use writethrough caching). Perhaps it's this extra parallelism that allows the Vampire with Phoenix to do a few extra fps? If fpga hardware assisted c2p was able to give any more fps, surely it wouldn't be much. It could save a few CPU cycles but the c2p conversion would be a low percentage of the CPU processing power on Phoenix.

Im not talking about recreating Akiko as there is patent covering C2P aspect of Akiko and recreation should be straightforward (but perhaps not optimal from FPGA perspective) - my point is that instead loosing CPU power to perform "always the same task", small amount of resources should be dedicated to this and two things should be done at the same time - C2P and transfer from one type buffer to another - this can be done in 1 cycle for word/longword transparent from CPU point of view.
And this kind of things will be way more useful but also simpler than rarely used transcendental instruction (don't get me wrong - i will be more than happy to have everything in such cheap FPGA but this is impossible).

Quote:

Originally Posted by Thorham

Yes. No one does it using the blitter on accelerated machines because it's slower.

No, this is not Agnus blitter but DMA that can be created on FPGA (CPU) and it can perform memory transfers from FAST to CHIP and opposite with full speed (allowed by Amiga bus) and during this transfer additional operations can be performed (like C2P etc) - this mean that we can first reduce CHIP size limitations and do some obvious and frequently repeated tasks (like moving screen buffer from FAST to CHIP 25 - 50 times per second) purely in hardware.
C2P in not arithmetic it is plain bit shuffling - this is purely logical operation on bits of data and address - wasting for this CPU is simple improper - at the same time CPU can be used to something else as Amiga bus will limit transfer anyway so CPU on FAST can do something else.

Megol · 23 May 2014, 23:02

Quote:

Originally Posted by Don_Adan

About movem and Intel.
Then time to beat Intel chips.

The previous generation Intel Itanium was capable of two memory loads and two memory stores per clock with some limitations.
The current generation reduced this to two memory operations per cycle in total.

Why? It increased performance.

Quote:

I'm not hardware expert, but I will use internal CPU 128 bit register
for read/write. Four 32 registers must/can be splitted/joined in one
128 bit register and data (128 bit) can be wrote/readed. Then for
two reads You can read 8 longwords per cycle or wrote 4 longwords
per cycle. If I remember right move16 command is fastest than movem
for 4 registers, then must exist any way for make it fastest.
If splitting/joining registers is too hard to make, then I think
than "movemfast" command (or special movem.l handling) can be used/added.
It can works like movem, but only for successive registers f.e.

movemfast D0-D3,(A0) is OK
movemfast D0-D2/D4,(A0) is not possible
movemfast D1-A4,(A5) is OK

or special handling of movem.l only

movem.l D0-D3,-(SP) full speed
movem.l D0-D2/D4,-(SP) slowest speed
movem.l (SP)+,D0-A3 full speed

For movem.w command similar data can be joined/splitted.

Combining data cache read/writes in order to reduce memory accesses is trivial. But one still is limited by register ports.

One of the main reasons superscalar FPGA soft cores are about as rare as hen's teeth is the problem of adding register ports.

Read ports are reasonable simple to add however complicates signal routing and adds latency in the read path. Write ports are very hard to do efficiently.

So in short increasing MOVEM stores is possible but increasing loads is much harder and probably not worth it. Given the complexities required for accelerating MOVEM stores it's probably not worth it either.

Quote:

Of course I don't know hardware. But I heard too many times that something
is impossible to make, when I will sure (as amateur) that this is possible.
If You need hardware examples f.e. put MC68060 CPU to A3640 or put more than 16 MB
on A4k main board. If You need software examples, f.e one disk version
of Turrican 2 (fit heavy packed ~1070kB data on 900kB disk) or write
RNC copylocks without hardware.

You have a very bullyish posting style IMHO. There's good reasons why one can't optimize all aspects of a design at the same time. Even the leading edge engineers using leading edge ASIC technologies have to decide what to optimize and what to eliminate or do slowly.

Quote:

100 cycles for movep emulation is very good example of wasting of CPU power.
Seems You like to waste of CPU power, I don't like this.
2 or 4 cycles vs 100 cycles, and Your choice is 100.

Are you willing to take a performance hit on _all_ instructions in order to execute MOVEP quickly?
I sure don't as MOVEP isn't critical except for IIRC two demos that uses Atari ST style C2P (not optimal for the Amiga bitplane layout).

Even for them it wouldn't be a problem if looking at the numbers:

8MHz 68000, 16 cycles MOVEP -> 1/2MOPs
100MHz new 68k compatible, 100 cycles MOVEP -> 1MOPS

However the new chip would most likely be clocked higher and have perhaps a 50 cycles MOVEP emulation.

Quote:

About trap and finding emulation routine, of course this is true. But of course You
can create thousends of traps and every trap can works only for concrete
CPU/FPU instructions f.e.
trap #10456 can handle movep.l D0,0(A0)
trap #10457 can handle movep.l D0,1(A0)
trap #29456 can handle movep.l D2,51(A2)
etc

This is simple wasting of memory/speed, but You can call this "right design".
Anyway it seems You must be "Trap Master", I think, if for You traps way
is correct way for make good CPU.

Why is trapping to a special mode intended for emulation of instructions worse than trapping to microcode?

One can have a ROM connected to the instruction fetch mechanism, a separate ROM program counter and some dedicated registers for the emulating code. That wouldn't be as slow as using a normal trap mechanism.

Quote:

I don't wrote that Apollo creators are lamers/amateurs. I only wrote
that they go lamers/amateurs way. If for You this is same, then sorry.

From my amateuer point of view:

1. movep instruction -> trap.
2. rare used instruction -> trap.
3. FPU instruction -> trap.
4. hard to implement instruction -> trap.
etc.

Sorry, but I can't call this expert way.

Okay.

Quote:

About SIN support.
Sorry, but I think that You must be wrong, that microcode can slowdown
CPU clock rate, if other FPU commands (microcoded too) don't slow CPU clock rate.
This is illogical for me, or even if this is true then try to make
other SIN implementation. Many things can be done in different ways, not only
via "one and the only trap" way.

To support MOVEM and MOVE mem, mem by splitting into them several operations doesn't require general microcode support, it can be done by relatively simple hardware. To support complex instructions with microcode one essentially have to build a processor within a processor which will increase resource usage (which will reduce e.g. caches and other acceleration technologies) and probably reduce the clock rate of the whole design.

Quote:

About xxx16 commands.
I don't know which examples You need, but f.e.
movem.l (A5)+,D7-A2
eor16 D7-A2,(A4)+

I understand how the operations would work, what I don't is what use one would get from them. EOR16 excepted as it could be use for RAID support. But what use would one get from ADD16? Checksums perhaps but instead supporting something like proper CRC32 would be a better way to spend the resources.

Quote:

I'm not against adding new instructions to core, but it must be done in clean way,
not in dirty way (using opcode space for already available 68040/68882 instructions),
due it will be make only problems. This is simple for me, choose one (unused)
and easy for handling 2 bytes opcode and use this as prefix/ID only for
series of 128 bit instructions, and later (next 2 bytes) use as real opcode for fast
instructions decoding. Instructions will be 2 bytes longer (due prefix/ID at begining),
but 100% compatible with already available 68040/68882 instructions.

I don't think any 68k compatible core designer intend to reuse valid instruction encodings for anything else?
There are other spaces that can be used for extensions.

My hobby project of extending the 68k architecture to 64 bits (not likely to be ever realized) uses small holes in the encoding to add e.g. MOVEM support for an extended register set, branches with 64 bit offset, immediate loads of 64 bits etc.

I don't know if the Apollo core supports it but Gunnar and others have proposed to make address registers capable of a subset of integer operations. That also reuses formats mostly already existing but not supported. Personally I think that's a bad idea BTW but other people think my ideas about 64 bit extensions and prefixes are really bad too.

Thorham · 24 May 2014, 00:54

Quote:

Originally Posted by pandy71

C2P in not arithmetic it is plain bit shuffling - this is purely logical operation on bits of data and address - wasting for this CPU is simple improper - at the same time CPU can be used to something else as Amiga bus will limit transfer anyway so CPU on FAST can do something else.

But what will you do the c2p with on an Amiga if you're not going to use the CPU for it? Right, the blitter, and that's just too slow. It limits frame rates, and that's why no one does it. For something like image viewers this might be alright, but for games and demos it's much better to use the CPU. With the blitter I doubt you'll get even 12 FPS in 320x256x8bit.

robinsonb5 · 24 May 2014, 02:06

Quote:

Originally Posted by Thorham

But what will you do the c2p with on an Amiga if you're not going to use the CPU for it? Right, the blitter, and that's just too slow.

Again, the point is that when an FPGA is available, it's not only possible but relatively straightforward to create a custom blitter-like component within the FPGA (alongside the soft CPU) that handles this task more efficiently than the CPU could. It would read chunky data from Fast RAM, so only writing would be limited by Chip RAM speed, and that would be far less of a problem when the CPU's not having to spoon-feed data to the slow Chip RAM bus.

Thorham · 24 May 2014, 10:54

Quote:

Originally Posted by robinsonb5

Again, the point is that when an FPGA is available, it's not only possible but relatively straightforward to create a custom blitter-like component within the FPGA (alongside the soft CPU) that handles this task more efficiently than the CPU could.

Surely a proper chunky mode would be ten times better then some c2p acceleration?

robinsonb5 · 24 May 2014, 13:53

Quote:

Originally Posted by Thorham

Surely a proper chunky mode would be ten times better then some c2p acceleration?

Yes, of course. But good luck implementing one in a device that plugs into the CPU socket and has no VGA connector!

Megol · 24 May 2014, 14:12

Quote:

Originally Posted by robinsonb5

Yes, of course. But good luck implementing one in a device that plugs into the CPU socket and has no VGA connector!

I haven't looked closely at it and have no plans to test it but shouldn't an accelerator be able to plug into/over the Denise and Agnus chips? If so (and with some mainboard surgery) all chipmem accesses should be routable to the accelerator memory instead increasing the CPU bandwidth greatly.

Don_Adan · 24 May 2014, 18:10

Quote:

Originally Posted by matthey

How is it possible to remove instructions that never existed in the Apollo?

For me trap handler is also "pseudo support" for instruction.

Quote:

Originally Posted by matthey

There is no waste if MOVEP is not used and it shouldn't be used anymore. If
old programs use MOVEP, they will probably still be faster than they
originally were despite the trap/emulation. I use a 68060 with emulated
MOVEP and I can't see any difference nor do I notice any problems from it.

Check this code on your Amiga:

Code:

 moveq #-1,D2
loop
  move.l (A0)+,D0
  move.l (A0)+,D1
  movep.l D0,0(A1)
  movep.l D1,1(A1)
  addq.l #8,A1
  dbf D2,loop

This is graphic conversion routine for Rygar arcade, if I remember right.
Call this in 50Hz or 60Hz, and check effects on your 68060, later call
this on 68020/68000 CPU and check effects.
Of course you can try to rework this code, but perhaps never you can
reach good speed, especially if movep.l write can works in 4c only for
Apollo core. Simple movep is useful instruction in some cases. If someone don't see
this, then he has eyes problem.

Quote:

Originally Posted by matthey

The CPU needs the SRAM for many different uses and often needs big blocks
of several kB. Robbing a few kB could result in one of the cache sizes being
cut in half. You are correct that the tables for some of the instructions
could be shared.

FSIN, FCOS and FSINCOS use the same table
FTAN uses a table (it could use FSIN/FCOS but division is slow)
FASIN and FACOS use FATAN which has a table
FSINH, FCOSH, FTANH, FETOXM1 use FETOX which has a table
FTENTOX, FTWOTOX, FLOGN, FLOG2 and FLOG10 use a table
FMOVECR uses a table

Just the tables is probably 10kB. There is other static data like fp numbers
that needs to be stored and the code has to be stored somewhere. Now you
could be using 20kB out of 200kB which is 10% of the SRAM and that doesn't
count the logic used. My estimate could be significantly off but I hope you
can see that although it's possible to add all the FPU instructions now, it
would take valuable resources from being used elsewhere where they provide a
better speedup.

Seems you are on good way to fit FPU instructions in FPGA.

Quote:

Originally Posted by matthey

This is a good idea but a pipeline creates complexity that makes this more
difficult. Processors use read and write register ports to access the data
in the register file(s). The processor must keep track of dependencies
(conflicts) between the different registers in a superscalar pipelined
processor. When multiple registers are accessed, it makes this job more
complex. Only a certain amount of work can be done in a stage of the
pipeline without slowing down the processor. MOVEM already takes more time
to
process because the registers are bitmapped instead of a continuous series.
There are no restrictions on the alignment of the memory access either. I
think some optimization can be done on the Apollo, at least when reading
from memory/cache, but I don't expect it's as easy as you think. Apollo has
a better chance than many processors because unaligned cache accesses are
allowed and the processor is slower than memory.

If movem.l (SP)+,D0-A6 will be works in 3 (1+2) cycles and movem.l D0-A6,-(SP)
will be works in 5 (1+4) cycles, then I don't think that movem pipeling
is necessary, due movem will be very fast. Maybe you can won 1 cycle for
some cases, but it needs biggest support in FPGA (more space to use).
Anyway with using movem you can make super fast copy routine, less than
1 cycle per one copied longword.

Quote:

Originally Posted by matthey

It's not about what is impossible but about what is practical. MOVEM could
load or store all registers in a single cycle but the clock speed of the CPU
would be a fraction of what it is. All the 68020+6888x instructions could be
put in hardware but the processor would be slower, take longer to develop
and need a more costly fpga.

Do you made any tests that adding/removing any instruction from core change
CPU speed? The easiest way is perhaps removing movep support from TG68 core
(Vampire A600) and check effects if its really fastest (without movep) than
full 68000 version.

Quote:

Originally Posted by matthey

I believe that Apollo does not currently support any microcode. Adding
microcode support may slow down the whole CPU. It should be possible (and
probably faster) to do the trig instructions in VHDL but the math is very
complex for some of these instructions. There can be different polynomials
used for different ranges of input based on the calculated error in that
range. The worst case error for fatan reads like this:

Then tell me how Apollo authors want to add support for FPU or SIMD
instructions, if no micro code support? Adding microcode "may slowdown" or "slowdown" core?
Without real tests this is theory only for me, not very logical too. Why?
Due if I assembled unused and very slow routine inside my code. Then this
only change the size of my code, but never change speed of used code.
Almost every thing in the world can be done in different ways, then for
me Apollo authors must rethink some ideas, due this is not correct way for me.

Quote:

Originally Posted by matthey

Accuracy and Monotonicity
The returned result is within 2 ulps in 64 bit significant bit, i.e. within
0.5001 ulp to 53 bits if the result is subsequently rounded to double
precision. The result is provably monotonic in double precision.

I think this is in some technical math language other than English. Perhaps
you would like to do the VHDL coding and proof of monotonic result and
accuracy to 2 ulps in 64 bit significant digits? It's not like it's
impossible.

Matt, don't try to jump on the head, you are not BC Kid.
For make direct (no microcode) support for FPU commands, you must start from the easiest to implement commands.
At first it must be fmove command.
Later commands like fneg, fnop, ftst.
Later commands like fmovem, fadd, fsub.
Later commands like fmul, fdiv
...
At end you can adapt commands like fatan etc.

Of course I can try to help you in this work, if I understand how work VHDL programming.

pandy71 · 24 May 2014, 19:29

Quote:

Originally Posted by Thorham

Surely a proper chunky mode would be ten times

As Megol said - we must be realistic - perhaps with new project, perhaps with different device, perhaps recreating chipset ( like A-Clone not Minimig).
For today however i see no point to create very fast CPU only to perform C2P where small amount of hardware can do this in 1 cycle. It have sense to create special hardware for C2P as Amiga architecture is constant and it will not change thus C2P is one of common things worth to do in hardware.

Don_Adan · 24 May 2014, 19:43

Quote:

Originally Posted by Megol

Are you willing to take a performance hit on _all_ instructions in order to execute MOVEP quickly?
I sure don't as MOVEP isn't critical except for IIRC two demos that uses Atari ST style C2P (not optimal for the Amiga bitplane layout).

Even for them it wouldn't be a problem if looking at the numbers:

8MHz 68000, 16 cycles MOVEP -> 1/2MOPs
100MHz new 68k compatible, 100 cycles MOVEP -> 1MOPS

However the new chip would most likely be clocked higher and have perhaps a 50 cycles MOVEP emulation.

Movep handling can be slowest, but of course 4 cycles is better than 10 cycles.

Quote:

Originally Posted by Megol

Why is trapping to a special mode intended for emulation of instructions worse than trapping to microcode?

One can have a ROM connected to the instruction fetch mechanism, a separate ROM program counter and some dedicated registers for the emulating code. That wouldn't be as slow as using a normal trap mechanism.

Why?
1. Some programs on different platforms will be recognize this CPU as 68060, which is not true and can cause many problems.
2. Needs special boot ROM for handling movep, and even with boot ROM it can failed. Many Amiga games (coded correctly, but used movep) use all or almost all Amiga chip memory, if movep handler will be use trap after $C0 (68060 use VBR+$F4, if I remember right ), then simple can be trashed by game code or data. Or even not directly by game code, but by crack patches, trainers or track disk loaders. Zero page is often used for similar tasks.

Quote:

Originally Posted by Megol

I understand how the operations would work, what I don't is what use one would get from them. EOR16 excepted as it could be use for RAID support. But what use would one get from ADD16? Checksums perhaps but instead supporting something like proper CRC32 would be a better way to spend the resources.

As CRC, as easy handling of two 64 bit values etc.

Quote:

Originally Posted by Megol

I don't think any 68k compatible core designer intend to reuse valid instruction encodings for anything else?
There are other spaces that can be used for extensions.

My hobby project of extending the 68k architecture to 64 bits (not likely to be ever realized) uses small holes in the encoding to add e.g. MOVEM support for an extended register set, branches with 64 bit offset, immediate loads of 64 bits etc.

I don't know if the Apollo core supports it but Gunnar and others have proposed to make address registers capable of a subset of integer operations. That also reuses formats mostly already existing but not supported. Personally I think that's a bad idea BTW but other people think my ideas about 64 bit extensions and prefixes are really bad too.

Right.
I know that extension word(s) is the best and logical solution for clean CPU enhances.
But I heard that someone want to remove some 68040/68882 commands and replace opcode space with different instructions. Totally no sense for me. Another shortcut way. Correct used extension word(s) can give much more possibilities, than replacing already used 68040 opcodes,
f.e. is possible add big series of 128 bit commands, add support for 64 bit addressing or series of good instructions from other CPU. But of course no space in FPGA, but new instructions are not very important for me.

meynaf · 26 May 2014, 10:23

Quote:

Originally Posted by Megol

It's increasingly clear that you don't know anything about hardware. But do try.

While this may be true, it doesn't properly eliminate the possibility that he is right and you are wrong.
Also he could have written that it's increasingly clear that you don't know anything about software. Wouldn't have helped the discussion, huh ?

Quote:

Originally Posted by Megol

No. If MOVEP can be emulated in e.g. 100 cycles on a 100MHz machine that is already enough.
Why? Because the few cases that uses MOVEP will not be performance critical.

There ARE some performance critical cases. Aside of Don's example, many 68000 programs do their endian conversion with MOVEP, like the SCI (Sierra Creative Interpreter) games.

Quote:

Originally Posted by Megol

Supporting SIN would lower the clock rate of the whole design as generic microcode support have to be included.

Not sure, if this unit is designed properly.

Quote:

Originally Posted by pandy71

And no, i not afraid of Sine - i just found it not worth to pay 200E more.

For that 200E you have a lot more than just fsin.

And frankly if the implementation of fsin is so costly, then the person doing it must be fired

Quote:

Originally Posted by matthey

All the 68020+6888x instructions could be put in hardware but the processor would be slower, take longer to develop and need a more costly fpga.

Nobody is asking for a full hardware implementation. With a microcode or a similar instruction cracking unit, you can remove many problems.

Quote:

Originally Posted by matthey

There is no waste if MOVEP is not used and it shouldn't be used anymore.

And why shouldn't it be used anymore ? Because you decided it ? The 68000 manual never stated that this instruction shouldn't be used, sorry.

Quote:

Originally Posted by matthey

If old programs use MOVEP, they will probably still be faster than they originally were despite the trap/emulation.

I wouldn't like to have programs on a 68070 that feel as if i had a 9 mhz 68000 instead...

Quote:

Originally Posted by matthey

I use a 68060 with emulated MOVEP and I can't see any difference nor do I notice any problems from it.

The fact you don't see a difference, doesn't mean there is none.

In addition, IIRC the 68060 has support for CAS, where none is planned by Gunnar. A more interesting subject to discuss than MOVEP, maybe ?

Quote:

Originally Posted by matthey

I believe that Apollo does not currently support any microcode. Adding microcode support may slow down the whole CPU.

It may or may not slow it down. You don't know. This slowdown is theoretical and has never been quantified by real testing.

Quote:

Originally Posted by Megol

Are you willing to take a performance hit on _all_ instructions in order to execute MOVEP quickly?

MOVEP alone won't do that, or it's done in a really wrong way...

But taking a performance hit on _all_ instructions in order to support the full 68k isa, is perfectly acceptable.
Anyway, if adding a new block really has a performance hit on all instructions, then adding a vector core is even worse than i initially thought.

Quote:

Originally Posted by Megol

Why is trapping to a special mode intended for emulation of instructions worse than trapping to microcode?

Because microcode does not have to fetch instructions from memory, interrupt the stream, pollute the cache, save/restore context, etc.

Quote:

Originally Posted by Megol

One can have a ROM connected to the instruction fetch mechanism, a separate ROM program counter and some dedicated registers for the emulating code. That wouldn't be as slow as using a normal trap mechanism.

I wouldn't bet. ROMs have a very high latency.

Quote:

Originally Posted by Megol

To support MOVEM and MOVE mem, mem by splitting into them several operations doesn't require general microcode support, it can be done by relatively simple hardware.

So can instructions like MOVEP be done. A very simple "instruction cracking" hardware. If you run it in parallel to the decoder, it's just an extra 2:1 mux after decoding.

Quote:

Originally Posted by Megol

I don't think any 68k compatible core designer intend to reuse valid instruction encodings for anything else?

There was once a tough discussion on apollo-core.com about this. While i did say them it was a very poor and very stupid idea, neither Gunnar, nor Matt, did see a problem in doing that !

Trapping instructions is slower than a proper support of them, on that, everyone can agree.

Emulation in 50 or 100 clocks is maybe possible. But remember that you have to trap (which by itself is very slow), save your regs somewhere (again very slow), decode the instruction (the more insns you trap, the worse this becomes), and then finally execute the bloody thing. After that, you restore your regs and return to the main stream. Clocks will be eaten pretty quickly.

But this is not the only reason why no instruction should trap.

Full support of the isa also has some marketing value that must not be overlooked - ask Apple for why they always refused to use the 68060.

And trapping an instruction isn't like having it. It needs some stack space for saving and retrieving the context (a bad thing when the stack is nearly full), an interrupt may come unexpectedly and your instruction is no longer atomic, you may eventually conflict with debuggers tracing instructions, your trap vector may be overwritten by some programs, etc.
I know by experience that if something CAN go wrong, you can take as granted that it WILL go wrong.

It may be better to implement all instructions and make some of them "facultative" (i.e. you can selectively enable and disable them).
Then, the exact impact of them will be known and perhaps two versions of the cpu can be made, one for compatibility, and the other for speed.

pandy71 · 26 May 2014, 13:38

Quote:

Originally Posted by meynaf

For that 200E you have a lot more than just fsin.

And frankly if the implementation of fsin is so costly, then the person doing it must be fired

Once again - rough estimation based on Altera FPU megacore lead to this kind of figures also key are expectations - seem that it must be fast implementation (few perhaps ten cycles) - if You or anyone else can beat Altera and place fully pipelined 80 bit FPU in 10k LE's then i will be more than happy.
And to be honest what can use less resources - C2P and perhaps HiColor to HAM conversion or FPU?

meynaf · 29 May 2014, 09:52

Quote:

Originally Posted by pandy71

Once again - rough estimation based on Altera FPU megacore lead to this kind of figures also key are expectations - seem that it must be fast implementation (few perhaps ten cycles) - if You or anyone else can beat Altera and place fully pipelined 80 bit FPU in 10k LE's then i will be more than happy.

I'm not asking for fsin in 10 cycles. 50 or even 100 is fine.
If it did fit in an old 68882 chip, it can fit in a moderate priced fpga - and IIRC Gunnar had found a board with a 100k LEs fpga for $99.

Quote:

Originally Posted by pandy71

And to be honest what can use less resources - C2P and perhaps HiColor to HAM conversion or FPU?

I see little use for a C2P/HAM conversion in hardware, i'd rather have a true chunky and true color modes.

Anyway the small FPGA on the Vampire board has room for neither.

pandy71 · 29 May 2014, 16:05

Quote:

Originally Posted by meynaf

I'm not asking for fsin in 10 cycles. 50 or even 100 is fine.
If it did fit in an old 68882 chip, it can fit in a moderate priced fpga - and IIRC Gunnar had found a board with a 100k LEs fpga for $99.

So how many cycles it must be 200? 2000? 20000?

100k LE's Altera or Xilinx or Lattice or Other...?

Quote:

Originally Posted by meynaf

I see little use for a C2P/HAM conversion in hardware, i'd rather have a true chunky and true color modes.

Anyway the small FPGA on the Vampire board has room for neither.

Unless you replace Denise we have only planar and HAM mode - as this disscussion is about CPU not Denise in FPGA then a see huge use for C2P and HicolorC2PHAM in hardware...

Megol · 29 May 2014, 18:58

Quote:

Originally Posted by meynaf

While this may be true, it doesn't properly eliminate the possibility that he is right and you are wrong.
Also he could have written that it's increasingly clear that you don't know anything about software. Wouldn't have helped the discussion, huh ?

Well at least he could stop calling recognizing limits lame and amateurish?
That wasn't intended as an insult and if he (or anyone else) took is as such I can only beg forgiveness and try to formulate it a bit different.

I'm assuming that we are talking about normal processors using the standard methods available, that is the microarchitecture is a Von Neumann stored program design using the standard modified Harward cache (separate instruction and data caches mapped into one common address space). I also assume it is pipelined to some degree.

Then there are a number of so called critical loops, one of those are load-use delays, another the latency between dependent instructions and a third is the instruction length computation to instruction fetch logic latency. There are others.
All of those are ideally 1 clock long and while they can be increased performance in most cases takes a heavy hit, 2 cycle latency between dependent instructions can even in a wide out of order execution engine be 20%+ slower than a 1 cycle one.

68k instructions are pretty hard to decode, lengths can vary greatly and formats do too. But to execute 68k code one have to deal with it. Thankfully there aren't many really complex instructions so most can be translated to a simpler 3 operand RISC type operation in one step. The big exceptions are MOVE with two memory operands and MOVEM however the logic to detect those cases in parallel with the rest of the decoder is fairly small and the internal operations are of the same type (MOVE) which makes that manageable - the main decoder could generate the first MOVE operation and then the dedicated logic would stall the fetch/decode pipeline and inject the rest of the MOVE operations required.
Examples:
MOVE src, dst -> MOVE src, temp ; MOVE temp, dst
MOVEM d0-d2, (a0)+ -> MOVE d0, (a0)+;MOVE d1, (a0)+;MOVE d2, (a0)+ (from memory, was a while since I've coded assembly so could be wrong!)

But if the design will need more complex operations this isn't enough, emulating FSIN will require operations including internal looping and more importantly other operations than FSIN - sadly it can't be emulated fractally

That means the logic needed to detect complex instructions have to be bigger and probably stored in a ROM. This can make the detection logic a limit to the clock frequency in several ways, increased fan-out of the fetch data, routing delays to and from the ROM and possibly problems fitting the <logic>->ROM-><logic> in the detector into one clock cycle. Adding another cycle for complex operation detection will require a micro-flush of the latest fetched instruction which will complicate even more parts of the fetch/decode pipeline.
But that is only for _detecting_ that the instruction should be considered complex, then the fetching have to redirected to a microcode engine. That adds more complexities.

In comparison an enhanced trap mechanism can do some simplifications as some cases can be handled in software. There still have to be detection logic however that can be reduced in size as parts of that already must exist to detect illegal instructions.

Quote:

There ARE some performance critical cases. Aside of Don's example, many 68000 programs do their endian conversion with MOVEP, like the SCI (Sierra Creative Interpreter) games.

But is that really performance critical? The 68k processor natively executing MOVEP as fast as possible is the 68040 right? Let's assume that require 8 clocks per MOVEP at 33MHz -> ~4M per second peak. Now assume the new processor runs at 200MHz and require 50 cycles -> ~4M per second peak.

There are many ways to improve trap and emulate timings without making large hardware changes.
Save/restore of registers can be done with banking in the register file or renaming techniques depending on the processor design.
It's hard to eliminate the transfer to the handler routine but it can be accelerated by treating every unimplemented instruction as a bsr to the trap handler eliminating the miss predict latency.
The trap handling code could be stored in an on chip ROM to ensure that it is "cache resident" however if the emulation is needed often it will likely be resident anyway.
Removing the checking for which instruction is to be emulated can be done at least partially by having a mechanism that translates part of the instruction word into an offset into the trap handler. Some checking code would still be needed though.

Thorham · 29 May 2014, 22:49

Why not simply implement what exists and be done with it? Shouldn't it be enough to get existing 68060 speed? What's the deal with wanting more and more speed? Isn't part of the charm that right now we don't have all the speed in the universe?

Also, if you're already moving away from 680x0 (FPGA is NOT 680x0), then why not stick a cheap AMD on a board and run a 680x0 emu?

Perhaps I'm naive, but I just had to ask these questions

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Vampire 500 project started	majsta	Hardware mods	221	17 August 2016 18:42
cd32 project idea i challenge ...	sian	request.Other	11	15 June 2013 19:34
Looking for artist to collaborate on Lotus Turbo Challenge project	P-J	Amiga scene	16	07 January 2012 04:21
Desperately seeking Amiga Demo Coder	slayerGTN	Amiga scene	2	02 August 2010 23:34
Project-X SE & F17 Challenge v2.0 (1993)(Team 17)(M5)[compilation][CDD3499]	retrogamer	request.Old Rare Games	0	05 April 2007 14:37

29 May 2014, 22:49	#120
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 48 Posts: 3,839	Why not simply implement what exists and be done with it? Shouldn't it be enough to get existing 68060 speed? What's the deal with wanting more and more speed? Isn't part of the charm that right now we don't have all the speed in the universe? Also, if you're already moving away from 680x0 (FPGA is NOT 680x0), then why not stick a cheap AMD on a board and run a 680x0 emu? Perhaps I'm naive, but I just had to ask these questions

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)