23 May 2014, 15:29 | #101 | |
Registered User
Join Date: May 2014
Location: inside the emulator
Posts: 377
|
Quote:
That board was designed to put a reasonably priced FPGA design in the hands of Amiga users in order to promote development. Look at http://majsta.com/ for more information. Or are you talking about the potential new design? Don't know anything about it and I doubt those talking about it knows much either given the current available information. |
|
23 May 2014, 18:42 | #102 | ||
Banned
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
|
Quote:
Quote:
|
||
23 May 2014, 19:49 | #103 | |
Registered User
Join Date: Mar 2012
Location: Norfolk, UK
Posts: 1,157
|
Quote:
That's a saving of nearly 20ms per frame with nothing changing except the amount of time the CPU has to wait when writing to the Chip RAM bus. Ultimately, though, as long as the CPU has to spoonfeed C2Ped data into Chip RAM it's going to be wasting a lot of time waiting for that bus to be free. What's more, the more powerful the CPU the more potential work will be lost to that waiting. |
|
23 May 2014, 20:06 | #104 |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,839
|
|
23 May 2014, 21:31 | #105 | ||||
Registered User
Join Date: Jun 2010
Location: PL?
Posts: 2,875
|
Quote:
Quote:
Quote:
And this kind of things will be way more useful but also simpler than rarely used transcendental instruction (don't get me wrong - i will be more than happy to have everything in such cheap FPGA but this is impossible). Quote:
C2P in not arithmetic it is plain bit shuffling - this is purely logical operation on bits of data and address - wasting for this CPU is simple improper - at the same time CPU can be used to something else as Amiga bus will limit transfer anyway so CPU on FAST can do something else. Last edited by pandy71; 24 May 2014 at 00:06. |
||||
23 May 2014, 23:02 | #106 | ||||||||
Registered User
Join Date: May 2014
Location: inside the emulator
Posts: 377
|
The previous generation Intel Itanium was capable of two memory loads and two memory stores per clock with some limitations.
The current generation reduced this to two memory operations per cycle in total. Why? It increased performance. Quote:
One of the main reasons superscalar FPGA soft cores are about as rare as hen's teeth is the problem of adding register ports. Read ports are reasonable simple to add however complicates signal routing and adds latency in the read path. Write ports are very hard to do efficiently. So in short increasing MOVEM stores is possible but increasing loads is much harder and probably not worth it. Given the complexities required for accelerating MOVEM stores it's probably not worth it either. Quote:
Quote:
I sure don't as MOVEP isn't critical except for IIRC two demos that uses Atari ST style C2P (not optimal for the Amiga bitplane layout). Even for them it wouldn't be a problem if looking at the numbers: 8MHz 68000, 16 cycles MOVEP -> 1/2MOPs 100MHz new 68k compatible, 100 cycles MOVEP -> 1MOPS However the new chip would most likely be clocked higher and have perhaps a 50 cycles MOVEP emulation. Quote:
One can have a ROM connected to the instruction fetch mechanism, a separate ROM program counter and some dedicated registers for the emulating code. That wouldn't be as slow as using a normal trap mechanism. Quote:
Quote:
Quote:
Quote:
There are other spaces that can be used for extensions. My hobby project of extending the 68k architecture to 64 bits (not likely to be ever realized) uses small holes in the encoding to add e.g. MOVEM support for an extended register set, branches with 64 bit offset, immediate loads of 64 bits etc. I don't know if the Apollo core supports it but Gunnar and others have proposed to make address registers capable of a subset of integer operations. That also reuses formats mostly already existing but not supported. Personally I think that's a bad idea BTW but other people think my ideas about 64 bit extensions and prefixes are really bad too. |
||||||||
24 May 2014, 00:54 | #107 | |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,839
|
Quote:
|
|
24 May 2014, 02:06 | #108 |
Registered User
Join Date: Mar 2012
Location: Norfolk, UK
Posts: 1,157
|
Again, the point is that when an FPGA is available, it's not only possible but relatively straightforward to create a custom blitter-like component within the FPGA (alongside the soft CPU) that handles this task more efficiently than the CPU could. It would read chunky data from Fast RAM, so only writing would be limited by Chip RAM speed, and that would be far less of a problem when the CPU's not having to spoon-feed data to the slow Chip RAM bus.
|
24 May 2014, 10:54 | #109 |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,839
|
Surely a proper chunky mode would be ten times better then some c2p acceleration?
|
24 May 2014, 13:53 | #110 |
Registered User
Join Date: Mar 2012
Location: Norfolk, UK
Posts: 1,157
|
|
24 May 2014, 14:12 | #111 |
Registered User
Join Date: May 2014
Location: inside the emulator
Posts: 377
|
I haven't looked closely at it and have no plans to test it but shouldn't an accelerator be able to plug into/over the Denise and Agnus chips? If so (and with some mainboard surgery) all chipmem accesses should be routable to the accelerator memory instead increasing the CPU bandwidth greatly.
|
24 May 2014, 18:10 | #112 | |||||||
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,038
|
Quote:
Quote:
Code:
moveq #-1,D2 loop move.l (A0)+,D0 move.l (A0)+,D1 movep.l D0,0(A1) movep.l D1,1(A1) addq.l #8,A1 dbf D2,loop Call this in 50Hz or 60Hz, and check effects on your 68060, later call this on 68020/68000 CPU and check effects. Of course you can try to rework this code, but perhaps never you can reach good speed, especially if movep.l write can works in 4c only for Apollo core. Simple movep is useful instruction in some cases. If someone don't see this, then he has eyes problem. Quote:
Quote:
will be works in 5 (1+4) cycles, then I don't think that movem pipeling is necessary, due movem will be very fast. Maybe you can won 1 cycle for some cases, but it needs biggest support in FPGA (more space to use). Anyway with using movem you can make super fast copy routine, less than 1 cycle per one copied longword. Quote:
CPU speed? The easiest way is perhaps removing movep support from TG68 core (Vampire A600) and check effects if its really fastest (without movep) than full 68000 version. Quote:
instructions, if no micro code support? Adding microcode "may slowdown" or "slowdown" core? Without real tests this is theory only for me, not very logical too. Why? Due if I assembled unused and very slow routine inside my code. Then this only change the size of my code, but never change speed of used code. Almost every thing in the world can be done in different ways, then for me Apollo authors must rethink some ideas, due this is not correct way for me. Quote:
For make direct (no microcode) support for FPU commands, you must start from the easiest to implement commands. At first it must be fmove command. Later commands like fneg, fnop, ftst. Later commands like fmovem, fadd, fsub. Later commands like fmul, fdiv ... At end you can adapt commands like fatan etc. Of course I can try to help you in this work, if I understand how work VHDL programming. |
|||||||
24 May 2014, 19:29 | #113 |
Registered User
Join Date: Jun 2010
Location: PL?
Posts: 2,875
|
As Megol said - we must be realistic - perhaps with new project, perhaps with different device, perhaps recreating chipset ( like A-Clone not Minimig).
For today however i see no point to create very fast CPU only to perform C2P where small amount of hardware can do this in 1 cycle. It have sense to create special hardware for C2P as Amiga architecture is constant and it will not change thus C2P is one of common things worth to do in hardware. |
24 May 2014, 19:43 | #114 | ||||
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 56
Posts: 2,038
|
Quote:
Quote:
1. Some programs on different platforms will be recognize this CPU as 68060, which is not true and can cause many problems. 2. Needs special boot ROM for handling movep, and even with boot ROM it can failed. Many Amiga games (coded correctly, but used movep) use all or almost all Amiga chip memory, if movep handler will be use trap after $C0 (68060 use VBR+$F4, if I remember right ), then simple can be trashed by game code or data. Or even not directly by game code, but by crack patches, trainers or track disk loaders. Zero page is often used for similar tasks. Quote:
Quote:
I know that extension word(s) is the best and logical solution for clean CPU enhances. But I heard that someone want to remove some 68040/68882 commands and replace opcode space with different instructions. Totally no sense for me. Another shortcut way. Correct used extension word(s) can give much more possibilities, than replacing already used 68040 opcodes, f.e. is possible add big series of 128 bit commands, add support for 64 bit addressing or series of good instructions from other CPU. But of course no space in FPGA, but new instructions are not very important for me. |
||||
26 May 2014, 10:23 | #115 | ||||||||||||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
Quote:
Also he could have written that it's increasingly clear that you don't know anything about software. Wouldn't have helped the discussion, huh ? Quote:
Quote:
Quote:
And frankly if the implementation of fsin is so costly, then the person doing it must be fired Quote:
Quote:
Quote:
Quote:
In addition, IIRC the 68060 has support for CAS, where none is planned by Gunnar. A more interesting subject to discuss than MOVEP, maybe ? Quote:
Quote:
But taking a performance hit on _all_ instructions in order to support the full 68k isa, is perfectly acceptable. Anyway, if adding a new block really has a performance hit on all instructions, then adding a vector core is even worse than i initially thought. Quote:
Quote:
Quote:
Quote:
Trapping instructions is slower than a proper support of them, on that, everyone can agree. Emulation in 50 or 100 clocks is maybe possible. But remember that you have to trap (which by itself is very slow), save your regs somewhere (again very slow), decode the instruction (the more insns you trap, the worse this becomes), and then finally execute the bloody thing. After that, you restore your regs and return to the main stream. Clocks will be eaten pretty quickly. But this is not the only reason why no instruction should trap. Full support of the isa also has some marketing value that must not be overlooked - ask Apple for why they always refused to use the 68060. And trapping an instruction isn't like having it. It needs some stack space for saving and retrieving the context (a bad thing when the stack is nearly full), an interrupt may come unexpectedly and your instruction is no longer atomic, you may eventually conflict with debuggers tracing instructions, your trap vector may be overwritten by some programs, etc. I know by experience that if something CAN go wrong, you can take as granted that it WILL go wrong. It may be better to implement all instructions and make some of them "facultative" (i.e. you can selectively enable and disable them). Then, the exact impact of them will be known and perhaps two versions of the cpu can be made, one for compatibility, and the other for speed. |
||||||||||||||
26 May 2014, 13:38 | #116 | |
Registered User
Join Date: Jun 2010
Location: PL?
Posts: 2,875
|
Quote:
Once again - rough estimation based on Altera FPU megacore lead to this kind of figures also key are expectations - seem that it must be fast implementation (few perhaps ten cycles) - if You or anyone else can beat Altera and place fully pipelined 80 bit FPU in 10k LE's then i will be more than happy. And to be honest what can use less resources - C2P and perhaps HiColor to HAM conversion or FPU? |
|
29 May 2014, 09:52 | #117 | ||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
Quote:
If it did fit in an old 68882 chip, it can fit in a moderate priced fpga - and IIRC Gunnar had found a board with a 100k LEs fpga for $99. Quote:
Anyway the small FPGA on the Vampire board has room for neither. |
||
29 May 2014, 16:05 | #118 | |
Registered User
Join Date: Jun 2010
Location: PL?
Posts: 2,875
|
Quote:
100k LE's Altera or Xilinx or Lattice or Other...? Unless you replace Denise we have only planar and HAM mode - as this disscussion is about CPU not Denise in FPGA then a see huge use for C2P and HicolorC2PHAM in hardware... |
|
29 May 2014, 18:58 | #119 | ||
Registered User
Join Date: May 2014
Location: inside the emulator
Posts: 377
|
Quote:
That wasn't intended as an insult and if he (or anyone else) took is as such I can only beg forgiveness and try to formulate it a bit different. I'm assuming that we are talking about normal processors using the standard methods available, that is the microarchitecture is a Von Neumann stored program design using the standard modified Harward cache (separate instruction and data caches mapped into one common address space). I also assume it is pipelined to some degree. Then there are a number of so called critical loops, one of those are load-use delays, another the latency between dependent instructions and a third is the instruction length computation to instruction fetch logic latency. There are others. All of those are ideally 1 clock long and while they can be increased performance in most cases takes a heavy hit, 2 cycle latency between dependent instructions can even in a wide out of order execution engine be 20%+ slower than a 1 cycle one. 68k instructions are pretty hard to decode, lengths can vary greatly and formats do too. But to execute 68k code one have to deal with it. Thankfully there aren't many really complex instructions so most can be translated to a simpler 3 operand RISC type operation in one step. The big exceptions are MOVE with two memory operands and MOVEM however the logic to detect those cases in parallel with the rest of the decoder is fairly small and the internal operations are of the same type (MOVE) which makes that manageable - the main decoder could generate the first MOVE operation and then the dedicated logic would stall the fetch/decode pipeline and inject the rest of the MOVE operations required. Examples: MOVE src, dst -> MOVE src, temp ; MOVE temp, dst MOVEM d0-d2, (a0)+ -> MOVE d0, (a0)+;MOVE d1, (a0)+;MOVE d2, (a0)+ (from memory, was a while since I've coded assembly so could be wrong!) But if the design will need more complex operations this isn't enough, emulating FSIN will require operations including internal looping and more importantly other operations than FSIN - sadly it can't be emulated fractally That means the logic needed to detect complex instructions have to be bigger and probably stored in a ROM. This can make the detection logic a limit to the clock frequency in several ways, increased fan-out of the fetch data, routing delays to and from the ROM and possibly problems fitting the <logic>->ROM-><logic> in the detector into one clock cycle. Adding another cycle for complex operation detection will require a micro-flush of the latest fetched instruction which will complicate even more parts of the fetch/decode pipeline. But that is only for _detecting_ that the instruction should be considered complex, then the fetching have to redirected to a microcode engine. That adds more complexities. In comparison an enhanced trap mechanism can do some simplifications as some cases can be handled in software. There still have to be detection logic however that can be reduced in size as parts of that already must exist to detect illegal instructions. Quote:
There are many ways to improve trap and emulate timings without making large hardware changes. Save/restore of registers can be done with banking in the register file or renaming techniques depending on the processor design. It's hard to eliminate the transfer to the handler routine but it can be accelerated by treating every unimplemented instruction as a bsr to the trap handler eliminating the miss predict latency. The trap handling code could be stored in an on chip ROM to ensure that it is "cache resident" however if the emulation is needed often it will likely be resident anyway. Removing the checking for which instruction is to be emulated can be done at least partially by having a mechanism that translates part of the instruction word into an offset into the trap handler. Some checking code would still be needed though. |
||
29 May 2014, 22:49 | #120 |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,839
|
Why not simply implement what exists and be done with it? Shouldn't it be enough to get existing 68060 speed? What's the deal with wanting more and more speed? Isn't part of the charm that right now we don't have all the speed in the universe?
Also, if you're already moving away from 680x0 (FPGA is NOT 680x0), then why not stick a cheap AMD on a board and run a 680x0 emu? Perhaps I'm naive, but I just had to ask these questions |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Vampire 500 project started | majsta | Hardware mods | 221 | 17 August 2016 18:42 |
cd32 project idea i challenge ... | sian | request.Other | 11 | 15 June 2013 19:34 |
Looking for artist to collaborate on Lotus Turbo Challenge project | P-J | Amiga scene | 16 | 07 January 2012 04:21 |
Desperately seeking Amiga Demo Coder | slayerGTN | Amiga scene | 2 | 02 August 2010 23:34 |
Project-X SE & F17 Challenge v2.0 (1993)(Team 17)(M5)[compilation][CDD3499] | retrogamer | request.Old Rare Games | 0 | 05 April 2007 14:37 |
|
|