20 May 2014, 05:40 | #81 | |||
Banned
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
|
Quote:
Quote:
Quote:
You generally don't need even 64 bit precision in a math floating point library other than much software expects double precision support. The 6888x 80 bit FPU is not accurate past 64 bits for many trigonometry instructions but the extra precision does help to calculate the accuracy to 64 bits (with software emulation too). It would be interesting to compare the accuracy of double precision floating point math functions on other platforms to the 68k FPU double precision functions. Last edited by matthey; 20 May 2014 at 05:46. |
|||
20 May 2014, 11:07 | #82 | |||
Registered User
Join Date: Jun 2010
Location: PL?
Posts: 2,810
|
Quote:
Check for FPGA prices on for example mouser or farnell - medium size FPGA is around 300E at least - remember you are trying to fit CPU+FPU class of Pentium (68060) only with some improvements which usually double size. Those CPU+FPU was state of semiconductor art in first half of 90's - now you trying to do exactly same but with help of FPGA - i mean 20 years is huge time from semiconductor industry point of view but even FPGA'a are limited (i mean 68060 use single transistors to do job where FPGA use cell with multiple transistors thus you need at least 10 - 20x more transistors on chip to do same job). Quote:
Quote:
I'm reading lot of details from you why 040/060 is bad (even if most modern CPU have no transcendental instructions perhaps except logarithm unit for some FPU's). If you not need full 68882 then perhaps you can accept other limitations that make design cheaper and more available? And i repeat my question - any reason to use FSIN in web browser? |
|||
20 May 2014, 19:11 | #83 | |
Banned
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
|
Quote:
Not in the core web browser functionality for html or pictures. However, audio/video players, a PDF viewer and/or a 3D object plugin could make use of FSIN. |
|
20 May 2014, 19:50 | #84 | ||||
Registered User
Join Date: May 2014
Location: inside the emulator
Posts: 377
|
Quote:
Quote:
But I'd argue that the x86 ISA isn't that bad since the 80386 that was released in 1985. In fact I'd like to know of any ISA that measure better for general purpose code with at least 10% improvement (less than that is hard to measure anyway). Quote:
However if the processor core would support some kind of microcode anyway having hardware support for 64 bit floats and partial support for extended precision using microcode could be a solution. Another way would be fast handling of traps for unsupported instructions combined with some non-architectural support instructions. This will be slower than using microcode as the trap path most probably have to do an at least partial pipeline flush before reaching the emulation code. Quote:
|
||||
21 May 2014, 00:00 | #85 | |||||
Registered User
Join Date: Jun 2010
Location: PL?
Posts: 2,810
|
Quote:
Side to this we talking not only on basic for but full transcendental FPU - fully pipelined, scalar and reasonable fast - faster than 060 this cant be done in 20 - 40k of LE's (knowing fact that 060 have multiple millions of transistors) Quote:
Quote:
So where is line of compromise is this is SP, DP or EP, hybrid solution (microcode+hw) or plain microcode or perhaps software emulation. Quote:
Quote:
So i still not see area where to use FSIN so please give me example where it need to be used (except demos but what is point to have hardware sine records?). Side to this why if FSIN is so important there is no FSIN in PowerPC instruction list or there is no implementation for FSIN (except x87) in x86. If you analyze modern ISA'a there no is float sine where fast performance is required - if it exist it usually implemented as relatively slow instruction (probably can be implemented with comparable speed in software). Perhaps modern software doesn't require FSIN? |
|||||
21 May 2014, 03:35 | #86 | |||
Banned
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
|
Quote:
Quote:
Quote:
1) the time spent in the function is small 2) there isn't much advantage to doing the calculation in hardware 3) devoting resources somewhere else makes the whole CPU faster including the function that does the trig calculation Only assembler programmers (and users using old fp programs) miss the convenience of the 6888x instructions. C programmers use standard C functions in floating point math libraries which handle everything and they usually don't know how it works. |
|||
21 May 2014, 10:48 | #87 | ||||||
Registered User
Join Date: May 2014
Location: inside the emulator
Posts: 377
|
Quote:
Doing something faster than the 68060 FPU shouldn't be a huge problem as it isn't pipelined and runs at relatively low speeds. Let's assume a fully extended precision FPU would have a 20 stages deep pipeline and 300MHz frequency for additions and multiplications. This would then have a peak performance of 300M/20=15M dependent operations or 300M independent operations per second. IIRC the 68060 FPU have a 5 cycles latency and can be overclocked to ~100MHz which would then give peak 20M dependent or independent operations/s. Quote:
In fact the inherent complexities of the x86 design have been used as an advantage, as many operations require microcoding the support for that is excellent - so many new features can be supported by using the already existing mechanism. And the power consumption for x86 is hard to compare to other processors designed for other goals. People often like to compare ARM processors that deliver <75% of the performance of a low power x86 design without realizing that lowering clock rate somewhat will lower power use dramatically through several mechanisms. Quote:
I don't agree with that. Nobody uses the FPU for doing integer operations or BCD operations expecting peak performance. Such things shouldn't be allowed to slow down the general design. Having some hardware support for accelerating emulation of those operations could be vise though. Quote:
Quote:
Heck, some pages are running 100% as javascript without needing it. Javascript uses floats as the standard datatype BTW. Quote:
|
||||||
21 May 2014, 13:32 | #88 | |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 2,002
|
Quote:
handling movep commands. Correct is rol or ror only. F.e. Code:
movep.l D0,5(A0) ; can be handled as rol.l #8,D0 move.b D0,5(A0) rol.l #8,D0 move.b D0,7(A0) rol.l #8,D0 move.b D0,9(A0) rol.l #8,D0 move.b D0,11(A0) But if 68882 instructions are really handled as micro code, then almost every routine/code can be optimised for speed or size. Also if I remember right, math from my school, then for SIN and COS only one table is necessary, then perhaps you can save 2kB. I think that much more (size or speed) optimisations can be done for other 68882 instructions too: FSAVE FRESTORE FADD FSADD FDADD FCMP FDIV FSDIV FDDIV FMOD FMUL FSMUL FDMUL FREM FSCALE FSUB FSSUB FDSUB FSGLDIV FSGLMUL FABS FSABS FDABS FACOS FASIN FATAN FATANH FCOS FCOSH FETOX FETOXM1 FGETEXP FGETMAN FINT FINTRZ FLOG10 FLOG2 FLOGN FLOGNP1 FNEG FSNEG FDNEG FSIN FSINH FSQRT FSSQRT FDSQRT FTAN FTANH FTENTOX FTWOTOX FMOVECR FNOP FSINCOS FTST FBF FBEQ FBOGT FBOGE FBOLT FBOLE FBOGL FBOR FBUN FBUEQ FBUGT FBUGE FBULT FBULE FBNE FBT FBSF FBSEQ FBGT FBGE FBLT FBLE FBGL FBGLE FBNGLE FBNGL FBNLE FBNLT FBNGE FBNGT FBSNE FBST FDBF FDBEQ FDBOGT FDBOGE FDBOLT FDBOLE FDBOGL FDBOR FDBUN FDBUEQ FDBUGT FDBUGE FDBULT FDBULE FDBNE FDBT FDBSF FDBSEQ FDBGT FDBGE FDBLT FDBLE FDBGL FDBGLE FDBNGLE FDBNGL FDBNLE FDBNLT FDBNGE FDBNGT FDBSNE FDBST FSF FSEQ FSOGT FSOGE FSOLT FSOLE FSOGL FSOR FSUN FSUEQ FSUGT FSUGE FSULT FSULE FSNE FST FSSF FSSEQ FSGT FSGE FSLT FSLE FSGL FSGLE FSNGLE FSNGL FSNLE FSNLT FSNGE FSNGT FSSNE FSST FTRAPF FTRAPEQ FTRAPOGT FTRAPOGE FTRAPOLT FTRAPOLE FTRAPOGL FTRAPOR FTRAPUN FTRAPUEQ FTRAPUGT FTRAPUGE FTRAPULT FTRAPULE FTRAPNE FTRAPT FTRAPSF FTRAPSEQ FTRAPGT FTRAPGE FTRAPLT FTRAPLE FTRAPGL FTRAPGLE FTRAPNGLE FTRAPNGL FTRAPNLE FTRAPNLT FTRAPNGE FTRAPNGT FTRAPSNE FTRAPST FMOVE FSMOVE FDMOVE FMOVEM 68882 instructions list is taken from PhxAss, I hope that is complete, due many Amiga assemblers has missed some FPU instructions. It can be one of the reasons, why some FPU instructions are very rare used or maybe even never used on Amiga. About your SIMD example. Do you know timing (cycles) for both versions of your code? I'm not FPU expert, but I think that next version of your code: Code:
fmovem.x (A0)+,FP0-FP3 fadd.x (A1)+,FP0 fadd.x (A1)+,FP1 fadd.x (A1)+,FP2 fadd.x (A1)+,FP3 fmovem.x FP0-FP3,(A2) instructions has no big sense (except possible creator fame), especially if good 68882 instructions must be removed. Much better idea is adding more instructions in move16 style, like: add16, sub16, or16, eor16, and16. |
|
21 May 2014, 15:51 | #89 | ||||||
Registered User
Join Date: May 2014
Location: inside the emulator
Posts: 377
|
Quote:
Quote:
Quote:
Quote:
Quote:
Your example isn't comparable to the vector one in throughput BTW. Even if the processor is super-scalar to a wide degree and support out of order execution it is much slower. Assume that the two FMOVEM instructions can be run at the same time and that the data dependencies between the FMOVEM and FADD can be hidden. Then each FADD require a load, the addition itself and then a store. Peak performance for this using one memory load and one store per cycle is one FADD/clock (this ignores fill/spill times). However each FADD also have another load which lowers the performance to two clocks/FADD. But we needn't stop there, what if the processor could execute the FMOVEMs in one clock cycle using a wide cache data path? Then the initial load takes one cycle, each of the additions can be initiated one cycle apart and the final FMOVEM take one cycle. This would with a relatively realistic (though not for FPGA) FADD latency of 4 cycles take 9 cycles for the four FADDs. But what if there suddenly were four FADD capable FP pipes? It wouldn't help any, due to the load operation. But perhaps the processor should be capable of 4 loads/cycle? Still no go as now we have a dependency on the A1 register. However as we already are a long way into fantasy land why shouldn't the processor be able to detect and resolve this register dependency using parallel addition logic? Then we'll reach a peak throughput of 4 FADD per cycle or (for this example using the 4 cycle FADD latency above) a total latency of 6 cycles. Or one could instead just use a short-vector/SIMD design and avoid the complications with the same throughput. The later is realistic to include in FPGA designs too. Quote:
|
||||||
21 May 2014, 23:20 | #90 | |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 2,002
|
Quote:
If I remember right, rotation (very useful) commands are done in 1 cycle for 68060. Then my proposal for movep.l handling with two OEP's can works with 4c for write and 2c for read, but if you know better/faster way, you are welcome too show, maybe I'm able to learn something? Who know? I don't know, how fast/slow are rotation commands for Apollo core, but if are slowest than 1 cycle, then this is correct job for CPU creators: 1st step, how implement CPU/FPU command. 2nd step, how reach maximum speed for this command. 3rd step, how reduce FPGA space usage for this command. Not trying to remove already available 68040/68882 instructions from the core, due this is lamers way. Is not important than only a few (?) programs used movep commands, its important than external emulation is very slow and has no sense. If for You, trap emulation is correct way for creating good CPU, then better read extractions from CyberPatcher doc: ****************************************************************************** The problem with several Amiga applications on the 68040 and 68060 is that these are compiled for the 6888x. This coprocessor has a lot more fpu instructions than the 68040 and 68060 so these instructions have to be emulated on the more advanced 680xx processors. Unknown instructions cause a trap and during the trap the emulation has to find the right emulation routine and run this function. In a trap the processor is in the Supervisor mode and no other tasks can run. This effect is visible by a not smooth running mouse...the system gets almost unusable the more unimplemented instructions are used by a program. CyberPatcher trys to patch the most used instructions that have to be emulated. .... At the moment CyberPatcher supports the following programs: -Mand2000d(large speed up) -SceneryAnimator(large speed up) -Imagine 2.x(large speed up) -Vista Pro -Lightwave -Real 3d 2.x -Maxon Cinema 2 -ImageFX **************************************************************************** If You don't understand, read again about supervisor mode, no other tasks can run etc. and again and again etc. But if you understand than external emulation is amateurs way thats OK. About memory blocks, if the smallest FPGA has 200kB memory, then 2kB (1%) for support SIN/COS commands is nothing and can not be called "wasting of FPGA space" (Hi, Matt ). About 68882. Which 68882 instructions are for You same? I wrote that this is extraction from PhxAss, but I can check this exactly. About SIMD, everything can be comparable, especially commands/routines speed. I always show examples, if something is slow or fast. Then for me fmovem version is better than SIMD version, due no cycles results for SIMD version. And nothing must be CASTRATED from CPU. For me SIMD commands only can wasting of FPGA space. BTW. Don't exist even "a few" programs which used SIMD commands. But of course you are on good way, fmovem and movem commands can/must be fastest. I think that fmovem and movem.l commands for up to 4 registers can be done in 2 cycles, for up to 8 registers in 3 cycles etc. About xxx16 commands. Exist many possibilities (especially with very fast movem.l or move16) to usage these commands, this is only question of coder imagination. |
|
22 May 2014, 06:50 | #91 | |||||||
Banned
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
|
Quote:
Quote:
Quote:
FSIN, FCOS and FSINCOS use the same table FTAN uses a table (it could use FSIN/FCOS but division is slow) FASIN and FACOS use FATAN which has a table FSINH, FCOSH, FTANH, FETOXM1 use FETOX which has a table FTENTOX, FTWOTOX, FLOGN, FLOG2 and FLOG10 use a table FMOVECR uses a table Just the tables is probably 10kB. There is other static data like fp numbers that needs to be stored and the code has to be stored somewhere. Now you could be using 20kB out of 200kB which is 10% of the SRAM and that doesn't count the logic used. My estimate could be significantly off but I hope you can see that although it's possible to add all the FPU instructions now, it would take valuable resources from being used elsewhere where they provide a better speedup. Quote:
Quote:
Quote:
Quote:
I would use MOVE16 some if the data didn't have to be 16 byte aligned. Otherwise, about the only use is for system memory copying functions. I don't see much use for the other memory to memory 16 byte instructions either. I do wish there was a full CMP EA,EA in byte, word, and long sizes and I do think it would get some use but there isn't encoding space for it. The Apollo core may be able to do CMP mem,mem and MOVE mem,mem in 1 cycle. |
|||||||
22 May 2014, 09:46 | #92 | |||||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,335
|
Nothing for me, really.
Quote:
Quote:
Quote:
Quote:
Or maybe the dcache can output 128 bits (or more) per clock ? Quote:
Quote:
1. Use a GPU (or a GPGPU) to offload your main cpu. 2. Rewrite your code with only integer ops. 3. Get a higher clocked FPGA. Quote:
Perhaps our friend Pandy doesn't like the FSIN instruction because it's a sin ? (sorry for the pun, i couldn't resist) If so, just consider FATAN for another example, and how you can use it when you're doing angle computations from coordinates. In the same manner, FSQRT can be useful for distance computations (even though i'd prefer an integer sqr instruction). Furthermore, when you have one transcendental operation, the others can probably reuse your logic (in a SW implementation it's the case). So the block may be huge, but it'll be there only once, and a wealth of other ops are just a few parameters away from what you have now. That said, i'm not giving an enormous value to the FPU. All my code is either pure integer, or uses libs. I'm just saying how a good fpu can look like for me. What's sure is that i'd personnally have more use for a new MOVEM.B rather than for FSIN. |
|||||||
22 May 2014, 10:04 | #93 | ||||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,335
|
Quote:
Quote:
What i find particularly funny is that the ones who advocate going the SMP way are often the same who advocate the removal of instructions such as these two. Funny, to say the least. It's probably even smaller. I remember my old 8-bit 6502 machine. It was able to compute all these with its Basic in ROM, and the rom was only 16kb in total. Quote:
Besides, i don't think there is much space to waste for a floating-point vector core. An integer-only vector core is big enough - and frankly i'd rather have new scalar integer instructions. Quote:
Quote:
Quote:
How about just having CMP d16(An),d16(An) ? Or maybe CMP (SP)+,EA ? These can be encoded, i think. |
||||||
22 May 2014, 10:58 | #94 | |||||||
Registered User
Join Date: May 2014
Location: inside the emulator
Posts: 377
|
Quote:
Those limitations are _hard_ to change and adding more ports of either type _will_ decrease to overall clock frequency! Yes, that includes ASIC implementations. Intel just recently added support for 2 data reads and one data write per cycle from cache. Intel also uses less register accesses than required for peak throughput, why? Because that lowers power and increases clock rates. Lowering power also increases performance BTW. Why do I talk about Intels chips? Because they _are_ the ones with best performance. Quote:
Quote:
Why? Because the few cases that uses MOVEP will not be performance critical. Quote:
Quote:
Given your certainty of performance in a FPGA processor implementing the full 68k ISA you'd have to have it up and running. Quote:
Supporting SIN would lower the clock rate of the whole design as generic microcode support have to be included. Or one could have a dedicated hardware unit that would then complicate data flows for result writeback. Quote:
The EOR16 could be used for software RAID support but that's about it. If instead the processor could support short vector/SIMD integer operations with 128 bit registers not only could every *16 operation be emulated but a lot of useful work could be done too. |
|||||||
22 May 2014, 22:33 | #95 | |
Registered User
Join Date: Jan 2008
Location: Warsaw/Poland
Age: 55
Posts: 2,002
|
Quote:
Then time to beat Intel chips. I'm not hardware expert, but I will use internal CPU 128 bit register for read/write. Four 32 registers must/can be splitted/joined in one 128 bit register and data (128 bit) can be wrote/readed. Then for two reads You can read 8 longwords per cycle or wrote 4 longwords per cycle. If I remember right move16 command is fastest than movem for 4 registers, then must exist any way for make it fastest. If splitting/joining registers is too hard to make, then I think than "movemfast" command (or special movem.l handling) can be used/added. It can works like movem, but only for successive registers f.e. movemfast D0-D3,(A0) is OK movemfast D0-D2/D4,(A0) is not possible movemfast D1-A4,(A5) is OK or special handling of movem.l only movem.l D0-D3,-(SP) full speed movem.l D0-D2/D4,-(SP) slowest speed movem.l (SP)+,D0-A3 full speed For movem.w command similar data can be joined/splitted. Of course I don't know hardware. But I heard too many times that something is impossible to make, when I will sure (as amateur) that this is possible. If You need hardware examples f.e. put MC68060 CPU to A3640 or put more than 16 MB on A4k main board. If You need software examples, f.e one disk version of Turrican 2 (fit heavy packed ~1070kB data on 900kB disk) or write RNC copylocks without hardware. 100 cycles for movep emulation is very good example of wasting of CPU power. Seems You like to waste of CPU power, I don't like this. 2 or 4 cycles vs 100 cycles, and Your choice is 100. About trap and finding emulation routine, of course this is true. But of course You can create thousends of traps and every trap can works only for concrete CPU/FPU instructions f.e. trap #10456 can handle movep.l D0,0(A0) trap #10457 can handle movep.l D0,1(A0) trap #29456 can handle movep.l D2,51(A2) etc This is simple wasting of memory/speed, but You can call this "right design". Anyway it seems You must be "Trap Master", I think, if for You traps way is correct way for make good CPU. I don't wrote that Apollo creators are lamers/amateurs. I only wrote that they go lamers/amateurs way. If for You this is same, then sorry. From my amateuer point of view: 1. movep instruction -> trap. 2. rare used instruction -> trap. 3. FPU instruction -> trap. 4. hard to implement instruction -> trap. etc. Sorry, but I can't call this expert way. About SIN support. Sorry, but I think that You must be wrong, that microcode can slowdown CPU clock rate, if other FPU commands (microcoded too) don't slow CPU clock rate. This is illogical for me, or even if this is true then try to make other SIN implementation. Many things can be done in different ways, not only via "one and the only trap" way. About xxx16 commands. I don't know which examples You need, but f.e. movem.l (A5)+,D7-A2 eor16 D7-A2,(A4)+ I'm not against adding new instructions to core, but it must be done in clean way, not in dirty way (using opcode space for already available 68040/68882 instructions), due it will be make only problems. This is simple for me, choose one (unused) and easy for handling 2 bytes opcode and use this as prefix/ID only for series of 128 bit instructions, and later (next 2 bytes) use as real opcode for fast instructions decoding. Instructions will be 2 bytes longer (due prefix/ID at begining), but 100% compatible with already available 68040/68882 instructions. |
|
23 May 2014, 08:04 | #96 | ||||
Banned
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
|
Quote:
Quote:
Quote:
Quote:
Accuracy and Monotonicity The returned result is within 2 ulps in 64 bit significant bit, i.e. within 0.5001 ulp to 53 bits if the result is subsequently rounded to double precision. The result is provably monotonic in double precision. I think this is in some technical math language other than English. Perhaps you would like to do the VHDL coding and proof of monotonic result and accuracy to 2 ulps in 64 bit significant digits? It's not like it's impossible . |
||||
23 May 2014, 12:08 | #97 |
Registered User
Join Date: Jun 2010
Location: PL?
Posts: 2,810
|
At some point i decided to give up however as i began see better point of view many of You i need ask once again - for who this board will be designed, who will be marketing target and what is main purpose for this board(s).
Also just from pure curiosity: i don't understand why so many people insist to use CPU to perform DMA-like tasks, i don't understand why so many people insist to perform for example C2P by CPU (where some small logic seem to be more efficient). And no, i not afraid of Sine - i just found it not worth to pay 200E more. |
23 May 2014, 12:31 | #98 |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,796
|
|
23 May 2014, 13:47 | #99 |
Registered User
Join Date: Mar 2012
Location: Norfolk, UK
Posts: 1,154
|
In an FPGA it'd be perfectly possible to design some logic that reads a buffer from SDRAM, does C2P conversion and writes the result to Chip RAM without CPU intervention. If the CPU's running largely from SDRAM / Cache then this background task would have very little impact on the CPU speed.
|
23 May 2014, 14:03 | #100 | |
Registered User
Join Date: Jun 2010
Location: PL?
Posts: 2,810
|
Quote:
I don't understand why everything must be done by CPU... copying 25 times per second buffer from one memory (FAST) to another (CHIP) is one of simplest things to do in FPGA... (way simpler than sophisticated FPU). 2/1 OCS/ECS/AGA cycle per 32 bit - this is slow? What can be faster? doing this by CPU? Akiko have special hardware to do this... fast as fast can be reading/writing from/to memory is not slow. Last edited by pandy71; 23 May 2014 at 14:27. |
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Vampire 500 project started | majsta | Hardware mods | 221 | 17 August 2016 18:42 |
cd32 project idea i challenge ... | sian | request.Other | 11 | 15 June 2013 19:34 |
Looking for artist to collaborate on Lotus Turbo Challenge project | P-J | Amiga scene | 16 | 07 January 2012 04:21 |
Desperately seeking Amiga Demo Coder | slayerGTN | Amiga scene | 2 | 02 August 2010 23:34 |
Project-X SE & F17 Challenge v2.0 (1993)(Team 17)(M5)[compilation][CDD3499] | retrogamer | request.Old Rare Games | 0 | 05 April 2007 14:37 |
|
|