Demo Coder Challenge - Vampire Beta Project ! - Page 5

matthey · 20 May 2014, 05:40

Quote:

Originally Posted by meynaf

And the problem is that the SIMD instructions can't work from cache, where FPU instructions can.

And therefore the benefit of your above code is zero. It's probably even slower than normal FPU code because it spends its time waiting on the memory bus.

The instructions can't work from the instruction cache? Why not? An SIMD can use the data cache but it doesn't always make sense because it often uses data streams that would just flush the cache. It should be able to read from the existing data cache but maybe not write it (or have a setting). It would be interesting to see how the Apollo stream detection would handle streams for the SIMD. I'm not saying a 68k SIMD would be great as they have big limitations but I see potential where you seem confident it can't be useful. I do see that the 68k FPU gets bogged down with heavy FPU use and no integer instructions to execute in parallel with vectors and matrix calculations. What do you think is a better solution to these bottlenecks?

Quote:

Originally Posted by pandy71

For FPU (compatible with 68882) at least 50000LE's is required, for high performance FPU add to this at least 10000 - 20000, add to this CPU, add few other things and you easily going for 200000 LE's (or higher as this is low budget project then no one will spend half year to route manually FPGA and LE's utilization will be less than 75 - 80% or even worse if you go for high speed design where routing is even more restricted) this mean that your FPGA cost approx. 250 - 350E at least add to it his other IC's, PCB, small profit - it should end somewhere around 500 - 600E - good luck - you will be owner for one for 50 boards.

Again, I believe your LE estimates for an FPU are off. I believe a 68060 like FPU with moderate pipelining (5-10 stages) is possible with less than 25k LE. The Phoenix core uses less than 7k LE but a simple 68060 FPU needs 50k LE? Also, the decoding, first part of the instruction pipeline (EA calculation) and caches are shared with the integer units. A fully 6888x compatible FPU could double the cost in LE which would raise the cost of the required fpga but 250 Euros? A completely awesome and huge fpga costs less than half that.

Quote:

Originally Posted by pandy71

Then good luck with x86 as we all know this is one of less efficient ISA's in IT industry.

I wouldn't say the later x86/x86_64 ISA is inefficient. It's really not too bad after they went to 16 mostly general purpose registers, added modern instructions and passed function arguments in registers. There is some old baggage, lack of orthogonality and the encodings are less efficient, but performance is an important consideration in an ISA's efficiency also.

Quote:

Originally Posted by pandy71

All this together sounds weird - why the heck you need to use FSIN at 64 or 80 bit in web browser?
(but watching how modern programs behave perhaps it is not so strange)

You generally don't need even 64 bit precision in a math floating point library other than much software expects double precision support. The 6888x 80 bit FPU is not accurate past 64 bits for many trigonometry instructions but the extra precision does help to calculate the accuracy to 64 bits (with software emulation too). It would be interesting to compare the accuracy of double precision floating point math functions on other platforms to the 68k FPU double precision functions.

pandy71 · 20 May 2014, 11:07

Quote:

Originally Posted by matthey

Again, I believe your LE estimates for an FPU are off. I believe a 68060 like FPU with moderate pipelining (5-10 stages) is possible with less than 25k LE. The Phoenix core uses less than 7k LE but a simple 68060 FPU needs 50k LE? Also, the decoding, first part of the instruction pipeline (EA calculation) and caches are shared with the integer units. A fully 6888x compatible FPU could double the cost in LE which would raise the cost of the required fpga but 250 Euros? A completely awesome and huge fpga costs less than half that.

My estimations are directly related to Altera FPU megacore description. Also considering fact that lot of instructions from 68882 is not covered by this megacore and core cover only aspect of max 64 bit DP where 68882 is 80 bit EP we can easily go to this kind of numbers, add to this that FPU will not fit nicely in typical FPGA architecture and if your target is really high speed then overall resource utilization in case of FPU can be less than 60% - as a result you need lot of LE's to waste (once again - i doubt that every time you will perform manual optimization for FPU).

Check for FPGA prices on for example mouser or farnell - medium size FPGA is around 300E at least - remember you are trying to fit CPU+FPU class of Pentium (68060) only with some improvements which usually double size.
Those CPU+FPU was state of semiconductor art in first half of 90's - now you trying to do exactly same but with help of FPGA - i mean 20 years is huge time from semiconductor industry point of view but even FPGA'a are limited (i mean 68060 use single transistors to do job where FPGA use cell with multiple transistors thus you need at least 10 - 20x more transistors on chip to do same job).

Quote:

Originally Posted by matthey

I wouldn't say the later x86/x86_64 ISA is inefficient. It's really not too bad after they went to 16 mostly general purpose registers, added modern instructions and passed function arguments in registers. There is some old baggage, lack of orthogonality and the encodings are less efficient, but performance is an important consideration in an ISA's efficiency also.

Ok it is not so efficient when compared to many modern CPU - you can see they are other ISA's doing way better - but x86 is most popular universal ISA that's all, compilers are also very good which is very important nowadays.

Quote:

Originally Posted by matthey

You generally don't need even 64 bit precision in a math floating point library other than much software expects double precision support. The 6888x 80 bit FPU is not accurate past 64 bits for many trigonometry instructions but the extra precision does help to calculate the accuracy to 64 bits (with software emulation too). It would be interesting to compare the accuracy of double precision floating point math functions on other platforms to the 68k FPU double precision functions.

Once again i must say this is a bit contradictory - are you interested in 68882 or not?
I'm reading lot of details from you why 040/060 is bad (even if most modern CPU have no transcendental instructions perhaps except logarithm unit for some FPU's).
If you not need full 68882 then perhaps you can accept other limitations that make design cheaper and more available?

And i repeat my question - any reason to use FSIN in web browser?

matthey · 20 May 2014, 19:11

Quote:

Originally Posted by pandy71

Once again i must say this is a bit contradictory - are you interested in 68882 or not? I'm reading lot of details from you why 040/060 is bad (even if most modern CPU have no transcendental instructions perhaps except logarithm unit for some FPU's). If you not need full 68882 then perhaps you can accept other limitations that make design cheaper and more available?

I'm not interested in full 6888x compatibility because it's not practical, especially in an fpga. Keeping extended precision would allow to emulate 6888x instructions better as more precision helps to increase double precision accuracy (better compatibility). It is true that extended precision is both slower and uses more logic than double precision in an fpga. I haven't made up my mind what would be the best way to go. It's probably not my decision anyway. I can only make suggestions giving what I see as the pros and cons. I would not recommend any kind of external FPU for reasons I have mentioned before.

Quote:

Originally Posted by pandy71

And i repeat my question - any reason to use FSIN in web browser?

Not in the core web browser functionality for html or pictures. However, audio/video players, a PDF viewer and/or a 3D object plugin could make use of FSIN.

Megol · 20 May 2014, 19:50

Quote:

Originally Posted by pandy71

My estimations are directly related to Altera FPU megacore description. Also considering fact that lot of instructions from 68882 is not covered by this megacore and core cover only aspect of max 64 bit DP where 68882 is 80 bit EP we can easily go to this kind of numbers, add to this that FPU will not fit nicely in typical FPGA architecture and if your target is really high speed then overall resource utilization in case of FPU can be less than 60% - as a result you need lot of LE's to waste (once again - i doubt that every time you will perform manual optimization for FPU).

Check for FPGA prices on for example mouser or farnell - medium size FPGA is around 300E at least - remember you are trying to fit CPU+FPU class of Pentium (68060) only with some improvements which usually double size.
Those CPU+FPU was state of semiconductor art in first half of 90's - now you trying to do exactly same but with help of FPGA - i mean 20 years is huge time from semiconductor industry point of view but even FPGA'a are limited (i mean 68060 use single transistors to do job where FPGA use cell with multiple transistors thus you need at least 10 - 20x more transistors on chip to do same job).

Efficiency of FPGA logic depends a lot on a lot of factors. On example: using Xilinx Artix/Kintex/Virtex-7 allows one to use a DSP block as an ALU handling addition/subtraction and logical functions. Even the smallest and cheapest Artix chip supports 90 DSP blocks so using two for ALU functionality and 4 for integer multiplication there still are 84 blocks free. A naive extended precision floating point multiplier require ceil(80/16)^2 blocks = 25 DSP blocks. A realistic design would use some kind of tiling scheme due to the asymmetric twos complement 25x18 multiplier used by Xilinx which reduces the amount of blocks.

Quote:

Ok it is not so efficient when compared to many modern CPU - you can see they are other ISA's doing way better - but x86 is most popular universal ISA that's all, compilers are also very good which is very important nowadays.

ISAs aren't that important, microarchitecture is king in this age of Giga transistors and power limitations.
But I'd argue that the x86 ISA isn't that bad since the 80386 that was released in 1985. In fact I'd like to know of any ISA that measure better for general purpose code with at least 10% improvement (less than that is hard to measure anyway).

Quote:

Once again i must say this is a bit contradictory - are you interested in 68882 or not?
I'm reading lot of details from you why 040/060 is bad (even if most modern CPU have no transcendental instructions perhaps except logarithm unit for some FPU's).
If you not need full 68882 then perhaps you can accept other limitations that make design cheaper and more available?

Of course. Limiting the FP support to 64 bit/double reduces logic for both the multiplier and more importantly the adder/subtraction unit. It also would make the divider somewhat smaller and faster.

However if the processor core would support some kind of microcode anyway having hardware support for 64 bit floats and partial support for extended precision using microcode could be a solution.

Another way would be fast handling of traps for unsupported instructions combined with some non-architectural support instructions. This will be slower than using microcode as the trap path most probably have to do an at least partial pipeline flush before reaching the emulation code.

Quote:

And i repeat my question - any reason to use FSIN in web browser?

Plenty. But you maybe are asking for simple old browsers with no modern features? Then I guess there aren't any.

pandy71 · 21 May 2014, 00:00

Quote:

Originally Posted by Megol

Efficiency of FPGA logic depends a lot on a lot of factors. On example: using Xilinx Artix/Kintex/Virtex-7 allows one to use a DSP block as an ALU handling addition/subtraction and logical functions. Even the smallest and cheapest Artix chip supports 90 DSP blocks so using two for ALU functionality and 4 for integer multiplication there still are 84 blocks free. A naive extended precision floating point multiplier require ceil(80/16)^2 blocks = 25 DSP blocks. A realistic design would use some kind of tiling scheme due to the asymmetric twos complement 25x18 multiplier used by Xilinx which reduces the amount of blocks.

Adding/subtracting is more demanding task in FP than multiplication\division thus i would be interested in for example adder calculation.
Side to this we talking not only on basic for but full transcendental FPU - fully pipelined, scalar and reasonable fast - faster than 060 this cant be done in 20 - 40k of LE's (knowing fact that 060 have multiple millions of transistors)

Quote:

Originally Posted by Megol

ISAs aren't that important, microarchitecture is king in this age of Giga transistors and power limitations.
But I'd argue that the x86 ISA isn't that bad since the 80386 that was released in 1985. In fact I'd like to know of any ISA that measure better for general purpose code with at least 10% improvement (less than that is hard to measure anyway).

I always consider power consumption and hardware complexity - Intel not shining when those two factor are taken into consideration - i can observe only how much energy is used by devices comparable functionally and simply Intel disappointing.

Quote:

Originally Posted by Megol

Of course. Limiting the FP support to 64 bit/double reduces logic for both the multiplier and more importantly the adder/subtraction unit. It also would make the divider somewhat smaller and faster.

However if the processor core would support some kind of microcode anyway having hardware support for 64 bit floats and partial support for extended precision using microcode could be a solution.

See whole this topic - "no compromise, we need very fast 68882" etc
So where is line of compromise is this is SP, DP or EP, hybrid solution (microcode+hw) or plain microcode or perhaps software emulation.

Quote:

Originally Posted by Megol

Another way would be fast handling of traps for unsupported instructions combined with some non-architectural support instructions. This will be slower than using microcode as the trap path most probably have to do an at least partial pipeline flush before reaching the emulation code.

Half of this topic is - "no compromise"...

Quote:

Originally Posted by Megol

Plenty. But you maybe are asking for simple old browsers with no modern features? Then I guess there aren't any.

"Modern features" are designed to not use FP at all (as FP cost HW and SW resources) - btw fact that codec use FP doesn't mean it is better than INT designed video codec - clear proof is for example VP8 vs H.264 where H.264 outperform VP8 on all areas (didn't check VP9 vs H.265).
So i still not see area where to use FSIN so please give me example where it need to be used (except demos but what is point to have hardware sine records?).

Side to this why if FSIN is so important there is no FSIN in PowerPC instruction list or there is no implementation for FSIN (except x87) in x86.
If you analyze modern ISA'a there no is float sine where fast performance is required - if it exist it usually implemented as relatively slow instruction (probably can be implemented with comparable speed in software).
Perhaps modern software doesn't require FSIN?

matthey · 21 May 2014, 03:35

Quote:

Originally Posted by pandy71

Adding/subtracting is more demanding task in FP than multiplication\division thus i would be interested in for example adder calculation.

The complexity of floating point add/sub and mul is comparable. The logic needed for fp add/sub is higher but then several instructions (FADD, FSUB, FCMP and FNEG) with more variations are supported. Floating point divide is much more complex and slower.

Quote:

Originally Posted by pandy71

"Modern features" are designed to not use FP at all (as FP cost HW and SW resources) - btw fact that codec use FP doesn't mean it is better than INT designed video codec - clear proof is for example VP8 vs H.264 where H.264 outperform VP8 on all areas (didn't check VP9 vs H.265).

Have you done much coding? It's not unusual for code to use floating point because it's convenient for some types of algorithms and it's available in most hardware. It's kind of like owning a car. It may not be the fastest or cheapest transportation but it sure is convenient.

Quote:

Originally Posted by pandy71

So i still not see area where to use FSIN so please give me example where it need to be used (except demos but what is point to have hardware sine records?).

Side to this why if FSIN is so important there is no FSIN in PowerPC instruction list or there is no implementation for FSIN (except x87) in x86.
If you analyze modern ISA'a there no is float sine where fast performance is required - if it exist it usually implemented as relatively slow instruction (probably can be implemented with comparable speed in software).
Perhaps modern software doesn't require FSIN?

SIN and COS are used in most cases where there is rotation around a point (whether 2D or 3D). It's generally implemented in floating point as a function instead of an instruction because:

1) the time spent in the function is small
2) there isn't much advantage to doing the calculation in hardware
3) devoting resources somewhere else makes the whole CPU faster including the function that does the trig calculation

Only assembler programmers (and users using old fp programs) miss the convenience of the 6888x instructions. C programmers use standard C functions in floating point math libraries which handle everything and they usually don't know how it works.

Megol · 21 May 2014, 10:48

Quote:

Originally Posted by pandy71

Adding/subtracting is more demanding task in FP than multiplication\division thus i would be interested in for example adder calculation.
Side to this we talking not only on basic for but full transcendental FPU - fully pipelined, scalar and reasonable fast - faster than 060 this cant be done in 20 - 40k of LE's (knowing fact that 060 have multiple millions of transistors)

Chain two DSP blocks and you'll have a 96 bit adder/subtractor. That "only" leaves alignment logic and shifter. I don't remember what the shifter needs to support for extended precision but that will require a large chunk of general logic.

Doing something faster than the 68060 FPU shouldn't be a huge problem as it isn't pipelined and runs at relatively low speeds. Let's assume a fully extended precision FPU would have a 20 stages deep pipeline and 300MHz frequency for additions and multiplications. This would then have a peak performance of 300M/20=15M dependent operations or 300M independent operations per second.
IIRC the 68060 FPU have a 5 cycles latency and can be overclocked to ~100MHz which would then give peak 20M dependent or independent operations/s.

Quote:

I always consider power consumption and hardware complexity - Intel not shining when those two factor are taken into consideration - i can observe only how much energy is used by devices comparable functionally and simply Intel disappointing.

The hardware complexity isn't relevant to users, right? Intel, AMD and VIA/Centaur technology knows how to handle the complexities. Even "simple" designs like the MIPS series have a lot of complications and quirks that complicate implementations but are largely invisible to users. The same applies to ARM.
In fact the inherent complexities of the x86 design have been used as an advantage, as many operations require microcoding the support for that is excellent - so many new features can be supported by using the already existing mechanism.

And the power consumption for x86 is hard to compare to other processors designed for other goals. People often like to compare ARM processors that deliver <75% of the performance of a low power x86 design without realizing that lowering clock rate somewhat will lower power use dramatically through several mechanisms.

Quote:

See whole this topic - "no compromise, we need very fast 68882" etc
So where is line of compromise is this is SP, DP or EP, hybrid solution (microcode+hw) or plain microcode or perhaps software emulation.

I didn't really see this in the thread? What I did see was that all 68882 instructions should be supported, not that every operation would be optimal.
I don't agree with that. Nobody uses the FPU for doing integer operations or BCD operations expecting peak performance. Such things shouldn't be allowed to slow down the general design. Having some hardware support for accelerating emulation of those operations could be vise though.

Quote:

Half of this topic is - "no compromise"...

But there's always some compromises that have to be done. E.g. for FPGA designs a high performance FPU will need a deep pipeline.

Quote:

"Modern features" are designed to not use FP at all (as FP cost HW and SW resources) - btw fact that codec use FP doesn't mean it is better than INT designed video codec - clear proof is for example VP8 vs H.264 where H.264 outperform VP8 on all areas (didn't check VP9 vs H.265).
So i still not see area where to use FSIN so please give me example where it need to be used (except demos but what is point to have hardware sine records?).

People use their browsers for more than watching videos. There are games, emulators, simulators, visualization solutions and a lot more out there.
Heck, some pages are running 100% as javascript without needing it. Javascript uses floats as the standard datatype BTW.

Quote:

Side to this why if FSIN is so important there is no FSIN in PowerPC instruction list or there is no implementation for FSIN (except x87) in x86.
If you analyze modern ISA'a there no is float sine where fast performance is required - if it exist it usually implemented as relatively slow instruction (probably can be implemented with comparable speed in software).
Perhaps modern software doesn't require FSIN?

Of course it is needed. It's just that the functionality have been moved into libraries rather than microcode. Why? Because the designers thought it would be the best solution.

Don_Adan · 21 May 2014, 13:32

Quote:

Originally Posted by matthey

Games and demos that make too many assumptions about the system can't hold us back. Adding 1GB of memory including more chip memory will cause some programs to fail. Does that mean we are stuck with 2MB of chip memory and 128MB of fast? Moving from ECS to AGA, from 68000 to 68020 and from AmigaOS 1.3 to AmigaOS 2.x broke more than a few programs. Were these worthwhile upgrades? I understand that you, meynaf and most current Amiga users want the ultimate retro Amiga. I won't argue that Amiga compatibility is very important but we may need to make some sacrifices to improve performance and attract outside users and programmers.

I'm not saying that an instruction like MOVEP will necessarily be trapped. It may be possible to decode it as a PERM+MOVE.L using an internal register. It depends on how it works out. ROXL and ROXR with a rotate size other than 1 are not easy to do in the shift unit. The instruction is looped back through the shift unit up to 8 times costing 8 cycles. This isn't so bad because I have never seen a ROXL or ROXR with a rotate size other than 1. Some instructions and instruction variations were just not used or were even forbidden to use on the Amiga (read+modify cycle) and make no sense to support in hardware. The same goes for some FPU instructions like FATANH, FCOSH, FSINH, FTANH and FDBcc. I can't recall ever seeing these FPU instructions used. Trigonometry instructions usually use a table lookup with polynomial calculation. Each table can be up to 2kB. That's a lot of dead space in an "affordable" fpga. How much memory do you think is in an affordable fpga?

I have determined that FATAN is used with various vector and polar coordinate calculations and is not uncommon. The atan2() function, which has gained in popularity, calls it. The code+data for FATAN is almost 3kB so there is not likely going to be a big hurry to put it in the fpga and it *is* used.

Vector or matrices are used in almost everything 3D now days. There are many matrix operations including add, subtract, inverse, scale, multiply add, compare, length, dot product, cross product, normalize, etc. The original Quake before SIMD became popular used vectors. Here is an example of one of the simple operations, a 3 element vector add:

Code:

; void _VectorAdd(vec3_t veca, vec3_t vecb, vec3_t out)
; a0 = -> veca
; a1 = -> vecb
; a2 = -> out
_VectorAdd:
   fmove.s (a0)+,fp0
   fadd.s (a1)+,fp0
   fmove.s (a0)+,fp1
   fmove.s fp0,(a2)+
   fadd.s (a1)+,fp1
   fmove.s (a0),fp0
   fmove.s fp1,(a2)+
   fadd.s (a1),fp0
   fmove.s fp0,(a2)

The instructions are scheduled as much as possible for 2 floating point pipelines but could still not always do 2 fp instructions in parallel. There is a limit to how much this can be sped up even with 2 fp pipes, 3 op fp instructions and more fp registers. An SIMD allows to do several operations in parallel. One could do this:

Code:

; void _VectorAdd(vec3_t veca, vec3_t vecb, vec3_t out)
; a0 = -> veca
; a1 = -> vecb
; a2 = -> out
_VectorAdd:
   vmove.s (a0),v0  ;move 4x fp.s (veca) to v0
   vadd.s (a1),v0  ;add 4x fp.s (vecb) to v0
   vmove.s v0,(a2)  ;move 4x fp.s to the destination (out)

This is adding 4 elements in parallel while avoiding a lot of the scheduling problems of the 68k FPU version (which was 3 elements and would need another 3 FPU instructions for a 4 element vector add). Notice that the SIMD inherits the register memory architecture (and reduced alignment restrictions) of the 68k where many SIMD processors use the annoying load/store architecture. Many modern programs expect an SIMD processor and use larger matrices like 4x4. This isn't too bad for a fp vector add but try a fp vector multiply add without a fp multiply add instruction or a fp cross product. The 68k FPU doesn't have enough fp registers even with using half the data from memory. This becomes very inefficient and slow on the 68k FPU.

Sorry, but I don't understand why you want to use roxl and roxr for
handling movep commands. Correct is rol or ror only. F.e.

Code:

 movep.l D0,5(A0)

; can be handled as

 rol.l #8,D0
 move.b D0,5(A0)
 rol.l #8,D0
 move.b D0,7(A0)
 rol.l #8,D0
 move.b D0,9(A0)
 rol.l #8,D0
 move.b D0,11(A0)

I don't know how many memory is available in FPGA.
But if 68882 instructions are really handled as micro code, then almost every
routine/code can be optimised for speed or size. Also if I remember right,
math from my school, then for SIN and COS only one table is necessary,
then perhaps you can save 2kB. I think that much more (size or speed)
optimisations can be done for other 68882 instructions too:

FSAVE
FRESTORE
FADD
FSADD
FDADD
FCMP
FDIV
FSDIV
FDDIV
FMOD
FMUL
FSMUL
FDMUL
FREM
FSCALE
FSUB
FSSUB
FDSUB
FSGLDIV
FSGLMUL
FABS
FSABS
FDABS
FACOS
FASIN
FATAN
FATANH
FCOS
FCOSH
FETOX
FETOXM1
FGETEXP
FGETMAN
FINT
FINTRZ
FLOG10
FLOG2
FLOGN
FLOGNP1
FNEG
FSNEG
FDNEG
FSIN
FSINH
FSQRT
FSSQRT
FDSQRT
FTAN
FTANH
FTENTOX
FTWOTOX
FMOVECR
FNOP
FSINCOS
FTST
FBF
FBEQ
FBOGT
FBOGE
FBOLT
FBOLE
FBOGL
FBOR
FBUN
FBUEQ
FBUGT
FBUGE
FBULT
FBULE
FBNE
FBT
FBSF
FBSEQ
FBGT
FBGE
FBLT
FBLE
FBGL
FBGLE
FBNGLE
FBNGL
FBNLE
FBNLT
FBNGE
FBNGT
FBSNE
FBST
FDBF
FDBEQ
FDBOGT
FDBOGE
FDBOLT
FDBOLE
FDBOGL
FDBOR
FDBUN
FDBUEQ
FDBUGT
FDBUGE
FDBULT
FDBULE
FDBNE
FDBT
FDBSF
FDBSEQ
FDBGT
FDBGE
FDBLT
FDBLE
FDBGL
FDBGLE
FDBNGLE
FDBNGL
FDBNLE
FDBNLT
FDBNGE
FDBNGT
FDBSNE
FDBST
FSF
FSEQ
FSOGT
FSOGE
FSOLT
FSOLE
FSOGL
FSOR
FSUN
FSUEQ
FSUGT
FSUGE
FSULT
FSULE
FSNE
FST
FSSF
FSSEQ
FSGT
FSGE
FSLT
FSLE
FSGL
FSGLE
FSNGLE
FSNGL
FSNLE
FSNLT
FSNGE
FSNGT
FSSNE
FSST
FTRAPF
FTRAPEQ
FTRAPOGT
FTRAPOGE
FTRAPOLT
FTRAPOLE
FTRAPOGL
FTRAPOR
FTRAPUN
FTRAPUEQ
FTRAPUGT
FTRAPUGE
FTRAPULT
FTRAPULE
FTRAPNE
FTRAPT
FTRAPSF
FTRAPSEQ
FTRAPGT
FTRAPGE
FTRAPLT
FTRAPLE
FTRAPGL
FTRAPGLE
FTRAPNGLE
FTRAPNGL
FTRAPNLE
FTRAPNLT
FTRAPNGE
FTRAPNGT
FTRAPSNE
FTRAPST
FMOVE
FSMOVE
FDMOVE
FMOVEM

68882 instructions list is taken from PhxAss, I hope that is complete,
due many Amiga assemblers has missed some FPU instructions. It can be
one of the reasons, why some FPU instructions are very rare used or maybe
even never used on Amiga.

About your SIMD example. Do you know timing (cycles) for both versions
of your code?
I'm not FPU expert, but I think that next version of your code:

Code:

 fmovem.x (A0)+,FP0-FP3
 fadd.x (A1)+,FP0
 fadd.x (A1)+,FP1
 fadd.x (A1)+,FP2
 fadd.x (A1)+,FP3
 fmovem.x FP0-FP3,(A2)

can reach similar speed like your SIMD example. Then for me adding SIMD
instructions has no big sense (except possible creator fame), especially
if good 68882 instructions must be removed.

Much better idea is adding more instructions in move16 style, like:
add16, sub16, or16, eor16, and16.

Megol · 21 May 2014, 15:51

Quote:

Originally Posted by Don_Adan

Sorry, but I don't understand why you want to use roxl and roxr for
handling movep commands. Correct is rol or ror only. F.e.

Using any type of rotation would be just a waster of power. But you should re-read the parent post to see that he never proposed to use any type of rotates for handling MOVEP, he just stated that ROXL/ROXR are costly except the special case of rotate by 1. Which is true.

Quote:

Code:

 movep.l D0,5(A0)

; can be handled as

 rol.l #8,D0
 move.b D0,5(A0)
 rol.l #8,D0
 move.b D0,7(A0)
 rol.l #8,D0
 move.b D0,9(A0)
 rol.l #8,D0
 move.b D0,11(A0)

If the processor supports unaligned memory accesses it could be reduced to 4 moves. But supporting MOVEP efficiently isn't needed, the few programs that uses them should tolerate trap+software emulation.

Quote:

I don't know how many memory is available in FPGA.
But if 68882 instructions are really handled as micro code, then almost every
routine/code can be optimised for speed or size. Also if I remember right,
math from my school, then for SIN and COS only one table is necessary,
then perhaps you can save 2kB. I think that much more (size or speed)
optimisations can be done for other 68882 instructions too:

The smallest Xilinx Artix-7 have 50 memory blocks 4kiB each = 200kiB. In addition some of the logic blocks can be used as memory too.

Quote:

<snip list of FPU instruction>

68882 instructions list is taken from PhxAss, I hope that is complete,
due many Amiga assemblers has missed some FPU instructions. It can be
one of the reasons, why some FPU instructions are very rare used or maybe
even never used on Amiga.

Why are you listing things that really are one instruction as several?

Quote:

About your SIMD example. Do you know timing (cycles) for both versions
of your code?
I'm not FPU expert, but I think that next version of your code:

Code:

 fmovem.x (A0)+,FP0-FP3
 fadd.x (A1)+,FP0
 fadd.x (A1)+,FP1
 fadd.x (A1)+,FP2
 fadd.x (A1)+,FP3
 fmovem.x FP0-FP3,(A2)

can reach similar speed like your SIMD example. Then for me adding SIMD
instructions has no big sense (except possible creator fame), especially
if good 68882 instructions must be removed.

The good 68882 instructions that aren't used?
Your example isn't comparable to the vector one in throughput BTW. Even if the processor is super-scalar to a wide degree and support out of order execution it is much slower.

Assume that the two FMOVEM instructions can be run at the same time and that the data dependencies between the FMOVEM and FADD can be hidden. Then each FADD require a load, the addition itself and then a store. Peak performance for this using one memory load and one store per cycle is one FADD/clock (this ignores fill/spill times).
However each FADD also have another load which lowers the performance to two clocks/FADD.

But we needn't stop there, what if the processor could execute the FMOVEMs in one clock cycle using a wide cache data path? Then the initial load takes one cycle, each of the additions can be initiated one cycle apart and the final FMOVEM take one cycle. This would with a relatively realistic (though not for FPGA) FADD latency of 4 cycles take 9 cycles for the four FADDs.

But what if there suddenly were four FADD capable FP pipes? It wouldn't help any, due to the load operation.
But perhaps the processor should be capable of 4 loads/cycle? Still no go as now we have a dependency on the A1 register.
However as we already are a long way into fantasy land why shouldn't the processor be able to detect and resolve this register dependency using parallel addition logic? Then we'll reach a peak throughput of 4 FADD per cycle or (for this example using the 4 cycle FADD latency above) a total latency of 6 cycles.

Or one could instead just use a short-vector/SIMD design and avoid the complications with the same throughput. The later is realistic to include in FPGA designs too.

Quote:

Much better idea is adding more instructions in move16 style, like:
add16, sub16, or16, eor16, and16.

I guess you have some use case for those instructions? Can't see any personally.

Don_Adan · 21 May 2014, 23:20

Quote:

Originally Posted by Megol

Using any type of rotation would be just a waster of power. But you should re-read the parent post to see that he never proposed to use any type of rotates for handling MOVEP, he just stated that ROXL/ROXR are costly except the special case of rotate by 1. Which is true.

If the processor supports unaligned memory accesses it could be reduced to 4 moves. But supporting MOVEP efficiently isn't needed, the few programs that uses them should tolerate trap+software emulation.

The smallest Xilinx Artix-7 have 50 memory blocks 4kiB each = 200kiB. In addition some of the logic blocks can be used as memory too.

Why are you listing things that really are one instruction as several?

The good 68882 instructions that aren't used?
Your example isn't comparable to the vector one in throughput BTW. Even if the processor is super-scalar to a wide degree and support out of order execution it is much slower.

Assume that the two FMOVEM instructions can be run at the same time and that the data dependencies between the FMOVEM and FADD can be hidden. Then each FADD require a load, the addition itself and then a store. Peak performance for this using one memory load and one store per cycle is one FADD/clock (this ignores fill/spill times).
However each FADD also have another load which lowers the performance to two clocks/FADD.

But we needn't stop there, what if the processor could execute the FMOVEMs in one clock cycle using a wide cache data path? Then the initial load takes one cycle, each of the additions can be initiated one cycle apart and the final FMOVEM take one cycle. This would with a relatively realistic (though not for FPGA) FADD latency of 4 cycles take 9 cycles for the four FADDs.

But what if there suddenly were four FADD capable FP pipes? It wouldn't help any, due to the load operation.
But perhaps the processor should be capable of 4 loads/cycle? Still no go as now we have a dependency on the A1 register.
However as we already are a long way into fantasy land why shouldn't the processor be able to detect and resolve this register dependency using parallel addition logic? Then we'll reach a peak throughput of 4 FADD per cycle or (for this example using the 4 cycle FADD latency above) a total latency of 6 cycles.

Or one could instead just use a short-vector/SIMD design and avoid the complications with the same throughput. The later is realistic to include in FPGA designs too.

I guess you have some use case for those instructions? Can't see any personally.

Interesting, using rotation commands is waste of CPU power.
If I remember right, rotation (very useful) commands are done in 1 cycle for 68060.
Then my proposal for movep.l handling with two OEP's can works with 4c
for write and 2c for read, but if you know better/faster way, you are
welcome too show, maybe I'm able to learn something? Who know?

I don't know, how fast/slow are rotation commands for Apollo core, but
if are slowest than 1 cycle, then this is correct job for CPU creators:
1st step, how implement CPU/FPU command.
2nd step, how reach maximum speed for this command.
3rd step, how reduce FPGA space usage for this command.

Not trying to remove already available 68040/68882 instructions from the core,
due this is lamers way.

Is not important than only a few (?) programs used movep commands, its important
than external emulation is very slow and has no sense. If for You, trap
emulation is correct way for creating good CPU, then better read extractions
from CyberPatcher doc:

******************************************************************************

The problem with several Amiga applications on the 68040 and 68060
is that these are compiled for the 6888x.
This coprocessor has a lot more fpu instructions than the 68040 and
68060 so these instructions have to be emulated on the more advanced
680xx processors.
Unknown instructions cause a trap and during the trap the emulation has
to find the right emulation routine and run this function.
In a trap the processor is in the Supervisor mode and no other tasks
can run.
This effect is visible by a not smooth running mouse...the system gets
almost unusable the more unimplemented instructions are used by a program.
CyberPatcher trys to patch the most used instructions that have to be
emulated.

....

At the moment CyberPatcher supports the following programs:

-Mand2000d(large speed up)
-SceneryAnimator(large speed up)
-Imagine 2.x(large speed up)
-Vista Pro
-Lightwave
-Real 3d 2.x
-Maxon Cinema 2
-ImageFX

****************************************************************************

If You don't understand, read again about supervisor mode, no other tasks can
run etc. and again and again etc. But if you understand than external
emulation is amateurs way thats OK.

About memory blocks, if the smallest FPGA has 200kB memory, then 2kB (1%)
for support SIN/COS commands is nothing and can not be called "wasting of FPGA
space" (Hi, Matt

).

About 68882.
Which 68882 instructions are for You same?
I wrote that this is extraction from PhxAss, but I can check this exactly.

About SIMD, everything can be comparable, especially commands/routines speed.
I always show examples, if something is slow or fast.
Then for me fmovem version is better than SIMD version, due no cycles results
for SIMD version. And nothing must be CASTRATED from CPU.
For me SIMD commands only can wasting of FPGA space.

BTW. Don't exist even "a few" programs which used SIMD commands.

But of course you are on good way, fmovem and movem commands can/must be
fastest. I think that fmovem and movem.l commands for up to 4 registers can be
done in 2 cycles, for up to 8 registers in 3 cycles etc.

About xxx16 commands. Exist many possibilities (especially with very fast movem.l
or move16) to usage these commands, this is only question of coder imagination.

matthey · 22 May 2014, 06:50

Quote:

Originally Posted by Don_Adan

Not trying to remove already available 68040/68882 instructions from the core, due this is lamers way.

How is it possible to remove instructions that never existed in the Apollo?

Quote:

Originally Posted by Don_Adan

If You don't understand, read again about supervisor mode, no other tasks can run etc. and again and again etc. But if you understand than external
emulation is amateurs way thats OK.

It's really not that bad. Forbid() is not multitasking friendly either. Only the longest trapped instructions stop multitasking for very long and they become faster as the CPU gets faster. Yes, trapping missing instructions is much slower but new programs can avoid using the trapped instructions. It's natural to want all the instructions for retro software but it's not necessary for current and future software.

Quote:

Originally Posted by Don_Adan

About memory blocks, if the smallest FPGA has 200kB memory, then 2kB (1%) for support SIN/COS commands is nothing and can not be called "wasting of FPGA space" (Hi, Matt

).

The CPU needs the SRAM for many different uses and often needs big blocks of several kB. Robbing a few kB could result in one of the cache sizes being cut in half. You are correct that the tables for some of the instructions could be shared.

FSIN, FCOS and FSINCOS use the same table
FTAN uses a table (it could use FSIN/FCOS but division is slow)
FASIN and FACOS use FATAN which has a table
FSINH, FCOSH, FTANH, FETOXM1 use FETOX which has a table
FTENTOX, FTWOTOX, FLOGN, FLOG2 and FLOG10 use a table
FMOVECR uses a table

Just the tables is probably 10kB. There is other static data like fp numbers that needs to be stored and the code has to be stored somewhere. Now you could be using 20kB out of 200kB which is 10% of the SRAM and that doesn't count the logic used. My estimate could be significantly off but I hope you can see that although it's possible to add all the FPU instructions now, it would take valuable resources from being used elsewhere where they provide a better speedup.

Quote:

Originally Posted by Don_Adan

About 68882.
Which 68882 instructions are for You same?
I wrote that this is extraction from PhxAss, but I can check this exactly.

FBcc, FScc, FDBcc and FTRAPcc variations are handled the same way. There are other instructions that are similar and probably share some logic.

Quote:

Originally Posted by Don_Adan

About SIMD, everything can be comparable, especially commands/routines speed. I always show examples, if something is slow or fast. Then for me fmovem version is better than SIMD version, due no cycles results for SIMD version. And nothing must be CASTRATED from CPU. For me SIMD commands only can wasting of FPGA space.

FMOVEM is slow and difficult to make faster. VADD should be at least as fast as a single FADD. An SIMD processor doesn't have to set condition codes and the pipeline can be shorter. The 68k FPU would normally do the FADD in extended precision where VADD would be done in single precision. It's no contest if there is parallel work to be done which there is not most of the time. The current 68k FPU is powerful because it's flexible and high precision not because it's fast.

Quote:

Originally Posted by Don_Adan

BTW. Don't exist even "a few" programs which used SIMD commands.

There are many programs that use SIMD instructions. There are no 68k programs that use SIMD instructions. Creating an SIMD unit that is similar to another allows for easier conversion of the code for that particular SIMD unit.

Quote:

Originally Posted by Don_Adan

But of course you are on good way, fmovem and movem commands can/must be fastest. I think that fmovem and movem.l commands for up to 4 registers can be done in 2 cycles, for up to 8 registers in 3 cycles etc.

A single core of most of the fastest processors in the world can't load or store more than 1 register per cycle. This is because the CPU logic is faster than the memory (the CPU doesn't want to wait while accessing memory). The logic in an fpga is slower than the memory so it is possible to do more than 1 load or store per cycle with some tricks. Doing 2 or 3 loads per cycle should be possible but more than 1 store per cycle has other complexities and would be difficult if possible. Multiple loads per cycle would make superscalar instruction scheduling easier and all pre-68060 code would benefit tremendously. The 68060 can only access memory with every other instruction to run at full speed (this is similar to most superscalar processors). This would only leave instruction scheduling for writes and avoiding dependencies like 2 consecutive instructions working on the same register (choosing instructions and addressing modes that work in multiple pipelines also although that should be less of a problem with the Apollo than the 68060).

Quote:

Originally Posted by Don_Adan

About xxx16 commands. Exist many possibilities (especially with very fast movem.l or move16) to usage these commands, this is only question of coder imagination.

I would use MOVE16 some if the data didn't have to be 16 byte aligned. Otherwise, about the only use is for system memory copying functions. I don't see much use for the other memory to memory 16 byte instructions either. I do wish there was a full CMP EA,EA in byte, word, and long sizes and I do think it would get some use but there isn't encoding space for it. The Apollo core may be able to do CMP mem,mem and MOVE mem,mem in 1 cycle.

meynaf · 22 May 2014, 09:46

Quote:

Originally Posted by pandy71

And - what is wrong to use integers?

Nothing for me, really.

Quote:

Originally Posted by pandy71

And referring to your x86 example - every generation brings new instructions or system components - are you refusing to use them also ?

Of course not, where did you see that ?

Quote:

Originally Posted by pandy71

Writing new code means you can use new functionality - if so why you insist to compile code in a way that force it to compatibility with oldest CPU.

Always keep in mind that I don't compile code, i'm writing it myself in raw pure bloody asm.

Quote:

Originally Posted by pandy71

I almost sure that running software emulation at some other CPU i can have similar performance or better at lower cost - FPGA have limitations (lot of them) it is not magical answer for all computing problems.

You certainly can have better throughput but you will also have much higher latencies.

Quote:

Originally Posted by matthey

The instructions can't work from the instruction cache? Why not? An SIMD can use the data cache but it doesn't always make sense because it often uses data streams that would just flush the cache. It should be able to read from the existing data cache but maybe not write it (or have a setting). It would be interesting to see how the Apollo stream detection would handle streams for the SIMD.

The SIMD can't use the data cache, my friend (i was only speaking about dcache, not icache).
Or maybe the dcache can output 128 bits (or more) per clock ?

Quote:

Originally Posted by matthey

I'm not saying a 68k SIMD would be great as they have big limitations but I see potential where you seem confident it can't be useful.

This is theory vs practice. Try to use the existing ones in x86, ppc or even arm, and see what you get.

Quote:

Originally Posted by matthey

I do see that the 68k FPU gets bogged down with heavy FPU use and no integer instructions to execute in parallel with vectors and matrix calculations. What do you think is a better solution to these bottlenecks?

Choose :
1. Use a GPU (or a GPGPU) to offload your main cpu.
2. Rewrite your code with only integer ops.
3. Get a higher clocked FPGA.

Quote:

Originally Posted by Megol

What I did see was that all 68882 instructions should be supported, not that every operation would be optimal.

More generally, that every instruction 68030+68882 should be supported, regardless whether it's done directly in HW, in hybrid HW+microcode, or just in microcode.

Perhaps our friend Pandy doesn't like the FSIN instruction because it's a sin ?

(sorry for the pun, i couldn't resist)

If so, just consider FATAN for another example, and how you can use it when you're doing angle computations from coordinates.
In the same manner, FSQRT can be useful for distance computations (even though i'd prefer an integer sqr instruction).

Furthermore, when you have one transcendental operation, the others can probably reuse your logic (in a SW implementation it's the case).
So the block may be huge, but it'll be there only once, and a wealth of other ops are just a few parameters away from what you have now.

That said, i'm not giving an enormous value to the FPU. All my code is either pure integer, or uses libs. I'm just saying how a good fpu can look like for me.
What's sure is that i'd personnally have more use for a new MOVEM.B rather than for FSIN.

meynaf · 22 May 2014, 10:04

Quote:

Originally Posted by matthey

How is it possible to remove instructions that never existed in the Apollo?

They never existed in the Apollo maybe, but what counts here is that they existed in the 68k.

Quote:

Originally Posted by matthey

It's natural to want all the instructions for retro software but it's not necessary for current and future software.

I wouldn't bet on this. Take the example of CAS and CAS2.
What i find particularly funny is that the ones who advocate going the SMP way are often the same who advocate the removal of instructions such as these two.
Funny, to say the least.

Quote:

Originally Posted by matthey

Just the tables is probably 10kB.

It's probably even smaller. I remember my old 8-bit 6502 machine. It was able to compute all these with its Basic in ROM, and the rom was only 16kb in total.

Quote:

Originally Posted by matthey

An SIMD processor doesn't have to set condition codes and the pipeline can be shorter.

But you forget that its data is larger. And in the fpga, you'll get big routing times, and costly extra-large muxes.

Besides, i don't think there is much space to waste for a floating-point vector core. An integer-only vector core is big enough - and frankly i'd rather have new scalar integer instructions.

Quote:

Originally Posted by matthey

The current 68k FPU is powerful because it's flexible and high precision not because it's fast.

And its flexibility and high precision have to be kept, i think. These needn't be very fast.

Quote:

Originally Posted by matthey

There are many programs that use SIMD instructions. There are no 68k programs that use SIMD instructions. Creating an SIMD unit that is similar to another allows for easier conversion of the code for that particular SIMD unit.

Many programs ? Huh, i suggest you make some statistics on the usage of these, before saying that !

Quote:

Originally Posted by matthey

I do wish there was a full CMP EA,EA in byte, word, and long sizes and I do think it would get some use but there isn't encoding space for it.

I would have some use for this myself !

How about just having CMP d16(An),d16(An) ?
Or maybe CMP (SP)+,EA ?

These can be encoded, i think.

Megol · 22 May 2014, 10:58

Quote:

Originally Posted by Don_Adan

Interesting, using rotation commands is waste of CPU power.
If I remember right, rotation (very useful) commands are done in 1 cycle for 68060.
Then my proposal for movep.l handling with two OEP's can works with 4c
for write and 2c for read, but if you know better/faster way, you are
welcome too show, maybe I'm able to learn something? Who know?

The limitation for fast MOVEM execution will be either register ports or memory ports.
Those limitations are _hard_ to change and adding more ports of either type _will_ decrease to overall clock frequency! Yes, that includes ASIC implementations.
Intel just recently added support for 2 data reads and one data write per cycle from cache.
Intel also uses less register accesses than required for peak throughput, why? Because that lowers power and increases clock rates. Lowering power also increases performance BTW.

Why do I talk about Intels chips? Because they _are_ the ones with best performance.

Quote:

I don't know, how fast/slow are rotation commands for Apollo core, but
if are slowest than 1 cycle, then this is correct job for CPU creators:
1st step, how implement CPU/FPU command.
2nd step, how reach maximum speed for this command.
3rd step, how reduce FPGA space usage for this command.

Not trying to remove already available 68040/68882 instructions from the core,
due this is lamers way.

It's increasingly clear that you don't know anything about hardware. But do try.

Quote:

Is not important than only a few (?) programs used movep commands, its important
than external emulation is very slow and has no sense.

No. If MOVEP can be emulated in e.g. 100 cycles on a 100MHz machine that is already enough.
Why? Because the few cases that uses MOVEP will not be performance critical.

Quote:

If for You, trap
emulation is correct way for creating good CPU, then better read extractions
from CyberPatcher doc:

******************************************************************************

The problem with several Amiga applications on the 68040 and 68060
is that these are compiled for the 6888x.
This coprocessor has a lot more fpu instructions than the 68040 and
68060 so these instructions have to be emulated on the more advanced
680xx processors.
Unknown instructions cause a trap and during the trap the emulation has
to find the right emulation routine and run this function.

Not true given the right design.

Quote:

In a trap the processor is in the Supervisor mode and no other tasks
can run.
This effect is visible by a not smooth running mouse...the system gets
almost unusable the more unimplemented instructions are used by a program.
CyberPatcher trys to patch the most used instructions that have to be
emulated.

<snip>
****************************************************************************

If You don't understand, read again about supervisor mode, no other tasks can
run etc. and again and again etc. But if you understand than external
emulation is amateurs way thats OK.

As everyone but you is apparently cheating lame amateurs I wonder how long we'll have to wait for your processor to be released?
Given your certainty of performance in a FPGA processor implementing the full 68k ISA you'd have to have it up and running.

Quote:

About memory blocks, if the smallest FPGA has 200kB memory, then 2kB (1%)
for support SIN/COS commands is nothing and can not be called "wasting of FPGA
space" (Hi, Matt

).

The smallest Kintex-7 ISA, yes. That isn't likely to be used in an accelerator because it already is pretty expensive.
Supporting SIN would lower the clock rate of the whole design as generic microcode support have to be included. Or one could have a dedicated hardware unit that would then complicate data flows for result writeback.

Quote:

About 68882.
Which 68882 instructions are for You same?
I wrote that this is extraction from PhxAss, but I can check this exactly.

About SIMD, everything can be comparable, especially commands/routines speed.
I always show examples, if something is slow or fast.
Then for me fmovem version is better than SIMD version, due no cycles results
for SIMD version. And nothing must be CASTRATED from CPU.
For me SIMD commands only can wasting of FPGA space.

BTW. Don't exist even "a few" programs which used SIMD commands.

But of course you are on good way, fmovem and movem commands can/must be
fastest. I think that fmovem and movem.l commands for up to 4 registers can be
done in 2 cycles, for up to 8 registers in 3 cycles etc.

About xxx16 commands. Exist many possibilities (especially with very fast movem.l
or move16) to usage these commands, this is only question of coder imagination.

So you can't give any example? Not surprised to be honest.
The EOR16 could be used for software RAID support but that's about it.
If instead the processor could support short vector/SIMD integer operations with 128 bit registers not only could every *16 operation be emulated but a lot of useful work could be done too.

Don_Adan · 22 May 2014, 22:33

Quote:

Originally Posted by Megol

The limitation for fast MOVEM execution will be either register ports or memory ports.
Those limitations are _hard_ to change and adding more ports of either type _will_ decrease to overall clock frequency! Yes, that includes ASIC implementations.
Intel just recently added support for 2 data reads and one data write per cycle from cache.
Intel also uses less register accesses than required for peak throughput, why? Because that lowers power and increases clock rates. Lowering power also increases performance BTW.

Why do I talk about Intels chips? Because they _are_ the ones with best performance.

It's increasingly clear that you don't know anything about hardware. But do try.

No. If MOVEP can be emulated in e.g. 100 cycles on a 100MHz machine that is already enough.
Why? Because the few cases that uses MOVEP will not be performance critical.

Not true given the right design.

As everyone but you is apparently cheating lame amateurs I wonder how long we'll have to wait for your processor to be released?
Given your certainty of performance in a FPGA processor implementing the full 68k ISA you'd have to have it up and running.

The smallest Kintex-7 ISA, yes. That isn't likely to be used in an accelerator because it already is pretty expensive.
Supporting SIN would lower the clock rate of the whole design as generic microcode support have to be included. Or one could have a dedicated hardware unit that would then complicate data flows for result writeback.

So you can't give any example? Not surprised to be honest.
The EOR16 could be used for software RAID support but that's about it.
If instead the processor could support short vector/SIMD integer operations with 128 bit registers not only could every *16 operation be emulated but a lot of useful work could be done too.

About movem and Intel.

Then time to beat Intel chips.
I'm not hardware expert, but I will use internal CPU 128 bit register
for read/write. Four 32 registers must/can be splitted/joined in one
128 bit register and data (128 bit) can be wrote/readed. Then for
two reads You can read 8 longwords per cycle or wrote 4 longwords
per cycle. If I remember right move16 command is fastest than movem
for 4 registers, then must exist any way for make it fastest.
If splitting/joining registers is too hard to make, then I think
than "movemfast" command (or special movem.l handling) can be used/added.
It can works like movem, but only for successive registers f.e.

movemfast D0-D3,(A0) is OK
movemfast D0-D2/D4,(A0) is not possible
movemfast D1-A4,(A5) is OK

or special handling of movem.l only

movem.l D0-D3,-(SP) full speed
movem.l D0-D2/D4,-(SP) slowest speed
movem.l (SP)+,D0-A3 full speed

For movem.w command similar data can be joined/splitted.

Of course I don't know hardware. But I heard too many times that something
is impossible to make, when I will sure (as amateur) that this is possible.
If You need hardware examples f.e. put MC68060 CPU to A3640 or put more than 16 MB
on A4k main board. If You need software examples, f.e one disk version
of Turrican 2 (fit heavy packed ~1070kB data on 900kB disk) or write
RNC copylocks without hardware.

100 cycles for movep emulation is very good example of wasting of CPU power.
Seems You like to waste of CPU power, I don't like this.
2 or 4 cycles vs 100 cycles, and Your choice is 100.

About trap and finding emulation routine, of course this is true. But of course You
can create thousends of traps and every trap can works only for concrete
CPU/FPU instructions f.e.
trap #10456 can handle movep.l D0,0(A0)
trap #10457 can handle movep.l D0,1(A0)
trap #29456 can handle movep.l D2,51(A2)
etc

This is simple wasting of memory/speed, but You can call this "right design".
Anyway it seems You must be "Trap Master", I think, if for You traps way
is correct way for make good CPU.

I don't wrote that Apollo creators are lamers/amateurs. I only wrote
that they go lamers/amateurs way. If for You this is same, then sorry.

From my amateuer point of view:

1. movep instruction -> trap.
2. rare used instruction -> trap.
3. FPU instruction -> trap.
4. hard to implement instruction -> trap.
etc.

Sorry, but I can't call this expert way.

About SIN support.
Sorry, but I think that You must be wrong, that microcode can slowdown
CPU clock rate, if other FPU commands (microcoded too) don't slow CPU clock rate.
This is illogical for me, or even if this is true then try to make
other SIN implementation. Many things can be done in different ways, not only
via "one and the only trap" way.

About xxx16 commands.
I don't know which examples You need, but f.e.
movem.l (A5)+,D7-A2
eor16 D7-A2,(A4)+

I'm not against adding new instructions to core, but it must be done in clean way,
not in dirty way (using opcode space for already available 68040/68882 instructions),
due it will be make only problems. This is simple for me, choose one (unused)
and easy for handling 2 bytes opcode and use this as prefix/ID only for
series of 128 bit instructions, and later (next 2 bytes) use as real opcode for fast
instructions decoding. Instructions will be 2 bytes longer (due prefix/ID at begining),
but 100% compatible with already available 68040/68882 instructions.

matthey · 23 May 2014, 08:04

Quote:

Originally Posted by Don_Adan

About movem and Intel.

Then time to beat Intel chips. I'm not hardware expert, but I will use internal CPU 128 bit register for read/write. Four 32 registers must/can be splitted/joined in one 128 bit register and data (128 bit) can be wrote/readed. Then for two reads You can read 8 longwords per cycle or wrote 4 longwords per cycle. If I remember right move16 command is fastest than movem for 4 registers, then must exist any way for make it fastest. If splitting/joining registers is too hard to make, then I think than "movemfast" command (or special movem.l handling) can be used/added.
It can works like movem, but only for successive registers f.e.

movemfast D0-D3,(A0) is OK
movemfast D0-D2/D4,(A0) is not possible
movemfast D1-A4,(A5) is OK

or special handling of movem.l only

movem.l D0-D3,-(SP) full speed
movem.l D0-D2/D4,-(SP) slowest speed
movem.l (SP)+,D0-A3 full speed

For movem.w command similar data can be joined/splitted.

This is a good idea but a pipeline creates complexity that makes this more difficult. Processors use read and write register ports to access the data in the register file(s). The processor must keep track of dependencies (conflicts) between the different registers in a superscalar pipelined processor. When multiple registers are accessed, it makes this job more complex. Only a certain amount of work can be done in a stage of the pipeline without slowing down the processor. MOVEM already takes more time to process because the registers are bitmapped instead of a continuous series. There are no restrictions on the alignment of the memory access either. I think some optimization can be done on the Apollo, at least when reading from memory/cache, but I don't expect it's as easy as you think. Apollo has a better chance than many processors because unaligned cache accesses are allowed and the processor is slower than memory.

Quote:

Originally Posted by Don_Adan

Of course I don't know hardware. But I heard too many times that something is impossible to make, when I will sure (as amateur) that this is possible. If You need hardware examples f.e. put MC68060 CPU to A3640 or put more than 16 MB on A4k main board. If You need software examples, f.e one disk version of Turrican 2 (fit heavy packed ~1070kB data on 900kB disk) or write RNC copylocks without hardware.

It's not about what is impossible but about what is practical. MOVEM could load or store all registers in a single cycle but the clock speed of the CPU would be a fraction of what it is. All the 68020+6888x instructions could be put in hardware but the processor would be slower, take longer to develop and need a more costly fpga.

Quote:

Originally Posted by Don_Adan

100 cycles for movep emulation is very good example of wasting of CPU power. Seems You like to waste of CPU power, I don't like this.
2 or 4 cycles vs 100 cycles, and Your choice is 100.

There is no waste if MOVEP is not used and it shouldn't be used anymore. If old programs use MOVEP, they will probably still be faster than they originally were despite the trap/emulation. I use a 68060 with emulated MOVEP and I can't see any difference nor do I notice any problems from it.

Quote:

Originally Posted by Don_Adan

About SIN support.
Sorry, but I think that You must be wrong, that microcode can slowdown
CPU clock rate, if other FPU commands (microcoded too) don't slow CPU clock rate. This is illogical for me, or even if this is true then try to make
other SIN implementation. Many things can be done in different ways, not only via "one and the only trap" way.

I believe that Apollo does not currently support any microcode. Adding microcode support may slow down the whole CPU. It should be possible (and probably faster) to do the trig instructions in VHDL but the math is very complex for some of these instructions. There can be different polynomials used for different ranges of input based on the calculated error in that range. The worst case error for fatan reads like this:

Accuracy and Monotonicity
The returned result is within 2 ulps in 64 bit significant bit, i.e. within 0.5001 ulp to 53 bits if the result is subsequently rounded to double precision. The result is provably monotonic in double precision.

I think this is in some technical math language other than English. Perhaps you would like to do the VHDL coding and proof of monotonic result and accuracy to 2 ulps in 64 bit significant digits? It's not like it's impossible

.

pandy71 · 23 May 2014, 12:08

At some point i decided to give up however as i began see better point of view many of You i need ask once again - for who this board will be designed, who will be marketing target and what is main purpose for this board(s).

Also just from pure curiosity: i don't understand why so many people insist to use CPU to perform DMA-like tasks, i don't understand why so many people insist to perform for example C2P by CPU (where some small logic seem to be more efficient).

And no, i not afraid of Sine - i just found it not worth to pay 200E more.

Thorham · 23 May 2014, 12:31

Quote:

Originally Posted by pandy71

i don't understand why so many people insist to perform for example C2P by CPU

What other choice is there? The blitter? Too slow

robinsonb5 · 23 May 2014, 13:47

Quote:

Originally Posted by Thorham

What other choice is there? The blitter? Too slow

In an FPGA it'd be perfectly possible to design some logic that reads a buffer from SDRAM, does C2P conversion and writes the result to Chip RAM without CPU intervention. If the CPU's running largely from SDRAM / Cache then this background task would have very little impact on the CPU speed.

pandy71 · 23 May 2014, 14:03

Quote:

Originally Posted by robinsonb5

In an FPGA it'd be perfectly possible to design some logic that reads a buffer from SDRAM, does C2P conversion and writes the result to Chip RAM without CPU intervention. If the CPU's running largely from SDRAM / Cache then this background task would have very little impact on the CPU speed.

This is my point - DMA controller (so called FDMA - Flexible DMA with own tasks list, ALU, barrel shifter, small memory that can be used for data and instructions etc) it can perform various operations on various data streams - C2P one of obvious tasks to do on Amiga as Amiga architecture remain same, such DMA can perform some operations based on Copper principle only more efficient with extended list of operations etc - this can be used by new software and exposed to old by modified libraries.

I don't understand why everything must be done by CPU... copying 25 times per second buffer from one memory (FAST) to another (CHIP) is one of simplest things to do in FPGA... (way simpler than sophisticated FPU).

Quote:

Originally Posted by Thorham

What other choice is there? The blitter? Too slow

2/1 OCS/ECS/AGA cycle per 32 bit - this is slow? What can be faster? doing this by CPU?
Akiko have special hardware to do this... fast as fast can be reading/writing from/to memory is not slow.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Vampire 500 project started	majsta	Hardware mods	221	17 August 2016 18:42
cd32 project idea i challenge ...	sian	request.Other	11	15 June 2013 19:34
Looking for artist to collaborate on Lotus Turbo Challenge project	P-J	Amiga scene	16	07 January 2012 04:21
Desperately seeking Amiga Demo Coder	slayerGTN	Amiga scene	2	02 August 2010 23:34
Project-X SE & F17 Challenge v2.0 (1993)(Team 17)(M5)[compilation][CDD3499]	retrogamer	request.Old Rare Games	0	05 April 2007 14:37

23 May 2014, 12:08	#97
pandy71 Registered User Join Date: Jun 2010 Location: PL? Posts: 2,810	At some point i decided to give up however as i began see better point of view many of You i need ask once again - for who this board will be designed, who will be marketing target and what is main purpose for this board(s). Also just from pure curiosity: i don't understand why so many people insist to use CPU to perform DMA-like tasks, i don't understand why so many people insist to perform for example C2P by CPU (where some small logic seem to be more efficient). And no, i not afraid of Sine - i just found it not worth to pay 200E more.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)