Betatesting SoftIEEE FPU emulator - Page 8

Karlos · 04 January 2023, 21:15

@Thomas Richter

Could a variant version that has reduced precision be faster? I appreciate this isn't the goal but it seems to me that a lot of users with faster 060s tend to use their FPU for gaming rather than anything requiring full extended precision.

Thomas Richter · 04 January 2023, 22:26

I afraid you are expecting too much. Even if it would be at the speed of mathieeesingbas, it would still be too slow for gaming - please make the math yourself. You would be still below 6fps.

Anyhow, the framework is there, the architecture is open, the interface is documented. All that needs to be done is a re-implementation of the softieee.library. The complicated parts such as the FPU emulation, instruction decoding or online jitting is already taken care of by the SoftIEEE binary (not the library) and MuRedox. These binaries do not care how the math core works - and that is the softieee.library.

Karlos · 04 January 2023, 22:41

Fair enough. Fixed point builds on EC parts for the win then.

rabidgerry · 04 January 2023, 23:46

Quote:

Originally Posted by Thomas Richter

I afraid it wouldn't get any faster. The current speed is at 1/3 of the speed of mathieeedoubbas, the latter is already quite optimized and offers only 56 bit precision. Even if it would match the speed of doubbas, or even singbas, it would still remain at 3fps or maybe 6fps, below "playable".

SoftIEEE is not supposed to replace a full fledged FPU. If you need the speed of a FPU, get a hardware FPU. It is just supposed to provide an FPU emulation for those programs whose authors were too lazy to go through the system math libraries.

I have 060 with full FPU. I simply tried the SoftIEEE as an experiment in conjunction with an LC I bought as someone had suggested it to me that I should try it. This was after I noted the LC060 was able to be overclocked quite comfortably but the games you might want the overclocking for all seemed to need the FPU in some capacity. So it was a nice little experiment but as you rightly point out it wont solve the issues for LC users will have who might want to play games like Duke Nukem etc or even Doom Attack AIO as I discovered.

Don_Adan · 05 January 2023, 22:45

Quote:

Originally Posted by Thomas Richter

frestore and fsave are of course emulated by SoftIEEE, that's not the issue. However, they cannot be replaced by jitter functions that *do not* go through an emulator trap. The trouble is that calling a "jitted" function takes at least 4 bytes (JSR.W), but there aren't 4 bytes available to patch.

Replacing them by "traps" does not provide any advantage - a trap is nothing but an exception, but then, there is nothing gained as that replaces just one exception (the original one as captured by SoftIEEE) with another exception (that of the trap).

The whole trick of MuRedox is that there are no exceptions involved anymore.

Yes, you right, i forget that these are traps too. But i think that AllocTrap version can be a few fastest version, than F-line emulation version. Because no recognise code and maybe less usage of registers

mfilos · 09 January 2023, 11:36

Thomas I see version 40.6 is on Aminet (from yesterday) but the archive has version 40.5 (binary + library).

pandy71 · 09 January 2023, 12:16

@THOR - apologies upfront for my question - i'm curious how from your perspective feasible is to implement such emulation in other than MC68K ISA - so emulate in software 881/882 in additional SW/HW but still keeping your MC68K frontend - in other words - implement physical float calculation in separate solution and use such virtual 881/882 from native CPU.

Still have impression that i'm unable to express clearly my question so example:
Your SoftIEEE library but float numeric is implemented in software in different HW connected to Amiga (for example one of cheap SOC using RISC V or ARM ISA if they are equipped with float co-processor and for example DSP and such SoC is running like 300...400MHz).

How feasible is such hybrid implementation from your perspective? - lets skip numeric (i.e. not MC68K) part from the question.

Thx!

Karlos · 09 January 2023, 12:29

I think that question has been raised already. Someone asked about offloading the instruction to PPC (I think that's what they said), but certainly in that case there's a lot to contend with. Anything along the WarpOS route would be many orders of magnitude slower than the current software implementation.

Unless you have an extremely low latency way to do it, I don't think you'll get away with offloading externally.

pandy71 · 09 January 2023, 13:00

Quote:

Originally Posted by Karlos

I think that question has been raised already. Someone asked about offloading the instruction to PPC (I think that's what they said), but certainly in that case there's a lot to contend with. Anything along the WarpOS route would be many orders of magnitude slower than the current software implementation.

Unless you have an extremely low latency way to do it, I don't think you'll get away with offloading externally.

I'm aware that overhead related is high but still 40..100MHz MC68K HW may be slower in software float calculations than external modern HW - problem is standard interface between application and external HW float implementation.
Nowadays small SoC's are equipped with HW float (albeit 32 bit) and usually DSP, some of them also capable to do some fast low precision integer dedicated NPU. Such SoC cost 2..3$ and beside glue logic has everything to do such functionality - so this was my question - small SoC incapable to perform full MC68K emulation but capable to offload for example float calculation at a fraction of the cost of original 881/882 (not mentioning 40/60 where coprocessor interface may be not even implemented on board).
Creating some API standard and separating frontend from physical implementation of the float calculation could be something interesting.

Ages ago there was for example WEITEK company that produce many solutions present as simple I/O in CPU address space... so this is question about something similar performing 4..6 times faster than MC68K in float implementation.

Or something like this https://micromegacorp.com/umfpu64.html easy to hook to even MC68000.

Some report comparing 8 bit uC in softfloat vs such softfloat implemented externally https://micromegacorp.com/downloads/...g%20WinAVR.pdf - limitation is of course due SPI inteerface but even in such case difference is obvious - assuming different way of connecting such external HW to significantly reduce communication overhead may be sane option for replacing 881/882 with SoftIEEE and receive better results.

Was curious about THOR opinion if from his perspective this is feasible to separate physical calculation implementation from frontend so he is not responsible for any foreign bugs but still can control SoftIEEE as owner and for example focus on his pure MC68K float implementation. So semi open standard.

Thomas Richter · 09 January 2023, 13:57

Quote:

Originally Posted by mfilos

Thomas I see version 40.6 is on Aminet (from yesterday) but the archive has version 40.5 (binary + library).

No worries, this is the right version. I apparently forgot to bump the revision, but the binaries are correct.

Thomas Richter · 09 January 2023, 15:47

Quote:

Originally Posted by pandy71

@THOR - apologies upfront for my question - i'm curious how from your perspective feasible is to implement such emulation in other than MC68K ISA - so emulate in software 881/882 in additional SW/HW but still keeping your MC68K frontend - in other words - implement physical float calculation in separate solution and use such virtual 881/882 from native CPU.

As said before, it is technically possible. All you need to do is to implement the softieee.library interface. This interface would take the parameters and forward it to the hardware.

However, the resulting solution would still not on par with a hardware FPU. Let's make a couple of computations: A hardware multiplication on the 68060 is ~2 cycles if I recall. The MuRedox call-in overhead is roughly one magnitude larger (~20 cycles), that of SoftIEEE through exception processing a lot larger (~200 cycles). To this, the softieee.library still has to forward parameters to the hardware, and perform the operation there. For example, for the 68882, you need to emulate the coprocessor interface in software (probably another 20 cycles) and then the 68882 has to execute the multiplication (which is another >20 cycles), so in the end, you are at about 60 to 100 cycles minimum. That's almost two magnitudes slower than the 68060.

The softieee.library multiplication engine is probably 200 cycles (just house numbers), so it is slower, but not that much slower. This is also the reason why the 68882-based "hardware accelerator" solutions were not really working well. The communication overhead to the FPU eat up the performance improvements of the FPU. The 68882 only works well with the 68020/030 hardware interface where hardware implements the interface.

Quote:

Originally Posted by pandy71

Still have impression that i'm unable to express clearly my question so example:
Your SoftIEEE library but float numeric is implemented in software in different HW connected to Amiga (for example one of cheap SOC using RISC V or ARM ISA if they are equipped with float co-processor and for example DSP and such SoC is running like 300...400MHz).

The softieee.library is a "numerics core". So it takes one or two extended precision floating point numbers in memory, and its "contract" defines that it places the result back in memory. SoftIEEE and MuRedox follow this contract. They do not care *how* the library does its job. This is "currently" an all-software implementation, but nobody stops you from implementing your own softieee.library. Such an alternative implementation would read the operands from memory, forward it to the hardware, and read the results back. Thus, while this construction would be faster, I doubt it would be *much* faster. I would expect a factor of 2 or 3 (see above for calculations), but that still places you one order of magnitude slower than native code on the 68060.

The speed would be, according to this estimate, approximately on par with the mathieeedoubbas.library.

Thomas Richter · 09 January 2023, 15:52

Quote:

Originally Posted by Karlos

I think that question has been raised already. Someone asked about offloading the instruction to PPC (I think that's what they said), but certainly in that case there's a lot to contend with. Anything along the WarpOS route would be many orders of magnitude slower than the current software implementation.

Pretty much. For PPC-offloading, you would be again slower than the 68882 solution because you need to communicate with the external CPU - some form of message passing is required. This does not pay off, it already killed the performance of PowerUp and WarpUp and made this hybrid PPC/68K solutions unpractical.

Thomas Richter · 09 January 2023, 15:57

Quote:

Originally Posted by pandy71

Was curious about THOR opinion if from his perspective this is feasible to separate physical calculation implementation from frontend so he is not responsible for any foreign bugs but still can control SoftIEEE as owner and for example focus on his pure MC68K float implementation. So semi open standard.

See above. I'm as open as possible on the interface to make such a thing possible, and the interface of the library is as simple as it can be (two pointers to floating point numbers), but even if the actual computation would be immediate, there is still code between "your code" and the actual computation, and that is the MuRedox "trampoline code". It stores essential registers trashed by the softieee.library on the stack (d0-d2/a0-a1/a6, the ccr and the PC), loads the source operands (in the easiest case directly in the softieee.library) and calls the library.

Like it or not, this type of overhead will go away, no matter how smart your hardware is, and it is already one magnitude larger than the 68060 hardware multipliation. Even with instant operation, you would be down to the speed of a 68882, and that is really a *very* optimistic estimate.

Samurai_Crow · 09 January 2023, 16:51

@pandy71 That FPU you linked has a serial interface. That would probably limit performance on an 040+. On a 68000 though.... :-)

pandy71 · 09 January 2023, 22:53

Quote:

Originally Posted by Thomas Richter

As said before, it is technically possible. All you need to do is to implement the softieee.library interface. This interface would take the parameters and forward it to the hardware.

To be honest i can't find documentation to softieee.library and as such it was my question to you as you are author and owner of the licence.

Quote:

Originally Posted by Thomas Richter

However, the resulting solution would still not on par with a hardware FPU. Let's make a couple of computations: A hardware multiplication on the 68060 is ~2 cycles if I recall. The MuRedox call-in overhead is roughly one magnitude larger (~20 cycles), that of SoftIEEE through exception processing a lot larger (~200 cycles). To this, the softieee.library still has to forward parameters to the hardware, and perform the operation there. For example, for the 68882, you need to emulate the coprocessor interface in software (probably another 20 cycles) and then the 68882 has to execute the multiplication (which is another >20 cycles), so in the end, you are at about 60 to 100 cycles minimum. That's almost two magnitudes slower than the 68060.

I'm fully aware of this but firstly MC68060 from reputable source cost today more than 500$, secondly if i understand goal of this project is to provide possibility to run poorly written software incapable to run without physical HW floating point coprocessor.
Original 881/882 are rather slow HW FPU's and eventual floating point FPU emulation on typical MC68k will be even slower (due for example low clock).
Some hybrid solution can replace gap between high price reputable but close to unobtainable HW or salvaged or fake chips...

Quote:

Originally Posted by Thomas Richter

The softieee.library multiplication engine is probably 200 cycles (just house numbers), so it is slower, but not that much slower. This is also the reason why the 68882-based "hardware accelerator" solutions were not really working well. The communication overhead to the FPU eat up the performance improvements of the FPU. The 68882 only works well with the 68020/030 hardware interface where hardware implements the interface.

So 20 cycles multiplication where clock is around 200MHz seem not to bad - and if i understand correctly software overhead will be exactly same for pure software or hybrid solution?
68882 is OK but quite slow - slower even than 80287 with twice lower clock and still 040 and 060 are subset of 881/882 instructions so eventual hybrid FPU approach may be still beneficial even if subpar with real HW FPU wired with CPU trough coprocessor interface?

Quote:

Originally Posted by Thomas Richter

The softieee.library is a "numerics core". So it takes one or two extended precision floating point numbers in memory, and its "contract" defines that it places the result back in memory. SoftIEEE and MuRedox follow this contract. They do not care *how* the library does its job. This is "currently" an all-software implementation, but nobody stops you from implementing your own softieee.library. Such an alternative implementation would read the operands from memory, forward it to the hardware, and read the results back. Thus, while this construction would be faster, I doubt it would be *much* faster. I would expect a factor of 2 or 3 (see above for calculations), but that still places you one order of magnitude slower than native code on the 68060.

The speed would be, according to this estimate, approximately on par with the mathieeedoubbas.library.

I agree hybrid can be somewhere between pure SW and real HW.

Quote:

Originally Posted by Thomas Richter

See above. I'm as open as possible on the interface to make such a thing possible, and the interface of the library is as simple as it can be (two pointers to floating point numbers), but even if the actual computation would be immediate, there is still code between "your code" and the actual computation, and that is the MuRedox "trampoline code". It stores essential registers trashed by the softieee.library on the stack (d0-d2/a0-a1/a6, the ccr and the PC), loads the source operands (in the easiest case directly in the softieee.library) and calls the library.

Like it or not, this type of overhead will go away, no matter how smart your hardware is, and it is already one magnitude larger than the 68060 hardware multipliation. Even with instant operation, you would be down to the speed of a 68882, and that is really a *very* optimistic estimate.

Currently MC68882 seem to be available from reputable source somewhere in price between 40 and 140$ depends on package and clock and it is still subpar in terms of delivered speed with other solutions...

Anyway thanks for your time and hard work.

Quote:

Originally Posted by Samurai_Crow

@pandy71 That FPU you linked has a serial interface. That would probably limit performance on an 040+. On a 68000 though.... :-)

As i pointed this was example solution - simple illustration that even 8 bit embedded uC may get some help in relatively easy way.
4MHz SPI can be replaced with 80MHz SPI or by parallel interface - problem with real FPU for Amiga is high price if from reputable sources or high risk of fake or faulty chip salvaged from some junk in China, India or Africa if bought in internet...
MC68000 can use 881/882 as Motorola pointed in their application note AN947 and similar scheme could be used for hybrid emulation - nowadays there is many 4...6$ SoC's with HW FPU (usually single precision) but clocked at 100...400MHz.

This thread triggered my curiosity - missing Amiga/Commodore documentation for this interesting topic - something like Apple SANE documentation "Apple_Numerics_Manual_Second_Edition_1988.pdf"

Thomas Richter · 10 January 2023, 00:03

Quote:

Originally Posted by pandy71

To be honest i can't find documentation to softieee.library and as such it was my question to you as you are author and owner of the licence.

Look, maybe that is too obvious, but the documentation is in the SoftIEEE.lha archive you get from Aminet. Where else would it be? Autodocs, pragmas, prototypes, all you need.

Quote:

Originally Posted by pandy71

So 20 cycles multiplication where clock is around 200MHz seem not to bad - and if i understand correctly software overhead will be exactly same for pure software or hybrid solution?

That is only the call-overhead of MuRedox. Remember, you still need to connect to the actual hardware, fill its registers with the source operands, and get the result back. That is an order of magnitude slower than the 060.

Quote:

Originally Posted by pandy71

Currently MC68882 seem to be available from reputable source somewhere in price between 40 and 140$ depends on package and clock and it is still subpar in terms of delivered speed with other solutions...

Yes, it is an old chip. Yet, it includes the transcendental functions. Whether you need them is another question. The softieee.library implements them through CORDIC.

Quote:

Originally Posted by pandy71

MC68000 can use 881/882 as Motorola pointed in their application note AN947 and similar scheme could be used for hybrid emulation - nowadays there is many 4...6$ SoC's with HW FPU (usually single precision) but clocked at 100...400MHz.

This thread triggered my curiosity - missing Amiga/Commodore documentation for this interesting topic - something like Apple SANE documentation "Apple_Numerics_Manual_Second_Edition_1988.pdf"

Not sure what you expect, actually. The equivalent of Apple SANE is the mathieeedoubbas/doubtrans and singbas/singtrans libraries, and its autodocs you find in the RKRMs and the NDK. Or, if you like, the autodocs of softieee, which provides something similar than Apple SANE (actually, softieee is much closer to SANE than mathieeedoubbas/trans are).

The mathffp/mathtrans libraries are based on motorola library codes for math functions.

pandy71 · 10 January 2023, 19:38

Quote:

Originally Posted by Thomas Richter

Look, maybe that is too obvious, but the documentation is in the SoftIEEE.lha archive you get from Aminet. Where else would it be? Autodocs, pragmas, prototypes, all you need. That is only the call-overhead of MuRedox. Remember, you still need to connect to the actual hardware, fill its registers with the source operands, and get the result back. That is an order of magnitude slower than the 060. Yes, it is an old chip. Yet, it includes the transcendental functions. Whether you need them is another question. The softieee.library implements them through CORDIC.

Apologies, downloaded SoftIEEE.lha not from Aminet (recent version) but from your opening message.
In respect to 060 - yes, but if you have 060 then seem this package is not for you but for people with LC060

Quote:

Originally Posted by Thomas Richter

Not sure what you expect, actually. The equivalent of Apple SANE is the mathieeedoubbas/doubtrans and singbas/singtrans libraries, and its autodocs you find in the RKRMs and the NDK. Or, if you like, the autodocs of softieee, which provides something similar than Apple SANE (actually, softieee is much closer to SANE than mathieeedoubbas/trans are).

The mathffp/mathtrans libraries are based on motorola library codes for math functions.

To be honest i don't expect anything - it was just example of what i could hypothetically expect if Commodore by accident would be a serious company.

Thx!

shelter · 20 January 2023, 11:47

Just a heads up, with SoftIEEE enabled, MacOS crashes in Shapeshifter.

amifan · 27 January 2023, 00:07

Has anyone tried this with a TF1260 LC and Lightwave 3.5 FPU version? I followed the instructions for installation but get a guru error when running Lightwave. Not sure what to do next.

mfilos · 27 January 2023, 06:36

Quote:

Originally Posted by shelter

Just a heads up, with SoftIEEE enabled, MacOS crashes in Shapeshifter.

I'm under the impression that you need to disabled FPU before running ShapeShifter as it uses a SoftFPU as well.

At least in my Vampire, I disable it before running it via a script.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Demos to test FPU on SX32 MkII (020+FPU)	Rochabian	request.Demos	1	21 April 2020 03:03
Betatesting Amiga and C64 Forever 7	michaelz	support.Amiga Forever	23	22 June 2017 16:58
[obsolete] EoB 2 Thread AGA and translations betatesting	Marcuz	project.Amiga Game Factory	17	21 August 2008 22:47
Frederic's Emulator inside and Emulator thread	Fred the Fop	Retrogaming General Discussion	22	09 March 2006 07:31

04 January 2023, 21:15	#141
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,517	@Thomas Richter Could a variant version that has reduced precision be faster? I appreciate this isn't the goal but it seems to me that a lot of users with faster 060s tend to use their FPU for gaming rather than anything requiring full extended precision.

04 January 2023, 22:26	#142
Thomas Richter Registered User Join Date: Jan 2019 Location: Germany Posts: 3,326	I afraid you are expecting too much. Even if it would be at the speed of mathieeesingbas, it would still be too slow for gaming - please make the math yourself. You would be still below 6fps. Anyhow, the framework is there, the architecture is open, the interface is documented. All that needs to be done is a re-implementation of the softieee.library. The complicated parts such as the FPU emulation, instruction decoding or online jitting is already taken care of by the SoftIEEE binary (not the library) and MuRedox. These binaries do not care how the math core works - and that is the softieee.library.

04 January 2023, 22:41	#143
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,517	Fair enough. Fixed point builds on EC parts for the win then.

09 January 2023, 11:36	#146
mfilos Paranoid Amigoid Join Date: Mar 2008 Location: Athens/Greece Age: 45 Posts: 1,978	Thomas I see version 40.6 is on Aminet (from yesterday) but the archive has version 40.5 (binary + library).

09 January 2023, 12:16	#147
pandy71 Registered User Join Date: Jun 2010 Location: PL? Posts: 2,888	@THOR - apologies upfront for my question - i'm curious how from your perspective feasible is to implement such emulation in other than MC68K ISA - so emulate in software 881/882 in additional SW/HW but still keeping your MC68K frontend - in other words - implement physical float calculation in separate solution and use such virtual 881/882 from native CPU. Still have impression that i'm unable to express clearly my question so example: Your SoftIEEE library but float numeric is implemented in software in different HW connected to Amiga (for example one of cheap SOC using RISC V or ARM ISA if they are equipped with float co-processor and for example DSP and such SoC is running like 300...400MHz). How feasible is such hybrid implementation from your perspective? - lets skip numeric (i.e. not MC68K) part from the question. Thx!

09 January 2023, 12:29	#148
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,517	I think that question has been raised already. Someone asked about offloading the instruction to PPC (I think that's what they said), but certainly in that case there's a lot to contend with. Anything along the WarpOS route would be many orders of magnitude slower than the current software implementation. Unless you have an extremely low latency way to do it, I don't think you'll get away with offloading externally.

09 January 2023, 16:51	#154
Samurai_Crow Total Chaos forever! Join Date: Aug 2007 Location: Waterville, MN, USA Age: 49 Posts: 2,200	@pandy71 That FPU you linked has a serial interface. That would probably limit performance on an 040+. On a 68000 though.... :-)

20 January 2023, 11:47	#158
shelter Registered User Join Date: Nov 2022 Location: #Amigaland Posts: 156	Just a heads up, with SoftIEEE enabled, MacOS crashes in Shapeshifter.

27 January 2023, 00:07	#159
amifan WhatIFF? Amiga Magazine Join Date: Feb 2021 Location: Chiba, Japan Age: 46 Posts: 500	Has anyone tried this with a TF1260 LC and Lightwave 3.5 FPU version? I followed the instructions for installation but get a guru error when running Lightwave. Not sure what to do next.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)