Betatesting SoftIEEE FPU emulator - Page 6

OldB0y · 24 November 2022, 22:18

This is both hilarious, and amazing at the same time.

I can now run Quake on my FPU less LC060 A3660 equipped A4000 - and its slower than when I first tried the leaked unofficial Amiga Quake port on my A1200 with 68882 equipped Blizzard 1230 II lol.

But it does work!

alenppc · 24 November 2022, 22:30

Quote:

Originally Posted by Thomas Richter

No matter what, I thank you a lot for helping me, and it seems we even found something interesting and new about the 68060 masks that does not seem to be documented anywhere else.

Just tested this last version and did not get any hits at all. I tried both Quake and Quake2.

The earlier version that did give me hits, gave me so many with Q2 that it filled the entire RAD drive (basically the log was over 800k, although I did not post that one).

alenppc · 24 November 2022, 22:31

Quote:

Originally Posted by OldB0y

This is both hilarious, and amazing at the same time.

I can now run Quake on my FPU less LC060 A3660 equipped A4000 - and its slower than when I first tried the leaked unofficial Amiga Quake port on my A1200 with 68882 equipped Blizzard 1230 II lol.

But it does work!

You can grab the NovaCoder's softfloat version (posted in the Quake thread) and it should be slightly faster, although the 3660 is an awful card so probably not by much.

DisasterIncarna · 24 November 2022, 22:35

what witchcraft is next? btw i'm guessing this will merge with your mmu libs? or is it, it's own thing?

Thomas Richter · 25 November 2022, 05:43

Quote:

Originally Posted by alenppc

Just tested this last version and did not get any hits at all. I tried both Quake and Quake2.

Thanks, so I guess we're done then. Interesting CPU bug then, it only seems to affect fmovem <mem>,register-list. Just as a precaution, I also added the same workaround to fmovem <mem>,control-registers and frestore. The remaining instructions seem to be unaffected.

This is why it is so important to make tests on real hardware...

Again, this was very helpful, thanks a lot!

Thomas Richter · 25 November 2022, 05:44

Quote:

Originally Posted by DisasterIncarna

what witchcraft is next? btw i'm guessing this will merge with your mmu libs? or is it, it's own thing?

No, this program will remain stand-alone. It is only loosely related to the mmu library and does not depend on it. It supports it, though.

Before you ask: No, you cannot emulate a MMU by software with a similar trick.

Michael · 25 November 2022, 07:05

So, demos like https://www.pouet.net/prod.php?which=2308 are not possible with this.

Thomas Richter · 25 November 2022, 07:38

It depends on what you mean by "possible". Real-time? No. Running - yes, provided the thing does not kill the operating system.

But that's really not the point of the software (but then, I never got the point with demos in first place).

Michael · 25 November 2022, 08:51

Well, Impossible fails with a guru here

Another creation form same coders also fpu based
https://www.pouet.net/prod.php?which=2306

This one mostly works, but extreamly slow in places
below 1 fps, when a proper 060 keeps a good fps.

So I suspect that both are system friendly, since later
demos also work on rtg, and you can't kill it there.

Thomas Richter · 25 November 2022, 08:59

Quote:

Originally Posted by Michael

Well, Impossible fails with a guru here.

Same as above. Please run SegTracker, Sashimi and MuForce with the DISPC option, redirect the output to RAD: and please provide it here. I cannot do anything about it without knowing further details.

alenppc · 25 November 2022, 18:21

I tried both of these demos and have not seen any crashes or MuForce hits.

Annoyingly however, "Impossible" is one of those demos that only presents a black screen on NTSC machines, even if I switch the WB to a PAL screenmode, the demo will still start in machine's native video mode (NTSC) and simply run music but not display anything. No MuForce hits however.

@Thomas, please check your PMs.

Samurai_Crow · 26 November 2022, 03:45

Quote:

Originally Posted by Thomas Richter

As a toy project to play with "why not", but as a realistic system design, the answer is quite simple: With software emulation, you would go over many cycles of execution and instruction interpretation just for a single vector instruction, thus there is nothing to be gained by this approach. It will be just slower than mutliple scalar 680x0 instructions.

You make an incorrect assumption about doing multiple scalar ops to mimic a vector. I was going to do actual vector ops in hardware on a CPLD chip. If I used a fixed ABI for the vector chip, I could just use Assembly macros to interface with the CPLD. I guess only a custom scalar emulation would be necessary after all.

Karlos · 26 November 2022, 11:00

Quote:

Originally Posted by Samurai_Crow

You make an incorrect assumption about doing multiple scalar ops to mimic a vector. I was going to do actual vector ops in hardware on a CPLD chip. If I used a fixed ABI for the vector chip, I could just use Assembly macros to interface with the CPLD. I guess only a custom scalar emulation would be necessary after all.

Ok, but the trap overhead of reaching your proposed implementation still exists.

Thomas Richter · 26 November 2022, 11:45

Quote:

Originally Posted by Samurai_Crow

You make an incorrect assumption about doing multiple scalar ops to mimic a vector. I was going to do actual vector ops in hardware on a CPLD chip.

Correct, but how do you feed the chip? The trouble is that you need many more instructions to feed the chip with data, namely going through the exception processing, than it would actually take to process each element of the vector manually. This would only make sense if the vectors are hundreds of bytes long, and the chip would be able to read them via DMA as the large overhead needs to be smaller than the scalar processing of data. Thus, to give you a practical example: Even on a 68881, multiplying four numbers takes approximately 200 cycles. Going through the exception processing takes probably 1000 cycles. Even if the actual vector processor takes only a single cycle, going through the exception processing is still slower than just scalar processing.

Samurai_Crow · 26 November 2022, 15:05

Quote:

Originally Posted by Thomas Richter

Correct, but how do you feed the chip?

Memory-mapped I/O to absolute addresses whose destination is specified in MOVE16 operations.

Quote:

Originally Posted by Thomas Richter

The trouble is that you need many more instructions to feed the chip with data, namely going through the exception processing, than it would actually take to process each element of the vector manually. This would only make sense if the vectors are hundreds of bytes long, and the chip would be able to read them via DMA as the large overhead needs to be smaller than the scalar processing of data. Thus, to give you a practical example: Even on a 68881, multiplying four numbers takes approximately 200 cycles. Going through the exception processing takes probably 1000 cycles.

68881 was not pipelined nor parallel.

Quote:

Originally Posted by Thomas Richter

Even if the actual vector processor takes only a single cycle, going through the exception processing is still slower than just scalar processing.

More incorrect assumptions. Exception handling would only be needed for in-order non-load-store operations. Thus only extracting values from the vector unit justifies a wait state.

Thomas Richter · 26 November 2022, 15:23

Quote:

Originally Posted by Samurai_Crow

Memory-mapped I/O to absolute addresses whose destination is specified in MOVE16 operations.

Then, assume that whoever wants to use the chip writes to memory mapped registers. That is much faster than going through the emulator trap.

Quote:

Originally Posted by Samurai_Crow

68881 was not pipelined nor parallel.

You don't understand what I'm trying to say. It does not matter how parallel a processor is. Using the 68881 would still outperform an I/O mapped vector processor if the interface to use this processor goes through an emulation trap, unless the vectors are really large.

Quote:

Originally Posted by Samurai_Crow

More incorrect assumptions. Exception handling would only be needed for in-order non-load-store operations. Thus only extracting values from the vector unit justifies a wait state.

If you want to interface the chip with assembler instructions, that's an emulator trap. And that's simply not a good idea, that's all I'm trying to tell you. It is just going to be slower than running through a library vector, which is a suggested interface, and a much quicker one. Tools like MuRedox are there just to prevent the emulator trap.

Samurai_Crow · 27 November 2022, 03:40

@Thomas Richter
Mapping opcodes to the coprocessor interface would definitely be preferred. However, some of the general-purpose coprocessor circuits present in the 68020 and 68030 were offloaded to an external chip on the 68040 and 68060. Unless that chip design gets rereleased as an FPGA softcore, I'd be unable to replicate it without going all Gunnar von Boehn and hardwiring the floating point vectors into the CPU core and ditching the 68LC040 chip altogether. Furthermore, adding the coprocessor softcore to the vector unit would probably increase the size of the vector unit such that it would require a full fledged FPGA instead of a CPLD. (I kind of doubt it would fit in a CPLD anyway but that raises the cost nonetheless.)

Thomas Richter · 27 November 2022, 08:42

Sorry, I'm completely confused on what you are proposing here. First, you say "coprocessor circuits present in the 68020 and 68030 were offloaded to an external chip on the 68040 and 68060". Nothing was "offloaded to an external chip" there. Actually, nothing was offloaded to external hardware. The missing opcodes are offloaded to external software (the fpsp.resource). So do you want to say "I plan an external chip that replaces the fpsp.resource"? If so, softIEEE has no role here. This process works (a bit) different compared to SoftIEEE, and it would need to go through a CPU library, or lacking this, an fpsp.resource.

What has Gunnar to do with all this? Even if you disable the FPU on his design, the FPU remains active for elementary math and would continue to process data for such elementary operations in only 56 bits rather than the full precision offered by SoftIEEE. Thus, at best you can offload some transcendental functions to an external chip, but whether it makes sense to go through the emulator trap rather than his "millicode" I cannot judge.

Third, what has all this to do with a vector unit, and how does SoftIEEE plays in here? As said, going through an emulator trap does not make sense, it would be only slower than scalar math operations carried out multiple times, so as a software interface to an external chip it makes little sense. I

If you propose to use SoftIEEE as some kind of "prototype system" where you catch (lacking hardware) the instructions by software - well, you can do that as of today. It would make sense there as temporary solution just to test the chip until the full interface becomes available in silicon. Just implement a softieee.library. Will I do that? No - that's not the purpose of the project, but the interface is open and documented, so it is doable, and I can help you to understand how the interface works.

If you plan to do that as an external chip for the 68LC040 to provide an FPU - that is possible, though again not exactly fast, so I'm not sure how competitive such a design could be.

If you plan that as an external chip for Gunnar's 68EC080, I guess you better talk to Gunnar to get it linked to the system as some sort of coprocessor interface. Good luck with that. The chip currently lacks the ability to re-route all FPU instructions, and even if you can re-route the transcendental functions as a subset to an external chip, it would likely not perform very well, but that's not my problem at all.

Last but not least, I doubt any soft of FPU can be implemented on a CPLD, these chips are much too tiny for such complex operations. You can probably implement a CORDIC logic in an FPGA to get the missing functions, but you would still need to find a way how to interface this chip to either a Mot chip, or Gunnar's EC080. For the 68040 and related chips, there is no coprocessor interface, thus some software layer is necessary. Yes, SoftIEEE can do that (minus vector instructions), and the answer is that you then need the right softieee.library. Doable, read the documentation, then ask me in case you have additional questions.

Thus, to conclude: Please write a concise project proposal of what exactly you are attempting to do. I cannot really make much sense of what you have written so far - sorry.

Karlos · 27 November 2022, 09:15

I think, to summarise, SoftIEEE depends on the missing instruction exception processing handling in order to intercept any unimplemented 6888x instruction the CPU encounters. The mechanism is used to hand over to a software emulation of the missing operation. Much of the overhead is in the exception processing itself, so even if you had an eternal hardware device that could perform the operation itself, the benefit would be minimal. For a traditional SIMD unit, where the onus is on throughput, the proposition would only turn a computational profit for vectors that are very large and that the external unit would also need to load and store by itself, to be faster than a software only solution.

Vector stuff aside, this notion that the exception overhead is dominant is why I'm curious about the applicability of using the exception trap to patch the caller with a direct call to a handler function in a manner similar to OxyPatcher/CyberPatcher.

Thomas Richter · 27 November 2022, 09:25

Please make that MuRedox as the above projects are dead as a dodo. This is next on my list, namely update MuRedox, but for that, first SoftIEEE needs to become stable, and I need input on that - which is exactly the purpose of this thread.

This said, even with MuRedox there is an overhead, namely copying the sources in, and the targets out, and interfacing the emulating library. It may cut down the number of instructions in the emulation path probably to one tenth of the current instruction count, but it's still very noticable. Even 20 instructions (in reality, it is more - even with MuRedox in place) plus one vector instruction is a noticeable overhead compared to 4 scalar instructions. Thus, you would really need larger vectors, even with MuRedox in place.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Demos to test FPU on SX32 MkII (020+FPU)	Rochabian	request.Demos	1	21 April 2020 03:03
Betatesting Amiga and C64 Forever 7	michaelz	support.Amiga Forever	23	22 June 2017 16:58
[obsolete] EoB 2 Thread AGA and translations betatesting	Marcuz	project.Amiga Game Factory	17	21 August 2008 22:47
Frederic's Emulator inside and Emulator thread	Fred the Fop	Retrogaming General Discussion	22	09 March 2006 07:31

24 November 2022, 22:18	#101
OldB0y Registered User Join Date: Jan 2009 Location: Letchworth/UK Posts: 86	This is both hilarious, and amazing at the same time. I can now run Quake on my FPU less LC060 A3660 equipped A4000 - and its slower than when I first tried the leaked unofficial Amiga Quake port on my A1200 with 68882 equipped Blizzard 1230 II lol. But it does work!

24 November 2022, 22:35	#104
DisasterIncarna Registered User Join Date: Oct 2021 Location: England Posts: 1,237	what witchcraft is next? btw i'm guessing this will merge with your mmu libs? or is it, it's own thing?

25 November 2022, 07:05	#107
Michael A1260T/PPC/BV/SCSI/NET Join Date: Jan 2013 Location: Moscow / Russia Posts: 840	So, demos like https://www.pouet.net/prod.php?which=2308 are not possible with this.

25 November 2022, 07:38	#108
Thomas Richter Registered User Join Date: Jan 2019 Location: Germany Posts: 3,302	It depends on what you mean by "possible". Real-time? No. Running - yes, provided the thing does not kill the operating system. But that's really not the point of the software (but then, I never got the point with demos in first place).

25 November 2022, 08:51	#109
Michael A1260T/PPC/BV/SCSI/NET Join Date: Jan 2013 Location: Moscow / Russia Posts: 840	Well, Impossible fails with a guru here Another creation form same coders also fpu based https://www.pouet.net/prod.php?which=2306 This one mostly works, but extreamly slow in places below 1 fps, when a proper 060 keeps a good fps. So I suspect that both are system friendly, since later demos also work on rtg, and you can't kill it there.

25 November 2022, 18:21	#111
alenppc Registered User Join Date: Apr 2012 Location: Canada Age: 44 Posts: 910	I tried both of these demos and have not seen any crashes or MuForce hits. Annoyingly however, "Impossible" is one of those demos that only presents a black screen on NTSC machines, even if I switch the WB to a PAL screenmode, the demo will still start in machine's native video mode (NTSC) and simply run music but not display anything. No MuForce hits however. @Thomas, please check your PMs.

27 November 2022, 03:40	#117
Samurai_Crow Total Chaos forever! Join Date: Aug 2007 Location: Waterville, MN, USA Age: 49 Posts: 2,193	@Thomas Richter Mapping opcodes to the coprocessor interface would definitely be preferred. However, some of the general-purpose coprocessor circuits present in the 68020 and 68030 were offloaded to an external chip on the 68040 and 68060. Unless that chip design gets rereleased as an FPGA softcore, I'd be unable to replicate it without going all Gunnar von Boehn and hardwiring the floating point vectors into the CPU core and ditching the 68LC040 chip altogether. Furthermore, adding the coprocessor softcore to the vector unit would probably increase the size of the vector unit such that it would require a full fledged FPGA instead of a CPLD. (I kind of doubt it would fit in a CPLD anyway but that raises the cost nonetheless.)

27 November 2022, 08:42	#118
Thomas Richter Registered User Join Date: Jan 2019 Location: Germany Posts: 3,302	Sorry, I'm completely confused on what you are proposing here. First, you say "coprocessor circuits present in the 68020 and 68030 were offloaded to an external chip on the 68040 and 68060". Nothing was "offloaded to an external chip" there. Actually, nothing was offloaded to external hardware. The missing opcodes are offloaded to external software (the fpsp.resource). So do you want to say "I plan an external chip that replaces the fpsp.resource"? If so, softIEEE has no role here. This process works (a bit) different compared to SoftIEEE, and it would need to go through a CPU library, or lacking this, an fpsp.resource. What has Gunnar to do with all this? Even if you disable the FPU on his design, the FPU remains active for elementary math and would continue to process data for such elementary operations in only 56 bits rather than the full precision offered by SoftIEEE. Thus, at best you can offload some transcendental functions to an external chip, but whether it makes sense to go through the emulator trap rather than his "millicode" I cannot judge. Third, what has all this to do with a vector unit, and how does SoftIEEE plays in here? As said, going through an emulator trap does not make sense, it would be only slower than scalar math operations carried out multiple times, so as a software interface to an external chip it makes little sense. I If you propose to use SoftIEEE as some kind of "prototype system" where you catch (lacking hardware) the instructions by software - well, you can do that as of today. It would make sense there as temporary solution just to test the chip until the full interface becomes available in silicon. Just implement a softieee.library. Will I do that? No - that's not the purpose of the project, but the interface is open and documented, so it is doable, and I can help you to understand how the interface works. If you plan to do that as an external chip for the 68LC040 to provide an FPU - that is possible, though again not exactly fast, so I'm not sure how competitive such a design could be. If you plan that as an external chip for Gunnar's 68EC080, I guess you better talk to Gunnar to get it linked to the system as some sort of coprocessor interface. Good luck with that. The chip currently lacks the ability to re-route all FPU instructions, and even if you can re-route the transcendental functions as a subset to an external chip, it would likely not perform very well, but that's not my problem at all. Last but not least, I doubt any soft of FPU can be implemented on a CPLD, these chips are much too tiny for such complex operations. You can probably implement a CORDIC logic in an FPGA to get the missing functions, but you would still need to find a way how to interface this chip to either a Mot chip, or Gunnar's EC080. For the 68040 and related chips, there is no coprocessor interface, thus some software layer is necessary. Yes, SoftIEEE can do that (minus vector instructions), and the answer is that you then need the right softieee.library. Doable, read the documentation, then ask me in case you have additional questions. Thus, to conclude: Please write a concise project proposal of what exactly you are attempting to do. I cannot really make much sense of what you have written so far - sorry.

27 November 2022, 09:15	#119
Karlos Alien Bleed Join Date: Aug 2022 Location: UK Posts: 4,401	I think, to summarise, SoftIEEE depends on the missing instruction exception processing handling in order to intercept any unimplemented 6888x instruction the CPU encounters. The mechanism is used to hand over to a software emulation of the missing operation. Much of the overhead is in the exception processing itself, so even if you had an eternal hardware device that could perform the operation itself, the benefit would be minimal. For a traditional SIMD unit, where the onus is on throughput, the proposition would only turn a computational profit for vectors that are very large and that the external unit would also need to load and store by itself, to be faster than a software only solution. Vector stuff aside, this notion that the exception overhead is dominant is why I'm curious about the applicability of using the exception trap to patch the caller with a direct call to a handler function in a manner similar to OxyPatcher/CyberPatcher.

27 November 2022, 09:25	#120
Thomas Richter Registered User Join Date: Jan 2019 Location: Germany Posts: 3,302	Please make that MuRedox as the above projects are dead as a dodo. This is next on my list, namely update MuRedox, but for that, first SoftIEEE needs to become stable, and I need input on that - which is exactly the purpose of this thread. This said, even with MuRedox there is an overhead, namely copying the sources in, and the targets out, and interfacing the emulating library. It may cut down the number of instructions in the emulation path probably to one tenth of the current instruction count, but it's still very noticable. Even 20 instructions (in reality, it is more - even with MuRedox in place) plus one vector instruction is a noticeable overhead compared to 4 scalar instructions. Thus, you would really need larger vectors, even with MuRedox in place.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)