English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 19 June 2014, 22:59   #81
Megol
Registered User

Megol's Avatar
 
Join Date: May 2014
Location: inside the emulator
Posts: 342
While supporting all FPU instructions would be expensive there are ways to reduce the expenses - simplifying shifters and multipliers would help reduce resource use with a performance penalty. Deciding on what should be done in microcode and what should be done in dedicated hardware (multi-cycle or not) is probably an optimization problem in itself. Too little hardware -> require larger microcode ROM and sequencer, smallish microcode support -> requires more and more complex hardware.

Is anyone aware if the 68881 have any hardware patents or documentation hinting at the microcode size?
Megol is offline  
Old 20 June 2014, 00:14   #82
matthey
Banned
 
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
Quote:
Originally Posted by Megol View Post
While supporting all FPU instructions would be expensive there are ways to reduce the expenses - simplifying shifters and multipliers would help reduce resource use with a performance penalty. Deciding on what should be done in microcode and what should be done in dedicated hardware (multi-cycle or not) is probably an optimization problem in itself. Too little hardware -> require larger microcode ROM and sequencer, smallish microcode support -> requires more and more complex hardware.
The easiest way to reduce the supported instructions is to emulate a 68040/68060 FPU (with FINT/FINTRZ please). Any other hardware FPU instructions could probably be added without any problems for a 68040.library or 68060.library emulation of the 6888x instructions. The unimplemented FPU instruction trap/exception would simply never be generated for instructions supported in hardware. It does *not* make sense to emulate a 68881/68882 fully. The cost of the resources is high and some of the code is tricky enough to implement in software. I've been optimizing some 68060fpsp trig and log functions (probably similar algorithms to the microcode of the 6888x). They are usually a table look-up with polynomial equations (higher clocked processors often use Taylor's series loops or similar to minimize memory accesses). The order of fp math can make a huge difference in accuracy in some cases. I expect it would be difficult to write the functions in a smallish simplified microcode implementation. I have found significant optimizations including simple rearrangement of equations (basic algebra). Even math PhD's aren't necessarily thorough.

Quote:
Originally Posted by Megol View Post
Is anyone aware if the 68881 have any hardware patents or documentation hinting at the microcode size?
I didn't see anything in the MC68881UM except for the constant ROM size (22 extended precision constants of 12 bytes each). The manual does speak of the microcode and internal configuration but doesn't go into much detail. Would the microcode size include tables? Perhaps what you are after is the internal ROM size for estimating the cost in an fpga?
matthey is offline  
Old 20 June 2014, 07:50   #83
JimDrew
Registered User

 
Join Date: Dec 2013
Location: Lake Havasu City, AZ
Posts: 600
I don't think anyone cares what the resource cost is. There is about 24,000 LEs left after AGA Amiga emulation is loaded.
JimDrew is offline  
Old 20 June 2014, 10:55   #84
Gunnar
Registered User

 
Join Date: Apr 2014
Location: Germany
Posts: 154
Quote:
Originally Posted by JimDrew View Post
I don't think anyone cares what the resource cost is. There is about 24,000 LEs left after AGA Amiga emulation is loaded.
I think in this size range you still care.
There are of course FPGA boards which have 100,000 LE FPGAs.
Then spending 2K LE more or less - or spending 16K ROM - matters little.

24.000 LE is about the size that you exactly need to include a fully pipelined 68060-like FPU and a good super scalar CPU core...
Gunnar is offline  
Old 20 June 2014, 22:54   #85
matthey
Banned
 
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
A full 6888x compatible shallow pipelined FPU as a companion to the TG68 should fit but the performance would probably be lacking. An FPU needs a deeper pipeline than the integer CPU, especially on a 68k FPU where practically all calculations are done in 80 bit extended precision. The 6888x internally has a 1-67 bit barrel shifter (1 cycle in 6888x) that would probably not be very fast in fpga either. This is all offset by faster memory, higher clock speeds and single chip proximity to the CPU so maybe the performance would be acceptable. I believe it would make sense to start with a 68060 compatible FPU as a first step. Even if the resources aren't important, the 68060 FPU contains the core components needed for a more 6888x compatible FPU (plus the FSxxx and FDxxx instructions which I recommend leaving in for 6888x emulation). We could emulate a 68040 with working FPU sooner than a 68030+6888x this way. Emulating a 68040 may be better for Macintosh support anyway.

@JimDrew
Did you ever do an analysis of which instructions are used on the 68040 Macintosh? How often are CAS.L, CAS.W, CAS.B, CAS2.L, CAS2.W and TAS used? Knowing how common the different instruction sizes for CAS and CAS2 could be helpful also. For example, CAS.L may be supported in a future 68k CPU as it's still relevant in a modern CPU but CAS.W/CAS.B could be trapped if rarely used as I suspect. I know CAS2 was used on the Macintosh despite a bug in some 68040 processors but wasn't this CAS2.L and was CAS2.W ever used? How did you handle these RMW instructions that were incompatible with the Amiga chip set and recommended against by C=? I expect they would be safe enough if you just made sure the code was in fast memory and not chip memory?

How often were MOVEP (any size and type), CMP2.L, CMP2.W, CMP2.B, MOVE16, PACK and UNPK used? Did any programs use packed decimal in the FPU?

Did the MacOS (supervisor mode) make use of MOVES or CHK2? Did you patch any evil instructions in Mac programs? If so, at program start or dynamically (like OxyPatcher)?
matthey is offline  
Old 21 June 2014, 03:44   #86
JimDrew
Registered User

 
Join Date: Dec 2013
Location: Lake Havasu City, AZ
Posts: 600
Quote:
Originally Posted by Gunnar View Post
I think in this size range you still care.
There are of course FPGA boards which have 100,000 LE FPGAs.
Then spending 2K LE more or less - or spending 16K ROM - matters little.

24.000 LE is about the size that you exactly need to include a fully pipelined 68060-like FPU and a good super scalar CPU core...
Well, subtract the current size of the TG68 core and that would give you the remainder free with the Replay board after adding the 060 core.

Quote:
Originally Posted by matthey View Post
@JimDrew
Did you ever do an analysis of which instructions are used on the 68040 Macintosh? How often are CAS.L, CAS.W, CAS.B, CAS2.L, CAS2.W and TAS used? Knowing how common the different instruction sizes for CAS and CAS2 could be helpful also. For example, CAS.L may be supported in a future 68k CPU as it's still relevant in a modern CPU but CAS.W/CAS.B could be trapped if rarely used as I suspect. I know CAS2 was used on the Macintosh despite a bug in some 68040 processors but wasn't this CAS2.L and was CAS2.W ever used? How did you handle these RMW instructions that were incompatible with the Amiga chip set and recommended against by C=? I expect they would be safe enough if you just made sure the code was in fast memory and not chip memory?

How often were MOVEP (any size and type), CMP2.L, CMP2.W, CMP2.B, MOVE16, PACK and UNPK used? Did any programs use packed decimal in the FPU?

Did the MacOS (supervisor mode) make use of MOVES or CHK2? Did you patch any evil instructions in Mac programs? If so, at program start or dynamically (like OxyPatcher)?
When I did the Mac emulation for the PC, I did a bunch of 680x0 instruction frequency tests. I don't know if I still have those, but I just might! I was curious about what and how often instructions were used. There is a lot of 040 specific code in the Mac OS. They didn't make a 040 MMU-less Mac ever, but they did make FPU-less (LC) versions. There were no EC (no MMU/FPU) versions ever made. So, there were quite a few of the 040-specific instructions used frequently along with constant MMU manipulation. MOVE16 definitely number one. PACK/UNPK instructions were used, along with odd FPU instructions (and also emulated). If the Mac didn't have the FPU a pack (library) was loaded that emulated the FPU opcode via a trap. Apple exploited every odd ball opcode that was rarely used on the Amiga. Funny thing though is that the OS was written so CacheClears were not needed that often, but the 060 totally blows up on the MacOS. Man, what a mess! I had to patch the crap out of the OS at boot time just to get the Mac to boot - and SuperScalar had to be off, only the instruction cache could be on, and some other weird things had to be done. By the time it was done and working, the 060 was crippled to be a speedy 040. It was hardly worth the effort. I did patch things on the fly like OxyPatcher, but the problem is that the MacOS reloads resources with it virtual memory and uses handles instead of fixed memory addresses, so it was like playing wack-a-mole with the OS while it was running.

Last edited by TCD; 21 June 2014 at 08:47. Reason: Back-to-back posts merged
JimDrew is offline  
Old 21 June 2014, 06:21   #87
matthey
Banned
 
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
Quote:
Originally Posted by JimDrew View Post
When I did the Mac emulation for the PC, I did a bunch of 680x0 instruction frequency tests. I don't know if I still have those, but I just might! I was curious about what and how often instructions were used.
If you have the old logs, that would be great. I altered an Amiga disassembler to gather stats on instructions and addressing modes for the Amiga (it works on Amiga executables). I can also log trapped instructions used on the 68060. Amiga compilers and most Amiga software fortunately don't use many of the odd ball instructions. MOVEP is the most common and it's only rarely in old 68000 assembler software. The MMU is not touched except for a few specific programs. It should make it easier to get a basic fpga CPU up and running.

Quote:
Originally Posted by JimDrew View Post
There is a lot of 040 specific code in the Mac OS. They didn't make a 040 MMU-less Mac ever, but they did make FPU-less (LC) versions. There were no EC (no MMU/FPU) versions ever made. So, there were quite a few of the 040-specific instructions used frequently along with constant MMU manipulation. MOVE16 definitely number one. PACK/UNPK instructions were used, along with odd FPU instructions (and also emulated). If the Mac didn't have the FPU a pack (library) was loaded that emulated the FPU opcode via a trap.
All the MMU code may be the killer as far as getting Mac emulation going on Apollo. The 68040/68060 MMU implementation is good but more "deluxe" than Gunnar wants to support. Maybe he would change his mind if a TG68 implementation in the fpgaArcade was low enough overhead. I would also start with the 68040/68060 compatible MMU rather than the 68030 MMU if I was MikeJ. A 68040/68060 compatible MMU+FPU is less work and gives more modern support.

MOVE16 and PACK/UNPK should not be too difficult to handle. MOVE16 might even be worth implementing in hardware (but I don't recommend for user programs). I don't like that the packed decimal in the FPU is used though. I've never seen packed decimal used on the Amiga before and we considered reusing the encoding space. Packed decimal is trapped in the 68040 FPU anyway so it would have been faster to load a library to do it all in software with or without a 68040 FPU. It might have been support for older 68030 Mac software though.

Quote:
Originally Posted by JimDrew View Post
Apple exploited every odd ball opcode that was rarely used on the Amiga. Funny thing though is that the OS was written so CacheClears were not needed that often, but the 060 totally blows up on the MacOS. Man, what a mess! I had to patch the crap out of the OS at boot time just to get the Mac to boot - and SuperScalar had to be off, only the instruction cache could be on, and some other weird things had to be done. By the time it was done and working, the 060 was crippled to be a speedy 040. It was hardly worth the effort. I did patch things on the fly like OxyPatcher, but the problem is that the MacOS reloads resources with it virtual memory and uses handles instead of fixed memory addresses, so it was like playing wack-a-mole with the OS while it was running.
Turning the 68060 superscalar mode and instruction cache off would give less than 1/2 the performance, maybe more like a 68030. You could use the half data cache mode to give the same data cache size as the 68040 but I bet the problem was the MMU page descriptors that can't be cached on the 68060. Disabling virtual memory may have fixed that problem, if possible, and would be much preferable to disabling the data cache (it would probably allow the branch cache to be turned on also). The 68060 has good compatibility to the 68040 in user mode but supervisor mode can require some adjustments, especially if getting fancy like the 68040 Mac.
matthey is offline  
Old 21 June 2014, 08:46   #88
JimDrew
Registered User

 
Join Date: Dec 2013
Location: Lake Havasu City, AZ
Posts: 600
Benchmarks were still faster than 25MHz 040 with a 50MHz 060, but nowhere near Amiga speeds.
JimDrew is offline  
Old 21 June 2014, 09:38   #89
Gunnar
Registered User

 
Join Date: Apr 2014
Location: Germany
Posts: 154
Quote:
Originally Posted by JimDrew View Post
Benchmarks were still faster than 25MHz 040 with a 50MHz 060, but nowhere near Amiga speeds.
Apollo/Phoenix has two seperate caches (icache and Dache).
But the ICache snoops DCache updates.
This means the two caches are coherent.

I assume that this will fix all problems - and even with 64KB cache it would give no problem.
Gunnar is offline  
Old 21 June 2014, 10:31   #90
matthey
Banned
 
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
Quote:
Originally Posted by JimDrew View Post
Benchmarks were still faster than 25MHz 040 with a 50MHz 060, but nowhere near Amiga speeds.
That's better than I would expect. Fusion/ShapeShifter don't feel slow on my 68060 although ShapeShifter feels faster (I own a BlitterSoft CD copy of the Fusion/PCx bundle). I wonder if ShapeShifter can run with Dcache turned on. Yaqube's 68060 board for the fpgaArcade destroys the 68040 and is outperforming the early PPC Macs:

[ Show youtube player ]

I'll take an fpgaArcade+68060 board if you can get me one. I have a spare Rev 6 68060 waiting here so I don't need the 68060 .

Quote:
Originally Posted by Gunnar View Post
Apollo/Phoenix has two seperate caches (icache and Dache).
But the ICache snoops DCache updates.
This means the two caches are coherent.

I assume that this will fix all problems - and even with 64KB cache it would give no problem.
The problems are probably related to cached data of logical/virtual addresses that have been swapped out with the MacOS virtual memory system. The 68060 needs some data (page descriptors) to be in uncached memory (not a problem on the 68040). Of course, no MMU equals no virtual memory which probably solves this problem but then there is the mess of MMU and MOVES instructions that need to be supported or patched. That's my guess anyway. Now if you decided to support a 68040/68060 compatible MMU with the Apollo, it would solve a lot of problems .
matthey is offline  
Old 21 June 2014, 11:14   #91
Gunnar
Registered User

 
Join Date: Apr 2014
Location: Germany
Posts: 154
Quote:
Originally Posted by matthey View Post
Now if you decided to support a 68040/68060 compatible MMU with the Apollo, it would solve a lot of problems .
An MMU 100% like the 68040 or 68060 MMU is in my opinion not suited for todays computing environment and todays program sizes.

The MMU Entries of the 68040 were designed to map a wopping 256 KB memory.

If we today implement an MMU then I would want an MMU which will work well also for much bigger programs.

The industry goes today for MMU designs supporting two sizes 4KB and a bigger much size. It would make sense to follow this lead.

Hi Jim.

Quote:
Originally Posted by JimDrew View Post
- and SuperScalar had to be off,
This is interesting.
Do we understand or can you explain why turning SuperScalar off was needed?

Last edited by TCD; 21 June 2014 at 11:25. Reason: Back-to-back posts merged
Gunnar is offline  
Old 21 June 2014, 13:09   #92
Photon
Moderator
Photon's Avatar
 
Join Date: Nov 2004
Location: Hult / Sweden
Posts: 4,589
This thread seems to have gone from a software question to a future hardware wish-list, but never mind.

This is how I see the discussion. The discussion is the same as in 1990, when I rang up Commodore and tried to convince them how a byte per pixel mode would be the future and result in faster graphics.

Problem: Your chunky gfx card already works in Workbench, but you want more. Therefore, you want game and demo coders to code for your gfx card.

Problem: Your gfx card wasn't sold in enough numbers. Virtually nobody will enjoy the demo or play the game the coder spent weeks and months on. Coders aren't enticed.

Solution 1: Basically sell some gfx card for postage to anyone who wants it. Hand them out at computer meetings. I could see an open-source solution where you make a very simple card with a blank logic gate chip and basic bus interface, where people can download "cores" and install themselves.

Solution 2: Basically swap out all the hardware to make people want the new hotness and pay. This has been attempted and has not resulted in enough numbers to become a target demo/game platform. I could see a solution in the form of an Amiga mobo remake in modern components ("future-secure Amiga" as selling point) where you sneak in transparent enhanced gfx modes.

Now, regarding the original topic, whose C2P won the compo?
Photon is offline  
Old 21 June 2014, 14:06   #93
Gunnar
Registered User

 
Join Date: Apr 2014
Location: Germany
Posts: 154
Quote:
Originally Posted by Photon View Post
Solution 2: Basically swap out all the hardware to make people want the new hotness and pay.
Hi Photon,

From a hardware side I see two options:

A) For existing classic AMIGAs a turbo card which provides
- 100 Mips CPU
- 128 MB memory
- Network
- IDE
- HDMI Video Out
With all those Chunky modes we want

For something like €150
This is useful and everyone can effort it.


B) Produce the a similar speced standalone system for a similar price.
Gunnar is offline  
Old 21 June 2014, 17:50   #94
Photon
Moderator
Photon's Avatar
 
Join Date: Nov 2004
Location: Hult / Sweden
Posts: 4,589
The problem I'm addressing is establishing a new standard gfx mode for coders to target. It's extremely hard.

RTG didn't, and Indivision didn't. The conclusion I draw from that is that any expansion must be cheap and high volume. (5000+ units).

How many of these do you plan to make? If only 600 or 700 units, that's a tiny audience/market/platform/etc for any coder to target. And then the gfx mode will not be established.

Again, if we're talking a "Workbench computer" and games and demos are not involved, I'm sure it will be a very nice alternative expansion to AGA/060 etc.
Photon is offline  
Old 21 June 2014, 18:12   #95
Gunnar
Registered User

 
Join Date: Apr 2014
Location: Germany
Posts: 154
Quote:
Originally Posted by Photon View Post
The problem I'm addressing is establishing a new standard gfx mode for coders to target. It's extremely hard.
But the graphics modes are not new.
The modes are supported by Picasso/CyberGFX already.
Many Amiga games already support them.

Also OS 4 / MOS / AROS are based on these graphics modes.
All there is a big number of SDL games around that use these modes.
Gunnar is offline  
Old 21 June 2014, 18:26   #96
Photon
Moderator
Photon's Avatar
 
Join Date: Nov 2004
Location: Hult / Sweden
Posts: 4,589
Good, then that should mean there is documentation and example code out there.

I meant "new" as in alternatives to OCS and AGA over the years, though. I thought we were going to make many Amigas "go chunky" here.

What is the relation of this expansion to the question of how many cycles per pixel a great C2P routine takes? Are you trying to make sure the CPU on it can run such routines with great performance? You started out suggesting 6bpl C2P for A600 and now you're making a Picasso card? I'm confused.

As you know I think an FPGA CPU as 68060 alternative (remake in modern components) is a great idea. Previously, you would need an Amiga with the right slots to get some CPU card and GFX card in place. If this replaces an Apollo or Blizzard in an A1200 at the cost of an 030 expansion it's good enough already

Last edited by Photon; 21 June 2014 at 19:26.
Photon is offline  
Old 21 June 2014, 19:44   #97
Gunnar
Registered User

 
Join Date: Apr 2014
Location: Germany
Posts: 154
Quote:
Originally Posted by Photon View Post
What is the relation of this expansion to the question of how many cycles per pixel a great C2P routine takes?
The existing Vampire600 cards are pure CPU upgrades without Video-out.
For them C2P is needed to run some SDL games or others.

Only the new card will come with new GFX-out included.
Gunnar is offline  
Old 21 June 2014, 19:45   #98
Photon
Moderator
Photon's Avatar
 
Join Date: Nov 2004
Location: Hult / Sweden
Posts: 4,589
So this is a question for a future SDL library update? Or graphics driver, presumably.
Photon is offline  
Old 21 June 2014, 19:56   #99
Gunnar
Registered User

 
Join Date: Apr 2014
Location: Germany
Posts: 154
Quote:
Originally Posted by Photon View Post
So this is a question for a future SDL library update?
I think the question is general.


Normally an A600 has very little CPU power.

With the Vampire the situation is now different.
The CPU is fast.
The copy to chip-mem is the slow part, the bottleneck.


The difference between CPU speed and chip memory speed is huge.

Example:
Currently its like this that for every word the CPU writes to chipmem is has to wait many cycles lets say for example 80 before it can do another write.

Instead waiting the CPU could do also useful work.

This means instead doing a plain memcopy the CPU can also do a C2P at the same speed.

But if you use a fast C2P routine, the CPU again has free time and is bored.
Again you can use this free time for something useful.

I have for example written a demo for the AMIGA 600, which renders a screen in 15bit hicolor in real time.

And in real time the C2P routine does not only do C2P but also a 15bit to 64 EHB Screen conversion.

I think with good codde one could also to a real time HAM conversion.
And I do not mean the ugly HAM conversioin which is used in some demos but a real good one.

I think there are even more options one could go here.

This type of routine would be useful for many cases.
Ueful for SDL libabry, useful for video players, and others
Gunnar is offline  
Old 21 June 2014, 20:11   #100
Photon
Moderator
Photon's Avatar
 
Join Date: Nov 2004
Location: Hult / Sweden
Posts: 4,589
OK, I responded to recent posts that were a off topic from your original discussion. Hence my suggestion of some way toward a hardware chunky mode that would have a chance of being embraced by many Amiga owners.

That's a dream I've had for many years that will likely never come true. Sorry for the interruption and continue to build yours

Last edited by Photon; 21 June 2014 at 20:26.
Photon is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Coders Challenge #2: C2P oRBIT Coders. General 4 04 June 2010 19:12
Any C2P experts here? oRBIT Coders. General 36 27 April 2010 08:26
C2P....help! NovaCoder Coders. General 8 17 December 2009 01:15
Game in c2p? oRBIT Amiga scene 11 01 February 2007 22:28
Fastest TCP/IP software Smiley support.Hardware 7 14 March 2005 19:26

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 04:45.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2018, vBulletin Solutions Inc.
Page generated in 0.09513 seconds with 14 queries