01 April 2022, 19:01 | #81 | |
Registered User
Join Date: Mar 2022
Location: Birmingham, UK
Posts: 154
|
Quote:
|
|
01 April 2022, 20:15 | #82 | ||||
Registered User
Join Date: Jun 2018
Location: Calgary/Canada
Posts: 247
|
Quote:
That and his apparent disconnects in understanding: - IDE DMA is possible even if the DMA pins were not connected and Gunnar never published any sort of "generic" DMAC built-in - Called me a liar when I first mentioned how many instructions on the fly the ThreadRipper can do. - Didn't understand what CORDIC was or how it could be used to make single-cycle math. - Would insist I needed more channels of Paula to pull off GNGEO better not getting that mixing was a fraction of the CPU requirements for pulling off ADPCM and FM decode (and 14-bit Paula audio is already better than the NEO GEO's 12-bit, so 16-bit is really useless). - Didn't get that the chief bottleneck in emulation are the CPUs and that having some sort of simple "V68K" mode (where the CPU "mutes" the high 8-bits in exchange for some "page" bits) would allow direct execution for virtually every 68K platform made. This might have made GNGEO too fast -- we would have also needed a good way to throttle the CPU at that point. Imagine running 3.1, 1.3, Atari, Mac, Genesis and NEO GEO at the same time, all with the power of the AC68080! Sigh... - Didn't (at the time) see any point of implementing the Blitter in hardware when the CPU can do it faster anyway. Spent months trying to "hack" compatibility in without it. - I can go on, but this shouldn't become a Gunnar rant... Quote:
The Vampire's memory bus is really, really fast, but shares this with video and other DMA. And this speed doesn't really help when the CPU core is only running at ~85MHz; it just means the CPU is very seldom starved for instructions. It would be like me feeding you 20 pizzas a minute. My ability to give you pizzas that fast does not mean you're able to consume them that fast. For the most part, that's not that important anyway since both processors have cache and most of your code (hopefully) is being executed from that. Quote:
The MC68060 was 1.33 MIPS/MHz and the AC68080 was around 1.54 MIPS/MHz when I was testing on the V4. Intel Pentium is 1.88 MIPS/MHz. For reference, the 68040 was 1.1 MIPS/MHz and is about as fast as the 680x0 instruction set can get with a single pipeline. This means that, in theory, if you could somehow guarantee that two pipelines were always 100% full then you'd be able to get 2.2 MIPS/MHz. However, all of these processors have the same asymmetric design -- that is, the main pipeline can do everything and the secondary pipeline can only do a subset. Compound this with a lack of out-of-order execution (another thing Gunnar saw no value in) and inter-instruction dependencies, then 40% is expected. Even if the Vampire can be reliably pushed to say 106MHz (x15) you're only talking a small win (164MIPS) versus the P75 (126MIPS) -- so maybe closer to a P90 or P100 (which benefited greatly from 60/66MHz bus clocks respectively). But a clock-push is a clock-push on any side of the comparison fence. Quote:
In reality, GNGEO got about a 5~10% speed up from AMMX. Was it nice? Yes. Did it at all have the performance gains that Gunnar claims on the website? Not at all. And btw, that difference is between good C code versus hand optimized AMMX code from Gunnar himself. Before I left we had GNGEO running at a solid 60 fps with AMMX (and one frame skip) and about 55 fps without (both with audio disabled -- FM is so hard to do). In the end, I want to say that I like the Vampire and still own a V2 V500. It was and remains a huge achievement and we shouldn't balk at that. Really, getting 1.54MIPS/MHz on the 680x0 instruction set in FPGA is really impressive. The only ones who come close are the ColdFire cores from Silvaco. And I think we can applaud these achievements without making up wildly inaccurate numbers. Last edited by nonarkitten; 01 April 2022 at 20:44. |
||||
01 April 2022, 20:33 | #83 | ||
Registered User
Join Date: Jun 2018
Location: Calgary/Canada
Posts: 247
|
Quote:
Which are mostly completely ignored by C compilers. That being said, the best addressing mode of the x86 is pretty equivalent to the extended addressing modes and is pretty flexible (Displacement + Base Register + Index Register * Scale). Subjective statement. Segmented memory not only made the MMU almost a requirement on x86 (where many 680x0 systems opted-out to save a few pennies), but can be used creatively to speed-up some copy operations. Quote:
The key to fast x86 code is to not use the registers to hold data and abuse stack/RAM instead and lean on the cache. This is why modern x86 CPU's have mad amounts of cache and benefit greatly from it. This is generally the approach taken with modern compilers. |
||
01 April 2022, 20:39 | #84 |
Registered User
Join Date: Jun 2018
Location: Calgary/Canada
Posts: 247
|
The Pentium MMX launched at 166MHz and would be about twice as fast as a Vampire. Also, Intel's MMX was more useful and didn't break OS compatibility by avoiding having to add new registers.
|
01 April 2022, 21:54 | #85 | |||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
Quote:
I've disassembled enough complied code to know this is not true. Quote:
And now try to do a variable shift with any register other than CL as the count (just an example). Quote:
But it's not subjective. 68k can for example transfer any memory cell to any other in single instruction, where x86 needs either two instructions or using the clumsy movs. Quote:
Quote:
|
|||||
02 April 2022, 01:14 | #86 |
Registered User
Join Date: Apr 2018
Location: Glasgow
Posts: 161
|
I got an email telling me the IceDrake for the A1200 was ready, but i am not touching it. It looks like a good bit of kit, but its even closer to the "Amiga in a box" V4, and i think that's what the Apollo guys are driving for.
As an accelerator, and purely as an accelerator, that is, across RGB and that's it, the performance is roughly identical to the V1200, and the pure performance of that, is not much better than a V600, and only really benefits from the A1200's internal architecture. So, in the seven years or so, since the Vampire stuff has been on sale, the CPU hasn't really changed. it's the same performance from half a decade ago. |
02 April 2022, 03:41 | #87 | |||
Registered User
Join Date: Jun 2018
Location: Calgary/Canada
Posts: 247
|
Quote:
So here's your example in 68K (four 68K opcodes for each). http://franke.ms/cex/z/qdxMsn And here's your example in x86 (four x86 opcodes for each). https://godbolt.org/z/MTM1qbP5Y "Mostly." You can always craft an odd enough case where it's possible, but then I'd probably say it's shit C code that's going through that much abstraction. And so many times I've written code that "ought to" use the extended addressing modes and didn't. Or coercing GCC to emit a DBCC. Or getting it to emit bitfield opcodes instead of LSL/AND/OR patterns. The 68K was designed to make assembly coders lives easier, it was never meant for high-level languages like C. The fact that C can sometimes use these is impressive. The fact that it's incredibly inconsistent is frustrating and it's usually easier to refactor the code to not be lasagna. Quote:
But comparing CEX to CEX, guess what -- a memory shift ends up the same code size again. Still not 64-bit. I was able to make a few cases using double pointers use one more x86 opcode. Now you may be able to compute the parity of the lowest byte with just a lookup in 68K, but that's one versus zero opcodes. And even then, that memory fetch may cost you a pipeline stall if it's not already in-cache, and even then, it's not instantaneous. Quote:
He said it's the OSD code for MiniMig. You're welcome to poke him for which one precisely, but it's all 90% the same since 2010. But that's not an valid argument anyway. Here's maybe something less arguable. This is NetBSD 9.2 on x86 (right) versus 68k (left). Total 68k size: 1,419,703 bytes. Total x86 size: 1,626,251 bytes. That's about 14.5% bigger, not 50% bigger. And that's 386 code, not Pentium (couldn't find an i586-specific build of NetBSD). You can grab the source yourself if you're curious. AMD64 was usually bigger (with rare exception like sh), but those 64-bit pointers do add up. I'm not saying they don't help in some cases. Hell, I would have killed for eight more registers on ARM, but they're not a cornucopia. In the end, the actual die space on a modern x86 processor dedicated to just the execution of opcodes is virtually infinitesimal next to the caches, cache logic, scheduling, memory handling, etc. So, given the opportunity for a clean compatibility break in the first-time since ever with x86, AMD chose to double the register count. Great. It makes a few things a lot faster, no argument there. But one thing you're totally glossing over here is the idea that two instructions can always execute faster than say three or four, not taking into account how these operations can stall the pipeline. Even on the 68000, there are cases when two operations are faster than one. So it's a gross over-simplification to say that because x86 *might* need more opcodes for a given operation that it's therefore slower. Frankly, it's downright ignorant. |
|||
02 April 2022, 09:38 | #88 |
Registered User
Join Date: Sep 2013
Location: Poland
Posts: 868
|
AFAIK new registers for AMD64 are only available in long mode (shouldn't be surprising, 32 or 16 bit apps are already compiled with 16/32b register model and allowing new 32b apps to use new registers would break compatibility with older processors) so that's only area of 64bit part of x86-64 which makes it stupid to even compare with 32b.
And even if it wasn't stupid per se it's stupid when you realize most ooo processors uses register renaming and there are physically more registers anyway to make it right (but that's for OOO and not developers!) I was curious how heavy is the usage of new registers in popular apps. Got latest 7zip x86 vs x86-64. r8-r15 usage is small, maybe 15% of all GPR operations. Got bench running. Overall 64b executable is roughly 25% faster than 32b. What's curious is 64b executable is faster on decompressing (5.6 vs 7.6GIPS on single core) while 32b seems to be faster in compressing (5.3 vs 4.6GIPS). Since compressing scores are quite similar I'd say wider registers are great for decompression while doesn't have that much impact on compression. Now then... why I'm bringing this up? Well why meynaf brought 64bit (!!) ISA into discussion between implementation of old arch made in 90s? I don't know. Must be a reason. And I might assure you meynaf that if I brought my Zen2 down to 200MHz and run whatever benchmark of that era you might find it will still be much faster than Pentium MMX 200MHz regardless of the same amount of registers developers-side (that's how much powerful execution units, prefetch, decode, cache subsystem and memory controllers are and that's the reason x86 has great performance gains year after year and not 8 new GPR in ISA). Zen already has over a hundred of physical integer registers. So new registers are added to 64bit x86 ISA to provide better utilization of growing number of physical registers as swapping those poor 8 all the time would be inefficient (and obviously wider registers themselves can correlate with big performance gains in CERTAIN situations, there are plenty of applications which - compiled for 64bit env - doesn't work much faster). As nonarkitten already proved there is no great difference between size of 68k and x86 executable (and that's 386). So more compact nature of 68k is true but also irrelevant - not with that small difference. |
02 April 2022, 10:02 | #89 | |||||||||||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
Quote:
So it's a microscopic detail that starts to repeat so many times that it's no longer microscopic. Of course it adds up with other details. Quote:
Now it could be interesting to try again with a routine that needs many registers. That would be OT here, but feel free to open a code contest of x86 vs 68k here if you want. Quote:
Quote:
The 68k is designed for higher languages too. Proof : else it wouldn't have LINK/UNLK instructions (and also CHK/TRAPV, for different reasons). It's just that it wasn't the main goal (even though i remember having been told it was). Quote:
So yes for x86 it works with the whole register. This is rarely used and is easy to handle on 68k, and also means nothing useful can be put in higher bits (a quite common thing to do on 68k, at least in asm). Quote:
Quote:
x86 does not have CLR and TST instructions which are incredibly common. Tell me why OF flag does not go in LAHF/SAHF instructions. 68k can do string ops in both directions easily, where x86 has that stupid DF flag. Your point is that memory operations are fast, and it's true that in modern cpus stack is very fast. But x86 is ill-suited even for this ; 68k at least can directly do things such as add.w (a7)+,d0. Quote:
Quote:
Besides, Pentium did not add anything significant in comparison to 386. Nevertheless, check games that have a version for PC and for Amiga. You may eventually see this 50% (HOMM2 for example). As i said, the larger a program becomes, the worse it is for x86. Anyhow, compilers for 68k are notoriously weak in comparison to x86 compilers, which have had a lot more efforts put in them. I can usually rewrite compiled 68k code to be 50% of its original size, often 25% or even less. So comparing compiled code for code density is biased. Quote:
More registers also mean less stalls as more operations on registers can execute in parallel than operations on memory. Data spilling never helps performance and it's not for nothing todays compilers have a tendency to pass parameters in registers rather than on the stack. No, it's just not agreeing. Quote:
So my point wasn't about 64-bit, it's about having 16 registers instead of 8. Quote:
Sure number of registers follow a law of diminishing returns, but 8 is clearly not enough. Quote:
68k code is denser than x86, by various amount as the advantage grows with larger programs. But it's not usually seen because 68k compilers are so poor. |
|||||||||||||
02 April 2022, 10:32 | #90 |
Guru Meditating
Join Date: Jun 2014
Location: England
Posts: 2,356
|
Glad to see everyone stayed on topic
|
02 April 2022, 10:35 | #91 | ||||
Registered User
Join Date: Sep 2013
Location: Poland
Posts: 868
|
Quote:
Quote:
Now from 040 & 060 manual Quote:
Quote:
|
||||
02 April 2022, 11:01 | #92 | |||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
Quote:
So yes, compilers sux and any comparison based on them is biased. That's not new. Quote:
Oh, and wider registers are only marginally faster. As 64-bit ops are scarce even in 64-bit programs... So, at the very top, new registers. Quote:
It's not nice things that get in wide use, it seems to have always been the case (i'm not saying aarch64 is even uglier, i don't know, but i can bet it ain't very nice either). |
|||
02 April 2022, 14:22 | #93 | ||
Registered User
Join Date: Sep 2013
Location: Poland
Posts: 868
|
Quote:
Quote:
|
||
02 April 2022, 14:32 | #94 | ||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
Quote:
Quote:
You can not know for sure where the improvement comes from. |
||
02 April 2022, 18:32 | #95 |
Registered User
Join Date: Jun 2018
Location: Calgary/Canada
Posts: 247
|
Yeah, all you've done meynaf is boast about your own skills while providing no proof. And if you spent more than 20 seconds on the NetBSD site, you'd know the 68K code is compiled for 68020 and up and requires an MMU. The x86 code is compiled for the contemporaneous i386 instruction set.
But here's a bigger set of applications -- the core for X11. The largest, Xorg, is 2,433,980 on 68K and 3,026,592 bytes on x86 representing a massive 24% increase in size. The average in this folder is 17.4%. And like I said, you CONTINUE to ignore the point that more instructions does not always mean slower code. For example, on the 68020, a shift-and-add or shift-and-or combination is usually faster than the bitfield instructions and breaking the 020's addressing modes into more basic 68000 addressing modes was also often faster. This is in spite of needing to fetch more instructions. So you need to prove: a) How x86 is so much bigger than 68K code b) How having so much bigger code might impact performance Without grandstanding, sayin-so or other nonsense. And your "taste" in code (e.g., hating how REP "looks") is irrelevant as that's beyond the point of even being subjective anymore. At this point, I suspect you're just trolling, looking to pick an argument. |
02 April 2022, 18:44 | #96 |
Registered User
Join Date: Jun 2018
Location: Calgary/Canada
Posts: 247
|
ON topic, I'm incredibly unimpressed with Gunnars continued attempts at distancing the V4 from the Amiga and creating more Vampire-specific lock in.
The whole POINT of the Vampire was to unite behind the 68K AmigaOS 3.1 code base. It was to STOP fragmenting the market and yet here we are with yet another feature that almost no one was asking for, that only one piece of hardware uses. How about an MMU so we can debug better? How about working on better core performance with things like OOE, register windowing, or single-cycle division? Because at this point Gunnar doesn't want to make the AC68080 better, he wants lock-in. He wants applications that only work on the Vampire and showcase its power, because it's now about him pushing the V4, and not "saving" the 68K Amiga. |
02 April 2022, 19:05 | #97 |
Registered User
Join Date: Jun 2018
Location: Calgary/Canada
Posts: 247
|
Given how impressive the Quake II demo on the Atari Falcon was, I would have been more impressed if the Vampire included a DSP56K instead. Given that Quake II looked that good on a 32MHz DSP56K and a 16MHz 68030+882, imagine a similar engine running on the AC68080 and Dual Core DSP56300? I'm sure for an FPGA mastermind like Gunnar that adding the DSP56K core would be a breeze next to the complexity of the 680x0.
|
02 April 2022, 19:40 | #98 | ||||||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
Quote:
Now search my name on this page. Of course just an example. If you have nothing to show, you should perhaps STFU. Quote:
If you want me to spend time, a little code contest of x86 vs 68k - but not here in this thread! - is a good solution. As doing cherry picking on random sites does not help. Quote:
Quote:
Quote:
But i think i've told you this is OT here and how to do a better test, so why do you insist ? Quote:
My original point was that having more instructions to execute does. (And that having instructions not larger per individual, larger code of x86 necessarily means more instructions.) Please do not distort what i say. Quote:
Yes REP is ugly, but it's just one problem among many. You boasted about x86's LOOP instruction, and completely forgot that it can only use CX as counter and so you can not have two nested loops using it - so a very poor and very stupidly designed instruction. You want more ? String ops in x86 can only use SI/DI registers as pointers. Only a few instructions can do this. In 68k any address register can be used. All instruction set can do this. Oh, and MUL/DIV target hardwired register AX (sometimes DX) where 68k can of course use all data regs. It could continue like this all day long, i'm just fed up having to prove you the obvious. Quote:
|
||||||||
02 April 2022, 21:21 | #99 | |||
Registered User
Join Date: Jun 2018
Location: Calgary/Canada
Posts: 247
|
Quote:
Quote:
Quote:
The draw pixel inner loop in GNGEO took ~10 (8 cycles inner plus 16 cycles for the massive palette load per "line") cycles for AMMX and 12 cycles for 68K (which included the CLUT load). This is an "ideal" scenario where the AMMX code is mostly just AMMX. That's not even a 20% speed up. So AMMX in this case was "worth" about 21MIPS (not 42MIPS). Realistically, you can't do this for everything (not even the majority) which is why GNGEO had only a 5-10% speed up overall. The DSP has its own state, it's own branching, etc. It provides 1 MIPS/MHz and even at 84MHz, that's a free, but continuous and parallel 84MIPS. It could be clocked higher (FPGA are ideal for DSP), it could be multi-core and it can run a plethora of existing code already made instead of being Gunnar's own toy. Last edited by nonarkitten; 02 April 2022 at 21:28. Reason: Forgot about the palette load |
|||
02 April 2022, 22:13 | #100 | ||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
|
As said, it was just an example. And you, you're supposed to be better, having shown nothing at all ?
Quote:
Quote:
|
||
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Apollo 1240 missing Mach chip | Benfromnorway | MarketPlace | 3 | 01 June 2016 21:53 |
Apollo 1240@25mhz + 32mb Ram (Mach131 chip so can be upgraded to 060) | fitzsteve | MarketPlace | 4 | 16 August 2010 19:01 |
Gauging interest: Amiga 600HD, Apollo 620, 2MB Chip, 8MB Fast | chiark | MarketPlace | 9 | 25 November 2009 20:18 |
Wanted: MACH131 chip from Apollo 040 or 060 | 8bitbubsy | MarketPlace | 8 | 29 October 2009 15:55 |
Cedric and the lost scepture Demo/Preview-Version | mai | request.Old Rare Games | 3 | 28 March 2008 16:27 |
|
|