Apollo Maggie 3D Chip - Preview Demo - Page 5

Paulee_Alex_Bow · 01 April 2022, 19:01

Quote:

Originally Posted by willemdrijver

That's a fair question and to be honest I have no idea at the moment.
The only thing I can offer right now is the 25 fps performance on the alpha v0.1 demo with 512x512 texture mapping on a 640x480 true-color display.
But there are daily optimisations now on Maggie, so I would expect the performance to get much better.
I will ask the 3D experts about some benchmarking figures.

Quote:

Originally Posted by saimon69

That, however, (at least me) use their talent for, guess what?
Developing games ^^

I do like games, and wouldn’t mind writing some music for one, but my main focus is using the Amiga to make music videos at the moment

nonarkitten · 01 April 2022, 20:15

Quote:

Originally Posted by grond

You are writing nonsense because of your personal disagreements with Gunnar.

Not at all, I just prefer facts. One of my major disagreements with Gunnar was his tendency to exaggerate the performance of the Vampire -- at times claiming it trounced any PowerPC accelerator made for the Amiga. This might be true when cherry picking one detail (e.g., memory bandwidth), but can be taken with a grain of salt in OVERALL system performance.

That and his apparent disconnects in understanding:
- IDE DMA is possible even if the DMA pins were not connected and Gunnar never published any sort of "generic" DMAC built-in
- Called me a liar when I first mentioned how many instructions on the fly the ThreadRipper can do.
- Didn't understand what CORDIC was or how it could be used to make single-cycle math.
- Would insist I needed more channels of Paula to pull off GNGEO better not getting that mixing was a fraction of the CPU requirements for pulling off ADPCM and FM decode (and 14-bit Paula audio is already better than the NEO GEO's 12-bit, so 16-bit is really useless).
- Didn't get that the chief bottleneck in emulation are the CPUs and that having some sort of simple "V68K" mode (where the CPU "mutes" the high 8-bits in exchange for some "page" bits) would allow direct execution for virtually every 68K platform made. This might have made GNGEO too fast -- we would have also needed a good way to throttle the CPU at that point. Imagine running 3.1, 1.3, Atari, Mac, Genesis and NEO GEO at the same time, all with the power of the AC68080! Sigh...
- Didn't (at the time) see any point of implementing the Blitter in hardware when the CPU can do it faster anyway. Spent months trying to "hack" compatibility in without it.
- I can go on, but this shouldn't become a Gunnar rant...

Quote:

Originally Posted by grond

As you know the 68080 has a much, much faster memory interface than a Pentium 75 (which has a theoretical maximum of 8 bytes * 25 MHz = 200 MB/s).

So the P75 memory bus was 50MHz, not 25MHz and 32-bit, so 4-bytes. But it's EDO/FPM so it's 2-cycle memory which is 100MB/s. The 430HX realistically only got up to about 82MB/s. A lot of people splurged for the FSB cache modules -- those made a big difference, and is the only place you're going to see 200MB/s on these old boards.

The Vampire's memory bus is really, really fast, but shares this with video and other DMA. And this speed doesn't really help when the CPU core is only running at ~85MHz; it just means the CPU is very seldom starved for instructions. It would be like me feeding you 20 pizzas a minute. My ability to give you pizzas that fast does not mean you're able to consume them that fast.

For the most part, that's not that important anyway since both processors have cache and most of your code (hopefully) is being executed from that.

Quote:

Originally Posted by grond

The 080 also does more instructions per clock cycle than a Pentium and each 68k instruction does more work than an x86 instruction.

This is highly subjective. The x386 and 68K instruction sets have comparable code densities (which is admittedly highly dependent on the quality of the compiler), so there's no clear winner there, either. Pentium added a LOT of new opcodes but I can't find anything definitive on code density post-x386.

The MC68060 was 1.33 MIPS/MHz and the AC68080 was around 1.54 MIPS/MHz when I was testing on the V4. Intel Pentium is 1.88 MIPS/MHz. For reference, the 68040 was 1.1 MIPS/MHz and is about as fast as the 680x0 instruction set can get with a single pipeline.

This means that, in theory, if you could somehow guarantee that two pipelines were always 100% full then you'd be able to get 2.2 MIPS/MHz. However, all of these processors have the same asymmetric design -- that is, the main pipeline can do everything and the secondary pipeline can only do a subset. Compound this with a lack of out-of-order execution (another thing Gunnar saw no value in) and inter-instruction dependencies, then 40% is expected.

Quote:

Originally Posted by grond

In addition to this the 080 is clocked higher than 75 MHz.

Even if the Vampire can be reliably pushed to say 106MHz (x15) you're only talking a small win (164MIPS) versus the P75 (126MIPS) -- so maybe closer to a P90 or P100 (which benefited greatly from 60/66MHz bus clocks respectively). But a clock-push is a clock-push on any side of the comparison fence.

Quote:

Originally Posted by grond

I do agree with you, however, that the 080 as found in the V2 and V4 certainly doesn't do 400 MIPS in any realistic usecase.

As I said, these inflated numbers often come from cherry picking aspects of the Vampire that are absurdly fast (like RAM), or niche cases, such as comparing AMMX benefits to stock 68K (and thus stock x86). That's not a fair comparison to make -- you'd need to compare AMMX against MMX or 3DNow! (both of which destroy AMMX).

In reality, GNGEO got about a 5~10% speed up from AMMX. Was it nice? Yes. Did it at all have the performance gains that Gunnar claims on the website? Not at all. And btw, that difference is between good C code versus hand optimized AMMX code from Gunnar himself. Before I left we had GNGEO running at a solid 60 fps with AMMX (and one frame skip) and about 55 fps without (both with audio disabled -- FM is so hard to do).

In the end, I want to say that I like the Vampire and still own a V2 V500. It was and remains a huge achievement and we shouldn't balk at that. Really, getting 1.54MIPS/MHz on the 680x0 instruction set in FPGA is really impressive. The only ones who come close are the ColdFire cores from Silvaco. And I think we can applaud these achievements without making up wildly inaccurate numbers.

nonarkitten · 01 April 2022, 20:33

Quote:

Originally Posted by meynaf

68k instruction set is better than x86 instruction set.
It has more registers so it spends less efforts in swapping them around and can do more calculations directly.

This only matters on the 68000, in absence of any cache. With a large, L1 cache, the lack of registers is rendered ALMOST irrelevant.

Quote:

Originally Posted by meynaf

It has better addressing modes

Which are mostly completely ignored by C compilers. That being said, the best addressing mode of the x86 is pretty equivalent to the extended addressing modes and is pretty flexible (Displacement + Base Register + Index Register * Scale).

Quote:

Originally Posted by meynaf

and much less constraints overall.

Subjective statement. Segmented memory not only made the MMU almost a requirement on x86 (where many 680x0 systems opted-out to save a few pennies), but can be used creatively to speed-up some copy operations.

Quote:

Originally Posted by meynaf

As a result, even though individual x86 instructions are of same size and often even smaller, 68k programs are denser than x86 (for big programs x86 version can be 1.5 times bigger than 68k - compare them yourself if you don't believe me).

https://retroramblings.net/?p=1414

The key to fast x86 code is to not use the registers to hold data and abuse stack/RAM instead and lean on the cache. This is why modern x86 CPU's have mad amounts of cache and benefit greatly from it. This is generally the approach taken with modern compilers.

Quote:

Originally Posted by meynaf

In retro gaming there is the word 'retro', perhaps something you've missed.

nonarkitten · 01 April 2022, 20:39

Quote:

Originally Posted by grond

You may be talking about the 060 and the Pentium MMX. I am not. I was talking about the wrong comparison of the 080 to a Pentium 75.

The Pentium MMX launched at 166MHz and would be about twice as fast as a Vampire. Also, Intel's MMX was more useful and didn't break OS compatibility by avoiding having to add new registers.

meynaf · 01 April 2022, 21:54

Quote:

Originally Posted by nonarkitten

This only matters on the 68000, in absence of any cache. With a large, L1 cache, the lack of registers is rendered ALMOST irrelevant.

Extra instructions are still needed to access memory. The cpu can not, for example, read two memory cells and add both values together in single instruction ; for registers it can.

Quote:

Originally Posted by nonarkitten

Which are mostly completely ignored by C compilers.

I've disassembled enough complied code to know this is not true.

Quote:

Originally Posted by nonarkitten

That being said, the best addressing mode of the x86 is pretty equivalent to the extended addressing modes and is pretty flexible (Displacement + Base Register + Index Register * Scale).

Except that the size of index register is always longword, where 68k also allows 16-bit word. I don't call that equivalent.
And now try to do a variable shift with any register other than CL as the count (just an example).

Quote:

Originally Posted by nonarkitten

Subjective statement. Segmented memory not only made the MMU almost a requirement on x86 (where many 680x0 systems opted-out to save a few pennies), but can be used creatively to speed-up some copy operations.

I wasn't talking about segmented memory.
But it's not subjective. 68k can for example transfer any memory cell to any other in single instruction, where x86 needs either two instructions or using the clumsy movs.

Quote:

Originally Posted by nonarkitten

https://retroramblings.net/?p=1414

Code not available, no disassembly...

Quote:

Originally Posted by nonarkitten

The key to fast x86 code is to not use the registers to hold data and abuse stack/RAM instead and lean on the cache. This is why modern x86 CPU's have mad amounts of cache and benefit greatly from it. This is generally the approach taken with modern compilers.

Then tell me why they added 8 new registers in x86-64.

Pollock · 02 April 2022, 01:14

I got an email telling me the IceDrake for the A1200 was ready, but i am not touching it. It looks like a good bit of kit, but its even closer to the "Amiga in a box" V4, and i think that's what the Apollo guys are driving for.

As an accelerator, and purely as an accelerator, that is, across RGB and that's it, the performance is roughly identical to the V1200, and the pure performance of that, is not much better than a V600, and only really benefits from the A1200's internal architecture.

So, in the seven years or so, since the Vampire stuff has been on sale, the CPU hasn't really changed. it's the same performance from half a decade ago.

nonarkitten · 02 April 2022, 03:41

Quote:

Originally Posted by meynaf

Extra instructions are still needed to access memory. The cpu can not, for example, read two memory cells and add both values together in single instruction ; for registers it can.

You're talking about an intermediate step at an incredibly microscopic detail. You're talking about one additional opcode (maybe).

So here's your example in 68K (four 68K opcodes for each).
http://franke.ms/cex/z/qdxMsn

And here's your example in x86 (four x86 opcodes for each).
https://godbolt.org/z/MTM1qbP5Y

Quote:

Originally Posted by meynaf

I've disassembled enough complied code to know this is not true.

"Mostly." You can always craft an odd enough case where it's possible, but then I'd probably say it's shit C code that's going through that much abstraction.

And so many times I've written code that "ought to" use the extended addressing modes and didn't. Or coercing GCC to emit a DBCC. Or getting it to emit bitfield opcodes instead of LSL/AND/OR patterns. The 68K was designed to make assembly coders lives easier, it was never meant for high-level languages like C. The fact that C can sometimes use these is impressive. The fact that it's incredibly inconsistent is frustrating and it's usually easier to refactor the code to not be lasagna.

Quote:

Originally Posted by meynaf

Except that the size of index register is always longword, where 68k also allows 16-bit word. I don't call that equivalent.
And now try to do a variable shift with any register other than CL as the count (just an example).

I can nitpick 68K too -- like DBCC. Loop on x86 is basically the equivalent, but doesn't mess up flags and works with the whole register. Parity flag for data checks. All x86 (from the 286 on) CPU's have MMU. Can emulate the memory space of an 8086 (try emulating the 24-bit memory space of a 68000 on a 68030). Or how about the impressive REP opcode -- REP is basically the memset, memcpy, memcmp in a single instruction.

But comparing CEX to CEX, guess what -- a memory shift ends up the same code size again. Still not 64-bit. I was able to make a few cases using double pointers use one more x86 opcode.

Now you may be able to compute the parity of the lowest byte with just a lookup in 68K, but that's one versus zero opcodes. And even then, that memory fetch may cost you a pipeline stall if it's not already in-cache, and even then, it's not instantaneous.

Quote:

Originally Posted by meynaf

I wasn't talking about segmented memory.
But it's not subjective. 68k can for example transfer any memory cell to any other in single instruction, where x86 needs either two instructions or using the clumsy movs.

That's the compiler's problem and see above. There are advantages of the 68K and advantages of the x86. The 68K's addressing modes made assembly easier, writing optimal C compilers harder and keeping up with the Pentium impossible. You exaggerate their benefit for what are niche use-cases.

Quote:

Originally Posted by meynaf

Code not available, no disassembly...

He said it's the OSD code for MiniMig. You're welcome to poke him for which one precisely, but it's all 90% the same since 2010. But that's not an valid argument anyway.

Here's maybe something less arguable. This is NetBSD 9.2 on x86 (right) versus 68k (left).

Total 68k size: 1,419,703 bytes.
Total x86 size: 1,626,251 bytes.
That's about 14.5% bigger, not 50% bigger. And that's 386 code, not Pentium (couldn't find an i586-specific build of NetBSD). You can grab the source yourself if you're curious. AMD64 was usually bigger (with rare exception like sh), but those 64-bit pointers do add up.

Quote:

Originally Posted by meynaf

Then tell me why they added 8 new registers in x86-64.

I'm not saying they don't help in some cases. Hell, I would have killed for eight more registers on ARM, but they're not a cornucopia. In the end, the actual die space on a modern x86 processor dedicated to just the execution of opcodes is virtually infinitesimal next to the caches, cache logic, scheduling, memory handling, etc. So, given the opportunity for a clean compatibility break in the first-time since ever with x86, AMD chose to double the register count. Great. It makes a few things a lot faster, no argument there.

But one thing you're totally glossing over here is the idea that two instructions can always execute faster than say three or four, not taking into account how these operations can stall the pipeline. Even on the 68000, there are cases when two operations are faster than one. So it's a gross over-simplification to say that because x86 *might* need more opcodes for a given operation that it's therefore slower.

Frankly, it's downright ignorant.

Promilus · 02 April 2022, 09:38

AFAIK new registers for AMD64 are only available in long mode (shouldn't be surprising, 32 or 16 bit apps are already compiled with 16/32b register model and allowing new 32b apps to use new registers would break compatibility with older processors) so that's only area of 64bit part of x86-64 which makes it stupid to even compare with 32b.

And even if it wasn't stupid per se it's stupid when you realize most ooo processors uses register renaming and there are physically more registers anyway to make it right (but that's for OOO and not developers!)
I was curious how heavy is the usage of new registers in popular apps. Got latest 7zip x86 vs x86-64. r8-r15 usage is small, maybe 15% of all GPR operations. Got bench running. Overall 64b executable is roughly 25% faster than 32b. What's curious is 64b executable is faster on decompressing (5.6 vs 7.6GIPS on single core) while 32b seems to be faster in compressing (5.3 vs 4.6GIPS). Since compressing scores are quite similar I'd say wider registers are great for decompression while doesn't have that much impact on compression.

Now then... why I'm bringing this up? Well why meynaf brought 64bit (!!) ISA into discussion between implementation of old arch made in 90s? I don't know. Must be a reason. And I might assure you meynaf that if I brought my Zen2 down to 200MHz and run whatever benchmark of that era you might find it will still be much faster than Pentium MMX 200MHz regardless of the same amount of registers developers-side (that's how much powerful execution units, prefetch, decode, cache subsystem and memory controllers are and that's the reason x86 has great performance gains year after year and not 8 new GPR in ISA).
Zen already has over a hundred of physical integer registers. So new registers are added to 64bit x86 ISA to provide better utilization of growing number of physical registers as swapping those poor 8 all the time would be inefficient (and obviously wider registers themselves can correlate with big performance gains in CERTAIN situations, there are plenty of applications which - compiled for 64bit env - doesn't work much faster).
As nonarkitten already proved there is no great difference between size of 68k and x86 executable (and that's 386). So more compact nature of 68k is true but also irrelevant - not with that small difference.

meynaf · 02 April 2022, 10:02

Quote:

Originally Posted by nonarkitten

You're talking about an intermediate step at an incredibly microscopic detail. You're talking about one additional opcode (maybe).

No, it's not. But it's seen mostly in large code. The larger it is, the worse it becomes.
So it's a microscopic detail that starts to repeat so many times that it's no longer microscopic. Of course it adds up with other details.

Quote:

Originally Posted by nonarkitten

So here's your example in 68K (four 68K opcodes for each).
http://franke.ms/cex/z/qdxMsn

And here's your example in x86 (four x86 opcodes for each).
https://godbolt.org/z/MTM1qbP5Y

That's simple, isolated example. Even a cpu which had only 2 or 3 registers would do the same...
Now it could be interesting to try again with a routine that needs many registers. That would be OT here, but feel free to open a code contest of x86 vs 68k here if you want.

Quote:

Originally Posted by nonarkitten

"Mostly." You can always craft an odd enough case where it's possible, but then I'd probably say it's shit C code that's going through that much abstraction.

You can always craft a case that proves any point. But i've disassembled many programs, from several compilers. That's counted in megabytes. I can often recognise the compiler by just seeing the code. Do you have this experience ?

Quote:

Originally Posted by nonarkitten

And so many times I've written code that "ought to" use the extended addressing modes and didn't. Or coercing GCC to emit a DBCC. Or getting it to emit bitfield opcodes instead of LSL/AND/OR patterns. The 68K was designed to make assembly coders lives easier, it was never meant for high-level languages like C. The fact that C can sometimes use these is impressive. The fact that it's incredibly inconsistent is frustrating and it's usually easier to refactor the code to not be lasagna.

Then blame the compiler, not the CPU.
The 68k is designed for higher languages too. Proof : else it wouldn't have LINK/UNLK instructions (and also CHK/TRAPV, for different reasons). It's just that it wasn't the main goal (even though i remember having been told it was).

Quote:

Originally Posted by nonarkitten

I can nitpick 68K too -- like DBCC. Loop on x86 is basically the equivalent, but doesn't mess up flags and works with the whole register.

Sorry, but 68k's DBCC does not mess up flags. It also has full set of conditions, unlike x86's LOOP.
So yes for x86 it works with the whole register. This is rarely used and is easy to handle on 68k, and also means nothing useful can be put in higher bits (a quite common thing to do on 68k, at least in asm).

Quote:

Originally Posted by nonarkitten

Parity flag for data checks. All x86 (from the 286 on) CPU's have MMU. Can emulate the memory space of an 8086 (try emulating the 24-bit memory space of a 68000 on a 68030). Or how about the impressive REP opcode -- REP is basically the memset, memcpy, memcmp in a single instruction.

Parity flag is pretty much useless in 99.999% of cases. For me it's just a waste of silicon. REP only makes code unreadable. I'm not impressed.

Quote:

Originally Posted by nonarkitten

But comparing CEX to CEX, guess what -- a memory shift ends up the same code size again. Still not 64-bit. I was able to make a few cases using double pointers use one more x86 opcode.

Now you may be able to compute the parity of the lowest byte with just a lookup in 68K, but that's one versus zero opcodes. And even then, that memory fetch may cost you a pipeline stall if it's not already in-cache, and even then, it's not instantaneous.

You're speaking about rare cases.
x86 does not have CLR and TST instructions which are incredibly common.
Tell me why OF flag does not go in LAHF/SAHF instructions.
68k can do string ops in both directions easily, where x86 has that stupid DF flag.
Your point is that memory operations are fast, and it's true that in modern cpus stack is very fast. But x86 is ill-suited even for this ; 68k at least can directly do things such as add.w (a7)+,d0.

Quote:

Originally Posted by nonarkitten

That's the compiler's problem and see above. There are advantages of the 68K and advantages of the x86. The 68K's addressing modes made assembly easier, writing optimal C compilers harder and keeping up with the Pentium impossible. You exaggerate their benefit for what are niche use-cases.

What made keeping up with the Pentium impossible is the fact Intel had massive amount of resources to use. It has nothing to do with the ISA.

Quote:

Originally Posted by nonarkitten

Here's maybe something less arguable. This is NetBSD 9.2 on x86 (right) versus 68k (left).
(...)
Total 68k size: 1,419,703 bytes.
Total x86 size: 1,626,251 bytes.
That's about 14.5% bigger, not 50% bigger. And that's 386 code, not Pentium (couldn't find an i586-specific build of NetBSD). You can grab the source yourself if you're curious. AMD64 was usually bigger (with rare exception like sh), but those 64-bit pointers do add up.

And that's 68000 code, not 68020, i suppose ?
Besides, Pentium did not add anything significant in comparison to 386.
Nevertheless, check games that have a version for PC and for Amiga. You may eventually see this 50% (HOMM2 for example). As i said, the larger a program becomes, the worse it is for x86.

Anyhow, compilers for 68k are notoriously weak in comparison to x86 compilers, which have had a lot more efforts put in them. I can usually rewrite compiled 68k code to be 50% of its original size, often 25% or even less.
So comparing compiled code for code density is biased.

Quote:

Originally Posted by nonarkitten

I'm not saying they don't help in some cases. Hell, I would have killed for eight more registers on ARM, but they're not a cornucopia. In the end, the actual die space on a modern x86 processor dedicated to just the execution of opcodes is virtually infinitesimal next to the caches, cache logic, scheduling, memory handling, etc. So, given the opportunity for a clean compatibility break in the first-time since ever with x86, AMD chose to double the register count. Great. It makes a few things a lot faster, no argument there.

But one thing you're totally glossing over here is the idea that two instructions can always execute faster than say three or four, not taking into account how these operations can stall the pipeline. Even on the 68000, there are cases when two operations are faster than one. So it's a gross over-simplification to say that because x86 *might* need more opcodes for a given operation that it's therefore slower.

Yes it is simplified, but it is nevertheless valid point in general. The fact exceptions do exist does not change this.
More registers also mean less stalls as more operations on registers can execute in parallel than operations on memory. Data spilling never helps performance and it's not for nothing todays compilers have a tendency to pass parameters in registers rather than on the stack.

Quote:

Originally Posted by nonarkitten

Frankly, it's downright ignorant.

No, it's just not agreeing.

Quote:

Originally Posted by Promilus

Now then... why I'm bringing this up? Well why meynaf brought 64bit (!!) ISA into discussion between implementation of old arch made in 90s? I don't know. Must be a reason.

The reason is : new registers. They made 64-bit mode faster in average, in spite of having to handle pointers that are twice the size.
So my point wasn't about 64-bit, it's about having 16 registers instead of 8.

Quote:

Originally Posted by Promilus

And I might assure you meynaf that if I brought my Zen2 down to 200MHz and run whatever benchmark of that era you might find it will still be much faster than Pentium MMX 200MHz regardless of the same amount of registers developers-side (that's how much powerful execution units, prefetch, decode, cache subsystem and memory controllers are and that's the reason x86 has great performance gains year after year and not 8 new GPR in ISA).
Zen already has over a hundred of physical integer registers. So new registers are added to 64bit x86 ISA to provide better utilization of growing number of physical registers as swapping those poor 8 all the time would be inefficient (and obviously wider registers themselves can correlate with big performance gains in CERTAIN situations, there are plenty of applications which - compiled for 64bit env - doesn't work much faster).

And what's the point here ?
Sure number of registers follow a law of diminishing returns, but 8 is clearly not enough.

Quote:

Originally Posted by Promilus

As nonarkitten already proved there is no great difference between size of 68k and x86 executable (and that's 386). So more compact nature of 68k is true but also irrelevant - not with that small difference.

Apart that nonarkitten didn't actually prove anything.
68k code is denser than x86, by various amount as the advantage grows with larger programs. But it's not usually seen because 68k compilers are so poor.

trixster · 02 April 2022, 10:32

Glad to see everyone stayed on topic

Promilus · 02 April 2022, 10:35

Quote:

68k code is denser than x86, by various amount as the advantage grows with larger programs. But it's not usually seen because 68k compilers are so poor

Ok, so all you can prove is that you wrote apps and they are like 50% larger on x86 but can't get any proof of other apps because compiler sux and that's biased... yeah.

Quote:

The reason is : new registers. They made 64-bit mode faster in average

No they did not. New decoders, ALU, bigger and faster cache, improved prefetch and rob, more physical registers and wider registers... and at the very end new registers.

Now from 040 & 060 manual

Quote:

The instruction set is tailored to support high-level languages and is optimized for those instructions most commonly executed

020

Quote:

The MC68020 is object-code compatible with earlier members of the M68000 family and
has the added features of new addressing modes in support of high-level languages

Don't dismiss that high level language just because most tools for that are ancient for 68k but even in 90s there was no specific advantage of 060 against pentium in that area and at that time no big application was done in asm alone or in majority. CF did use subset of 68k and was designed for embedded systems (so that's an area where more compact code did matter) and most 68k "less ancient" tools were done with Coldfire in mind. And look where it got them. And that ugly x86 is still going on.

meynaf · 02 April 2022, 11:01

Quote:

Originally Posted by Promilus

Ok, so all you can prove is that you wrote apps and they are like 50% larger on x86 but can't get any proof of other apps because compiler sux and that's biased... yeah.

There are apps that are 50% larger - even compiled ones - and i didn't write them (if i did, they would be even smaller).
So yes, compilers sux and any comparison based on them is biased. That's not new.

Quote:

Originally Posted by Promilus

No they did not. New decoders, ALU, bigger and faster cache, improved prefetch and rob, more physical registers and wider registers... and at the very end new registers.

Yes they did. New decoders, ALU, bigger and faster cache, improved anything you want, more physical registers, all are available in 32-bit mode too on very same cpu. Speed comparisons being made on same cpu in both modes, obviously. You seem to forget 64-bit x86 still have 32-bit mode.
Oh, and wider registers are only marginally faster. As 64-bit ops are scarce even in 64-bit programs...
So, at the very top, new registers.

Quote:

Originally Posted by Promilus

Don't dismiss that high level language just because most tools for that are ancient for 68k but even in 90s there was no specific advantage of 060 against pentium in that area and at that time no big application was done in asm alone or in majority. CF did use subset of 68k and was designed for embedded systems (so that's an area where more compact code did matter) and most 68k "less ancient" tools were done with Coldfire in mind. And look where it got them. And that ugly x86 is still going on.

Yes that ugly x86 is still going on, but it is (slowly) losing ground.
It's not nice things that get in wide use, it seems to have always been the case (i'm not saying aarch64 is even uglier, i don't know, but i can bet it ain't very nice either).

Promilus · 02 April 2022, 14:22

Quote:

Originally Posted by meynaf

There are apps that are 50% larger - even compiled ones - and i didn't write them (if i did, they would be even smaller).
So yes, compilers sux and any comparison based on them is biased. That's not new.

Well what might be out there and what's generally out there are 2 different things and you didn't provide any reliable proof that x86 in general is so much worse than 68k.

Quote:

Yes they did. New decoders, ALU, bigger and faster cache, improved anything you want, more physical registers, all are available in 32-bit mode too on very same cpu. Speed comparisons being made on same cpu in both modes, obviously. You seem to forget 64-bit x86 still have 32-bit mode.
Oh, and wider registers are only marginally faster. As 64-bit ops are scarce even in 64-bit programs...
So, at the very top, new registers.

Yeah, yeah... lame improves ~12% 32b vs 64. Most generic applications don't even get to that point. On the other hand chess engines do improve a lot and it seems it is actually result of how wide those are and not how many of them are there.

meynaf · 02 April 2022, 14:32

Quote:

Originally Posted by Promilus

Well what might be out there and what's generally out there are 2 different things and you didn't provide any reliable proof that x86 in general is so much worse than 68k.

We all know (or should know) that x86 is brain dead design. Why should i prove it there, it's OT anyway. Open a new thread if you really want to dig deep into the subject.

Quote:

Originally Posted by Promilus

Yeah, yeah... lame improves ~12% 32b vs 64. Most generic applications don't even get to that point.

But you can't deny it improves.

Quote:

Originally Posted by Promilus

On the other hand chess engines do improve a lot and it seems it is actually result of how wide those are and not how many of them are there.

You can not know for sure where the improvement comes from.

nonarkitten · 02 April 2022, 18:32

Yeah, all you've done meynaf is boast about your own skills while providing no proof. And if you spent more than 20 seconds on the NetBSD site, you'd know the 68K code is compiled for 68020 and up and requires an MMU. The x86 code is compiled for the contemporaneous i386 instruction set.

But here's a bigger set of applications -- the core for X11.

The largest, Xorg, is 2,433,980 on 68K and 3,026,592 bytes on x86 representing a massive 24% increase in size. The average in this folder is 17.4%.

And like I said, you CONTINUE to ignore the point that more instructions does not always mean slower code. For example, on the 68020, a shift-and-add or shift-and-or combination is usually faster than the bitfield instructions and breaking the 020's addressing modes into more basic 68000 addressing modes was also often faster. This is in spite of needing to fetch more instructions.

So you need to prove:
a) How x86 is so much bigger than 68K code
b) How having so much bigger code might impact performance

Without grandstanding, sayin-so or other nonsense.

And your "taste" in code (e.g., hating how REP "looks") is irrelevant as that's beyond the point of even being subjective anymore. At this point, I suspect you're just trolling, looking to pick an argument.

nonarkitten · 02 April 2022, 18:44

ON topic, I'm incredibly unimpressed with Gunnars continued attempts at distancing the V4 from the Amiga and creating more Vampire-specific lock in.

The whole POINT of the Vampire was to unite behind the 68K AmigaOS 3.1 code base. It was to STOP fragmenting the market and yet here we are with yet another feature that almost no one was asking for, that only one piece of hardware uses.

How about an MMU so we can debug better? How about working on better core performance with things like OOE, register windowing, or single-cycle division?

Because at this point Gunnar doesn't want to make the AC68080 better, he wants lock-in. He wants applications that only work on the Vampire and showcase its power, because it's now about him pushing the V4, and not "saving" the 68K Amiga.

nonarkitten · 02 April 2022, 19:05

Given how impressive the Quake II demo on the Atari Falcon was, I would have been more impressed if the Vampire included a DSP56K instead. Given that Quake II looked that good on a 32MHz DSP56K and a 16MHz 68030+882, imagine a similar engine running on the AC68080 and Dual Core DSP56300? I'm sure for an FPGA mastermind like Gunnar that adding the DSP56K core would be a breeze next to the complexity of the 680x0.

meynaf · 02 April 2022, 19:40

Quote:

Originally Posted by nonarkitten

Yeah, all you've done meynaf is boast about your own skills while providing no proof.

My skills are probably largely over yours. What did you produce already ?
Now search my name on this page. Of course just an example.
If you have nothing to show, you should perhaps STFU.

Quote:

Originally Posted by nonarkitten

And if you spent more than 20 seconds on the NetBSD site, you'd know the 68K code is compiled for 68020 and up and requires an MMU. The x86 code is compiled for the contemporaneous i386 instruction set.

It was probably not worth these 20 seconds.
If you want me to spend time, a little code contest of x86 vs 68k - but not here in this thread! - is a good solution. As doing cherry picking on random sites does not help.

Quote:

Originally Posted by nonarkitten

But here's a bigger set of applications -- the core for X11.
(...)

The largest, Xorg, is 2,433,980 on 68K and 3,026,592 bytes on x86 representing a massive 24% increase in size. The average in this folder is 17.4%.

Thank you for proving my point. What happens for "lib" folders is very interesting.

Quote:

Originally Posted by nonarkitten

And like I said, you CONTINUE to ignore the point that more instructions does not always mean slower code. For example, on the 68020, a shift-and-add or shift-and-or combination is usually faster than the bitfield instructions and breaking the 020's addressing modes into more basic 68000 addressing modes was also often faster. This is in spite of needing to fetch more instructions.

No way. I know that SOMETIMES more instructions do not mean slower code, but i'm trying to make you understand that GENERALLY it does.

Quote:

Originally Posted by nonarkitten

So you need to prove:
a) How x86 is so much bigger than 68K code

I don't need to - you've just did.
But i think i've told you this is OT here and how to do a better test, so why do you insist ?

Quote:

Originally Posted by nonarkitten

b) How having so much bigger code might impact performance

Without grandstanding, sayin-so or other nonsense.

Where did i write that having larger code in number of bytes impacts performance that much ?
My original point was that having more instructions to execute does. (And that having instructions not larger per individual, larger code of x86 necessarily means more instructions.)
Please do not distort what i say.

Quote:

Originally Posted by nonarkitten

And your "taste" in code (e.g., hating how REP "looks") is irrelevant as that's beyond the point of even being subjective anymore. At this point, I suspect you're just trolling, looking to pick an argument.

Aren't you the one who's trolling, rather ? You're reminding me of someone who came here with his strange love for x86, wrote all sorts of drivel, and ultimately ended up banned...
Yes REP is ugly, but it's just one problem among many.
You boasted about x86's LOOP instruction, and completely forgot that it can only use CX as counter and so you can not have two nested loops using it - so a very poor and very stupidly designed instruction.
You want more ?
String ops in x86 can only use SI/DI registers as pointers. Only a few instructions can do this. In 68k any address register can be used. All instruction set can do this.
Oh, and MUL/DIV target hardwired register AX (sometimes DX) where 68k can of course use all data regs.
It could continue like this all day long, i'm just fed up having to prove you the obvious.

Quote:

Originally Posted by nonarkitten

Given how impressive the Quake II demo on the Atari Falcon was, I would have been more impressed if the Vampire included a DSP56K instead. Given that Quake II looked that good on a 32MHz DSP56K and a 16MHz 68030+882, imagine a similar engine running on the AC68080 and Dual Core DSP56300? I'm sure for an FPGA mastermind like Gunnar that adding the DSP56K core would be a breeze next to the complexity of the 680x0.

Gunnar would most certainly tell you that his AMMX extension beats the crap out of that DSP.

nonarkitten · 02 April 2022, 21:21

Quote:

Originally Posted by meynaf

My skills are probably largely over yours. What did you produce already ?
Now search my name on this page. Of course just an example.
If you have nothing to show, you should perhaps STFU.

LOL. You're cute. A few ports make you some coding god.

Quote:

Originally Posted by meynaf

Thank you for proving my point. What happens for "lib" folders is very interesting.

I knew you'd jump on that. It's all drivers and extensions the 68K port doesn't even have -- completely not relevant. Programs are still only about 15-20% bigger.

Quote:

Originally Posted by meynaf

<snip>
Gunnar would most certainly tell you that his AMMX extension beats the crap out of that DSP.

And he'd be wrong.

The draw pixel inner loop in GNGEO took ~10 (8 cycles inner plus 16 cycles for the massive palette load per "line") cycles for AMMX and 12 cycles for 68K (which included the CLUT load). This is an "ideal" scenario where the AMMX code is mostly just AMMX. That's not even a 20% speed up. So AMMX in this case was "worth" about 21MIPS (not 42MIPS). Realistically, you can't do this for everything (not even the majority) which is why GNGEO had only a 5-10% speed up overall.

The DSP has its own state, it's own branching, etc. It provides 1 MIPS/MHz and even at 84MHz, that's a free, but continuous and parallel 84MIPS. It could be clocked higher (FPGA are ideal for DSP), it could be multi-core and it can run a plethora of existing code already made instead of being Gunnar's own toy.

meynaf · 02 April 2022, 22:13

Quote:

Originally Posted by nonarkitten

LOL. You're cute. A few ports make you some coding god.

As said, it was just an example. And you, you're supposed to be better, having shown nothing at all ?

Quote:

Originally Posted by nonarkitten

I knew you'd jump on that. It's all drivers and extensions the 68K port doesn't even have -- completely not relevant. Programs are still only about 15-20% bigger.

If this part isn't relevant, the others probably aren't either.

Quote:

Originally Posted by nonarkitten

And he'd be wrong.

The draw pixel inner loop in GNGEO took ~10 (8 cycles inner plus 16 cycles for the massive palette load per "line") cycles for AMMX and 12 cycles for 68K (which included the CLUT load). This is an "ideal" scenario where the AMMX code is mostly just AMMX. That's not even a 20% speed up. So AMMX in this case was "worth" about 21MIPS (not 42MIPS). Realistically, you can't do this for everything (not even the majority) which is why GNGEO had only a 5-10% speed up overall.

The DSP has its own state, it's own branching, etc. It provides 1 MIPS/MHz and even at 84MHz, that's a free, but continuous and parallel 84MIPS. It could be clocked higher (FPGA are ideal for DSP), it could be multi-core and it can run a plethora of existing code already made instead of being Gunnar's own toy.

Tell that to him then.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Apollo 1240 missing Mach chip	Benfromnorway	MarketPlace	3	01 June 2016 21:53
Apollo 1240@25mhz + 32mb Ram (Mach131 chip so can be upgraded to 060)	fitzsteve	MarketPlace	4	16 August 2010 19:01
Gauging interest: Amiga 600HD, Apollo 620, 2MB Chip, 8MB Fast	chiark	MarketPlace	9	25 November 2009 20:18
Wanted: MACH131 chip from Apollo 040 or 060	8bitbubsy	MarketPlace	8	29 October 2009 15:55
Cedric and the lost scepture Demo/Preview-Version	mai	request.Old Rare Games	3	28 March 2008 16:27

02 April 2022, 01:14	#86
Pollock Registered User Join Date: Apr 2018 Location: Glasgow Posts: 161	I got an email telling me the IceDrake for the A1200 was ready, but i am not touching it. It looks like a good bit of kit, but its even closer to the "Amiga in a box" V4, and i think that's what the Apollo guys are driving for. As an accelerator, and purely as an accelerator, that is, across RGB and that's it, the performance is roughly identical to the V1200, and the pure performance of that, is not much better than a V600, and only really benefits from the A1200's internal architecture. So, in the seven years or so, since the Vampire stuff has been on sale, the CPU hasn't really changed. it's the same performance from half a decade ago.

02 April 2022, 09:38	#88
Promilus Registered User Join Date: Sep 2013 Location: Poland Posts: 868	AFAIK new registers for AMD64 are only available in long mode (shouldn't be surprising, 32 or 16 bit apps are already compiled with 16/32b register model and allowing new 32b apps to use new registers would break compatibility with older processors) so that's only area of 64bit part of x86-64 which makes it stupid to even compare with 32b. And even if it wasn't stupid per se it's stupid when you realize most ooo processors uses register renaming and there are physically more registers anyway to make it right (but that's for OOO and not developers!) I was curious how heavy is the usage of new registers in popular apps. Got latest 7zip x86 vs x86-64. r8-r15 usage is small, maybe 15% of all GPR operations. Got bench running. Overall 64b executable is roughly 25% faster than 32b. What's curious is 64b executable is faster on decompressing (5.6 vs 7.6GIPS on single core) while 32b seems to be faster in compressing (5.3 vs 4.6GIPS). Since compressing scores are quite similar I'd say wider registers are great for decompression while doesn't have that much impact on compression. Now then... why I'm bringing this up? Well why meynaf brought 64bit (!!) ISA into discussion between implementation of old arch made in 90s? I don't know. Must be a reason. And I might assure you meynaf that if I brought my Zen2 down to 200MHz and run whatever benchmark of that era you might find it will still be much faster than Pentium MMX 200MHz regardless of the same amount of registers developers-side (that's how much powerful execution units, prefetch, decode, cache subsystem and memory controllers are and that's the reason x86 has great performance gains year after year and not 8 new GPR in ISA). Zen already has over a hundred of physical integer registers. So new registers are added to 64bit x86 ISA to provide better utilization of growing number of physical registers as swapping those poor 8 all the time would be inefficient (and obviously wider registers themselves can correlate with big performance gains in CERTAIN situations, there are plenty of applications which - compiled for 64bit env - doesn't work much faster). As nonarkitten already proved there is no great difference between size of 68k and x86 executable (and that's 386). So more compact nature of 68k is true but also irrelevant - not with that small difference.

02 April 2022, 10:32	#90
trixster Guru Meditating Join Date: Jun 2014 Location: England Posts: 2,356	Glad to see everyone stayed on topic

02 April 2022, 18:32	#95
nonarkitten Registered User Join Date: Jun 2018 Location: Calgary/Canada Posts: 247	Yeah, all you've done meynaf is boast about your own skills while providing no proof. And if you spent more than 20 seconds on the NetBSD site, you'd know the 68K code is compiled for 68020 and up and requires an MMU. The x86 code is compiled for the contemporaneous i386 instruction set. But here's a bigger set of applications -- the core for X11. The largest, Xorg, is 2,433,980 on 68K and 3,026,592 bytes on x86 representing a massive 24% increase in size. The average in this folder is 17.4%. And like I said, you CONTINUE to ignore the point that more instructions does not always mean slower code. For example, on the 68020, a shift-and-add or shift-and-or combination is usually faster than the bitfield instructions and breaking the 020's addressing modes into more basic 68000 addressing modes was also often faster. This is in spite of needing to fetch more instructions. So you need to prove: a) How x86 is so much bigger than 68K code b) How having so much bigger code might impact performance Without grandstanding, sayin-so or other nonsense. And your "taste" in code (e.g., hating how REP "looks") is irrelevant as that's beyond the point of even being subjective anymore. At this point, I suspect you're just trolling, looking to pick an argument.

02 April 2022, 18:44	#96
nonarkitten Registered User Join Date: Jun 2018 Location: Calgary/Canada Posts: 247	ON topic, I'm incredibly unimpressed with Gunnars continued attempts at distancing the V4 from the Amiga and creating more Vampire-specific lock in. The whole POINT of the Vampire was to unite behind the 68K AmigaOS 3.1 code base. It was to STOP fragmenting the market and yet here we are with yet another feature that almost no one was asking for, that only one piece of hardware uses. How about an MMU so we can debug better? How about working on better core performance with things like OOE, register windowing, or single-cycle division? Because at this point Gunnar doesn't want to make the AC68080 better, he wants lock-in. He wants applications that only work on the Vampire and showcase its power, because it's now about him pushing the V4, and not "saving" the 68K Amiga.

02 April 2022, 19:05	#97
nonarkitten Registered User Join Date: Jun 2018 Location: Calgary/Canada Posts: 247	Given how impressive the Quake II demo on the Atari Falcon was, I would have been more impressed if the Vampire included a DSP56K instead. Given that Quake II looked that good on a 32MHz DSP56K and a 16MHz 68030+882, imagine a similar engine running on the AC68080 and Dual Core DSP56300? I'm sure for an FPGA mastermind like Gunnar that adding the DSP56K core would be a breeze next to the complexity of the 680x0.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)