English Amiga Board

English Amiga Board (https://eab.abime.net/index.php)
-   Coders. Asm / Hardware (https://eab.abime.net/forumdisplay.php?f=112)
-   -   68k details (https://eab.abime.net/showthread.php?t=93770)

litwr 23 February 2021 13:15

This table shows that the 80386 is slightly faster than 68030 for the same frequency. This proves my point about this matter.

Quote:

Originally Posted by meynaf (Post 1462473)
Your point of view is clear : x86 is better than 68k, arm is better than 68k, just about everything including 6502 is better than 68k. So yes, your tastes are visible. No mystery here.

Indeed if we compare the modern x86, ARM and 68k we can't help but must agree with your words. However I stated quite clearly many times that the 68000 was better than the 8086. The 8086 is a 16-bit processor while the 68000 is 32-bit and this made the 68000 more preferred choice if we could ignore prices. The 6502 or Z80 technical characteristics are far behind the 68000, it is quite clear. Indeed if we skip the matter of price. Only the ARM was really superior. Almost all workstation manufacturers moved to the RISC architecture since about 1984. The 68020 appeared two years later than the 80286 and despite this the 68020 couldn't generally surpass the 80286. The 68020 has some advantages over the 80286, the 80286 has its advantages over the 68020 but generally it is rather impossible to say what was better. IMHO the 80386 and 80486 are slightly (very little) better than the 68030 and 68040 correspondingly. But it is my HO only. Thus, your assumptions were not true.

Quote:

Originally Posted by meynaf (Post 1462473)
Even if that cite is real, what does that mean ? I don't read all that has been published on the subject. And anyway we all know that the people behind the 68k did not see its true potential. So perhaps I know better than this guy.

What a great cite! The chief architect of the 68000 knew less about his 68000 than meynaf! :)

Quote:

Originally Posted by meynaf (Post 1462473)
It does not prove 6502 is better than z80.

It seems that you fight with you own delusions. I have only written that the 6502 more effectively utilizes clock cycles. It is the common truth. Do you still miss it? ;) BTW I even know that the Z80 code density is better than 6502.

Quote:

Originally Posted by meynaf (Post 1462473)
6502 does not scale up very nicely, so nothing could have happened.

We can't say this. Because Jack Tramiel "sold his soul" and stopped the 6502 development in 1976. MOS Technology men announced plan to make a 16-bit 6502 just before Jack made his acquisition. We can only wonder what this could be. The 6502 has a lot of free opcodes, they could be used for something really amazing. I can even repeat that Bill Mensch reported that he had 6502@10MHz in 1976.

Quote:

Originally Posted by meynaf (Post 1462473)
Frankly, where are your lists of x86 and ARM quirks ? I suppose you just listed very minor ones and forgot about more important ones.
For example, you grumble about 68k's move from sr, but you forget that we have ori,andi,eori to ccr for direct flag manipulation - something that x86 and arm both lack (x86 has a few instructions but they are for single flags and give few possibilities).

It is rather one more odditity of the 68k which has so many instructions to set flags but no any common instruction to read flags. I can't agree that this is too important. You can set any desired flags on the x86 or ARM just doing several instructions. Really it is not any necessity to have so many ways to set flags like the 68k. However it is rather a minor matter. If you can show some idea that can prove that it is really important please share it with us.

Quote:

Originally Posted by meynaf (Post 1462473)
Moto didn't fail because of the 68020. This all started with the poor implementation of 68040 which was written in Verilog instead of being done by hand.
(And IBM having chosen x86 for their PC and attempting to gain its control back with the PPC didn't help.)

Commodore was the major PC manufacturer in 1983 and 1984 and Commodore supported the IBM PC rather than promoted their own advanced technologies. :( And I can cite my blog "Shortcomings in the architecture of the 68k processors forced major manufacturers of computers based on these processors to look for a replacement. Sun started producing its own SPARC processors, Silicon Graphics switched to the MIPS processors, Apollo developed its own PRISM processor, HP started using its own PA-RISC processors, ATARI started working with custom RISC-chips, and Apple was coerced to switch to the PowerPC processors. Interestingly, Apple was going to switch to the SPARC in the second half of the 80's, but negotiations with Sun failed. One can only wonder how poorly the management of Motorola was working, as if they themselves did not believe in the future of their processors"...
The 68020 was a beginning of the end of the 68k. The 68020 was good for a PC but too expensive until 1991, but it was slow for workstations which migrated to the RISC architecture.

Quote:

Originally Posted by meynaf (Post 1462473)
What do you attempt to prove here ?

That the return to real mode is not a necessity in theoretically right architecture. But DOS was a reality, and Intel actively supported it since the 80386. They, unlike Moto, were more realistic and didn't push people like Moto did. MOVE from SR, or BE byte order are a classical examples of such pushing.

Quote:

Originally Posted by meynaf (Post 1462473)
It's not pure theory and fixing such a problem can not be done without breaking some software. Hopefully fixing such software is easy, on the Amiga we even have software such as Degrader which can do that automatically.

I can only repeat that so great accuracy for VM software was far-fetched in the 80s and even 90s. I can only note that Wine - https://en.wikipedia.org/wiki/Wine_(software) - which works very well is not a virtual machine and similar Amiga software was not too.

Quote:

Originally Posted by meynaf (Post 1462473)
There is no reason why we would want to read that bit explicitly, but it is read nevertheless because it is located in the SR register. Perhaps software just wanted to save the IPL or another bit, but the S bit is there and will go along with the others.
At least 68k has SR with system bits on one side and user bits on another, unlike x86 which has everything mixed up in its FLAGS register.

Please read carefully my previous post, I already wrote about this case. Intel knew that useful VM software for PC would be only a far future. Why ask people to pay for features that won't be useful until 20 years from now?

Quote:

Originally Posted by meynaf (Post 1462473)
The problem is that the superuser program does not even need to be bad to fail. It just has to read the SR to be in a potential failure.

You use a very good word "potential". This failure is probable but rather with probability very close to zero. You know almost every program can make a failure, we can only reduce the probability of such failures.

Quote:

Originally Posted by meynaf (Post 1462473)
It seems you don't get it at all. The code that would break if applying your suggestions wouldn't be just virtual machine software. ALL system software could potentially be broken !
We just wanted to read the IPL from SR from normal supervisor code but now it's move from CCR so we get a wrong value. I suppose you can imagine that this can trigger very nice bugs.

You know that system software is much closer to hardware than application programs. So when new hardware appears, it is quite normal and even routine to update system software to use with this new hardware. And this eliminates your problem. ;)

Quote:

Originally Posted by meynaf (Post 1462473)
Indeed, but it seems you're making a mountain out of a mousehole.
Funny that you criticize some opcode redundancy of 68k and fail to see that x86 has even more.

Maybe it is only you who can find any mountain here. :) For me it is just a little quirk. Anyway I wrote about that: the x86 uses many ways for encoding the same instructions but why did Moto invent new assembly mnemonics for this case?

Quote:

Originally Posted by meynaf (Post 1462473)
Let's try out with that code :
Code:

move sr,-(sp)
(some other code here)
 move (sp)+,sr

In true supervisor mode, no problem here.
But let's execute that in a sandbox. Here we're in user mode. So the first move sr, if not caught, will write to the stack a value with S bit cleared. When we restore SR later, it will be caught but will restore S bit cleared, making the virtualization program think we want to go back to user mode, which isn't the case. We'll end up in wrong mode, and crash.

Please read my previous post more carefully. I have already written about exactly this case. Indeed it is potentially can create an issue. But I repeat, the sandbox needs just to check a value and to fix it (set it to 0) before it writes this value to SR. And again, all this matter is far-fetched and strictly theoretical because in practice system software was rapidly adjusted to a new way of using SR. So I don't find any necessity to change MOVE from SR at all, just document that reading the system information from SR is depricated and provide a new instruction to read only system flags instead of infamous MOVE from CCR (Even Thomas doesn't like it!).

Quote:

Originally Posted by roondar (Post 1462505)
Thank you for proving my point. The 286 division instruction isn't 'fantastic' if you take all of the information into account. (and modern processors do division in a single cycle, so no... 286 timings were definitely not kept).

Check this - https://www.agner.org/optimize/instruction_tables.pdf - can you find 1 cycle division there? I can accept that some modern techology leap can allow to reach maybe 2-3 cycle 16-bit division but my blog was written in 2018... So you are incorrect again. :( The division in the 80286 is a real fantastic! :)

Quote:

Originally Posted by Thomas Richter (Post 1462605)
I wonder why you believe that big-endia order is "contrieved". Otherwise, we would have the year 1202, and not 2021. It is one of two possible conventions, and that's all about it. Big-endian means that the magnitude of bits is monotonically declining for multi-byte numbers, and I believe that is a nice property. Unless you come from IBM and denote the most-significant bit as bit #0.

I find it bewildering that you worry about such conventions.

Indeed the choice between LE or BE means nothing for the IBM/370 or 68020. Because they have the 32-bit ALU and they don't have 64-bit addition or subtraction. But for the 6809, 68000, 68008, and 68010, BE means slower long addition and subtruction. So the choice for BE was contrived - they weren't thinking about the performance benefits, they just blindly copied the byte order of the IBM mainframes.

Quote:

Originally Posted by Thomas Richter (Post 1462605)
That depends on whether you push parameters on the stack or not, or keep variables on the stack or not. The 6502 is not good at such things - it's better with "everything global, everything static". Recursion is not such a new thing, but almost all higher programming languages (probably with the exception of FORTRAN) support it, so some kind of (emulated) stack is required.

I worked with Pascal which supported recursion only as an option. By default, recursion was off. IMHO it was quite common before the 90s. Indeed stack ops are not a strong point of the 6502, it is a cheap processor from 1975. However in practice stack issues for the 6502 are not serious.

Quote:

Originally Posted by Thomas Richter (Post 1462605)
Except that a 1024 byte stack on a Z80 would have been easy, but it's hard on the 6502 - everything must be done manually.

My cite has a number 10240. Indeed we can easily make 10K stack on the 8080 or Z80 but it consumes 1/6 of our total address space and doesn't guarantee safe recursion.

litwr 23 February 2021 13:35

Quote:

Originally Posted by Thomas Richter (Post 1462616)
Then you don't know much. The following code is not so uncommon, and allows a clear carry-over to 68K code:

Code:

if ((a = b) >= 0) {...}
would compile to a single move, followed by a "blt.s" instruction, and thus requires clearing overflow.

Thus, again, there is an orthogonality principle here that would be violated if carry or overflow flag would not be cleared. Consider that "move" would only set "Z" and "N" flags. Then you could use some branch instructions safely after a "move", namely "beq/bne/bmi/bpl", but others, you could not, like "bgt, bge, blt, ble". This would violate "orthogonality".

Thanks for this nice example. However it saves us one instruction only for very rare cases. Therefore the gain is almost zero. Even meynaf agreed about this. Anyway don't forget that this often flag changing is bad for superscalar architectures... So in the 90s this tiny gain became a large loss.
And you know, the 68k is not orthogonal. Even its MOVE is not completely orthogonal. I want to have MOVE offset1(PC),offset2(PC). ;) The 68k is not VAX or PDP-11. And even the VAX and PDP-11 are not 100% orthogonal. The best architecture (IBM mainfraimes, RISC, x86) just skipped all this orthogonality crap. ;) It has no practical usefulness, it is just a poetry around true IT. :)

Quote:

Originally Posted by Thomas Richter (Post 1462616)
Same thing again. If there is a "branch on condition X", then there should be a "branch always" as well, and since we have a "jmp always" and "jsr always", we also need "bsr always".

It has some little sense around BRA but there are no conditional subroutine calls on the 68k. Thus the presence of BSR.w is an overt extra. Of course, practically it is just a tiny oddity, nothing important.

Quote:

Originally Posted by Thomas Richter (Post 1462616)
As in, when? If you process multi-precision artithmetics, you need to iterate through a (larger) set of (lower precision) numbers, so the most useful addressing mode is ADDX register,register and ADDX -(reg),-(reg) as this is a big-endian machine and carries move towards higher magnitudes.

Practice has more varieties than someone imagination, I needed ADDX when I could add a constant to a byte in a register.

Quote:

Originally Posted by Thomas Richter (Post 1462616)
Even more so as the indexed modes are for the cases where such structures are kept in arrays. In such cases, your struct ought better be short.

It is just empty theorizing. You know I still do some maintenance on Xlife project. You can know that Eric S. Raymond made large contributions to this project in the 1991-1998. His code for Xlife v5 contains basic structures `tile' and `pattern' which are larger than 128 bytes. You know that Xlife was a student project, so it is easy to deduce that more serious software had large structures quite often. BTW Xlife structures are kept in lists, not in arrays.

Quote:

Originally Posted by Thomas Richter (Post 1462616)
The 68K is a easy target for a code generator. Two types of registers, all equal, all instructions on each register of each type possible, no "special rules" like "multiply only with one register" or "shift only with another register". That's how intel worked.

What a strange logic! The presence of address registers is rather a complication. Anyway, code generation is work for good professionals and they have proved that they are able to make quite decent code generator for the x86. Other people (99.9999999%) just use their compilers. Why bother about 0.0000001% who are quite happy with their good job? :)

Quote:

Originally Posted by Thomas Richter (Post 1462616)
They are if the compiler creates better code.

What do you mean about better code? Faster? Can you give any link where code for the 68k is proved as better than for the x86?

Quote:

Originally Posted by Thomas Richter (Post 1462616)
You don't handle such records, not regularly. Make the common case easy.

You are a programmer and you know that every particular case has the same importance as the general case in programming. If you ignore one particular case that just kills your software.

Quote:

Originally Posted by Thomas Richter (Post 1462616)
MOVE from CCR is something new, yes. But again, you wouldn't really use this instruction. That's nothing a compiler needs to generate. The Os needs something like that.

We have such an instruction. In some rare cases it can help. The 68k architecture instructions change flags too often, and the opportunity to save flags can help. Anyway do you blame Moto for this instruction?

Quote:

Originally Posted by Thomas Richter (Post 1462616)
A LOT! Did you know that the A20 gate needs to be part of the CPU nowadays, simply because to keep caches consistent? The problem with an "external" A20 gate is that the CPU cache would cache wrong data (or would loose consisteny with external RAM) with the gate switched. Actually, a couple of third-party intel-compatible CPUs had defects in such a way that A20 was not rolled correctly into the CPU cache.

That's a question of design philosophy, as stated before. Mot would have through out the "real mode", and A20. Intel still keeps the mess alive. Has to, as the Bios still (after so many years) has to depend on it.

Do you know a better way to keep the compatibility with older hardware? BTW modern PC can still boot FreeDOS. And I can repeat all this matter has no any relation to programming. The 68000 was 32-bit since its beginning and Intel had to evolve their CPU.

Quote:

Originally Posted by Thomas Richter (Post 1462616)
Nope. That's part of the CPU cache architecture. It's internal in the CPU nowadays. It's a big "puke!".

I am almost sure that there were systems which didn't have the IBM PC architecture, they overtly compete IBM. Check for example, https://en.wikipedia.org/wiki/Altos_Computer_Systems for Altos 686.

Quote:

Originally Posted by Thomas Richter (Post 1462616)
No, I protest against the quirky, contrived way how intel tried to solve the transition to wider architectures. A20, segment pointers... all rather badly engineered solutions to the problem of providing a larger address range.

Let me repeat my question. Do you you know a better way to make a 16-bit processor than the way of the 8086? And let's think about the next one. Do you know a better way of transforming the 80286 into the 80386?

Quote:

Originally Posted by a/b (Post 1462722)
Source? I see different numbers, e.g. https://zsmith.co/intel_i.php#idiv states 25 and 21, respectively.
Also, you speak as if extra time for EA calc is super bad. With x86 you're limited to specific registers (dx:ax) and have no flexibility. Doing several mul/div in a row? Time for some data shuffling! It's an archaic approach from 8-bit era, along many other x86 'features'.

286 didn't have a barrel shifter? Shifts/rotates take a variable number of cycles (just like 68000/010). I'd say that's a big oopsie compared to 020.

I have given a number for DIV instruction, not IDIV. ;) You know we had a contest - http://eab.abime.net/showpost.php?p=...&postcount=916 - in this thread, we need to make the best code to convert binary to decimal, this code needs division and the 8086 beat the 68k for this case because DIV on the x86 is quite flexible - just try to code and find this out. :)

Yes, the 80286 doesn't have a barrel shifter but the x86 has more flexible byte ops: MOV, XCHG, XLAT, ... So this 68020 advantage is rather illusional, in many cases the 80286 is just faster. Let's check numbers

68020: LSR/ASR #1,Dn - 6, LSR/ASR #2,Dn - 6, LSR/ASR #7,Dn - 6, LSR #8,Dn.w - 6.
80286: SAR/SHR reg,1 - 2, SAR/SHR reg,2 - 7, SAR/SHR reg,7 - 12, XOR r8l,r8l and XCHG r8l,r8h - 5.

68020: ASL/ROL/ROR #1,Dn - 8, ASL/ROL/ROR #1,Dn - 8, ASL/ROL/ROR #7,Dn - 8, ROL/ROR #8,Dn.w - 8
80286: SAL/SHL reg,1 - 2, SAL/SHL reg,2 - 7, SAL/SHL reg,7 - 12, XCHG r8l,r8h - 3.

68020: ROLX/ROR/ROL #1,Dn - 12, ROLX/ROR/ROL #2,Dn - 12, ROLX/ROR/ROL #7,Dn - 12.
80286: RCL/RCR reg,1 - 2, RCL/RCR reg,2 - 7, RCL/RCR reg,7 - 12.

So it is quite clear that for the most common case (a shift by 1 bit), the 80286 is much faster. Though in some less common cases the 68020 can be a bit faster. However, the x86 can do byte and word (double word since the 80386) shifts on memory, while the 68k can shift only words. For word shifts the 80286 needs 5+n clocks while the 68020 needs 6/8/12+EA and EA is at least 4. So again the 68020 shows that it is slower for more common cases and a bit faster for less common. And the 68k is less flexible for memory shift ops. BTW What strange timings the 68020 has! LSR is faster than ASL, ASL is faster than ROLX - just several more little oddities to the 68k collections of oddities.

Quote:

Originally Posted by Thomas Richter (Post 1462800)
Another attempt at an apples to oranges comparison. The 80286 has several forms of MUL, 8x8->16, and 16x16->32. The 8x8->16 takes 13 cycles for register, 16 for memory, the comparable 16x16->32 takes 21 cycles, or 24 cycles. This is from the intel 80286 manual.

Unfortunately, it does not state whether this is "worst case", "always" or "average case". The 68020 manual states 28 cycles + EA worst case, where the case depends on the number of 1 bits in one of the multipliers IIRC, so it would typically be faster.

Sorry my number for the 80286 MUL was wrong, I rather checked a wrong line in a manual. You are correct, it is 21 not 14. Thank you. However my main point was the 80286 has much faster division and multiplication. It is still quite correct. Though the advantage of the x86 multiplication is less significant than I pointed initially. However you are not correct about the 68020 DIVU which takes 44+EA on the worst case and 42+EA on the best case. Those case depends on cache hits (not on divisor or dividend) and the 68020 cache is rather tiny to rely on its too often best or cache cases. Anyway 2 cycles give almost nothing - the 80286 fantastic division is much faster in any case. ;)

Quote:

Originally Posted by Thomas Richter (Post 1462802)
Also, I find this quite interesting:

https://allaboutprocessors.blogspot....els-80286.html

I guess I can only agree with Bill Gates here. Brain-dead processor. Two modes, but can only switch to protected mode, but not back. Well, that is a "quirk".

We have discussed this matter with meynaf. Indeed, Bill wanted DOS and he was afraid that protected mode OS like OS/2 could make his DOS obsolete. :) So most people used the 80286 like the much faster 8088 but Xenix was also quite good and popular among more serious users.

Quote:

Originally Posted by roondar (Post 1462814)
Now, I can be wrong here as it was over two years ago. So, keep that in mind.
But as far as I can remember the discussion back then, the Intel cited cycle times for the x86 lineup excluded the time required to fetch/prefetch the instructions from memory (if any - depending on cache etc), while the M68K manuals included them, albeit for zero wait state memory.
If that is the case, which again I'm not 100% certain of, this would change the values cited even more.

You are right. The 8088/8086 often is much slower that one can expect on the base of instruction timings. It is because of the instruction queue and slow 4 cycle memory access of the 8086/8088. However since the 80286 this has not been a real issue. The 80286 has fast memory access cycle and larger instruction queue.

Quote:

Originally Posted by dreadnought (Post 1462911)
I don't care what Bill says. I remember playing Wolfenstein on my mate's gf's dad's 286 Turbo 20Mhz and it was the future :)

I liked this game. It was very good on my first PC (the 80286@12MHz based) in 1991. It helped me to survive without my Amiga, which I had to sell in 1990.


Quote:

Originally Posted by BippyM (Post 1464233)
IMHO maybe starting a new thread, based on the exact same topic, which is an actual continuation of the same topic is perfectly acceptable. Anyone coming into thread number 2 doesn't need to go searching for the first thread to find out what is going on. The threads are merged, they are staying merged, so please be gracious and accept that!

I must obey. I dare only express some considerations. When the book gets too big it breaks up into volumes, when the movie gets big it breaks up into series, when the cake is too big it gets cut into pieces of that cake. :)

Quote:

Originally Posted by Bruce Abbott (Post 1464237)
Some of your advocacy comes out sounding disrespectful to Amiga users because you make a big deal about anything you see as wrong with the Amiga while hand-waving away PC issues. The truth is that no computer system is perfect, and which you think is 'best' depends on what criteria you consider more important. You should respect the fact that other people's priorities might not match yours.

Thank you for these nice words. I have never accepted that any computer architecture is perfect. You know my point, the 68000 was a very good CPU despite having o lot of odd quirks which are common traits of all Moto's processors I know (the 680x and 68k). However some people don't even want to know about the 68k oddities and shortcomings. My priority on this thread is to find the complete truth about the 68k details. I am very grateful and thankful to the people here, they helped me a lot in my task.

Quote:

Originally Posted by Bruce Abbott (Post 1464237)
If you are serious then perhaps you should open a new thread for that. We will be happy to help, so long as you stick to the task and don't off on a rant about how bad you think 68k is compared to other CPUs.

Thank you very much for this generous offer. I need some time to prepare materials for this new thread, I hope to finish all preparations as soon as possible. I will be very happy if the 68k shows better results.

Quote:

Originally Posted by Bruce Abbott (Post 1464237)
Mac performance doesn't interest us much, except to note that we can emulate a classic Mac faster on an Amiga if we want to (not very often). And since we really don't care we won't bother trying to check up on yet another dodgy benchmark comparison.

It is not about the Mac, it is about its CPU which is the same as in Amigas. IMHO this benchmark has a very high quality. So it is sad to know your objections about it.

Quote:

Originally Posted by Bruce Abbott (Post 1464237)
Many of us were disappointed with real-world performance of the original PC, considering the hype and the premium price IBM was selling it for. My Amstrad CPC664 was much nicer and better value for me, and I say that as someone who did buy a real IBM when I could get it at an acceptable price. It was quite a shock to discover that the Amstrad actually beat the PC on some tests, despite having a 'slower' 8 bit CPU. And as I found out recently, an 8MHz 8086 isn't much better.

Yes, the original PC was only slightly faster than some 8-bit computers. You know that Xlife-8 benchmark results shows that the Z80 is only about 12% slower than the 8088. It was surprizing for me. However the 8088 is a true 16-bit CPU and its command system is much superior than of any 8-bit CPU. I had to work very hard to make the fast Z80 code, for the 8088 it was much easier. You can also notice, that the Amstrad CPC464 appears in 1984, 3 years later after the IBM PC. In 1984, you could buy a PC-compatible computer based on the 8086/V20/V30/286 which were up to 6 times faster than the original PC. The CPC 464 didn't have even a floppy drive controller, and it was almost impossible to add more memory to the 464.

Quote:

Originally Posted by chb (Post 1464251)
IMHO also an example that it's often not the best technology that wins in the end.

The Sage was good but it had its shortcomings too: no graphics, no MMU, no FPU, ... I can't say that the Sage or Stride was generally better than the IBM PC AT.

Quote:

Originally Posted by chb (Post 1464251)
Sageandstride.org has quite some information about both Sage and Stride Micro, and a bit about Pinnacle. I agree that they were much less common than the IBM PCs/ATs, and were rather developer/custom application systems than office computers. Those 1000 units for Stride seems pretty low to me, how do you arrive at that estimate?

Thank you very much for this link to an interesting material. However you can notice that there is almost nothing about the Stride. :( Indeed I don't know exact information about the Stride numbers. I can only repeat that Rod Coleman claimed 10,000 shipped units of the Sage. People have the Sages in their collection, there are sites about the Sages, but almost nothing about the Strides. My conclusion is based on these facts. The Stride was a very rare computer.

Quote:

Originally Posted by chb (Post 1464251)
In itself that's a quite unimportant detail, but it's this repeated pattern to bend the facts ever-so-slightly (or not-so-slightly) in the direction you prefer that makes this discussion a bit tiresome, even so I really appreciate opinions that differ from the mainstream. So, please, keep your opinions, but mark them as opinions and apply a bit more of scientific rigor to your statements. :)

Sorry again, I have to confess, I can make mistakes sometime. However I suspect that other people can make them too.

litwr 23 February 2021 13:41

Quote:

Originally Posted by meynaf (Post 1464289)
Now that's another story. If you want SR and CCR to be completely different registers, you have to know that saving/restoring a second register for interrupts/exceptions would have cost extra cycles for basically nothing.

It might be the same register. Just MOVE from SR reads its system part, and MOVE from CCR reads its arithmetic flags.

Quote:

Originally Posted by meynaf (Post 1464289)
How can you reject a result ? Reread this thread.
I still have the memory of your pi-spigot program where you removed features just to show shorter x86 code.
(As a side note, the exe of the full version with duration shown is 313 bytes on my VM.)

It is an unexpected claim. Maybe I should prepare a new thread "pi-spigot" optimization. Thank you,

Quote:

Originally Posted by meynaf (Post 1464289)
There is a difference between code that can still be optimised and code that's largely sub-optimal. If i really wanted to show good code for 68k i would have to rewrite it fully.

Maybe it is true for very diffrent primitive architectures like the Z80, 6809, 6502. But for similar advanced architectures like x86 or 68k it is rather rarely when we need to completely rewrite our code. Anyway there must be some hints, what is better to change in the current program code?

Quote:

Originally Posted by meynaf (Post 1464289)
This has nothing to do with open source or piracy.
I can port Atari ST, Mac 68k, even Oric game to Amiga. First step is full disassembly and i just can't do that for e.g. a PC DOS game. For a Windows game, even worse.

It is not normal program development using disassembly of machine code. It has a piracy trait. :)

Quote:

Originally Posted by meynaf (Post 1464289)
Sorry, bad translation of french appel de porte. I meant, CALLM/RTM somehow look like x86 call gates, at least conceptually.
https://www.technovelty.org/arch/ia3...tem-calls.html

Thank you. I know that Linux kernel doesn't use dedicated x86 instructions to switch tasks, they use faster and more convenient sequence of instructions for this. :)

Quote:

Originally Posted by meynaf (Post 1464289)
Where in your blog do you write about the arm having as big shortcoming the inability to operate directly on memory?

Please read this sentence "The processor can refer to bytes and 4-byte words, it cannot directly access 16-bit data".

Quote:

Originally Posted by meynaf (Post 1464289)
Note that i am not sure aarch64 actually allows misaligned accesses. As most Risc cpus don't.

Quote:

The ARM v6 family supports the ARM v7 behaviour, but incorporates a configuration option in the system control register to select between this and older-style behaviour.

In general, the ARM v7 will perform the logically expected operation by breaking the memory access down into multiple memory accesses in order to load the expected data. There will be time penalties in this, not only in accessing multiple memory locations, but also in the case where cache reloads are required, page boundaries are crossed, etc.
;)

Quote:

Originally Posted by meynaf (Post 1464289)
The 68k obviously can do ordinary move from memory to a register...
A 80386 also can't do MOV RCX,[RBX], but apparently this isn't a shortcoming for you.

You again missed the point. RBX contains 64-bit address. ;) Indeed the 80386 can't do such instructions.

Quote:

Originally Posted by meynaf (Post 1464289)
You can not conclude anything for a single example randomly fetched on the internet.
Besides, the situation isn't the same, the PPC had onboard L2 cache.
Change compiler settings, or switch to another compiler, and you can easily get 50% speed difference.
In addition, compiled 68k code can be rewritten in asm to be at least 2x faster (my record is 14 times).

These benchmark results look very solid. They just ran very good benchmark programs. Indeed someone can optimize them for one architecture but other man can optimize them for another.
Recently I made several Basic programs for a benchmark race - https://gitlab.com/retroabandon/basc.../benchmarks.md - a man from the Atari8 world intervened, he wanted better results for his platform. He used diffrent Basic, different algo, ... and made the result about 5 times faster. :) I can only suspect that the Amstrad CPC or Commodore people could make results for their PC much better too but they just missed this race. So your point for this case is a real oddity. ;)

Quote:

Originally Posted by meynaf (Post 1464289)
These computers aren't 'PC'.
And no, the IBM PC was clearly not the best among them. It might have had faster cpu (a still questionable situation) but the rest was poor.

Why weren't they personal computers? They are exactly PC. What technical detail of those PCs was better than on the first IBM PC? BTW even thew ZX Spectrum and BBC Micro appeared after the IBM PC...

Quote:

Originally Posted by meynaf (Post 1464289)
Perhaps because i have more interesting things to do than to prove the obvious.

Without your efforts this seems like your only personal delusion. :)

Quote:

Originally Posted by meynaf (Post 1464289)
For small routines yes, but not for a whole program. As most of a program's bulk is usually not loops.
But, feel free to disassemble them and show us why the x86 code is so big :spin

I've just compile Xlife v7 sources with -O3 and -Os, I got 537 KB and 341 KB correspondingly. It is not a large program, it is only about 16,000 LOC. The size of stdlib++, Xlib, etc is quite a large common part of both programs. So this makes the difference at least as 2:1.

meynaf 23 February 2021 15:23

Quote:

Originally Posted by litwr (Post 1464591)
This table shows that the 80386 is slightly faster than 68030 for the same frequency. This proves my point about this matter.

Very nice : sometimes 68020 is faster than similar clocked 80386 in this example. :laughing


Quote:

Originally Posted by litwr (Post 1464591)
Indeed if we compare the modern x86, ARM and 68k we can't help but must agree with your words. However I stated quite clearly many times that the 68000 was better than the 8086. The 8086 is a 16-bit processor while the 68000 is 32-bit and this made the 68000 more preferred choice if we could ignore prices. The 6502 or Z80 technical characteristics are far behind the 68000, it is quite clear. Indeed if we skip the matter of price. Only the ARM was really superior. Almost all workstation manufacturers moved to the RISC architecture since about 1984. The 68020 appeared two years later than the 80286 and despite this the 68020 couldn't generally surpass the 80286. The 68020 has some advantages over the 80286, the 80286 has its advantages over the 68020 but generally it is rather impossible to say what was better. IMHO the 80386 and 80486 are slightly (very little) better than the 68030 and 68040 correspondingly. But it is my HO only. Thus, your assumptions were not true.

You are saying the same things over and over, but it will not make them more true.


Quote:

Originally Posted by litwr (Post 1464591)
What a great cite! The chief architect of the 68000 knew less about his 68000 than meynaf! :)

This is called real life experience. The chief architect couldn't have this before the cpu was out.


Quote:

Originally Posted by litwr (Post 1464591)
It seems that you fight with you own delusions. I have only written that the 6502 more effectively utilizes clock cycles. It is the common truth. Do you still miss it? ;) BTW I even know that the Z80 code density is better than 6502.

Yeah, "the common truth". I know the trick.


Quote:

Originally Posted by litwr (Post 1464591)
We can't say this. Because Jack Tramiel "sold his soul" and stopped the 6502 development in 1976. MOS Technology men announced plan to make a 16-bit 6502 just before Jack made his acquisition. We can only wonder what this could be. The 6502 has a lot of free opcodes, they could be used for something really amazing. I can even repeat that Bill Mensch reported that he had 6502@10MHz in 1976.

Try to enhance the 6502 to meet modern needs and then we'll speak.


Quote:

Originally Posted by litwr (Post 1464591)
It is rather one more odditity of the 68k which has so many instructions to set flags but no any common instruction to read flags.

You're wrong. There are common instruction to read flags. There is Scc, something that came to x86 only with 386.


Quote:

Originally Posted by litwr (Post 1464591)
I can't agree that this is too important. You can set any desired flags on the x86 or ARM just doing several instructions. Really it is not any necessity to have so many ways to set flags like the 68k. However it is rather a minor matter. If you can show some idea that can prove that it is really important please share it with us.

It is just what programming flexibility is.
To quote yourself talking to Thomas : "You are a programmer and you know that every particular case has the same importance as the general case in programming."

Now if you really want to have an example where flag manipulation is essential, try emulating another cpu family for a start.


Quote:

Originally Posted by litwr (Post 1464591)
Commodore was the major PC manufacturer in 1983 and 1984 and Commodore supported the IBM PC rather than promoted their own advanced technologies. :( And I can cite my blog "Shortcomings in the architecture of the 68k processors forced major manufacturers of computers based on these processors to look for a replacement. Sun started producing its own SPARC processors, Silicon Graphics switched to the MIPS processors, Apollo developed its own PRISM processor, HP started using its own PA-RISC processors, ATARI started working with custom RISC-chips, and Apple was coerced to switch to the PowerPC processors. Interestingly, Apple was going to switch to the SPARC in the second half of the 80's, but negotiations with Sun failed. One can only wonder how poorly the management of Motorola was working, as if they themselves did not believe in the future of their processors"...
The 68020 was a beginning of the end of the 68k. The 68020 was good for a PC but too expensive until 1991, but it was slow for workstations which migrated to the RISC architecture.

There are reasons why to produce own architecture and they are not technical.
RISC is designed to ease cpu design and implementation, with no care about the programming model.


Quote:

Originally Posted by litwr (Post 1464591)
That the return to real mode is not a necessity in theoretically right architecture.

Since when do you care about the theory ? I wouldn't think you were the type.


Quote:

Originally Posted by litwr (Post 1464591)
But DOS was a reality, and Intel actively supported it since the 80386. They, unlike Moto, were more realistic and didn't push people like Moto did.

Intel pushed people actually very much. Look at what x86 is today, you can't pretend it's nice design.


Quote:

Originally Posted by litwr (Post 1464591)
MOVE from SR, or BE byte order are a classical examples of such pushing.

No, they are classical examples of things done right (in the case of MOVE from SR, after a hiatus).


Quote:

Originally Posted by litwr (Post 1464591)
I can only repeat that so great accuracy for VM software was far-fetched in the 80s and even 90s. I can only note that Wine - https://en.wikipedia.org/wiki/Wine_(software) - which works very well is not a virtual machine and similar Amiga software was not too.

Fixing a design mistake isn't "far-fetched".


Quote:

Originally Posted by litwr (Post 1464591)
Please read carefully my previous post, I already wrote about this case. Intel knew that useful VM software for PC would be only a far future. Why ask people to pay for features that won't be useful until 20 years from now?

It is better to plan for features that will be useful, than have to support old, heavy, costly legacy.
I prefer paying for something that will be useful 20 years from now, rather than paying for something that has been useful 20-30 years ago and is now crap.


Quote:

Originally Posted by litwr (Post 1464591)
You use a very good word "potential". This failure is probable but rather with probability very close to zero. You know almost every program can make a failure, we can only reduce the probability of such failures.

Every program is buggy, this is the way life is : the nice overused excuse to not fix mistakes.


Quote:

Originally Posted by litwr (Post 1464591)
You know that system software is much closer to hardware than application programs. So when new hardware appears, it is quite normal and even routine to update system software to use with this new hardware. And this eliminates your problem. ;)

Except that for some applications (like games or embedded), a lot of software can be considered as system software. This does not in any manner eliminate the problem.


Quote:

Originally Posted by litwr (Post 1464591)
Maybe it is only you who can find any mountain here. :) For me it is just a little quirk. Anyway I wrote about that: the x86 uses many ways for encoding the same instructions but why did Moto invent new assembly mnemonics for this case?

There are two mnemonics for two instructions and if some particular case is redundant, where is the problem ?
At least if the name differs, we can choose which variant we encode.


Quote:

Originally Posted by litwr (Post 1464591)
Please read my previous post more carefully. I have already written about exactly this case. Indeed it is potentially can create an issue. But I repeat, the sandbox needs just to check a value and to fix it (set it to 0) before it writes this value to SR. And again, all this matter is far-fetched and strictly theoretical because in practice system software was rapidly adjusted to a new way of using SR. So I don't find any necessity to change MOVE from SR at all, just document that reading the system information from SR is depricated and provide a new instruction to read only system flags instead of infamous MOVE from CCR (Even Thomas doesn't like it!).

No ! You do not understand.
The sandbox can not just set the value to 0.
If saving to stack normally reads $2300 from SR, in the sandbox it will be $0300. Even if it sets that back to 0, it is the wrong value - and it has no way to tell which one is right.


Quote:

Originally Posted by litwr (Post 1464601)
It might be the same register. Just MOVE from SR reads its system part, and MOVE from CCR reads its arithmetic flags.

But why wanting to do so ? I see zero interest in that. It would only complicate matters.


Quote:

Originally Posted by litwr (Post 1464601)
Maybe it is true for very diffrent primitive architectures like the Z80, 6809, 6502. But for similar advanced architectures like x86 or 68k it is rather rarely when we need to completely rewrite our code. Anyway there must be some hints, what is better to change in the current program code?

Really, conversion from one cpu family to another needs a rewrite to reach good performance. We have more registers, better addressing modes, which a blind conversion does not use.


Quote:

Originally Posted by litwr (Post 1464601)
It is not normal program development using disassembly of machine code. It has a piracy trait. :)

Perhaps, but at least on 68k it's perfectly possible because the code is readable enough. Or maybe you will value x86's complexity as security by obfuscation ?


Quote:

Originally Posted by litwr (Post 1464601)
Please read this sentence "The processor can refer to bytes and 4-byte words, it cannot directly access 16-bit data".

That has little to do with on-the-fly memory operations.
Something like
add al,mem
.


Quote:

Originally Posted by litwr (Post 1464601)
;)

Arm v6 and v7 aren't the same as aarch64.


Quote:

Originally Posted by litwr (Post 1464601)
You again missed the point. RBX contains 64-bit address. ;) Indeed the 80386 can't do such instructions.

No it's you who missed the point. If the 80386 can't do such instructions, why should a 68k be able to do them ?
(Note that, actually, FPGA 68080 present in Vampire accelerators CAN do 64-bit accesses like that.)


Quote:

Originally Posted by litwr (Post 1464601)
These benchmark results look very solid. They just ran very good benchmark programs. Indeed someone can optimize them for one architecture but other man can optimize them for another.
Recently I made several Basic programs for a benchmark race - https://gitlab.com/retroabandon/basc.../benchmarks.md - a man from the Atari8 world intervened, he wanted better results for his platform. He used diffrent Basic, different algo, ... and made the result about 5 times faster. :) I can only suspect that the Amstrad CPC or Commodore people could make results for their PC much better too but they just missed this race. So your point for this case is a real oddity. ;)

You kinda prove my point. Change the code : 5 times faster.

I remember having made a basic vs basic test long ago. That was simple prime number factorization. Atari ST (gfa basic) : 11.6 secs, vs PC 386 DX40 (qbasic) : 9.2 secs. Same algorithm, very few changes in the code and yes, 40Mhz 386 DX only slightly faster than 8Mhz 68000.


Quote:

Originally Posted by litwr (Post 1464601)
Why weren't they personal computers? They are exactly PC. What technical detail of those PCs was better than on the first IBM PC? BTW even thew ZX Spectrum and BBC Micro appeared after the IBM PC...

PC designates the IBM PC line and clones, not just any personal computer.
But maybe you consider that a MacBook of today is a PC ?


Quote:

Originally Posted by litwr (Post 1464601)
Without your efforts this seems like your only personal delusion. :)

Interpret it as you wish. Why would i care. In french we say "j'ai déjà donné".


Quote:

Originally Posted by litwr (Post 1464601)
I've just compile Xlife v7 sources with -O3 and -Os, I got 537 KB and 341 KB correspondingly. It is not a large program, it is only about 16,000 LOC. The size of stdlib++, Xlib, etc is quite a large common part of both programs. So this makes the difference at least as 2:1.

If the compiler can remove that much, then the code is probably not of good quality :p
And you still have to disassemble that code.
Alternatively you could just compress the executables, to put unrolled loops out of the equation...

Thomas Richter 23 February 2021 18:14

Quote:

Originally Posted by litwr (Post 1464591)
We can't say this. Because Jack Tramiel "sold his soul" and stopped the 6502 development in 1976. MOS Technology men announced plan to make a 16-bit 6502 just before Jack made his acquisition. We can only wonder what this could be. The 6502 has a lot of free opcodes, they could be used for something really amazing. I can even repeat that Bill Mensch reported that he had 6502@10MHz in 1976.

The free opcodes had been used by a variety of options, such as the 65C02.


Quote:

Originally Posted by litwr (Post 1464591)
It is rather one more odditity of the 68k which has so many instructions to set flags but no any common instruction to read flags.

Because you branch on flags, that's their purpose. And you have Scc if you need to transfer a particular condition into a value. What else do you want the read the flags for.



Quote:

Originally Posted by litwr (Post 1464591)

Please read my previous post more carefully. I have already written about exactly this case. Indeed it is potentially can create an issue. But I repeat, the sandbox needs just to check a value and to fix it (set it to 0) before it writes this value to SR.

You still don't understand. The sandbox would *not* return 0 for the system part of the flags. It would provide there whatever is necessary to keep the illusion for the sandboxed program.


Quote:

Originally Posted by litwr (Post 1464591)


And again, all this matter is far-fetched and strictly theoretical because in practice system software was rapidly adjusted to a new way of using SR. So I don't find any necessity to change MOVE from SR at all,

It was necessary to make it priviledged. Not its to change its function.


Quote:

Originally Posted by litwr (Post 1464591)



just document that reading the system information from SR is depricated and provide a new instruction to read only system flags instead of infamous MOVE from CCR (Even Thomas doesn't like it!).

Hold on. Deprecation wouldn't have made sandboxes possible. And I wouldn't state that I don't like it. It is just not necessary - I haven't really used "move from ccr" in 20 years now - or move from sr if that matters for you. I branch on conditions, or set flags on conditions, but I don't use the flags directly.


Quote:

Originally Posted by litwr (Post 1464591)
The division in the 80286 is a real fantastic! :)





Ah, did you check the right instruction this time? I mean the 32/16 division?




Quote:

Originally Posted by litwr (Post 1464591)




Indeed the choice between LE or BE means nothing for the IBM/370 or 68020. Because they have the 32-bit ALU and they don't have 64-bit addition or subtraction. But for the 6809, 68000, 68008, and 68010, BE means slower long addition and subtruction.

Hold on, no. It just means that the CPU has to fetch the higher address first to optimize throughput, that's all. Since the older 68Ks have a 16-bit bus anyhow, it just means a different order in which the RAM is read or written, and that is all.


Please, think arguments to their very end. In fact, all the PPC did to switch between endianness is to fiddle with the lower adress bits.



Quote:

Originally Posted by litwr (Post 1464591)





I worked with Pascal which supported recursion only as an option. By default, recursion was off. IMHO it was quite common before the 90s. Indeed stack ops are not a strong point of the 6502, it is a cheap processor from 1975. However in practice stack issues for the 6502 are not serious.

Cough. Ever written something serious on the 6502? I have here a Basic interpreter, and yes, it uses a software emulated stack for FOR-NEXT and GOSUB instead of the hardware stack. You can guess why...


Quote:

Originally Posted by litwr (Post 1464591)






My cite has a number 10240. Indeed we can easily make 10K stack on the 8080 or Z80 but it consumes 1/6 of our total address space and doesn't guarantee safe recursion.

You can never guarantee "safe recursion" on a finite stack space, but there is quite some difference between a 256 byte stack, and a (potential) 64-K stack. While 128 recursions on the 6502 sounds like a lot, it implies that you cannot really use this itsi-bitsi stack for parameter passing.

a/b 23 February 2021 18:15

Quote:

Originally Posted by litwr (Post 1464599)
Yes, the 80286 doesn't have a barrel shifter but the x86 has more flexible byte ops: MOV, XCHG, XLAT, ... So this 68020 advantage is rather illusional, in many cases the 80286 is just faster. Let's check numbers

68020: LSR/ASR #1,Dn - 6, LSR/ASR #2,Dn - 6, LSR/ASR #7,Dn - 6, LSR #8,Dn.w - 6.
80286: SAR/SHR reg,1 - 2, SAR/SHR reg,2 - 6, LSR/ASR #7,Dn - 6, XOR r8l,r8l and XCHG r8l,r8h - 5.

68020: ASL/ROL/ROR #1,Dn - 8, ASL/ROL/ROR #1,Dn - 8, ASL/ROL/ROR #7,Dn - 8, ROL/ROR #8,Dn.w - 8
80286: SAL/SHL reg,1 - 2, SAL/SHL reg,2 - 6, SAL/SHL reg,7 - 12, XCHG r8l,r8h - 3.

68020: ROLX/ROR/ROL #1,Dn - 12, ROLX/ROR/ROL #2,Dn - 12, ROLX/ROR/ROL #7,Dn - 12.
80286: RCL/RCR reg,1 - 2, RCL/RCR reg,2 - 6, RCL/RCR reg,7 - 12.

So it is quite clear that for the most common case (a shift by 1 bit), the 80286 is much faster. Though in some less common cases the 68020 can be a bit faster. However, the x86 can do byte and word (double word since the 80386) shifts on memory, while the 68k can shift only words. For word shifts the 80286 needs 5+n clocks while the 68020 needs 6/8/12+EA and EA is at least 4. So again the 68020 shows that it is slower for more common cases and a bit faster for less common. And the 68k is less flexible for memory shift ops. BTW What strange timings the 68020 has! LSR is faster than ASL, ASL is faster than ROLX - just several more little oddities to the 68k collections of oddities.

Your numbers for 020 are either plain wrong or assume worst case scenario (cache miss, no overlap). I can only assume your 286 numbers are best case scenario (from https://zsmith.co/intel.php):
----
All timings are for best case and do not take into account wait states, instruction alignment, the state of the prefetch queue, DMA refresh cycles, cache hits/misses or exception processing.
----

LSR #1,Dn - 6 cycles? Did you find that in a 68020 manual written by *intel*?

Most common case? That's your assumption. Here's one of mine: in many cases I don't even have to shift because index scaling is *free*.

Mem shift 6/8/12+EA and EA is at least 4? Where did you see those numbers, they're incorrect. EA at least 4? Incorrect.
Sure, it's less flexible for mem ops. Because it has lots of registers, duh. I could count on fingers how many times in 30+ years I've used mem shift on M68K.

And what if... I'm using 32-bit data? Pretty much multiply all 286 timings by 2. And btw, I don't care about 386. You started about 286/020 so that's that.
Byte swaping and access to upper byte is neat (mainly from 000/010 perspective), I won't dispute that. Although greatly diminshed for a long while now.

Thomas Richter 23 February 2021 18:47

Quote:

Originally Posted by litwr (Post 1464599)
Thanks for this nice example. However it saves us one instruction only for very rare cases.

That's not the point. The point is that instuctions such as
Code:

if (a=b) { }
are a common case, and the rest follows from the orthogonality requirement.


Quote:

Originally Posted by litwr (Post 1464599)
So in the 90s this tiny gain became a large loss.

It only means that you cannot dispatch a move upfront a branch as superscalar pair.


Quote:

Originally Posted by litwr (Post 1464599)

And you know, the 68k is not orthogonal. Even its MOVE is not completely orthogonal. I want to have MOVE offset1(PC),offset2(PC). ;)

Argh, no! The 68K assumes that the "text" segment of a program is constant (a good idea!) and hence, you can only move *from* PC-relative addresses, not *to* them.


Quote:

Originally Posted by litwr (Post 1464599)


The 68k is not VAX or PDP-11. And even the VAX and PDP-11 are not 100% orthogonal. The best architecture (IBM mainfraimes, RISC, x86) just skipped all this orthogonality crap. ;) It has no practical usefulness, it is just a poetry around true IT. :)

It does have a lot of use for code generators. Or can you give me a good idea why you cannot multiply two arbitrary registers on 8086 and what that means for the code generator?





Quote:

Originally Posted by litwr (Post 1464599)



It has some little sense around BRA but there are no conditional subroutine calls on the 68k. Thus the presence of BSR.w is an overt extra. Of course, practically it is just a tiny oddity, nothing important.

Conditional subroutine branches I had rarely a need for, but in principle... one could think about them.


Quote:

Originally Posted by litwr (Post 1464599)




Practice has more varieties than someone imagination, I needed ADDX when I could add a constant to a byte in a register.

Pardon me? You can use "scs" to transfer the carry to a register.




Quote:

Originally Posted by litwr (Post 1464599)





It is just empty theorizing. You know I still do some maintenance on Xlife project. You can know that Eric S. Raymond made large contributions to this project in the 1991-1998. His code for Xlife v5 contains basic structures `tile' and `pattern' which are larger than 128 bytes.

And thus, because one program uses such large structures, it must be a good design in general? I'm sorry, no. I don't buy this.



Quote:

Originally Posted by litwr (Post 1464599)






You know that Xlife was a student project, so it is easy to deduce that more serious software had large structures quite often. BTW Xlife structures are kept in lists, not in arrays.

And then why small offsets in indexed addressing modes matter why exactly?




Quote:

Originally Posted by litwr (Post 1464599)







What a strange logic! The presence of address registers is rather a complication.

Ah, and how is that "more complicated" that you can only shift with one register, and multiply with another on x86?


Quote:

Originally Posted by litwr (Post 1464599)








Anyway, code generation is work for good professionals and they have proved that they are able to make quite decent code generator for the x86.

As we say in Germany "You can also tie a piano on your head if you like to". Yes, certainly, but its painful.


Quote:

Originally Posted by litwr (Post 1464599)








Other people (99.9999999%) just use their compilers. Why bother about 0.0000001% who are quite happy with their good job? :)

Because it's you who argues about the CPU architecture, that's why.


Quote:

Originally Posted by litwr (Post 1464599)









What do you mean about better code? Faster? Can you give any link where code for the 68k is proved as better than for the x86?

Readable, orthogonal. On x86, you have shuffle around values between registers to get a particular job done. On 68K, you can allocate registers almost like local variables. Not that I program a lot in assembly these days, no.



Quote:

Originally Posted by litwr (Post 1464599)










You are a programmer and you know that every particular case has the same importance as the general case in programming.

No, I do not know that. I know that I need to optimize for 80% of the cases.





Quote:

Originally Posted by litwr (Post 1464599)











We have such an instruction. In some rare cases it can help. The 68k architecture instructions change flags too often, and the opportunity to save flags can help. Anyway do you blame Moto for this instruction?

Blame? Why blame? I don't need it.




Quote:

Originally Posted by litwr (Post 1464599)












Do you know a better way to keep the compatibility with older hardware?

Yes, ditch the crap, replace by a virtual machine and a software layer to emulate the nonsense.


Quote:

Originally Posted by litwr (Post 1464599)













BTW modern PC can still boot FreeDOS. And I can repeat all this matter has no any relation to programming.

But to limit the progress of the architecture. intel had more or less a monopoly, so they had enough resources to even upgrade this silly architecture, but finally there is some competition. I becomes harder and harder to carry the old junk around. More gates in the cache just for A20 nonsense, more transistors, larger dies, higher prices, slower execution.


Quote:

Originally Posted by litwr (Post 1464599)














The 68000 was 32-bit since its beginning and Intel had to evolve their CPU.

That's the difference between a "good vision" and "fiddling around".


Quote:

Originally Posted by litwr (Post 1464599)
















Let me repeat my question. Do you you know a better way to make a 16-bit processor than the way of the 8086?

Yes. Don't make a 16-bit processor if the market needs a 24 bit processor.



Quote:

Originally Posted by litwr (Post 1464599)











And let's think about the next one. Do you know a better way of transforming the 80286 into the 80386?

Yes. Ditch the legacy. Replace by software, make a clean-room design. Intel started from the wrong design, went cheap, and created a Frankenstein monster of a processor. Three hearts in it, completely unorthogonal, hot-glue and duct-tape, only kept running by the immense financial income of intel.




Quote:

Originally Posted by litwr (Post 1464599)











I have given a number for DIV instruction, not IDIV. ;) You know we had a contest - http://eab.abime.net/showpost.php?p=...&postcount=916 - in this thread, we need to make the best code to convert binary to decimal, this code needs division and the 8086 beat the 68k for this case because DIV on the x86 is quite flexible - just try to code and find this out. :)

Why do you need division for that?




Quote:

Originally Posted by litwr (Post 1464599)












Yes, the 80286 doesn't have a barrel shifter but the x86 has more flexible byte ops: MOV, XCHG, XLAT, ... So this 68020 advantage is rather illusional, in many cases the 80286 is just faster.

Exotic instructions for rare cases one rarely needs. 68020 had something like PACK and UNPACK. Used 0 times in 20 years.





Quote:

Originally Posted by litwr (Post 1464599)













So it is quite clear that for the most common case (a shift by 1 bit), the 80286 is much faster.

How do you measure that this is the "common case"? By what?


Quote:

Originally Posted by litwr (Post 1464599)














Though in some less common cases the 68020 can be a bit faster. However, the x86 can do byte and word (double word since the 80386) shifts on memory, while the 68k can shift only words.

Huh? LSL.L?


Quote:

Originally Posted by litwr (Post 1464599)
















Sorry my number for the 80286 MUL was wrong, I rather checked a wrong line in a manual. You are correct, it is 21 not 14. Thank you. However my main point was the 80286 has much faster division and multiplication.

Apparently, not, and I still don't know whether that's worst case or average case.



Quote:

Originally Posted by litwr (Post 1464599)











It is still quite correct. Though the advantage of the x86 multiplication is less significant than I pointed initially. However you are not correct about the 68020 DIVU which takes 44+EA on the worst case and 42+EA on the best case. Those case depends on cache hits (not on divisor or dividend) and the 68020 cache is rather tiny to rely on its too often best or cache cases. Anyway 2 cycles give almost nothing - the 80286 fantastic division is much faster in any case. ;)

But you know that the intel excluded the memory cycles as well, did you?


I remember optimizing some code for the x86 - the biggest improvement I received by removing divisions and replace them by proper multiplication and shifting. If you want high-speed algorithms, avoid divisions. Also on x86. FYI, the algorithm was a quantizer.

BippyM 23 February 2021 19:19

Considerations taken on bored.. Look there's like 3/4 members interested in this thread and the last. One has now pulled out... Anyone else getting involved is either very brave, patient or a bit stupid... Either way... It's tiresome reading your considerations. It's not about obeying,
It's about keeping the shit stom in one thread...
I could just lock the thread if you prefer?





Quote:

Originally Posted by litwr (Post 1464599)
Thanks for this nice example. However it saves us one instruction only for very rare cases. Therefore the gain is almost zero. Even meynaf agreed about this. Anyway don't forget that this often flag changing is bad for superscalar architectures... So in the 90s this tiny gain became a large loss.
And you know, the 68k is not orthogonal. Even its MOVE is not completely orthogonal. I want to have MOVE offset1(PC),offset2(PC). ;) The 68k is not VAX or PDP-11. And even the VAX and PDP-11 are not 100% orthogonal. The best architecture (IBM mainfraimes, RISC, x86) just skipped all this orthogonality crap. ;) It has no practical usefulness, it is just a poetry around true IT. :)



It has some little sense around BRA but there are no conditional subroutine calls on the 68k. Thus the presence of BSR.w is an overt extra. Of course, practically it is just a tiny oddity, nothing important.


Practice has more varieties than someone imagination, I needed ADDX when I could add a constant to a byte in a register.


It is just empty theorizing. You know I still do some maintenance on Xlife project. You can know that Eric S. Raymond made large contributions to this project in the 1991-1998. His code for Xlife v5 contains basic structures `tile' and `pattern' which are larger than 128 bytes. You know that Xlife was a student project, so it is easy to deduce that more serious software had large structures quite often. BTW Xlife structures are kept in lists, not in arrays.


What a strange logic! The presence of address registers is rather a complication. Anyway, code generation is work for good professionals and they have proved that they are able to make quite decent code generator for the x86. Other people (99.9999999%) just use their compilers. Why bother about 0.0000001% who are quite happy with their good job? :)


What do you mean about better code? Faster? Can you give any link where code for the 68k is proved as better than for the x86?


You are a programmer and you know that every particular case has the same importance as the general case in programming. If you ignore one particular case that just kills your software.


We have such an instruction. In some rare cases it can help. The 68k architecture instructions change flags too often, and the opportunity to save flags can help. Anyway do you blame Moto for this instruction?



Do you know a better way to keep the compatibility with older hardware? BTW modern PC can still boot FreeDOS. And I can repeat all this matter has no any relation to programming. The 68000 was 32-bit since its beginning and Intel had to evolve their CPU.


I am almost sure that there were systems which didn't have the IBM PC architecture, they overtly compete IBM. Check for example, https://en.wikipedia.org/wiki/Altos_Computer_Systems for Altos 686.



Let me repeat my question. Do you you know a better way to make a 16-bit processor than the way of the 8086? And let's think about the next one. Do you know a better way of transforming the 80286 into the 80386?



I have given a number for DIV instruction, not IDIV. ;) You know we had a contest - http://eab.abime.net/showpost.php?p=...&postcount=916 - in this thread, we need to make the best code to convert binary to decimal, this code needs division and the 8086 beat the 68k for this case because DIV on the x86 is quite flexible - just try to code and find this out. :)

Yes, the 80286 doesn't have a barrel shifter but the x86 has more flexible byte ops: MOV, XCHG, XLAT, ... So this 68020 advantage is rather illusional, in many cases the 80286 is just faster. Let's check numbers

68020: LSR/ASR #1,Dn - 6, LSR/ASR #2,Dn - 6, LSR/ASR #7,Dn - 6, LSR #8,Dn.w - 6.
80286: SAR/SHR reg,1 - 2, SAR/SHR reg,2 - 6, LSR/ASR #7,Dn - 6, XOR r8l,r8l and XCHG r8l,r8h - 5.

68020: ASL/ROL/ROR #1,Dn - 8, ASL/ROL/ROR #1,Dn - 8, ASL/ROL/ROR #7,Dn - 8, ROL/ROR #8,Dn.w - 8
80286: SAL/SHL reg,1 - 2, SAL/SHL reg,2 - 6, SAL/SHL reg,7 - 12, XCHG r8l,r8h - 3.

68020: ROLX/ROR/ROL #1,Dn - 12, ROLX/ROR/ROL #2,Dn - 12, ROLX/ROR/ROL #7,Dn - 12.
80286: RCL/RCR reg,1 - 2, RCL/RCR reg,2 - 6, RCL/RCR reg,7 - 12.

So it is quite clear that for the most common case (a shift by 1 bit), the 80286 is much faster. Though in some less common cases the 68020 can be a bit faster. However, the x86 can do byte and word (double word since the 80386) shifts on memory, while the 68k can shift only words. For word shifts the 80286 needs 5+n clocks while the 68020 needs 6/8/12+EA and EA is at least 4. So again the 68020 shows that it is slower for more common cases and a bit faster for less common. And the 68k is less flexible for memory shift ops. BTW What strange timings the 68020 has! LSR is faster than ASL, ASL is faster than ROLX - just several more little oddities to the 68k collections of oddities.



Sorry my number for the 80286 MUL was wrong, I rather checked a wrong line in a manual. You are correct, it is 21 not 14. Thank you. However my main point was the 80286 has much faster division and multiplication. It is still quite correct. Though the advantage of the x86 multiplication is less significant than I pointed initially. However you are not correct about the 68020 DIVU which takes 44+EA on the worst case and 42+EA on the best case. Those case depends on cache hits (not on divisor or dividend) and the 68020 cache is rather tiny to rely on its too often best or cache cases. Anyway 2 cycles give almost nothing - the 80286 fantastic division is much faster in any case. ;)



We have discussed this matter with meynaf. Indeed, Bill wanted DOS and he was afraid that protected mode OS like OS/2 could make his DOS obsolete. :) So most people used the 80286 like the much faster 8088 but Xenix was also quite good and popular among more serious users.



You are right. The 8088/8086 often is much slower that one can expect on the base of instruction timings. It is because of the instruction queue and slow 4 cycle memory access of the 8086/8088. However since the 80286 this has not been a real issue. The 80286 has fast memory access cycle and larger instruction queue.



I liked this game. It was very good on my first PC (the 80286@12MHz based) in 1991. It helped me to survive without my Amiga, which I had to sell in 1990.




I must obey. I dare only express some considerations. When the book gets too big it breaks up into volumes, when the movie gets big it breaks up into series, when the cake is too big it gets cut into pieces of that cake. :)



Thank you for these nice words. I have never accepted that any computer architecture is perfect. You know my point, the 68000 was a very good CPU despite having o lot of odd quirks which are common traits of all Moto's processors I know (the 680x and 68k). However some people don't even want to know about the 68k oddities and shortcomings. My priority on this thread is to find the complete truth about the 68k details. I am very grateful and thankful to the people here, they helped me a lot in my task.



Thank you very much for this generous offer. I need some time to prepare materials for this new thread, I hope to finish all preparations as soon as possible. I will be very happy if the 68k shows better results.



It is not about the Mac, it is about its CPU which is the same as in Amigas. IMHO this benchmark has a very high quality. So it is sad to know your objections about it.


Yes, the original PC was only slightly faster than some 8-bit computers. You know that Xlife-8 benchmark results shows that the Z80 is only about 12% slower than the 8088. It was surprizing for me. However the 8088 is a true 16-bit CPU and its command system is much superior than of any 8-bit CPU. I had to work very hard to make the fast Z80 code, for the 8088 it was much easier. You can also notice, that the Amstrad CPC464 appears in 1984, 3 years later after the IBM PC. In 1984, you could buy a PC-compatible computer based on the 8086/V20/V30/286 which were up to 6 times faster than the original PC. The CPC 464 didn't have even a floppy drive controller, and it was almost impossible to add more memory to the 464.



The Sage was good but it had its shortcomings too: no graphics, no MMU, no FPU, ... I can't say that the Sage or Stride was generally better than the IBM PC AT.



Thank you very much for this link to an interesting material. However you can notice that there is almost nothing about the Stride. :( Indeed I don't know exact information about the Stride numbers. I can only repeat that Rod Coleman claimed 10,000 shipped units of the Sage. People have the Sages in their collection, there are sites about the Sages, but almost nothing about the Strides. My conclusion is based on these facts. The Stride was a very rare computer.


Sorry again, I have to confess, I can make mistakes sometime. However I suspect that other people can make them too.


meynaf 23 February 2021 20:06

One sure thing is that this thread leads nowhere.
However, our friend litwr here is a strange guy. He writes a lot of drivel all day long, but hardly ever goes into real name calling. This is something i've never seen before.
I really wonder what motivates him.

Bruce Abbott 23 February 2021 22:49

3 Attachment(s)
Quote:

Originally Posted by litwr (Post 1464599)
It is not about the Mac, it is about its CPU which is the same as in Amigas. IMHO this benchmark has a very high quality. So it is sad to know your objections about it.

Just because the CPU is the same doesn't mean the performance will be the same. For example the Original Mac's 68000 was clocked at 7.8336 MHz but had an effective speed of only 6 MHz.

Quote:

You can also notice, that the Amstrad CPC464 appears in 1984, 3 years later after the IBM PC.
And you might notice that the original PC's base configuration was 64k RAM, CGA card (no monitor!), BASIC in ROM and storage via an external cassette tape recorder (also not included) for a much higher price! If you wanted one with more realistic specs (256k, floppy disk drive, monitor, sound card - Sound card? We ain't got no sound cards!, etc.) the price was even higher - way beyond the budget of the typical 'truck driver' that Alan Sugar made the CPC464 for.

It's pointless comparing specs or release dates without taking into account affordability. Apart from rich Americans, no home computer user in 1984 considered the PC a viable option.

Quote:

The CPC 464 didn't have even a floppy drive controller, and it was almost impossible to add more memory to the 464.
The FDD and controller was a standard option sold by Amstrad, and 256k memory expansions were also common.

I had the CPC664 which came out soon after - the first computer I owned with a built in floppy drive. Very soon after that Amstrad released the CPC6128, which had 128k RAM for almost the same price. Many 664 users got upset about it, but I didn't. I unsoldered the 64k DRAM chips from the motherboard and replaced them 256k chips, and made a bank switching board compatible with the 6128 that could also map any RAM bank to the screen. I upgraded about a dozen 664's to 256k for users here in New Zealand (wonder what happened to them?).

Sadly the 664's keyboard went bad so I stupidly threw it away a few years ago (not knowing that replacement membranes were available!). However I kept a mint condition CPC6128 that somebody gave me. Some day I hope to upgrade it with a ridiculous amount of RAM so I can run SymbOS on it.

The CPC464 was released in NZ in 1985. Below is an advert from the 'Bits and Bytes' magazine issue that reviewed it (Manukau computers was a shop in Auckland run by a friend of mine, where I purchased my Amiga 1000 from a few years later). On launch the CPC464 with color monitor and floppy drive cost NZ$2190.

In the same issue there is an advert for a '100% IBM compatible' Sperry Model 20 with 128k RAM, 2 disk drives and mono screen for NZ$8100 ("other configurations up to $15,560").

And finally here is an advert from the Sept 1985 issue of 'Bits and Bytes' for the CPC664 - NZ$1895 with RGB color monitor or $1495 with green screen (which unlike the PC was still 'color', just displaying the image in shades of green). I bought the green screen model because I wanted a sharper screen for programming, and because it was cheaper!

litwr 08 March 2021 08:57

I don't know what is going on here. :( It seems that some men just want to degrade this thread and they don't want to discuss their reasons. :( I am just a mere participant. Sorry, in this situation, I can only reply to a few messages.

Quote:

Originally Posted by meynaf (Post 1464685)
One sure thing is that this thread leads nowhere.
However, our friend litwr here is a strange guy. He writes a lot of drivel all day long, but hardly ever goes into real name calling. This is something i've never seen before.

You have written a lot of drivel too. But isn't that a way of having a friendly conversation? ;)
I have checked our discussion around the pi-spigot and found out that my implementation is just much faster. So I couldn't use any code from this thread in my project. Moreover my special optimized for smaller size code was the smallest... So I still can't figure out what you mean?
BTW I can again express my gratitude to you because you helped me to optimize code for the 68000 multiplication. I used your advice in the PDP-11 (noEIS) code. It happened several years before this thread started.

Quote:

Originally Posted by Bruce Abbott (Post 1464730)
I had the CPC664 which came out soon after - the first computer I owned with a built in floppy drive. Very soon after that Amstrad released the CPC6128, which had 128k RAM for almost the same price. Many 664 users got upset about it, but I didn't. I unsoldered the 64k DRAM chips from the motherboard and replaced them 256k chips, and made a bank switching board compatible with the 6128 that could also map any RAM bank to the screen. I upgraded about a dozen 664's to 256k for users here in New Zealand (wonder what happened to them?).

It was a very interesting upgrade. I have never touched the CPC664. I started from the CPC6128 and later used the PCW series. IMHO the 6128 with 256 KB RAM were rather rarity. Indeed the CPC and PCW were very good for their price. However if we compare them with the IBM PC compatible we have to admit that the latter had a lot of advantages. For example, let's check the Tandy 1000 which was the most popular home IBM PC. It was about 1.5-2 times faster because it has 4.77 MHz while the CPC/PCW has only 3.2 effective MHz. The Tandy 1000 has 320x200 16 color graphics while the CPC has 160x200 16 color graphics. The Tandy can be easily expanded to use 640 KB of memory, a hard drive, etc. The Tandy 1000 had a free software package, Deskmate, for it and this computer was 100% IBM PC compatible. The CPC/PCW had to use CP/M for professional work and Digital research stopped to support CP/M in 1983. :(
It is strange that Alan Sugar stopped updating the CPC line after the 6128. IMHO he could use the Z80 @6 or even @8 MHz in 1986 or 1987. For the Apple II, the Z80 @8Mhz cards were available...

meynaf 08 March 2021 11:16

Quote:

Originally Posted by litwr (Post 1468431)
You have written a lot of drivel too. But isn't that a way of having a friendly conversation? ;)
I have checked our discussion around the pi-spigot and found out that my implementation is just much faster. So I couldn't use any code from this thread in my project. Moreover my special optimized for smaller size code was the smallest... So I still can't figure out what you mean?
BTW I can again express my gratitude to you because you helped me to optimize code for the 68000 multiplication. I used your advice in the PDP-11 (noEIS) code. It happened several years before this thread started.

All of this is leading absolutely nowhere. But i'm still wondering what really motivates you. Do you really believe in all you write ? Or are you writing the opposite of reality for some unscrutable purpose ?

Thomas Richter 08 March 2021 16:03

Quote:

Originally Posted by meynaf (Post 1468460)
Or are you writing the opposite of reality for some unscrutable purpose ?


I'm not sure that "68K is better than x86" is "reality". It is an opinion one can share or not. There are certainly merrits in the x86 architecture, as in "it does still exist", and "there is a 64 bit version of it", and "it is quite powerful".


But giving arguments in favour of the overall architectural design of the x86 seems really bewidlering to me. The CPU design looks like several layers of chewing gum and duct tape wrapped around an outdated 8-bit core. While I understand why intel did that - namely to keep in control of the market - I still appreciate the more orthogonal design of the 68K. Or any other processor I see on the market today.



It's really hard to make a design as unorthogonal as x86.

meynaf 08 March 2021 16:24

Quote:

Originally Posted by Thomas Richter (Post 1468551)
I'm not sure that "68K is better than x86" is "reality". It is an opinion one can share or not. There are certainly merrits in the x86 architecture, as in "it does still exist", and "there is a 64 bit version of it", and "it is quite powerful".

That depends how you look at it. The "it does still exist" might well become wrong a few decades from now due arm taking over and in some way 68k also still exists. The "there is a 64 bit version of it" is hardly an advantage as many others have it, even 68k has a 64 bit version if you look closer to Gunnar's 68080. And the "it is quite powerful", granted, Intel chips are awesome today but it's just a matter of implementation.


Quote:

Originally Posted by Thomas Richter (Post 1468551)
But giving arguments in favour of the overall architectural design of the x86 seems really bewidlering to me. The CPU design looks like several layers of chewing gum and duct tape wrapped around an outdated 8-bit core. While I understand why intel did that - namely to keep in control of the market - I still appreciate the more orthogonal design of the 68K.

I can only agree with that.


Quote:

Originally Posted by Thomas Richter (Post 1468551)
Or any other processor I see on the market today.

Here, on the other hand...


Quote:

Originally Posted by Thomas Richter (Post 1468551)
It's really hard to make a design as unorthogonal as x86.

That's sure.
Yet if i had to make that choice, i'd take x86 over mips or alpha for doing asm on, without hesitation.

Thomas Richter 08 March 2021 17:06

Quote:

Originally Posted by meynaf (Post 1468555)
That depends how you look at it. The "it does still exist" might well become wrong a few decades from now due arm taking over and in some way 68k also still exists.

There is no market. Nobody is selling 68K chips in volume today, leave alone builds computers around them. It's dead, deceased, is no more, ... this is an ex-parrot. A beautiful one, but nevertheless, nailed to the stick so it doesn't fall down.



Quote:

Originally Posted by meynaf (Post 1468555)
And the "it is quite powerful", granted, Intel chips are awesome today but it's just a matter of implementation.

Well, most likely, but not necessarily. One could suspect that a simpler architecture will less dependencies would provide better opportunities for optimization. Unfortunately, x86 with its coherent cache is causing lots of problems for optimization, in particular for multi-core operations. I believe that sooner or later intel will run into a dead-end with this architecture, but we'll see. You can no longer increase the clock rate, and at some point, you cannot make the architecture wider without causing cache-coherency problems.



Quote:

Originally Posted by meynaf (Post 1468555)
Yet if i had to make that choice, i'd take x86 over mips or alpha for doing asm on, without hesitation.

Yes, but you are in the minority. It is not important how well you like it to program them im assembly. What is important is whether there is a good compiler supporting them, and how easy it is to write a compiler to generate fast code. Nobody writes in assembler these days.


We still had, 10 years ago, an in-house specialist that gave some heavy-duty algorithms "the final touch" by implementing them in hand-tuned assembler. We don't do that nowadays anymore. It makes no sense. We use compiler intrinsics, and we reach the same if not better performance by letting the compiler generate the code. The compiler knows better which instruction takes how long, how to unroll loops and where to inline.



Where it needs help is to get the architecture of the code right, and the vectorization (there are no good auto-vectorizing compilers at the moment, except for trivial cases, this is still better done by hand).


What you need to do is compile - look at the code - tune the source - measure the speed - reiterate.

meynaf 08 March 2021 17:30

Quote:

Originally Posted by Thomas Richter (Post 1468566)
There is no market. Nobody is selling 68K chips in volume today, leave alone builds computers around them. It's dead, deceased, is no more, ... this is an ex-parrot. A beautiful one, but nevertheless, nailed to the stick so it doesn't fall down.

Not in volume yes, but not zero. So perhaps it is diseased, but it's not really dead.


Quote:

Originally Posted by Thomas Richter (Post 1468566)
Well, most likely, but not necessarily. One could suspect that a simpler architecture will less dependencies would provide better opportunities for optimization. Unfortunately, x86 with its coherent cache is causing lots of problems for optimization, in particular for multi-core operations. I believe that sooner or later intel will run into a dead-end with this architecture, but we'll see. You can no longer increase the clock rate, and at some point, you cannot make the architecture wider without causing cache-coherency problems.

I see the problem differently. As you say, clock rate and width of architecture can't be increased much anymore. But if you have an IPC of say 4, you can hardly have something better. So you'd rather have the same work be done by less instructions, and x86 typically does that better than "simpler" architectures. IOW if you need 3 instructions where x86 needs 1, you can't be faster.


Quote:

Originally Posted by Thomas Richter (Post 1468566)
Yes, but you are in the minority. It is not important how well you like it to program them im assembly. What is important is whether there is a good compiler supporting them, and how easy it is to write a compiler to generate fast code. Nobody writes in assembler these days.


We still had, 10 years ago, an in-house specialist that gave some heavy-duty algorithms "the final touch" by implementing them in hand-tuned assembler. We don't do that nowadays anymore. It makes no sense. We use compiler intrinsics, and we reach the same if not better performance by letting the compiler generate the code. The compiler knows better which instruction takes how long, how to unroll loops and where to inline.



Where it needs help is to get the architecture of the code right, and the vectorization (there are no good auto-vectorizing compilers at the moment, except for trivial cases, this is still better done by hand).


What you need to do is compile - look at the code - tune the source - measure the speed - reiterate.

The compiler knows nothing about what your program is doing - the programmer does. The programmer also does not have to respect the specs and limitations of the source language - something the compiler must do.
So the programmer will always have an edge.
Hand-tuned assembler isn't just "playing compiler". It's not about converting the code, it's about converting the algorithm. And that's something a compiler can not do.
Say what you want, but no compiler will ever beat me. Compilers being better than asm programmers is a myth.
The reason why asm isn't written anymore today is something else - it's simply because all currently available cpus are a pita in that aspect but are fast enough so it's not worth the effort.

defor 08 March 2021 17:37

Programming 68k in assembler is still popular and fun in 2021 (Amiga, ST, Megadrive). On the other hand, x86 assembler is used by a few PC intro coders only.
And that is, my friends, a testament to a good design :-)

grond 09 March 2021 11:36

Quote:

Originally Posted by Thomas Richter (Post 1468566)
I believe that sooner or later intel will run into a dead-end with this architecture, but we'll see.

Intel has been escaping dead-ends for 40 years... :D

dreadnought 09 March 2021 11:46

Quote:

Originally Posted by defor (Post 1468580)
Programming 68k in assembler is still popular and fun in 2021 (Amiga, ST, Megadrive). On the other hand, x86 assembler is used by a few PC intro coders only.
And that is, my friends, a testament to a good design :-)

Absurdisms like this one is why I love these threads :)

grond 09 March 2021 12:12

Quote:

Originally Posted by meynaf (Post 1468577)
if you have an IPC of say 4, you can hardly have something better. So you'd rather have the same work be done by less instructions, and x86 typically does that better than "simpler" architectures. IOW if you need 3 instructions where x86 needs 1, you can't be faster.

Obviously the algorithm is still the most important part in achieving speed. With regard to what you say about IPC: today's CPUs do not have a fixed IPC in the ISA instructions you put into them. They form super-instructions from several incoming ISA instructions and issue several such super-instructions in parallel (they call it "bundles" and "bundling"). The IPC for these super-instructions is limited somewhere but since one super-instruction carries out the work of a varying number of ISA instructions, the super-instructions executed in the same clock cycle may correspond to drastically varying numbers of ISA instructions. Even if you can put complex address modes into a single instruction in a CISC such as 68k and x86, modern ARM processors can do just the same. They see the address calculations carried out using general-purpose registers and the subsequent load instruction and execute all that as one complex instruction. All there is to lament is the lower code density but the ease of detection of instruction boundaries has a lot of advantages, too.

It often gets mentioned that Intel put RISC cores inside their CISC processors and thus could keep up with RISC's clock frequency increases (which was the original aim of RISC). The reality today is that RISCs have CISC ALUs to increase the number of instructions executed. I wouldn't be surprised if the CPUs were better at bundling super-instructions from typical compiler-generated code (that's what the CPUs are designed for) than from hand-written code optimised for low instruction count.


All times are GMT +2. The time now is 02:21.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.

Page generated in 0.16011 seconds with 11 queries