14 November 2018, 21:28 | #781 | |||
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,408
|
Quote:
Which IMHO means the fair solution (given I don't feel like making a test program) is to keep it at 'uncertain', so I'll do that from now on. After all, my source could be wrong, but so could your book. It's just the way it goes sometimes Quote:
Quote:
|
|||
14 November 2018, 23:38 | #782 |
Registered User
Join Date: Jun 2008
Location: Boston USA
Posts: 466
|
Only impressive if you forget about the cost of filling the prefetch buffer. The 8086 and 8088 are much slower than they appear to be from Intel's documentation. At least according to the Zen of assembler programming.
|
15 November 2018, 03:44 | #783 |
Registered User
Join Date: Jan 2012
Location: USA
Posts: 372
|
This might already be somewhere around here but I thought I'd post it as it seems somewhat relevant:
http://nemesis.hacking-cult.org/Mega...tion/Yacht.txt Amazing document concerning the cycle by cycle behavior of the MC68K for every instruction and even interrupts. Very interesting. |
15 November 2018, 03:48 | #784 | |
Registered User
Join Date: Jan 2012
Location: USA
Posts: 372
|
Quote:
Terje has said that you could very well estimate the speed of most code by simply counting the total number of memory accesses needed. Anyone really interested in all this should check out some of the old posts by Terje on that now backwater of the internet, usenet. Check out comp.arch, in particular. Very interesting reading. |
|
15 November 2018, 08:11 | #785 | |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Quote:
As memory cycles are 4 cpu cycles, it seems logical that instructions execute in multiples of 4 cycles, btw. |
|
15 November 2018, 09:37 | #786 |
Registered User
Join Date: Sep 2015
Location: Germany
Posts: 256
|
Maybe this little assembler tool will give you some answers regarding execution times on the MC68000.
Back in the days, it was published in the German "Amiga Magazin Sonderheft -Faszination Programmieren" edition 2/93. It helped me a lot and in conjunction with CIA registers you get really surprising results. This tool was mainly written for the MC68000, and for other processors of the 68k family you have to change the value for the execution time of an empty loop and perhaps the rounding value. Just play with the values. But no guarantee for a proper work on 68020+ machines, it may be only a kind of orientation because of their caches and pipelining behaviour. Have fun with it. |
15 November 2018, 11:11 | #787 | ||
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,408
|
Edit: this post has turned out a lot longer than I planned, the book frank_b shared (Zen of Assembly Language by Micheal Abrash) is really quite an interesting read and I found myself reading through most of it - though I skimmed a few chapters as well. It taught me a lot about the 8088/8086 and I've been adding insights from the book to this post during the day rather than making separate posts.
Sorry for the TL;DR; vibe I've marked my continued edits in italics so it's clearer where I've added stuff I learned from this book. This book turns out to be available online as well. Very interesting read (including the bits about prefetch and how it impacts performance from the theoretical cycle counts), though the author is a total Intel fanboy In essence, it turns out that Motorola was just more 'complete' with their 68000 cycle numbers because they could include the -admittedly hard coded- prefetch into the cycle counts as found in their manuals easily (they also try do so -to a point- for the 68020 onwards by explaining best/worst/cache performance and concurrency) where Intel could not* and thus didn't do so. Which makes the 68000 look slower than it really is when compared to the Intel 8088/8086, whose cycle counts in the manuals don't include these fetches and thus are regularly slower than stated. Note that all this doesn't mean the Intel cycles as given are dishonest, rather they measure a different thing and as such require a bit more work to get the complete picture. I am unaware if this situation also occurs for the 286/386/486, but it wouldn't be surprising if the same thing applies. Generally I find that manufacturers don't change their method of reporting specs unless they have a good reason to do so. Edit: the book in question is mostly about the 8088 though and notes the situation is much better on the 8086. There still will be situations in which the prefetch queue lags behind internal code execution (in his words: "That’s not to say that the 8086 doesn’t suffer from those cycle-eaters at all; it just suffers less than the 8088 does. Instruction fetching is certainly still a bottleneck on the 8086"), but there are fewer of them on the 8086 vs the 8088. One interesting bit is that neither the 8088 nor the 8086 can fetch from memory faster than in 4 cycle intervals. This means that running multiple 2 cycle instructions in a row will make them execute at 4 cycle intervals eventually due to the prefetch queue running out. Another interesting element is that the 8086 is hampered by code that accesses words that aren't aligned on 16 bits, which means that instructions with an odd amount of bytes attached to them can slow the CPU down if it's dealing with 16 bit data. Notably, the book also mentions that the prefetch queue can still lag behind significantly on both the 286 and 386 (there is no info on the 486 in this book) and is in fact more likely to do so as memory coupled with these CPU's tends to be slower than the CPU can fetch memory. The given example is the 3 cycle mov [WordVar],0 for the 286. Which on a real life AT can actually take a full 12 cycles to execute rather than the 3 cycles claimed by the manual. The more I read about this, the clearer it gets: the 8086,80286 and 80386 cycle counts as given are not at all useful for determining the actual expected performance, because they all omit prefetching. This effect also neatly explains why benchmarks/real life use shows the 68k equivalents (68000 vs 8086/68020 vs 80286/68030 vs 80386) as consistently being faster than their Intel competitors while you might conclude the opposite by looking at CPU cycle counts as given. *) Considering how complicated this can get I'm not too surprised to be honest, just look at the MC68020 stuff in the Motorola manuals and how unclear the actual performance of individual instructions can get when some cache and internal concurrency is part of the deal. The book goes into great detail why Intel couldn't give figures including the prefetch. Quote:
Edit: here's one rather funny (considering the really rather long discussion about it here) quote from the book Zen of Assembly language where the author discusses segmentation. Especially since the author is both clearly an expert on the 8088/8086 and really positive about the Intel 8088/8086 throughout the book: Quote:
Last edited by roondar; 15 November 2018 at 17:02. Reason: Added a bunch of info from Zen of Assembly language |
||
15 November 2018, 19:50 | #788 | ||||||||||||||||
Registered User
Join Date: Mar 2016
Location: Ozherele
Posts: 229
|
Quote:
Quote:
Quote:
Quote:
Anyway companies left Motorola because its processors lost power. Motorola tried to make VAX-like processor fast and it was impossible. IMHO new addressing modes of 68020 and some minor complexity of original 68000 killed 68k prospects in about 1991. Quote:
Quote:
Code:
poll(): 0: e5913000 ldr r3, [r1] 4: e0000003 and r0, r0, r3 8: e1c33000 bic r3, r3, r0 c: e5813000 str r3, [r1] 10: e12fff1e bx lr Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
|
||||||||||||||||
15 November 2018, 20:28 | #789 | ||||
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
|
Quote:
Seems you always interpret things in a way that suits you, without any hesitation to twist reality. You have not shown any x86 superiority of any kind, actually what you've shown here is a demonstration of bad faith and outright cheating. Quote:
Quote:
And of course any language can be extended with libraries. Makes many of them end up with large, ominous frameworks. Keep it simple ? Nah. Quote:
I told you how to compute from 236 so please get your maths right. Again : - you used headerless : -38 (236 -> 198) - you don't have dos.library : -12 (198 -> 186) - you don't have to open that library : -12 (186 -> 174) - you don't have to close it : -8 (174 -> 166) ALWAYS REMEMBER THAT COUNTING OS CODE IS UNFAIR. (Do i have to write this in red and blinking for you to finally get it ??) Another proof of cheat : your code does not make proper memory allocation and happily overwrites free memory. That's pretty dirty, yep. Last edited by meynaf; 15 November 2018 at 20:39. |
||||
15 November 2018, 21:19 | #790 |
Registered User
Join Date: Jun 2008
Location: Boston USA
Posts: 466
|
Indeed it is possible that the prefetch will stall the CPU whilst waiting for main memory to fetch opcodes.
I rewrote that for you to match what is said in the Abrash book. Add the best and worst case please. Also detail how many clocks it takes to load 16 bits into the prefetch queue. The abrash book says 4 cycles on an 8086 and 8 on an 8088. It's not hard. Just multiply the number of bytes in each opcode by the cycle cost of memory accesses. That's your worst case. I suspect any branch or loop will flush the remainder of the prefetch and the CPU will stall. |
15 November 2018, 21:41 | #791 | |
Registered User
Join Date: May 2014
Location: inside the emulator
Posts: 377
|
Quote:
(Will I be excommunicated after admitting I've once tried to retrofit segmentation to a 68k type processor? ) |
|
15 November 2018, 23:11 | #792 | ||||||||||||||
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,408
|
Quote:
As for hardware issues, the 68040 was not the only CPU of that 'generation' to have those. The original Intel 486 had thermal issues and compatibility problems at higher speeds. The 486DX/2 was made in large part to fix these problems and had clock for clock lower performance than the original 486 (due to the bus being run at half speed). Quote:
Quote:
But more importantly, the 68k series actually sold very well for a long time after that as well, just not in desktop PC's. Quote:
In other words, your position about ARM compilers being terrible is still false. Quote:
First off, on the A500 fast memory runs at the exact same speed as chip memory - the only difference is there can't be any custom chip activity in fast memory. Secondly, the 68000 in the A500 runs at full speed because (and this might shock you) the designers of the Amiga used memory that was much faster than needed to service the 68000 at full speed. Quote:
Anyway, I just noticed something - you repeatedly used the linedrawing example to try and show speed differences and even I don't agree this is a valid way to go about it, I did chuckle a bit when I noticed this. You wrote "I have done some corrections to my cycles count for the line drawing algorithm main loop: ARM - 14, 80486 - 22, 80386 - 57, 80286 - 59, 8088/8086 - 98, 68000 - 63". This shows a 4,5x speed improvement of the ARM over the 68000, which is rather close to my claims of 5x speed Heck, given the 7MHz Amiga vs the 8MHz Archimedes, it's just about spot on Quote:
Also, the A1200 was a mass produced not expensive system with a launch price below even cheap 386 based system, so even if it was somewhat slow this was kind of to be expected. Quote:
Anyway, the closest I can get is AMOS-3D. Which is no doubt less impressive and indeed does not render in a window, though it does render full flat shaded 3D Polygons in real time. It does have a nice (for the time) model of the TOS Enterprise though. Edit: thinking about this some more, without seeing the source code for the Archimedes (and the demo in action), your example by itself doesn't mean all that much. It could be all Basic code, but then again, Archimedes Basic lets you include assembly language at any point, so it could also be accelerated. Likewise, it could be a huge impressive window with a very detailed plane and it could also be a small window with a less impressive plane. Note that I'm not saying it is any of these things, just that I can't judge how 'fair' a point you're making without a great deal more information. Which is why I'd love to see a video of this. More so because it's really interesting to see these things, regardless of our discussion. Quote:
Quote:
As such, I'm not counting 68020 cycles - it's complicated and very easy to get wrong. Quote:
Quote:
By 1984, 256K+ was pretty common. This was several years before the 386. But all of the above presumes the 8086 was aimed only at consumer level 'cheap' hardware and that is clearly not true. One of the first designs was the 1978 Xerox NoteTaker which came with 256KB of RAM as standard. There was also the 1980 IBM Displaywriter, which came with anywhere from 160 to 224KB of RAM. In other words, the 8086 was used in systems with more than 128KB of memory from the very start and as such memory segmentation would've been a drawback for these systems at the very start. --- Quote:
On the topic of segmentation, I'm not against segmentation by default. I am against pretending the 8086 implementation is an objective benefit when it actually was a measure implemented to save a little bit of money on bus lines at the expense of making coding harder (obscure examples like yours set aside). It's a clear example of penny pinching at the expense of a forward outlook. Case in point, 256K+ was a fairly common amount of memory at the high end even in 1978 and the 8086 was too expensive for use in low end stuff at that time. Intel itself obviously grasped this quite early on and tried their hardest to get rid of it - hence protected mode. Alas, DOS and people buying tons of crappy 8088's made that a hard thing. Quote:
Last edited by roondar; 16 November 2018 at 16:27. |
||||||||||||||
16 November 2018, 02:46 | #793 | |
Registered User
Join Date: Jan 2012
Location: USA
Posts: 372
|
Quote:
It's one thing that makes the Amiga slightly less slower than the ST for certain codes. There may even be (very) special cases when the Amiga is faster than the ST -- probably codes that use lots of longs and branches. |
|
16 November 2018, 07:12 | #794 | |
Registered User
Join Date: Jun 2008
Location: Boston USA
Posts: 466
|
Quote:
|
|
16 November 2018, 10:46 | #795 | |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,408
|
Quote:
On the Amiga the display DMA interleaving only happens during the visible part of the screen (that is, only when pixels are actually shown). AFAIK on the Atari ST the same style interleaving happens all the time. Logically then, on the Amiga there are plenty of cycles where 'odd length' CPU instructions get to execute at their real speed, while there are none at all for the Atari ST (i.e. all CPU instructions execute at a multiple of 4 cycles). Anyway, this thread is big enough without adding ST vs Amiga to it, so maybe we should discuss this elsewhere. |
|
16 November 2018, 11:53 | #796 |
Registered User
Join Date: Jun 2008
Location: Boston USA
Posts: 466
|
Let's just agree that the ST/Amiga would handily outrun an 8086 at the same clock
|
16 November 2018, 12:04 | #797 |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,408
|
Any day of the week!
And let's not even get into the massive performance benefit they both have compared to the 8088 based PC's that were actually used by people. After all, as Zen of Assembly language teaches us, the 8086 is much faster than the 8088 (the book claims an average of about 30%, but states this can go up to much higher for some code, due to prefetching blues). |
17 November 2018, 03:25 | #798 | |
Registered User
Join Date: Jan 2012
Location: USA
Posts: 372
|
Quote:
Dang it. Now I'm getting sucked into another ST vs Amiga blackhole. Ignoring that, I think the bus access timing information revealed at http://nemesis.hacking-cult.org/Mega...tion/Yacht.txt might be useful in getting the most out of every available chip access cycle. Knowing just when instruction fetch occurs might really help in coordinating CPU/Blitter/Bitplane DMA. For 3D codes, for instance, blitter clears or line draws or area fills could be set to run mostly in the overscan areas with CPU code full of instructions that have few or no idle CPU cycles, thus using idle cycles left by the blitter during a clear. For that part of the code that has many DIVs and MULs, these might be run during bitplane DMA and blitter stencil copies, the blitter using all the CPU idle cycles. I could see this actually being effective on a 640x400, 3 bit plane frame buffer if MUL and DIV instructions are executed just before bitplane DMA fetches on each scanline. It would be tricky for sure, though. |
|
17 November 2018, 08:47 | #799 |
Registered User
Join Date: Jun 2008
Location: Boston USA
Posts: 466
|
Code containing nothing but shifts and branches executes faster from fast RAM. Even if all bitplanes are switched off. I believe I tested this on an NTSC Amiga 1000 a year or so ago.
|
17 November 2018, 08:52 | #800 |
Registered User
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,544
|
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Any software to see technical OS details? | necronom | support.Other | 3 | 02 April 2016 12:05 |
2-star rarity details? | stet | HOL suggestions and feedback | 0 | 14 December 2015 05:24 |
EAB's FTP details... | Basquemactee1 | project.Amiga File Server | 2 | 30 October 2013 22:54 |
req details for sdl | turrican3 | request.Other | 0 | 20 April 2008 22:06 |
Forum Details | BippyM | request.Other | 0 | 15 May 2006 00:56 |
|
|