English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 14 November 2018, 21:28   #781
roondar
Registered User
 
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,408
Quote:
Originally Posted by Don_Adan View Post
I dont coded for long time, and checked my Amiga assembler book again, but add.l ea,Dn takes 6+ cycles, when ea for Dx is equal 0, then 6+0=6c.
Oh, I believe you when you say that your book says it's 6 cycles. However my source (and others) say it's 8 cycles so I'll just say: it's less certain than I thought since different sources disagree.

Which IMHO means the fair solution (given I don't feel like making a test program) is to keep it at 'uncertain', so I'll do that from now on. After all, my source could be wrong, but so could your book.

It's just the way it goes sometimes
Quote:
Originally Posted by meynaf View Post
Don't ask litwr - not his code.

I'm adding address registers to data registers, because i'm out of data registers and therefore some data has to go in address registers.
Using word operations is a bad idea because it's only valid for 68000 (others don't care), adds some constraint on the other function, and - i repeat - this code isn't speed critical.
Thanks for the clarification, like I said I hadn't really followed this part of the discussion much. Personally, I don't completely agree with the part where you say using add/sub.w is a bad idea per se, but I don't feel much like discussing it so I'll drop it. You probably had a reason to do it this way.
Quote:
Originally Posted by coder76 View Post
You also can't see performance of a CPU by just looking at cycle counts for each instruction and comparing it against other CPUs . There are other factors, like cache performance, and number of registers available, which are also important for performance. The x86 cycles for instructions seem on paper often impressive, but x86's lacked the amount of CPU registers that 680x0's have. Also, the 386/486 caches weren't as good as 68030/68040's caches (386 had some sort of external cache).
I wholeheartedly agree with this statement, which is part of why I keep saying that comparing tiny bits of code is not going to give a clear answer.
roondar is offline  
Old 14 November 2018, 23:38   #782
frank_b
Registered User
 
Join Date: Jun 2008
Location: Boston USA
Posts: 466
Quote:
Originally Posted by coder76 View Post
The x86 cycles for instructions seem on paper often impressive, but x86's lacked the amount of CPU registers that 680x0's have. Also, the 386/486 caches weren't as good as 68030/68040's caches (386 had some sort of external cache).
Only impressive if you forget about the cost of filling the prefetch buffer. The 8086 and 8088 are much slower than they appear to be from Intel's documentation. At least according to the Zen of assembler programming.
frank_b is offline  
Old 15 November 2018, 03:44   #783
mc6809e
Registered User
 
Join Date: Jan 2012
Location: USA
Posts: 372
This might already be somewhere around here but I thought I'd post it as it seems somewhat relevant:

http://nemesis.hacking-cult.org/Mega...tion/Yacht.txt

Amazing document concerning the cycle by cycle behavior of the MC68K for every instruction and even interrupts.

Very interesting.
mc6809e is offline  
Old 15 November 2018, 03:48   #784
mc6809e
Registered User
 
Join Date: Jan 2012
Location: USA
Posts: 372
Quote:
Originally Posted by frank_b View Post
Only impressive if you forget about the cost of filling the prefetch buffer. The 8086 and 8088 are much slower than they appear to be from Intel's documentation. At least according to the Zen of assembler programming.
Yeah, you're right, of course.

Terje has said that you could very well estimate the speed of most code by simply counting the total number of memory accesses needed.

Anyone really interested in all this should check out some of the old posts by Terje on that now backwater of the internet, usenet. Check out comp.arch, in particular. Very interesting reading.
mc6809e is offline  
Old 15 November 2018, 08:11   #785
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by roondar View Post
Oh, I believe you when you say that your book says it's 6 cycles. However my source (and others) say it's 8 cycles so I'll just say: it's less certain than I thought since different sources disagree.
Doc says 6 but measurement on emulator says 8.
As memory cycles are 4 cpu cycles, it seems logical that instructions execute in multiples of 4 cycles, btw.
meynaf is offline  
Old 15 November 2018, 09:37   #786
dissident
Registered User
 
Join Date: Sep 2015
Location: Germany
Posts: 256
Maybe this little assembler tool will give you some answers regarding execution times on the MC68000.

Back in the days, it was published in the German "Amiga Magazin Sonderheft -Faszination Programmieren" edition 2/93.
It helped me a lot and in conjunction with CIA registers you get really surprising results.

This tool was mainly written for the MC68000, and for other processors of the 68k family you have to change the value for the execution time of an empty loop and perhaps the rounding value. Just play with the values. But no guarantee for a proper work on 68020+ machines, it may be only a kind of orientation because of their caches and pipelining behaviour.

Have fun with it.
Attached Files
File Type: s Tactic.s (4.1 KB, 71 views)
dissident is offline  
Old 15 November 2018, 11:11   #787
roondar
Registered User
 
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,408
Edit: this post has turned out a lot longer than I planned, the book frank_b shared (Zen of Assembly Language by Micheal Abrash) is really quite an interesting read and I found myself reading through most of it - though I skimmed a few chapters as well. It taught me a lot about the 8088/8086 and I've been adding insights from the book to this post during the day rather than making separate posts.

Sorry for the TL;DR; vibe
I've marked my continued edits in italics so it's clearer where I've added stuff I learned from this book.

Quote:
Originally Posted by frank_b View Post
At least according to the Zen of assembler programming.
This book turns out to be available online as well. Very interesting read (including the bits about prefetch and how it impacts performance from the theoretical cycle counts), though the author is a total Intel fanboy

In essence, it turns out that Motorola was just more 'complete' with their 68000 cycle numbers because they could include the -admittedly hard coded- prefetch into the cycle counts as found in their manuals easily (they also try do so -to a point- for the 68020 onwards by explaining best/worst/cache performance and concurrency) where Intel could not* and thus didn't do so.

Which makes the 68000 look slower than it really is when compared to the Intel 8088/8086, whose cycle counts in the manuals don't include these fetches and thus are regularly slower than stated. Note that all this doesn't mean the Intel cycles as given are dishonest, rather they measure a different thing and as such require a bit more work to get the complete picture.

I am unaware if this situation also occurs for the 286/386/486, but it wouldn't be surprising if the same thing applies. Generally I find that manufacturers don't change their method of reporting specs unless they have a good reason to do so.

Edit: the book in question is mostly about the 8088 though and notes the situation is much better on the 8086. There still will be situations in which the prefetch queue lags behind internal code execution (in his words: "That’s not to say that the 8086 doesn’t suffer from those cycle-eaters at all; it just suffers less than the 8088 does. Instruction fetching is certainly still a bottleneck on the 8086"), but there are fewer of them on the 8086 vs the 8088.

One interesting bit is that neither the 8088 nor the 8086 can fetch from memory faster than in 4 cycle intervals. This means that running multiple 2 cycle instructions in a row will make them execute at 4 cycle intervals eventually due to the prefetch queue running out. Another interesting element is that the 8086 is hampered by code that accesses words that aren't aligned on 16 bits, which means that instructions with an odd amount of bytes attached to them can slow the CPU down if it's dealing with 16 bit data.

Notably, the book also mentions that the prefetch queue can still lag behind significantly on both the 286 and 386 (there is no info on the 486 in this book) and is in fact more likely to do so as memory coupled with these CPU's tends to be slower than the CPU can fetch memory. The given example is the 3 cycle mov [WordVar],0 for the 286. Which on a real life AT can actually take a full 12 cycles to execute rather than the 3 cycles claimed by the manual.

The more I read about this, the clearer it gets: the 8086,80286 and 80386 cycle counts as given are not at all useful for determining the actual expected performance, because they all omit prefetching. This effect also neatly explains why benchmarks/real life use shows the 68k equivalents (68000 vs 8086/68020 vs 80286/68030 vs 80386) as consistently being faster than their Intel competitors while you might conclude the opposite by looking at CPU cycle counts as given.


*) Considering how complicated this can get I'm not too surprised to be honest, just look at the MC68020 stuff in the Motorola manuals and how unclear the actual performance of individual instructions can get when some cache and internal concurrency is part of the deal. The book goes into great detail why Intel couldn't give figures including the prefetch.
Quote:
Originally Posted by mc6809e View Post
This might already be somewhere around here but I thought I'd post it as it seems somewhat relevant:

http://nemesis.hacking-cult.org/Mega...tion/Yacht.txt

Amazing document concerning the cycle by cycle behavior of the MC68K for every instruction and even interrupts.

Very interesting.
That is a really interesting document, I'll be saving & using that for sure!


Edit: here's one rather funny (considering the really rather long discussion about it here) quote from the book Zen of Assembly language where the author discusses segmentation. Especially since the author is both clearly an expert on the 8088/8086 and really positive about the Intel 8088/8086 throughout the book:
Quote:
Originally Posted by Zen of Assembly Language
In a nutshell, the 8088 is a 16-bit processor with an 8-bit data bus capable of addressing 1 Mb of memory in total but no more than four 64Kb byte blocks at a time and that via a remarkably awkward segmented memory scheme.
Note that he has a lot more to say about segmentation than the above and almost all of it is negative. For instance, he notes how it complicates code and can cause subtle bugs that simply don't exist on architectures without segmentation. So it seems that people who coded for the 8088/8086 professionally considered segmentation awkward after all

Last edited by roondar; 15 November 2018 at 17:02. Reason: Added a bunch of info from Zen of Assembly language
roondar is offline  
Old 15 November 2018, 19:50   #788
litwr
Registered User
 
Join Date: Mar 2016
Location: Ozherele
Posts: 229
Quote:
Originally Posted by meynaf View Post
Being typical does not make it more time critical.
But it's interesting to see you qualify it as "typical" now. So it's typical when you get good speed measurements for your beloved x86, but it's a particular case when it comes to code density ? Damned, if that's not biased reasoning then what is it.
x86 is yours beloved thing - you have proved this many times. My beloved thing is the pure truth only. I haven't sought for a specific case. It was you who provided me with the code provoking me to show x86 superiority so sweet to you.

Quote:
Originally Posted by meynaf View Post
PC-relative addressing is very, very common in 68k code. It makes the code shorter and often even faster.
But as x86 does not have proper PC-relative modes, it's normal you don't see the interest
IMHO it is rather a weak point. This help to get the position independent code but it is rather not very important. It also allows to have a shorter address for data close to the PC but it is also doesn't give any significant advantage.

Quote:
Originally Posted by meynaf View Post
You are confusing external libraries with members of the language.
What is the difference? Some C++ keywords can't be used without a proper header. Indeed debugging is often harder with libraries. C and C++ can't be used without libraries.

Quote:
Originally Posted by roondar View Post
And your evidence for most or all of the A4000/40s being sold in Germany is what?

And remember, you claimed no more than a few hundred 68040 based machines were made worldwide. This is clearly nonsense. You also forget there were a whole bunch of Apple Macs based on the 68040. These were obviously mass produced and were part of the market for a number of years.

Barely any, there was an Amiga model and some workstations here and there but not much else. Interestingly, most of the companies that primarily used Motorola in the past didn't switch from 68k to Intel at the end of the '68k era', but chose from a variety of RISC based CPU's instead.
About Germany I have only claimed that it is quite possible. Apple Macintosh uses 68LC040 - the reduced version of 68040 and some of those models were upgradable to PowerPC. Even Apple having quite expensive hardware couldn't afford to use 68040, it preferred to migrate to PowerPC. Earlier 68LC040 chips have also a very unpleasant hardware bug.

Anyway companies left Motorola because its processors lost power. Motorola tried to make VAX-like processor fast and it was impossible. IMHO new addressing modes of 68020 and some minor complexity of original 68000 killed 68k prospects in about 1991.

Quote:
Originally Posted by roondar View Post
As I explained before, one tiny example is not nearly enough to make it into generally applicable fact. I already said that and I stand by that.
Your point is strong enough. It requires some research to beat it. Sorry I cannot afford this. But my result (the tiny example) was happened accidentally with a random algorithm taken so it is not so weak as you try to prove.

Quote:
Originally Posted by roondar View Post
You are also really, really, really misrepresenting the results of that blog post - the compiler did indeed originally produce code that was 100 bytes long. However, after changing the compiler flags to produce size optimised rather than non optimised code and code that was specific to the actual CPU used rather than general ARM code, it dropped down to only 16 bytes!!

Just to make sure you understand this: note that this improvement is made merely by changing some compiler flags. He did not do any assembly programming at all to reach 16 bytes. After this improvement, he changes the C code ever so slightly and gets it to compile to just 12 bytes. Again, note that all of this is without any assembly language on his part - all this was done merely by changing compiler flags and changing one line of C code.

Then he takes the 12 bytes example and manages to improve that to 10 bytes. In other words, he managed to hand optimise the code by all of 6 bytes if we include the changed line of C code as an example of 'hand optimised assembly' (which it isn't!) and if we exclude that he only managed to optimise the final result of the compiler by 2 bytes. All other optimisation was done purely by the compiler.
The author of the blog post didn't do any assembly, it only used gcc flags and modified C-source. That post dated 2009. I tried the modern GCC and got 20 bytes without thumb. So the quality of ARM-compilers even 9 years ago was poorer than today.

Code:
poll():
   0:	e5913000 	ldr	r3, [r1]
   4:	e0000003 	and	r0, r0, r3
   8:	e1c33000 	bic	r3, r3, r0
   c:	e5813000 	str	r3, [r1]
  10:	e12fff1e 	bx	lr

Quote:
Originally Posted by roondar View Post
You haven't proven this 10x at all, as I've tried to explain before and tried to do again just now. Much more than a single application is needed. Emulators might simply be a better fit to the Archimedes (for instance due to the screen memory layout of the Archimedes being much closer to the PC one than the Amiga screen memory layout is).
I'm a bit surprised by your point. You have found out that according to the official benchmark sheets ARM is about 5 times faster than 68000. So just apply the information about Amiga 500 memory speed. Its slow/chip memory is about 2 times slower than its fast memory. This gives us 10x.

Quote:
Originally Posted by roondar View Post
It's really quite complicated to compare performance. In the last post, I even gave a counter example: Archimedes 3D games are not 10x the speed of the Amiga version, even though they suit the Archimedes very well. Another example: the A500 can accelerate line drawing using the Blitter. The speed increase varies with length, but for most lines it's apparently at least as fast as a high speed 68030.
It is about graphics. Our discussion is about CPU power only.

Quote:
Originally Posted by roondar View Post
1) You are not reading the benchmarks correctly. Setting the A1200 as baseline gives the A500's best number as 58% - that does not mean the A1200 is 40% faster. I've attached an image (note, I've cut away the superfluous part with an image editor) to show the results when the A500 is set as the baseline. As you can see, the A1200 averages to about 2x the speed of the A500 and is still at least 70% faster than the A500 in the worst test result.

2) It is common knowledge the 68020 in the A1200 was held back by the slow on board RAM. If you add any amount of trapdoor RAM to the A1200, it becomes about 4x the speed of the A500 - without needing to upgrade the CPU. As an example, the GVP1208 (which does nothing but add 8MB of RAM to the A1200) just about doubles the speed of the base A1200.
Thank you for these interesting details. However my point was rather about different things. It is about the speed of mass produced not expensive computer. So a stock A1200 has performance about 80386 at 10 MHz. It is a computer released in 1992 when even 12 MHz 80286 systems were rather obsolete.

Quote:
Originally Posted by roondar View Post
You are aware you can do this in Amiga Basic (yes, the Microsoft variant) as well (though I do admit it's not terribly intuitive), right?

Or if you prefer a simpler approach, this can also be done using Blitz Basic and AMOS PRO (though the last one is bit of a stretch as it isn't as system friendly and thus doesn't do windows per se).

Edit: it occurs to me the above might not be quite clear enough on what I mean, so here goes:

All the above forms of BASIC can play back IFF animations (though it is easier using AMOS Pro/Blitz) and at least two of them can do so in an entirely OS friendly way. All these forms of BASIC can also create and display animations in different ways (such as using sprites, BOBs, etc). The part about the window might be more interesting - I'm 100% sure Blitz and Amiga Basic can display IFF animations on a standard Amiga screen and as such, the animation and listing can be shown at the same time.

Getting the listing to display on the same screen (or in a window on the WB screen) as an IFF animation may be more tricky as WB1.3/WB2.0 were not really designed for that, but should probably still be doable. Displaying BOBs/sprites in a window should be easier to do.
Can you give me a link to a textual Basic application which can show a 3d flying plane in an OS window with A500?! Archimedes doesn't just play back video data, it draws the picture!

Quote:
Originally Posted by meynaf View Post
By the way, if we take that into account with my 236 bytes version...
Indeed 236 bytes get a number less than my 168 bytes! Impeccable logic!

Quote:
Originally Posted by roondar View Post
I didn't really mean to look at this much, but the 68000 cycle counts you show there don't look to be correct to me. For instance, there are no 68000 opcodes with odd cycle counts. I've made an attempt as well, the code you show should have the following cycle counts for the 68000.
Sorry but you are wrong here. If a branch at line #2 is taken it will cost 10 cycles and the next two instructions will be skipped, if the branch is not taken then it will take 8 cycles and the next two instructions will be processed for 8 cycles. This gives us 10 cycles for one variant and 16 cycles for the second. So we get 13 cycles on average. Add.l and sub.l could be replaced by their word variants. I actually counted for word instructions.

Quote:
Originally Posted by roondar View Post
The 68020 is much harder to 'cycle count' for because the 68020 has a cache which means execution times start to differ depending on the code being inside or outside of the cache (stuff in cache is much faster). More so, code running from the cache can continue to run during memory accesses of prior instructions so it's possible for some opcodes to take '0 cycles' by being run during a memory access. The Motorola manual has an example like this:
This code is in a loop so it should be all in the cache.

Quote:
Originally Posted by frank_b View Post
Do your 8086 timings take into account memory accesses during prefetch?
Every opcode fetch is going to cost cycles.
Can you please add best case/worst case columns for 8086.
x86 do prefetch as a parallel task, no additional cycles required. Indeed it is possible that fast instructions execute quicker than the filling of the prefetch queue but my code contains several jumps which refill the queue.

Quote:
Originally Posted by mc6809e View Post
This might already be somewhere around here but I thought I'd post it as it seems somewhat relevant:

http://nemesis.hacking-cult.org/Mega...tion/Yacht.txt

Amazing document concerning the cycle by cycle behavior of the MC68K for every instruction and even interrupts.
Thank you very much.

Quote:
Originally Posted by roondar View Post
Note that he has a lot more to say about segmentation than the above and almost all of it is negative. For instance, he notes how it complicates code and can cause subtle bugs that simply don't exist on architectures without segmentation. So it seems that people who coded for the 8088/8086 professionally considered segmentation awkward after all
Indeed when you need much memory for a single application it is no good but 8086 appeared in 1978 when 128 KB was quite a luxury even for a mini-computer. So until the mid of 80s 8086 gives a very good scheme, then 80386 appeared.
litwr is offline  
Old 15 November 2018, 20:28   #789
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by litwr View Post
x86 is yours beloved thing - you have proved this many times. My beloved thing is the pure truth only. I haven't sought for a specific case. It was you who provided me with the code provoking me to show x86 superiority so sweet to you.
Of course x86 is NOT my beloved thing. It's yours.
Seems you always interpret things in a way that suits you, without any hesitation to twist reality.
You have not shown any x86 superiority of any kind, actually what you've shown here is a demonstration of bad faith and outright cheating.


Quote:
Originally Posted by litwr View Post
IMHO it is rather a weak point. This help to get the position independent code but it is rather not very important. It also allows to have a shorter address for data close to the PC but it is also doesn't give any significant advantage.
No it's not a weak point. It gives shorter, faster code, that can be run from any place. Why, even ARM has this (and it got added in x86_64 because it became mandatory !).


Quote:
Originally Posted by litwr View Post
What is the difference? Some C++ keywords can't be used without a proper header. Indeed debugging is often harder with libraries. C and C++ can't be used without libraries.
That's a weak point. Not everything is readily available (by the way, the c++ lib is a real horror, look at the code if you don't believe me).
And of course any language can be extended with libraries. Makes many of them end up with large, ominous frameworks. Keep it simple ? Nah.


Quote:
Originally Posted by litwr View Post
Indeed 236 bytes get a number less than my 168 bytes! Impeccable logic!
And now 171 equals 168...
I told you how to compute from 236 so please get your maths right.
Again :
- you used headerless : -38 (236 -> 198)
- you don't have dos.library : -12 (198 -> 186)
- you don't have to open that library : -12 (186 -> 174)
- you don't have to close it : -8 (174 -> 166)

ALWAYS REMEMBER THAT COUNTING OS CODE IS UNFAIR.
(Do i have to write this in red and blinking for you to finally get it ??)

Another proof of cheat : your code does not make proper memory allocation and happily overwrites free memory.
That's pretty dirty, yep.

Last edited by meynaf; 15 November 2018 at 20:39.
meynaf is offline  
Old 15 November 2018, 21:19   #790
frank_b
Registered User
 
Join Date: Jun 2008
Location: Boston USA
Posts: 466
Indeed it is possible that the prefetch will stall the CPU whilst waiting for main memory to fetch opcodes.

I rewrote that for you to match what is said in the Abrash book.
Add the best and worst case please. Also detail how many clocks it takes to load 16 bits into the prefetch queue. The abrash book says 4 cycles on an 8086 and 8 on an 8088. It's not hard. Just multiply the number of bytes in each opcode by the cycle cost of memory accesses. That's your worst case. I suspect any branch or loop will flush the remainder of the prefetch and the CPU will stall.
frank_b is offline  
Old 15 November 2018, 21:41   #791
Megol
Registered User
 
Megol's Avatar
 
Join Date: May 2014
Location: inside the emulator
Posts: 377
Quote:
Originally Posted by roondar View Post
Note that he has a lot more to say about segmentation than the above and almost all of it is negative. For instance, he notes how it complicates code and can cause subtle bugs that simply don't exist on architectures without segmentation. So it seems that people who coded for the 8088/8086 professionally considered segmentation awkward after all
And this is a surprise to you? Segmentation in itself can be a useful tool however the x86 version isn't very good. The best use of the protected mode version of it (where a segment number is used as a lookup and not as an address) is an obscure experimental object oriented kernel that removed most overheads by using segmentation for protection.

(Will I be excommunicated after admitting I've once tried to retrofit segmentation to a 68k type processor? )
Megol is offline  
Old 15 November 2018, 23:11   #792
roondar
Registered User
 
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,408
Quote:
Originally Posted by litwr View Post
About Germany I have only claimed that it is quite possible. Apple Macintosh uses 68LC040 - the reduced version of 68040 and some of those models were upgradable to PowerPC. Even Apple having quite expensive hardware couldn't afford to use 68040, it preferred to migrate to PowerPC. Earlier 68LC040 chips have also a very unpleasant hardware bug.
So you're just making stuff up then (about Germany). Got it. Also, stop moving the goalposts. You claimed only a few hundred 68040's were sold. This is nonsense, plain and simple. First of, the 68LC040 is a 68040 as much as the 486SX (or the 486DX/2 for that matter) is a 486. Secondly, Apple sold a whole bunch of 68040 Macs that had a full 68040 in them. No matter how much you dislike it.

As for hardware issues, the 68040 was not the only CPU of that 'generation' to have those. The original Intel 486 had thermal issues and compatibility problems at higher speeds. The 486DX/2 was made in large part to fix these problems and had clock for clock lower performance than the original 486 (due to the bus being run at half speed).

Quote:
Anyway companies left Motorola because its processors lost power.
And they chose to not use Intel instead of Motorola when they stopped using them. Which seems to imply Intel didn't offer 'power' either.
Quote:
Motorola tried to make VAX-like processor fast and it was impossible. IMHO new addressing modes of 68020 and some minor complexity of original 68000 killed 68k prospects in about 1991.
Your IMHO is wrong on both counts, the 68k was selling very well in the consumer PC market until the mid 1990's. Not only that, but each 68k CPU was faster than the x86 equivalent CPU. This only changed with the 486DX/2@66MHz (though this briefly flipped again with the 68060 vs Pentium). Furthermore, the 68060 and the later Coldfire stuff showed that the performance limit for the ISA was clearly not reached with the 68040.

But more importantly, the 68k series actually sold very well for a long time after that as well, just not in desktop PC's.
Quote:
The author of the blog post didn't do any assembly, it only used gcc flags and modified C-source. That post dated 2009. I tried the modern GCC and got 20 bytes without thumb. So the quality of ARM-compilers even 9 years ago was poorer than today.
But still very good as he got 24 bytes without using thumb. Which is still miles better than the 100 bytes you claimed he got - i.e. your original post about this is very misleading. Besides, his optimised 10 byte example used thumb code to make it smaller so it's hardly fair to compare a non-thumb compiled example to it (what, with thumb opcodes being 16 bits rather than the usual 32 bits for ARM opcodes).

In other words, your position about ARM compilers being terrible is still false.
Quote:
I'm a bit surprised by your point. You have found out that according to the official benchmark sheets ARM is about 5 times faster than 68000. So just apply the information about Amiga 500 memory speed. Its slow/chip memory is about 2 times slower than its fast memory. This gives us 10x.
This is a very ignorant statement to make. Mainly because it's not true.

First off, on the A500 fast memory runs at the exact same speed as chip memory - the only difference is there can't be any custom chip activity in fast memory.

Secondly, the 68000 in the A500 runs at full speed because (and this might shock you) the designers of the Amiga used memory that was much faster than needed to service the 68000 at full speed.
Quote:
It is about graphics. Our discussion is about CPU power only.
The Archimedes 3D games vs A500 3D games point I made is not about graphics, it's about CPU power. The other example is merely to show that looking at system performance requires more than just checking the CPU. But if you'd prefer to only look at the CPU that's fine. My point doesn't really change.

Anyway, I just noticed something - you repeatedly used the linedrawing example to try and show speed differences and even I don't agree this is a valid way to go about it, I did chuckle a bit when I noticed this. You wrote "I have done some corrections to my cycles count for the line drawing algorithm main loop: ARM - 14, 80486 - 22, 80386 - 57, 80286 - 59, 8088/8086 - 98, 68000 - 63". This shows a 4,5x speed improvement of the ARM over the 68000, which is rather close to my claims of 5x speed

Heck, given the 7MHz Amiga vs the 8MHz Archimedes, it's just about spot on

Quote:
Thank you for these interesting details. However my point was rather about different things. It is about the speed of mass produced not expensive computer. So a stock A1200 has performance about 80386 at 10 MHz. It is a computer released in 1992 when even 12 MHz 80286 systems were rather obsolete.
It doesn't have performance that poor because the overall system design is much better than the 386 you compare it with. This is easy to verify - A1200 games ported to x86 hardware always needed a fairly hefty system to keep up. For instance, the game Super Stardust runs on a basic A1200, yet requires a 486@33MHz with 4MB to run for the PC version (http://www.oldpcgaming.net/super-stardust-96/). Similarly, a basic A1200 could be used for productivity just fine without being anywhere near as slow in operation as a low speed 386.

Also, the A1200 was a mass produced not expensive system with a launch price below even cheap 386 based system, so even if it was somewhat slow this was kind of to be expected.
Quote:
Can you give me a link to a textual Basic application which can show a 3d flying plane in an OS window with A500?! Archimedes doesn't just play back video data, it draws the picture!
I agree that changes things. To be fair, this is not what I got from "8 MHz Archimedes showed animated plane flight in a window" right after a discussion on 2D performance

Anyway, the closest I can get is AMOS-3D. Which is no doubt less impressive and indeed does not render in a window, though it does render full flat shaded 3D Polygons in real time. It does have a nice (for the time) model of the TOS Enterprise though.

Edit: thinking about this some more, without seeing the source code for the Archimedes (and the demo in action), your example by itself doesn't mean all that much. It could be all Basic code, but then again, Archimedes Basic lets you include assembly language at any point, so it could also be accelerated. Likewise, it could be a huge impressive window with a very detailed plane and it could also be a small window with a less impressive plane. Note that I'm not saying it is any of these things, just that I can't judge how 'fair' a point you're making without a great deal more information.

Which is why I'd love to see a video of this. More so because it's really interesting to see these things, regardless of our discussion.

Quote:
Sorry but you are wrong here. If a branch at line #2 is taken it will cost 10 cycles and the next two instructions will be skipped, if the branch is not taken then it will take 8 cycles and the next two instructions will be processed for 8 cycles. This gives us 10 cycles for one variant and 16 cycles for the second. So we get 13 cycles on average. Add.l and sub.l could be replaced by their word variants. I actually counted for word instructions.
Fair enough, I misread your cycle counts to be for the instructions they where posted next to only. I didn't realise you averaged paths out.

Quote:
This code is in a loop so it should be all in the cache.
Still not easy as the 68020 can (and does) concurrently execute several instructions - or put differently, it's possible for an instruction to take zero cycles on the 68020, depending on what the instruction before it was doing.

As such, I'm not counting 68020 cycles - it's complicated and very easy to get wrong.

Quote:
x86 do prefetch as a parallel task, no additional cycles required. Indeed it is possible that fast instructions execute quicker than the filling of the prefetch queue but my code contains several jumps which refill the queue.
This is simply not true - additional cycles are required in many cases. This is one of the key points of the book we're talking about. Secondly, refilling the queue is one of the best ways to degrade performance on any x86 processor up to at least the 386, as the book explains.

Quote:
Indeed when you need much memory for a single application it is no good but 8086 appeared in 1978 when 128 KB was quite a luxury even for a mini-computer. So until the mid of 80s 8086 gives a very good scheme, then 80386 appeared.
The 8088 and 8086 were used in computers with way more memory than you seem to think and at much earlier dates. Case in point, in 1982 you could get a NEC PC with an 8086 and 128K up to 640KB of RAM. There where even portable PC's out by 1983 with 128K to 640K of RAM (such as the Compaq Portable).

By 1984, 256K+ was pretty common. This was several years before the 386.

But all of the above presumes the 8086 was aimed only at consumer level 'cheap' hardware and that is clearly not true. One of the first designs was the 1978 Xerox NoteTaker which came with 256KB of RAM as standard. There was also the 1980 IBM Displaywriter, which came with anywhere from 160 to 224KB of RAM.

In other words, the 8086 was used in systems with more than 128KB of memory from the very start and as such memory segmentation would've been a drawback for these systems at the very start.

---
Quote:
Originally Posted by Megol View Post
And this is a surprise to you? Segmentation in itself can be a useful tool however the x86 version isn't very good. The best use of the protected mode version of it (where a segment number is used as a lookup and not as an address) is an obscure experimental object oriented kernel that removed most overheads by using segmentation for protection.
Oh, it's not a surprise at all - all I'm doing is giving a counterpoint to the 'segmentation is great' rhetoric in this thread by showing that professionals back in the day didn't actually agree. And not just any old expert, but someone who is highly regarded in the 8088/8086 world and wrote a very in depth book about optimising for the 8088.

On the topic of segmentation, I'm not against segmentation by default. I am against pretending the 8086 implementation is an objective benefit when it actually was a measure implemented to save a little bit of money on bus lines at the expense of making coding harder (obscure examples like yours set aside).

It's a clear example of penny pinching at the expense of a forward outlook. Case in point, 256K+ was a fairly common amount of memory at the high end even in 1978 and the 8086 was too expensive for use in low end stuff at that time.

Intel itself obviously grasped this quite early on and tried their hardest to get rid of it - hence protected mode. Alas, DOS and people buying tons of crappy 8088's made that a hard thing.

Quote:
(Will I be excommunicated after admitting I've once tried to retrofit segmentation to a 68k type processor? )
Nah, 500 "hail the Galvins" and a small donation to the "Creaky old Amigas" fund will get you off the hook. This time

Last edited by roondar; 16 November 2018 at 16:27.
roondar is offline  
Old 16 November 2018, 02:46   #793
mc6809e
Registered User
 
Join Date: Jan 2012
Location: USA
Posts: 372
Quote:
Originally Posted by meynaf View Post
Doc says 6 but measurement on emulator says 8.
As memory cycles are 4 cpu cycles, it seems logical that instructions execute in multiples of 4 cycles, btw.
Except that a memory cycle can start on a 2 cycle boundary. It still takes 4 cycles to complete the access but an instruction can actually take a multiple of 2 cycles.

It's one thing that makes the Amiga slightly less slower than the ST for certain codes. There may even be (very) special cases when the Amiga is faster than the ST -- probably codes that use lots of longs and branches.
mc6809e is offline  
Old 16 November 2018, 07:12   #794
frank_b
Registered User
 
Join Date: Jun 2008
Location: Boston USA
Posts: 466
Quote:
Originally Posted by mc6809e View Post
Except that a memory cycle can start on a 2 cycle boundary. It still takes 4 cycles to complete the access but an instruction can actually take a multiple of 2 cycles.

It's one thing that makes the Amiga slightly less slower than the ST for certain codes. There may even be (very) special cases when the Amiga is faster than the ST -- probably codes that use lots of longs and branches.
Don't forget that cycle counting only happens on ST RAM/Chip RAM. ROM access is not affected on either machine. Branches and shifts tend to be faster than in chip RAM. Even for <4 planes.
frank_b is offline  
Old 16 November 2018, 10:46   #795
roondar
Registered User
 
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,408
Quote:
Originally Posted by frank_b View Post
Don't forget that cycle counting only happens on ST RAM/Chip RAM. ROM access is not affected on either machine. Branches and shifts tend to be faster than in chip RAM. Even for <4 planes.
During custom chip access of memory this is technically true, but the difference is really rather small for normal code. Anyway, isn't the Atari ST much more affected by its interleaving scheme?

On the Amiga the display DMA interleaving only happens during the visible part of the screen (that is, only when pixels are actually shown). AFAIK on the Atari ST the same style interleaving happens all the time. Logically then, on the Amiga there are plenty of cycles where 'odd length' CPU instructions get to execute at their real speed, while there are none at all for the Atari ST (i.e. all CPU instructions execute at a multiple of 4 cycles).

Anyway, this thread is big enough without adding ST vs Amiga to it, so maybe we should discuss this elsewhere.
roondar is offline  
Old 16 November 2018, 11:53   #796
frank_b
Registered User
 
Join Date: Jun 2008
Location: Boston USA
Posts: 466
Let's just agree that the ST/Amiga would handily outrun an 8086 at the same clock
frank_b is offline  
Old 16 November 2018, 12:04   #797
roondar
Registered User
 
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,408
Any day of the week!

And let's not even get into the massive performance benefit they both have compared to the 8088 based PC's that were actually used by people. After all, as Zen of Assembly language teaches us, the 8086 is much faster than the 8088 (the book claims an average of about 30%, but states this can go up to much higher for some code, due to prefetching blues).
roondar is offline  
Old 17 November 2018, 03:25   #798
mc6809e
Registered User
 
Join Date: Jan 2012
Location: USA
Posts: 372
Quote:
Originally Posted by frank_b View Post
Don't forget that cycle counting only happens on ST RAM/Chip RAM. ROM access is not affected on either machine. Branches and shifts tend to be faster than in chip RAM. Even for <4 planes.
What about for zero planes, like during the overscan and HBlank and VBlank?

Dang it. Now I'm getting sucked into another ST vs Amiga blackhole.

Ignoring that, I think the bus access timing information revealed at http://nemesis.hacking-cult.org/Mega...tion/Yacht.txt might be useful in getting the most out of every available chip access cycle.

Knowing just when instruction fetch occurs might really help in coordinating CPU/Blitter/Bitplane DMA. For 3D codes, for instance, blitter clears or line draws or area fills could be set to run mostly in the overscan areas with CPU code full of instructions that have few or no idle CPU cycles, thus using idle cycles left by the blitter during a clear.

For that part of the code that has many DIVs and MULs, these might be run during bitplane DMA and blitter stencil copies, the blitter using all the CPU idle cycles.

I could see this actually being effective on a 640x400, 3 bit plane frame buffer if MUL and DIV instructions are executed just before bitplane DMA fetches on each scanline.

It would be tricky for sure, though.
mc6809e is offline  
Old 17 November 2018, 08:47   #799
frank_b
Registered User
 
Join Date: Jun 2008
Location: Boston USA
Posts: 466
Code containing nothing but shifts and branches executes faster from fast RAM. Even if all bitplanes are switched off. I believe I tested this on an NTSC Amiga 1000 a year or so ago.
frank_b is offline  
Old 17 November 2018, 08:52   #800
Bruce Abbott
Registered User
 
Bruce Abbott's Avatar
 
Join Date: Mar 2018
Location: Hastings, New Zealand
Posts: 2,544
Quote:
Originally Posted by frank_b View Post
Code containing nothing but shifts and branches executes faster from fast RAM.
How much faster and with which Fast RAM?
Bruce Abbott is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Any software to see technical OS details? necronom support.Other 3 02 April 2016 12:05
2-star rarity details? stet HOL suggestions and feedback 0 14 December 2015 05:24
EAB's FTP details... Basquemactee1 project.Amiga File Server 2 30 October 2013 22:54
req details for sdl turrican3 request.Other 0 20 April 2008 22:06
Forum Details BippyM request.Other 0 15 May 2006 00:56

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 23:52.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.14696 seconds with 16 queries