68k details - Page 39

mc6809e · 12 November 2018, 02:56

I'm surprised no one mentioned a potentially difficult problem using x86 segments and C: pointer aliasing issues.

It's entirely possible for two pointers to point to precisely the same memory location and still be unequal.

This can make certain compiler optimizations next to impossible even if the programmer knows that two pointers can't possibly point to the same memory location. This goes for OoO processing at the processor level as well.

Dependencies are created that limit the degree to which operations can be reordered or elided.

chb · 12 November 2018, 09:54

Quote:

Originally Posted by touko

The question is, what's the A500 could do with a chip ram twice as fast than the OCS one(more or less the same that the archimedes has) ??
With this kind of RAM,all the system can also run at twice the speed . Somebody know how much this would cost compared to an archimedes ??
And what's the performances we could be reach ??

I think this is a misunderstanding; the Archimedes has 4x the Amiga 500's memory bandwidth, but its DRAM chips are not much faster rated (I think 120 ns versus 150 ns). The advantage comes from two techniques: a) obviously 32 bit access instead of 16 bit) and b) page mode.

Page mode works because it is faster to read data from the same column than having to change the row (RAS/CAS latency), so accessing multiple consecutive words can be faster than words in random order. The Archimedes' memory controller reads four words in a go when possible, giving a 2-1-1-1 timing (four random words would be 2-2-2-2), thus almost doubling throughput in the ideal case.

Could this have been used in the Amiga? IMHO only with difficulties.

The chipset is not designed to use consecutive memory accesses, and uses a lot of different memory locations. Bitplanes e.g. come from six locations, blitter has four channels, audio also, I think there are 25 DMA channels in total. To fully take advantage of page mode, every channel would need to read four words in a go before another channel takes over, meaning additional buffers and logic on-chip. A naive calculation would give 25 channels* 64 bit * 6 transistors/bit = 9600 additional transistors, just for buffers, without logic.
Probably half if you would page mode only for blitter and video.
The Archimedes on the other hand has no blitter, and video data comes from just one address (chunky mode), so much easier to implement.

Also, the 68000 is probably less suited for page mode than the ARM2: The ARM a load/store architecture, so typically you'd read from mem to registers, execute some code and store registers to memory. This generates rather sequential memory accesses, especially as the ARM has conditional instructions, so does not need to branch in a lot of cases. The 68k on the other hand has a lot of powerful instructions and addressing modes to work directly in memory, which generates more random access patterns (instruction from one location, data access from another). Also no conditional instructions, and at 7 Mhz it cannot saturate the bus anyway. Of course you could write your code to maximize memory throughput, but e.g. compilers would have needed special adaption. As the Archimedes was the only ARM architecture for some time, I guess compilers were optimized for this memory type there.

So to sum it up, it could probably have been done at least for bitplane and blitter access (AGA does it for the former) with quite some effort, but without speeding up CPU operation.

touko · 13 November 2018, 10:10

Quote:

I think this is a misunderstanding; the Archimedes has 4x the Amiga 500's memory bandwidth, but its DRAM chips are not much faster rated (I think 120 ns versus 150 ns).

i always thought it was a 280 ns RAM(because of 7.14 mhz DMA with 1 access every 2 cycles),but if it's a 150ns you're right, you can't really do more .

EDIT: ok it's 150 of access time, but 260 of cycle time.

If this can help :
https://retrocomputing.stackexchange...mory-bandwidth

litwr · 13 November 2018, 19:06

@meynaf I can get 171 bytes for pi-spigot for 80386. Your 68020 code takes 236 bytes. However I added to my article about 68k a phrase Additionally, as shown by eab.abime.net experts, the code density of 68k is often better than that of x86.

I have done some corrections to my cycles count for the line drawing algorithm main loop: ARM - 14, 80486 - 22, 80386 - 57, 80286 - 59, 8088/8086 - 98, 68000 - 63. Interestingly, that 80286 has almost the same count as 80286. IMHO 68k could have been successful but 68020 added too much difficult to maintain instructions and this was a big mistake. Sorry, I still can't calculate this number for 68020 - it is very difficult.

I can estimate 32 but IMHO it must be a larger number. There are my cycles sheet below.

Code:

                          86 286 386 486
.loop:  call putpixel     19   7   9   3
.m3:    cmp bp,7777        4   3   2   1
0       jne .l3           16  11  10   3 

.m4:    cmp bx,7777
1       je .l4

.l3:    mov ax,cx          2   2   2   1
3    shl ax,1 / add ax,ax  2   2   2   1
4       cmp ax,di          3   2   2   1
5       jl .l5            17  11  10   4

6       add cx,di
.m1:    add bp,8
.l5:    cmp cx,si          3   2   2   1
7       jg .loop          17  11  10   4

8       add cx,si
.m2:    add bx,8
9       jmp .loop         15   8   8   3 

.l4:

                         68000                   68020
.loop
1    sub.l d4,d6         4                       3?
2    bgt.s .xp           13 (10 or 8+4+4=16)     6?
3    add.l a0,d1            
4    add.l a2,d6
.xp
5    sub.l d5,d7         4                       3?
6    bgt.s .yp          13                       6?

7    add.l a1,d2
8    add.l a2,d7
.yp
9    bsr.s setpixel     18                       8?
0    dbf d0,.loop       10                       6?

Megol · 13 November 2018, 19:10

Quote:

Originally Posted by mc6809e

I'm surprised no one mentioned a potentially difficult problem using x86 segments and C: pointer aliasing issues.

It's entirely possible for two pointers to point to precisely the same memory location and still be unequal.

Yes, doing a full pointer comparison in segmented x86 isn't too nice. Not a common problem though.

Quote:

This can make certain compiler optimizations next to impossible even if the programmer knows that two pointers can't possibly point to the same memory location. This goes for OoO processing at the processor level as well.

The processor use "pointer" comparisons but only after calculating the linear address, not a problem. However using segmentation on most OoO processors wastes performance anyway as segmentation support is slow or very slow.

Quote:

Dependencies are created that limit the degree to which operations can be reordered or elided.

Yes but that wasn't a problem when x86 segmentation was used. The compilers of today are much more aggressive than those of the past.

litwr · 13 November 2018, 19:13

Quote:

Originally Posted by meynaf

A lot of 100% position-independent exists (don't say relocatable, it's wrong word for this !).

Why not relocatable? Sorry I can't catch the difference. 68k has too many addressing modes to support the code relocation. It makes loader easier and smaller but it is not much important generally. MMU can provide the code relocation for free - no need to waste the instruction space by additional instructions for this.

Quote:

Originally Posted by roondar

Yes it is: you ignored the rather positive 68000 article frankb posted (as I suggested to him you would) and only looked at the article that was negative.

Sorry it is not easy to find a text without its clear markers - https://archive.org/details/byte-mag...85-09/page/n41 - it is the proper link to the interview with Rod Coleman. That man really liked 68k because he thought that it was more theoretically correct. He said that 68000 has more promising architecture but he missed a simple idea that 80286 may be upgraded when the proper time comes. He missed practical advantages of 80286.

Sorry I had to add more clarification to my point. For example, for me 68000 is a bit clumsy. It is because my first x86 experience began with 80286 which is 3 years older and have a lot of very rapid instructions. Indeed I couldn't use this word if I had compared 8088 and 68000.

Quote:

Originally Posted by meynaf

Always ready to shoot, hmm ?

I am always ready to present you one or even two shots of the best vodka.

I feel myself as your debtor because of your numeros interesting and helpful remarks.

Quote:

Originally Posted by roondar

Even the (admittedly very rare) Amiga's equipped with a 68040 did a lot better than just a few hundred sold (see http://www.amigahistory.plus.com/sales.html - the A4000/40 is listed as selling 3800 units in Germany alone).

I'm not saying that there were 68040's on every street corner, but there clearly were many more made than a few hundred. Motorola would never have bothered with a 68060 if the 68040 was that much of a failure.

I am not sure but I won't be surprised if we find out that almost all A4000/40 were sold in Germany. There was no any real mass produced 68040 based computer and there were a lot of cheap 80486 based models.

Was there any computer based on 68060? As I know it was only available as an upgrade.

Quote:

Originally Posted by roondar

In any case, your original reply to me said the 486@25MHz was faster than the 68040 ("80486@25MHz gives slightly more"). And that is false even in the best case scenario. It's this false bit of info I've been trying to counter.

I can say sorry again. I was rather inattentive. And I also gave some justification to you. I consider those data as rather approximation. So there is no much difference between 20 and 21.

Quote:

Originally Posted by roondar

The higher clock speeds of the 486 wasn't what we were debating however, we were discussing performance differences between the more 'affordable models' of the 486/68040 and the ARM2 in 1991. And secondarily, the performance difference between these chips at the same clock speed

My initial point was about the 90 and 91 and I have to admit that it was wrong about 1991. However Archimedes had surprisingly affordable prices considering the power of its competitors.

Quote:

Originally Posted by roondar

What I am disputing is that hand written, optimised ARM2 code is significantly better compared to the compilers available for it at the time VS hand written, optimised 386/68030 code compared to the compilers available for them at the time.

You have a fact about the line drawing algorithm implementation. It is common to consider that ARM code density is significantly worse than of x86 and it is common to consider x86-32 code as the best here. However we have 80 bytes of ARM code for this algorithm and 82 for x86. IMHO compilers still make rather very poor codes for ARM - read the article - http://benno.id.au/blog/2009/01/01/s...onserving_code - it is about handmade reducing of a code produced by GCC from 100 bytes to 10! I can't imagine such a thing for x86.

Quote:

Originally Posted by roondar

I'm 100% certain that comparing a single emulator is not a good way to figure this out (if only given the vastly different performance figures for different emulators emulating the same system on the same CPU* and the differences in screen memory layout). I have no doubt the performance difference here was big though - I am certainly not claiming the Archimedes CPU isn't faster.

To try and get to the bottom, I tried to get a better performance comparison. However, it's not so easy to find performance comparisons (that are better than comparing one emulator) between the A500 and ARM2. I have found some comparisons between the A500 and Amiga's running a 33Mhz 68030 though. This might be somewhat relevant as we did an earlier comparison between that and a 12MHz ARM2. I freely admit this is not the best comparison and will accept a better one if you/I can find one, but here goes:

An A2000 using a 68030@33MHz is about 10x the speed of a basic A500**. The 68030@33MHz is about 1.8x the speed of an 8MHz ARM2 according to the list we've been using. This would translate to the Archimedes being about 5.6x the speed of an A500. Which seems to be about right when looking at 3D games.

Again, this is not a great comparison and I'll gladly accept better ways to measure the difference, but it's still better than looking at a single application

*) As an example, I've had multiple C64 emulators on my expanded A1200 (I'm excluding A64 here as it has very poor compatibility). These really varied in how fast they were - one of the Amiga C64 emulators was completely unusable for me (sub 5FPS), the other worked fairly well (closer to 25FPS).

**) According to http://amiga.resource.cx/perf/aibbde.html, check the "Combo (030/33, 882/33, OCS, 3.0 in RAM)" vs the A500.

Emulators are very complex programs and therefore can give very good impression of the performance of hardware. I used text modes and Norton utilities for benchmarking. It is sad that there is no Archimedes emulator for Linix.

Indeed we should take into accout the quality of emulation but I don't remember much difference except the very slow work with Amiga-500.

I don't insist that Archimedes is always 10 times faster but IMHO for proper optimised programs it can be faster about this number. I repeat my counts with the standard line drawing algorithm show that ARM can be 50% faster than 80486 at the same frequency. Amiga-500 has only slow RAM and Archimedes RAM access is 2 or even more times faster. So theoretically 68000 can only be 4-5 times slower but with Amiga it is about 10.

Thank for an interesting Amiga benchmark cite. I am a bit surprised that for some tests A1200 only 40% faster than A500. IMHO 68020 should be much faster. So it sounds as a common problem of Motorola products - their theoretical specifications are much better than practical.

Quote:

Originally Posted by roondar

More seriously, this is kind of the point - you can't simply say the Archimedes is a lot faster than the A500 by looking at the CPU alone. If 2D graphics get involved the difference can completely vanish.

It will certainly win vs the A500 for business/serious software or 3D graphics, but it won't usually win the '2D war' and to me that shows that CPU power alone isn't everything and thus doesn't tell the whole story.

Indeed Amiga can show very good results with some types of graphics but I was almost shocking when 8 MHz Archimedes showed animated plane flight in a window and a Basic code for this program in another window! IMHO it was impossible to write such things in interpreted Basic, I use my Amiga experience for this estimation.

Quote:

Originally Posted by meynaf

I thought it was quite self-explanatory. You need to create general-purpose functions when they're not here at first place.

Any example?

meynaf · 13 November 2018, 19:31

Quote:

Originally Posted by litwr

@meynaf I can get 171 bytes for pi-spigot for 80386. Your 68020 code takes 236 bytes.

But my code works. Your last version doesn't. I tried it on dosbox and it failed.

Quote:

Originally Posted by litwr

However I added to my article about 68k a phrase Additionally, as shown by eab.abime.net experts, the code density of 68k is often better than that of x86.

That's already something.

Quote:

Originally Posted by litwr

I have done some corrections to my cycles count for the line drawing algorithm main loop: ARM - 14, 80486 - 22, 80386 - 57, 80286 - 59, 8088/8086 - 98, 68000 - 63. Interestingly, that 80286 has almost the same count as 80286. IMHO 68k could have been successful but 68020 added too much difficult to maintain instructions and this was a big mistake. Sorry, I still can't calculate this number for 68020 - it is very difficult.

I can estimate 32 but IMHO it must be a larger number. There are my cycles sheet below.

As I already told, there is little interest in cycle counting for code that's not time critical.

Quote:

Originally Posted by litwr

Why not relocatable? Sorry I can't catch the difference.

Position-independent : code that can be run unchanged regardless of where it is located in memory.
Relocatable : code that provides the relevant information (= relocation tables) so it can be moved anywhere in memory by altering a few parts of it.

Quote:

Originally Posted by litwr

68k has too many addressing modes to support the code relocation.

What a strange claim !

Quote:

Originally Posted by litwr

It makes loader easier and smaller but it is not much important generally. MMU can provide the code relocation for free - no need to waste the instruction space by additional instructions for this.

There is no additionnal instruction targeted at relocating code.
Besides, MMU does not exactly come for free.

Quote:

Originally Posted by litwr

Any example?

Of things that are missing or very poor in C++ ?
Aside of already seen swap, proper dynamic string/array support is a good example.

litwr · 13 November 2018, 21:23

Quote:

Originally Posted by meynaf

But my code works. Your last version doesn't. I tried it on dosbox and it failed.

I am really sorry, I was trying to do things too fast. It was only a typo, it is corrected, the size is unchanged, it is still 171 bytes. 68k is beaten again.

Quote:

Originally Posted by meynaf

As I already told, there is little interest in cycle counting for code that's not time critical.

Why? To draw a line is a good example of a typical algorithm.

Quote:

Originally Posted by meynaf

Position-independent : code that can be run unchanged regardless of where it is located in memory.
Relocatable : code that provides the relevant information (= relocation tables) so it can be moved anywhere in memory by altering a few parts of it.

Thank you for the clarification but IMHO ppl often use word "relocatable" in a sense of position-independent. IMHO the position dependent code can be found only in firmware.

Quote:

Originally Posted by meynaf

What a strange claim !

Sorry but I really don't see much necessity for position-independent code. Indeed, I meant namely position independent code. PC-relative addressing is related to possibility to get such code. But it is only useful for hardware tester and some system programmers.

Quote:

Originally Posted by meynaf

Of things that are missing or very poor in C++ ?
Aside of already seen swap, proper dynamic string/array support is a good example.

https://en.cppreference.com/w/cpp/algorithm/swap
IMHO Headers <string>, <vector>, <deque> give proper dynamic string/array support.

meynaf · 13 November 2018, 21:48

Quote:

Originally Posted by litwr

I am really sorry, I was trying to do things too fast. It was only a typo, it is corrected, the size is unchanged, it is still 171 bytes. 68k is beaten again.

You seem to always forget that this still counts OS code.

Quote:

Originally Posted by litwr

Why? To draw a line is a good example of a typical algorithm.

Being typical does not make it more time critical.
But it's interesting to see you qualify it as "typical" now. So it's typical when you get good speed measurements for your beloved x86, but it's a particular case when it comes to code density ? Damned, if that's not biased reasoning then what is it.

Quote:

Originally Posted by litwr

Sorry but I really don't see much necessity for position-independent code. Indeed, I meant namely position independent code. PC-relative addressing is related to possibility to get such code. But it is only useful for hardware tester and some system programmers.

PC-relative addressing is very, very common in 68k code. It makes the code shorter and often even faster.
But as x86 does not have proper PC-relative modes, it's normal you don't see the interest

Quote:

Originally Posted by litwr

https://en.cppreference.com/w/cpp/algorithm/swap
IMHO Headers <string>, <vector>, <deque> give proper dynamic string/array support.

You are confusing external libraries with members of the language.

roondar · 13 November 2018, 21:53

Quote:

Originally Posted by litwr

I am not sure but I won't be surprised if we find out that almost all A4000/40 were sold in Germany. There was no any real mass produced 68040 based computer and there were a lot of cheap 80486 based models.

And your evidence for most or all of the A4000/40s being sold in Germany is what?

And remember, you claimed no more than a few hundred 68040 based machines were made worldwide. This is clearly nonsense. You also forget there were a whole bunch of Apple Macs based on the 68040. These were obviously mass produced and were part of the market for a number of years.

Quote:

Was there any computer based on 68060? As I know it was only available as an upgrade.

Barely any, there was an Amiga model and some workstations here and there but not much else. Interestingly, most of the companies that primarily used Motorola in the past didn't switch from 68k to Intel at the end of the '68k era', but chose from a variety of RISC based CPU's instead.

Quote:

My initial point was about the 90 and 91 and I have to admit that it was wrong about 1991. However Archimedes had surprisingly affordable prices considering the power of its competitors.

I have to concur that the A5000 with it's 25MHz CPU was relatively cheap at GBP999/$1700. It offered about 80% of the speed of a $3000 486.

The low end fared considerably worse though, the A3000 was competing with Atari's and Amiga's around the GBP250-300 mark and itself cost anywhere from GBP499 to GBP649 (I can't seem to find a price for these in 1991, the 649 is from 1989 and the 499 is from it's replacement the A3010 in 1992)

Quote:

You have a fact about the line drawing algorithm implementation. It is common to consider that ARM code density is significantly worse than of x86 and it is common to consider x86-32 code as the best here. However we have 80 bytes of ARM code for this algorithm and 82 for x86. IMHO compilers still make rather very poor codes for ARM - read the article - http://benno.id.au/blog/2009/01/01/s...onserving_code - it is about handmade reducing of a code produced by GCC from 100 bytes to 10! I can't imagine such a thing for x86.

As I explained before, one tiny example is not nearly enough to make it into generally applicable fact. I already said that and I stand by that.

You are also really, really, really misrepresenting the results of that blog post - the compiler did indeed originally produce code that was 100 bytes long. However, after changing the compiler flags to produce size optimised rather than non optimised code and code that was specific to the actual CPU used rather than general ARM code, it dropped down to only 16 bytes!!

Quote:

Originally Posted by the blog post you refered to above

The compiler line is: $ arm-elf-gcc -c poll.c -o poll.o -Os -mcpu=arm7tdmi -mthumb.

Which produces code like:
00000000 <poll>:
poll():
0: 6803 ldr r3, [r0, #0]
2: 1c02 adds r2, r0, #0
4: 1c08 adds r0, r1, #0
6: 4018 ands r0, r3
8: d001 beq.n e <poll+0xe>
a: 4383 bics r3, r0
c: 6013 str r3, [r2, #0]
e: 4770 bx lr
So, now we are down to 16 bytes

Just to make sure you understand this: note that this improvement is made merely by changing some compiler flags. He did not do any assembly programming at all to reach 16 bytes. After this improvement, he changes the C code ever so slightly and gets it to compile to just 12 bytes. Again, note that all of this is without any assembly language on his part - all this was done merely by changing compiler flags and changing one line of C code.

Then he takes the 12 bytes example and manages to improve that to 10 bytes. In other words, he managed to hand optimise the code by all of 6 bytes if we include the changed line of C code as an example of 'hand optimised assembly' (which it isn't!) and if we exclude that he only managed to optimise the final result of the compiler by 2 bytes. All other optimisation was done purely by the compiler.

Seriously, your claim of 100 to 10 bytes is so wrong/misleading I'm starting to wonder if you even read the article. That article you linked is actually conclusive proof that compilers for the ARM are really quite good.

Quote:

Emulators are very complex programs and therefore can give very good impression of the performance of hardware. I used text modes and Norton utilities for benchmarking. It is sad that there is no Archimedes emulator for Linix.

Indeed we should take into accout the quality of emulation but I don't remember much difference except the very slow work with Amiga-500.

I have to disagree - a single application, no matter how complicated is never a good way to benchmark an entire machine. Especially when the two applications compared are not actually the same program and may cover a problem (or partial problem) that is better suited for one architecture over another.

Quote:

I don't insist that Archimedes is always 10 times faster but IMHO for proper optimised programs it can be faster about this number. I repeat my counts with the standard line drawing algorithm show that ARM can be 50% faster than 80486 at the same frequency. Amiga-500 has only slow RAM and Archimedes RAM access is 2 or even more times faster. So theoretically 68000 can only be 4-5 times slower but with Amiga it is about 10.

You haven't proven this 10x at all, as I've tried to explain before and tried to do again just now. Much more than a single application is needed. Emulators might simply be a better fit to the Archimedes (for instance due to the screen memory layout of the Archimedes being much closer to the PC one than the Amiga screen memory layout is).

It's really quite complicated to compare performance. In the last post, I even gave a counter example: Archimedes 3D games are not 10x the speed of the Amiga version, even though they suit the Archimedes very well. Another example: the A500 can accelerate line drawing using the Blitter. The speed increase varies with length, but for most lines it's apparently at least as fast as a high speed 68030.

Quote:

Thank for an interesting Amiga benchmark cite. I am a bit surprised that for some tests A1200 only 40% faster than A500. IMHO 68020 should be much faster. So it sounds as a common problem of Motorola products - their theoretical specifications are much better than practical.

1) You are not reading the benchmarks correctly. Setting the A1200 as baseline gives the A500's best number as 58% - that does not mean the A1200 is 40% faster. I've attached an image (note, I've cut away the superfluous part with an image editor) to show the results when the A500 is set as the baseline. As you can see, the A1200 averages to about 2x the speed of the A500 and is still at least 70% faster than the A500 in the worst test result.

2) It is common knowledge the 68020 in the A1200 was held back by the slow on board RAM. If you add any amount of trapdoor RAM to the A1200, it becomes about 4x the speed of the A500 - without needing to upgrade the CPU. As an example, the GVP1208 (which does nothing but add 8MB of RAM to the A1200) just about doubles the speed of the base A1200.

So your conclusion about the problem being the Motorola chip is wrong. Even the idea that the design was at fault would be wrong - the A1200 was designed as a low-end computer, not as a high end one. This may not be what most Amiga fans (including me) wanted to see, but at least it was fairly easy and cheap to double the performance of the A1200.

Quote:

Indeed Amiga can show very good results with some types of graphics but I was almost shocking when 8 MHz Archimedes showed animated plane flight in a window and a Basic code for this program in another window! IMHO it was impossible to write such things in interpreted Basic, I use my Amiga experience for this estimation.

You are aware you can do this in Amiga Basic (yes, the Microsoft variant) as well (though I do admit it's not terribly intuitive), right?

Or if you prefer a simpler approach, this can also be done using Blitz Basic and AMOS PRO (though the last one is bit of a stretch as it isn't as system friendly and thus doesn't do windows per se).

Edit: it occurs to me the above might not be quite clear enough on what I mean, so here goes:

All the above forms of BASIC can play back IFF animations (though it is easier using AMOS Pro/Blitz) and at least two of them can do so in an entirely OS friendly way. All these forms of BASIC can also create and display animations in different ways (such as using sprites, BOBs, etc). The part about the window might be more interesting - I'm 100% sure Blitz and Amiga Basic can display IFF animations on a standard Amiga screen and as such, the animation and listing can be shown at the same time.

Getting the listing to display on the same screen (or in a window on the WB screen) as an IFF animation may be more tricky as WB1.3/WB2.0 were not really designed for that, but should probably still be doable. Displaying BOBs/sprites in a window should be easier to do.

meynaf · 14 November 2018, 08:46

Quote:

Originally Posted by meynaf

You seem to always forget that this still counts OS code.

By the way, if we take that into account with my 236 bytes version...
There are 36 bytes of hunk overhead, actually 38 because hunks are longword aligned and there are two null padding bytes at the end. Brings us at 198.
Count 12 for dos.library string. We're at 186.
Count 12 for actually opening the library. We're at 174.
Count 8 for closing said library. We end up at 166. 80386 is beaten.

Now could we please stop that senseless game ? 68k is better than x86 in code density and you can't do anything about that. Even if in your particular example this were not true, this would still be just a single meaningless one. Try to find other code.

Leffmann · 14 November 2018, 10:04

Quote:

Originally Posted by Leffmann

If you want facts and details then they're in Motorola's own M68000 Programmer's Reference Manual and M68000 User Manual, currently at nxp.com. I mean, Whatever you find elsewhere on the web is either already covered in the official documents, or will just be opinions and anecdotes.

Reading this thread is like watching old couples argue, it's 40 pages long and nobody has learned anything. I get tired from just seeing it pop up in Today's Posts.

Bruce Abbott · 14 November 2018, 11:45

Quote:

Originally Posted by Leffmann

Reading this thread is like watching old couples argue, it's 40 pages long and nobody has learned anything.

Actually some of us have learned quite a bit.

Quote:

I get tired from just seeing it pop up in Today's Posts.

But not too tired to take a dump on it, apparently.

roondar · 14 November 2018, 15:19

Quote:

Originally Posted by litwr

Code:

                         68000                   68020
.loop
1    sub.l d4,d6         4                       3?
2    bgt.s .xp           13 (10 or 8+4+4=16)     6?
3    add.l a0,d1            
4    add.l a2,d6
.xp
5    sub.l d5,d7         4                       3?
6    bgt.s .yp          13                       6?

7    add.l a1,d2
8    add.l a2,d7
.yp
9    bsr.s setpixel     18                       8?
0    dbf d0,.loop       10                       6?

I didn't really mean to look at this much, but the 68000 cycle counts you show there don't look to be correct to me. For instance, there are no 68000 opcodes with odd cycle counts. I've made an attempt as well, the code you show should have the following cycle counts for the 68000.

Code:

                         68000
.loop
1    sub.l d4,d6         8
2    bgt.s .xp           10 if taken / 8 if not
3    add.l a0,d1         8            
4    add.l a2,d6         8
.xp
5    sub.l d5,d7         8
6    bgt.s .yp           10 if taken / 8 if not

7    add.l a1,d2         8
8    add.l a2,d7         8
.yp
9    bsr.s setpixel      18
0    dbf d0,.loop        10

As you can see, the 68000 is somewhat slower than you originally calculated. I'm also not entirely clear why your cycle count examples (for all processors) don't actually count all instructions.

For that matter, the 68k code looks kind of odd - why are you adding address registers to data registers? I might be wrong here, but I think you mean to do the opposite. On a side note: if putpixel takes x&y coordinates the longword add/sub commands can be optimised into word add/sub commands.

That said, I've not actually looked at the line drawing stuff you discussed much as I find it to be a far to small algorithm to actually be useful to compare stuff accurately. So it might be correct after all.

The 68020 is much harder to 'cycle count' for because the 68020 has a cache which means execution times start to differ depending on the code being inside or outside of the cache (stuff in cache is much faster). More so, code running from the cache can continue to run during memory accesses of prior instructions so it's possible for some opcodes to take '0 cycles' by being run during a memory access. The Motorola manual has an example like this:

Code:

; This example assumes code is running from cache
4 cycles   move.l d4,(a1)+
0 cycles   add.l d4,d6

; This example assumes code is running from memory
4 cycles   mode.l d4,(a1)+
3 cycles   add.l d4,d6

It's actually even more complicated than this (there are quite a few different cases to account for). Personally, for this reason I tend to stay away from cycle counting on processors that utilize cache and internal concurrency (like the 68020) - the results can vary quite a bit depending on the involvement of cache or not.

The 486 actually has similar problems, it also has cache memory and will run code inside the cache considerably faster than code that isn't in the cache. I can't say for certain the 486 also runs opcodes while waiting on memory access or uses internal concurrency, but it probably does have these abilities and thus likewise is fairly complicated to count for. The 386 tended to run without cache as far as I can find.

Don_Adan · 14 November 2018, 15:25

Sorry, but sub.l and add.l instructions needs 6c, not 8c for 68000.

roondar · 14 November 2018, 16:13

Quote:

Originally Posted by Don_Adan

Sorry, but sub.l and add.l instructions needs 6c, not 8c for 68000.

This is not correct AFAIK.

Code:

        Standard Instruction Execution Times
instruction    Size        op<ea>,An ^    op<ea>,Dn    op Dn,<M>
ADD          byte,word     8(1/0) +       4(1/0) +     8(1/1) +
              long         6(1/0) +**     6(1/0) +**   12(1/2)+

 notes:    
    + Add effective address calculation time
    ^ Word or long only
    * Indicates maximum value
    ** The base time of six clock periods is increased to eight        
       if the effective address mode is register direct or 
       immediate (effective address time should also be added)

This reads as 8 clocks for an add.l Dn,Dn command, unless I'm mistaken. I got this from: http://oldwww.nvg.ntnu.no/amiga/MC68...mstandard.HTML

The same figures can be found elsewhere as well. Strangely, the 'current' edition of the 68000 family programmers manual does not include cycle counts per instruction

frank_b · 14 November 2018, 16:28

Do your 8086 timings take into account memory accesses during prefetch?
Every opcode fetch is going to cost cycles.
Can you please add best case/worst case columns for 8086.

Don_Adan · 14 November 2018, 17:46

Quote:

Originally Posted by roondar

This is not correct AFAIK.

Code:

        Standard Instruction Execution Times
instruction    Size        op<ea>,An ^    op<ea>,Dn    op Dn,<M>
ADD          byte,word     8(1/0) +       4(1/0) +     8(1/1) +
              long         6(1/0) +**     6(1/0) +**   12(1/2)+

 notes:    
    + Add effective address calculation time
    ^ Word or long only
    * Indicates maximum value
    ** The base time of six clock periods is increased to eight        
       if the effective address mode is register direct or 
       immediate (effective address time should also be added)

This reads as 8 clocks for an add.l Dn,Dn command, unless I'm mistaken. I got this from: http://oldwww.nvg.ntnu.no/amiga/MC68...mstandard.HTML

The same figures can be found elsewhere as well. Strangely, the 'current' edition of the 68000 family programmers manual does not include cycle counts per instruction

I dont coded for long time, and checked my Amiga assembler book again, but add.l ea,Dn takes 6+ cycles, when ea for Dx is equal 0, then 6+0=6c.

meynaf · 14 November 2018, 18:09

Quote:

Originally Posted by roondar

For that matter, the 68k code looks kind of odd - why are you adding address registers to data registers? I might be wrong here, but I think you mean to do the opposite. On a side note: if putpixel takes x&y coordinates the longword add/sub commands can be optimised into word add/sub commands.

Don't ask litwr - not his code.

I'm adding address registers to data registers, because i'm out of data registers and therefore some data has to go in address registers.
Using word operations is a bad idea because it's only valid for 68000 (others don't care), adds some constraint on the other function, and - i repeat - this code isn't speed critical.

coder76 · 14 November 2018, 20:16

It's not easy to compare instruction cycle count between various architectures. If you look in MC68020 manual, there's 3 cases given for each instruction, best, cache and worst. Typically you would be interested in cache cases and try to optimize your code towards best case. By trying out first various memory alignments of start of code and then try to reorder instructions for better performance. Branches also have different cycle counts depending on whether the branch is taken or not. Variations between best and worst cases can be large, e.g. 2-3x cycles. Also, in some cases an instruction can take 0 cycles to execute (goes in parallell with some other) or then a few cycles, so difference is infinite.

You also can't see performance of a CPU by just looking at cycle counts for each instruction and comparing it against other CPUs . There are other factors, like cache performance, and number of registers available, which are also important for performance. The x86 cycles for instructions seem on paper often impressive, but x86's lacked the amount of CPU registers that 680x0's have. Also, the 386/486 caches weren't as good as 68030/68040's caches (386 had some sort of external cache).

13 November 2018, 19:06	#764
litwr Registered User Join Date: Mar 2016 Location: Ozherele Posts: 229	@meynaf I can get 171 bytes for pi-spigot for 80386. Your 68020 code takes 236 bytes. However I added to my article about 68k a phrase *Additionally, as shown by eab.abime.net* experts, the code density of 68k is often better than that of x86*. I have done some corrections to my cycles count for the line drawing algorithm main loop: ARM - 14, 80486 - 22, 80386 - 57, 80286 - 59, 8088/8086 - 98, 68000 - 63. Interestingly, that 80286 has almost the same count as 80286. IMHO 68k could have been successful but 68020 added too much difficult to maintain instructions and this was a big mistake. Sorry, I still can't calculate this number for 68020 - it is very difficult. I can estimate 32 but IMHO it must be a larger number. There are my cycles sheet below. Code: 86 286 386 486 .loop: call putpixel 19 7 9 3 .m3: cmp bp,7777 4 3 2 1 0 jne .l3 16 11 10 3 .m4: cmp bx,7777 1 je .l4 .l3: mov ax,cx 2 2 2 1 3 shl ax,1 / add ax,ax 2 2 2 1 4 cmp ax,di 3 2 2 1 5 jl .l5 17 11 10 4 6 add cx,di .m1: add bp,8 .l5: cmp cx,si 3 2 2 1 7 jg .loop 17 11 10 4 8 add cx,si .m2: add bx,8 9 jmp .loop 15 8 8 3 .l4: 68000 68020 .loop 1 sub.l d4,d6 4 3? 2 bgt.s .xp 13 (10 or 8+4+4=16) 6? 3 add.l a0,d1 4 add.l a2,d6 .xp 5 sub.l d5,d7 4 3? 6 bgt.s .yp 13 6? 7 add.l a1,d2 8 add.l a2,d7 .yp 9 bsr.s setpixel 18 8? 0 dbf d0,.loop 10 6? Last edited by litwr; 13 November 2018 at 21:48.*

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Any software to see technical OS details?	necronom	support.Other	3	02 April 2016 12:05
2-star rarity details?	stet	HOL suggestions and feedback	0	14 December 2015 05:24
EAB's FTP details...	Basquemactee1	project.Amiga File Server	2	30 October 2013 22:54
req details for sdl	turrican3	request.Other	0	20 April 2008 22:06
Forum Details	BippyM	request.Other	0	15 May 2006 00:56

12 November 2018, 02:56	#761
mc6809e Registered User Join Date: Jan 2012 Location: USA Posts: 372	I'm surprised no one mentioned a potentially difficult problem using x86 segments and C: pointer aliasing issues. It's entirely possible for two pointers to point to precisely the same memory location and still be unequal. This can make certain compiler optimizations next to impossible even if the programmer knows that two pointers can't possibly point to the same memory location. This goes for OoO processing at the processor level as well. Dependencies are created that limit the degree to which operations can be reordered or elided.

14 November 2018, 15:25	#775
Don_Adan Registered User Join Date: Jan 2008 Location: Warsaw/Poland Age: 55 Posts: 1,960	Sorry, but sub.l and add.l instructions needs 6c, not 8c for 68000.

14 November 2018, 16:28	#777
frank_b Registered User Join Date: Jun 2008 Location: Boston USA Posts: 466	Do your 8086 timings take into account memory accesses during prefetch? Every opcode fetch is going to cost cycles. Can you please add best case/worst case columns for 8086.

14 November 2018, 20:16	#780
coder76 Registered User Join Date: Dec 2016 Location: Finland Posts: 168	It's not easy to compare instruction cycle count between various architectures. If you look in MC68020 manual, there's 3 cases given for each instruction, best, cache and worst. Typically you would be interested in cache cases and try to optimize your code towards best case. By trying out first various memory alignments of start of code and then try to reorder instructions for better performance. Branches also have different cycle counts depending on whether the branch is taken or not. Variations between best and worst cases can be large, e.g. 2-3x cycles. Also, in some cases an instruction can take 0 cycles to execute (goes in parallell with some other) or then a few cycles, so difference is infinite. You also can't see performance of a CPU by just looking at cycle counts for each instruction and comparing it against other CPUs . There are other factors, like cache performance, and number of registers available, which are also important for performance. The x86 cycles for instructions seem on paper often impressive, but x86's lacked the amount of CPU registers that 680x0's have. Also, the 386/486 caches weren't as good as 68030/68040's caches (386 had some sort of external cache).

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)