68k details - Page 33

litwr · 03 November 2018, 18:49

Quote:

Originally Posted by roondar

1) It only works natively using DOS - and then only because DOS isn't as much an OS as it is an interface to the disk and screen. Only working under DOS => you are showing off an OS feature, not showing off the superiority of the ISA.

I can't agree, this feature of DOS relies very hard on the segment registers which are the part of ISA giving some superiority to 8086. I can also mention CP/M-86, MP/M-86, ...

Quote:

Originally Posted by roondar

2) There is in fact nothing stopping you from creating similar headerless 68000 code other than the OS used. Only not working because the OS doesn't support it => you are showing of an OS feature, not showing off the superiority of the ISA.

68k have had a lot of OS and no one used headerless format, so IMHO it was rather not so easy as you can think. However I am ready not to count a header's bytes of 68020 code into account. Even though it is not 100% fair for x86, it is a clear handicap for 68k.

Quote:

Originally Posted by roondar

Opinions are irrelevant, what you actually need is proof - and the easily available evidence (like the MIPS figures I Googled) suggest something completely different (About 7 MIPS for the ARM2, about 21 MIPS for the 486@25MHz).

If you feel this is wrong, I'm obviously more than willing to consider any evidence you wish to provide for your claims.

Let's look at http://www.roylongbottom.org.uk/mips.htm#anchorAcorn

We can take several lines there.

Code:

ARCHIMEDES          ARM2       8      4.5
MOMENTUM 21096      68020      20      6
42/40               68030      33      8
AMS/5000            80486      25     15
QI PCi              80386      25      5
VX FTserver         80486      25     15
6386E/33            80386      33     7.7
6386/25             80386      25     6.9

Then we can project the next lines

Code:

ARM     12     6.8
80386   25     6.9
80386   33     7.7
68030   33     8
ARM     25     14
80486   25     15

They show that ARM is a bit slower than 80486 and at 12 MHz it is even slower than 80386 @25Mhz. However IMHO these results are rather biased. There were so no good compilers for ARM as for 68k or x86. Look at https://news.ycombinator.com/item?id=17793878 - it shows that even with FP Archimedes can be faster. Indeed very fast hardware division of x86 could also change the picture. Maybe I don't have 100% proof but I almost sure that ARM@25MHz can outperform 80486@25MHz with integers without division, for example, with line drawing algorithm discussed in this thread. I also almost sure that ARM@12MHz can outperform 80386@33MHz. I have just made approximate clocks calculation for the line drawing main loop. It takes 52 cycles for 80386, 24 cycles for 80486, and only 14 cycles for ARM and some of the ARM's cycles are the idle S-cycles. Sorry I am not very proficient with 68k so I dare to ask somebody to count 68000/68020 clocks.

Quote:

Originally Posted by meynaf

68000 does have an instruction queue. Perhaps you wanted to write "unlike 68000" ?

No, the mentioned Byte's article states quite clear

Quote:

However, the 68000 had no real queuing, and that meant that the 68008 ran half as fast as the 68000. Unfortunately, this made the 8088 look even better.

Quote:

Originally Posted by meynaf

But running code that's not code (= data files) is quite dangerous so it's why it's better to not support this nonsense today.

We are discussing systems of the 80s. 30 years has passed...

Quote:

Originally Posted by meynaf

Why don't you just do an Atari ST version of your program, btw? You would then notice that it's somewhat smaller than the Amiga version (OS calls are less powerful but simpler).

We have already remove almost all calls.

Quote:

Originally Posted by meynaf

No it's you who have started the cutting. You removed the part that asks for the number of digits. And the original code had a lot more text inside, so what you're doing here is basically removing texts (and a few features) from some code and then pretend the cpu you're using has better code density. Intellectually dishonest, to say the least.

You have replaced the fair codes by OS calls. A sheer trickery!

Quote:

Originally Posted by meynaf

I just told that if i could use a 68k-like cpu i designed myself, it would beat the crap out of any x86 code you might write.

Indeed! Your CPU can easily beat even Intel Xeon!

Quote:

Originally Posted by meynaf

Perhaps even Basic is (visually at least) a lot more beautiful.

Wow!

And where is your 68020 pi-spigot implementation?

meynaf · 03 November 2018, 19:26

Quote:

Originally Posted by litwr

No, the mentioned Byte's article states quite clear

Then it's wrong. Because the 68000 has a prefetch queue. Some SMC code could even fail because of this.

Quote:

Originally Posted by litwr

We are discussing systems of the 80s. 30 years has passed...

True. But it does not make your program header trick more valid.

Quote:

Originally Posted by litwr

We have already remove almost all calls.

Then let's continue until all of them are removed. This is the only way to really remove biases.

Quote:

Originally Posted by litwr

You have replaced the fair codes by OS calls. A sheer trickery!

Why ? It shows that good OSes can be written with 680x0, that support useful operations, hence show its superiority. Is it my fault that old DOS is so poor ?
And it was just to compensate for the fact you removed the program headers, to make things more fair.
Something you can't use is sheer trickery but something you can (and others can't) isn't ? Yes it's called cheating.

Quote:

Originally Posted by litwr

Indeed! Your CPU can easily beat even Intel Xeon!

If given the same amount of implementation efforts and resources, then yes. But don't tell it to anyone.
(Anyway i was just speaking about code density and for this it does not need to be fast.)

Quote:

Originally Posted by litwr

Wow!

And where is your 68020 pi-spigot implementation?

I think you have it already. Now where is your 180 bytes 386 version ?

litwr · 03 November 2018, 20:39

Quote:

Originally Posted by meynaf

Then it's wrong. Because the 68000 has a prefetch queue. Some SMC code could even fail because of this.

The article writes about "no real queuing". So maybe 68000 has something which resembles the 8086 queue but it is rather useless.

Quote:

Originally Posted by meynaf

True. But it does not make your program header trick more valid.

Is to use the standard OS features a trick?!

Quote:

Originally Posted by meynaf

I think you have it already. Now where is your 180 bytes 386 version ?

It is here - enjoy the fair superiority! However this version is still 189 bytes, so it is a light handicap for 68k.

You mentioned Basic. Of course, it is rather obsolete but some features of ancient Basics are quite interesting. For example, it is possible to use multiple NEXT for one FOR - it is unimaginable for modern PL. RESUME statement looks more powerful than modern TRY/CATCH technique. I have written full screen editors for several retro-platforms with such Basics, for example, http://litwr2.atspace.eu/bk/np4bk.html or http://litwr2.atspace.eu/notepad+4.html

Indeed I had to use some ML to get a proper speed.

Kalms · 03 November 2018, 20:55

Quote:

Originally Posted by litwr

Quote:

Originally Posted by roondar

1) It only works natively using DOS - and then only because DOS isn't as much an OS as it is an interface to the disk and screen. Only working under DOS => you are showing off an OS feature, not showing off the superiority of the ISA.

I can't agree, this feature of DOS relies very hard on the segment registers which are the part of ISA giving some superiority to 8086. I can also mention CP/M-86, MP/M-86, ...

Quote:

Originally Posted by litwr

Quote:

Originally Posted by roondar

2) There is in fact nothing stopping you from creating similar headerless 68000 code other than the OS used. Only not working because the OS doesn't support it => you are showing of an OS feature, not showing off the superiority of the ISA.

68k have had a lot of OS and no one used headerless format, so IMHO it was rather not so easy as you can think. However I am ready not to count a header's bytes of 68020 code into account. Even though it is not 100% fair for x86, it is a clear handicap for 68k.

I don't think this is about easy vs not easy. It rather has to do with the relative age of the OSes and their different heritages.

For an Intel x86 CPU in 16-bit mode, the most straightforward way to construct relocatable code is to load code+data to an address that is evenly disible by 16, set CS and DS to the start of the chunk of memory, and then jump to cs:<start offset>.

The code must be written with the assumption that CS/DS point to the start of the loaded chunk.

For a 68000 CPU, the most straightforward way to construct relocatable code is to load code+data to any address in memory, and then jump to <start offset within the loaded chunk>.

The loaded code will need to use base-relative operations. For reads, 16-bit displacement relative to PC will do. For writes, a separate register will need to be used as base pointer throughout the application (this can be set up by having a "lea 0(pc),a5" at the very beginning of the program), and the accesses will be 16-bit relative relative to A5.

This means that creating a headerless executable format is feasible for both ISAs. No ISA is advantageous or disadvantageous when it comes to running relocatable code.

What's more relevant then is, why does DOS support a headerless executable format whereas Atari TOS / Amiga OS / others do not? I think the answer here is in that DOS was originally designed for a much simpler execution model + it was explicitly designed to provide some CP/M compatibility.

The Atari TOS & Amiga OS came later. They were more complex in nature. They did not strive to run CP/M software. Features like the OS being able to load a program section by section into different memory regions, automatically clearing the BSS sections, and programs being able to have debug information embedded into the executables, and good support for >64kB executables were important. Having support for a strictly headerless executable format, in addition to the regular executable format, was not worth it given that it would complicate the rest of the OS.

And this means that... which OS happens to support a headerless executable format has little to do with the quality of the ISA. I suggest that if you want to debate relative merits of ISAs, then you leave the relative merits of OSes out of it.

litwr · 03 November 2018, 21:31

@Kalms Thank you very much for your very descriptive comment. However the keyword of it is the relocatable code. It is the DEC PDP-11 concept which proved not to be very useful. Headerless format for 68k has to use the relocatable code only. It puts a lot of limitations and makes code much larger. So it was never used with 68k.

meynaf · 03 November 2018, 21:31

Quote:

Originally Posted by litwr

The article writes about "no real queuing". So maybe 68000 has something which resembles the 8086 queue but it is rather useless.

The 68000 has something very similar to the 8086 queue and it is not useless.

Quote:

Originally Posted by litwr

Is to use the standard OS features a trick?!

You said it was "trickery" when i used them

Quote:

Originally Posted by litwr

It is here - enjoy the fair superiority! However this version is still 189 bytes, so it is a light handicap for 68k.

I enjoy the fair superiority indeed, as you're beaten already.
You have headerless code ? Ok, so remove the 36-byte hunk data from my 236 bytes version and you get 200 - actually 198 as there are 2 padding bytes at the end. The code is fully position-independent so this is no loss.
You have to open dos.library ? No, so remove "dos.library" string from my program and we're at 186. And i didn't count the code to open/close said lib...

Of course you can still say you have a 180 byte version - which you didn't show - but then i could tell you that open/close library code easily removes more than 6 bytes and you're still beaten

).

As you see, if we remove OS specific code - which we *have* to do to be really fair - you lose.

Kalms · 03 November 2018, 22:29

Quote:

Originally Posted by litwr

@Kalms Thank you very much for your very descriptive comment. However the keyword of it is the relocatable code. It is the DEC PDP-11 concept which proved not to be very useful. Headerless format for 68k has to use the relocatable code only. It puts a lot of limitations and makes code much larger. So it was never used with 68k.

I disagree with your analysis.

Relocatable code on 68k, where the addressing is within a 32kB memory area, is typically the same size, or smaller than, non-relocatable code. Where you would normally have a "move.l {location_of_global_variable},d0" with the relocatable code you would instead have a "move.l offset_of_global_variable(a5),d0". This saves opcode bytes and also executes quicker.

The biggest limitation with relocatable code on 68k is that the relative addressing is 16-bit displacement. If you want to write relocatable code that addresses more than ~32kB then code size will increase significantly.

This is very similar to the situation on DOS in 16-bit programs.

I still don't see that the ISA would make a major difference here.

I believe that the primary reason why headerless formats weren't implemented in Atari TOS / Amiga OS was because those OSes put the focus on supporting larger applications with more complex functionality.

mc6809e · 04 November 2018, 02:17

Instruction prefetch on the 68k is hard coded. Each instruction, besides performing its primary operation, will prefetch the first word of the next instruction in the same way every time. This means that MULs and DIVs, despite having many idle cycles, will on only prefetch one word.

The 8086 has a more generic prefetch that operates whenever the data bus is idle. It's true that it isn't all that advantageous in some cases, but it fetches up to six bytes so the 8086 might prefetch anywhere from one to six instructions. Imagine how much faster 3d code might be if the 68k prefetched during MULs and DIVs.

And that brings up one more advantage: instruction bandwidth. Single byte instructions are a huge win when memory access is relatively slow.

Bruce Abbott · 04 November 2018, 09:23

Quote:

Originally Posted by mc6809e

Imagine how much faster 3d code might be if the 68k prefetched during MULs and DIVs.

I imagine not much.

DIVS on 68000 takes 120-156 cycles to complete. How many other instructions could realistically be queued up in that time? And they still have to be executed, which generally takes about the same time (or longer) as fetching. The only case where there might be a big speedup is when the bus is much slower than the CPU, but the 68000 was never so fast that a well designed memory subsystem couldn't keep up with it.

Quote:

And that brings up one more advantage: instruction bandwidth. Single byte instructions are a huge win when memory access is relatively slow.

Only if you don't need many of them.

Problem is you can't encode much into a single byte, so you end up needing multiple bytes to do many operations that could be encoded in a single word, and instructions become split over word boundaries which requires a deeper prefetch queue and more complex control. To be effective the opcode lengths need to be carefully chosen to match usage frequency, and that can result in a messy non-orthogonal ISA.

The Z80 is a 'good' example of messing this up. Its single byte opcode map is full of LD r,r instructions which aren't used that often and are not very powerful, while extended instructions that could do with a speedup are nobbled with prefix bytes. LD r,(IX+d) seems like a great idea until you realize that it takes 6 bytes and 10 memory cycles to load a 16 bit register, whereas the direct equivalent ld hl,(nnnn) only needs 3 bytes and 5 cycles. That kind of performance hit really takes the fun out of using index registers on the Z80!

litwr · 04 November 2018, 11:01

Quote:

Originally Posted by Bruce Abbott

The Z80 is a 'good' example of messing this up. Its single byte opcode map is full of LD r,r instructions which aren't used that often and are not very powerful, while extended instructions that could do with a speedup are nobbled with prefix bytes. LD r,(IX+d) seems like a great idea until you realize that it takes 6 bytes and 10 memory cycles to load a 16 bit register, whereas the direct equivalent ld hl,(nnnn) only needs 3 bytes and 5 cycles. That kind of performance hit really takes the fun out of using index registers on the Z80!

If z80 had a proper prefetch queue and multiplication instruction it would execute each LD r,r after multiplication for 1 cycle only!

litwr · 04 November 2018, 11:54

Quote:

Originally Posted by Kalms

Relocatable code on 68k, where the addressing is within a 32kB memory area, is typically the same size, or smaller than, non-relocatable code. Where you would normally have a "move.l {location_of_global_variable},d0" with the relocatable code you would instead have a "move.l offset_of_global_variable(a5),d0". This saves opcode bytes and also executes quicker.

It is not so easy. Let's do some coding.

Code:

MOVE.L   #index,A0
MOVE.L   table(A0),D0

In a case of relocatable code we need to use something like MOVE.L table(A0,A5),DO which is slower but if you want to use two indices it will be no good at all.

Galahad/FLT · 04 November 2018, 12:51

Quote:

Originally Posted by litwr

It is not so easy. Let's do some coding.

Code:

MOVE.L   #index,A0
MOVE.L   table(A0),D0

In a case of relocatable code we need to use something like MOVE.L table(A0,A5),DO which is slower but if you want to use two indices it will be no good at all.

For a start it would be

lea index(pc),a0

Your example isn't relocatable.

plasmab · 04 November 2018, 19:53

I just want to point something out that kinda irritates me with 68K. People have said that 68K assembler is 100% portable between assemblers.. but I find thats simply not true.

For example. If you grab the code for this driver.

http://aminet.net/package/comm/misc/8n1

It has instructions that perform bit test and set operations on data registers which the coders assembler seems to have translated to different opcodes.

For example

Code:

     bset.b  #5,d1

My assembler does not like this. When i check the manual ...

BSET ~ (<bit number> of Destination) ? Z; 1 ? <bit number> of Destination
BSET Dn,<ea>
BSET # <data>,<ea>

I am guessing it gets translated to or.b #(1<<5), d1

But like i say.. its not 100% between assemblers.

This is the tip of the iceberg when using a strict assembler.

Don_Adan · 04 November 2018, 20:01

Quote:

Originally Posted by litwr

It is not so easy. Let's do some coding.

Code:

MOVE.L   #index,A0
MOVE.L   table(A0),D0

In a case of relocatable code we need to use something like MOVE.L table(A0,A5),DO which is slower but if you want to use two indices it will be no good at all.

Every 68k Amiga code stored fully in chip or fast memory can be fully relocatable.

Don_Adan · 04 November 2018, 20:04

Quote:

Originally Posted by plasmab

I just want to point something out that kinda irritates me with 68K. People have said that 68K assembler is 100% portable between assemblers.. but I find thats simply not true.

For example. If you grab the code for this driver.

http://aminet.net/package/comm/misc/8n1

It has instructions that perform bit test and set operations on data registers which the coders assembler seems to have translated to different opcodes.

For example

Code:

     bset.b  #5,d1

My assembler does not like this. When i check the manual ...

BSET ~ (<bit number> of Destination) ? Z; 1 ? <bit number> of Destination
BSET Dn,<ea>
BSET # <data>,<ea>

I am guessing it gets translated to or.b #(1<<5), d1

But like i say.. its not 100% between assemblers.

This is the tip of the iceberg when using a strict assembler.

Some assemblers dont like extensions for bset.b, bset is correctly. Others examples unlk.w when unlk is ok, moveq.l when moveq is enough etc.

StingRay · 04 November 2018, 20:52

Quote:

Originally Posted by litwr

It is not so easy. Let's do some coding.

It's exactly like Kalms said! Also, if you want to talk about relocatable code you should at least present proper pc-relative code.

plasmab · 04 November 2018, 20:58

Quote:

Originally Posted by StingRay

It's exactly like Kalms said! Also, if you want to talk about relocatable code you should at least present proper pc-relative code.

I think 68K is pretty much entirely relocatable except for long-jumps (and jsr)? Which is what the relocation table is for in AmigaOS. Everything else is trivial to make relocatable because you can load addresses relative to the PC. normal branches are all relative. Its the jumps to distant code that seem to screw up relocation. Although i bet there are ways round that i've just never bothered to use because hunk took care of it for me.

StingRay · 04 November 2018, 21:05

Quote:

Originally Posted by plasmab

Its the jumps to distant code that seem to screw up relocation.

Any and all Amiga code can be made 100% relocatable. It can lead to slightly larger code though when having to do relocatable code which accesses a different section for example as a bit of hunk trickery is required.

plasmab · 04 November 2018, 21:14

Quote:

Originally Posted by StingRay

Any and all Amiga code can be made 100% relocatable. It can lead to slightly larger code though when having to do relocatable code which accesses a different section for example as a bit of hunk trickery is required.

Ok but for the sake of me being stupid.. i dont understand how you can jump more than plus or minus 32768 without something helping you out and either patching the jump or putting the destinations in a table. either way thats not relocatable.

I'm not contesting it cant be done.. just cant see how.

StingRay · 04 November 2018, 21:22

Quote:

Originally Posted by plasmab

Ok but for the sake of me being stupid.. i dont understand how you can jump more than plus or minus 32768 without something helping you out

You can't jump directly so at least one helper instruction is needed. If dealing with different sections the segment list provides necessary info.

04 November 2018, 19:53	#653
plasmab Banned Join Date: Sep 2016 Location: UK Posts: 2,917	I just want to point something out that kinda irritates me with 68K. People have said that 68K assembler is 100% portable between assemblers.. but I find thats simply not true. For example. If you grab the code for this driver. http://aminet.net/package/comm/misc/8n1 It has instructions that perform bit test and set operations on data registers which the coders assembler seems to have translated to different opcodes. For example Code: bset.b #5,d1 My assembler does not like this. When i check the manual ... BSET ~ (<bit number> of Destination) ? Z; 1 ? <bit number> of Destination BSET Dn,<ea> BSET # <data>,<ea> I am guessing it gets translated to or.b #(1<<5), d1 But like i say.. its not 100% between assemblers. This is the tip of the iceberg when using a strict assembler.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Any software to see technical OS details?	necronom	support.Other	3	02 April 2016 12:05
2-star rarity details?	stet	HOL suggestions and feedback	0	14 December 2015 05:24
EAB's FTP details...	Basquemactee1	project.Amiga File Server	2	30 October 2013 22:54
req details for sdl	turrican3	request.Other	0	20 April 2008 22:06
Forum Details	BippyM	request.Other	0	15 May 2006 00:56

03 November 2018, 21:31	#645
litwr Registered User Join Date: Mar 2016 Location: Ozherele Posts: 229	@Kalms Thank you very much for your very descriptive comment. However the keyword of it is the relocatable code. It is the DEC PDP-11 concept which proved not to be very useful. Headerless format for 68k has to use the relocatable code only. It puts a lot of limitations and makes code much larger. So it was never used with 68k.

04 November 2018, 02:17	#648
mc6809e Registered User Join Date: Jan 2012 Location: USA Posts: 372	Instruction prefetch on the 68k is hard coded. Each instruction, besides performing its primary operation, will prefetch the first word of the next instruction in the same way every time. This means that MULs and DIVs, despite having many idle cycles, will on only prefetch one word. The 8086 has a more generic prefetch that operates whenever the data bus is idle. It's true that it isn't all that advantageous in some cases, but it fetches up to six bytes so the 8086 might prefetch anywhere from one to six instructions. Imagine how much faster 3d code might be if the 68k prefetched during MULs and DIVs. And that brings up one more advantage: instruction bandwidth. Single byte instructions are a huge win when memory access is relatively slow.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)