68k details - Page 24

meynaf · 09 September 2018, 08:22

Quote:

Originally Posted by alpine9000

The same person that posted results then edited the post because he made a mistake and wasn't running the code he thought he was ?

Isn't it better than posting code without any check ?

alpine9000 · 09 September 2018, 08:29

Quote:

Originally Posted by meynaf

Isn't it better than posting code without any check ?

This is a topic about a dead architecture that no-one but a handful of us grumpy old men care about, neither are important.

I certainly don't care if someone posts code that is un-tested if they say it's un-tested, I don't care if you make a mistake and I don't care if litwr makes a typo in his C code then corrects it.

By the way, GCC produces ever so slightly less compact code for litwr's version of the draw algorithm compared with the last version of the github version I posted. It also has slightly different results (1 pixel in a different position for your [120, 100, 7, 1] test case).

litwr · 09 September 2018, 10:09

Quote:

Originally Posted by meynaf

Why the heck would I allow D4-D5 to be trashed in my routine ?

They contain DX and DY - why to save them?! It's your miss.
However you have won this time. 68000 code for this case is better than I expected. It occupies only 72 bytes. I thought than 68000 uses 4-byte instructions more often but your code contains almost only 2-byte instructions. I am very impressed because my 8086 code also consists of mostly 2-byte instructions. 68000 has more registers and this made its code more compact. I have made a code for alpine9000's C-routine and it takes 84 bytes. And 8086 code for your routine takes even 86 bytes. But this is a particular case generally 8086 has better code density - it is a well known fact. You can remember my number pi calculation routine (http://litwr2.atspace.eu/pi/pi-spigot-benchmark.html): for 8086 it is about 50% shorter than for 68000.

I almost sure that 8086 code takes less cycles for the line drawing algorithm. My codes for (putpixel routine is ignored) main loop take 79 and 65 cycles respectively. Could you count cycles for main loop without putpixel of your code?
I have attached my sources. Assembly sources are for FASM and DOSBOX.
drawline-x86.zip

meynaf · 09 September 2018, 11:02

Quote:

Originally Posted by litwr

They contain DX and DY - why to save them?! It's your miss.

Your miss, not mine. Sometimes they are reused, with small changes.
Why not saving them anyway ? What would be the benefit, aside of a few microseconds that don't count ?
My reusable routines never return trash in registers, btw.
But i understand why you don't like this -- no movem for x86.

Quote:

Originally Posted by litwr

But this is a particular case generally 8086 has better code density - it is a well known fact.

"well known fact"

In reality, x86 code is 1.5 times bigger than 68k, except for very small code.
About it being a particular case, neither. It's already two examples I give you.

Quote:

Originally Posted by litwr

You can remember my number ? calculation routine (http://litwr2.atspace.eu/pi/pi-spigot-benchmark.html): for 8086 it is about 50% shorter than for 68000.

Wrong claim. Please post both routines here, so we can compare something concrete. Perhaps it's just your 68000 version that is far from being optimal. But 50% shorter, that's simply impossible.

Maybe you're just comparing apples with oranges. Old dos program does not have to open some library to display the result and so the amiga version needs more code...

Quote:

Originally Posted by litwr

I almost sure that 8086 code takes less cycles for the line drawing algorithm. My codes for (putpixel routine is ignored) main loop take 79 and 65 cycles respectively. Could you count cycles for main loop without putpixel of your code?

Why counting cycles ? The point you wanted to prove is about code density, not cycle counting.
Anyway I've already said I don't defend the 68000 implementation.
(And this routine isn't time critical.)

litwr · 09 September 2018, 11:10

Quote:

Originally Posted by grond

Intel could afford putting more transistors on a die earlier than the others and still make a ton of money.

Motorola had more resources than Intel but spent them mostly for promotion of a very poor 6800, a mediocre 6809, generally good but with serious architecture defects 68000. They should have spent more resources for better hardware instead. IMHO They also spent a lot for a bad management instead of better engineers - they had possibility to keep 6502 team.

Quote:

Originally Posted by idrougge

8086 is quite untypical for a 16-bit processor made after 1970. Throughout the 70s, minicomputers (and later on, microprocessors) went towards more orthogonal designs, something which cannot be said of the 8086. The PDP-11 was the golden standard for this design, but far from unique. Motorola chose this path, so did Zilog when they went 16-bit, and so did National Semiconductor with the 32016. Even the TMS9900 is quite orthogonal, albeit very different.

This made Intel different - it didn't follow bad standards, it invented better ones. PDP-11 and TMS-9900 ISA was limited to 64KB of addressable memory. VAX-11 shows an example of ISA monstrosity which can be outperformed even by the easiest 6502 (Acorn tests showed than 6502 @4MHz could outperform some VAX-11 models). Motorola blindly took BE architecture which slows down their 6809 with 16-bit arithmetic and 68000 with 32-bit arithmetic. The latter is obvious because if 68000 used LE it would have the same timings for 32-bit MOVE and ADD instructions.

Quote:

Originally Posted by idrougge

A programmer-friendly (or compiler-friendly) ISA makes for faster and safer code. It is also fundamental when choosing an architecture – do note that almost all 16/32-bit designs before the rise of RISC used a Motorola 680x0 and that almost none used the x86.

No! For an advanced architecture human made codes are much worse than generated by a proper compiler. Try to beat modern GCC with your assembly code - it is almost impossible. Even you have admitted that RISC and x86 have changed the world.

ross · 09 September 2018, 11:52

Quote:

Originally Posted by litwr

The latter is obvious because if 68000 used LE it would have the same timings for 32-bit MOVE and ADD instructions.

So you can have same timing for a 32bit MOVE (you can even have an internal 256 bit move on a 8 bit processor if you like) and a 32bit ADD with a single 16bit ALU.
Yeahh, LE miracles

.

litwr · 09 September 2018, 12:11

Quote:

Originally Posted by meynaf

Wrong claim. Please post both routines here, so we can compare something concrete. Perhaps it's just your 68000 version that is far from being optimal. But 50% shorter, that's simply impossible.

Maybe you're just comparing apples with oranges. Old dos program does not have to open some library to display the result and so the amiga version needs more code...

The sources are open... I have just checked them. I claim my code as the fastest one not the shortest. So pi-spigot algorithm shows that the fastest routine implementing it occupies 95 bytes for 8086, 100 bytes for 68000, and 78 bytes for 68020. I measured only code for .l0 loop - there is no any system call.

BTW why this forum engine doesn't allow to use just a Greek letter for pi? IMHO we are living in the Unicode epoch now.

meynaf · 09 September 2018, 12:54

Quote:

Originally Posted by litwr

The sources are open... I have just checked them. I claim my code as the fastest one not the shortest. So pi-spigot algorithm shows that the fastest routine implementing it occupies 95 bytes for 8086, 100 bytes for 68000, and 78 bytes for 68020. I measured only code for .l0 loop - there is no any system call.

95 bytes for 8086, 100 bytes for 68000, but you claimed in previous post 8086 was 50% shorter.

litwr · 09 September 2018, 19:16

Quote:

Originally Posted by meynaf

95 bytes for 8086, 100 bytes for 68000, but you claimed in previous post 8086 was 50% shorter.

Total sizes for the program: 660 b - 8086, 904 b - 68000. Thus, it is about "only" 37% not exactly the claimed 50%.

meynaf · 09 September 2018, 19:31

Quote:

Originally Posted by litwr

Total sizes for the program: 660 b - 8086, 904 b - 68000. Thus, it is about "only" 37% not exactly the claimed 50%.

And this accounts what i told about earlier - OS calls on the Amiga aren't exactly economic (at least for small programs). This is what's called a bias.

litwr · 09 September 2018, 20:34

Quote:

Originally Posted by meynaf

Typical ? Who defines what is "typical" ?

IMHO 6502 is 100% 8-bit processor. 8086 is 100% 16-bit processor. VAX-11, IBM/370, 80386, 68020, 80486, 68030, 32016, ARM, ... are typical 100% 32-bit systems.
However it is not so easy to say about proper number of bits of 8080, z80, 68000, 68EC020, 8088, TMS9900, ... I am trying to develop a calculated value called "a bit index" (BI). IMHO BI for 8080 and z80 is above 10, for 68000 is above 24, for 8088 is above 14, for TMS-9900 is close to 16, ...
BTW the part about TMS9900 is just published - https://litwr.livejournal.com/1575.html.

meynaf · 10 September 2018, 08:53

Quote:

Originally Posted by litwr

IMHO 6502 is 100% 8-bit processor. 8086 is 100% 16-bit processor. VAX-11, IBM/370, 80386, 68020, 80486, 68030, 32016, ARM, ... are typical 100% 32-bit systems.
However it is not so easy to say about proper number of bits of 8080, z80, 68000, 68EC020, 8088, TMS9900, ... I am trying to develop a calculated value called "a bit index" (BI). IMHO BI for 8080 and z80 is above 10, for 68000 is above 24, for 8088 is above 14, for TMS-9900 is close to 16, ...
BTW the part about TMS9900 is just published - https://litwr.livejournal.com/1575.html.

Defining the number of bits of a cpu really depends how you look at it.

If you consider the address space, then 8086 isn't "typical" 16 bits - but TMS9900 is. And no 8-bit will be typical 8-bit...
If you don't take that into account, then 68EC020 is full 32 bits and there is no problem with it.

You could consider several numbers : amount of data single instructions can handle, data bus width, address bus width, size of internal data path.
With 68000 these numbers are 32,16,24,16.
With 8088 they are 16,8,20,16.
With 6502 they are 8,8,16,8.

In addition to this there are variants. F.e. ARM isn't always 32 bits as 64-bit ARM exists.
You could even consider cpus having "compatibility modes" (like 65C816 or 80386+) are not typical of their size either.

In short, before saying something is typical or not, better have a clear definition.
For me typical just means something shared among many cpu types (therefore, 8086 being quite unique in its kind, it's not typical at all in my view).

litwr · 11 September 2018, 15:08

Quote:

Originally Posted by meynaf

If 6502 can use 2k table, perhaps 68000 can too

Indeed but it will require much more than 14 cycles.

Quote:

Originally Posted by meynaf

You would have wanted the 68k to abandon the 32-bit registers in later versions ?

Perhaps you want the 80386 to abandon its bulky registers too ?

My point: x86 segment registers are not registers in the full sense but they were just temporary means, 68000 address registers don't have all functionality of GPR but they are forever.

Quote:

Originally Posted by meynaf

You stated that :
"CPU as a piece of art must be fast - it is its main feature which affects its beauty too. Try to imagine a fat slow cheetah with a beautiful coat."
So you seemed to suggest 68k is fat slow in comparison to x86...
It it wasn't your point, explain.

It is you again who tries to suggest something. I can even begin to thing that you feel something wrong against 68000. I can repeat my point - a processor (and cheetah) must be fast.

Quote:

Originally Posted by meynaf

This means that with single stack pointer you need more stack space for every task, in order to account for extra space used by interrupts.

IMHO it is much better to have one bigger stack than two separate stack areas.

Quote:

Originally Posted by meynaf

They really use opcode space for both prefixes and mov to/from segment registers, so what do you mean by not polluting ?

Those instructions are separated from others and can be easily replaced.

Quote:

Originally Posted by meynaf

A temporary that still works 40 years later

Maybe for those ppl who like to run old OS like MS-DOS but modern OS don't use them. It is even impossible to use them with x86-64.

Quote:

Originally Posted by meynaf

Pretty sure it does not ; it would have been visible in the hex-rays output.
And anyway I have located many routines already and they are individually larger than the same routine in 68k code...

It is just another indirect proof. You can't compare 1 MB data byte-by-byte. It is quite possible that some 680000 routines are a bit smaller but it is possible that some routines are smaller for x86. However you almost convinced me that 68000 code density is very close to 8086 and may be even a bit better.

Quote:

Originally Posted by meynaf

According to the docs I have...
Interrupt timing : 8086 61 clocks, 68000 47 clocks.
8086 iret 24 clocks, 68000 rte 20 clocks.

8086 iret takes 32 cycles but at 80286 it takes only about 25 cycles. So 68000 is quite fast with interrupts. It is much better than I expected. Thanks for the information. However it doesn't make 68000 so superior to 80286 as you mentioned.

Quote:

Originally Posted by meynaf

You've changed your article ?
Else what did you mean here ?

I have changed nothing. Just be more careful.

Quote:

Originally Posted by roondar

That's strange - it drew one in about 45-47 seconds for me, I timed it on my phone and everything.

Sorry for the delay. It is strange indeed because when I tried it again it gave 45-47 seconds. Maybe I used a different setting. It's also quite possible that the BBC Micro program uses a specific settings...

Quote:

Originally Posted by roondar

Your execution times for the 6502 are somewhat off (I've attached a version which hopefully does count correctly). But anyway, that code is still much slower than the 68000 version

Thank you very much! Now I am sure that 68000 @8MHz is much faster than 6502 @2MHz but 6502 @4MHz can be sometime faster than 68000 and especially with tasks that require more register memory than 68000 has. 6502 should also be faster with long C-switch statement like branching. However I have to agree that I underestemated 68000 it was one of the best for its time. Though my article states almost the same idea...

Quote:

Originally Posted by meynaf

Defining the number of bits of a cpu really depends how you look at it.

If you consider the address space, then 8086 isn't "typical" 16 bits - but TMS9900 is. And no 8-bit will be typical 8-bit...
If you don't take that into account, then 68EC020 is full 32 bits and there is no problem with it.

You could consider several numbers : amount of data single instructions can handle, data bus width, address bus width, size of internal data path.
With 68000 these numbers are 32,16,24,16.
With 8088 they are 16,8,20,16.
With 6502 they are 8,8,16,8.

In addition to this there are variants. F.e. ARM isn't always 32 bits as 64-bit ARM exists.
You could even consider cpus having "compatibility modes" (like 65C816 or 80386+) are not typical of their size either.

In short, before saying something is typical or not, better have a clear definition.
For me typical just means something shared among many cpu types (therefore, 8086 being quite unique in its kind, it's not typical at all in my view).

It is not easy subject. Let's examine z80, it has 4-bit ALU, 8-bit registers which can be combined into 16-bit registers, 8- and 16-bit operations, ... Some z80 16-bit operations are very fast. The first ARMs have 26-bit address bus...

Sorry for my English.

I can sometimes use words with a bit shifted sense. It would probably be better to write that 8086 is a typical example of 16-bit processors. Thank you for the clarifications.

roondar · 11 September 2018, 15:31

Quote:

Originally Posted by litwr

Sorry for the delay. It is strange indeed because when I tried it again it gave 45-47 seconds. Maybe I used a different setting. It's also quite possible that the BBC Micro program uses a specific settings...

Yeah, that's one of the problem with Mandelbrot fractals - a large decider in speed is the number of iterations and I didn't see that stated for the BBC Micro program. Another is the resolution and that might not be the same between these two either.

Quote:

Thank you very much! Now I am sure that 68000 @8MHz is much faster than 6502 @2MHz but 6502 @4MHz can be sometime faster than 68000@8MHz and especially with tasks that require more register memory than 68000 has. 6502 should also be faster with long C-switch statement like branching. However I have to agree that I underestemated 68000 it was one of the best for its time. Though my article states almost the same idea...

Without even checking any code, I'm pretty certain a 4MHz 6502 will beat the 68000 at a number of 8 bit tasks and a number of control structures - though it gets more complicated to compare when the code starts needing branches that are further than the 6502 supports (+127/-128 bytes).

As for the C style switch statement, that can often be implemented as an indirect jump table on the 68000 and as a result the number of case statements tends to be unimportant. So I'd say it's actually the other way around: the 6502 will probably do better than the 68000 for shorter switches.

Here's a 68000 example of what I mean:

Code:

     ; Switch to one of four routines based on value in D0
 4   add.w d0,d0
 4   add.w d0,d0   ; Multiply by 4
14   jmp   switch_table(pc,d0.w)

switch_table
10   bra.w case_1
10   bra.w case_2
10   bra.w case_3
10   bra.w case_4

case_1
...etc

This code will always take 32 cycles, regardless of number of elements in the switch.

ross · 11 September 2018, 16:59

Or 34 fixed cycles for a full 24/32bit memory span:

Code:

     ; Switch to one of four routines based on value in D0
 4   add.w d0,d0
 4   add.w d0,d0
10   movea.l switch_table(pc,d0.w),a0
 8   jmp (a0)

switch_table
   dc.l case_1
   dc.l case_2
   dc.l case_3
   dc.l case_4

case_1
...etc

EDIT: the movea.l switch_table(pc,d0.w),a0 is 18 cycles, was just a typo, the 34 total is correct

roondar · 11 September 2018, 17:12

Quote:

Originally Posted by ross

Or 34 fixed cycles for a full 24/32bit memory span:

Code:

     ; Switch to one of four routines based on value in D0
 4   add.w d0,d0
 4   add.w d0,d0
10   movea.l switch_table(pc,d0.w),a0
 8   jmp (a0)

switch_table
   dc.l case_1
   dc.l case_2
   dc.l case_3
   dc.l case_4

case_1
...etc

Ah, yes - that's a good one. Any number of labels and jump anywhere in memory for a fixed cost

. This kind of stuff is one of many reasons why I like the 68000. That d(pc,ix) / d(an,ix) addressing mode is so useful.

It occurs to me you could also do an 8 bit branch version of this and save a few more cycles, but only 127 bytes of range is not great. Still, might be useful for some algorithms.

meynaf · 11 September 2018, 17:13

Have you tried :

Code:

 subq.w #1,d0
 bcs case_0
 beq case_1
 subq.w #2,d0
 bcs case_2
 beq case_3

If you can use short branches, timing becomes something like 14/22/34/42 (28 on average).

Note : more fun on 68020.

Code:

jmp [switch_table,pc,d0.w*4]

roondar · 11 September 2018, 17:19

Quote:

Originally Posted by meynaf

Have you tried :

Code:

 subq.w #1,d0
 bcs case_0
 beq case_1
 subq.w #2,d0
 bcs case_2
 beq case_3

If you can use short branches, timing becomes something like 14/22/34/42 (28 on average).

Indeed, that would work as well. Personally, I generally prefer the jump table as it's easier to expand later if needed and I like the flat-cost idea, but for only 4 jumps the above is definitely a viable alternative.

Quote:

Note : more fun on 68020.

Code:

jmp [switch_table,pc,d0.w*4]

This is one of those things I routinely wish for when coding 68000, the scale factor is super useful.

ross · 11 September 2018, 17:25

Quote:

Originally Posted by meynaf

If you can use short branches, timing becomes something like 14/22/34/42 (28 on average).

For few selections certainly convenient!

Quote:

Note : more fun on 68020.

Yeahh, PC Indirect Scaled Preindexed (and you can have also an od field).

litwr · 11 September 2018, 17:32

Quote:

Originally Posted by roondar

This code will always take 32 cycles, regardless of number of elements in the switch.

A very impressive technique! But it is not exactly like a C-switch statement which needs to check a value and if it is not matched to go to the next case position.
EDIT. Sorry I have missed the point - those jumps are for byte values.

11 September 2018, 16:59	#475
ross Defendit numerus Join Date: Mar 2017 Location: Crossing the Rubicon Age: 54 Posts: 4,488	Or 34 fixed cycles for a full 24/32bit memory span: Code: ; Switch to one of four routines based on value in D0 4 add.w d0,d0 4 add.w d0,d0 10 movea.l switch_table(pc,d0.w),a0 8 jmp (a0) switch_table dc.l case_1 dc.l case_2 dc.l case_3 dc.l case_4 case_1 ...etc EDIT: the movea.l switch_table(pc,d0.w),a0 is 18 cycles, was just a typo, the 34 total is correct Last edited by ross; 11 September 2018 at 17:19.

11 September 2018, 17:13	#477
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,355	Have you tried : Code: subq.w #1,d0 bcs case_0 beq case_1 subq.w #2,d0 bcs case_2 beq case_3 If you can use short branches, timing becomes something like 14/22/34/42 (28 on average). Note : more fun on 68020. Code: jmp [switch_table,pc,d0.w*4]

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Any software to see technical OS details?	necronom	support.Other	3	02 April 2016 12:05
2-star rarity details?	stet	HOL suggestions and feedback	0	14 December 2015 05:24
EAB's FTP details...	Basquemactee1	project.Amiga File Server	2	30 October 2013 22:54
req details for sdl	turrican3	request.Other	0	20 April 2008 22:06
Forum Details	BippyM	request.Other	0	15 May 2006 00:56

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)