the multi-cpu code density contest - Page 8

matthey · 20 August 2017, 06:09

Quote:

Originally Posted by Photon

Intel will win because it still supports special-purpose 8-bit CPU instructions that do more than other 8-bit CPUs. 16-bit is fluffier with the exception of mul/div and 32-bit Risc are the worst. Even with Thumb they can't quite get there. There are 64-bit etc CPUs too, ofc ;-)

Intel will lose because most of those "special purpose" instructions are not common enough to have an 8 bit encoding (see Vince Weaver's 8086 code). Where the 8086 is good at code density is fairly specific being tiny executables, byte size instructions used often, common instructions using inferred ops are used often and few registers are used. The 8086 code is for DOS which has minimal executable overhead and disqualifies it as the smallest Linux executable leaving the 68k as having the smallest Linux executable of 31 architectures. I have some changes pending which should drop the 68k total executable size by another 10-20 bytes.

Quote:

Originally Posted by NorthWay

Back to that peculiar obsession with the size of the LZSS decompression loop. I sat down and tinkered with how I would do it natively if size was all I cared about and I could arrange data as I like. 34 bytes:

The only problem is that we failed to get your LZSS code working. I could not test 68k Linux but submitted your previous code. Vince Weaver could not get it working though. Maybe you could include the initialization code needed? Perhaps you could dl the 68k code, insert your routine and submit the changes?

ross · 20 August 2017, 11:18

Quote:

Originally Posted by NorthWay

I know, but I said it was about the loop itself. That was the only thing that was counted for some reason, and so I cut out all init etc.
If I cared about a more realistic total code size then I would arrange it differently. Chances are I would care about speed too.

And if you are willing to drop in-buffer overwrite de-compression then you can separate literals and control/length+distance bits and read out the control bits("get_bits") 16 at a time.

Sorry NorthWay, i've not read the past posts, not really a criticism by me.
So I need the rules to take an attempt:
- pure 68k or 020+ allowed?
- only loop or the consts used need to be defined?
- the bit stream like the original LZSS or some better for 68k (mantaining the exact ratio and rules)?
- decompression in-place required?

Cheers,
ross

matthey · 20 August 2017, 16:18

Quote:

Originally Posted by ross

So I need the rules to take an attempt:
- pure 68k or 020+ allowed?

020 instructions and addressing modes are allowed.

Quote:

Originally Posted by ross

- only loop or the consts used need to be defined?

Vince Weaver needs to know the initialization code/consts so they should be included but separating may be helpful as the initialization code does not count for the LZSS code size.

Quote:

Originally Posted by ross

- the bit stream like the original LZSS or some better for 68k (mantaining the exact ratio and rules)?

- decompression in-place required?

The decompression code needs to remain LZSS and use the existing static data. The exact rules beyond that would be a good question for Vince to answer. His e-mail and the 68k assembly code are at the following links.

http://deater.net/weave/vmwprod/asm/ll/ll.html
http://deater.net/weave/vmwprod/asm/ll/ll.m68k.s

It sometimes takes a while for him to answer e-mails but he has been responsive to my e-mails so far. As already mentioned, I recently submitted more clean up suggestions for the 68k total executable size but there were no changes for the LZSS code.

ross · 20 August 2017, 19:07

Quote:

Originally Posted by matthey

The only problem is that we failed to get your LZSS code working. I could not test 68k Linux but submitted your previous code. Vince Weaver could not get it working though. Maybe you could include the initialization code needed? Perhaps you could dl the 68k code, insert your routine and submit the changes?

Hi matthey, NorthWay code cannot works with Okumura LZSS bitstream.
It's based on some 68k specificity (bit flag reversed for roxr trick, direct negate offset,..).
The SAME decode algorithm can be made much smaller if only you could shuffle the bits (that is more x86 friendly...).
And the fact that in any case the code is the smallest one says a lot about the quality of ISA.

Regards,
ross

ross · 20 August 2017, 19:38

mmh, walking through the sources..

Code:

         WARNING: order of match_position and match_lenght changed!
         see lines 178 to 182
         Mofication by <stephan.walter@gmx.ch>
         Also modified to have N,F,etc, etc to be parameters, not
         hard-coded  -- vmw

Code:

#define N 1024
#define F 64
#define THRESHOLD 2
#define P_BITS 10
#define POSITION_MASK 3

It's no more [LZSS] Okumura bitstream.. (N=2^12, F=2^4[+2])
So what's the point? Accomodate for a personal test and a personal result for a preferred architecture?
It does not seem very scientific..

NorthWay · 20 August 2017, 20:01

Oh, my last one was absolutely not compatible with the original rules, it was just to point out that the original rules were ...sub-optimal and not how you'd do it if you had 68K in mind.

I'll take a round a see if I can get my version of the original working.

ross · 20 August 2017, 20:20

Quote:

Originally Posted by NorthWay

Oh, my last one was absolutely not compatible with the original rules, it was just to point out that the original rules were ...sub-optimal and not how you'd do it if you had 68K in mind.

With 68k in mind you can make some very tight code.
See my a few months ago (32b with no init consts):
http://eab.abime.net/showpost.php?p=...&postcount=480
It's more effective than LZSS.

Cheers!
ross

Photon · 20 August 2017, 22:01

Question:

Quote:

Originally Posted by meynaf

...

It's about comparison of various cpu families, but, this time, not at all about performance benchmarks - rather, code size (both number of bytes and number of instructions).

This is to make comparisons merely for academic purposes

...

So it's all about writing/showing real life code samples of significant size (i'd say 20-40 instructions should be enough), and write that for as many cpus as possible.

Answer:

Quote:

Originally Posted by Photon

Intel will win because it still supports special-purpose 8-bit CPU instructions that do more than other 8-bit CPUs. 16-bit is fluffier with the exception of mul/div and 32-bit Risc are the worst. Even with Thumb they can't quite get there. There are 64-bit etc CPUs too, ofc ;-)

Counter:

Quote:

Originally Posted by matthey

Intel will lose because most of those "special purpose" instructions are not common enough to have an 8 bit encoding (see Vince Weaver's 8086 code). Where the 8086 is good at code density is fairly specific being tiny executables, byte size instructions used often, common instructions using inferred ops are used often and few registers are used.

...

Maybe for examples as short as this. But addressing the academic purpose of the question, examples should be as general as possible. Math functions, parsing/conversion, sorts, decompression algorithms are good examples. For those, 8-bit instructions are used a lot.

I can't answer why a specific coder's contest result is shorter (yet?), but common things like filling/copying n bytes in a single instruction, LUT, loop, simple ALU, RET vs RTS are all shorter on Intel CPUs since and after the 8086.

Obviously, OP is looking for an equivalent excerpt (let's say, single function), don't know what (BI)OS, frameworks etc has to do with it. Surely it must be standalone to make any comparison?

meynaf · 20 August 2017, 23:12

The code density situation for 68k vs x86 is quite simple in fact. x86 is good only on very small code samples ; the bigger, the worst it becomes. 68k is more or less constant.
Programs that are just 100 bytes or less in size aren't very relevant. Why not something like 1MB ? It would be 1MB on 68k but 1.5 MB on x86, in spite x86 has better compilers. I have one example of this.

matthey · 21 August 2017, 00:55

Quote:

Originally Posted by ross

It's no more [LZSS] Okumura bitstream.. (N=2^12, F=2^4[+2])
So what's the point? Accomodate for a personal test and a personal result for a preferred architecture?
It does not seem very scientific..

The code was originally used as a contest to optimize for the x86. Yes, it was modified to run better on the x86. Yes, the big endian code has to do an endian swap for little endian static data. Yes, I complained about these things to Vince Weaver. It is difficult to change them now even though the results are less than scientific and perhaps not even a good study. I originally asked Vince to take down his web site as a source of misinformation. My next best option was to at least get the 68k close to where it should be instead of middle of the pack for code density. Maybe the results are half way meaningful for the architectures which are well optimized at least. I recently asked him to include more data like the following.

number of instructions
code size
average instruction length
number of memory/cache accesses
number of branches

I did a comparison of the SuperH SH-3 to the 68k using Vince's code and found the SH-3 has about 50% more instructions, 40% more cache accesses, 40% more branches but only about 15% worse code density. CPU designs can overcome many obstacles but it is difficult to imagine a fast SH-3 core if these stats were normal. Maybe Vince's code could be good enough to be a starting point for ISA comparison.

Quote:

Originally Posted by Photon

Maybe for examples as short as this. But addressing the academic purpose of the question, examples should be as general as possible. Math functions, parsing/conversion, sorts, decompression algorithms are good examples. For those, 8-bit instructions are used a lot.

Many programmers fall for the trap of thinking an 8 bit encoding will give superior code density. There is very limited space for even a register or immediate specification and there are only so many useful instructions with practically no data. The x86 ISA is good at utilizing instructions with inferred ops but this is bad for orthogonality and ends up being less general purpose. Several of the x86 instructions were replaced when moving to x86_64 for this reason (6/256 8 bit encodings were used for BCD).

Quote:

Originally Posted by meynaf

The code density situation for 68k vs x86 is quite simple in fact. x86 is good only on very small code samples ; the bigger, the worst it becomes. 68k is more or less constant. Programs that are just 100 bytes or less in size aren't very relevant.

I agree. The 68020 ISA code density degrades some at roughly 100kB executable size but it is not as bad as x86. Many RISC processors are bad about code density degrading with larger executable sizes too.

NorthWay · 21 August 2017, 02:46

And I just realized that you can keep that peculiar data format of the original compression example, but ditch the whole code structure and style it to be similar to my idealized version and make the loop 38(?) bytes (init not included). You don't need that extra 1K buffer.

matthey · 21 August 2017, 03:00

Quote:

Originally Posted by NorthWay

And I just realized that you can keep that peculiar data format of the original compression example, but ditch the whole code structure and style it to be similar to my idealized version and make the loop 38(?) bytes (init not included). You don't need that extra 1K buffer.

I e-mailed Vince Weaver the link to your post #138 on Saturday. It makes sense to have a more practical algorithm but it would be a lot of work to change it now as all architectures would need changing. It is not my decision either.

ross · 21 August 2017, 13:44

There is two [potential] bugs in decompression loop:
[EDIT: potential because all buffer are enlarged enaught]

Code:

decompression_loop:
	move.q	#7,%d7		| load a counter
	move.b	%a3@+,%d5	| load a byte, increment pointer

test_flags:
	cmp.l	%a4,%a3		| have we reached the end?
	bge.b	done_logo  	| if so, exit

The end check need to be done *before* byte flags read, sure there is a fetch out of buffer.

Code:

	lea	%pc@(logo),%a3		| a3 points to logo data
	lea	%pc@(logo_end),%a4	| a4 points to logo end

For the check compressed stream bounds is used: this case is true only for a full 8bits flags last byte.
You should use *decompression buffer* bound or a token in compressed stream or lose some compression but make a right compressed stream..

[ And an amusing typos

:

Code:

|  There is an alternate morotolla syntax that gas can also handle

]

Cheers,
ross

Thorham · 21 August 2017, 14:25

What's with the odd syntax in the above post

ross · 21 August 2017, 15:09

Quote:

Originally Posted by Thorham

What's with the odd syntax in the above post

The GAS syntax, a totally unreadable one.
In fact I do a lot of effort to read the code..

Two more potential bugs:

Code:

|	clr.l   %d4		| (unnecessary?)
	move.w	%a3@+,%d4	| load 16-bits, increment pointer
	ror.w	#8,%d4		| unfair big-endian penalty

	move.l	%d4,%d6		| copy d4 to d6
				| no need to mask d6, as we do it
				| by default in output_loop

	lsr.l	%d0,%d4		| unsigned shift right by P_BITS
	addq.l	#(THRESHOLD+1),%d4

	add.w	%d4,%d1

With lsr.l the code works only by chance, need to be .w (note the commented out clr..).
[EDIT: the second is not a bug, only a contortion in code that explain the +1 added to d4

]

Cheers,
ross

ross · 21 August 2017, 22:00

Hi.

Attached a 54 byte version (with the potential bugs corrected).
Like NorthWay said the real turning would be to avoid using the 1k buffer.
The stream format is really unfriendly.

Regards,
ross

matthey · 23 August 2017, 00:37

Quote:

Originally Posted by ross

Attached a 54 byte version (with the potential bugs corrected).
Like NorthWay said the real turning would be to avoid using the 1k buffer.
The stream format is really unfriendly.

Thanks. I hope Vince will be able to use your suggestions and code. He wouldn't have people trying to rewrite the inefficient decompression code if it had been better to begin with (some 6502 guys also tried to rewrite the decompression code). I haven't received a response from Vince as of yet about the latest changes. He is busy sometimes.

Thorham · 24 August 2017, 10:06

Quote:

Originally Posted by ross

The GAS syntax, a totally unreadable one.

Very strange. Why did they think that that was a good idea? Should be pretty easy to clean up with search/replace at least (perhaps with regex).

meynaf · 24 August 2017, 11:08

While quite programmer unfriendly, the AT&T syntax (used by gcc and co) isn't worse than Intel's asm syntax

Thorham · 24 August 2017, 19:45

Quote:

Originally Posted by meynaf

While quite programmer unfriendly, the AT&T syntax (used by gcc and co) isn't worse than Intel's asm syntax

20 August 2017, 19:38	#145
ross Defendit numerus Join Date: Mar 2017 Location: Crossing the Rubicon Age: 53 Posts: 4,468	mmh, walking through the sources.. Code: WARNING: order of match_position and match_lenght changed! see lines 178 to 182 Mofication by <stephan.walter@gmx.ch> Also modified to have N,F,etc, etc to be parameters, not hard-coded -- vmw Code: #define N 1024 #define F 64 #define THRESHOLD 2 #define P_BITS 10 #define POSITION_MASK 3 It's no more [LZSS] Okumura bitstream.. (N=2^12, F=2^4[+2]) So what's the point? Accomodate for a personal test and a personal result for a preferred architecture? It does not seem very scientific.. Last edited by ross; 20 August 2017 at 20:21. Reason: []

21 August 2017, 13:44	#153
ross Defendit numerus Join Date: Mar 2017 Location: Crossing the Rubicon Age: 53 Posts: 4,468	There is two [potential] bugs in decompression loop: [EDIT: potential because all buffer are enlarged enaught] Code: decompression_loop: move.q #7,%d7 \| load a counter move.b %a3@+,%d5 \| load a byte, increment pointer test_flags: cmp.l %a4,%a3 \| have we reached the end? bge.b done_logo \| if so, exit The end check need to be done before byte flags read, sure there is a fetch out of buffer. Code: lea %pc@(logo),%a3 \| a3 points to logo data lea %pc@(logo_end),%a4 \| a4 points to logo end For the check compressed stream bounds is used: this case is true only for a full 8bits flags last byte. You should use decompression buffer bound or a token in compressed stream or lose some compression but make a right compressed stream.. [ And an amusing typos : Code: \| There is an alternate morotolla syntax that gas can also handle ] Cheers, ross Last edited by ross; 21 August 2017 at 17:17. Reason: []

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Generated code and CPU Instruction Cache	Mrs Beanbag	Coders. Asm / Hardware	11	23 May 2014 11:05
EAB Christmas Song-writing Contest	mr_a500	project.EAB	64	24 May 2009 02:44
AmigaSYS Wallpaper Contest	Calo Nord	News	10	22 April 2005 09:33
Landover's Amiga Arcade Conversion Contest	Frog	News	1	28 January 2005 23:41
Battlechess Contest (EAB vs A500)	Bloodwych	Nostalgia & memories	67	14 August 2003 14:37

20 August 2017, 20:01	#146
NorthWay Registered User Join Date: May 2013 Location: Grimstad / Norway Posts: 839	Oh, my last one was absolutely not compatible with the original rules, it was just to point out that the original rules were ...sub-optimal and not how you'd do it if you had 68K in mind. I'll take a round a see if I can get my version of the original working.

20 August 2017, 23:12	#149
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	The code density situation for 68k vs x86 is quite simple in fact. x86 is good only on very small code samples ; the bigger, the worst it becomes. 68k is more or less constant. Programs that are just 100 bytes or less in size aren't very relevant. Why not something like 1MB ? It would be 1MB on 68k but 1.5 MB on x86, in spite x86 has better compilers. I have one example of this.

21 August 2017, 02:46	#151
NorthWay Registered User Join Date: May 2013 Location: Grimstad / Norway Posts: 839	And I just realized that you can keep that peculiar data format of the original compression example, but ditch the whole code structure and style it to be similar to my idealized version and make the loop 38(?) bytes (init not included). You don't need that extra 1K buffer.

21 August 2017, 14:25	#154
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,751	What's with the odd syntax in the above post

24 August 2017, 11:08	#159
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	While quite programmer unfriendly, the AT&T syntax (used by gcc and co) isn't worse than Intel's asm syntax

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)