the multi-cpu code density contest - Page 5

grond · 10 February 2017, 10:26

Quote:

Originally Posted by meynaf

The bit-field instruction is slower than a read (especially this one, which goes in dcache).

DCache? What DCache? I thought you were investigating 020 code density, not 030...

meynaf · 10 February 2017, 10:47

Quote:

Originally Posted by grond

DCache? What DCache? I thought you were investigating 020 code density, not 030...

Where did you read i wanted to limit this thread to 020 ?
Anyway my code isn't slower, even on 020.

NorthWay · 10 February 2017, 10:58

I have looked a little at LZ code a few times, and I'd have to say that this is indeed a bit unconventional.

The bit ordering seems reversed from what I find natural, and having to treat the data stream as LE is close to a bug IMO.
A minimalistic and more conventional approach would be to have a subroutine that returns 1 bit of result to you from the bitstream. The subroutine (can't remember how) is self-detecting when the register is empty and fetches the next byte(or you can use word/long I guess). You then have a subroutine (or inlined) that calls this N times when you need to fetch N bits. Or if the number of bits for non-literals is exactly 16 then you can group all control bits in separate bytes as this seems to do.
And do you need to keep that A5 array? Wouldn't you reference A4 with negative offsets? (If I read the code right.)

I think I made something LZ-like in around 50 bytes on a 6510.

Thorham · 10 February 2017, 11:39

That doesn't look like a very good implementation.

meynaf · 10 February 2017, 13:46

Regardless if this guy did a good job or not, what he did is the kind of comparative i wanted to build here.

NorthWay · 10 February 2017, 15:27

This is more like what you would do on 68K IMO. 52 bytes inner loop.
That LE idiocy would obviously also be amended so it would be shorter still.

Code:

_start:
   movea.l  #(lab_1810),a6
   movea.l  a6,a1
   move.l   #$3c0,d2
   movea.l  #(lab_db),a3
   movea.l  #(lab_1f6),a4
   movea.l  #(lab_3d0),a5

    clr.l    d4        ; necessary?
   move.w   #$3ff,d3
   clr.l     d1
   moveq.l   #10,d7
   moveq.l   #$1,d5
   bra       _entry

lzss_begin:

decompression_loop:


string_copy:
   move.w   (a3)+,d6
   move.w   d6,d1
   ror.w     #8,d6
   lsr.w      d7,d1
   addq.w   #3-1,d1

output_loop:
   and.w    d3,d6
   move.b   (a5,d6.l),d4
   addq.w   #1,d6

store_byte:
   move.b   d4,(a1)+
   move.b   d4,(a5,d2.l)
   addq.w   #1,d2
   and.w    d3,d2
   dbra     d1,output_loop

_entry:
   cmpa.l   a4,a3
   bge      done_logo

get_bit
   lsr.b     #1,d5
   bne      test_flags

get_bits:
   move.b   (a3)+,d5
   roxr.b   #1,d5

test_flags:
   bcc      string_copy

discrete_char:
   move.b   (a3)+,d4
   clr.l    d1
   bra      store_byte


lzss_end:

done_logo:

matthey · 10 February 2017, 17:22

Thanks guys. Yea, not much to start with. Bad research methodology and reporting. Poor programming skills. This is sad to see from a PhD in Electrical and Computer Engineering. I sent this Vince Weaver a (2nd) e-mail about this thread. The first one I asked him to take down his misinformation/disinformation several years ago. Maybe he will accept some ideas here and make the 68k look good instead of bad

.

Thorham · 10 February 2017, 18:05

Quote:

Originally Posted by matthey

Poor programming skills.

Probably not much 68k experience.

matthey · 10 February 2017, 18:23

Quote:

Originally Posted by Thorham

Probably not much 68k experience.

Obviously, but he could learn from compiled code. There is no excuse for his code being worse optimized than compiled code, especially from a 68k compiler

.

Thorham · 10 February 2017, 18:33

Quote:

Originally Posted by matthey

Obviously, but he could learn from compiled code.

That sounds a like a bad idea. Especially considering that compilers don't generate code for neatness.

Quote:

Originally Posted by matthey

There is no excuse for his code being worse optimized than compiled code, especially from a 68k compiler

.

There is if he doesn't have much assembly language experience in general. I certainly remember the code I wrote in the beginning. The amount I've improved is ridiculous. Comes from hanging around here with you guys

NorthWay · 10 February 2017, 18:35

If you could change the way the code is built it should be possible to make it smaller, but the next optimization that I believe (without having seen the C source) you can do and stay within the spirit is to increase the buffersizes to 64K, and drop the two AND opcodes in the loop.

If you limit the compressed data size to (worst case) 2G you can convert from DBcc to subq.l/bpl.b and save one opcode in "discrete_char" by not touching d1.

Ideally "discrete_char" would just do

Code:

move.b (a3)+,(a1)+
bra _entry

and the other case would be something like this (assuming the offset is a negative 2's compliment number in the range -1 to -1024)

Code:

 move.w (a3)+,d6
 moveq.l #$3f,d1 ; #$3f (I think. split in 6+10 bits?)
 and.w d6,d1
 asr.l #6,d6    ; init d6 to #$ffffffff
 addq.w #3-1,d1
 lea (a1,d6.w),a2
copy
 move.b (a2)+,(a1)+
 dbra d1,copy

meynaf · 11 February 2017, 12:03

Quote:

Originally Posted by Thorham

That sounds a like a bad idea. Especially considering that compilers don't generate code for neatness.

Rewriting naive, suboptimal code, can be good for learning. And compilers show the perfect example of what we shouldn't do.

Perhaps now it is time for a new exercise.
It's either choice of 1 in n, or bit shuffling, depending on the approach you choose.

You have a register containing any possible value and want to set it to 1,2,3,4,5,6,7 depending on the value's position in the list 0, -1, 1, -2, 2, -3, 3.
Input value is a longword and must be equal to one item in the list, if not, we branch to some error label (and then the return value is not important). Output value can be just a byte if it makes things easier.
The value can be in the same register at the end, or in another (in that case, original value can be modified).

Don't ask me what this code could be for, it's just an example.
020+ code is allowed - in fact, any existing cpu's code is allowed. Speed of code is irrelevant, only size matters. Data counts the same as code.
Phew. I hope i didn't forget some detail this time

grond · 11 February 2017, 15:30

On ARM32 you could do that in 8 bytes.

robinsonb5 · 11 February 2017, 18:16

Quote:

Originally Posted by grond

On ARM32 you could do that in 8 bytes.

OK, great! Code for any existing CPU is allowed, so...?

Just for the hell of it, here's what it would look like on ZPU:

Code:

	loadsp 0 ; Assuming the operand is at the top of stack
	addsp 0
	loadsp 4
	im 2
	ashiftright
	xor
	im 1
	add

	loadsp 0
	im 0xfffffff8
	and
	im .ok
	eqbranch

	; error code here...

.ok:
	; success code here...

This would be between 13 and 16 bytes - depending on the location of the code the "im .ok" instruction could need continuation bytes.

grond · 11 February 2017, 19:49

OK, I messed up. It's 8 bytes for the inverse but 12 bytes for the correct conversion:

LSLS R0,R0,#1
CCADD R0,R0,#1
CSRSB R0,R0,#0

meynaf · 11 February 2017, 19:56

Quote:

Originally Posted by grond

OK, I messed up. It's 8 bytes for the inverse but 12 bytes for the correct conversion:

LSLS R0,R0,#1
CCADD R0,R0,#1
CSRSB R0,R0,#0

Remember, the code has to detect invalid values and branch to some error label in that case. I don't know these instructions but obviously they don't do that.

grond · 12 February 2017, 13:22

Then add a "CMP R0,#7“ at the end which makes it 16 bytes. ARM32 has predication and thus doesn't have to branch.

meynaf · 12 February 2017, 17:11

Quote:

Originally Posted by grond

Then add a "CMP R0,#7“ at the end which makes it 16 bytes.

Zero is an invalid value at the end (range is 1-7) and this cmp won't catch it.

Quote:

Originally Posted by grond

ARM32 has predication and thus doesn't have to branch.

Even ARM32 sometimes has to branch, e.g. when error code is very different to normal code. And anyway it was explicitly required from start.

grond · 12 February 2017, 18:52

The result can't be zero, thus, the CMP is perfectly enough. And now you are interpreting rules to favor 68k. ARM32 can have error-exit and normal exit in the same place, that's what predication is for. My code has an entire if-then-else without any branches. Ironically, ARM32 has some pretty good code-density here due to predication when history showed that predication wasn't really worth spending four bits in each instruction.

meynaf · 12 February 2017, 19:34

Quote:

Originally Posted by grond

The result can't be zero, thus, the CMP is perfectly enough.

You're wrong, result can be zero. Try with input=$80000000 (LSLS gives R0=0 and a carry, CCADD isn't executed, CCRSB subs zero from zero). Yeah, i checked what these instructions do.

Quote:

Originally Posted by grond

And now you are interpreting rules to favor 68k.

I'm not interpreting rules to favor 68k. I was very clear at start so there it's clearly you who attempt to not respect them. I wrote about incorrect values being rejected, you don't do it (at least not properly). I wrote about branching somewhere, you don't do it and write quibbles instead.
And now you dare to charge me of interpreting the rules ???

Quote:

Originally Posted by grond

ARM32 can have error-exit and normal exit in the same place, that's what predication is for. My code has an entire if-then-else without any branches.

So instead of generating a branch that will exit, you turn 100+ instructions into conditional ones ?
I wrote that the code must branch, so it must branch, ok ?

Quote:

Originally Posted by grond

Ironically, ARM32 has some pretty good code-density here due to predication when history showed that predication wasn't really worth spending four bits in each instruction.

Ironically, i could do it in 12 bytes on 68k. ARM32 requires twice that amount (your 3 instructions, two cmp to test out of range, one branch). So much for code density.

10 February 2017, 18:35	#91
NorthWay Registered User Join Date: May 2013 Location: Grimstad / Norway Posts: 839	If you could change the way the code is built it should be possible to make it smaller, but the next optimization that I believe (without having seen the C source) you can do and stay within the spirit is to increase the buffersizes to 64K, and drop the two AND opcodes in the loop. If you limit the compressed data size to (worst case) 2G you can convert from DBcc to subq.l/bpl.b and save one opcode in "discrete_char" by not touching d1. Ideally "discrete_char" would just do Code: move.b (a3)+,(a1)+ bra _entry and the other case would be something like this (assuming the offset is a negative 2's compliment number in the range -1 to -1024) Code: move.w (a3)+,d6 moveq.l #$3f,d1 ; #$3f (I think. split in 6+10 bits?) and.w d6,d1 asr.l #6,d6 ; init d6 to #$ffffffff addq.w #3-1,d1 lea (a1,d6.w),a2 copy move.b (a2)+,(a1)+ dbra d1,copy Last edited by NorthWay; 02 March 2017 at 10:24. Reason: multipost rule + bugfix + better scheduled

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Generated code and CPU Instruction Cache	Mrs Beanbag	Coders. Asm / Hardware	11	23 May 2014 11:05
EAB Christmas Song-writing Contest	mr_a500	project.EAB	64	24 May 2009 02:44
AmigaSYS Wallpaper Contest	Calo Nord	News	10	22 April 2005 09:33
Landover's Amiga Arcade Conversion Contest	Frog	News	1	28 January 2005 23:41
Battlechess Contest (EAB vs A500)	Bloodwych	Nostalgia & memories	67	14 August 2003 14:37

10 February 2017, 10:58	#83
NorthWay Registered User Join Date: May 2013 Location: Grimstad / Norway Posts: 839	I have looked a little at LZ code a few times, and I'd have to say that this is indeed a bit unconventional. The bit ordering seems reversed from what I find natural, and having to treat the data stream as LE is close to a bug IMO. A minimalistic and more conventional approach would be to have a subroutine that returns 1 bit of result to you from the bitstream. The subroutine (can't remember how) is self-detecting when the register is empty and fetches the next byte(or you can use word/long I guess). You then have a subroutine (or inlined) that calls this N times when you need to fetch N bits. Or if the number of bits for non-literals is exactly 16 then you can group all control bits in separate bytes as this seems to do. And do you need to keep that A5 array? Wouldn't you reference A4 with negative offsets? (If I read the code right.) I think I made something LZ-like in around 50 bytes on a 6510.

10 February 2017, 11:39	#84
Thorham Computer Nerd Join Date: Sep 2007 Location: Rotterdam/Netherlands Age: 47 Posts: 3,751	That doesn't look like a very good implementation.

10 February 2017, 13:46	#85
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	Regardless if this guy did a good job or not, what he did is the kind of comparative i wanted to build here.

10 February 2017, 17:22	#87
matthey Banned Join Date: Jan 2010 Location: Kansas Posts: 1,284	Thanks guys. Yea, not much to start with. Bad research methodology and reporting. Poor programming skills. This is sad to see from a PhD in Electrical and Computer Engineering. I sent this Vince Weaver a (2nd) e-mail about this thread. The first one I asked him to take down his misinformation/disinformation several years ago. Maybe he will accept some ideas here and make the 68k look good instead of bad .

11 February 2017, 15:30	#93
grond Registered User Join Date: Jun 2015 Location: Germany Posts: 1,918	On ARM32 you could do that in 8 bytes.

11 February 2017, 19:49	#95
grond Registered User Join Date: Jun 2015 Location: Germany Posts: 1,918	OK, I messed up. It's 8 bytes for the inverse but 12 bytes for the correct conversion: LSLS R0,R0,#1 CCADD R0,R0,#1 CSRSB R0,R0,#0

12 February 2017, 13:22	#97
grond Registered User Join Date: Jun 2015 Location: Germany Posts: 1,918	Then add a "CMP R0,#7“ at the end which makes it 16 bytes. ARM32 has predication and thus doesn't have to branch.

12 February 2017, 18:52	#99
grond Registered User Join Date: Jun 2015 Location: Germany Posts: 1,918	The result can't be zero, thus, the CMP is perfectly enough. And now you are interpreting rules to favor 68k. ARM32 can have error-exit and normal exit in the same place, that's what predication is for. My code has an entire if-then-else without any branches. Ironically, ARM32 has some pretty good code-density here due to predication when history showed that predication wasn't really worth spending four bits in each instruction.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)