English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 10 February 2017, 10:26   #81
grond
Registered User
 
Join Date: Jun 2015
Location: Germany
Posts: 1,918
Quote:
Originally Posted by meynaf View Post
The bit-field instruction is slower than a read (especially this one, which goes in dcache).
DCache? What DCache? I thought you were investigating 020 code density, not 030...
grond is offline  
Old 10 February 2017, 10:47   #82
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by grond View Post
DCache? What DCache? I thought you were investigating 020 code density, not 030...
Where did you read i wanted to limit this thread to 020 ?
Anyway my code isn't slower, even on 020.
meynaf is offline  
Old 10 February 2017, 10:58   #83
NorthWay
Registered User
 
Join Date: May 2013
Location: Grimstad / Norway
Posts: 839
I have looked a little at LZ code a few times, and I'd have to say that this is indeed a bit unconventional.

The bit ordering seems reversed from what I find natural, and having to treat the data stream as LE is close to a bug IMO.
A minimalistic and more conventional approach would be to have a subroutine that returns 1 bit of result to you from the bitstream. The subroutine (can't remember how) is self-detecting when the register is empty and fetches the next byte(or you can use word/long I guess). You then have a subroutine (or inlined) that calls this N times when you need to fetch N bits. Or if the number of bits for non-literals is exactly 16 then you can group all control bits in separate bytes as this seems to do.
And do you need to keep that A5 array? Wouldn't you reference A4 with negative offsets? (If I read the code right.)

I think I made something LZ-like in around 50 bytes on a 6510.
NorthWay is online now  
Old 10 February 2017, 11:39   #84
Thorham
Computer Nerd
 
Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,751
That doesn't look like a very good implementation.
Thorham is online now  
Old 10 February 2017, 13:46   #85
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Regardless if this guy did a good job or not, what he did is the kind of comparative i wanted to build here.
meynaf is offline  
Old 10 February 2017, 15:27   #86
NorthWay
Registered User
 
Join Date: May 2013
Location: Grimstad / Norway
Posts: 839
This is more like what you would do on 68K IMO. 52 bytes inner loop.
That LE idiocy would obviously also be amended so it would be shorter still.

Code:
_start:
   movea.l  #(lab_1810),a6
   movea.l  a6,a1
   move.l   #$3c0,d2
   movea.l  #(lab_db),a3
   movea.l  #(lab_1f6),a4
   movea.l  #(lab_3d0),a5

    clr.l    d4        ; necessary?
   move.w   #$3ff,d3
   clr.l     d1
   moveq.l   #10,d7
   moveq.l   #$1,d5
   bra       _entry

lzss_begin:

decompression_loop:


string_copy:
   move.w   (a3)+,d6
   move.w   d6,d1
   ror.w     #8,d6
   lsr.w      d7,d1
   addq.w   #3-1,d1

output_loop:
   and.w    d3,d6
   move.b   (a5,d6.l),d4
   addq.w   #1,d6

store_byte:
   move.b   d4,(a1)+
   move.b   d4,(a5,d2.l)
   addq.w   #1,d2
   and.w    d3,d2
   dbra     d1,output_loop

_entry:
   cmpa.l   a4,a3
   bge      done_logo

get_bit
   lsr.b     #1,d5
   bne      test_flags

get_bits:
   move.b   (a3)+,d5
   roxr.b   #1,d5

test_flags:
   bcc      string_copy

discrete_char:
   move.b   (a3)+,d4
   clr.l    d1
   bra      store_byte


lzss_end:

done_logo:

Last edited by NorthWay; 11 February 2017 at 07:04. Reason: 1 opcode less, 2 bugs added and removed
NorthWay is online now  
Old 10 February 2017, 17:22   #87
matthey
Banned
 
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
Thanks guys. Yea, not much to start with. Bad research methodology and reporting. Poor programming skills. This is sad to see from a PhD in Electrical and Computer Engineering. I sent this Vince Weaver a (2nd) e-mail about this thread. The first one I asked him to take down his misinformation/disinformation several years ago. Maybe he will accept some ideas here and make the 68k look good instead of bad .
matthey is offline  
Old 10 February 2017, 18:05   #88
Thorham
Computer Nerd
 
Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,751
Quote:
Originally Posted by matthey View Post
Poor programming skills.
Probably not much 68k experience.
Thorham is online now  
Old 10 February 2017, 18:23   #89
matthey
Banned
 
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
Quote:
Originally Posted by Thorham View Post
Probably not much 68k experience.
Obviously, but he could learn from compiled code. There is no excuse for his code being worse optimized than compiled code, especially from a 68k compiler .
matthey is offline  
Old 10 February 2017, 18:33   #90
Thorham
Computer Nerd
 
Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,751
Quote:
Originally Posted by matthey View Post
Obviously, but he could learn from compiled code.
That sounds a like a bad idea. Especially considering that compilers don't generate code for neatness.

Quote:
Originally Posted by matthey View Post
There is no excuse for his code being worse optimized than compiled code, especially from a 68k compiler .
There is if he doesn't have much assembly language experience in general. I certainly remember the code I wrote in the beginning. The amount I've improved is ridiculous. Comes from hanging around here with you guys
Thorham is online now  
Old 10 February 2017, 18:35   #91
NorthWay
Registered User
 
Join Date: May 2013
Location: Grimstad / Norway
Posts: 839
If you could change the way the code is built it should be possible to make it smaller, but the next optimization that I believe (without having seen the C source) you can do and stay within the spirit is to increase the buffersizes to 64K, and drop the two AND opcodes in the loop.

If you limit the compressed data size to (worst case) 2G you can convert from DBcc to subq.l/bpl.b and save one opcode in "discrete_char" by not touching d1.

Ideally "discrete_char" would just do
Code:
move.b (a3)+,(a1)+
bra _entry
and the other case would be something like this (assuming the offset is a negative 2's compliment number in the range -1 to -1024)
Code:
 move.w (a3)+,d6
 moveq.l #$3f,d1 ; #$3f (I think. split in 6+10 bits?)
 and.w d6,d1
 asr.l #6,d6    ; init d6 to #$ffffffff
 addq.w #3-1,d1
 lea (a1,d6.w),a2
copy
 move.b (a2)+,(a1)+
 dbra d1,copy

Last edited by NorthWay; 02 March 2017 at 10:24. Reason: multipost rule + bugfix + better scheduled
NorthWay is online now  
Old 11 February 2017, 12:03   #92
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by Thorham View Post
That sounds a like a bad idea. Especially considering that compilers don't generate code for neatness.
Rewriting naive, suboptimal code, can be good for learning. And compilers show the perfect example of what we shouldn't do.


Perhaps now it is time for a new exercise.
It's either choice of 1 in n, or bit shuffling, depending on the approach you choose.

You have a register containing any possible value and want to set it to 1,2,3,4,5,6,7 depending on the value's position in the list 0, -1, 1, -2, 2, -3, 3.
Input value is a longword and must be equal to one item in the list, if not, we branch to some error label (and then the return value is not important). Output value can be just a byte if it makes things easier.
The value can be in the same register at the end, or in another (in that case, original value can be modified).

Don't ask me what this code could be for, it's just an example.
020+ code is allowed - in fact, any existing cpu's code is allowed. Speed of code is irrelevant, only size matters. Data counts the same as code.
Phew. I hope i didn't forget some detail this time
meynaf is offline  
Old 11 February 2017, 15:30   #93
grond
Registered User
 
Join Date: Jun 2015
Location: Germany
Posts: 1,918
On ARM32 you could do that in 8 bytes.
grond is offline  
Old 11 February 2017, 18:16   #94
robinsonb5
Registered User
 
Join Date: Mar 2012
Location: Norfolk, UK
Posts: 1,153
Quote:
Originally Posted by grond View Post
On ARM32 you could do that in 8 bytes.
OK, great! Code for any existing CPU is allowed, so...?

Just for the hell of it, here's what it would look like on ZPU:
Code:
	loadsp 0 ; Assuming the operand is at the top of stack
	addsp 0
	loadsp 4
	im 2
	ashiftright
	xor
	im 1
	add

	loadsp 0
	im 0xfffffff8
	and
	im .ok
	eqbranch

	; error code here...

.ok:
	; success code here...
This would be between 13 and 16 bytes - depending on the location of the code the "im .ok" instruction could need continuation bytes.

Last edited by robinsonb5; 11 February 2017 at 19:19.
robinsonb5 is offline  
Old 11 February 2017, 19:49   #95
grond
Registered User
 
Join Date: Jun 2015
Location: Germany
Posts: 1,918
OK, I messed up. It's 8 bytes for the inverse but 12 bytes for the correct conversion:

LSLS R0,R0,#1
CCADD R0,R0,#1
CSRSB R0,R0,#0
grond is offline  
Old 11 February 2017, 19:56   #96
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by grond View Post
OK, I messed up. It's 8 bytes for the inverse but 12 bytes for the correct conversion:

LSLS R0,R0,#1
CCADD R0,R0,#1
CSRSB R0,R0,#0
Remember, the code has to detect invalid values and branch to some error label in that case. I don't know these instructions but obviously they don't do that.
meynaf is offline  
Old 12 February 2017, 13:22   #97
grond
Registered User
 
Join Date: Jun 2015
Location: Germany
Posts: 1,918
Then add a "CMP R0,#7“ at the end which makes it 16 bytes. ARM32 has predication and thus doesn't have to branch.
grond is offline  
Old 12 February 2017, 17:11   #98
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by grond View Post
Then add a "CMP R0,#7“ at the end which makes it 16 bytes.
Zero is an invalid value at the end (range is 1-7) and this cmp won't catch it.


Quote:
Originally Posted by grond View Post
ARM32 has predication and thus doesn't have to branch.
Even ARM32 sometimes has to branch, e.g. when error code is very different to normal code. And anyway it was explicitly required from start.
meynaf is offline  
Old 12 February 2017, 18:52   #99
grond
Registered User
 
Join Date: Jun 2015
Location: Germany
Posts: 1,918
The result can't be zero, thus, the CMP is perfectly enough. And now you are interpreting rules to favor 68k. ARM32 can have error-exit and normal exit in the same place, that's what predication is for. My code has an entire if-then-else without any branches. Ironically, ARM32 has some pretty good code-density here due to predication when history showed that predication wasn't really worth spending four bits in each instruction.
grond is offline  
Old 12 February 2017, 19:34   #100
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by grond View Post
The result can't be zero, thus, the CMP is perfectly enough.
You're wrong, result can be zero. Try with input=$80000000 (LSLS gives R0=0 and a carry, CCADD isn't executed, CCRSB subs zero from zero). Yeah, i checked what these instructions do.


Quote:
Originally Posted by grond View Post
And now you are interpreting rules to favor 68k.
I'm not interpreting rules to favor 68k. I was very clear at start so there it's clearly you who attempt to not respect them. I wrote about incorrect values being rejected, you don't do it (at least not properly). I wrote about branching somewhere, you don't do it and write quibbles instead.
And now you dare to charge me of interpreting the rules ???


Quote:
Originally Posted by grond View Post
ARM32 can have error-exit and normal exit in the same place, that's what predication is for. My code has an entire if-then-else without any branches.
So instead of generating a branch that will exit, you turn 100+ instructions into conditional ones ?
I wrote that the code must branch, so it must branch, ok ?


Quote:
Originally Posted by grond View Post
Ironically, ARM32 has some pretty good code-density here due to predication when history showed that predication wasn't really worth spending four bits in each instruction.
Ironically, i could do it in 12 bytes on 68k. ARM32 requires twice that amount (your 3 instructions, two cmp to test out of range, one branch). So much for code density.
meynaf is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Generated code and CPU Instruction Cache Mrs Beanbag Coders. Asm / Hardware 11 23 May 2014 11:05
EAB Christmas Song-writing Contest mr_a500 project.EAB 64 24 May 2009 02:44
AmigaSYS Wallpaper Contest Calo Nord News 10 22 April 2005 09:33
Landover's Amiga Arcade Conversion Contest Frog News 1 28 January 2005 23:41
Battlechess Contest (EAB vs A500) Bloodwych Nostalgia & memories 67 14 August 2003 14:37

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 18:51.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.14178 seconds with 14 queries