the multi-cpu code density contest - Page 4

meynaf · 07 February 2017, 17:42

Quote:

Originally Posted by Thorham

I meant that after the 5th byte you'd get a penalty similar to shifting more than 8 bits. The documentation says 14 cycles for < 5 bytes, and 22 cycles > 5 bytes.

That's for the number of bytes it actually has to access, it's not related to the data's position.
So you're not at risk of these 5 bytes if you don't ask for a bitfield that's larger than 24 bits.

Quote:

Originally Posted by grond

Why would be doing 16bit reads from fast and 32bit writes to chip be meaningless? I understand you are investigating code density but made the extra condition to use 32bit moves for the writes because they are to chipmem on a 32bit chipmem machine.

In my mind it was clear that both memory accesses were 32-bit accesses. I didn't state it clearly, and that's about all.

Quote:

Originally Posted by grond

Your four word-size moves example violates this condition.

It was there only as an explanation, to show why 16-bit memory accesses must not be used (because then a much shorter version exists).

Quote:

Originally Posted by grond

My code does not and shows better code density and possibly even better speed on some 68k. To proud to admit this?

To admit what ? That i forgot to specify something that was clear for me ?
I explained this now, so why do you insist ? For the sake of contradicting me maybe ?

And no, your code can not be faster. Word-sized movem completely kills the performance in your code : my movem.l is approx. same as 4x normal move.
Your movem.w is 8x normal move but you removed far less than 8 instructions to compensate for this. The performance hit is less important for 68000 but nevertheless big enough.
Now your code is not good for 020/030 execution because there are too many instructions between fastmem read and first chipmem write.

Quote:

Originally Posted by matthey

Optimum would be every other instruction being sOEP although that is rarely possible. There are some instructions which can't be sOEP in the 68060 and don't even allow an sOEP instruction at the same time like MOVEM, SWAP (oversight/mistake as it could and should have been), MUL and DIV. There isn't much room to reschedule your code. This is just the nature of the EOR exchange algorithm which does more calculations.

I'm afraid that in this case, chipmem writes will take all the available time and we're gonna do it at copymem speed regardless of the scheduling...

Quote:

Originally Posted by matthey

Meynaf's code is not much better but has more opportunities to reschedule. The 68060 has optimizations for MOVE.L which helps his code. I believe I found 2 instructions which can be removed from his code as follows.

Yeah but these 2 instructions have already been removed by Thorham
Now your version isn't faster on 68060 because chipmem is just too slow ; however it has a performance hit on 020/030 so there it's better this way :

Code:

 move.w #1999,d0
.loop
 movem.l (a0)+,d1-d4
 move.l d1,d5
 swap d1
 move.w d3,d1
 move.l d1,(a2)+
 swap d3
 move.w d3,d5
 move.l d5,(a1)+
 move.l d2,d5
 swap d2
 move.w d4,d2
 move.l d2,(a4)+
 swap d4
 move.w d4,d5
 move.l d5,(a3)+
 dbf d0,.loop

That said, of course, if someone able to do it could finally listen to me and implement my proposed extensions to 68k in some fpga, the situation could have been made easier :

Code:

 move.w #1999,d0
.loop
 movem.l (a0)+,d1-d4
 swap d3
 exg.w d1,d3
 move.l d1,(a1)+
 swap d3
 move.l d3,(a2)+
 swap d4
 exg.w d2,d4
 move.l d2,(a3)+
 swap d4
 move.l d4,(a4)+
 dbf d0,.loop
 rts

However, who will...

Quote:

Originally Posted by litwr

Theoretically it is nice but practically... Moto's always has oddities like CLR which looks fine but works like RMW-type slow instruction. I am aware that with 68020 CLR works properly. However this illustrates the common fact that Motorola was always too theoretical and forced users of their CPU to use a bit raw and bulky instructions.

Total nonsense. CLR is very useful instruction (when i counted i found out that 2.8% of instructions were CLR). 68000 bug is just implementation mistake, like Pentium's FDIV bug.
Motorola wasn't theoretical. When they designed the 68000 they profiled real programs. Intel did not, they just blindly extended their 8080 to make it 16-bit and produced the start of an horror story.
Raw and bulky instructions are on x86 side or other cpus. 68k is fine.

Quote:

Originally Posted by litwr

17 ticks and 9 bytes.

17 ticks, depends on cpu implementation.
8 bytes for 68000, 4 bytes for 68020. I think the situation is clear.

Quote:

Originally Posted by litwr

BTW. BE is a horror! it is even worse than octals.

Replace "BE" by "LE" in the above and it might become true.
Else it's just ad nauseam nonsense.

Quote:

Originally Posted by litwr

Somebody confuses the external and internal representations.

Who does ? Endianness is only different in the memory interface. For inside the cpu, everything is exactly the same.

Quote:

Originally Posted by litwr

The same shame we have with Unicode.

There i might eventually agree...

matthey · 07 February 2017, 17:43

Quote:

Originally Posted by litwr

Thanks. Only this point has a real importance. But it is true for right shift division only, left shift division generally is faster.

This point is as irrelevant today as the only hardware advantage of Little Endian which is add propagation. I would love to hear about any modern CPU which is 16 bits or greater taking advantage of the LE add propagation. Maybe there is a simple and tiny CPU with no caches being used for embedded purposes which takes advantage but unlikely anything powerful or general purpose.

Quote:

Originally Posted by litwr

Just read the manuals.
[68000] ADDI.l #,Dn 16 cycles, MOVE.l #,Dn 12 cycles
The timing should be equal for LE.

Code:

ADD.L #<d32>,Rn ; 16 cycles (PMD says 14 cycles?)
MOVE.L #<d32>,Rn ; 12 cycles
-----------------------
diff is 16 - 12 = 4 cycles

Code:

ADD.L Rm,Rn ; 8 cycles
MOVE.L Rm,Rn ; 4 cycles
-----------------------
diff is 8 - 4 = 4 cycles

The registers are 32 bits so the 16 bit memory is never touched in the latter case yet the timing difference is the same. It looks to me like ADD.L is just slower than MOVE.L and there is no Big Endian penalty for code fetch. It would be easy to buffer (cache) a small amount of the instruction stream code to avoid the BE penalty with longwords (I vaguely recall the 68000 having just such a buffer). Memory data accesses are what you want to look at.

Code:

ADD.L (An),Dn ; 14 cycles
MOVE.L (An),Dn ; 12 cycles
-----------------------
diff is 6 - 4 = 2 cycles

I don't see any timings which indicate a penalty for ADD.L while accessing the 68000 16 bit data bus. Please point out any I have over looked.

grond · 07 February 2017, 18:02

Quote:

Originally Posted by meynaf

In my mind it was clear that both memory accesses were 32-bit accesses. I didn't state it clearly, and that's about all. It was there only as an explanation, to show why 16-bit memory accesses must not be used (because then a much shorter version exists).

There is a technical reason why the destination should use only 32bit writes: it is in 32bit-chipmem and using 16bit writes would make the code take twice as long. However, there is no technical reason to exclude 16bit reads, too. I came up with code that is shorter than your supposedly optimal code that still uses 32bit chipmem writes but takes the liberty to use 16bit fastmem reads. Now you come up with "16bit reads are also forbidden" but can't provide any technical reason why they should.

Quote:

To admit what ? That i forgot to specify something that was clear for me ?

To admit that your code was not the most dense that met the technical requirements. This thread is about code density. But it tells a lot that now you try to find weaknesses in my code that are totally unrelated to the aspect of code density.

Quote:

And no, your code can not be faster. Word-sized movem completely kills the performance in your code : my movem.l is approx. same as 4x normal move. Your movem.w is 8x normal move but you removed far less than 8 instructions to compensate for this.

My movem does 8 reads but since yours does 4, I only need to compensate for 4 reads. At least the 030 with its burst-reads should do each extra read in 2cycles, I think.

Quote:

Now your code is not good for 020/030 execution because there are too many instructions between fastmem read and first chipmem write.

Sure. Speed optimisation will vary vastly depending on whether you optimise for 14 MHz 020 or 50 MHz 030 and with or without fastmem. If I had done one, you'd point out that it doesn't suit the other. You can easily rearrange the instructions to have only two instructions between fastmem read and first chipmem write. Better again than your code which needs three.

meynaf · 07 February 2017, 18:32

Quote:

Originally Posted by grond

There is a technical reason why the destination should use only 32bit writes: it is in 32bit-chipmem and using 16bit writes would make the code take twice as long. However, there is no technical reason to exclude 16bit reads, too. I came up with code that is shorter than your supposedly optimal code that still uses 32bit chipmem writes but takes the liberty to use 16bit fastmem reads. Now you come up with "16bit reads are also forbidden" but can't provide any technical reason why they should.

The technical reason is the same as in chipmem. Not twice as long but nevertheless longer (I explain below why your code can't be faster and not even of same speed).

Quote:

Originally Posted by grond

To admit that your code was not the most dense that met the technical requirements. This thread is about code density. But it tells a lot that now you try to find weaknesses in my code that are totally unrelated to the aspect of code density.

Who decides of the technical requirements of MY example ? You or me ?
Setup your own example and you will decide.

Quote:

Originally Posted by grond

My movem does 8 reads but since yours does 4, I only need to compensate for 4 reads.

But 4 reads are many clocks. Many more than what you can get by removing even 4 simple instructions (which you did not).

Quote:

Originally Posted by grond

At least the 030 with its burst-reads should do each extra read in 2cycles, I think.

MOVEM does not take any benefit from burst-reads.
To benefit from them we would need to schedule the code like this :

Code:

 move.l (a0)+,d1
 move.l d1,d5
 move.l (a0)+,d2
 swap d1
 move.l (a0)+,d3
 move.w d3,d1
 move.l (a0)+,d4

Then again, data burst isn't common in 68030 configurations and the 030 is largely fast enough...

Quote:

Originally Posted by grond

Sure. Speed optimisation will vary vastly depending on whether you optimise for 14 MHz 020 or 50 MHz 030 and with or without fastmem. If I had done one, you'd point out that it doesn't suit the other. You can easily rearrange the instructions to have only two instructions between fastmem read and first chipmem write. Better again than your code which needs three.

Sorry, but your code is slower in all cases - and doesn't even work "as is" because it's 32000 you need to add for A5, not 2000.

grond · 07 February 2017, 18:39

OK, you obviously want to play Calvinball. I don't. Regarding the wrong immediate, I noticed immediately after posting it but didn't care to correct it because it is irrelevant (still within signed word-size integer). I was all lol when I saw you clutch this straw.

meynaf · 07 February 2017, 18:50

Quote:

Originally Posted by grond

OK, you obviously want to play Calvinball.

Nope. I just wasn't clear enough at startup. Three times i explain it now, you still didn't get it.
Read again. I didn't change any rule that was previously explicity written.

Quote:

Originally Posted by grond

Regarding the wrong immediate, I noticed immediately after posting it but didn't care to correct it because it is irrelevant (still within signed word-size integer).

If you knew it before, it's even worse than i imagined.

Quote:

Originally Posted by grond

I was all lol when I saw you clutch this straw.

It wasn't an important remark, just a "by the way". But you obviously take things as they arrange you.

Thorham · 07 February 2017, 19:12

To matthey:

Thanks for explaining

Quote:

Originally Posted by matthey

It is possible to produce fairly optimal code for 68020-68060. Code for the 68020-68030 can usually be instruction scheduled for the 68060 with little if any slow down (see below for my attempt).

Yes, but I won't sacrifice a single cycle on 68020 and 68030 for making code faster on a CPU that doesn't need it. I'd much rather have two routines if optimizing for 68060 makes sense. This in nonnegotiable

Quote:

Originally Posted by matthey

This is just the nature of the EOR exchange algorithm which does more calculations.

It's a habit. Saves a register. Not needed in this case except (maybe) when you want to unroll more than once (read 2x6 long words).

Quote:

Originally Posted by litwr

Theoretically it is nice but practically...

Bitfield instructions can be faster than doing it by hand. They're decent additions.

Quote:

Originally Posted by meynaf

That's for the number of bytes it actually has to access, it's not related to the data's position.

Ah, NOW I get it

Quote:

Originally Posted by meynaf

Total nonsense. CLR is very useful instruction

CLR is only needed for memory clears, and sadly it's slow for that

For the rest clr isn't needed at all:

Code:

 clr.b dx -> eor.b dx,dx
 clr.w dx -> eor.w dx,dx
 clr.l dx -> eor.l dx,dx
 clr   ax -> sub.l ax,ax
 clra  ax -> sub.l ax,ax

Assemblers and disassemblers could've easily handled that. CLRA is certainly still addable.

There are more such redundancies, tst dx is one. Just do move dx,dx. Can also be disassembled properly.

matthey · 07 February 2017, 19:40

Quote:

Originally Posted by meynaf

Yeah but these 2 instructions have already been removed by Thorham
Now your version isn't faster on 68060 because chipmem is just too slow ; however it has a performance hit on 020/030 so there it's better this way :

Code:

 move.w #1999,d0 ; pOEP
.loop
 movem.l (a0)+,d1-d4 ; pOEP only
 move.l d1,d5 ; pOEP
 swap d1 ; pOEP only
 move.w d3,d1 ; pOEP
 move.l d1,(a2)+ ; pOEP (dependency)
 swap d3 ; pOEP only
 move.w d3,d5 ; pOEP
 move.l d5,(a1)+ ; pOEP (dependency)
 move.l d2,d5 ; sOEP
 swap d2 ; pOEP only
 move.w d4,d2 ; pOEP
 move.l d2,(a4)+ ; pOEP (dependency)
 swap d4 ; pOEP only
 move.w d4,d5 ; pOEP
 move.l d5,(a3)+ ; pOEP (dependency)
 dbf d0,.loop ; pOEP only

It was nice to get rid of the 2x unneeded MOVE.L but this isn't very good scheduling for a 68060. Actually, Thorham's EOR exchange code is faster than this on the 68060. What did the 020/030 not like with my 68060 scheduling and chip memory? Sorry, I don't have much experience with chip memory optimization. I have RTG and I usually try to avoid chip mem all together as it is dreadfully slow.

meynaf · 07 February 2017, 19:47

Quote:

Originally Posted by Thorham

CLR is only needed for memory clears, and sadly it's slow for that

For the rest clr isn't needed at all

The only mistake here is to have clr needlessly change the ccr. It would have been a lot more useful if it didn't.

Quote:

Originally Posted by matthey

What did the 020/030 not like with my 68060 scheduling and chip memory?

Instructions that don't touch memory are "hidden" behind chipmem writes, i.e. they cost zero cycle.
Therefore, the more instructions you put between the read and the first chipmem write, the worse it becomes.

Megol · 07 February 2017, 19:52

Quote:

Originally Posted by meynaf

Ok, as i wish to recruit, i must at least give something to do.

Code:

; a0=source, a1-a4=dest
 move.w #1999,d0
.loop
 movem.l (a0)+,d1-d4
 move.l d1,d5
 swap d5
 move.w d3,d5
 move.l d5,(a2)+
 move.l d1,d5
 swap d3
 move.w d3,d5
 move.l d5,(a1)+
 move.l d2,d5
 swap d5
 move.w d4,d5
 move.l d5,(a4)+
 move.l d2,d5
 swap d4
 move.w d4,d5
 move.l d5,(a3)+
 dbf d0,.loop
 rts

The above code shows that x86 would just plain suck at doing this :
- it has post-increment but only for one source and one target,
- it can do same as swap but with a longer instruction,
- it can't mix 16 and 32 bit code without the use of prefixes, which will make the code even longer,
- the above code uses 6 data and 5 addr, a lot more than what the x86 in 32-bit can provide.

Some smart ass may want to prove me that x86 is better than 68k : just show me wrong with a shorter x86 version of the code here.

Or do it for ARM or whatever cpu. 6502 version must be very funny, it does not even have enough ram to merely make it work...

For those who want to know, this routine does ST to Amiga screen conversion in less than half a frame on plain A1200 (because of 32-bit chipmem accesses ; 68020 fails to provide that speed on ECS). A much shorter approach exists by doing direct 16-bit moves but it's quite a lot slower, especially because target of writes is in chipmem.

Oh, ok. As it's supposed to be 68k-only here (?), 68k people can do their optimizing attempts as well

(There is one way to grab a few clocks but it's for 020/030 only and it'd make the code longer.)

I can't be bothered to reverse engineer some uncommented code which is doing something that for all I know is used not for something realistic but solely to make the argument that 68k is better. What is it for?

matthey · 07 February 2017, 20:01

Quote:

Originally Posted by Megol

I can't be bothered to reverse engineer some uncommented code which is doing something that for all I know is used not for something realistic but solely to make the argument that 68k is better. What is it for?

Quote:

Originally Posted by meynaf

For those who want to know, this routine does ST to Amiga screen conversion in less than half a frame on plain A1200 (because of 32-bit chipmem accesses ; 68020 fails to provide that speed on ECS). A much shorter approach exists by doing direct 16-bit moves but it's quite a lot slower, especially because target of writes is in chipmem.

Hmm.

Megol · 07 February 2017, 20:16

Quote:

Originally Posted by matthey

PPC was a simplified version of IBM's POWER ISA. Apple, IBM and Motorola (AIM) agreed to adopt and proliferate it as the "next generation" ISA during the RISC hype days. Motorola abandoned the 68k and mostly developed PPC processors.

Big endian is more natural as the bytes are stored in sequential order as they appear. There are advantages of both.

Calling BE more natural doesn't make it so.

Quote:

Big Endian
+ used more for networking standards

How does that matter when the data itself is little endian? Network metadata is often big endian for legacy reasons, doesn't make much difference.

Quote:

+ division starts at the most significant end making it faster

Bullshit. I can't even comprehend how you could have such a crazy idea...

Quote:

+ magnitude of numbers can be determined more quickly

The above doesn't make sense at all without extra context _and_ is wrong in the general case. Yes one could make a contrived example where it is true just as one could make it for a mixed-endian architecture.

Quote:

+ natural order better for text handling

Strangely there is no overhead for handling text in little endian processors. Most of the time text is handled in byte chunks and where it isn't the only problem is if the programmer can't understand little endian and/or the architecture have special support. The 68k doesn't BTW unless making a contrived routine...

Quote:

+ more human readable hex/binaries and text in memory

Little Endian
+ more common
+ addition/subtraction starts at the least significant end allowing faster carry propagation to be used

Theoretically, any 68k CPU with a 16 bit data bus could be faster when adding/subtracting a 32 bit number in memory but memory was likely already fast enough (and the 68k slow enough) that it made little if any difference. I would love to hear anywhere you read that it made a difference and how much though.

The fact is that endianess makes no difference in practice. Yes little endian can be faster in some cases just as big endian is faster in other cases. But not by much and not in general.

Megol · 07 February 2017, 20:20

Quote:

Originally Posted by matthey

Hmm.

Hmm indeed. So now I have to know how the Atari ST graphics works and how Amiga graphics works. It isn't a description of the algorithm.

While I do know both of those (wrote a blitter-based converter doing the same task after a demo was released using the blitter like that) most people wouldn't...

meynaf · 07 February 2017, 20:28

Quote:

Originally Posted by Megol

Hmm indeed. So now I have to know how the Atari ST graphics works and how Amiga graphics works. It isn't a description of the algorithm.

This has been explained already...

Quote:

Originally Posted by meynaf

It is simple reordering, like this :

Code:

 move.w (a0)+,(a1)+
 move.w (a0)+,(a2)+
 move.w (a0)+,(a3)+
 move.w (a0)+,(a4)+

matthey · 07 February 2017, 21:59

Quote:

Originally Posted by Megol

Calling BE more natural doesn't make it so.

Most humans would think that writing 0x0a0b0c0d to address 0 would give:

[0] = 0x0a
[1] = 0x0b
[2] = 0x0c
[3] = 0x0d

Most humans more naturally read left to right so 0x0a is first, 0x0b is second 0x0c is third and 0x0d is fourth. This is learned though. Maybe Little Endian would be more natural for Hebrew and Arab writers until they looked at a core dump

.

Quote:

Originally Posted by Megol

How does that matter when the data itself is little endian? Network metadata is often big endian for legacy reasons, doesn't make much difference.

Big Endian often gives a little savings in processing and a little simpler code for networking. It is unlikely to be a factor except maybe on the lowest end embedded devices with networking.

Quote:

Originally Posted by Megol

Bullshit. I can't even comprehend how you could have such a crazy idea...

It is as valid as the multi-part addition/subtraction with carry for LE argument. It would be unlikely that even the simplest and/or smallest embedded processors would take advantage of this but they could. If you don't like it, go delete the division sentence from wiki after calling B.S. there.

https://en.wikipedia.org/wiki/Endianness

Quote:

Originally Posted by Megol

The above doesn't make sense at all without extra context _and_ is wrong in the general case. Yes one could make a contrived example where it is true just as one could make it for a mixed-endian architecture.

With some processors, it is less costly to examine what a pointer is pointing directly to and less costly to examine a smaller amount of data. If you don't like it, you can go petition for the wiki above to be changed some more.

Quote:

Originally Posted by Megol

Strangely there is no overhead for handling text in little endian processors. Most of the time text is handled in byte chunks and where it isn't the only problem is if the programmer can't understand little endian and/or the architecture have special support. The 68k doesn't BTW unless making a contrived routine...

Sometimes text can be processed more quickly using a size other than a byte. The RISC-V ISA creator and author of the following article stated the following.

Quote:

Originally Posted by Andrew Waterman

The choice of memory system endianness is somewhat arbitrary. Some computations, such as IP packet processing and C string manipulation, favor big-endianness. We chose little-endianness because it is currently dominant in general-purpose computing: x86 is little-endian, and, while ARM is bi-endian, little-endian software is more common. While portable software should never rely on memory system endianness, much software, in practice, does. Adopting the most popular choice reduces the effort to port low-level software to RISC-V.

https://people.eecs.berkeley.edu/~kr...ECS-2016-1.pdf

He chose LE for RISC-V because of its popularity while observing that "string manipulation" could have advantages with BE. Maybe he would have some examples for you but I wouldn't accuse him of being biased toward BE.

Quote:

Originally Posted by Megol

The fact is that endianess makes no difference in practice. Yes little endian can be faster in some cases just as big endian is faster in other cases. But not by much and not in general.

I agree that there is no major technical advantage to choosing either endianess. The biggest issues with endianess are supporting legacy software which is not endian aware (major problem for the Amiga as PPC joins the 68k in undeath) and the increased cost of endian aware software. Hardware endian swapping (conversion) support can reduce the cost of the latter but usually can not eliminate it. The RISC-V ISA did not include endian swapping as we were talking about toward the end of the following thread.

http://eab.abime.net/showthread.php?t=85525&page=3

meynaf · 08 February 2017, 10:08

Little endian is electronicians' shortcut taken in the 70's if not before, i.e. old archaic legacy.
And like many electronicians' shortcuts, its benefits were short-lived.

It made sense in e.g. the 6502. If you did something like LDA $aabb,Y it could start adding Y with $bb before having read $aa, so $bb had to be stored first. Had it been able to do 16-bit accesses though, there would have been zero benefit in LE.

Nowadays things are of course no longer done this way. A 32-bit add might even be split in several 16-bit adds done in parallel, with top result computed for both carry and not carry, the end result being chosen when said carry is known.

matthey · 10 February 2017, 03:10

I talked about the amateur (somehow with a PhD in Electrical and Computer Engineering) posting his baby steps and grossly misrepresenting the code density of CPU architectures.

http://www.deater.net/weave/vmwprod/asm/ll/

The following is his 68k code for LZSS decompression.

Code:

| offsets into the results returned by the uname syscall
.equ U_SYSNAME,0
.equ U_NODENAME,65
.equ U_RELEASE,65*2
.equ U_VERSION,(65*3)
.equ U_MACHINE,(65*4)
.equ U_DOMAINNAME,65*5

| offset into the results returned by the sysinfo syscall
.equ S_TOTALRAM,16

| Sycscalls
.equ SYSCALL_EXIT,	1
.equ SYSCALL_READ,	3
.equ SYSCALL_WRITE,	4
.equ SYSCALL_OPEN,	5
.equ SYSCALL_CLOSE,	6
.equ SYSCALL_SYSINFO,	116
.equ SYSCALL_UNAME,	122

|
.equ STDIN,0
.equ STDOUT,1
.equ STDERR,2

	.globl _start	
_start:


	|=========================
	| PRINT LOGO
	|=========================

| LZSS decompression algorithm implementation
| by Stephan Walter 2002, based on LZSS.C by Haruhiko Okumura 1989
| optimized some more by Vince Weaver

	move.l	#out_buffer,%a6		| buffer we are printing to
	move.l	%a6,%a1

	move.l  #(N-F),%d2		| R

	move.l	#(logo),%a3		| a3 points to logo data
	move.l	#(logo_end),%a4		| a4 points to logo end
	move.l	#text_buf,%a5		| r5 points to text buf
	

decompression_loop:
        clr.l	%d5			| clear the %d5 register
	move.b	%a3@+,%d5		| load a byte, increment pointer

	or.w	#0xff00,%d5		| load top as a hackish 8-bit counter

test_flags:
	cmp.l	%a4,%a3		| have we reached the end?
	bge	done_logo  	| if so, exit

	lsr 	#1,%d5		| shift bottom bit into carry flag
	bcs	discrete_char	| if set, we jump to discrete char

offset_length:
	clr.l   %d4
	move.b	%a3@+,%d0	| load 16-bits, increment pointer	
	move.b	%a3@+,%d4	| do it in 2 steps because our data is little-endian :(
	lsl.l	#8,%d4
	move.b	%d0,%d4

	move.l	%d4,%d6		| copy d4 to d6
				| no need to mask d6, as we do it
				| by default in output_loop

	moveq.l	#P_BITS,%d0
	lsr.l	%d0,%d4
	move.l	#(THRESHOLD+1),%d0
	add.l	%d0,%d4
	add	%d4,%d1
				| d1 = (d4 >> P_BITS) + THRESHOLD + 1
				|                       (=match_length)

output_loop:
#   	andi	#((POSITION_MASK<<8)+0xff),%d6		| mask it
   	andi	#0x3ff,%d6		| mask it

	move.b 	%a5@(0,%d6),%d4		| load byte from text_buf[]
	addq	#1,%d6			| advance pointer in text_buf

store_byte:

	move.b	%d4,%a1@+		| store a byte, increment pointer
	move.b	%d4,%a5@(0,%d2)		| store a byte to text_buf[r]
	add 	#1,%d2			| r++
	andi	#(N-1),%d2		| mask r

	dbf	%d1,output_loop		| decrement count and loop
					| if %d1 is zero or above

	bftst	%d5,16:8		| are the top bits 0?
	bne	test_flags		| if not, re-load flags

	jmp	decompression_loop

discrete_char:

	move.b	%a3@+,%d4		| load a byte, increment pointer
	clr.l	%d1			| we set d1 to zero which on m68k
					| means do the loop once

	jmp	store_byte		| and store it


| end of LZSS code

done_logo:
        ...
	rts

#===========================================================================
#	section .data
#===========================================================================
.data
data_begin:
ver_string:	.ascii	" Version \0"
compiled_string:	.ascii	", Compiled \0"
one:	.ascii	"One \0"
processor:	.ascii	" Processor, \0"
ram_comma:	.ascii	"M RAM, \0"
bogo_total:	.ascii	" Bogomips Total\n\0"

default_colors:	.ascii "\033[0m\n\n\0"
escape:		.ascii "\033[\0"
C:		.ascii "C\0"

.ifdef FAKE_PROC
cpuinfo:	.ascii	"proc/cpu.m68k\0"
.else
cpuinfo:	.ascii	"/proc/cpuinfo\0"
.endif

.include	"logo.lzss_new"


#============================================================================
#	section .bss
#============================================================================
.bss
bss_begin:
.lcomm uname_info,(65*6)
.lcomm sysinfo_buff,(64)
.lcomm ascii_buffer,10
.lcomm  text_buf, (N+F-1)
#.lcomm  text_buf, 4096

.lcomm	disk_buffer,4096	| we cheat!!!!
.lcomm	out_buffer,16384

The 68k (68020 ISA) came in at 88 bytes for this lzss decompression loop. It looks like he was only counting the code between the decompression_loop label and the done_logo label. Let me translate that to Motorola syntax.

Code:

_start:
   movea.l  #(lab_1810),a6              ; 0 : 2c7c 0000 1810
   movea.l  a6,a1                       ; 6 : 224e
   move.l   #$3c0,d2                    ; 8 : 243c 0000 03c0
   movea.l  #(lab_db),a3                ; e : 267c 0000 00db
   movea.l  #(lab_1f6),a4               ; 14 : 287c 0000 01f6
   movea.l  #(lab_3d0),a5               ; 1a : 2a7c 0000 03d0

lzss_begin:

decompression_loop:
   clr.l    d5                          ; 20 : 4285
   move.b   (a3)+,d5                    ; 22 : 1a1b
   ori.w    #$ff00,d5                   ; 24 : 0045 ff00
test_flags:
   cmpa.l   a4,a3                       ; 28 : b7cc
   bge      done_logo                   ; 2a : 6c00 0050
   lsr.w    #1,d5                       ; 2e : e24d
   bcs      discrete_char               ; 30 : 6500 0042
offset_length:
   clr.l    d4                          ; 34 : 4284
   move.b   (a3)+,d0                    ; 36 : 101b
   move.b   (a3)+,d4                    ; 38 : 181b
   lsl.l    #8,d4                       ; 3a : e18c
   move.b   d0,d4                       ; 3c : 1800
   move.l   d4,d6                       ; 3e : 2c04
   moveq    #$a,d0                      ; 40 : 700a
   lsr.l    d0,d4                       ; 42 : e0ac
   move.l   #3,d0                       ; 44 : 203c 0000 0003
   add.l    d0,d4                       ; 4a : d880
   add.w    d4,d1                       ; 4c : d244
output_loop:
   andi.w   #$3ff,d6                    ; 4e : 0246 03ff
   move.b   (a5,d6.l),d4                ; 52 : 1835 6800
   addq.w   #1,d6                       ; 56 : 5246
store_byte:
   move.b   d4,(a1)+                    ; 58 : 12c4
   move.b   d4,(a5,d2.l)                ; 5a : 1b84 2800
   addq.w   #1,d2                       ; 5e : 5242
   andi.w   #$3ff,d2                    ; 60 : 0242 03ff
   dbra     d1,output_loop              ; 64 : 51c9 ffe8
   bftst    d5{$10:8}                   ; 68 : e8c5 0408
   bne      test_flags                  ; 6c : 6600 ffba
   jmp      (decompression_loop,pc)     ; 70 : 4efa ffae

discrete_char:
   move.b   (a3)+,d4                    ; 74 : 181b
   clr.l    d1                          ; 76 : 4281
   jmp      (store_byte,pc)             ; 78 : 4efa ffde

lzss_end:

done_logo:

Perhaps this attempt deserves a D- for trying but I have to go with an F for doing education and research a disservice by posting his meaningless results (FUD) like they mean something. He was also missing files needed to assemble and reproduce his results (I am not able to execute the program for testing or obtain the total size which requires Linux supposedly). A basic quick cleanup would give us something like the following.

Code:

_start:
   movea.l  #(lab_1810),a6
   movea.l  a6,a1
   move.l   #$3c0,d2
   movea.l  #(lab_db),a3
   movea.l  #(lab_1f6),a4
   movea.l  #(lab_3d0),a5

   moveq    #10,d7
   move.w   #$3ff,d3
   move.l   #$ff00,d0

lzss_begin:

decompression_loop:
   move.l   d0,d5
   move.b   (a3)+,d5
test_flags:
   cmpa.l   a4,a3
   bge      done_logo
   lsr.w    #1,d5
   bcs      discrete_char
   clr.l    d4		; necessary?
   move.w   (a3)+,d4
   ror.w    #8,d4	; LE->BE
   move.l   d4,d6
   lsr.l   d7,d4
   addq.l   #3,d4
   add.w    d4,d1
output_loop:
   and.w    d3,d6
   move.b   (a5,d6.l),d4
   addq.w   #1,d6
store_byte:
   move.b   d4,(a1)+
   move.b   d4,(a5,d2.l)
   addq.w   #1,d2
   and.w    d3,d2
   dbra     d1,output_loop
   bftst    d5{$10:8}
   bne      test_flags
   bra      decompression_loop

discrete_char:
   move.b   (a3)+,d4
   clr.l    d1
   bra      store_byte

lzss_end:

done_logo:

This gives 62 bytes for the LZSS decompression loop taking the 68020 from middle of the code density pack to 3rd best. If the data was in BE format (fair to convert static data to the native format?) then subtract another 2 bytes making the 68020 2nd best. It looks like a (boolean) flag is used in the 2nd byte of d5 (comment is "load top as a hackish 8-bit counter") which could use a test if just a flag saving another 2 bytes or perhaps moving to another register which the 68k has plenty of here. These are basic optimizations without altering the structure or algorithm where there are many more opportunities for optimizations.

meynaf · 10 February 2017, 09:22

Oddly enough, the source code for the winner doesn't appear to be available on his page. I'd like to see that with the encoding.

Indeed using inefficient code isn't the proper way to compare code densities but this is the problem with people coming from other architectures. They didn't have the relevant tools before, and code as they always have - i.e. they don't use 68k's good features and merely translate the code.
Anyway if he had used a decent assembler, peephole optimization would already have made the size fall by a fair amount...

Note : you can put the clr.l d4 outside of the loop and get down to 62.
So the 68k should appear at second place.

EDIT:
Down to 60. You can replace :

Code:

   move.w   (a3)+,d4
   ror.w    #8,d4	; LE->BE
   move.l   d4,d6
   bfextu   d4{0:22},d4	; same as lsr.l #10,d0

By :

Code:

 move.b (a3),d4
 move.w (a3)+,d6
 ror.w #8,d6
 lsr.l #2,d4

grond · 10 February 2017, 10:19

Quote:

Originally Posted by meynaf

Code:

 move.b (a3),d4
 move.w (a3)+,d6
 ror.w #8,d6
 lsr.l #2,d4

But now you have doubled the number of reads making the code slower. Thus, it has become meaningless. Or have you changed the rules of the code density contest again?

meynaf · 10 February 2017, 10:22

Quote:

Originally Posted by grond

But now you have doubled the number of reads making the code slower. Thus, it has become meaningless. Or have you changed the rules of the code density contest again?

You are wrong again. The bit-field instruction is slower than a read (especially this one, which goes in dcache). And please stop trolling - rules didn't change.

10 February 2017, 09:22	#78
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	Oddly enough, the source code for the winner doesn't appear to be available on his page. I'd like to see that with the encoding. Indeed using inefficient code isn't the proper way to compare code densities but this is the problem with people coming from other architectures. They didn't have the relevant tools before, and code as they always have - i.e. they don't use 68k's good features and merely translate the code. Anyway if he had used a decent assembler, peephole optimization would already have made the size fall by a fair amount... Note : you can put the clr.l d4 outside of the loop and get down to 62. So the 68k should appear at second place. EDIT: Down to 60. You can replace : Code: move.w (a3)+,d4 ror.w #8,d4 ; LE->BE move.l d4,d6 bfextu d4{0:22},d4 ; same as lsr.l #10,d0 By : Code: move.b (a3),d4 move.w (a3)+,d6 ror.w #8,d6 lsr.l #2,d4 Last edited by meynaf; 10 February 2017 at 09:29. Reason: -2 bytes

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Generated code and CPU Instruction Cache	Mrs Beanbag	Coders. Asm / Hardware	11	23 May 2014 11:05
EAB Christmas Song-writing Contest	mr_a500	project.EAB	64	24 May 2009 02:44
AmigaSYS Wallpaper Contest	Calo Nord	News	10	22 April 2005 09:33
Landover's Amiga Arcade Conversion Contest	Frog	News	1	28 January 2005 23:41
Battlechess Contest (EAB vs A500)	Bloodwych	Nostalgia & memories	67	14 August 2003 14:37

07 February 2017, 18:39	#65
grond Registered User Join Date: Jun 2015 Location: Germany Posts: 1,918	OK, you obviously want to play Calvinball. I don't. Regarding the wrong immediate, I noticed immediately after posting it but didn't care to correct it because it is irrelevant (still within signed word-size integer). I was all lol when I saw you clutch this straw.

08 February 2017, 10:08	#76
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,323	Little endian is electronicians' shortcut taken in the 70's if not before, i.e. old archaic legacy. And like many electronicians' shortcuts, its benefits were short-lived. It made sense in e.g. the 6502. If you did something like LDA $aabb,Y it could start adding Y with $bb before having read $aa, so $bb had to be stored first. Had it been able to do 16-bit accesses though, there would have been zero benefit in LE. Nowadays things are of course no longer done this way. A 32-bit add might even be split in several 16-bit adds done in parallel, with top result computed for both carry and not carry, the end result being chosen when said carry is known.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)