the multi-cpu code density contest - Page 7

buggs · 17 February 2017, 15:12

Quote:

Originally Posted by meynaf

Ok, as i wish to recruit, i must at least give something to do.

Code:

; a0=source, a1-a4=dest
 move.w #1999,d0
.loop
 movem.l (a0)+,d1-d4
 move.l d1,d5
 swap d5
 move.w d3,d5
 move.l d5,(a2)+
 move.l d1,d5
 swap d3
 move.w d3,d5
 move.l d5,(a1)+
 move.l d2,d5
 swap d5
 move.w d4,d5
 move.l d5,(a4)+
 move.l d2,d5
 swap d4
 move.w d4,d5
 move.l d5,(a3)+
 dbf d0,.loop
 rts

<snip> Or do it for ARM or whatever cpu. <snip>

With VASM out, I figured it might be time to post something for "whatever CPU".

Code:

 move #999,d0
.loop
 load (a0)+,E0          ;a0 b0 c0 d0 (.w)
 load (a0)+,E1          ;a1 b1 c1 d1
 load (a0)+,E2          ;a2 b2 c2 d2
 load (a0)+,E3          ;a3 b3 c3 d3
 transhi E0-E3,E4:E5    ;E4: a0 a1 a2 a3 E5: b0 b1 b2 b3
 translo E0-E3,E6:E7    ;E6: c0 c1 c2 c3 E7: d0 d1 d2 d3
                     ;TRANS has latency, 1 cyc lost in this example
 store E4,(a1)+         ;
 store E5,(a2)+         ;
 store E6,(a3)+         ;
 store E7,(a4)+         ;inner loop assembles to 10 * 32 Bit
 dbf   d0,.loop          ;plus move, dbf = 12 * 32 Bit

Code as shown will process 32 bytes per run in 11 cycles. Obviously, it won't be of much use in the original scenario as data keeps piling up in the write buffers (as long as A1-A4 are in Chip). But it'll perform quite nicely when A1-A4 point to a fast memory location.

meynaf · 17 February 2017, 15:29

Unfortunately for you this thread is about code density. And your example shows no benefit at all in this area.

DamienD · 18 February 2017, 13:40

Why do all these threads keep going downhill; it's usually the same people arguing???

I really don't have time to be reading through 7 pages of bickering between you guys... I need to prepare for a new job over the weekend so...

Closed for now until another GM has time to review.

prowler · 01 March 2017, 22:16

Quote:

Originally Posted by DamienD

Closed for now until another GM has time to review.

Done, thanks Damien!

Thread reopened. Now, let's try again, shall we, guys?

meynaf · 03 March 2017, 13:21

Ok, let's try again.

I have a new case to submit. It's a complicated one, of course to do in a minimal amount of code.
This is a real life case, but discussing details would probably lead to endless OT.

Here is pseudo-code as i guess an explanation wouldn't be clear enough :

Code:

flag=0
start:
struct = data           ; some rel(pc) array of structs with sizeof =8
loop:
x = struct[5] >>4       ; -- -- -- -- -- x-
v1=table[x*2]           ; attn: table too far for d8(pc,ix) but ok for d16(pc)
v2=table[x*2 +1]
if flag and v2>=0 goto skip  ; v2>=0 is bit #7 test
cc = (v2<0)                  ; set condition (passed to call func via some reg)
if flag then cc=0
var = (v1&128) + (v2&15)     ; but bits 6,5,4 of var are "don't care"
call func (struct,cc,var)
skip:                   ; value in var is unimportant if we skip
struct = struct +8      ; next item (sizeof =8)
if struct[0]<>0 or struct[1]<>0 or struct[2]<>0 or struct[3]<>0 goto loop
if flag goto error
flag = 1
goto start
error:

Function called here (actually unrelated inline code) will sometimes not return so we don't necessarily end up at error label.

Note : pseudo-code is of course not optimal.
Use as few registers as you can.

litwr · 03 May 2017, 10:22

Sorry for delay I was a bit sick.

Quote:

Originally Posted by meynaf

Total nonsense. CLR is very useful instruction (when i counted i found out that 2.8% of instructions were CLR). 68000 bug is just implementation mistake, like Pentium's FDIV bug.

CLR is useful but slow. We all like faster speed, do not we?

The same thing I suspect about BSET. It is indeed useful but it may be too slow. So it is good theoretically but practically Moto used ppl as beta-testers of their raw products. FDIV bug required a good scientist to reveal it but CLR braked almost every 68000/68010 program.
I can also mention very expensive PDP/VAX-11 ISA. Moto used it as a pattern. It is interesting that "mighty" VAX-11/730 can be outperformed by 6502 @4MHz! Look at pi-spigot results for a proof. Moto was too close to this madness.

Quote:

Originally Posted by meynaf

17 ticks, depends on cpu implementation.
8 bytes for 68000, 4 bytes for 68020. I think the situation is clear.

And where are exact ticks for 680x0? I suspect it will be more than 17 for 68000/68010/68008.

Quote:

Originally Posted by meynaf

Replace "BE" by "LE" in the above and it might become true.
Endianness is only different in the memory interface. For inside the cpu, everything is exactly the same.

My point is in a fact that the user can't feel any difference between BE or LE but LE is faster with multi-words operations like additions or subtractions. So what is all BE for? Do Ppl like slow codes?! Indeed modern CPU don't use multi-word additions or subtractions. But, I repeat again, Moto forced ppl to use slower codes just for abstract theoretical reasons.

Quote:

Originally Posted by meynaf

The registers are 32 bits so the 16 bit memory is never touched in the latter case yet the timing difference is the same. It looks to me like ADD.L is just slower than MOVE.L and there is no Big Endian penalty for code fetch. It would be easy to buffer (cache) a small amount of the instruction stream code to avoid the BE penalty with longwords (I vaguely recall the 68000 having just such a buffer). Memory data accesses are what you want to look at.

It maybe different point - the entire 68000 architecture is based at BE as on the top priority and they sacrificed some clocks for it everywhere. BE objectevly creates a delay for multi-byte operations and all other reasons are just speculative.

Quote:

Originally Posted by meynaf

Ok, let's try again.
I have a new case to submit. It's a complicated one, of course to do in a minimal amount of code.

It is too big to be a sport event.

meynaf · 03 May 2017, 20:32

Quote:

Originally Posted by litwr

CLR is useful but slow. We all like faster speed, do not we?

The same thing I suspect about BSET. It is indeed useful but it may be too slow. So it is good theoretically but practically Moto used ppl as beta-testers of their raw products.

CLR isn't slow. It's a single instruction doing the job of 2. Is STZ on 65c02 slow ?
BSET isn't slow. And you can't criticize Moto for this one - x86 has it since 386.
Starting with 68040 both are just 1 clock.

Quote:

Originally Posted by litwr

FDIV bug required a good scientist to reveal it but CLR braked almost every 68000/68010 program.

Wrong. CLR did not break even a single program on the ST, and on Amiga it makes a difference only for write-only hardware registers (the mistake here - there is one, yes - isn't in the cpu but in these hw regs that can't be read).
x86 is full of bugs (like ol' opcode F0 0F C7 C8 that simply hanged first pentiums like a 02 on 6502). 80286 was so buggy that it was impossible to get out of protected mode...

Quote:

Originally Posted by litwr

I can also mention very expensive PDP/VAX-11 ISA. Moto used it as a pattern. It is interesting that "mighty" VAX-11/730 can be outperformed by 6502 @4MHz! Look at pi-spigot results for a proof. Moto was too close to this madness.

Unlike VAX which took just about everything and more, Moto did statistical analysis on existing programs and removed what was not useful.

Quote:

Originally Posted by litwr

And where are exact ticks for 680x0? I suspect it will be more than 17 for 68000/68010/68008.

I don't know. I don't care. Number of ticks depend on the implementation, you can't judge an instruction set by number of ticks.

Quote:

Originally Posted by litwr

My point is in a fact that the user can't feel any difference between BE or LE but LE is faster with multi-words operations like additions or subtractions.

LE isn't faster on anything but a 8-bit cpu !
And the user looking in data files will find them very unreadable if in LE. I did it just too many times.
Now you want to check some file contents. Like a WAV. Even though it's LE, being LE will still cause trouble when you compare e.g. "RIFF" (52495656). On 68k it's simple cmpi.l #"RIFF",(a0)+. On x86 "RIFF" may well translate to 56564952 and then you get wrong code.

Quote:

Originally Posted by litwr

So what is all BE for? Do Ppl like slow codes?! Indeed modern CPU don't use multi-word additions or subtractions. But, I repeat again, Moto forced ppl to use slower codes just for abstract theoretical reasons.

BE is for a clean architecture, a concept apparently alien to many people.

Quote:

Originally Posted by litwr

It maybe different point - the entire 68000 architecture is based at BE as on the top priority and they sacrificed some clocks for it everywhere. BE objectevly creates a delay for multi-byte operations and all other reasons are just speculative.

Wake up, it's not 1980 anymore. 6800 was maybe slowed down - not 68000.

Quote:

Originally Posted by litwr

It is too big to be a sport event.

Big asm code shouldn't be a problem with a decent instruction set.
And this is the problem with x86's so-called good code density. It's good only for very small code - as soon as it becomes larger, it starts to suck. HOMM2 on x86 is 1.5MB of code, on 68k it's 0.9MB (in spite the compiler did a really poor job - it could have been half that size).
For your "sport event", it needs to be big enough to put some pressure on the register file. How would a c2p look like on x86 for example ?

litwr · 05 May 2017, 09:17

Quote:

Originally Posted by meynaf

CLR isn't slow. It's a single instruction doing the job of 2. Is STZ on 65c02 slow ?

Using RMW cycles for CLR is it ok?!

It had to be just one W.

Quote:

Originally Posted by meynaf

Wrong. CLR did not break even a single program on the ST, and on Amiga

I was about brake not break

.

Quote:

Originally Posted by meynaf

On x86 "RIFF" may well translate to 56564952 and then you get wrong code.

I agree some assemblers for LE may have this problem but it is just representation. There is no problem to write "RIFF" for proper configured x86 assembler. ML level is the level of legendary coders of 50s...

meynaf · 05 May 2017, 20:56

Quote:

Originally Posted by litwr

Using RMW cycles for CLR is it ok?!

It had to be just one W.

Of course it had to be just a write but this is only an implementation problem.
I'm not saying the implementation of 68k is good. I'm just saying the instruction set is good (at least, good enough for asm use).
And anyway this was fixed in 68020 (or 68010 ? I don't remember).

Quote:

Originally Posted by litwr

I was about brake not break

.

Damned english

Again it's just old 68000 problem.

Quote:

Originally Posted by litwr

I agree some assemblers for LE may have this problem but it is just representation. There is no problem to write "RIFF" for proper configured x86 assembler. ML level is the level of legendary coders of 50s...

This is just representation, of course it is. That whole thing IS a matter of representation ! On BE it's obvious, on LE you may have bad surprises. This has nothing to do with direct ML write.

Now... perhaps should i recall you that this is a thread about code density.
So when will you write code ?

matthey · 19 May 2017, 23:17

Dr. Vince Weaver finally updated his code density web site and documentation using some improvements suggested in this thread.

http://www.deater.net/weave/vmwprod/asm/ll/

The 68k has the best code density for the LZSS decoder and 2nd best for total size. Thanks to Vince, to all who contributed code and to the 68k designers for one of the greatest CPU architecture of all time.

Megol · 24 May 2017, 13:55

Quote:

Originally Posted by matthey

Dr. Vince Weaver finally updated his code density web site and documentation using some improvements suggested in this thread.

http://www.deater.net/weave/vmwprod/asm/ll/

The 68k has the best code density for the LZSS decoder and 2nd best for total size. Thanks to Vince, to all who contributed code and to the 68k designers for one of the greatest CPU architecture of all time.

Congratulations - see how it's better to try to help instead of just ranting?

Note that the ARM and x86 code can be improved, probably the 68k and others too.

matthey · 24 May 2017, 18:39

Quote:

Originally Posted by Megol

Congratulations - see how it's better to try to help instead of just ranting?

Yes, of course it is better to cooperate. Some improvement is better than none even when people do things the hard (or flawed) way.

Quote:

Originally Posted by Megol

Note that the ARM and x86 code can be improved, probably the 68k and others too.

I expect other architecture code can be improved too. Even the 68k code could be improved more. I submitted 3 source files with increasing optimizations. NorthWay's code was the most aggressively optimized but did not work (I could have messed up somewhere). I probably could have made it run if I could have debugged it on the Amiga but it will only run on Linux. I wasn't too worried about it when Vince asked and was happy to have the 68k in the ballpark of where it should be. I told him he should work on the others using what he learned from my submissions which are mostly like what a compiler should do. He really should have started with compiler generated code although that is no guarantee of good code quality on the 68k. Maybe we will start to see some 68k compiler improvements with bebbo's gcc changes and vbcc being sponsored.

http://eab.abime.net/showthread.php?t=85474
http://eab.abime.net/showthread.php?t=87205

Megol · 24 May 2017, 21:37

The GCC patches are promising I must say, haven't looked at VBCC lately but any improvement is nice.

Given how clean the 68k architecture is it's strange IMHO that compilers should have any problem generating good code, a compiler that use the (mostly legacy) quirky instructions of x86 instead of a "RISC" subset having problems sure - but 68k? The only real "problem" is the split D and A register sets and that's not too hard to work with...
--
I have never been good at compression/decompression code however the LZSS decompression code in the logo routine(s) feels odd. Feels is the right word as I haven't really analysed it.

matthey · 28 May 2017, 00:07

Vince added RISC-V ISAs to the comparison.

http://www.deater.net/weave/vmwprod/asm/ll/

He said there was room for improvement for RISC-V code. So far the results come up a little short of the code density hype and claims although RV64C appears to have pretty good code density for a 64 bit CPU. RV64C is beating arm64 (ARMv8 AArch64). All other RISC-V variants are unimpressive in code density and make me wonder why they even bothered.

Quote:

Originally Posted by Megol

The GCC patches are promising I must say, haven't looked at VBCC lately but any improvement is nice.

Given how clean the 68k architecture is it's strange IMHO that compilers should have any problem generating good code, a compiler that use the (mostly legacy) quirky instructions of x86 instead of a "RISC" subset having problems sure - but 68k? The only real "problem" is the split D and A register sets and that's not too hard to work with...

There are a few things which make the 68k more challenging for compilers to generate good quality code like the A/D register split but they are no more difficult than x86 issues. GCC will even change alignment when a particular CPU architecture is specified for the x86/x86_64. The 68k on the other hand has barely maintained old support passed down and translated from earlier compiler versions. The 68k was abandoned about the same time compilers were starting to get good. Popular processors get all the support and there hasn't been a popular 68k design in decades. The innate advantage of a CPU with good code density is easily sabotaged by poor compiler support.

Thorham · 28 May 2017, 01:25

Quote:

Originally Posted by matthey

The innate advantage of a CPU with good code density is easily sabotaged by poor compiler support.

Just write it in asm then

Megol · 29 May 2017, 12:54

Quote:

Originally Posted by Thorham

Just write it in asm then

Of course :P

However even on x86 with all support and "optimization" (quotes for a reason...) the generated code quality is generally lacking especially for size optimized code. Enable vectorization and increase optimization -> useless vectorization of scalar integer code that is extremely bloated and runs slower than the most naive integer code due to setup overheads.

Beginning to think i'm getting older - as I'm starting to long for less complex compilers...

Photon · 28 July 2017, 11:06

Intel will win because it still supports special-purpose 8-bit CPU instructions that do more than other 8-bit CPUs. 16-bit is fluffier with the exception of mul/div and 32-bit Risc are the worst. Even with Thumb they can't quite get there. There are 64-bit etc CPUs too, ofc ;-)

NorthWay · 18 August 2017, 12:23

Back to that peculiar obsession with the size of the LZSS decompression loop. I sat down and tinkered with how I would do it natively if size was all I cared about and I could arrange data as I like. 34 bytes:

Code:

get_bits
	move.b	(a3)+,d5
get_bit
	roxr.b	#1,d5
	beq.s	get_bits

	bcs.s	string

literal
	move.b	(a3)+,(a2)+
	bra.s	check

string
	move.w	(a3)+,d0	; 4(negative) + 12(negative)
	move.w	d0,d1
	or.w	d2,d0		; $F000
	sub.w	d3,d1		; 2<<12

	lea	(a2,d0.w),a0
copyloop
	move.b	(a0)+,(a2)+
	add.w	d4,d1		; 1<<12
	bcc.s	copyloop

check
	cmp.l	a2,a1	; swap order? might need to adjust a1 by 1
	bcc.s	get_bit

If you are willing to reduce the max copy string length from 18 to 16 bytes then you can save 2 more bytes ("sub.w d3,d1").

ross · 19 August 2017, 23:33

Quote:

Originally Posted by NorthWay

Back to that peculiar obsession with the size of the LZSS decompression loop.
--- cut ---
If you are willing to reduce the max copy string length from 18 to 16 bytes then you can save 2 more bytes ("sub.w d3,d1").

Hi NorthWay, some remarks.
To be fair D2, D3 and A1 need to be initialized and consume code space (with an escape token A1 can be omitted).

And this code do not works on 68000 machines

[EDIT: there is even a subtle initialization bug..]

Regards,
ross

NorthWay · 20 August 2017, 02:55

I know, but I said it was about the loop itself. That was the only thing that was counted for some reason, and so I cut out all init etc.
If I cared about a more realistic total code size then I would arrange it differently. Chances are I would care about speed too.

And if you are willing to drop in-buffer overwrite de-compression then you can separate literals and control/length+distance bits and read out the control bits("get_bits") 16 at a time.

17 February 2017, 15:29	#122
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,355	Unfortunately for you this thread is about code density. And your example shows no benefit at all in this area. Last edited by prowler; 01 March 2017 at 22:11. Reason: Cleanup.

03 March 2017, 13:21	#125
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,355	Ok, let's try again. I have a new case to submit. It's a complicated one, of course to do in a minimal amount of code. This is a real life case, but discussing details would probably lead to endless OT. Here is pseudo-code as i guess an explanation wouldn't be clear enough : Code: flag=0 start: struct = data ; some rel(pc) array of structs with sizeof =8 loop: x = struct[5] >>4 ; -- -- -- -- -- x- v1=table[x2] ; attn: table too far for d8(pc,ix) but ok for d16(pc) v2=table[x2 +1] if flag and v2>=0 goto skip ; v2>=0 is bit #7 test cc = (v2<0) ; set condition (passed to call func via some reg) if flag then cc=0 var = (v1&128) + (v2&15) ; but bits 6,5,4 of var are "don't care" call func (struct,cc,var) skip: ; value in var is unimportant if we skip struct = struct +8 ; next item (sizeof =8) if struct[0]<>0 or struct[1]<>0 or struct[2]<>0 or struct[3]<>0 goto loop if flag goto error flag = 1 goto start error: Function called here (actually unrelated inline code) will sometimes not return so we don't necessarily end up at error label. Note : pseudo-code is of course not optimal. Use as few registers as you can. Last edited by meynaf; 03 March 2017 at 13:36. Reason: fixed two mistakes in pseudo code

18 August 2017, 12:23	#138
NorthWay Registered User Join Date: May 2013 Location: Grimstad / Norway Posts: 852	Back to that peculiar obsession with the size of the LZSS decompression loop. I sat down and tinkered with how I would do it natively if size was all I cared about and I could arrange data as I like. 34 bytes: Code: get_bits move.b (a3)+,d5 get_bit roxr.b #1,d5 beq.s get_bits bcs.s string literal move.b (a3)+,(a2)+ bra.s check string move.w (a3)+,d0 ; 4(negative) + 12(negative) move.w d0,d1 or.w d2,d0 ; $F000 sub.w d3,d1 ; 2<<12 lea (a2,d0.w),a0 copyloop move.b (a0)+,(a2)+ add.w d4,d1 ; 1<<12 bcc.s copyloop check cmp.l a2,a1 ; swap order? might need to adjust a1 by 1 bcc.s get_bit If you are willing to reduce the max copy string length from 18 to 16 bytes then you can save 2 more bytes ("sub.w d3,d1").

20 August 2017, 02:55	#140
NorthWay Registered User Join Date: May 2013 Location: Grimstad / Norway Posts: 852	I know, but I said it was about the loop itself. That was the only thing that was counted for some reason, and so I cut out all init etc. If I cared about a more realistic total code size then I would arrange it differently. Chances are I would care about speed too. And if you are willing to drop in-buffer overwrite de-compression then you can separate literals and control/length+distance bits and read out the control bits("get_bits") 16 at a time. Last edited by NorthWay; 20 August 2017 at 03:03.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Generated code and CPU Instruction Cache	Mrs Beanbag	Coders. Asm / Hardware	11	23 May 2014 11:05
EAB Christmas Song-writing Contest	mr_a500	project.EAB	64	24 May 2009 02:44
AmigaSYS Wallpaper Contest	Calo Nord	News	10	22 April 2005 09:33
Landover's Amiga Arcade Conversion Contest	Frog	News	1	28 January 2005 23:41
Battlechess Contest (EAB vs A500)	Bloodwych	Nostalgia & memories	67	14 August 2003 14:37

18 February 2017, 13:40	#123
DamienD Banned Join Date: Aug 2005 Location: London / Sydney Age: 47 Posts: 20,420	Why do all these threads keep going downhill; it's usually the same people arguing??? I really don't have time to be reading through 7 pages of bickering between you guys... I need to prepare for a new job over the weekend so... Closed for now until another GM has time to review.

19 May 2017, 23:17	#130
matthey Banned Join Date: Jan 2010 Location: Kansas Posts: 1,284	Dr. Vince Weaver finally updated his code density web site and documentation using some improvements suggested in this thread. http://www.deater.net/weave/vmwprod/asm/ll/ The 68k has the best code density for the LZSS decoder and 2nd best for total size. Thanks to Vince, to all who contributed code and to the 68k designers for one of the greatest CPU architecture of all time.

24 May 2017, 21:37	#133
Megol Registered User Join Date: May 2014 Location: inside the emulator Posts: 377	The GCC patches are promising I must say, haven't looked at VBCC lately but any improvement is nice. Given how clean the 68k architecture is it's strange IMHO that compilers should have any problem generating good code, a compiler that use the (mostly legacy) quirky instructions of x86 instead of a "RISC" subset having problems sure - but 68k? The only real "problem" is the split D and A register sets and that's not too hard to work with... -- I have never been good at compression/decompression code however the LZSS decompression code in the logo routine(s) feels odd. Feels is the right word as I haven't really analysed it.

28 July 2017, 11:06	#137
Photon Moderator Join Date: Nov 2004 Location: Eksjö / Sweden Posts: 5,655	Intel will win because it still supports special-purpose 8-bit CPU instructions that do more than other 8-bit CPUs. 16-bit is fluffier with the exception of mul/div and 32-bit Risc are the worst. Even with Thumb they can't quite get there. There are 64-bit etc CPUs too, ofc ;-)

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)