the multi-cpu code density contest - Page 2

Thorham · 06 February 2017, 17:10

Quote:

Originally Posted by litwr

We have two 64-bit unsigned integers A and B. How to find what is bigger?

Code:

; cmp.l d1:d0,d3:d2

    cmp.l   d1,d3
    bne     .l1
    cmp.l   d0,d2
.l1

phx · 06 February 2017, 17:16

Ok, I'm not a PPC-expert, but to give you something I converted your first example for the PPC. It needs fewer instructions, but some more bytes because of RISC (4 bytes for each instruction).

Code:

# r3 source
# r4-r7 dest (-4)
        li      r12,2000
        mtctr   r12
loop:
        lmw     r28,0(r3)
        rlwinm  r11,r28,16,0,15
        rlwimi  r11,r30,0,16,31
        stwu    r11,4(r5)
        rlwinm  r11,r28,0,0,15
        rlwimi  r11,r30,16,16,31
        stwu    r11,4(r4)
        rlwinm  r11,r29,16,0,15
        rlwimi  r11,r31,0,16,31
        stwu    r11,4(r7)
        rlwinm  r11,r29,0,0,15    
        rlwimi  r11,r31,16,16,31
        stwu    r11,4(r6)
        addi    r3,r3,16
        bdnz    loop
        blr

meynaf · 06 February 2017, 17:21

Quote:

Originally Posted by litwr

This example is a bit too big. I am not sure that I can afford to have time enough for it.

Hear the great news folks : 20 lines of asm are too big to write an x86 version in a decent amount of time !

Quote:

Originally Posted by litwr

IMHO I have the other and much more simple. We have two 64-bit unsigned integers A and B. How to find what is bigger? With x86_64 we just use CMP RAX,RBX. With x86 we should use 2 registers for every number, for example, EAX:EBX for A and ECX:EDX for B. 680x0 may use D0: D1 for A and D2: D3 for B. The registers should not change. Start!

This example is too short and meaningless. Anyway it's easy : just... hey, damn it, Thorham was faster

Quote:

Originally Posted by litwr

BTW 680x0 can't match ARM in the division algorithm. It requires only 3 ARM instructions for a loop! Indeed, 680x0 has hardware division...

This example is too short and meaningless - and considered as wrong until I can see said ARM code. Please post these 3 lines here so everyone can see them.

Btw. Some recent ARM models apparently have a division instruction.

meynaf · 06 February 2017, 17:32

Quote:

Originally Posted by phx

Ok, I'm not a PPC-expert, but to give you something I converted your first example for the PPC. It needs fewer instructions, but some more bytes because of RISC (4 bytes for each instruction).

You must be PPC expert to be able to write this. PPC asm isn't exactly very easy

AnimaInCorpore · 06 February 2017, 17:33

Quote:

Originally Posted by idrougge

Assembly language has been obsolete for just as long.

I wouldn't say "obsolete" but "less important".

Thorham · 06 February 2017, 17:35

Quote:

Originally Posted by phx

Ok, I'm not a PPC-expert, but to give you something I converted your first example for the PPC. It needs fewer instructions, but some more bytes because of RISC (4 bytes for each instruction).

Those are some interesting instructions. Can't comment on how good that code is, though.

Quote:

Originally Posted by meynaf

Hear the great news folks : 20 lines of asm are too big to write an x86 version in a decent amount of time !

Yeah, I thought that was odd, too. Shouldn't take more than ten minutes.

Quote:

Originally Posted by meynaf

Anyway it's easy

Too easy.

Quote:

Originally Posted by litwr

BTW 680x0 can't match ARM in the division algorithm. It requires only 3 ARM instructions for a loop! Indeed, 680x0 has hardware division...

68k can almost do it in three instructions:

Code:

    moveq   #-1,d2
.loop
    addq.l  #1,d2
    sub.l   d0,d1
    bgt     .loop

Is COMPLETELY useless of course

Thorham · 06 February 2017, 17:37

Quote:

Originally Posted by AnimaInCorpore

I wouldn't say "obsolete" but "less important".

It's probably still critically important. It's just not used to write whole applications with.

meynaf · 06 February 2017, 17:48

Quote:

Originally Posted by Thorham

It's just not used to write whole applications with.

Depends by whom

litwr · 06 February 2017, 18:24

@Thorham Thank you for you code. I missed that it is better to start with the high byte.
20 lines of code is easy to translate directly but we have other point. To show ISA advantages. This requires to take the idea behind the code. BTW I am bit disappointed by your joke about division, I wrote about the fast division.
@meynaf

Code:

        MOVS R12,0,2    ;clears R12 and CY
        RSB R1,R1,0     ;NEG
   repeat 32
        ADCS R0,R0,R0
        ADCS R12,R1,R12,lsl 1
        SUBCC R12,R12,R1
   end repeat
        ADC R0,R0,R0

this is a code for 32 bit divisor and dividend. R0 = R0/R1, R12 = R0%R1.

phx · 06 February 2017, 18:38

Quote:

Originally Posted by meynaf

You must be PPC expert to be able to write this. PPC asm isn't exactly very easy

Let's say I have more experience with the M68k than with the PPC.

Quote:

Originally Posted by Thorham

Those are some interesting instructions. Can't comment on how good that code is, though.

lmw - load multiple words. Loads up to all 32 registers in a single instruction. But always from a start register of your choice, up to the last register, r31. So it's somewhat restricted compared with the M68k movem.
rlwinm - rotate left word immediate, then apply mask. Rotates a word left by any number of bits and masks the result. E.g. a shift-left (instead of a rotate) by two would be: rlwinm rDst,rSrc,2,0,29. It masks out the least significant two bits, which were rotated.
rlwimi - rotate left word immediate with masked insert. Similar as rlwinm, but inserts the masked bitfield from the source into the destination register while preserving the other bits outside that mask.
The destination register is always the leftmost one. All instructions allow to write the result into a different register than any source register.
Bits are numbered from MSB to LSB, unlike most other CPUs (see the mask start and end bits in the rotate instructions).
The stack is only rarely needed. Neither for passing arguments (there are sufficient registers), nor for saving/restoring the return address. A sub-routine call puts the return address into the LR (link) register. Returning is just a "blr" (branch to LR).

meynaf · 06 February 2017, 18:52

Quote:

Originally Posted by litwr

this is a code for 32 bit divisor and dividend. R0 = R0/R1, R12 = R0%R1.

So it is not 3 instructions but 32*3 +3

More seriously, you can always find examples like this, but add fused branch to 68k and it's not only of similar size : there isn't a speed difference anymore.

Btw. Your code isn't like 68k div because it can not detect overflow.
Btw2. Having an ARM version of my example would be nice, too - but it's probably even more time consuming than for x86

matthey · 06 February 2017, 18:58

Quote:

Originally Posted by Thorham

You can do it a little shorter like this (if I didn't make a mistake):

Code:

    move.w  #1999,d0
.loop
    movem.l (a0)+,d1-d4

    swap    d3
    eor.w   d1,d3
    eor.w   d3,d1
    move.l  d1,(a1)+
    eor.w   d1,d3
    swap    d3
    move.l  d3,(a2)+

    swap    d4
    eor.w   d2,d4
    eor.w   d4,d2
    move.l  d2,(a3)+
    eor.w   d2,d4
    swap    d4
    move.l  d4,(a4)+

    dbra    d0,.loop
    rts

68k: 18 instructions, 42 bytes, 2.33 bytes/instruction

This code is *not* good for superscalar (meynaf's code is likely faster with superscalar).

Quote:

Originally Posted by phx

Ok, I'm not a PPC-expert, but to give you something I converted your first example for the PPC. It needs fewer instructions, but some more bytes because of RISC (4 bytes for each instruction).

Code:

# r3 source
# r4-r7 dest (-4)
        li      r12,2000
        mtctr   r12
loop:
        lmw     r28,0(r3)
        rlwinm  r11,r28,16,0,15
        rlwimi  r11,r30,0,16,31
        stwu    r11,4(r5)
        rlwinm  r11,r28,0,0,15
        rlwimi  r11,r30,16,16,31
        stwu    r11,4(r4)
        rlwinm  r11,r29,16,0,15
        rlwimi  r11,r31,0,16,31
        stwu    r11,4(r7)
        rlwinm  r11,r29,0,0,15    
        rlwimi  r11,r31,16,16,31
        stwu    r11,4(r6)
        addi    r3,r3,16
        bdnz    loop
        blr

PPC: 18 instructions, 72 bytes (71% larger code than 68k), 4 bytes/instruction

PPC would need 71% more ICache and almost 2x the instruction fetch to keep up with the 68k if these numbers were typical (actually closer to 40% more ICache and almost 2x the instruction fetch). PPC was supposed to out clock CISC but that increases power consumption just like cache misses from too small of ICache.

meynaf · 06 February 2017, 19:16

Quote:

Originally Posted by litwr

20 lines of code is easy to translate directly but we have other point. To show ISA advantages. This requires to take the idea behind the code.

While this point is valid, it would only turn 5 minutes into 15.
Compile it with GCC if you don't have the time to do handwritten x86 asm

But let's try another example if you prefer.
We have an array of bits in memory. They are numbered like in the PPC : 0 is top left bit (most significant bit, aka sign bit). This array can consist of many bytes, each one giving 8 flags (numbered 0 to 7). For example, flag number 555 ($22b) is bit 555%8 (=3, giving %00010000) of byte 555/8 (+69).
Now we want to set a specific bit (number is in some register, like D0, EAX, whichever you want) and get its old state in some flag (in CCR or EFLAGS). Memory region is pointed to by some other register (like A0 or ESI).
Real life use case : check whether a newly explored cell is already known or not, while indicating it's not in the "fog of war" anymore.

Is that one short enough for you ?

litwr · 06 February 2017, 20:09

This looks more interesting but where is 680x0 code?
BTW I know very little about PPC. It has a bit odd history IMHO. I can't conceive the idea of big endian byte order. It looks contrived to me. It slows down arithmetic with 6809, 68008, 68000, 68010. So what is this oddity for?

meynaf · 06 February 2017, 21:11

Quote:

Originally Posted by litwr

This looks more interesting but where is 680x0 code?

You will be real sorry when you see it, believe me. It'll show how much other cpus are crap in comparison

But shht! It's too early for this.

This time, you show it first. Then i'll show 68k version. Ready ?

Quote:

Originally Posted by litwr

I can't conceive the idea of big endian byte order. It looks contrived to me. It slows down arithmetic with 6809, 68008, 68000, 68010. So what is this oddity for?

Oh, noes... And now you will even defend that utter crap called little endian.

That horror which makes data total unreadable.
That completely illogical thing.
Little endian has absolutely no advantage !
Hopefully internet protocol designers were smart enough to use big endian, however too many hardware makers missed the point...

Big endian costs ZERO in hardware implementation. It does not slow down anything either. ARM even provides both options and is supposed to be simple and easy to implement.
"ABCD" is same as $41424344. You won't see "DCBA" when reading memory afterwards and wonder what happened.

matthey · 06 February 2017, 22:05

Quote:

Originally Posted by litwr

This looks more interesting but where is 680x0 code?
BTW I know very little about PPC. It has a bit odd history IMHO.

PPC was a simplified version of IBM's POWER ISA. Apple, IBM and Motorola (AIM) agreed to adopt and proliferate it as the "next generation" ISA during the RISC hype days. Motorola abandoned the 68k and mostly developed PPC processors.

Quote:

Originally Posted by litwr

I can't conceive the idea of big endian byte order. It looks contrived to me. It slows down arithmetic with 6809, 68008, 68000, 68010. So what is this oddity for?

Big endian is more natural as the bytes are stored in sequential order as they appear. There are advantages of both.

Big Endian
+ used more for networking standards
+ division starts at the most significant end making it faster
+ magnitude of numbers can be determined more quickly
+ natural order better for text handling
+ more human readable hex/binaries and text in memory

Little Endian
+ more common
+ addition/subtraction starts at the least significant end allowing faster carry propagation to be used

Theoretically, any 68k CPU with a 16 bit data bus could be faster when adding/subtracting a 32 bit number in memory but memory was likely already fast enough (and the 68k slow enough) that it made little if any difference. I would love to hear anywhere you read that it made a difference and how much though.

Thorham · 07 February 2017, 03:48

To PHX:

Thanks for that

Quote:

Originally Posted by litwr

This requires to take the idea behind the code.

The idea is transposition (each 2 letters are two bytes in a register):

aabb
ccdd

becomes

aacc
bbdd

Quote:

Originally Posted by litwr

BTW I am bit disappointed by your joke about division, I wrote about the fast division.

Oh, come on! Anyway, shift sub division on 68k wouldn't even be close to what you wrote. In terms of size, ARM wins, no contest.

Quote:

Originally Posted by matthey

This code is *not* good for superscalar

You mean 68060? Why exactly? If it's the instruction ordering, then wouldn't that be irrelevant because of the pipelining (slow chipmem writes)?

Anyway, I optimize code for 68020s and 68030s because they need it far more than 68060s, and I don't know a whole lot about 68060 optimization (I only know something about instruction ordering). Especially when a plain A1200 is your target, 68060 optimization isn't relevant anymore.

Also, it was about code size

Thorham · 07 February 2017, 04:10

Quote:

Originally Posted by meynaf

We have an array of bits in memory. They are numbered like in the PPC : 0 is top left bit (most significant bit, aka sign bit). This array can consist of many bytes, each one giving 8 flags (numbered 0 to 7). For example, flag number 555 ($22b) is bit 555%8 (=3, giving %00010000) of byte 555/8 (+69).

Wouldn't it be better to use mod 32 and just do something like this?

Code:

;
; bit array set
;
; a0 = array
; d0 = bit number
;
    move.l  d0,d1
    lsr.l   #5,d0
    bset    d1,(a0,d0.w*4)

If the order of the bits matters:

Code:

;
; bit array set
;
; a0 = array
; d0 = bit number
;
    move.l  d0,d1
    lsr.l   #5,d0
    not.l   d1
    bset    d1,(a0,d0.w*4)

meynaf · 07 February 2017, 08:11

Quote:

Originally Posted by Thorham

The idea is transposition (each 2 letters are two bytes in a register):

aabb
ccdd

becomes

aacc
bbdd

It is simple reordering, like this :

Code:

 move.w (a0)+,(a1)+
 move.w (a0)+,(a2)+
 move.w (a0)+,(a3)+
 move.w (a0)+,(a4)+

But the above would do twice the amount of chipmem accesses and is therefore inefficient (though perhaps faster on bare 68000, i don't know).

Quote:

Originally Posted by Thorham

In terms of size, ARM wins, no contest.

Number of instructions, yes. Size, no. For real life cases, even less.

Quote:

Originally Posted by Thorham

Wouldn't it be better to use mod 32 and just do something like this?

Would be too easy

But ok, i'll give that 68k code. The 68020 has tools that are very powerful when you know how to use them (prepare to be shocked

) :

Code:

bfset (a0){d0:1}

Thorham · 07 February 2017, 09:06

Quote:

Originally Posted by meynaf

But the above would do twice the amount of chipmem accesses and is therefore inefficient (though perhaps faster on bare 68000, i don't know).

Looks like it would be faster on a 68000 because of all those 32bit instructions. Not to mention there's no cache (not entirely sure, but highly likely).

Quote:

Originally Posted by meynaf

Number of instructions, yes. Size, no. For real life cases, even less.

Typical.

Quote:

Originally Posted by meynaf

Code:

bfset (a0){d0:1}

Yeah, that's pretty short

I keep forgetting those bitfield intructions for some reason

However, is it faster (greater than 5 byte case is 22 cycles)?

06 February 2017, 17:16	#22
phx Natteravn Join Date: Nov 2009 Location: Herford / Germany Posts: 2,546	Ok, I'm not a PPC-expert, but to give you something I converted your first example for the PPC. It needs fewer instructions, but some more bytes because of RISC (4 bytes for each instruction). Code: # r3 source # r4-r7 dest (-4) li r12,2000 mtctr r12 loop: lmw r28,0(r3) rlwinm r11,r28,16,0,15 rlwimi r11,r30,0,16,31 stwu r11,4(r5) rlwinm r11,r28,0,0,15 rlwimi r11,r30,16,16,31 stwu r11,4(r4) rlwinm r11,r29,16,0,15 rlwimi r11,r31,0,16,31 stwu r11,4(r7) rlwinm r11,r29,0,0,15 rlwimi r11,r31,16,16,31 stwu r11,4(r6) addi r3,r3,16 bdnz loop blr Last edited by phx; 06 February 2017 at 17:18. Reason: addi was missing

06 February 2017, 18:24	#29
litwr Registered User Join Date: Mar 2016 Location: Ozherele Posts: 229	@Thorham Thank you for you code. I missed that it is better to start with the high byte. 20 lines of code is easy to translate directly but we have other point. To show ISA advantages. This requires to take the idea behind the code. BTW I am bit disappointed by your joke about division, I wrote about the fast division. @meynaf Code: MOVS R12,0,2 ;clears R12 and CY RSB R1,R1,0 ;NEG repeat 32 ADCS R0,R0,R0 ADCS R12,R1,R12,lsl 1 SUBCC R12,R12,R1 end repeat ADC R0,R0,R0 this is a code for 32 bit divisor and dividend. R0 = R0/R1, R12 = R0%R1. Last edited by litwr; 06 February 2017 at 18:30.

06 February 2017, 20:09	#34
litwr Registered User Join Date: Mar 2016 Location: Ozherele Posts: 229	This looks more interesting but where is 680x0 code? BTW I know very little about PPC. It has a bit odd history IMHO. I can't conceive the idea of big endian byte order. It looks contrived to me. It slows down arithmetic with 6809, 68008, 68000, 68010. So what is this oddity for? Last edited by litwr; 06 February 2017 at 20:49.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Generated code and CPU Instruction Cache	Mrs Beanbag	Coders. Asm / Hardware	11	23 May 2014 11:05
EAB Christmas Song-writing Contest	mr_a500	project.EAB	64	24 May 2009 02:44
AmigaSYS Wallpaper Contest	Calo Nord	News	10	22 April 2005 09:33
Landover's Amiga Arcade Conversion Contest	Frog	News	1	28 January 2005 23:41
Battlechess Contest (EAB vs A500)	Bloodwych	Nostalgia & memories	67	14 August 2003 14:37

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)