English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 06 February 2017, 17:10   #21
Thorham
Computer Nerd
 
Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,847
Quote:
Originally Posted by litwr View Post
We have two 64-bit unsigned integers A and B. How to find what is bigger?
Code:
; cmp.l d1:d0,d3:d2

    cmp.l   d1,d3
    bne     .l1
    cmp.l   d0,d2
.l1
Thorham is offline  
Old 06 February 2017, 17:16   #22
phx
Natteravn
 
phx's Avatar
 
Join Date: Nov 2009
Location: Herford / Germany
Posts: 2,546
Ok, I'm not a PPC-expert, but to give you something I converted your first example for the PPC. It needs fewer instructions, but some more bytes because of RISC (4 bytes for each instruction).
Code:
# r3 source
# r4-r7 dest (-4)
        li      r12,2000
        mtctr   r12
loop:
        lmw     r28,0(r3)
        rlwinm  r11,r28,16,0,15
        rlwimi  r11,r30,0,16,31
        stwu    r11,4(r5)
        rlwinm  r11,r28,0,0,15
        rlwimi  r11,r30,16,16,31
        stwu    r11,4(r4)
        rlwinm  r11,r29,16,0,15
        rlwimi  r11,r31,0,16,31
        stwu    r11,4(r7)
        rlwinm  r11,r29,0,0,15    
        rlwimi  r11,r31,16,16,31
        stwu    r11,4(r6)
        addi    r3,r3,16
        bdnz    loop
        blr

Last edited by phx; 06 February 2017 at 17:18. Reason: addi was missing
phx is offline  
Old 06 February 2017, 17:21   #23
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
Quote:
Originally Posted by litwr View Post
This example is a bit too big. I am not sure that I can afford to have time enough for it.
Hear the great news folks : 20 lines of asm are too big to write an x86 version in a decent amount of time !


Quote:
Originally Posted by litwr View Post
IMHO I have the other and much more simple. We have two 64-bit unsigned integers A and B. How to find what is bigger? With x86_64 we just use CMP RAX,RBX. With x86 we should use 2 registers for every number, for example, EAX:EBX for A and ECX:EDX for B. 680x0 may use D0: D1 for A and D2: D3 for B. The registers should not change. Start!
This example is too short and meaningless. Anyway it's easy : just... hey, damn it, Thorham was faster


Quote:
Originally Posted by litwr View Post
BTW 680x0 can't match ARM in the division algorithm. It requires only 3 ARM instructions for a loop! Indeed, 680x0 has hardware division...
This example is too short and meaningless - and considered as wrong until I can see said ARM code. Please post these 3 lines here so everyone can see them.

Btw. Some recent ARM models apparently have a division instruction.
meynaf is offline  
Old 06 February 2017, 17:32   #24
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
Quote:
Originally Posted by phx View Post
Ok, I'm not a PPC-expert, but to give you something I converted your first example for the PPC. It needs fewer instructions, but some more bytes because of RISC (4 bytes for each instruction).
You must be PPC expert to be able to write this. PPC asm isn't exactly very easy
meynaf is offline  
Old 06 February 2017, 17:33   #25
AnimaInCorpore
Registered User
 
Join Date: Nov 2012
Location: Willich/Germany
Posts: 233
Quote:
Originally Posted by idrougge View Post
Assembly language has been obsolete for just as long.
I wouldn't say "obsolete" but "less important".
AnimaInCorpore is offline  
Old 06 February 2017, 17:35   #26
Thorham
Computer Nerd
 
Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,847
Quote:
Originally Posted by phx View Post
Ok, I'm not a PPC-expert, but to give you something I converted your first example for the PPC. It needs fewer instructions, but some more bytes because of RISC (4 bytes for each instruction).
Those are some interesting instructions. Can't comment on how good that code is, though.

Quote:
Originally Posted by meynaf View Post
Hear the great news folks : 20 lines of asm are too big to write an x86 version in a decent amount of time !
Yeah, I thought that was odd, too. Shouldn't take more than ten minutes.

Quote:
Originally Posted by meynaf View Post
Anyway it's easy
Too easy.

Quote:
Originally Posted by litwr View Post
BTW 680x0 can't match ARM in the division algorithm. It requires only 3 ARM instructions for a loop! Indeed, 680x0 has hardware division...
68k can almost do it in three instructions:
Code:
    moveq   #-1,d2
.loop
    addq.l  #1,d2
    sub.l   d0,d1
    bgt     .loop
Is COMPLETELY useless of course
Thorham is offline  
Old 06 February 2017, 17:37   #27
Thorham
Computer Nerd
 
Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,847
Quote:
Originally Posted by AnimaInCorpore View Post
I wouldn't say "obsolete" but "less important".
It's probably still critically important. It's just not used to write whole applications with.
Thorham is offline  
Old 06 February 2017, 17:48   #28
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
Quote:
Originally Posted by Thorham View Post
It's just not used to write whole applications with.
Depends by whom
meynaf is offline  
Old 06 February 2017, 18:24   #29
litwr
Registered User
 
Join Date: Mar 2016
Location: Ozherele
Posts: 229
@Thorham Thank you for you code. I missed that it is better to start with the high byte.
20 lines of code is easy to translate directly but we have other point. To show ISA advantages. This requires to take the idea behind the code. BTW I am bit disappointed by your joke about division, I wrote about the fast division.
@meynaf
Code:
        MOVS R12,0,2    ;clears R12 and CY
        RSB R1,R1,0     ;NEG
   repeat 32
        ADCS R0,R0,R0
        ADCS R12,R1,R12,lsl 1
        SUBCC R12,R12,R1
   end repeat
        ADC R0,R0,R0
this is a code for 32 bit divisor and dividend. R0 = R0/R1, R12 = R0%R1.

Last edited by litwr; 06 February 2017 at 18:30.
litwr is offline  
Old 06 February 2017, 18:38   #30
phx
Natteravn
 
phx's Avatar
 
Join Date: Nov 2009
Location: Herford / Germany
Posts: 2,546
Quote:
Originally Posted by meynaf View Post
You must be PPC expert to be able to write this. PPC asm isn't exactly very easy
Let's say I have more experience with the M68k than with the PPC.

Quote:
Originally Posted by Thorham View Post
Those are some interesting instructions. Can't comment on how good that code is, though.
  • lmw - load multiple words. Loads up to all 32 registers in a single instruction. But always from a start register of your choice, up to the last register, r31. So it's somewhat restricted compared with the M68k movem.
  • rlwinm - rotate left word immediate, then apply mask. Rotates a word left by any number of bits and masks the result. E.g. a shift-left (instead of a rotate) by two would be: rlwinm rDst,rSrc,2,0,29. It masks out the least significant two bits, which were rotated.
  • rlwimi - rotate left word immediate with masked insert. Similar as rlwinm, but inserts the masked bitfield from the source into the destination register while preserving the other bits outside that mask.
  • The destination register is always the leftmost one. All instructions allow to write the result into a different register than any source register.
  • Bits are numbered from MSB to LSB, unlike most other CPUs (see the mask start and end bits in the rotate instructions).
  • The stack is only rarely needed. Neither for passing arguments (there are sufficient registers), nor for saving/restoring the return address. A sub-routine call puts the return address into the LR (link) register. Returning is just a "blr" (branch to LR).
phx is offline  
Old 06 February 2017, 18:52   #31
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
Quote:
Originally Posted by litwr View Post
this is a code for 32 bit divisor and dividend. R0 = R0/R1, R12 = R0%R1.
So it is not 3 instructions but 32*3 +3

More seriously, you can always find examples like this, but add fused branch to 68k and it's not only of similar size : there isn't a speed difference anymore.

Btw. Your code isn't like 68k div because it can not detect overflow.
Btw2. Having an ARM version of my example would be nice, too - but it's probably even more time consuming than for x86
meynaf is offline  
Old 06 February 2017, 18:58   #32
matthey
Banned
 
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
Quote:
Originally Posted by Thorham View Post
You can do it a little shorter like this (if I didn't make a mistake):
Code:
    move.w  #1999,d0
.loop
    movem.l (a0)+,d1-d4

    swap    d3
    eor.w   d1,d3
    eor.w   d3,d1
    move.l  d1,(a1)+
    eor.w   d1,d3
    swap    d3
    move.l  d3,(a2)+

    swap    d4
    eor.w   d2,d4
    eor.w   d4,d2
    move.l  d2,(a3)+
    eor.w   d2,d4
    swap    d4
    move.l  d4,(a4)+

    dbra    d0,.loop
    rts
68k: 18 instructions, 42 bytes, 2.33 bytes/instruction

This code is *not* good for superscalar (meynaf's code is likely faster with superscalar).

Quote:
Originally Posted by phx View Post
Ok, I'm not a PPC-expert, but to give you something I converted your first example for the PPC. It needs fewer instructions, but some more bytes because of RISC (4 bytes for each instruction).
Code:
# r3 source
# r4-r7 dest (-4)
        li      r12,2000
        mtctr   r12
loop:
        lmw     r28,0(r3)
        rlwinm  r11,r28,16,0,15
        rlwimi  r11,r30,0,16,31
        stwu    r11,4(r5)
        rlwinm  r11,r28,0,0,15
        rlwimi  r11,r30,16,16,31
        stwu    r11,4(r4)
        rlwinm  r11,r29,16,0,15
        rlwimi  r11,r31,0,16,31
        stwu    r11,4(r7)
        rlwinm  r11,r29,0,0,15    
        rlwimi  r11,r31,16,16,31
        stwu    r11,4(r6)
        addi    r3,r3,16
        bdnz    loop
        blr
PPC: 18 instructions, 72 bytes (71% larger code than 68k), 4 bytes/instruction

PPC would need 71% more ICache and almost 2x the instruction fetch to keep up with the 68k if these numbers were typical (actually closer to 40% more ICache and almost 2x the instruction fetch). PPC was supposed to out clock CISC but that increases power consumption just like cache misses from too small of ICache.

Last edited by matthey; 06 February 2017 at 19:26.
matthey is offline  
Old 06 February 2017, 19:16   #33
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
Quote:
Originally Posted by litwr View Post
20 lines of code is easy to translate directly but we have other point. To show ISA advantages. This requires to take the idea behind the code.
While this point is valid, it would only turn 5 minutes into 15.
Compile it with GCC if you don't have the time to do handwritten x86 asm

But let's try another example if you prefer.
We have an array of bits in memory. They are numbered like in the PPC : 0 is top left bit (most significant bit, aka sign bit). This array can consist of many bytes, each one giving 8 flags (numbered 0 to 7). For example, flag number 555 ($22b) is bit 555%8 (=3, giving %00010000) of byte 555/8 (+69).
Now we want to set a specific bit (number is in some register, like D0, EAX, whichever you want) and get its old state in some flag (in CCR or EFLAGS). Memory region is pointed to by some other register (like A0 or ESI).
Real life use case : check whether a newly explored cell is already known or not, while indicating it's not in the "fog of war" anymore.

Is that one short enough for you ?
meynaf is offline  
Old 06 February 2017, 20:09   #34
litwr
Registered User
 
Join Date: Mar 2016
Location: Ozherele
Posts: 229
This looks more interesting but where is 680x0 code?
BTW I know very little about PPC. It has a bit odd history IMHO. I can't conceive the idea of big endian byte order. It looks contrived to me. It slows down arithmetic with 6809, 68008, 68000, 68010. So what is this oddity for?

Last edited by litwr; 06 February 2017 at 20:49.
litwr is offline  
Old 06 February 2017, 21:11   #35
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
Quote:
Originally Posted by litwr View Post
This looks more interesting but where is 680x0 code?
You will be real sorry when you see it, believe me. It'll show how much other cpus are crap in comparison
But shht! It's too early for this.

This time, you show it first. Then i'll show 68k version. Ready ?


Quote:
Originally Posted by litwr View Post
I can't conceive the idea of big endian byte order. It looks contrived to me. It slows down arithmetic with 6809, 68008, 68000, 68010. So what is this oddity for?
Oh, noes... And now you will even defend that utter crap called little endian.
That horror which makes data total unreadable.
That completely illogical thing.
Little endian has absolutely no advantage !
Hopefully internet protocol designers were smart enough to use big endian, however too many hardware makers missed the point...

Big endian costs ZERO in hardware implementation. It does not slow down anything either. ARM even provides both options and is supposed to be simple and easy to implement.
"ABCD" is same as $41424344. You won't see "DCBA" when reading memory afterwards and wonder what happened.
meynaf is offline  
Old 06 February 2017, 22:05   #36
matthey
Banned
 
Join Date: Jan 2010
Location: Kansas
Posts: 1,284
Quote:
Originally Posted by litwr View Post
This looks more interesting but where is 680x0 code?
BTW I know very little about PPC. It has a bit odd history IMHO.
PPC was a simplified version of IBM's POWER ISA. Apple, IBM and Motorola (AIM) agreed to adopt and proliferate it as the "next generation" ISA during the RISC hype days. Motorola abandoned the 68k and mostly developed PPC processors.

Quote:
Originally Posted by litwr View Post
I can't conceive the idea of big endian byte order. It looks contrived to me. It slows down arithmetic with 6809, 68008, 68000, 68010. So what is this oddity for?
Big endian is more natural as the bytes are stored in sequential order as they appear. There are advantages of both.

Big Endian
+ used more for networking standards
+ division starts at the most significant end making it faster
+ magnitude of numbers can be determined more quickly
+ natural order better for text handling
+ more human readable hex/binaries and text in memory

Little Endian
+ more common
+ addition/subtraction starts at the least significant end allowing faster carry propagation to be used

Theoretically, any 68k CPU with a 16 bit data bus could be faster when adding/subtracting a 32 bit number in memory but memory was likely already fast enough (and the 68k slow enough) that it made little if any difference. I would love to hear anywhere you read that it made a difference and how much though.

Last edited by matthey; 06 February 2017 at 22:16.
matthey is offline  
Old 07 February 2017, 03:48   #37
Thorham
Computer Nerd
 
Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,847
To PHX:

Thanks for that

Quote:
Originally Posted by litwr View Post
This requires to take the idea behind the code.
The idea is transposition (each 2 letters are two bytes in a register):

aabb
ccdd


becomes

aacc
bbdd


Quote:
Originally Posted by litwr View Post
BTW I am bit disappointed by your joke about division, I wrote about the fast division.
Oh, come on! Anyway, shift sub division on 68k wouldn't even be close to what you wrote. In terms of size, ARM wins, no contest.

Quote:
Originally Posted by matthey View Post
This code is *not* good for superscalar
You mean 68060? Why exactly? If it's the instruction ordering, then wouldn't that be irrelevant because of the pipelining (slow chipmem writes)?

Anyway, I optimize code for 68020s and 68030s because they need it far more than 68060s, and I don't know a whole lot about 68060 optimization (I only know something about instruction ordering). Especially when a plain A1200 is your target, 68060 optimization isn't relevant anymore.

Also, it was about code size

Last edited by Thorham; 07 February 2017 at 04:12.
Thorham is offline  
Old 07 February 2017, 04:10   #38
Thorham
Computer Nerd
 
Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,847
Quote:
Originally Posted by meynaf View Post
We have an array of bits in memory. They are numbered like in the PPC : 0 is top left bit (most significant bit, aka sign bit). This array can consist of many bytes, each one giving 8 flags (numbered 0 to 7). For example, flag number 555 ($22b) is bit 555%8 (=3, giving %00010000) of byte 555/8 (+69).
Wouldn't it be better to use mod 32 and just do something like this?
Code:
;
; bit array set
;
; a0 = array
; d0 = bit number
;
    move.l  d0,d1
    lsr.l   #5,d0
    bset    d1,(a0,d0.w*4)
If the order of the bits matters:
Code:
;
; bit array set
;
; a0 = array
; d0 = bit number
;
    move.l  d0,d1
    lsr.l   #5,d0
    not.l   d1
    bset    d1,(a0,d0.w*4)
Thorham is offline  
Old 07 February 2017, 08:11   #39
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,355
Quote:
Originally Posted by Thorham View Post
The idea is transposition (each 2 letters are two bytes in a register):

aabb
ccdd


becomes

aacc
bbdd
It is simple reordering, like this :
Code:
 move.w (a0)+,(a1)+
 move.w (a0)+,(a2)+
 move.w (a0)+,(a3)+
 move.w (a0)+,(a4)+
But the above would do twice the amount of chipmem accesses and is therefore inefficient (though perhaps faster on bare 68000, i don't know).


Quote:
Originally Posted by Thorham View Post
In terms of size, ARM wins, no contest.
Number of instructions, yes. Size, no. For real life cases, even less.


Quote:
Originally Posted by Thorham View Post
Wouldn't it be better to use mod 32 and just do something like this?
Would be too easy

But ok, i'll give that 68k code. The 68020 has tools that are very powerful when you know how to use them (prepare to be shocked ) :
Code:
bfset (a0){d0:1}
meynaf is offline  
Old 07 February 2017, 09:06   #40
Thorham
Computer Nerd
 
Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 48
Posts: 3,847
Quote:
Originally Posted by meynaf View Post
But the above would do twice the amount of chipmem accesses and is therefore inefficient (though perhaps faster on bare 68000, i don't know).
Looks like it would be faster on a 68000 because of all those 32bit instructions. Not to mention there's no cache (not entirely sure, but highly likely).

Quote:
Originally Posted by meynaf View Post
Number of instructions, yes. Size, no. For real life cases, even less.
Typical.

Quote:
Originally Posted by meynaf View Post
Code:
bfset (a0){d0:1}
Yeah, that's pretty short I keep forgetting those bitfield intructions for some reason However, is it faster (greater than 5 byte case is 22 cycles)?
Thorham is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Generated code and CPU Instruction Cache Mrs Beanbag Coders. Asm / Hardware 11 23 May 2014 11:05
EAB Christmas Song-writing Contest mr_a500 project.EAB 64 24 May 2009 02:44
AmigaSYS Wallpaper Contest Calo Nord News 10 22 April 2005 09:33
Landover's Amiga Arcade Conversion Contest Frog News 1 28 January 2005 23:41
Battlechess Contest (EAB vs A500) Bloodwych Nostalgia & memories 67 14 August 2003 14:37

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 17:44.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.10783 seconds with 14 queries