English Amiga Board

English Amiga Board (http://eab.abime.net/index.php)
-   Coders. Asm / Hardware (http://eab.abime.net/forumdisplay.php?f=112)
-   -   index of lowest set bit (http://eab.abime.net/showthread.php?t=108775)

meynaf 05 November 2021 16:13

index of lowest set bit
 
Time for a small code contest, even though i really need to do that in my code.

If you didn't get it with the title, it's about getting the bit number of the lowest (= rightmost) bit that's equal to 1. Having a value with only that bit set is easier (when you know how) but not enough as it then serves as index.
F.e. if input value is $12345678, it must return 3. For $80000000 it returns 31. For 0, don't care - unimportant for my use. Could even loop forever :D

I have my version, but curious to see what others can come up with.
68020+ code allowed (else it's a nightmare !).

a/b 05 November 2021 17:12

Code:

        move.b        (a0,d0.w),d1
        bne.b        done
        swap        d0
        move.b        (a0,d0.w),d1
done:

Yes, this is a joke version (a0=middle of a 64kb table), it takes a while to activate 020+ systems.

meynaf 05 November 2021 17:20

Not only a joke, but actually it doesn't work. You're using same table for high and low words.

a/b 05 November 2021 17:37

Well, I was trying to supress my megalomania and settle with a single table ;p.
OK, now a bit more serious... How about:
Code:

        move.l        d0,d1
        subq.l        #1,d1
        eor.l        d1,d0
        bfffo        d0{0:32},d0
        eor.w        #31,d0

Clear all bits except the lowest one, then use bfffo to find it.

edit: EOR is my bane...

meynaf 05 November 2021 17:51

Quote:

Originally Posted by a/b (Post 1515375)
Clear all bits except the lowest one, then use bfffo to find it.

That's my current method. But bfffo is so damned slow...

ross 05 November 2021 18:34

Quote:

Originally Posted by a/b (Post 1515375)
edit: EOR is my bane...

Then do not use it :p
Code:

  move.l  d0,d1
  subq.l  #1,d1
  eor.l  d1,d0

=
Code:

    move.l  d0,d1
    neg.l  d0
    and.l  d1,d0


meynaf 05 November 2021 18:54

Now a small, interesting problem : i need to clear said bit to be able to loop on them. (Didn't think about that when posting...)
Like this :
Code:

; d2=input value, d1=tmp, d0=result
loop
 move.w d2,d1
 subq.w #1,d1
 not.w d1
 and.w d2,d1
 bfffo d1{0:32},d0
 eori.b #31,d0
(do something here with d0=bit#)
 eor.w d1,d2
 bne loop

Above eor method removes one instruction from the calculation but destroys actual value...

Photon 05 November 2021 20:18

If it's a contest and not a question, maybe specify size or cycles, and which platform? It will affect reply relevance and usefulness to you.

Simple pseudocode version is
Code:

shift right
bcs x
dbf Dn
x: move.w Dn,...

More involved would be a list of 5 mask words until 0 result.

From your last comment though, I would rather just loop from 0-31 and use bclr and act on the cc.

These are 3 major ways.

Cyprian 05 November 2021 20:30

Quote:

Originally Posted by meynaf (Post 1515357)
Time for a small code contest, even though i really need to do that in my code.


my 68000 compatible version, "move.l D2,D1" instructions can be removed if D2 does not have to be preserved:
Code:

; d2=input value, d1=tmp, d0=result
    moveq    #-1,D0
    move.l    D2,D1
Loop
        addq.l    #1,D0           
        lsr.l    #1,D1
    bcc        Loop

Quote:

Originally Posted by meynaf (Post 1515396)
Now a small, interesting problem : i need to clear said bit to be able to loop on them.


what about "bclr.l D0,D2"?

Code:

; d2=input value, d1=tmp, d0=result
    moveq    #-1,D0
    move.l    D2,D1
Loop
        addq.l    #1,D0           
        lsr.l    #1,D1
    bcc        Loop

    bclr.l    D0,D2


meynaf 05 November 2021 20:33

Quote:

Originally Posted by Photon (Post 1515421)
If it's a contest and not a question, maybe specify size or cycles, and which platform? It will affect reply relevance and usefulness to you.

It's both. A question turned into a contest.
Size ? Not too much. A few extra code words are ok, but huge table is overkill.
Cycles ? As little as possible, considering size must not explode.
Which platform ? I said 020+ was ok. Must be as agnostic as possible within this range.


Quote:

Originally Posted by Photon (Post 1515421)
From your last comment though, I would rather just loop from 0-31 and use bclr and act on the cc.

No, too many loops for values having few ones will make it even more damned slow.

meynaf 05 November 2021 20:36

Quote:

Originally Posted by Cyprian (Post 1515422)
what about "bclr.l D0,D2"?

Yes, what about being slow.
Use of bclr needs extra tst before branching, where i just have eor+branch.

a/b 05 November 2021 21:03

If you are looping through bits it's better to use dbcs:
Code:

        moveq #32-1,d1
loop:        lsr.l        #1,d0
        dbcs        d1,loop
        eor.w        #31,d1


a/b 05 November 2021 22:11

Quote:

Originally Posted by meynaf (Post 1515396)
Now a small, interesting problem : i need to clear said bit to be able to loop on them. (Didn't think about that when posting...)
...

Did you resolve this? Your code is using dec+not which is equivalent to neg (and not is inc+neg), which is also what Ross happened to post earlier, so the loop should still work but with 1 instruction less.
I optimized those two+and with sub+eor before you mentioned you need to clear the bit.

Photon 05 November 2021 22:14

Quote:

Originally Posted by meynaf (Post 1515423)
even more damned slow.

You are performing a bitwise (non-atomic) operation, like a div. This makes it serial in nature, and this in itself requires a serial solution or a successive approximation (tree) solution. In some CPUs, there's dedicated silicon for the bitwise decision tree instead of branches.

In my 3 major solutions, the 5-word table corresponds to 5 comparisons in that tree, which means you can get a result factorially quicker than linear.

This is the quickest canonical way, as in the fewest decisions barring a dedicated instruction. BFFFO scans for the highest bit, which means you must reverse bits, which is a similar desired but not implemented instruction in most CPUs. All this to say, every desire is not always implemented, and 680x0 is already a wishlist family of CISC (and later more formalized RISC) CPU.

Much can be done with 256-byte tables, if 64K tables are too big. You can reverse bits, find the last set, fill, invert, and find contours. Lowest set bit is a contour.

On different platforms, memory accesses will be more or less punishing, especially off-cache. If CPU is not specified then, there's no way you will get a single correct answer. It may very well be that the long loop in-cache with no memory access will be much faster than the shortest or fastest code with memory accesses.

For 68000 and maybe also 68EC020/68EC030, I would LUT a byte at a time backward if nonzero:

Code:

LowestBit:
        move.l d0,(a0)+                ;in: d0.l, (a0) 4 bytes workspace
        moveq #0,d0
        REPT 4
        move.b -(a0),d1
        bne.s .lu
        addq.l #8,d0
        ENDR
        RTS
.lu:
        add.b (a2,d1.w),d0        ;a2=LUT
.done:
        RTS                        ;out: d0.l=0-32 (32=no bit set)

This beats the canonical (mathematical) solution by one decision, but requires memory accesses.

Asman 05 November 2021 22:56

What about this. I just checked few cases,

Code:

        moveq        #32-1,d1
loop
        add.l        d0,d0
        dbeq        d1,loop
        addq.b        #1,d1


ross 05 November 2021 23:10

Quote:

Originally Posted by a/b (Post 1515442)
Your code is using dec+not which is equivalent to neg (and not is inc+neg), which is also what Ross happened to post earlier, so the loop should still work but with 1 instruction less.
I optimized those two+and with sub+eor before you mentioned you need to clear the bit.

Yep, I don't think on 020+ we can do a faster routine than this:

Code:

; d2=input value, d1=tmp, d0=result
loop
    move.l d2,d1
    neg.l d1
    and.l d2,d1
    bfffo d1{0:32},d0
    eori.b #31,d0
;  (do something here with d0=bit#)
    eor.l d1,d2
    bne.b loop

Different on 000 where the possibility of using a LUT *maybe* could guarantee greater speed in some cases.

Photon 06 November 2021 02:53

On 68000-68010, bit field operations are not available. BFFFO is also unusable because it gives the highest bit and OP wants lowest bit.

meynaf 06 November 2021 09:11

Quote:

Originally Posted by a/b (Post 1515431)
If you are looping through bits it's better to use dbcs:
Code:

    moveq #32-1,d1
loop:    lsr.l    #1,d0
    dbcs    d1,loop
    eor.w    #31,d1


The trick here is that looping like this will always loop 32 times. This is inefficient for cases with few ones, and these are quite common.
As an example, if value $f0000000 is given, you'll execute your loop 32 times vs 4 in my case, and nothing can be done to cure that.


Quote:

Originally Posted by a/b (Post 1515442)
Did you resolve this? Your code is using dec+not which is equivalent to neg (and not is inc+neg), which is also what Ross happened to post earlier, so the loop should still work but with 1 instruction less.
I optimized those two+and with sub+eor before you mentioned you need to clear the bit.

On my side it's "resolved" in the sense my code currently works. I just wanted to see if it could be made faster.



Quote:

Originally Posted by Photon (Post 1515444)
You are performing a bitwise (non-atomic) operation, like a div. This makes it serial in nature, and this in itself requires a serial solution or a successive approximation (tree) solution. In some CPUs, there's dedicated silicon for the bitwise decision tree instead of branches.

In my 3 major solutions, the 5-word table corresponds to 5 comparisons in that tree, which means you can get a result factorially quicker than linear.

This is the quickest canonical way, as in the fewest decisions barring a dedicated instruction. BFFFO scans for the highest bit, which means you must reverse bits, which is a similar desired but not implemented instruction in most CPUs. All this to say, every desire is not always implemented, and 680x0 is already a wishlist family of CISC (and later more formalized RISC) CPU.

It's not the only case in this program where i need to iterate through bits and do something only if '1' is found.
The (n and n-1) trick can work, but not always.


Quote:

Originally Posted by Photon (Post 1515444)
Much can be done with 256-byte tables, if 64K tables are too big. You can reverse bits, find the last set, fill, invert, and find contours. Lowest set bit is a contour.

I guess a 256-byte table is acceptable, but not if it's only marginally faster.


Quote:

Originally Posted by Photon (Post 1515444)
On different platforms, memory accesses will be more or less punishing, especially off-cache. If CPU is not specified then, there's no way you will get a single correct answer. It may very well be that the long loop in-cache with no memory access will be much faster than the shortest or fastest code with memory accesses.

Ok then, if you really want something specific as a config, consider 50Mhz 68030 with 60ns ram.


Quote:

Originally Posted by Photon (Post 1515444)
For 68000 and maybe also 68EC020/68EC030, I would LUT a byte at a time backward if nonzero:

Code:

LowestBit:
    move.l d0,(a0)+        ;in: d0.l, (a0) 4 bytes workspace
    moveq #0,d0
    REPT 4
    move.b -(a0),d1
    bne.s .lu
    addq.l #8,d0
    ENDR
    RTS
.lu:
    add.b (a2,d1.w),d0    ;a2=LUT
.done:
    RTS            ;out: d0.l=0-32 (32=no bit set)

This beats the canonical (mathematical) solution by one decision, but requires memory accesses.

Worst case here needs 4 memory accesses and 4 branches. That's much.
Also in this code 68000 is totally out of question.



Quote:

Originally Posted by Asman (Post 1515455)
What about this. I just checked few cases,

Code:

    moveq    #32-1,d1
loop
    add.l    d0,d0
    dbeq    d1,loop
    addq.b    #1,d1


Can not be made shorter and i would probably use this if target was shortest possible code. But it's not, and now we run in n² time...



Quote:

Originally Posted by ross (Post 1515457)
Yep, I don't think on 020+ we can do a faster routine than this:

Code:

; d2=input value, d1=tmp, d0=result
loop
    move.l d2,d1
    neg.l d1
    and.l d2,d1
    bfffo d1{0:32},d0
    eori.b #31,d0
;  (do something here with d0=bit#)
    eor.l d1,d2
    bne.b loop


How could I miss this... Thanks !


Quote:

Originally Posted by ross (Post 1515457)
Different on 000 where the possibility of using a LUT *maybe* could guarantee greater speed in some cases.

Here 000 is out of question. Not only because "too slow" but also due to so many misaligned memory accesses, incredible high amount of scale factors, etc.



Quote:

Originally Posted by Photon (Post 1515481)
On 68000-68010, bit field operations are not available. BFFFO is also unusable because it gives the highest bit and OP wants lowest bit.

I want highest bit too, sometimes - i do both depending on the case - but there it's easier so i didn't ask for it :
Code:

loop
 bfffo d2{0:32},d0
 eori.b #31,d0
(do something here)
 bclr d0,d2
 tst.w d2
 bne loop
 endm


Asman 06 November 2021 09:35

another approach (perhaps useless), better for 68000 (I think)

Code:

;d0 - in, d1 - out
lowest:
        moveq        #1,d2
        move.w        d0,d0
        bne.b        .go
        swap        d0
        moveq        #16,d2
.go
        moveq    #16-1,d1
.loop
        add.w    d0,d0
        dbeq    d1,.loop
        add.b    d2,d1
        rts


meynaf 06 November 2021 10:21

Quote:

Originally Posted by Asman (Post 1515515)
another approach (perhaps useless), better for 68000 (I think)

Right, but in this example it's clearly not about 68000.


I have other examples of similar code :
Code:

; pack the bits of d2 into d7, with mask=d1
 moveq #0,d7
.loop
 bfffo d2{0:32},d4
 beq.s .fin
 addq.l #1,d4
 lsl.l d4,d2
 lsl.l d4,d1
 addx.l d7,d7
 bra.s .loop
.fin

; unpacks the bits of d2 into d7, with mask=d1
 moveq #0,d7
.loop
 move.l d2,d4
 beq.s .fin
 subq.l #1,d2
 and.l d4,d2                ; n and (n-1) -> clr lower bit
 lsr.l #1,d1
 bcc.s .loop
 eor.l d2,d4
 add.l d4,d7
 bra.s .loop
.fin

I wonder if it's possible to make them faster without making them significantly bigger.
Or smaller, without making them slower.


All times are GMT +2. The time now is 14:23.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, vBulletin Solutions Inc.

Page generated in 0.05302 seconds with 11 queries