index of lowest set bit

meynaf · 05 November 2021, 16:13

Time for a small code contest, even though i really need to do that in my code.

If you didn't get it with the title, it's about getting the bit number of the lowest (= rightmost) bit that's equal to 1. Having a value with only that bit set is easier (when you know how) but not enough as it then serves as index.
F.e. if input value is $12345678, it must return 3. For $80000000 it returns 31. For 0, don't care - unimportant for my use. Could even loop forever

I have my version, but curious to see what others can come up with.
68020+ code allowed (else it's a nightmare !).

a/b · 05 November 2021, 17:12

Code:

	move.b	(a0,d0.w),d1
	bne.b	done
	swap	d0
	move.b	(a0,d0.w),d1
done:

Yes, this is a joke version (a0=middle of a 64kb table), it takes a while to activate 020+ systems.

meynaf · 05 November 2021, 17:20

Not only a joke, but actually it doesn't work. You're using same table for high and low words.

a/b · 05 November 2021, 17:37

Well, I was trying to supress my megalomania and settle with a single table ;p.
OK, now a bit more serious... How about:

Code:

	move.l	d0,d1
	subq.l	#1,d1
	eor.l	d1,d0
	bfffo	d0{0:32},d0
	eor.w	#31,d0

Clear all bits except the lowest one, then use bfffo to find it.

edit: EOR is my bane...

meynaf · 05 November 2021, 17:51

Quote:

Originally Posted by a/b

Clear all bits except the lowest one, then use bfffo to find it.

That's my current method. But bfffo is so damned slow...

ross · 05 November 2021, 18:34

Quote:

Originally Posted by a/b

edit: EOR is my bane...

Then do not use it

Code:

   move.l  d0,d1
   subq.l  #1,d1
   eor.l   d1,d0

=

Code:

    move.l  d0,d1
    neg.l   d0
    and.l   d1,d0

meynaf · 05 November 2021, 18:54

Now a small, interesting problem : i need to clear said bit to be able to loop on them. (Didn't think about that when posting...)
Like this :

Code:

; d2=input value, d1=tmp, d0=result
loop
 move.w d2,d1
 subq.w #1,d1
 not.w d1
 and.w d2,d1
 bfffo d1{0:32},d0
 eori.b #31,d0
(do something here with d0=bit#)
 eor.w d1,d2
 bne loop

Above eor method removes one instruction from the calculation but destroys actual value...

Photon · 05 November 2021, 20:18

If it's a contest and not a question, maybe specify size or cycles, and which platform? It will affect reply relevance and usefulness to you.

Simple pseudocode version is

Code:

shift right
bcs x
dbf Dn
x: move.w Dn,...

More involved would be a list of 5 mask words until 0 result.

From your last comment though, I would rather just loop from 0-31 and use bclr and act on the cc.

These are 3 major ways.

Cyprian · 05 November 2021, 20:30

Quote:

Originally Posted by meynaf

Time for a small code contest, even though i really need to do that in my code.

my 68000 compatible version, "move.l D2,D1" instructions can be removed if D2 does not have to be preserved:

Code:

; d2=input value, d1=tmp, d0=result
    moveq    #-1,D0
    move.l    D2,D1
Loop
        addq.l    #1,D0            
        lsr.l    #1,D1
    bcc        Loop

Quote:

Originally Posted by meynaf

Now a small, interesting problem : i need to clear said bit to be able to loop on them.

what about "bclr.l D0,D2"?

Code:

; d2=input value, d1=tmp, d0=result
    moveq    #-1,D0
    move.l    D2,D1
Loop
        addq.l    #1,D0            
        lsr.l    #1,D1
    bcc        Loop

    bclr.l    D0,D2

meynaf · 05 November 2021, 20:33

Quote:

Originally Posted by Photon

If it's a contest and not a question, maybe specify size or cycles, and which platform? It will affect reply relevance and usefulness to you.

It's both. A question turned into a contest.
Size ? Not too much. A few extra code words are ok, but huge table is overkill.
Cycles ? As little as possible, considering size must not explode.
Which platform ? I said 020+ was ok. Must be as agnostic as possible within this range.

Quote:

Originally Posted by Photon

From your last comment though, I would rather just loop from 0-31 and use bclr and act on the cc.

No, too many loops for values having few ones will make it even more damned slow.

meynaf · 05 November 2021, 20:36

Quote:

Originally Posted by Cyprian

what about "bclr.l D0,D2"?

Yes, what about being slow.
Use of bclr needs extra tst before branching, where i just have eor+branch.

a/b · 05 November 2021, 21:03

If you are looping through bits it's better to use dbcs:

Code:

	moveq #32-1,d1
loop:	lsr.l	#1,d0
	dbcs	d1,loop
	eor.w	#31,d1

a/b · 05 November 2021, 22:11

Quote:

Originally Posted by meynaf

Now a small, interesting problem : i need to clear said bit to be able to loop on them. (Didn't think about that when posting...)
...

Did you resolve this? Your code is using dec+not which is equivalent to neg (and not is inc+neg), which is also what Ross happened to post earlier, so the loop should still work but with 1 instruction less.
I optimized those two+and with sub+eor before you mentioned you need to clear the bit.

Photon · 05 November 2021, 22:14

Quote:

Originally Posted by meynaf

even more damned slow.

You are performing a bitwise (non-atomic) operation, like a div. This makes it serial in nature, and this in itself requires a serial solution or a successive approximation (tree) solution. In some CPUs, there's dedicated silicon for the bitwise decision tree instead of branches.

In my 3 major solutions, the 5-word table corresponds to 5 comparisons in that tree, which means you can get a result factorially quicker than linear.

This is the quickest canonical way, as in the fewest decisions barring a dedicated instruction. BFFFO scans for the highest bit, which means you must reverse bits, which is a similar desired but not implemented instruction in most CPUs. All this to say, every desire is not always implemented, and 680x0 is already a wishlist family of CISC (and later more formalized RISC) CPU.

Much can be done with 256-byte tables, if 64K tables are too big. You can reverse bits, find the last set, fill, invert, and find contours. Lowest set bit is a contour.

On different platforms, memory accesses will be more or less punishing, especially off-cache. If CPU is not specified then, there's no way you will get a single correct answer. It may very well be that the long loop in-cache with no memory access will be much faster than the shortest or fastest code with memory accesses.

For 68000 and maybe also 68EC020/68EC030, I would LUT a byte at a time backward if nonzero:

Code:

LowestBit:
	move.l d0,(a0)+		;in: d0.l, (a0) 4 bytes workspace
	moveq #0,d0
	REPT 4
	move.b -(a0),d1
	bne.s .lu
	addq.l #8,d0
	ENDR
	RTS
.lu:
	add.b (a2,d1.w),d0	;a2=LUT
.done:
	RTS			;out: d0.l=0-32 (32=no bit set)

This beats the canonical (mathematical) solution by one decision, but requires memory accesses.

Asman · 05 November 2021, 22:56

What about this. I just checked few cases,

Code:

	moveq	#32-1,d1
loop
	add.l	d0,d0
	dbeq	d1,loop
	addq.b	#1,d1

ross · 05 November 2021, 23:10

Quote:

Originally Posted by a/b

Your code is using dec+not which is equivalent to neg (and not is inc+neg), which is also what Ross happened to post earlier, so the loop should still work but with 1 instruction less.
I optimized those two+and with sub+eor before you mentioned you need to clear the bit.

Yep, I don't think on 020+ we can do a faster routine than this:

Code:

; d2=input value, d1=tmp, d0=result
loop
    move.l d2,d1
    neg.l d1
    and.l d2,d1
    bfffo d1{0:32},d0
    eori.b #31,d0
;   (do something here with d0=bit#)
    eor.l d1,d2
    bne.b loop

Different on 000 where the possibility of using a LUT *maybe* could guarantee greater speed in some cases.

Photon · 06 November 2021, 02:53

On 68000-68010, bit field operations are not available. BFFFO is also unusable because it gives the highest bit and OP wants lowest bit.

meynaf · 06 November 2021, 09:11

Quote:

Originally Posted by a/b

If you are looping through bits it's better to use dbcs:

Code:

    moveq #32-1,d1
loop:    lsr.l    #1,d0
    dbcs    d1,loop
    eor.w    #31,d1

The trick here is that looping like this will always loop 32 times. This is inefficient for cases with few ones, and these are quite common.
As an example, if value $f0000000 is given, you'll execute your loop 32 times vs 4 in my case, and nothing can be done to cure that.

Quote:

Originally Posted by a/b

Did you resolve this? Your code is using dec+not which is equivalent to neg (and not is inc+neg), which is also what Ross happened to post earlier, so the loop should still work but with 1 instruction less.
I optimized those two+and with sub+eor before you mentioned you need to clear the bit.

On my side it's "resolved" in the sense my code currently works. I just wanted to see if it could be made faster.

Quote:

Originally Posted by Photon

You are performing a bitwise (non-atomic) operation, like a div. This makes it serial in nature, and this in itself requires a serial solution or a successive approximation (tree) solution. In some CPUs, there's dedicated silicon for the bitwise decision tree instead of branches.

In my 3 major solutions, the 5-word table corresponds to 5 comparisons in that tree, which means you can get a result factorially quicker than linear.

This is the quickest canonical way, as in the fewest decisions barring a dedicated instruction. BFFFO scans for the highest bit, which means you must reverse bits, which is a similar desired but not implemented instruction in most CPUs. All this to say, every desire is not always implemented, and 680x0 is already a wishlist family of CISC (and later more formalized RISC) CPU.

It's not the only case in this program where i need to iterate through bits and do something only if '1' is found.
The (n and n-1) trick can work, but not always.

Quote:

Originally Posted by Photon

Much can be done with 256-byte tables, if 64K tables are too big. You can reverse bits, find the last set, fill, invert, and find contours. Lowest set bit is a contour.

I guess a 256-byte table is acceptable, but not if it's only marginally faster.

Quote:

Originally Posted by Photon

On different platforms, memory accesses will be more or less punishing, especially off-cache. If CPU is not specified then, there's no way you will get a single correct answer. It may very well be that the long loop in-cache with no memory access will be much faster than the shortest or fastest code with memory accesses.

Ok then, if you really want something specific as a config, consider 50Mhz 68030 with 60ns ram.

Quote:

Originally Posted by Photon

For 68000 and maybe also 68EC020/68EC030, I would LUT a byte at a time backward if nonzero:

Code:

LowestBit:
    move.l d0,(a0)+        ;in: d0.l, (a0) 4 bytes workspace
    moveq #0,d0
    REPT 4
    move.b -(a0),d1
    bne.s .lu
    addq.l #8,d0
    ENDR
    RTS
.lu:
    add.b (a2,d1.w),d0    ;a2=LUT
.done:
    RTS            ;out: d0.l=0-32 (32=no bit set)

This beats the canonical (mathematical) solution by one decision, but requires memory accesses.

Worst case here needs 4 memory accesses and 4 branches. That's much.
Also in this code 68000 is totally out of question.

Quote:

Originally Posted by Asman

What about this. I just checked few cases,

Code:

    moveq    #32-1,d1
loop
    add.l    d0,d0
    dbeq    d1,loop
    addq.b    #1,d1

Can not be made shorter and i would probably use this if target was shortest possible code. But it's not, and now we run in n² time...

Quote:

Originally Posted by ross

Yep, I don't think on 020+ we can do a faster routine than this:

Code:

; d2=input value, d1=tmp, d0=result
loop
    move.l d2,d1
    neg.l d1
    and.l d2,d1
    bfffo d1{0:32},d0
    eori.b #31,d0
;   (do something here with d0=bit#)
    eor.l d1,d2
    bne.b loop

How could I miss this... Thanks !

Quote:

Originally Posted by ross

Different on 000 where the possibility of using a LUT *maybe* could guarantee greater speed in some cases.

Here 000 is out of question. Not only because "too slow" but also due to so many misaligned memory accesses, incredible high amount of scale factors, etc.

Quote:

Originally Posted by Photon

On 68000-68010, bit field operations are not available. BFFFO is also unusable because it gives the highest bit and OP wants lowest bit.

I want highest bit too, sometimes - i do both depending on the case - but there it's easier so i didn't ask for it :

Code:

loop
 bfffo d2{0:32},d0
 eori.b #31,d0
(do something here)
 bclr d0,d2
 tst.w d2
 bne loop
 endm

Asman · 06 November 2021, 09:35

another approach (perhaps useless), better for 68000 (I think)

Code:

;d0 - in, d1 - out
lowest:
	moveq	#1,d2
	move.w	d0,d0
	bne.b	.go
	swap	d0
	moveq	#16,d2
.go
	moveq    #16-1,d1
.loop
	add.w    d0,d0
	dbeq    d1,.loop
	add.b    d2,d1
	rts

meynaf · 06 November 2021, 10:21

Quote:

Originally Posted by Asman

another approach (perhaps useless), better for 68000 (I think)

Right, but in this example it's clearly not about 68000.

I have other examples of similar code :

Code:

; pack the bits of d2 into d7, with mask=d1
 moveq #0,d7
.loop
 bfffo d2{0:32},d4
 beq.s .fin
 addq.l #1,d4
 lsl.l d4,d2
 lsl.l d4,d1
 addx.l d7,d7
 bra.s .loop
.fin

; unpacks the bits of d2 into d7, with mask=d1
 moveq #0,d7
.loop
 move.l d2,d4
 beq.s .fin
 subq.l #1,d2
 and.l d4,d2		; n and (n-1) -> clr lower bit
 lsr.l #1,d1
 bcc.s .loop
 eor.l d2,d4
 add.l d4,d7
 bra.s .loop
.fin

I wonder if it's possible to make them faster without making them significantly bigger.
Or smaller, without making them slower.

05 November 2021, 17:12	#2
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,043	Code: move.b (a0,d0.w),d1 bne.b done swap d0 move.b (a0,d0.w),d1 done: Yes, this is a joke version (a0=middle of a 64kb table), it takes a while to activate 020+ systems.

05 November 2021, 17:37	#4
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,043	Well, I was trying to supress my megalomania and settle with a single table ;p. OK, now a bit more serious... How about: Code: move.l d0,d1 subq.l #1,d1 eor.l d1,d0 bfffo d0{0:32},d0 eor.w #31,d0 Clear all bits except the lowest one, then use bfffo to find it. edit: EOR is my bane...

05 November 2021, 18:54	#7
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,335	Now a small, interesting problem : i need to clear said bit to be able to loop on them. (Didn't think about that when posting...) Like this : Code: ; d2=input value, d1=tmp, d0=result loop move.w d2,d1 subq.w #1,d1 not.w d1 and.w d2,d1 bfffo d1{0:32},d0 eori.b #31,d0 (do something here with d0=bit#) eor.w d1,d2 bne loop Above eor method removes one instruction from the calculation but destroys actual value...

05 November 2021, 21:03	#12
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,043	If you are looping through bits it's better to use dbcs: Code: moveq #32-1,d1 loop: lsr.l #1,d0 dbcs d1,loop eor.w #31,d1

05 November 2021, 22:56	#15
Asman 68k Join Date: Sep 2005 Location: Somewhere Posts: 828	What about this. I just checked few cases, Code: moveq #32-1,d1 loop add.l d0,d0 dbeq d1,loop addq.b #1,d1

05 November 2021, 16:13	#1
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,335	index of lowest set bit Time for a small code contest, even though i really need to do that in my code. If you didn't get it with the title, it's about getting the bit number of the lowest (= rightmost) bit that's equal to 1. Having a value with only that bit set is easier (when you know how) but not enough as it then serves as index. F.e. if input value is $12345678, it must return 3. For $80000000 it returns 31. For 0, don't care - unimportant for my use. Could even loop forever I have my version, but curious to see what others can come up with. 68020+ code allowed (else it's a nightmare !).

05 November 2021, 17:20	#3
meynaf son of 68k Join Date: Nov 2007 Location: Lyon / France Age: 51 Posts: 5,335	Not only a joke, but actually it doesn't work. You're using same table for high and low words.

05 November 2021, 20:18	#8
Photon Moderator Join Date: Nov 2004 Location: Eksjö / Sweden Posts: 5,627	If it's a contest and not a question, maybe specify size or cycles, and which platform? It will affect reply relevance and usefulness to you. Simple pseudocode version is Code: shift right bcs x dbf Dn x: move.w Dn,... More involved would be a list of 5 mask words until 0 result. From your last comment though, I would rather just loop from 0-31 and use bclr and act on the cc. These are 3 major ways. Last edited by Photon; 05 November 2021 at 20:26.

06 November 2021, 02:53	#17
Photon Moderator Join Date: Nov 2004 Location: Eksjö / Sweden Posts: 5,627	On 68000-68010, bit field operations are not available. BFFFO is also unusable because it gives the highest bit and OP wants lowest bit. Last edited by Photon; 06 November 2021 at 03:04.

06 November 2021, 09:35	#19
Asman 68k Join Date: Sep 2005 Location: Somewhere Posts: 828	another approach (perhaps useless), better for 68000 (I think) Code: ;d0 - in, d1 - out lowest: moveq #1,d2 move.w d0,d0 bne.b .go swap d0 moveq #16,d2 .go moveq #16-1,d1 .loop add.w d0,d0 dbeq d1,.loop add.b d2,d1 rts

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Sprites on lowest area of PAL Screen	Tigerskunk	Coders. Asm / Hardware	6	02 December 2017 20:34
Is WinUAE optimized for lowest possible input lag?	Dr.Venom	support.WinUAE	22	03 July 2011 02:14
Trade - Blizzard 1230IV + SCSI + Cash for Lowest Spec PPC	fitzsteve	Swapshop	4	10 December 2010 14:25
Lowest Spec for WinUAE ??	Methanoid	support.WinUAE	29	20 March 2008 17:23
pure bit not set?	amigalizard	support.Apps	4	04 September 2005 05:47