English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 05 November 2021, 16:13   #1
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
index of lowest set bit

Time for a small code contest, even though i really need to do that in my code.

If you didn't get it with the title, it's about getting the bit number of the lowest (= rightmost) bit that's equal to 1. Having a value with only that bit set is easier (when you know how) but not enough as it then serves as index.
F.e. if input value is $12345678, it must return 3. For $80000000 it returns 31. For 0, don't care - unimportant for my use. Could even loop forever

I have my version, but curious to see what others can come up with.
68020+ code allowed (else it's a nightmare !).
meynaf is offline  
Old 05 November 2021, 17:12   #2
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,039
Code:
	move.b	(a0,d0.w),d1
	bne.b	done
	swap	d0
	move.b	(a0,d0.w),d1
done:
Yes, this is a joke version (a0=middle of a 64kb table), it takes a while to activate 020+ systems.
a/b is offline  
Old 05 November 2021, 17:20   #3
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Not only a joke, but actually it doesn't work. You're using same table for high and low words.
meynaf is offline  
Old 05 November 2021, 17:37   #4
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,039
Well, I was trying to supress my megalomania and settle with a single table ;p.
OK, now a bit more serious... How about:
Code:
	move.l	d0,d1
	subq.l	#1,d1
	eor.l	d1,d0
	bfffo	d0{0:32},d0
	eor.w	#31,d0
Clear all bits except the lowest one, then use bfffo to find it.

edit: EOR is my bane...
a/b is offline  
Old 05 November 2021, 17:51   #5
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by a/b View Post
Clear all bits except the lowest one, then use bfffo to find it.
That's my current method. But bfffo is so damned slow...
meynaf is offline  
Old 05 November 2021, 18:34   #6
ross
Defendit numerus
 
ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
Quote:
Originally Posted by a/b View Post
edit: EOR is my bane...
Then do not use it
Code:
   move.l  d0,d1
   subq.l  #1,d1
   eor.l   d1,d0
=
Code:
    move.l  d0,d1
    neg.l   d0
    and.l   d1,d0
ross is offline  
Old 05 November 2021, 18:54   #7
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Now a small, interesting problem : i need to clear said bit to be able to loop on them. (Didn't think about that when posting...)
Like this :
Code:
; d2=input value, d1=tmp, d0=result
loop
 move.w d2,d1
 subq.w #1,d1
 not.w d1
 and.w d2,d1
 bfffo d1{0:32},d0
 eori.b #31,d0
(do something here with d0=bit#)
 eor.w d1,d2
 bne loop
Above eor method removes one instruction from the calculation but destroys actual value...
meynaf is offline  
Old 05 November 2021, 20:18   #8
Photon
Moderator
 
Photon's Avatar
 
Join Date: Nov 2004
Location: Eksjö / Sweden
Posts: 5,602
If it's a contest and not a question, maybe specify size or cycles, and which platform? It will affect reply relevance and usefulness to you.

Simple pseudocode version is
Code:
shift right
bcs x
dbf Dn
x: move.w Dn,...
More involved would be a list of 5 mask words until 0 result.

From your last comment though, I would rather just loop from 0-31 and use bclr and act on the cc.

These are 3 major ways.

Last edited by Photon; 05 November 2021 at 20:26.
Photon is offline  
Old 05 November 2021, 20:30   #9
Cyprian
Registered User
 
Join Date: Jul 2014
Location: Warsaw/Poland
Posts: 171
Quote:
Originally Posted by meynaf View Post
Time for a small code contest, even though i really need to do that in my code.

my 68000 compatible version, "move.l D2,D1" instructions can be removed if D2 does not have to be preserved:
Code:
; d2=input value, d1=tmp, d0=result
    moveq    #-1,D0
    move.l    D2,D1
Loop
        addq.l    #1,D0            
        lsr.l    #1,D1
    bcc        Loop
Quote:
Originally Posted by meynaf View Post
Now a small, interesting problem : i need to clear said bit to be able to loop on them.

what about "bclr.l D0,D2"?

Code:
; d2=input value, d1=tmp, d0=result
    moveq    #-1,D0
    move.l    D2,D1
Loop
        addq.l    #1,D0            
        lsr.l    #1,D1
    bcc        Loop

    bclr.l    D0,D2
Cyprian is offline  
Old 05 November 2021, 20:33   #10
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by Photon View Post
If it's a contest and not a question, maybe specify size or cycles, and which platform? It will affect reply relevance and usefulness to you.
It's both. A question turned into a contest.
Size ? Not too much. A few extra code words are ok, but huge table is overkill.
Cycles ? As little as possible, considering size must not explode.
Which platform ? I said 020+ was ok. Must be as agnostic as possible within this range.


Quote:
Originally Posted by Photon View Post
From your last comment though, I would rather just loop from 0-31 and use bclr and act on the cc.
No, too many loops for values having few ones will make it even more damned slow.
meynaf is offline  
Old 05 November 2021, 20:36   #11
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by Cyprian View Post
what about "bclr.l D0,D2"?
Yes, what about being slow.
Use of bclr needs extra tst before branching, where i just have eor+branch.
meynaf is offline  
Old 05 November 2021, 21:03   #12
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,039
If you are looping through bits it's better to use dbcs:
Code:
	moveq #32-1,d1
loop:	lsr.l	#1,d0
	dbcs	d1,loop
	eor.w	#31,d1
a/b is offline  
Old 05 November 2021, 22:11   #13
a/b
Registered User
 
Join Date: Jun 2016
Location: europe
Posts: 1,039
Quote:
Originally Posted by meynaf View Post
Now a small, interesting problem : i need to clear said bit to be able to loop on them. (Didn't think about that when posting...)
...
Did you resolve this? Your code is using dec+not which is equivalent to neg (and not is inc+neg), which is also what Ross happened to post earlier, so the loop should still work but with 1 instruction less.
I optimized those two+and with sub+eor before you mentioned you need to clear the bit.
a/b is offline  
Old 05 November 2021, 22:14   #14
Photon
Moderator
 
Photon's Avatar
 
Join Date: Nov 2004
Location: Eksjö / Sweden
Posts: 5,602
Quote:
Originally Posted by meynaf View Post
even more damned slow.
You are performing a bitwise (non-atomic) operation, like a div. This makes it serial in nature, and this in itself requires a serial solution or a successive approximation (tree) solution. In some CPUs, there's dedicated silicon for the bitwise decision tree instead of branches.

In my 3 major solutions, the 5-word table corresponds to 5 comparisons in that tree, which means you can get a result factorially quicker than linear.

This is the quickest canonical way, as in the fewest decisions barring a dedicated instruction. BFFFO scans for the highest bit, which means you must reverse bits, which is a similar desired but not implemented instruction in most CPUs. All this to say, every desire is not always implemented, and 680x0 is already a wishlist family of CISC (and later more formalized RISC) CPU.

Much can be done with 256-byte tables, if 64K tables are too big. You can reverse bits, find the last set, fill, invert, and find contours. Lowest set bit is a contour.

On different platforms, memory accesses will be more or less punishing, especially off-cache. If CPU is not specified then, there's no way you will get a single correct answer. It may very well be that the long loop in-cache with no memory access will be much faster than the shortest or fastest code with memory accesses.

For 68000 and maybe also 68EC020/68EC030, I would LUT a byte at a time backward if nonzero:

Code:
LowestBit:
	move.l d0,(a0)+		;in: d0.l, (a0) 4 bytes workspace
	moveq #0,d0
	REPT 4
	move.b -(a0),d1
	bne.s .lu
	addq.l #8,d0
	ENDR
	RTS
.lu:
	add.b (a2,d1.w),d0	;a2=LUT
.done:
	RTS			;out: d0.l=0-32 (32=no bit set)
This beats the canonical (mathematical) solution by one decision, but requires memory accesses.

Last edited by Photon; 05 November 2021 at 22:32.
Photon is offline  
Old 05 November 2021, 22:56   #15
Asman
68k
 
Asman's Avatar
 
Join Date: Sep 2005
Location: Somewhere
Posts: 828
What about this. I just checked few cases,

Code:
	moveq	#32-1,d1
loop
	add.l	d0,d0
	dbeq	d1,loop
	addq.b	#1,d1
Asman is offline  
Old 05 November 2021, 23:10   #16
ross
Defendit numerus
 
ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
Quote:
Originally Posted by a/b View Post
Your code is using dec+not which is equivalent to neg (and not is inc+neg), which is also what Ross happened to post earlier, so the loop should still work but with 1 instruction less.
I optimized those two+and with sub+eor before you mentioned you need to clear the bit.
Yep, I don't think on 020+ we can do a faster routine than this:

Code:
; d2=input value, d1=tmp, d0=result
loop
    move.l d2,d1
    neg.l d1
    and.l d2,d1
    bfffo d1{0:32},d0
    eori.b #31,d0
;   (do something here with d0=bit#)
    eor.l d1,d2
    bne.b loop
Different on 000 where the possibility of using a LUT *maybe* could guarantee greater speed in some cases.
ross is offline  
Old 06 November 2021, 02:53   #17
Photon
Moderator
 
Photon's Avatar
 
Join Date: Nov 2004
Location: Eksjö / Sweden
Posts: 5,602
On 68000-68010, bit field operations are not available. BFFFO is also unusable because it gives the highest bit and OP wants lowest bit.

Last edited by Photon; 06 November 2021 at 03:04.
Photon is offline  
Old 06 November 2021, 09:11   #18
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by a/b View Post
If you are looping through bits it's better to use dbcs:
Code:
    moveq #32-1,d1
loop:    lsr.l    #1,d0
    dbcs    d1,loop
    eor.w    #31,d1
The trick here is that looping like this will always loop 32 times. This is inefficient for cases with few ones, and these are quite common.
As an example, if value $f0000000 is given, you'll execute your loop 32 times vs 4 in my case, and nothing can be done to cure that.


Quote:
Originally Posted by a/b View Post
Did you resolve this? Your code is using dec+not which is equivalent to neg (and not is inc+neg), which is also what Ross happened to post earlier, so the loop should still work but with 1 instruction less.
I optimized those two+and with sub+eor before you mentioned you need to clear the bit.
On my side it's "resolved" in the sense my code currently works. I just wanted to see if it could be made faster.



Quote:
Originally Posted by Photon View Post
You are performing a bitwise (non-atomic) operation, like a div. This makes it serial in nature, and this in itself requires a serial solution or a successive approximation (tree) solution. In some CPUs, there's dedicated silicon for the bitwise decision tree instead of branches.

In my 3 major solutions, the 5-word table corresponds to 5 comparisons in that tree, which means you can get a result factorially quicker than linear.

This is the quickest canonical way, as in the fewest decisions barring a dedicated instruction. BFFFO scans for the highest bit, which means you must reverse bits, which is a similar desired but not implemented instruction in most CPUs. All this to say, every desire is not always implemented, and 680x0 is already a wishlist family of CISC (and later more formalized RISC) CPU.
It's not the only case in this program where i need to iterate through bits and do something only if '1' is found.
The (n and n-1) trick can work, but not always.


Quote:
Originally Posted by Photon View Post
Much can be done with 256-byte tables, if 64K tables are too big. You can reverse bits, find the last set, fill, invert, and find contours. Lowest set bit is a contour.
I guess a 256-byte table is acceptable, but not if it's only marginally faster.


Quote:
Originally Posted by Photon View Post
On different platforms, memory accesses will be more or less punishing, especially off-cache. If CPU is not specified then, there's no way you will get a single correct answer. It may very well be that the long loop in-cache with no memory access will be much faster than the shortest or fastest code with memory accesses.
Ok then, if you really want something specific as a config, consider 50Mhz 68030 with 60ns ram.


Quote:
Originally Posted by Photon View Post
For 68000 and maybe also 68EC020/68EC030, I would LUT a byte at a time backward if nonzero:

Code:
LowestBit:
    move.l d0,(a0)+        ;in: d0.l, (a0) 4 bytes workspace
    moveq #0,d0
    REPT 4
    move.b -(a0),d1
    bne.s .lu
    addq.l #8,d0
    ENDR
    RTS
.lu:
    add.b (a2,d1.w),d0    ;a2=LUT
.done:
    RTS            ;out: d0.l=0-32 (32=no bit set)
This beats the canonical (mathematical) solution by one decision, but requires memory accesses.
Worst case here needs 4 memory accesses and 4 branches. That's much.
Also in this code 68000 is totally out of question.



Quote:
Originally Posted by Asman View Post
What about this. I just checked few cases,

Code:
    moveq    #32-1,d1
loop
    add.l    d0,d0
    dbeq    d1,loop
    addq.b    #1,d1
Can not be made shorter and i would probably use this if target was shortest possible code. But it's not, and now we run in n² time...



Quote:
Originally Posted by ross View Post
Yep, I don't think on 020+ we can do a faster routine than this:

Code:
; d2=input value, d1=tmp, d0=result
loop
    move.l d2,d1
    neg.l d1
    and.l d2,d1
    bfffo d1{0:32},d0
    eori.b #31,d0
;   (do something here with d0=bit#)
    eor.l d1,d2
    bne.b loop
How could I miss this... Thanks !


Quote:
Originally Posted by ross View Post
Different on 000 where the possibility of using a LUT *maybe* could guarantee greater speed in some cases.
Here 000 is out of question. Not only because "too slow" but also due to so many misaligned memory accesses, incredible high amount of scale factors, etc.



Quote:
Originally Posted by Photon View Post
On 68000-68010, bit field operations are not available. BFFFO is also unusable because it gives the highest bit and OP wants lowest bit.
I want highest bit too, sometimes - i do both depending on the case - but there it's easier so i didn't ask for it :
Code:
loop
 bfffo d2{0:32},d0
 eori.b #31,d0
(do something here)
 bclr d0,d2
 tst.w d2
 bne loop
 endm
meynaf is offline  
Old 06 November 2021, 09:35   #19
Asman
68k
 
Asman's Avatar
 
Join Date: Sep 2005
Location: Somewhere
Posts: 828
another approach (perhaps useless), better for 68000 (I think)

Code:
;d0 - in, d1 - out
lowest:
	moveq	#1,d2
	move.w	d0,d0
	bne.b	.go
	swap	d0
	moveq	#16,d2
.go
	moveq    #16-1,d1
.loop
	add.w    d0,d0
	dbeq    d1,.loop
	add.b    d2,d1
	rts
Asman is offline  
Old 06 November 2021, 10:21   #20
meynaf
son of 68k
 
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,323
Quote:
Originally Posted by Asman View Post
another approach (perhaps useless), better for 68000 (I think)
Right, but in this example it's clearly not about 68000.


I have other examples of similar code :
Code:
; pack the bits of d2 into d7, with mask=d1
 moveq #0,d7
.loop
 bfffo d2{0:32},d4
 beq.s .fin
 addq.l #1,d4
 lsl.l d4,d2
 lsl.l d4,d1
 addx.l d7,d7
 bra.s .loop
.fin

; unpacks the bits of d2 into d7, with mask=d1
 moveq #0,d7
.loop
 move.l d2,d4
 beq.s .fin
 subq.l #1,d2
 and.l d4,d2		; n and (n-1) -> clr lower bit
 lsr.l #1,d1
 bcc.s .loop
 eor.l d2,d4
 add.l d4,d7
 bra.s .loop
.fin
I wonder if it's possible to make them faster without making them significantly bigger.
Or smaller, without making them slower.
meynaf is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Sprites on lowest area of PAL Screen Tigerskunk Coders. Asm / Hardware 6 02 December 2017 20:34
Is WinUAE optimized for lowest possible input lag? Dr.Venom support.WinUAE 22 03 July 2011 02:14
Trade - Blizzard 1230IV + SCSI + Cash for Lowest Spec PPC fitzsteve Swapshop 4 10 December 2010 14:25
Lowest Spec for WinUAE ?? Methanoid support.WinUAE 29 20 March 2008 17:23
pure bit not set? amigalizard support.Apps 4 04 September 2005 05:47

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 21:26.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.13811 seconds with 13 queries