Quote:
Originally Posted by meynaf
even more damned slow.
|
You are performing a bitwise (non-atomic) operation, like a div. This makes it serial in nature, and this in itself requires a serial solution or a successive approximation (tree) solution. In some CPUs, there's dedicated silicon for the bitwise decision tree instead of branches.
In my 3 major solutions, the 5-word table corresponds to 5 comparisons in that tree, which means you can get a result factorially quicker than linear.
This is the quickest canonical way, as in the fewest decisions barring a dedicated instruction. BFFFO scans for the highest bit, which means you must reverse bits, which is a similar desired but not implemented instruction in most CPUs. All this to say, every desire is not always implemented, and 680x0 is already a wishlist family of CISC (and later more formalized RISC) CPU.
Much can be done with 256-byte tables, if 64K tables are too big. You can reverse bits, find the last set, fill, invert, and find contours. Lowest set bit is a contour.
On different platforms, memory accesses will be more or less punishing, especially off-cache. If CPU is not specified then, there's no way you will get a single correct answer. It may very well be that the long loop in-cache with no memory access will be much faster than the shortest or fastest code with memory accesses.
For 68000 and maybe also 68EC020/68EC030, I would LUT a byte at a time backward if nonzero:
Code:
LowestBit:
move.l d0,(a0)+ ;in: d0.l, (a0) 4 bytes workspace
moveq #0,d0
REPT 4
move.b -(a0),d1
bne.s .lu
addq.l #8,d0
ENDR
RTS
.lu:
add.b (a2,d1.w),d0 ;a2=LUT
.done:
RTS ;out: d0.l=0-32 (32=no bit set)
This beats the canonical (mathematical) solution by one decision, but requires memory accesses.