English Amiga Board - View Single Post

Photon · 05 November 2021, 22:14

Quote:

Originally Posted by meynaf

even more damned slow.

You are performing a bitwise (non-atomic) operation, like a div. This makes it serial in nature, and this in itself requires a serial solution or a successive approximation (tree) solution. In some CPUs, there's dedicated silicon for the bitwise decision tree instead of branches.

In my 3 major solutions, the 5-word table corresponds to 5 comparisons in that tree, which means you can get a result factorially quicker than linear.

This is the quickest canonical way, as in the fewest decisions barring a dedicated instruction. BFFFO scans for the highest bit, which means you must reverse bits, which is a similar desired but not implemented instruction in most CPUs. All this to say, every desire is not always implemented, and 680x0 is already a wishlist family of CISC (and later more formalized RISC) CPU.

Much can be done with 256-byte tables, if 64K tables are too big. You can reverse bits, find the last set, fill, invert, and find contours. Lowest set bit is a contour.

On different platforms, memory accesses will be more or less punishing, especially off-cache. If CPU is not specified then, there's no way you will get a single correct answer. It may very well be that the long loop in-cache with no memory access will be much faster than the shortest or fastest code with memory accesses.

For 68000 and maybe also 68EC020/68EC030, I would LUT a byte at a time backward if nonzero:

Code:

LowestBit:
	move.l d0,(a0)+		;in: d0.l, (a0) 4 bytes workspace
	moveq #0,d0
	REPT 4
	move.b -(a0),d1
	bne.s .lu
	addq.l #8,d0
	ENDR
	RTS
.lu:
	add.b (a2,d1.w),d0	;a2=LUT
.done:
	RTS			;out: d0.l=0-32 (32=no bit set)

This beats the canonical (mathematical) solution by one decision, but requires memory accesses.