I would like to know if there is something better than the bsf instruction to search the fisrt occurence of the bit 1 in a byte / Word / DWord... Because it is slow !!!!

Are there algos / instructions / tips wich is very faster ??

Please Help me
Posted on 2002-02-21 16:53:24 by DarkEmpire
I posted here one method invented by my freind Nick Poroshin not
so long ago - search it here (~6 cloks). Based on FPU.
For other things, it depends on what you need it for.
For example I did some flag analyze with invented method
to get last flag, not ordering number but value.
Posted on 2002-02-21 17:24:03 by The Svin
Can you give me an exemple / explain me about your flag analyze ?
Posted on 2002-02-21 18:02:29 by DarkEmpire
BSF / BSR are very fast on the P6 core, they are only slow on the older cores.

There is:

mov ecx, eax
neg ecx
and eax, ecx

This will mask out all but the least significant set bit.

Posted on 2002-02-21 18:03:28 by Mirno
Yes, but this piece of code don't give me the position on the set bit ...

Or there is an other tip i have not found...

Can you help me ?
Posted on 2002-02-22 02:54:15 by DarkEmpire

[Replace Bit scan instructions] [Assembler][/][80386][+FPU]

BSF and BSR are the poorest optimized instructions on the Pentium, taking
approximately 11 + 2*n clock cycles, where n is the number of zeros
skipped. (on later processors it takes only 1 or 2) The following code
emulates BSR ECX,EAX:

test eax,eax
jz short bs1
mov [dword ptr temp],eax
mov [dword ptr temp+4],0
fild [qword ptr temp]
fstp [qword ptr temp]
wait ; WAIT only needed for compatibility with earlier processors
mov ecx,[dword ptr temp+4]
shr ecx,20
sub ecx,3ffh
test eax,eax ; clear zero flag

The following code emulates BSF ECX,EAX:

test eax,eax
jz short bs2
mov [dword ptr temp+4],ecx
sub ecx,eax
and eax,ecx
mov [dword ptr temp],eax
fild [qword ptr temp]
fstp [qword ptr temp]
wait ; WAIT only needed for compatibility with earlier processors
mov ecx,[dword ptr temp+4]
shr ecx,20
sub ecx,3ffh
test eax,eax ; clear zero flag

This gem comes from Agner Fog's "How to optimize for Pentium(tm)
microprocessor". This manual is highly recommended.
Gem writer: Agner Fog
last updated: 1998-03-16
Posted on 2002-02-22 09:26:19 by The Svin
"... (on later processors it takes only 1 or 2) ... "

Do you know after which processor bsf takes only 1 or 2 cycles ?

Can i consider that on PIII And Athlon bsf takes 1 or 2 cycle ?

Because, i don't know how i would test the number of cycle the bsf takes...

So, must i really optimize bsf with the piece of code you give me ?? or must i keep bsf ??
Posted on 2002-02-22 13:41:20 by DarkEmpire
Here is a piece of code with the rdtsc (an instruction i just discovered):

mov dwLow3, eax
mov dwHigh3, edx

mov dwLow4, eax
mov dwHigh4, edx

mov dwLow1, eax
mov dwHigh1, edx

bsf eax, nbToScan

mov BitSetPos,eax

mov dwLow2, eax
mov dwHigh2, edx

t3 = ( (__int64)dwHigh3 << 32 ) | (__int64)dwLow3;
t4 = ( (__int64)dwHigh4 << 32 ) | (__int64)dwLow4;
t1 = ( (__int64)dwHigh1 << 32 ) | (__int64)dwLow1;
t2 = ( (__int64)dwHigh2 << 32 ) | (__int64)dwLow2;
b = t2 - t1 - (t4 - t3);

b is the number of cycles to do 3 mov and bsf, if i understood what rdtsc...

The result was b = 2 cycles (with b = 4 and b = 2^31)

when i tried the piece of code you proposed, the result was more than more than 6000 cycles...

Have i done an error when i tried what you proposed me ?

In, fact i was surprised to see the result with the bsf method, because, when i read articles about bsf, all explained that bsf is slow (between 6 and 42 cycles) on PC 586 and more.
Posted on 2002-02-22 14:51:04 by DarkEmpire
From Agner... a bit down the page. It was a reading mistake I made too.

These emulation codes should not be used on the PPro, PII and PIII, where the bit scan instructions take only 1 or 2 clocks, and where the emulation codes shown above have two partial memory stalls.

So use them all you want unless your on an older machine.
Posted on 2002-02-22 15:11:03 by Mutant Slime

bsf and bsr are intructions for intel processors... Are they useable with AMD processors too ? Or is there an equivalent instruction ?

Thanks for your help
Posted on 2002-02-23 13:02:09 by DarkEmpire
Check out www.amd.com they have manuals for the processors just like Intel does. A quick look at the athlon optimization manual shows that athlon has the instruction. Is it efficient on the Athlon? Well, I'll let YOU read the book ;)

Full link to manual stuff
Posted on 2002-02-25 09:44:10 by Mutant Slime