can i convert the following function to faster C marco?

int GetLSB(int32 x) {
int result;
__asm mov eax,-1
__asm bsf eax,x
__asm mov result,eax
return result;

Posted on 2002-07-20 23:33:13 by doby
Your compiler doesn't support INLINE keyword?

Might find something here:
Posted on 2002-07-21 01:44:54 by bitRAKE
bitrake, do your macro have the problem in case that the code before the macro may be using EAX register?

using INLINE is easier but it's not standard and i'm not sure that INLINE will have the same speed with MACRO.

for the mov eax,-1: i dont know but when test the program if x is zero the result will be -1, so i use it.

thanks for link, it's very good,
Posted on 2002-07-21 02:17:12 by doby
I was wrong about the syntax. Correct syntax for VC++:

Your right about the INLINE, too - doesn't work. :)
Posted on 2002-07-21 02:38:35 by bitRAKE
bitRAKE: one of the reasons why sometimes I really hate C/C++ is that if you try to compile your perfectly logical piece of code, it won't compile. You must use single line __asm statements inside a multi-line #define, because the \ will mean "it's like if all belongs to the same line" (like #define requires), while asm { } requires the different asm instructions to be in different lines. I hope I explained it well. ;)
C/C++ compilers tend to be f*cking annoying when it comes to shoot stupid errors.. they're really a major source of stress for me.

doby: you don't have to worry about the registers you use inside inline asm, the compiler will save/restore them anyway. <irony>Very efficient</irony> (well, some better compilers will spot which registers you don't anyway use, and will avoid to save/restore them).

Now my wild guess:

__declspec(naked) int __fastcall GetLSB(int32 x) {
asm {
mov eax,-1
bsf eax,ecx

So far C/C++ compilers can be tweaked to offer such optimizations, so you're lucky. There are more advanced cases where C/C++ shows its misery, though.
Posted on 2002-07-21 02:51:49 by Maverick
A logical person might also try something like:
int inline GetLSB(int val) {

int register temp;
__asm bsf temp,val
return temp;
But the compiler doesn't have a clue.
Posted on 2002-07-21 02:56:03 by bitRAKE
In fact, I didn't bother to put an "inline" because all the C/C++ compilers I know of won't anyway inline the function if there's an asm statement inside.
Even worse, e.g. VisualC has also __forceinline, a name a guarantee.. too bad that if there's an asm statement inside it won't inline *anyway*. ;)
Posted on 2002-07-21 07:24:19 by Maverick
VC7 does inline the function, but the compiler doesn't know what to do with the return value (ie the result is not stored). Something like:
int inline GetLSB(int val) {

int temp;
__asm {
bsf eax,val
mov temp,eax
return temp;
...does work, and only produces two instructions.
Posted on 2002-07-21 10:56:38 by bitRAKE

It is kinda hard to explain, but I noticed when building with VC7 in release mode(optomizations turned on),
even if you declared a variable(int i; ), it would try to keep the variable in a register. The complier will remove variable that it does not need if the variable can be kept in a register for it's lifetime. It might appear it does not know what to do with it, or it might be assigning much later than you expect, or remove the variable alltogether. The VC7 compiler is quite clever.
Posted on 2002-07-21 11:09:38 by ThoughtCriminal
Thank you, ThoughtCriminal - it was my testing method - VC7 optimized out the code because the value wasn't used. ;) After further tests, this does work:
int inline GetLSB(int val) {

__asm bsf eax,val
The VC7 compiler has been very impressive compared to VC6. In one program I had public object variable which was the same in every instance of that object - the compiler recognized this and unrolled the innner loops of the methods. These kind of global optimizations are very important. I don't understand why it creates the loop check code that it does, and it still will not fully use the complex addressing modes, i.e.:

; (-2,-2)
mov eax,DWORD PTR [esi-(4*2)] ; X0
; ( 0, 0)
mov edx,DWORD PTR [esi+ebx*2] ; Y
; (-2, 2)
add eax,DWORD PTR [esi+(4*2)] ; X1
; ( 2,-2)
add eax,DWORD PTR [esi+ebx*4-(4*2)] ; X2
; ( 2, 2)
add eax,DWORD PTR [esi+ebx*4+(4*2)] ; X3
lea eax,[eax+edx*8+7]
Then again most assembly language programmers miss this, too. ;) I tried Vector C and couldn't stop laughing at the MMX/SSE code produced - guess it's better than nothing. :grin:
Posted on 2002-07-21 11:18:23 by bitRAKE
Hi bitRAKE & ThoughtCriminal:
My knowledge was about VisualC 6, what you wrote about VisualC 7 is very
interesting.. maybe I should take a serious look at the new compiler, and
test how good they made it.
Posted on 2002-07-21 15:17:37 by Maverick
you should be very careful with __declspec(naked) - when I messed with it,
it assumed *NOTHING* about the function. It wouldn't preserve registers
(good), but it assumed that I didn't trash *any* registers, including the normally
trashable ones... and furthermore, it didn't use eax for return value.
__Stone told me that __declspec(naked) is "for emergencies". *sigh*.
VC7 is good, but it would be nice if it supported some of the GNUish (or
watcomish) attributes to the asm blocks.

I think the best "inline" keyword to use for VC is __inline ... as far as I
can understand from the docs, it will ignore "inline", while "__inline" will
at least make the compiler think about it (for short functions it should
have inlined anyway). __forceinline will "probably" always get your
function inlined, but there's a few circumstances where it can't (decsribed
in the docs). There's also __declspec(noinline) - which has been useful
at least a single time :)
Posted on 2002-07-21 15:51:25 by f0dder
VC7 is the one the come with VisualStudio .NET?
Posted on 2002-07-21 21:36:31 by doby
Yes, VC++7 comes with VS.Net, or can be purchased by itself, iirc.
Posted on 2002-07-21 21:38:41 by bitRAKE
just want to report somethings,

i didnt tried about VC7 yet, but i did the experiment on VC6, BC6 and Digital Mars C/C++(DM) with the same source code on Win98 here is the result.

in DM my program can search 585k nodes/sec
in BC6 = 600k nodes/sec
in VC6 = 850k nodes/sec

microsoft's compiler give the best performance.

bitRake: thanks for ur great optimized function, it speed up my program 5%
Maverick: is it true that if i have asm in my function, it wont be inline? my performance is increase about 5k nodes/sec after give the __inline infront of bitRake GetLSB function.

for BitCount algorithm, there is a thread in this forum concern that Population Bitcount is faster than Table Lookup, but from my experiment Table Lookup is faster than Population, why? or i did something wrong? but there is the others get the same result with me.

Posted on 2002-07-27 06:08:47 by doby