Hello folks,

I'm c++ programmer and I'm needing to increase the performance of an especific function of my program.
Basically I need to count the number of common bits of two 32-bit values. The code is quite simple in c++, would be something like this:


int counter_bits()

int counter = 0;
bitset<32> bits1;
bitset<32> bits2;

for (int x = 0;  x <= 31; x++) {
if (bits1 == true)
if (bits2 == true)

return counter;


As I'm planning to call this function hundreds or thousands of times inside another loop, I would like to optimize this espefic part using asm inline.
Does anyone knows how would be this algorithm translated to assembly language?


Posted on 2010-02-03 19:29:44 by smertniki
What you need it to optimize the ALGORITHM, not the code.

You can reduce the number of comparisons from 2n to n simply but doing a logical AND of those 2 values and calculating the number of set bits in the result in ONE loop.

something like:
result = bits1 & bits2;

for (x = 0; x <32; x++) if (result) counter++;

And this is just a simple suggestion which comes to mind without giving it very much thought.

The second thing which comes to mind is NOT to use any exotic types where you can use simple ints/longs.

The third thing would be to use a LUT for 16 4-bit values, and process the result in 8 look-ups into the LUT, which would compile into 8 simple read operations. With proper caching the code would execute in -I don't know- 20 cycles (and wouldn't have any branches)? Compared to your current code, which -I estimate- takes about ~60 cycles and may have 2 branches, means roughly a 300-500 % speedup. And we are still talking about C/C++ code.

No asm optimization will help you if your algorithm is flawed. First optimize the algorithm. Then, and only then, optimize the code. And most of the time optimizing the code in ASM isn't necessary as modern compilers can produce code which performs better than code hand-written by -say- 75% of all coders. In other words: If you can't write a decent algorithm in C/C++ then you definitely can't write efficient code in ASM. No offense intended but that's how reality usually looks like.
Posted on 2010-02-03 19:33:49 by ti_mo_n
Yep, step #1 is to AND the two values together.
Now you have one value, the Set bits were 1 in both input values... count them.
Posted on 2010-02-03 19:38:38 by Homer
Thanks for the replies guys,

The algorithm I described was a pseudo code to exemplify what I was asking. I know that your version is the fastest way...
And I'm using the bits to represent combinations of numbers, their values does not matter, what matters to me is the intersection of common bits between two sets.

but don't you think that if I use asm to make this counting, would not improve the speed?
or the loop is so short that the supposed improvement would be insignificant?
Posted on 2010-02-03 19:52:22 by smertniki
I know that your version is the fastest way...

I doubt it ^^'

don't you think that if I use asm to make this counting, would not improve the speed?

It depends on the exact algorithm you want to implement and the amount of data to be processed (this is not to be confused with the frequency of execution). You say that this function is to be executed many times. Most of the time this isn't as important as the amount of data each execution is supposed to process.

And then: If each execution processes too small amount of data then you are wasting CPU cycles on function calls which, then, makes this function a good candidate for inlining.

or the loop is so short that the supposed improvement would be insignificant?

Again, depends on the amount of data. Unrolling the loop may help, a LUT may help, if it's very large then SSE may help.

There are MANY options, depending on what exactly you want to accomplish, ASM being probably the last one of them.
Posted on 2010-02-03 19:56:29 by ti_mo_n
well, what can I say? thanks for the tips :]

Posted on 2010-02-03 20:29:22 by smertniki
This is a good one for simple inline asm in C.
Go for it!!!
If you need help writing it, show your messy code.
Posted on 2010-02-03 22:50:53 by Homer
After you have ANDed the values together, couting the bits is known as popcnt. You can also check wikipedia or go hitting on google - there should be plenty of literature :)

Inline assembly is the devil - go external assembly modules, yay! :)
Posted on 2010-02-04 01:50:36 by f0dder
Inline assembly is the devil - go external assembly modules, yay! :)

That is what Microsoft must have been thinking. They removed inline assembly from their C/C++ compiler for x64 targets. However, they supply an x64 version of MASM.
So in the Microsoft x64 universe, external assembly modules are the only way.
With GCC you still have inline assembly though.
Posted on 2010-02-04 06:10:57 by Scali
Hello again guys,

ti_mo_n, you were right, first optimize the algo, then the code ;]

I know that your version is the fastest way...
I doubt it ^^'

Don't doubt, I knew it, what I did not imagine was how much faster would be, to take off the doubt I made the benchmark: (1.000.000 calls each)
Look the results:

my method  = 11189.8 milliseconds
your method = 10104.2 milliseconds
asm method  = 79.1911 milliseconds  :shock:

but there is an improvement in your method, instead to make a for loop, just call counter = result.count(); the timing went to 595.954 milliseconds

Here are the functions:

int stupid_bitcounter(bitset<32> bits1, bitset<32> bits2)

int counter = 0;

for (int x = 0; x <= 31; x++)

if (bits1 == true)
if (bits2 == true)

return counter;


int normal_bitcounter(bitset<32> bits1, bitset<32> bits2)

int counter = 0;
bitset<32> result;

result = bits1 & bits2;
counter = result.count();

// for (int x = 0; x < 32; x++)
// if (result) counter++;

return counter;

int asm_bitcounter(bitset<32> bits1, bitset<32> bits2)

int counter = 0;

_asm mov eax, bits1
_asm mov ebx, bits2
_asm and eax, ebx

_asm cLoop:
_asm mov ebx, eax
_asm dec ebx 
_asm inc counter
_asm and eax, ebx
_asm jg cLoop

return counter;


I'm not sure if the asm method is 100% correct (you know, I'm an asm noob), but as far as I tested it worked without problems...
If you find a bug in the function, please fix it, or if you know even a faster way to make this bits countings, please post here, I'll appreciate it:]

Homer, the codes are still incomplete, when I finish to write them I post here to you take a look.

f0dder, thanks for the link, I will take a look further later

well guys, thanks for the support, to tell the truth I'm not even a C++ programmer, I'm migrating from the classic VB 6.0 to C++ and already started to feel that I should have done this much more time before, but assembly really rox ;D

Posted on 2010-02-04 07:55:31 by smertniki
Where did your asm algo come from though?
It uses a completely different strategy for counting the bits than any of the presented code so far.
Instead of literally counting which bits are 1, you are now using the trick that subtracting 1 will basically do this:
xxxx10 - 1 => xxxx01
xxx100 - 1 => xxx011
xx1000 - 1 => xx0111
Then that mask is used to cancel out that one bit (which is the lowest set bit in that number at the time).
You count how often this is possible, which effectively counts the total number of bits.

In other words, it's not a translation from any of the presented C++ code to assembly, but something completely different.

While we're on the subject of different strategies... I think this one is the smallest/most elegant possible way to count bits (but not necessarily the fastest):
    adc    ecx, 0
    add    eax, eax    ; For small bitstrings, shr eax, 1 might be preferred, for it will lead to less iterations.
    jnz    @@loop

    adc    ecx, 0
Posted on 2010-02-04 08:07:21 by Scali
Hi Scali,

I coded it based on logic described in this website: http://www2.arnes.si/~ubolti/ininuga/programming.htm

One question, @@loop is the TASM sintax?
Posted on 2010-02-04 08:58:56 by smertniki

One question, @@loop is the TASM sintax?

It's MASM syntax, but should also work in TASM. The @@ prefix means that it's a local label. This way you can re-use simple names such as 'loop', so you don't have to think of a unique label for every loop you make in your program :)

By the way, you should try writing that new algo in C++, it's probably very close to the ASM version in performance...
Something like this:
x = a & b;
counter = 0;

while (x)
  x &= (x-1);
Posted on 2010-02-04 09:15:38 by Scali

it should be a rotate or shift with a carry detect, this counts off bits too.
Posted on 2010-02-04 09:25:05 by Homer


it should be a rotate or shift with a carry detect, this counts off bits too.

Which one are you talking about?
Posted on 2010-02-04 09:28:31 by Scali
Also, you don't want to mix bitset<> and assembly... if you use bitset<> methods in an algo, stick with that bitset. If you work on integers, use integers.
Posted on 2010-02-04 11:18:29 by f0dder
According to the Nasm manual, the "popcnt" instruction was added in "Nehalem" (SSE4.2).

This "population count" is apparently a common operation, and a lot of thought has gone into it. I vaguely recalled Terje Mathison had a method, so I Googled "popcnt terje". Came up with a bunch - this is the first I looked at. Seems to compare a number of methods.


Might be interesting...


Posted on 2010-02-04 13:35:17 by fbkotler
According to the Nasm manual, the "popcnt" instruction was added in "Nehalem" (SSE4.2).

Yes, POPCNT is an SSE4.2 instruction, first introduced in the Nehalem microarchitecture (Core i7, for example).
Posted on 2010-02-04 18:48:42 by ti_mo_n
The discussion here is correct regarding the poster's code.
However, the poster code does not count the number of common bits, as he said he wanted to.
It counts the number of times the two corresponding bits are set.
As I understand it "common bits" would include when the two are zero.
But I cannot know if you expressed your need ambiguously and the code is correct, or your code does not what you need, nor does the rest of the suggested methods.
Posted on 2010-05-15 01:37:08 by HeLLoWorld
The current best population count algorithms are here (with full source code):

And there is a fast SSE3 implementation here:
Posted on 2010-06-28 18:59:27 by MCoder