What is the fastest way to count the number of bits set in a 32 bit value? Many thanks
Posted on 2001-06-26 02:46:00 by Chris
One solution could be to use a 16-byte lookup table and process the value by packet of 4 bytes. For example :

.DATA
table BYTE 0, 1, 1, 2, 1 ; etc

.CODE

CountBit PROC value:DWORD USES EBX ECX EDX
  mov    ecx, value
  movzx  ebx, cl 
  movzx  edx, ch
  movzx  eax, table
  shr    ecx, 16
  add    al, table
  mov    bl, cl
  mov    dl, ch
  add    al, table
  add    al, table
CountBit ENDP
This message was edited by karim, on 6/26/2001 6:08:53 AM
Posted on 2001-06-26 03:59:00 by karim
Fastest way would be to use a 32bit lookup :D But that would leave you with no room for code, or a stack! Not quite so fast as the lookup, but smaller code wise:

mov ecx, 32
mov edx, MyNum
xor eax, eax

@@:
  rol edx, 1
  adc eax, 0
  dec ecx
  jnz @B
If you don't want to preserve the register then you can do this:

  xor eax, eax   ;Also clears the carry flag
  mov edx, MyNum ;Doesn't affect the carry flag

@@:
   adc eax, 0    ;Does nothing first time around
   shl edx, 1    ;Pop one bit off
   jnz @B        ;Loop again if edx is not empty

  adc eax, 0     ;Add the last bit to be shifted off the top
The second example will be quicker, as it is a smaller loop, requires less startup code, and can exit early if edx is empty! If you find that you deal mostly with smaller numbers, it may also be worth changing it from SHL to SHR, as smaller numbers ocupy the lower bits only, it will exit quicker. That benefit will be small though! Mirno
Posted on 2001-06-26 04:58:00 by Mirno
Thanks for the replies so far. I was hopefully (or with wishfull thinking) trying to avoid or minimise both shifts and using partial registers. This routine is right in the middle of a critical and heavily used procedure and the effect of processor stalls is quite dramatic. Actually I need to count the bits in each nibble for a 32 bit value. With a lot of experimentation together with running the code through vTune, I have improved speed quite considerably since my initial code but I cannot get away from the feeling that I still seem to be doing a lot of work for such a seemingly simple need. Also I have it in the back of my mind that I am missing some trick with Boolean logic somewhere which prompted me to post the question, in case somebody could point out the obvious which I am missing..... All suggestions are welcome. Many thanks
Posted on 2001-06-26 07:14:00 by Chris
A hybrid of my previous, and karim's suggestions:

.data
Lookup dd 0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4

.code
   mov edx, MyNumber
   xor eax, eax
   xor ecx, ecx

@@:
   add eax, 
   xor ecx, ecx
   shld edx, ecx, 4
   jnz @B

   add eax, 
Again the SHLD/SHRD thing applies depending on your average input. It uses no partial registers, and cuts down on the number of itterations of the loop. Mirno
Posted on 2001-06-26 07:45:00 by Mirno
I've found this in the AMD optimization guide :

; value in eax
MOV   EDX, EAX
SHR   EAX,1
AND   EAX,055555555h
SUB   EDX,EAX
MOV   EAX,EDX
SHR   EDX,2
AND   EAX,033333333h
AND   EDX,033333333h
ADD   EAX,EDX
MOV   EDX,EAX
SHR   EAX,4
ADD   EAX,EDX
AND   EAX,00F0F0F0Fh
IMUL  EAX,001010101h
SHR   EAX,24
Posted on 2001-06-26 08:01:00 by karim
The fast way would be parallel adders in MMX. You said you wanted the total bit counts for the eight nibbles of a DWORD. Nibbles are four bits, so the result has five states: this will require 3 bits (or 2 bits and a special case). Look in a basic electronics book for the binary logic of a Full-Adder, and duplicate that with the MMX code. This question was asked in the newsgroup (comp.lang.asm.x86) - I wish I came up with the solution myself! ;) I would like to design a macro for the general case - seems hard. :P
Posted on 2001-06-26 13:43:00 by bitRAKE
Counting bits is a 2 dimensional process, and a CPU is 1 dimensional. Humans have 2D sight (partial 3D perspection), that is why it is easier to count bits for humans; we can see the bigger picture. The AMD opt is the best way to handle this situation.
Posted on 2001-06-26 22:31:00 by eet_1024
eet_1024, that algo is for finding the number of bits that are set in a DWORD.
Actually I need to count the bits in each nibble for a 32 bit value.
Chris, it would be easier to present an optimal solution if I knew what the surrounding code was. The method I note above is only optimal if your doing this for many DWORDs, and works even better if you can reorganize the data. So, please tell us more, or post some code? This message was edited by bitRAKE, on 6/26/2001 10:57:03 PM
Posted on 2001-06-26 22:52:00 by bitRAKE
Those are just bit patterns, I believe you can modify it for 4 bit use. Also try this:

;  AL contains your nibble
  xor    cx, cx
  add    al, F8
  adc    cx, 0
  add    al, FC
  adc    cx, 0
  add    al, FE
  adc    cx, 0
  add    al, FF
  adc    cx, 0
;  CX contains the bit count
Posted on 2001-06-26 23:25:00 by eet_1024
Would:

push   eax
pop   ax
run my code
xchg  ah, al
run my code
..etc..
be fast enough?
Posted on 2001-06-26 23:27:00 by eet_1024
I have counter the number of clock cycles using the rdtscp instruction. Eet_1024's code is the fastest (about 6-7 clock cycles). AMD's method is slighty slowest (10 clock cycles). Using a table increase the number of clocks significantly (~2000 clock cycles !), probably because of the paging mechanism that slow down memory access. Mirno's method takes about 100 clocks cycles, even when the nibble is zero. Here are the results :
              
Reference :         33       33      33      33
Lookup table :      740251   2245    2205    2119
Loop         :      136      136     136     136 
AMD          :      45       45      45      45     
eet_1024     :      39       39      39      39 
It was tested on an 933Mhz Intel with 256 Mo under Windows NT 4 (my machine at work). The reference is the number of clock cycles when no code is executed (OS noise).
Posted on 2001-06-28 04:18:00 by karim
It appears that the lookup table was cached.
Posted on 2001-06-28 13:15:00 by eet_1024
Too cute not to mention:

mysub:   ;counts bits in eax, returns the count in edx
xor edx,edx         ;clc
again: adc dl,dh    ;maybe adc edx,0 is faster, dunno
add eax,eax         ;ax or al can be used instead of al
jnz again           ;so this algo is early-out
adc dl,dh
ret
Posted on 2001-06-28 20:00:00 by Larry Hammick
Larry, If you mov (into 8 diff locations) & clear dl on every 4 counts, I think that your code would be best for his situation.
Posted on 2001-06-28 20:45:00 by eet_1024
cmp eax,1 sbb edx,edx NxtBit: lea ebx,dword ptr inc edx and eax,ebx jnz NxtBit ZeroExit:
Posted on 2001-06-28 23:13:00 by The Svin
Very nice, Svin.
Posted on 2001-07-01 01:50:00 by Larry Hammick