What is the fastest way to count the number of bits set in a 32 bit value? Many thanks
One solution could be to use a 16-byte lookup table and process the value by packet of 4 bytes. For example :
This message was edited by karim, on 6/26/2001 6:08:53 AM
.DATA table BYTE 0, 1, 1, 2, 1 ; etc .CODE CountBit PROC value:DWORD USES EBX ECX EDX mov ecx, value movzx ebx, cl movzx edx, ch movzx eax, table shr ecx, 16 add al, table mov bl, cl mov dl, ch add al, table add al, table CountBit ENDP
Fastest way would be to use a 32bit lookup :D But that would leave you with no room for code, or a stack! Not quite so fast as the lookup, but smaller code wise:
If you don't want to preserve the register then you can do this:
mov ecx, 32 mov edx, MyNum xor eax, eax @@: rol edx, 1 adc eax, 0 dec ecx jnz @B
The second example will be quicker, as it is a smaller loop, requires less startup code, and can exit early if edx is empty! If you find that you deal mostly with smaller numbers, it may also be worth changing it from SHL to SHR, as smaller numbers ocupy the lower bits only, it will exit quicker. That benefit will be small though! Mirno
xor eax, eax ;Also clears the carry flag mov edx, MyNum ;Doesn't affect the carry flag @@: adc eax, 0 ;Does nothing first time around shl edx, 1 ;Pop one bit off jnz @B ;Loop again if edx is not empty adc eax, 0 ;Add the last bit to be shifted off the top
Thanks for the replies so far. I was hopefully (or with wishfull thinking) trying to avoid or minimise both shifts and using partial registers. This routine is right in the middle of a critical and heavily used procedure and the effect of processor stalls is quite dramatic. Actually I need to count the bits in each nibble for a 32 bit value. With a lot of experimentation together with running the code through vTune, I have improved speed quite considerably since my initial code but I cannot get away from the feeling that I still seem to be doing a lot of work for such a seemingly simple need. Also I have it in the back of my mind that I am missing some trick with Boolean logic somewhere which prompted me to post the question, in case somebody could point out the obvious which I am missing..... All suggestions are welcome. Many thanks
A hybrid of my previous, and karim's suggestions:
Again the SHLD/SHRD thing applies depending on your average input. It uses no partial registers, and cuts down on the number of itterations of the loop. Mirno
.data Lookup dd 0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4 .code mov edx, MyNumber xor eax, eax xor ecx, ecx @@: add eax, xor ecx, ecx shld edx, ecx, 4 jnz @B add eax,
I've found this in the AMD optimization guide :
; value in eax MOV EDX, EAX SHR EAX,1 AND EAX,055555555h SUB EDX,EAX MOV EAX,EDX SHR EDX,2 AND EAX,033333333h AND EDX,033333333h ADD EAX,EDX MOV EDX,EAX SHR EAX,4 ADD EAX,EDX AND EAX,00F0F0F0Fh IMUL EAX,001010101h SHR EAX,24
The fast way would be parallel adders in MMX. You said you wanted the total bit counts for the eight nibbles of a DWORD. Nibbles are four bits, so the result has five states: this will require 3 bits (or 2 bits and a special case). Look in a basic electronics book for the binary logic of a Full-Adder, and duplicate that with the MMX code. This question was asked in the newsgroup (comp.lang.asm.x86) - I wish I came up with the solution myself! ;) I would like to design a macro for the general case - seems hard. :P
Counting bits is a 2 dimensional process, and a CPU is 1 dimensional. Humans have 2D sight (partial 3D perspection), that is why it is easier to count bits for humans; we can see the bigger picture. The AMD opt is the best way to handle this situation.
eet_1024, that algo is for finding the number of bits that are set in a DWORD.
Actually I need to count the bits in each nibble for a 32 bit value.Chris, it would be easier to present an optimal solution if I knew what the surrounding code was. The method I note above is only optimal if your doing this for many DWORDs, and works even better if you can reorganize the data. So, please tell us more, or post some code? This message was edited by bitRAKE, on 6/26/2001 10:57:03 PM
Those are just bit patterns, I believe you can modify it for 4 bit use. Also try this:
; AL contains your nibble xor cx, cx add al, F8 adc cx, 0 add al, FC adc cx, 0 add al, FE adc cx, 0 add al, FF adc cx, 0 ; CX contains the bit count
be fast enough?
push eax pop ax run my code xchg ah, al run my code ..etc..
I have counter the number of clock cycles using the rdtscp instruction. Eet_1024's code is the fastest (about 6-7 clock cycles). AMD's method is slighty slowest (10 clock cycles). Using a table increase the number of clocks significantly (~2000 clock cycles !), probably because of the paging mechanism that slow down memory access. Mirno's method takes about 100 clocks cycles, even when the nibble is zero. Here are the results :
It was tested on an 933Mhz Intel with 256 Mo under Windows NT 4 (my machine at work). The reference is the number of clock cycles when no code is executed (OS noise).
Reference : 33 33 33 33 Lookup table : 740251 2245 2205 2119 Loop : 136 136 136 136 AMD : 45 45 45 45 eet_1024 : 39 39 39 39
It appears that the lookup table was cached.
Too cute not to mention:
mysub: ;counts bits in eax, returns the count in edx xor edx,edx ;clc again: adc dl,dh ;maybe adc edx,0 is faster, dunno add eax,eax ;ax or al can be used instead of al jnz again ;so this algo is early-out adc dl,dh ret
Larry, If you mov (into 8 diff locations) & clear dl on every 4 counts, I think that your code would be best for his situation.
cmp eax,1 sbb edx,edx NxtBit: lea ebx,dword ptr inc edx and eax,ebx jnz NxtBit ZeroExit:
Very nice, Svin.