Recently, I had to count the number of bits of a 32 bits register.
I don't remember exactly how much cycles it took, but it was originally a C routine.

My question is:
what is the fastest routine to count the number of bits of EAX ?

JC
Posted on 2002-06-01 06:47:55 by MCoder
mov ebx, eax
shr eax, 16
and ebx, 0ffffh
mov eax,
Posted on 2002-06-01 07:03:12 by bdjames

mov ebx, eax
shr eax, 16
and ebx, 0ffffh
mov eax,

That will be extremely slow.. L1 cache miss, L2 cache miss, even page fault (and load from disk) probably.

Here's an example that works much faster in real life, and no branches even:

int countbits(a)
unsigned long a; /* a: 32 1-bit tallies */
{
a = (a&0x55555555) + ((a>>1) &(0x55555555)); /* a: 16 2-bit tallies */
a = (a&0x33333333) + ((a>>2) &(0x33333333)); /* a: 8 4-bit tallies */
a = (a&0x07070707) + ((a>>4) &(0x07070707)); /* a: 4 8-bit tallies */
/* a %= 255; return(a); may replace what follows */
a = (a&0x000F000F) + ((a>>8) &(0x000F000F)); /* a: 2 16-bit tallies */
a = (a&0x0000001F) + ((a>>16)&(0x0000001F)); /* a: 1 32-bit tally */
return(a);
}
Posted on 2002-06-01 08:41:02 by Maverick
In GMP, there is a beautiful:

#define popc_limb(result, input) \
do { \
mp_limb_t __x = (input); \
__x -= (__x & 0xaaaaaaaaL) >> 1; \
__x = ((__x >> 2) & 0x33333333L) + (__x & 0x33333333L); \
__x = ((__x >> 4) + __x) & 0x0f0f0f0fL; \
__x = ((__x >> 8) + __x); \
__x = ((__x >> 16) + __x) & 0xff; \
(result) = __x; \
} while (0)

What is the fastest in ASM ?
Can a MMX function do it faster ?

JC
Posted on 2002-06-01 10:04:51 by MCoder

What is the fastest in ASM?
Can a MMX function do it faster?
MCoder, this is a popular problem (:)) - look in the AMD optimization manual for MMX version. Both C versions are really the same and convert directly to ASM. MMX does two DWORDs at once - better than half the time.
Posted on 2002-06-01 10:18:29 by bitRAKE
Posted on 2002-06-01 10:41:13 by bdjames
bdjames, that is fairly slow and the last bit can just be added - no need to shift into carry flag. :) ~16 cycles on what CPU? It should compress well! ;)
``````shr ebx, 1
;shr ebx, 1
pop ebx``````
Nice explaination
Good page of bit operations
Posted on 2002-06-01 11:24:37 by bitRAKE
What the h*ll?

I thought adc reg,imm and shr reg, 1
were 1 cycle and can pair.=>32

54 :(
Posted on 2002-06-01 11:50:43 by bdjames
No, it's not pairable, since the sbb wait for the carry of the shr.
Pairing is allowed only when the 2 instructions have nothing in common (except TEST/BRANCH, but I'm not sure).

Bitrake: Thanks for the info !
For reference, here is AMD's code for the bitcount:

mov edx, eax
shr eax, 1
and eax, 55555555h
sub edx, eax
mov eax, edx
shr edx, 2
and eax, 33333333h
and edx, 33333333h
mov edx, eax
shr eax, 4
and eax, 0F0F0F0Fh
imul eax, 01010101h
shr eax, 24

Wow !

I've not the courage to copy the MMX version here...

JC
Posted on 2002-06-01 13:43:31 by MCoder
``````
;eax in
;edx out
;speed and size compromiss
cmp eax,1
sbb edx,edx
@@:
lea ebx,[eax-1]
inc edx
and eax,ebx
jne @B
;size; in ecx, out edx
xor edx,edx
jecxz @1
@@:
lea ebx,[ecx-1]
inc edx
and ecx,ebx
jne @b
@1:
``````
Posted on 2002-06-01 16:38:43 by The Svin
On an Athlon it is possible to get below
a cycle per byte using MMX/SSE.
Posted on 2002-06-02 17:03:26 by bitRAKE

Recently, I had to count the number of bits of a 32 bits register.
I don't remember exactly how much cycles it took, but it was originally a C routine.

My question is:
what is the fastest routine to count the number of bits of EAX ?

JC

no offense but what is the use of this? :confused:

I mean it's a fixed number and all? :)
Posted on 2002-06-02 17:12:11 by Hiroshimator
Here is the single quadword version:
``````	ALIGN 64
mpn_popcount_64 PROC PUBLIC FORCENOFRAME

; eax = source
; ecx = blocks

lea eax, [eax+ecx*8]
neg ecx

mov edx,0F3355h

pxor mm2,mm2
movd mm7,edx		; ????????000F3355
pxor mm4,mm4		; total bits
punpcklbw mm7,mm7	; 00000F0F33335555

movq mm6,mm7
movq mm5,mm7
pshufw mm7,mm7,00000000y	; 5555555555555555
pshufw mm6,mm6,01010101y	; 3333333333333333
pshufw mm5,mm5,10101010y	; 0F0F0F0F0F0F0F0F

REG_55 EQU <mm7>
REG_33 EQU <mm6>
REG_0F EQU <mm5>

ALIGN 64
_2:
; 18 instructions
; 9.5 cycles per loop (1000 reps)
movq	mm1,[eax+ecx*8]

movq	mm0,mm1
psrlq	mm1,1
pand	mm1,REG_55
psubd	mm0,mm1	; bit pairs

movq	mm1,mm0
psrlq	mm0,2
pand	mm1,REG_33
pand	mm0,REG_33

movq	mm1,mm0
psrlq	mm0,4
pand	mm0,REG_0F	; bytes

psadbw	mm0,mm2	; sum of absolute byte differences

inc	ecx
jnz	_2

movd	eax,mm4
ret
mpn_popcount_64 ENDP``````
I've seen many versions of this routine, but most of them run
at 10/11 cycles - this one has been tuned for 9.5 :) The dual
quad word version is the fastest and it takes quite a bit more
work to shave off a couple of cycles per quad word.
Posted on 2002-06-02 22:34:24 by bitRAKE
Thanks !

I needed such a routine for 2 programming contests.
About queens in 3D and postage stamps problem.

JCM
Posted on 2002-06-03 01:37:48 by MCoder

I needed such a routine for 2 programming contests.
About queens in 3D and postage stamps problem.
Posted on 2002-06-03 07:33:26 by bitRAKE
mcoder,

``````
;EAX=value
sub edx,edx
@@count:
bsf eax,ecx
jz @@done
btc eax,ecx
inc edx
jmp @@count
@@done:
;EDX=number of bits set in EAX
``````

ancev
Posted on 2002-06-03 10:49:27 by ancev
This might work?
``````    xor ecx,ecx
shl eax,1
jns _1
_2: inc ecx
shl eax,2
js _2
jne _1
Posted on 2002-06-03 11:14:47 by bitRAKE
We save a byte.
``````	or edx,-1
_1:	bsf ecx,eax
btr eax,ecx
inc edx
jc _1``````
Also, interesting to note: if you just want the highest or lowest bit set, and you know at least one bit is set:
``````_1:	bsf ecx,eax ; bsr for high bit only
btc eax,ecx
jc _1``````
Posted on 2002-06-03 20:03:25 by bitRAKE

Check the AttackingQueens on Yahoogroups.

The result of the contest is here:
http://members.aol.com/bitzenbeitz/Queens/

There was some money to win ;-)
A new contest should appear soon.
If you are interested, I can make a post here...

JCM
Posted on 2002-06-05 13:46:03 by MCoder
Did you win?
\$300 total up for grabs.
...I could use a new monitor. :)
Posted on 2002-06-05 14:26:50 by bitRAKE