Recently, I had to count the number of bits of a 32 bits register.

I don't remember exactly how much cycles it took, but it was originally a C routine.

My question is:

what is the fastest routine to count the number of bits of EAX ?

JC

I don't remember exactly how much cycles it took, but it was originally a C routine.

My question is:

what is the fastest routine to count the number of bits of EAX ?

JC

mov ebx, eax

shr eax, 16

and ebx, 0ffffh

mov eax,

add eax,

mov ebx, eax

shr eax, 16

and ebx, 0ffffh

mov eax,

add eax,

That will be extremely slow.. L1 cache miss, L2 cache miss, even page fault (and load from disk) probably.

Here's an example that works much faster

int countbits(a)

unsigned long a; /* a: 32 1-bit tallies */

{

a = (a&0x55555555) + ((a>>1) &(0x55555555)); /* a: 16 2-bit tallies */

a = (a&0x33333333) + ((a>>2) &(0x33333333)); /* a: 8 4-bit tallies */

a = (a&0x07070707) + ((a>>4) &(0x07070707)); /* a: 4 8-bit tallies */

/* a %= 255; return(a); may replace what follows */

a = (a&0x000F000F) + ((a>>8) &(0x000F000F)); /* a: 2 16-bit tallies */

a = (a&0x0000001F) + ((a>>16)&(0x0000001F)); /* a: 1 32-bit tally */

return(a);

}

What is the fastest in ASM?

Can a MMX function do it faster?

Good page of bit operations

Recently, I had to count the number of bits of a 32 bits register.

I don't remember exactly how much cycles it took, but it was originally a C routine.

My question is:

what is the fastest routine to count the number of bits of EAX ?

JC

no offense but what is the use of this? :confused:

I mean it's a fixed number and all? :)

I needed such a routine for 2 programming contests.

About queens in 3D and postage stamps problem.Could you post more info about the contests?

We save a byte.

Could you post more info about the contests?

Check the AttackingQueens on Yahoogroups.

The result of the contest is here:

http://members.aol.com/bitzenbeitz/Queens/

There was some money to win ;-)

A new contest should appear soon.

If you are interested, I can make a post here...

JCM

shr eax, 16

and ebx, 0ffffh

mov eax,

add eax,

mov ebx, eax

shr eax, 16

and ebx, 0ffffh

mov eax,

add eax,

Here's an example that works much faster

*in real life*, and no branches even:int countbits(a)

unsigned long a; /* a: 32 1-bit tallies */

{

a = (a&0x55555555) + ((a>>1) &(0x55555555)); /* a: 16 2-bit tallies */

a = (a&0x33333333) + ((a>>2) &(0x33333333)); /* a: 8 4-bit tallies */

a = (a&0x07070707) + ((a>>4) &(0x07070707)); /* a: 4 8-bit tallies */

/* a %= 255; return(a); may replace what follows */

a = (a&0x000F000F) + ((a>>8) &(0x000F000F)); /* a: 2 16-bit tallies */

a = (a&0x0000001F) + ((a>>16)&(0x0000001F)); /* a: 1 32-bit tally */

return(a);

}

In GMP, there is a beautiful:

#define popc_limb(result, input) \

do { \

mp_limb_t __x = (input); \

__x -= (__x & 0xaaaaaaaaL) >> 1; \

__x = ((__x >> 2) & 0x33333333L) + (__x & 0x33333333L); \

__x = ((__x >> 4) + __x) & 0x0f0f0f0fL; \

__x = ((__x >> 8) + __x); \

__x = ((__x >> 16) + __x) & 0xff; \

(result) = __x; \

} while (0)

What is the fastest in ASM ?

Can a MMX function do it faster ?

JC

#define popc_limb(result, input) \

do { \

mp_limb_t __x = (input); \

__x -= (__x & 0xaaaaaaaaL) >> 1; \

__x = ((__x >> 2) & 0x33333333L) + (__x & 0x33333333L); \

__x = ((__x >> 4) + __x) & 0x0f0f0f0fL; \

__x = ((__x >> 8) + __x); \

__x = ((__x >> 16) + __x) & 0xff; \

(result) = __x; \

} while (0)

What is the fastest in ASM ?

Can a MMX function do it faster ?

JC

What is the fastest in ASM?

Can a MMX function do it faster?

**MCoder**, this is a

*popular*problem (:)) - look in the AMD optimization manual for MMX version. Both C versions are really the same and convert directly to ASM. MMX does two DWORDs at once - better than half the time.

Posted on 2002-06-01 10:41:13 by bdjames

**bdjames**, that is fairly slow and the last bit can just be added - no need to shift into carry flag. :) ~16 cycles on what CPU? It should compress well! ;)

```
shr ebx, 1
```

adc eax, ebx ; add two bits at once

;shr ebx, 1

;adc eax, 0

pop ebx

Nice explaination
Good page of bit operations

What the h*ll?

I thought adc reg,imm and shr reg, 1

were 1 cycle and can pair.=>32

54 :(

I thought adc reg,imm and shr reg, 1

were 1 cycle and can pair.=>32

54 :(

No, it's not pairable, since the sbb wait for the carry of the shr.

Pairing is allowed only when the 2 instructions have nothing in common (except TEST/BRANCH, but I'm not sure).

Bitrake: Thanks for the info !

For reference, here is AMD's code for the bitcount:

mov edx, eax

shr eax, 1

and eax, 55555555h

sub edx, eax

mov eax, edx

shr edx, 2

and eax, 33333333h

and edx, 33333333h

add eax, edx

mov edx, eax

shr eax, 4

add eax, edx

and eax, 0F0F0F0Fh

imul eax, 01010101h

shr eax, 24

Wow !

I've not the courage to copy the MMX version here...

JC

Pairing is allowed only when the 2 instructions have nothing in common (except TEST/BRANCH, but I'm not sure).

Bitrake: Thanks for the info !

For reference, here is AMD's code for the bitcount:

mov edx, eax

shr eax, 1

and eax, 55555555h

sub edx, eax

mov eax, edx

shr edx, 2

and eax, 33333333h

and edx, 33333333h

add eax, edx

mov edx, eax

shr eax, 4

add eax, edx

and eax, 0F0F0F0Fh

imul eax, 01010101h

shr eax, 24

Wow !

I've not the courage to copy the MMX version here...

JC

```
```

;eax in

;edx out

;speed and size compromiss

cmp eax,1

sbb edx,edx

@@:

lea ebx,[eax-1]

inc edx

and eax,ebx

jne @B

;size; in ecx, out edx

xor edx,edx

jecxz @1

@@:

lea ebx,[ecx-1]

inc edx

and ecx,ebx

jne @b

@1:

On an Athlon it is possible to get below

a cycle per byte using MMX/SSE.

a cycle per byte using MMX/SSE.

Recently, I had to count the number of bits of a 32 bits register.

I don't remember exactly how much cycles it took, but it was originally a C routine.

My question is:

what is the fastest routine to count the number of bits of EAX ?

JC

no offense but what is the use of this? :confused:

I mean it's a fixed number and all? :)

Here is the single quadword version:

at 10/11 cycles - this one has been tuned for 9.5 :) The dual

quad word version is the fastest and it takes quite a bit more

work to shave off a couple of cycles per quad word.

```
ALIGN 64
```

mpn_popcount_64 PROC PUBLIC FORCENOFRAME

; eax = source

; ecx = blocks

lea eax, [eax+ecx*8]

neg ecx

mov edx,0F3355h

pxor mm2,mm2

movd mm7,edx ; ????????000F3355

pxor mm4,mm4 ; total bits

punpcklbw mm7,mm7 ; 00000F0F33335555

movq mm6,mm7

movq mm5,mm7

pshufw mm7,mm7,00000000y ; 5555555555555555

pshufw mm6,mm6,01010101y ; 3333333333333333

pshufw mm5,mm5,10101010y ; 0F0F0F0F0F0F0F0F

REG_55 EQU <mm7>

REG_33 EQU <mm6>

REG_0F EQU <mm5>

ALIGN 64

_2:

; 18 instructions

; 9.5 cycles per loop (1000 reps)

movq mm1,[eax+ecx*8]

movq mm0,mm1

psrlq mm1,1

pand mm1,REG_55

psubd mm0,mm1 ; bit pairs

movq mm1,mm0

psrlq mm0,2

pand mm1,REG_33

pand mm0,REG_33

paddd mm0,mm1 ; nibbles

movq mm1,mm0

psrlq mm0,4

paddd mm0,mm1

pand mm0,REG_0F ; bytes

psadbw mm0,mm2 ; sum of absolute byte differences

paddd mm4,mm0 ; add to total

inc ecx

jnz _2

movd eax,mm4

ret

mpn_popcount_64 ENDP

I've seen many versions of this routine, but most of them run
at 10/11 cycles - this one has been tuned for 9.5 :) The dual

quad word version is the fastest and it takes quite a bit more

work to shave off a couple of cycles per quad word.

Thanks !

I needed such a routine for 2 programming contests.

About queens in 3D and postage stamps problem.

JCM

I needed such a routine for 2 programming contests.

About queens in 3D and postage stamps problem.

JCM

I needed such a routine for 2 programming contests.

About queens in 3D and postage stamps problem.

mcoder,

what about...

ancev

what about...

```
```

;EAX=value

sub edx,edx

@@count:

bsf eax,ecx

jz @@done

btc eax,ecx

inc edx

jmp @@count

@@done:

;EDX=number of bits set in EAX

ancev

This might work?

```
xor ecx,ecx
```

shl eax,1

jns _1

_2: inc ecx

_1: adc ecx,0

shl eax,2

js _2

jne _1

adc ecx,0

**ancev**, how about this? :)

We save a byte.

```
or edx,-1
```

_1: bsf ecx,eax

btr eax,ecx

inc edx

jc _1

Also, interesting to note: if you just want the highest or lowest bit set, and you know at least one bit is set:```
_1: bsf ecx,eax ; bsr for high bit only
```

btc eax,ecx

jc _1

Could you post more info about the contests?

Check the AttackingQueens on Yahoogroups.

The result of the contest is here:

http://members.aol.com/bitzenbeitz/Queens/

There was some money to win ;-)

A new contest should appear soon.

If you are interested, I can make a post here...

JCM

Did you win?

$300 total up for grabs.

...I could use a new monitor. :)

$300 total up for grabs.

...I could use a new monitor. :)