hi all,

let say I have : bitbuffer=101100111010b

can some one show me how to count how many bits are on(1) and how many bits are off(0)??

I am stuck on this when it come to the part with registers, since the minimum register is one 4 bits,

I am looking for something like this:

OnCounter=0

OffCounter=0

pbuf=

bit=bit ptr

while not endof bitbuffer

if bit==0 than inc OnCounter

if bit==1 than inc OffCounter

inc pbuf

endw

I hope this will help you easier to understand my question :(

thank you for the help in adv,

yanda

let say I have : bitbuffer=101100111010b

can some one show me how to count how many bits are on(1) and how many bits are off(0)??

I am stuck on this when it come to the part with registers, since the minimum register is one 4 bits,

I am looking for something like this:

OnCounter=0

OffCounter=0

pbuf=

bit=bit ptr

while not endof bitbuffer

if bit==0 than inc OnCounter

if bit==1 than inc OffCounter

inc pbuf

endw

I hope this will help you easier to understand my question :(

thank you for the help in adv,

yanda

bt instruction

*note: untested, but I think you should get the idea.

made a mistake with the jcc

```
```

;eax = bitbuffer

xor edx, edx ; edx = OnCounter

mov ecx, 31

@@:

bt eax, ecx

adc edx, 0

dec ecx

jns @B

;OffCounter = 32 - edx

*note: untested, but I think you should get the idea.

made a mistake with the jcc

thankx

I use Norbert Juffa's popcount, might be faster, never checked:

(Sorry about the first post it was formatted for GoAsm this is the masm version)

(Sorry about the first post it was formatted for GoAsm this is the masm version)

```
; Pass the value in ESI
```

; The set bit count is returned in EDX

POPCOUNT MACRO

mov edx,esi

shr edx,1

and edx,055555555h

sub esi,edx

mov edx,esi

shr edx,2

and esi,033333333h

and edx,033333333h

add esi,edx

mov edx,esi

shr edx,4

add edx,esi

and edx,00F0F0F0Fh

mov esi,edx

shr esi,8

add edx,esi

mov esi,edx

shr esi,16

add edx,esi

and edx,00000003Fh

ENDM

thank you all!

roticv:

I know that you did a bt eax, ecx for testing bit,

but I just not understand the use of adc edx, 0 instruction.

could you explain to me please?

thankx

roticv:

I know that you did a bt eax, ecx for testing bit,

but I just not understand the use of adc edx, 0 instruction.

could you explain to me please?

thankx

adc is just like add but if the carry flag is set, it is the same as add then inc.

For example, adc eax, edx

if CF set{

add eax, edx

inc eax

}ELSE{

add eax, edx

}

Something like that if I am not wrong.

In my code, adc edx, 0

means that if carry flag is not set, then nothing is done to edx. However if carry flag is set, it would be the "same" as inc edx. (I just do not want to end up doing branching)

For example, adc eax, edx

if CF set{

add eax, edx

inc eax

}ELSE{

add eax, edx

}

Something like that if I am not wrong.

In my code, adc edx, 0

means that if carry flag is not set, then nothing is done to edx. However if carry flag is set, it would be the "same" as inc edx. (I just do not want to end up doing branching)

thank you very much roticv!

now I got it!

thankx for the help!

sincerely,

now I got it!

thankx for the help!

sincerely,

Hi roticv.

I think the instruction

I think the instruction

**js @B**need be replaced to**jns @B**.Hello, this is like the one roticv did ( not as optimized as roticv?s ) but no need for BT ( slower ).

EAX < REG32 to count bits from.

EDX > Count of bits that were 1.

ECX > 0

mov ecx,32

xor edx,edx

BitShiftLoop: shr eax,1

adc edx,0

dec ecx

jnz BitShiftLoop

Bye :D

EAX < REG32 to count bits from.

EDX > Count of bits that were 1.

ECX > 0

mov ecx,32

xor edx,edx

BitShiftLoop: shr eax,1

adc edx,0

dec ecx

jnz BitShiftLoop

Bye :D

Hi roticv.

I think the instruction

**js @B**need be replaced to

**jns @B**.

hehe.. I was confused about that part. I go change it. :/

I still like Juffa's, I did some simple speed tests using Agner Fog's testing example and it came out at ~20 tics compared to around 100 for the other two. 5 times faster is quite a bit of savings for a function that is used often. I should note that I made the jns @B change to roticv's

roticv's algo = 98 tics

sheroc's = 105 tics

Juffa's POPCOUNT = 21 tics.

Average for 100,000 iterations

roticv's algo = 98 tics

sheroc's = 105 tics

Juffa's POPCOUNT = 21 tics.

Average for 100,000 iterations

Yeah, that bittrick is cool, although a bit hard to understand, but heh 5 times faster, worth a deeper look into it :D

thank you Donkey!

thank you all for the help!

thank you all for the help!

Avoiding loops is of course better, but I wonder if avoiding shifts in the loop version would bring any improvement. We could exploit that N &= (N-1) removes the lowest set bit in N, and count the bits using something like (disclaimer: not tested, not optimised, early in the morning, etc. :grin: ):

The loop runs as many times as there are 1 bits in the value.

```
; eax = bits
```

xor edx, edx ; counter = 0

test eax, eax ; if zero, return 0

jz done ;

@@:

mov ecx, eax ; temp = val

add edx, 1 ; counter++

sub ecx, 1 ; temp--;

and eax, ecx ; eax &= temp;

jnz @B ; continue if more bits left

done:

; edx = count

The loop runs as many times as there are 1 bits in the value.

This is the sort of thing where the patterns you are dealing with can give you the algorithm to choose. If certain bits are more likely to be set than others, then you can select a better algorithm depending on which bits are likely to be set.

If the distribution is uniform across the whole 32 bit range, then Juffa's code will be the best choice, but if there are likely to be few bits set, using "sh(l/r)" and "bs(f/r)" could remove a large portion of the work. Alternatively a simple loop which doesn't count all bits, but stops when a register is zero could work well if the top, or low end is likely to have bits set.

This will range from 5 to 75 clocks on a P3 (or it did for me at least), depending on the most significant set bit. Change the shr's to shl's and it will depend on the least significant set bit.

If you use a bsf/bsr to decide how many to shift, and the population of bits is sparce, then the saving will outweigh the cost of the bit-scan op.

I was also pleased by the above code because I only use 2 registers, so if it's in a macro (or inlined code) the register usage can be a factor in time consumption (although not enough to dent the 50 ticks for its worst case vs. Juffa's).

Mirno

If the distribution is uniform across the whole 32 bit range, then Juffa's code will be the best choice, but if there are likely to be few bits set, using "sh(l/r)" and "bs(f/r)" could remove a large portion of the work. Alternatively a simple loop which doesn't count all bits, but stops when a register is zero could work well if the top, or low end is likely to have bits set.

```
```

mov eax, my_number

xor ecx, ecx

shr eax, 1

@@:

adc ecx, 0

shr eax, 1

jnz @B

adc eax, ecx

; result in eax

This will range from 5 to 75 clocks on a P3 (or it did for me at least), depending on the most significant set bit. Change the shr's to shl's and it will depend on the least significant set bit.

If you use a bsf/bsr to decide how many to shift, and the population of bits is sparce, then the saving will outweigh the cost of the bit-scan op.

I was also pleased by the above code because I only use 2 registers, so if it's in a macro (or inlined code) the register usage can be a factor in time consumption (although not enough to dent the 50 ticks for its worst case vs. Juffa's).

Mirno