Please help me with this question.
It is said that only one ASM instruction is needded to do that. I couldn't figure out the trick.
It is said that only one ASM instruction is needded to do that. I couldn't figure out the trick.
xlat
For a byte, it does a lookup at ebx + al (zero extended to eax)
and returns the value from that location.
If a byte table has the count of one bits stored
for each byte value then xlat which uses al as the index
will return the count.
For a byte, it does a lookup at ebx + al (zero extended to eax)
and returns the value from that location.
If a byte table has the count of one bits stored
for each byte value then xlat which uses al as the index
will return the count.
popcnt is a SSE4 instruction that also will count the number of bits set (to 1).
In general if you want to search for this, the search term is "population count" - but dsouza123already more than hinted this :). You should be able to find a lot of references to this. I have some vague memory of it being related to chess algorithm optimization?
I included a popcount in convertlib....
Pretty fast considering though I expect there are some much faster ones out there.
Donkey
popcount FRAME dwValue
mov ecx,
mov eax,ecx
shr eax,1
and eax,055555555h
sub ecx,eax
mov eax,ecx
shr eax,2
and ecx,033333333h
and eax,033333333h
add ecx,eax
mov eax,ecx
shr eax,4
add eax,ecx
and eax,00F0F0F0Fh
mov ecx,eax
shr ecx,8
add eax,ecx
mov ecx,eax
shr ecx,16
add eax,ecx
and eax,00000003Fh
RET
ENDF
Pretty fast considering though I expect there are some much faster ones out there.
Donkey
You could probably set up a couple of tables and then use
repnz scasw
to check for an ax match
A lot of work to set it up tho...you could set it up for bytes....al instead of ax..yikes
Slightly easier with a rotate loop :D
mov ax,0F00 ;number to chek
mov bx,0 ;final bitcount store
mov cx,10
xxx
rcr ax,1
adc bx,0
loop xxx
repnz scasw
to check for an ax match
A lot of work to set it up tho...you could set it up for bytes....al instead of ax..yikes
Slightly easier with a rotate loop :D
mov ax,0F00 ;number to chek
mov bx,0 ;final bitcount store
mov cx,10
xxx
rcr ax,1
adc bx,0
loop xxx
eek: not good if you want it to be fast, though :)
how about this? (FASM syntax)
can be further optimized by loading constants to registers
but donkey's version is probably faster...
;EAX = number of bits in EBX
popcount:
and eax, 0 ;at least i am original :]
repeat 32
shr ebx, 1
adc eax, 0
end repeat
retn
can be further optimized by loading constants to registers
but donkey's version is probably faster...
Thanks for all you guys, I found two impl may be faster
1. use lookup table, each table is 256 choices, and the contents of each of them are the number of 1 bits in a byte.
2. use ( sorry, I'm not expert in ASM )
cnt contains to number of 1 bits in var
1. use lookup table, each table is 256 choices, and the contents of each of them are the number of 1 bits in a byte.
2. use ( sorry, I'm not expert in ASM )
while (var) {
var &= (var-1);
cnt++;
}
cnt contains to number of 1 bits in var
Hi,
I can't see that being too fast, I would assume that the cache would be affected and it requires 4 passes for a DWORD. A quick test shows mine at around 14 clocks (1417564 clocks for 100000 iterations) independent of the data and I am still assuming that there are faster ones out there.
Donkey
EDIT:
It doesn't help to make a statement without showing the test parameters, simple timing test...
I can't see that being too fast, I would assume that the cache would be affected and it requires 4 passes for a DWORD. A quick test shows mine at around 14 clocks (1417564 clocks for 100000 iterations) independent of the data and I am still assuming that there are faster ones out there.
Donkey
EDIT:
It doesn't help to make a statement without showing the test parameters, simple timing test...
mov edi,100000
mov ecx,edi
rdtsc
mov esi,eax
looper:
mov eax,ecx
shr eax,1
and eax,055555555h
sub ecx,eax
mov eax,ecx
shr eax,2
and ecx,033333333h
and eax,033333333h
add ecx,eax
mov eax,ecx
shr eax,4
add eax,ecx
and eax,00F0F0F0Fh
mov ecx,eax
shr ecx,8
add eax,ecx
mov ecx,eax
shr ecx,16
add eax,ecx
and eax,00000003Fh
mov ecx,edi
dec edi
jnz <looper
rdtsc
sub eax,esi
PrintDec(eax)