Hi,
I posted this on fasm forum, but (as roticv suggested) I posted it here too. Hope you will help me... :wink:

Bitstream is a small library for reading/writing on single bit. I wrote it for my LZSS compressor (it's easiest to achieve bigger compression ratio when you operate on single bits). Bitstreaming isn's a critical part of whe whole algo, but optimizing it could make whole compressor faster. Here I attach two functions: one that stores bits in a bit stream, and the other that reads them.
Could you help me in making some speed improvements to these functions? Maybe there is some other way that would be faster?

Here's a short description of the algorithm I used:
- BitStream is a structure that contains all data needed to do streaming:
struct BitStream

.buf dd ?
.ptr dd ?
.bit_ptr dd ?
ends

.buf is a pointer to the first byte of the stream
.ptr points on a current DWORD
.bit_ptr is an bit offset relative to the .ptr

function BitsPut stores given number of given bits into the stream, and then fixes [.ptr] and members. It can write at most 32 bits into the stream.
BitsGet reads given number of bits, and moves the pointer too.
Both functions firstly generate bit mask that is used to store bits, then they actually write them. The complexity comes because bits to be stored can "cross" two DWORDs.

So, could you make any improvements? I would be really grateful...
If you want to see how the bitstreams work in a real program, then just download lzss library and test code.

thanks,
decard

; puts given number of bits into stream

proc BitsPut, .stream,.bits,.bit_count
begin
push eax ebx ecx edx esi edi
; generate mask in edi:esi
xor eax,eax
dec eax
mov esi,eax
mov edi,eax
mov ebx,[.stream]
mov ecx,[.bit_count]
shl esi,cl
mov ecx,[ebx+BitStream.bit_ptr]
shld edi,esi,cl
shld esi,eax,cl
; load new bits into edx:ecx; at this point ecx==[ebx+BitStream.bit_ptr]
xor edx,edx
mov eax,[.bits]
shld edx,eax,cl
shl eax,cl
mov ecx,eax
; finally, store new bits using
mov ebx,[ebx+BitStream.ptr]
and [ebx],esi
or [ebx],ecx
add ebx,4
and [ebx],edi
or [ebx],edx
; and fix the pointers
mov ebx,[.stream]
mov eax,[.bit_count]
add [ebx+BitStream.bit_ptr],eax
cmp [ebx+BitStream.bit_ptr],32
jb .finish
sub [ebx+BitStream.bit_ptr],32
add [ebx+BitStream.ptr],4
.finish:
pop edi esi edx ecx ebx eax
return
endp

; reads given number of bits form the stream
proc BitsGet, .stream,.bit_count
begin
push ebx ecx edx esi
; generate mask (in esi)
xor esi,esi
dec esi
mov ecx,[.bit_count]
shl esi,cl
not esi
mov ebx,[.stream]
mov eax,[ebx+BitStream.ptr]
mov edx,[eax+4]
mov eax,[eax]
mov ecx,[ebx+BitStream.bit_ptr]
shrd eax,edx,cl
and eax,esi
; fix the pointers
mov ebx,[.stream]
mov edx,[.bit_count]
add [ebx+BitStream.bit_ptr],edx
cmp [ebx+BitStream.bit_ptr],32
jb .finish
sub [ebx+BitStream.bit_ptr],32
add [ebx+BitStream.ptr],4
.finish:
pop esi edx ecx ebx
return
endp
Posted on 2004-10-28 11:03:05 by decard
For the reading part, Intel has published an application note that shows how to use MMX to read bits from a bitstream. I use a modified version of this method in my jpeg decoder (see the post JPEG decoder beta) I don't even know if it's faster :)
Posted on 2004-10-28 11:57:25 by Dr. Manhattan
Bit streams are a native type on x86!

BT
BTC
BTS
BTR

...nothing else is needed (for a small number of consecutive bits). :)
Posted on 2004-10-28 12:21:52 by bitRAKE
This should be fast enough, I think:

SetBits:
push esi
mov edx,
push ebx
mov esi,
mov cl,
mov eax,edx
rol edx,cl
shl eax,cl
mov ebx,
add cl,
xor edx,eax
or ,eax
test cl,32
mov ,edx
jz nonewdword
add ebx,4
and cl,31
nonewdword:
mov ,ebx
pop ebx
mov ,cl
pop esi
ret 12

GetBits:
push esi
mov esi,
push ebx
mov ebx,
mov cl,
mov eax,
mov edx,
mov ch,
shrd eax,edx,cl
xadd ch,cl
xor edx,edx
inc edx
test ch,32
jz nonewdword2
add ebx,4
and ch,31
nonewdword2:
shl edx,cl
mov ,ebx
pop ebx
dec edx
mov ,ch
pop esi
and eax,edx
ret 8

Posted on 2004-10-28 14:13:02 by Sephiroth3
Hi Sephiroth3,
Your code doesn't work :( Something with reading arguments from stack? I will try to fix it, but proably you could find a mistake in your code faster :)

bitRAKE:
The procedures work on bigger number of bits, so BT instructions aren't optimal in this case, as they can work on single bits only...
Posted on 2004-10-28 15:39:23 by decard
Ok, now it has to work.
Posted on 2004-10-29 11:24:53 by Sephiroth3
i have posted this to flat asm board too,
i poost it here in case someone interested ( any bugs found are welcome cause i don't see any :)



; Matrix bitstream functions
; here you are, i knew xchg was slow, it could be a size optimization
; use rol cx,8 if its faster for your processor

write_bitstream:; eax = input bitstream from lsb, edi - ptr to buffer, better if aligned
xor ebx,ebx ; cl = bit offset 0-31
or ch,ch ; ch = number of bits to read\write 0-32
jz .done
mov edx,$FFFFFFFF ; 11111111111111111111
rol cx,8 ;xchg cl,ch ; put bit count in cl
shl edx,cl ; bit count in cl 11111111111111000000
not edx ; complement 00000000000000111111
rol cx,8 ;xchg cl,ch ; bit offset in cl
shl edx,cl ; offset 00000001111111000000
not edx ; complement 11111110000000111111
shld ebx,eax,cl ; ebx = upper part
shl eax,cl ; offset eax
and [edi],edx ; mask out first dword
or [edi],eax ; set bits in first dword
add cl,ch ; get second dword number of bits
cmp cl,32 ; compare second dword count with 0,
jbe .done ; if second dword not modified then skip
sub cl,32 ; get number of bits in second dword
mov edx,$FFFFFFFF ; 11111111111111111111
rol cx,8 ;xchg cl,ch ; put bit count in cl
shl edx,cl ; bit count in cl 11111111111111000000
and [edi+4],edx ; mask out first dword
or [edi+4],ebx ; set bits in first dword
.done:ret

read_bitstream:; eax = output bitstream from lsb, edi - ptr to buffer, better if aligned
or ch,ch ; cl = bit offset 0-31
jz .done ; ch = number of bits to read\write 0-32
mov eax,[edi] ; get first dword
mov ebx,[edi+4]; get second dword
shrd eax,ebx,cl; adjust by offset
mov edx,$FFFFFFFF ; 11111111111111111111
rol cx,8 ;xchg cl,ch ; put bit count in cl
shl edx,cl ; bit count in cl 11111111111111000000
not edx ; complement 00000000000000111111
and eax,edx ; mask out eax
.done:ret


here it is without shld shrd:


; Matrix bitstream functions
; here you are, i knew xchg was slow, it could be a size optimization
; use rol cx,8 if its faster for your processor
; of course note that call does a push and a pop too
; in this version i have avoided the use of shld, shrd

write_bitstream:; eax = input bitstream from lsb, edi - ptr to buffer, better if aligned
or ch,ch ; cl = bit offset 0-31
jz .done ; ch = number of bits to read\write 0-32
mov edx,$FFFFFFFF ; 11111111111111111111
rol eax,cl ; hehe
mov ebx,eax
; --- this part is for zeroing out unneeded bits ( if any ) in the upper part
rol cx,8 ;xchg cl,ch ; put bit count in cl
add cl,ch
cmp cl,32
jbe .notzeroe
sub cl,32
shl edx,cl ; bit count in cl 11111111111111111000
and [edi+4],edx ; mask out first dword
not edx ; complement 00000000000000000111
and ebx,edx
or [edi+4],ebx ; set bits in first dword
add cl,32
.notzeroe:
sub cl,ch
rol cx,8 ;xchg cl,ch ; put offset in cl
; --- this part is for zeroing out unneeded bits ( if any ) in the upper part
mov edx,$FFFFFFFF ; 11111111111111111111
shl edx,cl ; offset in cl 11111111111111000000
; if you are zeroing out use this instead
and eax,edx

rol cx,8 ;xchg cl,ch ; put bit count in cl
mov edx,$FFFFFFFF ; 11111111111111111111
shl edx,cl ; bit count in cl 11111111111111000000
rol cx,8 ;xchg cl,ch ; put offset in cl
shl edx,cl ; offset 00000001111111000000
not edx ; complement 11111110000000111111
rol cx,8 ;xchg cl,ch ; put bit count in cl
and [edi],edx ; mask out first dword
or [edi],eax ; set bits in first dword
ret

read_bitstream:; eax = output bitstream from lsb, edi - ptr to buffer, better if aligned
or ch,ch ; cl = bit offset 0-31
jz .done ; ch = number of bits to read\write 0-32
mov eax,[edi] ; get first dword
mov ebx,[edi+4]; get second dword
mov edx,$FFFFFFFF ; 11111111111111111111
shr eax,cl
shr edx,cl ; offset in cl 00000011111111111111
ror ebx,cl
not edx ; offset in cl 11111100000000000000
and ebx,edx
or eax,ebx
mov edx,$FFFFFFFF ; 11111111111111111111
rol cx,8 ;xchg cl,ch ; put bit count in cl
shl edx,cl ; bit count in cl 11111111111111000000
not edx ; complement 00000000000000111111
and eax,edx ; mask out eax
.done:ret
Posted on 2004-11-09 06:41:27 by >Matrix<