warning: large post ;)

I've been reading some PNG specifications lately to see how hard it would be to write a library for it (just like my giflib), particulary the huffman decoding.
PNG uses a combination of LZ77 (an algorithm that uses pointers to already outputted strings to reduce redundant data) and huffman encoding.

Huffman codes are bit sequences of any size, each one representing a symbol in from the alphabet (abc, bytes 0-255, anything you want). These bit codes are chosen in a way that when a sequence of bit codes is made, the decoder can still seperate the codes without the need of a seperator.
A small example is:
A: 00
B: 1
C: 011
D: 010
BCBBDA would be: 10111101000
When the code table above is known, you can split this bit sequence into the symbols it represents, although there aren't any separators between the symbols.

RFC1951, a document which describes the 'deflate' compression format, shows a method to generate huffman codes based on a frequency list for the bitlengths.
For example, the example above can be described in bit lengths as {2,1,3,3}: 2 bits for A (00), 1 bit for B (1), 3 bits for C (011) and 3 bits for D (010).
Put these frequency of each bit length in an array (bl_count) like this:

``````
bl_count = {0,1,1,2};``````

This means:
0-bit length: occurs zero times (not used)
1-bit length: occurs once (namely for B)
2-bit length: occurs once (namely for A)
3-bit length: occurs twice (for C and D)

Then put the bit lengths themselves in an array as well:
(note: the code I show here is slightly different from the one in the RFC)
``````
bit_lengths[] = {2,1,3,3}; // 2 for A, 1 for B, 3 for C, 3 for D
``````

Now you can generate the codes (bit_codes[]) for each symbol with this code:
``````
MAX_BITS = 3; // longest is 3 bits
MAX_CODE = 4; // 4 symbols in alphabet
next_code[MAX_BITS];
code = 0;
for (bits=1;bits <= MAX_BITS;bits++)
{
code = (code + bl_count[bits-1]) << 1;
next_code[bits] = code;
}

for (n=0;n<MAX_CODE;n++)
{
len = bit_lengths[n];
if (len!=0)
{
bit_codes[n] = next_code[len];
next_code[len]++;
}
}
``````

The first for loop creates the code bases for each code length. The rest of the codes for each code length are generated by just increasing the base by one for every code (the second loop).
In this case:
next_code table after first for:
``````
next_code =   00b
next_code =  010b
next_code = 1110b

(the other next_code values are filled in but not valid nor used)
``````

The actual code can be generated by taking the base (next_code) for a length, an increasing it for each new code:
``````
2 bit codes:
00
3 bit codes:
010
011
100
101
110
4 bit codes:
1110
1111
``````

That's the basic idea behind it, for more details read the RFC (1950+1951).

Now, the code stream is a sequence of these codes. I was thinking of how I could implement this in the fastest way possible. Just storing all codes in several arrays would require a lot of look-ups and memory access, which is slow. I searched for a method to determine which bit length the next code has, before doing a look-up for the corresponding value. So I examined the code generation algorithm described above closer, and found something interesting:

code = (code = bl_count) << 1;

Each code base (next_code) for a given bit length is based on the previously generated code base and the frequency of the previously generated bit length (the number of times that bit length is used).

"code base + frequency" gives one more than the maximum code for that bit length. In the example above, the bit length 3 was used 5 times. The code base was 010b. 010b + 5 = 111b.
111b is one more than the highest code generated with length 3 (110, see above).
This value is then shifted left once, a zero bit is appended:
111b SHL 1 = 1110b.

Because this are huffman codes, you should be able to distinguish a 3-bit code from a 4-bit code.
If you would treat the sequence 1110b as a 3 bit code: 111b, you will see that it does not fall into the range used by 3 bit codes (as 110 is the highest value). Each code is based on maximum previous code + 1 (and thus out of range), shifted left by one. So when you don't read enough bits (say 3 instead of 4), a simple range check will show that the code can never be a 3 bit code.

A decoder algorithm for the example codes above could be this:
``````
xor     eax, eax

shift 2 bits into eax (2 bits from input stream, shift in from the LSB)

cmp     eax, 01b    ;max 2-bit code + 1
jb      @its_a_2_bit_code

shift 1 bit more into eax

cmp     eax, 111b   ;max 3-bit code + 1
jb      @its_a_3_bit_code

shift 1 bit more into eax

@its_a_4_bit_code:
lookup value of eax into 4-bit code table

@its_a_3_bit_code:
lookup value of eax into 3-bit code table

@its_a_2_bit_code:
lookup value of eax into 2-bit code table
``````

There are several problems here:

Huffman codes are packed like this:
``````
huffman codes: ABCD, EF, GHIJKL [msb to lsb]
byte sequence:
1st byte: HGFEDCBA   [msb to lsb]
2nd byte: ??LKJIHG   [msb to lsb]
``````

So, 'shifting in a bit into eax on the right', means that you first have to get a bit from the input sequence and then shift it into eax (reversed order). Assuming ecx contains the next 32 bits from the input stream:
shr ecx, 1
rlc eax, 1

This shifts in one bit, using the carry as temporary buffer. This is okay for 1 bit, but for 5 bits you need to do this 5 times... Say for example that only 2 and 10 bit codes are used. After you've checked that it isn't a 2 bit code, you need to shift in 8 bits, in reversed order compared to the input stream.

Second problem: in the code above I used hardcoded values but of course these depend on the code tables. I was thinking about a piece of self modifying code to 'compile' the bit-detection into executable code. If I won't do this, I would still require to look up the bit frequencies.

I'd like to hear your opinions (if you made it through here ;) ) on this algorithm, especially alternatives for the bit shifting (reversing order, but for any length), and the implementation of the range checks (SMC? smart lookup tables?)

Thomas

P.S. I think this is my longest post ever :grin:
Posted on 2002-03-13 10:26:14 by Thomas
Here's another example. ecx holds the next 32 input bits, when done, edx will hold the value represented by the first huffman code in the stream, using the code table below:

``````
3-bit codes:
000
001
010
011

5-bit codes:
10000
10001
10010

8-bit codes:
10011000
10011001
10011010
10011011
10011100

10-bit codes:
1001110100
1001110101

; Assumes ecx contains the next 32 bits of input data

xor eax, eax

;shift 3 bits into eax
shr ecx, 1
rcl eax, 1
shr ecx, 1
rcl eax, 1
shr ecx, 1
rcl eax, 1

cmp eax, 011b + 1
jb  @its_a_3_bit_code

;shift 2 bits into eax
shr ecx, 1
rcl eax, 1
shr ecx, 1
rcl eax, 1

cmp eax, 10010b + 1
jb  @its_a_5_bit_code

;shift 3 bits into eax
shr ecx, 1
rcl eax, 1
shr ecx, 1
rcl eax, 1
shr ecx, 1
rcl eax, 1

cmp eax, 10011100b + 1
jb  @its_an_8_bit_code

@its_a_3_bit_code:
mov  edx, lookUp3bit[eax*4 - 4*0]      ; 0 is 3-bit code base
jmp @done

@its_a_5_bit_code:

mov  edx, lookUp5bit[eax*4 - 4*10000b] ;10000b is 5-bit code base
jmp @done

@its_an_8_bit_code:

mov  edx, lookUp8bit[eax*4 - 4*10011000b] ; 10011000b is 8-bit code base
jmp @done

@its_a_10_bit_code:

mov  edx, lookUp10bit[eax*4 - 4*1001110100b] ; 1001110100 is 10-bit code base
;jmp @done
@done:
``````
Posted on 2002-03-13 10:43:00 by Thomas
Thomas, that was a good read. PNG is asm would be a great work. I'm wondering if there is some multiply trick to reverse bits? (quite sure there is but can't do the math right now : I think 2 MUL + 2 AND) I think a look-up table has been used in the past for bit reversals. Can't wait to see more of what you got cooking!
Posted on 2002-03-13 11:40:44 by bitRAKE

Thomas, that was a good read. PNG is asm would be a great work. I'm wondering if there is some multiply trick to reverse bits? (quite sure there is but can't do the math right now : I think 2 MUL + 2 AND) I think a look-up table has been used in the past for bit reversals. Can't wait to see more of what you got cooking!

I have seen some trick for 8-bit numbers (forgot where though :) ), but the problem is that the number of bits to swap is variable.

I think a piece of self modifying code would be very fast. All known values can be hardcoded, so there won't be any memory access, even optimization is possible.
As I know how many bits I want to shift, the proper algo for that number could be inserted..

For small bit sequences (say 3 bits), maybe it's possible to do a look up inside a register (without memory access):
``````
bits	reversed
0	000	000
1	001	100
2	010	010
3	011	110
4	100	001
5	101	101
6	110	011
7	111	111

Reversed bits left padded with a zero bit:
0000
0100
0010
0110
0001
0101
0011
0111

all bits concatenated, right to left:

01110011010100010110001001000000b

; 3 bits to invert are in LS bits of ECX (all other bits are zeroed):

mov  esi, 01110011010100010110001001000000b
shl ecx, 2
shr  esi, ecx
and esi, 111b

; esi holds inverted bits

``````

however I don't know if that will be any faster than 3 SHR/RCLs, and the input stream needs to be shifted anyway.

Thomas
Posted on 2002-03-13 12:00:43 by Thomas
I found this one somewhere for 8-bits:
``````
mov ah,al        ;76543210 76543210
add al,al        ;76543210 65432100
shr ah,1         ;07654321 65432100
and ax,55AAh     ;07050301 60402000
or  al,ah        ;07050301 67452301
mov ah,al        ;67452301 67452301
ror al,2         ;67452301 01674523
rol ah,2         ;45230167 01674523
and ax,33CCh     ;00230067 01004500
or  al,ah        ;00230067 01234567
``````
Posted on 2002-03-13 12:13:34 by Thomas
I've been trying to figure out a short and tricky way to reverse a byte... to no avail.

At this point I would create a 256 byte table with each value's reverse value and use the byte itself as the index.

something like

``````
.data
_lookuptable db 00h, 80h, 40h, C0h, ......

.code
reverseDword proc _dword:DWORD

mov eax, _dword
mov edx, offset _lookuptable

movzx ecx, al
mov al, byte ptr [edx + ecx]
movzx ecx, ah
mov ah, byte ptr [edx + ecx]

bswap eax

movzx ecx, al
mov al, byte ptr [edx + ecx]
movzx ecx, ah
mov ah, byte ptr [edx + ecx]

ret

reverseDword endp
``````

256 bytes might be too much for some people though, but the tradeoff is it would be faster than doing 32 shifts and rotates.
Posted on 2002-03-13 13:32:59 by iblis
Thanks for your code .. I think the maximum number of bits I need to swap is 16.. The code I posted for switching 8 bits takes 5 cycles on my athlon, and doesn't use memory..

Thomas
Posted on 2002-03-13 13:45:27 by Thomas
I should look at the docs for PNG to see how huffman (this is my mothers family name Huffman :)) is used, but I thought I'd ask anyway:

For the SMC method, does this have to be generated for each picture only once, or is the table allowed to be reset - so that anytime in the stream you might have to re-generate the SMC code. I assume the latter, else it wouldn't be any good. :) This aasumption leads me to think SMC isn't a good idea.
Posted on 2002-03-13 14:30:00 by bitRAKE
Just in case anyone's still interested in a dword bitswap, I found this:

``````
;in  - eax
;out - eax

bswap eax
mov edx,eax

and eax,00F0F0F0Fh
xor edx,eax
rol eax,8
or eax,edx
mov edx,eax

and eax,033333333h
xor edx,eax
rol eax,4
or eax,edx
mov edx,eax

and eax,055555555h
xor edx,eax
rol eax,2
or eax,edx
ror eax,7

``````

I don't know how fast it is, but I see a lot of opportunities for optimisation.
Posted on 2002-03-13 14:32:56 by iblis
For the SMC method, does this have to be generated for each picture only once, or is the table allowed to be reset - so that anytime in the stream you might have to re-generate the SMC code. I assume the latter, else it wouldn't be any good. This aasumption leads me to think SMC isn't a good idea.

Well the compressed stream consists of blocks, each block has it's own huffman table, and blocks can be of any size. So it's up to the encoder...

I've considered this as well, it's hard to get an idea of the overhead for SMC. You need to calculate the huffman codes anyway, and decoding the stream requires looking up these codes. When the decoding code is hardcoded, executing it would be quite fast.. So I think it depends on the overhead for setting up the code.
The code is relatively easy to build, just a bunch of cmp, jump and mov opcodes, apart from some nasty problems like short/long variations of opcodes (mov eax,4 is smaller than mov eax, 100h).
Iirc, there are also some problems with executing memory which just has been written to.

Do you have any ideas about a good alternative which doesn't require much memory look-ups (or efficient ones)?

Thomas
Posted on 2002-03-13 15:50:46 by Thomas
No I don't have any bright ideas, but I haven't beat my head against the wall either. :) How big are the blocks? SMC could be the way to go if the generated code is fast enough to overcome the generation overhead. (You know this, I'm just stating the obvious :)) mov eax,4 is the same size as mov eax, 100h :)

How about you create the SMC code, and I'll do the non-SMC code?
Posted on 2002-03-13 16:04:02 by bitRAKE

No I don't have any bright ideas, but I haven't beat my head against the wall either. How big are the blocks?

That's the problem, they can be any size. There's no restriction on them (except for non-compressed blocks, they have a limit of 32kb iirc). It's up to the encoder, it probably chooses a block size that contains similair pixels.

SMC could be the way to go if the generated code is fast enough to overcome the generation overhead. (You know this, I'm just stating the obvious )

Generating it isn't to hard, the only problem is the variable sizes some opcodes have, it makes calculating relative jump offset harder. Of course I could choose to only use the larger version of each opcode but short ones are better as more can be read at the same time by the processor.

mov eax,4 is the same size as mov eax, 100h

Sorry, I meant cmp eax, 4 is different from cmp eax, 100h.. :rolleyes:

How about you create the SMC code, and I'll do the non-SMC code?

Sounds great, I'm still struggling with the decoded data.. I created some test PNG (with sizes like 10x1 or 5x1) and tried to decode them manually. The results didn't make sense to me though and some things from the PNG specs are not quite clear to me. If I would succeed in decoding a simple PNG image manually, I would be sure I understand it.

I have had a quick look at the rest of the PNG format, most of it doesn't seem hard to code, although I think I'm not going to support interlaced PNGs (maybe later), as their ouput pattern makes coding it (especially optimizing it) much harder.

I think I'm going to code some functions for huffman code generation etc., so I can play around with it. I'll mail you the code if I have created something useful.

Thomas
Posted on 2002-03-13 16:29:16 by Thomas
Maybe this .pdf could help. It's very straight forward, even though it might not be the best solution there is at least you could get some ideas. I was suppose to include this example on my btrees but maybe you could make some fast huffman decoder codes out of it. :)
Posted on 2002-03-13 16:35:43 by stryker
stryker: Thanks for the document, I'll have a look at it.. First some sleep :)

I managed to decode an image myself, I made a mistake with the bits per pixel (it was 4 instead of 8 :rolleyes: ).

Thomas
Posted on 2002-03-13 16:51:00 by Thomas
I did a quick test (with bitRAKEs code for testing small code pieces) of SMC. I commited a new page (execute, read&write access), write C3909090h to it (nop,nop,nop,retn) and call it.

When I write that dword *before* the loop, the code takes 1~2 cycles on my athlon. With the memory write *inside* the loop, it takes 785 cycles!

So there seems to be a penalty for executing memory you just wrote to. Of course in my case I only need to write it once and then loop the execution, however 785 cycles is quite some overhead.

I think SMC might not be the fastest way, unless you have really large blocks but you don't know the block size beforehand as the length is determined by the 'end of block' code which is encoded itself as well.

Also, the huffman table (in case of custom codes) is huffman encoded as well :rolleyes:, so you need to do huffman decoding even for decoding another huffman table, and these are quite small in general.

I have some thoughts on using several memory look-ups, and how to use them efficiently. I'll post some ideas later this day.

Thomas
Posted on 2002-03-14 01:51:29 by Thomas
Thomas, you might try writing a larger block of memory (greater than the cacheline size, 64 bytes) to see how the great the penalty is then. Execution will start at the begining of the SMC code I presume, and the cache latency might have cleared up by then. Hopefully, that is the only reason for the penalty - dual dirty cacheline round trip.

Another good link for Huffman and compression articles is Arturo Campos's pages. He is very knowledgable on compression and programming - nice guy too. :) I find his explainations very easy to follow.

What about programming a FSM for byte values? The programming seems complex, but it would really fly thought the data - faster than SMC, imo. I'll see if I can work out some more details. It's similar to a table look-up, but with more data in the table. Guess that would be really slow? :tongue:

Posted on 2002-03-14 09:57:12 by bitRAKE
I think I'll first try a non-SMC version. I have got an algorithm for the whole decoding in my head, and now I'm trying it out. I tried to minimize use of tables and memory, but due to shortage of registers I'll have to use a few counter variables. Not much though, probably one or two.
I have designed a smart lookup table. It consists of dwords, where the higher word of each dword is the (maximum code + 1) for a given bitsize (index in table). The lower word is a forward reference to the table right behind it, holding the values represented by the codes of that bitsize.
I'll try to work it out, I make zips of the code when each part is done so I can easily look back to see how I implemented everything.

Thanks for the links, I'll have a look there as well.

Thomas
Posted on 2002-03-14 10:49:30 by Thomas
Here's the table I worked out. It is used for both determining the length of the next huffman code (using the method described in my first post here), as for looking up the value the code represents.

``````
step 4.

lookup tables LUcodeA and LUcodeB:

(LUcodeA is sometimes referenced to as table A,
LUcodeB as table B)

LUcodeB table should be immediately after table LUcodeA.

LUcodeA table:

Array of DWORDS. The index of the DWORD in this table represents
a bit length: index 0 = bitlength 1, index 1 = bitlength 2 and so on.
Each DWORD is split into a high word (Q) and a low word (P).

The low word P equals "max_code + 1", where max_code stands for the
highest code in the huffman table for that bit length.
This value can be used to check if a huffman code of a given length
is valid for the huffman table used. If the code is equal or higher
to the value of P, it is not a code of the bit length used at that
moment.

The high word Q is a forward reference to address R, which resides
in LUcodeB. To calculate R, use this formula:

R = cur_addr +  2 * (Q  + P - code)

where cur_addr stands for the address of the DWORD containing Q and P
in table A. Remember that table B comes right after table A.
code holds the huffman code to be decoded, the LSB of the huffman code
is in the LSB of this value.

There is one special case for the value of P: when P is 0, there are
no codes of the current bitlength

LUcodeB table:

Array of WORDs, each word represents an output code. All possible output
values (one for each huffman code) are in this table. The codes are in
order of increasing bit size. Codes with the same bit size are ordered
descending by value.

----------------------------------------------------------------------
Example:

2 bit codes:
00                  represents          B
3 bit codes:
010                 represents          C
011                 represents          D
100                 represents          A
101                 represents          E
110                 represents          G
4 bit codes:
1110                represents          H
1111                represents          F

-----Table A------
offset      value
0           P: 0        Q:0     (bitsize 1, not used)
4           P: 01b      Q:5     (bitsize 2)
8           P: 111b     Q:4     (bitsize 3)
12          P: 10000b   Q:7     (bitsize 4)
-----Table B------
offset      value
16          B       [meaning of code 00]
18          G       [meaning of code 110]
20          E       [meaning of code 101]
22          A       [meaning of code 100]
24          D       [meaning of code 011]
26          C       [meaning of code 010]
28          F       [meaning of code 1111]
30          H       [meaning of code 1110]

Say for example that you know you have a 3 bit code, and the
code is 100b ( = 4 dec). The info in table A for a 3 bit
code is at index (bitlength-1) = 3 - 1 = 2.
Index 2 is at offset 8 (cur_addr = 8).
This gives:
P = 111b (7 dec)
Q = 5 dec

Using the formula below:

R = cur_addr +  2 * (Q  + P - code)
R =     8    +  2 * (4  + 7 - 4)
R = 22

At offset 24 (in table B) you will find the symbol A. This is the
value that the huffman code 100b represents.

----------------------------------------------------------------------

An asm implementation could be:

; assumes ebx points to the right index in table A.
; assumes eax holds the current huffman code to be looked up
; edx will, on return, hold the value represented by the
; huffman code.

mov     edx, [ebx]
neg     eax
and     eax, 0ffffh
shr     edx, 16
mov     edx, [ebx + 2 * eax]

Or using 16-bit regs:

mov     edx, [ebx]
neg     ax
shr     edx, 16
``````    neg     ax