Two reasons: the first is that your byte compressor is now actually dealing with nybble patterns instead of byte patterns, meaning it can compress patterns occuring at that level. The second is that by doing so, your data contains only 16 possible values (0 to 15) in each source byte, meaning that the other 240 values are available for you to use as TOKENS IN THE COMPRESSED BYTESTREAM - you can then use token bytes to identify the type and size of a wad of data in a compressed file which contains multiple compression formats aimed at different kinds of data. This leads to the viability of hybrid algorithms, since the storage system is so often dictated by the algorithm, it frees us from this constraint at a very low cost (if there was only one wad of data in the output, under my system, there would be one token byte, plus 1,2 or 4 bytes containing the length of the data, then the data itself, with the token byte identifying the compression type and at the same time identifying the size of the Length field, and thus the offset from the header to the data within that token).

Can you see that by converting the data stream into a series of bytes all between zero and 15 that we are going to see more patterns in the values?
By forcing the potential set of values to be smaller, we are forcing the probability of repetitions within those values to increase.
This and the opposite effect can be seen in lotteries and other forms of gambling too - and all gamblers know it - the smaller the field, the higher the chance of winning.
Posted on 2003-05-13 11:02:26 by Homer
Why I would not use this techique:
* I would probably not see any more patterns in most files since there is few files that use 4-bit chunks to store each item.
* For every 8-bit byte that was used in the file, a new rule would probably be created.
* Extra characters that could be put into the stream doesn't mean anything since it is all a matter of probabilities.

I see why this would help on, for example, 16-color images (that naturally use 4-bit chunks for each pixel) but on byte oriented files this would probably hurt compression since it just introduces a bunch of new rules. If you look at, for example, an ASCII textfile, the nibbles of the bytes in it does not individually tell you anything. You will need both the nibbles in each byte to make out their meaning. Sequitur would probably put the two nibbles of each occuring byte into a new rule.
Posted on 2003-05-14 00:29:06 by gliptic
It's true that the nybble approach may fare worse on some byte-based data, but on the other hand, it may fare BETTER on other byte-size data.
We can't assume that because the data is originally 8 bits, that we won't see 4-bit patterns within it - especially true of ascii plaintext , and as you pointed out, bitmap graphics, for sure. But even if the source data is relatively RANDOM, we will still find patterns - and here, I must point out that the pattern does NOT have to be linear under Sequitur, like it does for something like RLE.
Three or four extra bytes, for the benefit of being able to hybridize compression algos (and use the best compression for a given wad of data), and be able to detect errors in the compressed data because it is now structured more than before, the benefits seem to me to outweigh the cost.
Many compression algorithms use code sections which can be remarkably similar - I've found that hybridizing algos doesn't have to mean totally separate procedures for each algo - I've been playing with two kinds of code in this department, the first version used extra params in compressor functions to trigger casecode within them, the newer generation of the code uses more of a polymorphic approach (self-modifying code - the main proc modifies helper procs before calling them).

The main problem with most implementations of Sequitur is that they don't deal with the issue of replacing a chunk of plain data with a reference to a rule, and I'd like you to think about that. In the example string ABCDBC, how could you possibly create rule a=BC, and then replace instances of BC with a, when the rule token is 8 bits? You are forced into a situation where you can only have 256 possible Rules, and have no way of telling whether a data stream is already compressed or not, or whether it is corrupt !!
I'd have to say the single biggest benefit is freeing up those 240 byte values so you can use them for whatever you like, knowing they are "illegal" in the compressed stream.
Your thoughts?

(I have tokens which identify the following data types:
0 - up to 256 bytes of "plain" data
1-16 bits of plain
2-32 bits of plain
3-8 bits RLE
4-16 bits RLE
5-32 bits RLE
6-8 bits Sequitur
7-16 bits Sequitur
8-32 bits Sequitur
9-16 bits LZH
10-32 bits LZH
11-16 bits ARC
12-32 bits ARC
13-16 bits MP3
14-32 bits MP3
15-RULE REFERENCE TOKEN
16-239 UNUSED

These values are byte zero of the token header, followed by the indicated number of Length bytes, and then the data...
except RuleReference tokens, which look like a 32 bit token (1 header byte, 4 length bytes) except the 4 length bytes are used to store a pointer to the actual Rule, which is stored in (in my case) a linked list (during compression).

What I'm aiming for is somthing like UPX.
UPX compresses the segments within executable files, and does so using different algos for different segments.
For example, it takes advantage of the fact that code segments generally use a very limited set of values (how many opcodes do YOU use?)
What I want is a compressor which is intelligent enough to detect common data types and then try a couple of algos against each, selecting the one which has the best realworld compression ratio for THAT data. Furthermore, where it uses a dictionary compression, the dictionary must be created form and reflect the data being compressed.
Why shouldn't compressors become more intelligent, and what's wrong with structuring the compressed data? Have you heard of a Recovery record, or seen compressors which can "repair" a corrupt volume? Do you feel that Zip is all we should expect from our compressors? Some people are just never satisfied - and I'm one of them.
Posted on 2003-05-14 01:06:25 by Homer
Why would you limit the number of tokens to 256? You can have as many tokens as you would like (okay some arithmetic coders has a limit of course) and you can't look at the stream and see if it's corrupt without using checksums or something. To me it sounds like you wanted to store the tokens as they are (one token per byte). That's a really stupid plan if you aim for good compression.

Can you explain to me why there would be more patterns in 4-bit data when the data is originally byte oriented?
Posted on 2003-05-15 05:22:33 by gliptic
Why would you limit the number of tokens to 256?


You don't WANT to limit anything , perhaps I was not clear.
If the source data is 8 bit binary data, then the bytes in the source data, as you know, can have values of 00 thru FF (0-255)
All values in the range are legal, none are special, and we have no "spare" values which we can use to denote something in the data stream - you can't even detect the difference between an embedded token and the data stream it exists within, unless you have one or more reserved values up your sleeve. You don't exactly need 240 of them, one would be fine, but with 8 bit source data, theres NO way of distinguishing anything in the data stream unless we either reserve one or more values (which we can't do), or unless we encode the data stream SOMEHOW to give us AT LEAST ONE value which will never appear in the data stream.

(one token per byte).

No, I don't replace single bytes with a series of bytes representing it (lol).
I replace instances of SUBSTRINGS with "referencing tokens" that are used to provide indirection, they point to nodes of a linked list of Rule tokens, which contain the substrings and/or references to other Rules (The Rule base is a Tree)

I no longer use the hex encoding method, it was something I tried early on, I now using something that is much like Base64 as a precompression encoding. Hex-enc yields a 2x gain in the size of precompressed source data, whereas B64 yields 1.3x

(Can you explain to me why there would be more patterns in 4-bit data when the data is originally byte oriented?)


Yes. Let's imagine for starters a block of ascii plaintext.
Plaintext is invariably 7 bits, not six, and we use very little of the range, so although we will see much variation in the lower nybble, the upper nybble won't change a lot. But in terms of 8 bits, that's not why we stand to benefit from viewing the data as nybbles rather than as bytes, words, dwords, qwords or whatever the hell it really is...
To demonstrate the principle furthermore, consider that the valid set of characters as such for the nybble based character is "0123456789ABCDEF", that is, there are only 16 possible values for a given nybble. Imagine the plaintext number "3738" in hex we'd have 33 37 33 38, and if we collapse the bytes, 33373338, and you might care to notice the 12 bit pattern 333 occuring twice. Of course, the pattern "73338" could end up being more popular, who can tell?
When compression is complete, I evaluate the rulebase for rules which appear too few times to be considered of value, for example all those rules which represent a single duplication of a short string.
I'm finding that the benefit from working on a 4 bit basis varies with the source data as much as the compression probabilities do.
That means that on some data, there is little or no benefit.
On other data, there is a glorious gain over byte based compression.
If we can't lose, it can't be a bad thing, right?
(I've cut the cost from 200% to 130% of the same algo under byte length data)
Posted on 2003-05-16 00:12:34 by Homer
EvilHomer2k, then why not use binary?
It is the method I'm using. :)
Posted on 2003-05-16 00:31:53 by bitRAKE
bitRAKE,
As you are probably aware, binary compressors were almost common in the mid to late 80's on 8 bit machines, and they were incredibly efficient, although they had unrealistic compression times.
You are implementing a binary Sequitur, or some other algo?
What is the largest keyspace you search, and what does your searching algo look like? I'm very curious :)
I was tempted to write an N-bit binary compressor, but I fear segmented addressing too much - I have a flat background. Anyway, I found that multiples of 4 bits was adequate resolution for my expectations - althought I must admit I am somewhat of a perfectionist, now I hear someone else out there still thinks in bit pattern terms, I might revise my thinking too... tell me more:)
Posted on 2003-05-16 02:03:47 by Homer

You are implementing a binary Sequitur, or some other algo?
Yes
Posted on 2003-05-16 07:50:37 by bitRAKE
:tongue:
Posted on 2003-05-17 10:22:16 by Homer
:) I'm thinking in terms of a very small decompressor with high compression ratios. Using the bit instructions (BT/BTS/etc) on the x86. Speed of decompression/coompression is not a concern.
Posted on 2003-05-17 10:31:42 by bitRAKE
The worst case for 4-bit decomposition, if we compare to 8-bit decomposition, is that we get 256 extra rules. Do you call that not loosing anything? The small gain in new patterns that can be found would probably just make up for the loss.

Bit decomposition makes more sense than 4-bit decomposition and might be a good idea for data that is bit-oriented. For byte oriented data this would probably degrade compression performance even more (I can't see why it would help compression at least). The bit oriented compressors that are used nowadays doesn't use anything like Sequitur, they instead use prediction like PPM. I think DMC (Dynamic Markov Coding) is bit oriented but that coder uses prediction and not anything close to Sequitur.
Posted on 2003-05-17 12:24:32 by gliptic
Methinks you may be too wound up about data representation:
Why are you so concerned about N-orientation of the data?
ALL data is a bitstream - it matters not whether you observe it in N bits, nybbles, bytes, QWORDS or whatever manageable size run of bits you select.
Consider a framebased algorithm which deals with frames of data that are N bits in length - these are common enough - what bitRAKE is talking about is just a system which can handle variable bitlength frames of data - I called them tokens, but I referred simply to a wad of data N bits long. Why handle 4 bits at a time? I dunno - does it really matter? All it determines is the RESOLUTION OF THE FRAME in terms of my frames will always be multiples of 4 bits in length. Of course, a resolution of 1 bit in length would be lovely, but consider that your databasing system must now be able to track the bitlengths of the frames as well as their offsets in the UNCOMPRESSED stream - and when you are dealing with BIT offsets, you are generally going to implement some kind of indexing system to help address the start offsets of the decompressed frames. This means more overhead in your databasing system (dictionary, ruletree or whatever it may be).
Sticking to a known multiple bitlength alleviates some of that drama.
Keeping the bitlength multiplier small but constant allows us a reasonable search resolution, while reducing addressing overhead.
Posted on 2003-05-18 02:05:20 by Homer
Exactly, and since most data is at least 8-bit aligned, there is not many reasons to use a lower alignment since the overhead would become too large.
I might be way off here but theoretically, I can't find any reason to use bits instead of bytes.

8-bit or more aligned data:
* ANSI text
* 256+ color images
* All machine code types that I'm aware of
* Sounds
* Most other datafiles

4-bit aligned data:
* 16-color images
* eh... help me here...

1-bit aligned data:
* Already compressed data (excluding compression schemes that uses arithmetic coding)
* B/W images
Posted on 2003-05-19 03:20:35 by gliptic
I note you mention arithmetic coding but fail to mention bitwise OR compression !!
Posted on 2003-05-21 10:42:53 by Homer
You mean like huffman coding? Come on, everybody knows that huffman is suboptimal compared to arithmetic coding. Huffman is also much slower when the coding model is adaptive.
Posted on 2003-05-22 01:50:00 by gliptic
Bitwise OR compression has probably been around longer than Huffman's algos - although 1 to 1 Huffman encoding is probably his closest algo.
Anyway, I was merely pointing out that algos which compress at the binary level have been with us for more than a short while now.
Posted on 2003-05-22 02:36:52 by Homer
Can you explain what you mean with bitwise OR compression? To me, it sounds like the way prefix codes like huffman are coded into a bitstream but I guess you mean something else.
Posted on 2003-05-22 09:46:36 by gliptic
Well, in Huffman 1:1 encoding, compression is only achieved by assigning the shortest possible bitlength symbol to the most common byte value(s).
Binary OR based compression (which should rightly be called Binary And Or compression) is different in that it doesn't assume the constant data element size (byte) which Huff1:1 does, it works on a sliding binary window whose length either grows or shrinks depending on the implementation. Its more like Markov encoding because it works by predicting the likelihood of a repeat of a binary pattern based on patterns it has seen before. It has no knowledge of the data formatting or of the probability of byte values. Similar in some ways, different in others. It compresses by assigning the shortest possible bitlength symbol to a N bit binary pattern.
Posted on 2003-05-23 06:52:58 by Homer
Well, in Huffman 1:1 encoding, compression is only achieved by assigning the shortest possible bitlength symbol to the most common byte value(s).
Binary OR based compression (which should rightly be called Binary And Or compression) is different in that it doesn't assume the constant data element size (byte) which Huff1:1 does, it works on a sliding binary window whose length either grows or shrinks depending on the implementation. Its more like Markov encoding because it works by predicting the likelihood of a repeat of a binary pattern based on patterns it has seen before. It has no knowledge of the data formatting or of the probability of byte values. Similar in some ways, different in others. It compresses by assigning the shortest possible bitlength symbol to a N bit binary pattern.
Some implementations handle the "prediction" a little differently...
In one implementation I looked at, it took frames of N bytes of data, and then compared half of it to the other half repeatedly halfing the search size, a kind of binary space partitioning. If the halves were the same, it would replace both with a symbol N bits long. If not, it would nest, until it was comparing just two bits.
Posted on 2003-05-23 06:59:53 by Homer
Do you have any implementation/papers for this method? I have never heard about it before. Sounds a bit like LZ77 in binary. It all seems like a way to speed up the state-of-the-art prediction methods.
Posted on 2003-05-25 15:19:38 by gliptic