Here is the reason why that occured...
The source data was two repetitions of abcdbc ie, abcdbcabcdbc
After creation of the very first rule (bc), we get aAdAaAdA, as you pointed out.
However, I do not replace instances of a new Rule in source data I have not yet placed into my processing buffer - it's a look-backwards approach.
I do always search for the longest possible substring instances in the Hashing Buffer, but because I fetch data into it as I work, there is a point where the hashing buffer looks like this:


My algo recognizes the aAd length-3 repeating string.
Had I fetched the into there already, it would pick up the 4-length string too.
Or Had I looked forward and replaced the bc with another A, likewise.

This algo seems easy enough when you read it, but just think for a moment how you might actually manage the data in asm, for starters, how you might create a unique token which contains a stream of plaintext and references to other unique tokens... I'm using linkedlists because it was what occured to me.

What I SHOULD do is replace instances ahead of my current sourcing address.
The problem with that is that at the moment I am fetching bytes, and placing them into nodes of a linkedlist... I'd have to either create a node for every source byte upfront (expensive memorywise) or else I'd have to change the way I am working with single-byte nodes and manipulate strings across nodes.

Either way you are quite right to point this out, it's something I will address.
Originally I noticed that quirk but decided that the aAd sequence could well prove more common than the aAdA sequence, that was before I realized I could use a "partial rule" method to encode aAd instances using the same rule as aAdA.. which I did mention there someplace I believe, but as usual, I probably lost my chain of thought and forgot what I was going to write :)
Anyhow, yes, it's an issue, and one that needs addressing.
Damn this would be easy in a symbolic language - it's truly better adapted to script in terms of ease of implementation, but it will be worth the effort.

So how far removed from genuine Sequitur is my algo now?
Really not far at all, I'm eager to get the darn thing working and compare its output to my original algo's output, which with due respect is quite slow.
Posted on 2003-03-24 05:51:54 by Homer
I don't see why your algo would compress better than Sequitur (without that partial rule tweak). Can you give me an example where it does outperform Sequitur so we can see why it does that? Your algo is certainly not faster than Sequitur but it is still top-down so the compression gain is not immediatly noticable.
Posted on 2003-03-24 06:27:25 by gliptic
Immediately obvious, it outperforms Sequitur wherever the source data contains a decent amount of serially repeating bytes.
Sequitur has to create rules within rules to deal with large blocks of data.
for example , aaaaaaaa
aaaaaaa a
aaaaaaa aa A=aa
aaaaaaaa AAAA H=BB
Note that a longer string would make for more Rules under Sequitur.

So we have H=BB, A=aa
We need more than two tokens to encode this data, at the very least always there will be some Hashing Tokens (series of plain bytes and rule references) and there will be of course the Rules, which also generally contain a mix of plain byte tokens and rule reference tokens.

When we use RLE to preprocess the data, we get to replace that entire section of data with just ONE token, of as few as 3 bytes in length, or as many as 6 bytes for an RLE token that represents up to 32 bit lengths of repeating bytes.
If there are more sections like this, the gain we make when we apply something like sequitur as a SUBSEQUENT phase increases significantly.

This holds true for ANY "obvious" binary pattern.
As a further demonstration, one RLE variety I could have included but did not use would recognize incrementing series of bytes (often quite good when compressing pixel color data)...
Incremental runs can be encoded quite easily, same for decremental.
If we wanted to get extreme we could check for all kinds of binary patterns.
The important part to realize is that I am not merely re-compressing the output data from the first stage. If that was so, I would likely be compressing the token header data as well !!! I am forming a linkedlist which holds tokens that represent the original data, and then I am reprocessing the token list, for only the tokens which eluded compression in the first stage.

When I am completely happy with the performance of the raw compressor, I will be applying it in a PE segment compressor with self-decompressor bound to the front, similar to upx.



Is the Sequitur Variant part of my code any better than the original Sequitur?
Well, I think it will be, I am implementing the revised version which can spot phrases which are very similar to an existing Rule... it does so by defining a new kind of rule-referencing token which specifies a logical mask to apply to the Rule.
Again, I am just interested in getting the thing working at the moment.
It's still weirding out on me right now, I'm finding odd things happening inside my dynamic arrays, odd because they are in fields which I explicitly set to a given value, more odd because they are intermittent problems.

I fail to see how my compressor can fail to be at least as good as the current crop, considering what I said about RAR-compressed files... over a 15 meg file, I saved @20kb just with RLE on average, and DEMAND further gain in stage 2.
In the very worst case under RLE, the source data will grow slightly.
The output data can be discarded in this event in favor of the original data (wrapped in one or more huge tokens) with no performance hit in stage 2.
If I can't write a decompressor for this in under 2kb, let alone 20, I'll eat my cd collection.

Anyways it's always fun to revisit this kind of project every few years.
Posted on 2003-03-24 11:40:05 by Homer
I'm not talking about RLE here or any other weird byte pattern matching. I'm not talking about tweaks that can be done to partially match rules and such. I'm just talking about a simple rule-tree where all rules are fully expanded. Does your algo make better rule trees then?
Finding all kinds of byte patterns is not a good idea since no particular of these patterns is generally more common than the other and doing special cases for hundreds of patterns is seldom a good idea for a general purpose compressor. The better thing would be to use a prediction stage if the data is image/sound or some other continuos stream.

It seems like your compressor is not going to be general purpose. What is it aiming at? I mean long increasing and decreasing byte patterns are not that common and should not be given high probability in a general purpose compressor.

When we come down to it, it's all just probabilities. Shannon taught us that the number of bits that should be assigned to a token (pattern, rule, file or whatever) is log2(1/P) where P is the probability for the thing to appear at that place and at that time.
Posted on 2003-03-24 14:40:56 by gliptic
The two algos combined in the proposed order produces better output than either algo does alone. Outside of rule tweaks, the tree produced by my algo is certainly no better than the tree produced by Sequitur proper, however Sequitur proper does not handle very well the kind of data RLE is aimed at, and RLE doesn't do anything well BUT what it is aimed at. Together, with intent and purpose, one can address the shortcomings of the other.
I am not claiming to have broken new ground here.
I am still debugging the darn thing, and will post it in due course.
Then you may tear it to shreds.
I was kinda hoping to spark some innovative responses in this forum, possibly opening further avenues of exploration in this field.
I'll keep tinkering in the meanwhile, as I am far from convinced that we have achieved the most spectacular compression algo ever seen yet.
I reiterate that my approach is to seek the pros and cons of existing algos, attempting to retain the good while addressing the bad, in the hope that I might glimpse a new algo which makes those redundant.
Also, it's been a novel experiment in LinkedLists for me, with some curious procedures having been written to deal with things like referenced replacements of N tokens with a single token, while supporting mixed data types.
I've certainly felt challenged by some of the coding, and so long as I feel that way, I shall continue, because to do otherwise is to stagnate.
Posted on 2003-03-28 09:29:24 by Homer
To achieve better compression than the original Sequitur, the main goal would be to code the rules better. Partial rule matching will only be advantagous in very specific cases (have not found anyone yet). It rather increases the complexity and increases the size of other, non-partial rule references. I have not looked into what kind of entropy coding that the normal Sequitur implementation uses but I'm sure I can beat it. My idea was to sort the rules so that a rule only contains rules that are below it. In that way, you can exclude many rule tokens for the top rules. Another idea is to use some hierarchical prediction of a rule's probability (don't ask me what that means :) ).
Posted on 2003-03-28 13:43:07 by gliptic
Interesting that you mention rule-sorting.
This occurred to me also, but during practical experiments with the algo (or at least with my variant), I have not found a single instance of a situation where this is not true ...each new Rule contains either hither-unfortold plaintext, or it contains some mixture of plaintext and reference(s) to EARLIER rule(s).
In some of the examples of Sequitur, each new Rule has been named with the label A, and all existing Rules relabeled.
This is misleading.
I hypothetically call my first Rule A, and invent new names for subsequent rules.

B can contain a reference to A, but A can never contain a reference to B, since B did not exist at the moment that A was created.

Kill me if I'm wrong here, but I'm pretty sure that is a feature of the algo, and well certainly seems to be a feature of my variant.

Partial rulematch tweak is meritous in that since I already aim to find the longest possible rules, even though they may not be the most common, the chances of me finding a partial match in an existing rule becomes great, and if we take advantage of this, we can avoid creating many short rules which appear as a substring of some other rule. In terms of overall efficiency, I am going for the bell curve in terms of probabilities, because the most valuable replacement we could make is dependant on the size of the rule, as well as the number of instances.... this means that the most common valuable replacements will probably be most valuable at medium lengths, with very long rules being very uncommon, and with very short rules being very common but having a low priority in terms of yielding a low return in terms of bytes saved.
Thus, even though short terms can be quite common, if we aimed for them, we would not get a good byte saving, and similarly if we only went for long ones, even though they have a potentially higher yield.
By recording the longest possible rules and then treating them as extraneous source streams, we can iteratively find them without losing ground in the creation of a rulebase that itself contains repetitive data, along with a plethora of references to them. In the worst case scenario, the compression yielded from the processing of the smallest possible rule length (2) could be negative, depending on the formatting of the tokens. And as I have expressed already, in the very best case it would be roughly 50%, with NO chance of it being better than that. The only way to beat the 50% threshold under pure Sequitur is to increase the length of nested rules. Yep?
I still haven't implemented a look-ahead method yet, either.
I got sidetracked :tongue:
cya:)
Posted on 2003-03-28 16:54:00 by Homer

And as I have expressed already, in the very best case it would be roughly 50%, with NO chance of it being better than that. The only way to beat the 50% threshold under pure Sequitur is to increase the length of nested rules. Yep?
I still haven't implemented a look-ahead method yet, either.
I got sidetracked :tongue:
cya:)


Are you still stucked to that 50% thinking? I don't know how you are thinking but Sequitur can compress a run length of N bytes in about log2(N) rules (including the S rule). If N was a power of 2, each rule would be 2 bytes large so the number of tokens would be 2log2(N) if we ignore rule lengths and such. So, if we compressed a string of, for example, 256 'A's, the size of each rule would be 2 bytes and there would be eight rules:

S -> 77
7 -> 66
6 -> 55
5 -> 44
4 -> 33
3 -> 22
2 -> 11
1 -> AA

RLE would certainly be more efficient as a pre-processor in this case but you see that you can easily code this in less than 128 bytes.


The Sequitur algorithm does sometimes create rules so that they are in the wrong order. I don't know why it does that but sorting them is simple enough so that won't be a problem.

About probabilities, the bell curve (or normal curve) is suitable in certain contexts, for example in prediction coders where the variance is high. In general, the laplace distribution is the best for prediction coders, but there is no prediction errors to code after a rule factorizing algo. The direct way would be to count the frequency of all the tokens and just use those probabilities. That is an order-0 coder. I thought of implementing some sort of context to it but I don't know what kind is good for this kind of output.
Posted on 2003-03-30 08:45:47 by gliptic
Yeps, initially I was creating a deadlisting of numbers and lengths of token instances before performing any replacements.
Not anymore.
My algo no longer requires this imho for two reasons:
1. I automatically am sorting the rule tokens as they are created - I am using a linkedlist approach, I guarantee that each rule is as complex or moreso than the preceding one.

2. I do not attempt to tokenize data which as not yet entered my hashing buffer.

Let's say I did count all instances of all binary terms at all lengths.
Let's say that I calculated the potential byte saving for all the replacements of each binary term.
There will be many instances of overlapping binary terms which should we replace one, we would not be able to replace the other.
This means that our potential savings values were always just estimates, even if they did help us to make educated decisions about the order in which to tokenize the binary terms.
So even if we counted X instances of term length Y, we cannot be sure we would save X*Y bytes because we cannot be sure that simply making all those replacements immediately would be the most efficient option.
Therefore I no longer bother counting instances of binary terms at all.
I simply prefetch data into my hashing buffer, and then perform a rolling search of the hashing buffer substrings within the hashing buffer itself and within existing rules, going for longest terms first.
What we find is that the efficiency improves as the rulebase grows.
I'll probably head back that way when I'm happy with the current code, so that it creates a rulebase from the existing data (recording #instances), then reprocesses the data with the complete rulebase, this time tokenizing the data.
What are your thoughts ??
cya.
Posted on 2003-03-31 00:15:17 by Homer
Replacing the substring that would immediatly cause the highest byte saving is not always the best solution, but it's probably the best one can hope for.

My context idea kind of went down the drain when I thought a little more about it. Any trivial context is removed from the string by the rule factorizer so the only chance is to find complex contexts and that can only be done by intelligent software. Maybe a neural network or GA can be used to find contexts but as with all stochastic methods, their efficiency is hard to prove.
Posted on 2003-03-31 06:21:37 by gliptic
Well said.
I've not put much serious thought to basing a neural network around compression, but it does strike me. Has anyone tried anything like that?
My linkedlist codebase could be given a nudge in that direction should I be so inclined, however my personal experience with weighted nets is quite limited outside of playtime. It's something that I'd like to address. Heh another entry in my ToDo list :tongue:
Posted on 2003-04-01 19:09:38 by Homer

The algorithm you describe is VERY costly if it is going to be done optimally. There are much faster (but not optimal) algorithms which does the same thing. Check here:
http://sequence.rutgers.edu/sequitur/

I have actually tried myself to find an algorithm that does exactly what you describe. The obvious algorithm is VERY slow and I didn't have much success in finding a better one.

BTW. Skip the RLE, please :)


Wholy Crap! This is way kool! Thanks for the link!
:alright:
NaN
Posted on 2003-05-07 21:04:33 by NaN
NaN, you like playing with trees, don't you ;)
What is your spin on the sequitur algo?
The best I could come up with was a couple of minor optimizations, the main one being a rule referencing node with the ability to refer to partial Rules.
I ended up keeping RLE as a preprocessing stage, but I modified the intermediate code to check whether the RLE processing was worth the effort, and if it wasn't, then using a threshold-based chopper function to subdiv the workload instead.
Posted on 2003-05-07 23:44:05 by Homer
I really wonder how well these kind of compressors compress compared to, for example, the state-of-the-art PPM algorithm. On data created by humans (text, code, etc.) I think this kind of compression does pretty good but on other kinds of data (continous tone images, exe-files, etc.) I think other kinds of compression is better. A good statistical coder has to be developed for this compression to make it really good though.
Posted on 2003-05-08 04:02:03 by gliptic
I happened on this thread cause i was digging into a way to compress floating point matrixes. 8 bytes per number * n * m, matrix is alot of data space.

Im planning on making *Big* Matrixes from time to time (ie 100's x 100's). To store all this number info would take a good chunk outa resources.

So i was looking for some fast compression algo that is not heavy on overhead. I think i found it.

The other benifit here is there is definitely structure to the floating point numbers. As well there will be *alot* of zero's in the really large matrixes. So this algo will work very well.

I did a Web-zip on the site and found a PDF in there. It has performace curves that your looking for. This algo has some impressive stats (as published in the PDF), but its close to competing compression algo's. It only beats them when there is more structure to the data being compressed.

:alright:
NaN
Posted on 2003-05-08 18:44:25 by NaN
I wouldn't call the Sequitur algorithm fast. I think state-of-the-art PPM is faster (although it's one of the slowest compression algorithms). The sourcecode I found on the Sequitur page is somehow faulty. It can compress smaller files but crashes whenever I try to compress a file above 1kb or so. I tried to implement Sequitur myself but it's not that easy to make it O(N) so that project is halted.

I read that PDF you talked about and it seems like Sequitur, indeed, performs very well on very structured files. I wonder if the encoding scheme that is used in the paper is the best. I think the terminals can be encoded in a more clever way and the probabilities of the rules can be better estimated. Although, I don't think partial rule matching is of any use (please prove me wrong!).
Posted on 2003-05-09 01:05:25 by gliptic
It will be a bit before i can... im still learning the internals at this point. Im doing a litteral translation to assembly.. Once done, i will look for ways of squeezing the air out.

Beyond this i do not know just how well it will work, or how better it can be applied.. for now its an interest project ;)

:alright:
NaN
Posted on 2003-05-09 21:37:18 by NaN
During my early experiments with the algo, I found a really silly but very effective way to make a bad implementation of Sequitur swallow 8 bit binary data efficiently.
You're going to laugh , I sure did, but here it is.

Encode the 8 bit data as hex plaintext, so each 4 bit nybble becomes an 8 bit byte with a very finite range of values.
Now your source data is twice as long, but thats ok, Sequitur will create ONE extra rule, thats the cost.
You can add code to the decompressor to automatically weld the output nybbles.

:alright:
Posted on 2003-05-10 02:28:47 by Homer
Hm.. thats similar to an electical property in ethernet communications. IEEE 802.1 uses machester encoding which in essence ensures that every bit gets some repetative change (like your zero filling)...

:alright:
NaN
Posted on 2003-05-10 10:23:56 by NaN

During my early experiments with the algo, I found a really silly but very effective way to make a bad implementation of Sequitur swallow 8 bit binary data efficiently.
You're going to laugh , I sure did, but here it is.

Encode the 8 bit data as hex plaintext, so each 4 bit nybble becomes an 8 bit byte with a very finite range of values.
Now your source data is twice as long, but thats ok, Sequitur will create ONE extra rule, thats the cost.
You can add code to the decompressor to automatically weld the output nybbles.

:alright:


Am I totally off when I say that Sequitur would create up to 256 additional rules? Why do you want to do the hex transform anyway? It just slows everything down.
Posted on 2003-05-12 03:07:34 by gliptic