Hey... I've come up with an interesting compression algo which I'll describe here, in the hope that someone out there can see an obvious optimization.

The compression algo is a two stage job.
The first stage involves runlength encoding of the data.
This yields a list of "tokens" of two kinds: RLE and PLAIN blocks.

The big problem with RLE is that it only compresses blocks of repeated data.
So in stage 2, we will examine only the "PLAIN" blocks (which contain binary data that has no obvious pattern).

Enter the binary analysis engine.

What I wish to do is create a library of substrings from the actual data itself.
So for each block of plain data, I generate all possible substrings within it (down to length of 5 bytes), and for each substring, I seek instances of that substring in all blocks including the current one, and disregarding ONE located instance.

Anything that appears more than once, I record the length of the substring and number of times it appears.

The idea is that we are able to "rate" these substrings on the basis of the number of bytes saved should we choose to tokenize said substring.

When analysis is complete, we can better determine the optimal order in which to ATTEMPT to tokenize these substrings.

I say ATTEMPT, because many of these substring instances actually overlap in the original data, and so not all substring instances will be tokenized.

We will tokenize substrings in the order which yields the highest savings, and such that no tokens ever overlap.

Note that even low-priority rated substrings are worth tokenizing, even if they are relatively small or only appear once.
The aim of the game is MAXIMUM compression, disregarding time.

The proof of concept application can compress and decompress the RLE tokens already, and perform the analysis. All that remains is to code the final rated substring selection stuff, where it has to actually make decisions.

I'm using 6 tokens in my RLE, being byte, word and dword lengths of each type of token. I am only seeking repeating or non repeating data in the RLE, it's not looking for any other common byte patterns. And yet just the RLE is compressing the average exe by 70% and a fully maxxed RAR file by about 1%.
The only time it seems ever to fail to compress a file in the first stage is if the data was previously compressed with the same algo !!

Enough ranting, anyone got hints about optimizing this algo?
Posted on 2003-03-17 07:04:49 by Homer
The algorithm you describe is VERY costly if it is going to be done optimally. There are much faster (but not optimal) algorithms which does the same thing. Check here:
http://sequence.rutgers.edu/sequitur/

I have actually tried myself to find an algorithm that does exactly what you describe. The obvious algorithm is VERY slow and I didn't have much success in finding a better one.

BTW. Skip the RLE, please :)
Posted on 2003-03-17 07:17:58 by gliptic
Nah - the RLE is staying.
Why? because its fast, efficient and it works.

In my mind, I am using RLE to remove repeated-byte substrings from the data stream (not actually removing them, but you get my drift)...
Why is this important to me?

Lets suppose I did as you suggest and lost the rle stage.
I would be faced with performing a mammoth binary search as you mentioned.

The rle stage effectively breaks my data up into smaller chunks.
What is not so obvious is that in doing so, we are dramatically reducingtwo of the terms in the algo - the "keyspace" (or search-space), and the "keylength" (or substring max length).
That is why the algo (for me) has become not only feasible but realistic.

If you were to say "but hang on, it's still going to have running times exponential to the size of the job, which is hardly a good thing"
yes, I would agree with you.

In the 1980's we had some hardcore "Crunchers" which could take literally days on those old boxes. The first compression algos were made popular (and I believe were developed) by talented amateurs (read "hackers"). They achieved amazing compression ratios by performing various kinds of binary searches.
They were superceded by newer algos (again you mentioned this) not because they are better at the job, but because they are faster at it.
In "The Beginning", this was a "Good Thing". Machines were slow. We had lives to live and things to do. Hardly anyone had a roomfull of old boxes.
To think that we are running these inferior algos on machines that are up to and over 2000 times the speed of the machines I refer to, why do we persist?
We already accept that the "better" compression algos are slow, and that there will always be a tradeoff between time and compression.

I just thought I'd mention once more that pure RLE is beating hard rar.
All by itself without the second stage at all !!!

OK so obviously thru rle I am tackling the problem of the large keyspace and long key(s), considering this compression algo is really a kind of bruteforcer.

What ELSE can I do to speed up the algo?
At the moment I am running my substrings up from 5 bytes in length, all the way up the the size of the data crumb which the substring lives in.
I am not checking to see if I am processing the same substring more than once (in more than one location) which would probably be smart.
Aside from that, anything else?
(imagines how long it would take to search every wordlength from 5 bytes upwards in a single 15 meg chunk of data - THATS why the RLE is staying)
Posted on 2003-03-17 08:28:11 by Homer
Forgot to mention - algo in present form can be applied in a distributed fashion - you could use a lan like a render farm to compress a single file.
How many distr0 compression algos have you heard of?
Posted on 2003-03-17 08:31:26 by Homer
Originally posted by EvilHomer2k
(imagines how long it would take to search every wordlength from 5 bytes upwards in a single 15 meg chunk of data - THATS why the RLE is staying)


Then you could as well split the data yourself. How can you you be so sure that the RLE splits the data where it is most efficient? The reason why people uses RLE is because it's a simple and efficient algorithm for repeated data. But what if you were going to compress, for example, text? The RLE would not work very good then. The RLE may as well break this compression totally since it might separate the data so that long repeated substrings aren't optimized away just because they happend to be on two sides of an RLE run. If you are going to use this as a general compression algorithm, I doubt that RLE would be good.

About distro. algos, you can do that with ANY algorithm by splitting up the data but then it will not be as efficient. Compression algos generally work better on longer data streams than many shorter ones.

BTW. Did you check the difference between the Sequitur algorithm and yours? The Sequitur algorithm is said to be running at O(N) and it doesn't split up the data and thus probably achieves better compression than algos that splits up the data. The Sequitur algo maybe is 5% or so from the optimal rule tree but finding one of those would be almost impossible if the file size is, for example, 1 MB. On text, the Sequitur algo is very efficient, I get almost three times better compression than RAR on many text files. I don't know what kind of files you are testing on if you find RLE better than RAR but they must contain very long runs of data.
Posted on 2003-03-18 12:10:40 by gliptic
Heya.
I've found some small optimizations which have gained me some significant speed increases.. foremost was removing a ListView I was using to retain anaylsis record data.

First off, I don't think RLE is wonderful, and I am not suggesting that it is breaking up the data at anything remotely like optimal points.
The files I have been testing on are everything from plain executables, upx-packed executables, mp3s, jpegs, rar-compressed files, bmps and textfiles.
I find less than one in 10 files grow after the RLE stage.
I find furthermore that after RLE is performed on regular executables, that the nonrepeat data blocks average 50 or so bytes, with a large one being several hundred, and a very large one being a couple of thousand bytes.
Text files offer terrible performance from RLE , and RAR-compressed files are similar. Sometimes NO breaks can be found in textfiles. RAR-compressed files always find breaks but the nonrepeat blocks can be hundreds of thousands of bytes (15 meg rarfile of a svcd split).
Nonetheless, as far as my algo is concerned, RLE is something I still think is "worth trying", although in the cases where it performs poorly I think I will try to find an alternative splitting algorithm.
The substring frequency analyser is rocking along, I'm still picking over the code trying to save a cycle here and there.
I think I'll rewrite my binary search algo to use the fpu - what do you think of that concept?
I've seen sequitur before, but I just came across the permutation extension by Earl and Ladner, 2002... VERY COOL. My algo is in practise for similar to the original sequitur as far as logical operations and order of them is concerned.
I already had a hashing system in place too, but having read the whitepaper on this new sequitur implementation, I am going to go right back to the drawing board. My ethos on this project was to try to blend the better elements of various algorithms and implementations to try to find a better macro-algo.
My ultimate concern is to maximize the ratio achieved and that the algo have an assymetric decompression time. Even though sequitur was originally developed with plaintext databasing in mind, I find myself strangely drawn to it...
You're right man, sequitur rawks ...
now can you suggest an algo for optimally slicing up the workload? :tongue:
Damn I'm impressed with that algo :)
Posted on 2003-03-19 21:57:37 by Homer
When something is complex the usual route to success is to split it in two. If you have ever played "mastermind" you know this is true because you will win nine times out of ten if you fill the slots with two colors on the initial move. ( e.g. 3 black and three blue in the six slots)

This is just a suggestion but make your algo so that parts of a file that are conducive to RLE are ecoded that way. Then take the rest of the file and encode it another ...or if you have to use two methods again...do that.

I'm presently investigating investigating a method of compression at the bit level...not the byte level. The advantage is that a repeating pattern is more likely to occur in a sequence of bits from contigiouse bytes rather than the value of bytes themselves. This is conducive to RLE encoding.

For example all odd numbers have the 0 bit set. So if the sequence of contigiouse bytes was 1, 3 , 9 ,221, 231....all would have the 0 bit set.

The disatvantage is that you have to find many more repeating pattern sequences because you are are encoding bits rather than bytes and this encoding must be in the form of bytes or nibbles rather than bits.
Posted on 2003-03-19 23:18:02 by IwasTitan
That kind of bit level coding has been used before but is not that good. Data that can be compressed this way could be compressed with prediction based methods better. The reason why this isn't good is because there is no apparent reason to use base 2 instead of any other base.
Posted on 2003-03-20 13:36:51 by gliptic
I'm afraid I'll have to disagree with gliptic's evaluation of bitpackers - the ones I saw 20 years ago achieved wonderful ratios, although they were not fast, and had long decompression times in relation to modern algos. On a 1mhz machine it could take 12 hours to compress 100kb (lol by todays standards) but could achieve well OVER 50 percent compression.

Sequatur takea a wad a data and converts it into the following: a hash string representing the encoded data stream, and a library of 2-byte replacement words. It does so "on-the-fly", and is able to reference terms forwards but not backwards from the point a term is discovered.
This means that at MOST, Sequatur can achieve roughly 50 percent compression.
Not good enough !!
The first thing I did was modify the algo to handle N-byte terms.
The next thing I did was decide not to perform live compression , but to only use this modified Sequatur to produce a "dead-listing" of terms, their lengths, and the number of times they appear.
This amounts to a linear replacement to the binary grind code I had been using... now the algo can reference terms forwards and backwards as well as within terms already discovered, and does so more efficiently.
Why not perform live compression with this algo?
Because we are aiming to find the term replacements with the highest byte saving, which may not necessarily be the longest or most common.
I still think that I will achieve far superior compression by anaysing all possible replacement outcomes (in the style of Deep Blue) rather than by making educated guesses on the fly.

So how does N-byte term Sequatur differ from regular Sequatur? Simply we are looking for not just for 2 bytes within the existing rules and hashing string, we are now looking for (example) 4,3 then 2 bytes... and we aint keeping a hash at this point because we aint trying to compress at this point, merely to evaluate the binary data before the actual comrpession takes place.

When using this algo for frequency analysis, unlike my old one, I cannot disregard single instances of a term, because more instances may yet be uncovered (I used to search the entire binary for each term).
I database my results in a 2 dimensional linked list. I have basically N linked lists of root records for terms of each length, sorted by their lengths. Under each root record are sibling-linked records for each unique term discovered at that length.
The siblings are not sorted until after analysis is complete.
So the current ethos is:
Inital rle, hybrid sequatur for frequency analysis, and then whole-term replacement via tokenizing similar again to the rle stage.

Which algo did you base your bitpacker on?
was it an eight bit algo or one of the later 16 bit ones?
think you could modify it to use the fpu in 80 bit words?
Posted on 2003-03-20 22:51:21 by Homer

Sequatur takea a wad a data and converts it into the following: a hash string representing the encoded data stream, and a library of 2-byte replacement words. It does so "on-the-fly", and is able to reference terms forwards but not backwards from the point a term is discovered.
This means that at MOST, Sequatur can achieve roughly 50 percent compression.
Not good enough !!


I think you have totally misunderstood the Sequitur algorithm. It is hierarchical, so it doesn't have a library rather a tree of rules, and the resulting rule tree is arithmetically compressed later so the Sequitur algo can achieve far above 50% compression. This is an example rule tree that shows this:

S -> 11111111
1 -> "Hello "

Expanded, this tree would take 48 bytes but now, if we coded it in base 257 (which is far from efficient) we could store it in a little bit above 16 bytes (with rule size bytes included). If we used arithmetic coding, and the example was a little bit bigger, we could achieve VERY high compression.

Sequitur does not need look-ahead ability since it has look-back ability so that won't degrade it's performance. Sequitur uses 2 symbol terms in order to build the rule tree, that doesn't put any limit on it's highest compression.


Anyway, I hope you understand what I mean and I think you should read a little about Sequitur so you understand the algorithm before you make a modification of it. The good thing about Sequitur is that it is O(N) in both space and time, incremental, offers almost optimal rule creation and can thus compress a whole stream of data, without splitting, and achieve a very good result. Sure, it can be beaten by an optimal rule creator that also works on the whole stream of data but since those are like O(N^5) or something close to that, it's not worth it. But, if you split up the data, it doesn't matter if your rule creator is optimal, you won't achieve that high compression anyway. Those optimal rule creators are for the future and quantum computers. Hmm... maybe not. It's hard to make that run in parallel I think.
Posted on 2003-03-21 01:04:57 by gliptic
Homer,

From what I have seen, RLE functions as a pre-processor for other compression algorithms and sometimes improves their efficiency so I would not abandon RLE just yet. If time is not your problem, then multipass processing is no big deal and it may yield good results for you.

Regards,

hutch@movsd.com
Posted on 2003-03-22 00:14:59 by hutch--
gliptic, thanks - the Sequitur algorithm is cool - I'll see if I can whip a binary version. If my thinking is correct the decompressor should be only a couple (<100) of bytes. :)


This is an example rule tree that shows this:

S -> 11111111
1 -> "Hello "
This tree isn't allowed! :)
I'd have to be:
S -> 33

3 -> 22
2 -> 11
1 -> "Hello "
RLE will reduce the rule depth - which could help or hinder compression.
IMHO, RLE will help in most cases.
Posted on 2003-03-22 01:08:04 by bitRAKE

gliptic, thanks - the Sequitur algorithm is cool - I'll see if I can whip a binary version. If my thinking is correct the decompressor should be only a couple (<100) of bytes. :)

This tree isn't allowed! :)
I'd have to be:
S -> 33

3 -> 22
2 -> 11
1 -> "Hello "
RLE will reduce the rule depth - which could help or hinder compression.
IMHO, RLE will help in most cases.


That example I gave was just to show that a rule tree can compress more than 50%. I didn't get it from Sequitur.

IMHO, RLE would hinder in almost all cases (except small files). For example:

"Hello guy! aaaaaaaaaahh! Hello guy!"

Here, the RLE would break the string so that the two Hello's can't be made to a rule. The strong thing about Sequitur is that it operates on the whole data so it doesn't matter how far away from eachother two substrings are.

Sequitur would reduce this example string to:

S -> 1 2 3 3 4 h h 2 1 !
1 -> H e l l o _ g u y
2 -> ! _
3 -> 4 4
4 -> a a

RLE would do something like this:

"Hello guy! ", "a"*10, "hh! Hello guy!"

Since no rules can be formed now, the compression is limited to the small amount that the RLE can do.

And yes, bitRAKE, the decompressor can become REALLY small if it is coded right.
Posted on 2003-03-22 14:24:46 by gliptic

RLE would do something like this:

"Hello guy! ", "a"*10, "hh! Hello guy!"

Since no rules can be formed now, the compression is limited to the small amount that the RLE can do.
Why wouldn't rules be able to be formed?

Lets use ~ to denote the RLE function and it takes two characters after it:

~Ka

Expands to "aaaaaaaaaa" because K is the tenth character. :)

...so RLE produces: "Hello guy! ~Kahh! Hello guy!"

...and Sequitur produces:

S -> 1 2 ~ K a h h 2 1 !
1 -> H e l l o _ g u y
2 -> ! _
Posted on 2003-03-22 14:47:41 by bitRAKE
Hmm, it seems like I misunderstood the use of RLE. So how is RLE going to speed up this algo then? And how is it going to be made distributable?
Posted on 2003-03-22 14:49:20 by gliptic
I'm just saying it'll improve compression ratio - I leave the rest to EvilHomer2k. :)
Posted on 2003-03-22 14:56:49 by bitRAKE
Yeah, I actually asked EvilHomer.

The gain in performance would be minimal and the algorithm would still be too slow to be practical on larger data sets.
Posted on 2003-03-22 14:59:33 by gliptic
And I misunderstood Sequitur - although my variant algo was right on track.
Let me describe my proposed new variant, which I am coding already using this as a blueprint. Please don't hesitate to comment on it.

===============================================================
LinkedList Sequitur Variant by EvilHomer (aka Stage 2 of Franny FatSlasher)
===============================================================
We have source data to be compressed (S).
In reality, the source data is already partially compressed.
It exists as a linkedlist of tokens of type 0,1,2 (all the RLE types) ,3,4 and 5 (all the PLAIN types)
When we reprocess the data in stage 2, we'll be ignoring the RLE tokens
and concentrating on the so-called plain ones. We'll dip inside each plain token's data payload, and present the data chunk to the compressor in the form of a pointer to a chunk of plain data.
We'll do this for all plain data chunks in the source.
For the rest of this document, we'll describe the processing of a single wad of data, so let's pretend we didn't know that we started with a linkedlist of tokens.

We'll compress the source data into a destination buffer we call our Hashing Buffer (H)
Our Hashing Buffer is actually a LinkedList of Tokens.
That means we have a Root Pointer to a linkedlist of "HashingBuffer Tokens" (similar to our Source data).

Whenever we fetch bytes of source data, we wrap them as a SMALL_PLAIN token (type 3).
The Hashing Buffer LinkedList can contain these type-3 tokens.
The Hashing Buffer can also contain "Reference to previously created Rule Token" tokens which we also call RuleTokenReference Tokens (type 6)
Note that type-6 tokens don't contain the rule itself, just a reference to a rule, but nonetheless are tokens.
Thus, the Hashing Buffer is a LinkedList which will contain ONLY tokens of type 3 and 6.

Rules themselves are objects which contain linkedlists of tokens, and, like the HashingBuffer LinkedList,
the linkedlist may contain any mix of type 3 and 6 tokens. Rules are NOT Tokens. They contain no Token Header.
We database our Rules in a global linkedlist, we keep the Root pointer in a global variable.
We'll keep the Rules as simple parent-children linked nodes,
and each Rule will be a container for a Root Pointer to a linked list of tokens that form the Rule.
Rules will be referenced by a RuleTokenReference token, which will be an indirect pointer to a Rule.
RuleTokenReference tokens are used in place of Instances of a Rule (and thus replace a series of Tokens),
whereas Rules are Definitions of a Series of Tokens (of type 3 and/or 6).

We start by initializing the hashing buffer linkedlist...
We fetch the first source byte, and encapsulate it in a PLAIN_SMALL token.
We increment our sourcepointer.
We initialize the HashingTokens Root Pointer with the address of this our first token.
We fetch the second source byte and we likewise wrap it up as a PLAIN_SMALL token.
We increment our sourcepointer.
We append this our second token to our new HashingBuffer LinkedList.

For the remainder of this example document, please observe the following:
S represents some uncompressed source data
H represents the HashingBuffer linkedlist of type 3 (plainsmall) and/or type 6(rulereference) tokens
<> indicates the NEXT source byte pointed to by sourcepointer
[ ] indicates a Token and its contents
lowercase characters indicate plainsmall data substrings
uppercase characters indicate rule references
HBLL is an abbreviation for HashingBuffer LinkedList.

S=ab<c>dbcabcdbc H= We start by fetching the next source byte (into a PLAINSMALL token)
S=abc<d>bcabcdbc H= We examine the HBLL for multiple non-overlapping instances
of a series of 2 or more terms, searching for larger lengths first
We seek each series in the existing Rules - in there?
We seek each series in the rest of the HBLL - in there?
No, and No. Let's continue by wrapping up the next source byte...
S=abcd<b>cabcdbc H= We examine the HBLL for multiple non-overlapping instances
Nothing cool happens here either.. We wrap up the next source byte...
S=abcdb<c>abcdbc H= We examine the HBLL for multiple non-overlapping instances
Ho Hum... Wrap up the next source bytes
S=abcdbc<a>bcdbc H= We examine the HBLL for multiple non-overlapping instances
ACTION !! appears TWICE in the HBLL !!!
We create our first Rule (A=)
We replace all instances of the Rule in the Hashing Buffer with a RuleRef
This might mean killing some nodes in the linked list !!!
We immediately continue by wrapping up the next source byte
S=abcdbca<b>cdbc H= We examine the HBLL for multiple non-overlapping instances
The series doesn't exist in Rules or in the HBLL.
We continue by wrapping up the next source byte...
S=abcdbcab<c>dbc H= We examine the HBLL for multiple non-overlapping instances
The series doesn't exist in Rules or in the HBLL.
We continue by wrapping up the next source byte...
S=abcdbcabc<d>bc H= We examine the HBLL for multiple non-overlapping instances
The series EXISTS as a Rule !!!
We replace all instances of the pair in the HBLL with the appropriate RuleRef
This might mean killing some nodes in the linked list !!!
We immediately continue by wrapping up the next source byte
S=abcdbcabcd<b>c H= WHAMMO !!! We have a Multiple Term Repeat in the HBLL
Were we watching out for that?
We create a new Rule (B=)
We replace both duplicate instance chains with a RuleReference
This might mean killing some nodes in the linked list !!!
We immediately continue by wrapping up the next source byte
S=abcdbcabcd<b>c H= We examine the HBLL for multiple non-overlapping instances
Nothing of interest, so we wrap the next (and final) source byte...
S=abcdbcabcdb<c> H= The pair EXISTS as a Rule,
so we replace the HBLL instance(s) with the correct Rule Ref token
(no source bytes left to fetch at this point)
S=abcdbcabcdbc H= We may note even now we can see a pair duplication in the HBLL
We create a new Rule (C=)
We replace all instances of the pair in the HBLL
We've completed compression of the source data.
We end up with the following:
H=
A=
B=
C=

============================================================================================
From this we can conclude the following:

Our algo must check more than the last two terms in the HBLL.
We should check for duplication of any number of tokens in a series.

Our Rules must hold an arbitrary number of tokens in a linkedlist.
We can keep a counter in each Rule of how many tokens it contains,
or we can take advantage of PARTIAL MATCHES of rules.
We might have a complex Rule Q=
and we might have a HBLL searchterm which as you can see PARTIALLY matches a rule
but it is shorter, so we could have a token 7, R= which contains the length to stop at.
This would save us from creating a new Rule for when it exists as a subseries.
Let's say the HBLL searchterm is actually
Another notion might be to invent a token 8, R= which has start and length of a subseries of tokens.
We'll get the plain version working before worrying about adding those in.

=============================================================================
CHECKING THE HBLL EFFICIENTLY FOR ALL LENGTH DUPLICATES
To check for duplicate instances of subseries of tokens in the HBLL, we should use
a sliding window algo which has an initial length of half the length of terms in the window, rounded down.
The final length to check is TWO terms. This example has 15 terms...


Half of that would be 7.5, rounded down to 7
<>

We start with the LAST seven, and seek them at the NEXT seven down
<><>

If theres seven or more left to check, we would continue to the NEXT seven down.
We then decrement the start of the search,for as many "remainder" terms it takes to include the root term
<><>

We now decrement the initial search length from 7, and repeat, until the searchlength is 2.

As a visual aid, here are the last two searches of the example hashbuffer...

Here is length 3 search
<><>
<><>
<><>
<><>
<><>
<><>
<><>
<><>
<><>
<><>

Here is length 2 search
<><>
<><>
<><>
<><>
<><>
<><>
<><>
<><>
<><>
<><>
<><>
<><>

The term-series comparisons are being made between two linkedlists of equal length.
The lists both contain type 3 and/or type 6 tokens.
The quickest way to check for a match is to compare the token types in the two series.
If the types of all the tokens don't match perfectly, we don't have a match.

If the types all match, we now make comparisons of pairs of tokens (we know they are the same type)
so we are comparing a type 3 with a type 3, or a type 6 with a type 6, in a loop.

Type 6 tokens are quicker to check, because they are references to something unique.
When we have Type 3 tokens, we need to compare the binary plaintext in the token pair.
Only if they match, do we have a match.

We have searched the HashBuffer completely for duplication of all length matching series of terms down to pairs.
============================================================================================

The method used to compare two equal length series of tokens is also used when
trying to identify a subseries searchterm in the existing Rules.
We just need to check first if the number of tokens is equal in the Rule to the number in the searchterm.
If it isn't, this Rule could never be a Match.
Then we check that all the tokens in the two series are of the same type,
and finally that they contain equivalent data.
If all these hold true, we have found a Match in the Rules, and need to replace
however many linked nodes from the HBLL searchterm are required with a Ref token.

When our second stage pass is complete, we will replace each series of single-byte plain tokens
in the HBLL with a single plain token representing a series of of plain bytes,
of whatever type 3,4,5 is required to express the data size.

We take our final HBLL and link it to our Output List, by replacing the source Plain with it.
We zero our HBLL root pointer before processing the next chunk of data.
We just keep growing the Rule List.

When we have processed ALL chunks of data, we will replace each series of single-byte plain tokens
in each of the Rules with a single plain token representing a series of of plain bytes,
of whatever type 3,4,5 is required to express the data size.

The final stage of the compressor will be to write the Output LL to a binary file, minus any LO fields.
We will start by writing a flat-memory version of the Rules.
Then we write how many of them there are to the output file.
Each time we write a Rule, we replace any indirect references' pointer to it in all the other Rules
and in the Output LL with the Rule's new flat dword offset.
We then write the flat Rules data, followed by the LO-stripped Output LL.

-You may now suspend your disbelief.-
EvilHomer, March 23, 2003





:alright:
Posted on 2003-03-23 03:49:52 by Homer
I just posted a beta of a beta of an application using this code in the Algos forum, under the "SkinnedButtons" thread, because someone asked, and it happens to use that piece of code also.
It contains a few messageboxes - one of my debugging techniques.
Just a humble offering, which shows the execution speed of the RLE stage, and not much more than that.
Try putting the term "abcdbcabcdbc" into a textfile and compressing it.
That's what I'm using to debug stage 2, the Sequitur Variant.
My current problem involves incorrect patching of linkedlists after replacement of a series of linked nodes with a new node.
I'm trying to delete all the nodes but one (relying on inbuilt link-patching to maintain the links to nodes outside the deletion range), and then trying to graft that node's linkage header onto the header of the unlinked and freshly-created replacement node, then patching the links in its new parent and child to point to it (since it knows about them, having inherited those links). Finally I delete the original node, with my replacement complete, I return the address of the replacement node for indirect referencing (replacement of a series of linked nodes in our case involves replacing several source nodes with a single reference to a rule - we must create the rule (returning addres of rule), create the reference, shove the rule address in the reference, and replace the series of source nodes with the new referencing node as indicated.
When we create the rule, we need to clone the series of source nodes into it, because we will destroy them when we return.
Cloning and deletion just seemed simpler than trying to extract a section of the LL, patching the wound behind us.
Posted on 2003-03-24 02:05:49 by Homer
You have to excuse me, I don't follow that procedure totally but I think I got the point. As I understand it, you are building your tree top-down like Sequitur. I think an optimal rule creator have to create it bottom-up. I'm not sure about this so please correct me if I'm wrong. For example, the stream "abcdbcabcdbc" would be processed like this in a bottom-up rule creator:

S -> abcdbcabcdbc

1.
S -> AA
A -> abcdbc

2.
S -> AA
A -> aBdB
B -> bc

Your tree was...

H=
A=
B=
C=

...and Sequitur produces...

S -> 1 1
1 -> a 2 d 2
2 -> b c

...which is in fact the same as the bottom-up approach. Why is your tree worse? It's because you have a useless rule left. If you expanded B directly into C you would get:

H=
A=
C=

Which is the same as Sequitur and the bottom-up arroach.
Posted on 2003-03-24 03:22:27 by gliptic