In order to retain my sanity, I've been rotating between a handful of similar projects.
One old project that I picked up recently involves an 8-bit implementation of the Sequitur algorithm.
This algorithm was originally designed with the intent of extracting/inferring grammatic structure in plaintext.
I'm more interested in using it as a lossless compression algorithm - and I'm not alone.
Early results from my beta demo are promising - this is without any kind of probabilistic or other analyses.
The algorithm is very similar to the input stage of my parser - which is what interests me.
That is to say, it is based on a paradigm of Terminal and NonTerminal Symbols - the key difference being that Sequitur symbols are not unique - they're more like my concept of a TOKEN.
Sequitur works by finding repeated data structures, from which it builds 'replacement rules'.
The algorithm detects rules which are used only once, and eliminates them through concatenation with the only existing reference... this effectively allows rules to grow, and become more complex, keeping the number of rules to a minimum while simultaneously describing more complex data structures.
So it generates a 'dictionary' a lot like a Huffman encoding - but its not a bintree.