First, let's settle on what is a "large dictionary". One of the languages supported by Word Judge is Polish, for which the valid word list has over 2 million entries and takes about 36MB after unpacking. Do we need to compress this data at all? Probably not. If you consider the average hardware spec and modern network speed, the 36MB is not much, but we can do so much better. Besides, it's fun!
One the other end of the spectrum is the "zip all the things" approach. It's not a good idea though - it puts a very heavy load on the CPU and doesn't compress the data that well. The zipped Polish dictionary takes 4.5MB.
The most important observation is that we're not compressing some random white noise, but words from real language. We can leverage this information to get a much better compression than some generic algorithm. Lot of words share the same prefix and suffix, which means they can be efficiently represented as a directed acyclic graph with shared nodes for common prefixes and suffixes. Let's see how the graph would look like for couple of words:
White letters on black background mark the nodes representing the ends of words. So the graph above represents the following words:
abject abjection abjections abjectly abjectness ablate ablated ablation ablationsThe nodes for 'ab' prefix are of course shared, the 'ions', 's' suffixes nodes as well. For obvious reasons we cannot share the 't' node: in one group it marks the end of the word, in other it does not; the child nodes for each 't' nodes are also different. Nodes are equal, and thus can be shared, if and only if they represent the same letter, the End Of Word flag is the same for both of them, and the list of children is exactly the same.
This type of graph contains minimal number of nodes, but each node takes quite a lot of space. We need 8 bits for the letter, 1 bit for EOW flag, 8 bits for the children list length and some kind of reference, for example node index, to each child, which is log2(n) rounded up bits, where n is a number of nodes in the graph. As you can see the major contributor to a single node size is the children list: for TWL06 dictionary the number of nodes is in the 10k-100k order of magnitude, which means we need several bits per node index.
A guy called John Paul Adamovsky found a neat solution to that. Instead of keeping the full list of children, we can keep children in something similar to singly-linked list: let's store only the index of the first child, number the nodes in such way that the children always have consecutive indices and add a End Of Children List flag to each node. This way we need exactly 1 + log2(n) bits for child list. This way we can keep the entire node in one byte. What's the catch? We need to introduce few more nodes.
For example on the graph above we can no longer share the 'i' node in the 'ions' suffix: if it was shared it would have to have a number one greater than the number of both 'e' node in 'ablate' and 'n' node in 'abjectness' (this is, of course, assuming the 'i' node is the last on the child list of both 't' nodes; but it's impossible for any node ordering). Our graph would look like this:
The rule for node equality has to be extended: the nodes are equal if they represent the same letter, the flags are the same, the children are equal and the siblings further on the children list are equal. The last condition is a bit confusing, so I'll provide an example.
Let's add a 'absolution' and 'absolutions' to our dictionary. We can certainly share the 'ab' prefix and 'ons' suffix, but do we need a separate 'i' node as well? The 'i' node in 'absolution' must have a End-Of-Child-List flag set. If we arrange the child nodes in either 'ablat-' or 'abject-' part of graph in such way that the 'i' node is the last node, we can share the 'i' node between that part of graph and the newly added branch. Our graph would look like this:
This is the way we can squeeze the 36MB dictionary to 1.5MB. The structure also doesn't need any processing, it can be loaded to int array and used directly. If you're curious how to convert the dictionary into this kind of graph you can read the description of my dawggenerator project on github (C++) or the original description by John Paul Adamovsky (pure C).
Hello,
ReplyDeleteI've read your article carefully, and it's fine. However I have one little doubt: Could you explain what did you meant writing the sequence "list of children is exactly the same"?
Regards, Michal.
When constructiong the graph from the first image, you start with the trie (i.e. prefix tree). Then you start reducing the shared suffixes by comparing the nodes and for that you need some equality rule. The rule is:
DeleteNodes are equal iff they represent the same letter, have the same end-of-word flag state and their children list are equal (meaning that the lists have the same length and nodeA.children[i] == nodeB.children[i] for every i in [0..list length - 1] range and you use this equality rule for comparing the children).
It sounds complicated, but it really is the formal definition of "does two branches of the tree are the same".
Ah, you're right!
DeleteI had an idea to share common parts inside of the words as well (with another prefixes and/or suffixes), and then your mentioned condition wouldn't work.
But in fact it's impossible in such a way, because if children would differ, then non-existent words could be found...
So in DAWG we can share only prefixes and suffixes, unfortunately.
Now I wonder if is there a way to create more complicated data structure to share even more of letters...
Anyway, is there a simple algorithm to convert a trie to the **minimal** DAWG?
Nodes doesn't have to represent a single letter. For example for english dictionary it might be optimal to have a 'th' and 'sh' node values. For polish dictionary a 'cz', 'sz' might be useful. I don't know, I haven't tested this idea, and I don't have a clue how to find optimal set of node values.
DeleteIf you're interested in word data structure, you might want to check the other invention of John Paul Adamovsky: http://www.pathcom.com/~vadco/cwg.html.
Feel free to experiment with my code on github (or with the code posted by John Paul Adamovsky, if you prefer C to C++) and please let me know if you have any interesting findings.