Subj : Re: Serializing a trie (sorting problem) To : comp.programming From : Peter Ammon Date : Sun Jul 24 2005 07:04 pm websnarf@gmail.com wrote: > Peter Ammon wrote: > >>In short, how can I order the nodes of a tree or trie so that the >>distance between a parent and its children is as small as possible? >> >>Some background: >> >>I have a trie file representing a bunch of words, and it must hold as >>much data while being as compact as possible. I have a 24 bit field in >>which to represent the character of the next node and its offset in the >>trie file; since there's 26 possible characters I can represent about >>645276 values for the location of a child (that is, ((1 << 24) - 25) / 26. >> >>Currently, this value refers to the offset of the child from the start >>of the trie file. By requiring two-byte alignment for nodes in the >>file, I can support any trie file size up to 1290552 bytes. >> >>It occurs to me that I could order the trie such that children are >>closer to their parents. If I can guarantee that no child is further >>away than 1290552 bytes, I could cram more data in even though the file >>size grows larger than my "pointer" size. >> >>So how do I sort the nodes this way, that minimizes the largest distance >>between a node and its children? To make it harder, some children are >>shared - a node can have multiple parents. > > > Sounds like a hard problem. But aren't CPU L2 caches big enough to fit > the whole thing these days? :) > > First of all, instead of location value, I think you mean *relative > offset* of the location. That, I assume, is the relevance of wanting > immediate parent-child nodes to be close to each other. > > The arbitrary suffix condition makes this quite a tough problem. You > can see without it, how this would be much easier (just sort of put > each node in the middle of its children, evenly cut in half by number > of total nodes under them, and sort them such that the children with > the most total nodes under them appear furthest away from the node, and > least closest to the original node, and repeat recursively -- either it > works or it doesn't.) > > The best (from a "definately find an answer if one exists" point of > view) I can think of with your constraints, is literally to "search" > for a sufficient solution, using some kind of heuristic search. I > don't know if this can yield a working solution in reasonable time. Since my data set won't change, it's OK if the program takes a few days to run. My current data set yields 52904 nodes, so it's not outrageously big, but I'll definitely have to do better than a try-every-possibility approach. > > The other obvious approach I would recommend you consider is provide an > "escape code encoding" that requires some nodes to chew up two > nodes-worth of space to encode delta information that requires more > than 24 bits to encode. Using 27 letters instead of 26 (where the 27th > letter is "escape") you have 621377 possible delta values, and joining > two with this escape gives you 386109376129 possible values plus the > *real* character encoding. I.e., the immediately adjacent node would > be used in the escape situation and the child array from this adjacent > node would essentially concatenate to your starting node, giving you, > sort of, a total of 48 bits used for the "escaped" entries. Unlike a usual trie, that has null pointers for missing children, I store only present children, in an array. So rather than using a direct offset, I binary search to find the child for a given character. If I go to variable-length child references, I'll probably have to drop to linear search, unless there's a clever way to binary search on non-fixed-length records. This isn't terrible, since the number of children is limited to 26, but it would still be nice to avoid that. Nevertheless, some sort of non-fixed-length encoding for child references is a very good idea and my best bet so far for extending the data. Thank you! [...] -Peter -- Pull out a splinter to reply. .