Subj : Re: Serializing a trie (sorting problem) To : comp.programming From : websnarf Date : Sat Jul 23 2005 09:24 pm Peter Ammon wrote: > In short, how can I order the nodes of a tree or trie so that the > distance between a parent and its children is as small as possible? > > Some background: > > I have a trie file representing a bunch of words, and it must hold as > much data while being as compact as possible. I have a 24 bit field in > which to represent the character of the next node and its offset in the > trie file; since there's 26 possible characters I can represent about > 645276 values for the location of a child (that is, ((1 << 24) - 25) / 26. > > Currently, this value refers to the offset of the child from the start > of the trie file. By requiring two-byte alignment for nodes in the > file, I can support any trie file size up to 1290552 bytes. > > It occurs to me that I could order the trie such that children are > closer to their parents. If I can guarantee that no child is further > away than 1290552 bytes, I could cram more data in even though the file > size grows larger than my "pointer" size. > > So how do I sort the nodes this way, that minimizes the largest distance > between a node and its children? To make it harder, some children are > shared - a node can have multiple parents. Sounds like a hard problem. But aren't CPU L2 caches big enough to fit the whole thing these days? :) First of all, instead of location value, I think you mean *relative offset* of the location. That, I assume, is the relevance of wanting immediate parent-child nodes to be close to each other. The arbitrary suffix condition makes this quite a tough problem. You can see without it, how this would be much easier (just sort of put each node in the middle of its children, evenly cut in half by number of total nodes under them, and sort them such that the children with the most total nodes under them appear furthest away from the node, and least closest to the original node, and repeat recursively -- either it works or it doesn't.) The best (from a "definately find an answer if one exists" point of view) I can think of with your constraints, is literally to "search" for a sufficient solution, using some kind of heuristic search. I don't know if this can yield a working solution in reasonable time. The other obvious approach I would recommend you consider is provide an "escape code encoding" that requires some nodes to chew up two nodes-worth of space to encode delta information that requires more than 24 bits to encode. Using 27 letters instead of 26 (where the 27th letter is "escape") you have 621377 possible delta values, and joining two with this escape gives you 386109376129 possible values plus the *real* character encoding. I.e., the immediately adjacent node would be used in the escape situation and the child array from this adjacent node would essentially concatenate to your starting node, giving you, sort of, a total of 48 bits used for the "escaped" entries. With this escape approach, you could encode the results by working backwards from the suffixes going upwards until you arive at an situation where an escape is necessary (i.e., where you cannot add in all parent nodes and trees without using the escape mechanism.) The heuristic I would recommend is simply to seperate the children evenly amongst the offsets -310688 to 310688, spread out weighted by the number of nodes under each one -- if you end up with an intersection while doing this, just scan in either direction for a different offset (but as close as possible to the one originally intended). If a scan fails (because the closest offset is outside the range [-310688,310688]) then you are forced to add in that parent with an escape code. But I imagine that you would first maximally produce these suffixes, until you just had a set of suffixes, that cannot be joined together without "escape codes", then compact them (recompute the offsets so that they still appear in the order that you've got them in, but without any spaces between adjacent entries) then try again to join them together, compact etc until you are not getting any more changes, then start joining them together (in the "sort big away from current central node" method described above) with escape codes as necessary. But that's just my first thoughts on the matter. :) -- Paul Hsieh http://www.pobox.com/~qed/ http://bstring.sf.net/ .