Subj : Re: Serializing a trie (sorting problem)
To   : comp.programming
From : websnarf
Date : Sat Jul 23 2005 09:24 pm

Peter Ammon wrote:
> In short, how can I order the nodes of a tree or trie so that the
> distance between a parent and its children is as small as possible?
>
> Some background:
>
> I have a trie file representing a bunch of words, and it must hold as
> much data while being as compact as possible.  I have a 24 bit field in
> which to represent the character of the next node and its offset in the
> trie file; since there's 26 possible characters I can represent about
> 645276 values for the location of a child (that is, ((1 << 24) - 25) / 26.
>
> Currently, this value refers to the offset of the child from the start
> of the trie file.  By requiring two-byte alignment for nodes in the
> file, I can support any trie file size up to 1290552 bytes.
>
> It occurs to me that I could order the trie such that children are
> closer to their parents.  If I can guarantee that no child is further
> away than 1290552 bytes, I could cram more data in even though the file
> size grows larger than my "pointer" size.
>
> So how do I sort the nodes this way, that minimizes the largest distance
> between a node and its children?  To make it harder, some children are
> shared - a node can have multiple parents.

Sounds like a hard problem.  But aren't CPU L2 caches big enough to fit
the whole thing these days?  :)

First of all, instead of location value, I think you mean *relative
offset* of the location.  That, I assume, is the relevance of wanting
immediate parent-child nodes to be close to each other.

The arbitrary suffix condition makes this quite a tough problem.  You
can see without it, how this would be much easier (just sort of put
each node in the middle of its children, evenly cut in half by number
of total nodes under them, and sort them such that the children with
the most total nodes under them appear furthest away from the node, and
least closest to the original node, and repeat recursively -- either it
works or it doesn't.)

The best (from a "definately find an answer if one exists" point of
view) I can think of with your constraints, is literally to "search"
for a sufficient solution, using some kind of heuristic search.  I
don't know if this can yield a working solution in reasonable time.

The other obvious approach I would recommend you consider is provide an
"escape code encoding" that requires some nodes to chew up two
nodes-worth of space to encode delta information that requires more
than 24 bits to encode.  Using 27 letters instead of 26 (where the 27th
letter is "escape") you have 621377 possible delta values, and joining
two with this escape gives you 386109376129 possible values plus the
*real* character encoding.  I.e., the immediately adjacent node would
be used in the escape situation and the child array from this adjacent
node would essentially concatenate to your starting node, giving you,
sort of, a total of 48 bits used for the "escaped" entries.

With this escape approach, you could encode the results by working
backwards from the suffixes going upwards until you arive at an
situation where an escape is necessary (i.e., where you cannot add in
all parent nodes and trees without using the escape mechanism.)  The
heuristic I would recommend is simply to seperate the children evenly
amongst the offsets -310688 to 310688, spread out weighted by the
number of nodes under each one -- if you end up with an intersection
while doing this, just scan in either direction for a different offset
(but as close as possible to the one originally intended).  If a scan
fails (because the closest offset is outside the range
[-310688,310688]) then you are forced to add in that parent with an
escape code.  But I imagine that you would first maximally produce
these suffixes, until you just had a set of suffixes, that cannot be
joined together without "escape codes", then compact them (recompute
the offsets so that they still appear in the order that you've got them
in, but without any spaces between adjacent entries) then try again to
join them together, compact etc until you are not getting any more
changes, then start joining them together (in the "sort big away from
current central node" method described above) with escape codes as
necessary.

But that's just my first thoughts on the matter.  :)

-- 
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

.