Subj : Re: Serializing a trie (sorting problem)
To   : comp.programming
From : Peter Ammon
Date : Sun Jul 24 2005 07:04 pm

websnarf@gmail.com wrote:
> Peter Ammon wrote:
> 
>>In short, how can I order the nodes of a tree or trie so that the
>>distance between a parent and its children is as small as possible?
>>
>>Some background:
>>
>>I have a trie file representing a bunch of words, and it must hold as
>>much data while being as compact as possible.  I have a 24 bit field in
>>which to represent the character of the next node and its offset in the
>>trie file; since there's 26 possible characters I can represent about
>>645276 values for the location of a child (that is, ((1 << 24) - 25) / 26.
>>
>>Currently, this value refers to the offset of the child from the start
>>of the trie file.  By requiring two-byte alignment for nodes in the
>>file, I can support any trie file size up to 1290552 bytes.
>>
>>It occurs to me that I could order the trie such that children are
>>closer to their parents.  If I can guarantee that no child is further
>>away than 1290552 bytes, I could cram more data in even though the file
>>size grows larger than my "pointer" size.
>>
>>So how do I sort the nodes this way, that minimizes the largest distance
>>between a node and its children?  To make it harder, some children are
>>shared - a node can have multiple parents.
> 
> 
> Sounds like a hard problem.  But aren't CPU L2 caches big enough to fit
> the whole thing these days?  :)
> 
> First of all, instead of location value, I think you mean *relative
> offset* of the location.  That, I assume, is the relevance of wanting
> immediate parent-child nodes to be close to each other.
> 
> The arbitrary suffix condition makes this quite a tough problem.  You
> can see without it, how this would be much easier (just sort of put
> each node in the middle of its children, evenly cut in half by number
> of total nodes under them, and sort them such that the children with
> the most total nodes under them appear furthest away from the node, and
> least closest to the original node, and repeat recursively -- either it
> works or it doesn't.)
> 
> The best (from a "definately find an answer if one exists" point of
> view) I can think of with your constraints, is literally to "search"
> for a sufficient solution, using some kind of heuristic search.  I
> don't know if this can yield a working solution in reasonable time.

Since my data set won't change, it's OK if the program takes a few days 
to run.  My current data set yields 52904 nodes, so it's not 
outrageously big, but I'll definitely have to do better than a 
try-every-possibility approach.

> 
> The other obvious approach I would recommend you consider is provide an
> "escape code encoding" that requires some nodes to chew up two
> nodes-worth of space to encode delta information that requires more
> than 24 bits to encode.  Using 27 letters instead of 26 (where the 27th
> letter is "escape") you have 621377 possible delta values, and joining
> two with this escape gives you 386109376129 possible values plus the
> *real* character encoding.  I.e., the immediately adjacent node would
> be used in the escape situation and the child array from this adjacent
> node would essentially concatenate to your starting node, giving you,
> sort of, a total of 48 bits used for the "escaped" entries.

Unlike a usual trie, that has null pointers for missing children, I 
store only present children, in an array.  So rather than using a direct 
offset, I binary search to find the child for a given character.  If I 
go to variable-length child references, I'll probably have to drop to 
linear search, unless there's a clever way to binary search on 
non-fixed-length records.  This isn't terrible, since the number of 
children is limited to 26, but it would still be nice to avoid that.

Nevertheless, some sort of non-fixed-length encoding for child 
references is a very good idea and my best bet so far for extending the 
data.  Thank you!

[...]

-Peter

-- 
Pull out a splinter to reply.

.