Subj : Re: Searching 40 Million short strings fast
To   : comp.programming
From : Peter Ammon
Date : Tue Sep 06 2005 05:05 am

M.Barren@gmail.com wrote:
> Thanks for your replies,
> 
> I believe Tries are the best choice here since I just want to check
> existance of a certain string and don't want to associate any data with
> that string (as of a hashtable).
> 
> Peter, the format I came up after reading some pages about trie
> structures , consumes 9 bytes/trie entry (4 bytes for address of child
> node, 4 bytes for the address of the next sibling entry, 1 byte of
> data).
> 1. Having strings with length 10 to 12, the data itself will take up
> about: 11 x 40M = 440MB.
> 2. The Trie data will take up about: 9 x 40M =~ 360MB (taking into
> account only the bottom childs in the structure)
> Space is not a concern in my case, but can you let me know what format
> your trie representation uses that is almost half the size of the
> actual data ?
> 
> Michael
> 

It's flattering that you asked!

The most important optimization I do is to look for duplicate suffixes. 
  Rather than store five links for "bring" and six for "string," both 
can link to a unique "ring" sequence, which reduces the total number of 
nodes from 390158 to only 52904.

I also have the notion of "degenenerate nodes," which represents a 
suffix that does not branch, except to terminate intermediate words; 
"tions" is an example of a degenerate node.  I store the suffix for a 
degenerate node in the node itself, rather than linking to each 
character in sequence.  (This is similar to the optimization used to 
construct Patricia trees.)

Finally, I cram together a lot of data into a few bytes.  I allocate 24 
bits per link.  I am only interested in the 26 letters of the alphabet, 
so I have 2**24 / 26 = 645277 values available to refer to the child 
node for each link.  I use these as an offset into the file; by 
requiring two byte alignment for each node, I can support file sizes up 
to 1.3 MB.  (If I run out of space, I can use some clever sorting to 
keep nodes close together, or use double indirection via a table to 
support more nodes.)

My file format is this:

Node:
   Header (1 Byte):
     terminates_a_word (1 bit)
     is_degenerate (1 bit)
     number_of_links (5 bits)
     unused (1 bit)

   (For non-degenerate nodes):
     Link (3 bytes):
       26 * child_node_file_offset + index_of_character_in_alphabet
     Link
     Link
     ...

   (For degenerate nodes):
     Character (1 byte):
       ASCII_value_of_character (7 bits)
       terminates_an_intermediate_word (1 bit)
     Character
     Character
     ...
     0 byte

Node
Node
....

-Peter

-- 
Pull out a splinter to reply.

.