Subj : Re: Searching 40 Million short strings fast To : comp.programming From : Peter Ammon Date : Tue Sep 06 2005 05:05 am M.Barren@gmail.com wrote: > Thanks for your replies, > > I believe Tries are the best choice here since I just want to check > existance of a certain string and don't want to associate any data with > that string (as of a hashtable). > > Peter, the format I came up after reading some pages about trie > structures , consumes 9 bytes/trie entry (4 bytes for address of child > node, 4 bytes for the address of the next sibling entry, 1 byte of > data). > 1. Having strings with length 10 to 12, the data itself will take up > about: 11 x 40M = 440MB. > 2. The Trie data will take up about: 9 x 40M =~ 360MB (taking into > account only the bottom childs in the structure) > Space is not a concern in my case, but can you let me know what format > your trie representation uses that is almost half the size of the > actual data ? > > Michael > It's flattering that you asked! The most important optimization I do is to look for duplicate suffixes. Rather than store five links for "bring" and six for "string," both can link to a unique "ring" sequence, which reduces the total number of nodes from 390158 to only 52904. I also have the notion of "degenenerate nodes," which represents a suffix that does not branch, except to terminate intermediate words; "tions" is an example of a degenerate node. I store the suffix for a degenerate node in the node itself, rather than linking to each character in sequence. (This is similar to the optimization used to construct Patricia trees.) Finally, I cram together a lot of data into a few bytes. I allocate 24 bits per link. I am only interested in the 26 letters of the alphabet, so I have 2**24 / 26 = 645277 values available to refer to the child node for each link. I use these as an offset into the file; by requiring two byte alignment for each node, I can support file sizes up to 1.3 MB. (If I run out of space, I can use some clever sorting to keep nodes close together, or use double indirection via a table to support more nodes.) My file format is this: Node: Header (1 Byte): terminates_a_word (1 bit) is_degenerate (1 bit) number_of_links (5 bits) unused (1 bit) (For non-degenerate nodes): Link (3 bytes): 26 * child_node_file_offset + index_of_character_in_alphabet Link Link ... (For degenerate nodes): Character (1 byte): ASCII_value_of_character (7 bits) terminates_an_intermediate_word (1 bit) Character Character ... 0 byte Node Node .... -Peter -- Pull out a splinter to reply. .