Subj : Re: Searching 40 Million short strings fast
To   : comp.programming
From : Arthur J. O'Dwyer
Date : Tue Sep 06 2005 03:40 pm


On Tue, 6 Sep 2005, Willem wrote:
>
> Peter wrote:
> ) The most important optimization I do is to look for duplicate suffixes.
> )   Rather than store five links for "bring" and six for "string," both
> ) can link to a unique "ring" sequence, which reduces the total number of
> ) nodes from 390158 to only 52904.
>
> And what if you want to add "brings" ?  You'd have to separate the links
> again.  Don't you need extra housekeeping to tell that one node is pointed
> to from two locations ?

   You could do that, but in Peter's case I bet we'd just add a new node 
with the data "brings". Since nodes are never split up, we don't need any 
extra housekeeping on the nodes that are pointed to multiple times.
   I haven't been following this whole thread, but has anyone mentioned
the keyword "suffix tree" yet? :)  That seems to be basically what Peter
is describing. As for the OP's problem, I agree with whoever said "hash
table."

-Arthur

.