Subj : Re: Searching 40 Million short strings fast
To   : comp.programming
From : Peter Ammon
Date : Mon Sep 05 2005 11:56 pm

M.Barren@gmail.com wrote:
> Hi,
> 
> I am writing a program that queries a website and extracts certain
> information from result pages and writes them in a file. These
> informations are short strings of ascii characters (between 10 to 12
> chars) and number of them will eventually be around 40 million. There
> is a little problem though. Since the searches don't neccersily return
> unique results, I might get certain results over and over again and if
> I add them to my file, I will have duplicates which due to the nature
> of search mechanism of that certain website, will double or even triple
> the amount of collected data.
> 
> I want to be able to check if I already have a certain string in my
> file very quickly. Indexing strings happens to be one of the only (if
> not the only) way of doing this. But I don't want to use a whole
> database server for a temporary data collection stage of this program.
> SqLite is an interesting choice but I still rather use even a simpler
> method that I can myself implement that is light-wight,  fast and
> specilized *just* to do what i want.
> 
> If you have any ideas, I'll be more than happy to hear it.
> Michael
> 

Take a look at tries.  A trie has the advantage of very fast lookups 
(linear in the length of the search string) and a very compact 
representation, which is often smaller than the data it represents. 
I've got a text file of 173,840 words that's 1.75 MB; my trie 
representation of the same data is 1.15 MB.

-Peter

-- 
Pull out a splinter to reply.

.