Subj : Re: Searching 40 Million short strings fast
To   : comp.programming
From : websnarf
Date : Mon Sep 05 2005 11:03 pm

M.Barren@gmail.com wrote:
> I am writing a program that queries a website and extracts certain
> information from result pages and writes them in a file. These
> informations are short strings of ascii characters (between 10 to 12
> chars) and number of them will eventually be around 40 million. There
> is a little problem though. Since the searches don't neccersily return
> unique results, I might get certain results over and over again and if
> I add them to my file, I will have duplicates which due to the nature
> of search mechanism of that certain website, will double or even triple
> the amount of collected data.
>
> I want to be able to check if I already have a certain string in my
> file very quickly. Indexing strings happens to be one of the only (if
> not the only) way of doing this. But I don't want to use a whole
> database server for a temporary data collection stage of this program.
> SqLite is an interesting choice but I still rather use even a simpler
> method that I can myself implement that is light-wight,  fast and
> specilized *just* to do what i want.
>
> If you have any ideas, I'll be more than happy to hear it.

Well, obviously you want a hash table; one that can hold 40 million
entries of course.  James Dow Allen has something that is likely to
work for this purpose:

    http://freepages.genealogy.rootsweb.com/~jamesdow/Tech/mhash.htm

Though the code is a bit messy.  As I recall, that solution is not
thread safe, and can only exist as one instance.  I don't know if that
matters to you.

Then there is the question of hashing a string, versus hashing an
encoding of the strings.  For example, if the strings themselves
average greater than 20 characters in length, then you could try
encoding them using SHA-1, and in fact use a prexfix of the SHA-1 value
as the hash key (this is sound), and just store the remaining bytes in
the hash entries.  This may have the advantage of being better for
memory consumption.  (It has the theoretical problem of false mappings,
but to date, there are no known aliased entries with respect to SHA-1.)

-- 
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

.