Subj : Re: Searching 40 Million short strings fast To : comp.programming From : Boo Date : Tue Sep 06 2005 12:11 pm > I am writing a program that queries a website and extracts certain > information from result pages and writes them in a file. These > informations are short strings of ascii characters (between 10 to 12 > chars) and number of them will eventually be around 40 million. There > is a little problem though. Since the searches don't neccersily return > unique results, I might get certain results over and over again and if > I add them to my file, I will have duplicates which due to the nature > of search mechanism of that certain website, will double or even triple > the amount of collected data. > > I want to be able to check if I already have a certain string in my > file very quickly. Indexing strings happens to be one of the only (if > not the only) way of doing this. But I don't want to use a whole > database server for a temporary data collection stage of this program. > SqLite is an interesting choice but I still rather use even a simpler > method that I can myself implement that is light-wight, fast and > specilized *just* to do what i want. > > If you have any ideas, I'll be more than happy to hear it. my idea is that this guy want extract data from eg hotmail msn or icq home page. are we helping someone to compile email cds here? -Boo .