Subj : Re: Searching 40 Million short strings fast To : comp.programming From : Peter Ammon Date : Mon Sep 05 2005 11:56 pm M.Barren@gmail.com wrote: > Hi, > > I am writing a program that queries a website and extracts certain > information from result pages and writes them in a file. These > informations are short strings of ascii characters (between 10 to 12 > chars) and number of them will eventually be around 40 million. There > is a little problem though. Since the searches don't neccersily return > unique results, I might get certain results over and over again and if > I add them to my file, I will have duplicates which due to the nature > of search mechanism of that certain website, will double or even triple > the amount of collected data. > > I want to be able to check if I already have a certain string in my > file very quickly. Indexing strings happens to be one of the only (if > not the only) way of doing this. But I don't want to use a whole > database server for a temporary data collection stage of this program. > SqLite is an interesting choice but I still rather use even a simpler > method that I can myself implement that is light-wight, fast and > specilized *just* to do what i want. > > If you have any ideas, I'll be more than happy to hear it. > Michael > Take a look at tries. A trie has the advantage of very fast lookups (linear in the length of the search string) and a very compact representation, which is often smaller than the data it represents. I've got a text file of 173,840 words that's 1.75 MB; my trie representation of the same data is 1.15 MB. -Peter -- Pull out a splinter to reply. .