Subj : String searching with a twist To : comp.programming From : Arthur J. O'Dwyer Date : Fri Aug 05 2005 08:45 pm Here's an algorithms question for the group, based on thinking about the haiku-finder program I just posted Message-ID: What is the most efficient way to find all the haikus in a given stream of words? We can forget about sentence delimiters, counting syllables, and suchlike for the moment, and concentrate on a simplified problem that I can't figure out how to explain rigorously, so I'll just give some examples: haiku_find(text of size n, pattern of size m) haiku_find("231112132", "423") returns the index of "311121" haiku_find("231112132", "243") returns the index of "111213" haiku_find("231112132", "444") returns the index of "3111213" haiku_find("231112132", "6") returns the index of "231" haiku_find("231112132", "222") returns null haiku_find("231112132", "544") returns null (We may assume that no zeros ever appear in the input --- that is, each word contains at least one syllable.) One way to solve the problem is to take a queue of maximum length equal to the sum of all the elements of the input pattern; as each element of the text is scanned, add it to the queue and check to make sure the queue is still a valid prefix of the pattern; if it's not, remove the head of the queue and try again. But the "try again" stage of this algorithm might take O(m) time, so this might be an O(mn) algorithm (well, not really, but I haven't tried to figure it out completely). Anyway, it does terribly on haiku_find("5", "1111611116..." Is there an algorithm for this kind of string searching that runs in strictly linear time (O(m+n))? It seems like there ought to be a variation on Knuth-Morris-Pratt that works, but I can't figure out what it might be. Any help? -Arthur -- And he looked over at the alarm clock, ticking on the chest of drawers --/Metamorphosis/ .