Subj : Re: String searching with a twist To : comp.programming From : niklasb Date : Mon Aug 08 2005 01:04 pm Arthur J. O'Dwyer wrote: > Here's an algorithms question for the group, based on thinking about the > haiku-finder program I just posted > Message-ID: > > What is the most efficient way to find all the haikus > in a given stream of words? > > We can forget about sentence delimiters, counting syllables, and > suchlike for the moment, and concentrate on a simplified problem that I > can't figure out how to explain rigorously, so I'll just give some > examples: > > haiku_find(text of size n, pattern of size m) > haiku_find("231112132", "423") returns the index of "311121" > haiku_find("231112132", "243") returns the index of "111213" > haiku_find("231112132", "444") returns the index of "3111213" > haiku_find("231112132", "6") returns the index of "231" > haiku_find("231112132", "222") returns null > haiku_find("231112132", "544") returns null > > (We may assume that no zeros ever appear in the input --- that is, each > word contains at least one syllable.) One option would be to convert the pattern to a DFA and then run the DFA. For example, the pattern "423" could be respresented as a regular expression, thus: ..*((1((1((11)|2))|(21)|3))|(2((11)|2))|(31)|4)((11)|2)((1((11)|2))|(21)|3) Or equivalently, using a made-up notation: pattern -> .*:4:2:3 :2 -> (11)|2 :3 -> (1:2)|(21)|3 :4 -> (1:3)|(2:2)|(31)|4 Running the DFA is O(N). The total run time, including the time to construct the DFA would be something like O(N+M) where M is the number of states in the DFA. I'm not sure off-hand how big we might expect M to get, but it seems clear that the number of syllables is a factor so M is not just a function of your m (i.e., the length of the pattern). --Nick .