Subj : Re: String searching with a twist To : comp.programming From : Roger Willcocks Date : Sat Aug 06 2005 01:38 pm <3rdtry@willets.org> wrote in message news:1123307232.095179.43830@g47g2000cwa.googlegroups.com... > Maybe you could use a more unary representation of the word lengths, so > that pattern and match lengths will be the same. If you pack the > syllables together, and mark each syllable preceded by a word break > with a 1 bit, you get a bit string with where k bits represent k > syllables, no matter how many word breaks occur therein. > > So eg if you have "2121" in your representation, it maps to "101101" in > the bitmap format. To match "3" you write it as "1001", and look for > substrings which match the 1 positions. In this case we have three > alignments: > > 101101 > 1001 > --1001 > ---1001 (special case for end-of-string) > > (That last match treats end-of-string as an implicit "1" appended at > the end, i.e. there's always a word break there.) > > I'm not sure if KMP can be modified to work with this matching, since > the overlap may depend on the bits which were ignored in pursuing the > 1's-only match. That is, finding the "next shorter" prefix match on P > is no longer just a function of position in P, since it depends on bits > which were ignored in the current prefix match. > > EG: > > Match "1001" in "1011011": > > 1011011 > 1001 - OK a match, now where to go? > ---1001 -- normal KMP says here (wrong) > --1001 -- terminal state has to "remember" where patterns in ignored > 1's were > > An FSA to encode this would be O(2^M), if I understand this > correctly(?). > Using a KMP search on a binary string isn't particularly helpful in any case since there are only two different symbols in the dictionary, so you don't get the full benefit of skipping past characters found in the target string that aren't in the search string. But the match string will always start with a one, so you can rapidly skip over zeroes in the target string before checking for a possible match. You'll then have to check that all the ones in the match string correspond to ones in the target string. The actual performance will depend on how dense the ones are in the target and match strings. In the worst case (where both are all ones), the performance will be O(TM). In the best case (where the target is all zeroes) the performance will be O(T). -- Roger .