Subj : Re: String searching with a twist To : comp.programming From : 3rdtry Date : Fri Aug 05 2005 11:47 pm Maybe you could use a more unary representation of the word lengths, so that pattern and match lengths will be the same. If you pack the syllables together, and mark each syllable preceded by a word break with a 1 bit, you get a bit string with where k bits represent k syllables, no matter how many word breaks occur therein. So eg if you have "2121" in your representation, it maps to "101101" in the bitmap format. To match "3" you write it as "1001", and look for substrings which match the 1 positions. In this case we have three alignments: 101101 1001 --1001 ---1001 (special case for end-of-string) (That last match treats end-of-string as an implicit "1" appended at the end, i.e. there's always a word break there.) I'm not sure if KMP can be modified to work with this matching, since the overlap may depend on the bits which were ignored in pursuing the 1's-only match. That is, finding the "next shorter" prefix match on P is no longer just a function of position in P, since it depends on bits which were ignored in the current prefix match. EG: Match "1001" in "1011011": 1011011 1001 - OK a match, now where to go? ---1001 -- normal KMP says here (wrong) --1001 -- terminal state has to "remember" where patterns in ignored 1's were An FSA to encode this would be O(2^M), if I understand this correctly(?). .