[HN Gopher] Knuth-Morris-Pratt string-searching algorithm: DFA-l...
       ___________________________________________________________________
        
       Knuth-Morris-Pratt string-searching algorithm: DFA-less version
        
       Author : ingve
       Score  : 94 points
       Date   : 2021-01-21 07:55 UTC (1 days ago)
        
 (HTM) web link (yurichev.com)
 (TXT) w3m dump (yurichev.com)
        
       | maweki wrote:
       | Isn't the "homebrew" result at the end exactly the state machine
       | from the DFA?
       | 
       | It has been a while, but I remember KMP calculating a prefix
       | table of some sort beforehand, to skip the repeated checking on
       | yet-unclear partial matches?
        
         | Jtsummers wrote:
         | Yes. The author discusses the relation between the homebrew
         | solution and a DFA solution in part 2, and more about partial
         | matches in part 3.
        
         | thdc wrote:
         | Yeah, back in university I've really only learned the version
         | that is presented in part 3. I mean I feel like the prefix
         | table is just a more succinct representation of the DFA but I
         | do find it interesting to see one way that that approach
         | could've been arrived at.
        
       | mlyle wrote:
       | "You see, there are two memory accesses per one character of the
       | input string."
       | 
       | No, there's one memory access on average even if your compiler is
       | maximally dumb (short circuit AND is something C compilers need
       | to know how to do to be conformant). And in a smarter compiler
       | this will be one memory access per byte anyways. Not to mention
       | on any modern architecture, even if your compiler doesn't squash
       | the second memory access, the microarchitecture will.
       | 
       | Also all the semicolons after his brackets bug me. Not to mention
       | that the thing looks past the end of the string. (We have a
       | length specified, so I assume it's not null terminated... indeed,
       | we can get a match past the end).
        
         | Jtsummers wrote:
         | "worst" versus "average". He probably could've phrased the
         | sentence better, but his conclusion isn't wrong. If the string
         | is all 'o' characters we'd hit this worst case memory access.
        
         | klodolph wrote:
         | search_ok("ooooooooooooooo", 15)
         | 
         | 30 memory accesses. length times two. The article is correct.
        
           | mlyle wrote:
           | > 30 memory accesses. length times two. The article is
           | correct.
           | 
           | vs. what I said:
           | 
           | > > No, there's one memory access on * _average*_
        
             | klodolph wrote:
             | Ah, you must have missed the part where the article says
             | "at worst".
             | 
             | Even if it didn't say that, I think we can figure out from
             | context that the article is talking about worst-case memory
             | accesses.
        
               | kyberias wrote:
               | No, it clearly says two memory accesses per character.
               | 
               | "At worst" calculation is just calculating the worst case
               | (string not found) where it has to scan the whole string
               | and... access each byte twice.
               | 
               | It doesn't matter whether we have a worst case or not.
               | They claim two memory accesses per character.
        
               | klodolph wrote:
               | > It doesn't matter whether we have a worst case or not.
               | They claim two memory accesses per character.
               | 
               | Yes, a worst case of two memory accesses per character.
               | This is obvious, no?
               | 
               | When people analyze algorithm runtime it is not always
               | explicitly stated whether the author is doing worst-case
               | or average-case analysis. So, as a reader, you sometimes
               | have to figure it out.
        
               | mlyle wrote:
               | It says at worst later, talking about if there is no
               | match. It heavily implies that there's 2 accesses per
               | byte always.
        
               | klodolph wrote:
               | > It says at worst later, talking about if there is no
               | match.
               | 
               | The article says "len*2 at worst". There is no
               | intervening text, that's a direct quote.
               | 
               | > It heavily implies that there's 2 accesses per byte
               | always.
               | 
               | This interpretation is incorrect. The article does not
               | imply this.
        
               | mlyle wrote:
               | "You see, there are two memory accesses per one character
               | of the input string."
               | 
               | It says this, DIRECTLY, not qualifying it as "up to" or
               | "worst case". The next sentence says it can be 2*len,
               | e.g. if you scan the entire length.
               | 
               | Those two sentences feel like something that someone
               | would write if they didn't understand short circuit
               | evaluation.
               | 
               | You read it differently, but your opinion of how it is
               | intended isn't sufficient to refute my interpretation of
               | the text.
        
               | klodolph wrote:
               | It seems much more plausible to me that the author does
               | understand short-circuit evaluation, but did not feel
               | like explaining it in such detail to satisfy the tedious
               | pedantry of HN comments.
               | 
               | My experience with writing answers Stack Overflow is that
               | people in the comments will always _figure out a way_ to
               | interpret an answer as being incorrect, and latch onto
               | it.
               | 
               | There is a tradeoff between clarity and pedantry. You
               | don't want to simplify something too much where it is no
               | longer insightful, and you don't want to write something
               | with irrelevant detail that obscures the point you are
               | trying to make.
        
               | mlyle wrote:
               | I really don't think the author is fluent in C. They
               | consistently put semicolons after brackets, and they have
               | off-by-1 errors in array indexing, etc (the first
               | search_ok looks past the end of string; the second
               | search_ok won't see ok in the string "ok"). That's part
               | of why I read it that way.
               | 
               | I understand the concern over excessive pedantry.
        
               | klodolph wrote:
               | My experience is that C programmers with many years of
               | experience will still make off-by-one errors. Most of the
               | examples have the correct bounds. The semicolons don't
               | belong but... so what?
               | 
               | If you wish to report errors to the author, there is an
               | email address at the bottom:
               | 
               | > Please drop me email about bug(s) and/or suggestion(s):
        
       | boxfire wrote:
       | For Part I, though I get the point to demonstrate proofs with
       | CBMC, this is a pretty contrived example. E.g. using state
       | machines for the cocos search makes an on-line very lazily
       | written implementation human-verifiable at a read. I would like
       | to see an example that doesn't have such a contrived nature.
       | 
       | Edit: Now I see in part II/III where the title actually comes to
       | fruition and there is a state machine oriented generator is made
       | and used. That still doesn't satisfy me about part I, honestly I
       | would have just skipped it in retrospect.
        
         | mlyle wrote:
         | Note that the use of CMBC here was not thorough enough to spot
         | that his reference implementation reads past the end of the
         | string.
        
           | shric wrote:
           | I don't think it does, assuming the passed len is correct.
        
             | mlyle wrote:
             | for (unsigned i=0; i<len; i++)             {
             | if (s[i]=='o' && s[i+1]=='k')
             | return i; // found             };
             | 
             | The last iteration of this loop, it accesses s[len-1] and
             | s[len]. If the string is null terminated, that's OK. If
             | it's really just a length qualified string, that's not OK.
             | 
             | The later implementations treat len differently.
             | 
             | EDIT: It's worse than I said. The later version won't
             | detect "ok" at the end of the string, because they are off
             | by one in the OPPOSITE direction.
        
               | shric wrote:
               | A string in C is, by definition, null terminated.
        
               | mlyle wrote:
               | Then why are we providing a length? Plenty of string-
               | handling functions in the c standard library result in no
               | terminating null if max length -- e.g. strncpy.
               | 
               | Note the second function is off by one in the opposite
               | direction-- it doesn't find "ok" in "ok".
        
               | makapuf wrote:
               | Thats why strncpy is deprecated for strlcpy.
        
               | shric wrote:
               | Because the writer of the function doesn't want to bother
               | to check the null terminator. I'm not saying it's a good
               | function, it just doesn't have an issue as long as the
               | input is a valid string with a correct length.
               | 
               | strncpy, despite what its name implies, does indeed not
               | necessarily produce strings.
               | 
               | Regardless, a C string is defined to have a null
               | terminator. All bets are off if you don't provide a
               | string.
        
               | mlyle wrote:
               | The issue is: the two functions that we're comparing for
               | correctness treat lengths very differently: one scans
               | len-1 bytes, the other scans len+1 bytes.
               | 
               | Even with a null terminator, there isn't a length that
               | would work for both without overrunning the buffer.
               | printf("%d %d\n", search_ok_1("mok", 2),
               | search_ok_2("mok", 2));             printf("%d %d\n",
               | search_ok_1("mok", 3), search_ok_2("mok", 3));   /*
               | Overruns without null term */             printf("%d
               | %d\n", search_ok_1("mok", 4), search_ok_2("mok", 4));
               | /* Overruns even WITH null term */
               | 
               | Prints out:                   1 2         1 3         1 1
        
               | billsnow wrote:
               | Oh I have some bad news for you, buddy.
        
       | kreeben wrote:
       | This sound interesting. I wanted to read the documentation but
       | the link in the README file returns 404.
        
       | mikewarot wrote:
       | Wouldn't it be better to do less branching?
       | 
       | Long ago, I wrote a text search that used the 8088 XLAT
       | instruction in a very short loop to do a text search, the carry
       | flag got set when there was a hit.
       | 
       | On modern machines, it would all end up in cache quickly, the
       | branch would be well predicted, and it would just zip through
       | text.
       | 
       | It handled upper/lower case (you just set the bits corresponding
       | to both upper and lower case of the letter)... but it couldn't do
       | regex.
        
         | Jtsummers wrote:
         | Part 1 is a fairly naive hand-rolled solution. Part 2 shows KMP
         | generating a DFA, which does reduce branching (specifically in
         | the search, where the branches in the first part become array
         | lookups in the second part).
        
       | tangjurine wrote:
       | I took a look at part 1, and the use of CMBC was cool!
        
       | danmg wrote:
       | Is this really DFA-less? Isn't the look-back table you construct
       | just encoding the DFA?
        
         | Jtsummers wrote:
         | It is and it isn't. The DFA is not deliberately constructed,
         | it's more a "happy accident". A renaming/reframing of the
         | `seen` variable would make it obvious that what is constructed
         | is a manually conceived DFA. As presented, `seen` is "how many
         | characters have we seen of the target string?". Reframing it as
         | "There exists a state for each character in the target, `seen`
         | represents which of these states we are presently in" makes it
         | clear that it _is_ a DFA. But as constructed it 's an
         | accidental DFA, rather than a deliberate one.
        
       | MattPalmer1086 wrote:
       | That was a fun read, I liked the use of cmbc to validate the
       | algorithm.
       | 
       | For those who are interested, there's a good tool to specifically
       | test string searching algorithms here:
       | 
       | https://github.com/smart-tool/smart
       | 
       | There are so many string searching algorithms now, with different
       | best and worst cases. Some work better on low alphabets (eg DNA),
       | some are better for text or high entropy data, some take
       | advantage of CPU instructions, some are generic. The real
       | challenge is picking the right algorithm.
       | 
       | I've implemented a few of them in java here, and extended them to
       | support multi byte matching at any position:
       | 
       | https://github.com/nishihatapalmer/byteseek
        
       | victor82 wrote:
       | Thank you for the write-up, I love the way you use CMBC to
       | construct your algorithm, seem like a way to synthesize code :)
       | Thanks also for point CMBC, I was looking for something like that
       | for years!
        
       | bambataa wrote:
       | Does avoiding memory accesses for short strings really provide
       | much performance improvement? Surely the adjacent characters,
       | which need to be read at least once anyway, will be in cache. I
       | can see the benefit when the searched-for string is long.
        
         | Jtsummers wrote:
         | It makes the algorithm (if not using a series of unoptimized
         | if-else-ifs) linear in the length of the string being searched.
         | You go from O(n*k) to O(n).
         | 
         | I mean, it may not matter _much_ , but it is more efficient. A
         | more impactful optimization is knowing how far to skip in the
         | string being searched based on which state you're in and what
         | character appears.
        
       | kuter wrote:
       | https://cp-algorithms.com/string/prefix-function.html
        
       ___________________________________________________________________
       (page generated 2021-01-22 23:01 UTC)