[HN Gopher] Knuth-Morris-Pratt string-searching algorithm: DFA-l...
___________________________________________________________________
Knuth-Morris-Pratt string-searching algorithm: DFA-less version
Author : ingve
Score : 94 points
Date : 2021-01-21 07:55 UTC (1 days ago)
(HTM) web link (yurichev.com)
(TXT) w3m dump (yurichev.com)
| maweki wrote:
| Isn't the "homebrew" result at the end exactly the state machine
| from the DFA?
|
| It has been a while, but I remember KMP calculating a prefix
| table of some sort beforehand, to skip the repeated checking on
| yet-unclear partial matches?
| Jtsummers wrote:
| Yes. The author discusses the relation between the homebrew
| solution and a DFA solution in part 2, and more about partial
| matches in part 3.
| thdc wrote:
| Yeah, back in university I've really only learned the version
| that is presented in part 3. I mean I feel like the prefix
| table is just a more succinct representation of the DFA but I
| do find it interesting to see one way that that approach
| could've been arrived at.
| mlyle wrote:
| "You see, there are two memory accesses per one character of the
| input string."
|
| No, there's one memory access on average even if your compiler is
| maximally dumb (short circuit AND is something C compilers need
| to know how to do to be conformant). And in a smarter compiler
| this will be one memory access per byte anyways. Not to mention
| on any modern architecture, even if your compiler doesn't squash
| the second memory access, the microarchitecture will.
|
| Also all the semicolons after his brackets bug me. Not to mention
| that the thing looks past the end of the string. (We have a
| length specified, so I assume it's not null terminated... indeed,
| we can get a match past the end).
| Jtsummers wrote:
| "worst" versus "average". He probably could've phrased the
| sentence better, but his conclusion isn't wrong. If the string
| is all 'o' characters we'd hit this worst case memory access.
| klodolph wrote:
| search_ok("ooooooooooooooo", 15)
|
| 30 memory accesses. length times two. The article is correct.
| mlyle wrote:
| > 30 memory accesses. length times two. The article is
| correct.
|
| vs. what I said:
|
| > > No, there's one memory access on * _average*_
| klodolph wrote:
| Ah, you must have missed the part where the article says
| "at worst".
|
| Even if it didn't say that, I think we can figure out from
| context that the article is talking about worst-case memory
| accesses.
| kyberias wrote:
| No, it clearly says two memory accesses per character.
|
| "At worst" calculation is just calculating the worst case
| (string not found) where it has to scan the whole string
| and... access each byte twice.
|
| It doesn't matter whether we have a worst case or not.
| They claim two memory accesses per character.
| klodolph wrote:
| > It doesn't matter whether we have a worst case or not.
| They claim two memory accesses per character.
|
| Yes, a worst case of two memory accesses per character.
| This is obvious, no?
|
| When people analyze algorithm runtime it is not always
| explicitly stated whether the author is doing worst-case
| or average-case analysis. So, as a reader, you sometimes
| have to figure it out.
| mlyle wrote:
| It says at worst later, talking about if there is no
| match. It heavily implies that there's 2 accesses per
| byte always.
| klodolph wrote:
| > It says at worst later, talking about if there is no
| match.
|
| The article says "len*2 at worst". There is no
| intervening text, that's a direct quote.
|
| > It heavily implies that there's 2 accesses per byte
| always.
|
| This interpretation is incorrect. The article does not
| imply this.
| mlyle wrote:
| "You see, there are two memory accesses per one character
| of the input string."
|
| It says this, DIRECTLY, not qualifying it as "up to" or
| "worst case". The next sentence says it can be 2*len,
| e.g. if you scan the entire length.
|
| Those two sentences feel like something that someone
| would write if they didn't understand short circuit
| evaluation.
|
| You read it differently, but your opinion of how it is
| intended isn't sufficient to refute my interpretation of
| the text.
| klodolph wrote:
| It seems much more plausible to me that the author does
| understand short-circuit evaluation, but did not feel
| like explaining it in such detail to satisfy the tedious
| pedantry of HN comments.
|
| My experience with writing answers Stack Overflow is that
| people in the comments will always _figure out a way_ to
| interpret an answer as being incorrect, and latch onto
| it.
|
| There is a tradeoff between clarity and pedantry. You
| don't want to simplify something too much where it is no
| longer insightful, and you don't want to write something
| with irrelevant detail that obscures the point you are
| trying to make.
| mlyle wrote:
| I really don't think the author is fluent in C. They
| consistently put semicolons after brackets, and they have
| off-by-1 errors in array indexing, etc (the first
| search_ok looks past the end of string; the second
| search_ok won't see ok in the string "ok"). That's part
| of why I read it that way.
|
| I understand the concern over excessive pedantry.
| klodolph wrote:
| My experience is that C programmers with many years of
| experience will still make off-by-one errors. Most of the
| examples have the correct bounds. The semicolons don't
| belong but... so what?
|
| If you wish to report errors to the author, there is an
| email address at the bottom:
|
| > Please drop me email about bug(s) and/or suggestion(s):
| boxfire wrote:
| For Part I, though I get the point to demonstrate proofs with
| CBMC, this is a pretty contrived example. E.g. using state
| machines for the cocos search makes an on-line very lazily
| written implementation human-verifiable at a read. I would like
| to see an example that doesn't have such a contrived nature.
|
| Edit: Now I see in part II/III where the title actually comes to
| fruition and there is a state machine oriented generator is made
| and used. That still doesn't satisfy me about part I, honestly I
| would have just skipped it in retrospect.
| mlyle wrote:
| Note that the use of CMBC here was not thorough enough to spot
| that his reference implementation reads past the end of the
| string.
| shric wrote:
| I don't think it does, assuming the passed len is correct.
| mlyle wrote:
| for (unsigned i=0; i<len; i++) {
| if (s[i]=='o' && s[i+1]=='k')
| return i; // found };
|
| The last iteration of this loop, it accesses s[len-1] and
| s[len]. If the string is null terminated, that's OK. If
| it's really just a length qualified string, that's not OK.
|
| The later implementations treat len differently.
|
| EDIT: It's worse than I said. The later version won't
| detect "ok" at the end of the string, because they are off
| by one in the OPPOSITE direction.
| shric wrote:
| A string in C is, by definition, null terminated.
| mlyle wrote:
| Then why are we providing a length? Plenty of string-
| handling functions in the c standard library result in no
| terminating null if max length -- e.g. strncpy.
|
| Note the second function is off by one in the opposite
| direction-- it doesn't find "ok" in "ok".
| makapuf wrote:
| Thats why strncpy is deprecated for strlcpy.
| shric wrote:
| Because the writer of the function doesn't want to bother
| to check the null terminator. I'm not saying it's a good
| function, it just doesn't have an issue as long as the
| input is a valid string with a correct length.
|
| strncpy, despite what its name implies, does indeed not
| necessarily produce strings.
|
| Regardless, a C string is defined to have a null
| terminator. All bets are off if you don't provide a
| string.
| mlyle wrote:
| The issue is: the two functions that we're comparing for
| correctness treat lengths very differently: one scans
| len-1 bytes, the other scans len+1 bytes.
|
| Even with a null terminator, there isn't a length that
| would work for both without overrunning the buffer.
| printf("%d %d\n", search_ok_1("mok", 2),
| search_ok_2("mok", 2)); printf("%d %d\n",
| search_ok_1("mok", 3), search_ok_2("mok", 3)); /*
| Overruns without null term */ printf("%d
| %d\n", search_ok_1("mok", 4), search_ok_2("mok", 4));
| /* Overruns even WITH null term */
|
| Prints out: 1 2 1 3 1 1
| billsnow wrote:
| Oh I have some bad news for you, buddy.
| kreeben wrote:
| This sound interesting. I wanted to read the documentation but
| the link in the README file returns 404.
| mikewarot wrote:
| Wouldn't it be better to do less branching?
|
| Long ago, I wrote a text search that used the 8088 XLAT
| instruction in a very short loop to do a text search, the carry
| flag got set when there was a hit.
|
| On modern machines, it would all end up in cache quickly, the
| branch would be well predicted, and it would just zip through
| text.
|
| It handled upper/lower case (you just set the bits corresponding
| to both upper and lower case of the letter)... but it couldn't do
| regex.
| Jtsummers wrote:
| Part 1 is a fairly naive hand-rolled solution. Part 2 shows KMP
| generating a DFA, which does reduce branching (specifically in
| the search, where the branches in the first part become array
| lookups in the second part).
| tangjurine wrote:
| I took a look at part 1, and the use of CMBC was cool!
| danmg wrote:
| Is this really DFA-less? Isn't the look-back table you construct
| just encoding the DFA?
| Jtsummers wrote:
| It is and it isn't. The DFA is not deliberately constructed,
| it's more a "happy accident". A renaming/reframing of the
| `seen` variable would make it obvious that what is constructed
| is a manually conceived DFA. As presented, `seen` is "how many
| characters have we seen of the target string?". Reframing it as
| "There exists a state for each character in the target, `seen`
| represents which of these states we are presently in" makes it
| clear that it _is_ a DFA. But as constructed it 's an
| accidental DFA, rather than a deliberate one.
| MattPalmer1086 wrote:
| That was a fun read, I liked the use of cmbc to validate the
| algorithm.
|
| For those who are interested, there's a good tool to specifically
| test string searching algorithms here:
|
| https://github.com/smart-tool/smart
|
| There are so many string searching algorithms now, with different
| best and worst cases. Some work better on low alphabets (eg DNA),
| some are better for text or high entropy data, some take
| advantage of CPU instructions, some are generic. The real
| challenge is picking the right algorithm.
|
| I've implemented a few of them in java here, and extended them to
| support multi byte matching at any position:
|
| https://github.com/nishihatapalmer/byteseek
| victor82 wrote:
| Thank you for the write-up, I love the way you use CMBC to
| construct your algorithm, seem like a way to synthesize code :)
| Thanks also for point CMBC, I was looking for something like that
| for years!
| bambataa wrote:
| Does avoiding memory accesses for short strings really provide
| much performance improvement? Surely the adjacent characters,
| which need to be read at least once anyway, will be in cache. I
| can see the benefit when the searched-for string is long.
| Jtsummers wrote:
| It makes the algorithm (if not using a series of unoptimized
| if-else-ifs) linear in the length of the string being searched.
| You go from O(n*k) to O(n).
|
| I mean, it may not matter _much_ , but it is more efficient. A
| more impactful optimization is knowing how far to skip in the
| string being searched based on which state you're in and what
| character appears.
| kuter wrote:
| https://cp-algorithms.com/string/prefix-function.html
___________________________________________________________________
(page generated 2021-01-22 23:01 UTC)