[HN Gopher] Ken Thompson's NFA regex patent (1971)
___________________________________________________________________
Ken Thompson's NFA regex patent (1971)
Author : jonstewart
Score : 62 points
Date : 2022-11-11 20:44 UTC (2 hours ago)
(HTM) web link (patents.google.com)
(TXT) w3m dump (patents.google.com)
| joedrago wrote:
| I read an article a million years ago which I thought described
| this mechanism in a super friendly way:
|
| https://perl.plover.com/Regex/article.html
|
| It uses the idea of "putting a penny down" on the graph and
| moving it around, and I found the conversational tone of the
| description to be very easy to grok. Assuming this is the exact
| same mechanism (which the author cites), it might resonate for
| some of you too.
| jonstewart wrote:
| ken also wrote a paper about his algorithm, published in CACM in
| 1968: https://dl.acm.org/doi/pdf/10.1145/363347.363387 (PDF)
| HellsMaddy wrote:
| https://en.wikipedia.org/wiki/Thompson's_construction
| Cupertino95014 wrote:
| Does anyone know if this patent was ever asserted by AT&T? (Of
| course, it's long expired.)
| ape4 wrote:
| Do current regex implementations use this algorithm?
| bla3 wrote:
| Most don't, but re2 and a few others do. The ones that do use
| it don't have exponential runtime on malicious inputs and lack
| a few features (back references, mostly).
| https://swtch.com/~rsc/regexp/ is a great resource on this.
| iron2disulfide wrote:
| > Most don't
|
| Is this true? I'm pretty sure that most of the regex engines
| I've used (grep, ripgrep, re2, hyperscan) use Thompson's
| construction or at least some NFA-based algorithm (not
| necessarily Thomopson's; Glushkov automata in particular are
| used by hyperscan).
| burntsushi wrote:
| Yes, all of those are finite automata based. Although grep
| usually uses the POSIX system regex library. GNU grep will
| use its own finite automata based approach for a subset (a
| large subset) of regexes.
|
| But most regex engines are backtracking based. Perl, PCRE
| (used by PHP and probably others), Oniguruma/Onigmo (used
| by Ruby), Javascript, Java, Python. That covers a lot of
| ground.
|
| Plus, popular reference web sites like https://www.regular-
| expressions.info/ are heavily biased toward regex engines
| that have features typically only implemented by
| backtracking engines.
| masklinn wrote:
| There's also postgres's which descends from spencer's
| (Tcl Advanced Regular Expressions) and supports
| backreferences but IME is very hard to catch into
| catastrophic backtracking.
|
| Possibly because you need to use the feature to trigger
| the NFA mode and the baseline test case for catastrophic
| backtracking doesn't, it just assumes the engine will
| backtrack.
| klodolph wrote:
| As a side note, when converting the NFA to a DFA, you get an
| exponential number of states, since each DFA state
| corresponds to a set of NFA states.
| burntsushi wrote:
| True, and that's why libraries like RE2 don't build full
| DFAs. They use lazy DFAs. DFA states are compiled as
| needed. At most one state is built per byte of input, so
| this sidesteps the exponential time bound.
|
| In all implementations I know of, memory use is bounded.
| When memory fills up, the cache of DFA states is cleared.
| If this happens too many times, RE2 quits the lazy DFA and
| falls back to a Pike VM. (Effectively the Thompson NFA
| matcher, but with capture group support.)
___________________________________________________________________
(page generated 2022-11-11 23:00 UTC)