[HN Gopher] Pne-more-re-nightmare - A fast regex compiler in Com...
___________________________________________________________________
Pne-more-re-nightmare - A fast regex compiler in Common Lisp
Author : rurban
Score : 36 points
Date : 2021-11-30 21:35 UTC (1 days ago)
(HTM) web link (applied-langua.ge)
(TXT) w3m dump (applied-langua.ge)
| ogogmad wrote:
| What about derivative-based regex?
|
| I got worried at some point that some of the purported advantages
| of derivative-based regexes might be drawbacks. One such
| advantage/drawback is that it adds the _complement_ operation
| into the language of Regex. But that gives you a way of deciding
| whether two regexes are equivalent, which is supposed to be
| PSPACE-COMPLETE, which ought to severely hamper finding an
| efficient implementation.
| kazinator wrote:
| The complement operation doesn't provide a way of deciding
| whether two regexes are equivalent, nor is any such thing
| required to implement it. Using derivatives to interpret a
| regular expression that contains a complement operator does not
| require a function for calculating equivalence of two regular
| expressions.
|
| Equivalence is required for compiling. The reason is that
| regular expressions have features like the Kleene closure which
| express iteration, and these give rise to endless derivatives:
| regular expressions whose derivative you can keep taking _ad
| nauseum_. This is fine for interpreting, because the algorithm
| terminates when you have consumed the input or hit an impasse.
| The compiler, however, has to close the loops and turn them
| into cycles in the state machine graph. To do that it has to
| check each derivative: "have I derived an expression which has
| occurred before?" If so, then just hook to the previous state.
|
| If you make the check literal, just comparing the regex syntax
| for identical structure, the graph will sometimes be larger
| than necessary, because it will contain redundant nodes that
| should be equivalent states. The smarter you make the
| equivalence function, the more compact the graphs will be.
| hayley-patton wrote:
| Author here - I don't think this is an issue.
|
| If you don't use submatching, you can just compute the minimal
| DFAs and compare them to test for equivalence, and you don't
| even need negation. Thus, introducing negation can't do any
| harm there.
|
| But as I generate Mealy machines for submatching, which can't
| be minimised, we need another approach. Knowing that A - B = A
| [?] !B, one can produce a DFA for (A - B) [?] (B - A), and if
| it has no accepting states, then A and B must be equivalent, as
| nothing can match one but not both. But DFA construction is
| O(2n), even if one takes the RE - NFA - DFA route, so this is
| nothing new complexity-wise.
|
| Such a decision procedure is never used in a derivative-based
| regex implementation, anyway. The equivalence of regular
| expressions only needs to be approximated, and the
| approximation consists of some structural rules (see Definition
| 4.1 of [1]). Relatively few rules are needed to guarantee that
| a _finite_ machine can be generated, and only a dozen rules are
| required to get very close to a minimal DFA (table 1, page 14
| of [1]). So having the possibility of writing a precise regex
| equivalence function is completely irrelevant to producing an
| efficient regex engine.
|
| A decision procedure is handy, but you can write one for
| submatch-less regexes without negation, the complexity of such
| a procedure is typical for DFA construction, and the problem of
| deciding if two regular expressions are equivalent is
| irrelevant.
|
| [1] Regular-expression derivatives reexamined
| https://www.ccs.neu.edu/home/turon/re-deriv.pdf
| froh wrote:
| As this is a regex->DFA implementation I wonder how it compares
| to Google RE2 [1]
|
| there are nice performance comparisons to glibc, rust regex and
| cl-ppcre, but re2 was missing from that list.
|
| [1] https://github.com/google/re2
| hyperpape wrote:
| If the benchmarks are to be believed, I think it substantially
| outperforms re2, because it also beats Hyperscan, and I think
| Hyperscan typically beats re2.
| alexott wrote:
| re2 has limited functionality. It's works wel for specific
| patterns, like crawling, but not general purpose
| reikonomusha wrote:
| This is demonstrative of one of the great superpowers of Lisp:
| that you can write mini native code compilers--without knowing a
| lick of assembly. (But knowing assembly certainly helps you
| optimize the heck out of it.)
|
| More specifically, you can compile syntax (of your choosing) into
| performant, low-level Lisp code, which the Lisp compiler can then
| compile into efficient machine code. To do this, you need no
| extra tools, no intermediate files, no compiler hooks, no virtual
| machines, no byte codes, and no non-standard or non-portable
| features. It has been built in to Lisp since at least the 1980s.
|
| Making miniature "JIT" compilers is also a technique a quantum
| simulator in Lisp [0a, 0b] uses to optimize quantum gate
| operations. It analyzes the quantum gate matrix and, on the fly,
| compiles it into optimized numerical Lisp code that is equivalent
| to that matrix acting on vectors of tensor product spaces (i.e.,
| super huge arrays of complex double floats). Using another
| language would require me to do arduous things like write
| assembly code or learn a library like LibJIT [1] which may not
| even interface well with my host language.
|
| Other simulators out there written in C use huge hand-unrolled
| loops with comments like // THIS CODE WAS
| GENERATED, DO NOT TOUCH
|
| and/or in-line assembly code, and still can't optimize specific
| matrix shapes and structures, or do algebraic simplifications to
| eliminate work altogether.
|
| The regex library FTA is a great, and clean, example of a long
| standing practice of compiling regexen, except it doesn't use any
| fancy VMs or any fancy JITs, just "when you see this regex,
| automatically turn it into this Common Lisp code, and let the
| Lisp compiler handle the rest."
|
| [0a] https://github.com/quil-lang/qvm
|
| [0b] COMPILE-OPERATOR: https://github.com/quil-
| lang/qvm/blob/master/src/compile-gat...
|
| [1] https://www.gnu.org/software/libjit/
| martincmartin wrote:
| Perhaps most importantly, a good CL implementation can help
| with debugging this, by keeping track of the original CL code
| that generated the intermediate code, and pointing you to it.
| With the "use a program to generate C" approach, if there's
| some bug in the C code, the best the compiler or runtime can do
| is point you to the line of C code that's crashing. It's up to
| you to figure out where that came from.
|
| This was a problem for early C++ compilers like Cfront that
| translated C++ into C: the error messages from the C compiler
| referenced the intermediate C, not the original C++.
| gmfawcett wrote:
| How old is the #line directive, I wonder? It's been a long
| time since Cfront.
|
| https://stackoverflow.com/questions/2216540/whats-the-
| meanin...
| i_am_proteus wrote:
| I express my appreciation for the author's reference to King
| Crimson - the track "One More Red Nightmare" off of the record
| _Red_.
| irq wrote:
| I noticed this too and was hoping the author would mention it
| in his article but couldn't find a reference to it :) This is a
| great album to listen to while doing technical work.
| hayley-patton wrote:
| See the first footnote: "The name is of course a reference to
| One More Red Nightmare."
| zeruch wrote:
| I'm sensing a subtle King Crimson reference here...
| phoe-krk wrote:
| This title has a typo - should be _One-more-re-nightmare_.
| jwr wrote:
| This is a good example of why I roll my eyes at most programming
| language discussions. There is always someone that starts the
| "language X is SLOW" bandwagon, and people hop on.
|
| Languages are rarely "slow" or "fast" by themselves. Programs
| written in those languages can be slow or fast.
| regularfry wrote:
| Yes and no. Some language features can make it incredibly
| difficult to optimise. I know the JRuby folks had all sorts of
| fun trying to make method dispatch fast in the face of "well,
| this method definition can be changed at any time, arbitrarily
| often, including during execution of the method itself". I know
| CL must also have solved that problem but I presume the toolkit
| is sufficiently different that the techniques just don't map.
| pfdietz wrote:
| A nice mechanism for optimizing such things was given at the
| last European Lisp Symposium.
|
| https://european-lisp-
| symposium.org/static/proceedings/2021....
|
| (the last paper, by Robert Strandh.)
| AllegedAlec wrote:
| I also feel that for 99% of applications, it _doesn 't matter_.
| Like, yeah, there are a a few areas where hyper-optimized code
| in some kind of near-bare-metal language is necessary. But most
| of the time, if you're writing code, the language itself isn't
| gonna be the deciding factor if something is performant enough
| for your application. Furthermore, by using a "fast" language,
| you're then trading your ability to quickly and cleanly
| implement code that's human readable suffers greatly.
| simion314 wrote:
| >Languages are rarely "slow" or "fast" by themselves.
|
| For sure they are, language features will for sure limit the
| optimizations a compiler/interpreter can do and for sure a
| language with many levels of abstractions will have to pay the
| price in performance somewhere.
|
| Every-time someone improves the benchmark performance of a slow
| language that person is not a regular developer, but someone
| that has a deeper understanding on what happens under the hood,
| knows how to use profiling tools, maybe knows assembly and most
| of the time they have to use tricks or advanced patterns.
|
| Btw, optimizing some slow language in a hot path with clever
| tricks or some advanced patterns is fine, it happens all the
| time, you can;t just switch the project language with 1 click.
| the-alt-one wrote:
| >language features will for sure limit the optimizations a
| compiler/interpreter can do
|
| For example:
|
| Common Lisp has a full numerical tower, meaning it will
| automatically convert fixed size numbers into bignums (and
| more cool stuff). This means that an addition has to branch
| if it can't prove that a result won't overflow from fixnum to
| bignum. Yes, C also needs to use bignums, but they're on an
| opt-in basis. In CL they're opt-out.
|
| There are more examples of this, often related to the
| dynamism of your language. The more introspection a language
| is capable of, the less a compiler designer can do to cheat
| and optimize your program.
___________________________________________________________________
(page generated 2021-12-01 23:01 UTC)