[HN Gopher] Adding lookbehinds to rust-lang/regex
       ___________________________________________________________________
        
       Adding lookbehinds to rust-lang/regex
        
       Author : emschwartz
       Score  : 75 points
       Date   : 2025-07-15 15:37 UTC (7 hours ago)
        
 (HTM) web link (systemf.epfl.ch)
 (TXT) w3m dump (systemf.epfl.ch)
        
       | CJefferson wrote:
       | Great! I enjoyed reading through, and I'm going to come back
       | later and read a little more carefully.
       | 
       | If anyone knows (to let me be lazy), is this the same regex
       | engine used by ripgrep? Or is that an independent implementation?
        
         | cbarrick wrote:
         | Same engine as ripgrep
        
         | flaghacker wrote:
         | Yes, the `regex` crate is also the regex engine used by
         | ripgrep, both were developed by https://github.com/burntsushi.
        
         | shilangyu wrote:
         | As others have pointed out, the regex engine is the same so the
         | benefits would trickle downstream. For example, VSCode also
         | uses ripgrep and therefore the rust-lang/regex engine.
        
           | burntsushi wrote:
           | ripgrep plugged this gap a long time ago by providing PCRE2
           | support.
        
       | singron wrote:
       | I don't think there is discussion of the snort-2 and snort-3
       | benchmarks, which the linear engine handily beats the python re
       | for once (70-80x faster). I'm guessing they are cases where
       | backtracking is painfully quadratic in re, but it would have been
       | nice to hear about those successes. [In the rest of the
       | benchmarks, python re is 2-5x faster]
        
       | d3m0t3p wrote:
       | Nice to see a master thesis highlighted on the research groupe
       | page
        
       | RadiozRadioz wrote:
       | From a user perspective, this is extremely valuable. What an
       | amazing improvement; unbounded especially. I do hope this would
       | make it into actual RE2 & go.
       | 
       | When I use regex, I expect to be able to lookbehind, so I am
       | routinely hit by RE2's limitations in places where it's used.
       | Sometimes the software uses the entire matched string and you
       | can't use non-capturing groups to work around it.
       | 
       | I understand go's reasons, ReDoS etc, but the "purism" of RE2
       | does fly in the face of practicality to an irksome degree. This
       | is not uncommon for go.
        
         | hnlmorg wrote:
         | The point of standard libraries is to provide sane default
         | behaviours. Go's regexp package is a sensible default.
         | 
         | For instances where you need something more sophisticated than
         | what's in the standard library, you reach for 3rd party
         | modules. And there are regex libraries for Go which support
         | backtracking et al.
         | 
         | There's definitely some irksome defaults in Go, but the choose
         | of regex engine in the regexp library isn't one of them
        
         | masklinn wrote:
         | The authors' previous article (linked in this one) was about
         | doing this in re2
         | (https://systemf.epfl.ch/blog/re2-lookbehinds/), and they have
         | a fork with those changes though I don't know that they have a
         | PR.
         | 
         | > the "purism" of RE2 does fly in the face of practicality to
         | an irksome degree
         | 
         | It's not purism tho. There are very practical reasons to want
         | an FA-based engine, and if you compromise that to get
         | additional features then the engine is pointless, you could
         | have just used a backtracking engine in the first place.
        
           | ncruces wrote:
           | I couldn't find the link in that page, but the fork is here,
           | and seems to be up-to-date: https://github.com/GerHobbelt/re2
           | 
           | If you need that from Go, you can probably use that to create
           | a fork of this: https://github.com/wasilibs/go-re2
        
         | chubot wrote:
         | What are some examples of problems where you've used
         | lookbehinds?
        
         | progbits wrote:
         | While I agree this is a common golang theme, in this case I
         | believe this decision predates the golang implementation and
         | comes from the C++ RE2 days, no?
        
       | LegionMammal978 wrote:
       | > However, as a downside our lookbehinds do not support
       | containing capture groups which are a feature allowing to extract
       | a substring that matched a part of the regex pattern.
       | 
       | I wonder in what situation someone would even be tempted to put a
       | capture group into a lookbehind expression, except
       | unintentionally by using () instead of (?:) for grouping. Maybe
       | in an attempt to obtain capture groups from overlapping matches?
       | But even in that case, lookaheads would be clearer, when
       | available.
        
       | hu3 wrote:
       | Interesting. I have used look behind before without knowing their
       | specifics. AI generated a regex and unit tests passed so I
       | carried on with life.
       | 
       | Searching for a simple explanation of how it works, I found this
       | which also explains negative look behind and look ahead. TIL:
       | 
       | https://www.phptutorial.net/php-tutorial/regex-lookbehind/
        
       | librasteve wrote:
       | It's odd to see such a widely adopted language as Rust only just
       | getting some regex basics. Whereas Raku (https://raku.org) has
       | made a strong forward step in regex syntax over PCRE, made by the
       | same language designer with implementation of modern unicode
       | savvy features like Grapheme and Diacritic handling that are
       | essential to building consistent code to handle multilingual
       | needs.                 say "Cool" ~~ /<:Letter>*
       | <:Block("Emoticons")>/; # [Cool]       say "Czesc" ~~
       | m:ignoremark/ Czesc /;               # [Czesc]       say "WEISsE"
       | ~~ m:ignorecase/ weisse /;              # [WEISsE]       say
       | "hnuuaehmset`r" ~~ /<:Letter>+/;                    #
       | [hnuuaehmset`r]
        
         | librasteve wrote:
         | huh ... guess HN blocks emojis
        
         | burntsushi wrote:
         | It's not only just getting some "regex basics." The `fancy-
         | regex` crate has provided look-behind for years. The OP is
         | about adopting look-behind to the linear time guarantee
         | required by the `regex` crate.
         | 
         | My main focus for the `regex` crate has been on performance:
         | https://github.com/BurntSushi/rebar
         | 
         | How does Raku's regex performance compare to Perl?
        
           | kibwen wrote:
           | _> the linear time guarantee required by the `regex` crate_
           | 
           | Making sure this line isn't glossed over: the point of the
           | regex crate is that it provides linear-time guarantees for
           | arbitrary regexes, making it safe (within reason) to expose
           | the regex engine to untrusted input without running the risk
           | of trivial DoS. From what I can tell, supporting lookbehinds
           | in such a context is something that researchers have only
           | recently described.
        
             | dmit wrote:
             | > making it safe (within reason) to expose the regex engine
             | to untrusted input
             | 
             | Or even trusted input! https://blog.cloudflare.com/details-
             | of-the-cloudflare-outage...
        
           | librasteve wrote:
           | I stand corrected on that - I was responding to the headline
           | and did not appreciate that Rust has had library support
           | beforehand. (That said, having regex around in different
           | standard vs. crate options is not necessarily the ideal).
           | 
           | It's good to have a focus and I agree that Rust is all about
           | performance and stability for a system language.
           | 
           | I haven't seen Raku regex performance benchmarked, but I
           | would be surprised if it beats perl or Rust.
           | 
           | I wouldn't say that Raku is a good choice where speed is the
           | most important consideration since it is a scripting language
           | that runs on a VM with GC. Nevertheless the language syntax
           | includes many features (hyper operators, lazy evaluation to
           | name two) that make it amenable to performance optimisation.
        
             | masklinn wrote:
             | > That said, having regex around in different standard vs.
             | crate options is not necessarily the ideal
             | 
             | What 1: both regex and fancy-regex are crates. Regex is
             | under the rust-lang umbrella but it's not part of the
             | stdlib.
             | 
             | What 2: having different options is the point of third
             | partly libraries, why would you have a third party library
             | which is the exact same thing as the standard library?
        
               | librasteve wrote:
               | so Rust has no regex in the standard library, basic/fast
               | regex under the rust-lang umbrella in a crate and fancy-
               | regex is a 3rd party crate
               | 
               | not having different options is the point of (batteries
               | included) standard libraries ;-)
        
               | burntsushi wrote:
               | We (I am on libs-api in addition to authoring the regex
               | crate) specifically eschewed a batteries included
               | standard library. The fact that `regex` was its own thing
               | was the best thing that ever happened to it. It let me
               | iterate on its API independent of the standard library.
        
           | SteveJS wrote:
           | I loved discovering that rust has O(n) guardrails on regex!
           | The so-called features that break that constraint are anti-
           | features.
           | 
           | Over the last two weeks I wrote a dialog aware english
           | sentence splitter using Claude code to write rust. The
           | compile error when it stuck lookarounds in one of the regex's
           | was super useful to me.
        
         | shawn_w wrote:
         | I don't think Philip Hazel, who wrote PCRE, has anything to do
         | with perl or raku development.
        
           | librasteve wrote:
           | sorry I didn't know that Philip Hazel wrote PCRE ... and I
           | certainly credit the initiative to release Perl Compatible
           | Regular Expressions from the grip of perl
           | 
           | my main point is that PCRE was based on perl regexes and that
           | these were designed by Larry Wall and so he had some
           | experience when it came to the strengths and weaknesses of of
           | perl RE when it came to designing the Raku RE syntax (ie. the
           | language formerly known as Perl 6)
        
         | quotemstr wrote:
         | This right here is one of the foundational splits in the
         | programming community. This article is all about how cool an
         | _implementation_ is. This comment is about some other engine's
         | cool _syntax_. Deep versus superficial. The two camps can't
         | stand each other.
        
           | librasteve wrote:
           | Speaking on behalf of the superficial camp, I admire the Rust
           | core regex focus on linear performance and I can well believe
           | that it is based on recent theoretical work.
           | 
           | Splitting the regex features between some core ones that meet
           | a DoS standard and some non-core modules that do other
           | "convenience" features makes sense as a trade off for Rust.
           | It would not make sense in a scripting language like Raku
           | where the weight is on coder expressiveness and making it
           | easier / faster to write working code.
           | 
           | I seem to have hit a seam of intense implementation guys -
           | and they are holding their own since they know their stuff.
           | 
           | I think there is room for improvement BOTH with new system
           | language / core performance innovation AND with advancing the
           | PCRE regex syntax (largely unchanged since the 1990s) and
           | merging it seamlessly with standard language support for
           | Grammars.
        
       ___________________________________________________________________
       (page generated 2025-07-15 23:01 UTC)