[HN Gopher] Adding lookbehinds to rust-lang/regex
___________________________________________________________________
Adding lookbehinds to rust-lang/regex
Author : emschwartz
Score : 75 points
Date : 2025-07-15 15:37 UTC (7 hours ago)
(HTM) web link (systemf.epfl.ch)
(TXT) w3m dump (systemf.epfl.ch)
| CJefferson wrote:
| Great! I enjoyed reading through, and I'm going to come back
| later and read a little more carefully.
|
| If anyone knows (to let me be lazy), is this the same regex
| engine used by ripgrep? Or is that an independent implementation?
| cbarrick wrote:
| Same engine as ripgrep
| flaghacker wrote:
| Yes, the `regex` crate is also the regex engine used by
| ripgrep, both were developed by https://github.com/burntsushi.
| shilangyu wrote:
| As others have pointed out, the regex engine is the same so the
| benefits would trickle downstream. For example, VSCode also
| uses ripgrep and therefore the rust-lang/regex engine.
| burntsushi wrote:
| ripgrep plugged this gap a long time ago by providing PCRE2
| support.
| singron wrote:
| I don't think there is discussion of the snort-2 and snort-3
| benchmarks, which the linear engine handily beats the python re
| for once (70-80x faster). I'm guessing they are cases where
| backtracking is painfully quadratic in re, but it would have been
| nice to hear about those successes. [In the rest of the
| benchmarks, python re is 2-5x faster]
| d3m0t3p wrote:
| Nice to see a master thesis highlighted on the research groupe
| page
| RadiozRadioz wrote:
| From a user perspective, this is extremely valuable. What an
| amazing improvement; unbounded especially. I do hope this would
| make it into actual RE2 & go.
|
| When I use regex, I expect to be able to lookbehind, so I am
| routinely hit by RE2's limitations in places where it's used.
| Sometimes the software uses the entire matched string and you
| can't use non-capturing groups to work around it.
|
| I understand go's reasons, ReDoS etc, but the "purism" of RE2
| does fly in the face of practicality to an irksome degree. This
| is not uncommon for go.
| hnlmorg wrote:
| The point of standard libraries is to provide sane default
| behaviours. Go's regexp package is a sensible default.
|
| For instances where you need something more sophisticated than
| what's in the standard library, you reach for 3rd party
| modules. And there are regex libraries for Go which support
| backtracking et al.
|
| There's definitely some irksome defaults in Go, but the choose
| of regex engine in the regexp library isn't one of them
| masklinn wrote:
| The authors' previous article (linked in this one) was about
| doing this in re2
| (https://systemf.epfl.ch/blog/re2-lookbehinds/), and they have
| a fork with those changes though I don't know that they have a
| PR.
|
| > the "purism" of RE2 does fly in the face of practicality to
| an irksome degree
|
| It's not purism tho. There are very practical reasons to want
| an FA-based engine, and if you compromise that to get
| additional features then the engine is pointless, you could
| have just used a backtracking engine in the first place.
| ncruces wrote:
| I couldn't find the link in that page, but the fork is here,
| and seems to be up-to-date: https://github.com/GerHobbelt/re2
|
| If you need that from Go, you can probably use that to create
| a fork of this: https://github.com/wasilibs/go-re2
| chubot wrote:
| What are some examples of problems where you've used
| lookbehinds?
| progbits wrote:
| While I agree this is a common golang theme, in this case I
| believe this decision predates the golang implementation and
| comes from the C++ RE2 days, no?
| LegionMammal978 wrote:
| > However, as a downside our lookbehinds do not support
| containing capture groups which are a feature allowing to extract
| a substring that matched a part of the regex pattern.
|
| I wonder in what situation someone would even be tempted to put a
| capture group into a lookbehind expression, except
| unintentionally by using () instead of (?:) for grouping. Maybe
| in an attempt to obtain capture groups from overlapping matches?
| But even in that case, lookaheads would be clearer, when
| available.
| hu3 wrote:
| Interesting. I have used look behind before without knowing their
| specifics. AI generated a regex and unit tests passed so I
| carried on with life.
|
| Searching for a simple explanation of how it works, I found this
| which also explains negative look behind and look ahead. TIL:
|
| https://www.phptutorial.net/php-tutorial/regex-lookbehind/
| librasteve wrote:
| It's odd to see such a widely adopted language as Rust only just
| getting some regex basics. Whereas Raku (https://raku.org) has
| made a strong forward step in regex syntax over PCRE, made by the
| same language designer with implementation of modern unicode
| savvy features like Grapheme and Diacritic handling that are
| essential to building consistent code to handle multilingual
| needs. say "Cool" ~~ /<:Letter>*
| <:Block("Emoticons")>/; # [Cool] say "Czesc" ~~
| m:ignoremark/ Czesc /; # [Czesc] say "WEISsE"
| ~~ m:ignorecase/ weisse /; # [WEISsE] say
| "hnuuaehmset`r" ~~ /<:Letter>+/; #
| [hnuuaehmset`r]
| librasteve wrote:
| huh ... guess HN blocks emojis
| burntsushi wrote:
| It's not only just getting some "regex basics." The `fancy-
| regex` crate has provided look-behind for years. The OP is
| about adopting look-behind to the linear time guarantee
| required by the `regex` crate.
|
| My main focus for the `regex` crate has been on performance:
| https://github.com/BurntSushi/rebar
|
| How does Raku's regex performance compare to Perl?
| kibwen wrote:
| _> the linear time guarantee required by the `regex` crate_
|
| Making sure this line isn't glossed over: the point of the
| regex crate is that it provides linear-time guarantees for
| arbitrary regexes, making it safe (within reason) to expose
| the regex engine to untrusted input without running the risk
| of trivial DoS. From what I can tell, supporting lookbehinds
| in such a context is something that researchers have only
| recently described.
| dmit wrote:
| > making it safe (within reason) to expose the regex engine
| to untrusted input
|
| Or even trusted input! https://blog.cloudflare.com/details-
| of-the-cloudflare-outage...
| librasteve wrote:
| I stand corrected on that - I was responding to the headline
| and did not appreciate that Rust has had library support
| beforehand. (That said, having regex around in different
| standard vs. crate options is not necessarily the ideal).
|
| It's good to have a focus and I agree that Rust is all about
| performance and stability for a system language.
|
| I haven't seen Raku regex performance benchmarked, but I
| would be surprised if it beats perl or Rust.
|
| I wouldn't say that Raku is a good choice where speed is the
| most important consideration since it is a scripting language
| that runs on a VM with GC. Nevertheless the language syntax
| includes many features (hyper operators, lazy evaluation to
| name two) that make it amenable to performance optimisation.
| masklinn wrote:
| > That said, having regex around in different standard vs.
| crate options is not necessarily the ideal
|
| What 1: both regex and fancy-regex are crates. Regex is
| under the rust-lang umbrella but it's not part of the
| stdlib.
|
| What 2: having different options is the point of third
| partly libraries, why would you have a third party library
| which is the exact same thing as the standard library?
| librasteve wrote:
| so Rust has no regex in the standard library, basic/fast
| regex under the rust-lang umbrella in a crate and fancy-
| regex is a 3rd party crate
|
| not having different options is the point of (batteries
| included) standard libraries ;-)
| burntsushi wrote:
| We (I am on libs-api in addition to authoring the regex
| crate) specifically eschewed a batteries included
| standard library. The fact that `regex` was its own thing
| was the best thing that ever happened to it. It let me
| iterate on its API independent of the standard library.
| SteveJS wrote:
| I loved discovering that rust has O(n) guardrails on regex!
| The so-called features that break that constraint are anti-
| features.
|
| Over the last two weeks I wrote a dialog aware english
| sentence splitter using Claude code to write rust. The
| compile error when it stuck lookarounds in one of the regex's
| was super useful to me.
| shawn_w wrote:
| I don't think Philip Hazel, who wrote PCRE, has anything to do
| with perl or raku development.
| librasteve wrote:
| sorry I didn't know that Philip Hazel wrote PCRE ... and I
| certainly credit the initiative to release Perl Compatible
| Regular Expressions from the grip of perl
|
| my main point is that PCRE was based on perl regexes and that
| these were designed by Larry Wall and so he had some
| experience when it came to the strengths and weaknesses of of
| perl RE when it came to designing the Raku RE syntax (ie. the
| language formerly known as Perl 6)
| quotemstr wrote:
| This right here is one of the foundational splits in the
| programming community. This article is all about how cool an
| _implementation_ is. This comment is about some other engine's
| cool _syntax_. Deep versus superficial. The two camps can't
| stand each other.
| librasteve wrote:
| Speaking on behalf of the superficial camp, I admire the Rust
| core regex focus on linear performance and I can well believe
| that it is based on recent theoretical work.
|
| Splitting the regex features between some core ones that meet
| a DoS standard and some non-core modules that do other
| "convenience" features makes sense as a trade off for Rust.
| It would not make sense in a scripting language like Raku
| where the weight is on coder expressiveness and making it
| easier / faster to write working code.
|
| I seem to have hit a seam of intense implementation guys -
| and they are holding their own since they know their stuff.
|
| I think there is room for improvement BOTH with new system
| language / core performance innovation AND with advancing the
| PCRE regex syntax (largely unchanged since the 1990s) and
| merging it seamlessly with standard language support for
| Grammars.
___________________________________________________________________
(page generated 2025-07-15 23:01 UTC)