[HN Gopher] Needle: A DFA Based Regex Library That Compiles to J...
___________________________________________________________________
Needle: A DFA Based Regex Library That Compiles to JVM ByteCode
Author : todsacerdoti
Score : 101 points
Date : 2024-05-08 01:38 UTC (21 hours ago)
(HTM) web link (justinblank.com)
(TXT) w3m dump (justinblank.com)
| librasteve wrote:
| looks very interesting - coming from the perl (ie the original
| PCRE) it's rather a pity that this is focused on the JVM since
| Java is not the most regex friendly authoring environment (sorry
| I do not know if other JVM denizens like Kotlin / Clojure have
| more natural regex syntax)
|
| Larry Walls version 2 of PCRE is doing some interesting
| innovation which, while continuing the model of a character level
| syntax where punctuation <[.+*]> and so on are the "verbs" as the
| regex Sub-Language (one of the Slangs of raku), now provides deep
| support for unicode (eg. a digit test looks at the unicode
| properties, not just <[0..9]>), and can be gradually unpacked
| using <token> and <rule> methods all the way up to a full blown
| composable class based Grammar ... so you can still write regex
| one-liners perl style but then evolve that to a small Grammar to
| reduce line noise and drive up code clarity
|
| Are there any plans to provide your library on the MoarVM?
| hyperpape wrote:
| Author here. I don't know much about the MoarVM, so no plans.
|
| I have very idly thought about being able to output native code
| for use in other systems, but realistically, I have enough
| other ideas to keep my nights and weekends busy for a long
| time.
| CoastalCoder wrote:
| > the Slangs of raku
|
| Sounds like the name of some race in a sci-fi story :)
| MrBuddyCasino wrote:
| Kotlin isn't too bad, you can avoid double-quoting and use
| .toRegex() on strings: val regex =
| """([\w\s]+) is (\d+) years old""".toRegex()
| neonsunset wrote:
| C# does this as well. Its OOB Regex engine can compile
| expressions to automata at runtime, or there's an alternate newer
| back-end to generate the source code at compile time, which is
| useful in AOT as you can't JIT the IL there.
|
| Similar to the blog post, this allows it to be one of the
| fastest: https://github.com/BurntSushi/rebar?tab=readme-ov-
| file#summa...
| BoingBoomTschak wrote:
| A related project that might interest some people:
| https://github.com/telekons/one-more-re-nightmare
|
| And the pretty hard to find blog post about it: https://applied-
| langua.ge/posts/omrn-compiler.html
| usrusr wrote:
| Now you had me hoping for an implementation of fitting Needle
| behind the existing Pattern/Matcher API, and a compiler plugin
| that substitutes Pattern.compile accordingly wherever the
| inputs are known at compile time.
| hyperpape wrote:
| Author here: one thing I didn't mention in the post, but
| which is mentioned in the issues/readme on GitHub is that
| while I've been working on this for awhile, it's still a
| pretty new project in many ways. So what you're describing is
| interesting, but probably premature. The generated class
| files can still be very large, and the compiler itself has
| comments like "TODO: O(n^2)" that make throwing arbitrary
| input at it a dodgy idea.
|
| Beyond those factors, there's the matter of just proving out
| that my testing is good enough, and I haven't missed edge
| cases.
| kamma4434 wrote:
| By looking at the benchmarks, it is interesting to see how much
| an advantage have non-backtracking engines versus the default
| Java one. It is also interesting that most of the regular
| expressions I happen to write are not backtracking, so it's
| likely I'm paying a ticket every time for a feature I'm not going
| to see.
|
| It would be interesting if the default Java class the implements
| regular expressions would detect whether backtracking is actually
| used in the regular expression when it is compiled and decide
| whether the full blown engine is really necessary or a
| lightweight one would be enough.
| burntsushi wrote:
| The set of regex engines being compared here is pretty small,
| and even among backtracking regex engines, Java's is pretty
| slow. See: https://github.com/BurntSushi/rebar?tab=readme-ov-
| file#summa...
|
| The backtracking engines ahead of are pcre2/jit, javascript/v8,
| d/ldc/std-regex (technically a hybrid I believe) and regress.
| Java's engine is about on par with Python's and Perl's (which
| are both written in C).
| hyperpape wrote:
| I'd definitely like to get needle into rebar (or a branch of
| it) and be able to run a broader set of comparisons.
|
| Also, I've learned a lot from your code. A lot of this
| project was stubbornly banging my head against the problem,
| but whenever I wanted to see how another engine did things,
| your library was the one I looked at.
| burntsushi wrote:
| Yeah! Please file issues or ask questions if you need help.
| :-)
|
| Hopefully it's easy enough to follow the `java` example to
| add your own.
| svilenivanov wrote:
| FWIW, there is a similar project [1] which compiles a regex to
| bytecode. The benchmarks [2] are impressive.
|
| [1] https://github.com/humio/jitrex [1]
| https://github.com/humio/jitrex?tab=readme-ov-file#performan...
| hyperpape wrote:
| Interesting--I had seen several different regex libraries for
| the JVM, but not this one.
| exabrial wrote:
| > Version 0.0.1
|
| Thank you for semver-ing!
___________________________________________________________________
(page generated 2024-05-08 23:02 UTC)