[HN Gopher] Needle: A DFA Based Regex Library That Compiles to J...
       ___________________________________________________________________
        
       Needle: A DFA Based Regex Library That Compiles to JVM ByteCode
        
       Author : todsacerdoti
       Score  : 101 points
       Date   : 2024-05-08 01:38 UTC (21 hours ago)
        
 (HTM) web link (justinblank.com)
 (TXT) w3m dump (justinblank.com)
        
       | librasteve wrote:
       | looks very interesting - coming from the perl (ie the original
       | PCRE) it's rather a pity that this is focused on the JVM since
       | Java is not the most regex friendly authoring environment (sorry
       | I do not know if other JVM denizens like Kotlin / Clojure have
       | more natural regex syntax)
       | 
       | Larry Walls version 2 of PCRE is doing some interesting
       | innovation which, while continuing the model of a character level
       | syntax where punctuation <[.+*]> and so on are the "verbs" as the
       | regex Sub-Language (one of the Slangs of raku), now provides deep
       | support for unicode (eg. a digit test looks at the unicode
       | properties, not just <[0..9]>), and can be gradually unpacked
       | using <token> and <rule> methods all the way up to a full blown
       | composable class based Grammar ... so you can still write regex
       | one-liners perl style but then evolve that to a small Grammar to
       | reduce line noise and drive up code clarity
       | 
       | Are there any plans to provide your library on the MoarVM?
        
         | hyperpape wrote:
         | Author here. I don't know much about the MoarVM, so no plans.
         | 
         | I have very idly thought about being able to output native code
         | for use in other systems, but realistically, I have enough
         | other ideas to keep my nights and weekends busy for a long
         | time.
        
         | CoastalCoder wrote:
         | > the Slangs of raku
         | 
         | Sounds like the name of some race in a sci-fi story :)
        
         | MrBuddyCasino wrote:
         | Kotlin isn't too bad, you can avoid double-quoting and use
         | .toRegex() on strings:                   val regex =
         | """([\w\s]+) is (\d+) years old""".toRegex()
        
       | neonsunset wrote:
       | C# does this as well. Its OOB Regex engine can compile
       | expressions to automata at runtime, or there's an alternate newer
       | back-end to generate the source code at compile time, which is
       | useful in AOT as you can't JIT the IL there.
       | 
       | Similar to the blog post, this allows it to be one of the
       | fastest: https://github.com/BurntSushi/rebar?tab=readme-ov-
       | file#summa...
        
       | BoingBoomTschak wrote:
       | A related project that might interest some people:
       | https://github.com/telekons/one-more-re-nightmare
       | 
       | And the pretty hard to find blog post about it: https://applied-
       | langua.ge/posts/omrn-compiler.html
        
         | usrusr wrote:
         | Now you had me hoping for an implementation of fitting Needle
         | behind the existing Pattern/Matcher API, and a compiler plugin
         | that substitutes Pattern.compile accordingly wherever the
         | inputs are known at compile time.
        
           | hyperpape wrote:
           | Author here: one thing I didn't mention in the post, but
           | which is mentioned in the issues/readme on GitHub is that
           | while I've been working on this for awhile, it's still a
           | pretty new project in many ways. So what you're describing is
           | interesting, but probably premature. The generated class
           | files can still be very large, and the compiler itself has
           | comments like "TODO: O(n^2)" that make throwing arbitrary
           | input at it a dodgy idea.
           | 
           | Beyond those factors, there's the matter of just proving out
           | that my testing is good enough, and I haven't missed edge
           | cases.
        
       | kamma4434 wrote:
       | By looking at the benchmarks, it is interesting to see how much
       | an advantage have non-backtracking engines versus the default
       | Java one. It is also interesting that most of the regular
       | expressions I happen to write are not backtracking, so it's
       | likely I'm paying a ticket every time for a feature I'm not going
       | to see.
       | 
       | It would be interesting if the default Java class the implements
       | regular expressions would detect whether backtracking is actually
       | used in the regular expression when it is compiled and decide
       | whether the full blown engine is really necessary or a
       | lightweight one would be enough.
        
         | burntsushi wrote:
         | The set of regex engines being compared here is pretty small,
         | and even among backtracking regex engines, Java's is pretty
         | slow. See: https://github.com/BurntSushi/rebar?tab=readme-ov-
         | file#summa...
         | 
         | The backtracking engines ahead of are pcre2/jit, javascript/v8,
         | d/ldc/std-regex (technically a hybrid I believe) and regress.
         | Java's engine is about on par with Python's and Perl's (which
         | are both written in C).
        
           | hyperpape wrote:
           | I'd definitely like to get needle into rebar (or a branch of
           | it) and be able to run a broader set of comparisons.
           | 
           | Also, I've learned a lot from your code. A lot of this
           | project was stubbornly banging my head against the problem,
           | but whenever I wanted to see how another engine did things,
           | your library was the one I looked at.
        
             | burntsushi wrote:
             | Yeah! Please file issues or ask questions if you need help.
             | :-)
             | 
             | Hopefully it's easy enough to follow the `java` example to
             | add your own.
        
       | svilenivanov wrote:
       | FWIW, there is a similar project [1] which compiles a regex to
       | bytecode. The benchmarks [2] are impressive.
       | 
       | [1] https://github.com/humio/jitrex [1]
       | https://github.com/humio/jitrex?tab=readme-ov-file#performan...
        
         | hyperpape wrote:
         | Interesting--I had seen several different regex libraries for
         | the JVM, but not this one.
        
       | exabrial wrote:
       | > Version 0.0.1
       | 
       | Thank you for semver-ing!
        
       ___________________________________________________________________
       (page generated 2024-05-08 23:02 UTC)