[HN Gopher] Re2c
       ___________________________________________________________________
        
       Re2c
        
       Author : todsacerdoti
       Score  : 101 points
       Date   : 2024-02-20 06:31 UTC (2 days ago)
        
 (HTM) web link (re2c.org)
 (TXT) w3m dump (re2c.org)
        
       | 414owen wrote:
       | re2c works like a charm. I'm generally confident that I couldn't
       | write faster scanners by hand if I tried.
        
       | notachatbot123 wrote:
       | > re2c is a free and open-source lexer generator for C/C++, Go
       | and Rust with a focus on generating fast code.
        
       | ginko wrote:
       | I often wonder if the existence of lexer and parser generator
       | tools like this is indicative that common general purpose
       | languages are lacking the expressiveness to implement things like
       | this directly.
       | 
       | I feel like parsing is common enough that this could be a
       | language feature or a core library making use of higher-level
       | language functionality.
        
         | megous wrote:
         | Indeed, it would be great as a language feature.
        
           | emmanueloga_ wrote:
           | See Raku's grammars [1], probably the most direct language
           | support for parsing in a mainstream language (errr... or lang
           | that _wants_ to be mainstream :-).
           | 
           | --
           | 
           | 1: https://docs.raku.org/language/grammars
        
         | dist1ll wrote:
         | I suspect that lexer generators mostly exist because that's
         | what people who want to implement compilers are being taught.
         | But there's nothing difficult about writing a lexer. It can be
         | done in less than 100 LoC, easily half that if your language
         | has good string ergonomics and support for pattern matching.
        
           | remexre wrote:
           | Wouldn't the hand-written one be more work to write and
           | measurably slower than the re2c-generated one?
        
             | dist1ll wrote:
             | It's really not much work, and it removes another source of
             | dependency/build bloat. Also, if you care about speed,
             | hand-rolling your lexer will always be the fastest (and
             | even naive implementations can rival/outcompete
             | generators).
        
               | ceronman wrote:
               | Not only that, but a manually written lexer has much more
               | flexibility. It allows you to easily implement things
               | that lexer generators struggle with. Think of significant
               | indentation, nested comments, interpolation inside string
               | literals. Sometimes you can't do this at all with
               | generators, sometimes you can, but code becomes a mess.
               | That's the reason why many popular programming language
               | implementations use a hand written lexer and parser.
        
               | fanf2 wrote:
               | re2c makes it easy to drop into hand-written code where
               | necessary. For example, in C backslash-newline makes a
               | huge mess of everything unless you can handle it
               | underneath the lexer, which works nicely in re2c. In C++,
               | raw string literals are not even context-free, let alone
               | regular, so they _have_ to be parsed with hand-written
               | code. This is straightforward with re2c.
        
           | o11c wrote:
           | The problem with hand-written lexers/parsers (and some
           | generated parsers, notably PEG these days) is that you have
           | no idea if you can actually trust it.
           | 
           | There are _so_ many weird edge cases for lexing and parsing
           | that I would never dare to do it without machine
           | verification.
        
         | whizzter wrote:
         | It's historically a regular expression compiler for C (hence
         | re2c), C (and C++ until recently) didn't include a Regular
         | Expression engine.
         | 
         | I used an earlier version of re2c for one project since it was
         | a far easier to get started with compared to lex/flex (that
         | iirc also wasn't object local/thread safe in the past).
         | 
         | Perl, Javascript, Java, C#,etc all include regexp engines these
         | days so you can quickly build an abstraction that provides
         | roughly what re2c gives.
         | 
         | One huge benefit of re2c style generators still have however is
         | that they're precompiled and quite efficient compared to
         | generic regexp libraries if performance matters.
         | 
         | If you want something like re2c with the same performance the
         | target language would first require macro or precompiled
         | facilities (iirc there are C++ regexp systems now that are
         | compile-time compiled and should have similar performance to
         | re2c).
        
           | fanf2 wrote:
           | C on Unix and POSIX has had regular expressions in the
           | standard library since the 1980s https://pubs.opengroup.org/o
           | nlinepubs/007908799/xsh/regex.h....
        
             | whizzter wrote:
             | Still not the language, would be interesting to know why
             | someone pushed this to POSIX and didn't propose it for the
             | language.
             | 
             | Today Windows is the biggest non-POSIX system remaning
             | afaik (apart from Android and consoles) but back in the 80s
             | and 90s there was MacOS, STOS, AmigaOS,etc so saying it was
             | in Unix/POSIX has little connection to the language at
             | large.
        
           | ginko wrote:
           | > If you want something like re2c with the same performance
           | the target language would first require macro or precompiled
           | facilities.
           | 
           | Yeah, I think you could build something like that in Zig with
           | comptime.
        
           | electroly wrote:
           | C# has compile-time precompiled regular expressions now, too.
           | The enabling C# feature is source generators.
        
             | neonsunset wrote:
             | They are extremely fast too (way faster than re2):
             | https://github.com/BurntSushi/rebar?tab=readme-ov-
             | file#summa...
        
               | burntsushi wrote:
               | Note that rebar doesn't include C#'s compile time regexes
               | in the benchmark. The only reason it doesn't is because
               | nobody has implemented it yet. In rebar's model, it will
               | require writing a C# program that compiles another C#
               | program. It's pretty annoying. It's the same reason why
               | CTRE isn't in rebar either. There's nothing preventing it
               | other than the work to make it happen.
        
               | neonsunset wrote:
               | This is correct but both source-generated and runtime-
               | compiled regexes should be within the same ballpark
               | performance-wise (if somewhat favoring the former)
               | therefore it's worth a mention :)
        
         | whiterknight wrote:
         | Different tasks are best served by a different mode of
         | evaluation, not just a library. This is why the lisp community
         | advocates a many DSL design approach.
        
         | chubot wrote:
         | After using re2c, I actually think it should have been built
         | into C. Why?
         | 
         | - Because Ken Thompson was co-inventor of C, and he ALSO
         | brought regular languages to computing. IIRC he basically wrote
         | the first version of grep, and compilers like re2c use the same
         | automata-based methods as the original grep.
         | 
         | https://www.oilshell.org/archive/Thompson-1968.pdf
         | 
         | - Because C is really bad at string processing. Regular
         | languages would be a huge improvement. Looking through the
         | implementation of most shells should convince of you that.
         | 
         | Thompson also wrote the very first Unix shell, of course!
         | https://www.oilshell.org/blog/2021/08/history-trivia.html
        
       | kunley wrote:
       | Just why is it called re2c for Go and Rust, where the actual tool
       | is re2go or re2rust, but you still use re2c in the syntax?
       | 
       | It's of course not beyond the intelectual capabities of users,
       | but just looks kind of silly. Could have been rebranded IMO when
       | new languages were introduced
        
       | megous wrote:
       | A beautiful thing. If my C program is working with strings in any
       | capacity larger than some copying around, this is what I use. No
       | reason to muck around with strsep, strpbrk, etc. when you can use
       | re2c and a few gotos to implement pretty much any matching
       | pattern safely. Especially useful for dealing with text
       | protocols. Making hand-written recursive descent parsers is also
       | a breeze with this as a building block.
        
       | fanf2 wrote:
       | Some years ago, I heard of several projects using the ragel state
       | machine compiler for parsing network protocols
       | http://www.colm.net/open-source/ragel/. It sounded great, but I
       | found it was too low-level for me: it seemed that to use it
       | effectively, I would have to attain oneness with the state
       | machine, whereas one of the great advantages of regular
       | expressions is that I don't have to think (much) about how they
       | are implemented.
       | 
       | On the other hand, for a small project I wanted to wire up a C
       | library (adns) to a simple parser, but adns lacks bindings to
       | higher-level languages. So I thought, why not go retro and use
       | the classic Unix swiss army knife, lex [*]. And it was actually a
       | lot nicer than I expected, and the results were great.
       | 
       | More recently I used re2c for a C/C++ lexer. Its user interface
       | is fairly similar to lex, but cleaned up for modern tastes - in
       | fact, re2c is easier and more flexible than lex, I think. And
       | like ragel, re2c produces a directly-coded state machine, instead
       | of a table-driven state machine. So the resulting code is
       | (generally speaking) bigger but faster than lex.
       | 
       | I think re2c is a great tool, and its developers are helpful and
       | responsive.
       | 
       | [*] perl was referred to as a unix swiss army chainsaw, because
       | where lex is regexes + C, perl is regexes + scripting, and perl
       | allows you to go from idea to huge mess much faster
        
         | chubot wrote:
         | Hm as an re2c user, I always thought that Ragel vs. re2c was a
         | push/pull issue (though re2c flag has a --storable-state flag
         | to generate the push style -- I haven't used it)
         | 
         | But now that I look at the example on the front page:
         | http://www.colm.net/open-source/ragel/                   action
         | dgt      { printf("DGT: %c\n", fc); }         action dec      {
         | printf("DEC: .\n"); }              number = (
         | [0-9]+ $dgt ( '.' @dec [0-9]+ $dgt )?
         | 
         | I don't like how the dgt and dec actions are "inline" in the
         | regex.
         | 
         | This seems to be breaking the key abstraction of regexes, which
         | in my mind is "NON-DETERMINISM for free".
         | 
         | The better mechanism is submatch extraction, which re2c
         | supports, and they even wrote a 2017 paper about -
         | https://arxiv.org/abs/1907.08837 , also see https://re2c.org/
         | 
         | (I don't use re2c submatch extraction in
         | https://www.oilshell.org, but I just used in a brand new
         | program, and it works great. Very easy and natural.)
         | 
         | So I think Ragel is doing something Perl-ish and recasting
         | regular languages as imperative. But that can be asymptotically
         | slower and breaks reasoning by sets, forcing you to reason by
         | stepping through code.
        
           | ulrikrasmussen wrote:
           | Check out Kleenex which also allows you to inline actions in
           | the (regular) grammar, but which promises to only execute
           | them when it has resolved all non-determinism for the input
           | seen so far: https://kleenexlang.org/
           | 
           | The compiler translates regular grammars with embedded
           | semantic actions to fast C code.
           | 
           | Full disclosure: I am one of the authors.
        
       | tsujamin wrote:
       | Ran across some .re2c generated functions in some software I was
       | reverse engineering recently. It probably speaks more to the
       | compiler/source used but the generated code was incredibly
       | verbose (400 odd nested error handling branches).
       | 
       | On the upside, it generated functions too long to decompile, so
       | that's a handy feature I guess?
        
         | UncleEntity wrote:
         | I use it for a scheme lexer with full(?) unicode support --
         | just for science I turned off the 'case-ranges' flag and it
         | produces a lex function that is > 27,000 loc.
        
       | jakearmitage wrote:
       | It sucks that re2c simply can't parse indentation-based formats
       | like YAML or Python. Had to resort to nom to be able to do it.
        
         | chubot wrote:
         | This is sort of a category error... re2c is a lexer generator,
         | and YAML and Python are recursive/nested formats.
         | 
         | You can definitely use re2c to lex them, including indentation,
         | but it's not the whole solution. You need a parser too.
         | 
         | I use it for everything possible in https://www.oilshell.org,
         | and it's amazing. It really reduces the amount of fiddly C code
         | you need to parse languages, and it drops in anywhere.
         | 
         | Parser generators usually have some downside, but there's not
         | much downside to the lexer generators IMO. It just saves you
         | work. And the regex syntax is better than Perl/Python because
         | literals are consistently quoted.
        
           | chubot wrote:
           | Note: rather than "lexer generator", "regular language to
           | state machine compiler" is a better description.
           | 
           | Lexers can use re2c, but it's not even the whole story. This
           | is good because it means it's "policy free" and you can use
           | it anywhere.
        
         | ww520 wrote:
         | Re2c only performs the tokenizing part, not the parsing part.
         | Re2c should be able to run a regex to recognize N spaces and
         | produces a n-space token. It will be up to the parser to use
         | that to get the indentation of a statement.
        
       | mattgreenrocks wrote:
       | One fun project in this vein is to DIY something similar to this.
       | To simplify things initially, you can use NFAs, along with an
       | existing library to parse the regex syntax yourself.
       | 
       | The aha moment comes when you see how regex syntax compiles down
       | to various configurations of automata. Couple that with the fact
       | that automata are made to be composed together well, and the
       | result is beautiful in a way that you rarely see in production
       | code.
       | 
       | Here's my stab at it in Rust:
       | https://github.com/mattgreen/lexer/tree/master/src/nfa
        
       | junon wrote:
       | This is what the Ninja build system uses, by the way. Incredibly
       | fast.
        
       | sram1337 wrote:
       | I like to think this is pronounced "reducey"
        
       ___________________________________________________________________
       (page generated 2024-02-22 23:01 UTC)