[HN Gopher] Re2c
___________________________________________________________________
Re2c
Author : todsacerdoti
Score : 101 points
Date : 2024-02-20 06:31 UTC (2 days ago)
(HTM) web link (re2c.org)
(TXT) w3m dump (re2c.org)
| 414owen wrote:
| re2c works like a charm. I'm generally confident that I couldn't
| write faster scanners by hand if I tried.
| notachatbot123 wrote:
| > re2c is a free and open-source lexer generator for C/C++, Go
| and Rust with a focus on generating fast code.
| ginko wrote:
| I often wonder if the existence of lexer and parser generator
| tools like this is indicative that common general purpose
| languages are lacking the expressiveness to implement things like
| this directly.
|
| I feel like parsing is common enough that this could be a
| language feature or a core library making use of higher-level
| language functionality.
| megous wrote:
| Indeed, it would be great as a language feature.
| emmanueloga_ wrote:
| See Raku's grammars [1], probably the most direct language
| support for parsing in a mainstream language (errr... or lang
| that _wants_ to be mainstream :-).
|
| --
|
| 1: https://docs.raku.org/language/grammars
| dist1ll wrote:
| I suspect that lexer generators mostly exist because that's
| what people who want to implement compilers are being taught.
| But there's nothing difficult about writing a lexer. It can be
| done in less than 100 LoC, easily half that if your language
| has good string ergonomics and support for pattern matching.
| remexre wrote:
| Wouldn't the hand-written one be more work to write and
| measurably slower than the re2c-generated one?
| dist1ll wrote:
| It's really not much work, and it removes another source of
| dependency/build bloat. Also, if you care about speed,
| hand-rolling your lexer will always be the fastest (and
| even naive implementations can rival/outcompete
| generators).
| ceronman wrote:
| Not only that, but a manually written lexer has much more
| flexibility. It allows you to easily implement things
| that lexer generators struggle with. Think of significant
| indentation, nested comments, interpolation inside string
| literals. Sometimes you can't do this at all with
| generators, sometimes you can, but code becomes a mess.
| That's the reason why many popular programming language
| implementations use a hand written lexer and parser.
| fanf2 wrote:
| re2c makes it easy to drop into hand-written code where
| necessary. For example, in C backslash-newline makes a
| huge mess of everything unless you can handle it
| underneath the lexer, which works nicely in re2c. In C++,
| raw string literals are not even context-free, let alone
| regular, so they _have_ to be parsed with hand-written
| code. This is straightforward with re2c.
| o11c wrote:
| The problem with hand-written lexers/parsers (and some
| generated parsers, notably PEG these days) is that you have
| no idea if you can actually trust it.
|
| There are _so_ many weird edge cases for lexing and parsing
| that I would never dare to do it without machine
| verification.
| whizzter wrote:
| It's historically a regular expression compiler for C (hence
| re2c), C (and C++ until recently) didn't include a Regular
| Expression engine.
|
| I used an earlier version of re2c for one project since it was
| a far easier to get started with compared to lex/flex (that
| iirc also wasn't object local/thread safe in the past).
|
| Perl, Javascript, Java, C#,etc all include regexp engines these
| days so you can quickly build an abstraction that provides
| roughly what re2c gives.
|
| One huge benefit of re2c style generators still have however is
| that they're precompiled and quite efficient compared to
| generic regexp libraries if performance matters.
|
| If you want something like re2c with the same performance the
| target language would first require macro or precompiled
| facilities (iirc there are C++ regexp systems now that are
| compile-time compiled and should have similar performance to
| re2c).
| fanf2 wrote:
| C on Unix and POSIX has had regular expressions in the
| standard library since the 1980s https://pubs.opengroup.org/o
| nlinepubs/007908799/xsh/regex.h....
| whizzter wrote:
| Still not the language, would be interesting to know why
| someone pushed this to POSIX and didn't propose it for the
| language.
|
| Today Windows is the biggest non-POSIX system remaning
| afaik (apart from Android and consoles) but back in the 80s
| and 90s there was MacOS, STOS, AmigaOS,etc so saying it was
| in Unix/POSIX has little connection to the language at
| large.
| ginko wrote:
| > If you want something like re2c with the same performance
| the target language would first require macro or precompiled
| facilities.
|
| Yeah, I think you could build something like that in Zig with
| comptime.
| electroly wrote:
| C# has compile-time precompiled regular expressions now, too.
| The enabling C# feature is source generators.
| neonsunset wrote:
| They are extremely fast too (way faster than re2):
| https://github.com/BurntSushi/rebar?tab=readme-ov-
| file#summa...
| burntsushi wrote:
| Note that rebar doesn't include C#'s compile time regexes
| in the benchmark. The only reason it doesn't is because
| nobody has implemented it yet. In rebar's model, it will
| require writing a C# program that compiles another C#
| program. It's pretty annoying. It's the same reason why
| CTRE isn't in rebar either. There's nothing preventing it
| other than the work to make it happen.
| neonsunset wrote:
| This is correct but both source-generated and runtime-
| compiled regexes should be within the same ballpark
| performance-wise (if somewhat favoring the former)
| therefore it's worth a mention :)
| whiterknight wrote:
| Different tasks are best served by a different mode of
| evaluation, not just a library. This is why the lisp community
| advocates a many DSL design approach.
| chubot wrote:
| After using re2c, I actually think it should have been built
| into C. Why?
|
| - Because Ken Thompson was co-inventor of C, and he ALSO
| brought regular languages to computing. IIRC he basically wrote
| the first version of grep, and compilers like re2c use the same
| automata-based methods as the original grep.
|
| https://www.oilshell.org/archive/Thompson-1968.pdf
|
| - Because C is really bad at string processing. Regular
| languages would be a huge improvement. Looking through the
| implementation of most shells should convince of you that.
|
| Thompson also wrote the very first Unix shell, of course!
| https://www.oilshell.org/blog/2021/08/history-trivia.html
| kunley wrote:
| Just why is it called re2c for Go and Rust, where the actual tool
| is re2go or re2rust, but you still use re2c in the syntax?
|
| It's of course not beyond the intelectual capabities of users,
| but just looks kind of silly. Could have been rebranded IMO when
| new languages were introduced
| megous wrote:
| A beautiful thing. If my C program is working with strings in any
| capacity larger than some copying around, this is what I use. No
| reason to muck around with strsep, strpbrk, etc. when you can use
| re2c and a few gotos to implement pretty much any matching
| pattern safely. Especially useful for dealing with text
| protocols. Making hand-written recursive descent parsers is also
| a breeze with this as a building block.
| fanf2 wrote:
| Some years ago, I heard of several projects using the ragel state
| machine compiler for parsing network protocols
| http://www.colm.net/open-source/ragel/. It sounded great, but I
| found it was too low-level for me: it seemed that to use it
| effectively, I would have to attain oneness with the state
| machine, whereas one of the great advantages of regular
| expressions is that I don't have to think (much) about how they
| are implemented.
|
| On the other hand, for a small project I wanted to wire up a C
| library (adns) to a simple parser, but adns lacks bindings to
| higher-level languages. So I thought, why not go retro and use
| the classic Unix swiss army knife, lex [*]. And it was actually a
| lot nicer than I expected, and the results were great.
|
| More recently I used re2c for a C/C++ lexer. Its user interface
| is fairly similar to lex, but cleaned up for modern tastes - in
| fact, re2c is easier and more flexible than lex, I think. And
| like ragel, re2c produces a directly-coded state machine, instead
| of a table-driven state machine. So the resulting code is
| (generally speaking) bigger but faster than lex.
|
| I think re2c is a great tool, and its developers are helpful and
| responsive.
|
| [*] perl was referred to as a unix swiss army chainsaw, because
| where lex is regexes + C, perl is regexes + scripting, and perl
| allows you to go from idea to huge mess much faster
| chubot wrote:
| Hm as an re2c user, I always thought that Ragel vs. re2c was a
| push/pull issue (though re2c flag has a --storable-state flag
| to generate the push style -- I haven't used it)
|
| But now that I look at the example on the front page:
| http://www.colm.net/open-source/ragel/ action
| dgt { printf("DGT: %c\n", fc); } action dec {
| printf("DEC: .\n"); } number = (
| [0-9]+ $dgt ( '.' @dec [0-9]+ $dgt )?
|
| I don't like how the dgt and dec actions are "inline" in the
| regex.
|
| This seems to be breaking the key abstraction of regexes, which
| in my mind is "NON-DETERMINISM for free".
|
| The better mechanism is submatch extraction, which re2c
| supports, and they even wrote a 2017 paper about -
| https://arxiv.org/abs/1907.08837 , also see https://re2c.org/
|
| (I don't use re2c submatch extraction in
| https://www.oilshell.org, but I just used in a brand new
| program, and it works great. Very easy and natural.)
|
| So I think Ragel is doing something Perl-ish and recasting
| regular languages as imperative. But that can be asymptotically
| slower and breaks reasoning by sets, forcing you to reason by
| stepping through code.
| ulrikrasmussen wrote:
| Check out Kleenex which also allows you to inline actions in
| the (regular) grammar, but which promises to only execute
| them when it has resolved all non-determinism for the input
| seen so far: https://kleenexlang.org/
|
| The compiler translates regular grammars with embedded
| semantic actions to fast C code.
|
| Full disclosure: I am one of the authors.
| tsujamin wrote:
| Ran across some .re2c generated functions in some software I was
| reverse engineering recently. It probably speaks more to the
| compiler/source used but the generated code was incredibly
| verbose (400 odd nested error handling branches).
|
| On the upside, it generated functions too long to decompile, so
| that's a handy feature I guess?
| UncleEntity wrote:
| I use it for a scheme lexer with full(?) unicode support --
| just for science I turned off the 'case-ranges' flag and it
| produces a lex function that is > 27,000 loc.
| jakearmitage wrote:
| It sucks that re2c simply can't parse indentation-based formats
| like YAML or Python. Had to resort to nom to be able to do it.
| chubot wrote:
| This is sort of a category error... re2c is a lexer generator,
| and YAML and Python are recursive/nested formats.
|
| You can definitely use re2c to lex them, including indentation,
| but it's not the whole solution. You need a parser too.
|
| I use it for everything possible in https://www.oilshell.org,
| and it's amazing. It really reduces the amount of fiddly C code
| you need to parse languages, and it drops in anywhere.
|
| Parser generators usually have some downside, but there's not
| much downside to the lexer generators IMO. It just saves you
| work. And the regex syntax is better than Perl/Python because
| literals are consistently quoted.
| chubot wrote:
| Note: rather than "lexer generator", "regular language to
| state machine compiler" is a better description.
|
| Lexers can use re2c, but it's not even the whole story. This
| is good because it means it's "policy free" and you can use
| it anywhere.
| ww520 wrote:
| Re2c only performs the tokenizing part, not the parsing part.
| Re2c should be able to run a regex to recognize N spaces and
| produces a n-space token. It will be up to the parser to use
| that to get the indentation of a statement.
| mattgreenrocks wrote:
| One fun project in this vein is to DIY something similar to this.
| To simplify things initially, you can use NFAs, along with an
| existing library to parse the regex syntax yourself.
|
| The aha moment comes when you see how regex syntax compiles down
| to various configurations of automata. Couple that with the fact
| that automata are made to be composed together well, and the
| result is beautiful in a way that you rarely see in production
| code.
|
| Here's my stab at it in Rust:
| https://github.com/mattgreen/lexer/tree/master/src/nfa
| junon wrote:
| This is what the Ninja build system uses, by the way. Incredibly
| fast.
| sram1337 wrote:
| I like to think this is pronounced "reducey"
___________________________________________________________________
(page generated 2024-02-22 23:01 UTC)