[HN Gopher] State machines are wonderful tools
___________________________________________________________________
State machines are wonderful tools
Author : chmaynard
Score : 266 points
Date : 2021-01-01 08:50 UTC (14 hours ago)
(HTM) web link (nullprogram.com)
(TXT) w3m dump (nullprogram.com)
| vesinisa wrote:
| Cool trickery, but should I ever encounter one of those hex
| tables in a program I am maintaining, I would certainly hope to
| see at least the state graph of the encoded machine as
| documentation.
| pupdogg wrote:
| I don't mind being reminded every 2-3 years how great State
| Machines are. I was first introduced to the concept back in 2002
| and my mind was blown. At that time, I was in my 20s...now I'm
| pushing past 40 and I'm starting to think my life is a state
| stuck in an event loop.
| allenu wrote:
| State machines are incredibly useful and I use them all the time
| when I write UI code.
|
| There are a couple benefits that come to mind with respect to
| writing UI. First, it forces you think through all the potential
| "inputs" on your system, be they button presses from users or
| network request responses. You have to be explicit about what
| could have an effect on the system. Secondly, it makes your
| business logic easier to reason about because you essentially
| turn your UI code into a bunch of functional statements. All your
| transitions are just pure functions where the input is the
| previous state and some input, and the output is the new state.
|
| A lot of UI frameworks do it this way today, of course, but I
| picked up from reading about how Elm works. Whenever I write new
| UI from scratch, I always end up falling back on the state
| machine for guidance.
| grandinj wrote:
| FSM are fun to write but horrible to modify especially if you're
| not the one who wrote it.
| rualca wrote:
| > FSM are fun to write but horrible to modify especially if
| you're not the one who wrote it.
|
| It really depends on the tooling you're using. For example,
| Qt's support for SCXML[1] eliminates much of the pain of using
| state machines.
|
| [1] https://doc.qt.io/qt-5/qtscxml-overview.html
| xondono wrote:
| The general issue is that FSM require a lot of naming, since
| you want to give every state a significant name, and naming is
| hard.
|
| When the states are clear enough, modifying FSMs is very nice.
| teacpde wrote:
| That isn't necessarily a state machine problem, a state is an
| abstraction that maps to part of your business semantics.
| Naming becomes hard when your state abstraction isn't clear,
| which in turn is probably due to how business semantics are
| modeled. One the most important things in implementing state
| machines is to draw it up and make sure every state and
| transition has clear meanings before actually implementing
| them. There usually will be multiple iterations of modeling
| before arriving at a good design.
| threatofrain wrote:
| But for many this is just React with Redux, and the community
| buzz around that has been that teamwork is made easier. You'd
| have a state machine if the problem called for it anyway, but
| with redux everything is now explicit and teammates come
| prepared.
| amw-zero wrote:
| This is only true when considering a specific software design
| that implements the state machine. I'd say the most popular way
| to define a state machine in code is how Xstate does it:
| https://github.com/davidkpiano/xstate. You define your states,
| and then you define your inputs, and then you map out every
| single transition between states.
|
| This is probably the most inefficient way to design a state
| machine. There's a simple mathematical explanation for that. If
| we agree that a state diagram is a directed graph consisting of
| N nodes, then there are N^2 possible transitions (that's the
| definition of a completely connected graph). This means there's
| a large space of possible transitions relative to the number of
| states, which means there's a high probability that you will
| have to modify lots of transitions when modifying the state
| machine itself.
|
| In this article, this author is sharing several alternative
| designs for state machines, which don't look like they suffer
| from exactly the same problem. I'm most familiar with TLA+,
| where you describe a state machine simply with one logical
| predicate: the "Next" state relation. This is the most powerful
| way I've found to define state machines, and modifying one
| function to entirely change the behavior of the state machine
| is very easy in practice.
| teacpde wrote:
| Mind elaborate on the next state predicate approach? By next,
| I imagine the focus is on transition rather than state;
| however, if you have multiple transitions arriving at the
| same state, what makes the next approach better? At the end
| of day, the state machine is the same regardless how it is
| implemented.
| nitrogen wrote:
| _Edit:_ looks like they 're talking about this:
| https://en.wikipedia.org/wiki/TLA%2B
|
| What I describe below is not related.
|
| ----
|
| Not the OP, but here's a stab at it...
|
| In state machines commonly used in embedded C code, each
| state is represented either by a function or by a struct
| that contains a function pointer. The state machine calls
| the current state function in a loop or whenever relevant
| events happen, and that function returns a pointer to the
| next state, calculated based on inputs. If there is no
| state transition, then the function returns NULL or returns
| its own state.
| jayd16 wrote:
| Seems like you could just add some sugar to the definition
| process, eg an Any State or more than one source state per
| transition.
| JadeNB wrote:
| > There's a simple mathematical explanation for that. If we
| agree that a state diagram is a directed graph consisting of
| N nodes, then there are N^2 possible transitions (that's the
| definition of a completely connected graph).
|
| A state diagram is usually richer than a directed graph--
| usually the edges are _labelled_. This multiplies the number
| of possible transitions by the number of possible labels.
|
| (Also, for anyone interested in comparing with the graph-
| theoretic terminology, the term "complete graph"
| (https://en.wikipedia.org/wiki/Complete_graph) is often used
| for undirected graphs with no self-loops, in which case there
| are only \binom n 2 possible edges.)
| lmilcin wrote:
| The problem with modifying FSM is if you start with modifying
| the code. This is wrong approach.
|
| The FSM is not a code but a model and the code is just an
| implementation of the model.
|
| What you want to do is to recreate the process of modelling (or
| modify existing model) and then recreate the code based on
| that.
|
| What I do is to have an implementation-agnostic model of the
| FSM that gets translated to code. That code is never modified,
| it only is called from external modules or calls external
| modules.
|
| There are some cases where I would just hardcode the FSM
| without the model. Those cases are when I do not expect the
| state machine to change because it follows from the fundamental
| description of the problem. For example, I might have some
| communication protocol and the state machine describes the
| state of that communication channel. If I do my job well the
| state machine maps 1:1 to the problem and will never need to
| change even as the rest of the application changes.
| teacpde wrote:
| In your example, what happens if changes need to be made to
| the protocol? It is less frequent than business/application
| changes, but definitely can happen.
| lmilcin wrote:
| Good protocols are usually made in layers. You would
| typically want to have some state machine describing lower
| part of application protocol on which you build higher
| level of your application protocol, composed of specific
| functions.
|
| This way you can modify functionality without necessarily
| touching low level implementation.
|
| I can give an example.
|
| I have worked with credit card terminals and I have
| implemented entire application from scratch.
|
| There is a protocol to communicate between the terminal and
| the acquirer's system (ISO 8583).
|
| This protocol specifies a flow and format of messages. For
| example, if you send a transaction request, you will get a
| response. If you don't get a response you may or may not
| retry it, etc. If you send a transaction advice (ie.
| transaction has already happened) you must be retrying
| sending it until you get a response, but every
| retransmission after first must include a bit that says it
| is retransmission. If the request or advice has been sent
| and transaction was cancelled, you need to be sending
| cancellations until you get an acknowledgement.
|
| Now, all those messages have a huge amount of fields that
| change regularly as requirements change, but the basic flow
| of the protocol is fairly constant and has been constant
| for many years.
|
| I have implemented the protocol as a tree of state machines
| telling the application what to do based on events and I
| believe the code has not been touched in over 10 years even
| though the application was in constant development.
| jarirajari wrote:
| FSMs should be "written" visually, and they should have either
| hierarchical sub-states or some sort of a grouping scheme to
| keep complexity at a reasonable level.
|
| FSM examples are usually too simplified, and they don't show
| how to manage the complexities of a real, more complex state
| machine.
| blunte wrote:
| It also depends on how you implement the state machine. If
| you're constrained by memory or performance (or playing golf),
| then you may end up with code like TFA.
|
| But if you're implementing in a typical app, you can obviously
| organize and write that is very understandable.
|
| I find the title of TFA misleading in its generality in
| comparison to the content of the article.
| masterofmisc wrote:
| Yeah, i find that using a library (like Stateless for C#) where
| it makes you "spell-out, upfront" each of the states and the
| transitions that are permitted from this state, it makes it
| easier to reason about the code.
| ci5er wrote:
| A long time ago I used to auto gen http and smtp servers and
| message queues (similar to RabbitMQ) from very compact
| descriptions. My in-house I infrastructure build-out productivity
| went up ~30x, but all the downstream teams hated it.
| blargmaster42_8 wrote:
| And this is news? Duuu
| [deleted]
| strenholme wrote:
| I ended up using one in MaraDNS 2.0 to parse configuration files,
| since the C code for parsing MaraDNS 1.0 configurations files was
| a lot more error-prone and hard to maintain.
|
| The state machine's grammar is embedded in the Deadwood (MaraDNS
| 2 recursor) binary.
|
| Here is a full explanation of the finite state machine, along
| with the actual embedded code:
|
| https://github.com/samboy/MaraDNS/blob/master/deadwood-githu...
| carapace wrote:
| FWIW, here's an example I wrote in Python "that imitates what
| "compiled" FSM code might look like in an "unrolled" form. Most
| FSM code uses a little driver loop and a table datastructure, the
| code below instead acts like JMP instructions ("jump", or GOTO in
| higher-level-but-still-low-level languages) to hard-code the
| information in the table into a little patch of branches."
|
| https://joypy.osdn.io/notebooks/Derivatives_of_Regular_Expre...
|
| It's part of an (incomplete. slightly broken) notebook on
| Brzozowski's derivatives of regular expressions.
| sovietmudkipz wrote:
| It's humbling to see a state machine implemented using
| coroutines. I program professionally since about 6 years ago but
| have only recently appreciated generators and coroutines.
|
| I'm working through learn.unity.com for fun and because, in my
| estimation, there is an upcoming boom in real time interfaces for
| real work. All the people playing video games have mastered
| skills that aren't being taken full advantage of. Anyways,
| coroutines in Unity can be used to do interesting things over
| time while attached to the game loop.
|
| My point, despite my digression, is just how there are still
| basic/intermediate topics I have only been able to appreciate
| recently. Humbling.
| jayd16 wrote:
| You should also try to groc Unity's animation controller. It's
| a state machine with non-discrete transitions (primarily for
| animation blending). Neat stuff.
| sali0 wrote:
| 100% agreed on the increasing importance of video game
| interfaces. Nice to see I am not alone in that thought.
| sovietmudkipz wrote:
| Every now and again as I play RTS (and other) games I imagine
| how the interface could be used in for business, engineering
| or marketing. There's potential, IMO.
|
| The test (as I imagine it now) is making work fun enough that
| my 13 year old self would want to do work.
|
| At any rate, ideas are cheap. Prototypes and implementation
| are what matters.
|
| We're speaking publically but I would like to know what you
| imagine about when you think about this topic?
| sixdimensional wrote:
| No post on state machines would be complete without mention of
| the superset concept of Petri nets [1]. I have used Petri nets to
| build my own workflow engine previously. Building a workflow
| engine is one of those things... nearly every enterprise app has
| one, and normally you shouldn't need to build one if you can
| source one... but it is an exercise that makes you a stronger
| engineer/developer too, deepening understanding of the patterns
| that lie beneath most systems.
|
| Back when I was building, openly available workflow engines,
| BPML, BPEL etc. were not really a thing yet, so it made a little
| more sense to be building one.
|
| [1] https://en.wikipedia.org/wiki/Petri_net
| mikewarot wrote:
| This is my first encounter with the Petri nets, and at first I
| thought they were just like state machines, in that you're only
| concerned with flow of control... boy was I wrong. Its going to
| take a while to fully grok, but it seems impressively powerful
| in what it can do. Thanks!
| nudpiedo wrote:
| Do you have more information on how to build workflow systems,
| and why is it good to do so with Petri nets rather than
| abstract the processes with classes or whatever else?
|
| Any information is welcome
| dqpb wrote:
| Every program is fundamentally states and state transitions. You
| can be explicit about it or not, but the states and transitions
| are there regardless.
| amw-zero wrote:
| Absolutely. This is the definition of what computation is. Even
| in the lambda calculus, where everything is defined with pure
| functions, we build up state sequentially over time, based on
| previous values of the state. The fact that there's no mutation
| under the hood is inconsequential.
| ColinWright wrote:
| I'm reminded of a thing I wrote about seeing a state machine
| implemented in real life:
|
| https://www.solipsys.co.uk/new/StateMachineInRealLife.html?U...
| skrebbel wrote:
| That was a delightful little read, thanks!
| pardner wrote:
| Great little story!
| [deleted]
| zevv wrote:
| Encoding parser FSMs by hand can be tedious and error prone,
| there's some nice tools to help you with that; I'm a huge fan of
| Ragel: http://www.colm.net/open-source/ragel/
| 082349872349872 wrote:
| I sometimes wonder if replacing the hand-coded FSMs of UI
| ravioli callbacks with a more parser-generator-like declaration
| would be a cleaner approach.
|
| (or maybe it's already been done? I haven't done UI code since
| last century...)
| hyperpallium2 wrote:
| Now you equivalently have two problems.
| amw-zero wrote:
| These are some pretty cool examples, though a little too nitty
| gritty to communicate the practical importance of state machines.
| The truth is, state machines are so fundamental that programmers
| are not even aware that they are using them. Many, if not most,
| computer programs can be seen as state machines. In fact,
| programming can quite literally be seen as state machine creation
| and management. Programming languages are tools for creating
| complex state machines without having to manually wire them
| together.
|
| For example, take the following factorial function in Python:
| def factorial(n): result = 1 while n > 1:
| result *= n n -= 1 return result
|
| Here is a TLA+ spec for the same algorithm, along with the
| corresponding state diagram for calculating factorial(5):
| https://gist.github.com/amw-zero/c604f4dfd2d98e0d77fffe602c6....
| This is a simple example, but the state diagram illustrates an
| important point that the while loop doesn't: the algorithm is
| completely linear, meaning it simply builds upon previous states.
| Even though there is branching in the code, there is no branching
| in the state diagram.
|
| Of course, we're not so lucky and most state spaces are not
| linear. Here's the spec and diagram for a user interface that I
| was building a while back: https://gist.github.com/amw-
| zero/4dc16cbf8e578d9be9e68b0a85c.... There was a subtle state-
| based bug where a user had to perform a special sequence of
| events to experience it. The code seemed simple, but looking at
| the state diagram directly showed what the bug was immediately.
|
| This was in a React app, and doing this exercise made me realize
| that React apps trivially translate to state machines as well.
| Whenever you call `setState` or the corresponding `setHookValue`
| function when using hooks, you are defining a state transition.
| The set of all state variables determines the full state space of
| the application, and bugs arise when we forget to consider that
| fact within some of our transition functions.
|
| Long story short, you don't need to use a state machine design
| pattern such as Xstate to actually have a state machine. In fact,
| unless you are doing some very weird things, your application
| most likely already is a state machine under it all, but that's
| only because a state machine is a very simple, general concept
| that has tons of applications.
| arcbyte wrote:
| Yup. Basically all enterprise software is just state machine
| design. Any simple RESTFUL app with a database is just a state
| machine. Call one end point to create a resource, fetch it
| later from a second. Calling them out of order yields an error
| state. Super simple state machine. Of course we've reinvented
| this wheel so many times.
|
| Something like LISP is fun because you can get both the
| executable and (human AND machine) readable representations of
| a state machine out of the exact same source code. Thats still
| the future of programming.
| the_cat_kittles wrote:
| i think this has always subconsciously confused me whenever
| people talk about state machines, and your comment clarifies
| what i think ive felt. what would qualify as _not_ a state
| machine in the broad sense?
| amw-zero wrote:
| That's a great question. Combinational logic is an example of
| something that computes that isn't a state machine. You can
| compute Boolean algebra operations with no notion of time,
| they're essentially pure functions.
|
| That's where it gets slightly confusing though. Just because
| a function is pure, does not mean that its implementation is
| not a state machine. For example, an algorithm to compute the
| greatest common denominator between two numbers is stateful,
| even if it has a pure functional interface.
| loopz wrote:
| Isn't FSM more like writing a program consisting only of GOTO's?
| I call this the "loopz-effect", where any single state may have
| effect in any other state of the program.
| ajuc wrote:
| It's the opposite. In a FSM you have clearly defined states and
| all possible transitions between them.
|
| If you write the same code as normal imperative code that
| modifies some variables - to understand what it does you have
| to "run it" in your brain and reverse engineer the states and
| transitions yourself.
| garethrowlands wrote:
| Yes. And not only that, but GOTOs are the normal recommended
| way of implementing an FSM.
| xondono wrote:
| What? Where? Most sane people write FSM as switches inside an
| infinite loop
| recursive wrote:
| Right here, by me. Or perhaps I'm insane.
|
| Using goto ensures that all of a particular state's logic
| is nearby in code. It also eliminates the need to check an
| exhaustive list of irrelevant switch cases at runtime.
| amw-zero wrote:
| A FSM is simply a structure that progresses over time, where
| the next state is dependent on the previous state. By your
| usage of "goto," I feel like you're implying like this is a
| negative thing. But state is an inherent part of computation. A
| Turing Machine can be seen as a slightly more powerful state
| machine, and most if not all of the code that we write can be
| easily modeled with the less powerful state machine.
|
| And no, functional programming does not actually help this.
| davemp wrote:
| It's more like having a switch statement inside of a loop.
|
| The variable you switch on is the next state and the case
| you're in is the current state.
| retrac wrote:
| Strictly speaking, yes. But you should think of it as a
| formalized abstraction over that; a way to reason about
| something you can easily translate into a bunch of goto
| statements.
|
| It's just like how all if/then and function call constructs
| compile down to goto, as well. Some languages will offer more
| abstraction over this, but you can still do such constructs in
| assembly language if you want to by hand-holding the machine
| through every step.
|
| If you do find yourself writing assembly language (or C, for
| that matter!) then you should do just that; implement high-
| level constructs at the low level with a bunch of jump
| statements. The resulting code is verbose and superficially
| seems gnarly, but will also be far more scalable and robust.
|
| But there are compilers for a reason! There are some examples
| of automatic FSM to C compilers elsewhere in this thread. Some
| very high-level languages have embedded domain languages for
| FSM generation as well.
| psykotic wrote:
| Some random fun things you can do with DFAs/NFAs:
|
| 1. From two machines you can build a single machine that
| corresponds to running them in lockstep/parallel. Most language
| classes are closed under unions but regular languages stand out
| by being closed under intersections, and that relies on this
| construction. You can't replicate this construction with pushdown
| automata to show that context-free languages are closed under
| intersection because the product automata would require two
| stacks. But in fact the intersection of a context-free language
| with a regular language is context-free via this construction.
|
| 2. You can run them backwards. Usually works best with NFAs since
| reversing a DFA can seriously inflate the state space. E.g. if
| you're trying to match a regular expression of the form '<re1>
| string <re2>' you can search for 'string' and then match re1
| backward and re2 forward. Running machines backwards can also be
| used for computing capture groups after a match has been found
| since the fastest methods for matching don't let you compute
| capture groups on the fly. Decoupling the re1 and re2 machines
| also means you end up with two independent machines with smaller
| state spaces, which means smaller, cache-friendlier lookup
| tables. See the next item for a less conventional example of what
| you can do with small state spaces.
|
| 3. While DFA execution seems inherently serial, it can actually
| be parallelized very effectively if the state space is small
| enough, through speculative execution. Suppose you want to lex a
| string in parallel by chopping it in half. You don't know what
| state the DFA will be in when it reaches the halfway point, but
| you can process the second half for all possible states at that
| point. Then when both halves are finished you throw away all but
| one of the speculative answers for the second half based on the
| final state for the first half. This is extremely fast and
| practical with PSHUFB/TBL-like instructions if your state space
| is small enough, at most 16 states for PSHUFB. You can mux
| together multiple PSHUFBs to go beyond that limit but the cost
| increases substantially. (This is a parallel prefix algorithm
| when applied recursively. As far as I know the application to
| parallel lexing goes back to the Hillis-Steele paper on data-
| parallel algorithms from 1986.)
|
| 4. There's a critical optimization for the last algorithm which
| is sometimes applicable. A log file processor might have numerous
| internal states while processing a line, so you'd rather not
| speculate on all of them. Instead you can chop the file in half
| and from the second half seek forward to the next newline and
| start there. This avoids speculation entirely if there's a unique
| state following a newline character regardless of what state you
| were in before the newline.
| sleavey wrote:
| What are DFAs/NFAs? I didn't see them defined in the article.
| psykotic wrote:
| Deterministic/non-deterministic finite automaton. It's the
| standard term for a finite state machine which processes a
| stream of characters. All the examples in his article are
| DFAs.
| sleavey wrote:
| Thanks! I'm interested in the idea of a reversible state
| machine, for parsing an application specific language
| syntax into some machine representation, then allowing the
| user to modify it and unparse back to syntax again. So far
| I've had to implement procedural code for both directions
| but I figure there must be some way for some smart code to
| both parse and unparse using some declarative parser
| rules^. I've had no luck finding any existing art so far.
|
| ^I figure this won't be super simple, since e.g. whitespace
| is often ignored at the tokenizer step and would need
| reinjected by the unparser, but that sort of stuff is
| probably not too difficult to handle.
| mikewarot wrote:
| I'm interested in doing that as well, I've often thought
| of being able to go from Pascal -> ast + symbol table ->
| executable and back.
|
| You would want to preserve all the whitespace, to make it
| completely reversible. The ability to modify the symbol
| table to _directly_ rename a variable or procedure, would
| be quite powerful, and never wrong. 8)
| cxr wrote:
| Notwithstanding caveats about what "reversible" means
| here, I'm interested in the application. I've been
| interested in "non-destructive compilation" for a while,
| including what you might call "language skins"--
| derivations built from reversible, deterministic
| transformations on the syntax graph created from an
| arbitrary parse. With the current popularity of (one-way)
| transpilers, I'm surprised there's not more interest and
| work on making them two-way, for example, but that's just
| one possible application.
| sleavey wrote:
| > I'm interested in the application
|
| I'm part of a team developing a scientific modelling
| tool. The user defines a model via syntax similar to
| SPICE's netlists, specifying model objects and their
| parameters and connections. The nature of the type of
| simulation requires that the user runs the script over
| and over while tweaking various parameters, some by hand
| and some via optimization routines. We use the syntax as
| a way to serialize the complete state of the model, to be
| embedded in the generated results (the script size is
| negligible compared to the results). It is therefore a
| requirement that we can generate syntax from the model
| that, once re-parsed, results in the same model again -
| thus the need for forward and reverse parsing.
|
| My current approach is to store the parsed AST as part of
| the model, then start from this (updating anything that
| was changed/added/removed in the meantime) when
| generating syntax.
|
| > I'm surprised there's not more interest and work on
| making them two-way, for example, but that's just one
| possible application.
|
| I suppose it's a combination of not many users typically
| wanting such a feature, the ease of storing a program's
| state as a proprietary/binary blob (Python's pickle
| module for example), and the tight coupling such a system
| creates between syntax and objects: almost everything
| created by the parser should map 1:1 to syntax, even if
| modified by the user after parsing. One example of that
| last point: you might define some sort of axis via
| syntax, like `range(100, 200, increment=1)`. The parser
| then creates this axis as an array of 10 data points. But
| the user may also modify this in memory, e.g. adding `50`
| to the array, so it's no longer a monotonic range. when
| you come to unparse you need to have some code to
| identify what this array represents (is it still a range?
| or a sequence with no order?), and generate corresponding
| syntax. That's tricky and requires lots of code for this
| usually niche feature.
| psykotic wrote:
| "Reversible" meant consuming the character stream in
| reverse order while matching the same language, so it's
| different from what you have in mind--automatically
| generating a printer from a parser. On the surface it
| sounds very doable if each pattern of AST node is
| generated by a unique grammar production. Then pretty
| printing is the task of matching against an unambiguous
| tree grammar. It's not really different from what the
| compiler for a language like ML or Haskell does for
| pattern matching.
| sleavey wrote:
| Ah, now I see what you mean.
|
| > On the surface it sounds very doable if each pattern of
| AST node is generated by a unique grammar production.
|
| This seems to be the key, from my work implementing this
| in my application. Luckily for me the grammar itself can
| be somewhat tweaked to help ensure a 1:1 mapping between
| productions and objects, at least until we release v1,
| but if someone is having to work with an existing grammar
| I can see the need for all sorts of hacks.
| psykotic wrote:
| Yeah, the good news is that your tool can statically
| detect the ambiguities and you can let users specify
| priorities for the conflicting productions to define a
| unique mapping. That's a standard technique in parser
| generators. There the ambiguities are in the opposite
| direction, but the same idea applies.
| progre wrote:
| https://en.wikipedia.org/wiki/Deterministic_finite_automaton
| ufo wrote:
| By any chance, do you know an example of a regex library or
| tool that uses that backwards-matching trick?
| psykotic wrote:
| I learned the specific 're1 string re2' decomposition with
| backward matching from Hyperscan but AFAIK several fast grep-
| oriented regex engines factor out both prefixes and suffixes
| and use whichever is longer/most rare as a substring anchor.
| For the suffix case you need backward matching to confirm the
| regex match after the substring match. Hyperscan actually
| decomposes the entire state machine graph into 're1 string
| re2' style fragments, which lets them select the fastest
| implementation technique for each sub-machine.
|
| I think Russ Cox talks about backward matching in his article
| series on regular expressions, so it's probably used
| somewhere in re2.
| jwilk wrote:
| Russ Cox's article in question:
|
| https://swtch.com/~rsc/regexp/regexp3.html
| burntsushi wrote:
| The grep-oriented tools have it a bit easier. They can (and
| do) extract inner literals as well. The trick is that the
| search is line oriented and lines tend to be small. So when
| you find a match for an inner literal, all you need to do
| is find the bounds of the line that contains the literal
| and run the full regex engine on just that line. It works
| extremely well when matches are somewhat rare.
|
| As for RE2, backward matching is used to find the starting
| location of a match.
| feanaro wrote:
| Thinking about point 1 makes me wonder why no popular regex
| engines have an intersection (logical AND) operator. It seems
| like it could allow writing complex regexes in a much clearer
| way.
| indentit wrote:
| Could 2 lookaheads starting at the same position not be used
| for this purpose?
| enedil wrote:
| Lookahead is what causes a language not to be regular. In
| fact, common regex libraries support broader classes of
| languages than the regular ones. Perhaps that's why there's
| no `and` operator.
| burntsushi wrote:
| Most popular regex engines use backtracking and support non-
| regular features. They aren't based on automata, so
| implementing AND is perhaps not as theoretically straight-
| forward in that context.
|
| There are some popular finite automata based engines though.
| RE2 comes to mind. In theory, things like AND and complement
| could be implemented, but in practice, they blow up the size
| of the state machine.[1, 2] The current maintainer of RE2 did
| build a tool that permits intersection and complement
| though.[3]
|
| [1] - https://dl.acm.org/doi/10.1145/2071368.2071372
|
| [2] - https://github.com/google/re2/issues/156#issuecomment-3
| 54068...
|
| [3] - https://github.com/google/redgrep
| feanaro wrote:
| This is interesting, thank you!
___________________________________________________________________
(page generated 2021-01-01 23:01 UTC)