[HN Gopher] Calculate the difference and intersection of any two...
___________________________________________________________________
Calculate the difference and intersection of any two regexes
Author : posco
Score : 177 points
Date : 2023-09-11 17:10 UTC (4 hours ago)
(HTM) web link (phylactery.org)
(TXT) w3m dump (phylactery.org)
| [deleted]
| hoten wrote:
| On mobile: are the rectangle glyphs as suffixes on the states on
| purpose or am I missing a font?
| progbits wrote:
| The states are numbered, $\alpha_0, ..., \alpha_N$ and
| $\beta_0, ...$. You might be missing the font for the digits.
| JoelJacobson wrote:
| I created a similar regex web demo that shows how a regex is
| parsed -> NFA -> DFA -> minimal DFA, and finally outputs
| LLVMIR/Javascript/WebAssembly for from the minimal DFA:
|
| http://compiler.org/reason-re-nfa/src/index.html
| less_less wrote:
| Interesting. I think this problem is actually EXPSPACE-complete
| in general? But still has a straightforward algorithm.
|
| https://en.wikipedia.org/wiki/EXPSPACE
| klysm wrote:
| Regular expressions are a great example of bundling up some
| really neat and complex mathematical theory into a valuable
| interface. Linear algebra feels similar to me.
| pishpash wrote:
| That usually means the representation is getting close to the
| truth. Good interfaces have intrinsic value, which many result-
| focused people do not appreciate.
| dhosek wrote:
| It always amazes me how given the appropriate field, so much
| math can be transformed into linear algebra. Even Mobius
| transformations on the complex plane w=(az+b)/(cz+d) can be
| turned into linear algebra.
| [deleted]
| abecedarius wrote:
| iirc connections with linear algebra come up in Conway's
| https://store.doverpublications.com/0486485838.html (which I
| only skimmed).
| Jaxan wrote:
| There is a whole field of "weighted automata" which combine
| linear algebra and automata theory.
| [deleted]
| oever wrote:
| This library can be used to create string class hierarchies.
| That, in turn, can help to use typed strings more.
|
| For example, e-mails and urls are a special syntax. Their value
| space is a subset of all non-empty string which is a subset of
| all strings.
|
| An e-mail address could be passed into a function that requires a
| non-empty string as input. When the type-system knows that an
| e-mail string is a subclass of non-empty string, it knows that an
| email address is valid.
|
| This library can be used to check the definitions and hierarchy
| of such string types. The implementation of the hierarchy differs
| per programming language (subclassing, trait boundaries, etc).
| 1-more wrote:
| In languages with tagged union types you do this a lot! Some
| Haskell pseudocode for ya module Email
| (Address, fromText, toText) where -- note we do not export the
| constructor of Address, just the type data Address
| = Address Text fromString :: Text -> Maybe Address
| fromString = -- you'd do your validation in here
| and return Nothing if it's a bad address. -- Signal
| validity out of band, not in band with the data.
| toText :: Address -> Text toText (Address addr) = addr
| -- for when you need to output it somewhere
| alexeldeib wrote:
| > Signal validity out of band, not in band with the data.
|
| Could you expand on this?
| 1-more wrote:
| Sure! Sorry that was a little too obtuse. So in this case
| we can imagine an app where we don't use any tagged unions
| and just use primitive types (your strings, booleans,
| integers, things of that nature). And we want to signal the
| validity of some data. Say a user ID and an email address.
| We store the User ID as an integer to keep space down and
| store the email address as a string. We use semaphore
| values: if the user ID is invalid we store -1 (it's JS and
| there are no unsigned numbers) and if the email address is
| invalid we store the empty string.
|
| Whenever we consume these values, we need to make sure that
| userId > 0 and email != "" I mean email !== "". We are
| testing for special values of the data. Data and "this is
| for sure not meaningful data" are the same shape! So your
| functions need to handle those cases.
|
| But with tagged unions you can check these things at the
| edge of the program and thereafter accept that the contents
| of the tagged data are valid (because you wrote good tests
| for your decoders).
|
| So your data is a different shape when it's valid vs when
| it's invalid, and you can write functions that only accept
| data that's the valid shape. If you got Json that was hit
| by cosmic rays when trying to build your User model, you
| can fail right then and not build a model and find a way to
| handle that.
|
| It's out of band because you don't guard for special values
| of your morphologically identical data.
|
| If you want examples of any specific part of this let me
| know. IDK your level of familiarity and don't want to
| overburden you with things you already get.
| _a_a_a_ wrote:
| > Their value space...
|
| wossis mean? TIA
|
| Edit: instread of downvoting try answering. I'd like to know.
| TIA{2}
| oever wrote:
| Value space is the set of values a type can have. A boolean
| has only two values in its value space. An unsigned byte has
| 256 possible values, so does a signed byte.
|
| A string enumeration has a limited number of values. E.g.
| type A ("Yes" | "No" | "Maybe") has three values and is a
| superset of type B ("Yes" | "No"). A function that accepts
| type A can also accept type B as valid input.
|
| If the value space is defined by a regular expression, as is
| often the case, the mentioned library could be used to check,
| at compile-time, which type are subsets of others.
| _a_a_a_ wrote:
| Thank you. I guess I misread.
|
| "For example, e-mails and urls are a special syntax. Their
| value space..." seemed to talk about the 'value space' of
| strings (these being e-mails and urls), not types (of
| e-mails and urls), which confused me.
| acchow wrote:
| It is bout the 'value space' of strings. Think of all
| possible strings. That is the entire value space of
| strings. Not every possible string is an email. Only a
| subset of this value space is a valid email. This subset
| is the 'value space' of strings which are valid emails.
| posco wrote:
| The amazing page computes binary relations between pairs of
| regular expressions and shows a graphical representation of the
| DFA.
|
| It's a really incredible demonstration of some highly non-trivial
| operations on regular expressions.
| [deleted]
| vintermann wrote:
| It's very cool, but also no wonder that it doesn't support all
| those features of regexes which technically make them not
| regular expressions anymore. Though, I would have thought ^ and
| $ anchors shouldn't be a problem?
| o11c wrote:
| A _lack_ of `^` is equivalent to prepending `(.*)`, then
| trimming the match span to the end of that capture. And
| similarly for a lack of `$` (but suddenly I remember how
| nasty Python was before `.fullmatch` was added ...).
|
| More interesting is word boundaries:
|
| `\b` is just `\<|\>` though that should be bubbled up and
| usually only one side will actually produce a matchable
| regex.
|
| `A\<B` is just `(A&\W)(\w&B)`, and similar for `\>`.
| rntz wrote:
| ^ and $ _are_ a problem, although one with a workaround.
|
| The standard theory of regular expressions focuses entirely
| on regex matching, rather than searching. For matching, ^ and
| $ don't really mean anything. In particular, regexp theory is
| defined in terms of the "language of" a regexp: the set of
| strings which match it. What's the set of strings that "^"
| matches? Well, it's the empty string, but only if it comes at
| the beginning of a line (or sometimes the beginning of the
| document). This beginning-of-line constraint doesn't fit
| nicely into the "a regexp is defined by its language/set of
| strings" theory, much the same way lookahead/lookbehind
| assertions don't quite fit the theory of regular expressions.
|
| The standard workaround is to augment your alphabet with
| special beginning/end-of-line characters (or beginning/end-
| of-document), and say that "^" matches the beginning-of-line
| character.
| Sharlin wrote:
| As ^ and $ are implicit, you can opt _out_ of them simply by
| affixing `.*`.
| [deleted]
| teraflop wrote:
| This page implements regex _matching_ , not searching. So in
| effect, every pattern has an implicit ^ at the beginning and
| $ at the end.
| layer8 wrote:
| I wanted to see the intersection between syntactically valid URLs
| and email addresses, but just entering the URL regex (cf. below)
| already takes too long to process for the page.
|
| [\\-a-zA-Z0-9@:%._+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\b([\\-a-zA-Z0
| -9()@:%_+.~#?&//=]*)
|
| (source: https://stackoverflow.com/a/3809435/623763)
| d66 wrote:
| expressions like (...){1,256} are very heavyweight and the
| scala JS code ends up timing out or crashing the browser.
|
| if you replace that with (...)+ then it seems to work (at least
| for me). smaller expressions like (...){1,6} should be fine.
| noduerme wrote:
| Just wondering, what is it about testing repetition
| [a-z]{1,256} with an upper bound that's so heavy? Intuitively
| it feels like greedy testing [a-z]+ should actually be worse
| since it has to work back from the end of the input.
| d66 wrote:
| the library uses a fairly simple data representation where
| x{m,n} is compiled using conjunction and disjunction. so
| x{1,4} ends up being represented as x|xx|xxx|xxxx.
|
| this simplifies the code for testing equality and
| inclusion, since logically x{n} is just xx... (n times) and
| x{m,n} is just x{m}|x{m+1}|...|x{n}.
|
| but when you have x{m,n} and n-m is large you can imagine
| what kind of problems that causes.
| yorwba wrote:
| Seems like it should at least be x(|x(|x(|x))) instead of
| x|xx|xxx|xxxx to avoid quadratic blow-up.
| d66 wrote:
| yes, that is the actual construction: the disjunction
| data type only supports a lhs and rhs, so that is the
| only possible way to represent it.
|
| i wrote it the way i did for clarity in the comments.
| rntz wrote:
| This depends heavily on how repetition is implemented.
|
| With a backtracking-search implementation of regexes,
| bounded iteration is pretty easy.
|
| But the linked webpage appears to compile regexes to finite
| state machines (it shows you their finite-state-machine,
| for instance), and eg [a-z]{1,256} will have 256 states:
| 256 times the 1 state needed for [a-z]. If [a-z] were a
| complex regex, you could get a combinatorial explosion.
|
| This alone probably isn't the issue? 256 is not a very
| large number. But I suspect there are follow-on algorithmic
| issues. This is just speculation, but I wouldn't be
| surprised if that 256-state machine were computed by
| applying DFA minimization, an algorithm with worst-case
| exponential running time, to a more naively generated
| machine.
| d66 wrote:
| you're right. inclusion/intersection/etc. aren't actually
| computed via DFA but instead are computed directly on the
| regular expression representation itself. and large
| disjunctions (with 256 branches) are what is very heavy.
|
| (it's possible to instead do these operations on DFAs but
| at the time i found it hard to get from an automata back
| to a reasonable-looking regular expression.)
| themusicgod1 wrote:
| ugh STOP USING GITHUB
| simlevesque wrote:
| Kinda related but I'm looking for something that could give me
| the number of possible matching strings for a simple regex. Does
| such a tool exist ?
| someguy101010 wrote:
| might be something like this
| https://jvns.ca/blog/2016/04/24/how-regular-expressions-go-f...
| which refs https://en.wikipedia.org/wiki/Brzozowski_derivative
| Drup wrote:
| https://regex-generate.github.io/regenerate/ (I'm one of the
| authors) enumerates all the matching (and non-matching)
| strings, which incidentally answers the question, but doesn't
| terminate in the infinite case.
| d66 wrote:
| the page actually does give these. for a := [a-z]{2,4} the page
| gives |a| = 475228.
|
| however, as others have pointed out any non-trivial use of the
| kleene star means the result will be [?]. in this case the page
| will list numbers that roughly correspond to "number of strings
| with N applications of kleene star" in addition to infinity.
| clord wrote:
| I feel like it might be possible with dataflow analysis.
| Stepping through the regex maintaining a liveness set or
| something like that. Sort of like computing exemplar inputs,
| but with repetition as permitted exemplars. Honestly probably
| end up re-encoding the regex in some other format, perhaps with
| 'optimizations applied.'
| contravariant wrote:
| I feel like it shouldn't be too hard to calculate from the
| finite automaton that encodes the regular expression, but
| surely in most cases it will simply be infinite?
| kadoban wrote:
| Maybe the number of possible matchings for a given length (or
| range of lengths) might be interesting?
| danieldk wrote:
| Say you want to compute all strings of length 5 that the
| automaton can generate. Conceptually the nicest way is to
| create an automaton that matches any five characters and
| then compute the intersection between that automaton and
| the regex automaton. Then you can generate all the strings
| in the intersection automaton. Of course, IRL, you wouldn't
| actually generate the intersection automaton (you can
| easily do this on the fly), but you get the idea.
|
| Automata are really a lost art in modern natural language
| processing. We used to do things like store a large
| vocabulary in an deterministic acyclic minimized automaton
| (nice and compact, so-called dictionary automaton). And
| then to find, say all words within Levenshtein distance 2
| of _hacker_ , create a Levenshtein automaton for _hacker_
| and then compute (on the fly) the intersection between the
| Levenshtein automaton and the dictionary automaton. The
| language of the automaton is then all words within the
| intersection automaton.
|
| I wrote a Java package a decade ago that implements some of
| this stuff:
|
| https://github.com/danieldk/dictomaton
| contravariant wrote:
| > deterministic acyclic minimized automaton
|
| That's basically a Trie right? To be fair I have only
| heard of them and know they can be used to do neat
| tricks, I've rarely used one myself.
| Someone wrote:
| If you do not plan to update it often and don't need to
| store extra data with each word, a dawg (https://en.wikip
| edia.org/wiki/Deterministic_acyclic_finite_s...) is more
| compact. You often can merge leaf nodes.
|
| For example, if you have words talk
| talked talking talks walk
| walked walking walks
|
| there's no need to repeat the "", "ed", "ing", "s" parts.
| danieldk wrote:
| No. In a minimized automaton shared string suffixes also
| share states/transitions.
| abecedarius wrote:
| That was one of the short examples in Norvig's Python
| program-design course for Udacity.
| https://github.com/darius/cant/blob/master/library/regex-
| gen... (I don't have the Python handy.)
| tetha wrote:
| This is hitting back a long time. But the algorithm - if I
| recall right - is a simple DFS on the determinstic automaton
| for the regular expression and it can output the full set of
| matching strings if you're allowed to use *s in the output.
|
| Basically, you need an accumulator of "stuff up to here". If
| you move from a node to a second node, you add the character
| annotating that edge to the accumulator. And whenever you end
| up with an edge to a visited node, you add a '*' and output
| that, and for leaf nodes, you output the accumulator.
|
| And then you add a silly jumble of parenthesis on entry and
| output to make it right. This was kinda simple to figure out
| with stuff like (a(ab)*b)* and such.
|
| This is in O(states) for R and O(2^states) for NR if I recall
| right.
| 082349872349872 wrote:
| see https://www.cs.dartmouth.edu/~doug/nfa.pdf
| pimlottc wrote:
| What's your use case?
| stvltvs wrote:
| The answer is usually an infinite number, except for very, very
| simple cases. Anything involving * for example means infinity
| is your answer.
| skulk wrote:
| I wonder if it makes sense to compute an "order type" for a
| regexp. For example, a* is omega, a*b* is 2 omega.
|
| https://en.m.wikipedia.org/wiki/Order_type
|
| https://en.wikipedia.org/wiki/Ordinal_number
| contravariant wrote:
| I think that's just the ordinal corresponding to
| lexicographic order on the words in the language, so yeah
| that should work. I wonder how easy it is to calculate...
| d66 wrote:
| normally you would use an ordinal number [1] to label
| individual elements of an infinite set while using a
| cardinal number [2] to measure the size of the set.
|
| i believe the cardinality of a set of words from a finite
| alphabet (with more than one member) is equivalent to the
| cardinality of the real numbers. this means that the
| cardinality of .* is c.
|
| unfortunately, i don't think that cardinality gets us
| very far when trying to differentiate the "complexity" of
| expressions like [ab]* from ([ab]*c[de]*)*[x-z]*.
| probably some other metric should be used (maybe
| something like kolmogorov complexity).
|
| [1] https://en.wikipedia.org/wiki/Ordinal_number
|
| [2] https://en.wikipedia.org/wiki/Cardinal_number
| contravariant wrote:
| I wouldn't say that's their 'normal' usage, I mean sure
| you can use them like that but fundamentally ordinal
| numbers are equivalence classes of ordered sets in the
| same way that cardinal numbers are equivalence classes of
| sets.
|
| As you've rightly noted the latter equivalence class gets
| us nothing so throwing away the ordering is a bit of a
| waste. Of all mathematical concepts 'size' is easily the
| most subjective so picking one that is interesting is
| better than trying to be 'correct'.
|
| In particular a*b* is exactly equivalent to o^2, since
| a^n b^m < a^x b^y iff n < x or n=x and m<y. This gives an
| order preserving isomorphism between words of the form
| a^n b^m and tuples (n,m) with lexicographic ordering.
| d66 wrote:
| interesting!
|
| what would [ab]* be? for computing an ordinal number the
| only real difficulty is how to handle kleene star: given
| ord(X) how do we calculate ord(X*)?
|
| but as you probably noticed i'm a bit out of my depth
| when dealing with ordinals.
| snoble wrote:
| Oh neat, this is scala via scalajs.
| blibble wrote:
| it always bugged me as a student that had to sit through all
| those discrete maths lectures that standard regex libraries don't
| allow you to union/intersect two "compiled" regular expression
| objects together
|
| (having to try them one an a time is pretty sad)
| haltist wrote:
| Can LLMs do this?
| vore wrote:
| I wouldn't use an LLM for anything that can be done 100%
| precisely, like this.
| haltist wrote:
| OK, just curious how LLMs are stacking up in logical tasks
| like this. I kept hearing we were close to AGI so just
| wondering how far there is to go.
| baggy_trough wrote:
| I love how it looks like a CS textbook.
| perihelions wrote:
| The graphics look identical to those in Hopcroft & Ullman's _"
| Introduction to Automata Theory, Languages, and Computation"_
| (like the convention that they use a double-circle to denote
| accepting states). I imagine they're GraphViz-based: it's very
| easy [0] to draw these in GraphViz. I don't know what Hopcroft
| & Ullman used though, because that one was published in 1979,
| and GraphViz didn't exist before 1991. Suddenly I'm curious
| what the state of the art for vector diagrams was in 1979...?
|
| [0] e.g. https://graphviz.org/Gallery/directed/fsm.html
| cobbal wrote:
| It has the look of graphviz about it, which is an excellent
| tool. Often helpful in debugging anything related to graphs.
|
| https://graphviz.org/
| pimlottc wrote:
| Suggestion: turn off auto suggest in the regex input fields to
| make it more usable on mobile.
|
| https://stackoverflow.com/questions/35513968/disable-autocor...
| _a_a_a_ wrote:
| Any def for 'difference and intersection of regexes' might
| actually mean?
|
| I guess for regexes r1 and r2 this means the diff and intersect
| of their extensional sets, expressed intensionally as a regex. I
| guess. But nothing seems defined, including what ^ is, or > or
| whatever. It's not helpful
| d66 wrote:
| negation (~a): strings not matched by a difference (a -
| b): strings matched by a but not b intersection (a & b):
| strings matched by a and b exclusive-or (a ^ b): strings
| matched by a or b but not both inclusion (a > b): does a
| matches all strings b matches? equality (a = b): do a and
| b match exactly the same strings?
| rsstack wrote:
| I used this concept once to write the validation logic for an "IP
| RegEx filter" setting. The goal was to let users configure an IP
| filter using RegEx (no, marketing people don't get CIDRs, and
| they knew RegEx's from Google Analytics). How could I define a
| valid RegEx for this? The intersection with the RegEx of "all
| IPv4 addresses" is not empty, and not equal to the RegEx of "all
| IPv4 addresses". Prevented many complaints about the filter not
| doing anything, but of course didn't prevent wrong filters from
| being entered.
| Etheryte wrote:
| Wouldn't a simpler solution work here? Instead of trying to
| validate the filter regex, show some sample IP addresses or let
| the user insert a set of addresses, and then show which ones
| the filter matches and which ones it doesn't. Also helps
| address the problem of incorrect filters.
| rsstack wrote:
| The odds of the sample addresses matching is essentially
| zero, and adding work to the user is counterproductive.
| Etheryte wrote:
| I'm not sure I agree -- most common regex editing tools
| available online include a section for adding test strings
| to verify what you actually wrote is correct. Clearly there
| is a benefit to it. In similar vein, allowing the user to
| test before they commit and then test actually reduces
| their work load, they don't have to drop and then reload
| the whole regex in their mind.
| rsstack wrote:
| Sure, I use that when authoring and editing a RegEx.
| That's not the same as entry validation.
___________________________________________________________________
(page generated 2023-09-11 22:00 UTC)