[HN Gopher] Calculate the difference and intersection of any two...
       ___________________________________________________________________
        
       Calculate the difference and intersection of any two regexes
        
       Author : posco
       Score  : 177 points
       Date   : 2023-09-11 17:10 UTC (4 hours ago)
        
 (HTM) web link (phylactery.org)
 (TXT) w3m dump (phylactery.org)
        
       | [deleted]
        
       | hoten wrote:
       | On mobile: are the rectangle glyphs as suffixes on the states on
       | purpose or am I missing a font?
        
         | progbits wrote:
         | The states are numbered, $\alpha_0, ..., \alpha_N$ and
         | $\beta_0, ...$. You might be missing the font for the digits.
        
       | JoelJacobson wrote:
       | I created a similar regex web demo that shows how a regex is
       | parsed -> NFA -> DFA -> minimal DFA, and finally outputs
       | LLVMIR/Javascript/WebAssembly for from the minimal DFA:
       | 
       | http://compiler.org/reason-re-nfa/src/index.html
        
       | less_less wrote:
       | Interesting. I think this problem is actually EXPSPACE-complete
       | in general? But still has a straightforward algorithm.
       | 
       | https://en.wikipedia.org/wiki/EXPSPACE
        
       | klysm wrote:
       | Regular expressions are a great example of bundling up some
       | really neat and complex mathematical theory into a valuable
       | interface. Linear algebra feels similar to me.
        
         | pishpash wrote:
         | That usually means the representation is getting close to the
         | truth. Good interfaces have intrinsic value, which many result-
         | focused people do not appreciate.
        
         | dhosek wrote:
         | It always amazes me how given the appropriate field, so much
         | math can be transformed into linear algebra. Even Mobius
         | transformations on the complex plane w=(az+b)/(cz+d) can be
         | turned into linear algebra.
        
           | [deleted]
        
         | abecedarius wrote:
         | iirc connections with linear algebra come up in Conway's
         | https://store.doverpublications.com/0486485838.html (which I
         | only skimmed).
        
           | Jaxan wrote:
           | There is a whole field of "weighted automata" which combine
           | linear algebra and automata theory.
        
           | [deleted]
        
       | oever wrote:
       | This library can be used to create string class hierarchies.
       | That, in turn, can help to use typed strings more.
       | 
       | For example, e-mails and urls are a special syntax. Their value
       | space is a subset of all non-empty string which is a subset of
       | all strings.
       | 
       | An e-mail address could be passed into a function that requires a
       | non-empty string as input. When the type-system knows that an
       | e-mail string is a subclass of non-empty string, it knows that an
       | email address is valid.
       | 
       | This library can be used to check the definitions and hierarchy
       | of such string types. The implementation of the hierarchy differs
       | per programming language (subclassing, trait boundaries, etc).
        
         | 1-more wrote:
         | In languages with tagged union types you do this a lot! Some
         | Haskell pseudocode for ya                   module Email
         | (Address, fromText, toText) where -- note we do not export the
         | constructor of Address, just the type              data Address
         | = Address Text              fromString :: Text -> Maybe Address
         | fromString =             -- you'd do your validation in here
         | and return Nothing if it's a bad address.             -- Signal
         | validity out of band, not in band with the data.
         | toText :: Address -> Text         toText (Address addr) = addr
         | -- for when you need to output it somewhere
        
           | alexeldeib wrote:
           | > Signal validity out of band, not in band with the data.
           | 
           | Could you expand on this?
        
             | 1-more wrote:
             | Sure! Sorry that was a little too obtuse. So in this case
             | we can imagine an app where we don't use any tagged unions
             | and just use primitive types (your strings, booleans,
             | integers, things of that nature). And we want to signal the
             | validity of some data. Say a user ID and an email address.
             | We store the User ID as an integer to keep space down and
             | store the email address as a string. We use semaphore
             | values: if the user ID is invalid we store -1 (it's JS and
             | there are no unsigned numbers) and if the email address is
             | invalid we store the empty string.
             | 
             | Whenever we consume these values, we need to make sure that
             | userId > 0 and email != "" I mean email !== "". We are
             | testing for special values of the data. Data and "this is
             | for sure not meaningful data" are the same shape! So your
             | functions need to handle those cases.
             | 
             | But with tagged unions you can check these things at the
             | edge of the program and thereafter accept that the contents
             | of the tagged data are valid (because you wrote good tests
             | for your decoders).
             | 
             | So your data is a different shape when it's valid vs when
             | it's invalid, and you can write functions that only accept
             | data that's the valid shape. If you got Json that was hit
             | by cosmic rays when trying to build your User model, you
             | can fail right then and not build a model and find a way to
             | handle that.
             | 
             | It's out of band because you don't guard for special values
             | of your morphologically identical data.
             | 
             | If you want examples of any specific part of this let me
             | know. IDK your level of familiarity and don't want to
             | overburden you with things you already get.
        
         | _a_a_a_ wrote:
         | > Their value space...
         | 
         | wossis mean? TIA
         | 
         | Edit: instread of downvoting try answering. I'd like to know.
         | TIA{2}
        
           | oever wrote:
           | Value space is the set of values a type can have. A boolean
           | has only two values in its value space. An unsigned byte has
           | 256 possible values, so does a signed byte.
           | 
           | A string enumeration has a limited number of values. E.g.
           | type A ("Yes" | "No" | "Maybe") has three values and is a
           | superset of type B ("Yes" | "No"). A function that accepts
           | type A can also accept type B as valid input.
           | 
           | If the value space is defined by a regular expression, as is
           | often the case, the mentioned library could be used to check,
           | at compile-time, which type are subsets of others.
        
             | _a_a_a_ wrote:
             | Thank you. I guess I misread.
             | 
             | "For example, e-mails and urls are a special syntax. Their
             | value space..." seemed to talk about the 'value space' of
             | strings (these being e-mails and urls), not types (of
             | e-mails and urls), which confused me.
        
               | acchow wrote:
               | It is bout the 'value space' of strings. Think of all
               | possible strings. That is the entire value space of
               | strings. Not every possible string is an email. Only a
               | subset of this value space is a valid email. This subset
               | is the 'value space' of strings which are valid emails.
        
       | posco wrote:
       | The amazing page computes binary relations between pairs of
       | regular expressions and shows a graphical representation of the
       | DFA.
       | 
       | It's a really incredible demonstration of some highly non-trivial
       | operations on regular expressions.
        
         | [deleted]
        
         | vintermann wrote:
         | It's very cool, but also no wonder that it doesn't support all
         | those features of regexes which technically make them not
         | regular expressions anymore. Though, I would have thought ^ and
         | $ anchors shouldn't be a problem?
        
           | o11c wrote:
           | A _lack_ of `^` is equivalent to prepending `(.*)`, then
           | trimming the match span to the end of that capture. And
           | similarly for a lack of `$` (but suddenly I remember how
           | nasty Python was before `.fullmatch` was added ...).
           | 
           | More interesting is word boundaries:
           | 
           | `\b` is just `\<|\>` though that should be bubbled up and
           | usually only one side will actually produce a matchable
           | regex.
           | 
           | `A\<B` is just `(A&\W)(\w&B)`, and similar for `\>`.
        
           | rntz wrote:
           | ^ and $ _are_ a problem, although one with a workaround.
           | 
           | The standard theory of regular expressions focuses entirely
           | on regex matching, rather than searching. For matching, ^ and
           | $ don't really mean anything. In particular, regexp theory is
           | defined in terms of the "language of" a regexp: the set of
           | strings which match it. What's the set of strings that "^"
           | matches? Well, it's the empty string, but only if it comes at
           | the beginning of a line (or sometimes the beginning of the
           | document). This beginning-of-line constraint doesn't fit
           | nicely into the "a regexp is defined by its language/set of
           | strings" theory, much the same way lookahead/lookbehind
           | assertions don't quite fit the theory of regular expressions.
           | 
           | The standard workaround is to augment your alphabet with
           | special beginning/end-of-line characters (or beginning/end-
           | of-document), and say that "^" matches the beginning-of-line
           | character.
        
           | Sharlin wrote:
           | As ^ and $ are implicit, you can opt _out_ of them simply by
           | affixing `.*`.
        
           | [deleted]
        
           | teraflop wrote:
           | This page implements regex _matching_ , not searching. So in
           | effect, every pattern has an implicit ^ at the beginning and
           | $ at the end.
        
       | layer8 wrote:
       | I wanted to see the intersection between syntactically valid URLs
       | and email addresses, but just entering the URL regex (cf. below)
       | already takes too long to process for the page.
       | 
       | [\\-a-zA-Z0-9@:%._+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\b([\\-a-zA-Z0
       | -9()@:%_+.~#?&//=]*)
       | 
       | (source: https://stackoverflow.com/a/3809435/623763)
        
         | d66 wrote:
         | expressions like (...){1,256} are very heavyweight and the
         | scala JS code ends up timing out or crashing the browser.
         | 
         | if you replace that with (...)+ then it seems to work (at least
         | for me). smaller expressions like (...){1,6} should be fine.
        
           | noduerme wrote:
           | Just wondering, what is it about testing repetition
           | [a-z]{1,256} with an upper bound that's so heavy? Intuitively
           | it feels like greedy testing [a-z]+ should actually be worse
           | since it has to work back from the end of the input.
        
             | d66 wrote:
             | the library uses a fairly simple data representation where
             | x{m,n} is compiled using conjunction and disjunction. so
             | x{1,4} ends up being represented as x|xx|xxx|xxxx.
             | 
             | this simplifies the code for testing equality and
             | inclusion, since logically x{n} is just xx... (n times) and
             | x{m,n} is just x{m}|x{m+1}|...|x{n}.
             | 
             | but when you have x{m,n} and n-m is large you can imagine
             | what kind of problems that causes.
        
               | yorwba wrote:
               | Seems like it should at least be x(|x(|x(|x))) instead of
               | x|xx|xxx|xxxx to avoid quadratic blow-up.
        
               | d66 wrote:
               | yes, that is the actual construction: the disjunction
               | data type only supports a lhs and rhs, so that is the
               | only possible way to represent it.
               | 
               | i wrote it the way i did for clarity in the comments.
        
             | rntz wrote:
             | This depends heavily on how repetition is implemented.
             | 
             | With a backtracking-search implementation of regexes,
             | bounded iteration is pretty easy.
             | 
             | But the linked webpage appears to compile regexes to finite
             | state machines (it shows you their finite-state-machine,
             | for instance), and eg [a-z]{1,256} will have 256 states:
             | 256 times the 1 state needed for [a-z]. If [a-z] were a
             | complex regex, you could get a combinatorial explosion.
             | 
             | This alone probably isn't the issue? 256 is not a very
             | large number. But I suspect there are follow-on algorithmic
             | issues. This is just speculation, but I wouldn't be
             | surprised if that 256-state machine were computed by
             | applying DFA minimization, an algorithm with worst-case
             | exponential running time, to a more naively generated
             | machine.
        
               | d66 wrote:
               | you're right. inclusion/intersection/etc. aren't actually
               | computed via DFA but instead are computed directly on the
               | regular expression representation itself. and large
               | disjunctions (with 256 branches) are what is very heavy.
               | 
               | (it's possible to instead do these operations on DFAs but
               | at the time i found it hard to get from an automata back
               | to a reasonable-looking regular expression.)
        
       | themusicgod1 wrote:
       | ugh STOP USING GITHUB
        
       | simlevesque wrote:
       | Kinda related but I'm looking for something that could give me
       | the number of possible matching strings for a simple regex. Does
       | such a tool exist ?
        
         | someguy101010 wrote:
         | might be something like this
         | https://jvns.ca/blog/2016/04/24/how-regular-expressions-go-f...
         | which refs https://en.wikipedia.org/wiki/Brzozowski_derivative
        
         | Drup wrote:
         | https://regex-generate.github.io/regenerate/ (I'm one of the
         | authors) enumerates all the matching (and non-matching)
         | strings, which incidentally answers the question, but doesn't
         | terminate in the infinite case.
        
         | d66 wrote:
         | the page actually does give these. for a := [a-z]{2,4} the page
         | gives |a| = 475228.
         | 
         | however, as others have pointed out any non-trivial use of the
         | kleene star means the result will be [?]. in this case the page
         | will list numbers that roughly correspond to "number of strings
         | with N applications of kleene star" in addition to infinity.
        
         | clord wrote:
         | I feel like it might be possible with dataflow analysis.
         | Stepping through the regex maintaining a liveness set or
         | something like that. Sort of like computing exemplar inputs,
         | but with repetition as permitted exemplars. Honestly probably
         | end up re-encoding the regex in some other format, perhaps with
         | 'optimizations applied.'
        
         | contravariant wrote:
         | I feel like it shouldn't be too hard to calculate from the
         | finite automaton that encodes the regular expression, but
         | surely in most cases it will simply be infinite?
        
           | kadoban wrote:
           | Maybe the number of possible matchings for a given length (or
           | range of lengths) might be interesting?
        
             | danieldk wrote:
             | Say you want to compute all strings of length 5 that the
             | automaton can generate. Conceptually the nicest way is to
             | create an automaton that matches any five characters and
             | then compute the intersection between that automaton and
             | the regex automaton. Then you can generate all the strings
             | in the intersection automaton. Of course, IRL, you wouldn't
             | actually generate the intersection automaton (you can
             | easily do this on the fly), but you get the idea.
             | 
             | Automata are really a lost art in modern natural language
             | processing. We used to do things like store a large
             | vocabulary in an deterministic acyclic minimized automaton
             | (nice and compact, so-called dictionary automaton). And
             | then to find, say all words within Levenshtein distance 2
             | of _hacker_ , create a Levenshtein automaton for _hacker_
             | and then compute (on the fly) the intersection between the
             | Levenshtein automaton and the dictionary automaton. The
             | language of the automaton is then all words within the
             | intersection automaton.
             | 
             | I wrote a Java package a decade ago that implements some of
             | this stuff:
             | 
             | https://github.com/danieldk/dictomaton
        
               | contravariant wrote:
               | > deterministic acyclic minimized automaton
               | 
               | That's basically a Trie right? To be fair I have only
               | heard of them and know they can be used to do neat
               | tricks, I've rarely used one myself.
        
               | Someone wrote:
               | If you do not plan to update it often and don't need to
               | store extra data with each word, a dawg (https://en.wikip
               | edia.org/wiki/Deterministic_acyclic_finite_s...) is more
               | compact. You often can merge leaf nodes.
               | 
               | For example, if you have words                  talk
               | talked        talking        talks        walk
               | walked        walking        walks
               | 
               | there's no need to repeat the "", "ed", "ing", "s" parts.
        
               | danieldk wrote:
               | No. In a minimized automaton shared string suffixes also
               | share states/transitions.
        
             | abecedarius wrote:
             | That was one of the short examples in Norvig's Python
             | program-design course for Udacity.
             | https://github.com/darius/cant/blob/master/library/regex-
             | gen... (I don't have the Python handy.)
        
           | tetha wrote:
           | This is hitting back a long time. But the algorithm - if I
           | recall right - is a simple DFS on the determinstic automaton
           | for the regular expression and it can output the full set of
           | matching strings if you're allowed to use *s in the output.
           | 
           | Basically, you need an accumulator of "stuff up to here". If
           | you move from a node to a second node, you add the character
           | annotating that edge to the accumulator. And whenever you end
           | up with an edge to a visited node, you add a '*' and output
           | that, and for leaf nodes, you output the accumulator.
           | 
           | And then you add a silly jumble of parenthesis on entry and
           | output to make it right. This was kinda simple to figure out
           | with stuff like (a(ab)*b)* and such.
           | 
           | This is in O(states) for R and O(2^states) for NR if I recall
           | right.
        
           | 082349872349872 wrote:
           | see https://www.cs.dartmouth.edu/~doug/nfa.pdf
        
         | pimlottc wrote:
         | What's your use case?
        
         | stvltvs wrote:
         | The answer is usually an infinite number, except for very, very
         | simple cases. Anything involving * for example means infinity
         | is your answer.
        
           | skulk wrote:
           | I wonder if it makes sense to compute an "order type" for a
           | regexp. For example, a* is omega, a*b* is 2 omega.
           | 
           | https://en.m.wikipedia.org/wiki/Order_type
           | 
           | https://en.wikipedia.org/wiki/Ordinal_number
        
             | contravariant wrote:
             | I think that's just the ordinal corresponding to
             | lexicographic order on the words in the language, so yeah
             | that should work. I wonder how easy it is to calculate...
        
               | d66 wrote:
               | normally you would use an ordinal number [1] to label
               | individual elements of an infinite set while using a
               | cardinal number [2] to measure the size of the set.
               | 
               | i believe the cardinality of a set of words from a finite
               | alphabet (with more than one member) is equivalent to the
               | cardinality of the real numbers. this means that the
               | cardinality of .* is c.
               | 
               | unfortunately, i don't think that cardinality gets us
               | very far when trying to differentiate the "complexity" of
               | expressions like [ab]* from ([ab]*c[de]*)*[x-z]*.
               | probably some other metric should be used (maybe
               | something like kolmogorov complexity).
               | 
               | [1] https://en.wikipedia.org/wiki/Ordinal_number
               | 
               | [2] https://en.wikipedia.org/wiki/Cardinal_number
        
               | contravariant wrote:
               | I wouldn't say that's their 'normal' usage, I mean sure
               | you can use them like that but fundamentally ordinal
               | numbers are equivalence classes of ordered sets in the
               | same way that cardinal numbers are equivalence classes of
               | sets.
               | 
               | As you've rightly noted the latter equivalence class gets
               | us nothing so throwing away the ordering is a bit of a
               | waste. Of all mathematical concepts 'size' is easily the
               | most subjective so picking one that is interesting is
               | better than trying to be 'correct'.
               | 
               | In particular a*b* is exactly equivalent to o^2, since
               | a^n b^m < a^x b^y iff n < x or n=x and m<y. This gives an
               | order preserving isomorphism between words of the form
               | a^n b^m and tuples (n,m) with lexicographic ordering.
        
               | d66 wrote:
               | interesting!
               | 
               | what would [ab]* be? for computing an ordinal number the
               | only real difficulty is how to handle kleene star: given
               | ord(X) how do we calculate ord(X*)?
               | 
               | but as you probably noticed i'm a bit out of my depth
               | when dealing with ordinals.
        
       | snoble wrote:
       | Oh neat, this is scala via scalajs.
        
       | blibble wrote:
       | it always bugged me as a student that had to sit through all
       | those discrete maths lectures that standard regex libraries don't
       | allow you to union/intersect two "compiled" regular expression
       | objects together
       | 
       | (having to try them one an a time is pretty sad)
        
       | haltist wrote:
       | Can LLMs do this?
        
         | vore wrote:
         | I wouldn't use an LLM for anything that can be done 100%
         | precisely, like this.
        
           | haltist wrote:
           | OK, just curious how LLMs are stacking up in logical tasks
           | like this. I kept hearing we were close to AGI so just
           | wondering how far there is to go.
        
       | baggy_trough wrote:
       | I love how it looks like a CS textbook.
        
         | perihelions wrote:
         | The graphics look identical to those in Hopcroft & Ullman's _"
         | Introduction to Automata Theory, Languages, and Computation"_
         | (like the convention that they use a double-circle to denote
         | accepting states). I imagine they're GraphViz-based: it's very
         | easy [0] to draw these in GraphViz. I don't know what Hopcroft
         | & Ullman used though, because that one was published in 1979,
         | and GraphViz didn't exist before 1991. Suddenly I'm curious
         | what the state of the art for vector diagrams was in 1979...?
         | 
         | [0] e.g. https://graphviz.org/Gallery/directed/fsm.html
        
         | cobbal wrote:
         | It has the look of graphviz about it, which is an excellent
         | tool. Often helpful in debugging anything related to graphs.
         | 
         | https://graphviz.org/
        
       | pimlottc wrote:
       | Suggestion: turn off auto suggest in the regex input fields to
       | make it more usable on mobile.
       | 
       | https://stackoverflow.com/questions/35513968/disable-autocor...
        
       | _a_a_a_ wrote:
       | Any def for 'difference and intersection of regexes' might
       | actually mean?
       | 
       | I guess for regexes r1 and r2 this means the diff and intersect
       | of their extensional sets, expressed intensionally as a regex. I
       | guess. But nothing seems defined, including what ^ is, or > or
       | whatever. It's not helpful
        
         | d66 wrote:
         | negation (~a): strings not matched by a       difference (a -
         | b): strings matched by a but not b       intersection (a & b):
         | strings matched by a and b       exclusive-or (a ^ b): strings
         | matched by a or b but not both       inclusion (a > b): does a
         | matches all strings b matches?       equality (a = b): do a and
         | b match exactly the same strings?
        
       | rsstack wrote:
       | I used this concept once to write the validation logic for an "IP
       | RegEx filter" setting. The goal was to let users configure an IP
       | filter using RegEx (no, marketing people don't get CIDRs, and
       | they knew RegEx's from Google Analytics). How could I define a
       | valid RegEx for this? The intersection with the RegEx of "all
       | IPv4 addresses" is not empty, and not equal to the RegEx of "all
       | IPv4 addresses". Prevented many complaints about the filter not
       | doing anything, but of course didn't prevent wrong filters from
       | being entered.
        
         | Etheryte wrote:
         | Wouldn't a simpler solution work here? Instead of trying to
         | validate the filter regex, show some sample IP addresses or let
         | the user insert a set of addresses, and then show which ones
         | the filter matches and which ones it doesn't. Also helps
         | address the problem of incorrect filters.
        
           | rsstack wrote:
           | The odds of the sample addresses matching is essentially
           | zero, and adding work to the user is counterproductive.
        
             | Etheryte wrote:
             | I'm not sure I agree -- most common regex editing tools
             | available online include a section for adding test strings
             | to verify what you actually wrote is correct. Clearly there
             | is a benefit to it. In similar vein, allowing the user to
             | test before they commit and then test actually reduces
             | their work load, they don't have to drop and then reload
             | the whole regex in their mind.
        
               | rsstack wrote:
               | Sure, I use that when authoring and editing a RegEx.
               | That's not the same as entry validation.
        
       ___________________________________________________________________
       (page generated 2023-09-11 22:00 UTC)