[HN Gopher] A regular expression to check for prime numbers (2007)
       ___________________________________________________________________
        
       A regular expression to check for prime numbers (2007)
        
       Author : graderjs
       Score  : 267 points
       Date   : 2022-03-05 04:34 UTC (18 hours ago)
        
 (HTM) web link (www.noulakaz.net)
 (TXT) w3m dump (www.noulakaz.net)
        
       | [deleted]
        
       | amp108 wrote:
       | Interesting, I wrote a quick script to use this:
       | 
       | #!/usr/bin/env ruby
       | 
       | def is_prime(n)                 ("1" \* n.to_i)!~
       | /^1?$|^(11+?)\1+$/
       | 
       | end
       | 
       | num = ARGV.shift
       | 
       | $stdout.puts is_prime(num)
       | 
       | ... then decided to test it on a button-mashed number:
       | 
       | $ruby ~/prime.rb 767423
       | 
       | ... and it hung.
       | 
       | Thought it was the length of the number, so I tried something
       | smaller (101). Quick "true". Then I tried progressively more
       | digits. I got to `ruby ~/prime.rb 7673121` (false) and it took
       | less than a second.
       | 
       | But it doesn't like something about the number 767423.
        
         | TrianguloY wrote:
         | The regexp basically tests all dividers of the number (from 2
         | to the number-1), if it founds one, it says false and stops, if
         | nothing is found, it says true.
         | 
         | 101 is prime, but it does "only" a hundred checks approx.
         | 
         | 7673121 is divisible by 3, so as soon as it tries to match
         | groups of 3 it succeeds and stops.
         | 
         | 767423 is prime, so it needs to do hundreds of millions of
         | checks, that's probably why it hungs.
         | 
         | Basically: the runtime of the algorithm should be equivalent to
         | the minimum divisor excluding 1 of the number.
        
       | vbezhenar wrote:
       | Simpler regexp:                   ^(11+)\1+$
       | 
       | it does not handle "1" for clarity. I also don't understand why
       | author used `+?`, because simple `+` should work just as well. At
       | least it works in Idea for me.
        
         | MauranKilom wrote:
         | Yeah, should work even with greedy matching as it just does the
         | same checks but in reverse order.
        
         | catlifeonmars wrote:
         | I wonder which one will reject composite numbers faster. The
         | non greedy version starts with the smallest (potential) factor
         | first, while the greedy version starts with the largest
         | candidate factor first.
         | 
         | I suspect that since a factor must be smaller than the square
         | root of a composite, the greedy version you proposed would
         | waste a lot of time checking numbers which could never be
         | factors of the composite under test.
         | 
         | If the number is prime, I think both variants have the same
         | performance because they will exhaustively evaluate all numbers
         | less than or equal to the composite under test.
         | 
         | Edit: this is not quite correct. One of the prime factors must
         | be smaller than the square root of the composite numbers, but
         | the other one can obviously be larger. For example: 2 * 17 =
         | 34. I'm leaving the original statement as is, because I'm
         | hoping someone with a better math background can chime in and
         | explain if the way that prime factors are distributed is
         | symmetrical or not.
        
       | johnx123-up wrote:
       | FWIW...
       | 
       | Better explanations in http://neilk.net/blog/2000/06/01/abigails-
       | regex-to-test-for-...
       | 
       | Abigail's other regular expression hacks
       | https://metacpan.org/author/ABIGAIL?sort=[[3,1]]
        
       | bawolff wrote:
       | This is interesting - i thought it would be something complex,
       | but really its just using the fundamental meaning of prime.
       | 
       | Literally what its doing is turn a number in to all 1s, if the
       | number is 1 return true. Otherwise try all possible ways to put
       | it into 2 or more equal groups, return true once we find one that
       | works.
       | 
       | You don't get closer to the definition of "prime" than that.
        
         | russfink wrote:
         | If the number of 1's itself is prime, then n is prime. That's
         | all this does. Search the other comments for 'hung' or 'hang'
         | to see examples of small numbers that take forever to resolve.
        
           | bawolff wrote:
           | Sure. its basically a less efficient form of trial division.
        
       | pxeger1 wrote:
       | Here is a regex to check for integer powers of four:
       | https://codegolf.stackexchange.com/a/223129
       | 
       | And a regex to integer-divide numbers by the square root of 2:
       | https://codegolf.stackexchange.com/a/198428
        
       | userbinator wrote:
       | It's basically https://en.wikipedia.org/wiki/Trial_division using
       | the + operator, and enforcing the bounds of the match to require
       | a zero remainder.
        
       | mmaunder wrote:
       | I prefer the source which I think is a clearer explanation. The
       | key is that perl's x operator turns the number into a string of
       | 1s that is the length of the number. The rest is a hack to use
       | regex to perform division using minimal capturing and back
       | references.
       | 
       | http://montreal.pm.org/tech/neil_kandalgaonkar.shtml
        
         | btdmaster wrote:
         | It's crazy how fast Perl's regex engine does this!
        
       | avinash wrote:
       | Aha. I wrote this post 15 years ago after explaining to my wife
       | (who is a Language major) how this "regular expression" works.
       | 
       | It has been discussed a few times on Hacker News, Reddit, etc.
       | over the past 15 years...
        
         | bmn__ wrote:
         | Credit is due to Abigail
         | <https://web.archive.org/web/2020/http://www.abigail.be/>.
         | Please give it and be a good Internet citizen by amending the
         | post.
        
         | phgn wrote:
         | @dang can we please add (2007) to the title?
        
           | beardyw wrote:
           | It doesn't seem to be past its "best before" date.
        
             | phgn wrote:
             | What do you mean? I actually love posts about old articles,
             | they're often more useful than any fleeting news.
        
         | chx wrote:
         | As an aside, a Hungarian mathematician wrote an entire book
         | explaining how mathematics works to a, I guess, a language
         | major (well, Marcell Benedek was a professor, a writer, a
         | literary historian, a translator, a theatre director).
         | 
         | https://en.wikipedia.org/wiki/Playing_with_Infinity
        
       | [deleted]
        
       | rofr999 wrote:
       | I'm going to test this on large numbers and post the results!
        
         | saagarjha wrote:
         | You'll probably not be able to finish before the 14-day reply
         | period expires.
        
         | xcambar wrote:
         | I can see you have a very efficient method for wasting time and
         | energy.
         | 
         | Why? The method works. Size doesn't matter. But it's extremely
         | inefficient. Which can be calculated rather than measured.
        
           | tmalsburg2 wrote:
           | I think OP understands this and their comment was a joke.
        
             | mrjin wrote:
             | Yep. It seems doesn't matter what, there is always people
             | trying to use it in a way it wasn't meant to.
        
             | wittycardio wrote:
             | Not sure if anything about his comment suggests that tbh.
        
           | reactspa wrote:
           | Is this a case of P != NP?
        
             | pxx wrote:
             | primality testing is in P by the AKS construction
             | (https://www.microsoft.com/en-us/research/wp-
             | content/uploads/...)
        
           | rofr999 wrote:
        
       | dhosek wrote:
       | At first I was thinking this was actually operating on a binary
       | representation of the numbers. Some interesting things become
       | apparent when you look at numbers in binary, for example, numbers
       | of the form 2^n-1 can be represented as a string of n 1s. From
       | there, it's pretty obvious that if n is composite that 2^n is
       | also composite since you can take that string of n=kp 1s and
       | write it as k groups of p 1s so obviously, 2^p-1 divides 2^n-1.
       | There's the famous quote from Leopold Kronecker, "God made the
       | integers" and I always thought that when he did, he wrote them in
       | binary.
        
         | Rerarom wrote:
         | No, in unary.
        
         | cmeacham98 wrote:
         | .
        
           | zodiac wrote:
           | From reading the proof, I think he meant "n composite => 2^n
           | - 1 composite" and it's just a typo
        
       | randv wrote:
       | while compact and looks elegant...
       | 
       | it is basically trying to group by 11 (divide by 2), if there no
       | match (via backtracking) group by 111 (divide by 3) and so on
       | 
       | not very efficient by any stretch esp with string matches.
        
         | randv wrote:
         | steps:
         | 
         | - it is turning the number into a string of 1s 3 => 111 4 =>
         | 1111 and so on
         | 
         | - then trying to simulate divisibility test by trying to find
         | perfect groups of '11' to simulate division by 2 failing which
         | (that is the backtracking part) and trying find perfect groups
         | of '111' to simulate division by 3 and so on...
         | 
         | Clever but very inefficient
        
       | dorianmariefr wrote:
       | > require "prime"         => true         > 10.prime?         =>
       | false
        
       | [deleted]
        
       | brightffw wrote:
        
       | tasty_freeze wrote:
       | What it is really doing is checking if the length of a string is
       | a prime number or not. And as coded, it is checking specifically
       | if the length of a string consisting of some number of '1'
       | characters is prime or not.
        
         | randv wrote:
         | - it is turning the number into a string of 1s 3 => 111 4 =>
         | 1111 and so on
         | 
         | - then trying to simulate divisibility test by trying to find
         | perfect groups of '11' to simulate division by 2 failing which
         | (that is the backtracking part) and trying find perfect groups
         | of '111' to simulate division by 3 and so on...
         | 
         | Clever but very inefficient
        
         | rawling wrote:
         | I've done some Codewars problems where you have to write a
         | regex to check whether an input number is divisible by a
         | specific number, so I was really interested to see how that
         | would work for primes.
         | 
         | Nope, it's just playing with lengths.
        
       | colanderman wrote:
       | I'll be "that guy" -- this is _not_ a regular expression [1]. A
       | Perl regexp, sure. But backreferences are not part of the grammar
       | of regular expressions, and moreover, using them in that
       | particular manner (to refer to a capture of unbounded length)
       | cannot be encoded as a DFA and therefore does not describe a
       | regular language.
       | 
       | (I have no clue whether unary-representation primality is
       | actually regular or not -- it's been far too long since my
       | automata class to remember how to (dis)prove that -- but I guess
       | it's at least sublinear, since primality is polynomial w/r/t the
       | logarithm of the number being tested.)
       | 
       | [1]
       | https://en.wikipedia.org/wiki/Regular_expression#Formal_defi...
        
         | dudeman13 wrote:
         | Thanks for that.
         | 
         | My initial reaction to the title was "wait, you can describe
         | prime numbers using formal languages?". I could feel it in my
         | guts that this would probably have some mind blowing Math/CS
         | implications.
         | 
         | A shame how I'm so rusty in my Mathematics and CS nowadays :(
        
           | aghilmort wrote:
           | had same immediate reaction, like whoa, huge if true, time to
           | read the comments
        
         | rurban wrote:
         | So left recursive is not regular, so not a language? Every
         | simple language is defined recursively. Regularity just defines
         | it as terminating, to avoid the simplicity of recursion.
         | 
         | DFA is the absurdity, not NFA.
        
           | colanderman wrote:
           | What are you talking about? "Not a language?" "DFA is the
           | absurdity?" I'm sorry, I don't follow.
        
             | rurban wrote:
             | read for example about the state explosion of DFA's:
             | 
             | "It is important because NFAs can be used to reduce the
             | complexity of the mathematical work required to establish
             | many important properties in the theory of computation. For
             | example, it is much easier to prove closure properties of
             | regular languages using NFAs than DFAs." https://en.m.wikip
             | edia.org/wiki/Nondeterministic_finite_auto...
             | 
             | but more importantly look at the various attempts to the
             | solve backtracking "problem" (stack overflow by unbounded
             | recursion), as done recently in perl and pcre, which led to
             | more insecure AND more unmaintainable regex code by
             | converting their simple recursion to iterations on the
             | heap. first it got much slower, as the state is now handled
             | on the heap, and second it got much more insecure as DOS
             | attacks can now exhaust the whole heap, not just around
             | 1024 stack depths. and third, it got 10x more complicated,
             | and unreadable.
             | 
             | ad language: you can of course define a language by
             | forbidding left recursion, which yacc folks would prefer.
             | but you can define it much easier with left recursive
             | parsers, like 10x smaller. of course you can always expand
             | simple NFA's to deterministic DFA's at the cost of
             | exponential state explosion and increased complexity. but
             | it is a monstrosity only impractical CS folks prefer. SW
             | engineers prefer the similar, faster and more secure NFA
             | backtracking versions, as long as you count the stack
             | depth.
        
               | colanderman wrote:
               | As I mentioned elsewhere, linear performance is a hard
               | requirement for some domains such as intrusion
               | prevention. Unbounded recursion isn't the primary
               | problem; exponential runtime is.
               | 
               | If state-space explosion is a concern, you can implement
               | a non-backtracking regular expression matcher using term
               | rewriting. For true regular expressions, the size of the
               | rewritten terms are a function of the size of the pattern
               | (and thus performance is linear). For non-regular
               | expressions, this does not hold.
        
               | jonstewart wrote:
               | You can also just implement an NFA but apply
               | determinization to much of it, with tight memory
               | allocation bounds, to get most of the benefit.
        
           | danieldk wrote:
           | _DFA is the absurdity, not NFA._
           | 
           | Every non-deterministic finite state automaton can be
           | converted into a deterministic finite state automaton that
           | accepts the same language (see powerset construction [1]), so
           | I am not sure what you are saying here.
           | 
           | [1] https://en.wikipedia.org/wiki/Powerset_construction
        
             | rurban wrote:
             | yes, and this absurdity does not make it a better language.
             | rather a worse.
             | 
             | eg tons of simple recursively defined languages were
             | converted to DFA monstrosities, and since then they became
             | slower and unmaintainable.
        
         | DyslexicAtheist wrote:
         | as the saying goes if your only tool is a hammer ...
         | perl/regexp is obviously the wrong tool for the job. Finding
         | new prime numbers, minting bitcoin, running seti@home, is what
         | C++ exceptions have been invented for.
        
           | colejohnson66 wrote:
           | Coming soon: "Mining Bitcoin at Compile Time using constexpr"
        
         | pxx wrote:
         | you can use the pumping lemma (https://en.wikipedia.org/wiki/Pu
         | mping_lemma_for_regular_lang...) to do this proof trivially.
         | it'll look something like
         | 
         | {1^x : x is prime} is not a regular language. suppose it is a
         | regular language. then by the pumping lemma there exists a p
         | such that every string of length at least p can be written as w
         | = xyz, where |y| >= 1, |xy| <= p [not an important observation
         | for this proof], and we can "pump" a new string w' = xy^nz (for
         | any n > 0) and have w' remain in L.
         | 
         | let p' be the prime immediately above p. then the string
         | 
         | w = 1^p'
         | 
         | is in the language L and must be "pumpable" for some N = |y| >
         | 0. Choose the number of times we "pump" (n in the expression
         | xy^nz) to be p'. Then that would imply
         | 
         | w' = 1^(p' + p' * N)
         | 
         | is in the language L, a contradiction, as p' + p' * N has
         | factors p' and (1 + N). Note we can also use the CFL pumping
         | lemma to prove that this particular language is not context-
         | free either; if we choose s = uvwxy, we can still just pump p'
         | times; the new string just is of size p' * (1 + |v| + |x|), and
         | |v| + |x| > 0.
        
           | thaumasiotes wrote:
           | > we can "pump" a new string w' = xy^nz (for any n > 0)
           | 
           | While this is certainly true, I would prefer to note that n
           | >= 0. n > 0 is just a special case of that.
           | 
           | (You can see clearly that 0 repetitions will always be valid
           | by considering the proof of the pumping lemma: since a
           | regular expression is recognized by a finite state machine,
           | the only way to recognize a string of length exceeding the
           | (finite!) number of states in the machine is to visit some
           | states more than once, which means traveling around a cycle
           | in the state graph. The 'pumpable substring' of the pumping
           | lemma is a path through such a cycle, and repeating it zero
           | times corresponds to traveling around the cycle zero times.
           | The rest of the path is just as valid as always.)
        
           | maweki wrote:
           | Just to note additionally that the pumping lemma may only be
           | used in that direction, not the other way.
           | 
           | The language {1^x : x is not prime} is easily pumped with
           | x=z=1 and y=11. And regular languages are closed under
           | complement.
           | 
           | No pumping Lemma -> Not regular, Yeah Pumping Lemma -> Who
           | knows
        
           | Capira wrote:
           | this is why I love hackernews
        
           | kevinventullo wrote:
           | It's been a while since I took Theory of Computation. Does
           | this imply that primes in binary or n-ary are also not
           | regular?
        
             | jwilk wrote:
             | You would need a different proof (likely a much more
             | complicated one) for n-ary primes.
        
               | jwilk wrote:
               | https://math.stackexchange.com/questions/1232463/how-to-
               | prov... has a proof for base 2. This method works for
               | other bases too.
        
           | colanderman wrote:
           | Fantastic! It's been 15 years since I've had to do those
           | proofs, I'll use your proof to refresh my memory how they
           | work.
        
         | petmon wrote:
         | "A Perl regexp, sure." is not part of the grammar of English
         | because it lacks a verb.
        
           | 867-5309 wrote:
           | perhaps you meant "does not follow". only the loose rules are
           | "part of"
           | 
           | besides, "it is" is implied in that sentence, so there's your
           | verb
        
           | xigoi wrote:
           | Natural languages don't have a formal grammar.
        
         | tromp wrote:
         | While we're nitpicking, it doesn't check for prime numbers
         | either. It checks for composite length.
        
         | nickelpro wrote:
         | I'll be "the other guy" -- no one uses the term regex or
         | regular expression in a programming context to mean a strictly
         | regular language in the theoretical computer science sense.
         | Nitpicking this point is ignorant of how the usage of the term
         | has evolved within the programming community.
         | 
         | Good. There's that obligative conversation over and done with.
        
           | firechickenbird wrote:
           | Regular expressions are Regular Languages, thus equivalent to
           | deterministic finite state automata. These languages have
           | very interesting properties due to their regularity (eg.
           | pumping lemma), and they are a very small strict subset of
           | Context-Free languages, which are also a strict subset of
           | Turing-Complete languages. If you add recursion and
           | backtracking to RegExps you easily go outside the Regular
           | Languages class, and you probably end up in the TC class. Now
           | you are dealing with an undecidable machine which is also
           | very hard to define and reason about (and you also have two
           | more problems). What's the point of calling it regular?
        
             | nickelpro wrote:
             | I'm hitting Poe's Law here, are you joking?
             | 
             | Your average coding bootcamp kid cracking away in node and
             | invoking the builtin regex engine has no idea what any of
             | those words mean.
             | 
             | To those programmers the words "regular expression" have
             | become completely separated from their original etymology
             | in lexical analysis. To them a regex means "input to a
             | regex engine", and thus the term "use a regular expression"
             | means "use a regex engine". It could be "use a froofroo
             | boingo", for all the meaning the words hold. "Regular
             | expression" just became the vernacular due to its
             | origination in lexical analysis.
        
               | firechickenbird wrote:
               | Wow, I have no idea what regexps mean? I can write a book
               | about regexps, and if you really want to learn what
               | regexps are, let me know, we can scedule a session.
               | 
               | Anyway, real REGUALR expressions belong to the regular
               | languages class, thus equivalent to DFAs. Given a regexp,
               | you can construct a DFA in O(2^n), where n is the number
               | of operators in the regexp. Now this DFA will have a
               | Linear time complexity match. If you extend regular
               | expressions with backreferences, you can't build DFAs
               | anymore, and you also lose the fundamental property of
               | regularity: the pumping lemma for regular languages.
               | Furthermore, the languages you can build with this
               | extension go even outside the context-free class. Not
               | only that: you are also dealing with an NP-complete
               | matching. Again, why do you still call them regular?
        
               | Biganon wrote:
               | I think you two are agreeing 100%, but your egos prevent
               | you from even realizing it
        
           | eutropia wrote:
           | I'll be the guy who postscripts that "hardmode" regexes would
           | have important mathematical and computer science implications
           | when/if they solve certain problems (like identifying
           | primes). So it's not pure pedantry, and there's an important
           | difference between "regexes to solve programming problems"
           | and "regexes solving math problems".
        
             | dpwm wrote:
             | "hardmode" regexes do have important computer science
             | implications: they can be matched orders of magnitude
             | faster [0].
             | 
             | Demonstrating the power of a regex engine by solving a
             | problem like this is interesting and very clever, but it's
             | hopefully not used in production.
             | 
             | [0] https://swtch.com/~rsc/regexp/regexp1.html
        
           | colanderman wrote:
           | Having worked in an industry where the linear vs. exponential
           | performance implications of regular expressions and PCREs was
           | crucial (intrusion protection systems), I'd just as soon
           | software professionals stop being sloppy with terminology.
           | 
           | Not to mention the plethora of mostly-but-not-totally
           | compatible regexp languages which exist. C++ supports 6 [1],
           | and that doesn't even include PCREs. Miscommunication about
           | which language one is actually using is a common source of
           | regexp bugs.
           | 
           | [1]
           | https://en.cppreference.com/w/cpp/regex/syntax_option_type
        
             | earleybird wrote:
             | Yes, thank you.
             | 
             | If you're sloppy with words you probably have other nasty
             | habits. (apologies to Heinlein)
             | 
             | There are reasons to employ regular languages with
             | extensions (to not quite context free languages) but be
             | clear about what you're talking about.
        
               | efitz wrote:
               | Grammar sticklers usually fail on this point: human
               | languages are not static, stuck in place from the time
               | the stickler learned the rules.
               | 
               | The current state of a human language is defined by
               | common usage. The purpose of a language is for people to
               | be able to communicate ideas. Ideas exist all up and down
               | a specificity spectrum, e.g. "regex" might mean a
               | specific syntax or one of several syntaxes depending on
               | context. Kind of like "kleenex" might refer to a specific
               | brand or to the general concept of tissue.
               | 
               | If 99.9% of people successfully (by their own measure,
               | not yours) communicate what they mean when they use the
               | term "regular expression", but you are among the 0.1% for
               | which that interaction fails, then I would argue that the
               | problem is your pedantry, not the term. I'm not saying
               | this to be mean; I have similar issues.
        
               | bawolff wrote:
               | I'd argue that if this is sloppy, then the use of "you"
               | instead of "thou" is sloppy too. Using "you" for
               | everything is confusing. How can i possibly know if you
               | mean plural or singular?
               | 
               | Sure some people might say "thou" is archaic, however so
               | is using "regex" outside of academic contexts to mean
               | regular languages. I don't see how one could justify one
               | but not the other.
        
             | lloeki wrote:
             | > Having worked in an industry where the linear vs.
             | exponential performance implications of regular expressions
             | and PCREs was crucial (intrusion protection systems)
             | 
             | same boat. that's why we use re2 in a way inspired by Go
             | (which means losing a few features but the tradeoff is
             | worth it):
             | 
             | > The regexp implementation provided by this package is
             | guaranteed to run in time linear in the size of the input.
             | (This is a property not guaranteed by most open source
             | implementations of regular expressions.) For more
             | information about this property, see
             | 
             | > https://swtch.com/~rsc/regexp/regexp1.html
             | 
             | > or any book about automata theory.
             | 
             | https://pkg.go.dev/regexp
        
               | colanderman wrote:
               | OCaml's regexps run in linear time also. They're based on
               | Ville Laurikari's Tagged REs: https://laurikari.net/tre/
        
             | skissane wrote:
             | > C++ supports 6
             | 
             | Wow, TIL. I always thought C++ had evolved into an
             | excessively complex language, but this really seems to take
             | the cake. ECMAScript (with C++-only changes), basic POSIX,
             | extended POSIX, POSIX awk, POSIX grep, and POSIX extended
             | grep. You are right there is no PCRE in there, yet.
             | (There's always C++23... and if not that, C++26 or C++30)
             | 
             | (Although ECMASCript regex was heavily inspired by
             | Perl/PCRE anyway, albeit without being 100% compatible with
             | either, so in a sense PCRE is already there.)
        
               | leni536 wrote:
               | It's unlikely that std::regex will gain new dialects.
               | Unfortunately it is very broken, it's unlikely that a
               | proposal to extend it would go through.
        
               | MauranKilom wrote:
               | Can you elaborate (or point me to sources about) how
               | std::regex is broken? I'd like to learn more.
        
               | leni536 wrote:
               | AFAIK it's not even the interface that is broken, but
               | some standard libraries poushed out slow implementations
               | that cannot be changed without an ABI break.
        
             | tomcam wrote:
             | I tried to imagine the test suite for these C++ variations
             | but failed due to a coronary event
        
           | bspammer wrote:
           | This isn't entirely a pedantic thing. Cloudflare had an
           | outage a couple of years ago which was caused by someone not
           | knowing the difference:
           | 
           | https://blog.cloudflare.com/details-of-the-cloudflare-
           | outage...
        
           | Thorrez wrote:
           | >no one uses the term regex or regular expression in a
           | programming context to mean a strictly regular language in
           | the theoretical computer science sense.
           | 
           | The RE2 library seems to:
           | 
           | >loosely speaking, a regular language is a set of strings
           | that can be matched in a single pass through the text using
           | only a fixed amount of memory.
           | 
           | https://github.com/google/re2/wiki/Syntax
           | 
           | Note that RE2 doesn't support backreferences.
        
           | jltsiren wrote:
           | There is no such thing as _the_ programming community. Only a
           | huge number of partially overlapping communities using wildly
           | different terminology. Some communities still talk about
           | regular expressions in the original meaning (because labeled
           | graphs are useful objects), just like some communities still
           | use the word  "algorithm" outside machine learning or
           | "crypto" in non-blockchain contexts.
           | 
           | I personally prefer using "regular expression" in the
           | language-theoretic sense and "regex" or "regexp" for related
           | concepts in programming languages and software.
        
             | [deleted]
        
             | dpwm wrote:
             | > I personally prefer using "regular expression" in the
             | language-theoretic sense and "regex" or "regexp" for
             | related concepts in programming languages and software.
             | 
             | This is also what Larry Wall proposes:
             | 
             | > [...] Pattern Matching, generally having to do with what
             | we call "regular expressions", which are only marginally
             | related to real regular expressions. Nevertheless, the term
             | has grown with the capabilities of our pattern matching
             | engines, so I'm not going to try to fight linguistic
             | necessity here. I will, however, generally call them
             | "regexes" (or "regexen", when I'm in an Anglo-Saxon mood).
             | 
             | https://raku.org/archive/doc/design/apo/A05.html
        
               | thaumasiotes wrote:
               | > (or "regexen", when I'm in an Anglo-Saxon mood)
               | 
               | There's some kind of idea going around that because the
               | plural of "ox" is "oxen", plurals with _-n_ must be
               | triggered by a word ending in  "x". So you see cutesy
               | references to "boxen" or similar.
               | 
               | But _-n_ is just an ordinary plural marker; the same
               | ending that makes  "oxen" the plural of "ox" also makes
               | "kine" the plural of "cow" and "children" the plural of
               | "child". (In those last two cases, the words are sporting
               | two plural markers in sequence.)
        
               | knome wrote:
               | >There's some kind of idea going around
               | 
               | It's not new, by any stretch. I remember seeing such many
               | references in the early 2000s and beyond. Doing some
               | vague poking around textfiles.com, I didn't pull any easy
               | to find 1980s 'boxen' references. It almost looks like it
               | might have been the addition of 'unix boxen' to the 'new
               | hacker dictionary' by esr which spread the joke into the
               | mid-90s communities. I wasn't online at that point
               | myself, so I can't speak from any personal experience.
               | But the various hacker zines etc are using 'boxen' by the
               | later 90s and early 2000s, which would allow for it to
               | have spread through the folks of the time finding and
               | trolling through esr's version of the 'hacker
               | dictionary'. https://www.dourish.com/goodies/jargon.html
               | complains about esr adding unix references, and does not
               | include any boxen references that are found in the http:/
               | /cd.textfiles.com/hackersencyc/ETC/MISC/JARG_331.TXT
               | version folks in the 90s would likely have spread around.
               | 
               | edit: nope. definitely there in the 80s on usenet
               | 
               | so he was just encoding existing slang usage, rather than
               | introducing it, on this point.
               | 
               | https://www.usenetarchives.com/view.php?id=net.wanted&mid
               | =PD...
               | 
               | more edit: considered the second oldest boxen reference
               | in the usenet archive is esr himself, maybe he was the
               | one pushing it :)
               | 
               | https://www.usenetarchives.com/view.php?id=comp.sys.att&m
               | id=...
        
               | thaumasiotes wrote:
               | I didn't say it was new. I said it was current. (And, um,
               | unmotivated.)
               | 
               | We might also note that from the perspective of modern
               | English, plurals in -en are more characteristic of
               | Geoffrey Chaucer than of Alfred the Great. We preserve
               | three such words: oxen, children, and brethren. But only
               | one of those is an Anglo-Saxon form; in Old English, the
               | plural forms of cild and brothor are cildru and
               | brothru[1]. The -ens appeared later.
               | 
               | [1] "Brother" is a fairly irregular word, but there's no
               | form ending in -n.
        
               | samatman wrote:
               | > _[1] "Brother" is a fairly irregular word, but there's
               | no form ending in -n._
               | 
               | It's brethren, for the record.
        
               | ectopod wrote:
               | That footnote is clearly talking about Old English.
        
               | samatman wrote:
               | Well. You're on Hacker News. So I'm going to quote the
               | Jargon File and then you'll stop complaining. Right,
               | cobber?
               | 
               | > Further, note the prevalence of certain kinds of
               | nonstandard plural forms. Some of these go back quite a
               | ways; the TMRC Dictionary includes an entry which implies
               | that the plural of `mouse' is {meeces}, and notes that
               | the defined plural of `caboose' is `cabeese'. This latter
               | has apparently been standard (or at least a standard
               | joke) among railfans (railroad enthusiasts) for many
               | years.
               | 
               | > On a similarly Anglo-Saxon note, almost anything ending
               | in `x' may form plurals in `-xen' (see {VAXen} and
               | {boxen} in the main text). Even words ending in phonetic
               | /k/ alone are sometimes treated this way; e.g., `soxen'
               | for a bunch of socks. Other funny plurals are
               | `frobbotzim' for the plural of `frobbozz' (see
               | {frobnitz}) and `Unices' and `Twenices' (rather than
               | `Unixes' and `Twenexes'; see {UNIX}, {TWENEX} in main
               | text). But note that `Unixen' and `Twenexen' are never
               | used; it has been suggested that this is because `-ix'
               | and `-ex' are Latin singular endings that attract a
               | Latinate plural. Finally, it has been suggested to
               | general approval that the plural of `mongoose' ought to
               | be `polygoose'.
               | 
               | > The pattern here, as with other hackish grammatical
               | quirks, is generalization of an inflectional rule that in
               | English is either an import or a fossil (such as the
               | Hebrew plural ending `-im', or the Anglo-Saxon plural
               | suffix `-en') to cases where it isn't normally considered
               | to apply.
        
         | polski-g wrote:
         | I wish everytime an application said "regular expression" it
         | meant PCRE. People say sed supports RE but it's so divorced
         | from PCRE I end up using `| perl -pe ....`
        
       | nvader wrote:
       | And takes space exponential to the number of bits in the input.
        
         | xcambar wrote:
         | "Take a problem. Solve it with a RegExp. Now you have two
         | problems".
        
           | asicsp wrote:
           | See https://blog.codinghorror.com/regular-expressions-now-
           | you-ha... for context around this (mis)quote.
        
             | bmn__ wrote:
             | Collapse that indirect pointer. The source is
             | http://regex.info/blog/2006-09-15/247
        
             | xcambar wrote:
             | Thx, I had actually no idea it came from there.
        
       | cqcn1991 wrote:
       | this is very useful; thanks!
        
       | rabbits77 wrote:
       | I think it's worth noting that the original author of the regex
       | primality test is Abigail, a well known Perl hacker.
       | 
       | https://www.masteringperl.org/2013/06/how-abigails-prime-num...
       | 
       | This blog article does a nice job discussing Abigail's work.
        
       | ahmed_ds wrote:
       | I feel like Charlie Kelly from IASIP reading this regex. "1.
       | Starts out with 1, then there is other stuff...."
       | 
       | https://youtu.be/AFDU_R6ShmI?t=74
        
         | Dylan16807 wrote:
         | Huh, looking at some regex explaining sites, they don't seem to
         | do a good job of simplifying.
         | 
         | Which is a shame because it's really easy to simplify away most
         | of this regex.
         | 
         | / does nothing
         | 
         | ^ and $ only exist to make it do the whole string
         | 
         | | is just saying "or", and only the part to the right really
         | matters.
         | 
         | And the ? doesn't do anything meaningful.
         | 
         | Throw all that out and all you have is (11+)\1+
         | 
         | It's still a mash of symbols, but it's _much_ simpler.
         | 
         | And changing it to (cc+)\1+ would probably help naive
         | readability too.
        
           | mordechai9000 wrote:
           | The question mark isn't meaningless - it changes the matching
           | behavior of + from greedy to non-greedy. The non-greedy
           | version will try to match shorter strings first. This should
           | be more efficient, because non-primes will fail sooner. Most
           | numbers will fail a lot sooner.
        
             | Dylan16807 wrote:
             | It takes just as long on primes, so the overall runtime
             | will be very similar. So sure it's an optimization but it's
             | not one to care about when you're trying to understand it.
        
           | aghilmort wrote:
           | my fave regex explainer gives this, idk that it adds much? ht
           | tps://regexper.com/#%2F%5E1%3F%24%7C%5E%2811%2B%3F%29%5C1%...
        
             | Dylan16807 wrote:
             | That's the best I've seen so far, at least. Much better
             | than a bunch of bullet points of various indentation.
        
       | ogogmad wrote:
       | Can you decide if a number is prime using a _true_ regular
       | expression?
       | 
       | No. This won't work because the time cost of primality testing is
       | too high (cubic in the number of binary digits?), and checking if
       | a string satisfies a regular expression takes only linear time.
       | Since cubic is bigger than linear, this won't work. Too good to
       | be true!
       | 
       | Not a proof, but a rule of thumb. No need to even use the pumping
       | lemma.
        
         | thaumasiotes wrote:
         | > This won't work because the time cost of primality testing is
         | too high
         | 
         | This is just a conjecture, though.
         | 
         | > No need to even use the pumping lemma.
         | 
         | But it's _really easy_ to prove this with the pumping lemma.
         | This could be the first problem in a textbook section that
         | introduces the pumping lemma. It would be difficult to come up
         | with a language for which the pumping lemma disproof was more
         | obvious.
        
           | chakkepolja wrote:
           | The canonical example for pumping lemma disproof is a^n b^n,
           | which is more obvious IMO.
        
             | thaumasiotes wrote:
             | Here are the disproofs:
             | 
             | -----
             | 
             | (1) a^n b^n
             | 
             | The pumping lemma tells us that for some string in this
             | language, we can divide the string into three substrings _x
             | y z_ such that _x_ _y_ ^ _n_ _z_ remains in the language
             | for all _n_.
             | 
             | There are two cases:
             | 
             | (1)(a) The number of 'a' characters in _y_ does not equal
             | the number of  'b' characters. In this case, pumping _y_
             | will violate the constraint that a string in the language
             | must have an equal number of  'a's and 'b's.
             | 
             | (1)(b) _y_ consists of _k_ 'a's followed by _k_ 'b's for
             | some _k_ > 0. In this case, pumping it will violate the
             | constraint that for every string in this language, each 'a'
             | character must precede every 'b' character.
             | 
             | Thus, the language a^n b^n is not regular.
             | 
             | -----
             | 
             | (2) a^n, where n is prime
             | 
             | The pumping lemma tells us that for some string in this
             | language, we can divide the string into three substrings _x
             | y z_ such that _x_ _y_ ^ _n_ _z_ remains in the language
             | for all _n_. [same as above]
             | 
             | But the string _x_ _y_ ^| _xz_ | _z_ has length | _xz_ | +
             | | _y_ |*| _xz_ |, which is not prime.
             | 
             | Thus, the language a^n (where n is prime) is not regular.
             | 
             | -----
             | 
             | Which of those was easier?
        
         | Yajirobe wrote:
         | > Since cubic is bigger than linear, this won't work.
         | 
         | So it actually 'works', it's just slower.
        
           | laszlokorte wrote:
           | No, the point is that the article does not use a real regular
           | expression: 1. Real regular expressiona can always be checked
           | in linear time 2. primes can not be checked in linear time
           | Ergo: primes can not be checked with regular expressions,
           | otherwise there would be a way to check primes in linear
           | time, which would contratict premise 2
        
       ___________________________________________________________________
       (page generated 2022-03-05 23:02 UTC)