[HN Gopher] Fun with regular expressions: part I
       ___________________________________________________________________
        
       Fun with regular expressions: part I
        
       Author : kulikov009
       Score  : 71 points
       Date   : 2021-08-20 08:35 UTC (14 hours ago)
        
 (HTM) web link (yurichev.com)
 (TXT) w3m dump (yurichev.com)
        
       | holoduke wrote:
       | Am I the only one who keeps forgetting even most basic
       | expressions. I am using them for a long time, but somehow not
       | enough to remember them. Always find what I need in a few minutes
       | by using Google, but still. Not proud of it :)
        
         | [deleted]
        
         | mdp2021 wrote:
         | Re-thinking them in real time is good for you, and a much
         | better exercise than memorizing them.
         | 
         | To remember notions, a necessary but less noble capacity, we
         | have invented tools well past the Memex. To be "smart", not
         | yet.
        
         | fifticon wrote:
         | On this off-topic, learning shortcuts for menus: I wish
         | applications allowed you to enable 'show-me' menus. It would be
         | a browsable GUI menu as you know them, but it would not RUN any
         | commands! Only let you see the keyboard shortcut. So you would
         | always have to _type_ the shortcut you had looked up. I naively
         | assume this would accelerate remembering the shortcut keys.
        
         | asicsp wrote:
         | Try maintaining your own cheatsheet. Every time you need to
         | look up something, make a note along with examples and
         | documentation links. This should help at least for the basic
         | usage.
        
         | bloopernova wrote:
         | I have this issue with certain Emacs commands. I don't know if
         | this will work for you, but here is what I've done to mitigate
         | that loss:
         | 
         | I use a permanent marker on a 3x5 index card, and write the
         | command and key combo in black, and then a very "simple"
         | explanation in blue. For instance, "SWIPER SEARCH" "C-c C-s"
         | "inside files". or "DEFT SEARCH" "C-c n d" "across files".
         | 
         | Those index cards get pinned on my home office wall above my
         | main monitor screen. Seems to help a bit for me, not sure if it
         | would for other folks.
         | 
         | Here's a pic for the visually inclined!
         | https://imgur.com/gallery/zlY5wYH
        
           | Zababa wrote:
           | In case you never heard about it (and for other people),
           | these cards seem like the perfect thing to put in a spaced
           | repetition [1] program like Anki [2].
           | 
           | [1]: https://en.wikipedia.org/wiki/Spaced_repetition
           | 
           | [2]: https://apps.ankiweb.net/
        
           | teddyh wrote:
           | The GNU Emacs manual includes a printable reference card,
           | available separately here:
           | https://www.gnu.org/software/emacs/refcards/
        
         | veltas wrote:
         | Just trying thinking about one thing you forget a bit longer
         | each time you encounter it, and trying to rationalise why the
         | syntax is as it is. Slowly build it up. You don't need to
         | remember the whole thing at once.
        
         | Joker_vD wrote:
         | No, you're not the only one. That's actually the main reason
         | why I personally stopped using them entirely several years ago
         | and just write straightforward loops that search for things I
         | need in the input string.
        
         | ALotOfBees wrote:
         | remembering they exist is all you need. the actual syntax is
         | just a quick google away. but yes, I can never remember almost
         | any expressions either
        
         | yissp wrote:
         | The thing that gets me is remembering the slightly different
         | rules for the various regex engines. Like which ones support
         | which operators, which operators need to be escaped, etc.
        
         | iddan wrote:
         | What helped me the most is reading and understanding the
         | explanations on https://regexr.com/
        
       | geokon wrote:
       | What I like about regex is that it's kinda "cross platform".
       | 
       | I always though that one could make a markup (like markdown)
       | based on regex . Then you could almost directly parse it in any
       | language environment. I'm not sure what the major drawbacks would
       | be though. I guess the expression could get hairy pretty fast as
       | you handle edge cases :)
        
         | JadeNB wrote:
         | > I always though that one could make a markup (like markdown)
         | based on regex . Then you could almost directly parse it in any
         | language environment. I'm not sure what the major drawbacks
         | would be though. I guess the expression could get hairy pretty
         | fast as you handle edge cases :)
         | 
         | I'm pretty sure Gruber's original Markdown parser was just
         | regex-based (which is why there are so many weird
         | underspecifications and corner cases). Or do you mean that
         | somehow the document itself is written in a flavor of regex?
         | I'm intrigued but puzzled: what would that even look like?
        
           | geokon wrote:
           | oh no, I meant the former. I guess it's just come from my
           | personally use of orgmode day to day - which is great
           | feature-wise but has no spec and I'm shackled to Emacs/Elisp.
           | It just feels like things could have been designed better
           | from the getgo. And markup is something that at least at face
           | value simple enough that you could achieve what you want
           | without a full parser/grammar
           | 
           | even though regex isn't quite crossplatform/standardized -
           | it's much more standard than any way of expressing grammars
        
             | JadeNB wrote:
             | BNF is pretty standard, no? Maybe I just haven't seen
             | enough of the ways it is surely abused in the wild ....
        
               | geokon wrote:
               | I've honestly never come across it since university - but
               | maybe that's me being a noob. I don't get the impression
               | it's as copy-n-paste as regex. and well every lang has
               | regex
        
       | esnard wrote:
       | The most fun regex I know is this one:
       | .?|(..+?)\\1+
       | 
       | Which is used for primality checking (applied to the input string
       | length).
       | 
       | It's not that hard to understand compared to some others, but
       | being able to do those types of computations with regex is really
       | mind-blowing to me.
        
         | lordnacho wrote:
         | Someone needs to do one of those "Guess if it's this or that"
         | sites with regex and q (kdb database) expressions.
        
         | nsajko wrote:
         | Note that this "regex", unlike the ones in the article, is not
         | actually a regular expression. You could perhaps call it an
         | irregular expression. There can be no such thing as a
         | backreference in a regular expression.
         | 
         | More info: https://swtch.com/~rsc/regexp/regexp1.html
        
           | derefr wrote:
           | I think you're confusing "regular expression" (RE) for
           | "deterministic finite-state automata" (DFA).
           | 
           | "Regular Expression" is just the name for the grammar/formal
           | language model that REs present. REs are a useful way to
           | _encode_ DFAs, but not all things that can be expressed using
           | RE formal-language are DFAs.
           | 
           | Some REs (i.e. the "Perl-compatible" or "extended" Regular
           | Expressions) are Nondeterministic Finite-state Automata or
           | "NFAs". This doesn't change the fact that the language used
           | to express them makes them Regular Expressions.
        
             | VenTatsu wrote:
             | The term "Regular Expression" is very often misused, it had
             | a very well defined meaning in the field of formal language
             | theory, but like most terms when barrowed by another field
             | some of that meaning is lost or transformed. Some
             | documentation has shifted to using terms like "pattern" or
             | "pattern matching expression" to convey the meaning without
             | the baggage.
             | 
             | > Some REs (i.e. the "Perl-compatible" or "extended"
             | Regular Expressions) are Nondeterministic Finite-state
             | Automata or "NFAs".
             | 
             | That those engines are implemented using an NFA or a DFA
             | does not actually matter for the question of being regular
             | or not. A given pattern may be Regular while another may
             | not be. There are multiple technical reasons these engines
             | are built on NFA's and not DFA's, supporting non-regular
             | expressions is one, but not the only, reason.
             | 
             | Ironically the library called "PCRE" or "Perl-compatible
             | Regular Expressions" is in-fact not "Perl-compatible" (nor
             | regular). It is at the same time both named "Perl-
             | compatible" and absolutely not Perl-compatible. Both PCRE
             | library and the Perl language have evolved and added
             | mutually incompatible features which results in a valid
             | PCRE matching expression failing to compile in Perl and a
             | valid Perl matching expression failing to compile in PCRE.
             | Just because that is the name doesn't make it true.
        
             | JadeNB wrote:
             | > I think you're confusing "regular expression" (RE) for
             | "deterministic finite-state automata" (DFA).
             | 
             | I think it's about descriptivism vs prescriptivism.
             | "Regular expression" did start life with a technical
             | meaning, according to which REs had equivalent expressive
             | power to DFAs. The term has since been appropriated and
             | diluted by other languages, and it is not entirely
             | unreasonable to (though I prefer not to) take "regular
             | expression" to mean "whatever programming languages present
             | as regular expression"; but I think it's not quite right to
             | say that someone is confused who believes in maintaining
             | the original distinction.
        
               | nsajko wrote:
               | I agree with you (obviously), but I'm not sure we should
               | invoke "descriptivism vs prescriptivism" here. As far as
               | I understand that distinction is mostly applicable to
               | English (or other natural language) teaching, which is
               | usually quite prescriptive, but should also have varying
               | descriptive elements. And to dictionaries, which are(?)
               | mostly descriptive (even when that means listing two
               | conflicting definitions for a single word), but have to
               | make some prescriptive editing decisions.
               | 
               | When it comes to actual semantic issues like this one, I
               | think that invoking "p vs d" just muddies the waters,
               | because such issues should be examined on a case-by-case
               | basis.
               | 
               | This issue specifically is quite akin to the "literally"
               | case: different instances of usage of the same phrase
               | have (almost) opposite meanings:
               | 
               | See https://en.wiktionary.org/wiki/literally and
               | https://en.wikipedia.org/wiki/Chomsky_hierarchy
               | 
               | I think it's fair to say the Perl-style semantics are
               | just wrong, because there's no way to use them without
               | being confusing.
        
             | charlesdaniels wrote:
             | Regular expressions, deterministic finite automata, and
             | nondeterministic finite automata are all equivalent[0][1].
             | 
             | All three of these representations are capable of
             | describing any regular language (set of symbol sequences,
             | or more intuitively a set of strings), and the fact that a
             | language can be described by an NFA, DFA, or RE implies
             | that it is regular.
             | 
             | I am not hugely familiar with Pearl's "extended regular
             | expression" system, however I was under the understanding
             | that the set of languages it can recognize is a superset of
             | the set of all regular languages. Based on [2], it would
             | appear that Perl regexes can recognize all regular
             | languages, and parts of the set of all Turing-recognizable
             | languages.
             | 
             | 0 - Introduction to the Theory of Computation 3/e, Michael
             | Sipser, Thm 1.39, pp. 55.
             | 
             | 1 - Introduction to the Theory of Computation 3/e, Michael
             | Sipser, Thm 1.54, pp. 67.
             | 
             | 2- https://www.perlmonks.org/?node_id=809842
        
               | beecafe wrote:
               | FWIW the equivalence between NFA and DFA requires an
               | exponential space increase to encode the NFA as a DFA,
               | with an exponential space blow up you can encode a lot of
               | things as DFAs (I'm pretty sure you could encode a Turing
               | machine that uses bounded space on the tape as a DFA with
               | exponentially more space, "just" make each possible
               | configuration one state in the DFA)
        
       | alex_smart wrote:
       | The whole discussion at the end of part 2 feels so strage to me.
       | 
       | >using the standard RE, it's impossible to match only DD-MM-YYYY
       | or DDMMYYYY strings without matching DD-MMYYYY or DDMM-YYYY,
       | because it's impossible to represent this in DFA form.
       | 
       | And then there is the reader comment that finds a DFA for this
       | particular problem but then wonders about another:
       | 
       | > now whether one could _also_ accept YYYY-MM-DD /YYYYMMDD format
       | with the same regexps, it might require some sort of deeper magic
       | and back-tracking
       | 
       | Regular languages are closed under union and there is even a
       | straightforward translation to regular expressions (the |
       | operator). Am I really missing something here? It has been a long
       | time since I took a compiler course, but I would be damned.
        
       | mistspeaker wrote:
       | Thanks for this, as a new learner I'll take any resource for
       | regex I can find.
        
       | gigatexal wrote:
       | Regex and fun in the same sentence? Hah
        
       | elanning wrote:
       | I recently wrote a little Regex domain specific language that's
       | translated into regular JavaScript regex. I was inspired by
       | CCGrep: https://arxiv.org/abs/2003.05615 because its syntax
       | looked so clean and yet powerful.
       | 
       | Examples of it look like `$# ($a == $1) { return $$$; }` $# -
       | match any keyword, eg `do`, `while`, etc. $a - match any
       | variable. $1 - match any literal. $$$ - match any block (greedy).
       | 
       | It's also whitespace invariant, so `if($a==$1)` is equivalent to
       | `if ($a == $1)`.
       | 
       | All that to say, I wonder if we're missing out on a variety of
       | "domain specific regexes" for various fields.
        
       | aeortiz wrote:
       | "Fun" and "regular expressions" don't belong in the same
       | sentence!
       | 
       | Of course they have their uses, but I think they should be taken
       | seriously, since sadly, corner cases are hard to weed out of
       | them.
        
       ___________________________________________________________________
       (page generated 2021-08-20 23:02 UTC)