[HN Gopher] Fun with regular expressions: part I
___________________________________________________________________
Fun with regular expressions: part I
Author : kulikov009
Score : 71 points
Date : 2021-08-20 08:35 UTC (14 hours ago)
(HTM) web link (yurichev.com)
(TXT) w3m dump (yurichev.com)
| holoduke wrote:
| Am I the only one who keeps forgetting even most basic
| expressions. I am using them for a long time, but somehow not
| enough to remember them. Always find what I need in a few minutes
| by using Google, but still. Not proud of it :)
| [deleted]
| mdp2021 wrote:
| Re-thinking them in real time is good for you, and a much
| better exercise than memorizing them.
|
| To remember notions, a necessary but less noble capacity, we
| have invented tools well past the Memex. To be "smart", not
| yet.
| fifticon wrote:
| On this off-topic, learning shortcuts for menus: I wish
| applications allowed you to enable 'show-me' menus. It would be
| a browsable GUI menu as you know them, but it would not RUN any
| commands! Only let you see the keyboard shortcut. So you would
| always have to _type_ the shortcut you had looked up. I naively
| assume this would accelerate remembering the shortcut keys.
| asicsp wrote:
| Try maintaining your own cheatsheet. Every time you need to
| look up something, make a note along with examples and
| documentation links. This should help at least for the basic
| usage.
| bloopernova wrote:
| I have this issue with certain Emacs commands. I don't know if
| this will work for you, but here is what I've done to mitigate
| that loss:
|
| I use a permanent marker on a 3x5 index card, and write the
| command and key combo in black, and then a very "simple"
| explanation in blue. For instance, "SWIPER SEARCH" "C-c C-s"
| "inside files". or "DEFT SEARCH" "C-c n d" "across files".
|
| Those index cards get pinned on my home office wall above my
| main monitor screen. Seems to help a bit for me, not sure if it
| would for other folks.
|
| Here's a pic for the visually inclined!
| https://imgur.com/gallery/zlY5wYH
| Zababa wrote:
| In case you never heard about it (and for other people),
| these cards seem like the perfect thing to put in a spaced
| repetition [1] program like Anki [2].
|
| [1]: https://en.wikipedia.org/wiki/Spaced_repetition
|
| [2]: https://apps.ankiweb.net/
| teddyh wrote:
| The GNU Emacs manual includes a printable reference card,
| available separately here:
| https://www.gnu.org/software/emacs/refcards/
| veltas wrote:
| Just trying thinking about one thing you forget a bit longer
| each time you encounter it, and trying to rationalise why the
| syntax is as it is. Slowly build it up. You don't need to
| remember the whole thing at once.
| Joker_vD wrote:
| No, you're not the only one. That's actually the main reason
| why I personally stopped using them entirely several years ago
| and just write straightforward loops that search for things I
| need in the input string.
| ALotOfBees wrote:
| remembering they exist is all you need. the actual syntax is
| just a quick google away. but yes, I can never remember almost
| any expressions either
| yissp wrote:
| The thing that gets me is remembering the slightly different
| rules for the various regex engines. Like which ones support
| which operators, which operators need to be escaped, etc.
| iddan wrote:
| What helped me the most is reading and understanding the
| explanations on https://regexr.com/
| geokon wrote:
| What I like about regex is that it's kinda "cross platform".
|
| I always though that one could make a markup (like markdown)
| based on regex . Then you could almost directly parse it in any
| language environment. I'm not sure what the major drawbacks would
| be though. I guess the expression could get hairy pretty fast as
| you handle edge cases :)
| JadeNB wrote:
| > I always though that one could make a markup (like markdown)
| based on regex . Then you could almost directly parse it in any
| language environment. I'm not sure what the major drawbacks
| would be though. I guess the expression could get hairy pretty
| fast as you handle edge cases :)
|
| I'm pretty sure Gruber's original Markdown parser was just
| regex-based (which is why there are so many weird
| underspecifications and corner cases). Or do you mean that
| somehow the document itself is written in a flavor of regex?
| I'm intrigued but puzzled: what would that even look like?
| geokon wrote:
| oh no, I meant the former. I guess it's just come from my
| personally use of orgmode day to day - which is great
| feature-wise but has no spec and I'm shackled to Emacs/Elisp.
| It just feels like things could have been designed better
| from the getgo. And markup is something that at least at face
| value simple enough that you could achieve what you want
| without a full parser/grammar
|
| even though regex isn't quite crossplatform/standardized -
| it's much more standard than any way of expressing grammars
| JadeNB wrote:
| BNF is pretty standard, no? Maybe I just haven't seen
| enough of the ways it is surely abused in the wild ....
| geokon wrote:
| I've honestly never come across it since university - but
| maybe that's me being a noob. I don't get the impression
| it's as copy-n-paste as regex. and well every lang has
| regex
| esnard wrote:
| The most fun regex I know is this one:
| .?|(..+?)\\1+
|
| Which is used for primality checking (applied to the input string
| length).
|
| It's not that hard to understand compared to some others, but
| being able to do those types of computations with regex is really
| mind-blowing to me.
| lordnacho wrote:
| Someone needs to do one of those "Guess if it's this or that"
| sites with regex and q (kdb database) expressions.
| nsajko wrote:
| Note that this "regex", unlike the ones in the article, is not
| actually a regular expression. You could perhaps call it an
| irregular expression. There can be no such thing as a
| backreference in a regular expression.
|
| More info: https://swtch.com/~rsc/regexp/regexp1.html
| derefr wrote:
| I think you're confusing "regular expression" (RE) for
| "deterministic finite-state automata" (DFA).
|
| "Regular Expression" is just the name for the grammar/formal
| language model that REs present. REs are a useful way to
| _encode_ DFAs, but not all things that can be expressed using
| RE formal-language are DFAs.
|
| Some REs (i.e. the "Perl-compatible" or "extended" Regular
| Expressions) are Nondeterministic Finite-state Automata or
| "NFAs". This doesn't change the fact that the language used
| to express them makes them Regular Expressions.
| VenTatsu wrote:
| The term "Regular Expression" is very often misused, it had
| a very well defined meaning in the field of formal language
| theory, but like most terms when barrowed by another field
| some of that meaning is lost or transformed. Some
| documentation has shifted to using terms like "pattern" or
| "pattern matching expression" to convey the meaning without
| the baggage.
|
| > Some REs (i.e. the "Perl-compatible" or "extended"
| Regular Expressions) are Nondeterministic Finite-state
| Automata or "NFAs".
|
| That those engines are implemented using an NFA or a DFA
| does not actually matter for the question of being regular
| or not. A given pattern may be Regular while another may
| not be. There are multiple technical reasons these engines
| are built on NFA's and not DFA's, supporting non-regular
| expressions is one, but not the only, reason.
|
| Ironically the library called "PCRE" or "Perl-compatible
| Regular Expressions" is in-fact not "Perl-compatible" (nor
| regular). It is at the same time both named "Perl-
| compatible" and absolutely not Perl-compatible. Both PCRE
| library and the Perl language have evolved and added
| mutually incompatible features which results in a valid
| PCRE matching expression failing to compile in Perl and a
| valid Perl matching expression failing to compile in PCRE.
| Just because that is the name doesn't make it true.
| JadeNB wrote:
| > I think you're confusing "regular expression" (RE) for
| "deterministic finite-state automata" (DFA).
|
| I think it's about descriptivism vs prescriptivism.
| "Regular expression" did start life with a technical
| meaning, according to which REs had equivalent expressive
| power to DFAs. The term has since been appropriated and
| diluted by other languages, and it is not entirely
| unreasonable to (though I prefer not to) take "regular
| expression" to mean "whatever programming languages present
| as regular expression"; but I think it's not quite right to
| say that someone is confused who believes in maintaining
| the original distinction.
| nsajko wrote:
| I agree with you (obviously), but I'm not sure we should
| invoke "descriptivism vs prescriptivism" here. As far as
| I understand that distinction is mostly applicable to
| English (or other natural language) teaching, which is
| usually quite prescriptive, but should also have varying
| descriptive elements. And to dictionaries, which are(?)
| mostly descriptive (even when that means listing two
| conflicting definitions for a single word), but have to
| make some prescriptive editing decisions.
|
| When it comes to actual semantic issues like this one, I
| think that invoking "p vs d" just muddies the waters,
| because such issues should be examined on a case-by-case
| basis.
|
| This issue specifically is quite akin to the "literally"
| case: different instances of usage of the same phrase
| have (almost) opposite meanings:
|
| See https://en.wiktionary.org/wiki/literally and
| https://en.wikipedia.org/wiki/Chomsky_hierarchy
|
| I think it's fair to say the Perl-style semantics are
| just wrong, because there's no way to use them without
| being confusing.
| charlesdaniels wrote:
| Regular expressions, deterministic finite automata, and
| nondeterministic finite automata are all equivalent[0][1].
|
| All three of these representations are capable of
| describing any regular language (set of symbol sequences,
| or more intuitively a set of strings), and the fact that a
| language can be described by an NFA, DFA, or RE implies
| that it is regular.
|
| I am not hugely familiar with Pearl's "extended regular
| expression" system, however I was under the understanding
| that the set of languages it can recognize is a superset of
| the set of all regular languages. Based on [2], it would
| appear that Perl regexes can recognize all regular
| languages, and parts of the set of all Turing-recognizable
| languages.
|
| 0 - Introduction to the Theory of Computation 3/e, Michael
| Sipser, Thm 1.39, pp. 55.
|
| 1 - Introduction to the Theory of Computation 3/e, Michael
| Sipser, Thm 1.54, pp. 67.
|
| 2- https://www.perlmonks.org/?node_id=809842
| beecafe wrote:
| FWIW the equivalence between NFA and DFA requires an
| exponential space increase to encode the NFA as a DFA,
| with an exponential space blow up you can encode a lot of
| things as DFAs (I'm pretty sure you could encode a Turing
| machine that uses bounded space on the tape as a DFA with
| exponentially more space, "just" make each possible
| configuration one state in the DFA)
| alex_smart wrote:
| The whole discussion at the end of part 2 feels so strage to me.
|
| >using the standard RE, it's impossible to match only DD-MM-YYYY
| or DDMMYYYY strings without matching DD-MMYYYY or DDMM-YYYY,
| because it's impossible to represent this in DFA form.
|
| And then there is the reader comment that finds a DFA for this
| particular problem but then wonders about another:
|
| > now whether one could _also_ accept YYYY-MM-DD /YYYYMMDD format
| with the same regexps, it might require some sort of deeper magic
| and back-tracking
|
| Regular languages are closed under union and there is even a
| straightforward translation to regular expressions (the |
| operator). Am I really missing something here? It has been a long
| time since I took a compiler course, but I would be damned.
| mistspeaker wrote:
| Thanks for this, as a new learner I'll take any resource for
| regex I can find.
| gigatexal wrote:
| Regex and fun in the same sentence? Hah
| elanning wrote:
| I recently wrote a little Regex domain specific language that's
| translated into regular JavaScript regex. I was inspired by
| CCGrep: https://arxiv.org/abs/2003.05615 because its syntax
| looked so clean and yet powerful.
|
| Examples of it look like `$# ($a == $1) { return $$$; }` $# -
| match any keyword, eg `do`, `while`, etc. $a - match any
| variable. $1 - match any literal. $$$ - match any block (greedy).
|
| It's also whitespace invariant, so `if($a==$1)` is equivalent to
| `if ($a == $1)`.
|
| All that to say, I wonder if we're missing out on a variety of
| "domain specific regexes" for various fields.
| aeortiz wrote:
| "Fun" and "regular expressions" don't belong in the same
| sentence!
|
| Of course they have their uses, but I think they should be taken
| seriously, since sadly, corner cases are hard to weed out of
| them.
___________________________________________________________________
(page generated 2021-08-20 23:02 UTC)