[HN Gopher] The Greatest Regex Trick Ever (2014)
___________________________________________________________________
The Greatest Regex Trick Ever (2014)
Author : signa11
Score : 249 points
Date : 2021-07-08 16:49 UTC (6 hours ago)
(HTM) web link (rexegg.com)
(TXT) w3m dump (rexegg.com)
| capitalbreeze wrote:
| This is awesome!! Well done "Tarzan"
| ppierald wrote:
| I used to just go ask Friedl.
| Tipewryter wrote:
| The solution... not_this|(but_this)
|
| ... is interesting. But since it returns the match in a submatch
| I would say the \K approach is better:
| (?:not_this.*?)*\Kbut_this
|
| Because usually when you try hard to accomplish something with a
| regex, you do not have the luxury to say "And then please
| disregard the match and look at the submatch instead".
| lifthrasiir wrote:
| That doesn't work. `(?:"Tarzan".*?)*\KTarzan` should behave
| identically without `\K`, and it will match `"Tarzan" "Tarzan"`
| because the ungreedy quantifier ? still allows backtracking (it
| just changes the search order). You want the possessive
| quantifier + instead; `not_this|(but_this)` is equivalent
| because regexp engines will not look back into once matched
| string.
| high_byte wrote:
| it's nice. I'm way more dumbfounded by the prime thing though
| rprenger wrote:
| Me too. I had to look it up. This page has pretty good
| breakdown:
|
| https://itnext.io/a-wild-way-to-check-if-a-number-is-prime-u...
|
| The main trick for me was you first have to convert the number
| to unary, which was done outside of the regex.
| asah wrote:
| speaking as an old regexp wizard from before perl5, this is
| indeed a great trick, have an upvote.
|
| sadly, this trick still requires a code comment to explain.
| Python example: # match tarzan but not "tarzan"
| # see https://news.ycombinator.com/item?id=27774584 if
| "tarzan" == re.search(r'"tarzan"|(tarzan)', myvar)[1]:
| ...
|
| which in practice means it probably deserves a function:
| if re_search_but_exclude(r'tarzan', myvar, '"tarzan"'):
| ...
|
| I don't recommend monkeypatching re, i.e. re.search_but_exclude =
| ...
| dmurray wrote:
| Is there a reason you have an r-string for the first arg but
| not for the third one?
| nytgop77 wrote:
| A bit off topic, but the commented version was much clearer,
| than the version with separate function. (full sentences are
| very good at explaining things)
| 123pie123 wrote:
| of all the things ever invented in software, regex still amazes
| me.
|
| It's almost like nature, many simple rules coming together to
| make extremely clever and fairly complex ideas
| z3t4 wrote:
| It took me over 15 years until I started to willingly use
| RegExp, but now I can't live without it. It's like the curse of
| knowledge, once you learn something you'll loose all empathy
| and assume everyone else knows it too. It still surprises me
| though, I've had bug like my regex matching terminal color
| sequences messing up the data if it was colored.
| usrusr wrote:
| It feels like something that was more discovered than invented,
| something that would exist even if nobody knew of its
| existence. I get the same feeling when listening to Pharrell
| Williams' Happy.
| imglorp wrote:
| Is anyone having trouble reading the page? It renders as dark
| gray on slightly darker green and is illegible.
| dorianmariefr wrote:
| Please don't
| beders wrote:
| Please don't use regular expressions to parse Dyck languages. It
| doesn't work.
| lifthrasiir wrote:
| Regexp for _tokenization_ does work. This entire essay boils
| down to the fact that you can always postprocess matches and in
| this case that corresponds to tossing unwanted tokens out.
| miloignis wrote:
| I'm not sure if any regex library exposes this, but since regular
| languages are closed over compliment and intersection you could
| theoretically do something like match("....string..",
| regex("Tarzan") - regex("\"Tarzan\"")), where the - operation is
| shorthand for intersection with the compliment. Does anyone know
| if any regex libraries expose these sorts of operations on the
| regular expression/underlying DFA?
| amenghra wrote:
| Greenery (python3) let's you manipulate regular expressions and
| do things like compute intersections:
| https://github.com/qntm/greenery
| miloignis wrote:
| This is exactly the type of thing I was thinking of, and
| seems quite fully featured - thank you!
| codeflo wrote:
| Unfortunately (or perhaps fortunately), "regexes" as commonly
| implemented in programming languages are only loosely related
| to regular expressions from automata theory. With all their
| extensions, they can recognize much, much more than just
| regular languages, and I don't think they're closed under
| complement (though I'm not sure). However, most regex engines
| have a feature called negative lookahead assertions, (?!do not
| match), which would almost work in the way you suggest.
|
| You have to be careful about inputs like this though: "Inside a
| string"Tarzan"Again inside a string"
| User23 wrote:
| Yeah, a DFA that recognizes a regular language can easily be
| implemented with O(n) worst case behavior.
|
| My attitude is generally that one should use regexes for
| matching regular languages and if one needs a stack or even
| Turing completeness then handle that in code around the
| regex.
| contravariant wrote:
| Wouldn't that end up just being the same as 'regex(Tarzan)'?
| Those regexes can't match the same thing, they can only
| overlap.
|
| What you want is something like all matches of regex("Tarzan")
| not contained in a match for regex("\"Tarzan\""), which is a
| bit trickier. That would require something like:
|
| regex("Tarzan") - all-substrings(regex("\"Tarzan\""))
|
| and I'm not sure regular languages are closed over the "all-
| substrings" operation. Actually I'm pretty sure they aren't.
| layer8 wrote:
| > compliment
|
| I'll take that as a complement.
| sixo wrote:
| Not exactly that but take a look at
| https://github.com/mtrencseni/rxe ("literate regex"). I found
| this on HN and recall the comment thread being good but I can't
| find it now.
| sodality2 wrote:
| This perhaps? second result on hn.algolia.com.
| https://news.ycombinator.com/item?id=20646174
| lifthrasiir wrote:
| My biggest grief with regexp is that it is just a compact code
| disguised as something else. It is relatively common that you
| want to scan a string but action codes intermixed. There is a way
| to do that with regexp (Perl (?{...}) etc. or PCRE callouts), but
| it is always awkward to put a code to a regexp. As a result we
| typically end up with either a complex code that really should
| have used a regexp but couldn't, or a contorted regexp barring
| the understanding. The essay suggests `(*SKIP)(*FAIL)` at the
| end, which is another evidence that a code and a regexp don't mix
| well so a regexp-only solution is somehow considered worthy.
| [deleted]
| 1970-01-01 wrote:
| For me, the site rendered dark gray text on a dark gray
| background and is a chore to read as-is. Outline.com fixed my
| issue with it: https://outline.com/YSYgsp
| nabilhat wrote:
| I got curious and looked back in archive.org to this page's
| initial release in 2014. The text background started out as
| good old reliable background-color: #EEEEEE, which was later
| replaced with background: url("http://a.yu8.us/bg-tile-
| parch.gif")
|
| ...because what could possibly go wrong? From the latest
| comment at the end of the page, the author would like you to
| know that the outcome is your problem, because you're using the
| wrong browser:
|
| _June 20, 2021 - 15:02_
|
| _Subject: RE: Undoing whatever is hiding this page._
|
| _Hi Allen, try a different browser. There 's no strange
| shading on the page, your browser is deciding to display it in
| a weird way. Regards, -Rex_
| mmsc wrote:
| Most likely using the HTTPS Everywhere addon. That website is
| not available via HTTP, and the user must visit the page
| first to accept the 'risk' of using the http version.
| nabilhat wrote:
| Firefox also defaults to HTTPS by default nowadays. Lots of
| content blockers block third party content too. Regardless,
| if _literally anything_ goes wrong with the third party
| dependency that the article 's contrast depends on, the
| best case scenario here is that the text falls back on the
| body's background.
|
| Interestingly, the author also appears to control yu8.us
|
| Breaking one's own content by https-ing one site but not
| another is a great example of why to not prop up a
| website's basic legibility on a third party dependency,
| even if it's one you own and control.
| rentnorove wrote:
| It's definitely nothing to do with the following string in
| the response:
|
| > Page copy protected against web site content infringement
| by Copyscape
| extra88 wrote:
| Yes, they web author made the mistake of defining the
| <article> background-color: #EEEEEE within a min-width 960px
| media query. If the background image fails to load in wider
| window, there's still a readable contrast between text and
| background but on a phone or other narrow screen, the dark
| background color set on the <body> is what's behind the
| article text.
| [deleted]
| dang wrote:
| " _Please don 't complain about website formatting, back-button
| breakage, and similar annoyances. They're too common to be
| interesting. Exception: when the author is present. Then
| friendly feedback might be helpful._"
|
| (It's not that the annoyances aren't annoying, it's that
| they're so common that they lead to repetitive offtopicness
| that compounds into more boring threads.)
|
| https://news.ycombinator.com/newsguidelines.html
| [deleted]
| metalliqaz wrote:
| firefox shows it as black(ish) text on a light yellow
| background. I think you must be blocking something
| jrm4 wrote:
| Part of me reads these things and I'm like "neat trick", but most
| of the time they more-or-less prove to me that Regex is doomed to
| a steady and slow decline.
|
| It's just not a particularly good "interface" for the task it is
| intended to achieve, a little more ability to be "verbose" at the
| possible price of succinctness I think would go a long way. I'm
| more-or-less waiting for the "blank" in: "blank" is to Python
| what Regex is to Perl.
| gota wrote:
| I dream that we will have something like Copilot but
| exclusively for regex and working marvelously
|
| "Find every 2nd instance of a dollar amount that is not encased
| in quotes" outputting <insert regex here> would be awesome
| smnrchrds wrote:
| > The Greatest Regex Trick Ever
|
| was to convince programmers it didn't exist?
| [deleted]
| throwanem wrote:
| The greatest regex trick ever is knowing when _not_ to use one.
| IncRnd wrote:
| I've seen several regexs in various code reviews that are used
| to validate user input but do so in an exponential manner that
| can be exploited for simple DOS attacks.
| xtracto wrote:
| Ooooh or worse, I once caught someone's "email matching"
| RegEx code during a code review that was opening the door for
| some nasty SQL Injection or XSS attacks (kind of like
| validating if the text field _contained_ a valid email.. but
| not if it was ONLY a valid email).
|
| The problem with RegEx is its "obscurity". However Maybe
| someone could write a nice testing tool that would throw
| millions of known exploits into each regex it finds in your
| code to see if it is vulnerable.
| CyberDildonics wrote:
| Like what? I've never thought about what regex features are
| exponential.
| llbeansandrice wrote:
| From the same site: https://www.rexegg.com/regex-explosive-
| quantifiers.html
| throwanem wrote:
| It's more a question of which ones _can 't_ be. There are
| some really nasty and not very obvious gotchas here;
| https://regular-expressions.mobi/catastrophic.html has a
| good dive into how, for example, backtracking combines with
| incautious regex design to produce exponential behavior in
| the length of input.
|
| I don't have a hard and fast rule of my own about regex
| complexity, but I do have a strong intuition over what's
| now ca. 25 years of working with regexes dating back to
| initial exposure in Perl 5 as a high schooler. That
| intuition boils down more or less to the idea that, when a
| regex grows too complex to comprehend at a glance, it's
| time to start thinking hard about replacing it with a
| proper parser, especially if it's operating over (as yet)
| imperfectly sanitized user input.
|
| Sure, it's maybe a little more work up front, at least
| until you get good at writing small fast parsers - which
| doesn't take long, in my experience at least; formal
| training might make it easier still, but I've rarely felt
| the lack. In exchange for that small investment, you gain
| reliability and maintainability benefits throughout the
| lifetime of the code. Much of that comes from the simple
| source of no longer having to re-comprehend the hairball of
| punctuation that is any complex regex, before being able to
| modify it at all - something at which I was actually really
| good, as recently as a decade or so ago. The expertise has
| since expired through disuse, and that's given me no cause
| for regret; the thing about being a regex expert is that
| it's a really good skill for writing unreadable and subtly
| dangerous code, and not a skill good for much of anything
| else. Unreadable and subtly dangerous code was fine when I
| was a kid doing my own solo projects for fun, where the
| worst that'd happen is I might have to hit ^C. As an
| engineer on a team of engineers building software for
| production, it's not even something I would _want_ to be
| good at doing.
| User23 wrote:
| > That intuition boils down more or less to the idea
| that, when a regex grows too complex to comprehend at a
| glance, it's time to start thinking hard about replacing
| it with a proper parser
|
| You can get some surprisingly complex yet readable
| regexes in Perl by using qr//x[1] and decomposing the
| pieces into smaller qr//s that are then interpolated into
| the final pattern, along with proper inline comments in
| the regexes themselves.
|
| [1] https://perldoc.perl.org/perlre#/x-and-/xx
| throwanem wrote:
| You still have to reason about the whole thing, though.
| This doesn't make that any easier, but I bet it makes it
| _feel_ easier.
| digitalsushi wrote:
| The greatest regex /skill/ is knowing that a regex cannot
| describe everything.
| locallost wrote:
| Very verbose writing for a very succinct regex.
| kogus wrote:
| This is a great trick. It says something about RegEx syntax that
| matching a simple rule with a relatively clear expression is a
| major accomplishment.
| nytgop77 wrote:
| Yup. Regex is not a silver bullet for "match stuff", and it is
| wrong(ish) tool for following jobs:
|
| - context sensitive matching
|
| - matching with multi-char-exclusions
|
| (regex is happy the most, when it's used to match "regular
| language" things)
| xrayarx wrote:
| Long Page with practical regex advice for programmers, most
| likely not useful for command line warriors
|
| Lookbehind
|
| Lookahead
|
| Advanced handling of tags
|
| Replace before matching
|
| the best regex trick ever:
|
| "Tarzan"|(Tarzan)
|
| The whole site contains useful regex advice
| jandrese wrote:
| The more general tip is that a single regex isn't the only tool
| you have. You don't have to get your final product one one step.
| Almost every "disaster" regex comes from someone trying to do too
| much at once.
|
| One other solution would have been to run the regex twice, once
| to pick up all instances of Tarzan, and a second on the results
| of the first to filter out all instances of "Tarzan".
| usrusr wrote:
| A big source of trying to do too much is environments that
| offer easy regex-based transformations defined as a pair of
| regex and a single replacement string (that may contain
| references to matching groups) and make other transformations
| hard ("while find + rest"). When you have the option to provide
| a "process match" closure instead of the replacement string the
| lure of putting too much into a single regex almost collapses.
| dang wrote:
| One past thread:
|
| _The Greatest Regex Trick Ever (2014)_ -
| https://news.ycombinator.com/item?id=10282121 - Sept 2015 (131
| comments)
| phl wrote:
| As the examples in the article use xml, I just wanted to point
| out that applying regex to xml has a lot of limitations and
| should be avoided. See:
| https://stackoverflow.com/questions/1732348/regex-match-open...
| rascul wrote:
| I was thinking about that great answer when I was reading the
| article. Thanks for sharing it.
| ComputerGuru wrote:
| Very long build up to what is definitely a neat trick, although
| without SKIP FAIL, it might cause explosive growth in the memory
| usage as it allocated space for the results you don't need
| (unless you use a streaming regex option).
|
| Speaking of lengthy: this site breaks the iOS Safari scroll bar!
| It just disappears altogether (even when scrolling up or down to
| make it show, like you have to these days to please the UX
| designers in Palo Alto).
| toxik wrote:
| The scroll bar works but for some reason it gets rendered very
| bright. Scroll all the way up to the black background in the
| header and you'll see it.
| tus89 wrote:
| Clicking on a http:// link these days feels like I have been
| tricked into clicking on a phishing link in an email.
|
| Good trick though.
| ComputerGuru wrote:
| This is why any attempts to make plain http sites throw up
| scare warnings is a horrible idea. The internet is littered
| with old websites that contain a wealth of knowledge and
| deserve to remain accessible.
|
| Just make browsers for into "read only" mode where input cannot
| be accepted on non-secure pages. But don't wall them out!
| crazygringo wrote:
| > _" Tarzan"|(Tarzan)_
|
| OK that's pretty clever (I certainly never thought of putting a
| capturing group _inside_ only _one_ side of an "or")...
|
| ...but it doesn't seem particularly useful? It probably won't
| work in most cases where this is just part of a larger
| expression. You're usually using capturing groups in a particular
| way for a good reason, and this would mess that up.
|
| In contrast, the lookbehind+lookahead way is the "proper" and
| intuitive way to write it, and works as part of any larger
| expression.
|
| So... +100 points for cleverness, but don't actually _use_ this
| please. :)
| RheingoldRiver wrote:
| > In contrast, the lookbehind+lookahead way is the "proper" and
| intuitive way to write it, and works as part of any larger
| expression.
|
| I would say, the "proper" way is to have a separate line of
| code validating what's not there :)
| crazygringo wrote:
| I'm not following?
| diarrhea wrote:
| Not GP, but I'd go a very simple and verbose way, maybe
| that's what they meant to. Match:
| (.)Tarzan(.)
|
| Then in an additional line of code assert
| (Group 1 == Group 2) [?] "
|
| This shifts the logic out of regex and into the surrounding
| programming language context. That's arguably better, but
| the resulting regex is extremely dull and unclever.
| pimlottc wrote:
| Don't forget to look out for matches at the boundaries of
| the original string. I think it should be something like:
| (^|.)Tarzan(.|$)
|
| Though I'm not 100% sure offhand what the result in the
| capturing groups would be.
| RheingoldRiver wrote:
| Yeah, that's more or less what I meant. Write a regex
| (plus line of code) to make sure `Tarzan` appears. Then
| write another regex and line of code to make sure
| `"Tarzan"` doesn't appear.
|
| Maybe at this point you aren't using regex even. Nice,
| you solved two problems.
|
| (I do appreciate regex and even use them a lot. But, I
| use them enough to avoid them as much as possible.)
| crazygringo wrote:
| I mean, I guess if nobody on your team understands
| regexes.
|
| But generally, once you decide to use a regex in the
| first place, you might as well put as much regular
| everyday logic as you can in it. Otherwise you might as
| well look for "Tarzan" with a dumb string search.
|
| Lookbehinds and lookaheads aren't rocket science. And you
| can always leave a comment about what they're doing if
| you're worried other team members won't grok the syntax.
| kristopolous wrote:
| The ? syntax group has to be the most unmemorable of the bunch.
| I've used it maybe over 1,000 times or so and I still have to
| look up ?: Or ?! ?< or whatever else.
|
| I used to have a laminated sheet on my wall at an office because
| it was so terribly bad.
| digitalsushi wrote:
| Let me take these PhD level regex down to elementary school
| awesome.
|
| I have a process table and I want to grep it for the phrase
| "banana":
|
| ps auxww | grep banana
|
| root 87 Jun21 0:26.78 /System/Library/CoreServices/FruitProcessor
| --core=banana
|
| mikec 456 450PM 0:00.00 grep banana
|
| Argh! It also greps for the grep for banana! Annoying!
|
| Well, I'm sure there's pgrep or some clever thing, but my
| coworker showed me this and it took me a few minutes to realize
| how it works:
|
| ps auxww | grep [b]anana
|
| root 87 Jun21 0:26.78 /System/Library/CoreServices/FruitProcessor
| --core=banana
|
| Doc Brown spoke to me: "You're just not thinking fourth
| dimensionally!" Like Marty, I have a real problem with that. But
| don't you see: [b]anana matches banana but it doesn't match 'grep
| [b]anana' as a raw string. And so I get only the process I
| wanted!
| sandreas wrote:
| This is really clever... I usually ended up with adding
| | grep -v grep
|
| like in ps auxww | grep banana | grep -v grep
| sigg3 wrote:
| _applause_
|
| Never thought of that. Nice.
| jackhalford wrote:
| but what's wrong with pgrep -f though? I don't want to search
| for clever trick every time I need to grep a process
| stonewareslord wrote:
| This almost always works, but it won't if the shell expands
| your bracketed letter. See for example: $
| echo [b]anana [b]anana $ touch banana $
| echo [b]anana banana
|
| You can escape the bracket and it will work:
| $ echo \[b]anana [b]anana
| nick__m wrote:
| I use prep -laf the-wanted-string https://man7.org/linux/man-
| pages/man1/pgrep.1.html
|
| But nice regex though
|
| Edit : someone already posted that solution
| https://news.ycombinator.com/item?id=27777901
| Sniffnoy wrote:
| I dunno, the "logic" solution seems like the obvious one to me;
| if your boss really has that much trouble with propositional
| logic that they don't immediately see why it works, well, that's
| what code comments are for.
|
| (...the trick is still cool, though; I can imagine other
| situations where it would be more useful. However it does seem
| like it potentially depends on the particular regex engine being
| used, in contrast to the author's claim about it being totally
| portable; yes, it'll compile on anything, but will it _work_?)
| knodi123 wrote:
| PCRE is a pretty well-defined standard, isn't it? And it's the
| one used by most of the languages I've worked with, including
| in MariaDB.
| ComputerGuru wrote:
| It doesn't even rely on PCRE, just core regex.
| recursive wrote:
| How could it not work. I've regularly relied on order or
| matching, and never found an environment that didn't test left-
| to-right for the `|` operator in regex.
| bear8642 wrote:
| > operator in regex.
|
| regex is not regular expressions - if using NFA to match then
| you're matching all alternates simultaneously.
|
| Russ Cox has good pictures explaining idea in 'Regular
| Expression Search Algorithms' section of
| <https://swtch.com/~rsc/regexp/regexp1.html>
| recursive wrote:
| I'm talking about regex. Regex libraries in practical use
| do not use NFA. I'm talking about actual code that's
| written using normal languages. I'm familiar with the
| difference between "regular expressions" as in "regular
| languages".
| burntsushi wrote:
| Go's regexp package, Rust's regex crate and RE2 are
| examples of regex engines that are very much in practical
| use that use NFAs (among other things).
| ivegotnoaccount wrote:
| Lex/Flex, wich I think we can agree is used by "actual
| code that's written using normal languages" use DFAs,
| both inside rules and between rules, and they do not try
| '|' cases left to right (They probably could have if they
| wanted since there is a REJECT action that already force
| them to store the list of all the rules/texts that were
| matched):
|
| a|ab {cout << "matched ab" << std::endl; } b { cout <<
| "matched b" << std::endl; }
|
| if provided with "ab", will match the first rule with
| "ab", and not the first with "a" then the second with
| "b".
| [deleted]
| praptak wrote:
| This trick may be thought of as a simplification of the
| systematic approach to parsing stuff, that is the lexer-parser
| division of responsibilities.
|
| The lexer uses regexes but only for splitting the input stream of
| characters into tokens. Identifiers, integers, operators,
| strings, keywords, opening brackets and whatnot - each type of
| token is defined by a regex. This part is hopefully deterministic
| and simple, although the lexer matches regexes for all kinds of
| tokens at once, which is why lexer generators are often used to
| generate lexers.
|
| The heavy lifting is done by the actual parser which tries to
| combine the tokens into something that makes sense from the point
| of the grammar.
|
| So in this trick the sub-regexes between |'s define the tokens
| (the lexer part) while the group mechanism selects the single
| token that we want to keep (a very very simple parser).
| xtracto wrote:
| This site reminded me the times when I interviewed candidates.
| One of the interview problems was to write a function that would
| validate if a given string was a valid IPv4 address (a la
| 10.10.10.1).
|
| Some of the candidates started by saying: "I know! I'll use a
| Regular Expression", to what I replied: "Great!, now you have TWO
| problems!"
___________________________________________________________________
(page generated 2021-07-08 23:00 UTC)