[HN Gopher] Regex character "$" doesn't mean "end-of-string"
       ___________________________________________________________________
        
       Regex character "$" doesn't mean "end-of-string"
        
       Author : BerislavLopac
       Score  : 380 points
       Date   : 2024-03-20 07:50 UTC (15 hours ago)
        
 (HTM) web link (sethmlarson.dev)
 (TXT) w3m dump (sethmlarson.dev)
        
       | ikiris wrote:
       | this is mostly due to the different types of regex and less about
       | it being platform dependent. $ was end of string in pcre which is
       | the "old" perl compatible regex. python has its own which has
       | quirks as mentioned, re2 is another option in go for example, and
       | i think rust has its own version as well iirc.
        
         | pjmlp wrote:
         | Indeed, there isn't any kind of universal regexp standard.
        
           | 7bit wrote:
           | We should create a new RegEx flavour that standardises RegEx
           | for good!
        
             | jasonjayr wrote:
             | https://xkcd.com/927/
        
               | rerdavies wrote:
               | https://datatracker.ietf.org/doc/rfc9485/
               | 
               | https://xkcd.com/927/
        
         | wolletd wrote:
         | The differences of the various regex "dialects" came to me over
         | the years of using regular expressions for all kinds of stuff.
         | 
         | Matching EOL feels natural for every line-based process.
         | 
         | What I find way more annoying is escaping characters and
         | writing character groups. Why can't all regex engines support
         | '\d' and '\w' and such? Why, in sed, is an unescaped '.' a
         | regex-dot matching any character, but an unescaped '(' is just
         | a regular bracket?
        
           | somat wrote:
           | > Why, in sed, is an unescaped '.' a regex-dot matching any
           | character, but an unescaped '(' is just a regular bracket?
           | 
           | It is because sed predates the very influential second
           | generation Extended Regular Expression engine and by default
           | uses the first generation Basic Regular Expression engine. So
           | really it is for backwards compatibility.
           | 
           | http://man.openbsd.org/re_format#BASIC_REGULAR_EXPRESSIONS
           | 
           | you can usually pass sed a -r flag to get it to use ERE's
           | 
           | Actually I don't really know if BRE's predate ERE's or not. I
           | assume they do based on the name but I might be wrong.
        
             | tankenmate wrote:
             | BRE and ERE was created at the same time. Prior to this
             | there wasn't a clear standard for Regex. From my memory
             | this was standardised in 1996 (IEEE Std 1003.1-1996).
             | 
             | The work originally came from work by Stephen Cole Kleene
             | in the 1950s. It was introduced into Unix fame via the QED
             | editor (which later became ed (and sed), then ex, then vi,
             | then vim; all with differing authors) when Ken Thompson
             | added regex when he ported QED to CTSS (an OS developed at
             | MIT for the IBM 709, which was later used to develop
             | Multics, and hence lead to Unix).
             | 
             | Also the "grep" command got its name from "ed"; "g" (the
             | global ed command) "re" (regular expression), and "p" (the
             | print ed command). Try it in vi/vim, :g/string/p it is the
             | same thing as the grep command.
        
             | fsckboy wrote:
             | > _you can usually pass sed a -r flag_
             | 
             | for portability, -E is the POSIX flag for the same thing
        
         | ajsnigrutin wrote:
         | "$" could be end of string or end of line in perl, depending on
         | the setting (are you treating data as a multiline text, or each
         | line separately). (/m, /s,...)
        
           | ikiris wrote:
           | Yeah I accidentally said string when I absolutely meant to
           | say line there.
        
       | Izmaki wrote:
       | The new-line character is an actual character "at the end" of the
       | string though so it makes sense that $ would include the new-line
       | character in multi-line matching.
        
         | IshKebab wrote:
         | Yes and every implementation gets that right. The point was
         | when multi-line matching is _disabled_ and only Javascript, Go
         | and Rust get that right.
         | 
         | I'm not too surprised by PHP and Python getting it wrong. Java
         | and C# is a slight surprise though.
        
           | danbruc wrote:
           | I don't think it is correct to say some get it right and some
           | get it wrong, it is more of an design decision.
        
             | IshKebab wrote:
             | It's possible to get design decisions wrong. Clearly people
             | _expect_ `$` to only match end-of-string so they did make
             | the wrong decision. It may not have been clear it was the
             | wrong decision at the time.
        
               | danbruc wrote:
               | Things are obviously more complicated than that, lines
               | are a complicated issue for historical reasons. There are
               | two conventions, line termination and line separation. In
               | case of line termination, the newline is part of the line
               | and a string without a newline is not a [complete] line.
               | In case of line separation, the newline is not part of
               | the line but separates two lines. Also the way newlines
               | are encoded is not universal.
        
               | fauigerzigerk wrote:
               | Why is this relevant when multi-line is disabled?
        
               | danbruc wrote:
               | Because even after disabling multi-line you are still
               | dealing with line-based semantics when you use ^ or $,
               | the newline at the end is still not part of the content.
               | You have to use \A and \Z if you want to treat all
               | characters as a string instead of one or multiple lines.
        
               | burntsushi wrote:
               | > Because even after disabling multi-line you are still
               | dealing with line-based semantics when you use ^ or $
               | 
               | No, you're not, _except_ for this weird corner case where
               | `$` can match before the _last_ `\n` in a string. It 's
               | not just any `\n` that non-multiline `$` can match
               | before. It's when it's the _last_ `\n` in the string.
               | See:                   >>> re.search('cat$', 'cat\n')
               | <re.Match object; span=(0, 3), match='cat'>         >>>
               | re.search('cat$', 'cat\n\n')         >>>
               | 
               | This is weird behavior. I assume this is why RE2 didn't
               | copy this. And it's certainly why I followed RE2 with
               | Rust's regex crate. Non-multiline `$` should only match
               | at the end of the string. It should not be line-aware. In
               | regex engines like Python where it has the behavior
               | above, it is only "partially" line-aware, and only in the
               | sense that it treats the last `\n` as special.
        
               | danbruc wrote:
               | But that is exactly what it means, the end of the line is
               | before the terminating newline or at the end of the
               | string if there is no terminating newline. Both ^ and $
               | always match at start or end of lines, \A and \Z match at
               | the start or end of the string. The difference between
               | multi-line and not is whether or not internal newlines
               | end and start lines, it does not change the semantics
               | from end of line to end of string. And if you are not in
               | multi-line mode but have internal newlines, then you
               | might also want single-line/dot-all mode.
               | 
               | One could certainly have a debate whether this behavior
               | is too strongly tied to the origins of regular
               | expressions and now does more harm than good, but I am
               | not convinced that this would be an easy and obvious
               | choice to have breaking change.
        
               | IshKebab wrote:
               | > But that is exactly what it means
               | 
               | I think you've kind of missed the point. Sure if `$` in
               | non-multiline mode means "end of line" the behaviour
               | might be reasonable. But the big error is that people DO
               | NOT EXPECT `$` to mean "end of line" in that case. They
               | expect it to mean "end of string". That's clearly the
               | least surprising and most useful behaviour.
               | 
               | The bug is not in how they have implemented "end of line"
               | matching in non-multiline mode. It's that they did it at
               | all.
        
               | burntsushi wrote:
               | re.search does not accept a "line." It accepts a
               | "string." There is no pretext in which re.search is meant
               | to only accept a single line. And giving it a `string`
               | with multiple new lines doesn't necessarily mean you want
               | to enable multi-line mode. They are orthogonal things.
               | 
               | > Both ^ and $ always match at start or end of lines
               | 
               | This is trivially not true, as I showed in my previous
               | example. The haystack `cat\n\n` contains two lines and
               | the regex `cat$` says it should match `cat` followed by
               | the "end of a line" according to your definition. Yet it
               | does not match `cat` followed by the end of a line in
               | `cat\n\n`. And it does not do so in Python or in any
               | other regex engine.
               | 
               | You're trying to square a circle here. It can't be done.
               | 
               | Can you make sense of, _historically_ , why this choice
               | of semantics was made? Sure. I bet you can. But I can
               | still evaluate the choice on its own merits today. And I
               | did when I made the regex crate.
               | 
               | > but I am not convinced that this would be an easy and
               | obvious choice to have breaking change.
               | 
               | Rust's regex crate, Go's regexp package and RE2 all
               | reject this whacky behavior. As the regex crate
               | maintainer, I don't think I've ever seen anyone complain.
               | Not once. This to me suggests that, at minimum, making
               | `$` and `\z` equivalent in non-multiline mode is a
               | reasonable choice. I would also argue it is the better
               | and more sensible approach.
               | 
               | Whether other regex engines should have a breaking change
               | or not to change the meaning of `$` is an entirely
               | different question completely. That is neither here nor
               | there. They absolutely will not be able to make such a
               | change, for many good reasons.
        
               | danbruc wrote:
               | _re.search does not accept a "line." It accepts a
               | "string." There is no pretext in which re.search is meant
               | to only accept a single line._
               | 
               | Sure, it takes a string which might be a line or multiple
               | or whatever. Does not change the fact that $ matches at
               | the end of a line. If you want the end of the string, use
               | \Z.
               | 
               |  _This is trivially not true, as I showed in my previous
               | example. The haystack `cat\n\n` contains two lines and
               | the regex `cat$` says it should match `cat` followed by
               | the "end of a line" according to your definition._
               | 
               | In multi-line mode it matches, in single-line mode it
               | does not because there is a newline between cat and the
               | end of the line. A newline is only a terminating newline
               | if it is the last character, the newline after cat is not
               | a terminating newline. You need cat\n$ or cat\n\n to
               | match.
        
               | burntsushi wrote:
               | > In multi-line mode it matches, in single-line mode it
               | does not because there is a newline between cat and the
               | end of the line. A newline is only a terminating newline
               | if it is the last character, the newline after cat is not
               | a terminating newline. You need cat\n$ or cat\n\n to
               | match.
               | 
               | This only makes sense if re.search accepted a line to
               | search. It doesn't. It accepts an arbitrary string.
               | 
               | I don't think this conversation is going anywhere. Your
               | description of the semantics seems inconsistent and
               | incomprehensible to me.
               | 
               | > A newline is only a terminating newline if it is the
               | last character, the newline after cat is not a
               | terminating newline. You need cat\n$ or cat\n\n to match.
               | 
               | The first `\n` in `cat\n\n` _is_ a terminating newline.
               | There just happens to be one after it.
               | 
               | Like I said, your description makes sense _if_ the input
               | is meant to be interpreted as a single line. And in some
               | contexts (like line oriented CLI tools), that can make
               | sense. But that 's _not_ the case here. So your
               | description makes no sense at all to me.
        
               | danbruc wrote:
               | _This only makes sense if re.search accepted a line to
               | search. It doesn 't. It accepts an arbitrary string._
               | 
               | Which is fine because lines are a subset of strings. And
               | whether you want your input treated as a line or a string
               | is decided by your pattern, use ^ and $ and it will be
               | treated as a line, use \A and \Z and it will be treated
               | as a string.
               | 
               |  _The first `\n` in `cat\n\n` is a terminating newline.
               | There just happens to be one after it._
               | 
               | Look at where this is coming from. You do line-based
               | stuff, there is either no newline at all or there is
               | exactly one newline at the end. You do file-based stuff,
               | there are many newlines. In both cases the behavior of ^
               | and $ makes perfect sense.
               | 
               | Now you come along with cat\n\n which clearly falls into
               | the file-based stuff category as it has more than one
               | newline in it but you also insist that it is not multiple
               | lines. If it is not multiple lines, then only the last
               | character can be a newline, otherwise it would be
               | multiple lines.
               | 
               | And I get it, yes, you can throw arbitrary strings at a
               | regular expression, this line-based processing is not
               | everything, but it explains why things behave the way
               | they do. And that is also why people added \A and \Z. And
               | I understand that ^ and $ are much nicer and much better
               | known than \A and \Z. Maybe the best option would be to
               | have a separate flag that makes them synonymous with \A
               | and \Z and this could maybe even be the default.
        
               | burntsushi wrote:
               | > And whether you want your input treated as a line or a
               | string is decided by your pattern, use ^ and $ and it
               | will be treated as a line, use \A and \Z and it will be
               | treated as a string.
               | 
               | Where is this semantic explained in the `re` module docs?
               | 
               | This is totally and completely made up as far as I can
               | tell.
               | 
               | This also seems entirely consistent with my rebuttal:
               | 
               | Me: What you're saying makes sense _if_ condition foo
               | holds.
               | 
               | You: Condition foo holds.
               | 
               | This is uninteresting to me because I see no reason to
               | believe that condition foo holds. Where condition foo is
               | "the input to re.search is expected to be a single line."
               | Or more precisely, apparently, "the input to re.search is
               | expected to be a single line when either ^ or $ appear in
               | the pattern." That is totally bonkers.
               | 
               | > but it explains why things behave the way they do
               | 
               | Firstly, I am not debating with you about the historical
               | reasoning for this. Secondly, I am providing a commentary
               | on the semantics themselves (they suck) and also on your
               | explanation of them in _today 's_ context (it doesn't
               | make sense). Thirdly, I am not making a prescriptive
               | argument that established regex engines should change
               | their behavior in any way.
               | 
               | If you're looking to explain _why_ this semantic is the
               | way it is, then I 'd expect writing from the original
               | implementors of it. Probably in Perl. I wouldn't at all
               | be surprised if this was an "oops" or if it was
               | implemented in a strictly-line-oriented context, and then
               | someone else decided to keep it unthinkingly when they
               | moved to a non-line-oriented context. From there,
               | compatibility takes over as a reason for why it's with us
               | today.
        
               | danbruc wrote:
               | I quoted the section from the Python module here. [1]
               | 
               | If you do not specify multi-line, bar$ matches a lines
               | ending in bar, either foobar\n or foobar if the
               | terminating newline has been removed or does not exist.
               | If you specify multi-line, then it will also match at
               | every bar\n within the string. So it either treats your
               | input as a single line or as multiple lines. You can of
               | course not specify multi-line and still pass in a string
               | with additional newlines within the string, but then
               | those newlines will be treated more or less as any other
               | character, bar$ will not match bar\n\n. The exception is
               | that dot will not match them except you set the single-
               | line/dot-all flag, bar\n$ will match bar\n\n but bar.$
               | will not unless you specify the single-line/dot-all flag.
               | 
               | I would even agree with you that it seems a bit weird. If
               | you have a proper line without additional newlines in the
               | middle, then multi-line behaves exactly like not multi-
               | line. Not multi-line only behaves differently if you
               | confront it with multiple lines and I have no good idea
               | how you would end up in a situation where you have
               | multiple lines and want to treat them as one unit but
               | still treat the entire thing as if it was a line.
               | 
               | [1] https://news.ycombinator.com/item?id=39765086
        
               | burntsushi wrote:
               | The docs do not say what you're saying. Your phrasing is
               | completely different, and the part where "if ^/$ are in
               | the pattern then the haystack is treated as a single
               | line" is completely made up. As far as I can tell, that's
               | your _rationalization_ for how to make sense of this
               | behavior. But it is not a story supported by the actual
               | regex engine docs. The actual docs say,  "^ matches only
               | at the beginning of the string, and $ matches only at the
               | end of the string and immediately before the newline (if
               | any) at the end of the string." The docs do not say, "the
               | string is treated as a single line when ^/$ are used in
               | the pattern." That's _your_ phrasing, not anyone else 's.
               | That's _your_ story, not theirs.
               | 
               | I still have not seen anything from you that makes sense
               | of the behavior that `cat$` does not match `cat\n\n`.
               | Like, I realize you've tried to explain it. But your
               | explanation does not make sense. That's because the
               | behavior is _strange_.
               | 
               | The only actual way to explain the behavior of $ is what
               | the `re` docs say: it either matches at the end of the
               | string or just before a `\n` that appears at the end of
               | the string. That's it.
        
               | danbruc wrote:
               | You are right, it is my wording, I replaced end of string
               | or before newline as the last character with end of line
               | because that is what this means. You could also write
               | that into the documentation but then you would have to
               | also explain what end of line means. And I will grant you
               | that I might be wrong, that the behavior is only
               | accidentally identical to matching the end of a line but
               | that the true reason for it is different.
               | 
               | cat$, the $ matches the end of the line, the second \n,
               | cat is not directly before that. I guess you want the
               | regex engine to first treat the input as a multi-line
               | input, extract cat\n as the first line, and then have
               | cat$ match successfully in that single line? What about
               | cat$ and dog$ and cat\ndog\n.
        
               | dfawcus wrote:
               | Given that in unix they sort started as:
               | ed -> sed         ed -> grep
               | 
               | The line oriented mature makes sense.
               | 
               | There is some sed multi-line capability if one uses the
               | hold space, but it is much easier to just use awk.
        
             | tankenmate wrote:
             | Not quite, there are standards for this behaviour (formal
             | and de jure).
        
               | danbruc wrote:
               | And the ones that do not match cat\n with cat$ arguably
               | have it wrong. Both ^ and $ anchor to the start and end
               | of lines, not to the start and end of strings, whether in
               | multi-line mode or not.
        
           | noirscape wrote:
           | It's not wrong actually. It's the difference between BRE and
           | ERE, which are the two different POSIX standards that define
           | regex. In BRE the $ should always match the end of the string
           | (the spec specifically says it should match the string
           | terminator since "newlines aren't special characters"), while
           | the ERE spec says it should match until the end of the line.
           | 
           | The real issue is that no language nowadays "just" implements
           | BRE or ERE since both specs are lacking in features.
           | 
           | Most languages instead implement some variant of Perl's regex
           | instead (often called PCRE regex because of the C library
           | that brought Perl's regex to C), which as far as I can tell
           | isn't standardized, so you get these subtle differences
           | between implementations.
        
         | mnw21cam wrote:
         | The article is about when multi-line is _disabled_.
        
       | user2342 wrote:
       | I'm confused by this blog-post. In the table: what is the reg-ex
       | pattern tested and against which input?
        
         | mnw21cam wrote:
         | The input being matched is "cat\n" and the regex pattern is one
         | of:                 "cat$" with multiline enabled       "cat$"
         | with multiline disabled       "cat\z"       "cat\Z"
        
       | somat wrote:
       | Structural regexes as found in the sam editor are an obscure but
       | well engineered regex engine. I am far from an expert but my main
       | takeaway from them is that most regex engines have an implied
       | structure built around "lines" of text. While you can work around
       | this, it is awkward. Structural regexes allow you to explicitly
       | define the structure of a match, that is, you get to tell the
       | engine what a "line" is.
       | 
       | http://man.cat-v.org/plan_9/1/sam
        
       | xlii wrote:
       | Regexp was one of the first things I truly internalized years ago
       | when I was discovering Perl (which still lives in a cozy place in
       | my heart due to a lovely "Camel" book).
       | 
       | Today most important bit of information is knowledge that
       | implementations differ and I made a habit of pulling reference
       | sheet for a thing I work with.
       | 
       | E.g. Emacs Regexp annoyingly doesn't have word in form of "\w"
       | but uses "\s_-" (or something no reference sheet on screen) as
       | character class (but Emacs has the best documentation and
       | discoverability - a hill I'm willing to die on)
       | 
       | Some utilities require parenthesis escaping and some not.
       | Sometimes this behavior is configurable and sometimes it's not.
       | 
       | I lived through whole confusion, annoyance, denial phase and now
       | I just accept it. Concept is the same everywhere but flavor
       | changes.
        
         | ydant wrote:
         | Exactly the same here, re: Perl.
         | 
         | My brain thinks in Perl's regex language and then I have to
         | translate the inconsistent bits to the language I'm using.
         | Especially in the shell - I'm way more likely to just drop a
         | perl into the pipeline instead of trying to remember how
         | sed/grep/awk (GNU or BSD?) prefer their regex.
        
           | influx wrote:
           | GNU grep supports Perl regexp with -P
        
             | mwpmaybe wrote:
             | As does git grep!
        
             | 1letterunixname wrote:
             | Using PCRE2, which doesn't behave exactly the same as Perl
             | or PCRE1.
             | 
             | https://pcre.org/current/doc/html/pcre2compat.html
             | 
             | https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expre
             | s...
             | 
             | https://stackoverflow.com/questions/70273084/regex-
             | differenc...
        
           | mtmk wrote:
           | hah, I'm the same too, straight to 'perl -lne'. I believe
           | that was one of Larry Wall's goals when creating Perl:
           | 
           | > Perl is kind of designed to make awk and sed semi-obsolete.
           | 
           | https://github.com/Perl/perl5/commit/8d063cd8
        
         | pizzafeelsright wrote:
         | How did you internalize it? Perl looks like cat keyboarding.
        
           | mwpmaybe wrote:
           | The same way people internalize punching data and
           | instructions into stacks of cards, or internalize advanced
           | mathematical notation. Just because things aren't written in
           | plain english words doesn't mean they can't be internalized.
        
             | chongli wrote:
             | Advanced math is mostly written in plain English, actually!
        
           | ydant wrote:
           | For me, Perl hit me at exactly the right time in my
           | development. One or more of the various O'Reilly Perl books
           | caught my attention in the bookstore, the foreword and the
           | writing style was unlike anything else I'd read in
           | programming up to that point, and I read the book and just
           | felt a strong connection to how the language was structured,
           | the design concepts behind it, the power of regex being built
           | in to the language, etc. The syntax favored easy to write
           | programs without unnecessary scaffolding (of course, leading
           | to the jokes of it being write-only - also the jokes I could
           | make about me programming largely in Java today), and the
           | standard functionality plus the library set available felt
           | like magic to me at that point.
           | 
           | Learning Perl today would be a very different experience. I
           | don't think it would catch me as readily as it did back then.
           | But it doesn't matter - it's embedded into me at a deep level
           | because I learned it through a strong drive of fascination
           | and infatuation.
           | 
           | As for the regex themselves? It's powerful and solved a lot
           | of the problems I was trying to solve, was built
           | fundamentally into Perl as a language, so learning it was
           | just an easy iterative process. It didn't hurt that the
           | particular period of time when I learned Perl/regex the
           | community was really big on "leetcode" style exercises, they
           | just happened to be focused around Perl Golf, being clever in
           | how you wrote solutions to arbitrary problems, and abusive
           | levels of regex to solve problems. We were all playing and
           | play is a great way to learn.
        
       | beardyw wrote:
       | Does anyone consider RegEx to be standardised? Moving to a new
       | context is always a relearning exercise in my experience.
        
         | rusk wrote:
         | My understanding is it was standardised for Posix but the
         | variants in popular use have so many variations.
         | 
         | I consider sed to be the baseline. If you can do sed you can do
         | anything but it's seriously limited.
        
           | susam wrote:
           | POSIX specifies two flavours of regular expressions: basic
           | regular expressions (BRE) and extended regular expressions
           | (ERE). There are subtle differences between the two and ERE
           | supports more features than BRE. For example, what is written
           | as a\\(bc\\)\\{3\\}d in BRE is written as a(bc){3}d in ERE.
           | See https://pubs.opengroup.org/onlinepubs/9699919799/basedefs
           | /V1... for more details.
           | 
           | The regular expression engines available in most mainstream
           | languages go well beyond what is specified in POSIX though.
           | An interesting example is named capturing group in Python,
           | e.g., (?P<token>f[o]+).
        
             | tankenmate wrote:
             | Indeed, and the most common is Perl since it was the source
             | of many of the extensions.
        
               | rusk wrote:
               | I would hazard that nowadays it's Java due to its broad
               | permeation of the application space
        
               | account42 wrote:
               | If anything it would be ECMAScript (JavaScript dwarfs
               | Java use) or PCRE (the de-facto contiuation of Perl
               | regular expressions written in C but used in many
               | languages).
        
               | rusk wrote:
               | Yes I think you're right actually. I'm about 10 years off
               | :)
        
             | jwilk wrote:
             | > what is written as \\(f..\\)\1 in BRE is written as
             | (f..)\1 in ERE
             | 
             | Oddly, there are no backreferences in POSIX EREs.
        
               | susam wrote:
               | You are right indeed. Looked at the specification again
               | and indeed there is no back-reference in POSIX ERE.
               | 
               | Quoting from <https://pubs.opengroup.org/onlinepubs/96999
               | 19799.2008edition...>:
               | 
               | > It was suggested that, in addition to interval
               | expressions, back-references ( '\n' ) should also be
               | added to EREs. This was rejected by the standard
               | developers as likely to decrease consensus.
               | 
               | Updated my comment to present a better example that
               | avoids back-references. Thanks!
        
               | GrumpySloth wrote:
               | That's because POSIX EREs are actual regular expressions
               | thank god.
        
           | psd1 wrote:
           | No gnu tool can balance brackets, afaics. So you can't do
           | everything in sed. And sed is, by design, useless for
           | matching text that spans lines, so good luck picking out
           | paragraphs with it.
        
             | rusk wrote:
             | Sorry I meant to write "if you can do it in sed you can do
             | it in anything" thereby implying it is a subset of the more
             | generally available flavours. The issue at hand however is
             | that there isn't much in the way of standardisation but 95%
             | of sed should work across all of them. Of course you should
             | get more into the specifics of whatever your solution space
             | supports.
        
             | ykonstant wrote:
             | I am pretty sure even pure Awk can do it; or am I mistaken?
             | I thought there was an even more sophisticated example in
             | the Awk book.
             | 
             | Edit: oh, you mean via regex engines available in GNU
             | tools; I am dumb. Hmm... is there no GNU extension with
             | PCRE?
        
               | colimbarna wrote:
               | "Sed" is the name of a specific tool. It is not defined
               | by the GNU tools, but has existed in some form since
               | 1974, well before Perl. GNU sed and POSIX sed both
               | support BRE and EREs, but not PCREs.
               | 
               | Maybe there's some other implementation of sed that
               | supports PCREs but that would really be an extension of
               | that implementation of sed rather than a property of sed.
               | 
               | And maybe there's some GNU tool that uses PCREs, but that
               | GNU tool would not be GNU sed, so it would not be a
               | relevant property.
               | 
               | Anyway, they probably should have said BREs or EREs
               | rather than "sed"...
        
         | telotortium wrote:
         | Languages invented after Perl will generally use some flavor of
         | Perl regex syntax, but there are always some minor differences.
         | The issue of the meaning of `$` and changing it via multi-line
         | mode is usually consistent though.
        
           | usrusr wrote:
           | I like to think of "whatever browsers do in js" as an updated
           | common baseline. Whatever your regex engine does, describe it
           | as a delta to the js precedent. That thing is just so
           | ubiquitous.
           | 
           | I do wonder though what's the highest number of different
           | regex syntaxes I've ever encountered (perhaps written?)
           | within a single line: bash, grep and sed are never not in a
           | "hold my beer" mood!
        
             | psd1 wrote:
             | Reason #2 to use powershell - consistent regex.
             | 
             | I've got "hold my beer" commits in .net - I've balanced
             | brackets. I believe that's impossible in sed and grep. If I
             | were going to write a json parser in a script, then a) stop
             | me and b) it's got to be in powershell.
        
             | layer8 wrote:
             | That seems like just a web front-end developer's
             | perspective.
        
             | Calzifer wrote:
             | Isn't JavaScripts regex one of the worst modern regex
             | implementations?
             | 
             | They seem to improve. Negative lookbehind isn't missing
             | anymore [1]. But still lack the handy \Q and \E to escape
             | stuff [2].
             | 
             | [1] https://stackoverflow.com/a/3950684
             | 
             | [2] https://stackoverflow.com/q/6318710
        
             | kstrauser wrote:
             | I'll go along with that, as long as someone ports pcre to
             | JavaScript and that's the browser syntax we land on.
        
             | mwpmaybe wrote:
             | > I do wonder though what's the highest number of different
             | regex syntaxes I've ever encountered (perhaps written?)
             | within a single line: bash, grep and sed are never not in a
             | "hold my beer" mood!
             | 
             | Your comment is missing a trigger warning, lol. But
             | seriously, this is one of my flags for "this should
             | probably be a script, or an awk or perl one-liner."
        
         | wolletd wrote:
         | At some point, I felt like I knew them all. There are probably
         | more regex dialects out there, but I don't encounter them and
         | my set of knowledge works most of the time.
         | 
         | I feel it's like driving a rental car. It behaves slightly
         | different than your own car, some features missing, some other
         | features added, but in general, most of the things are pretty
         | similar.
        
           | stanislavb wrote:
           | What a nice analogy. I'll borrow it in the future.
        
         | MattHeard wrote:
         | My working assumption has always been to check the docs of your
         | specific regexp parser, and to write some tests (either
         | automated or manually in a REPL) with specific patterns that
         | you are interested in using.
        
         | out-of-ideas wrote:
         | kind of a trick question; there is POSIX and then there is the
         | app you're using and whichever flags are enabled (albeit by
         | default or explicitly defined)
        
         | jasonjayr wrote:
         | The three big ones I know of are POSIX, Perl/PCRE(aka Perl-
         | Compatible Regular Expression), and Go came along and
         | <strike>added</strike> used re2, which is a bit different from
         | the first too.
         | 
         | A lot of systems implemented PCRE, including JavaScript, since
         | Perl extended the POSIX system with many useful extensions.
         | IIRC, re2 tries to reign in on some of the performance issues
         | and quirks the original systems had, while implementing the
         | whole thing in Go.
         | 
         | edit: Did not realize re2 predated go ...
        
           | jpgvm wrote:
           | re2 predates Go and was written in C++.
        
           | foldr wrote:
           | Go's regex implementation is new in the sense that it's not
           | just a binding to the re2 C++ library, but it uses the same
           | non-backtracking algorithm.
        
           | jerf wrote:
           | POSIX and PCRE are arguably redundant. They both support
           | backreferences, which puts very significant constraints on
           | their implementations. PCRE is at least functionally a
           | superset of POSIX, whether or not there's some quirky thing
           | POSIX supports that PCRE does not.
           | 
           | re2 adds a legitimate option to the menu of using NDFAs,
           | which have the disadvantage of not supporting backreferences,
           | but have the advantage of having constrained complexity of
           | scanning a string. This does not come for free; you can
           | conceivably end up with a compiled regexp of very large size
           | with an NDFA approach, but most of the time you won't. The
           | result may be generally slower than a PCRE-type approach, but
           | it can also end up safer because you can be confident that
           | there isn't a pathological input string for a given regexp
           | that will go exponential.
           | 
           | This is one of those cases where ~99% of the time, it doesn't
           | really matter which you choose, but at the scale of the
           | Entire Programming World, both options need to be available.
           | I've got some security applications where I legitimately
           | prefer the re2 implementation in Go because it is
           | advantageous to be confident that the REs I write have no
           | pathological cases in the arbitrary input they face. PCRE can
           | be necessary in certain high-performance cases, as long as
           | you can be sure you're not going to get that pathological
           | input.
           | 
           | RE engines don't quite engender the same emotions as
           | programming languages as a whole, but this is not
           | cheerleading, this is a sober engineering assessment. I use
           | both styles in my code. I've even got one unlucky exe I've
           | been working with lately that has both, because it rather
           | irreducibly has the requirements for both. Professionally
           | annoying, but not actually a problem.
        
             | burntsushi wrote:
             | I'll add two notes to this:
             | 
             | * Finite automata based regex engines don't necessarily
             | have to be slower than backtracking engines like PCRE. Go's
             | regexp is in practice slower in a lot of cases, but this is
             | more a property of its implementation than its concept.
             | See: https://github.com/BurntSushi/rebar?tab=readme-ov-
             | file#summa... --- Given "sufficient" implementation effort
             | (~several person years of development work), backtrackers
             | and finite automata engines can both perform very well,
             | with one beating the other in some cases but not in others.
             | It depends.
             | 
             | * Fun fact is that if you're iterating over all matches in
             | a haystack (e.g., Go's `FindAll` routines), then you're
             | susceptible to O(m * n^2) search time. This applies to all
             | regex engines that implement some kind of leftmost match
             | priority. See
             | https://github.com/BurntSushi/rebar?tab=readme-ov-
             | file#quadr... for a more detailed elaboration on this
             | point.
        
               | jerf wrote:
               | Excellent, thank you.
        
             | keybored wrote:
             | > RE engines don't quite engender the same emotions as
             | programming languages as a whole, but this is not
             | cheerleading, this is a sober engineering assessment.
             | 
             | Good on you.
        
         | bregma wrote:
         | The ISO/IEC 14882 C++ standard library <regex> mandates [0]
         | implementations for six de jure standard regex grammars: IEEE
         | Std 1003.1-2008 (POSIX) [1] BRE, ERE, awk, grep, and egrep and
         | ECMA-262 EcmaScript 3 [2].
         | 
         | So, yes, at least someone (me) considers regex to be
         | standardized in several published de jure standards.
         | [0] https://www.open-
         | std.org/jtc1/sc22/wg21/docs/papers/2013/n3690.pdf#chapter.28
         | [1] https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V
         | 1_chap09.html       [2] https://262.ecma-
         | international.org/14.0/#sec-regexp-regular-expression-objects
        
           | pjc50 wrote:
           | "At least six different standards" is an XKCD comic, not _a_
           | standard.
        
             | riffraff wrote:
             | "The nice thing about standards is that you have so many to
             | choose from." - Andrew Tanenbaum (or Grace Hopper)
        
           | account42 wrote:
           | <regex> is not exactly an example anyone should follow.
        
             | bregma wrote:
             | You may be prejudiced against C++, but ISO/IEC 14882 is a
             | published international standard that links to recognized
             | regex standards, so answers the question "does anyone
             | consider RegEx standardised?" very much in the affirmative.
        
         | beardyw wrote:
         | And don't get me started about find and replace, what is the
         | symbol to insert the match?
        
         | tonyg wrote:
         | Delightfully, RFC 9485
         | https://datatracker.ietf.org/doc/rfc9485/ "I-Regexp: An
         | Interoperable Regular Expression Format" was published just
         | back in October last year!
        
       | ghusbands wrote:
       | > Note: The table of data was gathered from regex101.com, I
       | didn't test using the actual runtimes.
       | 
       | Has anyone confirmed this behaviour directly against the
       | runtimes/languages? Newlines at the end of a string are certainly
       | something that could get lost in transit inside an online service
       | involving multiple runtimes.
        
         | AtNightWeCode wrote:
         | I fail to add carriage return to the test string on that site.
         | Which I guess would be an issue on Windows.
        
         | zimpenfish wrote:
         | https://go.dev/play/p/Tce1qWjfjOy matches their results.
         | 
         | I've also run that locally against "go1.22.1 darwin/arm64",
         | "go1.21.5 windows/amd64", and "go1.21.0 linux/amd64" with the
         | same result.
        
         | coldtea wrote:
         | > _Newlines at the end of a string are certainly something that
         | could get lost in transit inside an online service involving
         | multiple runtimes._
         | 
         | In what way could newlines at the end of a string "could get
         | lost in transit"?
        
           | ghusbands wrote:
           | If you write it to a text file by itself and then read it
           | from that text file, each runtime can have a different
           | definition of whether a newline at the end of the file is
           | meaningful or not. Under POSIX, a newline should always be
           | present at the end of a non-empty text file and is not
           | meaningful; not everyone agrees or is aware.
           | 
           | There are plenty of other ways, too; bugs happen.
        
             | coldtea wrote:
             | Ideally no runtime should alter strings passing through
             | ("in transit") from one runtime to another - unless it does
             | some processing on them.
        
         | ghusbands wrote:
         | I've now tested C#, directly, and got the same result as the
         | article. It also documents the behavior:
         | 
         | > The ^ and $ language elements indicate the beginning and end
         | of the input string. The end of the input string can be a
         | trailing newline \n character.
        
         | burntsushi wrote:
         | Yes, and with more regex engines:
         | https://github.com/BurntSushi/rebar/blob/177f5d55e916964b9c4...
         | 
         | Beyond what's in the OP, that includes RE2, Hyperscan, D's
         | std.regex, ICU, Perl, Python's third party `regex` package, and
         | `regress`.
        
       | masswerk wrote:
       | As for the good old reference implementation (not _" Parameter
       | Efficient Reinforcement Learning"_):                 my $string =
       | "cat\n";       /cat$/s  -> true       /cat\Z/s -> true
       | /cat\z/s -> false
        
       | pjc50 wrote:
       | Special misery case: Visual Studio supports regex search, where
       | '$' matches \n.
       | 
       | The end of line character is usually the standard Windows \r\n.
       | 
       | Yes, that means if you want to really match the end of line you
       | have to match "\r$". So broken.
        
         | skrebbel wrote:
         | FWIW, and I know this doesn't really address your complaint: I
         | use Windows and I've set all my text editors to use LF
         | exclusively years ago and Things Are Great. No more weird Git
         | autocrlf warnings, no quirks when copying files over to/from
         | people on Macs or Linuxes, etc. Even Notepad supports LF line
         | endings for quite a long time now - to my practical experience,
         | there's little remaining in Windows that makes CRLF "the OS
         | standard line ending".
         | 
         | I bet if someday VS Code's Windows build ships with LF default
         | on new installations, people won't even notice.
         | 
         | I mean, at some point it did matter what the OS did when you
         | pressed the "Enter" button. But this isn't really the case much
         | anymore. VS Code catches that keypress, and inserts whatever
         | "files.eol" is set to. Sublime does the same. I didn't check,
         | but I assume every other IDE has this setting.
         | 
         | Similarly, the HTML spec, which is pretty nuts, makes browsers
         | normalize my enters to LF characters as I type into this
         | textarea here (I can check by reading the `value` property in
         | devtools), but when it's submitted, it converts every LF to a
         | CRLF because that's how HTML forms were once specced back in
         | the day. Again though, what my OS considers to be "the standard
         | newline" is simply not considered at all. Even CMD.EXE batch
         | files support LF.
         | 
         | I don't really type newlines all that much outside IDEs and
         | browsers (incl electron apps) and places like MS Word, all of
         | which disregard what the OS does and insert their own thing.
         | Maybe the terminal? I don't even know. I doubt it's very
         | consequential.
         | 
         | EDIT: PSA the same holds for backslashes! Do Not Use
         | Backslashes. Don't use "OS specific directory separator
         | constants". It's not 1998, just type "/" - it just works.
        
           | n_plus_1_acc wrote:
           | I could never get visual studio (not code) to not use \r\n
           | when editing a solution file via the gui
        
           | divingdragon wrote:
           | > Even CMD.EXE batch files support LF.
           | 
           | I don't know if it is the case on Windows 11, but I have
           | surely been bitten by CMD batch files using LF line endings.
           | I don't remember the exact issue but it may have been the one
           | bug affecting labels. [1]
           | 
           | [1]:
           | https://www.dostips.com/forum/viewtopic.php?t=8988#p58888
        
           | pjc50 wrote:
           | > I bet if someday VS Code's Windows build ships with LF
           | default on new installations, people won't even notice.
           | 
           | As with '/', they really ought to do this some day but won't.
        
         | jbverschoor wrote:
         | The whole \r is archaic. It doesn't even behave properly in
         | most cases. Just use \n everywhere and bite the lemon for a
         | short while to fix your problems.
         | 
         | And if you believe \r\n is the way to go, please make sure \n\r
         | also works as they should have the same results. (or
         | \r\n\r\r\r\r for that matter)
        
           | psd1 wrote:
           | There are unices that use LFCR endings... computing is an
           | endless bath in history
        
           | HideousKojima wrote:
           | But without \r how am I supposed to print to my typewriter
           | over serial cable? Only half-joking, that's the setup my
           | family had in the early 90's.
        
             | jbverschoor wrote:
             | Send BELL characters and wait for human intervention
        
           | keybored wrote:
           | Why did they even decide to use two characters for the end of
           | line? Seems bizarre. I could have imagined that `\r` and `\n`
           | was a tossup. But why both?
        
             | mnau wrote:
             | Likely compatibility bugs going back decades (70s?).
             | Probably with some terminal/teletype.
             | 
             | \r - returned teletype head to the start of a line
             | 
             | \n - move paper one line down
             | 
             | > The sequence CR+LF was commonly used on many early
             | computer systems that had adopted Teletype machines--
             | typically a Teletype Model 33 ASR--as a console device,
             | because this sequence was required to position those
             | printers at the start of a new line. The separation of
             | newline into two functions concealed the fact that the
             | print head could not return from the far right to the
             | beginning of the next line in time to print the next
             | character. Any character printed after a CR would often
             | print as a smudge in the middle of the page while the print
             | head was still moving the carriage back to the first
             | position. "The solution was to make the newline two
             | characters: CR to move the carriage to column one, and LF
             | to move the paper up."[2] In fact, it was often necessary
             | to send extra padding characters--extraneous CRs or NULs--
             | which are ignored but give the print head time to move to
             | the left margin. Many early video displays also required
             | multiple character times to scroll the display.
             | 
             | https://en.wikipedia.org/wiki/Newline
        
               | jbverschoor wrote:
               | It's similar to an old school typewriter.
               | 
               | The handle does 2 things: return and feed. You can also
               | just return by not pulling all the way or the other way
               | around depending on the design
        
               | HideousKojima wrote:
               | Which also let you do strikethrough and similar effects
               | by typing over a line you already typed
        
               | keybored wrote:
               | It is known. Why didn't Linux decide to do that though.
        
             | HideousKojima wrote:
             | Typewriters is why
        
       | onion2k wrote:
       | I can hear thousands of bad hiring manager's adding 'How do you
       | match the end of a string in a regex?' to their list of 'Ha! You
       | don't know the trick!' questions designed to catch out
       | candidates.
        
         | hoc wrote:
         | "I will hire you anyway, but I will pay you less"
         | 
         | Regex, useful in any job...
        
           | username_my1 wrote:
           | regex is useful but chatgpt is amazing at it, so why spend a
           | minute keeping such useless knowledge in mind.
           | 
           | if you know where to find something no point in knowing it.
        
             | ykonstant wrote:
             | Does gpt produce efficient regex? Are there any experts
             | here that can assess the quality and correctness of gpt-
             | generated regex? I wonder how regex responses by gpt are
             | validated if the prompter does not have the knowledge to
             | read the output.
        
               | thecatspaw wrote:
               | what does gpt say how we should validate email addresses?
        
               | rhd wrote:
               | chatgpt-4:
               | 
               | ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$
               | 
               | https://chat.openai.com/share/696f7046-7f43-4331-b12b-538
               | 566...
               | 
               | chatgpt-3.5:
               | 
               | ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$
               | 
               | https://chat.openai.com/share/aaa09ae8-3fd9-4df7-a417-948
               | 436...
        
               | layer8 wrote:
               | ...which both excludes addresses allowed by the RFC and
               | includes addresses disallowed by the RFC. (For example,
               | the RFC disallows two consecutive dots in the local-
               | part.)
        
               | KMnO4 wrote:
               | I take the descriptivist approach to email validation,
               | rather than the prescriptivist.
               | 
               | I know an email has to have a domain name after the @ so
               | I know where to send it.
               | 
               | I also know it has to have something before the @ so the
               | domain's email server knows how to handle it.
               | 
               | But do I care if the email server is supports sub
               | addresses, characters outside of the commonly supported
               | range (eg quotation marks and spaces), or even characters
               | which aren't part of the RFC? I do not.
               | 
               | If the user gives me that email, I'll trust them. Worst
               | case they won't receive the verification email and will
               | need to double check it. But it's a lot better than those
               | websites who try to tell me my email is invalid because
               | their regex is too picky.
        
               | layer8 wrote:
               | I generally agree, but the two consecutive dots (or
               | leading/trailing dots) are an example that would very
               | likely be a typo and that you wouldn't particularly want
               | to send. Similar for unbalanced quotes, angle brackets,
               | and other grammar elements.
        
               | dumbo-octopus wrote:
               | I wonder whether simply (regex) replacing a sequence of
               | .'s with a single one as part of a post-processing step
               | would be effective.
        
               | layer8 wrote:
               | That would be bad form, IMO. The user may have typed
               | _john..kennedy@example.com_ by mistake instead of
               | _john.f.kennedy@example.com_ , and now you'll be sending
               | their email to _john.kennedy@example.com_. Similar for
               | leading or trailing dots. You can't just decide what a
               | user probably meant, when they type in something invalid.
        
               | wtetzner wrote:
               | Yeah, that's about as far as I've ever been comfortable
               | going in terms of validating email addresses too: some
               | stuff followed by "@" followed by more stuff.
               | 
               | Though I guess adding a check for invalid dot patterns
               | might be worthwhile.
        
               | jcranmer wrote:
               | The HTML email regex validation [1] is probably the best
               | rule to use for validating an email address in most user
               | applications. It prohibits IP address domain literals
               | (which the emailcore people have basically said is of
               | limited utility [2]), and quoted strings in the
               | localpart. Its biggest fault is allowing multiple dots to
               | appear next to each other, which is a lot of faff to put
               | in a regex when you already have to individually spell
               | out every special character in atext.
               | 
               | [1]
               | https://html.spec.whatwg.org/multipage/input.html#email-
               | stat...
               | 
               | [2] https://datatracker.ietf.org/doc/draft-ietf-
               | emailcore-as/
        
               | marcosdumay wrote:
               | What is maybe more important to note, it completely
               | disallows the language of some 4/5 of the humanity. And
               | partially disallows some 2/3 of the rest.
        
               | sebstefan wrote:
               | Actually pretty good response if the programmer bothers
               | to read all of it
               | 
               | I'd be more emphatic that you shouldn't rely on regexes
               | to validate emails and that this should only be used as
               | an "in the form validation" first step to warn of user
               | input error, but the gist is there
               | 
               | > This regex is *practical for most applications* (??),
               | striking a balance between complexity and adherence to
               | the standard. It allows for basic validation but does not
               | fully enforce the specifications of RFC 5322, which are
               | much more intricate and challenging to implement in a
               | single regex pattern.
               | 
               | ^ ("challenging"? Didn't I see that emails validation
               | requires at least a grammar and not just a regex?)
               | 
               | > For example, it doesn't account for quoted strings
               | (which can include spaces) in the local part, nor does it
               | fully validate all possible TLDs. Implementing a regex
               | that fully complies with the RFC specifications is
               | impractical due to their complexity and the flexibility
               | allowed in the specifications.
               | 
               | > For applications requiring strict compliance, it's
               | often recommended to use a library or built-in function
               | for email validation provided by the programming language
               | or framework you're using, as these are more likely to
               | handle the nuances and edge cases correctly.
               | Additionally, the ultimate test of an email address's
               | validity is sending a confirmation email to it.
        
               | bonki wrote:
               | Not good at all, but a little better than expected. I use
               | + in email addresses prominently and there are so many
               | websites who don't even allow that...
        
               | zaxomi wrote:
               | Remember to first punycode the domain part of an email
               | address before trying to validate it, or it will not work
               | with internationalized domain names.
        
               | jameshart wrote:
               | Support for IDN email addresses is still patchy at best.
               | Many systems can't send to them; many email hosts still
               | can't handle being configured for them.
        
               | criley2 wrote:
               | Prompt:
               | 
               | 'I'm writing a nodejs javascript application and I need a
               | regex to validate emails in my server. Can you write a
               | regex that will safely and efficiently match emails?'
               | 
               | GPT4 / Gemini Advanced / Claude 3 Sonnet
               | 
               | GPT4: `const emailRegex =
               | /^[a-zA-Z0-9._-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$/;` Full
               | answser: https://justpaste.it/cg4cl
               | 
               | Gemini Advanced: `const emailRegex = /^[a-zA-Z0-9.!#$%&'
               | _+ /=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0
               | -9])?(?:\\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)
               | _$/;` Full answer: https://justpaste.it/589a5
               | 
               | Claude 3: `const emailRegex =
               | /^([a-zA-Z0-9._%-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,})$/;`
               | Full answer: https://justpaste.it/82r2v
        
               | zaxomi wrote:
               | Still doesn't support internationalized domain names.
        
               | croemer wrote:
               | Terrible answers as far as I can tell, especially Chat
               | got would throw out many valid email addresses.
        
               | dfawcus wrote:
               | Whereas email more or less lasts forever (mailbox
               | contents), and has to be backwards compatible with older
               | versions back to (at least) RFC 821/822, or those before.
               | It also allows almost any character (when escaped at 821
               | level) in the host or domain part (domain names allow any
               | byte value).
               | 
               | So a Internet email address match pattern has to be:
               | "..*@..*", anything else can reject otherwise valid
               | addresses.
               | 
               | That however does not account for earlier source routed
               | addresses, not the old style UUCP bang paths. However
               | those can probably be ignored for newly generated email.
               | 
               | I regularly use an email address with a "+" in the host
               | part. When I used qmail, I often used addresses like:
               | "foo-a/b-bar-tat@DOMAIN". Mainly for auto filtering
               | received messages from mailing lists.
        
               | skeaker wrote:
               | There really ought to be a regex repository of common use
               | cases like these so we don't have to reinvent the wheel
               | or dig up a random codebase that we hope is correct to
               | copy from every time.
        
               | da39a3ee wrote:
               | You don't have to be an expert; you should very rarely be
               | using regexes so complex that you can't understand them.
        
               | zacmps wrote:
               | It might not be obvious when you hit that point, bad
               | regexes can be subtle, just see that old cloudflare
               | postmortem.
        
               | hnlmorg wrote:
               | ...and if you can understand them then you clearly
               | understand regex enough not to need ChatGPT to write them
        
               | kaibee wrote:
               | I understand assembly too.
        
               | mnau wrote:
               | Even simple regexs can be problematic, e.g. Gitlab RCE
               | bug through ExifTools
               | 
               | https://devcraft.io/2021/05/04/exiftool-arbitrary-code-
               | execu...
               | 
               | > "a\ > ""
               | 
               | > The second quote was not escaped because in the regex
               | $tok =~ /(\\\\+)$/ the $ will match the end of a string,
               | but also match before a newline at the end of a string,
               | so the code thinks that the quote is being escaped when
               | it's escaping the newline.
        
               | 2devnull wrote:
               | That was one of my first uh oh moments with gpt. Getting
               | code that clearly had untestable/unreadable regexen,
               | which given the source must have meant the regex were gpt
               | generated. So much is going to go wrong, and soon.
        
             | berkes wrote:
             | > if you know where to find something no point in knowing
             | it.
             | 
             | Nonsense. And you know it.
             | 
             | First, you need to know _what_ to find, before knowing
             | _where_ to find it. And knowing _what_ to find requires
             | intricate knowledge of the thing. Not intricate
             | implementation details, but enough to point yourself in the
             | right direction.
             | 
             | Secondly, you need to know _why_ to find thing X and not
             | thing Y. If anything, ChatGPT is even worse than google or
             | stackoverflow in  "solving the XY problem for you". XY is a
             | problem you don't want solved, but instead to be told that
             | you don't want to solve it.
             | 
             | Maybe some future LLM can also push back. Maybe some future
             | LLM can guide you to the right answer for a problem. But at
             | the current state: nope.
             | 
             | Related: regexes are almost never the best answer to any
             | question. They are available and quick, so all considered,
             | maybe "the best" for this case. But overall: nah.
        
               | pksebben wrote:
               | While I agree with your point that knowing things
               | matters, it is entirely possible with the current batch
               | of LLMs to get to an answer you don't know much about.
               | It's actually one of the few things they do reliably
               | well.
               | 
               | You start with what you _do_ know, asking leading
               | questions and being clear about what you don 't, and you
               | build towards deeper and deeper terminology until you get
               | to the point where there are docs to read (because you
               | still can't trust them to get the specifics right).
               | 
               | I've done this on a number of projects with pretty
               | astonishing results, building stuff that would otherwise
               | be completely out of my wheelhouse.
        
               | lolc wrote:
               | Funny for me there have been instances where the LLM did
               | push back. I had a plan of how to solve something and
               | tasked the LLM with a draft implementation. It kept
               | producing another solution which I kept rejecting and
               | specifying more details so it wouldn't stray. In the end
               | I had to accept that my solution couldn't work, and that
               | the proposed one was acceptable. It's going to happen
               | again, because it often comes up with inferior solutions
               | so I'm not very open to the reverse situation.
        
             | HumblyTossed wrote:
             | This is something ChatGPT would say.
        
       | Karellen wrote:
       | > Folks who've worked with regular expressions before might know
       | about ^ meaning "start-of-string" and correspondingly see $ as
       | "end-of-string".
       | 
       | Huh. I always think of them as "start-of-line" and "end-of-line".
       | I mean, a lot of the time when I'm working with regexes, I'm
       | working with text a line at a time so the effect is the same, but
       | that doesn't change how I think of those operators.
       | 
       | Maybe because a fair amount of the work I do with regexes (and,
       | probably, how I was introduced to them) is via `grep`, so I'm
       | often thinking of the inputs as "lines" rather than "strings"?
        
         | jamesmunns wrote:
         | Same, tho it'd be interesting to see if this behavior holds if
         | the file ends without a trailing newline and your match is on
         | the final newline-less line.
        
           | fooofw wrote:
           | Fortunately, it's pretty simple to test.                   $
           | printf 'Line with EOL\nLine without EOL' | grep 'EOL$'
           | Line with EOL         Line without EOL         $ grep
           | --version | head -n1         grep (GNU grep) 3.8
        
             | romwell wrote:
             | The line does end with the file, so it's logically
             | consistent.
             | 
             | It's not matching the newline character after all.
        
               | colimbarna wrote:
               | Yes exactly, they match the end of a line, not a newline
               | character. Some examples from documentation:
               | 
               | man 7 regex: '$' (matching the null string at the end of
               | a line)
               | 
               | pcre2pattern: The circumflex and dollar metacharacters
               | are zero-width assertions. That is, they test for a
               | particular condition being true without consuming any
               | characters from the subject string. These two
               | metacharacters are concerned with matching the starts and
               | ends of lines. ... The dollar character is an assertion
               | that is true only if the current matching point is at the
               | end of the subject string, or immediately before a
               | newline at the end of the string (by default), unless
               | PCRE2_NOTEOL is set. Note, however, that it does not
               | actually match the newline. Dollar need not be the last
               | character of the pattern if a number of alternatives are
               | involved, but it should be the last item in any branch in
               | which it appears. Dollar has no special meaning in a
               | character class.
        
             | jamesmunns wrote:
             | Thanks! I was AFK and didn't have a grep (or a shell) handy
             | on my phone.
        
         | antegamisou wrote:
         | _Maybe because a fair amount of the work I do with regexes
         | (and, probably, how I was introduced to them) is via `grep`, so
         | I 'm often thinking of the inputs as "lines" rather than
         | "strings"?_
         | 
         | Vim is what did that for me.
        
         | wccrawford wrote:
         | It's kind of driving me nuts that the article says ^ is "start
         | of string" when it's actually "start of line", just like $ is
         | "end of line". \A is apparently "start of string" like \Z is
         | "end of string".
        
           | masklinn wrote:
           | It's not start of line though, unless the engine is in
           | multiline mode. Here is the documentation for Python's re for
           | instance:
           | 
           | > Matches the start of the string, and in MULTILINE mode also
           | matches immediately after each newline.
           | 
           | Or JavaScript:
           | 
           | > An input boundary is the start or end of the string; or, if
           | the m flag is set, the start or end of a line.
           | 
           | \A and \Z are start/end of input regardless of mode... when
           | they're available, that's not the case of all engines.
        
             | eastbound wrote:
             | Probably a vulnerability issue. Programmers would leave
             | multiline mode on by mistake, then validate that some
             | string only contain ^[a-Z]*$... only for the string to have
             | an \n and an SQL injection on the second line.
        
               | masklinn wrote:
               | > Probably a vulnerability issue.
               | 
               | No? It's a semantics decision.
        
             | danbruc wrote:
             | It is start and end of line. [1]
             | 
             |  _Usually ^ matches only at the beginning of the string,
             | and $ matches only at the end of the string and immediately
             | before the newline (if any) at the end of the string. When
             | this flag is specified, ^ matches at the beginning of the
             | string and at the beginning of each line within the string,
             | immediately following each newline. Similarly, the $
             | metacharacter matches either at the end of the string and
             | at the end of each line (immediately preceding each
             | newline)._
             | 
             | In single-line [2] mode, the line starts at the start of
             | the string and ends at the end of the line where the end of
             | the line is either the end of the string if there is no
             | terminating newline or just before the final newline if
             | there is a terminating newline.
             | 
             | In multi-line mode a new line starts at the start of the
             | string and after each newline and ends before each newline
             | or at the end of the string if the last line has no
             | terminating newline.
             | 
             | The confusion is that people think that they are in string-
             | mode if they are not in multi-line mode but they are not,
             | they are in single-line mode, ^ and $ still use the
             | semantics of lines and a terminating newline, if present,
             | is still not part of the content of the line.
             | 
             | With \n\n\n in single-line mode the non-greedy ^(\n+?)$
             | will capture only two of the newlines, the third one will
             | be eaten by the $. If you make it greedy ^(\n+)$ will
             | capture all three newlines. So arguably the implementations
             | that do not match cat\n with cat$ are the broken ones.
             | 
             | [1] https://docs.python.org/3/howto/regex.html#more-
             | metacharacte...
             | 
             | [2] I am using single-line to mean not multi-line for
             | convenience even though single-line already has a different
             | meaning.
        
               | masklinn wrote:
               | > It is start and end of line.
               | 
               | You seem to have redefined "line" as "not a line".
               | 
               | > The confusion
               | 
               | I'm sure redefining "line" as "nothing like what anyone
               | reasonable would interpret as a line" will help a lot and
               | right clear up the confusion.
        
               | danbruc wrote:
               | The POSIX definition of a line is a sequence of non-
               | newline characters - possibly zero - followed by a
               | newline. Everything that does not end with a newline is
               | not a [complete] line. So strictly speaking it would even
               | be correct that cat$ does not match cat because there is
               | no terminating newline, it should only match cat\n. But
               | as lines missing a terminating newline is a thing, it
               | seems reasonable to be less strict.
        
               | masklinn wrote:
               | > a line is a sequence of non-newline characters
               | 
               | Works for me.
               | 
               | How do you square that with your assertion that in your
               | invention of "single-line mode" you implicitly define
               | "line" as matching \n\n?
        
               | danbruc wrote:
               | If you are not in multi-line mode, then a single line is
               | expected and consequently there is at most one newline at
               | the end of the string. You can of course pick an input
               | that violates this, run it against a multi-line string
               | with several newlines in it. cat\n\n will not match cat$
               | because there is something between cat and the end of the
               | line, it just happens to be a newline but without any
               | special meaning because it is not the last character and
               | you did not say that the input is multi-line.
        
               | sltkr wrote:
               | Python violates that definition however, by allowing
               | internal newlines in strings. For example /^c[^a]t$/
               | matches "c\nt\n", but according to POSIX that's not a
               | line.
               | 
               | I suspect the real reason for Python's behavior starts
               | with the early decision to include the terminating
               | newline in the string returned by IOBase.readline().
               | 
               | Python's peculiar choice has some minor advantages: you
               | can distinguish between files that do and don't end with
               | a terminating newline (the latter are invalid according
               | to POSIX, but common in practice, especially on Windows),
               | and you can reconstruct the original file by simply
               | concatenating the line strings, which is occasionally
               | useful.
               | 
               | The downside of this choice is that as a caller you have
               | to deal with strings that may-or-may-not contain a
               | terminating newline character, which is annoying (I often
               | end up calling rstrip() or strip() on every line returned
               | by readline(), just to get rid of the newlines;
               | read().splitlines() is an option too if you don't mind
               | reading the entire file into memory upfront).
               | 
               | My guess is that Python's behavior is just a hack to make
               | re.match() easier to use with readline(), rather than
               | based on any principled belief about what lines are.
        
               | danbruc wrote:
               | Python's behavior is not a hack, it is the common
               | behavior. $ matches at the end of the string or before
               | the last character if that is a newline, which is
               | logically the same as the end of a single line. But as
               | you said, you can have additional newlines inside of the
               | string which is also the common behavior and not specific
               | to python. Personally I think of this as you just assume
               | that the string is a single line and match $ accordingly,
               | either at the end of the string or before a terminating
               | newline, if there are additional newlines, you treat them
               | mostly as normal characters, with the exception of dot
               | not matching newlines unless you set the single-line/dot-
               | all flag.
        
               | sltkr wrote:
               | > Python's behavior [..] is the common behavior.
               | 
               | The very post we're commenting on shows that that's not
               | true: PHP, Python, Java and .NET (C#) share one behavior
               | (accept "\n" as "$"), and ECMAScript (Javascript),
               | Golang, and Rust share another behavior (do not accept
               | "\n" as $).
               | 
               | Let's not argue about which is "the most common"; all of
               | these languages are sufficiently common to say that there
               | is no single common behavior.
               | 
               | > $ matches at the end of the string or before the last
               | character if that is a newline, which is logically the
               | same as the end of a single line.
               | 
               | Yes, that is Python's behavior (and PHP's, Java's, etc.).
               | You're just describing it; not motivating why it has to
               | work that way or why it's more correct than the obvious
               | alternative of only matching the end of the string.
               | 
               | Subjectively, I find it odd that /^cat$/ matches not just
               | the obvious string "cat" but also the string "cat\n". And
               | I think historically, it didn't. I tried several common
               | tools that predate Python:                 - awk 'BEGIN {
               | print ("cat\n" ~ /^cat$/) }' prints 0       - in GNU ed,
               | /^M/ does not match any lines       - in vim, /^M/ does
               | not match any lines       - sed -n '/\n/p' does not print
               | any lines       - grep -P '\n' does not match any lines
               | - (I wanted to try `grep -E` too but I don't know how to
               | escape a newline)       - perl -e 'print ("cat\n" =~
               | /^cat$/)' prints 1
               | 
               | So the consensus seems to be that the classic UNIX line-
               | based tools match the regex against the line excluding
               | the newline terminator (which makes sense since it isn't
               | part of the content of that line) and therefore $ only
               | needs to match the end of the string.
               | 
               | The odd one out is Perl: it seems to have introduced the
               | idea that $ can match a newline at the end of the string,
               | probably for similar reasons as Python. All of this
               | suggests to me that allowing $ to match both "\n" and ""
               | at the end of the string was a hack designed to make it
               | easier to deal with strings without control characters
               | and string that end with a single newline.
        
               | Bjartr wrote:
               | The line delimiter is a newline.
               | 
               | If you have a file containing `A\nB\nC` in a file, the
               | file is three lines long.
               | 
               | I guess it could be argued that a file containing
               | `A\nB\nC\n` has four lines, with the fourth having zero
               | length.
               | 
               | That a regex is applying to an in memory string vs a file
               | doesn't feel to me like it should have different
               | semantics.
               | 
               | Digging into the history a little, it looks like regexes
               | were popularized in text editors and other file oriented
               | tooling. In those contexts I imagine it would be far more
               | common to want to discard or ignore the trailing zero
               | length line than to process it like every other line in a
               | file.
        
               | akdev1l wrote:
               | Technically the "newline" character is actually a line
               | _terminator_. Hence "A\n" is one line, not two. The "\n"
               | is always at the end of a line by definition.
        
               | wtetzner wrote:
               | So if you have "A" in a file with no newline, there are
               | no lines in that file?
        
               | jepler wrote:
               | Yes, that is a file with zero lines that ends with an
               | "incomplete line". Processing of such files by standard
               | line-oriented utilities is undefined in the opengroup
               | spec. So, for instance, the effect of "grep"ping such a
               | file is not defined. Heck, even "cat"ting such a file
               | gives non-ideal results, such as colliding with the
               | regular shell prompt. For this reason, a lot of software
               | projects I work on check and correct this condition
               | whenever creating a commit.
               | 
               | https://pubs.opengroup.org/onlinepubs/9699919799/basedefs
               | /V1... ("text file")
        
               | rovr138 wrote:
               | > Yes, that is a file with zero lines that ends with an
               | "incomplete line".
               | 
               | It's a file with zero complete lines. But it has 1 line,
               | that's incomplete, right?
               | 
               | The file starts empty. Anything in it starts "a line". So
               | it's 1 incomplete line.
               | 
               | I hate weird states.
        
               | xyzzy_plugh wrote:
               | No, it is valid for a file to have content but no lines.
               | 
               | Semantically many libraries treat that as a line because
               | while \n<EOF> means "the end of the last line" having
               | just <EOF> adds additional complexity the user has to
               | handle to read the remaining input. But by the book it's
               | not "a line".
               | 
               | If I said "ten buckets of water" does that mean ten full
               | buckets? Or does a bucket with a drop in it count as "a
               | bucket of water?" If I asked for ten buckets of water and
               | you brought me nine and one half-full, is that
               | acceptable? What about ten half-full buckets?
               | 
               | A line ends in a newline. A file with no newlines in it
               | has no lines.
        
               | joshjje wrote:
               | Thats beyond ridiculous. Most languages when you are
               | reading a line from a file, and it doesn't have a \n
               | terminator, its going to give you that line, not say,
               | oops, this isn't a line sorry.
        
               | LK5ZJwMwgBbHuVI wrote:
               | That's a relatively recent invention compared to tools
               | like `wc` (or your favorite `sh` for that matter). See
               | also: https://perldoc.perl.org/functions/chop wherein the
               | norm was "just cut off the last character of the line, it
               | will always be a newline"
        
               | squeaky-clean wrote:
               | Most languages but not all. I've even been bit by this
               | recently in cron.
               | 
               | Assuming that EOF is identical to \\\nEOF will end up
               | causing trouble for you one day, because it's not
               | actually identical.
        
               | int_19h wrote:
               | I don't think you can meaningfully generalize to "most
               | languages" here. To give an example, two extremely
               | popular languages are C and Python. Both have a standard
               | library function to read a line from a text stream -
               | fgets() for C, readline() for Python. In both cases, the
               | behavior is to read up to _and including_ the newline
               | character, but also to stop if EOF is encountered before
               | then. Which means that the return value is different for
               | terminated vs unterminated final lines in both languages
               | - in particular, if there 's no \n before EOF, the value
               | returned is _not a line_ (as it does not end with a
               | newline), and you have to explicitly write your code to
               | accommodate that.
        
               | nativeit wrote:
               | I get this is largely a semantic debate, but find it a
               | little ironic so many programmers seem put off with the
               | idea of a line count that starts at "0".
        
               | akdev1l wrote:
               | No, a line is defined as a sequence of characters
               | (bytes?) with a line terminator at the end.
               | 
               | Technically as per posix a file as you describe is
               | actually a binary file without any lines. Basically just
               | random binary data that happens to kind of look like a
               | line.
        
               | mort96 wrote:
               | It's a file with 0 lines and some trailing garbage.
        
               | DougBTX wrote:
               | Another way to look at it is that concatenating files
               | should sum the line count. Concatenating two empty files
               | produces an empty file, so 0 + 0 = 0. If "incomplete
               | lines" are not counted as lines, then the maths still
               | works out. If they counted as lines, it would end up as 1
               | + 1 = 1.
        
               | coryrc wrote:
               | Pedantically, if it doesn't end with a newline, it's
               | considered a binary file and not a text file. Binary
               | files don't have lines.
               | 
               | In practice, most utilities expecting text files will
               | still operate on it.
        
               | PaulDavisThe1st wrote:
               | No file has lines.
               | 
               | "Lines" are a convention established by (or not) software
               | reading a data stream.
        
               | coryrc wrote:
               | Ackshully
        
               | rerdavies wrote:
               | The opengroup spec says no such thing.
        
               | simonh wrote:
               | 3.206 Line
               | 
               | A sequence of zero or more non- <newline> characters plus
               | a terminating <newline> character.
               | 
               | See also '3.403 Text File' for the definition of a text
               | file. No new line characters, no lines. No lines, not a
               | text file.
        
               | mbrubeck wrote:
               | $ echo -n "A" | wc --lines         0
        
               | keybored wrote:
               | Yep. since wc(1) apparently strictly adheres to what a
               | newline-terminated text file is. This is why plaintext
               | files should end with a newline. :)
               | 
               | See: https://stackoverflow.com/a/25322168/1725151
        
               | LK5ZJwMwgBbHuVI wrote:
               | Why don't you go ask?                   $ echo -n foo |
               | wc -l         0
        
               | Gormo wrote:
               | Suddenly the DOS/Windows solution of using \r\n instead
               | of just \n seems to offer some advantages.
        
               | samatman wrote:
               | This does precisely nothing to solve the ambiguity issue
               | when a final line lacks a newline. The representation of
               | that newline isn't relevant to the problem.
        
               | Izkata wrote:
               | It's actually slightly worse: Windows defines newline as
               | a delimiter, not a terminator. So this:
               | foo\nbar\n
               | 
               | Would be 2 lines in *nix and 3 lines in windows.
        
               | deaddodo wrote:
               | The "Windows way" is the "right way" for a few reasons.
               | 
               | This is definitely _not_ one of them.
        
               | int_19h wrote:
               | Which are the valid reasons, legacy meanings of those
               | characters aside?
        
               | rerdavies wrote:
               | Technically, that is one of two possible interpretations,
               | and you seem to have invented a "by definition" out of
               | thin air.
               | 
               | Very very technically a "newline" character indicates the
               | start of a new line, which is why it is not called the
               | "end-of-line" character.
        
               | cortesoft wrote:
               | I mean, the person you are responding to didn't invent
               | the definition out of thin air... the POSIX standard did:
               | 
               | 3.206 Line A sequence of zero or more non- <newline>
               | characters plus a terminating <newline> character.
               | 
               | https://pubs.opengroup.org/onlinepubs/9699919799.2018edit
               | ion...
        
               | nomel wrote:
               | Posix getline() includes EOF as a line terminator:
               | getline() reads an entire line from stream, storing the
               | address            of the buffer containing the text into
               | *lineptr.  The buffer is            null-terminated and
               | includes the newline character, if one was
               | found.         ...         ... a delimiter character is
               | not added if one was            not present in the input
               | before end of file was reached.
               | 
               | EOF seems same as end-of-string.
        
               | mabster wrote:
               | I don't know why no-one here sees this as a bad design...
               | 
               | If a line is missing a newline then we just disregard
               | it?!
               | 
               | A way better way to deal with newline is it's a separator
               | like comma. And like in modern languages we allow a final
               | separator, but ignore it so that is easier for tools to
               | generate files.
               | 
               | Now all combinations of characters, including newline
               | characters, has an interpretation without dropping
               | anything.
        
               | LK5ZJwMwgBbHuVI wrote:
               | It doesn't indicate the start of a new line, or files
               | would _start_ with it. Files _end_ with it, which is why
               | it is a line terminator. And it is by definition: by the
               | standard, by the way cat and /or your shell and/or your
               | terminal work together, and by the way standard utilities
               | like `wc` treat the file.
        
               | joshjje wrote:
               | "A\n" is two lines.
        
               | LK5ZJwMwgBbHuVI wrote:
               | Factually incorrect.
        
             | f1shy wrote:
             | Matches the EMPTY STRING at the beginning of the line is
             | the correct definition.
        
           | tangus wrote:
           | That gives the author space for another article ;)
        
           | amelius wrote:
           | What is driving me nuts is that we have Unicode now, so there
           | is no need to use common characters like $ or ^ to denote
           | special regex state transitions.
        
             | knome wrote:
             | the idea of changing a decades old convention to instead
             | use, as I assume you are implying, some character that
             | requires special entry, is beyond silly.
        
               | FranOntanaya wrote:
               | I don't think anyone that writes regex would feel
               | specially challenged by using the Alt+ | Ctrl+Shift+u key
               | combos for unicode entry. Having to escape less things in
               | a pattern would be nice.
        
               | amelius wrote:
               | Also, code is read more often than it is written.
        
               | cortesoft wrote:
               | People say this all the time, but is it really always
               | true? I have a ton of code that I wrote, that just works,
               | and I never really look at it again, at least not with
               | the level of inspection that requires parsing the regex
               | in my head.
        
               | cortesoft wrote:
               | I write regexes all the time, and I don't know if I would
               | be CHALLENGED by that, but it would be annoying. Escaping
               | things is trivial, and since you do it all the time it is
               | not anything extra to learn. Having to remember bespoke
               | keystrokes for each character is a lot more to learn.
        
               | keybored wrote:
               | ASCII restriction begets ASCII toothpick soup. Either
               | lift that restriction or use balanced delimiters for
               | strings in ASCII like backtick and single quote.
               | 
               | ("But backtick is annoying to type" said the Europeans.)
        
               | int_19h wrote:
               | Regexes are one case where I think it's already extremely
               | unbalanced wrt being easy to write but hard to read.
               | Using stuff like special Unicode chars for this would
               | make them harder to write but easier to read, which
               | sounds like a fair deal to me. In general, I'd say that
               | regexes _should_ take time and effort to write, just
               | because it 's oh-so-easy to write something that kinda
               | sorta works but has massive footguns.
               | 
               | I would also imagine that, if this became the norm, IDEs
               | would quickly standardize around common notation -
               | probably actually based on existing regex symbols and
               | escapes - to quickly input that, similar to TeX-like
               | notation for inputting math. So if you're inside a regex
               | literal, you'd type, say, \A, and the editor itself would
               | automatically replace it with the Unicode sigil for
               | beginning-of-string.
        
               | keybored wrote:
               | It's not that silly. You constantly get into escape
               | conundrums because you need to use a metacharacter which
               | is also a metacharacter three levels deep in some
               | embedding.
               | 
               | (But that might not solve that problem? Maybe the problem
               | is mostly about using same-character delimiters for
               | strings.)
               | 
               | And I guess that's why Perl is so flexible with regards
               | to delimiters and such.
        
               | LK5ZJwMwgBbHuVI wrote:
               | Yes, languages really need some sort of "raw string"
               | feature like Python (or make regex literals their own
               | syntax like Perl does). That's the solution here, not
               | using weird characters...
        
             | Yujf wrote:
             | Why not? Common characters are easier to type and presumbly
             | if you are using regex on a unicode string they might
             | include these special characters anyway so what have you
             | gained?
        
               | amelius wrote:
               | In theory yes, in practice no.
               | 
               | What you have gained is that the regex is now much easier
               | to read.
        
               | knome wrote:
               | It's easy to read now.
        
               | LK5ZJwMwgBbHuVI wrote:
               | > In theory yes, in practice no.
               | 
               | That's like "in theory we need 4 bytes to represent
               | Unicode, but in practice 3 bytes is fine" ( _glances at
               | universally-maligned utf8mb3_ )
        
               | int_19h wrote:
               | It's not really an issue if the string you're matching
               | might have those characters. It's an issue if the _regex_
               | you are matching that string might need to _match_ those
               | characters verbatim. Which is actually pretty common with
               | ()[]$ when you 're matching phone numbers, prices etc -
               | so you end up having to escape a lot, and regex is less
               | readable especially if it also has to use those same
               | characters as regex operators. On the other hand, it
               | would be very uncommon to want to literally match, say,
               | or [[?].
        
             | yjftsjthsd-h wrote:
             | If we were willing to ignore the ability to actually type
             | it, you don't need Unicode for that; ASCII has a whole
             | block of control characters at the beginning; I think ASCII
             | 25 ("End of medium") works here.
        
             | codethatwerks wrote:
             | The problem with using an eggplant to denote end of string
             | is backwards compatibility.
        
           | davidw wrote:
           | What with unicode, it'd be fun to have A and O available to
           | make our regexps that much more readable...
        
         | kqr wrote:
         | I'm the same, but now that I try in Perl, sure enough, $ seems
         | to default to being a positive lookahead assertion for the end
         | of the string. It does not match and consume an EOL character.
         | 
         | Only in multiline mode does it match EOL characters, but it
         | does still not appear to consume them. In fact, I cannot
         | construct a regex that captures the last character of one line,
         | then consumes the newline, and then captures the first
         | character of the next line, while using $. The capture group
         | simply ends at $.
        
           | singingfish wrote:
           | To get the newline captured as well you need to add the `/s`
           | modifier too
        
         | absoluteunit1 wrote:
         | I've always thought that as well; mostly due to Vim though.
         | 
         | ^ - takes you to start of line $ - takes you to end of line
        
           | Izkata wrote:
           | ^ actually takes you to the first non-whitespace character in
           | the line in vim. For start of line you want 0
        
             | kataklasm wrote:
             | I don't have (n)vi(m) open right now but I think this only
             | applies to prepending spaces. For prepending tabs, 0 will
             | take you to the first non-tab character as well.
        
               | qu4z-2 wrote:
               | Vim takes me to the first character in the line (the
               | first tab), but displays the cursor on the last
               | gridsquare the tab's width covers.
        
         | alphazard wrote:
         | This must be the "second problem" everyone talks about with
         | regular expressions.
        
         | Izkata wrote:
         | Same here; when I saw the title I was like "well obviously not,
         | where did you hear that?"
         | 
         | In nearly two decades of using regex I think this might be the
         | first time I've heard of $ being end of string. It's always
         | been end of line for me.
        
           | frame_ranger wrote:
           | You couldn't write a post like this if you didn't start with
           | a strawman.
        
           | michaelt wrote:
           | Take a look at, for example, these stackoverflow answers
           | about a regex to validate and e-mail address:
           | https://stackoverflow.com/a/8829363
           | 
           | These people are I think not intending to say a newline
           | character is permitted at the end of an e-mail address.
           | 
           | (Of course people using 'grep' would have different
           | expectations for obvious reasons)
        
             | Izkata wrote:
             | Even disregarding whether or not end-of-string is also an
             | end-of-line or not (see all the other comments below), $
             | doesn't match the newline, similar to zero-width matches
             | like \b, so the newline wouldn't be included in the matched
             | text either way.
             | 
             | I think this series of comments might be clearest:
             | https://news.ycombinator.com/item?id=39764385
        
               | LK5ZJwMwgBbHuVI wrote:
               | Problem is, plenty of software doesn't actually look at
               | the match but rather just validates that there _was_ a
               | match (and then continues to use the input to that
               | match).
        
         | notnmeyer wrote:
         | i feel like this perspective will be split between folks who
         | use regex in code with strings and more sysadmin folks who are
         | used to consuming lines from files in scripts and at the cli.
         | 
         | but yeah seems like a real misunderstanding from "start/end of
         | string" people
        
         | cerved wrote:
         | In `sed` it's end of string.
         | 
         | String is usually end of line, but not if you use stuff like
         | `N`, to manipulate multi-line strings
        
       | hans_castorp wrote:
       | Fun fact: in Postgres, 'cat\n' matches 'cat$' when the so called
       | "weird" newline matching is enabled :)
       | 
       | https://www.postgresql.org/docs/current/functions-matching.h...
        
       | AtNightWeCode wrote:
       | There are many differences between implementations of regex. To
       | name a few. Lookbehind, atomic groups, named capturing groups,
       | recursion, timeouts and my favorite interop problem, unicode.
        
       | wruza wrote:
       | _By default, '$' only matches at the end of the string and
       | immediately before the newline (if any) at the end of the
       | string._
       | 
       | The rationale was probably "it should be easier to match input
       | strings" and now it's harder for everyone.
        
       | febeling wrote:
       | Seriously, just write one unit test for your regex.
        
         | mannykannot wrote:
         | Indeed, one should test any regex one puts any trust in, but
         | the problem is that if you take as a fact something that is
         | actually a false assumption (as the author did here), your test
         | may well fail to find errors which may cause faults when the
         | regex is put to use.
         | 
         | This, in a nutshell, is the sort of problem which renders
         | fallacious the notion that you can unit-test your way to
         | correct software.
        
       | PuffinBlue wrote:
       | This seems like the perfect opportunity to introduce those
       | unfamiliar to Robert Elder. He makes cool YouTube[0] and blog
       | content[1] and has a series on regular expressions[2] and does
       | some quite deep dives into the differing behaviour of the
       | different tools that implement the various versions.
       | 
       | His latest on the topic is cool too:
       | https://www.youtube.com/watch?v=ys7yUyyQA-Y
       | 
       | He's has quite a lot of content that HN folks might be interested
       | in I think, like the reality and woes of consulting[3]
       | 
       | [0] https://www.youtube.com/@RobertElderSoftware
       | 
       | [1] https://blog.robertelder.org/
       | 
       | [2] https://blog.robertelder.org/regular-expressions/
       | 
       | [3] https://www.youtube.com/watch?v=cK87ktENPrI
        
         | aquariusDue wrote:
         | I'm glad to see someone else that has stumbled over his
         | content. Seconding the recommendation.
        
         | CatchSwitch wrote:
         | He has so many favorite Linux commands lol
        
       | teknopaul wrote:
       | Tldr;
       | 
       | $ does not mean end of string in Python.
        
       | frou_dh wrote:
       | Something I found really surprising about Python's regexp
       | implementation is that it doesn't support the typical character
       | classes like [:alnum:] etc.
       | 
       | It must be some kind of philosophical objection because there's
       | no way something with as much water under the bridge as Python
       | simply hasn't got around to it.
        
       | k3vinw wrote:
       | Another poor soul trying to solve one problem using regex and now
       | they have two... ;)
        
       | croes wrote:
       | Isn't a string with a newline character automatically multiline?
       | 
       | The new line is just empty but not the first line anymore.
        
         | Joker_vD wrote:
         | No, it is not.                   3.195 Incomplete Line
         | A sequence of one or more non-<newline> characters at the end
         | of the file.              3.206 Line              A sequence of
         | zero or more non-<newline> characters plus a terminating
         | <newline> character.
         | 
         | courtesy of [0]. See also [1] for rationale on "text file":
         | Text File             [...] The definition of "text file" has
         | caused controversy. The only difference between text and binary
         | files is that text files have lines of less than {LINE_MAX}
         | bytes, with no NUL characters, each terminated by a <newline>.
         | The definition allows a file with a single <newline>, or a
         | totally empty file, to be called a text file. If a file ends
         | with an incomplete line it is not strictly a text file by this
         | definition. [...]
         | 
         | [0]
         | https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...
         | 
         | [1]
         | https://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd...
        
           | croes wrote:
           | Not everything uses POSIX maybe that's a reason for the
           | different results.
        
       | perlgeek wrote:
       | Raku (formerly Perl 6) has picked ^ and $ for start-of-string and
       | end-of-string, and has introduced ^^ and $$ for start-of-line and
       | end-of-line. No multi line mode is available or necessary.
       | (There's also \h for horizontal and \v for vertical whitespace)
       | 
       | That's one of the benefits of a complete rethink/rewrite, you can
       | learn from the fact that the old behavior surprised people.
        
         | Terretta wrote:
         | And this is why this curmudgeon can't use Perl 6[^1]. It
         | randomly shuffles the line noise we learned over decades.
         | 
         | It seems so obvious that's the opposite of what they should
         | have defaulted to, that it clearly should have been ^ and $ for
         | lines, and ^^ and $$ for the string, since like ((1)(2)(3)):
         | 
         | ^^line1$\n^line2$\n^line3$\n$
         | 
         | [1]: That, and it's not anywhere, while Perl 5 is everywhere.
        
         | richardwhiuk wrote:
         | Think I would have picked exactly the reverse (i.e. ^^ being
         | more "starty" than "^").
        
           | lcnPylGDnU4H9OF wrote:
           | Reminds me of verbosity flags in some cli utilities. Often,
           | -v is "verbose" and -vv is "very verbose" and -vvv... etc.
        
       | wodenokoto wrote:
       | > So if you're trying to match a string without a newline at the
       | end, you can't only use $ in Python! My expectation was having
       | multiline mode disabled wouldn't have had this newline-matching
       | behavior, but that isn't the case.
       | 
       | A reproducible example would be nice. I don't understand what it
       | is he cannot do. `re.search('$', 'no new lines')` returns a
       | match.
        
         | iainmerrick wrote:
         | This unexpectedly matches:
         | 
         | re.match('^bob$', 'bob\n')
         | 
         | I didn't want the trailing newline to be included.
        
           | wodenokoto wrote:
           | But that string does have a new line at the end.
        
             | iainmerrick wrote:
             | re.match('^bob$', 'bob') - yes
             | 
             | re.match('^bob$', 'bobs') - no
             | 
             | Most people would expect 'bob\n' _not_ to match, because I
             | used  '$' and it has an extra character at the end, just
             | like 'bobs'. In Python it does match because '\n' is a
             | special case.
        
               | rerdavies wrote:
               | ... for some arbitrary definition of "most people".
        
       | danbruc wrote:
       | People are confused about strings and lines. A string is a
       | sequence of characters, a line can be two different things. If
       | you consider the newline a line terminator, then a line is a
       | sequence of non-newline characters - possibly zero - plus a
       | newline. If there is no new-line at the end, then it is not a
       | [complete] line. That is what POSIX uses. If you consider the
       | newline a line separator, then a line is a sequence of non-
       | newline characters - possibly zero. In either case, the content
       | of the line ends before the newline, either because the newline
       | terminates the line or because it separates the line from the
       | next. [1]
       | 
       | The semantics of ^ and $ is based on lines - whether single-line
       | or multi-line mode. For string based semantics - which you could
       | also think of as entire file if you are dealing with files - use
       | \A and \Z or their equivalents.
       | 
       | [1] Both interpretations have their merits. If you transmit text
       | over a serial connection, it is useful to have a newline as line
       | terminator so that you know when you received a complete line. If
       | you put text into text files, it might arguably be easier to look
       | at a newline as a line separator because then you can not have a
       | invalid last line. On the other hand having line terminators in
       | text files allows you to detect incompletely written lines.
        
       | Existing4190 wrote:
       | perlre Metacharacters documentation states: $ Match the end of
       | the string (or before newline at the end of the string; or before
       | any newline if /m is used)
       | 
       | (/m enables multiline mode)
        
       | mdavid626 wrote:
       | Is this a bug?
        
       | humanlity wrote:
       | Interesting
        
       | m0rissette wrote:
       | Why isn't Perl anywhere on that chart when mentioning regex?
        
         | burntsushi wrote:
         | Because they're using regex101 to easily test the semantics of
         | different regex engines and Perl isn't available on regex101.
         | PCRE is though, which is a decent approximation. And indeed,
         | Perl and PCRE behave the same for this particular case.
        
           | account42 wrote:
           | Why isn't Perl available on regex101 when its all about
           | regex?
        
             | burntsushi wrote:
             | I dunno. Maybe because nobody has contributed it? Maybe
             | because Perl isn't as widely used as it once was? Maybe
             | because it's hard to compile Perl to WASM? Maybe some other
             | reason?
        
       | tyingq wrote:
       | Seems odd to leave Perl off the list, given it's regex related.
       | 
       | Here's the explanation for $ in the perlre docs:
       | $   Match the end of the string                            (or
       | before newline at the end of the                 string; or
       | before any newline if /m is                used)
        
         | toyg wrote:
         | Yeah, omitting what is arguably the language most associated
         | with regexes seems a bit of an oversight. I guess it shows how
         | far off the radar Perl currently is.
        
           | demondemidi wrote:
           | Perl perfected the simplicity and flexibility of regex syntax
           | from POSIX and it seems every other language after has just
           | made it harder.
        
           | TillE wrote:
           | PHP uses PCRE, so it more or less serves as a stand-in for
           | Perl in this case.
        
       | homakov wrote:
       | This led to a few serious bugs in Ruby-based apps. Always use
       | \A\z
       | 
       | https://homakov.blogspot.com/2012/05/saferweb-injects-in-var...
       | 
       | https://sakurity.com/blog/2015/02/28/openuri.html
       | 
       | https://sakurity.com/blog/2015/06/04/mongo_ruby_regexp.html
        
       | Scubabear68 wrote:
       | In 30 years of developing software I don't think I ever used
       | multi-line regexp even once.
        
         | thrdbndndn wrote:
         | Definitely not common, but if you are parsing a text file
         | you're going to use it a lot (say, you're writing a JS parser).
        
           | marcosdumay wrote:
           | You really shouldn't use a lot of regexes for parsing code.
           | 
           | They go only on the tokenizer, if they go somewhere at all.
        
             | thrdbndndn wrote:
             | Agreed, it's more about quick and dirty ad hoc capture than
             | full-fledged parser though (like when you want to extract
             | certain object when scraping).
        
         | Terretta wrote:
         | > _In 30 years of developing software I don't think I ever used
         | multi-line regexp even once._
         | 
         | As long as sharing anecdata, in 30 years, it's almost the only
         | way I use it.
         | 
         | It's incredible for slicing and dicing repetitious text into
         | structure. You generally want some sort of Practical Extraction
         | and Reporting Language, the core of which is something like a
         | regular expression, generally able to handle the, well,
         | _irregularity_.
         | 
         | Most recent example (I did this last week) was extracting
         | Apple's app store purchases from an OCR of the purchase history
         | available through Apple's Music app's Account page that lets
         | you see all purchases across all digital offerings, but only as
         | a long scrolling dialog box (reading that dialog's contents
         | through accessibility hooks only retrieves the first few pages,
         | unfortunately).
         | 
         | Each purchase contains one or more items and each item has one
         | or more vertical lines, and if logos contain text they add
         | arbitrary lines per logo.
         | 
         | A good match and sub match multi-line regex folds that mess
         | back into a CSV. In this case, the regex for this was less than
         | an 80 char line of code and worked in the find replace of
         | Sublime Text which has multiline matching, subgroups, and back
         | references.
         | 
         | Another way to do this is something like a state match/case
         | machine, but why write a program when you can just write a
         | regular expression?
        
       | nebulous1 wrote:
       | The fact that there are so many different peculiarities in
       | different regex systems has always raised the hairs on the back
       | of my neck. As in when a tool accepts a regex and I have to a
       | trawl the manual to find out exactly what regex is acceptable to
       | it.
        
       | silent_cal wrote:
       | I think there's a big opportunity to re-write Regex as a SQL-type
       | language. It's too bad I don't feel like trying.
        
       | nunez wrote:
       | You can also use (?m) to enable multiline processing on PCRE-
       | compatible regexp engines.
        
       | raldi wrote:
       | Cmd-F perl
       | 
       |  _no matches_
        
       | weinzierl wrote:
       | The table in the article makes this look complicated, but it
       | really isn't. All the cases in the article can be grouped into
       | two families:
       | 
       | - The JS/Go/Rust family, which treats $ like \z and does not
       | support \Z at all
       | 
       | - The Java, .NET, PHP, Python family, which treats $ like \Z and
       | may or may not (Python) support \z.
       | 
       | \Z does away with \n before the end of the string, while \z
       | treats \n as a regular character. For multiline $ the distinction
       | doesn't matter, because \n _is_ the end.
       | 
       | Really the only deviation from the rule is Python's \Z, which is
       | indeed weird.
        
       | gorjusborg wrote:
       | If you really want to learn regex, you'll have a hard time
       | piecing it all together via blog posts.
       | 
       | Brad Freidl's Mastering Regular Expressions is a good book to
       | read if you want to stop being surprised/lost.
       | 
       | I'll admit I stopped at the dive into DFA/NFA engine details.
        
       | jewel wrote:
       | This has security implications! Example exploitable ruby code:
       | unless person_id =~ /^\d+$/         abort "Bad person ID"
       | end       sql = "select * from people where person_id =
       | #{person_id}"
       | 
       | In addition to injection attacks, this also can bite people when
       | parsing headers, where a bad header is allowed to sneak past a
       | filter.
        
         | jfhufl wrote:
         | Unsure what you mean?                   $ ruby -e 'x = "25" ;
         | if x =~ /^\d+$/ ; puts "yes" ; else ; puts "no" ; end'
         | yes         $ ruby -e 'x = "25\n" ; if x =~ /^\d+$/ ; puts
         | "yes" ; else ; puts "no" ; end'          yes         $ ruby -e
         | 'x = "a25\n" ; if x =~ /^\d+$/ ; puts "yes" ; else ; puts "no"
         | ; end'         no
         | 
         | Also, you'd want to use something that parameterizes the query
         | with '?' (I use the Sequel gem) instead of just stuffing it
         | into a sql string.
        
           | jfhufl wrote:
           | Well, learned something today after reading a bit further in
           | the thread:                   ruby -e 'x = "a\n25\n" ; if x
           | =~ /^\d+$/ ; puts "yes" ; else ; puts "no" ; end'         yes
           | 
           | Good to know.
        
           | halostatue wrote:
           | You need to make your regex multi-line (`/^\d+$/m`), but that
           | isn't the problem shown. Your query will be searching for
           | `25\n`, not `25` _despite_ your pre-check that it's a good
           | value.
           | 
           | The second line _should always be no_ , which if you use
           | `\A\d+\z`, it will be.
        
             | jfhufl wrote:
             | Yep, makes sense, thanks!
        
           | dr-smooth wrote:
           | $ ruby -e 'x = "25\n; delete from people" ; if x =~ /^\d+$/ ;
           | puts "yes" ; else ; puts "no" ; end'         yes
        
         | mnau wrote:
         | Practical Gitlab RCE that involved end of line regex in
         | ExifTools:
         | 
         | https://devcraft.io/2021/05/04/exiftool-arbitrary-code-execu...
        
       | SAI_Peregrinus wrote:
       | POSIX regexes and Python regexes are different. In general, you
       | need to reference the regex documentation for _your
       | implementation_ , since the syntax is not universal.
       | 
       | Per POSIX chapter 9[1]:
       | 
       | 9.2 ... "The use of regular expressions is generally associated
       | with text processing. REs (BREs and EREs) operate on text
       | strings; that is, zero or more characters followed by an end-of-
       | string delimiter (typically NUL). Some utilities employing
       | regular expressions limit the processing to lines; that is, zero
       | or more characters followed by a <newline>."
       | 
       | and 9.3.8 ... "A <dollar-sign> ( '$' ) shall be an anchor when
       | used as the last character of an entire BRE. The implementation
       | may treat a <dollar-sign> as an anchor when used as the last
       | character of a subexpression. The <dollar-sign> shall anchor the
       | expression (or optionally subexpression) to the end of the string
       | being matched; the <dollar-sign> can be said to match the end-of-
       | string following the last character."
       | 
       | combine to mean that $ may match the end of string OR the end of
       | the line, and it's up to the utility (or mode) to define which.
       | Most of the common utilities (grep, sed, awk, Python, etc) treat
       | it as end of line by default, since they operate on lines by
       | default.
       | 
       | THERE IS NO SINGLE UNIVERSAL REGULAR EXPRESSION SYNTAX. You
       | cannot reliably read or write regular expressions without knowing
       | which language & options are being used.
       | 
       | [1]
       | https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...
        
       | javier_e06 wrote:
       | I would hold a code review hostage if any file does not end with
       | an empty new line.
       | 
       | My reasoning would be if the file is transmitted and gets
       | truncated nobody would know for sure if it does not end a new
       | line. Brownie points if this is code end has a comment that the
       | files ends there.
       | 
       | The article calls computer languages platforms but the are
       | computer languages. Bash is not included. Weird. I believe the
       | most common use of regular expressions is the use of grep or
       | egrep with bash or some other shell but, who knows. Maybe I am
       | hanging with the wrong crowd.
        
       | vitiral wrote:
       | In Lua it's only the start/end of the string
       | 
       | > A pattern is a sequence of pattern items. A caret '^' at the
       | beginning of a pattern anchors the match at the beginning of the
       | subject string. A '$' at the end of a pattern anchors the match
       | at the end of the subject string. At other positions, '^' and '$'
       | have no special meaning and represent themselves.
       | 
       | https://www.lua.org/manual/5.3/manual.html#6.4.1
       | 
       | Lua's pattern matching is much simpler than regexes though.
       | 
       | > Unlike several other scripting languages, Lua does not use
       | POSIX regular expressions (regexp) for pattern matching. The main
       | reason for this is size: A typical implementation of POSIX regexp
       | takes more than 4,000 lines of code. This is bigger than all Lua
       | standard libraries together. In comparison, the implementation of
       | pattern matching in Lua has less than 500 lines.
       | 
       | https://www.lua.org/pil/20.1.html
        
         | denzquix wrote:
         | > In Lua it's only the start/end of the string
         | 
         | There's an additional caveat: if you use the optional "init"
         | parameter to specify an offset into the string to start
         | matching, the ^ anchor will match _at that offset_ , which may
         | or may not be what you expect.
        
           | vitiral wrote:
           | That is a good point, and something I've actually
           | (personally) used quite a bit when writing parsers
        
       | cpeterso wrote:
       | $ is the regex's "the buck stops here" symbol. Here at the end of
       | the line. :)
        
       | nurtbo wrote:
       | Totally get the desire, but also feels like last two paragraphs
       | are solvable with
       | 
       | ``` re.match(text).extract().rstrip("\n") ```
        
       | menacingly wrote:
       | Of course it's line. How could it be the end of the string when
       | the matter at hand is defining the string?
        
       | pksebben wrote:
       | Regex would really benefit from a comprehensive industrial
       | standard. It's such a powerful tool that you have to keep
       | relearning whenever you switch contexts.
        
       | aftbit wrote:
       | Wait, in non-multiline mode, it only matches _one_ trailing
       | newline? And not any other whitespace, including \r or \r\n? That
       | is indeed surprising behavior. Why? Why not just make it end of
       | string like the author expected?                   >>> import re
       | >>> bool(re.search('abc$', 'abc'))         True         >>>
       | bool(re.search('abc$', 'abc\n'))         True         >>>
       | bool(re.search('abc$', 'abc\n\n'))         False         >>>
       | bool(re.search('abc$', 'abc '))         False         >>>
       | bool(re.search('abc$', 'abc\t'))         False         >>>
       | bool(re.search('abc$', 'abc\r'))         False         >>>
       | bool(re.search('abc$', 'abc\r\n'))         False
        
       | mmh0000 wrote:
       | > So if you're trying to match a string without a newline at the
       | end, you can't        only use $ in Python! My expectation was
       | having multiline mode disabled        wouldn't have had this
       | newline-matching behavior, but that isn't the case.
       | 
       | I would argue this is correct behavior, a "line" isn't a "line"
       | if it doesn't end with \n.[1]                 > 3.206 Line - A
       | sequence of zero or more non- <newline> characters plus a
       | terminating <newline> character.
       | 
       | [1]
       | https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...
        
       | librasteve wrote:
       | I am surprised that the OP does not include perl5 in their table.
       | 
       | In raku (aka perl6) Regexes were reinvented by Larry Wall (the
       | creator of perl which made perlRE the de facto regex standard)
       | 
       | Here's what he does with $:
       | 
       | (https://docs.raku.org/language/regexes#Start_of_string_and_e...)
       | 
       | * The $ anchor only matches at the end of the string
       | 
       | * The $$ anchor matches at the end of a logical line. That is,
       | before a newline character, or at the end of the string when the
       | last character is not a newline character.
        
       | ary wrote:
       | Was any regex documentation unclear on this? Some libraries have
       | modes that change the semantics of ^ and $ but I've always found
       | their use to be rather clear. It's the grouping and look
       | ahead/behind modifiers that I've always found hard to understand
       | (at times).
        
       | pmarreck wrote:
       | The results did not surprise me. The fact that everyone is in
       | agreement that "cat$" matches "cat" and not "cat\n" if multiline
       | is off did not surprise me. \n is implicitly a multiline-
       | contextual character to me. In other words, if you didn't have
       | any \n, you'd just have an array of lines (without linefeeds),
       | same as if you were reading lines from a file one at a time or
       | splitting a binary on \n.
       | 
       | The other results that differ across engines seem to be because
       | people either don't understand regex or because the POSIX
       | description of how to deal with such an input and config was ill-
       | defined.
        
       | 1letterunixname wrote:
       | Ugh. Whenever I hear people talk about regular expressions as a
       | singular language or standard, I die a little inside.
       | 
       | PSA: Regex security is particular to each implementation flavor.
       | Please know the nuances of a particular kind and be unambiguously
       | precise.
        
       | callwhendone wrote:
       | it's end of line right?
        
       | smlacy wrote:
       | It's easy to get the canonical answer:
       | 
       | $ man pcre2syntax
       | 
       | Where you'll find the following block under ANCHORS AND SIMPLE
       | ASSERTIONS:                        $           end of subject
       | also before newline at end of subject
       | also before internal newline in multiline mode
       | 
       | So all the cases of "newline at/before end of subject" are
       | covered here. Then, the question becomes "what is a subject?" Is
       | it line-by-line? Are newlines included? What if we want multiline
       | matching? That's where re.MULTILINE comes from, it's not
       | "multiline matching" (sort of) it's "what is the subject of the
       | regular expression that we're matching against"
        
       ___________________________________________________________________
       (page generated 2024-03-20 23:01 UTC)