[HN Gopher] Regex character "$" doesn't mean "end-of-string"
___________________________________________________________________
Regex character "$" doesn't mean "end-of-string"
Author : BerislavLopac
Score : 380 points
Date : 2024-03-20 07:50 UTC (15 hours ago)
(HTM) web link (sethmlarson.dev)
(TXT) w3m dump (sethmlarson.dev)
| ikiris wrote:
| this is mostly due to the different types of regex and less about
| it being platform dependent. $ was end of string in pcre which is
| the "old" perl compatible regex. python has its own which has
| quirks as mentioned, re2 is another option in go for example, and
| i think rust has its own version as well iirc.
| pjmlp wrote:
| Indeed, there isn't any kind of universal regexp standard.
| 7bit wrote:
| We should create a new RegEx flavour that standardises RegEx
| for good!
| jasonjayr wrote:
| https://xkcd.com/927/
| rerdavies wrote:
| https://datatracker.ietf.org/doc/rfc9485/
|
| https://xkcd.com/927/
| wolletd wrote:
| The differences of the various regex "dialects" came to me over
| the years of using regular expressions for all kinds of stuff.
|
| Matching EOL feels natural for every line-based process.
|
| What I find way more annoying is escaping characters and
| writing character groups. Why can't all regex engines support
| '\d' and '\w' and such? Why, in sed, is an unescaped '.' a
| regex-dot matching any character, but an unescaped '(' is just
| a regular bracket?
| somat wrote:
| > Why, in sed, is an unescaped '.' a regex-dot matching any
| character, but an unescaped '(' is just a regular bracket?
|
| It is because sed predates the very influential second
| generation Extended Regular Expression engine and by default
| uses the first generation Basic Regular Expression engine. So
| really it is for backwards compatibility.
|
| http://man.openbsd.org/re_format#BASIC_REGULAR_EXPRESSIONS
|
| you can usually pass sed a -r flag to get it to use ERE's
|
| Actually I don't really know if BRE's predate ERE's or not. I
| assume they do based on the name but I might be wrong.
| tankenmate wrote:
| BRE and ERE was created at the same time. Prior to this
| there wasn't a clear standard for Regex. From my memory
| this was standardised in 1996 (IEEE Std 1003.1-1996).
|
| The work originally came from work by Stephen Cole Kleene
| in the 1950s. It was introduced into Unix fame via the QED
| editor (which later became ed (and sed), then ex, then vi,
| then vim; all with differing authors) when Ken Thompson
| added regex when he ported QED to CTSS (an OS developed at
| MIT for the IBM 709, which was later used to develop
| Multics, and hence lead to Unix).
|
| Also the "grep" command got its name from "ed"; "g" (the
| global ed command) "re" (regular expression), and "p" (the
| print ed command). Try it in vi/vim, :g/string/p it is the
| same thing as the grep command.
| fsckboy wrote:
| > _you can usually pass sed a -r flag_
|
| for portability, -E is the POSIX flag for the same thing
| ajsnigrutin wrote:
| "$" could be end of string or end of line in perl, depending on
| the setting (are you treating data as a multiline text, or each
| line separately). (/m, /s,...)
| ikiris wrote:
| Yeah I accidentally said string when I absolutely meant to
| say line there.
| Izmaki wrote:
| The new-line character is an actual character "at the end" of the
| string though so it makes sense that $ would include the new-line
| character in multi-line matching.
| IshKebab wrote:
| Yes and every implementation gets that right. The point was
| when multi-line matching is _disabled_ and only Javascript, Go
| and Rust get that right.
|
| I'm not too surprised by PHP and Python getting it wrong. Java
| and C# is a slight surprise though.
| danbruc wrote:
| I don't think it is correct to say some get it right and some
| get it wrong, it is more of an design decision.
| IshKebab wrote:
| It's possible to get design decisions wrong. Clearly people
| _expect_ `$` to only match end-of-string so they did make
| the wrong decision. It may not have been clear it was the
| wrong decision at the time.
| danbruc wrote:
| Things are obviously more complicated than that, lines
| are a complicated issue for historical reasons. There are
| two conventions, line termination and line separation. In
| case of line termination, the newline is part of the line
| and a string without a newline is not a [complete] line.
| In case of line separation, the newline is not part of
| the line but separates two lines. Also the way newlines
| are encoded is not universal.
| fauigerzigerk wrote:
| Why is this relevant when multi-line is disabled?
| danbruc wrote:
| Because even after disabling multi-line you are still
| dealing with line-based semantics when you use ^ or $,
| the newline at the end is still not part of the content.
| You have to use \A and \Z if you want to treat all
| characters as a string instead of one or multiple lines.
| burntsushi wrote:
| > Because even after disabling multi-line you are still
| dealing with line-based semantics when you use ^ or $
|
| No, you're not, _except_ for this weird corner case where
| `$` can match before the _last_ `\n` in a string. It 's
| not just any `\n` that non-multiline `$` can match
| before. It's when it's the _last_ `\n` in the string.
| See: >>> re.search('cat$', 'cat\n')
| <re.Match object; span=(0, 3), match='cat'> >>>
| re.search('cat$', 'cat\n\n') >>>
|
| This is weird behavior. I assume this is why RE2 didn't
| copy this. And it's certainly why I followed RE2 with
| Rust's regex crate. Non-multiline `$` should only match
| at the end of the string. It should not be line-aware. In
| regex engines like Python where it has the behavior
| above, it is only "partially" line-aware, and only in the
| sense that it treats the last `\n` as special.
| danbruc wrote:
| But that is exactly what it means, the end of the line is
| before the terminating newline or at the end of the
| string if there is no terminating newline. Both ^ and $
| always match at start or end of lines, \A and \Z match at
| the start or end of the string. The difference between
| multi-line and not is whether or not internal newlines
| end and start lines, it does not change the semantics
| from end of line to end of string. And if you are not in
| multi-line mode but have internal newlines, then you
| might also want single-line/dot-all mode.
|
| One could certainly have a debate whether this behavior
| is too strongly tied to the origins of regular
| expressions and now does more harm than good, but I am
| not convinced that this would be an easy and obvious
| choice to have breaking change.
| IshKebab wrote:
| > But that is exactly what it means
|
| I think you've kind of missed the point. Sure if `$` in
| non-multiline mode means "end of line" the behaviour
| might be reasonable. But the big error is that people DO
| NOT EXPECT `$` to mean "end of line" in that case. They
| expect it to mean "end of string". That's clearly the
| least surprising and most useful behaviour.
|
| The bug is not in how they have implemented "end of line"
| matching in non-multiline mode. It's that they did it at
| all.
| burntsushi wrote:
| re.search does not accept a "line." It accepts a
| "string." There is no pretext in which re.search is meant
| to only accept a single line. And giving it a `string`
| with multiple new lines doesn't necessarily mean you want
| to enable multi-line mode. They are orthogonal things.
|
| > Both ^ and $ always match at start or end of lines
|
| This is trivially not true, as I showed in my previous
| example. The haystack `cat\n\n` contains two lines and
| the regex `cat$` says it should match `cat` followed by
| the "end of a line" according to your definition. Yet it
| does not match `cat` followed by the end of a line in
| `cat\n\n`. And it does not do so in Python or in any
| other regex engine.
|
| You're trying to square a circle here. It can't be done.
|
| Can you make sense of, _historically_ , why this choice
| of semantics was made? Sure. I bet you can. But I can
| still evaluate the choice on its own merits today. And I
| did when I made the regex crate.
|
| > but I am not convinced that this would be an easy and
| obvious choice to have breaking change.
|
| Rust's regex crate, Go's regexp package and RE2 all
| reject this whacky behavior. As the regex crate
| maintainer, I don't think I've ever seen anyone complain.
| Not once. This to me suggests that, at minimum, making
| `$` and `\z` equivalent in non-multiline mode is a
| reasonable choice. I would also argue it is the better
| and more sensible approach.
|
| Whether other regex engines should have a breaking change
| or not to change the meaning of `$` is an entirely
| different question completely. That is neither here nor
| there. They absolutely will not be able to make such a
| change, for many good reasons.
| danbruc wrote:
| _re.search does not accept a "line." It accepts a
| "string." There is no pretext in which re.search is meant
| to only accept a single line._
|
| Sure, it takes a string which might be a line or multiple
| or whatever. Does not change the fact that $ matches at
| the end of a line. If you want the end of the string, use
| \Z.
|
| _This is trivially not true, as I showed in my previous
| example. The haystack `cat\n\n` contains two lines and
| the regex `cat$` says it should match `cat` followed by
| the "end of a line" according to your definition._
|
| In multi-line mode it matches, in single-line mode it
| does not because there is a newline between cat and the
| end of the line. A newline is only a terminating newline
| if it is the last character, the newline after cat is not
| a terminating newline. You need cat\n$ or cat\n\n to
| match.
| burntsushi wrote:
| > In multi-line mode it matches, in single-line mode it
| does not because there is a newline between cat and the
| end of the line. A newline is only a terminating newline
| if it is the last character, the newline after cat is not
| a terminating newline. You need cat\n$ or cat\n\n to
| match.
|
| This only makes sense if re.search accepted a line to
| search. It doesn't. It accepts an arbitrary string.
|
| I don't think this conversation is going anywhere. Your
| description of the semantics seems inconsistent and
| incomprehensible to me.
|
| > A newline is only a terminating newline if it is the
| last character, the newline after cat is not a
| terminating newline. You need cat\n$ or cat\n\n to match.
|
| The first `\n` in `cat\n\n` _is_ a terminating newline.
| There just happens to be one after it.
|
| Like I said, your description makes sense _if_ the input
| is meant to be interpreted as a single line. And in some
| contexts (like line oriented CLI tools), that can make
| sense. But that 's _not_ the case here. So your
| description makes no sense at all to me.
| danbruc wrote:
| _This only makes sense if re.search accepted a line to
| search. It doesn 't. It accepts an arbitrary string._
|
| Which is fine because lines are a subset of strings. And
| whether you want your input treated as a line or a string
| is decided by your pattern, use ^ and $ and it will be
| treated as a line, use \A and \Z and it will be treated
| as a string.
|
| _The first `\n` in `cat\n\n` is a terminating newline.
| There just happens to be one after it._
|
| Look at where this is coming from. You do line-based
| stuff, there is either no newline at all or there is
| exactly one newline at the end. You do file-based stuff,
| there are many newlines. In both cases the behavior of ^
| and $ makes perfect sense.
|
| Now you come along with cat\n\n which clearly falls into
| the file-based stuff category as it has more than one
| newline in it but you also insist that it is not multiple
| lines. If it is not multiple lines, then only the last
| character can be a newline, otherwise it would be
| multiple lines.
|
| And I get it, yes, you can throw arbitrary strings at a
| regular expression, this line-based processing is not
| everything, but it explains why things behave the way
| they do. And that is also why people added \A and \Z. And
| I understand that ^ and $ are much nicer and much better
| known than \A and \Z. Maybe the best option would be to
| have a separate flag that makes them synonymous with \A
| and \Z and this could maybe even be the default.
| burntsushi wrote:
| > And whether you want your input treated as a line or a
| string is decided by your pattern, use ^ and $ and it
| will be treated as a line, use \A and \Z and it will be
| treated as a string.
|
| Where is this semantic explained in the `re` module docs?
|
| This is totally and completely made up as far as I can
| tell.
|
| This also seems entirely consistent with my rebuttal:
|
| Me: What you're saying makes sense _if_ condition foo
| holds.
|
| You: Condition foo holds.
|
| This is uninteresting to me because I see no reason to
| believe that condition foo holds. Where condition foo is
| "the input to re.search is expected to be a single line."
| Or more precisely, apparently, "the input to re.search is
| expected to be a single line when either ^ or $ appear in
| the pattern." That is totally bonkers.
|
| > but it explains why things behave the way they do
|
| Firstly, I am not debating with you about the historical
| reasoning for this. Secondly, I am providing a commentary
| on the semantics themselves (they suck) and also on your
| explanation of them in _today 's_ context (it doesn't
| make sense). Thirdly, I am not making a prescriptive
| argument that established regex engines should change
| their behavior in any way.
|
| If you're looking to explain _why_ this semantic is the
| way it is, then I 'd expect writing from the original
| implementors of it. Probably in Perl. I wouldn't at all
| be surprised if this was an "oops" or if it was
| implemented in a strictly-line-oriented context, and then
| someone else decided to keep it unthinkingly when they
| moved to a non-line-oriented context. From there,
| compatibility takes over as a reason for why it's with us
| today.
| danbruc wrote:
| I quoted the section from the Python module here. [1]
|
| If you do not specify multi-line, bar$ matches a lines
| ending in bar, either foobar\n or foobar if the
| terminating newline has been removed or does not exist.
| If you specify multi-line, then it will also match at
| every bar\n within the string. So it either treats your
| input as a single line or as multiple lines. You can of
| course not specify multi-line and still pass in a string
| with additional newlines within the string, but then
| those newlines will be treated more or less as any other
| character, bar$ will not match bar\n\n. The exception is
| that dot will not match them except you set the single-
| line/dot-all flag, bar\n$ will match bar\n\n but bar.$
| will not unless you specify the single-line/dot-all flag.
|
| I would even agree with you that it seems a bit weird. If
| you have a proper line without additional newlines in the
| middle, then multi-line behaves exactly like not multi-
| line. Not multi-line only behaves differently if you
| confront it with multiple lines and I have no good idea
| how you would end up in a situation where you have
| multiple lines and want to treat them as one unit but
| still treat the entire thing as if it was a line.
|
| [1] https://news.ycombinator.com/item?id=39765086
| burntsushi wrote:
| The docs do not say what you're saying. Your phrasing is
| completely different, and the part where "if ^/$ are in
| the pattern then the haystack is treated as a single
| line" is completely made up. As far as I can tell, that's
| your _rationalization_ for how to make sense of this
| behavior. But it is not a story supported by the actual
| regex engine docs. The actual docs say, "^ matches only
| at the beginning of the string, and $ matches only at the
| end of the string and immediately before the newline (if
| any) at the end of the string." The docs do not say, "the
| string is treated as a single line when ^/$ are used in
| the pattern." That's _your_ phrasing, not anyone else 's.
| That's _your_ story, not theirs.
|
| I still have not seen anything from you that makes sense
| of the behavior that `cat$` does not match `cat\n\n`.
| Like, I realize you've tried to explain it. But your
| explanation does not make sense. That's because the
| behavior is _strange_.
|
| The only actual way to explain the behavior of $ is what
| the `re` docs say: it either matches at the end of the
| string or just before a `\n` that appears at the end of
| the string. That's it.
| danbruc wrote:
| You are right, it is my wording, I replaced end of string
| or before newline as the last character with end of line
| because that is what this means. You could also write
| that into the documentation but then you would have to
| also explain what end of line means. And I will grant you
| that I might be wrong, that the behavior is only
| accidentally identical to matching the end of a line but
| that the true reason for it is different.
|
| cat$, the $ matches the end of the line, the second \n,
| cat is not directly before that. I guess you want the
| regex engine to first treat the input as a multi-line
| input, extract cat\n as the first line, and then have
| cat$ match successfully in that single line? What about
| cat$ and dog$ and cat\ndog\n.
| dfawcus wrote:
| Given that in unix they sort started as:
| ed -> sed ed -> grep
|
| The line oriented mature makes sense.
|
| There is some sed multi-line capability if one uses the
| hold space, but it is much easier to just use awk.
| tankenmate wrote:
| Not quite, there are standards for this behaviour (formal
| and de jure).
| danbruc wrote:
| And the ones that do not match cat\n with cat$ arguably
| have it wrong. Both ^ and $ anchor to the start and end
| of lines, not to the start and end of strings, whether in
| multi-line mode or not.
| noirscape wrote:
| It's not wrong actually. It's the difference between BRE and
| ERE, which are the two different POSIX standards that define
| regex. In BRE the $ should always match the end of the string
| (the spec specifically says it should match the string
| terminator since "newlines aren't special characters"), while
| the ERE spec says it should match until the end of the line.
|
| The real issue is that no language nowadays "just" implements
| BRE or ERE since both specs are lacking in features.
|
| Most languages instead implement some variant of Perl's regex
| instead (often called PCRE regex because of the C library
| that brought Perl's regex to C), which as far as I can tell
| isn't standardized, so you get these subtle differences
| between implementations.
| mnw21cam wrote:
| The article is about when multi-line is _disabled_.
| user2342 wrote:
| I'm confused by this blog-post. In the table: what is the reg-ex
| pattern tested and against which input?
| mnw21cam wrote:
| The input being matched is "cat\n" and the regex pattern is one
| of: "cat$" with multiline enabled "cat$"
| with multiline disabled "cat\z" "cat\Z"
| somat wrote:
| Structural regexes as found in the sam editor are an obscure but
| well engineered regex engine. I am far from an expert but my main
| takeaway from them is that most regex engines have an implied
| structure built around "lines" of text. While you can work around
| this, it is awkward. Structural regexes allow you to explicitly
| define the structure of a match, that is, you get to tell the
| engine what a "line" is.
|
| http://man.cat-v.org/plan_9/1/sam
| xlii wrote:
| Regexp was one of the first things I truly internalized years ago
| when I was discovering Perl (which still lives in a cozy place in
| my heart due to a lovely "Camel" book).
|
| Today most important bit of information is knowledge that
| implementations differ and I made a habit of pulling reference
| sheet for a thing I work with.
|
| E.g. Emacs Regexp annoyingly doesn't have word in form of "\w"
| but uses "\s_-" (or something no reference sheet on screen) as
| character class (but Emacs has the best documentation and
| discoverability - a hill I'm willing to die on)
|
| Some utilities require parenthesis escaping and some not.
| Sometimes this behavior is configurable and sometimes it's not.
|
| I lived through whole confusion, annoyance, denial phase and now
| I just accept it. Concept is the same everywhere but flavor
| changes.
| ydant wrote:
| Exactly the same here, re: Perl.
|
| My brain thinks in Perl's regex language and then I have to
| translate the inconsistent bits to the language I'm using.
| Especially in the shell - I'm way more likely to just drop a
| perl into the pipeline instead of trying to remember how
| sed/grep/awk (GNU or BSD?) prefer their regex.
| influx wrote:
| GNU grep supports Perl regexp with -P
| mwpmaybe wrote:
| As does git grep!
| 1letterunixname wrote:
| Using PCRE2, which doesn't behave exactly the same as Perl
| or PCRE1.
|
| https://pcre.org/current/doc/html/pcre2compat.html
|
| https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expre
| s...
|
| https://stackoverflow.com/questions/70273084/regex-
| differenc...
| mtmk wrote:
| hah, I'm the same too, straight to 'perl -lne'. I believe
| that was one of Larry Wall's goals when creating Perl:
|
| > Perl is kind of designed to make awk and sed semi-obsolete.
|
| https://github.com/Perl/perl5/commit/8d063cd8
| pizzafeelsright wrote:
| How did you internalize it? Perl looks like cat keyboarding.
| mwpmaybe wrote:
| The same way people internalize punching data and
| instructions into stacks of cards, or internalize advanced
| mathematical notation. Just because things aren't written in
| plain english words doesn't mean they can't be internalized.
| chongli wrote:
| Advanced math is mostly written in plain English, actually!
| ydant wrote:
| For me, Perl hit me at exactly the right time in my
| development. One or more of the various O'Reilly Perl books
| caught my attention in the bookstore, the foreword and the
| writing style was unlike anything else I'd read in
| programming up to that point, and I read the book and just
| felt a strong connection to how the language was structured,
| the design concepts behind it, the power of regex being built
| in to the language, etc. The syntax favored easy to write
| programs without unnecessary scaffolding (of course, leading
| to the jokes of it being write-only - also the jokes I could
| make about me programming largely in Java today), and the
| standard functionality plus the library set available felt
| like magic to me at that point.
|
| Learning Perl today would be a very different experience. I
| don't think it would catch me as readily as it did back then.
| But it doesn't matter - it's embedded into me at a deep level
| because I learned it through a strong drive of fascination
| and infatuation.
|
| As for the regex themselves? It's powerful and solved a lot
| of the problems I was trying to solve, was built
| fundamentally into Perl as a language, so learning it was
| just an easy iterative process. It didn't hurt that the
| particular period of time when I learned Perl/regex the
| community was really big on "leetcode" style exercises, they
| just happened to be focused around Perl Golf, being clever in
| how you wrote solutions to arbitrary problems, and abusive
| levels of regex to solve problems. We were all playing and
| play is a great way to learn.
| beardyw wrote:
| Does anyone consider RegEx to be standardised? Moving to a new
| context is always a relearning exercise in my experience.
| rusk wrote:
| My understanding is it was standardised for Posix but the
| variants in popular use have so many variations.
|
| I consider sed to be the baseline. If you can do sed you can do
| anything but it's seriously limited.
| susam wrote:
| POSIX specifies two flavours of regular expressions: basic
| regular expressions (BRE) and extended regular expressions
| (ERE). There are subtle differences between the two and ERE
| supports more features than BRE. For example, what is written
| as a\\(bc\\)\\{3\\}d in BRE is written as a(bc){3}d in ERE.
| See https://pubs.opengroup.org/onlinepubs/9699919799/basedefs
| /V1... for more details.
|
| The regular expression engines available in most mainstream
| languages go well beyond what is specified in POSIX though.
| An interesting example is named capturing group in Python,
| e.g., (?P<token>f[o]+).
| tankenmate wrote:
| Indeed, and the most common is Perl since it was the source
| of many of the extensions.
| rusk wrote:
| I would hazard that nowadays it's Java due to its broad
| permeation of the application space
| account42 wrote:
| If anything it would be ECMAScript (JavaScript dwarfs
| Java use) or PCRE (the de-facto contiuation of Perl
| regular expressions written in C but used in many
| languages).
| rusk wrote:
| Yes I think you're right actually. I'm about 10 years off
| :)
| jwilk wrote:
| > what is written as \\(f..\\)\1 in BRE is written as
| (f..)\1 in ERE
|
| Oddly, there are no backreferences in POSIX EREs.
| susam wrote:
| You are right indeed. Looked at the specification again
| and indeed there is no back-reference in POSIX ERE.
|
| Quoting from <https://pubs.opengroup.org/onlinepubs/96999
| 19799.2008edition...>:
|
| > It was suggested that, in addition to interval
| expressions, back-references ( '\n' ) should also be
| added to EREs. This was rejected by the standard
| developers as likely to decrease consensus.
|
| Updated my comment to present a better example that
| avoids back-references. Thanks!
| GrumpySloth wrote:
| That's because POSIX EREs are actual regular expressions
| thank god.
| psd1 wrote:
| No gnu tool can balance brackets, afaics. So you can't do
| everything in sed. And sed is, by design, useless for
| matching text that spans lines, so good luck picking out
| paragraphs with it.
| rusk wrote:
| Sorry I meant to write "if you can do it in sed you can do
| it in anything" thereby implying it is a subset of the more
| generally available flavours. The issue at hand however is
| that there isn't much in the way of standardisation but 95%
| of sed should work across all of them. Of course you should
| get more into the specifics of whatever your solution space
| supports.
| ykonstant wrote:
| I am pretty sure even pure Awk can do it; or am I mistaken?
| I thought there was an even more sophisticated example in
| the Awk book.
|
| Edit: oh, you mean via regex engines available in GNU
| tools; I am dumb. Hmm... is there no GNU extension with
| PCRE?
| colimbarna wrote:
| "Sed" is the name of a specific tool. It is not defined
| by the GNU tools, but has existed in some form since
| 1974, well before Perl. GNU sed and POSIX sed both
| support BRE and EREs, but not PCREs.
|
| Maybe there's some other implementation of sed that
| supports PCREs but that would really be an extension of
| that implementation of sed rather than a property of sed.
|
| And maybe there's some GNU tool that uses PCREs, but that
| GNU tool would not be GNU sed, so it would not be a
| relevant property.
|
| Anyway, they probably should have said BREs or EREs
| rather than "sed"...
| telotortium wrote:
| Languages invented after Perl will generally use some flavor of
| Perl regex syntax, but there are always some minor differences.
| The issue of the meaning of `$` and changing it via multi-line
| mode is usually consistent though.
| usrusr wrote:
| I like to think of "whatever browsers do in js" as an updated
| common baseline. Whatever your regex engine does, describe it
| as a delta to the js precedent. That thing is just so
| ubiquitous.
|
| I do wonder though what's the highest number of different
| regex syntaxes I've ever encountered (perhaps written?)
| within a single line: bash, grep and sed are never not in a
| "hold my beer" mood!
| psd1 wrote:
| Reason #2 to use powershell - consistent regex.
|
| I've got "hold my beer" commits in .net - I've balanced
| brackets. I believe that's impossible in sed and grep. If I
| were going to write a json parser in a script, then a) stop
| me and b) it's got to be in powershell.
| layer8 wrote:
| That seems like just a web front-end developer's
| perspective.
| Calzifer wrote:
| Isn't JavaScripts regex one of the worst modern regex
| implementations?
|
| They seem to improve. Negative lookbehind isn't missing
| anymore [1]. But still lack the handy \Q and \E to escape
| stuff [2].
|
| [1] https://stackoverflow.com/a/3950684
|
| [2] https://stackoverflow.com/q/6318710
| kstrauser wrote:
| I'll go along with that, as long as someone ports pcre to
| JavaScript and that's the browser syntax we land on.
| mwpmaybe wrote:
| > I do wonder though what's the highest number of different
| regex syntaxes I've ever encountered (perhaps written?)
| within a single line: bash, grep and sed are never not in a
| "hold my beer" mood!
|
| Your comment is missing a trigger warning, lol. But
| seriously, this is one of my flags for "this should
| probably be a script, or an awk or perl one-liner."
| wolletd wrote:
| At some point, I felt like I knew them all. There are probably
| more regex dialects out there, but I don't encounter them and
| my set of knowledge works most of the time.
|
| I feel it's like driving a rental car. It behaves slightly
| different than your own car, some features missing, some other
| features added, but in general, most of the things are pretty
| similar.
| stanislavb wrote:
| What a nice analogy. I'll borrow it in the future.
| MattHeard wrote:
| My working assumption has always been to check the docs of your
| specific regexp parser, and to write some tests (either
| automated or manually in a REPL) with specific patterns that
| you are interested in using.
| out-of-ideas wrote:
| kind of a trick question; there is POSIX and then there is the
| app you're using and whichever flags are enabled (albeit by
| default or explicitly defined)
| jasonjayr wrote:
| The three big ones I know of are POSIX, Perl/PCRE(aka Perl-
| Compatible Regular Expression), and Go came along and
| <strike>added</strike> used re2, which is a bit different from
| the first too.
|
| A lot of systems implemented PCRE, including JavaScript, since
| Perl extended the POSIX system with many useful extensions.
| IIRC, re2 tries to reign in on some of the performance issues
| and quirks the original systems had, while implementing the
| whole thing in Go.
|
| edit: Did not realize re2 predated go ...
| jpgvm wrote:
| re2 predates Go and was written in C++.
| foldr wrote:
| Go's regex implementation is new in the sense that it's not
| just a binding to the re2 C++ library, but it uses the same
| non-backtracking algorithm.
| jerf wrote:
| POSIX and PCRE are arguably redundant. They both support
| backreferences, which puts very significant constraints on
| their implementations. PCRE is at least functionally a
| superset of POSIX, whether or not there's some quirky thing
| POSIX supports that PCRE does not.
|
| re2 adds a legitimate option to the menu of using NDFAs,
| which have the disadvantage of not supporting backreferences,
| but have the advantage of having constrained complexity of
| scanning a string. This does not come for free; you can
| conceivably end up with a compiled regexp of very large size
| with an NDFA approach, but most of the time you won't. The
| result may be generally slower than a PCRE-type approach, but
| it can also end up safer because you can be confident that
| there isn't a pathological input string for a given regexp
| that will go exponential.
|
| This is one of those cases where ~99% of the time, it doesn't
| really matter which you choose, but at the scale of the
| Entire Programming World, both options need to be available.
| I've got some security applications where I legitimately
| prefer the re2 implementation in Go because it is
| advantageous to be confident that the REs I write have no
| pathological cases in the arbitrary input they face. PCRE can
| be necessary in certain high-performance cases, as long as
| you can be sure you're not going to get that pathological
| input.
|
| RE engines don't quite engender the same emotions as
| programming languages as a whole, but this is not
| cheerleading, this is a sober engineering assessment. I use
| both styles in my code. I've even got one unlucky exe I've
| been working with lately that has both, because it rather
| irreducibly has the requirements for both. Professionally
| annoying, but not actually a problem.
| burntsushi wrote:
| I'll add two notes to this:
|
| * Finite automata based regex engines don't necessarily
| have to be slower than backtracking engines like PCRE. Go's
| regexp is in practice slower in a lot of cases, but this is
| more a property of its implementation than its concept.
| See: https://github.com/BurntSushi/rebar?tab=readme-ov-
| file#summa... --- Given "sufficient" implementation effort
| (~several person years of development work), backtrackers
| and finite automata engines can both perform very well,
| with one beating the other in some cases but not in others.
| It depends.
|
| * Fun fact is that if you're iterating over all matches in
| a haystack (e.g., Go's `FindAll` routines), then you're
| susceptible to O(m * n^2) search time. This applies to all
| regex engines that implement some kind of leftmost match
| priority. See
| https://github.com/BurntSushi/rebar?tab=readme-ov-
| file#quadr... for a more detailed elaboration on this
| point.
| jerf wrote:
| Excellent, thank you.
| keybored wrote:
| > RE engines don't quite engender the same emotions as
| programming languages as a whole, but this is not
| cheerleading, this is a sober engineering assessment.
|
| Good on you.
| bregma wrote:
| The ISO/IEC 14882 C++ standard library <regex> mandates [0]
| implementations for six de jure standard regex grammars: IEEE
| Std 1003.1-2008 (POSIX) [1] BRE, ERE, awk, grep, and egrep and
| ECMA-262 EcmaScript 3 [2].
|
| So, yes, at least someone (me) considers regex to be
| standardized in several published de jure standards.
| [0] https://www.open-
| std.org/jtc1/sc22/wg21/docs/papers/2013/n3690.pdf#chapter.28
| [1] https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V
| 1_chap09.html [2] https://262.ecma-
| international.org/14.0/#sec-regexp-regular-expression-objects
| pjc50 wrote:
| "At least six different standards" is an XKCD comic, not _a_
| standard.
| riffraff wrote:
| "The nice thing about standards is that you have so many to
| choose from." - Andrew Tanenbaum (or Grace Hopper)
| account42 wrote:
| <regex> is not exactly an example anyone should follow.
| bregma wrote:
| You may be prejudiced against C++, but ISO/IEC 14882 is a
| published international standard that links to recognized
| regex standards, so answers the question "does anyone
| consider RegEx standardised?" very much in the affirmative.
| beardyw wrote:
| And don't get me started about find and replace, what is the
| symbol to insert the match?
| tonyg wrote:
| Delightfully, RFC 9485
| https://datatracker.ietf.org/doc/rfc9485/ "I-Regexp: An
| Interoperable Regular Expression Format" was published just
| back in October last year!
| ghusbands wrote:
| > Note: The table of data was gathered from regex101.com, I
| didn't test using the actual runtimes.
|
| Has anyone confirmed this behaviour directly against the
| runtimes/languages? Newlines at the end of a string are certainly
| something that could get lost in transit inside an online service
| involving multiple runtimes.
| AtNightWeCode wrote:
| I fail to add carriage return to the test string on that site.
| Which I guess would be an issue on Windows.
| zimpenfish wrote:
| https://go.dev/play/p/Tce1qWjfjOy matches their results.
|
| I've also run that locally against "go1.22.1 darwin/arm64",
| "go1.21.5 windows/amd64", and "go1.21.0 linux/amd64" with the
| same result.
| coldtea wrote:
| > _Newlines at the end of a string are certainly something that
| could get lost in transit inside an online service involving
| multiple runtimes._
|
| In what way could newlines at the end of a string "could get
| lost in transit"?
| ghusbands wrote:
| If you write it to a text file by itself and then read it
| from that text file, each runtime can have a different
| definition of whether a newline at the end of the file is
| meaningful or not. Under POSIX, a newline should always be
| present at the end of a non-empty text file and is not
| meaningful; not everyone agrees or is aware.
|
| There are plenty of other ways, too; bugs happen.
| coldtea wrote:
| Ideally no runtime should alter strings passing through
| ("in transit") from one runtime to another - unless it does
| some processing on them.
| ghusbands wrote:
| I've now tested C#, directly, and got the same result as the
| article. It also documents the behavior:
|
| > The ^ and $ language elements indicate the beginning and end
| of the input string. The end of the input string can be a
| trailing newline \n character.
| burntsushi wrote:
| Yes, and with more regex engines:
| https://github.com/BurntSushi/rebar/blob/177f5d55e916964b9c4...
|
| Beyond what's in the OP, that includes RE2, Hyperscan, D's
| std.regex, ICU, Perl, Python's third party `regex` package, and
| `regress`.
| masswerk wrote:
| As for the good old reference implementation (not _" Parameter
| Efficient Reinforcement Learning"_): my $string =
| "cat\n"; /cat$/s -> true /cat\Z/s -> true
| /cat\z/s -> false
| pjc50 wrote:
| Special misery case: Visual Studio supports regex search, where
| '$' matches \n.
|
| The end of line character is usually the standard Windows \r\n.
|
| Yes, that means if you want to really match the end of line you
| have to match "\r$". So broken.
| skrebbel wrote:
| FWIW, and I know this doesn't really address your complaint: I
| use Windows and I've set all my text editors to use LF
| exclusively years ago and Things Are Great. No more weird Git
| autocrlf warnings, no quirks when copying files over to/from
| people on Macs or Linuxes, etc. Even Notepad supports LF line
| endings for quite a long time now - to my practical experience,
| there's little remaining in Windows that makes CRLF "the OS
| standard line ending".
|
| I bet if someday VS Code's Windows build ships with LF default
| on new installations, people won't even notice.
|
| I mean, at some point it did matter what the OS did when you
| pressed the "Enter" button. But this isn't really the case much
| anymore. VS Code catches that keypress, and inserts whatever
| "files.eol" is set to. Sublime does the same. I didn't check,
| but I assume every other IDE has this setting.
|
| Similarly, the HTML spec, which is pretty nuts, makes browsers
| normalize my enters to LF characters as I type into this
| textarea here (I can check by reading the `value` property in
| devtools), but when it's submitted, it converts every LF to a
| CRLF because that's how HTML forms were once specced back in
| the day. Again though, what my OS considers to be "the standard
| newline" is simply not considered at all. Even CMD.EXE batch
| files support LF.
|
| I don't really type newlines all that much outside IDEs and
| browsers (incl electron apps) and places like MS Word, all of
| which disregard what the OS does and insert their own thing.
| Maybe the terminal? I don't even know. I doubt it's very
| consequential.
|
| EDIT: PSA the same holds for backslashes! Do Not Use
| Backslashes. Don't use "OS specific directory separator
| constants". It's not 1998, just type "/" - it just works.
| n_plus_1_acc wrote:
| I could never get visual studio (not code) to not use \r\n
| when editing a solution file via the gui
| divingdragon wrote:
| > Even CMD.EXE batch files support LF.
|
| I don't know if it is the case on Windows 11, but I have
| surely been bitten by CMD batch files using LF line endings.
| I don't remember the exact issue but it may have been the one
| bug affecting labels. [1]
|
| [1]:
| https://www.dostips.com/forum/viewtopic.php?t=8988#p58888
| pjc50 wrote:
| > I bet if someday VS Code's Windows build ships with LF
| default on new installations, people won't even notice.
|
| As with '/', they really ought to do this some day but won't.
| jbverschoor wrote:
| The whole \r is archaic. It doesn't even behave properly in
| most cases. Just use \n everywhere and bite the lemon for a
| short while to fix your problems.
|
| And if you believe \r\n is the way to go, please make sure \n\r
| also works as they should have the same results. (or
| \r\n\r\r\r\r for that matter)
| psd1 wrote:
| There are unices that use LFCR endings... computing is an
| endless bath in history
| HideousKojima wrote:
| But without \r how am I supposed to print to my typewriter
| over serial cable? Only half-joking, that's the setup my
| family had in the early 90's.
| jbverschoor wrote:
| Send BELL characters and wait for human intervention
| keybored wrote:
| Why did they even decide to use two characters for the end of
| line? Seems bizarre. I could have imagined that `\r` and `\n`
| was a tossup. But why both?
| mnau wrote:
| Likely compatibility bugs going back decades (70s?).
| Probably with some terminal/teletype.
|
| \r - returned teletype head to the start of a line
|
| \n - move paper one line down
|
| > The sequence CR+LF was commonly used on many early
| computer systems that had adopted Teletype machines--
| typically a Teletype Model 33 ASR--as a console device,
| because this sequence was required to position those
| printers at the start of a new line. The separation of
| newline into two functions concealed the fact that the
| print head could not return from the far right to the
| beginning of the next line in time to print the next
| character. Any character printed after a CR would often
| print as a smudge in the middle of the page while the print
| head was still moving the carriage back to the first
| position. "The solution was to make the newline two
| characters: CR to move the carriage to column one, and LF
| to move the paper up."[2] In fact, it was often necessary
| to send extra padding characters--extraneous CRs or NULs--
| which are ignored but give the print head time to move to
| the left margin. Many early video displays also required
| multiple character times to scroll the display.
|
| https://en.wikipedia.org/wiki/Newline
| jbverschoor wrote:
| It's similar to an old school typewriter.
|
| The handle does 2 things: return and feed. You can also
| just return by not pulling all the way or the other way
| around depending on the design
| HideousKojima wrote:
| Which also let you do strikethrough and similar effects
| by typing over a line you already typed
| keybored wrote:
| It is known. Why didn't Linux decide to do that though.
| HideousKojima wrote:
| Typewriters is why
| onion2k wrote:
| I can hear thousands of bad hiring manager's adding 'How do you
| match the end of a string in a regex?' to their list of 'Ha! You
| don't know the trick!' questions designed to catch out
| candidates.
| hoc wrote:
| "I will hire you anyway, but I will pay you less"
|
| Regex, useful in any job...
| username_my1 wrote:
| regex is useful but chatgpt is amazing at it, so why spend a
| minute keeping such useless knowledge in mind.
|
| if you know where to find something no point in knowing it.
| ykonstant wrote:
| Does gpt produce efficient regex? Are there any experts
| here that can assess the quality and correctness of gpt-
| generated regex? I wonder how regex responses by gpt are
| validated if the prompter does not have the knowledge to
| read the output.
| thecatspaw wrote:
| what does gpt say how we should validate email addresses?
| rhd wrote:
| chatgpt-4:
|
| ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$
|
| https://chat.openai.com/share/696f7046-7f43-4331-b12b-538
| 566...
|
| chatgpt-3.5:
|
| ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$
|
| https://chat.openai.com/share/aaa09ae8-3fd9-4df7-a417-948
| 436...
| layer8 wrote:
| ...which both excludes addresses allowed by the RFC and
| includes addresses disallowed by the RFC. (For example,
| the RFC disallows two consecutive dots in the local-
| part.)
| KMnO4 wrote:
| I take the descriptivist approach to email validation,
| rather than the prescriptivist.
|
| I know an email has to have a domain name after the @ so
| I know where to send it.
|
| I also know it has to have something before the @ so the
| domain's email server knows how to handle it.
|
| But do I care if the email server is supports sub
| addresses, characters outside of the commonly supported
| range (eg quotation marks and spaces), or even characters
| which aren't part of the RFC? I do not.
|
| If the user gives me that email, I'll trust them. Worst
| case they won't receive the verification email and will
| need to double check it. But it's a lot better than those
| websites who try to tell me my email is invalid because
| their regex is too picky.
| layer8 wrote:
| I generally agree, but the two consecutive dots (or
| leading/trailing dots) are an example that would very
| likely be a typo and that you wouldn't particularly want
| to send. Similar for unbalanced quotes, angle brackets,
| and other grammar elements.
| dumbo-octopus wrote:
| I wonder whether simply (regex) replacing a sequence of
| .'s with a single one as part of a post-processing step
| would be effective.
| layer8 wrote:
| That would be bad form, IMO. The user may have typed
| _john..kennedy@example.com_ by mistake instead of
| _john.f.kennedy@example.com_ , and now you'll be sending
| their email to _john.kennedy@example.com_. Similar for
| leading or trailing dots. You can't just decide what a
| user probably meant, when they type in something invalid.
| wtetzner wrote:
| Yeah, that's about as far as I've ever been comfortable
| going in terms of validating email addresses too: some
| stuff followed by "@" followed by more stuff.
|
| Though I guess adding a check for invalid dot patterns
| might be worthwhile.
| jcranmer wrote:
| The HTML email regex validation [1] is probably the best
| rule to use for validating an email address in most user
| applications. It prohibits IP address domain literals
| (which the emailcore people have basically said is of
| limited utility [2]), and quoted strings in the
| localpart. Its biggest fault is allowing multiple dots to
| appear next to each other, which is a lot of faff to put
| in a regex when you already have to individually spell
| out every special character in atext.
|
| [1]
| https://html.spec.whatwg.org/multipage/input.html#email-
| stat...
|
| [2] https://datatracker.ietf.org/doc/draft-ietf-
| emailcore-as/
| marcosdumay wrote:
| What is maybe more important to note, it completely
| disallows the language of some 4/5 of the humanity. And
| partially disallows some 2/3 of the rest.
| sebstefan wrote:
| Actually pretty good response if the programmer bothers
| to read all of it
|
| I'd be more emphatic that you shouldn't rely on regexes
| to validate emails and that this should only be used as
| an "in the form validation" first step to warn of user
| input error, but the gist is there
|
| > This regex is *practical for most applications* (??),
| striking a balance between complexity and adherence to
| the standard. It allows for basic validation but does not
| fully enforce the specifications of RFC 5322, which are
| much more intricate and challenging to implement in a
| single regex pattern.
|
| ^ ("challenging"? Didn't I see that emails validation
| requires at least a grammar and not just a regex?)
|
| > For example, it doesn't account for quoted strings
| (which can include spaces) in the local part, nor does it
| fully validate all possible TLDs. Implementing a regex
| that fully complies with the RFC specifications is
| impractical due to their complexity and the flexibility
| allowed in the specifications.
|
| > For applications requiring strict compliance, it's
| often recommended to use a library or built-in function
| for email validation provided by the programming language
| or framework you're using, as these are more likely to
| handle the nuances and edge cases correctly.
| Additionally, the ultimate test of an email address's
| validity is sending a confirmation email to it.
| bonki wrote:
| Not good at all, but a little better than expected. I use
| + in email addresses prominently and there are so many
| websites who don't even allow that...
| zaxomi wrote:
| Remember to first punycode the domain part of an email
| address before trying to validate it, or it will not work
| with internationalized domain names.
| jameshart wrote:
| Support for IDN email addresses is still patchy at best.
| Many systems can't send to them; many email hosts still
| can't handle being configured for them.
| criley2 wrote:
| Prompt:
|
| 'I'm writing a nodejs javascript application and I need a
| regex to validate emails in my server. Can you write a
| regex that will safely and efficiently match emails?'
|
| GPT4 / Gemini Advanced / Claude 3 Sonnet
|
| GPT4: `const emailRegex =
| /^[a-zA-Z0-9._-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$/;` Full
| answser: https://justpaste.it/cg4cl
|
| Gemini Advanced: `const emailRegex = /^[a-zA-Z0-9.!#$%&'
| _+ /=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0
| -9])?(?:\\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)
| _$/;` Full answer: https://justpaste.it/589a5
|
| Claude 3: `const emailRegex =
| /^([a-zA-Z0-9._%-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,})$/;`
| Full answer: https://justpaste.it/82r2v
| zaxomi wrote:
| Still doesn't support internationalized domain names.
| croemer wrote:
| Terrible answers as far as I can tell, especially Chat
| got would throw out many valid email addresses.
| dfawcus wrote:
| Whereas email more or less lasts forever (mailbox
| contents), and has to be backwards compatible with older
| versions back to (at least) RFC 821/822, or those before.
| It also allows almost any character (when escaped at 821
| level) in the host or domain part (domain names allow any
| byte value).
|
| So a Internet email address match pattern has to be:
| "..*@..*", anything else can reject otherwise valid
| addresses.
|
| That however does not account for earlier source routed
| addresses, not the old style UUCP bang paths. However
| those can probably be ignored for newly generated email.
|
| I regularly use an email address with a "+" in the host
| part. When I used qmail, I often used addresses like:
| "foo-a/b-bar-tat@DOMAIN". Mainly for auto filtering
| received messages from mailing lists.
| skeaker wrote:
| There really ought to be a regex repository of common use
| cases like these so we don't have to reinvent the wheel
| or dig up a random codebase that we hope is correct to
| copy from every time.
| da39a3ee wrote:
| You don't have to be an expert; you should very rarely be
| using regexes so complex that you can't understand them.
| zacmps wrote:
| It might not be obvious when you hit that point, bad
| regexes can be subtle, just see that old cloudflare
| postmortem.
| hnlmorg wrote:
| ...and if you can understand them then you clearly
| understand regex enough not to need ChatGPT to write them
| kaibee wrote:
| I understand assembly too.
| mnau wrote:
| Even simple regexs can be problematic, e.g. Gitlab RCE
| bug through ExifTools
|
| https://devcraft.io/2021/05/04/exiftool-arbitrary-code-
| execu...
|
| > "a\ > ""
|
| > The second quote was not escaped because in the regex
| $tok =~ /(\\\\+)$/ the $ will match the end of a string,
| but also match before a newline at the end of a string,
| so the code thinks that the quote is being escaped when
| it's escaping the newline.
| 2devnull wrote:
| That was one of my first uh oh moments with gpt. Getting
| code that clearly had untestable/unreadable regexen,
| which given the source must have meant the regex were gpt
| generated. So much is going to go wrong, and soon.
| berkes wrote:
| > if you know where to find something no point in knowing
| it.
|
| Nonsense. And you know it.
|
| First, you need to know _what_ to find, before knowing
| _where_ to find it. And knowing _what_ to find requires
| intricate knowledge of the thing. Not intricate
| implementation details, but enough to point yourself in the
| right direction.
|
| Secondly, you need to know _why_ to find thing X and not
| thing Y. If anything, ChatGPT is even worse than google or
| stackoverflow in "solving the XY problem for you". XY is a
| problem you don't want solved, but instead to be told that
| you don't want to solve it.
|
| Maybe some future LLM can also push back. Maybe some future
| LLM can guide you to the right answer for a problem. But at
| the current state: nope.
|
| Related: regexes are almost never the best answer to any
| question. They are available and quick, so all considered,
| maybe "the best" for this case. But overall: nah.
| pksebben wrote:
| While I agree with your point that knowing things
| matters, it is entirely possible with the current batch
| of LLMs to get to an answer you don't know much about.
| It's actually one of the few things they do reliably
| well.
|
| You start with what you _do_ know, asking leading
| questions and being clear about what you don 't, and you
| build towards deeper and deeper terminology until you get
| to the point where there are docs to read (because you
| still can't trust them to get the specifics right).
|
| I've done this on a number of projects with pretty
| astonishing results, building stuff that would otherwise
| be completely out of my wheelhouse.
| lolc wrote:
| Funny for me there have been instances where the LLM did
| push back. I had a plan of how to solve something and
| tasked the LLM with a draft implementation. It kept
| producing another solution which I kept rejecting and
| specifying more details so it wouldn't stray. In the end
| I had to accept that my solution couldn't work, and that
| the proposed one was acceptable. It's going to happen
| again, because it often comes up with inferior solutions
| so I'm not very open to the reverse situation.
| HumblyTossed wrote:
| This is something ChatGPT would say.
| Karellen wrote:
| > Folks who've worked with regular expressions before might know
| about ^ meaning "start-of-string" and correspondingly see $ as
| "end-of-string".
|
| Huh. I always think of them as "start-of-line" and "end-of-line".
| I mean, a lot of the time when I'm working with regexes, I'm
| working with text a line at a time so the effect is the same, but
| that doesn't change how I think of those operators.
|
| Maybe because a fair amount of the work I do with regexes (and,
| probably, how I was introduced to them) is via `grep`, so I'm
| often thinking of the inputs as "lines" rather than "strings"?
| jamesmunns wrote:
| Same, tho it'd be interesting to see if this behavior holds if
| the file ends without a trailing newline and your match is on
| the final newline-less line.
| fooofw wrote:
| Fortunately, it's pretty simple to test. $
| printf 'Line with EOL\nLine without EOL' | grep 'EOL$'
| Line with EOL Line without EOL $ grep
| --version | head -n1 grep (GNU grep) 3.8
| romwell wrote:
| The line does end with the file, so it's logically
| consistent.
|
| It's not matching the newline character after all.
| colimbarna wrote:
| Yes exactly, they match the end of a line, not a newline
| character. Some examples from documentation:
|
| man 7 regex: '$' (matching the null string at the end of
| a line)
|
| pcre2pattern: The circumflex and dollar metacharacters
| are zero-width assertions. That is, they test for a
| particular condition being true without consuming any
| characters from the subject string. These two
| metacharacters are concerned with matching the starts and
| ends of lines. ... The dollar character is an assertion
| that is true only if the current matching point is at the
| end of the subject string, or immediately before a
| newline at the end of the string (by default), unless
| PCRE2_NOTEOL is set. Note, however, that it does not
| actually match the newline. Dollar need not be the last
| character of the pattern if a number of alternatives are
| involved, but it should be the last item in any branch in
| which it appears. Dollar has no special meaning in a
| character class.
| jamesmunns wrote:
| Thanks! I was AFK and didn't have a grep (or a shell) handy
| on my phone.
| antegamisou wrote:
| _Maybe because a fair amount of the work I do with regexes
| (and, probably, how I was introduced to them) is via `grep`, so
| I 'm often thinking of the inputs as "lines" rather than
| "strings"?_
|
| Vim is what did that for me.
| wccrawford wrote:
| It's kind of driving me nuts that the article says ^ is "start
| of string" when it's actually "start of line", just like $ is
| "end of line". \A is apparently "start of string" like \Z is
| "end of string".
| masklinn wrote:
| It's not start of line though, unless the engine is in
| multiline mode. Here is the documentation for Python's re for
| instance:
|
| > Matches the start of the string, and in MULTILINE mode also
| matches immediately after each newline.
|
| Or JavaScript:
|
| > An input boundary is the start or end of the string; or, if
| the m flag is set, the start or end of a line.
|
| \A and \Z are start/end of input regardless of mode... when
| they're available, that's not the case of all engines.
| eastbound wrote:
| Probably a vulnerability issue. Programmers would leave
| multiline mode on by mistake, then validate that some
| string only contain ^[a-Z]*$... only for the string to have
| an \n and an SQL injection on the second line.
| masklinn wrote:
| > Probably a vulnerability issue.
|
| No? It's a semantics decision.
| danbruc wrote:
| It is start and end of line. [1]
|
| _Usually ^ matches only at the beginning of the string,
| and $ matches only at the end of the string and immediately
| before the newline (if any) at the end of the string. When
| this flag is specified, ^ matches at the beginning of the
| string and at the beginning of each line within the string,
| immediately following each newline. Similarly, the $
| metacharacter matches either at the end of the string and
| at the end of each line (immediately preceding each
| newline)._
|
| In single-line [2] mode, the line starts at the start of
| the string and ends at the end of the line where the end of
| the line is either the end of the string if there is no
| terminating newline or just before the final newline if
| there is a terminating newline.
|
| In multi-line mode a new line starts at the start of the
| string and after each newline and ends before each newline
| or at the end of the string if the last line has no
| terminating newline.
|
| The confusion is that people think that they are in string-
| mode if they are not in multi-line mode but they are not,
| they are in single-line mode, ^ and $ still use the
| semantics of lines and a terminating newline, if present,
| is still not part of the content of the line.
|
| With \n\n\n in single-line mode the non-greedy ^(\n+?)$
| will capture only two of the newlines, the third one will
| be eaten by the $. If you make it greedy ^(\n+)$ will
| capture all three newlines. So arguably the implementations
| that do not match cat\n with cat$ are the broken ones.
|
| [1] https://docs.python.org/3/howto/regex.html#more-
| metacharacte...
|
| [2] I am using single-line to mean not multi-line for
| convenience even though single-line already has a different
| meaning.
| masklinn wrote:
| > It is start and end of line.
|
| You seem to have redefined "line" as "not a line".
|
| > The confusion
|
| I'm sure redefining "line" as "nothing like what anyone
| reasonable would interpret as a line" will help a lot and
| right clear up the confusion.
| danbruc wrote:
| The POSIX definition of a line is a sequence of non-
| newline characters - possibly zero - followed by a
| newline. Everything that does not end with a newline is
| not a [complete] line. So strictly speaking it would even
| be correct that cat$ does not match cat because there is
| no terminating newline, it should only match cat\n. But
| as lines missing a terminating newline is a thing, it
| seems reasonable to be less strict.
| masklinn wrote:
| > a line is a sequence of non-newline characters
|
| Works for me.
|
| How do you square that with your assertion that in your
| invention of "single-line mode" you implicitly define
| "line" as matching \n\n?
| danbruc wrote:
| If you are not in multi-line mode, then a single line is
| expected and consequently there is at most one newline at
| the end of the string. You can of course pick an input
| that violates this, run it against a multi-line string
| with several newlines in it. cat\n\n will not match cat$
| because there is something between cat and the end of the
| line, it just happens to be a newline but without any
| special meaning because it is not the last character and
| you did not say that the input is multi-line.
| sltkr wrote:
| Python violates that definition however, by allowing
| internal newlines in strings. For example /^c[^a]t$/
| matches "c\nt\n", but according to POSIX that's not a
| line.
|
| I suspect the real reason for Python's behavior starts
| with the early decision to include the terminating
| newline in the string returned by IOBase.readline().
|
| Python's peculiar choice has some minor advantages: you
| can distinguish between files that do and don't end with
| a terminating newline (the latter are invalid according
| to POSIX, but common in practice, especially on Windows),
| and you can reconstruct the original file by simply
| concatenating the line strings, which is occasionally
| useful.
|
| The downside of this choice is that as a caller you have
| to deal with strings that may-or-may-not contain a
| terminating newline character, which is annoying (I often
| end up calling rstrip() or strip() on every line returned
| by readline(), just to get rid of the newlines;
| read().splitlines() is an option too if you don't mind
| reading the entire file into memory upfront).
|
| My guess is that Python's behavior is just a hack to make
| re.match() easier to use with readline(), rather than
| based on any principled belief about what lines are.
| danbruc wrote:
| Python's behavior is not a hack, it is the common
| behavior. $ matches at the end of the string or before
| the last character if that is a newline, which is
| logically the same as the end of a single line. But as
| you said, you can have additional newlines inside of the
| string which is also the common behavior and not specific
| to python. Personally I think of this as you just assume
| that the string is a single line and match $ accordingly,
| either at the end of the string or before a terminating
| newline, if there are additional newlines, you treat them
| mostly as normal characters, with the exception of dot
| not matching newlines unless you set the single-line/dot-
| all flag.
| sltkr wrote:
| > Python's behavior [..] is the common behavior.
|
| The very post we're commenting on shows that that's not
| true: PHP, Python, Java and .NET (C#) share one behavior
| (accept "\n" as "$"), and ECMAScript (Javascript),
| Golang, and Rust share another behavior (do not accept
| "\n" as $).
|
| Let's not argue about which is "the most common"; all of
| these languages are sufficiently common to say that there
| is no single common behavior.
|
| > $ matches at the end of the string or before the last
| character if that is a newline, which is logically the
| same as the end of a single line.
|
| Yes, that is Python's behavior (and PHP's, Java's, etc.).
| You're just describing it; not motivating why it has to
| work that way or why it's more correct than the obvious
| alternative of only matching the end of the string.
|
| Subjectively, I find it odd that /^cat$/ matches not just
| the obvious string "cat" but also the string "cat\n". And
| I think historically, it didn't. I tried several common
| tools that predate Python: - awk 'BEGIN {
| print ("cat\n" ~ /^cat$/) }' prints 0 - in GNU ed,
| /^M/ does not match any lines - in vim, /^M/ does
| not match any lines - sed -n '/\n/p' does not print
| any lines - grep -P '\n' does not match any lines
| - (I wanted to try `grep -E` too but I don't know how to
| escape a newline) - perl -e 'print ("cat\n" =~
| /^cat$/)' prints 1
|
| So the consensus seems to be that the classic UNIX line-
| based tools match the regex against the line excluding
| the newline terminator (which makes sense since it isn't
| part of the content of that line) and therefore $ only
| needs to match the end of the string.
|
| The odd one out is Perl: it seems to have introduced the
| idea that $ can match a newline at the end of the string,
| probably for similar reasons as Python. All of this
| suggests to me that allowing $ to match both "\n" and ""
| at the end of the string was a hack designed to make it
| easier to deal with strings without control characters
| and string that end with a single newline.
| Bjartr wrote:
| The line delimiter is a newline.
|
| If you have a file containing `A\nB\nC` in a file, the
| file is three lines long.
|
| I guess it could be argued that a file containing
| `A\nB\nC\n` has four lines, with the fourth having zero
| length.
|
| That a regex is applying to an in memory string vs a file
| doesn't feel to me like it should have different
| semantics.
|
| Digging into the history a little, it looks like regexes
| were popularized in text editors and other file oriented
| tooling. In those contexts I imagine it would be far more
| common to want to discard or ignore the trailing zero
| length line than to process it like every other line in a
| file.
| akdev1l wrote:
| Technically the "newline" character is actually a line
| _terminator_. Hence "A\n" is one line, not two. The "\n"
| is always at the end of a line by definition.
| wtetzner wrote:
| So if you have "A" in a file with no newline, there are
| no lines in that file?
| jepler wrote:
| Yes, that is a file with zero lines that ends with an
| "incomplete line". Processing of such files by standard
| line-oriented utilities is undefined in the opengroup
| spec. So, for instance, the effect of "grep"ping such a
| file is not defined. Heck, even "cat"ting such a file
| gives non-ideal results, such as colliding with the
| regular shell prompt. For this reason, a lot of software
| projects I work on check and correct this condition
| whenever creating a commit.
|
| https://pubs.opengroup.org/onlinepubs/9699919799/basedefs
| /V1... ("text file")
| rovr138 wrote:
| > Yes, that is a file with zero lines that ends with an
| "incomplete line".
|
| It's a file with zero complete lines. But it has 1 line,
| that's incomplete, right?
|
| The file starts empty. Anything in it starts "a line". So
| it's 1 incomplete line.
|
| I hate weird states.
| xyzzy_plugh wrote:
| No, it is valid for a file to have content but no lines.
|
| Semantically many libraries treat that as a line because
| while \n<EOF> means "the end of the last line" having
| just <EOF> adds additional complexity the user has to
| handle to read the remaining input. But by the book it's
| not "a line".
|
| If I said "ten buckets of water" does that mean ten full
| buckets? Or does a bucket with a drop in it count as "a
| bucket of water?" If I asked for ten buckets of water and
| you brought me nine and one half-full, is that
| acceptable? What about ten half-full buckets?
|
| A line ends in a newline. A file with no newlines in it
| has no lines.
| joshjje wrote:
| Thats beyond ridiculous. Most languages when you are
| reading a line from a file, and it doesn't have a \n
| terminator, its going to give you that line, not say,
| oops, this isn't a line sorry.
| LK5ZJwMwgBbHuVI wrote:
| That's a relatively recent invention compared to tools
| like `wc` (or your favorite `sh` for that matter). See
| also: https://perldoc.perl.org/functions/chop wherein the
| norm was "just cut off the last character of the line, it
| will always be a newline"
| squeaky-clean wrote:
| Most languages but not all. I've even been bit by this
| recently in cron.
|
| Assuming that EOF is identical to \\\nEOF will end up
| causing trouble for you one day, because it's not
| actually identical.
| int_19h wrote:
| I don't think you can meaningfully generalize to "most
| languages" here. To give an example, two extremely
| popular languages are C and Python. Both have a standard
| library function to read a line from a text stream -
| fgets() for C, readline() for Python. In both cases, the
| behavior is to read up to _and including_ the newline
| character, but also to stop if EOF is encountered before
| then. Which means that the return value is different for
| terminated vs unterminated final lines in both languages
| - in particular, if there 's no \n before EOF, the value
| returned is _not a line_ (as it does not end with a
| newline), and you have to explicitly write your code to
| accommodate that.
| nativeit wrote:
| I get this is largely a semantic debate, but find it a
| little ironic so many programmers seem put off with the
| idea of a line count that starts at "0".
| akdev1l wrote:
| No, a line is defined as a sequence of characters
| (bytes?) with a line terminator at the end.
|
| Technically as per posix a file as you describe is
| actually a binary file without any lines. Basically just
| random binary data that happens to kind of look like a
| line.
| mort96 wrote:
| It's a file with 0 lines and some trailing garbage.
| DougBTX wrote:
| Another way to look at it is that concatenating files
| should sum the line count. Concatenating two empty files
| produces an empty file, so 0 + 0 = 0. If "incomplete
| lines" are not counted as lines, then the maths still
| works out. If they counted as lines, it would end up as 1
| + 1 = 1.
| coryrc wrote:
| Pedantically, if it doesn't end with a newline, it's
| considered a binary file and not a text file. Binary
| files don't have lines.
|
| In practice, most utilities expecting text files will
| still operate on it.
| PaulDavisThe1st wrote:
| No file has lines.
|
| "Lines" are a convention established by (or not) software
| reading a data stream.
| coryrc wrote:
| Ackshully
| rerdavies wrote:
| The opengroup spec says no such thing.
| simonh wrote:
| 3.206 Line
|
| A sequence of zero or more non- <newline> characters plus
| a terminating <newline> character.
|
| See also '3.403 Text File' for the definition of a text
| file. No new line characters, no lines. No lines, not a
| text file.
| mbrubeck wrote:
| $ echo -n "A" | wc --lines 0
| keybored wrote:
| Yep. since wc(1) apparently strictly adheres to what a
| newline-terminated text file is. This is why plaintext
| files should end with a newline. :)
|
| See: https://stackoverflow.com/a/25322168/1725151
| LK5ZJwMwgBbHuVI wrote:
| Why don't you go ask? $ echo -n foo |
| wc -l 0
| Gormo wrote:
| Suddenly the DOS/Windows solution of using \r\n instead
| of just \n seems to offer some advantages.
| samatman wrote:
| This does precisely nothing to solve the ambiguity issue
| when a final line lacks a newline. The representation of
| that newline isn't relevant to the problem.
| Izkata wrote:
| It's actually slightly worse: Windows defines newline as
| a delimiter, not a terminator. So this:
| foo\nbar\n
|
| Would be 2 lines in *nix and 3 lines in windows.
| deaddodo wrote:
| The "Windows way" is the "right way" for a few reasons.
|
| This is definitely _not_ one of them.
| int_19h wrote:
| Which are the valid reasons, legacy meanings of those
| characters aside?
| rerdavies wrote:
| Technically, that is one of two possible interpretations,
| and you seem to have invented a "by definition" out of
| thin air.
|
| Very very technically a "newline" character indicates the
| start of a new line, which is why it is not called the
| "end-of-line" character.
| cortesoft wrote:
| I mean, the person you are responding to didn't invent
| the definition out of thin air... the POSIX standard did:
|
| 3.206 Line A sequence of zero or more non- <newline>
| characters plus a terminating <newline> character.
|
| https://pubs.opengroup.org/onlinepubs/9699919799.2018edit
| ion...
| nomel wrote:
| Posix getline() includes EOF as a line terminator:
| getline() reads an entire line from stream, storing the
| address of the buffer containing the text into
| *lineptr. The buffer is null-terminated and
| includes the newline character, if one was
| found. ... ... a delimiter character is
| not added if one was not present in the input
| before end of file was reached.
|
| EOF seems same as end-of-string.
| mabster wrote:
| I don't know why no-one here sees this as a bad design...
|
| If a line is missing a newline then we just disregard
| it?!
|
| A way better way to deal with newline is it's a separator
| like comma. And like in modern languages we allow a final
| separator, but ignore it so that is easier for tools to
| generate files.
|
| Now all combinations of characters, including newline
| characters, has an interpretation without dropping
| anything.
| LK5ZJwMwgBbHuVI wrote:
| It doesn't indicate the start of a new line, or files
| would _start_ with it. Files _end_ with it, which is why
| it is a line terminator. And it is by definition: by the
| standard, by the way cat and /or your shell and/or your
| terminal work together, and by the way standard utilities
| like `wc` treat the file.
| joshjje wrote:
| "A\n" is two lines.
| LK5ZJwMwgBbHuVI wrote:
| Factually incorrect.
| f1shy wrote:
| Matches the EMPTY STRING at the beginning of the line is
| the correct definition.
| tangus wrote:
| That gives the author space for another article ;)
| amelius wrote:
| What is driving me nuts is that we have Unicode now, so there
| is no need to use common characters like $ or ^ to denote
| special regex state transitions.
| knome wrote:
| the idea of changing a decades old convention to instead
| use, as I assume you are implying, some character that
| requires special entry, is beyond silly.
| FranOntanaya wrote:
| I don't think anyone that writes regex would feel
| specially challenged by using the Alt+ | Ctrl+Shift+u key
| combos for unicode entry. Having to escape less things in
| a pattern would be nice.
| amelius wrote:
| Also, code is read more often than it is written.
| cortesoft wrote:
| People say this all the time, but is it really always
| true? I have a ton of code that I wrote, that just works,
| and I never really look at it again, at least not with
| the level of inspection that requires parsing the regex
| in my head.
| cortesoft wrote:
| I write regexes all the time, and I don't know if I would
| be CHALLENGED by that, but it would be annoying. Escaping
| things is trivial, and since you do it all the time it is
| not anything extra to learn. Having to remember bespoke
| keystrokes for each character is a lot more to learn.
| keybored wrote:
| ASCII restriction begets ASCII toothpick soup. Either
| lift that restriction or use balanced delimiters for
| strings in ASCII like backtick and single quote.
|
| ("But backtick is annoying to type" said the Europeans.)
| int_19h wrote:
| Regexes are one case where I think it's already extremely
| unbalanced wrt being easy to write but hard to read.
| Using stuff like special Unicode chars for this would
| make them harder to write but easier to read, which
| sounds like a fair deal to me. In general, I'd say that
| regexes _should_ take time and effort to write, just
| because it 's oh-so-easy to write something that kinda
| sorta works but has massive footguns.
|
| I would also imagine that, if this became the norm, IDEs
| would quickly standardize around common notation -
| probably actually based on existing regex symbols and
| escapes - to quickly input that, similar to TeX-like
| notation for inputting math. So if you're inside a regex
| literal, you'd type, say, \A, and the editor itself would
| automatically replace it with the Unicode sigil for
| beginning-of-string.
| keybored wrote:
| It's not that silly. You constantly get into escape
| conundrums because you need to use a metacharacter which
| is also a metacharacter three levels deep in some
| embedding.
|
| (But that might not solve that problem? Maybe the problem
| is mostly about using same-character delimiters for
| strings.)
|
| And I guess that's why Perl is so flexible with regards
| to delimiters and such.
| LK5ZJwMwgBbHuVI wrote:
| Yes, languages really need some sort of "raw string"
| feature like Python (or make regex literals their own
| syntax like Perl does). That's the solution here, not
| using weird characters...
| Yujf wrote:
| Why not? Common characters are easier to type and presumbly
| if you are using regex on a unicode string they might
| include these special characters anyway so what have you
| gained?
| amelius wrote:
| In theory yes, in practice no.
|
| What you have gained is that the regex is now much easier
| to read.
| knome wrote:
| It's easy to read now.
| LK5ZJwMwgBbHuVI wrote:
| > In theory yes, in practice no.
|
| That's like "in theory we need 4 bytes to represent
| Unicode, but in practice 3 bytes is fine" ( _glances at
| universally-maligned utf8mb3_ )
| int_19h wrote:
| It's not really an issue if the string you're matching
| might have those characters. It's an issue if the _regex_
| you are matching that string might need to _match_ those
| characters verbatim. Which is actually pretty common with
| ()[]$ when you 're matching phone numbers, prices etc -
| so you end up having to escape a lot, and regex is less
| readable especially if it also has to use those same
| characters as regex operators. On the other hand, it
| would be very uncommon to want to literally match, say,
| or [[?].
| yjftsjthsd-h wrote:
| If we were willing to ignore the ability to actually type
| it, you don't need Unicode for that; ASCII has a whole
| block of control characters at the beginning; I think ASCII
| 25 ("End of medium") works here.
| codethatwerks wrote:
| The problem with using an eggplant to denote end of string
| is backwards compatibility.
| davidw wrote:
| What with unicode, it'd be fun to have A and O available to
| make our regexps that much more readable...
| kqr wrote:
| I'm the same, but now that I try in Perl, sure enough, $ seems
| to default to being a positive lookahead assertion for the end
| of the string. It does not match and consume an EOL character.
|
| Only in multiline mode does it match EOL characters, but it
| does still not appear to consume them. In fact, I cannot
| construct a regex that captures the last character of one line,
| then consumes the newline, and then captures the first
| character of the next line, while using $. The capture group
| simply ends at $.
| singingfish wrote:
| To get the newline captured as well you need to add the `/s`
| modifier too
| absoluteunit1 wrote:
| I've always thought that as well; mostly due to Vim though.
|
| ^ - takes you to start of line $ - takes you to end of line
| Izkata wrote:
| ^ actually takes you to the first non-whitespace character in
| the line in vim. For start of line you want 0
| kataklasm wrote:
| I don't have (n)vi(m) open right now but I think this only
| applies to prepending spaces. For prepending tabs, 0 will
| take you to the first non-tab character as well.
| qu4z-2 wrote:
| Vim takes me to the first character in the line (the
| first tab), but displays the cursor on the last
| gridsquare the tab's width covers.
| alphazard wrote:
| This must be the "second problem" everyone talks about with
| regular expressions.
| Izkata wrote:
| Same here; when I saw the title I was like "well obviously not,
| where did you hear that?"
|
| In nearly two decades of using regex I think this might be the
| first time I've heard of $ being end of string. It's always
| been end of line for me.
| frame_ranger wrote:
| You couldn't write a post like this if you didn't start with
| a strawman.
| michaelt wrote:
| Take a look at, for example, these stackoverflow answers
| about a regex to validate and e-mail address:
| https://stackoverflow.com/a/8829363
|
| These people are I think not intending to say a newline
| character is permitted at the end of an e-mail address.
|
| (Of course people using 'grep' would have different
| expectations for obvious reasons)
| Izkata wrote:
| Even disregarding whether or not end-of-string is also an
| end-of-line or not (see all the other comments below), $
| doesn't match the newline, similar to zero-width matches
| like \b, so the newline wouldn't be included in the matched
| text either way.
|
| I think this series of comments might be clearest:
| https://news.ycombinator.com/item?id=39764385
| LK5ZJwMwgBbHuVI wrote:
| Problem is, plenty of software doesn't actually look at
| the match but rather just validates that there _was_ a
| match (and then continues to use the input to that
| match).
| notnmeyer wrote:
| i feel like this perspective will be split between folks who
| use regex in code with strings and more sysadmin folks who are
| used to consuming lines from files in scripts and at the cli.
|
| but yeah seems like a real misunderstanding from "start/end of
| string" people
| cerved wrote:
| In `sed` it's end of string.
|
| String is usually end of line, but not if you use stuff like
| `N`, to manipulate multi-line strings
| hans_castorp wrote:
| Fun fact: in Postgres, 'cat\n' matches 'cat$' when the so called
| "weird" newline matching is enabled :)
|
| https://www.postgresql.org/docs/current/functions-matching.h...
| AtNightWeCode wrote:
| There are many differences between implementations of regex. To
| name a few. Lookbehind, atomic groups, named capturing groups,
| recursion, timeouts and my favorite interop problem, unicode.
| wruza wrote:
| _By default, '$' only matches at the end of the string and
| immediately before the newline (if any) at the end of the
| string._
|
| The rationale was probably "it should be easier to match input
| strings" and now it's harder for everyone.
| febeling wrote:
| Seriously, just write one unit test for your regex.
| mannykannot wrote:
| Indeed, one should test any regex one puts any trust in, but
| the problem is that if you take as a fact something that is
| actually a false assumption (as the author did here), your test
| may well fail to find errors which may cause faults when the
| regex is put to use.
|
| This, in a nutshell, is the sort of problem which renders
| fallacious the notion that you can unit-test your way to
| correct software.
| PuffinBlue wrote:
| This seems like the perfect opportunity to introduce those
| unfamiliar to Robert Elder. He makes cool YouTube[0] and blog
| content[1] and has a series on regular expressions[2] and does
| some quite deep dives into the differing behaviour of the
| different tools that implement the various versions.
|
| His latest on the topic is cool too:
| https://www.youtube.com/watch?v=ys7yUyyQA-Y
|
| He's has quite a lot of content that HN folks might be interested
| in I think, like the reality and woes of consulting[3]
|
| [0] https://www.youtube.com/@RobertElderSoftware
|
| [1] https://blog.robertelder.org/
|
| [2] https://blog.robertelder.org/regular-expressions/
|
| [3] https://www.youtube.com/watch?v=cK87ktENPrI
| aquariusDue wrote:
| I'm glad to see someone else that has stumbled over his
| content. Seconding the recommendation.
| CatchSwitch wrote:
| He has so many favorite Linux commands lol
| teknopaul wrote:
| Tldr;
|
| $ does not mean end of string in Python.
| frou_dh wrote:
| Something I found really surprising about Python's regexp
| implementation is that it doesn't support the typical character
| classes like [:alnum:] etc.
|
| It must be some kind of philosophical objection because there's
| no way something with as much water under the bridge as Python
| simply hasn't got around to it.
| k3vinw wrote:
| Another poor soul trying to solve one problem using regex and now
| they have two... ;)
| croes wrote:
| Isn't a string with a newline character automatically multiline?
|
| The new line is just empty but not the first line anymore.
| Joker_vD wrote:
| No, it is not. 3.195 Incomplete Line
| A sequence of one or more non-<newline> characters at the end
| of the file. 3.206 Line A sequence of
| zero or more non-<newline> characters plus a terminating
| <newline> character.
|
| courtesy of [0]. See also [1] for rationale on "text file":
| Text File [...] The definition of "text file" has
| caused controversy. The only difference between text and binary
| files is that text files have lines of less than {LINE_MAX}
| bytes, with no NUL characters, each terminated by a <newline>.
| The definition allows a file with a single <newline>, or a
| totally empty file, to be called a text file. If a file ends
| with an incomplete line it is not strictly a text file by this
| definition. [...]
|
| [0]
| https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...
|
| [1]
| https://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd...
| croes wrote:
| Not everything uses POSIX maybe that's a reason for the
| different results.
| perlgeek wrote:
| Raku (formerly Perl 6) has picked ^ and $ for start-of-string and
| end-of-string, and has introduced ^^ and $$ for start-of-line and
| end-of-line. No multi line mode is available or necessary.
| (There's also \h for horizontal and \v for vertical whitespace)
|
| That's one of the benefits of a complete rethink/rewrite, you can
| learn from the fact that the old behavior surprised people.
| Terretta wrote:
| And this is why this curmudgeon can't use Perl 6[^1]. It
| randomly shuffles the line noise we learned over decades.
|
| It seems so obvious that's the opposite of what they should
| have defaulted to, that it clearly should have been ^ and $ for
| lines, and ^^ and $$ for the string, since like ((1)(2)(3)):
|
| ^^line1$\n^line2$\n^line3$\n$
|
| [1]: That, and it's not anywhere, while Perl 5 is everywhere.
| richardwhiuk wrote:
| Think I would have picked exactly the reverse (i.e. ^^ being
| more "starty" than "^").
| lcnPylGDnU4H9OF wrote:
| Reminds me of verbosity flags in some cli utilities. Often,
| -v is "verbose" and -vv is "very verbose" and -vvv... etc.
| wodenokoto wrote:
| > So if you're trying to match a string without a newline at the
| end, you can't only use $ in Python! My expectation was having
| multiline mode disabled wouldn't have had this newline-matching
| behavior, but that isn't the case.
|
| A reproducible example would be nice. I don't understand what it
| is he cannot do. `re.search('$', 'no new lines')` returns a
| match.
| iainmerrick wrote:
| This unexpectedly matches:
|
| re.match('^bob$', 'bob\n')
|
| I didn't want the trailing newline to be included.
| wodenokoto wrote:
| But that string does have a new line at the end.
| iainmerrick wrote:
| re.match('^bob$', 'bob') - yes
|
| re.match('^bob$', 'bobs') - no
|
| Most people would expect 'bob\n' _not_ to match, because I
| used '$' and it has an extra character at the end, just
| like 'bobs'. In Python it does match because '\n' is a
| special case.
| rerdavies wrote:
| ... for some arbitrary definition of "most people".
| danbruc wrote:
| People are confused about strings and lines. A string is a
| sequence of characters, a line can be two different things. If
| you consider the newline a line terminator, then a line is a
| sequence of non-newline characters - possibly zero - plus a
| newline. If there is no new-line at the end, then it is not a
| [complete] line. That is what POSIX uses. If you consider the
| newline a line separator, then a line is a sequence of non-
| newline characters - possibly zero. In either case, the content
| of the line ends before the newline, either because the newline
| terminates the line or because it separates the line from the
| next. [1]
|
| The semantics of ^ and $ is based on lines - whether single-line
| or multi-line mode. For string based semantics - which you could
| also think of as entire file if you are dealing with files - use
| \A and \Z or their equivalents.
|
| [1] Both interpretations have their merits. If you transmit text
| over a serial connection, it is useful to have a newline as line
| terminator so that you know when you received a complete line. If
| you put text into text files, it might arguably be easier to look
| at a newline as a line separator because then you can not have a
| invalid last line. On the other hand having line terminators in
| text files allows you to detect incompletely written lines.
| Existing4190 wrote:
| perlre Metacharacters documentation states: $ Match the end of
| the string (or before newline at the end of the string; or before
| any newline if /m is used)
|
| (/m enables multiline mode)
| mdavid626 wrote:
| Is this a bug?
| humanlity wrote:
| Interesting
| m0rissette wrote:
| Why isn't Perl anywhere on that chart when mentioning regex?
| burntsushi wrote:
| Because they're using regex101 to easily test the semantics of
| different regex engines and Perl isn't available on regex101.
| PCRE is though, which is a decent approximation. And indeed,
| Perl and PCRE behave the same for this particular case.
| account42 wrote:
| Why isn't Perl available on regex101 when its all about
| regex?
| burntsushi wrote:
| I dunno. Maybe because nobody has contributed it? Maybe
| because Perl isn't as widely used as it once was? Maybe
| because it's hard to compile Perl to WASM? Maybe some other
| reason?
| tyingq wrote:
| Seems odd to leave Perl off the list, given it's regex related.
|
| Here's the explanation for $ in the perlre docs:
| $ Match the end of the string (or
| before newline at the end of the string; or
| before any newline if /m is used)
| toyg wrote:
| Yeah, omitting what is arguably the language most associated
| with regexes seems a bit of an oversight. I guess it shows how
| far off the radar Perl currently is.
| demondemidi wrote:
| Perl perfected the simplicity and flexibility of regex syntax
| from POSIX and it seems every other language after has just
| made it harder.
| TillE wrote:
| PHP uses PCRE, so it more or less serves as a stand-in for
| Perl in this case.
| homakov wrote:
| This led to a few serious bugs in Ruby-based apps. Always use
| \A\z
|
| https://homakov.blogspot.com/2012/05/saferweb-injects-in-var...
|
| https://sakurity.com/blog/2015/02/28/openuri.html
|
| https://sakurity.com/blog/2015/06/04/mongo_ruby_regexp.html
| Scubabear68 wrote:
| In 30 years of developing software I don't think I ever used
| multi-line regexp even once.
| thrdbndndn wrote:
| Definitely not common, but if you are parsing a text file
| you're going to use it a lot (say, you're writing a JS parser).
| marcosdumay wrote:
| You really shouldn't use a lot of regexes for parsing code.
|
| They go only on the tokenizer, if they go somewhere at all.
| thrdbndndn wrote:
| Agreed, it's more about quick and dirty ad hoc capture than
| full-fledged parser though (like when you want to extract
| certain object when scraping).
| Terretta wrote:
| > _In 30 years of developing software I don't think I ever used
| multi-line regexp even once._
|
| As long as sharing anecdata, in 30 years, it's almost the only
| way I use it.
|
| It's incredible for slicing and dicing repetitious text into
| structure. You generally want some sort of Practical Extraction
| and Reporting Language, the core of which is something like a
| regular expression, generally able to handle the, well,
| _irregularity_.
|
| Most recent example (I did this last week) was extracting
| Apple's app store purchases from an OCR of the purchase history
| available through Apple's Music app's Account page that lets
| you see all purchases across all digital offerings, but only as
| a long scrolling dialog box (reading that dialog's contents
| through accessibility hooks only retrieves the first few pages,
| unfortunately).
|
| Each purchase contains one or more items and each item has one
| or more vertical lines, and if logos contain text they add
| arbitrary lines per logo.
|
| A good match and sub match multi-line regex folds that mess
| back into a CSV. In this case, the regex for this was less than
| an 80 char line of code and worked in the find replace of
| Sublime Text which has multiline matching, subgroups, and back
| references.
|
| Another way to do this is something like a state match/case
| machine, but why write a program when you can just write a
| regular expression?
| nebulous1 wrote:
| The fact that there are so many different peculiarities in
| different regex systems has always raised the hairs on the back
| of my neck. As in when a tool accepts a regex and I have to a
| trawl the manual to find out exactly what regex is acceptable to
| it.
| silent_cal wrote:
| I think there's a big opportunity to re-write Regex as a SQL-type
| language. It's too bad I don't feel like trying.
| nunez wrote:
| You can also use (?m) to enable multiline processing on PCRE-
| compatible regexp engines.
| raldi wrote:
| Cmd-F perl
|
| _no matches_
| weinzierl wrote:
| The table in the article makes this look complicated, but it
| really isn't. All the cases in the article can be grouped into
| two families:
|
| - The JS/Go/Rust family, which treats $ like \z and does not
| support \Z at all
|
| - The Java, .NET, PHP, Python family, which treats $ like \Z and
| may or may not (Python) support \z.
|
| \Z does away with \n before the end of the string, while \z
| treats \n as a regular character. For multiline $ the distinction
| doesn't matter, because \n _is_ the end.
|
| Really the only deviation from the rule is Python's \Z, which is
| indeed weird.
| gorjusborg wrote:
| If you really want to learn regex, you'll have a hard time
| piecing it all together via blog posts.
|
| Brad Freidl's Mastering Regular Expressions is a good book to
| read if you want to stop being surprised/lost.
|
| I'll admit I stopped at the dive into DFA/NFA engine details.
| jewel wrote:
| This has security implications! Example exploitable ruby code:
| unless person_id =~ /^\d+$/ abort "Bad person ID"
| end sql = "select * from people where person_id =
| #{person_id}"
|
| In addition to injection attacks, this also can bite people when
| parsing headers, where a bad header is allowed to sneak past a
| filter.
| jfhufl wrote:
| Unsure what you mean? $ ruby -e 'x = "25" ;
| if x =~ /^\d+$/ ; puts "yes" ; else ; puts "no" ; end'
| yes $ ruby -e 'x = "25\n" ; if x =~ /^\d+$/ ; puts
| "yes" ; else ; puts "no" ; end' yes $ ruby -e
| 'x = "a25\n" ; if x =~ /^\d+$/ ; puts "yes" ; else ; puts "no"
| ; end' no
|
| Also, you'd want to use something that parameterizes the query
| with '?' (I use the Sequel gem) instead of just stuffing it
| into a sql string.
| jfhufl wrote:
| Well, learned something today after reading a bit further in
| the thread: ruby -e 'x = "a\n25\n" ; if x
| =~ /^\d+$/ ; puts "yes" ; else ; puts "no" ; end' yes
|
| Good to know.
| halostatue wrote:
| You need to make your regex multi-line (`/^\d+$/m`), but that
| isn't the problem shown. Your query will be searching for
| `25\n`, not `25` _despite_ your pre-check that it's a good
| value.
|
| The second line _should always be no_ , which if you use
| `\A\d+\z`, it will be.
| jfhufl wrote:
| Yep, makes sense, thanks!
| dr-smooth wrote:
| $ ruby -e 'x = "25\n; delete from people" ; if x =~ /^\d+$/ ;
| puts "yes" ; else ; puts "no" ; end' yes
| mnau wrote:
| Practical Gitlab RCE that involved end of line regex in
| ExifTools:
|
| https://devcraft.io/2021/05/04/exiftool-arbitrary-code-execu...
| SAI_Peregrinus wrote:
| POSIX regexes and Python regexes are different. In general, you
| need to reference the regex documentation for _your
| implementation_ , since the syntax is not universal.
|
| Per POSIX chapter 9[1]:
|
| 9.2 ... "The use of regular expressions is generally associated
| with text processing. REs (BREs and EREs) operate on text
| strings; that is, zero or more characters followed by an end-of-
| string delimiter (typically NUL). Some utilities employing
| regular expressions limit the processing to lines; that is, zero
| or more characters followed by a <newline>."
|
| and 9.3.8 ... "A <dollar-sign> ( '$' ) shall be an anchor when
| used as the last character of an entire BRE. The implementation
| may treat a <dollar-sign> as an anchor when used as the last
| character of a subexpression. The <dollar-sign> shall anchor the
| expression (or optionally subexpression) to the end of the string
| being matched; the <dollar-sign> can be said to match the end-of-
| string following the last character."
|
| combine to mean that $ may match the end of string OR the end of
| the line, and it's up to the utility (or mode) to define which.
| Most of the common utilities (grep, sed, awk, Python, etc) treat
| it as end of line by default, since they operate on lines by
| default.
|
| THERE IS NO SINGLE UNIVERSAL REGULAR EXPRESSION SYNTAX. You
| cannot reliably read or write regular expressions without knowing
| which language & options are being used.
|
| [1]
| https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...
| javier_e06 wrote:
| I would hold a code review hostage if any file does not end with
| an empty new line.
|
| My reasoning would be if the file is transmitted and gets
| truncated nobody would know for sure if it does not end a new
| line. Brownie points if this is code end has a comment that the
| files ends there.
|
| The article calls computer languages platforms but the are
| computer languages. Bash is not included. Weird. I believe the
| most common use of regular expressions is the use of grep or
| egrep with bash or some other shell but, who knows. Maybe I am
| hanging with the wrong crowd.
| vitiral wrote:
| In Lua it's only the start/end of the string
|
| > A pattern is a sequence of pattern items. A caret '^' at the
| beginning of a pattern anchors the match at the beginning of the
| subject string. A '$' at the end of a pattern anchors the match
| at the end of the subject string. At other positions, '^' and '$'
| have no special meaning and represent themselves.
|
| https://www.lua.org/manual/5.3/manual.html#6.4.1
|
| Lua's pattern matching is much simpler than regexes though.
|
| > Unlike several other scripting languages, Lua does not use
| POSIX regular expressions (regexp) for pattern matching. The main
| reason for this is size: A typical implementation of POSIX regexp
| takes more than 4,000 lines of code. This is bigger than all Lua
| standard libraries together. In comparison, the implementation of
| pattern matching in Lua has less than 500 lines.
|
| https://www.lua.org/pil/20.1.html
| denzquix wrote:
| > In Lua it's only the start/end of the string
|
| There's an additional caveat: if you use the optional "init"
| parameter to specify an offset into the string to start
| matching, the ^ anchor will match _at that offset_ , which may
| or may not be what you expect.
| vitiral wrote:
| That is a good point, and something I've actually
| (personally) used quite a bit when writing parsers
| cpeterso wrote:
| $ is the regex's "the buck stops here" symbol. Here at the end of
| the line. :)
| nurtbo wrote:
| Totally get the desire, but also feels like last two paragraphs
| are solvable with
|
| ``` re.match(text).extract().rstrip("\n") ```
| menacingly wrote:
| Of course it's line. How could it be the end of the string when
| the matter at hand is defining the string?
| pksebben wrote:
| Regex would really benefit from a comprehensive industrial
| standard. It's such a powerful tool that you have to keep
| relearning whenever you switch contexts.
| aftbit wrote:
| Wait, in non-multiline mode, it only matches _one_ trailing
| newline? And not any other whitespace, including \r or \r\n? That
| is indeed surprising behavior. Why? Why not just make it end of
| string like the author expected? >>> import re
| >>> bool(re.search('abc$', 'abc')) True >>>
| bool(re.search('abc$', 'abc\n')) True >>>
| bool(re.search('abc$', 'abc\n\n')) False >>>
| bool(re.search('abc$', 'abc ')) False >>>
| bool(re.search('abc$', 'abc\t')) False >>>
| bool(re.search('abc$', 'abc\r')) False >>>
| bool(re.search('abc$', 'abc\r\n')) False
| mmh0000 wrote:
| > So if you're trying to match a string without a newline at the
| end, you can't only use $ in Python! My expectation was
| having multiline mode disabled wouldn't have had this
| newline-matching behavior, but that isn't the case.
|
| I would argue this is correct behavior, a "line" isn't a "line"
| if it doesn't end with \n.[1] > 3.206 Line - A
| sequence of zero or more non- <newline> characters plus a
| terminating <newline> character.
|
| [1]
| https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...
| librasteve wrote:
| I am surprised that the OP does not include perl5 in their table.
|
| In raku (aka perl6) Regexes were reinvented by Larry Wall (the
| creator of perl which made perlRE the de facto regex standard)
|
| Here's what he does with $:
|
| (https://docs.raku.org/language/regexes#Start_of_string_and_e...)
|
| * The $ anchor only matches at the end of the string
|
| * The $$ anchor matches at the end of a logical line. That is,
| before a newline character, or at the end of the string when the
| last character is not a newline character.
| ary wrote:
| Was any regex documentation unclear on this? Some libraries have
| modes that change the semantics of ^ and $ but I've always found
| their use to be rather clear. It's the grouping and look
| ahead/behind modifiers that I've always found hard to understand
| (at times).
| pmarreck wrote:
| The results did not surprise me. The fact that everyone is in
| agreement that "cat$" matches "cat" and not "cat\n" if multiline
| is off did not surprise me. \n is implicitly a multiline-
| contextual character to me. In other words, if you didn't have
| any \n, you'd just have an array of lines (without linefeeds),
| same as if you were reading lines from a file one at a time or
| splitting a binary on \n.
|
| The other results that differ across engines seem to be because
| people either don't understand regex or because the POSIX
| description of how to deal with such an input and config was ill-
| defined.
| 1letterunixname wrote:
| Ugh. Whenever I hear people talk about regular expressions as a
| singular language or standard, I die a little inside.
|
| PSA: Regex security is particular to each implementation flavor.
| Please know the nuances of a particular kind and be unambiguously
| precise.
| callwhendone wrote:
| it's end of line right?
| smlacy wrote:
| It's easy to get the canonical answer:
|
| $ man pcre2syntax
|
| Where you'll find the following block under ANCHORS AND SIMPLE
| ASSERTIONS: $ end of subject
| also before newline at end of subject
| also before internal newline in multiline mode
|
| So all the cases of "newline at/before end of subject" are
| covered here. Then, the question becomes "what is a subject?" Is
| it line-by-line? Are newlines included? What if we want multiline
| matching? That's where re.MULTILINE comes from, it's not
| "multiline matching" (sort of) it's "what is the subject of the
| regular expression that we're matching against"
___________________________________________________________________
(page generated 2024-03-20 23:01 UTC)