[HN Gopher] The dangers of single line regular expressions
       ___________________________________________________________________
        
       The dangers of single line regular expressions
        
       Author : thunderbong
       Score  : 67 points
       Date   : 2024-04-22 17:28 UTC (5 hours ago)
        
 (HTM) web link (greg.molnar.io)
 (TXT) w3m dump (greg.molnar.io)
        
       | jlv2 wrote:
       | More like "the danger of thinking you can trivially validate
       | user-supplied input" before evaluating the string.
        
         | cratermoon wrote:
         | Even non-trivially validating it can go wrong. See Log4Shell,
         | e.g.
         | 
         | The bigger problem here is executing user input.
        
       | wodenokoto wrote:
       | I think it is a surprise that a partial match return true.
       | 
       | But I guess this is why Python has so many ways of matching a
       | pattern against a string (match, find, findall, I think - they
       | are hard to remember)
        
         | roywiggins wrote:
         | I have to look it up _every time_.
        
       | sfink wrote:
       | Alternatively, don't validate and then use the original. Instead,
       | pull out the acceptable input and use that.
       | 
       | Even better, compare that to the original and fail validation if
       | they're not identical, but that requires maintaining a higher
       | level of paranoia than may be reasonable to expect.
        
         | Akronymus wrote:
         | > Alternatively, don't validate and then use the original.
         | Instead, pull out the acceptable input and use that.
         | 
         | Parse don't validate https://lexi-
         | lambda.github.io/blog/2019/11/05/parse-don-t-va...
        
           | sfink wrote:
           | Heh. I wrote up my comment, and then thought "hey, I bet
           | that's what that 'Parse don't validate' article meant, the
           | one I never quite got around to reading." So I pulled it up
           | -- great article! -- but then didn't post the link because it
           | uses the type system to record the results of the parse.
           | Whereas here, you'd probably parse from a string into another
           | string.
           | 
           | But philosophically I agree, that's exactly the relevant
           | advice.
        
             | Akronymus wrote:
             | parsing from a string to a string runs the risk of
             | erroneously assigning the original value to the new string.
             | Which kinda defeats the whole parsing, not validating.
             | 
             | What would work is having a small object holding a readonly
             | string which parses the original on creation, then becomes
             | immutable.
        
       | neilk wrote:
       | In my experience `$` does reliably mean end of string for regular
       | expressions, unless you specifically ask for "multiline" mode.
       | 
       | Ruby seems to be in multiline mode all the time?
       | $ python -c 'import re; print "yes" if re.match(r"^[a-z ]+$",
       | "foobar") else "no"'         yes         $ python -c 'import re;
       | print "yes" if re.match(r"^[a-z ]+$", "foo\nbar") else "no"'
       | no         $ python -c 'import re; print "yes" if
       | re.match(r"^[a-z ]+$", "foo\nbar", re.M) else "no"'         yes
       | $ perl -le 'print "foobar" =~ /^[a-z ]+$/ ? "yes" : "no"'
       | yes         $ perl -le 'print "foo\nbar" =~ /^[a-z ]+$/ ? "yes" :
       | "no"'         no         $ perl -le 'print "foo\nbar" =~ /^[a-z
       | ]+$/m ? "yes" : "no"'         yes              $ node -e
       | 'console.log(/^[a-z ]+$/.test("foobar") ? "yes" : "no")'
       | yes                    $ node -e 'console.log(/^[a-z
       | ]+$/.test("foo\nbar") ? "yes" : "no")'         no
       | $ node -e 'console.log(/^[a-z ]+$/m.test("foo\nbar") ? "yes" :
       | "no")'         yes              $ ruby -e 'if "foobar" =~
       | /^[0-9a-z ]+$/i then puts "yes" else puts "no" end'         yes
       | $ ruby -e 'if "foo\nbar" =~ /^[0-9a-z ]+$/i then puts "yes" else
       | puts "no" end'         yes
       | 
       | EDIT: this is documented behavior for Ruby. What other languages
       | call multiline mode is the default; you're supposed to use \A and
       | \Z instead. They do have an `/m` but it only affects the
       | interpretation of `.`
       | 
       | https://docs.ruby-lang.org/en/master/Regexp.html#class-Regex...
        
         | interroboink wrote:
         | Yeah, my takeaway from this was more "the dangers of Ruby"
         | rather than "the dangers of single line regular expressions" (:
         | 
         | I think the simplest fix would be to use "\Z" rather than "$",
         | which means "match end of input" rather than "end of line."
         | This is also Perl-compatible. So weird that the "$" default
         | meaning is different in Ruby.
         | 
         | I guess one could argue that Ruby's way is better since "$" has
         | a fixed meaning, rather than being context-dependent.
         | 
         | > Ruby seems to be in multiline mode all the time?
         | 
         | Ruby does have a "/m" for multiline mode, but it just makes "."
         | match newline, rather than changing the meaning of "$", it
         | seems.
         | 
         | [1] https://ruby-doc.org/3.2.2/Regexp.html#class-Regexp-label-
         | An...
         | 
         | [2] https://perldoc.perl.org/perlre#Metacharacters
        
           | neilk wrote:
           | looks like we both updated our answers as we looked up the
           | docs :)
        
           | Borg3 wrote:
           | In case of ruby, best would be to actually use result of
           | match for futher computation like this:
           | 
           | if !m=/^[a-z0-9 ]+$/match(str) return "Bad Input" end
           | str=m[0]
        
         | js2 wrote:
         | The potential trouble with $ (even in single-line mode) is that
         | it matches the end of a string BOTH _with_ AND _without_ a
         | newline at the end. If you 're using it to ensure the string
         | has no newline before doing something with it, this can lead to
         | trouble.                 $ python3 -c 'import re; print("yes"
         | if re.search(r"^foo$", "foo") else "no")'         yes
         | $ python3 -c 'import re; print("yes" if re.search(r"^foo$",
         | "foo\n") else "no")'         yes            $ python3 -c
         | 'import re; print("yes" if re.search(r"\Afoo\Z", "foo") else
         | "no")'         yes            $ python3 -c 'import re;
         | print("yes" if re.search(r"\Afoo\Z", "foo\n") else "no")'
         | no
         | 
         | Even if the newline is not problematic, using \A and \Z makes
         | your intentions clearer to the reader, especially if you add
         | re.X and place comments into the pattern.
         | 
         | Asides:
         | 
         | 1. Based on syntax, you appear to be testing with python2.
         | 
         | 2. With python, re.match is implicitly anchored to the start,
         | so the ^ is redundant. Use re.search or omit the ^.
        
           | medstrom wrote:
           | Correct me if I'm wrong, but if you extract a capture group
           | (^foo$), you would get "foo" without the "\n", right?
           | 
           | If so, it is not "matching the end of a string" at all. Just
           | end of line. That's exactly as expected in single-line mode,
           | so it's good. May mismatch your expectations in multi-line
           | mode though.
        
         | dwheeler wrote:
         | False. "$" does NOT mean end-of-string in Perl, Python, PHP,
         | Ruby, Java, or .NET. In particular, a trailing newline (at
         | least) is accepted in those languages.
         | 
         | A $ does mean end-of-string in Javascript, POSIX, Rust (if
         | using its usual package), and Go.
         | 
         | I'm working with the OpenSSF best practices working group to
         | create some guidance on this stuff. It's a very common
         | misconception. Stay tuned.
         | 
         | If anyone knows of vulnerabilities caused by thus, let me know.
        
           | sjrd wrote:
           | `$` _does_ mean end of input in Java, unless you explicitly
           | ask for multiline mode. In the latter case it means
           | `(?=$|\n)` if also in Unix-lines mode, and the horrible
           | `(?=$|(? <!\r)\n|[\r\u0085\u2028\u2029])` otherwise.
           | 
           | I wrote a compiler from Java regex to JavaScript RegExp, in
           | which you'll find that particular compilation scheme [1].
           | 
           | Edit: also quoting from [2]:
           | 
           | > By default, the regular expressions ^ and $ ignore line
           | terminators and only match at the beginning and the end,
           | respectively, of the entire input sequence. If MULTILINE mode
           | is activated then ^ matches at the beginning of input and
           | after any line terminator except at the end of input. When in
           | MULTILINE mode $ matches just before a line terminator or the
           | end of the input sequence.
           | 
           | [1] https://github.com/scala-js/scala-
           | js/blob/eb160f1ef113794999...
           | 
           | [2] https://docs.oracle.com/javase/8/docs/api/java/util/regex
           | /Pa...
        
             | sjrd wrote:
             | OK it seems they changed the doc since. In the docs for JDK
             | 21 we read instead [1]:
             | 
             | > If MULTILINE mode is not activated, the regular
             | expression ^ ignores line terminators and only matches at
             | the beginning of the entire input sequence. The regular
             | expression $ matches at the end of the entire input
             | sequence, but also matches just before the last line
             | terminator if this is not followed by any other input
             | character. Other line terminators are ignored, including
             | the last one if it is followed by other input characters.
             | 
             | Looks like I have some code to fix.
             | 
             | [1] https://docs.oracle.com/en%2Fjava%2Fjavase%2F21%2Fdocs%
             | 2Fapi...
        
           | phyzome wrote:
           | Interesting that a trailing newline is accepted. Not as bad
           | as what's in the post, at least. Definitely worth breaking
           | out which languages do which of those, though! Python, for
           | instance, only accepts a trailing newline but not additional
           | chars beyond that.
           | 
           | I don't think Java should be in your first list, though?
           | Pattern.matches("^foo$", "foo\n") returns false.
        
         | brobinson wrote:
         | Note that Ruby also has \z which is what you generally want
         | instead of \Z.
         | 
         | (\Z allows a trailing newline, \z does not)
        
       | wrsh07 wrote:
       | This was interesting and new to me, but as other commenters
       | indicate, part of the problem is that we're trying to find the
       | bad thing rather than trying to verify it is the good thing
       | 
       | There's a related concept of "failing open vs failing closed"
       | (fail open: fire exit, fail closed: ranch gate)
       | 
       | In Jurassic park (amazing book/film to understand system
       | failures), when the power goes out, the fence is functionally an
       | open gate
       | 
       | In this case, we shouldn't assume that we can enumerate all
       | possible bad strings (even with a regex)
        
         | tsimionescu wrote:
         | I don't think this is a good example, because the regex does
         | just that: it doesn't try to filter out bad input, it
         | specifically only accepts known good input. If the regex did
         | what it was meant to do, only allowing strings composed of
         | ascii letters and numbers, and space, than the code would have
         | not been exploitable.
        
           | floxy wrote:
           | Still seems like that is broken. Shouldn't they be escaping
           | whatever control characters? Like if your user wanted to
           | highlight "Now 75% off". Seems like it is reasonable to want
           | to allow that.
        
             | tsimionescu wrote:
             | That's a completely different problem: it may be too
             | closed. But it's definitely not a fail open system. It's a
             | fail closed system with a bug.
        
               | floxy wrote:
               | Yeah but the real bug is the trying to roll-your-own,
               | instead of using a `ERB.escape_tainted_input` method or
               | somesuch. Either that method doesn't exist, which seems
               | like major mis-feature, or the author didn't know about
               | it, or didn't want to use it.
        
               | wrsh07 wrote:
               | Whoops you are absolutely right!! Good point, I totally
               | misread the if/else.
        
       | cedws wrote:
       | I pretty much always consider regex expressions as the wrong
       | solution. They're notoriously hard to get right.
       | 
       | There's a whole lot of faulty expressions out there for
       | validating email addresses. I prefer to do _less_ validation and
       | let it fail. If the email address is wrong, whatever service you
       | 're using for sending emails will just reject it. If you really
       | do need to validate email addresses, use something somebody else
       | wrote that does it properly.
       | 
       | If you're working with some exotic format for which there isn't
       | already an open source library, do what this guy says: parse it,
       | don't try to validate it with regex: https://lexi-
       | lambda.github.io/blog/2019/11/05/parse-don-t-va...
        
         | htek wrote:
         | Here, you have good advice: "I ... consider regex expressions
         | as the wrong solution. They're notoriously hard to get right."
         | 
         | However, the conclusion of "use something somebody else wrote
         | that does it properly", while valid, is asking a lot. As regex
         | is hard to get right, don't assume the code you find on the web
         | or book or via some other means works correctly.
         | 
         | My rule is if I didn't write it and can't wrap my head around
         | the code to convince myself it is the right solution, I don't
         | use it. And as I think others have written, there are some
         | interactive online tests for regex expressions that can help.
        
           | cedws wrote:
           | I think the average developer has a better chance of finding
           | a robust, battle tested library to do what they need than
           | cooking up some regex of their own. Preferably, the library
           | does not use regex at all and checks data more intelligently.
        
         | scarmig wrote:
         | Sometimes valid email addresses will be rejected as invalid,
         | and sometimes invalid email addresses are still successfully
         | delivered. Validation guarantees nothing, and at most it should
         | be a UI cue.
        
         | tsimionescu wrote:
         | Regex works very well for what it was originally designed:
         | describing/validating regular languages. It can work ok if your
         | language is simple and _almost_ regular. They work very badly
         | for validating non-regulars languages, even when extensions are
         | added Perl-style to support that. And, unfortunately, most
         | structured formats you might care to valdiate are in fact not
         | regular languages at all.
         | 
         | Email addresses in particular are surprisingly complicated and
         | far from being regular languages. I don't know how commonly
         | real servers support the full feature set, but even if they
         | just support non-ascii names they quickly become a pain.
        
       | librasteve wrote:
       | Raku (perl6) was a chance for Larry Wall to fix some of the
       | limitations of the perl regex syntax, as you would expect from
       | the perl heritage, it behaves similarly.                   ~ >
       | raku -e 'say "foobar"   ~~ /^ <[a..z ]> +$/ ?? "yes" !! "no"'
       | yes         ~ > raku -e 'say "foo\nbar" ~~ /^ <[a..z ]> +$/ ??
       | "yes" !! "no"'           no         ~ > raku -e 'say "foo\nbar"
       | ~~ /^^<[a..z ]>+$$/ ?? "yes" !! "no"'         yes
       | 
       | - ^^ and $$ are the raku flavour of multiline mode
       | 
       | - ~~ the smartmatch operator binds the regex to the matchee and
       | much more
       | 
       | - character classes are now <[...]> (plain [...] does what (...)
       | does in math)
       | 
       | - perl's triadic x ? y : z becomes x ?? y !! z
       | 
       | We can have whitespace in our regexen now (and comments and
       | multiline regexen)                   my $regex =  rx/ \d ** 4
       | #`(match the year YYYY)                       '-'
       | \d ** 2                # ...the month MM
       | '-'                      \d ** 2 /;             # ...and the day
       | DD                say '2015-12-25'.match($regex);     # OUTPUT:
       | <<[2015-12-25]>>
        
         | btilly wrote:
         | Perl has supported whitespace and comments in regular
         | expressions since approximately forever. Just use the /x
         | modifier. All that Raku did was make that flag a default.
         | 
         | The same thing is available in many other languages. They
         | copied it when they copied from Perl. For example Python's
         | https://docs.python.org/3/library/re.html#flags documents that
         | re.X, also called re.VERBOSE, does the same exact thing.
         | 
         | The fact that people don't use it is because few people care to
         | learn regular expressions well enough to even know that it is
         | an option. One of my favorite examples of astounding people
         | with this was when I was writing a complex stored procedure in
         | PostgreSQL. I read
         | https://www.postgresql.org/docs/current/functions-
         | matching.h.... I looked for flags. And yup, there is an x flag.
         | It turns on "extended syntax". Which does the same exact thing.
         | I needed a complex regular expression that I knew my coworkers
         | couldn't have written themselves. So I commented the heck out
         | of it. They couldn't believe that that was even a thing that
         | you could do!
        
       | ec109685 wrote:
       | Escape the output based on the context a string is being used in
       | versus trying to sanitize for all use cases on input.
       | 
       | This will guarantee that you're safe no matter how a piece of
       | content is used tomorrow (just need a new escaping function for
       | that content type), and prevent awkward things like not letting
       | users use "unsafe" strings as input. JSX and XHP are example
       | templating systems that understand context and escape
       | appropriately.
       | 
       | If a user wants their title to be "hello%0a%3C%25%3D%20File.open%
       | 28%27flag.txt%27%29.read%20%25%3E", so be it.
       | 
       | Use input validation / parsing to ensure data types aren't
       | violated, but not as an output safety mechanism.
        
         | Jerrrry wrote:
         | >If a user wants their title to be "hello%0a%3C%25%3D%20File.op
         | en%28%27flag.txt%27%29.read%20%25%3E", so be it.
         | 
         | that's a good way to horizontally propagate/reflect XSS and
         | other Code As Data vulnerabilities.
         | 
         | better to strip the known-bad/problematic characters
         | 
         | https://en.wikipedia.org/wiki/Code_as_data
        
           | ec109685 wrote:
           | The known problematic characters are different in json, xml,
           | css, html content, html attributes, MySQL, etc. Unless you
           | have output escaping, it is hard to ensure everything gets
           | caught, no matter how the data enters the system.
        
             | tsimionescu wrote:
             | Sure, but there is a common set of safe characters that are
             | guaranteed not to cause problems in any of these: the set
             | described by the regex [a-zA-Z0-9 -]. If you _can_ limit
             | user input to this set, you 'll drastically reduce the risk
             | of code injection regardless of the stack below you.
        
           | phyzome wrote:
           | And that's how you end up pissing off users with apostrophes
           | in their names.
        
             | Terr_ wrote:
             | "Alright. If you're gonna go ahead with it, I want to make
             | sure you get one thing right. It's "O'Neill," with two L's.
             | There is another Colonel O'Neil with only one L and he has
             | no sense of humor at all."
        
         | tsimionescu wrote:
         | The output is not the problem here, it is the input. And, if
         | you can get away with, accepting a small set of known-safe
         | characters is _much_ safer than accepting any character and
         | hoping it will be properly escaped at every level.
         | 
         | When the user hands you a string and you then pass this down to
         | other bits of code, you can't know if it will be used in an SQL
         | query, a regex, in an error message that will be rendered into
         | HTML, etc.
         | 
         | Ideally all layers of your code would handle user input with
         | the utmost care, but that is often very hard to achieve. If you
         | take user input and use it in a regex, it's easy to regex-
         | escape it, but it's much harder to remember that now this whole
         | regex is user input and can't be safely used to, say, construct
         | an SQL query. And even if you remember to properly escape it in
         | the SQL query, it may show up in the returned result, and now
         | if you display that result, you need to be careful to escape
         | _it_ before passing it to some HTML engine.
         | 
         | But then none of this works if you did intend to have some SQL
         | syntax in the regex, or some HTML snippets in the DB: you'd
         | need to make all of these technologies aware of which parts of
         | the expressions are safe and which are tainted by user input.
         | 
         | And this is all just to prevent code injection type attacks. I
         | haven't even discussed more subtle attacks, like using Unicode
         | look-like characters to confuse other users.
        
       | ufmace wrote:
       | Seems to me this is more about the danger of passing anything
       | derived from user input into the TEMPLATE side of a templating
       | engine. Why in the world would you ever do that?!?
       | 
       | Obviously if you pass data into the variable side of the engine,
       | you hardly have to worry about it at all, since it's already
       | going into a place that was designed for handling arbitrary and
       | possibly-hostile input and been battle-tested at doing it
       | correctly in Production for many years. If you pass it into the
       | template side, you're betting that you can be as good as dozens
       | of templating engine writers working for a decade at doing that,
       | in exchange for, well, I can't really think of any possible
       | legitimate advantage for doing that.
        
         | klysm wrote:
         | What if you want to allow users to regex search their
         | documents?
        
           | ses1984 wrote:
           | Do it on the client side?
           | 
           | Do it in a sandbox and have aggressive timeouts.
        
       | JonChesterfield wrote:
       | Regular expressions make me sad about our industry.
       | 
       | If you read the early papers, you get a very clear language for
       | pattern matching on sequences. They have really nice properties -
       | the compilation to finite automata gives you decidable equality
       | and decidable minimisation. As in you can compile equivalent
       | regex to exactly the same state machine however they were
       | expressed.
       | 
       | At some point perl happened and that seems to have sent us down a
       | path to encoding the regular expression in an illegible subset of
       | ascii. The backtracking implementation cost us negation and
       | intersection. What should be linear time matching becomes
       | exponential.
       | 
       | Emacs will let you write regex in s-expressions at which point
       | they're much easier to read. Everywhere else has gone with "looks
       | like Perl but has different semantics, which we kind of document,
       | be lucky".
       | 
       | I started writing tests to check that regex I'd begrudgingly
       | converted to the perl style behaved the same under different
       | engines and the divergence is rough. Granted I was parsing regex
       | with regex which is possibly a path to insanity but things like a
       | literal [ were a real puzzle to match on different
       | implementations.
       | 
       | I don't _know_ that the horrible syntax on semantic beauty is due
       | to perl but it looks likely from a superficial standpoint.
        
         | jacobolus wrote:
         | If you read the early papers you get a very limiting
         | mathematical tool of mainly theoretical interest. At some point
         | perl happened and regular expressions became a ubiquitous
         | practical tool saving programmers collectively millions of
         | hours of labor.
        
         | wlesieutre wrote:
         | Swift's RegexBuilder DSL from a couple years ago gets away from
         | the illegible subset of ASCII.
         | 
         | Easy to explode into a lot lines, but I'd rather have a 50 line
         | RegexBuilder implementation than try to keep track of what the
         | equivalent single-line version is doing. Especially if you ever
         | have to come back to it later and understand it again.
         | 
         | And if you ever make revisions in RegexBuilder you have useful
         | diffs instead of "the one line that does everything is
         | different than before."
         | 
         | https://developer.apple.com/documentation/regexbuilder
         | 
         | Are there similar tools in any other languages?
        
           | JonChesterfield wrote:
           | An alternative to seeking better language APIs.
           | 
           | Parsing regex then pretty-printing the parse tree as
           | s-expressions is very legible. You can also print the parse
           | tree as the original syntax. Postfix will work better for
           | some people, I like the lispy look for parse trees.
           | 
           | Most regex are similar syntax over a parse tree with
           | different parts missing, if you keep track of roughly what
           | features the current engine has in your head the sema
           | checking a real compiler should do could be deferred or
           | incomplete.
           | 
           | Some coding standards will want redundant escapes because
           | that is considered more readable, could put that logic in the
           | pretty-printer.
           | 
           | That's sort of suggesting using your IDE to translate the
           | thing back and forth on the fly instead of persuading
           | colleagues to stop writing in the obfuscated format.
        
         | phyzome wrote:
         | Am I just unusual in really liking the usual regex syntax? (I
         | mean, other than how every engine has a slightly different
         | variation on it.) This might just be a matter of familiarity,
         | but I find the s-expression versions harder to read, despite
         | having worked in a Lisp for more than 10 years.
        
       | hombre_fatal wrote:
       | If this is sufficient for rendering the text as neon:
       | @neon = "Glow With The Flow"         erb :'index'
       | 
       | What exactly is `@neon = ERB.new(params[:neon]).result(binding)`
       | even supposed to be doing?
       | 
       | Why wouldn't it just be:                   @neon = params[:neon]
       | erb :'index'
        
       | ezekg wrote:
       | Ruby 4 should do what every other sane programming language does
       | and require users to opt into multi-line mode via the /m flag.
       | 
       | The fact that Ruby has this behavior at all is a major security
       | issue.
        
       | SEXMCNIGGA19328 wrote:
       | hi are u lonely want ai gf?? https://crushon.ai ivpjRvmVDe
        
       ___________________________________________________________________
       (page generated 2024-04-22 23:01 UTC)