[HN Gopher] The dangers of single line regular expressions
___________________________________________________________________
The dangers of single line regular expressions
Author : thunderbong
Score : 67 points
Date : 2024-04-22 17:28 UTC (5 hours ago)
(HTM) web link (greg.molnar.io)
(TXT) w3m dump (greg.molnar.io)
| jlv2 wrote:
| More like "the danger of thinking you can trivially validate
| user-supplied input" before evaluating the string.
| cratermoon wrote:
| Even non-trivially validating it can go wrong. See Log4Shell,
| e.g.
|
| The bigger problem here is executing user input.
| wodenokoto wrote:
| I think it is a surprise that a partial match return true.
|
| But I guess this is why Python has so many ways of matching a
| pattern against a string (match, find, findall, I think - they
| are hard to remember)
| roywiggins wrote:
| I have to look it up _every time_.
| sfink wrote:
| Alternatively, don't validate and then use the original. Instead,
| pull out the acceptable input and use that.
|
| Even better, compare that to the original and fail validation if
| they're not identical, but that requires maintaining a higher
| level of paranoia than may be reasonable to expect.
| Akronymus wrote:
| > Alternatively, don't validate and then use the original.
| Instead, pull out the acceptable input and use that.
|
| Parse don't validate https://lexi-
| lambda.github.io/blog/2019/11/05/parse-don-t-va...
| sfink wrote:
| Heh. I wrote up my comment, and then thought "hey, I bet
| that's what that 'Parse don't validate' article meant, the
| one I never quite got around to reading." So I pulled it up
| -- great article! -- but then didn't post the link because it
| uses the type system to record the results of the parse.
| Whereas here, you'd probably parse from a string into another
| string.
|
| But philosophically I agree, that's exactly the relevant
| advice.
| Akronymus wrote:
| parsing from a string to a string runs the risk of
| erroneously assigning the original value to the new string.
| Which kinda defeats the whole parsing, not validating.
|
| What would work is having a small object holding a readonly
| string which parses the original on creation, then becomes
| immutable.
| neilk wrote:
| In my experience `$` does reliably mean end of string for regular
| expressions, unless you specifically ask for "multiline" mode.
|
| Ruby seems to be in multiline mode all the time?
| $ python -c 'import re; print "yes" if re.match(r"^[a-z ]+$",
| "foobar") else "no"' yes $ python -c 'import re;
| print "yes" if re.match(r"^[a-z ]+$", "foo\nbar") else "no"'
| no $ python -c 'import re; print "yes" if
| re.match(r"^[a-z ]+$", "foo\nbar", re.M) else "no"' yes
| $ perl -le 'print "foobar" =~ /^[a-z ]+$/ ? "yes" : "no"'
| yes $ perl -le 'print "foo\nbar" =~ /^[a-z ]+$/ ? "yes" :
| "no"' no $ perl -le 'print "foo\nbar" =~ /^[a-z
| ]+$/m ? "yes" : "no"' yes $ node -e
| 'console.log(/^[a-z ]+$/.test("foobar") ? "yes" : "no")'
| yes $ node -e 'console.log(/^[a-z
| ]+$/.test("foo\nbar") ? "yes" : "no")' no
| $ node -e 'console.log(/^[a-z ]+$/m.test("foo\nbar") ? "yes" :
| "no")' yes $ ruby -e 'if "foobar" =~
| /^[0-9a-z ]+$/i then puts "yes" else puts "no" end' yes
| $ ruby -e 'if "foo\nbar" =~ /^[0-9a-z ]+$/i then puts "yes" else
| puts "no" end' yes
|
| EDIT: this is documented behavior for Ruby. What other languages
| call multiline mode is the default; you're supposed to use \A and
| \Z instead. They do have an `/m` but it only affects the
| interpretation of `.`
|
| https://docs.ruby-lang.org/en/master/Regexp.html#class-Regex...
| interroboink wrote:
| Yeah, my takeaway from this was more "the dangers of Ruby"
| rather than "the dangers of single line regular expressions" (:
|
| I think the simplest fix would be to use "\Z" rather than "$",
| which means "match end of input" rather than "end of line."
| This is also Perl-compatible. So weird that the "$" default
| meaning is different in Ruby.
|
| I guess one could argue that Ruby's way is better since "$" has
| a fixed meaning, rather than being context-dependent.
|
| > Ruby seems to be in multiline mode all the time?
|
| Ruby does have a "/m" for multiline mode, but it just makes "."
| match newline, rather than changing the meaning of "$", it
| seems.
|
| [1] https://ruby-doc.org/3.2.2/Regexp.html#class-Regexp-label-
| An...
|
| [2] https://perldoc.perl.org/perlre#Metacharacters
| neilk wrote:
| looks like we both updated our answers as we looked up the
| docs :)
| Borg3 wrote:
| In case of ruby, best would be to actually use result of
| match for futher computation like this:
|
| if !m=/^[a-z0-9 ]+$/match(str) return "Bad Input" end
| str=m[0]
| js2 wrote:
| The potential trouble with $ (even in single-line mode) is that
| it matches the end of a string BOTH _with_ AND _without_ a
| newline at the end. If you 're using it to ensure the string
| has no newline before doing something with it, this can lead to
| trouble. $ python3 -c 'import re; print("yes"
| if re.search(r"^foo$", "foo") else "no")' yes
| $ python3 -c 'import re; print("yes" if re.search(r"^foo$",
| "foo\n") else "no")' yes $ python3 -c
| 'import re; print("yes" if re.search(r"\Afoo\Z", "foo") else
| "no")' yes $ python3 -c 'import re;
| print("yes" if re.search(r"\Afoo\Z", "foo\n") else "no")'
| no
|
| Even if the newline is not problematic, using \A and \Z makes
| your intentions clearer to the reader, especially if you add
| re.X and place comments into the pattern.
|
| Asides:
|
| 1. Based on syntax, you appear to be testing with python2.
|
| 2. With python, re.match is implicitly anchored to the start,
| so the ^ is redundant. Use re.search or omit the ^.
| medstrom wrote:
| Correct me if I'm wrong, but if you extract a capture group
| (^foo$), you would get "foo" without the "\n", right?
|
| If so, it is not "matching the end of a string" at all. Just
| end of line. That's exactly as expected in single-line mode,
| so it's good. May mismatch your expectations in multi-line
| mode though.
| dwheeler wrote:
| False. "$" does NOT mean end-of-string in Perl, Python, PHP,
| Ruby, Java, or .NET. In particular, a trailing newline (at
| least) is accepted in those languages.
|
| A $ does mean end-of-string in Javascript, POSIX, Rust (if
| using its usual package), and Go.
|
| I'm working with the OpenSSF best practices working group to
| create some guidance on this stuff. It's a very common
| misconception. Stay tuned.
|
| If anyone knows of vulnerabilities caused by thus, let me know.
| sjrd wrote:
| `$` _does_ mean end of input in Java, unless you explicitly
| ask for multiline mode. In the latter case it means
| `(?=$|\n)` if also in Unix-lines mode, and the horrible
| `(?=$|(? <!\r)\n|[\r\u0085\u2028\u2029])` otherwise.
|
| I wrote a compiler from Java regex to JavaScript RegExp, in
| which you'll find that particular compilation scheme [1].
|
| Edit: also quoting from [2]:
|
| > By default, the regular expressions ^ and $ ignore line
| terminators and only match at the beginning and the end,
| respectively, of the entire input sequence. If MULTILINE mode
| is activated then ^ matches at the beginning of input and
| after any line terminator except at the end of input. When in
| MULTILINE mode $ matches just before a line terminator or the
| end of the input sequence.
|
| [1] https://github.com/scala-js/scala-
| js/blob/eb160f1ef113794999...
|
| [2] https://docs.oracle.com/javase/8/docs/api/java/util/regex
| /Pa...
| sjrd wrote:
| OK it seems they changed the doc since. In the docs for JDK
| 21 we read instead [1]:
|
| > If MULTILINE mode is not activated, the regular
| expression ^ ignores line terminators and only matches at
| the beginning of the entire input sequence. The regular
| expression $ matches at the end of the entire input
| sequence, but also matches just before the last line
| terminator if this is not followed by any other input
| character. Other line terminators are ignored, including
| the last one if it is followed by other input characters.
|
| Looks like I have some code to fix.
|
| [1] https://docs.oracle.com/en%2Fjava%2Fjavase%2F21%2Fdocs%
| 2Fapi...
| phyzome wrote:
| Interesting that a trailing newline is accepted. Not as bad
| as what's in the post, at least. Definitely worth breaking
| out which languages do which of those, though! Python, for
| instance, only accepts a trailing newline but not additional
| chars beyond that.
|
| I don't think Java should be in your first list, though?
| Pattern.matches("^foo$", "foo\n") returns false.
| brobinson wrote:
| Note that Ruby also has \z which is what you generally want
| instead of \Z.
|
| (\Z allows a trailing newline, \z does not)
| wrsh07 wrote:
| This was interesting and new to me, but as other commenters
| indicate, part of the problem is that we're trying to find the
| bad thing rather than trying to verify it is the good thing
|
| There's a related concept of "failing open vs failing closed"
| (fail open: fire exit, fail closed: ranch gate)
|
| In Jurassic park (amazing book/film to understand system
| failures), when the power goes out, the fence is functionally an
| open gate
|
| In this case, we shouldn't assume that we can enumerate all
| possible bad strings (even with a regex)
| tsimionescu wrote:
| I don't think this is a good example, because the regex does
| just that: it doesn't try to filter out bad input, it
| specifically only accepts known good input. If the regex did
| what it was meant to do, only allowing strings composed of
| ascii letters and numbers, and space, than the code would have
| not been exploitable.
| floxy wrote:
| Still seems like that is broken. Shouldn't they be escaping
| whatever control characters? Like if your user wanted to
| highlight "Now 75% off". Seems like it is reasonable to want
| to allow that.
| tsimionescu wrote:
| That's a completely different problem: it may be too
| closed. But it's definitely not a fail open system. It's a
| fail closed system with a bug.
| floxy wrote:
| Yeah but the real bug is the trying to roll-your-own,
| instead of using a `ERB.escape_tainted_input` method or
| somesuch. Either that method doesn't exist, which seems
| like major mis-feature, or the author didn't know about
| it, or didn't want to use it.
| wrsh07 wrote:
| Whoops you are absolutely right!! Good point, I totally
| misread the if/else.
| cedws wrote:
| I pretty much always consider regex expressions as the wrong
| solution. They're notoriously hard to get right.
|
| There's a whole lot of faulty expressions out there for
| validating email addresses. I prefer to do _less_ validation and
| let it fail. If the email address is wrong, whatever service you
| 're using for sending emails will just reject it. If you really
| do need to validate email addresses, use something somebody else
| wrote that does it properly.
|
| If you're working with some exotic format for which there isn't
| already an open source library, do what this guy says: parse it,
| don't try to validate it with regex: https://lexi-
| lambda.github.io/blog/2019/11/05/parse-don-t-va...
| htek wrote:
| Here, you have good advice: "I ... consider regex expressions
| as the wrong solution. They're notoriously hard to get right."
|
| However, the conclusion of "use something somebody else wrote
| that does it properly", while valid, is asking a lot. As regex
| is hard to get right, don't assume the code you find on the web
| or book or via some other means works correctly.
|
| My rule is if I didn't write it and can't wrap my head around
| the code to convince myself it is the right solution, I don't
| use it. And as I think others have written, there are some
| interactive online tests for regex expressions that can help.
| cedws wrote:
| I think the average developer has a better chance of finding
| a robust, battle tested library to do what they need than
| cooking up some regex of their own. Preferably, the library
| does not use regex at all and checks data more intelligently.
| scarmig wrote:
| Sometimes valid email addresses will be rejected as invalid,
| and sometimes invalid email addresses are still successfully
| delivered. Validation guarantees nothing, and at most it should
| be a UI cue.
| tsimionescu wrote:
| Regex works very well for what it was originally designed:
| describing/validating regular languages. It can work ok if your
| language is simple and _almost_ regular. They work very badly
| for validating non-regulars languages, even when extensions are
| added Perl-style to support that. And, unfortunately, most
| structured formats you might care to valdiate are in fact not
| regular languages at all.
|
| Email addresses in particular are surprisingly complicated and
| far from being regular languages. I don't know how commonly
| real servers support the full feature set, but even if they
| just support non-ascii names they quickly become a pain.
| librasteve wrote:
| Raku (perl6) was a chance for Larry Wall to fix some of the
| limitations of the perl regex syntax, as you would expect from
| the perl heritage, it behaves similarly. ~ >
| raku -e 'say "foobar" ~~ /^ <[a..z ]> +$/ ?? "yes" !! "no"'
| yes ~ > raku -e 'say "foo\nbar" ~~ /^ <[a..z ]> +$/ ??
| "yes" !! "no"' no ~ > raku -e 'say "foo\nbar"
| ~~ /^^<[a..z ]>+$$/ ?? "yes" !! "no"' yes
|
| - ^^ and $$ are the raku flavour of multiline mode
|
| - ~~ the smartmatch operator binds the regex to the matchee and
| much more
|
| - character classes are now <[...]> (plain [...] does what (...)
| does in math)
|
| - perl's triadic x ? y : z becomes x ?? y !! z
|
| We can have whitespace in our regexen now (and comments and
| multiline regexen) my $regex = rx/ \d ** 4
| #`(match the year YYYY) '-'
| \d ** 2 # ...the month MM
| '-' \d ** 2 /; # ...and the day
| DD say '2015-12-25'.match($regex); # OUTPUT:
| <<[2015-12-25]>>
| btilly wrote:
| Perl has supported whitespace and comments in regular
| expressions since approximately forever. Just use the /x
| modifier. All that Raku did was make that flag a default.
|
| The same thing is available in many other languages. They
| copied it when they copied from Perl. For example Python's
| https://docs.python.org/3/library/re.html#flags documents that
| re.X, also called re.VERBOSE, does the same exact thing.
|
| The fact that people don't use it is because few people care to
| learn regular expressions well enough to even know that it is
| an option. One of my favorite examples of astounding people
| with this was when I was writing a complex stored procedure in
| PostgreSQL. I read
| https://www.postgresql.org/docs/current/functions-
| matching.h.... I looked for flags. And yup, there is an x flag.
| It turns on "extended syntax". Which does the same exact thing.
| I needed a complex regular expression that I knew my coworkers
| couldn't have written themselves. So I commented the heck out
| of it. They couldn't believe that that was even a thing that
| you could do!
| ec109685 wrote:
| Escape the output based on the context a string is being used in
| versus trying to sanitize for all use cases on input.
|
| This will guarantee that you're safe no matter how a piece of
| content is used tomorrow (just need a new escaping function for
| that content type), and prevent awkward things like not letting
| users use "unsafe" strings as input. JSX and XHP are example
| templating systems that understand context and escape
| appropriately.
|
| If a user wants their title to be "hello%0a%3C%25%3D%20File.open%
| 28%27flag.txt%27%29.read%20%25%3E", so be it.
|
| Use input validation / parsing to ensure data types aren't
| violated, but not as an output safety mechanism.
| Jerrrry wrote:
| >If a user wants their title to be "hello%0a%3C%25%3D%20File.op
| en%28%27flag.txt%27%29.read%20%25%3E", so be it.
|
| that's a good way to horizontally propagate/reflect XSS and
| other Code As Data vulnerabilities.
|
| better to strip the known-bad/problematic characters
|
| https://en.wikipedia.org/wiki/Code_as_data
| ec109685 wrote:
| The known problematic characters are different in json, xml,
| css, html content, html attributes, MySQL, etc. Unless you
| have output escaping, it is hard to ensure everything gets
| caught, no matter how the data enters the system.
| tsimionescu wrote:
| Sure, but there is a common set of safe characters that are
| guaranteed not to cause problems in any of these: the set
| described by the regex [a-zA-Z0-9 -]. If you _can_ limit
| user input to this set, you 'll drastically reduce the risk
| of code injection regardless of the stack below you.
| phyzome wrote:
| And that's how you end up pissing off users with apostrophes
| in their names.
| Terr_ wrote:
| "Alright. If you're gonna go ahead with it, I want to make
| sure you get one thing right. It's "O'Neill," with two L's.
| There is another Colonel O'Neil with only one L and he has
| no sense of humor at all."
| tsimionescu wrote:
| The output is not the problem here, it is the input. And, if
| you can get away with, accepting a small set of known-safe
| characters is _much_ safer than accepting any character and
| hoping it will be properly escaped at every level.
|
| When the user hands you a string and you then pass this down to
| other bits of code, you can't know if it will be used in an SQL
| query, a regex, in an error message that will be rendered into
| HTML, etc.
|
| Ideally all layers of your code would handle user input with
| the utmost care, but that is often very hard to achieve. If you
| take user input and use it in a regex, it's easy to regex-
| escape it, but it's much harder to remember that now this whole
| regex is user input and can't be safely used to, say, construct
| an SQL query. And even if you remember to properly escape it in
| the SQL query, it may show up in the returned result, and now
| if you display that result, you need to be careful to escape
| _it_ before passing it to some HTML engine.
|
| But then none of this works if you did intend to have some SQL
| syntax in the regex, or some HTML snippets in the DB: you'd
| need to make all of these technologies aware of which parts of
| the expressions are safe and which are tainted by user input.
|
| And this is all just to prevent code injection type attacks. I
| haven't even discussed more subtle attacks, like using Unicode
| look-like characters to confuse other users.
| ufmace wrote:
| Seems to me this is more about the danger of passing anything
| derived from user input into the TEMPLATE side of a templating
| engine. Why in the world would you ever do that?!?
|
| Obviously if you pass data into the variable side of the engine,
| you hardly have to worry about it at all, since it's already
| going into a place that was designed for handling arbitrary and
| possibly-hostile input and been battle-tested at doing it
| correctly in Production for many years. If you pass it into the
| template side, you're betting that you can be as good as dozens
| of templating engine writers working for a decade at doing that,
| in exchange for, well, I can't really think of any possible
| legitimate advantage for doing that.
| klysm wrote:
| What if you want to allow users to regex search their
| documents?
| ses1984 wrote:
| Do it on the client side?
|
| Do it in a sandbox and have aggressive timeouts.
| JonChesterfield wrote:
| Regular expressions make me sad about our industry.
|
| If you read the early papers, you get a very clear language for
| pattern matching on sequences. They have really nice properties -
| the compilation to finite automata gives you decidable equality
| and decidable minimisation. As in you can compile equivalent
| regex to exactly the same state machine however they were
| expressed.
|
| At some point perl happened and that seems to have sent us down a
| path to encoding the regular expression in an illegible subset of
| ascii. The backtracking implementation cost us negation and
| intersection. What should be linear time matching becomes
| exponential.
|
| Emacs will let you write regex in s-expressions at which point
| they're much easier to read. Everywhere else has gone with "looks
| like Perl but has different semantics, which we kind of document,
| be lucky".
|
| I started writing tests to check that regex I'd begrudgingly
| converted to the perl style behaved the same under different
| engines and the divergence is rough. Granted I was parsing regex
| with regex which is possibly a path to insanity but things like a
| literal [ were a real puzzle to match on different
| implementations.
|
| I don't _know_ that the horrible syntax on semantic beauty is due
| to perl but it looks likely from a superficial standpoint.
| jacobolus wrote:
| If you read the early papers you get a very limiting
| mathematical tool of mainly theoretical interest. At some point
| perl happened and regular expressions became a ubiquitous
| practical tool saving programmers collectively millions of
| hours of labor.
| wlesieutre wrote:
| Swift's RegexBuilder DSL from a couple years ago gets away from
| the illegible subset of ASCII.
|
| Easy to explode into a lot lines, but I'd rather have a 50 line
| RegexBuilder implementation than try to keep track of what the
| equivalent single-line version is doing. Especially if you ever
| have to come back to it later and understand it again.
|
| And if you ever make revisions in RegexBuilder you have useful
| diffs instead of "the one line that does everything is
| different than before."
|
| https://developer.apple.com/documentation/regexbuilder
|
| Are there similar tools in any other languages?
| JonChesterfield wrote:
| An alternative to seeking better language APIs.
|
| Parsing regex then pretty-printing the parse tree as
| s-expressions is very legible. You can also print the parse
| tree as the original syntax. Postfix will work better for
| some people, I like the lispy look for parse trees.
|
| Most regex are similar syntax over a parse tree with
| different parts missing, if you keep track of roughly what
| features the current engine has in your head the sema
| checking a real compiler should do could be deferred or
| incomplete.
|
| Some coding standards will want redundant escapes because
| that is considered more readable, could put that logic in the
| pretty-printer.
|
| That's sort of suggesting using your IDE to translate the
| thing back and forth on the fly instead of persuading
| colleagues to stop writing in the obfuscated format.
| phyzome wrote:
| Am I just unusual in really liking the usual regex syntax? (I
| mean, other than how every engine has a slightly different
| variation on it.) This might just be a matter of familiarity,
| but I find the s-expression versions harder to read, despite
| having worked in a Lisp for more than 10 years.
| hombre_fatal wrote:
| If this is sufficient for rendering the text as neon:
| @neon = "Glow With The Flow" erb :'index'
|
| What exactly is `@neon = ERB.new(params[:neon]).result(binding)`
| even supposed to be doing?
|
| Why wouldn't it just be: @neon = params[:neon]
| erb :'index'
| ezekg wrote:
| Ruby 4 should do what every other sane programming language does
| and require users to opt into multi-line mode via the /m flag.
|
| The fact that Ruby has this behavior at all is a major security
| issue.
| SEXMCNIGGA19328 wrote:
| hi are u lonely want ai gf?? https://crushon.ai ivpjRvmVDe
___________________________________________________________________
(page generated 2024-04-22 23:01 UTC)