[HN Gopher] Don't try to sanitize input, escape output (2020)
___________________________________________________________________
Don't try to sanitize input, escape output (2020)
Author : maple3142
Score : 105 points
Date : 2022-01-13 14:48 UTC (8 hours ago)
(HTM) web link (benhoyt.com)
(TXT) w3m dump (benhoyt.com)
| gumby wrote:
| Since you don't know where your output will end up how could you
| possibly know the syntax to escape it?
|
| And how can the consumer of an arbitrary string trust that every
| input will have been properly escaped?
| whoopdedo wrote:
| Sounds like a restatement of Postel's robustness principle[1].
| Did it go out of style to "be conservative in what you send, be
| liberal in what you accept" and we need to relearn it again?
|
| Well, perhaps it did. History has shown the dangers of not
| handling malformed input well. Postel's principle has received
| scrutiny[2] for reinforcing those mistakes by creating a mistaken
| belief in robustness. More recent recommendations have been to be
| stricter in handling of inputs[3].
|
| But I think there is some confusion between robustness and
| defensiveness. "Be liberal in what you accept" may be confused
| with "don't sanitize your inputs" when not sanitizing is the less
| liberal action. Robustness means the program should not fail if
| it receives input it didn't expect. A program that crashes,
| hangs, executes unintended shell code, mangles the data, changes
| the thermostat, or other undefined behavior is not being robust.
| To prevent that from happening then data must be sanitized at
| input so that it can be processed without those side-effects. The
| examples of programs failing robustness have been because they
| were insufficiently defensive.
|
| The bigger issue is that robustness doesn't scale easily. You may
| know how your bit of code will deal with malformed data, but what
| about every other library you use? Or other systems you
| communicate with? It becomes a backstage problem, where once
| someone has gained access to a restricted area it's assumed they
| are authorized to be there. The further down the tech stack you
| go the less likely the code will be defensive. That puts a burden
| on the public-facing sanity checks to anticipate how relaxed they
| can be about the input.
|
| If you change the definition of output to include internal-
| outputs, then Postel's principle gets new life. That is, try not
| to program the entire system and ecosystem at once, but treat
| each software component as an island. Be liberal not only with
| the data you receive from the end-user, but also with return
| values from functions. Be conservative and escape not only your
| generated HTML, but also the SQL statements you dispatch to the
| backend. This is what input sanitizing is actually about, it's
| keeping the promise to the other parts of your program that your
| code isn't going to give them bad data. That's also what the
| linked article is saying, because the HTML being generated is
| itself one component in a chain of programs that includes the
| end-user's browser.
|
| [1] https://en.wikipedia.org/wiki/Robustness_principle
|
| [2]
| https://programmingisterrible.com/post/42215715657/postels-p...
|
| [3] https://datatracker.ietf.org/doc/html/draft-iab-protocol-
| mai...
| gkoberger wrote:
| This solution doesn't match the problem. Even the SQL injection
| example shows him sanitizing the input, which is at odds with the
| title of the post. Log4J is a more recent example of it being too
| late/useless to escape the output.
| ashearer wrote:
| This is an example of why the term "sanitize" just brings
| confusion and leads to incorrect software. If we say "escape"
| (for concatenation) or "parameterize" (for discrete arguments)
| instead, then there's no confusion: we know that it should be
| done at the point of use, because the procedure for doing so
| depends on that use.
|
| Calling it "sanitization" implies that the data is somehow
| dirty, so naturally it should be cleaned as soon as possible,
| and after that it's safe. But all that accomplishes in general
| is corrupting the data, often in an unrecoverable way, and then
| opening up security vulnerabilities because the specific use
| doesn't happen to exactly match the sanitization done in
| advance.
|
| It's great to validate the data on input and make it conform to
| the correct domain of values, but conflating this with output
| formats and expecting this to take care of downstream security
| as well just leads to incorrect data along with security
| vulnerabilities.
|
| PHP's long-ago-removed magic quotes feature was an example of
| this confusion in action. It not only mangled incoming strings
| containing single quotes in an effort to prevent SQL injection,
| but did so in a way that left some databases completely
| exposed, depending on their quoting syntax.
| brodouevencode wrote:
| Yeah little Bobby Droptables is still a thing.
| marcosdumay wrote:
| What?
|
| SQL injection is avoided at the point of usage. Trying to
| sanitize your input against it is an extremely bad practice.
| The same is true about HMTL injection (whether you call it XSS
| or something else).
|
| Log4j is an example of not interpreting text that the developer
| was never aware that was code. It's kinda of the extreme
| opposite of escaping your text on usage.
| gkoberger wrote:
| The article says DON'T sanitize when putting it into the
| database. I think contextual escaping counts as "sanitizing
| input", so the solution of "don't try to sanitize input" is
| undermined.
| bcrosby95 wrote:
| For a long while, input sanitization in the web world was
| about modifying inputs to strip the problem areas. As such
| many consider escaping and sanitization to be completely
| different practices.
|
| It seems like this article is using this differentiation.
| In my experience, it's very common. It's not worth arguing
| about.
| shawnz wrote:
| I interpreted the message as not sanitizing inputs at the
| point they are received, a la PHP magic quotes. Instead,
| escape at the output (the output to the database engine).
| gkoberger wrote:
| No where in the article do they use "output" to mean from
| the database engine; they use it to mean "outputting
| HTML".
| shawnz wrote:
| The article doesn't explicitly say the words "outputting
| SQL to the database engine", but that's because the focus
| is on XSS attacks and the part about SQL injection is
| just an aside. Clearly it's what they were trying to
| imply with language like this:
|
| > The only code that knows what characters are dangerous
| is the code that's outputting in a given context. And of
| course use your SQL engine's parameterized query features
| so it properly escapes variables when building SQL: ...
| This is sometimes called "contextual escaping".
|
| The "context" is that you are outputting to the database
| engine.
| dotancohen wrote:
| > your SQL engine's parameterized query features so
| > it properly escapes variables when building SQL
|
| This is wrong. Parameterized queries do not build an SQL
| string by escaping the input. The input is actually sent
| to the database separately from the SQL.
|
| Well, in all sane implementations, anyway. PHP has an
| PDO::ATTR_EMULATE_PREPARES option that does build SQL
| from a parameterized query. And, of course, Wordpress has
| $wpdb->prepare() that returns an SQL string with the
| parameter escaped. Also, so far as I know, one cannot run
| a prepared statement from the SQLite CLI, so no
| parameterized queries there either:
|
| https://stackoverflow.com/questions/20065990/how-to-
| prepare-...
| shawnz wrote:
| Sure, maybe it does not literally send a substituted SQL
| string, but in order to send the parameters "separately"
| from the query, do they not still eventually get
| concatenated into a single binary string of some form to
| be sent across the wire? In spirit I think the same
| arguments apply there, it's just that the format of the
| data is not strictly SQL. It's actually the wire format
| of the database protocol.
| dotancohen wrote:
| You are correct that the parameters go across the wire,
| obviously, but I've never heard of an attack in which the
| parameters caused any type of compromise in the wire
| protocol. I would highly appreciate examples if any
| exist.
| shawnz wrote:
| It probably wouldn't result in an attack (unless you were
| dealing with a really sophisticated attacker), it's just
| necessary for correctness. Which is also true of all
| these examples: for example, people won't appreciate
| having backslashes wrongly inserted around legitimate
| characters of their names or other personal information,
| or having the software fail to process their request due
| to the characters in their name. It's not _just_ a
| security concern.
|
| In the general case there are certainly many examples of
| security vulnerabilities created by wrong serialization
| of data into the wire protocols of services, but maybe
| not specifically for this situation of query parameters.
| But maybe there are, I have no idea really. Either way,
| it's not the application developer's responsibility at
| that point, it's the responsibility of the people who
| developed the database driver.
| Arnavion wrote:
| >This is wrong. Parameterized queries do not build an SQL
| string by escaping the input. The input is actually sent
| to the database separately from the SQL.
|
| Your blanket observation is not necessarily true of all
| databases or database drivers. You found three counter-
| examples yourself, but there's no reason to not consider
| them "sane". It's not less correct than for databases
| that do support prepared statements in the driver
| protocol.
| marcosdumay wrote:
| > a la PHP magic quotes
|
| Up to this day, the official way to deal with XSS in .Net
| is by doing sanitization at the receiving point. I
| imagine the article is directed at that.
| shawnz wrote:
| That sounds pretty terrible, do you have an example of
| some docs which demonstrate that practice?
| [deleted]
| marcosdumay wrote:
| If the user says his name is "Bob'; drop tables students
| --", that is what you should store on your database.
| Unless, of course it's not a valid name for the rest of the
| system.
|
| That's so old and obvious advice that I'm surprised people
| keep posting here and upvoting. And even more surprised
| when people keep disagreeing here.
| ehutch79 wrote:
| If you're storing "Bob'; drop tables students --" in the
| database, you had to have sanitized your inputs, or there
| would be no students table.
|
| The article title says NOT to sanitize inputs. perhaps
| it's that nuance doesn't fit in a headline, but eh...
| wvenable wrote:
| The confusion is what is input and what is output. The
| string "Bob'; drop tables students --" should not be
| sanitized/encoded on *input* _to the application_.
| However, if you 're not using parameterized queries, it
| should be encoded on *output* _to the database_.
|
| Data should only be sanitized in transit and not stored
| in an sanitized form. That's what the article is really
| saying.
| IshKebab wrote:
| No you don't. You use a parameterized query:
| execute("INSERT INTO foo VALUES (?)", user_input)
| hombre_fatal wrote:
| What are you referring to? The SQL injection example is showing
| what not to do.
| gkoberger wrote:
| "So the better approach is to store whatever [data] the user
| enters verbatim, and then have the template system HTML-
| escape when outputting HTML"
|
| With this logic, someone could use a SQL injection. It
| wouldn't be sanitized as the INSERT is happening, so the SQL
| injection would be executed.
|
| EDIT: I know he goes on to talk about escaping characters,
| but the title of the post is "Don't try to sanitize input".
| My point is simply that SQL injections happen on input, not
| output. His example of escaping the SQL is at odds with the
| title of the post.
| hombre_fatal wrote:
| They show the solution of using parameterized queries to
| store the user input verbatim. What is an example of the
| attack you have in mind?
| jerf wrote:
| Most SQL systems have bind parameters for this sort of
| thing. That is a form of encoding the input. You have to
| encode the SQL values as well. You're basically saying if
| you don't use the suggested technique, the suggested
| technique doesn't work. Well, yeah. It has to be used
| consistently, all the time, every time.
|
| Unfortunately, that's just life. There's no way around it.
| One way or another you're going to be doing something or
| you're going to get owned.
| kam wrote:
| They're calling the SQL query "output" (from the app to the
| DB server). The point is that the "bad characters" depend
| on the context, so it's the step where you combine trusted
| and untrusted data that you need to think about escaping or
| validating.
| gkoberger wrote:
| No they're not. They're using the word "output" to mean
| "back into the HTML".
|
| "So the better approach is to store whatever name the
| user enters verbatim, and then have the template system
| HTML-escape when outputting HTML, or properly escape JSON
| when outputting JSON and JavaScript."
| kam wrote:
| The sentence immediately after that is "And of course use
| your SQL engine's parameterized query features so it
| properly escapes variables when building SQL"
| vlovich123 wrote:
| Sorry, how does this happen if you're using DB parameter in
| the query string?
| amalcon wrote:
| The article is specifically about sanitizing inputs to prevent
| XSS attacks. Sanitizing input isn't a great defense against
| that; you need a defense that better matches the attack.
|
| Validating or sanitizing input input is a reasonably good
| defense against certain other things. E.g. zeroes in values
| you'll later divide by, when it's too late to return an error;
| multi-gigabyte names; information that you want to avoid
| storing like credit card numbers. That sort of use case doesn't
| really have a whole lot to do with the article, though.
| wnoise wrote:
| Compare with https://lexi-lambda.github.io/blog/2019/11/05/parse-
| don-t-va...
| kevincox wrote:
| These are both good advise. I have seen really funny bugs where
| Java accepted non-ascii numbers in an IP address but the C++
| control plane very much did not. If the re-serialized version
| was sent to the backend this wouldn't have been an issue.
|
| But the domains are different. Data validation is ensuring that
| the information is something that your system accepts. Data
| encoding is used when you are serializing information. You
| should very likely validate on input, but not "sanitize" or
| encode. You do your encoding on output.
| dnautics wrote:
| I think the domain models are different. "parse-don't-validate"
| is great when your users are internal and trusted (e.g. a
| library that does codegen - the operators of the parser are
| already in the codebase). When your users are potentially
| hostile, you should at some level have a separate validate and
| eject strategy.
| mjw1007 wrote:
| I think "parse, don't validate" is an improvement on what the
| author of this article recommends:
|
| << So in cases where you do need to "echo" raw user input,
| carefully filter input based on a restrictive whitelist, and
| store the result in the database. When you come to output it,
| output it as stored without escaping. >>
|
| I think the "parse, don't validate" approach comes out as
| follows:
|
| - take the list of things you would have included on your
| whitelist
|
| - add nodes for them to your internal representation for
| parsed markdown
|
| - extend your markdown parser to convert html-like input into
| those nodes
|
| - implement output for those nodes in a similar way to normal
| markdown
|
| This way, given the "escape output" they recommend, it's
| harder for any variant of the input that you hadn't
| considered to have harmful effects.
| yakshaving_jgt wrote:
| No, the _parse, don't validate_ idea is completely
| unrelated. It's about leveraging a type system like in
| Haskell. It's about parsing a value into a type with a
| narrower domain which in turn minimises the amount of
| control flow needed to implement a sufficiently correct
| program.
| mjw1007 wrote:
| I agree with "It's about parsing a value into a type with
| a narrower domain", but I don't see how you get to the
| first sentence.
|
| In their example of a markdown renderer, the internal-
| representation node is the type with the narrower domain.
| yakshaving_jgt wrote:
| The idea of parsing over validation is just as applicable
| with untrusted input as with trusted input. The idea is more
| about system design rather than the prevention of security
| vulnerabilities.
| dnautics wrote:
| come again? I didn't say you can't validate while parsing
| for untrusted input, I said for untrusted input you will
| probably STILL need additional separate validation methods.
| Key emphasis on negating _don 't_ absolute imperative in
| the original aphorism.
| yakshaving_jgt wrote:
| I didn't say that that's what you said. I tried to
| communicate that whether or not the input is trusted is
| besides the point.
|
| Unless I'm deeply confused, the idea in _parse, don 't
| validate_ is about doing something like this
| parseFoo :: Text -> Maybe Foo parseFoo t =
| if textIsAFoo t then Just (Foo t) else
| Nothing f :: Foo -> IO () f = _
|
| Rather than something like this (which seems to be more
| common) f :: Text -> IO () f t =
| when (textIsAFoo t) g where g = _
| dnautics wrote:
| I thought the idea of parse, don't validate is that your
| parser should contain validation logic.
|
| So instead of
|
| text -> generic json parser -> validate json -> ...
|
| you would do
|
| text -> custom json parser that stops if encounters
| "incorrect content" -> ...
| yakshaving_jgt wrote:
| You can think of the parser as containing validation
| logic -- it can parse the value into a more constrained
| type if it conforms to validation rules, or it can fail.
|
| The point is that once your value is in a more principled
| type, the rest of the system is free from having to make
| assumptions (and guard against potential failures) about
| the breadth of that value's domain.
|
| As the article mentions, this is really only relevant to
| languages with proper type systems like Haskell.
| marcosdumay wrote:
| Two different advises for two different things. That one is
| about data validation, making sure it is coherent and fits your
| data quality rules. This one is about data encoding, making
| sure it fits a different system's rules.
| ffhhj wrote:
| sanitize (client side) => confirm with user => trim+escape
| (server side) => insert
| InitialBP wrote:
| Alternatively...
|
| validate (client side) => insert using sql parameterization
| (escaping) => escape per context when outputting
|
| Sanitizing is the idea that you are cleaning dangerous things
| from the original input (different than validating which is
| disallowing user's to input characters that don't conform to
| what your program expects).
|
| One BIG issue here is that validation is generally clear to the
| user ("That is an invalid email address") whereas sanitization
| normally doesn't consult or inform the user that there were
| changes and may result in unexpected things happening from a
| user perspective.
|
| From article: name is "John O'Brien" now displays as "John
| OBrien" (this is a trivial example but still an issue)
|
| The name thing is a great example of things you might not
| expect your users to do but are still totally valid use cases.
| Sanitization can be Extremely frustrating from a user
| perspective.
| dagss wrote:
| Why "escape"? Just insert. Using SQL parameters.
| InitialBP wrote:
| Insert using SQL parameters is "escaping". The
| parameterization ensures that the data being passed gets
| interpreted by the DB as the expected data type by ensuring
| special characters aren't interpreted as "special" in that
| context.
| 1970-01-01 wrote:
| ?Por que no los dos?
| pwdisswordfish9 wrote:
| Because if someone wants to register with the name O'Malley,
| you should not refuse them, or worse, mangle their name.
| 1970-01-01 wrote:
| If someone wants to register with the username O'Malley and
| password O`Malley, would you let them?
| simonw wrote:
| That's not sanitization, that's validation. You implement a
| validation rule that says that the password and the
| username can't be the same thing, then use that to
| redisplay the registration form with an error message.
| 1970-01-01 wrote:
| Sometimes, you just sanitize input for user reasons. The
| username and password above are different ASCII strings.
| The non-tech savvy (very elder) user does not know the
| difference between "`" and "'" which means they are
| locked-out. This results in phone calls to support, which
| in volume result in "fix the password field"
|
| See https://www.cl.cam.ac.uk/~mgk25/ucs/apostrophe.html
| simonw wrote:
| I'd call that normalization rather than sanitization, but
| that's my own personal terminology, not necessarily
| terminology that's widely used.
| Avamander wrote:
| Then that sanitation is incorrect. I don't think this
| discussion would has any merit if we're speaking about
| incorrect implementations.
| pjerem wrote:
| Ok, and imagine we are on an Internet forum, talking about
| what is correct sanitization, and I want to make the
| following example : <script>alert(42)</script>
|
| Will HN remove the <> characters and make my comment
| incomprehensible or will it escape it on output, preserving
| all the meaning ?
|
| (Well, I'll know after hitting Reply)
|
| Edit : good boy, HN
| nybble41 wrote:
| You can't sanitize "correctly" if you don't know where the
| data will be used. This is exactly why the article
| advocates for escaping _output_ (e.g. immediately before
| inserting a string into a SQL query) rather than sanitizing
| _input_ (e.g. by deleting single-quotes or other
| potentially problematic characters from strings as soon as
| they 're received).
| chriswarbo wrote:
| The fundamental problem is attempting to conflate a bunch of
| semantically-distinct things, just because they might happen to
| (sometimes) be represented in memory by similar byte sequences.
|
| Such 'byte coincidences' lead to lazy, non-sensical operations,
| like "append this user-provided name to that SQL statement";
| implemented by munging together a bunch of bytes, without thought
| for how they'll be interpreted.
|
| A much better solution is to ignore whether things might just-so-
| happen to be represented in a similar way in memory; and instead
| keep things distinct if they have different semantic meanings
| (like "name", "SQL statement", "HTML source", "shell command",
| "form input", etc.). That way, if we try to do non-sensical
| things like appending user input to HTML, we'll get an
| informative error message that there is no such operation.
|
| This isn't hard; but it requires more careful thought about APIs.
| Unfortunately many languages (and now frameworks) have APIs
| littered with "String"; ignoring any distinctions between values,
| and hence allowing anything to be plugged into anything else (AKA
| injection vulnerabilities)
| [deleted]
| dang wrote:
| Discussed at the time:
|
| _Don't try to sanitize input - escape output_ -
| https://news.ycombinator.com/item?id=22431022 - Feb 2020 (280
| comments)
| parhamn wrote:
| It's cool to see how these posts are becoming less and less
| important in the wake of today's frameworks/tools protecting devs
| by default.
|
| From ORMs escaping SQL, to FE frameworks escaping html/js, to
| browsers starting to default to same-site=lax. It feels like
| we've slowly pulled ourselves out of OWASP hell. Pretty nice to
| see!
|
| Obviously it's still important (see log4j) to know it all
| especially when its not so clear cut, but still good progress.
| erosenbe0 wrote:
| I think we really failed in earlier eras to get it right due to
| the momentum of the frameworks.
|
| I would liken to some of the crap building materials that were
| allowed in the past as new, cheap alternatives but subsequently
| showed failure or hazards after short service-lifes.
| Contractors were tasked with implementing these materials to
| stay within budget and everyone suffered the effects later.
| nostrademons wrote:
| I think a better way to think of this may be in terms of
| _canonicalization_. Inside your application, you should decide on
| a single canonical way to represent data, one which fits the type
| of processing and expected use of the application. For example,
| you might decide that all strings should be UTF8, and should be
| interpreted (and stored) as whatever the user initially wrote.
| You might decide that any structured data should be parsed and
| then stored as protobufs in a BigTable. Or you might decide that
| an RDBMS is your native datastore and use whatever the native
| string encoding is for it, as well as parse & normalize data
| into tables upon input.
|
| Then, whenever you take input, your job is to _validate_ and
| _encode_ it. If you get a Windows-1252 string, you should re-
| encode it to utf8 for further storage. If it has data that are
| invalid UTF-8 codepoints, you should either strip, replace with a
| replacement character, or notify the user with a validation
| failure. Same with structured data that fails your normalization
| rules - you should usually notify the user.
|
| And when you send _output_ , you should escape based on the
| intended output device. If you're putting it in an HTML page,
| HTML-escape it. If it's a URL, url-encode it. If it's a database
| query, SQL escape it. If it's a CSV, quote it.
|
| Thinking in these terms keeps the internal logic of your
| application simple (there are no format conversions except at
| system boundaries), and it also gives you a lot of flexibility to
| preserve the user's intent and add new output formats later.
| platz wrote:
| so you would prevent stored XSS attacks by escaping on the
| output step instead of the _canonicalization_ step
| simonw wrote:
| Right - the way to avoid XSS is to escape on output.
|
| Most good template languages these days implement auto-
| escaping of variables that are interpolated into HTML.
|
| You still have to be careful embedding content into non-HTML
| contexts. One classic example there is outputting a blob of
| JSON inside a <script> tag - you need to make sure that you
| handle the case where a string could contain
| "</script><script>evil_code_here()</script>".
| colejohnson66 wrote:
| React (technically JSX) has a nice feature where all output
| is escaped. So this doesn't work: const
| evil = "<script>alert('')</script>"; ..
| <div>{evil}</div>
|
| That'll output:
| <div><script>alert('')</script></div>
|
| If you _must_ output a raw HTML string, they make you
| acknowledge that you 're aware of what you're doing:
| <div dangerouslySetInnerHTML={{__html: evil}} />
| zelphirkalt wrote:
| Also, in languages, which do not treat HTML as a simple
| string (looking at PHP and many others) or have libraries
| for doing exactly that, using any kind of data inside any
| HTML element, where it is put as text, will automatically
| make it escaped as text, with no overhead for the
| developer.
| scotty79 wrote:
| I'm really surprised by the discussion here. It's so obviously
| true and I realized this when correct php function to escape
| string for sql was names mysql_real_escape_string
| joering2 wrote:
| Every online form where user can interact and send data back to a
| server is always a nightmare in terms of security. I do utilize
| mod_secure, but with my next project, I have an idea of doing
| "base64" on everything in client's browser via javascript then
| sending it to server and checking on backend if content is a
| valid base64. Is that a good concept?
| afavour wrote:
| Unfortunately that wouldn't help with a whole lot. The danger
| with input is that it could be used to e.g. escape a SQL query
| and delete your database. Which is why we now have
| parameterised queries and such to help alleviate those worries.
|
| If you think about it the process you're describing already
| happens: the browser sends the user's input as (usually) UTF8
| string data, then the server decodes it. Changing that process
| to base64 wouldn't change much.
| [deleted]
| scotty79 wrote:
| Only if you never decode it from base64. :)
| adrr wrote:
| Wouldn't base64ing your inputs bypass mod_security?
| joering2 wrote:
| ok thank you everyone for your responses (+1s) - I was research
| on this idea and couldn't find anything online - now I know
| why!
| lesquivemeau wrote:
| Wouldn't prevent XSS afaik
| justinsaccount wrote:
| That could work if you are just going to store things as
| base64.
|
| It accomplishes nothing if you are going to decode the base64
| on the backend and then use the original value as-is. If
| anything it's worse than nothing, because now mod_secure will
| just see the base64 content and might fail to detect certain
| attacks.
| blibble wrote:
| guess I'll just put that 2gb "first name" directly into my
| database then
| AtNightWeCode wrote:
| No, garbage in, garbage out. Sure, things like log or SQL
| injections should not only be solved by sanitizing. You solve it
| by separating data and code. A lot of times you really want to
| store data in a structured canonical way. Usernames for instance.
| It is bad if you with Unicode trickery can create multiple
| usernames that looks the same. Product descriptions, it is bad if
| your ML needs to handle HTML and so on.
| kevincox wrote:
| This is wrong. If I leave a comment `'; DROP TABLE users; --`
| You should display it back in the app as exactly that. If you
| put it into an HTML attribute you escape the `'` and if you
| stick it in SQL you use parametrized statements.
|
| There is nothing "wrong" with that initial input. What is wrong
| is pasting it into an SQL string, HTML element, HTML attribute,
| URL parameter or anywhere else without properly encoding it.
|
| This is the main reason you can't "sanitize" input. You need to
| know what the output format is to properly encode it. There are
| different requirements if you are pasting it into a sed
| replacement command vs HTML attribute vs HTML element body. You
| can strip everything except a-zA-Z and cross your fingers but
| even that isn't necessarily sufficient for all output formats.
| ehutch79 wrote:
| using parameterized statements is sanitizing inputs into the
| database.
| kevincox wrote:
| The database is "outside" of your application server. You
| communicate with the database using statements and when you
| get the value back from the database it is unchanged. The
| encoding was just for transfer, no data has actually been
| changed.
| AtNightWeCode wrote:
| Maybe a better way to put is that you should be smart about
| why, when, and where to sanitize your data. A comment on a
| forum should not remove "'; DO BAD THINGS;". Why would it? It
| is just text in probably some UTF8 encoding. No viable web
| framework will write it out in a raw format if you do not
| explicitly ask for it. In SQL you use parameters. But as I
| wrote in my original comment. There are several scenarios and
| if you work with a web, probably the most cases, where you
| really want to make sure that what you have stored is a clean
| structured canonical data representation. Not only for your
| security but also for third party consumers and analyzing.
|
| I understand that everybody who sells NOSQL solutions
| disagree.
| hamilyon2 wrote:
| Sanitizing inputs is not what you realistically want. You should
| prohibit certain types of input. Whitelisting strings is that
| what I would call it.
|
| You should escape outputs, of course (not that anyone in 2022
| thinks otherwise).
|
| Why escaping outputs alone won't work is because user inputs will
| be stored in some database and you can't realistically predict
| how, when, where it will be used. Years in the future. User name
| could be used as a filename once, opening up possibility of
| shell-based exploit. It could trigger a little-known spreadsheet
| formula vulnerability when exported for analysis. Novel,
| interesting xss attacks are common and produced every day. That
| could be even not your code, but the code your client or partner
| organisation run. You just never know.
|
| One common defence is user names (and other freeform fields)
| should not be allowed to be arbitrary bytes.
|
| That is defence in depth, an established practice.
| HWR_14 wrote:
| If you are echoing a user's input back to them, what's the
| threat model that requires you to sanitize the output?
|
| That said, it's obviously not worth build a "don't sanitize
| this" filter for that case.
| wongarsu wrote:
| That works well for things you can limit to alphanumeric, which
| is pretty much only usernames. For everything else there will
| be an exploit in some context without proper escaping. You can
| decrease the attack surface, but you have to weigh that against
| the false sense of security it might give developers.
| InitialBP wrote:
| Agree and Disagree. Sanitization has it's place, but from a
| user perspective it's better to just outright reject (through
| validation) inputs that aren't valid.
|
| There are often unexpected ways that data gets into the system
| (IT manually adding data, internal support tool to help
| customers add data, etc.) You need to ensure that you're
| properly sanitizing your input at every single input faucet and
| your sanitization has to predict how, when, and where it will
| be used by sanitizing for dangerous characters in filenames,
| shell, spreadsheet formula vulns, and XSS attacks.
|
| Instead, (Or In addition to) just make the assumption that data
| in the database is dangerous, and ensure that you properly
| escape for your use case when using that data.
|
| Using a username to create a new file? Escape for filenames
| based on which OS/language your using.
|
| Using birthdates in an excel file? Escape for excel formulas.
|
| Using bio on an HTML page? HTML Escape.
|
| Using username as part of a URL path? URL Escape.
|
| And finally circle back to the fact that sanitization where you
| change user input without their knowledge (like the "O'brien"
| -> "Obrien" example in the article) creates for a frustrating
| user experience.
| hamilyon2 wrote:
| I agree, when your app does exporting, use escaping and be
| happy. Nobody ever challenged that. But that is not enough.
| You should do defence in depth. What I am talking about, you
| can't realistically escape for every use, because
|
| 1) once it is stored, it is usually outside of your control.
| You simply do not know where your data will end up, due to
| e.g. new integrations that will be developed in future.
|
| 2) you can even not know the proper escaping rules for
| document types you are producing due to software obscurity.
| Nobody I can think of escapes any csv files for excel-2001
| vulnerabilities. This is just one exaple of software where
| those files can actually end up opened.
|
| What is more economical/rational to change, your input
| validation or every csv/excel exporter/converter ever in
| existence?
| swlkr wrote:
| A strong content security policy also helps with xss
| iou wrote:
| Do both pls.
| hombre_fatal wrote:
| If you're doing both, I'd ask you what you think you're
| accomplishing by sanitizing input, especially when you're
| already escaping output.
|
| All you're doing is corrupting the data with a ritual that
| seems like it's securing something, and it tends to make you
| think that your data is now ready to be rendered anywhere
| without issue.
| pydry wrote:
| >If you're doing both, I'd ask you what you think you're
| accomplishing by sanitizing input, especially when you're
| already escaping output.
|
| https://en.m.wikipedia.org/wiki/Defence_in_depth_(non-
| milita...
| hombre_fatal wrote:
| I'd argue that sanitization makes things worse from that
| standpoint.
|
| What exactly was transformed in some given data and for
| what context? What needs to be done to reverse the
| sanitization process if you want to see the verbatim data,
| if that's even possible? Now that you want to escape the
| output, how can you reverse the sanitization transform so
| that you aren't double-escaping? What were the assumptions
| being made when this data was sanitized and what _was_ that
| transform?
|
| In other words, it's simpler to hold the verbatim data and
| then ask "ok, how does it need to be escaped for this
| context?" than having to ask that same question with
| arbitrarily mangled data while worrying if the data was
| sufficiently escaped for this context at input-time some
| point in the past.
|
| Even beginners get almost all mileage from parameterized
| SQL queries + using an HTML templating library that escapes
| by default which is almost all of them these days.
|
| I think knee-jerk sanitization is a relic of the days where
| that wasn't common, namely <?php echo $username ?>, which
| wasn't necessarily the worst advice when you otherwise had
| to remember to echo htmlEscape($username) every single
| time. Fortunately, things have improved since those days.
| pydry wrote:
| I've used a bunch of sanitizers and never had any issues
| with any of them. I'm sure there are exceptions but IME
| they tend to mangle the kind of text which the user
| really has no legitimate need to enter most of the time.
|
| Far from being a relic the recent log4j vulnerability
| highlighted just how much value there is in this kind of
| defense in depth.
|
| Obviously knee jerk decisions in tech are usually bad
| news.
| AnonHP wrote:
| The data store may be one, but the teams and apps working on
| the inputs and the outputs may be disparate and different.
| Relying on other teams all the time to do things correctly
| may not be a wise approach.
| jerf wrote:
| I can't emphasize this enough. This isn't a matter of taste,
| like, maybe you sanitize, maybe you escape on the way out,
| it's all good, it all works, it's just a matter of opinion.
|
| Sanitizing the input is _wrong_. Actively, objectively,
| unrecoverably wrong. Once you 've destroyed your data you
| can't get it back. Huge amounts of effort have been wasted by
| people trying to fix and recover data that was destroyed by
| systems "helpfully" "sanitizing" data. God help you if you
| have a sequence of these systems in a row each doing their
| own "sanitization" before you get the data.
|
| Do not "sanitize" your inputs. Do not tell other developers
| to sanitize their inputs. Do not sagely spout off on HN about
| the importance of sanitizing your inputs. It is _wrong_.
|
| The only "sanitization" that should be done is that when
| encoding to the output there are sometimes things that should
| simply be removed. For instance, a good HTML escaping
| function probably ought to entirely drop nulls, not even
| encoding them as � or anything, just drop them. Some of
| the other ASCII characters are straight-up illegal in HTML as
| well, even encoded. But all that sort of "sanitization"
| should be in the escaping step. If you want to reject null
| characters at input time, that's part of _validation_ , not
| sanitization.
| asplake wrote:
| Validate inputs, escape outputs
| Buttons840 wrote:
| Yes, but remember in a lot of cases nearly anything is
| valid input.
| talideon wrote:
| _Some_ sanitisation is fine. For instance, stripping
| leading and trailing space in some fields, case
| normalisation, automatic insertion of spaces in credit card
| numbers, that kind of thing. That is to say, you should
| sanitise as an affordance to the user. Given the choice
| between presenting an error to the user and automatic
| sanitisation, the latter is preferable. It's something that
| should be done carefully, but it's still good.
|
| Thoughtless sanitisation is a whole different kettle.
| nybble41 wrote:
| To me that sounds more like canonicalization than
| sanitation. Depending on your requirements it might be
| fine to convert the input to a canonical form before
| processing. If you do this, be certain to do it _before_
| validation so that you don 't accidentally "canonicalize"
| validated input into something which wouldn't pass the
| validation checks.
|
| A key aspect of canonicalization compared to sanitation
| is that the result should be something that the user
| would consider equivalent to their original input. The
| most common offender in my experience is the abuse of
| case normalization, especially for data like email
| addresses which are not defined as case-insensitive (at
| least for the mailbox name) even if many servers treat
| them that way. If you don't preserve the original case
| (and other parts such as "+" labels whose meaning is
| defined by the mail server) the address may not work at
| all, or may result in sending messages to the wrong user.
|
| Names, as an intimate part of the user's identity, are
| another area where case normalization can sometimes prove
| annoying or even offensive. If some legacy system
| requires names to be entered as all-caps US-ASCII
| characters, fine, but at least don't turn "O'Conner" or
| "MacDouglas" into "O'conner" or "Macdouglas" in some
| misguided attempt to ensure that just the first letter is
| capitalized. (And in some situations the first letter
| _shouldn 't_ be capitalized, e.g. the "dos Santos" in
| "Giovani dos Santos Ramirez"[0]--which is a single
| surname, not two names.)
|
| [0] https://en.wikipedia.org/wiki/Giovani_dos_Santos
| talideon wrote:
| Oh, believe me. As somebody with a name that includes
| accents, and a surname that contains two words, with
| relatives whose names include internal capitalisation and
| apostrophes, I know _all_ about that.
|
| The thing is that canonicalisation is a kind of
| sanitisation. As you mentioned, I personally prefer it to
| be done in real time. Sometimes it can't, however, you
| have to resort to munging, which is on the nastier end of
| sanitisation. Here's a short story:
|
| AFNIC run the .fr registry, and they, unlike other
| registries, expect you to provide a contact's given name
| and surname separately. The joys of French bureaucracy.
| At my previous job (hosting provider and domain
| registrar), I built the company's domain management
| system. The systems in front of that didn't care about
| the form a person's name took so long as it was present,
| and most other domain registries are the same. There was
| no sensible way to get the applicant to enter them
| previously (this data was taken from the billing system).
| This necessitated that I build a library that could parse
| people's names, and I ended up developing a rather large
| number of heuristics for doing so as accurately as
| possible. It only covered the Latin alphabet, as that's
| all AFNIC would accept at the time, but it worked.
|
| The problem is that most don't put that kind of thought
| into data sanitisation, and do things such as those you
| mentioned. And that's why we can't have nice things.
| jerf wrote:
| I agree that cleanup is acceptable, and there's certainly
| some wiggle room in what people call cleanup vs.
| sanitization and such.
|
| But when people chant "sanitize your inputs" and expect
| it to be treated as sage wisdom, it's in a security
| context, and it is _wrong_ in that context. Sanitization
| is not a valid security tool. Mind you, you might be
| forced into it if your back is against the wall and you
| 're working on other code that is broken and you can't
| fix that other code's broken failure to escape or
| whatever. But it's still wrong, just a wrong thing you
| were forced to do.
|
| A richer point of view is more "don't destroy data you
| don't 100% mean to destroy". Whitespace in the wrong
| place or stray nulls can meet that bar. Removing
| characters for "security" reasons doesn't. Destroying
| data to prevent security issues downstream is not a good
| idea.
| serious_habit wrote:
| If I'm reviewing code and someone is implementing escaping
| that's an immediate, massive, red flag. It's SO HARD to get
| right and there are many MANY libraries for doing it
| correctly. The scary thing is how many bugs still make it
| into these libraries.
|
| Strongly prefer using an established library and see
| designs such as https://web.dev/trusted-types.
| AnonHP wrote:
| > Sanitizing the input is wrong. Actively, objectively,
| unrecoverably wrong.
|
| I agree on the "unrecoverably" (sic) part, but strongly
| disagree on words like "objectively". It can be bad only if
| the input sanitization is poorly done. If that's poorly
| done, then it's also likely that the output sanitization
| may be poorly done. One cannot then say that output
| sanitization is objectively bad because someone doesn't
| know or care enough to do it properly.
|
| This is a complex topic that deserves more attention, not
| hand waving away with claims that cannot stand on their
| own.
| serious_habit wrote:
| Even better- never sanitize your data.
|
| You should only use templating systems which safely handle
| user data. Don't use innerHTML assignments, don't concatenate
| user data into SQL queries. Use existing, validated libraries
| for generating HTML and SQL.
| JxLS-cpgbe0 wrote:
| [deleted]
| ipaddr wrote:
| Instead of sanitizing input you create unsafe datastore which
| might be used in other applications later. Do it as soon as
| possible.
| frontiersummit wrote:
| I think it cuts both ways, as anyone who has needed to mine an
| existing data set for a new purpose can attest. Having the data
| sanitized can may your parsing job infinitely easier, while it
| can simultaneously destroy data which would have been extremely
| helpful to the new project.
| ncc-erik wrote:
| I think what makes this hard for folks is tracking what the
| expected form of data is at each step of its lifecycle,
| especially considering people working with new and unfamiliar
| codebases or splitting focus on multiple projects.
|
| There are some frameworks that try using types to solve the
| problem. Alternatively, the developers could throw in a comment
| that looks something like:
|
| // client == submits raw data ==> web_server == inserts raw data
| (param. sql stmt) ==> db_server ==> returns query with raw data
| ==> our_function == returns html-escaped data ==> client
| billpg wrote:
| Shameless plug: NEVER Sanitize Your Inputs (by me, 2013)
| https://billpg.com/never-sanitize-your-inputs/
| Sebb767 wrote:
| > The parallel for SQL injection might be if you're building a
| data charting tool that allows users to enter arbitrary SQL
| queries. You might want to allow them to enter SELECT queries but
| not data-modification queries. In these cases you're best off
| using a proper SQL parser [...] to ensure it's a well-formed
| SELECT query - but doing this correctly is not trivial, so be
| sure to get security review.
|
| If you are ever in this situation, you should actually use a
| dedicated read-only user that can only access the relevant data.
| If you need to hide columns, use views. Trying to parse SQL can
| easily go very wrong, especially when someone (ab-)uses the edge
| cases of your DB.
___________________________________________________________________
(page generated 2022-01-13 23:01 UTC)