[HN Gopher] Don't try to sanitize input, escape output (2020)
       ___________________________________________________________________
        
       Don't try to sanitize input, escape output (2020)
        
       Author : maple3142
       Score  : 105 points
       Date   : 2022-01-13 14:48 UTC (8 hours ago)
        
 (HTM) web link (benhoyt.com)
 (TXT) w3m dump (benhoyt.com)
        
       | gumby wrote:
       | Since you don't know where your output will end up how could you
       | possibly know the syntax to escape it?
       | 
       | And how can the consumer of an arbitrary string trust that every
       | input will have been properly escaped?
        
       | whoopdedo wrote:
       | Sounds like a restatement of Postel's robustness principle[1].
       | Did it go out of style to "be conservative in what you send, be
       | liberal in what you accept" and we need to relearn it again?
       | 
       | Well, perhaps it did. History has shown the dangers of not
       | handling malformed input well. Postel's principle has received
       | scrutiny[2] for reinforcing those mistakes by creating a mistaken
       | belief in robustness. More recent recommendations have been to be
       | stricter in handling of inputs[3].
       | 
       | But I think there is some confusion between robustness and
       | defensiveness. "Be liberal in what you accept" may be confused
       | with "don't sanitize your inputs" when not sanitizing is the less
       | liberal action. Robustness means the program should not fail if
       | it receives input it didn't expect. A program that crashes,
       | hangs, executes unintended shell code, mangles the data, changes
       | the thermostat, or other undefined behavior is not being robust.
       | To prevent that from happening then data must be sanitized at
       | input so that it can be processed without those side-effects. The
       | examples of programs failing robustness have been because they
       | were insufficiently defensive.
       | 
       | The bigger issue is that robustness doesn't scale easily. You may
       | know how your bit of code will deal with malformed data, but what
       | about every other library you use? Or other systems you
       | communicate with? It becomes a backstage problem, where once
       | someone has gained access to a restricted area it's assumed they
       | are authorized to be there. The further down the tech stack you
       | go the less likely the code will be defensive. That puts a burden
       | on the public-facing sanity checks to anticipate how relaxed they
       | can be about the input.
       | 
       | If you change the definition of output to include internal-
       | outputs, then Postel's principle gets new life. That is, try not
       | to program the entire system and ecosystem at once, but treat
       | each software component as an island. Be liberal not only with
       | the data you receive from the end-user, but also with return
       | values from functions. Be conservative and escape not only your
       | generated HTML, but also the SQL statements you dispatch to the
       | backend. This is what input sanitizing is actually about, it's
       | keeping the promise to the other parts of your program that your
       | code isn't going to give them bad data. That's also what the
       | linked article is saying, because the HTML being generated is
       | itself one component in a chain of programs that includes the
       | end-user's browser.
       | 
       | [1] https://en.wikipedia.org/wiki/Robustness_principle
       | 
       | [2]
       | https://programmingisterrible.com/post/42215715657/postels-p...
       | 
       | [3] https://datatracker.ietf.org/doc/html/draft-iab-protocol-
       | mai...
        
       | gkoberger wrote:
       | This solution doesn't match the problem. Even the SQL injection
       | example shows him sanitizing the input, which is at odds with the
       | title of the post. Log4J is a more recent example of it being too
       | late/useless to escape the output.
        
         | ashearer wrote:
         | This is an example of why the term "sanitize" just brings
         | confusion and leads to incorrect software. If we say "escape"
         | (for concatenation) or "parameterize" (for discrete arguments)
         | instead, then there's no confusion: we know that it should be
         | done at the point of use, because the procedure for doing so
         | depends on that use.
         | 
         | Calling it "sanitization" implies that the data is somehow
         | dirty, so naturally it should be cleaned as soon as possible,
         | and after that it's safe. But all that accomplishes in general
         | is corrupting the data, often in an unrecoverable way, and then
         | opening up security vulnerabilities because the specific use
         | doesn't happen to exactly match the sanitization done in
         | advance.
         | 
         | It's great to validate the data on input and make it conform to
         | the correct domain of values, but conflating this with output
         | formats and expecting this to take care of downstream security
         | as well just leads to incorrect data along with security
         | vulnerabilities.
         | 
         | PHP's long-ago-removed magic quotes feature was an example of
         | this confusion in action. It not only mangled incoming strings
         | containing single quotes in an effort to prevent SQL injection,
         | but did so in a way that left some databases completely
         | exposed, depending on their quoting syntax.
        
         | brodouevencode wrote:
         | Yeah little Bobby Droptables is still a thing.
        
         | marcosdumay wrote:
         | What?
         | 
         | SQL injection is avoided at the point of usage. Trying to
         | sanitize your input against it is an extremely bad practice.
         | The same is true about HMTL injection (whether you call it XSS
         | or something else).
         | 
         | Log4j is an example of not interpreting text that the developer
         | was never aware that was code. It's kinda of the extreme
         | opposite of escaping your text on usage.
        
           | gkoberger wrote:
           | The article says DON'T sanitize when putting it into the
           | database. I think contextual escaping counts as "sanitizing
           | input", so the solution of "don't try to sanitize input" is
           | undermined.
        
             | bcrosby95 wrote:
             | For a long while, input sanitization in the web world was
             | about modifying inputs to strip the problem areas. As such
             | many consider escaping and sanitization to be completely
             | different practices.
             | 
             | It seems like this article is using this differentiation.
             | In my experience, it's very common. It's not worth arguing
             | about.
        
             | shawnz wrote:
             | I interpreted the message as not sanitizing inputs at the
             | point they are received, a la PHP magic quotes. Instead,
             | escape at the output (the output to the database engine).
        
               | gkoberger wrote:
               | No where in the article do they use "output" to mean from
               | the database engine; they use it to mean "outputting
               | HTML".
        
               | shawnz wrote:
               | The article doesn't explicitly say the words "outputting
               | SQL to the database engine", but that's because the focus
               | is on XSS attacks and the part about SQL injection is
               | just an aside. Clearly it's what they were trying to
               | imply with language like this:
               | 
               | > The only code that knows what characters are dangerous
               | is the code that's outputting in a given context. And of
               | course use your SQL engine's parameterized query features
               | so it properly escapes variables when building SQL: ...
               | This is sometimes called "contextual escaping".
               | 
               | The "context" is that you are outputting to the database
               | engine.
        
               | dotancohen wrote:
               | > your SQL engine's parameterized query features so
               | > it properly escapes variables when building SQL
               | 
               | This is wrong. Parameterized queries do not build an SQL
               | string by escaping the input. The input is actually sent
               | to the database separately from the SQL.
               | 
               | Well, in all sane implementations, anyway. PHP has an
               | PDO::ATTR_EMULATE_PREPARES option that does build SQL
               | from a parameterized query. And, of course, Wordpress has
               | $wpdb->prepare() that returns an SQL string with the
               | parameter escaped. Also, so far as I know, one cannot run
               | a prepared statement from the SQLite CLI, so no
               | parameterized queries there either:
               | 
               | https://stackoverflow.com/questions/20065990/how-to-
               | prepare-...
        
               | shawnz wrote:
               | Sure, maybe it does not literally send a substituted SQL
               | string, but in order to send the parameters "separately"
               | from the query, do they not still eventually get
               | concatenated into a single binary string of some form to
               | be sent across the wire? In spirit I think the same
               | arguments apply there, it's just that the format of the
               | data is not strictly SQL. It's actually the wire format
               | of the database protocol.
        
               | dotancohen wrote:
               | You are correct that the parameters go across the wire,
               | obviously, but I've never heard of an attack in which the
               | parameters caused any type of compromise in the wire
               | protocol. I would highly appreciate examples if any
               | exist.
        
               | shawnz wrote:
               | It probably wouldn't result in an attack (unless you were
               | dealing with a really sophisticated attacker), it's just
               | necessary for correctness. Which is also true of all
               | these examples: for example, people won't appreciate
               | having backslashes wrongly inserted around legitimate
               | characters of their names or other personal information,
               | or having the software fail to process their request due
               | to the characters in their name. It's not _just_ a
               | security concern.
               | 
               | In the general case there are certainly many examples of
               | security vulnerabilities created by wrong serialization
               | of data into the wire protocols of services, but maybe
               | not specifically for this situation of query parameters.
               | But maybe there are, I have no idea really. Either way,
               | it's not the application developer's responsibility at
               | that point, it's the responsibility of the people who
               | developed the database driver.
        
               | Arnavion wrote:
               | >This is wrong. Parameterized queries do not build an SQL
               | string by escaping the input. The input is actually sent
               | to the database separately from the SQL.
               | 
               | Your blanket observation is not necessarily true of all
               | databases or database drivers. You found three counter-
               | examples yourself, but there's no reason to not consider
               | them "sane". It's not less correct than for databases
               | that do support prepared statements in the driver
               | protocol.
        
               | marcosdumay wrote:
               | > a la PHP magic quotes
               | 
               | Up to this day, the official way to deal with XSS in .Net
               | is by doing sanitization at the receiving point. I
               | imagine the article is directed at that.
        
               | shawnz wrote:
               | That sounds pretty terrible, do you have an example of
               | some docs which demonstrate that practice?
        
             | [deleted]
        
             | marcosdumay wrote:
             | If the user says his name is "Bob'; drop tables students
             | --", that is what you should store on your database.
             | Unless, of course it's not a valid name for the rest of the
             | system.
             | 
             | That's so old and obvious advice that I'm surprised people
             | keep posting here and upvoting. And even more surprised
             | when people keep disagreeing here.
        
               | ehutch79 wrote:
               | If you're storing "Bob'; drop tables students --" in the
               | database, you had to have sanitized your inputs, or there
               | would be no students table.
               | 
               | The article title says NOT to sanitize inputs. perhaps
               | it's that nuance doesn't fit in a headline, but eh...
        
               | wvenable wrote:
               | The confusion is what is input and what is output. The
               | string "Bob'; drop tables students --" should not be
               | sanitized/encoded on *input* _to the application_.
               | However, if you 're not using parameterized queries, it
               | should be encoded on *output* _to the database_.
               | 
               | Data should only be sanitized in transit and not stored
               | in an sanitized form. That's what the article is really
               | saying.
        
               | IshKebab wrote:
               | No you don't. You use a parameterized query:
               | execute("INSERT INTO foo VALUES (?)", user_input)
        
         | hombre_fatal wrote:
         | What are you referring to? The SQL injection example is showing
         | what not to do.
        
           | gkoberger wrote:
           | "So the better approach is to store whatever [data] the user
           | enters verbatim, and then have the template system HTML-
           | escape when outputting HTML"
           | 
           | With this logic, someone could use a SQL injection. It
           | wouldn't be sanitized as the INSERT is happening, so the SQL
           | injection would be executed.
           | 
           | EDIT: I know he goes on to talk about escaping characters,
           | but the title of the post is "Don't try to sanitize input".
           | My point is simply that SQL injections happen on input, not
           | output. His example of escaping the SQL is at odds with the
           | title of the post.
        
             | hombre_fatal wrote:
             | They show the solution of using parameterized queries to
             | store the user input verbatim. What is an example of the
             | attack you have in mind?
        
             | jerf wrote:
             | Most SQL systems have bind parameters for this sort of
             | thing. That is a form of encoding the input. You have to
             | encode the SQL values as well. You're basically saying if
             | you don't use the suggested technique, the suggested
             | technique doesn't work. Well, yeah. It has to be used
             | consistently, all the time, every time.
             | 
             | Unfortunately, that's just life. There's no way around it.
             | One way or another you're going to be doing something or
             | you're going to get owned.
        
             | kam wrote:
             | They're calling the SQL query "output" (from the app to the
             | DB server). The point is that the "bad characters" depend
             | on the context, so it's the step where you combine trusted
             | and untrusted data that you need to think about escaping or
             | validating.
        
               | gkoberger wrote:
               | No they're not. They're using the word "output" to mean
               | "back into the HTML".
               | 
               | "So the better approach is to store whatever name the
               | user enters verbatim, and then have the template system
               | HTML-escape when outputting HTML, or properly escape JSON
               | when outputting JSON and JavaScript."
        
               | kam wrote:
               | The sentence immediately after that is "And of course use
               | your SQL engine's parameterized query features so it
               | properly escapes variables when building SQL"
        
             | vlovich123 wrote:
             | Sorry, how does this happen if you're using DB parameter in
             | the query string?
        
         | amalcon wrote:
         | The article is specifically about sanitizing inputs to prevent
         | XSS attacks. Sanitizing input isn't a great defense against
         | that; you need a defense that better matches the attack.
         | 
         | Validating or sanitizing input input is a reasonably good
         | defense against certain other things. E.g. zeroes in values
         | you'll later divide by, when it's too late to return an error;
         | multi-gigabyte names; information that you want to avoid
         | storing like credit card numbers. That sort of use case doesn't
         | really have a whole lot to do with the article, though.
        
       | wnoise wrote:
       | Compare with https://lexi-lambda.github.io/blog/2019/11/05/parse-
       | don-t-va...
        
         | kevincox wrote:
         | These are both good advise. I have seen really funny bugs where
         | Java accepted non-ascii numbers in an IP address but the C++
         | control plane very much did not. If the re-serialized version
         | was sent to the backend this wouldn't have been an issue.
         | 
         | But the domains are different. Data validation is ensuring that
         | the information is something that your system accepts. Data
         | encoding is used when you are serializing information. You
         | should very likely validate on input, but not "sanitize" or
         | encode. You do your encoding on output.
        
         | dnautics wrote:
         | I think the domain models are different. "parse-don't-validate"
         | is great when your users are internal and trusted (e.g. a
         | library that does codegen - the operators of the parser are
         | already in the codebase). When your users are potentially
         | hostile, you should at some level have a separate validate and
         | eject strategy.
        
           | mjw1007 wrote:
           | I think "parse, don't validate" is an improvement on what the
           | author of this article recommends:
           | 
           | << So in cases where you do need to "echo" raw user input,
           | carefully filter input based on a restrictive whitelist, and
           | store the result in the database. When you come to output it,
           | output it as stored without escaping. >>
           | 
           | I think the "parse, don't validate" approach comes out as
           | follows:
           | 
           | - take the list of things you would have included on your
           | whitelist
           | 
           | - add nodes for them to your internal representation for
           | parsed markdown
           | 
           | - extend your markdown parser to convert html-like input into
           | those nodes
           | 
           | - implement output for those nodes in a similar way to normal
           | markdown
           | 
           | This way, given the "escape output" they recommend, it's
           | harder for any variant of the input that you hadn't
           | considered to have harmful effects.
        
             | yakshaving_jgt wrote:
             | No, the _parse, don't validate_ idea is completely
             | unrelated. It's about leveraging a type system like in
             | Haskell. It's about parsing a value into a type with a
             | narrower domain which in turn minimises the amount of
             | control flow needed to implement a sufficiently correct
             | program.
        
               | mjw1007 wrote:
               | I agree with "It's about parsing a value into a type with
               | a narrower domain", but I don't see how you get to the
               | first sentence.
               | 
               | In their example of a markdown renderer, the internal-
               | representation node is the type with the narrower domain.
        
           | yakshaving_jgt wrote:
           | The idea of parsing over validation is just as applicable
           | with untrusted input as with trusted input. The idea is more
           | about system design rather than the prevention of security
           | vulnerabilities.
        
             | dnautics wrote:
             | come again? I didn't say you can't validate while parsing
             | for untrusted input, I said for untrusted input you will
             | probably STILL need additional separate validation methods.
             | Key emphasis on negating _don 't_ absolute imperative in
             | the original aphorism.
        
               | yakshaving_jgt wrote:
               | I didn't say that that's what you said. I tried to
               | communicate that whether or not the input is trusted is
               | besides the point.
               | 
               | Unless I'm deeply confused, the idea in _parse, don 't
               | validate_ is about doing something like this
               | parseFoo :: Text -> Maybe Foo       parseFoo t =
               | if textIsAFoo t         then Just (Foo t)         else
               | Nothing            f :: Foo -> IO ()       f = _
               | 
               | Rather than something like this (which seems to be more
               | common)                 f :: Text -> IO ()       f t =
               | when (textIsAFoo t) g         where g = _
        
               | dnautics wrote:
               | I thought the idea of parse, don't validate is that your
               | parser should contain validation logic.
               | 
               | So instead of
               | 
               | text -> generic json parser -> validate json -> ...
               | 
               | you would do
               | 
               | text -> custom json parser that stops if encounters
               | "incorrect content" -> ...
        
               | yakshaving_jgt wrote:
               | You can think of the parser as containing validation
               | logic -- it can parse the value into a more constrained
               | type if it conforms to validation rules, or it can fail.
               | 
               | The point is that once your value is in a more principled
               | type, the rest of the system is free from having to make
               | assumptions (and guard against potential failures) about
               | the breadth of that value's domain.
               | 
               | As the article mentions, this is really only relevant to
               | languages with proper type systems like Haskell.
        
         | marcosdumay wrote:
         | Two different advises for two different things. That one is
         | about data validation, making sure it is coherent and fits your
         | data quality rules. This one is about data encoding, making
         | sure it fits a different system's rules.
        
       | ffhhj wrote:
       | sanitize (client side) => confirm with user => trim+escape
       | (server side) => insert
        
         | InitialBP wrote:
         | Alternatively...
         | 
         | validate (client side) => insert using sql parameterization
         | (escaping) => escape per context when outputting
         | 
         | Sanitizing is the idea that you are cleaning dangerous things
         | from the original input (different than validating which is
         | disallowing user's to input characters that don't conform to
         | what your program expects).
         | 
         | One BIG issue here is that validation is generally clear to the
         | user ("That is an invalid email address") whereas sanitization
         | normally doesn't consult or inform the user that there were
         | changes and may result in unexpected things happening from a
         | user perspective.
         | 
         | From article: name is "John O'Brien" now displays as "John
         | OBrien" (this is a trivial example but still an issue)
         | 
         | The name thing is a great example of things you might not
         | expect your users to do but are still totally valid use cases.
         | Sanitization can be Extremely frustrating from a user
         | perspective.
        
         | dagss wrote:
         | Why "escape"? Just insert. Using SQL parameters.
        
           | InitialBP wrote:
           | Insert using SQL parameters is "escaping". The
           | parameterization ensures that the data being passed gets
           | interpreted by the DB as the expected data type by ensuring
           | special characters aren't interpreted as "special" in that
           | context.
        
       | 1970-01-01 wrote:
       | ?Por que no los dos?
        
         | pwdisswordfish9 wrote:
         | Because if someone wants to register with the name O'Malley,
         | you should not refuse them, or worse, mangle their name.
        
           | 1970-01-01 wrote:
           | If someone wants to register with the username O'Malley and
           | password O`Malley, would you let them?
        
             | simonw wrote:
             | That's not sanitization, that's validation. You implement a
             | validation rule that says that the password and the
             | username can't be the same thing, then use that to
             | redisplay the registration form with an error message.
        
               | 1970-01-01 wrote:
               | Sometimes, you just sanitize input for user reasons. The
               | username and password above are different ASCII strings.
               | The non-tech savvy (very elder) user does not know the
               | difference between "`" and "'" which means they are
               | locked-out. This results in phone calls to support, which
               | in volume result in "fix the password field"
               | 
               | See https://www.cl.cam.ac.uk/~mgk25/ucs/apostrophe.html
        
               | simonw wrote:
               | I'd call that normalization rather than sanitization, but
               | that's my own personal terminology, not necessarily
               | terminology that's widely used.
        
           | Avamander wrote:
           | Then that sanitation is incorrect. I don't think this
           | discussion would has any merit if we're speaking about
           | incorrect implementations.
        
             | pjerem wrote:
             | Ok, and imagine we are on an Internet forum, talking about
             | what is correct sanitization, and I want to make the
             | following example : <script>alert(42)</script>
             | 
             | Will HN remove the <> characters and make my comment
             | incomprehensible or will it escape it on output, preserving
             | all the meaning ?
             | 
             | (Well, I'll know after hitting Reply)
             | 
             | Edit : good boy, HN
        
             | nybble41 wrote:
             | You can't sanitize "correctly" if you don't know where the
             | data will be used. This is exactly why the article
             | advocates for escaping _output_ (e.g. immediately before
             | inserting a string into a SQL query) rather than sanitizing
             | _input_ (e.g. by deleting single-quotes or other
             | potentially problematic characters from strings as soon as
             | they 're received).
        
       | chriswarbo wrote:
       | The fundamental problem is attempting to conflate a bunch of
       | semantically-distinct things, just because they might happen to
       | (sometimes) be represented in memory by similar byte sequences.
       | 
       | Such 'byte coincidences' lead to lazy, non-sensical operations,
       | like "append this user-provided name to that SQL statement";
       | implemented by munging together a bunch of bytes, without thought
       | for how they'll be interpreted.
       | 
       | A much better solution is to ignore whether things might just-so-
       | happen to be represented in a similar way in memory; and instead
       | keep things distinct if they have different semantic meanings
       | (like "name", "SQL statement", "HTML source", "shell command",
       | "form input", etc.). That way, if we try to do non-sensical
       | things like appending user input to HTML, we'll get an
       | informative error message that there is no such operation.
       | 
       | This isn't hard; but it requires more careful thought about APIs.
       | Unfortunately many languages (and now frameworks) have APIs
       | littered with "String"; ignoring any distinctions between values,
       | and hence allowing anything to be plugged into anything else (AKA
       | injection vulnerabilities)
        
         | [deleted]
        
       | dang wrote:
       | Discussed at the time:
       | 
       |  _Don't try to sanitize input - escape output_ -
       | https://news.ycombinator.com/item?id=22431022 - Feb 2020 (280
       | comments)
        
       | parhamn wrote:
       | It's cool to see how these posts are becoming less and less
       | important in the wake of today's frameworks/tools protecting devs
       | by default.
       | 
       | From ORMs escaping SQL, to FE frameworks escaping html/js, to
       | browsers starting to default to same-site=lax. It feels like
       | we've slowly pulled ourselves out of OWASP hell. Pretty nice to
       | see!
       | 
       | Obviously it's still important (see log4j) to know it all
       | especially when its not so clear cut, but still good progress.
        
         | erosenbe0 wrote:
         | I think we really failed in earlier eras to get it right due to
         | the momentum of the frameworks.
         | 
         | I would liken to some of the crap building materials that were
         | allowed in the past as new, cheap alternatives but subsequently
         | showed failure or hazards after short service-lifes.
         | Contractors were tasked with implementing these materials to
         | stay within budget and everyone suffered the effects later.
        
       | nostrademons wrote:
       | I think a better way to think of this may be in terms of
       | _canonicalization_. Inside your application, you should decide on
       | a single canonical way to represent data, one which fits the type
       | of processing and expected use of the application. For example,
       | you might decide that all strings should be UTF8, and should be
       | interpreted (and stored) as whatever the user initially wrote.
       | You might decide that any structured data should be parsed and
       | then stored as protobufs in a BigTable. Or you might decide that
       | an RDBMS is your native datastore and use whatever the native
       | string encoding is for it, as well as parse  & normalize data
       | into tables upon input.
       | 
       | Then, whenever you take input, your job is to _validate_ and
       | _encode_ it. If you get a Windows-1252 string, you should re-
       | encode it to utf8 for further storage. If it has data that are
       | invalid UTF-8 codepoints, you should either strip, replace with a
       | replacement character, or notify the user with a validation
       | failure. Same with structured data that fails your normalization
       | rules - you should usually notify the user.
       | 
       | And when you send _output_ , you should escape based on the
       | intended output device. If you're putting it in an HTML page,
       | HTML-escape it. If it's a URL, url-encode it. If it's a database
       | query, SQL escape it. If it's a CSV, quote it.
       | 
       | Thinking in these terms keeps the internal logic of your
       | application simple (there are no format conversions except at
       | system boundaries), and it also gives you a lot of flexibility to
       | preserve the user's intent and add new output formats later.
        
         | platz wrote:
         | so you would prevent stored XSS attacks by escaping on the
         | output step instead of the _canonicalization_ step
        
           | simonw wrote:
           | Right - the way to avoid XSS is to escape on output.
           | 
           | Most good template languages these days implement auto-
           | escaping of variables that are interpolated into HTML.
           | 
           | You still have to be careful embedding content into non-HTML
           | contexts. One classic example there is outputting a blob of
           | JSON inside a <script> tag - you need to make sure that you
           | handle the case where a string could contain
           | "</script><script>evil_code_here()</script>".
        
             | colejohnson66 wrote:
             | React (technically JSX) has a nice feature where all output
             | is escaped. So this doesn't work:                 const
             | evil = "<script>alert('')</script>";       ..
             | <div>{evil}</div>
             | 
             | That'll output:
             | <div>&lt;script&gt;alert('')&lt;/script&gt;</div>
             | 
             | If you _must_ output a raw HTML string, they make you
             | acknowledge that you 're aware of what you're doing:
             | <div dangerouslySetInnerHTML={{__html: evil}} />
        
             | zelphirkalt wrote:
             | Also, in languages, which do not treat HTML as a simple
             | string (looking at PHP and many others) or have libraries
             | for doing exactly that, using any kind of data inside any
             | HTML element, where it is put as text, will automatically
             | make it escaped as text, with no overhead for the
             | developer.
        
       | scotty79 wrote:
       | I'm really surprised by the discussion here. It's so obviously
       | true and I realized this when correct php function to escape
       | string for sql was names mysql_real_escape_string
        
       | joering2 wrote:
       | Every online form where user can interact and send data back to a
       | server is always a nightmare in terms of security. I do utilize
       | mod_secure, but with my next project, I have an idea of doing
       | "base64" on everything in client's browser via javascript then
       | sending it to server and checking on backend if content is a
       | valid base64. Is that a good concept?
        
         | afavour wrote:
         | Unfortunately that wouldn't help with a whole lot. The danger
         | with input is that it could be used to e.g. escape a SQL query
         | and delete your database. Which is why we now have
         | parameterised queries and such to help alleviate those worries.
         | 
         | If you think about it the process you're describing already
         | happens: the browser sends the user's input as (usually) UTF8
         | string data, then the server decodes it. Changing that process
         | to base64 wouldn't change much.
        
         | [deleted]
        
         | scotty79 wrote:
         | Only if you never decode it from base64. :)
        
         | adrr wrote:
         | Wouldn't base64ing your inputs bypass mod_security?
        
         | joering2 wrote:
         | ok thank you everyone for your responses (+1s) - I was research
         | on this idea and couldn't find anything online - now I know
         | why!
        
         | lesquivemeau wrote:
         | Wouldn't prevent XSS afaik
        
         | justinsaccount wrote:
         | That could work if you are just going to store things as
         | base64.
         | 
         | It accomplishes nothing if you are going to decode the base64
         | on the backend and then use the original value as-is. If
         | anything it's worse than nothing, because now mod_secure will
         | just see the base64 content and might fail to detect certain
         | attacks.
        
       | blibble wrote:
       | guess I'll just put that 2gb "first name" directly into my
       | database then
        
       | AtNightWeCode wrote:
       | No, garbage in, garbage out. Sure, things like log or SQL
       | injections should not only be solved by sanitizing. You solve it
       | by separating data and code. A lot of times you really want to
       | store data in a structured canonical way. Usernames for instance.
       | It is bad if you with Unicode trickery can create multiple
       | usernames that looks the same. Product descriptions, it is bad if
       | your ML needs to handle HTML and so on.
        
         | kevincox wrote:
         | This is wrong. If I leave a comment `'; DROP TABLE users; --`
         | You should display it back in the app as exactly that. If you
         | put it into an HTML attribute you escape the `'` and if you
         | stick it in SQL you use parametrized statements.
         | 
         | There is nothing "wrong" with that initial input. What is wrong
         | is pasting it into an SQL string, HTML element, HTML attribute,
         | URL parameter or anywhere else without properly encoding it.
         | 
         | This is the main reason you can't "sanitize" input. You need to
         | know what the output format is to properly encode it. There are
         | different requirements if you are pasting it into a sed
         | replacement command vs HTML attribute vs HTML element body. You
         | can strip everything except a-zA-Z and cross your fingers but
         | even that isn't necessarily sufficient for all output formats.
        
           | ehutch79 wrote:
           | using parameterized statements is sanitizing inputs into the
           | database.
        
             | kevincox wrote:
             | The database is "outside" of your application server. You
             | communicate with the database using statements and when you
             | get the value back from the database it is unchanged. The
             | encoding was just for transfer, no data has actually been
             | changed.
        
           | AtNightWeCode wrote:
           | Maybe a better way to put is that you should be smart about
           | why, when, and where to sanitize your data. A comment on a
           | forum should not remove "'; DO BAD THINGS;". Why would it? It
           | is just text in probably some UTF8 encoding. No viable web
           | framework will write it out in a raw format if you do not
           | explicitly ask for it. In SQL you use parameters. But as I
           | wrote in my original comment. There are several scenarios and
           | if you work with a web, probably the most cases, where you
           | really want to make sure that what you have stored is a clean
           | structured canonical data representation. Not only for your
           | security but also for third party consumers and analyzing.
           | 
           | I understand that everybody who sells NOSQL solutions
           | disagree.
        
       | hamilyon2 wrote:
       | Sanitizing inputs is not what you realistically want. You should
       | prohibit certain types of input. Whitelisting strings is that
       | what I would call it.
       | 
       | You should escape outputs, of course (not that anyone in 2022
       | thinks otherwise).
       | 
       | Why escaping outputs alone won't work is because user inputs will
       | be stored in some database and you can't realistically predict
       | how, when, where it will be used. Years in the future. User name
       | could be used as a filename once, opening up possibility of
       | shell-based exploit. It could trigger a little-known spreadsheet
       | formula vulnerability when exported for analysis. Novel,
       | interesting xss attacks are common and produced every day. That
       | could be even not your code, but the code your client or partner
       | organisation run. You just never know.
       | 
       | One common defence is user names (and other freeform fields)
       | should not be allowed to be arbitrary bytes.
       | 
       | That is defence in depth, an established practice.
        
         | HWR_14 wrote:
         | If you are echoing a user's input back to them, what's the
         | threat model that requires you to sanitize the output?
         | 
         | That said, it's obviously not worth build a "don't sanitize
         | this" filter for that case.
        
         | wongarsu wrote:
         | That works well for things you can limit to alphanumeric, which
         | is pretty much only usernames. For everything else there will
         | be an exploit in some context without proper escaping. You can
         | decrease the attack surface, but you have to weigh that against
         | the false sense of security it might give developers.
        
         | InitialBP wrote:
         | Agree and Disagree. Sanitization has it's place, but from a
         | user perspective it's better to just outright reject (through
         | validation) inputs that aren't valid.
         | 
         | There are often unexpected ways that data gets into the system
         | (IT manually adding data, internal support tool to help
         | customers add data, etc.) You need to ensure that you're
         | properly sanitizing your input at every single input faucet and
         | your sanitization has to predict how, when, and where it will
         | be used by sanitizing for dangerous characters in filenames,
         | shell, spreadsheet formula vulns, and XSS attacks.
         | 
         | Instead, (Or In addition to) just make the assumption that data
         | in the database is dangerous, and ensure that you properly
         | escape for your use case when using that data.
         | 
         | Using a username to create a new file? Escape for filenames
         | based on which OS/language your using.
         | 
         | Using birthdates in an excel file? Escape for excel formulas.
         | 
         | Using bio on an HTML page? HTML Escape.
         | 
         | Using username as part of a URL path? URL Escape.
         | 
         | And finally circle back to the fact that sanitization where you
         | change user input without their knowledge (like the "O'brien"
         | -> "Obrien" example in the article) creates for a frustrating
         | user experience.
        
           | hamilyon2 wrote:
           | I agree, when your app does exporting, use escaping and be
           | happy. Nobody ever challenged that. But that is not enough.
           | You should do defence in depth. What I am talking about, you
           | can't realistically escape for every use, because
           | 
           | 1) once it is stored, it is usually outside of your control.
           | You simply do not know where your data will end up, due to
           | e.g. new integrations that will be developed in future.
           | 
           | 2) you can even not know the proper escaping rules for
           | document types you are producing due to software obscurity.
           | Nobody I can think of escapes any csv files for excel-2001
           | vulnerabilities. This is just one exaple of software where
           | those files can actually end up opened.
           | 
           | What is more economical/rational to change, your input
           | validation or every csv/excel exporter/converter ever in
           | existence?
        
       | swlkr wrote:
       | A strong content security policy also helps with xss
        
       | iou wrote:
       | Do both pls.
        
         | hombre_fatal wrote:
         | If you're doing both, I'd ask you what you think you're
         | accomplishing by sanitizing input, especially when you're
         | already escaping output.
         | 
         | All you're doing is corrupting the data with a ritual that
         | seems like it's securing something, and it tends to make you
         | think that your data is now ready to be rendered anywhere
         | without issue.
        
           | pydry wrote:
           | >If you're doing both, I'd ask you what you think you're
           | accomplishing by sanitizing input, especially when you're
           | already escaping output.
           | 
           | https://en.m.wikipedia.org/wiki/Defence_in_depth_(non-
           | milita...
        
             | hombre_fatal wrote:
             | I'd argue that sanitization makes things worse from that
             | standpoint.
             | 
             | What exactly was transformed in some given data and for
             | what context? What needs to be done to reverse the
             | sanitization process if you want to see the verbatim data,
             | if that's even possible? Now that you want to escape the
             | output, how can you reverse the sanitization transform so
             | that you aren't double-escaping? What were the assumptions
             | being made when this data was sanitized and what _was_ that
             | transform?
             | 
             | In other words, it's simpler to hold the verbatim data and
             | then ask "ok, how does it need to be escaped for this
             | context?" than having to ask that same question with
             | arbitrarily mangled data while worrying if the data was
             | sufficiently escaped for this context at input-time some
             | point in the past.
             | 
             | Even beginners get almost all mileage from parameterized
             | SQL queries + using an HTML templating library that escapes
             | by default which is almost all of them these days.
             | 
             | I think knee-jerk sanitization is a relic of the days where
             | that wasn't common, namely <?php echo $username ?>, which
             | wasn't necessarily the worst advice when you otherwise had
             | to remember to echo htmlEscape($username) every single
             | time. Fortunately, things have improved since those days.
        
               | pydry wrote:
               | I've used a bunch of sanitizers and never had any issues
               | with any of them. I'm sure there are exceptions but IME
               | they tend to mangle the kind of text which the user
               | really has no legitimate need to enter most of the time.
               | 
               | Far from being a relic the recent log4j vulnerability
               | highlighted just how much value there is in this kind of
               | defense in depth.
               | 
               | Obviously knee jerk decisions in tech are usually bad
               | news.
        
           | AnonHP wrote:
           | The data store may be one, but the teams and apps working on
           | the inputs and the outputs may be disparate and different.
           | Relying on other teams all the time to do things correctly
           | may not be a wise approach.
        
           | jerf wrote:
           | I can't emphasize this enough. This isn't a matter of taste,
           | like, maybe you sanitize, maybe you escape on the way out,
           | it's all good, it all works, it's just a matter of opinion.
           | 
           | Sanitizing the input is _wrong_. Actively, objectively,
           | unrecoverably wrong. Once you 've destroyed your data you
           | can't get it back. Huge amounts of effort have been wasted by
           | people trying to fix and recover data that was destroyed by
           | systems "helpfully" "sanitizing" data. God help you if you
           | have a sequence of these systems in a row each doing their
           | own "sanitization" before you get the data.
           | 
           | Do not "sanitize" your inputs. Do not tell other developers
           | to sanitize their inputs. Do not sagely spout off on HN about
           | the importance of sanitizing your inputs. It is _wrong_.
           | 
           | The only "sanitization" that should be done is that when
           | encoding to the output there are sometimes things that should
           | simply be removed. For instance, a good HTML escaping
           | function probably ought to entirely drop nulls, not even
           | encoding them as &#00; or anything, just drop them. Some of
           | the other ASCII characters are straight-up illegal in HTML as
           | well, even encoded. But all that sort of "sanitization"
           | should be in the escaping step. If you want to reject null
           | characters at input time, that's part of _validation_ , not
           | sanitization.
        
             | asplake wrote:
             | Validate inputs, escape outputs
        
               | Buttons840 wrote:
               | Yes, but remember in a lot of cases nearly anything is
               | valid input.
        
             | talideon wrote:
             | _Some_ sanitisation is fine. For instance, stripping
             | leading and trailing space in some fields, case
             | normalisation, automatic insertion of spaces in credit card
             | numbers, that kind of thing. That is to say, you should
             | sanitise as an affordance to the user. Given the choice
             | between presenting an error to the user and automatic
             | sanitisation, the latter is preferable. It's something that
             | should be done carefully, but it's still good.
             | 
             | Thoughtless sanitisation is a whole different kettle.
        
               | nybble41 wrote:
               | To me that sounds more like canonicalization than
               | sanitation. Depending on your requirements it might be
               | fine to convert the input to a canonical form before
               | processing. If you do this, be certain to do it _before_
               | validation so that you don 't accidentally "canonicalize"
               | validated input into something which wouldn't pass the
               | validation checks.
               | 
               | A key aspect of canonicalization compared to sanitation
               | is that the result should be something that the user
               | would consider equivalent to their original input. The
               | most common offender in my experience is the abuse of
               | case normalization, especially for data like email
               | addresses which are not defined as case-insensitive (at
               | least for the mailbox name) even if many servers treat
               | them that way. If you don't preserve the original case
               | (and other parts such as "+" labels whose meaning is
               | defined by the mail server) the address may not work at
               | all, or may result in sending messages to the wrong user.
               | 
               | Names, as an intimate part of the user's identity, are
               | another area where case normalization can sometimes prove
               | annoying or even offensive. If some legacy system
               | requires names to be entered as all-caps US-ASCII
               | characters, fine, but at least don't turn "O'Conner" or
               | "MacDouglas" into "O'conner" or "Macdouglas" in some
               | misguided attempt to ensure that just the first letter is
               | capitalized. (And in some situations the first letter
               | _shouldn 't_ be capitalized, e.g. the "dos Santos" in
               | "Giovani dos Santos Ramirez"[0]--which is a single
               | surname, not two names.)
               | 
               | [0] https://en.wikipedia.org/wiki/Giovani_dos_Santos
        
               | talideon wrote:
               | Oh, believe me. As somebody with a name that includes
               | accents, and a surname that contains two words, with
               | relatives whose names include internal capitalisation and
               | apostrophes, I know _all_ about that.
               | 
               | The thing is that canonicalisation is a kind of
               | sanitisation. As you mentioned, I personally prefer it to
               | be done in real time. Sometimes it can't, however, you
               | have to resort to munging, which is on the nastier end of
               | sanitisation. Here's a short story:
               | 
               | AFNIC run the .fr registry, and they, unlike other
               | registries, expect you to provide a contact's given name
               | and surname separately. The joys of French bureaucracy.
               | At my previous job (hosting provider and domain
               | registrar), I built the company's domain management
               | system. The systems in front of that didn't care about
               | the form a person's name took so long as it was present,
               | and most other domain registries are the same. There was
               | no sensible way to get the applicant to enter them
               | previously (this data was taken from the billing system).
               | This necessitated that I build a library that could parse
               | people's names, and I ended up developing a rather large
               | number of heuristics for doing so as accurately as
               | possible. It only covered the Latin alphabet, as that's
               | all AFNIC would accept at the time, but it worked.
               | 
               | The problem is that most don't put that kind of thought
               | into data sanitisation, and do things such as those you
               | mentioned. And that's why we can't have nice things.
        
               | jerf wrote:
               | I agree that cleanup is acceptable, and there's certainly
               | some wiggle room in what people call cleanup vs.
               | sanitization and such.
               | 
               | But when people chant "sanitize your inputs" and expect
               | it to be treated as sage wisdom, it's in a security
               | context, and it is _wrong_ in that context. Sanitization
               | is not a valid security tool. Mind you, you might be
               | forced into it if your back is against the wall and you
               | 're working on other code that is broken and you can't
               | fix that other code's broken failure to escape or
               | whatever. But it's still wrong, just a wrong thing you
               | were forced to do.
               | 
               | A richer point of view is more "don't destroy data you
               | don't 100% mean to destroy". Whitespace in the wrong
               | place or stray nulls can meet that bar. Removing
               | characters for "security" reasons doesn't. Destroying
               | data to prevent security issues downstream is not a good
               | idea.
        
             | serious_habit wrote:
             | If I'm reviewing code and someone is implementing escaping
             | that's an immediate, massive, red flag. It's SO HARD to get
             | right and there are many MANY libraries for doing it
             | correctly. The scary thing is how many bugs still make it
             | into these libraries.
             | 
             | Strongly prefer using an established library and see
             | designs such as https://web.dev/trusted-types.
        
             | AnonHP wrote:
             | > Sanitizing the input is wrong. Actively, objectively,
             | unrecoverably wrong.
             | 
             | I agree on the "unrecoverably" (sic) part, but strongly
             | disagree on words like "objectively". It can be bad only if
             | the input sanitization is poorly done. If that's poorly
             | done, then it's also likely that the output sanitization
             | may be poorly done. One cannot then say that output
             | sanitization is objectively bad because someone doesn't
             | know or care enough to do it properly.
             | 
             | This is a complex topic that deserves more attention, not
             | hand waving away with claims that cannot stand on their
             | own.
        
           | serious_habit wrote:
           | Even better- never sanitize your data.
           | 
           | You should only use templating systems which safely handle
           | user data. Don't use innerHTML assignments, don't concatenate
           | user data into SQL queries. Use existing, validated libraries
           | for generating HTML and SQL.
        
             | JxLS-cpgbe0 wrote:
        
         | [deleted]
        
       | ipaddr wrote:
       | Instead of sanitizing input you create unsafe datastore which
       | might be used in other applications later. Do it as soon as
       | possible.
        
         | frontiersummit wrote:
         | I think it cuts both ways, as anyone who has needed to mine an
         | existing data set for a new purpose can attest. Having the data
         | sanitized can may your parsing job infinitely easier, while it
         | can simultaneously destroy data which would have been extremely
         | helpful to the new project.
        
       | ncc-erik wrote:
       | I think what makes this hard for folks is tracking what the
       | expected form of data is at each step of its lifecycle,
       | especially considering people working with new and unfamiliar
       | codebases or splitting focus on multiple projects.
       | 
       | There are some frameworks that try using types to solve the
       | problem. Alternatively, the developers could throw in a comment
       | that looks something like:
       | 
       | // client == submits raw data ==> web_server == inserts raw data
       | (param. sql stmt) ==> db_server ==> returns query with raw data
       | ==> our_function == returns html-escaped data ==> client
        
       | billpg wrote:
       | Shameless plug: NEVER Sanitize Your Inputs (by me, 2013)
       | https://billpg.com/never-sanitize-your-inputs/
        
       | Sebb767 wrote:
       | > The parallel for SQL injection might be if you're building a
       | data charting tool that allows users to enter arbitrary SQL
       | queries. You might want to allow them to enter SELECT queries but
       | not data-modification queries. In these cases you're best off
       | using a proper SQL parser [...] to ensure it's a well-formed
       | SELECT query - but doing this correctly is not trivial, so be
       | sure to get security review.
       | 
       | If you are ever in this situation, you should actually use a
       | dedicated read-only user that can only access the relevant data.
       | If you need to hide columns, use views. Trying to parse SQL can
       | easily go very wrong, especially when someone (ab-)uses the edge
       | cases of your DB.
        
       ___________________________________________________________________
       (page generated 2022-01-13 23:01 UTC)