[HN Gopher] Parsix: Parse Don't Validate
___________________________________________________________________
Parsix: Parse Don't Validate
Author : Iazel
Score : 105 points
Date : 2021-05-15 15:38 UTC (7 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| ledauphin wrote:
| I've been looking for a solid Typescript implementation of "parse
| don't validate" that performs runtime parsing using semantics
| attached to the defined Typescript types themselves. In other
| words, much like attrs for Python, I want to be able to define a
| low/no-boilerplate type, and then register parsers for those
| types that will work recursively to parse my data, resulting in
| the specified Typescript type.
|
| Has anyone seen or written something like this?
| iddan wrote:
| It is definitely possible as Flowtype got it right. I hope one
| day it will come to TypeScript as well
| brundolf wrote:
| Use io-ts: https://github.com/gcanti/io-ts
|
| You define a decoder schema, and then the resulting TS type
| gets automatically derived for you. You can then run data
| through the decoder, it will err if there's a mismatch, or
| return a value of the inferred type otherwise.
| bmuon wrote:
| I've been using this small library inspired by Elm/Swift
| decoders [1]. It works, but it's not low boilerplate.
|
| I'm gravitating towards GraphQL now because strict parsing is
| built into it, so there is no need for all this boilerplate.
|
| https://www.npmjs.com/package/@mojotech/json-type-validation
| ledauphin wrote:
| we use GraphQL for this purpose as well, but I'd also like to
| be able to validate across other boundaries.
|
| However, as I'm saying this, I wonder if I've been looking at
| this problem wrong. Since we already generate types from
| GraphQL schemas, maybe I should figure out how to use the
| same client side parser that's already in my GraphQL client,
| define a GraphQL schema for the types I'm interested in, and
| then just generate and use those types.
|
| One thing that doesn't necessarily give me is the ability to
| define custom parsers corresponding to custom types. At
| least, I think most of that sort of thing is usually done
| server side with GraphQL.
|
| So, thank you for the link and also the inspiration for
| considering an alternative.
| renke1 wrote:
| Not exactly what you want, I think, but there is zod [0].
|
| I really would like to see nominal typing support in
| TypeScript. Currently, it's hard to validate a piece of data
| (or parse for that matter) once and have other functions only
| operate on that validated data. There are (ugly?) workarounds
| though [1].
|
| [0]: https://github.com/colinhacks/zod [1]:
| https://gist.github.com/dcolthorp/aa21cf87d847ae9942106435bf...
| brundolf wrote:
| Similar thing for TypeScript: https://github.com/gcanti/io-ts
| billytetrud wrote:
| To me this just looks like they're arguing for using class types
| rather than raw strings. The parsing seems kind of orthogonal and
| a special case of the kinds of validation you might want to do.
|
| It's also misleading in that the code is still doing validation,
| just in a different place.
| Iazel wrote:
| Yes, it is basically a combination of proofing some data has
| been validated by encoding this proof in a specific type, like
| Email :) We want to popularize this idea and make it easier to
| work with it by offering some nice, type-safe abstraction.
| didibus wrote:
| Yeah, but I think it's even more so, they're arguing that you
| should model the fact that something has been validated or not,
| and functions should indicate if they expect a validated form
| of input or not.
|
| In that sense, using types is only one way to do this, but you
| could model that in other ways. For example:
| var foo = "123" foo = validFoo(foo) print(foo)
| > {"value" : "123", "valid?" : true}
|
| And now you could have: function
| bar(validFoo) { if (!validFoo.get("valid?"))
| throw new InvalidInputException("Foo must be validated prior to
| calling bar.") ...
|
| }
|
| Now types are a convenient way to do this that also gives you
| static checking for it, but I believe the idea is more to model
| that things were validated and expects validated input or fail.
|
| That allows you to push all validation at the boundary, and
| make sure that no one ever forgets to validate the input,
| because if they do, the inner functions will fail reminding the
| caller: Please remember to validate this!
| billytetrud wrote:
| Makes sense. It just seems like parsing is kind of a separate
| issue and shouldn't be entangled with the concept of input
| validation.
| alserio wrote:
| I mean, yes, the point is that it is a better place to do the
| validation step. Also, parse is generic to mean from a
| representation to a more structured one.
| samatman wrote:
| As a minor point of order, the exact phrase "parse, don't
| validate" has been conventional wisdom in langsec circles since I
| got involved, so 2014 at the earliest.
|
| I asked around on the work Matrix as to who actually coined it,
| but it's the weekend.
|
| This is not to take anything away from @lexi_lambda, who cited
| her sources and documented an interesting type-theoretic approach
| to applying the principle. She did a great job!
|
| If anyone wants to do a deeper dive, look into langsec, language-
| theoretic security. There's a lot of prior art to explore.
| alex_duf wrote:
| I think this can be summarised by "model your domain by using
| types, then let the compiler ensure you're not doing anything
| silly"
| Waterluvian wrote:
| I developed a pattern in typescript (I'm sure it's not original)
| where I have an interface describing an API entity and a class of
| the same name with only static methods, one of which is
| Foo.fromApi() that validates and parses.
|
| I haven't seen any need to bring a library in to handle this.
| Though it would be nice to marry the worlds of TS, API, and Json
| Schema.
| lhnz wrote:
| It doesn't use json schema but you might be interested in
| something like this: https://gcanti.github.io/io-ts/
|
| (You can define runtime encoder/decoders which produce typed
| values.)
| brundolf wrote:
| io-ts is fantastic (I linked it myself above). The killer
| feature is that it infers the static types of your runtime
| schemas for you, so you don't have to define them twice. You
| make a change to the schema, the rest of your code will
| typecheck against it.
| smnrchrds wrote:
| Fun fact: Parsix was the name of a Linux distro optimized for
| Persian speakers.
| didibus wrote:
| I have to be honest, I'm not seeing what problem this is trying
| to solve. Anyone can enlighten me?
|
| Edit: Ok I think I understand...
|
| It seems the problem would be that if you're implementing a
| function that takes a user email as a string, and that function
| is in a lower layer of the application, like inside the data
| access layer. It is difficult at this point to know if the email
| string you will be passed as input has already been validated or
| not. Thus you might be tempted to re-implement validation for it
| at your level inside this function as well and have an
| assertValidEmail check.
|
| This can lead to a littering of validation throughout the code
| base, as each implemented function worries that the input isn't
| validated and re-validates it, possibly using slightly different
| rules each time.
|
| Furthermore, if you decide to not validate it again, you might be
| left wondering, but am I sure it'll have been validated prior?
| How can I be sure? Someone in the future could easily start
| calling my function and forget to validate the email before
| calling it? This could eventually lead to a security issue or
| just a bug, by introducing a code path that doesn't ever validate
| the email string.
|
| Thus if instead you'd re-write your function so it takes an email
| as a ValidEmail type (or object), and not as a string, you force
| the caller to for sure remember to validate the email first. And
| you also can safely assume if you're getting an email as a
| ValidEmail type that it has been validated. It could also
| technically allow you to localize the validation logic to the
| ValidEmail type constructor, avoiding possible duplicate attempts
| at validating email with different rules.
|
| And it seems the latter "style" the author calls "Parsing" while
| the former they call "Validating", in the sense that since the
| function validating returns a modified structure it "parsed" it,
| because a string became a ValidEmail, thus parsing a string into
| a ValidEmail, as opposed to simply validating that the string is
| valid as an email.
|
| And finally, this is a little library to help make use of this
| pattern in Kotlin.
| jinwoo68 wrote:
| As they said in README, it's inspired by Alex King's Parse,
| don't validate [1].
|
| Basically, rather than write a validation function, write a
| parser that returns a result of a specific type and use that
| type everywhere else. Then you can make sure the raw inputs are
| always validated.
|
| [1] https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-
| va...
| anaerobicover wrote:
| Small correcting, the author is named Alex _is_
| jinwoo68 wrote:
| Whoops. Thanks for correcting.
| steventhedev wrote:
| There's an entire class of vulnerabilities caused by having
| separate verification and parsing logic, typically with fields
| that usually only one is used, but the format supports
| multiple. The verifier checks the first one but the parser uses
| the last one.
| Smaug123 wrote:
| If you mean "why Parse, Don't Validate", you should read the
| original blog post, linked at the top of the article. It's...
| transformative, if you aren't already aware of the principle.
| https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...
|
| If you mean "why this library", well, I guess parser
| combinators are nice! Some may say that a declarative statement
| of the parsing restrictions is better than a procedural
| implementation, on general principles.
| finnh wrote:
| It encourages people to use strongly typed classes rather than
| primitives, even if the type simply wraps a primitive.
|
| As a result you can't pass a invalid (say) accountID deep into
| your code, bc validity is guaranteed to be checked early when
| you "parse" an input string into the "AccountId" type.
|
| So: internal interfaces defined using non-primitive types, so
| internal methods don't need to keep validating their input.
| Conversion to said types happens early and predictably,
| catching bad values before they (eg) hit the database.
| coderintherye wrote:
| The linked blog post explains it pretty well. Essentially, it
| seems to be solving for unexpected cases or incorrect
| validation by using of static typing and passing the expected
| type back in the return rather than a boolean. I'm not sure
| I've encountered enough issues with validation functions to use
| this pattern, but it does seem like a more robust way of
| writing them.
| jpeloquin wrote:
| Paraphrased from the repo's readme: Suppose you have a program
| that consumes user input. Users often give bad input so the
| program needs to validate user input before acting on it. One
| way to validate is to call a function (e.g., `check_input`) on
| the user input and if it doesn't raise an error the input is
| safe for consumption by the rest of the program. The repo
| author considers this approach to be risky because the
| programmer can inadvertently omit or bypass `check_input` and
| the program still compiles and runs without complaint.
|
| The repo presents an alternative validation approach, which is
| to parse the user input into a data type (or, not quite
| equivalently, into a class). The parsing process serves as
| validation. Consumer functions are written such that they only
| accept the parsed data type. Therefore it is now impossible for
| the programmer to inadvertently omit or bypass validation of
| user input.
|
| The library is a set of convenience functions for actually
| writing these parsing / validation functions.
| _jal wrote:
| Reminds me a little of taint checking in Perl and Ruby, in
| reverse.
| atoav wrote:
| So in short: instead of representing user input (e.g. a Email
| address) as a string - which you can forget to validate - the
| idea here is to create a own data type for it, and use the
| validation step to create said data type.
|
| The rest of your program then works with this data type
| instead of the string and this way you will get a type error
| whenever you accidentally use unvalidated data.
|
| A nice idea that goes into a similar direction is to expand
| on this and create more types for different levels of trust.
| E.g. you could have the data types ValidatedEmail,
| VerifiedEmail and TrustedEmail and define precisely how one
| becomes the other. This way your typesystem will already tell
| you what is valid and what is not and you can't accidental
| mix them up.
| TeMPOraL wrote:
| You can also further generalize this idea by noticing you
| can encode all kinds of life cycle information in your type
| system. As you transform some data in a sequence of steps,
| you can use types to document and enforce the steps are
| always executed in order.
|
| In this example, the user input validation step is
| f(String) -> ValidatedEmail, then the process of verifying
| it is f(ValidatedEmail) -> VerifiedEmail. But the same
| principle can apply to e.g. append() operation being
| f(List[T], T) -> NonEmptyList[T], and you can write code
| accepting NonEmptyList to save yourself an emptiness check.
| Or, take a multi-step algorithm that gets a list of users,
| filters them by some criterion, sorts the list, and sends
| these users e-mails. Type-wise, it's a flow of Users ->
| EligibleUsers -> SortedEligibleUsers ->
| ContactedEligibleUsers.
|
| And then, why should types be singular anyway? You should
| be able to tag data with properties, and then filter on or
| transform a subtag of them. This is the area of theory I'm
| not familiar with yet, but I imagine you _should_ be able
| to do things like:
|
| List[User] -> List[User, NonEmpty] -> List[User[Eligible],
| NonEmpty] -> List[User[Eligible], NonEmpty, Sorted[Asc]] ->
| List[User[Contacted], Sorted[Asc]].
|
| Or,
|
| Email -> Email[Validated] -> Email[Validated, Verified] ->
| Email[Validated, Verified, Trusted].
|
| I'm sure there's a programming language that does that, and
| then there's probably lots of reasons that this doesn't
| work in practice. I'd love to know about them, as I haven't
| encountered anything like it in practice, except bits and
| pieces of compiler code that can sometimes propagate such
| information "in the background", for optimization and
| correctness checking.
| _greim_ wrote:
| To keep building on this, I think the word "parsing" is just
| the tip of the iceberg. Parsing is one way to port data
| across a type boundary, where the source and dest types are
| optimized for different use cases (e.g. serialization vs
| type-safe representation). Since the semantic Venn diagrams
| of any two types might have areas of non-overlap, parse-
| don't-validate means establishing clear boundaries in your
| program where those translations happen, then defining the
| types on either side of the boundary to rule out the
| possibility of nonsense states elsewhere throughout the
| program. The idea of nonsense states is closely related and
| discussed more here[0] and here[1].
|
| [0] http://blog.jenkster.com/2016/06/how-elm-slays-a-ui-
| antipatt...
|
| [1] https://kentcdodds.com/blog/make-impossible-states-
| impossibl...
| StreamBright wrote:
| Interesting naming. Strongly typed languages (especially in the
| ML family) have best practices that include using types instead
| of strings as function parameters. Email type itself is enough
| to skip validation in each function accepting that particular
| type.
|
| I think this is great first step using functional languages but
| you can go much much deeper than that.
|
| https://www.slideshare.net/ScottWlaschin/the-power-of-compos...
| cle wrote:
| There are lots of siblings explaining why "parse don't
| validate".
|
| But also, it's not always wise to take this to an extreme. I've
| seen over the years many scenarios where dev teams were over-
| enthusiastic about this and parsed themselves into a corner by
| making system components over-strict and enforcing invariants
| that weren't necessary to enforce, making them much harder to
| change later.
|
| The right answer is, of course, somewhere in the middle, and
| depends on your domain and situation.
| Iazel wrote:
| hi, @cle! Curious to hear more about that, were they actually
| running validation/assertions in constructors?
| cle wrote:
| That can be a case of that yeah. Using your example, a lot
| of devs might use that email parsing logic in various
| independent components of the same system. Eg if you have a
| reporting component that sends you business reports, that
| component really shouldn't be validating the structure of
| email addresses...if you need to refine the parsing logic
| now you've got to do coordinated deployments, possibly
| backfills, etc., whereas if you just treated it as an
| opaque string in that system you'd be better off.
|
| This isn't really a criticism of the approach, it's super
| useful, just that it needs to be applied judiciously.
| "Parse all the things" isn't always the best advice.
| Iazel wrote:
| Cool to see you perfectly got the point in the end! I wonder
| though, were you confused by the README? What made it clear for
| you?
| didibus wrote:
| Hum, it was the people here who replied to my question, and
| also reading the linked article.
|
| I think my confusion was in trying to frame things as parsing
| VS validating. While I now appreciate that use of word, now
| that I understand, it also caused my biggest source of
| confusion.
|
| That's because I think most people think of parsing as
| conversion, like I turn a String to an Int. Where as in your
| case, you're simply wanting to tag a type as having been
| validated, but you don't really convert the type itself, so
| you simply wrap it in another type in order to tag it as
| having been validated simply because the language offers no
| other way to tag the type with meta-information for the
| compiler to assert statically.
|
| So because it seemed more like you're just wrapping the
| input, but still all code will be using the input value as it
| is, extracting it out of your wrapped type, the idea that you
| were "Parsing" and not "Validating" well just confused me.
| mirekrusin wrote:
| Imagine you're writing typescript project. You type everything
| and have type safety. This type safety is an illusion on I/O
| boundaries - whenever ie. JSON.parse(...) from
| file/websocket/http happens. To preserve type safety, you want
| to use something like [0] to do runtime type assertions. Once
| i/o boundaries are parsing unknown types at runtime into what
| is defined as static types, your type safety is guaranteed.
|
| [0] https://github.com/appliedblockchain/assert-combinators
| rdedev wrote:
| I find this approach combined with phantom data types really
| cool. Now you can easily introduce a semantic differentiation
| between two instances of the same data type but without much
| overhead
| GordonS wrote:
| If it helps, here's a related blog post but with a C# slant:
|
| https://andrewlock.net/using-strongly-typed-entity-ids-to-av...
|
| The author refers to using primitives everywhere as "primitive
| obsession", and proposed using types instead.
| dmux wrote:
| Similar to the idea of "microtypes" (I've most often seen it
| used in Java circles):
|
| https://www.markhneedham.com/blog/2009/03/10/oo-micro-types/
| matheusmoreira wrote:
| This also has security implications. The input handling layer
| is critical. Bugs in parsing and validation code are
| responsible for a huge number of vulnerabilities.
|
| More details: http://langsec.org/
|
| > The Language-theoretic approach (LANGSEC) regards the
| Internet insecurity epidemic as a consequence of _ad hoc_
| programming of input handling at all layers of network stacks,
| and in other kinds of software stacks.
|
| > LANGSEC posits that the only path to trustworthy software
| that takes untrusted inputs is treating all valid or expected
| inputs as a formal language, and the respective input-handling
| routines as a _recognizer_ for that language.
| TheAceOfHearts wrote:
| Refining types so they encode all desired constraints before
| use. This is explained in the linked article: Parse, don't
| validate [0].
|
| It helps reduce the risk of using invalid inputs by
| representing constraints over the value as part of the type.
|
| For example: a common problem in web development security is
| that query parameters aren't properly validated which can lead
| to denial of service attacks. As a trivial example of this,
| consider a web server which paginates some data using "offset"
| and "limit" by passing those parameters directly to a database
| query; an attacker could set "limit" to some incredibly high
| value and cause the server to crash. If you're just doing
| validation on your inputs it's possible that some usage could
| end up being overlooked.
|
| [0] https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-
| va...
| gregors wrote:
| So real question - in the "offset" "limit" example what makes
| it any more safe if at first the programmer sets those types
| to be integers? The same problem persists does it not?
|
| Does the explicit creation of a type add this introspection?
| I'm not convinced that it does. Now once you fix this bug,
| encoding it in a type prevents it from creeping into other
| parts of the the code. This seems more like DRY principles in
| action.
| TheAceOfHearts wrote:
| Apologies if I did a poor job of explaining, what you wrote
| seems in agreement with what I was attempting to convey.
|
| If one were only using integer types then the same problem
| would persist, that's correct. The problem would be solved
| by defining our limit type to only represent positive
| integers up to a specific safe value.
|
| Type refinement is done on the input boundaries of the
| system during runtime to prevent errors from propagating.
| didibus wrote:
| Yeah, it seems to be more about guarantees as a code base
| grows larger and more people touch it.
|
| If there's a Limit class whose constructor and setter all
| check that the range is between say 5 to 100, and all
| existing code that needs the limit uses the instance of
| Limit, it just becomes less likely a code change is made
| that uses the limit input as it was directly provided by
| the user (and thus possibly out of range).
|
| But you'd still need to have had someone be smart enough to
| make sure the Limit class does prevent limits that could
| cause DB crashes.
|
| In practice I'm thinking, ok, so someone must have
| thought... Hey we should validate this user input and put
| in some logic for it.
|
| So I think what this says is, validation works by having
| all external input validated as they are received. But it
| can be easy to make a code change at the boundary where you
| forget to add proper validation. If all existing functions
| in the lower layers, like in the data access layer, are
| designed to take a Limit object, the person who took a
| limit as external input and was about to pass it to the
| query function will get a compile error and realize... Oh I
| need to first parse my integer limit into a Limit, and thus
| reminds them to use the thing that enforces the valid
| range.
|
| If instead the code had a util function called
| assertValidLimit, and the query function took a limit as an
| integer, it be easy for that person to forget to add a call
| to assertValidLimit when getting the limit from the user
| and then pass that unvalidated to the query and possibly
| cause a vulnerability.
|
| And lastly, it seems they argue, if you were to validate
| instead in the query function itself, thus it wouldn't
| matter if others forget to validate since where it matters
| would, but then it is hard to fail at that layer, since you
| might have already made other changes and that can leave
| your state corrupted.
|
| So basically it seems the argument is:
|
| "It is best to validate external input at the boundary as
| soon as it is received, but it can be easy to forget to do
| so and that's dangerous. So to help you not forget, have
| all implementing functions take a different type then the
| type of the external input, which will remind people... Oh
| right I need to parse this thing first and in doing so
| assert it's valid as well.
| Iazel wrote:
| Well said! I would only like to add that I highly
| discourage adding validations/assertions in the actual
| data class, this often make them hard to work with and
| reuse. It is better to have this parsing logic as a
| simple function, perhaps at factory level if you prefer
| that kind of flavor :)
| mbildner wrote:
| This is not yet possible in Typescript, but imagine if you
| could define a numerical subtype that requires your input
| be below some threshold eg:
|
| `type Limit = 0..100;`
|
| See discussion here:
| https://github.com/Microsoft/TypeScript/issues/15480
| twic wrote:
| Great, but why do you need a library for this? I just write
| classes with a falliable static parse method and a private
| constructor.
|
| It looks like this library was written by someone labouring under
| the mistaken belief that it's better to build and use a DSL to
| create the illusion of declarativity than to just write a line or
| two of normal code (eg the focusedParse stuff).
|
| Also, i demur somewhat at calling this parsing. It's tracking
| validation using typestate.
| skybrian wrote:
| This library seems to be providing a framework and doesn't
| include any interesting parsers. (There is no email address
| parser, despite the example.) It seems to allow for some
| composition of parsers, but the basic idea is a design pattern
| that's simple enough that it doesn't obviously require a
| framework.
|
| So it seems like most of the value comes from standardizing on
| domain types like Username, Email, and so on. Using a framework
| doesn't get you there, and it adds a dependency on the framework.
| Iazel wrote:
| Hi skybrian, would you mind explaining why do you see this as a
| framework?
|
| About missing interesting parsers, you are right, for now only
| the core part is done. Based on community interest, we will
| work on complementary packages, like more common parsers, easy
| integration with a web framework like ktor, effectful parsers
| based on coroutine, etc...
|
| Lots of work ahead :D
| throwawayboise wrote:
| I do as much of this as I can with database constraints. Foreign
| key constraints, or check constraints, or even triggers if
| necessary (though I do try to avoid them).
|
| Databases tend to outlive application code, or may be fronted by
| different applications (internal vs external for example).
| Keeping the constraints with the data is the best way to ensure
| that your data remains consistent within itself.
| Iazel wrote:
| I see, this is also an interesting approach and definitely have
| its usages. Thinking about it, though, it has its own
| limitations when it comes to scalability and business
| requirements naturally out from the database box, eg: how would
| you ensure an S3 file reference is actually valid and it does
| exist?
| jhardy54 wrote:
| I do this too, but I'm always frustrated by the mismatch
| between database constraints and application constraints. For
| example, when using Django you can declare a field as
| varchar(32) but that constraint isn't checked until you
| actually insert the row into the database. I suppose maybe
| that's not a problem in languages with more mature type safety
| ecosystems?
| Iazel wrote:
| Yeah, I've also worked with weak type systems in the past too
| (PHP, Ruby, JS), so I can definitely share the pain! I
| learned the hard way how much easier it is to build complex
| systems when you have a compiler helping you ;)
| jhardy54 wrote:
| What are you building with now? Rust/Go/something snazzy?
___________________________________________________________________
(page generated 2021-05-15 23:00 UTC)