[HN Gopher] Parsing JSON Is a Minefield (2018)
       ___________________________________________________________________
        
       Parsing JSON Is a Minefield (2018)
        
       Author : fanf2
       Score  : 77 points
       Date   : 2024-06-02 16:42 UTC (6 hours ago)
        
 (HTM) web link (seriot.ch)
 (TXT) w3m dump (seriot.ch)
        
       | chrisjj wrote:
       | Nice.
       | 
       | Feedback:
       | 
       | > I wrote yet another JSON parser (section 6)
       | 
       | Link defunct.
        
         | douglee650 wrote:
         | ```
         | 
         | One day a student came to Moon and said: "I understand how to
         | make a better garbage collector. We must keep a reference count
         | of the pointers to each cons."
         | 
         | Moon patiently told the student the following story:
         | 
         | "One day a student came to Moon and said: 'I understand how to
         | make a better garbage collector...
         | 
         | ```
        
       | jsnell wrote:
       | (2016). Previous significant discussions:
       | 
       | https://news.ycombinator.com/item?id=12796556
       | 
       | https://news.ycombinator.com/item?id=20724672
       | 
       | https://news.ycombinator.com/item?id=28826600
        
         | dang wrote:
         | Thanks! Macroexpanded:
         | 
         |  _Parsing JSON Is a Minefield (2016)_ -
         | https://news.ycombinator.com/item?id=28826600 - Oct 2021 (173
         | comments)
         | 
         |  _Parsing JSON Is a Minefield (2018)_ -
         | https://news.ycombinator.com/item?id=20724672 - Aug 2019 (178
         | comments)
         | 
         |  _Parsing JSON is a Minefield_ -
         | https://news.ycombinator.com/item?id=16897061 - April 2018 (246
         | comments)
         | 
         |  _Parsing JSON is a Minefield_ -
         | https://news.ycombinator.com/item?id=12796556 - Oct 2016 (292
         | comments)
        
       | sureglymop wrote:
       | Wrote a json parser recently and did not think this hard about it
       | (because the spec is so simple). Time to revisit
        
       | rendaw wrote:
       | The primitive types JSON specifies are redundant and generally
       | only lead to issues. Almost all JSON consumers are either
       | deserializing to a spec that already contains type information,
       | frequently richer, with even more variety of types (url,
       | telephone number, UUID, not just "string"), and even without a
       | spec code will be written to need a specific type (i.e. you're
       | not going to write code to accept an integer when you want a
       | person's name).
       | 
       | It would be much simpler if all primitives were strings, and it'd
       | probably save a few people from accidentally doing the wrong
       | thing while dealing with prices.
        
         | aftbit wrote:
         | Perhaps. I've often wished that JSON supported some sort of
         | custom types or type annotations, or failing that, at least
         | datetimes. Some other nice extensions would be support for
         | comments and optional trailing commas.
         | 
         | There is something very nice and expressive about the existing
         | JSON types. Just 6 types (null, boolean, string, number, array,
         | and dictionary) are enough to cover a ton of use cases, and as
         | you suggest, one can always fall back to "stringly typed"
         | alternatives by implementing one's own serialization and
         | deserialization for extra types.
        
           | ooterness wrote:
           | You may be interested in CBOR (IETF RFC 8949).
           | 
           | CBOR features are almost one-to-one with JSON, except that
           | the encoding is more size-efficient, it supports a few
           | additional types (e.g., integers and floats are separate),
           | and it allows semantic tags.
           | 
           | https://en.wikipedia.org/wiki/CBOR
        
             | murmansk wrote:
             | While it might be great in theory, CBOR has own separate
             | set of dragons waiting for you.
             | 
             | Expectation: tags in CBOR allow you to pass semantics.
             | Reality: multitude of tags, and absence of strict rules for
             | the tags make it pain in the ass.
        
             | zzo38computer wrote:
             | There are some benefits of CBOR (having a separate integer
             | type is good, and a byte string type is good, and they have
             | typed numeric arrays which is good also, etc), but also
             | some problems. For example, I might have preferred that
             | Unicode is a tag rather than a type (other tags can be used
             | for other character sets), and base64-encoded strings also
             | seems unnecessary (since it is a binary format anyways, you
             | should just use the binary data directly instead), and I
             | think it would be better for a MIME message to be treated
             | as a byte string instead of Unicode (fortuantely the
             | specification allows that, but it seems to just be "added
             | on" afterward due to a lack of consideration), and possibly
             | maybe it might be better to disallow the types of keys to
             | be arrays and maps.
             | 
             | However, some of the things I mentioned above, do have
             | benefits for interoperability with JSON, although they
             | aren't good for a general-purpose use; I think that it
             | would generally be better to make a good format rather than
             | trying to work only with the bad ideas of other
             | specifications. (Fortunately, I think what I described
             | above could be implemented using a subset of CBOR.)
             | 
             | However, using these formats (whether CBOR or JSON) is
             | often more complicated than should be needed for a specific
             | use anyways.
        
         | VMG wrote:
         | Disagree. The typical ad hoc funcs for parsing string to bool
         | make me despair (uppercase, lowercase, true, yes, y, 1, .. )
        
           | sgarland wrote:
           | Python's distutils had a strtobool() function that was very
           | handy for this, but the module has been removed. It's trivial
           | to re-implement, but still slightly annoying to have to do.
        
         | kibwen wrote:
         | Let's make a distinction here between serialization formats and
         | configuration formats. Because JSON is often used for both,
         | these two use cases often get conflated.
         | 
         | For _configuration_ formats, I 100% agree with you. I do not
         | want _any_ data type except a string and a hashmap (maybe an
         | array if you 're being luxurious). Not an int, not a float, not
         | a boolean, not a datetime (looking at you, TOML). For
         | configuration formats I am always _immediately_ feeding those
         | files into a language with a richer type system that will
         | actually parse them; my program and its embedded types _are_
         | the schema. (Users of dynamically-typed languages may
         | reasonably disagree.)
         | 
         | However, for the serialization use case, I'm not so sure.
         | There's an argument that having a schema against which to do
         | lightweight validation at several points in the pipeline isn't
         | the worst idea, and built-in primitives get you halfway to a
         | half-decent schema. I'm ambivalent at worst.
        
           | troupo wrote:
           | > my program and its embedded types are the schema.
           | 
           | They are not. Configuration is a very tiny subset of a more
           | general problem that you also mention: serialization.
           | 
           | Your config file will be de-serialized by your program and
           | parsed into some specific types. Including numbers (tons of
           | edge cases), dates (tons of edge cases), strings (tons of
           | edge cases) etc.
           | 
           | It becomes worse when your program is used by more people
           | than just you: which field is a date? In which format? Do you
           | handle floats? What precision? What's the decimal separator?
           | Do you do string normalization? What are valid and invalid
           | characters, if any?
           | 
           | You can't pretend that your config is "just strings". They
           | are not
        
             | wruza wrote:
             | But most configs are just strings and it's okay. How does
             | it get so bad just itt?
             | 
             | Human input is full of tradeoffs, that's why it's bash and
             | not typescript in your shell path column. And you'll meet a
             | great resistance from users if you make your config fully
             | typed and require to refer to schema dtd ns or whatever bs
             | xml had.
        
             | mike_hock wrote:
             | I kind of took away the opposite from the parent post. Of
             | course, your config isn't just strings, but it also isn't
             | just a limited set of primitive types that the inventor of
             | some one-size-fits-all configuration language envisioned.
             | 
             | You can't build a generic schema validator that will accept
             | _exactly_ the valid configs for some program and nothing
             | else anyway, so forget the half-assed type checking
             | attempts and just provide the hierarchical structure. It 's
             | up to the application to define the valid grammar and
             | semantics of each config option and parse it into an
             | application-specific type.
        
             | hgyjnbdet wrote:
             | I would say all configs should be treated as castable
             | strings. That's why for config files I much prefer the INI
             | format.
        
             | nevermore24 wrote:
             | The strings are strings. I don't care how people handle
             | their dates, that's between them and their god.
        
         | crazygringo wrote:
         | > _Almost all JSON consumers are either deserializing to a spec
         | that already contains type information_
         | 
         | But different languages interpret different strings in
         | different ways by default.
         | 
         | This leads to major bugs.
         | 
         | One of the great strengths of JSON is that parsing a number is
         | well-defined.
         | 
         | The way you're suggesting would lead to people emitting JSON
         | with leading zeros sometimes, and then some languages end up
         | interpreting certain numbers as octal.
         | 
         | No thank you.
        
           | anonymoushn wrote:
           | JSON numbers are just certain strings, but some tools that
           | deal with json such as jq feel a need to mangle the numbers
           | anyway
        
             | crazygringo wrote:
             | I don't know what you mean.
             | 
             | JSON numbers are far more restrictive than strings and
             | carry precisely defined meaning in a way that arbitrary
             | strings don't. They're only "just certain strings" in the
             | same way anything can be serialized to a string, which
             | doesn't really mean anything.
             | 
             | What does jq do to them?
        
         | IshKebab wrote:
         | It's extremely common in dynamically typed languages to
         | deserialise JSON without a spec. What you're asking for is
         | basically XML and it's definitely nicer to get at least basic
         | types (string, bool, int, etc.) "for free".
        
         | kemitchell wrote:
         | I've long used a toylike "Lists and Maps of Strings" format for
         | personal recordkeeping and automation.
         | https://www.npmjs.com/package/lamos
         | 
         | I've never gone back to formalize the grammar or otherwise
         | mature it. But it's served me well as-is, and it's been easy to
         | convert "up" to JSON or YAML or XML or what-have-you, once the
         | case for an interface beyond plain text proves worthwhile.
        
         | fuzztester wrote:
         | >It would be much simpler if all primitives were strings,
         | 
         | TCLON?
        
       | aftbit wrote:
       | I'm a little sad to see that no implementations of JavaScript
       | were on the tested parser list. I'd be interested to see where
       | browsers and nodejs `JSON.parse` as well as `eval` parsers fall.
       | As the author mentioned, some of the JSON features are not valid
       | JavaScript but I wonder which of these test cases fail `eval`.
       | 
       | Note, just so nobody reminds me, don't parse JSON with eval for
       | security reasons. I'm just curious how it would work from a
       | parser completeness point of view.
        
       | Thaxll wrote:
       | XML was indeed better.
        
         | IshKebab wrote:
         | It absolutely wasn't, primarily because the XML data model is
         | so mismatched with the object structures you find in
         | programming languages.
         | 
         | It does at least support comments though. Biggest flaw in JSON
         | by far.
        
         | Tao3300 wrote:
         | Apples and oranges.
        
       | ryjo wrote:
       | Writing a JSON parser is a good way to teach yourself better
       | programming practices. I attribute my understanding of pointer
       | arithmetic and i/o streams to my own efforts in
       | parsing/generating JSON.
        
       | jwells89 wrote:
       | JSON has its issues, but modern languages including facilities to
       | work with it (fewer dependencies to wrangle is always great) and
       | the way typesafe (de)serializers can be synthesized automatically
       | with a little thoughtfulness in design instead of needing to be
       | manually written (see Swift's Codable and Kotlin/Java's Moshi,
       | for example) can in my opinion make it compelling enough to
       | overlook its warts. It doesn't fit everywhere of course but it's
       | more than good enough for a vast range of applications.
        
       | theamk wrote:
       | This is interesting, but seems pretty irrelevant for the real
       | world (kinda like "i = ++i + ++i;" C puzzle). The answer to those
       | dangers is "don't do it then". Use your stdlib to emit json,
       | don't use string functions to modify json, assume any number is
       | no better tha float64, and base64 your binary data - and you will
       | never have to worry about this "minefield"
       | 
       | (the only possible problem is if you are designing a security
       | system, but even then, since all the ambiguity is whether to
       | reject the string, it will cause DOS at worst)
        
       | mariusor wrote:
       | Funny how Baader-Meinhof works, I just finished writing a JSON
       | toy parser earlier today. I guess I'll add the mentioned corner
       | cases to the testsuite, and watch them fail. :D
        
         | hughw wrote:
         | And so I just now learned that the Baader-Meinhof Gang of the
         | 1970s gave its name to the phenomenon of frequency illusion,
         | where once you hear about a thing you notice many more
         | references to it.
        
       | RedShift1 wrote:
       | Parsing any format is a minefield though...
        
       | thecleaner wrote:
       | Btw if we use parser generators like antlr for this purpose, is
       | it still a minefield ? Can someone point some vulnerabilities I
       | can study ?
        
       | zzo38computer wrote:
       | There is problem with JSON, such as:
       | 
       | - The numbers is floating points, but cannot be Infinity and NaN.
       | It is not a integer type, so long integers might not work
       | properly. (There are other problems with numbers too, as
       | mentioned in that article.)
       | 
       | - The strings is Unicode. Non-Unicode (including binary data)
       | doesn't do properly, and even Unicode can have problems (some of
       | which are mentioned in that article, but there are others too).
       | 
       | - Keys are only strings, not numbers.
       | 
       | - Syntax convenience isn't so well, e.g. doesn't have comments,
       | optional trailing commas, etc.
       | 
       | - The format is difficult for reasons explained in that article,
       | too.
       | 
       | One possible alternative would be a format based on a subset of
       | PostScript (instead of JavaScript), e.g. (a part of a example
       | from Wikipedia):                 <<         /first_name (John)
       | /last_name (Smith)         /is_alive true         /age 27
       | /phone_numbers [           <<             /type (home)
       | /number (212 555-1234)           >>           <<
       | /type (office)             /number (646 555-4567)           >>
       | ]         /spouse null       >>
       | 
       | PostScript also has binary format, comments (with a percentage
       | sign), hex string literals, etc. (And, commas are not used, so
       | the problem with trailing commas also does not apply.)
       | 
       | (Nevertheless, I did write a JSON parser (and also a JSON writer)
       | in PostScript.)
       | 
       | It is also possible to use binary formats, CSV, etc, depending on
       | what exactly is needed by the program; for many reasons, one
       | format cannot solve everything.
        
         | BugsJustFindMe wrote:
         | > _The numbers is floating points...long integers might not
         | work properly_
         | 
         | I personally hate the usual interpretation as float and see it
         | as a common but extremely-implementation-induced failure. It's
         | far better interpreted as an arbitrary precision numeric type,
         | not float or int. The spec even says as much and only says that
         | implementations mostly suck so watch out. IMO precision myopia
         | is why we end up with e.g. Python's refusal-by-default to
         | (de)serialize from/to Decimal.
        
         | nurettin wrote:
         | Why not make non-strict parsers that will handle unicodes,
         | longs, binary, ignore comments and allow trailing commas? If
         | you set bend_over_backwards=true, it will do strict parsing for
         | the poor souls who need that.
         | 
         | edit: I didn't mention integer keys, because object members
         | canonically start with a letter.
        
         | agys wrote:
         | _> Syntax convenience isn 't so well, e.g. doesn't have
         | comments, optional trailing commas, etc._
         | 
         | I never understood these two choices in the spec as they are
         | totally against the "human-readable" goal...
        
       | stevejb wrote:
       | I definitely agree with a lot of the comments here especially the
       | ones in the vein of "don't do dangerous things with json. " if
       | you have control of the sender and the receiver, it makes sense
       | to have fields that add a bit of extra type information e.g. this
       | is an integer or this is a float with this much precision
        
         | acheong08 wrote:
         | At that point just use protobuf
        
       ___________________________________________________________________
       (page generated 2024-06-02 23:02 UTC)