[HN Gopher] Parsing JSON Is a Minefield (2018)
___________________________________________________________________
Parsing JSON Is a Minefield (2018)
Author : fanf2
Score : 77 points
Date : 2024-06-02 16:42 UTC (6 hours ago)
(HTM) web link (seriot.ch)
(TXT) w3m dump (seriot.ch)
| chrisjj wrote:
| Nice.
|
| Feedback:
|
| > I wrote yet another JSON parser (section 6)
|
| Link defunct.
| douglee650 wrote:
| ```
|
| One day a student came to Moon and said: "I understand how to
| make a better garbage collector. We must keep a reference count
| of the pointers to each cons."
|
| Moon patiently told the student the following story:
|
| "One day a student came to Moon and said: 'I understand how to
| make a better garbage collector...
|
| ```
| jsnell wrote:
| (2016). Previous significant discussions:
|
| https://news.ycombinator.com/item?id=12796556
|
| https://news.ycombinator.com/item?id=20724672
|
| https://news.ycombinator.com/item?id=28826600
| dang wrote:
| Thanks! Macroexpanded:
|
| _Parsing JSON Is a Minefield (2016)_ -
| https://news.ycombinator.com/item?id=28826600 - Oct 2021 (173
| comments)
|
| _Parsing JSON Is a Minefield (2018)_ -
| https://news.ycombinator.com/item?id=20724672 - Aug 2019 (178
| comments)
|
| _Parsing JSON is a Minefield_ -
| https://news.ycombinator.com/item?id=16897061 - April 2018 (246
| comments)
|
| _Parsing JSON is a Minefield_ -
| https://news.ycombinator.com/item?id=12796556 - Oct 2016 (292
| comments)
| sureglymop wrote:
| Wrote a json parser recently and did not think this hard about it
| (because the spec is so simple). Time to revisit
| rendaw wrote:
| The primitive types JSON specifies are redundant and generally
| only lead to issues. Almost all JSON consumers are either
| deserializing to a spec that already contains type information,
| frequently richer, with even more variety of types (url,
| telephone number, UUID, not just "string"), and even without a
| spec code will be written to need a specific type (i.e. you're
| not going to write code to accept an integer when you want a
| person's name).
|
| It would be much simpler if all primitives were strings, and it'd
| probably save a few people from accidentally doing the wrong
| thing while dealing with prices.
| aftbit wrote:
| Perhaps. I've often wished that JSON supported some sort of
| custom types or type annotations, or failing that, at least
| datetimes. Some other nice extensions would be support for
| comments and optional trailing commas.
|
| There is something very nice and expressive about the existing
| JSON types. Just 6 types (null, boolean, string, number, array,
| and dictionary) are enough to cover a ton of use cases, and as
| you suggest, one can always fall back to "stringly typed"
| alternatives by implementing one's own serialization and
| deserialization for extra types.
| ooterness wrote:
| You may be interested in CBOR (IETF RFC 8949).
|
| CBOR features are almost one-to-one with JSON, except that
| the encoding is more size-efficient, it supports a few
| additional types (e.g., integers and floats are separate),
| and it allows semantic tags.
|
| https://en.wikipedia.org/wiki/CBOR
| murmansk wrote:
| While it might be great in theory, CBOR has own separate
| set of dragons waiting for you.
|
| Expectation: tags in CBOR allow you to pass semantics.
| Reality: multitude of tags, and absence of strict rules for
| the tags make it pain in the ass.
| zzo38computer wrote:
| There are some benefits of CBOR (having a separate integer
| type is good, and a byte string type is good, and they have
| typed numeric arrays which is good also, etc), but also
| some problems. For example, I might have preferred that
| Unicode is a tag rather than a type (other tags can be used
| for other character sets), and base64-encoded strings also
| seems unnecessary (since it is a binary format anyways, you
| should just use the binary data directly instead), and I
| think it would be better for a MIME message to be treated
| as a byte string instead of Unicode (fortuantely the
| specification allows that, but it seems to just be "added
| on" afterward due to a lack of consideration), and possibly
| maybe it might be better to disallow the types of keys to
| be arrays and maps.
|
| However, some of the things I mentioned above, do have
| benefits for interoperability with JSON, although they
| aren't good for a general-purpose use; I think that it
| would generally be better to make a good format rather than
| trying to work only with the bad ideas of other
| specifications. (Fortunately, I think what I described
| above could be implemented using a subset of CBOR.)
|
| However, using these formats (whether CBOR or JSON) is
| often more complicated than should be needed for a specific
| use anyways.
| VMG wrote:
| Disagree. The typical ad hoc funcs for parsing string to bool
| make me despair (uppercase, lowercase, true, yes, y, 1, .. )
| sgarland wrote:
| Python's distutils had a strtobool() function that was very
| handy for this, but the module has been removed. It's trivial
| to re-implement, but still slightly annoying to have to do.
| kibwen wrote:
| Let's make a distinction here between serialization formats and
| configuration formats. Because JSON is often used for both,
| these two use cases often get conflated.
|
| For _configuration_ formats, I 100% agree with you. I do not
| want _any_ data type except a string and a hashmap (maybe an
| array if you 're being luxurious). Not an int, not a float, not
| a boolean, not a datetime (looking at you, TOML). For
| configuration formats I am always _immediately_ feeding those
| files into a language with a richer type system that will
| actually parse them; my program and its embedded types _are_
| the schema. (Users of dynamically-typed languages may
| reasonably disagree.)
|
| However, for the serialization use case, I'm not so sure.
| There's an argument that having a schema against which to do
| lightweight validation at several points in the pipeline isn't
| the worst idea, and built-in primitives get you halfway to a
| half-decent schema. I'm ambivalent at worst.
| troupo wrote:
| > my program and its embedded types are the schema.
|
| They are not. Configuration is a very tiny subset of a more
| general problem that you also mention: serialization.
|
| Your config file will be de-serialized by your program and
| parsed into some specific types. Including numbers (tons of
| edge cases), dates (tons of edge cases), strings (tons of
| edge cases) etc.
|
| It becomes worse when your program is used by more people
| than just you: which field is a date? In which format? Do you
| handle floats? What precision? What's the decimal separator?
| Do you do string normalization? What are valid and invalid
| characters, if any?
|
| You can't pretend that your config is "just strings". They
| are not
| wruza wrote:
| But most configs are just strings and it's okay. How does
| it get so bad just itt?
|
| Human input is full of tradeoffs, that's why it's bash and
| not typescript in your shell path column. And you'll meet a
| great resistance from users if you make your config fully
| typed and require to refer to schema dtd ns or whatever bs
| xml had.
| mike_hock wrote:
| I kind of took away the opposite from the parent post. Of
| course, your config isn't just strings, but it also isn't
| just a limited set of primitive types that the inventor of
| some one-size-fits-all configuration language envisioned.
|
| You can't build a generic schema validator that will accept
| _exactly_ the valid configs for some program and nothing
| else anyway, so forget the half-assed type checking
| attempts and just provide the hierarchical structure. It 's
| up to the application to define the valid grammar and
| semantics of each config option and parse it into an
| application-specific type.
| hgyjnbdet wrote:
| I would say all configs should be treated as castable
| strings. That's why for config files I much prefer the INI
| format.
| nevermore24 wrote:
| The strings are strings. I don't care how people handle
| their dates, that's between them and their god.
| crazygringo wrote:
| > _Almost all JSON consumers are either deserializing to a spec
| that already contains type information_
|
| But different languages interpret different strings in
| different ways by default.
|
| This leads to major bugs.
|
| One of the great strengths of JSON is that parsing a number is
| well-defined.
|
| The way you're suggesting would lead to people emitting JSON
| with leading zeros sometimes, and then some languages end up
| interpreting certain numbers as octal.
|
| No thank you.
| anonymoushn wrote:
| JSON numbers are just certain strings, but some tools that
| deal with json such as jq feel a need to mangle the numbers
| anyway
| crazygringo wrote:
| I don't know what you mean.
|
| JSON numbers are far more restrictive than strings and
| carry precisely defined meaning in a way that arbitrary
| strings don't. They're only "just certain strings" in the
| same way anything can be serialized to a string, which
| doesn't really mean anything.
|
| What does jq do to them?
| IshKebab wrote:
| It's extremely common in dynamically typed languages to
| deserialise JSON without a spec. What you're asking for is
| basically XML and it's definitely nicer to get at least basic
| types (string, bool, int, etc.) "for free".
| kemitchell wrote:
| I've long used a toylike "Lists and Maps of Strings" format for
| personal recordkeeping and automation.
| https://www.npmjs.com/package/lamos
|
| I've never gone back to formalize the grammar or otherwise
| mature it. But it's served me well as-is, and it's been easy to
| convert "up" to JSON or YAML or XML or what-have-you, once the
| case for an interface beyond plain text proves worthwhile.
| fuzztester wrote:
| >It would be much simpler if all primitives were strings,
|
| TCLON?
| aftbit wrote:
| I'm a little sad to see that no implementations of JavaScript
| were on the tested parser list. I'd be interested to see where
| browsers and nodejs `JSON.parse` as well as `eval` parsers fall.
| As the author mentioned, some of the JSON features are not valid
| JavaScript but I wonder which of these test cases fail `eval`.
|
| Note, just so nobody reminds me, don't parse JSON with eval for
| security reasons. I'm just curious how it would work from a
| parser completeness point of view.
| Thaxll wrote:
| XML was indeed better.
| IshKebab wrote:
| It absolutely wasn't, primarily because the XML data model is
| so mismatched with the object structures you find in
| programming languages.
|
| It does at least support comments though. Biggest flaw in JSON
| by far.
| Tao3300 wrote:
| Apples and oranges.
| ryjo wrote:
| Writing a JSON parser is a good way to teach yourself better
| programming practices. I attribute my understanding of pointer
| arithmetic and i/o streams to my own efforts in
| parsing/generating JSON.
| jwells89 wrote:
| JSON has its issues, but modern languages including facilities to
| work with it (fewer dependencies to wrangle is always great) and
| the way typesafe (de)serializers can be synthesized automatically
| with a little thoughtfulness in design instead of needing to be
| manually written (see Swift's Codable and Kotlin/Java's Moshi,
| for example) can in my opinion make it compelling enough to
| overlook its warts. It doesn't fit everywhere of course but it's
| more than good enough for a vast range of applications.
| theamk wrote:
| This is interesting, but seems pretty irrelevant for the real
| world (kinda like "i = ++i + ++i;" C puzzle). The answer to those
| dangers is "don't do it then". Use your stdlib to emit json,
| don't use string functions to modify json, assume any number is
| no better tha float64, and base64 your binary data - and you will
| never have to worry about this "minefield"
|
| (the only possible problem is if you are designing a security
| system, but even then, since all the ambiguity is whether to
| reject the string, it will cause DOS at worst)
| mariusor wrote:
| Funny how Baader-Meinhof works, I just finished writing a JSON
| toy parser earlier today. I guess I'll add the mentioned corner
| cases to the testsuite, and watch them fail. :D
| hughw wrote:
| And so I just now learned that the Baader-Meinhof Gang of the
| 1970s gave its name to the phenomenon of frequency illusion,
| where once you hear about a thing you notice many more
| references to it.
| RedShift1 wrote:
| Parsing any format is a minefield though...
| thecleaner wrote:
| Btw if we use parser generators like antlr for this purpose, is
| it still a minefield ? Can someone point some vulnerabilities I
| can study ?
| zzo38computer wrote:
| There is problem with JSON, such as:
|
| - The numbers is floating points, but cannot be Infinity and NaN.
| It is not a integer type, so long integers might not work
| properly. (There are other problems with numbers too, as
| mentioned in that article.)
|
| - The strings is Unicode. Non-Unicode (including binary data)
| doesn't do properly, and even Unicode can have problems (some of
| which are mentioned in that article, but there are others too).
|
| - Keys are only strings, not numbers.
|
| - Syntax convenience isn't so well, e.g. doesn't have comments,
| optional trailing commas, etc.
|
| - The format is difficult for reasons explained in that article,
| too.
|
| One possible alternative would be a format based on a subset of
| PostScript (instead of JavaScript), e.g. (a part of a example
| from Wikipedia): << /first_name (John)
| /last_name (Smith) /is_alive true /age 27
| /phone_numbers [ << /type (home)
| /number (212 555-1234) >> <<
| /type (office) /number (646 555-4567) >>
| ] /spouse null >>
|
| PostScript also has binary format, comments (with a percentage
| sign), hex string literals, etc. (And, commas are not used, so
| the problem with trailing commas also does not apply.)
|
| (Nevertheless, I did write a JSON parser (and also a JSON writer)
| in PostScript.)
|
| It is also possible to use binary formats, CSV, etc, depending on
| what exactly is needed by the program; for many reasons, one
| format cannot solve everything.
| BugsJustFindMe wrote:
| > _The numbers is floating points...long integers might not
| work properly_
|
| I personally hate the usual interpretation as float and see it
| as a common but extremely-implementation-induced failure. It's
| far better interpreted as an arbitrary precision numeric type,
| not float or int. The spec even says as much and only says that
| implementations mostly suck so watch out. IMO precision myopia
| is why we end up with e.g. Python's refusal-by-default to
| (de)serialize from/to Decimal.
| nurettin wrote:
| Why not make non-strict parsers that will handle unicodes,
| longs, binary, ignore comments and allow trailing commas? If
| you set bend_over_backwards=true, it will do strict parsing for
| the poor souls who need that.
|
| edit: I didn't mention integer keys, because object members
| canonically start with a letter.
| agys wrote:
| _> Syntax convenience isn 't so well, e.g. doesn't have
| comments, optional trailing commas, etc._
|
| I never understood these two choices in the spec as they are
| totally against the "human-readable" goal...
| stevejb wrote:
| I definitely agree with a lot of the comments here especially the
| ones in the vein of "don't do dangerous things with json. " if
| you have control of the sender and the receiver, it makes sense
| to have fields that add a bit of extra type information e.g. this
| is an integer or this is a float with this much precision
| acheong08 wrote:
| At that point just use protobuf
___________________________________________________________________
(page generated 2024-06-02 23:02 UTC)