[HN Gopher] Parsing JSON Is a Minefield (2016)
       ___________________________________________________________________
        
       Parsing JSON Is a Minefield (2016)
        
       Author : todsacerdoti
       Score  : 167 points
       Date   : 2021-10-11 09:57 UTC (13 hours ago)
        
 (HTM) web link (seriot.ch)
 (TXT) w3m dump (seriot.ch)
        
       | ChrisArchitect wrote:
       | Surely something newer on this since 2016
       | 
       | Plenty of previous discussion:
       | 
       | 2 years ago https://news.ycombinator.com/item?id=20724672
       | 
       | 3 years ago https://news.ycombinator.com/item?id=16897061
       | 
       | 5 years ago https://news.ycombinator.com/item?id=12796556
        
       | kstenerud wrote:
       | Safety and security are two big reasons why I developed Concise
       | Encoding [1]. The computing and networking landscape today is
       | MUCH more hostile compared to the JSON and XML heyday (with state
       | actors and organized crime now getting in on the action), and
       | it's time to retire them in favor of more secure and predictable
       | formats that are also human-friendly.
       | 
       | [1] https://concise-encoding.org
        
       | Decabytes wrote:
       | I'm a data scientist so I work with JSON and csv all the time.
       | It's amazing how the back bone of data serialization and
       | reporting are so ambiguous.
       | 
       | But I wonder if I'm part of the probably. Know one notices all
       | the inconsistencies because so much of my job is ironing it out.
        
       | EdwardDiego wrote:
       | Far easier than parsing Markdown at least.
        
       | [deleted]
        
       | q3k wrote:
       | Some other fun facts about JSON, its mainstream implementations
       | and using it reliably:
       | 
       | 1. json.dump(s) in Python by default emits non-standards-
       | compliant JSON, ie. will happily serialize NaN/Inf/-Inf. You want
       | to set allow_nan=False to be compliant. Otherwise this _will_
       | annoy someone who has to consume your shoddy pseudo-JSON from a
       | standards-compliant library.
       | 
       | 2. JSON allows for duplicate/repeated keys, and allows for the
       | parser to basically do anything when that happens. Do you know
       | how the parser implementation you use handles this? Are you sure
       | there's no differences between that implementation and other
       | implementations used in your system (eg. between execution and
       | validation)? What about other undefined behaviour, like permitted
       | number ranges?
       | 
       | 3. Do you pass around user-provided JSON data accross your
       | system? How many JSON nesting levels does your implementation
       | allow? What happens if it's exceeded? What happens if different
       | parts of your processing system have different limits? What about
       | other unspecified limits like serialized size, string length?
       | 
       | My general opinion is that it's extremely hard to reliably use
       | JSON as an interchange format reliably when multiple systems
       | and/or parser implementations are involved. It's based on a set
       | of underdefined specifications that leaves critical behaviour
       | undefined, effectively making it impossible to have 100%
       | interoperable implementations. It doesn't help that one of the
       | mainstream implementations (in Python) is just non-compliant by
       | default.
       | 
       | I highly encourage any greenfield project to look into well
       | designed and better specified alternatives.
        
         | zelphirkalt wrote:
         | Some good points there. And now imagine people wanting to
         | needlessly use YAML for configuration, which adds loads of edge
         | cases on top of that.
        
         | magicalhippo wrote:
         | > My general opinion is that it's extremely hard to reliably
         | use JSON as an interchange format reliably when multiple
         | systems and/or parser implementations are involved.
         | 
         | XML is very precisely defined in comparison to JSON. Yet we've
         | had one customer who had a system that couldn't handle XML
         | files with newlines in them at all, and several which
         | _sometimes_ sends ISO 8859-1 (Latin 1) encoded data in _some
         | fields_ of a XML file with encoding= "UTF-8" in the header...
         | 
         | We of course also have some nice fixed-field integrations,
         | based on customer's specs, where the system suddenly sends
         | multiple mangled characters if any non-ASCII character is
         | present, causing the fields to suddenly not be so fixed
         | anymore... It behaves very much like UTF-8 interpreted as
         | Latin-1, except with something else than Latin-1.
         | 
         | Anyway, I've given up trying to be strict at this point. We
         | will have to wash incoming data, it's apparently inevitable.
        
           | spookthesunset wrote:
           | I mean even if it is well defined that doesn't mean the devs
           | are using the languages native parser library. I've encounter
           | at least two projects where the devs rolled their own XML
           | "parser" using regex and "substring" functions. Why? "The xml
           | library was too bloated... much easier to write it ourself".
           | Suffice to say, they had tons to problems.
        
         | stinos wrote:
         | _You want to set allow_nan=False to be compliant. Otherwise
         | this _will_ annoy someone who has to consume your shoddy
         | pseudo-JSON from a standards-compliant library_
         | 
         | Funny (well, not really) thing is NaN and Inf are perfectly
         | valid floating point numbers acoording to most (?) standards
         | used on computers. To the point that I don't understand why it
         | was left out of JSON. So unless you're 100% sure you won't
         | encounter these numbers the choice is between not being able to
         | use JSON, or finding hacks around (and using null isn't one of
         | them since you have 3 numbers to represent), or just using non-
         | compliant-yet-often-accepted JSON and possibly annoying someone
         | whos parser doesn't handle it.
         | 
         | And for me there have been quite a lot of cases were I just
         | quickly needed something simple to interface between components
         | so when finding out they all support JSON+Nan/Inf then the
         | choice is usually made quickly.
        
           | MathMonkeyMan wrote:
           | From a practical standpoint, defining numbers in JSON to be
           | "whatever double precision binary floating point does, or
           | optionally something more precise" would have been good
           | enough, and capture what we end up having anyway.
           | 
           | Still, I prefer Crockford's choice: that JSON numbers are
           | defined to be _numbers_. Infinity and the flavors of NaN
           | are... not numbers.
           | 
           | In an extensible data interchange format, like [edn][1],
           | people could define conventions about more specific
           | interpretations of numbers, e.g.
           | #ieee754/b64 45.6653 ; this is a double
           | 
           | We could build such a format on top of JSON (there are
           | probably multiple), but I again agree with Crockford that
           | this sort of thing does not belong in JSON.
           | 
           | Makes for a bunch of headaches, though, for sure.
           | 
           | One example is a data scientist I used to work with. He was
           | working with lots of machine learning libraries that liked to
           | use NaN to mean "nothing to see here." A fellow developer
           | ended up writing code that used some sort of convention to
           | work around it, e.g. number := decimal | {"magic-uuid":
           | "NaN"}. I can see why some people are of the opinion "this is
           | stupid, just allow NaNs." I disagree.
           | 
           | [1]: https://github.com/edn-format/edn
        
           | dragonwriter wrote:
           | > Funny (well, not really) thing is NaN and Inf are perfectly
           | valid floating point numbers acoording to most (?) standards
           | used on computers. To the point that I don't understand why
           | it was left out of JSON.
           | 
           | There are all kinds of ways to encode that in JSON, but
           | (contrary to JS, where "numbers" or IEEE doubles, which
           | include various things which are either not numbers or not
           | finite), JSON numbers are generic finite (both in size or
           | decimal representation) numbers, so "as JSON numbers" is not
           | one of them. (And there's no explicit way defined in JSON, so
           | if you want it to be unambiguous, you need externally defined
           | semantics, but you need that for most real uses anyway.)
        
           | nomel wrote:
           | > To the point that I don't understand why it was left out of
           | JSON
           | 
           | I think you're forgetting the birthplace of JSON. Who deals
           | with the concept of infinity and NaN in the context of web
           | front ends?
        
             | lifthrasiir wrote:
             | Ranges are pretty common in APIs and both -Infinity and
             | Infinity can naturally arise from one-sided ranges. Since
             | they are absent in JSON, they are frequently replaced with
             | null, ad-hoc sentinel values with uncoded assumptions (e.g.
             | timestamps should be always positive) and missing fields.
        
             | stinos wrote:
             | I get that, but to go from "oh this won't be very common"
             | to willingly "let's just leave this out" is something else.
             | At least in my mind :) Or was it an oversight?
        
               | mst wrote:
               | I suspect it was a bet on worse is better.
               | 
               | Whether it was a _good_ bet is debatable, but given
               | Crockford 's focus on "try and leave out as much as
               | possible" I can certainly see it making sense at the
               | time.
        
           | josefx wrote:
           | > To the point that I don't understand why it was left out of
           | JSON
           | 
           | Because JSON has generic numbers that just happen to be able
           | to represent every numeric IEEE floating point double value.
           | In theory you could have an implementation that uses a
           | BigDecimal class or something similar to represent numeric
           | values. Which is of course completely incompatible with every
           | other JSON implementation and just asks for badly tested edge
           | cases to rear their ugly head.
        
         | EdwardDiego wrote:
         | > How many JSON nesting levels does your implementation allow?
         | What happens if it's exceeded
         | 
         | Haha, I've met a few stack overflows in this area.
        
           | tehbeard wrote:
           | While there's a lot of issues with JSON, this one also
           | applies to any other interchange format that supports
           | nesting, including the much beloved XML. Protobuf might also
           | have this, idk if it does any static analysis for infinite
           | depth.
        
             | q3k wrote:
             | The problem doesn't really exist in Protobuf, as protobuf
             | (de)serialization is performed based on an IDL definition
             | of the message type. Whatever that IDL specifies, a
             | corresponding typed definition and (de)serialization
             | function will be generated for your programming language,
             | and that implementation will ignore any fields that weren't
             | part of the IDL. The (de)serializing code is statically
             | generated ahead of time, and is treated like any other code
             | that operates on potentially nested data structures.
             | 
             | What this means is that if your IDL specifies deep nesting
             | (or recursive nesting), then it means your application is
             | expected to handle this "by contract", and attempts to
             | deserialize will rightfully fail in case of out-of-memory /
             | stack overflow errors. There's no danger of an
             | implementation 'accidentally' deserializing something
             | nested that was passsed from the outside, as anything
             | unknown to the IDL is simply ignored.
             | 
             | Finally, there's no XML-like self-references in Protobuf,
             | so it's not possible to have an infinitely deep structure,
             | or a combinatorial explosion like with billion laughs -
             | just a very deeply nested one, and only if allowed in the
             | IDL, and only up to whatever message size limit you're
             | allowing.
        
               | tehbeard wrote:
               | Thank you for the 2nd + 3rd paragraphs, those were parts
               | of protobufs design I wasn't really aware of from a
               | cursory glance.
               | 
               | I'm a little suprised to learn there's no self-reference
               | support in protobuf, as I wouldn't have assumed parsing
               | that would be an issue (as all it really is is a pointer
               | to an existing object in the message to say, put a ref.
               | to it here), though I guess it might be a problem in
               | supporting certain languages.
        
               | q3k wrote:
               | > I'm a little suprised to learn there's no self-
               | reference support in protobuf, as I wouldn't have assumed
               | parsing that would be an issue (as all it really is is a
               | pointer to an existing object in the message to say, put
               | a ref. to it here), though I guess it might be a problem
               | in supporting certain languages.
               | 
               | That's a tradeoff more designs should have, IMO: reduce
               | the feature set as much as possible, but in return make
               | the implementation vastly simpler. :)
               | 
               | I assume it's not only about support in programming
               | languages, but also exactly to eliminate the entire class
               | of bugs that stems from back/forward-references in
               | serialized data, and to generally keep the wire format as
               | simple (to parse and to implement a parser for) as
               | possible. The few usecases that could make use of
               | references are not worth the pain inflicted on everyone
               | if they were implemented.
        
             | ChrisMarshallNY wrote:
             | _> the much beloved XML_
             | 
             | I can't quite resolve "beloved" and "XML" in the same
             | sentence...
             | 
             | That said, I have used XML _a lot_ , pretty much because of
             | XML Schema.
             | 
             | I don't like it. No sir. Not one bit. Uh-uh...
             | 
             | But there's really no viable substitute.
             | 
             | When I design an API, I generally start with an object
             | model, and use native converters to create XML and JSON
             | from it.
             | 
             | I will provide an XML Schema with the XML variant. I often
             | have to do this by hand, which sucks. There are tools to
             | create Schema from dumps, but these are pretty limited. I
             | may use them to "get me in the ballpark," but there's
             | always lots of elbow grease.
             | 
             | I'll use the XML for testing, but I will usually use the
             | JSON format for the actual implementation.
        
               | nocman wrote:
               | > I can't quite resolve "beloved" and "XML" in the same
               | sentence...
               | 
               | You mean, it's possible to take that as being _NON-
               | sarcastic_??? If so, I share your lack of resolution.
        
           | thechao wrote:
           | I had a little non-recursive JSON parser hanging around. When
           | you have "nested" levels you've really only got two choices:
           | object, or array. That implies that to track nesting, you
           | just need an array of 1b values. In order to shave the yak
           | _properly_ , I built "nesting compressor" that detected runs
           | of array/object and represented them using a 64b RLE; or, it
           | bailed out, and then just used on-the-fly compression with
           | zstd.
           | 
           | Obviously, any sort of JSON file that fit on a disk I can
           | afford can be parsed into memory in a tiny fraction of its
           | on-disk representation. I modified `yes` to just stream `[`
           | out; the JSON parser handled it just fine -- it takes a while
           | to roll a 64b counter.
        
         | lifthrasiir wrote:
         | And all these problems trace back to Douglas Crockford. He
         | didn't know how to make a proper serialization format [1] and
         | also an interoperable standard (for the latter, Tim Bray tried
         | very hard to make it slightly better [2]). He just noticed that
         | a (supposed) subset of JavaScript can be easily turned into a
         | serialization format with `eval` and went to publicize it, only
         | noticing the issues later _and still pursuring its
         | standardization as is_. I hate him.
         | 
         | [1] My additional complaints:
         | https://news.ycombinator.com/item?id=24953981
         | 
         | [2]
         | https://www.tbray.org/ongoing/When/201x/2014/03/05/RFC7159-J...
        
           | gmac wrote:
           | I was using JSON before it was 'invented', as was basically
           | anyone sending data to the browser in JS format.
           | 
           | Holding specific people responsible is pretty absurd.
        
             | lifthrasiir wrote:
             | JSON before the standardization had an obvious data model
             | and specification: ECMAScript. (I don't think JSON was
             | widely used outside of JavaScript back then.) ECMAScript is
             | particularly strictly defined even compared to other
             | language standards, so it should have been possible to
             | extract the relevant portions of ECMAScript into a proper
             | standard. Crockford didn't. JSON as specified by Crockford
             | was not even a proper subset of ECMAScript until ECMAScript
             | itself retrofitted its syntax.
        
         | makeitdouble wrote:
         | It's a pyramid.
         | 
         | At the bottom you have CSV which is popular beyond belief, and
         | has no real specification, with common cases wildly handled
         | differently across libraries.
         | 
         | In the middle you have JSON which isn't 100% interoperable, but
         | goes 98% of the way.
         | 
         | And you have XML and protobufs at the top tip, who have strong
         | mechanisms available for interoperability but at an operational
         | cost that rarely justifies the upgrade from JSON.
         | 
         | I suspect it will take a lot more that "well designed" and
         | "better specified" to justify moving away from JSON as the
         | default stepup from chaotic CSV like formats.
        
           | maple3142 wrote:
           | I am not sure if parsing XML is better than parsing JSON.
           | Many languages or libraries' XML parser are dangerous by
           | default. You usually need to manually configure your XML
           | parser to be secure from XML-related attacks. Fortunately,
           | some languages and libraries are going to make XML have a
           | securer defaults, this is a good change. IMO, I think XML
           | shouldn't have include many questionable features from
           | security perspective.
        
             | pdimitar wrote:
             | Agreed, and I like the libraries I saw in the past that
             | deliberately only support a small subset of all XML
             | extensions (sadly now I can't remember the names). Reducing
             | attack surface _and_ increasing sanity in one stroke is a
             | policy that much more open-source software has to adopt.
        
             | makeitdouble wrote:
             | You are right, and XML parsers can have a very large attack
             | surface due to the sheer amount of specs to adhere to.
             | 
             | I see XML as better in the expressiveness it has, and more
             | mature out of the box options to validate and transform it.
             | Security and bugs remain an issue, but at the scale it can
             | be used, there is a fighting chance to have experts dealing
             | with the hardening of it all.
             | 
             | Swagger like format definitions are still pretty lax in my
             | option in comparison. Now I wouldn't want to get back to
             | XML land, I just think it occupies a pretty solid niche
             | that is hard to match with anything more simple.
        
             | Sohcahtoa82 wrote:
             | > Many languages or libraries' XML parser are dangerous by
             | default.
             | 
             | Seriously, XML External Entities is an incredibly dumb
             | feature. To have it enabled by default makes it even worse.
        
             | tannhaeuser wrote:
             | To be fair, XML wasn't intended as a data exchange format
             | but as simplified SGML subset for use as delivery format on
             | the web. While that largely hasn't happened, XML with XSD
             | (sans rarely used feats) remains a strong exchange format
             | for coarsely-grained inter-party traffic such as payment
             | systems, taxes and other public/private data, etc.
             | 
             | I'm guessing the security deficits you mention are XML
             | entity attacks. Well, SGML has CAPACITY ENTLVL in the SGML
             | declaration to limit expansion depth. And a markup
             | authoring or delivery format without entities/text macros
             | is quite useless, even though HTML, when seen as a stand-
             | alone markup language rather than SGML vocabulary, lacks
             | it.
        
               | goodpoint wrote:
               | > XML wasn't intended as a data exchange format but as
               | simplified SGML subset for use as delivery format on the
               | web
               | 
               | You cannot deliver web content without... exchanging
               | data.
               | 
               | And you cannot trust servers not to attack browsers.
        
               | HWR_14 wrote:
               | > And you cannot trust servers not to attack browsers.
               | 
               | Interesting. I normally see it expressed the other way
               | (trusting the server and not the client). Obviously, both
               | are important.
        
               | mcv wrote:
               | I guess trust needs to be a two-way street. Even between
               | computers.
        
             | ievans wrote:
             | For those unfamiliar with these attack vectors, there code
             | injection and denial-of-service issues that in previous
             | version of Python, were exploitable by default. Projects
             | like https://pypi.org/project/defusedxml/ were designed to
             | be secure against these issues by default, rather than
             | requiring the library user to opt in.
             | 
             | The defusedxml project has an excellent matrix showing
             | viability of the attack types against various python XML
             | implementations:
             | https://pypi.org/project/defusedxml/#python-xml-libraries
        
           | mst wrote:
           | When faced with a case where _SV is a natural fit, I 've long
           | been in the habit of (ass-u-ming I get to make that call, of
           | course) specifying PostgreSQL COPY style TSV as the
           | interchange format, and using more 'normal' TSV to make it
           | easy to get data exports into Excel and friends.
           | 
           | That's turned out to be rather less annoying than any other
           | approach to _SV I've tried over the years.
        
           | mumblemumble wrote:
           | I would argue that, in the long run, gRPC/protobuf has a
           | lower operational cost than JSON _as long as you don 't need
           | to talk to it from a browser._
           | 
           | (Consuming it from a browser is a hassle because client-side
           | JavaScript code is unable to speak the full gRPC API, so you
           | need to fuss with reverse proxies to get everything working.)
           | 
           | What it doesn't have is a short learning curve. In order to
           | get started, you need to learn the *.proto format, and how to
           | use the code generator, and all the design implementations of
           | the different data types it supports, and all of that.
           | 
           | But, once you get over that hump, it makes a lot of the hard
           | stuff much, much easier to get right.
           | 
           | What I keep wishing for is some sort of "gRPC-lite" that
           | doesn't include quite as many questionable micro-
           | optimizations as protobuf/gRPC, but does include all of the
           | really good ideas like specification-first API development,
           | service reflection, and a clean logical separation between
           | HTTP semantics and the semantics of the API that's being
           | implemented on top of it.
        
             | nawgz wrote:
             | > gRPC/protobuf has a lower operational cost than JSON as
             | long as you don't need to talk to it from a browser.
             | 
             | So, reading this the other way (browsers are king)... you
             | claim gRPC is so useless as to not be able to power your
             | entire system and requires you to standup duplicate
             | interchange systems for different use cases?
             | 
             | Yikes.
        
               | mumblemumble wrote:
               | Browsers are king for some people, not others. For the
               | stuff I'm working on, the browser is just the tip of the
               | iceberg. For everything below the waterline, the
               | operational benefits (strong static typing, well-defined
               | backward- and forward-compatibility semantics, better
               | throughput and latency characteristics) greatly outweigh
               | the, "but we have to use a lightweight Envoy reverse
               | proxy to expose some things to the browser," problem.
               | 
               | I also have a tendency to consider that Envoy proxy to be
               | more of a feature than a bug, anyway. It's pretty easy to
               | set up, all told. We want to gatekeep the edge, anyway,
               | for various reasons, so it's not like there was ever a
               | reality in which we weren't going to be fussing with a
               | reverse proxy. And it serves as a nice opportunity to
               | stop and be thoughtful about exactly what we're choosing
               | to expose to the Internet.
               | 
               | Speaking purely as a developer, I do find it to be an
               | annoyance. But I also acknowledge that inconveniencing
               | developers for the sake of the greater good can be a wise
               | move.
        
               | bob_roberts wrote:
               | For a cloud-based app, that might literally just be at
               | the application gateway. Beyond that, everything could be
               | whatever protocol.
        
             | Cloudef wrote:
             | protobuf incurs lots of codebloat (especially with google's
             | runtime / compilers) and the the serialization format is
             | not really that nice IMO. I don't think it's possible to
             | come up with 100% ideal format for all the use cases.
        
             | throwaway894345 wrote:
             | I tried to get into gRPC/protobuf in Go on a Mac for a
             | little hobby project, but man the effort just to get protoc
             | up and running and then generate the stubs was insane. I'm
             | sure somehow or another, it's user error, but when the
             | barrier of entry is so high, it's hard to justify the
             | effort when JSON-slinging is so rarely the bottleneck.
        
           | recursive wrote:
           | > you have CSV which is popular beyond belief, and has no
           | real specification
           | 
           | What about RFC 4180? Works for me.
        
           | jedimastert wrote:
           | >CSV...has no real specification
           | 
           | Bite your tongue sir! It has a GLORIOUS specification!
           | 
           | https://datatracker.ietf.org/doc/html/rfc4180
        
           | heresie-dabord wrote:
           | I would describe it as a _pyramid of bounded viability_ ,
           | from the minimally viable to the feature-burdened maximum.
           | 
           | CSV excels (ha) as a minimally viable exchange format for
           | data. Combined with awk, grep, sed, bash, and Perl, and some
           | simple SVG or D3 with SVG, the analytical solution is fast,
           | scalable, and automatable.
           | 
           | But CSV has limits. The column headers are the schema. Beyond
           | these bounds, we have the other formats.
           | 
           | JSON is messier. Its strength is in network/browser
           | encapsulations and operations. As I have seen it used, people
           | insert an array as a kind of schema, and they stay away from
           | complex nesting where scalability starts failing and other
           | difficult tooling must be summoned to compensate (parsing
           | tools such as jq).
           | 
           | Beyond JSON's bounds, we have XML and associated tooling. XML
           | is versatile and expressive.
           | 
           | XML and JSON can be written simply but both can be abused by
           | programmers who aren't thinking beyond their own cursor.
           | 
           | This is a rich set of tooling for data representation.
           | 
           | In the end, one of the main advantages of CSV is that it
           | remains a format that brings little tooling baggage
           | ("ecosystem") to the task.
        
           | petschge wrote:
           | I'd argue the top of the pyramid is actually formats such as
           | HDF5. That format was started to store voyager data and we
           | can still read it after more than 40 years. It makes the
           | format of entries extremely clear ("this is an 3d array of
           | floating point numbers in IEEE755 format, with 64, 2 and 17
           | entries per dimension") and encourages further meat data ("it
           | is in statV per centimeter and came out of channel of of the
           | intrument") in addition to naming the data set "electric
           | field". Compared to horrible piles of binary data that used
           | to be common (and still are!), it's a breeze of fresh air to
           | work with.
        
         | Cloudef wrote:
         | Any JSON parser that tries to handle numbers without big number
         | support is broken. This is why I always raise eyebrows if I see
         | json library that doesn't allow me to deal with the number
         | myself by retrieving it as a string.
        
         | kortex wrote:
         | > I highly encourage any greenfield project to look into well
         | designed and better specified alternatives.
         | 
         | Like what?
        
           | q3k wrote:
           | My preference is Protobuf, but really anything that's not
           | JSON and which also comes with some IDL gets my approval.
        
             | kortex wrote:
             | I like protobuf for some use-casess (namely grpc) but a)
             | it's a binary format and sometimes (often times) it's nice
             | to have a text protocol
             | 
             | b) protobuf libraries and protoc have given me way more
             | grief overall than json (python, js, c++)
             | 
             | If your workflow already supports it, I can see it being
             | useful, but it's got a pretty steep learning curve to be
             | honest, certainly more than json, despite the ill-
             | implemented libs out there. If I wanted a binary format,
             | IMHO I'd go for msgpack first, and reach for protobuf if
             | that didn't work for me.
        
               | elteto wrote:
               | > I like protobuf for some use-casess (namely grpc) but
               | a) it's a binary format and sometimes (often times) it's
               | nice to have a text protocol
               | 
               | Protobuf (and flatbuffers) supports parsing messages from
               | JSON instead of a binary blob. Best of both worlds IMO.
        
             | avmich wrote:
             | Can you use JSON Schema? Generating classes from it, if you
             | want native objects?
        
               | q3k wrote:
               | You can use whatever you want :).
               | 
               | I personally would rather still go with Protobuf if I'm
               | going to put in the effort to add a schema and codegen.
               | It gives me other nice-to-have features (faster
               | [de]serialization, smaller messages, field numbers and
               | schema evolution, nicer IDL [not JSON!], gRPC, ...) and
               | does away with some problems intrinsic to JSON that no
               | schema system will fix (terrible number type, lack of
               | binary type, slow parsing). It also has some interop with
               | JSON in the rare case you absolutely positively need to
               | convert to/from it (which is IMO the only upside of using
               | JSON Schema in case you need that interop).
        
           | NavinF wrote:
           | https://capnproto.org/
        
           | 0xbadcafebee wrote:
           | YAML. Of course implementations of this go all over the place
           | too, but you could say the same of XML parsers to a certain
           | extent.
           | 
           | I still pine for binary-only data formats. They're easier to
           | program, and nobody makes the mistake of trying to edit them
           | manually or compose them in a shell script. Parsing data
           | shouldn't be hard, but it also shouldn't be so easy that
           | people hang themselves by accident.
           | 
           | Of course, the reason why we largely have text data formats
           | is because it's insanely simpler to troubleshoot systems that
           | use them. Some things should just be easier to manipulate.
           | But for general purpose work, I miss binary data formats.
           | 
           | Zip is probably my favorite general-purpose binary data
           | format. It's old, well defined, works with any kind of data,
           | and you can immediately seek to data in very large archives
           | rather than having to parse the entire thing first. And then
           | there's that whole compression thing. If you wanted to
           | distribute a thousand tiny blobs of CSV, JSON, YAML, and XML,
           | all in one container, you could do much worse than Zip.
        
             | richardwhiuk wrote:
             | YAML has all of the problems of JSON with some of the
             | problems of XML, and some new ones thrown in. Avoid.
        
             | rjh29 wrote:
             | I've had a number of negative experiences with yaml, enough
             | to put me off using it. For example the implicit parsing of
             | 'yes' and 'no' into bools rather than strings (including
             | the NO country code for Norway)
             | <https://hitchdev.com/strictyaml/why/implicit-typing-
             | removed/>, the no-quote rules allowing accidental creation
             | of inline hashes/arrays
             | <https://hitchdev.com/strictyaml/why/flow-style-removed/>,
             | multiline string syntax so complex that it needs a helper
             | tool <http://yaml-multiline.info/>, and powerful extensions
             | that invite your program to be exploited
             | <https://www.sitepoint.com/anatomy-of-an-exploit-an-in-
             | depth-...>
             | 
             | It manages to be both a poor data interchange language
             | compared to JSON, and also a bad human-friendly langage due
             | to the above ambiguities.
             | 
             | Unfortunately it's still the _best_ human-friendly
             | configuration language in wide use, so I use strictyaml
             | (https://hitchdev.com/strictyaml/) instead.
        
               | kortex wrote:
               | NO is not even the worst of it. `on`, like in github
               | actions, is interpreted as True by PyYaml by default. You
               | have to either quote it, "on", or set certain configs I
               | haven't bothered with just yet.
               | 
               | I fully agree YAML is just...not good as a
               | transport/interchange serde.
               | 
               | Personally, I actually really like HCL as a human-
               | friendly config language, but it's got challenges in
               | writing it, and thus support in most languages, if even
               | present, is read-only.
               | 
               | Will look into strictyaml!
        
             | kortex wrote:
             | Zip isn't a binary data structure protocol though, it just
             | provides a compression protocol. I'd argue that while zip
             | is technically older than gzip (3 years, 89 vs 92), it was
             | proprietary for much of its history, and thus gz is an
             | older "standard".
        
             | BerislavLopac wrote:
             | TOML?
        
             | prionassembly wrote:
             | Is anyone sending sqlite binary blobs over the wire?
             | Foreign keys as a replacement for recursive arrays sounds
             | like a win...
        
           | jerf wrote:
           | Part of the problem is that there's at least half-a-dozen
           | high quality answers out of the gate (gRPC, FlatBuffers,
           | Protocol Buffers, XML in some cases, Thrift), and an even-
           | longer long tail after that. It's made harder when four
           | different teams who deeply loathe JSON and independently
           | decide to use something "better" can legitimately use four
           | completely different technologies if they don't communicate
           | with each other.
        
             | 35fbe7d3d5b9 wrote:
             | To your comment above - you can bodge around interop
             | problems with JSON in ways that you cannot with some of
             | these other technologies.
             | 
             | I like to joke that I invented ndjson over a decade ago
             | when I accidentally forgot to put things in an array before
             | `json.dumps`, I just wasn't smart enough to call it a
             | standard. But when you do end up with ndjson when you
             | wanted an array of results, or vice versa, JSON makes it
             | easy to munge things to where you need.
             | 
             | Compare that to something like protobuf: it's not a self-
             | synchronizing stream, so if you send someone multiple
             | messages without framing them (prefix by length or
             | delimited are popular approaches), they're going to decode
             | a single message that doesn't make much sense on the other
             | end. And they won't be able to fix it at all.
             | 
             | So I guess JSON is New Jersey style design[1].
             | 
             | [1]: https://dreamsongs.com/RiseOfWorseIsBetter.html
        
               | kortex wrote:
               | Well, you invented one of the best things since sliced
               | bread! I love NDjson, being able to parse a sequence of
               | {} objects as an array is just frankly more natural. A
               | coworker got some absurd speedup going from some massive
               | json array to ndjson.
               | 
               | Honestly if json had as part of its spec line-delimited
               | arrays, and accepting NaN, it'd be close to perfect. Oh
               | and native ints, but that is JS's problem.
               | 
               | Well, and a single, canonical spec. And a hard limit
               | (however high) on nesting depth. And some other things.
               | Ok, maybe it's far from perfect.
        
               | q3k wrote:
               | > Compare that to something like protobuf: it's not a
               | self-synchronizing stream, so if you send someone
               | multiple messages without framing them (prefix by length
               | or delimited are popular approaches), they're going to
               | decode a single message that doesn't make much sense on
               | the other end. And they won't be able to fix it at all.
               | 
               | FWIW, this is a conscious design decision with Protobuf:
               | it allows for easy upsert operations on serialized
               | messages by appending another message with the updated
               | field values. This is very useful for middleware that
               | wants to either just add its own context to a message it
               | doesn't even parse [1], or for middleware that might
               | handle protobuf messages serialized with unknown fields.
               | 
               | On the other hand, 'newline delimited protobuf' is much
               | less useful day-to-day than ndjson, as gRPC provides
               | message streaming, which solves the issue of wanting to
               | stream small elements of a long response (which is the
               | general usecase of ndjson from my experience). For on-
               | disk storage of sequential protobufs (or any other data,
               | really), you should be using something like riegeli [2],
               | as it provides critical features like seek offsets,
               | compression and corruption resiliency.
               | 
               | [1] - eg. passing a Request message from some web server
               | frontend, through request routers, logging, ACL and
               | ratelimit systems up to the actual service handling the
               | request.
               | 
               | [2] - https://github.com/google/riegeli
        
             | syncsynchalt wrote:
             | > teams who deeply loathe JSON
             | 
             | In the current world this seems like a lifestyle choice
             | that sets yourself up for constant self-punishment.
             | 
             | I might be a curmudgeon but I'll take JSON for data interop
             | any day over anything that _requires_ tooling (protobuf,
             | gRPC). And I'll take it over the XML ecosystem too.
             | 
             | The faults of JSON seem, in practice, to be less harmful
             | than the faults of other formats.
        
         | throw_m239339 wrote:
         | Most of your problems aren't problems
         | 
         | > 3. Do you pass around user-provided JSON data accross your
         | system? How many JSON nesting levels does your implementation
         | allow? What happens if it's exceeded? What happens if different
         | parts of your processing system have different limits? What
         | about other unspecified limits like serialized size, string
         | length?
         | 
         | XML has the same issue, that's why SAX exists, it works the
         | same way with JSON.
         | 
         | > 2. JSON allows for duplicate/repeated keys, and allows for
         | the parser to basically do anything when that happens. Do you
         | know how the parser implementation you use handles this? Are
         | you sure there's no differences between that implementation and
         | other implementations used in your system (eg. between
         | execution and validation)? What about other undefined
         | behaviour, like permitted number ranges?
         | 
         | A parser should... parse and not interpret data or it isn't a
         | parser. it's a deserializer. Well how many languages allow
         | duplicate keys for maps anyway? this isn't an issue in
         | practice.
         | 
         | Basically, the answer to all your problems is to use an evented
         | parser instead of a deserializer.
        
           | kortex wrote:
           | > this isn't an issue in practice.
           | 
           | It absolutely is an issue in practice. If system A handles
           | dupes by accepting the first and ignoring the rest, and
           | system B implements last-key-wins, then that's a potential
           | source of bugs. The system might not fully parse to a map.
           | 
           | It may, for example, do string-level modification of json
           | strings. Is that disgusting and wrong? Yes. Have I seen it in
           | prod? Also yes.
        
             | throw_m239339 wrote:
             | > It absolutely is an issue in practice. If system A
             | handles dupes by accepting the first and ignoring the rest,
             | and system B implements last-key-wins, then that's a
             | potential source of bugs. The system might not fully parse
             | to a map.
             | 
             | But the system shouldn't be automatically be parsing a
             | "json map" to a map at first place:
             | {"foo":"bar","foo":"baz","foo":"qix","fiz":"buzz"}
             | 
             | Shouldn't be deserialized into a map. but a
             | Array<Map<string,string>> like structure.
             | 
             | A SAX style parser for JSON can help do that.
             | 
             | Thus the issue is the choice of parser indeed. Not JSON.
        
               | q3k wrote:
               | > Shouldn't be deserialized into a map. but a
               | Array<Map<string,string>> like structure.
               | 
               | But that's the thing: you might actually expect/want a
               | Map<string,string>, but a malicious/broken system might
               | emit something that cannot be deserialized into a
               | Map<string,string>. It's then the JSON
               | parser's/deserializer's job to figure out what to do, as
               | the standards say to do whatever. That in turn causes
               | different parsers/deserializers to behave differently
               | (whatever the implementer thought makes sense), which is
               | a source of interoperability bugs.
        
               | dragonwriter wrote:
               | > But that's the thing: you might actually expect/want a
               | Map<string,string>
               | 
               | Yes, but that's not the semantics of a bare JSON object;
               | if you want the ability to commubicate that you intend
               | that, then you use a schema language like JSON schema,
               | which lets you say that the JSON map _in this element_
               | doesn 't allow duplicate keys and requires the values to
               | be strings, at which point tools that read the schema
               | language no it is safe to deserialize as Map<string,
               | string>.
        
               | throw_m239339 wrote:
               | > But that's the thing: you might actually expect/want a
               | Map<string,string>, but a malicious/broken system might
               | emit something that cannot be deserialized into a
               | Map<string,string>. It's then the JSON
               | parser's/deserializer's job to figure out what to do, as
               | the standards say to do whatever. That in turn causes
               | different parsers/deserializers to behave differently
               | (whatever the implementer thought makes sense), which is
               | a source of interoperability bugs.
               | 
               | I disagree, people are mixing up parsing and
               | deserializing. The JSON spec isn't at fault here. The
               | JSON spec is only concerned with defining the parsing,
               | not the deserialization, because obviously, a JSON array
               | isn't a PHP array or a Ruby array, a JSON map isn't a PHP
               | object or a Go map at first place.
               | 
               | The problem isn't with JSON but how some JSON
               | deserializers work. Again, a deserializer isn't a parser.
        
               | q3k wrote:
               | > The problem isn't with JSON but how some JSON
               | deserializers work.
               | 
               | That makes no observable difference to the end-user of
               | JSON wishing to use it as an interchange format. The
               | standard might as well be perfect, but if nearly all of
               | its implementations (yes, extending that into
               | deserialization, not just parsing - because that's how
               | most people use JSON!) are problematic, then the standard
               | is effectively also problematic. This is why I also
               | always include Python's broken implementation in my JSON
               | rant - it's not indicative of the standard(s) being bad,
               | but the ecosystem being bad.
        
               | throw_m239339 wrote:
               | > That makes no observable difference to the end-user of
               | JSON wishing to use it as an interchange format. The
               | standard might as well be perfect, but if nearly all of
               | its implementations (yes, extending that into
               | deserialization, not just parsing - because that's how
               | most people use JSON!) are problematic, then the standard
               | is effectively also problematic. This is why I also
               | always include Python's broken implementation in my JSON
               | rant - it's not indicative of the standard(s) being bad,
               | but the ecosystem being bad.
               | 
               | Yes it does makes a difference to the end user. Otherwise
               | why single out JSON? XML or YAML would suffer from the
               | exact same issue.
               | 
               | Deserializers are an anti-pattern if they don't follow a
               | strict schema. The problem again isn't the JSON spec,
               | it's some deserializers making assumptions about JSON
               | types.
               | 
               | In practice data have specs and schemas so JSON/XML/...
               | payloads should also have schemas.
        
           | detaro wrote:
           | > _Basically, the answer to all your problems is to use an
           | evented parser instead of a deserializer._
           | 
           | Which "nobody" does, so it is a problem in practice.
        
             | throw_m239339 wrote:
             | > Which "nobody" does, so it is a problem in practice.
             | 
             | who's nobody? if developers care about performances, they
             | obviously do. What if the json file is 500MB of logs?
             | Furthermore, all these JSON deserialization lib tricks
             | might work in some languages that are dynamic or support
             | runtime reflection, it doesn't for other languages where
             | using a proper evented parser is mandatory.
        
           | recursive wrote:
           | > use an evented parser
           | 
           | I've never heard of this. A google search isn't particularly
           | illuminating. What is an "evented parser"?
        
             | throw_m239339 wrote:
             | Google "event-based parser"
        
             | dragonwriter wrote:
             | > What is an "evented parser"?
             | 
             | Also knowns as a "streaming parser", its a parser that
             | takes in a data stream and produces a stream of events
             | which client code can handle; it allows more flexible
             | handling than deserializers, including ability to handle
             | arbitrarily large input. SAX is a streaming/evented parser
             | API for XML, and there are similar ones for other formats.
        
             | Smaug123 wrote:
             | Just a parser which fires events you can listen on when its
             | internal state machine changes state.
        
         | jerf wrote:
         | "My general opinion is that it's extremely hard to reliably use
         | JSON as an interchange format reliably when multiple systems
         | and/or parser implementations are involved."
         | 
         | I suspect one of the reasons that JSON has been so successful
         | is precisely this fuzziness, though. Every language can do
         | something a little slightly different and it'll work at first
         | when you send it to somebody else. You get up and off the
         | ground really quickly, and can fix up issues as you go.
         | 
         | If you try to specify something with a stronger schema right
         | off the bat, I find a number of problems immediately emerge
         | that tend to slow the process down. It may be foreign to
         | programmers on HN who have embraced a strong static type
         | mindset, or dynamic programmers who have learned the hard way
         | that sometimes you need to be more precise about your types,
         | but there's still a lot of programmers out there who will
         | wonder why you're asking them whether this is an int or a float
         | is relevant. I came in to work this morning to an alert system
         | telling me that a field that a particular system has been
         | sending as an integer for a couple of months now over many
         | thousands of pushes, "number of bytes transferred", is
         | apparently capable of being a float once every several thousand
         | times for some reason. There's a lot of programmers who will
         | send a string, or a null, or maybe a float, or maybe it's
         | always an integer, and deeply don't understand why you care
         | what it's getting serialized as.
         | 
         | And that's just an example of some of the issues, not a
         | complete list. Trying to specify with some stronger system
         | moves a lot of these issues up front.
         | 
         | (If your organization has internalized that's just how it has
         | to be done, great! I bet you encountered a lot of these bumps
         | on the way, though.)
         | 
         | This isn't a celebration of JSON per se... this is really a
         | rather cynical take. I don't know that we need to type
         | everything to the n'th degree in the first meeting, but "why
         | can't we just let our dynamically-typed language send this
         | number as a string sometimes?" is definitely something I've had
         | to discuss. (Now, I don't get a lot of resistance per se, but
         | it's something I have to bring up.) I'm not presenting this as
         | a good thing, but as a theory that JSON's success is actually
         | in large part _because_ of its loosey-gooseyness, and not
         | despite it, regardless of how we may feel about it.
        
           | dec0dedab0de wrote:
           | _I suspect one of the reasons that JSON has been so
           | successful is precisely this fuzziness, though. Every
           | language can do something a little slightly different and it
           | 'll work at first when you send it to somebody else. You get
           | up and off the ground really quickly, and can fix up issues
           | as you go._
           | 
           | I agree. Sort of how xhtml never really caught on because it
           | was too strict. I never understood the desire to make things
           | break when it's often less effort to make them work.
           | 
           | Though I think the biggest benefit of JSON is that it is so
           | simple, at least compared to XML. It makes it harder to just
           | dump your internal data structures as is. Which forced people
           | to actually serialize their data. Though with time people
           | have overcomplicated it with objects that have "type" and
           | "value" fields, basically designing their own standard.
           | 
           | * There's a lot of programmers who will send a string, or a
           | null, or maybe a float, or maybe it's always an integer, and
           | deeply don't understand why you care what it's getting
           | serialized as.*
           | 
           | As far as changing the type depending on the situation, I
           | kind of wish that was more common. I like the idea of
           | conveying meaning based on type, but for it to work well it
           | would need more standard types, plus anyone using a static
           | language would be mad at you.
        
             | q3k wrote:
             | > Though I think the biggest benefit of JSON is that it is
             | so simple, at least compared to XML.
             | 
             | Or more precisely, that it appears simple at first glance,
             | and that it is very easy to get started with. TFA (or just
             | practical experience trying to build an interoperable JSON-
             | based API) should convince anyone that it is not simple in
             | the long term :).
        
             | dwaite wrote:
             | > Though I think the biggest benefit of JSON is that it is
             | so simple, at least compared to XML. It makes it harder to
             | just dump your internal data structures as is. Which forced
             | people to actually serialize their data. Though with time
             | people have overcomplicated it with objects that have
             | "type" and "value" fields, basically designing their own
             | standard.
             | 
             | XML is a document language with features like mixed content
             | to represent concepts like subsections of formatted text.
             | IMHO quite a few of XML's failings were in the "data
             | format" crowd being a separate camp, and the two never
             | really pushing for good middle ground.
             | 
             | For the crowd that wanted a common scaffolding for document
             | formats, having the rules between say namespace usage in
             | XHTML vs Docbook-XML would not be a problem. For instance,
             | HTML states you should ignore unrecognized tags and instead
             | just show the text contents.
             | 
             | That all came back to bite hard when the data model people
             | started to try to do canonicalization and document signing.
             | 
             | A "strict" variant of JSON fits on a napkin - basically,
             | reject documents with multiple identical keys in an object,
             | represent native numbers using IEEE double-precision
             | floating point, reject documents which do not meet the
             | grammar.
        
             | mbeex wrote:
             | > Though I think the biggest benefit of JSON is that it is
             | so simple
             | 
             | Still, I wish there was an option to insert comments.
        
           | lifthrasiir wrote:
           | I'm not convinced. There are a lot less JSON
           | implementatations than JSON users, so we should have been
           | possible to guide implementations with a means of proper
           | specification and test suites. Note that the OP is possibly
           | the first ever complete test suite for JSON after 15 full
           | years. It is not like seeding initial implementations (that
           | can serve as models for future implementors) is particularly
           | hard either; Douglas Crockford himself wrote two
           | implementations in C and JavaScript.
        
         | 35fbe7d3d5b9 wrote:
         | > I highly encourage any greenfield project to look into well
         | designed and better specified alternatives.
         | 
         | By way of recommendation: I reach for protobufs to do data
         | interchange between polyglot systems and have yet to be
         | disappointed. Even if you aren't getting into gRPC, having data
         | interchange backed by codegen and an IDL removes a lot of the
         | risk you get with data interchange.
        
           | theamk wrote:
           | In my experience, protobuf has a minimum project complexity
           | threshold before it starts to make sense.
           | 
           | Yes, if both sides of your interchange are systems which have
           | build infra setup, it provides a better experience. But if
           | you need to access data from outside of your usual projects,
           | or from shell, or from random data analysis notebooks, It
           | becomes a major pain.
           | 
           | Recent example: we've had an orchestrator script which was
           | written in "python with stdlib only" - no build step,
           | download an archive, extract and run. This script had to talk
           | to third-party program which would export protobuf only. This
           | was a major pain as yon can imagine.
        
           | avmich wrote:
           | In my experience JSON allowed absence of codegen and superior
           | schema definition capabilities to protobuf, and also nice
           | transformations with parts of jq built into JSON libraries.
           | Try to limit structure complexity to something which can be
           | verified before usage, yes. YMMV.
        
       | dlsa wrote:
       | So many standards, for sure. But... parsing json is actually
       | simple enough. You require those who send you data to comply with
       | specific libraries during export and import. If they send a file
       | which can't be imported then they sent a corrupted file. Bonus if
       | you lock the version. Be as specific as you need to be.
       | 
       | There are people who will quibble around "there are thousands of
       | libraries". No there aren't. There's just the N you support.
       | 
       | We specify all sorts of details for other aspects of computing.
       | Why wouldn't we specify the data format as well? Change control /
       | configuration management are very useful.
       | 
       | This is how you reduce pointless complexity. Nip it in the bud as
       | early as possible.
       | 
       | EDIT: Not sure why people disagree with this comment. This is
       | basic data management. Are people really asserting that we are
       | NOT allowed to set a minimum standard? This is also called
       | "setting boundaries".
        
       | belter wrote:
       | Previous discussion:
       | 
       | 2016: https://news.ycombinator.com/item?id=12796556
       | 
       | 2018: https://news.ycombinator.com/item?id=16897061
       | 
       | 2019: https://news.ycombinator.com/item?id=20724672
        
       | [deleted]
        
       | benibela wrote:
       | That is why I maintain my own JSON parser. First I started with
       | the parse from FreePascal's standard library. Then I ran test
       | cases on it, and there were lots of issues I had to patch.
       | 
       | First it was accepting all kinds of numbers, so I rewrote it to
       | only accept the numbers from the spec
       | 
       | Then it was removing invalid \u escapes, while I needed it to
       | replace them with U+FFFD.
       | 
       | Then I needed the unchanged input. Besides the test cases from
       | the article, I ran test cases from the W3C XPath test suite. The
       | W3C has a very odd understanding of JSON. Besides the normal
       | numbers and Unicode U+FFFD replacement, the JSON parser must be
       | able to parse it unchanged. That means, if the input number is
       | like 100 or 1e2, the parser must be able to return that as string
       | "100" or "1e2". Those are different numbers. And there must be a
       | user defined replacement of invalid \u, like you set the
       | replacement to identity and the input is "\uDEAD\u002D\udead",
       | then the parser must parse that as "\uDEAD-\udead" while keeping
       | the case.
        
       | qualudeheart wrote:
       | Can copilot parse json?
        
       | eatonphil wrote:
       | On the topic of JSON and minefields, what is your experience
       | using JSON5? I'm considered moving to it for configuration files
       | in an application I'm building.
        
         | AnthonBerg wrote:
         | I find it much more pleasant to work with.
        
         | lifthrasiir wrote:
         | JSON5 mostly extends JSON's syntax, not its data model (it
         | still doesn't outlaw duplicate object keys for example).
        
           | eatonphil wrote:
           | This article is about parsing though so I am mostly asking
           | about that.
        
             | lifthrasiir wrote:
             | "Parsing" can mean wildly different things indeed. In this
             | case though the article does check duplicate keys and
             | numeric range & precision, so the data model is definitely
             | in question.
        
       | [deleted]
        
       | jmull wrote:
       | This is a big problem for people writing general JSON
       | processors/parsers.
       | 
       | But it's not too bad an issue for specific applications/systems
       | using JSON...
       | 
       | They need their JSON to be in the correct form to represent their
       | "business objects" (or whatever you want to call your application
       | or system-specific data types), which is already a very
       | restricted subset of JSON that a standard can't help with, and
       | only rarely need to bump up against the oddness JSON has around
       | the edges.
       | 
       | (Not that people won't bump up against these issues more than
       | they really need to -- e.g, I recently saw someone trying to rely
       | on multiple keys to mean something specific, which is a
       | fun/interesting idea but is crazy to want to put into
       | production... but good specs won't stop people from wanting to do
       | crazy things.)
        
       | cryptica wrote:
       | It seems like all the 'problematic' edge cases mentioned can
       | easily be dealt with using runtime type validation and are not
       | the concern of an interchange format like JSON which is (and
       | should be) optimized for maximum flexibility/interoperability.
       | The server should not trust the data inside JSON objects sent by
       | remote clients; there should be some kind of runtime type
       | validation; it's expected that different programming languages
       | might interpret the content of the same JSON object slightly
       | differently for certain unusual edge cases. IMO, as an
       | interchange format, JSON should be allowed to evolve over time;
       | JavaScript has already proven this model to be effective; you can
       | always add features and add flexibility but cannot take away
       | features or remove flexibility.
        
       | onion2k wrote:
       | Most (all?) the complaints here appear to be that specific
       | libraries fail to implement the JSON spec in the way that the
       | author has interpreted it. Some libraries try to 'help' by
       | parsing things that they shouldn't, and some fail to parse things
       | they probably should.
       | 
       | This is why we end up with so many JSON parsing libraries I
       | guess, but it's not _really_ a problem with the format itself,
       | beyond the fact that clearer specs might disambiguate things and
       | lead to less deviation.
        
         | q3k wrote:
         | > but it's not really a problem with the format itself, beyond
         | the fact that clearer specs might disambiguate things and lead
         | to less deviation.
         | 
         | It is a problem, because it's not a spec that can be
         | implemented reliably. Different parsers behave differently on
         | various corner cases not only because of implementation
         | blunders, but also because the standard(s) just let them do
         | whatever. This spectacularly breaks systems that use more than
         | one parser implementation, each slightly implementing the
         | standard slightly differently. One part of some
         | processing/parsing pipeline will let some payload through,
         | while another one will reject it, or even parse it differently.
        
           | horsawlarway wrote:
           | I disagree (at least mostly).
           | 
           | This is a case where the spec is intentionally loose to allow
           | compatibility with a much larger number of machines and use
           | cases.
           | 
           | You'll notice most of the cases where the behavior is
           | implementation defined have resource requirements. example:
           | how deep you want to allow nesting depends a _LOT_ on the
           | capabilities of the machine running the code. A sane value
           | for a modern browser is going to be unworkable on an arduino
           | /ESP32/embedded other.
           | 
           | Also... if these ambiguities bother you, you probably haven't
           | read the full http spec either. It's riddled with cases where
           | behavior is implementation defined, for exactly the same
           | reasons (resources are required, and you can't assume
           | everyone has the same amount available). Want to take a guess
           | at the maximum length for a url?
        
             | q3k wrote:
             | > This is a case where the spec is intentionally loose to
             | allow compatibility with a much larger number of machines
             | and use cases.
             | 
             | There's plenty that could've been specified with little
             | detriment to small systems: strings are UTF-8 with a well-
             | defined escape sequence set, numbers are always IEEE-754
             | doubles, messages cannot be nested by more than 128 levels
             | (or some other arbitrary number in this range), repeated
             | fields are not permitted, everything non-compliant must
             | fail the entire parse. Then the only thing left to handle
             | is a maximum serialized size (which can be explicitly
             | implementation or user defined). Set the maximum string
             | length to maximum payload length defined earlier and you're
             | golden. That is then your only difference between
             | implementations.
             | 
             | This will work on your Ryzen server and on your ESP8266 or
             | ESP32, and can even be handled on your washing machine
             | microcontroller^W^W^WArduino (with a slowdown for dealing
             | with floating point numbers, but you already have to deal
             | with that).
             | 
             | Finally, the spec isn't loose because of some design choice
             | to allow interoperability with more machines: it's loose
             | because it was historically loose (see: JSON business card
             | 'specification' chutzpah, which itself is based on a mess
             | of a programming language that is/was JS), and before it
             | could be formalized to something sensible it got
             | implemented haphazardly by different languages. That doomed
             | the format to forever be underdefined, as anything more
             | strict would render existing implementations non-compliant.
        
               | horsawlarway wrote:
               | But that attempt at strictness harms implementation
               | value.
               | 
               | Even your own requirement set that you've claimed will
               | work on everything is... bad - Sure I can parse every
               | number as a double, if I'm willing to spend at least
               | 64bits on every number in the payload.
               | 
               | I just finished building a PH Autodoser for a hydroponics
               | system I run - it sends JSON payloads with sensor data
               | and receives JSON commands to do things like dispense
               | PHDown/PHUp solution, toggle on water cooling. I have
               | _very_ little spare working memory on the device doing
               | the actual monitoring. having to hold 64 bits per number
               | would push me into having to buy a more expensive
               | microcontroller.
               | 
               | Instead - I have an informal contract that almost all
               | fields are plain unsigned bytes (0 to 255) which works
               | fine for my use-case, requiring just 1/8th the space.
               | 
               | And to go the other direction - I have a desktop running
               | some financial software, I pass around json payloads
               | there, but a double is NOT ENOUGH. I want a BigInt field
               | for numbers there instead, because rounding errors that
               | would be a-ok for a ph sensor are absolutely not ok for
               | calculating financial data.
               | 
               | ----
               | 
               | Basically - I want the flexibility to chose the correct
               | interpretation for my data.
               | 
               | And this: "everything non-compliant must fail the entire
               | parse." Is just fucking insanity. It's the literal
               | antithesis of the robustness principle:
               | 
               | "be conservative in what you send, be liberal in what you
               | accept"
        
               | karmakaze wrote:
               | What you're describing here is a schema-specific parser.
               | Even if the parser succeeded, you would reject the input
               | as the values are out of range. Making a custom parser
               | for this is fine, but call it what it is a parser for a
               | subset of JSON--it would fail for a value of 0.1 or -1.
        
               | horsawlarway wrote:
               | Sure. The problem is that many things that a very strict
               | spec might require have real resource requirements.
               | 
               | There's a reason the URL length in http is undefined,
               | it's because the machine accepting the request doesn't
               | have infinite memory. Even the latest spec is a simple
               | recommendation to accept a request line of at least 8k
               | octets.
               | 
               | You can say "We must support nesting depth of N" in json,
               | but the reality of the situation is that parsers can and
               | will just ignore you. Are they non-compliant? Sure. Are
               | they useful? Sure.
               | 
               | Will people still use them? Fuck yes they will. Because
               | utility trumps strictness in most cases.
        
               | q3k wrote:
               | > Instead - I have an informal contract that almost all
               | fields are plain unsigned bytes (0 to 255) which works
               | fine for my use-case, requiring just 1/8th the space.
               | 
               | Right, but that informal contract is at the detriment of
               | everyone else having to also specify the expected limits
               | of numbers they work with. It makes your particular
               | usecase easier, but it doesn't make the standard better
               | in the grand scheme of things.
               | 
               | > And to go the other direction - I have a desktop
               | running some financial software, I pass around json
               | payloads there, but a double is NOT ENOUGH. I want a
               | BigInt field for numbers there instead, because rounding
               | errors that would be a-ok for a ph sensor are absolutely
               | not ok for calculating financial data.
               | 
               | And JSON doesn't guarantee you that, you have to shop
               | around for languages and implementations that permit
               | this. If you then have to make work with an
               | implementation that always deserializes to doubles (which
               | is a compliant behaviour) or bytes (which is a compliant
               | behaviour), you're screwed. Again, this might work for
               | the simple case of you controlling both ends of the
               | serialization, but it's terrible for trying to work with
               | an end that you don't control (ie. when actually using
               | JSON as an interchange format).
               | 
               | > And this: "everything non-compliant must fail the
               | entire parse." Is just fucking insanity. It's the literal
               | antithesis of the robustness principle: "be conservative
               | in what you send, be liberal in what you accept"
               | 
               | The Robustness Principle followed blindly is known to be
               | harmful when dealing with long-term standards, evolving
               | implementations and the human element of software
               | engineering [1]. My opinion is that an interchange
               | format's job is to transfer some data reliably and
               | atomically: the deserialized data should be either be
               | 100% correct or the deserialization should be rejected.
               | Anything else can and will lead to bugs, and bugs that
               | are then difficult to solve (as at that point it's
               | difficult to agree whether the serialization was not
               | conservative enough, or the deserialization not liberal
               | enough).
               | 
               | [1] - https://datatracker.ietf.org/doc/html/draft-iab-
               | protocol-mai...
        
               | horsawlarway wrote:
               | Ok - so now you have a very strict protocol, that never
               | gains traction because the strictness you value hampers
               | utility.
               | 
               | And yes - I'm aware of the "Bug for bug compatibility"
               | problems that draft tries to highlight, but it's fairly
               | clear that utility is paramount:
               | 
               | > As [SUCCESS] demonstrates, success or failure of a
               | protocol depends far more on factors like usefulness than
               | on on technical excellence. Timely publication of
               | protocol specifications, even with the potential for
               | flaws, likely contributed significantly to the eventual
               | success of the Internet.
        
         | [deleted]
        
       | Ygg2 wrote:
       | This is from 2016, no? Why was it reposted? Did something changed
       | significantly?
        
         | MrBuddyCasino wrote:
         | Old articles that consistently do well are periodically re-
         | submitted by accounts that want to farm karma points. Why, I
         | don't know.
        
           | kergonath wrote:
           | Resubmitting is one thing, but if it is upvoted, it means
           | that at least some people find or interesting or valuable. If
           | some people do, then resubmitting it was useful.
        
             | MrBuddyCasino wrote:
             | This erodes the quality over time, as the platform becomes
             | less useful to regulars and long-time members, and thus
             | dis-incentives investment and care. Every open platform
             | without rules devolves into a porn distributor, so strictly
             | going by ,,what's popular" is problematic.
        
               | kergonath wrote:
               | I hear you, but that's the whole point of HN. Nobody
               | takes editorial decisions.
               | 
               | It's great if your interests are aligned with the
               | community and as long as the noise is manageable by the
               | voting system.
               | 
               | Besides, there is quite a bit of turnover on the front
               | page. A post that you find useless today will probably be
               | gone from the front page before tomorrow.
        
           | account-5 wrote:
           | Cynical and for some maybe true. Or it could be people are
           | genuinely posting something they have just read for the first
           | time and thought others might find it interesting. I count
           | myself in this group; posting and finding this interesting. I
           | do search before posting though others may not.
        
           | petee wrote:
           | Ive resubmitted a post that i knew was on HN a few years
           | prior, but forgot just how eye opening it was at the time and
           | thought that surely some missed it and would appreciate it
           | again. You're probably right some people do it for points,
           | but more likely its just more people fascinated by a specific
           | topic, so more submissions.
           | 
           | My repost in particular was the ASCII/binary 4-column
           | representation rather than the typical 3-col, which makes a
           | big difference in understanding
        
             | MrBuddyCasino wrote:
             | I should have been more elaborate. Your case is of course
             | fine. But I noticed accounts with a very high karma count
             | that submit a huge volume of articles, but hardly any
             | comments. I don't know exactly what's going on, but I find
             | it slightly weird.
        
         | GuB-42 wrote:
         | Sometimes, articles are reposted because _nothing_ changed
         | significantly.
         | 
         | We sometimes get articles from the 19th century that are still
         | relevant today, and it is interesting to see our ancestors
         | perspective on the problem, and an old article on a problem we
         | still have today is a good indication that there is no easy fix
         | and it won't go away anytime soon.
        
           | Ygg2 wrote:
           | Ok, but nothing points this is still relevant. Were tests re-
           | run or something? Last update was 3 years ago.
        
             | coldtea wrote:
             | We don't need special tests and metrics to point us that
             | this is still relevant. We know it is, and no, nothing has
             | changed since.
             | 
             | Beyond this particular case, this is a social link-voting
             | website. If people submit and vote for an older article, it
             | will be in the front page, end of story. Doesn't matter if
             | it still holds or not - it's enough that people found it
             | still interesting to submit and upvote. Some of the better
             | discussions here happen the nth time the same post is on
             | the front page (and some posts get on the top page 5-10
             | times in a decade). There's also a handy link on HN to show
             | previous submissions of the same post, and the discussions
             | that ensued.
        
               | Ygg2 wrote:
               | Ok, but what is the proof nothing has changed? I just see
               | a repost, of a really good article.
               | 
               | No test-suit runs, not even a glib message saying "It's
               | year 2021, and nothing in test suite has changed".
        
               | coldtea wrote:
               | > _Ok, but what is the proof nothing has changed?_
               | 
               | It's the so-called experimental proof. We see it every
               | day in practice.
               | 
               | (This is not a research lab).
        
               | kergonath wrote:
               | It's here because someone posted it, and enough people
               | upvoted it, and not enough flagged it. There is no
               | conspiracy, things just show up on the front page
               | depending on what we collectively want to read.
               | 
               | And it is an interesting article, and I hadn't read it
               | before, so I upvoted it as well.
               | 
               | The rule of thumb is that de-posting is acceptable after
               | ~1 year.
        
         | IggleSniggle wrote:
         | Reposting doesn't have the same negative connotation here as it
         | might other places. Sometimes valuable insights come from older
         | works. Sometimes a conversation is worth having on HN with a
         | contemporary context. If it ends up on front page, folks are
         | finding it valuable to discuss.
        
           | Ygg2 wrote:
           | I don't understand the reasoning behind it, and my genuine
           | question has been flagged.
           | 
           | Why is it posted now, rather than a year or two years ago?
        
             | peterkelly wrote:
             | Only the person who posted it knows exactly why they chose
             | this particular day to do so. Probably they came across it
             | in the course of their work and thought it might be
             | useful/interesting to have a discussion about it.
             | Apparently a lot of other people agreed because they
             | upvoted it.
             | 
             | Revisiting old articles every few years can be useful,
             | because the set of people participating in the discussion
             | is likely to be substantially different from those who
             | commented on it the last time it appeared. Those people may
             | have insights or information to share that weren't
             | discussed previously. Maybe some of the people commenting
             | here hadn't even started working in the industry when the
             | original post was made. And even people who were part of
             | the first discussion may have new thoughts on the topic.
             | 
             | As an example, while I was certainly aware of and using
             | JSON at the time this article was written, and recall
             | reading it at the time, it is actually much more relevant
             | to me now because I am working on a project that uses JSON
             | in a different way than what I'd done previously.
             | Specifically, we rely on the fact that the same piece of
             | data will always serialize to the exact same string, which
             | we hash and use for later comparisons. We've run into
             | issues relating to interoperability problems between
             | different implementations exactly because of the issues
             | discussed in the article (and yes I'd prefer to use an
             | alternative format for these reasons). This is something I
             | wouldn't have been concerned with at all on previous
             | projects where it didn't matter if there were slight
             | differences. That's just a datapoint on why I personally
             | have renewed interest in this discussion.
             | 
             | If you look at a lot of other sites like Reddit (where
             | reposts of articles are often discouraged), you'll often
             | find that on question-based subs, many different people ask
             | the same few kinds of questions with extreme regularity.
             | Subs like /r/askreddit and /r/relationships are full of
             | examples. HN is as much about discussion (if not more) than
             | the actual articles themselves, and as mentioned above such
             | discussions can offer something new each time. So as long
             | as they're not repeated too often, reposts can still have
             | value.
        
             | IggleSniggle wrote:
             | I suspect you were flagged for seeming to complain that the
             | submission was inappropriate. See the HN guidelines:
             | https://news.ycombinator.com/newsguidelines.html
             | 
             | Posted now because it is still relevant. Previous
             | posting/discussion:
             | 
             | 5 years ago: https://news.ycombinator.com/item?id=12796556
             | 
             | 3 years ago: https://news.ycombinator.com/item?id=16897061
             | 
             | 2 years ago: https://news.ycombinator.com/item?id=20724672
        
       | dec0dedab0de wrote:
       | Edit: After seeing the comments, I checked my REPL history, and
       | the bad data was still there. luckily with the spaces displayed
       | as \040. Turns out the offending space was \240, which makes more
       | sense. Please disregard this comment.
       | 
       | If you're curious, the problem stems from stuffing JSON in the
       | description field of an external system to do our own tagging.
       | Someone (me) must have copy/pasted from a screwy source. We were
       | pulling out our hair trying to figure out what was wrong with it,
       | and I just stuck it in the REPL, and saw the offending character
       | was a space. Manually deleted the extra space and it was fine. A
       | quick google showed space was part of the convention, and we were
       | like "woah that's weird how did we never stumble across that
       | before." I am embarrassed that I posted this now. My only excuse
       | is that I just got over Covid, so I'm going with that :-).
       | 
       | This was my original comment:
       | 
       | After about a decade of using JSON I just discovered the hard way
       | that you can only have one space after the colon between a key
       | and value. Atleast with the python JSON library.
        
         | lifthrasiir wrote:
         | Which version of Python and/or JSON library? Since Python's
         | built-in `json` module was first introduced in 2.6 and I can't
         | see any evidence of this bug throughout the relevant code
         | (either pure Python or C implementations).
        
           | dec0dedab0de wrote:
           | See my edit, it was just a stupid mistake on my part. Thanks
           | for pointing this out.
        
         | pythonthecware wrote:
         | Yeah python is not exactly a language for the web. I know fresh
         | grads and inexperienced professors claim otherwise but its not.
        
           | Eikon wrote:
           | What is "a language for the web"?
        
             | pythonthecware wrote:
             | Any language that decodes json without issue. I mean it's
             | _the_ data format of the web ain't it?
        
               | [deleted]
        
             | [deleted]
        
           | lcrz wrote:
           | What a dumb take.
        
             | pythonthecware wrote:
             | Says she while having issues decoding json as of 2021.
        
               | sebzim4500 wrote:
               | I don't see any evidence that python has issues decoding
               | json.
        
           | valparaiso wrote:
           | Lmao half of SV startups are on python
        
         | tyingq wrote:
         | $ python -c 'import
         | json,sys;print(json.dumps(json.loads(sys.argv[1])))' '{"a":
         | "b"}'       {"a": "b"}
         | 
         | Maybe it's been fixed? That's Python 3.8.10. What version is it
         | broken in?
        
           | lcrz wrote:
           | Python 2.7.16 (default, May  8 2021, 11:48:02)          >>>
           | import json         >>> json.loads('{"a"  :     "b"    }')
           | {u'a': u'b'}
        
             | dec0dedab0de wrote:
             | In case either of you check your threads page, it was just
             | a stupid mistake on my part. See my edit. Thank you for
             | correcting me.
        
       | SavantIdiot wrote:
       | I deleted my snarky comment because I want to be more serious.
       | 
       | You never know when you hack something quick if it may become a
       | future standard for trillions of handshakes!
       | 
       | JSON looks so clean and tidy on that business card, but when you
       | look at both RFCs you realize: ZOMG there is a lot of stuff that
       | needs to be thought out!!
       | 
       | There are some purists in this thread who claim it is the
       | parsers' fault. I can almost get behind that, but not 100%
       | because you really do need to be more clear about several things
       | (different types of numbers, different character sets, minimum
       | requirements -and- maximum requirements...)
       | 
       | I agree with OP that not including implicit version numbers was
       | an oversight: it looks ugly, but if you're not going to put in
       | _all_ the thought in at the start, at least make a version number
       | required so you can ignore mistakes.
       | 
       | Let this document be a lesson to anyone who writes a data
       | schema/grammar that eventually replaces JSON.
        
         | admax88qqq wrote:
         | I dunno man, serialization in general is fraught with peril. At
         | least the JSON grammar is short and simple. I can't believe
         | people in here are genuinely recommending XML as an
         | alternative. Try standardizinh XML in a 14 page RFC.
         | 
         | Yes some things need to be though out, but I can sit down and
         | read the JSON RFC start to finish easily.
        
           | Ginden wrote:
           | And many people choose to ignore that number parsing issues
           | can happen in XML too.
        
           | SavantIdiot wrote:
           | > I dunno man, serialization in general is fraught with
           | peril.
           | 
           | I completely agree.
           | 
           | And I'm certain I'm using JSON in an unsafe way... somewhere.
           | :)
        
       | bob1029 wrote:
       | For JSON contracts that are of any reasonable level of complexity
       | (many levels of nesting), I prefer to have the same serializer &
       | strong type system on both ends. A common use case here is
       | serializing dynamic business types as JSON in and out of blob
       | columns.
       | 
       | For what its worth, I have had maybe 2 hours total worth of
       | struggles with JSON serialization over the last as many years,
       | and we use it for pretty much _everything_. The biggest pain
       | point for us is implementation-specific. Refactors of namespaces
       | & dependent assembly names can cause trouble with polymorphic
       | serialization (which can absolutely be secure if used
       | responsibly). The only other pain point experienced is with
       | regard to nullable vs non-nullable fields - again only a problem
       | after a change takes place relative to pre-existing JSON
       | documents.
        
       | tinus_hn wrote:
       | It isn't too bad, considering how many people on this site are
       | still pining for the worst format in the world: csv
        
       ___________________________________________________________________
       (page generated 2021-10-11 23:01 UTC)