[HN Gopher] Parsing JSON Is a Minefield (2016)
___________________________________________________________________
Parsing JSON Is a Minefield (2016)
Author : todsacerdoti
Score : 167 points
Date : 2021-10-11 09:57 UTC (13 hours ago)
(HTM) web link (seriot.ch)
(TXT) w3m dump (seriot.ch)
| ChrisArchitect wrote:
| Surely something newer on this since 2016
|
| Plenty of previous discussion:
|
| 2 years ago https://news.ycombinator.com/item?id=20724672
|
| 3 years ago https://news.ycombinator.com/item?id=16897061
|
| 5 years ago https://news.ycombinator.com/item?id=12796556
| kstenerud wrote:
| Safety and security are two big reasons why I developed Concise
| Encoding [1]. The computing and networking landscape today is
| MUCH more hostile compared to the JSON and XML heyday (with state
| actors and organized crime now getting in on the action), and
| it's time to retire them in favor of more secure and predictable
| formats that are also human-friendly.
|
| [1] https://concise-encoding.org
| Decabytes wrote:
| I'm a data scientist so I work with JSON and csv all the time.
| It's amazing how the back bone of data serialization and
| reporting are so ambiguous.
|
| But I wonder if I'm part of the probably. Know one notices all
| the inconsistencies because so much of my job is ironing it out.
| EdwardDiego wrote:
| Far easier than parsing Markdown at least.
| [deleted]
| q3k wrote:
| Some other fun facts about JSON, its mainstream implementations
| and using it reliably:
|
| 1. json.dump(s) in Python by default emits non-standards-
| compliant JSON, ie. will happily serialize NaN/Inf/-Inf. You want
| to set allow_nan=False to be compliant. Otherwise this _will_
| annoy someone who has to consume your shoddy pseudo-JSON from a
| standards-compliant library.
|
| 2. JSON allows for duplicate/repeated keys, and allows for the
| parser to basically do anything when that happens. Do you know
| how the parser implementation you use handles this? Are you sure
| there's no differences between that implementation and other
| implementations used in your system (eg. between execution and
| validation)? What about other undefined behaviour, like permitted
| number ranges?
|
| 3. Do you pass around user-provided JSON data accross your
| system? How many JSON nesting levels does your implementation
| allow? What happens if it's exceeded? What happens if different
| parts of your processing system have different limits? What about
| other unspecified limits like serialized size, string length?
|
| My general opinion is that it's extremely hard to reliably use
| JSON as an interchange format reliably when multiple systems
| and/or parser implementations are involved. It's based on a set
| of underdefined specifications that leaves critical behaviour
| undefined, effectively making it impossible to have 100%
| interoperable implementations. It doesn't help that one of the
| mainstream implementations (in Python) is just non-compliant by
| default.
|
| I highly encourage any greenfield project to look into well
| designed and better specified alternatives.
| zelphirkalt wrote:
| Some good points there. And now imagine people wanting to
| needlessly use YAML for configuration, which adds loads of edge
| cases on top of that.
| magicalhippo wrote:
| > My general opinion is that it's extremely hard to reliably
| use JSON as an interchange format reliably when multiple
| systems and/or parser implementations are involved.
|
| XML is very precisely defined in comparison to JSON. Yet we've
| had one customer who had a system that couldn't handle XML
| files with newlines in them at all, and several which
| _sometimes_ sends ISO 8859-1 (Latin 1) encoded data in _some
| fields_ of a XML file with encoding= "UTF-8" in the header...
|
| We of course also have some nice fixed-field integrations,
| based on customer's specs, where the system suddenly sends
| multiple mangled characters if any non-ASCII character is
| present, causing the fields to suddenly not be so fixed
| anymore... It behaves very much like UTF-8 interpreted as
| Latin-1, except with something else than Latin-1.
|
| Anyway, I've given up trying to be strict at this point. We
| will have to wash incoming data, it's apparently inevitable.
| spookthesunset wrote:
| I mean even if it is well defined that doesn't mean the devs
| are using the languages native parser library. I've encounter
| at least two projects where the devs rolled their own XML
| "parser" using regex and "substring" functions. Why? "The xml
| library was too bloated... much easier to write it ourself".
| Suffice to say, they had tons to problems.
| stinos wrote:
| _You want to set allow_nan=False to be compliant. Otherwise
| this _will_ annoy someone who has to consume your shoddy
| pseudo-JSON from a standards-compliant library_
|
| Funny (well, not really) thing is NaN and Inf are perfectly
| valid floating point numbers acoording to most (?) standards
| used on computers. To the point that I don't understand why it
| was left out of JSON. So unless you're 100% sure you won't
| encounter these numbers the choice is between not being able to
| use JSON, or finding hacks around (and using null isn't one of
| them since you have 3 numbers to represent), or just using non-
| compliant-yet-often-accepted JSON and possibly annoying someone
| whos parser doesn't handle it.
|
| And for me there have been quite a lot of cases were I just
| quickly needed something simple to interface between components
| so when finding out they all support JSON+Nan/Inf then the
| choice is usually made quickly.
| MathMonkeyMan wrote:
| From a practical standpoint, defining numbers in JSON to be
| "whatever double precision binary floating point does, or
| optionally something more precise" would have been good
| enough, and capture what we end up having anyway.
|
| Still, I prefer Crockford's choice: that JSON numbers are
| defined to be _numbers_. Infinity and the flavors of NaN
| are... not numbers.
|
| In an extensible data interchange format, like [edn][1],
| people could define conventions about more specific
| interpretations of numbers, e.g.
| #ieee754/b64 45.6653 ; this is a double
|
| We could build such a format on top of JSON (there are
| probably multiple), but I again agree with Crockford that
| this sort of thing does not belong in JSON.
|
| Makes for a bunch of headaches, though, for sure.
|
| One example is a data scientist I used to work with. He was
| working with lots of machine learning libraries that liked to
| use NaN to mean "nothing to see here." A fellow developer
| ended up writing code that used some sort of convention to
| work around it, e.g. number := decimal | {"magic-uuid":
| "NaN"}. I can see why some people are of the opinion "this is
| stupid, just allow NaNs." I disagree.
|
| [1]: https://github.com/edn-format/edn
| dragonwriter wrote:
| > Funny (well, not really) thing is NaN and Inf are perfectly
| valid floating point numbers acoording to most (?) standards
| used on computers. To the point that I don't understand why
| it was left out of JSON.
|
| There are all kinds of ways to encode that in JSON, but
| (contrary to JS, where "numbers" or IEEE doubles, which
| include various things which are either not numbers or not
| finite), JSON numbers are generic finite (both in size or
| decimal representation) numbers, so "as JSON numbers" is not
| one of them. (And there's no explicit way defined in JSON, so
| if you want it to be unambiguous, you need externally defined
| semantics, but you need that for most real uses anyway.)
| nomel wrote:
| > To the point that I don't understand why it was left out of
| JSON
|
| I think you're forgetting the birthplace of JSON. Who deals
| with the concept of infinity and NaN in the context of web
| front ends?
| lifthrasiir wrote:
| Ranges are pretty common in APIs and both -Infinity and
| Infinity can naturally arise from one-sided ranges. Since
| they are absent in JSON, they are frequently replaced with
| null, ad-hoc sentinel values with uncoded assumptions (e.g.
| timestamps should be always positive) and missing fields.
| stinos wrote:
| I get that, but to go from "oh this won't be very common"
| to willingly "let's just leave this out" is something else.
| At least in my mind :) Or was it an oversight?
| mst wrote:
| I suspect it was a bet on worse is better.
|
| Whether it was a _good_ bet is debatable, but given
| Crockford 's focus on "try and leave out as much as
| possible" I can certainly see it making sense at the
| time.
| josefx wrote:
| > To the point that I don't understand why it was left out of
| JSON
|
| Because JSON has generic numbers that just happen to be able
| to represent every numeric IEEE floating point double value.
| In theory you could have an implementation that uses a
| BigDecimal class or something similar to represent numeric
| values. Which is of course completely incompatible with every
| other JSON implementation and just asks for badly tested edge
| cases to rear their ugly head.
| EdwardDiego wrote:
| > How many JSON nesting levels does your implementation allow?
| What happens if it's exceeded
|
| Haha, I've met a few stack overflows in this area.
| tehbeard wrote:
| While there's a lot of issues with JSON, this one also
| applies to any other interchange format that supports
| nesting, including the much beloved XML. Protobuf might also
| have this, idk if it does any static analysis for infinite
| depth.
| q3k wrote:
| The problem doesn't really exist in Protobuf, as protobuf
| (de)serialization is performed based on an IDL definition
| of the message type. Whatever that IDL specifies, a
| corresponding typed definition and (de)serialization
| function will be generated for your programming language,
| and that implementation will ignore any fields that weren't
| part of the IDL. The (de)serializing code is statically
| generated ahead of time, and is treated like any other code
| that operates on potentially nested data structures.
|
| What this means is that if your IDL specifies deep nesting
| (or recursive nesting), then it means your application is
| expected to handle this "by contract", and attempts to
| deserialize will rightfully fail in case of out-of-memory /
| stack overflow errors. There's no danger of an
| implementation 'accidentally' deserializing something
| nested that was passsed from the outside, as anything
| unknown to the IDL is simply ignored.
|
| Finally, there's no XML-like self-references in Protobuf,
| so it's not possible to have an infinitely deep structure,
| or a combinatorial explosion like with billion laughs -
| just a very deeply nested one, and only if allowed in the
| IDL, and only up to whatever message size limit you're
| allowing.
| tehbeard wrote:
| Thank you for the 2nd + 3rd paragraphs, those were parts
| of protobufs design I wasn't really aware of from a
| cursory glance.
|
| I'm a little suprised to learn there's no self-reference
| support in protobuf, as I wouldn't have assumed parsing
| that would be an issue (as all it really is is a pointer
| to an existing object in the message to say, put a ref.
| to it here), though I guess it might be a problem in
| supporting certain languages.
| q3k wrote:
| > I'm a little suprised to learn there's no self-
| reference support in protobuf, as I wouldn't have assumed
| parsing that would be an issue (as all it really is is a
| pointer to an existing object in the message to say, put
| a ref. to it here), though I guess it might be a problem
| in supporting certain languages.
|
| That's a tradeoff more designs should have, IMO: reduce
| the feature set as much as possible, but in return make
| the implementation vastly simpler. :)
|
| I assume it's not only about support in programming
| languages, but also exactly to eliminate the entire class
| of bugs that stems from back/forward-references in
| serialized data, and to generally keep the wire format as
| simple (to parse and to implement a parser for) as
| possible. The few usecases that could make use of
| references are not worth the pain inflicted on everyone
| if they were implemented.
| ChrisMarshallNY wrote:
| _> the much beloved XML_
|
| I can't quite resolve "beloved" and "XML" in the same
| sentence...
|
| That said, I have used XML _a lot_ , pretty much because of
| XML Schema.
|
| I don't like it. No sir. Not one bit. Uh-uh...
|
| But there's really no viable substitute.
|
| When I design an API, I generally start with an object
| model, and use native converters to create XML and JSON
| from it.
|
| I will provide an XML Schema with the XML variant. I often
| have to do this by hand, which sucks. There are tools to
| create Schema from dumps, but these are pretty limited. I
| may use them to "get me in the ballpark," but there's
| always lots of elbow grease.
|
| I'll use the XML for testing, but I will usually use the
| JSON format for the actual implementation.
| nocman wrote:
| > I can't quite resolve "beloved" and "XML" in the same
| sentence...
|
| You mean, it's possible to take that as being _NON-
| sarcastic_??? If so, I share your lack of resolution.
| thechao wrote:
| I had a little non-recursive JSON parser hanging around. When
| you have "nested" levels you've really only got two choices:
| object, or array. That implies that to track nesting, you
| just need an array of 1b values. In order to shave the yak
| _properly_ , I built "nesting compressor" that detected runs
| of array/object and represented them using a 64b RLE; or, it
| bailed out, and then just used on-the-fly compression with
| zstd.
|
| Obviously, any sort of JSON file that fit on a disk I can
| afford can be parsed into memory in a tiny fraction of its
| on-disk representation. I modified `yes` to just stream `[`
| out; the JSON parser handled it just fine -- it takes a while
| to roll a 64b counter.
| lifthrasiir wrote:
| And all these problems trace back to Douglas Crockford. He
| didn't know how to make a proper serialization format [1] and
| also an interoperable standard (for the latter, Tim Bray tried
| very hard to make it slightly better [2]). He just noticed that
| a (supposed) subset of JavaScript can be easily turned into a
| serialization format with `eval` and went to publicize it, only
| noticing the issues later _and still pursuring its
| standardization as is_. I hate him.
|
| [1] My additional complaints:
| https://news.ycombinator.com/item?id=24953981
|
| [2]
| https://www.tbray.org/ongoing/When/201x/2014/03/05/RFC7159-J...
| gmac wrote:
| I was using JSON before it was 'invented', as was basically
| anyone sending data to the browser in JS format.
|
| Holding specific people responsible is pretty absurd.
| lifthrasiir wrote:
| JSON before the standardization had an obvious data model
| and specification: ECMAScript. (I don't think JSON was
| widely used outside of JavaScript back then.) ECMAScript is
| particularly strictly defined even compared to other
| language standards, so it should have been possible to
| extract the relevant portions of ECMAScript into a proper
| standard. Crockford didn't. JSON as specified by Crockford
| was not even a proper subset of ECMAScript until ECMAScript
| itself retrofitted its syntax.
| makeitdouble wrote:
| It's a pyramid.
|
| At the bottom you have CSV which is popular beyond belief, and
| has no real specification, with common cases wildly handled
| differently across libraries.
|
| In the middle you have JSON which isn't 100% interoperable, but
| goes 98% of the way.
|
| And you have XML and protobufs at the top tip, who have strong
| mechanisms available for interoperability but at an operational
| cost that rarely justifies the upgrade from JSON.
|
| I suspect it will take a lot more that "well designed" and
| "better specified" to justify moving away from JSON as the
| default stepup from chaotic CSV like formats.
| maple3142 wrote:
| I am not sure if parsing XML is better than parsing JSON.
| Many languages or libraries' XML parser are dangerous by
| default. You usually need to manually configure your XML
| parser to be secure from XML-related attacks. Fortunately,
| some languages and libraries are going to make XML have a
| securer defaults, this is a good change. IMO, I think XML
| shouldn't have include many questionable features from
| security perspective.
| pdimitar wrote:
| Agreed, and I like the libraries I saw in the past that
| deliberately only support a small subset of all XML
| extensions (sadly now I can't remember the names). Reducing
| attack surface _and_ increasing sanity in one stroke is a
| policy that much more open-source software has to adopt.
| makeitdouble wrote:
| You are right, and XML parsers can have a very large attack
| surface due to the sheer amount of specs to adhere to.
|
| I see XML as better in the expressiveness it has, and more
| mature out of the box options to validate and transform it.
| Security and bugs remain an issue, but at the scale it can
| be used, there is a fighting chance to have experts dealing
| with the hardening of it all.
|
| Swagger like format definitions are still pretty lax in my
| option in comparison. Now I wouldn't want to get back to
| XML land, I just think it occupies a pretty solid niche
| that is hard to match with anything more simple.
| Sohcahtoa82 wrote:
| > Many languages or libraries' XML parser are dangerous by
| default.
|
| Seriously, XML External Entities is an incredibly dumb
| feature. To have it enabled by default makes it even worse.
| tannhaeuser wrote:
| To be fair, XML wasn't intended as a data exchange format
| but as simplified SGML subset for use as delivery format on
| the web. While that largely hasn't happened, XML with XSD
| (sans rarely used feats) remains a strong exchange format
| for coarsely-grained inter-party traffic such as payment
| systems, taxes and other public/private data, etc.
|
| I'm guessing the security deficits you mention are XML
| entity attacks. Well, SGML has CAPACITY ENTLVL in the SGML
| declaration to limit expansion depth. And a markup
| authoring or delivery format without entities/text macros
| is quite useless, even though HTML, when seen as a stand-
| alone markup language rather than SGML vocabulary, lacks
| it.
| goodpoint wrote:
| > XML wasn't intended as a data exchange format but as
| simplified SGML subset for use as delivery format on the
| web
|
| You cannot deliver web content without... exchanging
| data.
|
| And you cannot trust servers not to attack browsers.
| HWR_14 wrote:
| > And you cannot trust servers not to attack browsers.
|
| Interesting. I normally see it expressed the other way
| (trusting the server and not the client). Obviously, both
| are important.
| mcv wrote:
| I guess trust needs to be a two-way street. Even between
| computers.
| ievans wrote:
| For those unfamiliar with these attack vectors, there code
| injection and denial-of-service issues that in previous
| version of Python, were exploitable by default. Projects
| like https://pypi.org/project/defusedxml/ were designed to
| be secure against these issues by default, rather than
| requiring the library user to opt in.
|
| The defusedxml project has an excellent matrix showing
| viability of the attack types against various python XML
| implementations:
| https://pypi.org/project/defusedxml/#python-xml-libraries
| mst wrote:
| When faced with a case where _SV is a natural fit, I 've long
| been in the habit of (ass-u-ming I get to make that call, of
| course) specifying PostgreSQL COPY style TSV as the
| interchange format, and using more 'normal' TSV to make it
| easy to get data exports into Excel and friends.
|
| That's turned out to be rather less annoying than any other
| approach to _SV I've tried over the years.
| mumblemumble wrote:
| I would argue that, in the long run, gRPC/protobuf has a
| lower operational cost than JSON _as long as you don 't need
| to talk to it from a browser._
|
| (Consuming it from a browser is a hassle because client-side
| JavaScript code is unable to speak the full gRPC API, so you
| need to fuss with reverse proxies to get everything working.)
|
| What it doesn't have is a short learning curve. In order to
| get started, you need to learn the *.proto format, and how to
| use the code generator, and all the design implementations of
| the different data types it supports, and all of that.
|
| But, once you get over that hump, it makes a lot of the hard
| stuff much, much easier to get right.
|
| What I keep wishing for is some sort of "gRPC-lite" that
| doesn't include quite as many questionable micro-
| optimizations as protobuf/gRPC, but does include all of the
| really good ideas like specification-first API development,
| service reflection, and a clean logical separation between
| HTTP semantics and the semantics of the API that's being
| implemented on top of it.
| nawgz wrote:
| > gRPC/protobuf has a lower operational cost than JSON as
| long as you don't need to talk to it from a browser.
|
| So, reading this the other way (browsers are king)... you
| claim gRPC is so useless as to not be able to power your
| entire system and requires you to standup duplicate
| interchange systems for different use cases?
|
| Yikes.
| mumblemumble wrote:
| Browsers are king for some people, not others. For the
| stuff I'm working on, the browser is just the tip of the
| iceberg. For everything below the waterline, the
| operational benefits (strong static typing, well-defined
| backward- and forward-compatibility semantics, better
| throughput and latency characteristics) greatly outweigh
| the, "but we have to use a lightweight Envoy reverse
| proxy to expose some things to the browser," problem.
|
| I also have a tendency to consider that Envoy proxy to be
| more of a feature than a bug, anyway. It's pretty easy to
| set up, all told. We want to gatekeep the edge, anyway,
| for various reasons, so it's not like there was ever a
| reality in which we weren't going to be fussing with a
| reverse proxy. And it serves as a nice opportunity to
| stop and be thoughtful about exactly what we're choosing
| to expose to the Internet.
|
| Speaking purely as a developer, I do find it to be an
| annoyance. But I also acknowledge that inconveniencing
| developers for the sake of the greater good can be a wise
| move.
| bob_roberts wrote:
| For a cloud-based app, that might literally just be at
| the application gateway. Beyond that, everything could be
| whatever protocol.
| Cloudef wrote:
| protobuf incurs lots of codebloat (especially with google's
| runtime / compilers) and the the serialization format is
| not really that nice IMO. I don't think it's possible to
| come up with 100% ideal format for all the use cases.
| throwaway894345 wrote:
| I tried to get into gRPC/protobuf in Go on a Mac for a
| little hobby project, but man the effort just to get protoc
| up and running and then generate the stubs was insane. I'm
| sure somehow or another, it's user error, but when the
| barrier of entry is so high, it's hard to justify the
| effort when JSON-slinging is so rarely the bottleneck.
| recursive wrote:
| > you have CSV which is popular beyond belief, and has no
| real specification
|
| What about RFC 4180? Works for me.
| jedimastert wrote:
| >CSV...has no real specification
|
| Bite your tongue sir! It has a GLORIOUS specification!
|
| https://datatracker.ietf.org/doc/html/rfc4180
| heresie-dabord wrote:
| I would describe it as a _pyramid of bounded viability_ ,
| from the minimally viable to the feature-burdened maximum.
|
| CSV excels (ha) as a minimally viable exchange format for
| data. Combined with awk, grep, sed, bash, and Perl, and some
| simple SVG or D3 with SVG, the analytical solution is fast,
| scalable, and automatable.
|
| But CSV has limits. The column headers are the schema. Beyond
| these bounds, we have the other formats.
|
| JSON is messier. Its strength is in network/browser
| encapsulations and operations. As I have seen it used, people
| insert an array as a kind of schema, and they stay away from
| complex nesting where scalability starts failing and other
| difficult tooling must be summoned to compensate (parsing
| tools such as jq).
|
| Beyond JSON's bounds, we have XML and associated tooling. XML
| is versatile and expressive.
|
| XML and JSON can be written simply but both can be abused by
| programmers who aren't thinking beyond their own cursor.
|
| This is a rich set of tooling for data representation.
|
| In the end, one of the main advantages of CSV is that it
| remains a format that brings little tooling baggage
| ("ecosystem") to the task.
| petschge wrote:
| I'd argue the top of the pyramid is actually formats such as
| HDF5. That format was started to store voyager data and we
| can still read it after more than 40 years. It makes the
| format of entries extremely clear ("this is an 3d array of
| floating point numbers in IEEE755 format, with 64, 2 and 17
| entries per dimension") and encourages further meat data ("it
| is in statV per centimeter and came out of channel of of the
| intrument") in addition to naming the data set "electric
| field". Compared to horrible piles of binary data that used
| to be common (and still are!), it's a breeze of fresh air to
| work with.
| Cloudef wrote:
| Any JSON parser that tries to handle numbers without big number
| support is broken. This is why I always raise eyebrows if I see
| json library that doesn't allow me to deal with the number
| myself by retrieving it as a string.
| kortex wrote:
| > I highly encourage any greenfield project to look into well
| designed and better specified alternatives.
|
| Like what?
| q3k wrote:
| My preference is Protobuf, but really anything that's not
| JSON and which also comes with some IDL gets my approval.
| kortex wrote:
| I like protobuf for some use-casess (namely grpc) but a)
| it's a binary format and sometimes (often times) it's nice
| to have a text protocol
|
| b) protobuf libraries and protoc have given me way more
| grief overall than json (python, js, c++)
|
| If your workflow already supports it, I can see it being
| useful, but it's got a pretty steep learning curve to be
| honest, certainly more than json, despite the ill-
| implemented libs out there. If I wanted a binary format,
| IMHO I'd go for msgpack first, and reach for protobuf if
| that didn't work for me.
| elteto wrote:
| > I like protobuf for some use-casess (namely grpc) but
| a) it's a binary format and sometimes (often times) it's
| nice to have a text protocol
|
| Protobuf (and flatbuffers) supports parsing messages from
| JSON instead of a binary blob. Best of both worlds IMO.
| avmich wrote:
| Can you use JSON Schema? Generating classes from it, if you
| want native objects?
| q3k wrote:
| You can use whatever you want :).
|
| I personally would rather still go with Protobuf if I'm
| going to put in the effort to add a schema and codegen.
| It gives me other nice-to-have features (faster
| [de]serialization, smaller messages, field numbers and
| schema evolution, nicer IDL [not JSON!], gRPC, ...) and
| does away with some problems intrinsic to JSON that no
| schema system will fix (terrible number type, lack of
| binary type, slow parsing). It also has some interop with
| JSON in the rare case you absolutely positively need to
| convert to/from it (which is IMO the only upside of using
| JSON Schema in case you need that interop).
| NavinF wrote:
| https://capnproto.org/
| 0xbadcafebee wrote:
| YAML. Of course implementations of this go all over the place
| too, but you could say the same of XML parsers to a certain
| extent.
|
| I still pine for binary-only data formats. They're easier to
| program, and nobody makes the mistake of trying to edit them
| manually or compose them in a shell script. Parsing data
| shouldn't be hard, but it also shouldn't be so easy that
| people hang themselves by accident.
|
| Of course, the reason why we largely have text data formats
| is because it's insanely simpler to troubleshoot systems that
| use them. Some things should just be easier to manipulate.
| But for general purpose work, I miss binary data formats.
|
| Zip is probably my favorite general-purpose binary data
| format. It's old, well defined, works with any kind of data,
| and you can immediately seek to data in very large archives
| rather than having to parse the entire thing first. And then
| there's that whole compression thing. If you wanted to
| distribute a thousand tiny blobs of CSV, JSON, YAML, and XML,
| all in one container, you could do much worse than Zip.
| richardwhiuk wrote:
| YAML has all of the problems of JSON with some of the
| problems of XML, and some new ones thrown in. Avoid.
| rjh29 wrote:
| I've had a number of negative experiences with yaml, enough
| to put me off using it. For example the implicit parsing of
| 'yes' and 'no' into bools rather than strings (including
| the NO country code for Norway)
| <https://hitchdev.com/strictyaml/why/implicit-typing-
| removed/>, the no-quote rules allowing accidental creation
| of inline hashes/arrays
| <https://hitchdev.com/strictyaml/why/flow-style-removed/>,
| multiline string syntax so complex that it needs a helper
| tool <http://yaml-multiline.info/>, and powerful extensions
| that invite your program to be exploited
| <https://www.sitepoint.com/anatomy-of-an-exploit-an-in-
| depth-...>
|
| It manages to be both a poor data interchange language
| compared to JSON, and also a bad human-friendly langage due
| to the above ambiguities.
|
| Unfortunately it's still the _best_ human-friendly
| configuration language in wide use, so I use strictyaml
| (https://hitchdev.com/strictyaml/) instead.
| kortex wrote:
| NO is not even the worst of it. `on`, like in github
| actions, is interpreted as True by PyYaml by default. You
| have to either quote it, "on", or set certain configs I
| haven't bothered with just yet.
|
| I fully agree YAML is just...not good as a
| transport/interchange serde.
|
| Personally, I actually really like HCL as a human-
| friendly config language, but it's got challenges in
| writing it, and thus support in most languages, if even
| present, is read-only.
|
| Will look into strictyaml!
| kortex wrote:
| Zip isn't a binary data structure protocol though, it just
| provides a compression protocol. I'd argue that while zip
| is technically older than gzip (3 years, 89 vs 92), it was
| proprietary for much of its history, and thus gz is an
| older "standard".
| BerislavLopac wrote:
| TOML?
| prionassembly wrote:
| Is anyone sending sqlite binary blobs over the wire?
| Foreign keys as a replacement for recursive arrays sounds
| like a win...
| jerf wrote:
| Part of the problem is that there's at least half-a-dozen
| high quality answers out of the gate (gRPC, FlatBuffers,
| Protocol Buffers, XML in some cases, Thrift), and an even-
| longer long tail after that. It's made harder when four
| different teams who deeply loathe JSON and independently
| decide to use something "better" can legitimately use four
| completely different technologies if they don't communicate
| with each other.
| 35fbe7d3d5b9 wrote:
| To your comment above - you can bodge around interop
| problems with JSON in ways that you cannot with some of
| these other technologies.
|
| I like to joke that I invented ndjson over a decade ago
| when I accidentally forgot to put things in an array before
| `json.dumps`, I just wasn't smart enough to call it a
| standard. But when you do end up with ndjson when you
| wanted an array of results, or vice versa, JSON makes it
| easy to munge things to where you need.
|
| Compare that to something like protobuf: it's not a self-
| synchronizing stream, so if you send someone multiple
| messages without framing them (prefix by length or
| delimited are popular approaches), they're going to decode
| a single message that doesn't make much sense on the other
| end. And they won't be able to fix it at all.
|
| So I guess JSON is New Jersey style design[1].
|
| [1]: https://dreamsongs.com/RiseOfWorseIsBetter.html
| kortex wrote:
| Well, you invented one of the best things since sliced
| bread! I love NDjson, being able to parse a sequence of
| {} objects as an array is just frankly more natural. A
| coworker got some absurd speedup going from some massive
| json array to ndjson.
|
| Honestly if json had as part of its spec line-delimited
| arrays, and accepting NaN, it'd be close to perfect. Oh
| and native ints, but that is JS's problem.
|
| Well, and a single, canonical spec. And a hard limit
| (however high) on nesting depth. And some other things.
| Ok, maybe it's far from perfect.
| q3k wrote:
| > Compare that to something like protobuf: it's not a
| self-synchronizing stream, so if you send someone
| multiple messages without framing them (prefix by length
| or delimited are popular approaches), they're going to
| decode a single message that doesn't make much sense on
| the other end. And they won't be able to fix it at all.
|
| FWIW, this is a conscious design decision with Protobuf:
| it allows for easy upsert operations on serialized
| messages by appending another message with the updated
| field values. This is very useful for middleware that
| wants to either just add its own context to a message it
| doesn't even parse [1], or for middleware that might
| handle protobuf messages serialized with unknown fields.
|
| On the other hand, 'newline delimited protobuf' is much
| less useful day-to-day than ndjson, as gRPC provides
| message streaming, which solves the issue of wanting to
| stream small elements of a long response (which is the
| general usecase of ndjson from my experience). For on-
| disk storage of sequential protobufs (or any other data,
| really), you should be using something like riegeli [2],
| as it provides critical features like seek offsets,
| compression and corruption resiliency.
|
| [1] - eg. passing a Request message from some web server
| frontend, through request routers, logging, ACL and
| ratelimit systems up to the actual service handling the
| request.
|
| [2] - https://github.com/google/riegeli
| syncsynchalt wrote:
| > teams who deeply loathe JSON
|
| In the current world this seems like a lifestyle choice
| that sets yourself up for constant self-punishment.
|
| I might be a curmudgeon but I'll take JSON for data interop
| any day over anything that _requires_ tooling (protobuf,
| gRPC). And I'll take it over the XML ecosystem too.
|
| The faults of JSON seem, in practice, to be less harmful
| than the faults of other formats.
| throw_m239339 wrote:
| Most of your problems aren't problems
|
| > 3. Do you pass around user-provided JSON data accross your
| system? How many JSON nesting levels does your implementation
| allow? What happens if it's exceeded? What happens if different
| parts of your processing system have different limits? What
| about other unspecified limits like serialized size, string
| length?
|
| XML has the same issue, that's why SAX exists, it works the
| same way with JSON.
|
| > 2. JSON allows for duplicate/repeated keys, and allows for
| the parser to basically do anything when that happens. Do you
| know how the parser implementation you use handles this? Are
| you sure there's no differences between that implementation and
| other implementations used in your system (eg. between
| execution and validation)? What about other undefined
| behaviour, like permitted number ranges?
|
| A parser should... parse and not interpret data or it isn't a
| parser. it's a deserializer. Well how many languages allow
| duplicate keys for maps anyway? this isn't an issue in
| practice.
|
| Basically, the answer to all your problems is to use an evented
| parser instead of a deserializer.
| kortex wrote:
| > this isn't an issue in practice.
|
| It absolutely is an issue in practice. If system A handles
| dupes by accepting the first and ignoring the rest, and
| system B implements last-key-wins, then that's a potential
| source of bugs. The system might not fully parse to a map.
|
| It may, for example, do string-level modification of json
| strings. Is that disgusting and wrong? Yes. Have I seen it in
| prod? Also yes.
| throw_m239339 wrote:
| > It absolutely is an issue in practice. If system A
| handles dupes by accepting the first and ignoring the rest,
| and system B implements last-key-wins, then that's a
| potential source of bugs. The system might not fully parse
| to a map.
|
| But the system shouldn't be automatically be parsing a
| "json map" to a map at first place:
| {"foo":"bar","foo":"baz","foo":"qix","fiz":"buzz"}
|
| Shouldn't be deserialized into a map. but a
| Array<Map<string,string>> like structure.
|
| A SAX style parser for JSON can help do that.
|
| Thus the issue is the choice of parser indeed. Not JSON.
| q3k wrote:
| > Shouldn't be deserialized into a map. but a
| Array<Map<string,string>> like structure.
|
| But that's the thing: you might actually expect/want a
| Map<string,string>, but a malicious/broken system might
| emit something that cannot be deserialized into a
| Map<string,string>. It's then the JSON
| parser's/deserializer's job to figure out what to do, as
| the standards say to do whatever. That in turn causes
| different parsers/deserializers to behave differently
| (whatever the implementer thought makes sense), which is
| a source of interoperability bugs.
| dragonwriter wrote:
| > But that's the thing: you might actually expect/want a
| Map<string,string>
|
| Yes, but that's not the semantics of a bare JSON object;
| if you want the ability to commubicate that you intend
| that, then you use a schema language like JSON schema,
| which lets you say that the JSON map _in this element_
| doesn 't allow duplicate keys and requires the values to
| be strings, at which point tools that read the schema
| language no it is safe to deserialize as Map<string,
| string>.
| throw_m239339 wrote:
| > But that's the thing: you might actually expect/want a
| Map<string,string>, but a malicious/broken system might
| emit something that cannot be deserialized into a
| Map<string,string>. It's then the JSON
| parser's/deserializer's job to figure out what to do, as
| the standards say to do whatever. That in turn causes
| different parsers/deserializers to behave differently
| (whatever the implementer thought makes sense), which is
| a source of interoperability bugs.
|
| I disagree, people are mixing up parsing and
| deserializing. The JSON spec isn't at fault here. The
| JSON spec is only concerned with defining the parsing,
| not the deserialization, because obviously, a JSON array
| isn't a PHP array or a Ruby array, a JSON map isn't a PHP
| object or a Go map at first place.
|
| The problem isn't with JSON but how some JSON
| deserializers work. Again, a deserializer isn't a parser.
| q3k wrote:
| > The problem isn't with JSON but how some JSON
| deserializers work.
|
| That makes no observable difference to the end-user of
| JSON wishing to use it as an interchange format. The
| standard might as well be perfect, but if nearly all of
| its implementations (yes, extending that into
| deserialization, not just parsing - because that's how
| most people use JSON!) are problematic, then the standard
| is effectively also problematic. This is why I also
| always include Python's broken implementation in my JSON
| rant - it's not indicative of the standard(s) being bad,
| but the ecosystem being bad.
| throw_m239339 wrote:
| > That makes no observable difference to the end-user of
| JSON wishing to use it as an interchange format. The
| standard might as well be perfect, but if nearly all of
| its implementations (yes, extending that into
| deserialization, not just parsing - because that's how
| most people use JSON!) are problematic, then the standard
| is effectively also problematic. This is why I also
| always include Python's broken implementation in my JSON
| rant - it's not indicative of the standard(s) being bad,
| but the ecosystem being bad.
|
| Yes it does makes a difference to the end user. Otherwise
| why single out JSON? XML or YAML would suffer from the
| exact same issue.
|
| Deserializers are an anti-pattern if they don't follow a
| strict schema. The problem again isn't the JSON spec,
| it's some deserializers making assumptions about JSON
| types.
|
| In practice data have specs and schemas so JSON/XML/...
| payloads should also have schemas.
| detaro wrote:
| > _Basically, the answer to all your problems is to use an
| evented parser instead of a deserializer._
|
| Which "nobody" does, so it is a problem in practice.
| throw_m239339 wrote:
| > Which "nobody" does, so it is a problem in practice.
|
| who's nobody? if developers care about performances, they
| obviously do. What if the json file is 500MB of logs?
| Furthermore, all these JSON deserialization lib tricks
| might work in some languages that are dynamic or support
| runtime reflection, it doesn't for other languages where
| using a proper evented parser is mandatory.
| recursive wrote:
| > use an evented parser
|
| I've never heard of this. A google search isn't particularly
| illuminating. What is an "evented parser"?
| throw_m239339 wrote:
| Google "event-based parser"
| dragonwriter wrote:
| > What is an "evented parser"?
|
| Also knowns as a "streaming parser", its a parser that
| takes in a data stream and produces a stream of events
| which client code can handle; it allows more flexible
| handling than deserializers, including ability to handle
| arbitrarily large input. SAX is a streaming/evented parser
| API for XML, and there are similar ones for other formats.
| Smaug123 wrote:
| Just a parser which fires events you can listen on when its
| internal state machine changes state.
| jerf wrote:
| "My general opinion is that it's extremely hard to reliably use
| JSON as an interchange format reliably when multiple systems
| and/or parser implementations are involved."
|
| I suspect one of the reasons that JSON has been so successful
| is precisely this fuzziness, though. Every language can do
| something a little slightly different and it'll work at first
| when you send it to somebody else. You get up and off the
| ground really quickly, and can fix up issues as you go.
|
| If you try to specify something with a stronger schema right
| off the bat, I find a number of problems immediately emerge
| that tend to slow the process down. It may be foreign to
| programmers on HN who have embraced a strong static type
| mindset, or dynamic programmers who have learned the hard way
| that sometimes you need to be more precise about your types,
| but there's still a lot of programmers out there who will
| wonder why you're asking them whether this is an int or a float
| is relevant. I came in to work this morning to an alert system
| telling me that a field that a particular system has been
| sending as an integer for a couple of months now over many
| thousands of pushes, "number of bytes transferred", is
| apparently capable of being a float once every several thousand
| times for some reason. There's a lot of programmers who will
| send a string, or a null, or maybe a float, or maybe it's
| always an integer, and deeply don't understand why you care
| what it's getting serialized as.
|
| And that's just an example of some of the issues, not a
| complete list. Trying to specify with some stronger system
| moves a lot of these issues up front.
|
| (If your organization has internalized that's just how it has
| to be done, great! I bet you encountered a lot of these bumps
| on the way, though.)
|
| This isn't a celebration of JSON per se... this is really a
| rather cynical take. I don't know that we need to type
| everything to the n'th degree in the first meeting, but "why
| can't we just let our dynamically-typed language send this
| number as a string sometimes?" is definitely something I've had
| to discuss. (Now, I don't get a lot of resistance per se, but
| it's something I have to bring up.) I'm not presenting this as
| a good thing, but as a theory that JSON's success is actually
| in large part _because_ of its loosey-gooseyness, and not
| despite it, regardless of how we may feel about it.
| dec0dedab0de wrote:
| _I suspect one of the reasons that JSON has been so
| successful is precisely this fuzziness, though. Every
| language can do something a little slightly different and it
| 'll work at first when you send it to somebody else. You get
| up and off the ground really quickly, and can fix up issues
| as you go._
|
| I agree. Sort of how xhtml never really caught on because it
| was too strict. I never understood the desire to make things
| break when it's often less effort to make them work.
|
| Though I think the biggest benefit of JSON is that it is so
| simple, at least compared to XML. It makes it harder to just
| dump your internal data structures as is. Which forced people
| to actually serialize their data. Though with time people
| have overcomplicated it with objects that have "type" and
| "value" fields, basically designing their own standard.
|
| * There's a lot of programmers who will send a string, or a
| null, or maybe a float, or maybe it's always an integer, and
| deeply don't understand why you care what it's getting
| serialized as.*
|
| As far as changing the type depending on the situation, I
| kind of wish that was more common. I like the idea of
| conveying meaning based on type, but for it to work well it
| would need more standard types, plus anyone using a static
| language would be mad at you.
| q3k wrote:
| > Though I think the biggest benefit of JSON is that it is
| so simple, at least compared to XML.
|
| Or more precisely, that it appears simple at first glance,
| and that it is very easy to get started with. TFA (or just
| practical experience trying to build an interoperable JSON-
| based API) should convince anyone that it is not simple in
| the long term :).
| dwaite wrote:
| > Though I think the biggest benefit of JSON is that it is
| so simple, at least compared to XML. It makes it harder to
| just dump your internal data structures as is. Which forced
| people to actually serialize their data. Though with time
| people have overcomplicated it with objects that have
| "type" and "value" fields, basically designing their own
| standard.
|
| XML is a document language with features like mixed content
| to represent concepts like subsections of formatted text.
| IMHO quite a few of XML's failings were in the "data
| format" crowd being a separate camp, and the two never
| really pushing for good middle ground.
|
| For the crowd that wanted a common scaffolding for document
| formats, having the rules between say namespace usage in
| XHTML vs Docbook-XML would not be a problem. For instance,
| HTML states you should ignore unrecognized tags and instead
| just show the text contents.
|
| That all came back to bite hard when the data model people
| started to try to do canonicalization and document signing.
|
| A "strict" variant of JSON fits on a napkin - basically,
| reject documents with multiple identical keys in an object,
| represent native numbers using IEEE double-precision
| floating point, reject documents which do not meet the
| grammar.
| mbeex wrote:
| > Though I think the biggest benefit of JSON is that it is
| so simple
|
| Still, I wish there was an option to insert comments.
| lifthrasiir wrote:
| I'm not convinced. There are a lot less JSON
| implementatations than JSON users, so we should have been
| possible to guide implementations with a means of proper
| specification and test suites. Note that the OP is possibly
| the first ever complete test suite for JSON after 15 full
| years. It is not like seeding initial implementations (that
| can serve as models for future implementors) is particularly
| hard either; Douglas Crockford himself wrote two
| implementations in C and JavaScript.
| 35fbe7d3d5b9 wrote:
| > I highly encourage any greenfield project to look into well
| designed and better specified alternatives.
|
| By way of recommendation: I reach for protobufs to do data
| interchange between polyglot systems and have yet to be
| disappointed. Even if you aren't getting into gRPC, having data
| interchange backed by codegen and an IDL removes a lot of the
| risk you get with data interchange.
| theamk wrote:
| In my experience, protobuf has a minimum project complexity
| threshold before it starts to make sense.
|
| Yes, if both sides of your interchange are systems which have
| build infra setup, it provides a better experience. But if
| you need to access data from outside of your usual projects,
| or from shell, or from random data analysis notebooks, It
| becomes a major pain.
|
| Recent example: we've had an orchestrator script which was
| written in "python with stdlib only" - no build step,
| download an archive, extract and run. This script had to talk
| to third-party program which would export protobuf only. This
| was a major pain as yon can imagine.
| avmich wrote:
| In my experience JSON allowed absence of codegen and superior
| schema definition capabilities to protobuf, and also nice
| transformations with parts of jq built into JSON libraries.
| Try to limit structure complexity to something which can be
| verified before usage, yes. YMMV.
| dlsa wrote:
| So many standards, for sure. But... parsing json is actually
| simple enough. You require those who send you data to comply with
| specific libraries during export and import. If they send a file
| which can't be imported then they sent a corrupted file. Bonus if
| you lock the version. Be as specific as you need to be.
|
| There are people who will quibble around "there are thousands of
| libraries". No there aren't. There's just the N you support.
|
| We specify all sorts of details for other aspects of computing.
| Why wouldn't we specify the data format as well? Change control /
| configuration management are very useful.
|
| This is how you reduce pointless complexity. Nip it in the bud as
| early as possible.
|
| EDIT: Not sure why people disagree with this comment. This is
| basic data management. Are people really asserting that we are
| NOT allowed to set a minimum standard? This is also called
| "setting boundaries".
| belter wrote:
| Previous discussion:
|
| 2016: https://news.ycombinator.com/item?id=12796556
|
| 2018: https://news.ycombinator.com/item?id=16897061
|
| 2019: https://news.ycombinator.com/item?id=20724672
| [deleted]
| benibela wrote:
| That is why I maintain my own JSON parser. First I started with
| the parse from FreePascal's standard library. Then I ran test
| cases on it, and there were lots of issues I had to patch.
|
| First it was accepting all kinds of numbers, so I rewrote it to
| only accept the numbers from the spec
|
| Then it was removing invalid \u escapes, while I needed it to
| replace them with U+FFFD.
|
| Then I needed the unchanged input. Besides the test cases from
| the article, I ran test cases from the W3C XPath test suite. The
| W3C has a very odd understanding of JSON. Besides the normal
| numbers and Unicode U+FFFD replacement, the JSON parser must be
| able to parse it unchanged. That means, if the input number is
| like 100 or 1e2, the parser must be able to return that as string
| "100" or "1e2". Those are different numbers. And there must be a
| user defined replacement of invalid \u, like you set the
| replacement to identity and the input is "\uDEAD\u002D\udead",
| then the parser must parse that as "\uDEAD-\udead" while keeping
| the case.
| qualudeheart wrote:
| Can copilot parse json?
| eatonphil wrote:
| On the topic of JSON and minefields, what is your experience
| using JSON5? I'm considered moving to it for configuration files
| in an application I'm building.
| AnthonBerg wrote:
| I find it much more pleasant to work with.
| lifthrasiir wrote:
| JSON5 mostly extends JSON's syntax, not its data model (it
| still doesn't outlaw duplicate object keys for example).
| eatonphil wrote:
| This article is about parsing though so I am mostly asking
| about that.
| lifthrasiir wrote:
| "Parsing" can mean wildly different things indeed. In this
| case though the article does check duplicate keys and
| numeric range & precision, so the data model is definitely
| in question.
| [deleted]
| jmull wrote:
| This is a big problem for people writing general JSON
| processors/parsers.
|
| But it's not too bad an issue for specific applications/systems
| using JSON...
|
| They need their JSON to be in the correct form to represent their
| "business objects" (or whatever you want to call your application
| or system-specific data types), which is already a very
| restricted subset of JSON that a standard can't help with, and
| only rarely need to bump up against the oddness JSON has around
| the edges.
|
| (Not that people won't bump up against these issues more than
| they really need to -- e.g, I recently saw someone trying to rely
| on multiple keys to mean something specific, which is a
| fun/interesting idea but is crazy to want to put into
| production... but good specs won't stop people from wanting to do
| crazy things.)
| cryptica wrote:
| It seems like all the 'problematic' edge cases mentioned can
| easily be dealt with using runtime type validation and are not
| the concern of an interchange format like JSON which is (and
| should be) optimized for maximum flexibility/interoperability.
| The server should not trust the data inside JSON objects sent by
| remote clients; there should be some kind of runtime type
| validation; it's expected that different programming languages
| might interpret the content of the same JSON object slightly
| differently for certain unusual edge cases. IMO, as an
| interchange format, JSON should be allowed to evolve over time;
| JavaScript has already proven this model to be effective; you can
| always add features and add flexibility but cannot take away
| features or remove flexibility.
| onion2k wrote:
| Most (all?) the complaints here appear to be that specific
| libraries fail to implement the JSON spec in the way that the
| author has interpreted it. Some libraries try to 'help' by
| parsing things that they shouldn't, and some fail to parse things
| they probably should.
|
| This is why we end up with so many JSON parsing libraries I
| guess, but it's not _really_ a problem with the format itself,
| beyond the fact that clearer specs might disambiguate things and
| lead to less deviation.
| q3k wrote:
| > but it's not really a problem with the format itself, beyond
| the fact that clearer specs might disambiguate things and lead
| to less deviation.
|
| It is a problem, because it's not a spec that can be
| implemented reliably. Different parsers behave differently on
| various corner cases not only because of implementation
| blunders, but also because the standard(s) just let them do
| whatever. This spectacularly breaks systems that use more than
| one parser implementation, each slightly implementing the
| standard slightly differently. One part of some
| processing/parsing pipeline will let some payload through,
| while another one will reject it, or even parse it differently.
| horsawlarway wrote:
| I disagree (at least mostly).
|
| This is a case where the spec is intentionally loose to allow
| compatibility with a much larger number of machines and use
| cases.
|
| You'll notice most of the cases where the behavior is
| implementation defined have resource requirements. example:
| how deep you want to allow nesting depends a _LOT_ on the
| capabilities of the machine running the code. A sane value
| for a modern browser is going to be unworkable on an arduino
| /ESP32/embedded other.
|
| Also... if these ambiguities bother you, you probably haven't
| read the full http spec either. It's riddled with cases where
| behavior is implementation defined, for exactly the same
| reasons (resources are required, and you can't assume
| everyone has the same amount available). Want to take a guess
| at the maximum length for a url?
| q3k wrote:
| > This is a case where the spec is intentionally loose to
| allow compatibility with a much larger number of machines
| and use cases.
|
| There's plenty that could've been specified with little
| detriment to small systems: strings are UTF-8 with a well-
| defined escape sequence set, numbers are always IEEE-754
| doubles, messages cannot be nested by more than 128 levels
| (or some other arbitrary number in this range), repeated
| fields are not permitted, everything non-compliant must
| fail the entire parse. Then the only thing left to handle
| is a maximum serialized size (which can be explicitly
| implementation or user defined). Set the maximum string
| length to maximum payload length defined earlier and you're
| golden. That is then your only difference between
| implementations.
|
| This will work on your Ryzen server and on your ESP8266 or
| ESP32, and can even be handled on your washing machine
| microcontroller^W^W^WArduino (with a slowdown for dealing
| with floating point numbers, but you already have to deal
| with that).
|
| Finally, the spec isn't loose because of some design choice
| to allow interoperability with more machines: it's loose
| because it was historically loose (see: JSON business card
| 'specification' chutzpah, which itself is based on a mess
| of a programming language that is/was JS), and before it
| could be formalized to something sensible it got
| implemented haphazardly by different languages. That doomed
| the format to forever be underdefined, as anything more
| strict would render existing implementations non-compliant.
| horsawlarway wrote:
| But that attempt at strictness harms implementation
| value.
|
| Even your own requirement set that you've claimed will
| work on everything is... bad - Sure I can parse every
| number as a double, if I'm willing to spend at least
| 64bits on every number in the payload.
|
| I just finished building a PH Autodoser for a hydroponics
| system I run - it sends JSON payloads with sensor data
| and receives JSON commands to do things like dispense
| PHDown/PHUp solution, toggle on water cooling. I have
| _very_ little spare working memory on the device doing
| the actual monitoring. having to hold 64 bits per number
| would push me into having to buy a more expensive
| microcontroller.
|
| Instead - I have an informal contract that almost all
| fields are plain unsigned bytes (0 to 255) which works
| fine for my use-case, requiring just 1/8th the space.
|
| And to go the other direction - I have a desktop running
| some financial software, I pass around json payloads
| there, but a double is NOT ENOUGH. I want a BigInt field
| for numbers there instead, because rounding errors that
| would be a-ok for a ph sensor are absolutely not ok for
| calculating financial data.
|
| ----
|
| Basically - I want the flexibility to chose the correct
| interpretation for my data.
|
| And this: "everything non-compliant must fail the entire
| parse." Is just fucking insanity. It's the literal
| antithesis of the robustness principle:
|
| "be conservative in what you send, be liberal in what you
| accept"
| karmakaze wrote:
| What you're describing here is a schema-specific parser.
| Even if the parser succeeded, you would reject the input
| as the values are out of range. Making a custom parser
| for this is fine, but call it what it is a parser for a
| subset of JSON--it would fail for a value of 0.1 or -1.
| horsawlarway wrote:
| Sure. The problem is that many things that a very strict
| spec might require have real resource requirements.
|
| There's a reason the URL length in http is undefined,
| it's because the machine accepting the request doesn't
| have infinite memory. Even the latest spec is a simple
| recommendation to accept a request line of at least 8k
| octets.
|
| You can say "We must support nesting depth of N" in json,
| but the reality of the situation is that parsers can and
| will just ignore you. Are they non-compliant? Sure. Are
| they useful? Sure.
|
| Will people still use them? Fuck yes they will. Because
| utility trumps strictness in most cases.
| q3k wrote:
| > Instead - I have an informal contract that almost all
| fields are plain unsigned bytes (0 to 255) which works
| fine for my use-case, requiring just 1/8th the space.
|
| Right, but that informal contract is at the detriment of
| everyone else having to also specify the expected limits
| of numbers they work with. It makes your particular
| usecase easier, but it doesn't make the standard better
| in the grand scheme of things.
|
| > And to go the other direction - I have a desktop
| running some financial software, I pass around json
| payloads there, but a double is NOT ENOUGH. I want a
| BigInt field for numbers there instead, because rounding
| errors that would be a-ok for a ph sensor are absolutely
| not ok for calculating financial data.
|
| And JSON doesn't guarantee you that, you have to shop
| around for languages and implementations that permit
| this. If you then have to make work with an
| implementation that always deserializes to doubles (which
| is a compliant behaviour) or bytes (which is a compliant
| behaviour), you're screwed. Again, this might work for
| the simple case of you controlling both ends of the
| serialization, but it's terrible for trying to work with
| an end that you don't control (ie. when actually using
| JSON as an interchange format).
|
| > And this: "everything non-compliant must fail the
| entire parse." Is just fucking insanity. It's the literal
| antithesis of the robustness principle: "be conservative
| in what you send, be liberal in what you accept"
|
| The Robustness Principle followed blindly is known to be
| harmful when dealing with long-term standards, evolving
| implementations and the human element of software
| engineering [1]. My opinion is that an interchange
| format's job is to transfer some data reliably and
| atomically: the deserialized data should be either be
| 100% correct or the deserialization should be rejected.
| Anything else can and will lead to bugs, and bugs that
| are then difficult to solve (as at that point it's
| difficult to agree whether the serialization was not
| conservative enough, or the deserialization not liberal
| enough).
|
| [1] - https://datatracker.ietf.org/doc/html/draft-iab-
| protocol-mai...
| horsawlarway wrote:
| Ok - so now you have a very strict protocol, that never
| gains traction because the strictness you value hampers
| utility.
|
| And yes - I'm aware of the "Bug for bug compatibility"
| problems that draft tries to highlight, but it's fairly
| clear that utility is paramount:
|
| > As [SUCCESS] demonstrates, success or failure of a
| protocol depends far more on factors like usefulness than
| on on technical excellence. Timely publication of
| protocol specifications, even with the potential for
| flaws, likely contributed significantly to the eventual
| success of the Internet.
| [deleted]
| Ygg2 wrote:
| This is from 2016, no? Why was it reposted? Did something changed
| significantly?
| MrBuddyCasino wrote:
| Old articles that consistently do well are periodically re-
| submitted by accounts that want to farm karma points. Why, I
| don't know.
| kergonath wrote:
| Resubmitting is one thing, but if it is upvoted, it means
| that at least some people find or interesting or valuable. If
| some people do, then resubmitting it was useful.
| MrBuddyCasino wrote:
| This erodes the quality over time, as the platform becomes
| less useful to regulars and long-time members, and thus
| dis-incentives investment and care. Every open platform
| without rules devolves into a porn distributor, so strictly
| going by ,,what's popular" is problematic.
| kergonath wrote:
| I hear you, but that's the whole point of HN. Nobody
| takes editorial decisions.
|
| It's great if your interests are aligned with the
| community and as long as the noise is manageable by the
| voting system.
|
| Besides, there is quite a bit of turnover on the front
| page. A post that you find useless today will probably be
| gone from the front page before tomorrow.
| account-5 wrote:
| Cynical and for some maybe true. Or it could be people are
| genuinely posting something they have just read for the first
| time and thought others might find it interesting. I count
| myself in this group; posting and finding this interesting. I
| do search before posting though others may not.
| petee wrote:
| Ive resubmitted a post that i knew was on HN a few years
| prior, but forgot just how eye opening it was at the time and
| thought that surely some missed it and would appreciate it
| again. You're probably right some people do it for points,
| but more likely its just more people fascinated by a specific
| topic, so more submissions.
|
| My repost in particular was the ASCII/binary 4-column
| representation rather than the typical 3-col, which makes a
| big difference in understanding
| MrBuddyCasino wrote:
| I should have been more elaborate. Your case is of course
| fine. But I noticed accounts with a very high karma count
| that submit a huge volume of articles, but hardly any
| comments. I don't know exactly what's going on, but I find
| it slightly weird.
| GuB-42 wrote:
| Sometimes, articles are reposted because _nothing_ changed
| significantly.
|
| We sometimes get articles from the 19th century that are still
| relevant today, and it is interesting to see our ancestors
| perspective on the problem, and an old article on a problem we
| still have today is a good indication that there is no easy fix
| and it won't go away anytime soon.
| Ygg2 wrote:
| Ok, but nothing points this is still relevant. Were tests re-
| run or something? Last update was 3 years ago.
| coldtea wrote:
| We don't need special tests and metrics to point us that
| this is still relevant. We know it is, and no, nothing has
| changed since.
|
| Beyond this particular case, this is a social link-voting
| website. If people submit and vote for an older article, it
| will be in the front page, end of story. Doesn't matter if
| it still holds or not - it's enough that people found it
| still interesting to submit and upvote. Some of the better
| discussions here happen the nth time the same post is on
| the front page (and some posts get on the top page 5-10
| times in a decade). There's also a handy link on HN to show
| previous submissions of the same post, and the discussions
| that ensued.
| Ygg2 wrote:
| Ok, but what is the proof nothing has changed? I just see
| a repost, of a really good article.
|
| No test-suit runs, not even a glib message saying "It's
| year 2021, and nothing in test suite has changed".
| coldtea wrote:
| > _Ok, but what is the proof nothing has changed?_
|
| It's the so-called experimental proof. We see it every
| day in practice.
|
| (This is not a research lab).
| kergonath wrote:
| It's here because someone posted it, and enough people
| upvoted it, and not enough flagged it. There is no
| conspiracy, things just show up on the front page
| depending on what we collectively want to read.
|
| And it is an interesting article, and I hadn't read it
| before, so I upvoted it as well.
|
| The rule of thumb is that de-posting is acceptable after
| ~1 year.
| IggleSniggle wrote:
| Reposting doesn't have the same negative connotation here as it
| might other places. Sometimes valuable insights come from older
| works. Sometimes a conversation is worth having on HN with a
| contemporary context. If it ends up on front page, folks are
| finding it valuable to discuss.
| Ygg2 wrote:
| I don't understand the reasoning behind it, and my genuine
| question has been flagged.
|
| Why is it posted now, rather than a year or two years ago?
| peterkelly wrote:
| Only the person who posted it knows exactly why they chose
| this particular day to do so. Probably they came across it
| in the course of their work and thought it might be
| useful/interesting to have a discussion about it.
| Apparently a lot of other people agreed because they
| upvoted it.
|
| Revisiting old articles every few years can be useful,
| because the set of people participating in the discussion
| is likely to be substantially different from those who
| commented on it the last time it appeared. Those people may
| have insights or information to share that weren't
| discussed previously. Maybe some of the people commenting
| here hadn't even started working in the industry when the
| original post was made. And even people who were part of
| the first discussion may have new thoughts on the topic.
|
| As an example, while I was certainly aware of and using
| JSON at the time this article was written, and recall
| reading it at the time, it is actually much more relevant
| to me now because I am working on a project that uses JSON
| in a different way than what I'd done previously.
| Specifically, we rely on the fact that the same piece of
| data will always serialize to the exact same string, which
| we hash and use for later comparisons. We've run into
| issues relating to interoperability problems between
| different implementations exactly because of the issues
| discussed in the article (and yes I'd prefer to use an
| alternative format for these reasons). This is something I
| wouldn't have been concerned with at all on previous
| projects where it didn't matter if there were slight
| differences. That's just a datapoint on why I personally
| have renewed interest in this discussion.
|
| If you look at a lot of other sites like Reddit (where
| reposts of articles are often discouraged), you'll often
| find that on question-based subs, many different people ask
| the same few kinds of questions with extreme regularity.
| Subs like /r/askreddit and /r/relationships are full of
| examples. HN is as much about discussion (if not more) than
| the actual articles themselves, and as mentioned above such
| discussions can offer something new each time. So as long
| as they're not repeated too often, reposts can still have
| value.
| IggleSniggle wrote:
| I suspect you were flagged for seeming to complain that the
| submission was inappropriate. See the HN guidelines:
| https://news.ycombinator.com/newsguidelines.html
|
| Posted now because it is still relevant. Previous
| posting/discussion:
|
| 5 years ago: https://news.ycombinator.com/item?id=12796556
|
| 3 years ago: https://news.ycombinator.com/item?id=16897061
|
| 2 years ago: https://news.ycombinator.com/item?id=20724672
| dec0dedab0de wrote:
| Edit: After seeing the comments, I checked my REPL history, and
| the bad data was still there. luckily with the spaces displayed
| as \040. Turns out the offending space was \240, which makes more
| sense. Please disregard this comment.
|
| If you're curious, the problem stems from stuffing JSON in the
| description field of an external system to do our own tagging.
| Someone (me) must have copy/pasted from a screwy source. We were
| pulling out our hair trying to figure out what was wrong with it,
| and I just stuck it in the REPL, and saw the offending character
| was a space. Manually deleted the extra space and it was fine. A
| quick google showed space was part of the convention, and we were
| like "woah that's weird how did we never stumble across that
| before." I am embarrassed that I posted this now. My only excuse
| is that I just got over Covid, so I'm going with that :-).
|
| This was my original comment:
|
| After about a decade of using JSON I just discovered the hard way
| that you can only have one space after the colon between a key
| and value. Atleast with the python JSON library.
| lifthrasiir wrote:
| Which version of Python and/or JSON library? Since Python's
| built-in `json` module was first introduced in 2.6 and I can't
| see any evidence of this bug throughout the relevant code
| (either pure Python or C implementations).
| dec0dedab0de wrote:
| See my edit, it was just a stupid mistake on my part. Thanks
| for pointing this out.
| pythonthecware wrote:
| Yeah python is not exactly a language for the web. I know fresh
| grads and inexperienced professors claim otherwise but its not.
| Eikon wrote:
| What is "a language for the web"?
| pythonthecware wrote:
| Any language that decodes json without issue. I mean it's
| _the_ data format of the web ain't it?
| [deleted]
| [deleted]
| lcrz wrote:
| What a dumb take.
| pythonthecware wrote:
| Says she while having issues decoding json as of 2021.
| sebzim4500 wrote:
| I don't see any evidence that python has issues decoding
| json.
| valparaiso wrote:
| Lmao half of SV startups are on python
| tyingq wrote:
| $ python -c 'import
| json,sys;print(json.dumps(json.loads(sys.argv[1])))' '{"a":
| "b"}' {"a": "b"}
|
| Maybe it's been fixed? That's Python 3.8.10. What version is it
| broken in?
| lcrz wrote:
| Python 2.7.16 (default, May 8 2021, 11:48:02) >>>
| import json >>> json.loads('{"a" : "b" }')
| {u'a': u'b'}
| dec0dedab0de wrote:
| In case either of you check your threads page, it was just
| a stupid mistake on my part. See my edit. Thank you for
| correcting me.
| SavantIdiot wrote:
| I deleted my snarky comment because I want to be more serious.
|
| You never know when you hack something quick if it may become a
| future standard for trillions of handshakes!
|
| JSON looks so clean and tidy on that business card, but when you
| look at both RFCs you realize: ZOMG there is a lot of stuff that
| needs to be thought out!!
|
| There are some purists in this thread who claim it is the
| parsers' fault. I can almost get behind that, but not 100%
| because you really do need to be more clear about several things
| (different types of numbers, different character sets, minimum
| requirements -and- maximum requirements...)
|
| I agree with OP that not including implicit version numbers was
| an oversight: it looks ugly, but if you're not going to put in
| _all_ the thought in at the start, at least make a version number
| required so you can ignore mistakes.
|
| Let this document be a lesson to anyone who writes a data
| schema/grammar that eventually replaces JSON.
| admax88qqq wrote:
| I dunno man, serialization in general is fraught with peril. At
| least the JSON grammar is short and simple. I can't believe
| people in here are genuinely recommending XML as an
| alternative. Try standardizinh XML in a 14 page RFC.
|
| Yes some things need to be though out, but I can sit down and
| read the JSON RFC start to finish easily.
| Ginden wrote:
| And many people choose to ignore that number parsing issues
| can happen in XML too.
| SavantIdiot wrote:
| > I dunno man, serialization in general is fraught with
| peril.
|
| I completely agree.
|
| And I'm certain I'm using JSON in an unsafe way... somewhere.
| :)
| bob1029 wrote:
| For JSON contracts that are of any reasonable level of complexity
| (many levels of nesting), I prefer to have the same serializer &
| strong type system on both ends. A common use case here is
| serializing dynamic business types as JSON in and out of blob
| columns.
|
| For what its worth, I have had maybe 2 hours total worth of
| struggles with JSON serialization over the last as many years,
| and we use it for pretty much _everything_. The biggest pain
| point for us is implementation-specific. Refactors of namespaces
| & dependent assembly names can cause trouble with polymorphic
| serialization (which can absolutely be secure if used
| responsibly). The only other pain point experienced is with
| regard to nullable vs non-nullable fields - again only a problem
| after a change takes place relative to pre-existing JSON
| documents.
| tinus_hn wrote:
| It isn't too bad, considering how many people on this site are
| still pining for the worst format in the world: csv
___________________________________________________________________
(page generated 2021-10-11 23:01 UTC)