[HN Gopher] Internet Object - A JSON alternative data serializat...
       ___________________________________________________________________
        
       Internet Object - A JSON alternative data serialization format
        
       Author : Starz0r
       Score  : 76 points
       Date   : 2021-10-24 12:39 UTC (10 hours ago)
        
 (HTM) web link (internetobject.org)
 (TXT) w3m dump (internetobject.org)
        
       | kabes wrote:
       | The example schema has:                 > age:{int, min:20}
       | 
       | Why would a data serialization format bother with data validation
       | like the minimum value here?
        
         | williamtwild wrote:
         | front and validation perhaps?
        
           | kabes wrote:
           | Shouldn't be a concern of a serialization library
        
             | SV_BubbleTime wrote:
             | It's part of the schema part, not the serialization part
             | right? I don't disagree you parse then validate, but in a
             | schema that defines type of data, it's not unreasonable to
             | put limits on values.
        
       | tom_ wrote:
       | Where is the spec? Why are there spaces after the commas? Why
       | does the example not include a string with commas in it?
        
       | deadfish wrote:
       | I see a benchmark for the data size... But as other comments have
       | suggested gzip should remove the majority of that difference.
       | 
       | I'd be more interested to know about serialisation and
       | deserialisation time.
        
       | hiyer wrote:
       | 60% savings won't really count for much when the traffic is
       | compressed, which is the case for most of JSON's uses. For real
       | savings I think you'd have to go with a binary format like
       | protobuf or thrift.
       | 
       | Edit: 50 -> 60
        
       | only_as_i_fall wrote:
       | If you follow that link which says "Read the Story Here" they
       | have this json example which has a list of employees and then
       | info about the pagination of that list. The caption is this
       | 
       | >If you look closely, this JSON document mixes the data employees
       | with other non-data keys (headers) such as count, currentPage,
       | and pageSize in the same response.
       | 
       | But they don't explain at all how Changing the data format fixes
       | the underlying issue of mixed concerns in one data object.
        
       | barelysapient wrote:
       | They lost me when they declared an address type.
       | 
       | Addresses are so varied in implementation and meaning that it's
       | frankly ridiculous.
        
       | codeulike wrote:
       | How are commas or speechmarks in strings escaped?
        
       | __m wrote:
       | How do you have an array with different types of objects? You
       | either have to repeat the schema or have to reference the schema.
        
       | danfritz wrote:
       | Looks like CSV described. Gzip the json and don't care about the
       | biggest selling point
        
         | samhw wrote:
         | Precisely. Many people don't realise how exceptionally well
         | JSON compresses. Provided you're using it the way most do, to
         | send arrays of objects which share the same set of keys (or
         | some subset thereof), then all the keys will end up dictionary-
         | coded away, thus totally eroding the space advantage that this
         | format notionally has.
         | 
         | Plus JSON's exceptionally wide support means you can benefit
         | from SIMD-assisted decoders which will absolutely blow this out
         | of the water - and much, much more besides. I wish people would
         | devote their time to something more useful than 'yet another
         | competing standard'.
         | 
         | Edit: Sorry, I want to be clear, this is an impressive and cool
         | personal project. I hope it's a step on an exciting journey for
         | the person who wrote it. It just doesn't actually have enough
         | strengths to replace JSON - which would be a tall order for
         | _any_ new format.
        
       | [deleted]
        
       | dorongrinstein wrote:
       | Looks neat. I don't see a formal spec. Question: if I have two
       | optional fields of the same type and the first one isn't
       | provided, how does a parser know which field is provided? The
       | optional fields seem unclear to me.
        
       | cyberpsybin wrote:
       | No.
        
       | Gys wrote:
       | For me a JSON alternative should at the very least offer some
       | spec for adding comments anywhere.
        
         | random478101 wrote:
         | One of the variants that permit comments:
         | https://github.com/tailscale/hujson
        
       | random478101 wrote:
       | As far I can see "IO" addresses the size issue, which is indeed a
       | compression issue for the most part.
       | 
       | For a broader take on an alternative, there is concise encoding
       | Concise Encoding [1][2], which I believe addresses a few more
       | issues with existing encodings (clear spec, schema not an
       | afterthought, native support for a variety of data structures,
       | security, ...).
       | 
       | [1] https://concise-encoding.org/ [2] The author gave a
       | presentation on it here:
       | https://www.youtube.com/watch?v=_dIHq4GJE14
        
         | wffurr wrote:
         | People keep saying "just use gzip and JSON is plenty small" but
         | gzip isn't free. It takes time and power to do all the
         | compression and decompression. The uncompressed size of the
         | data takes up memory on client and server.
         | 
         | A smaller data format requires less compression time and power
         | and you can fit more of it in memory at either end.
        
           | petre wrote:
           | There's Messagepack and CBOR and Flat Buffers. All of them
           | are faster and smaller than any text based format.
        
       | nikeee wrote:
       | Since strings don't need to be quoted, what happens during
       | deserialization if you want the string "T"? Does this lead to the
       | equivalent of the Norway-Problem of YAML [0]?
       | 
       | Is the space between the key and the type necessary? If not, how
       | to distinguish between objects and types?
       | 
       | Does the validation offer some form of unions or mutual
       | exclusion?
       | 
       | [0]: https://hitchdev.com/strictyaml/why/implicit-typing-removed/
        
         | lifthrasiir wrote:
         | It seems to be a typed CSV, so whether `T` is interpreted as a
         | string or a boolean presumably depends on the schema. That
         | sounds slightly better than YAML, though it can easily break
         | when you allow heterogeneous types (say, string or boolean).
        
           | petre wrote:
           | T is quite dumb. The author should had at least used #t and
           | #f from Scheme.
        
         | cookiengineer wrote:
         | YAML and its "Arrays" are really broken. The problem I see with
         | Internet Object is that it's also implying this kind of
         | mechanism.
         | 
         | Every time I read about new formats, they seem to get either
         | the 1-n relations or the n-n relations implemented well, but
         | not both. I guess that's what's so hard about map/reduce...
         | 
         | Regarding YAML: somebody on HN mentioned his project DIXY a
         | couple years ago, and it's much much _much_ easier to parse
         | than YAML. [1] I'm using this over YAML pretty much everywhere
         | now.
         | 
         | [1] https://github.com/kuyawa/Dixy
        
           | colejohnson66 wrote:
           | I'll admit that YAML has its quirks, but a good syntax
           | highlighter can take care of that in my experience. What's
           | wrong with YAML's arrays?
        
             | cookiengineer wrote:
             | > What's wrong with YAML's arrays?
             | 
             | That there are multiple ways to define Arrays: "- item",
             | "-\n\titem", "\titem" or "item, item" for starters. Parsing
             | YAML into Arrays requires context of its surroundings.
             | 
             | Without the previous context, you cannot know what type of
             | data you're parsing when you are at a "-" at the beginning
             | of a line or a "," in the middle of a line.
             | 
             | This is just unnecessary parser complexity and human
             | ambiguity in my opinion.
             | 
             | As a question to you in case you disagree: What happens
             | when you write down an indented/nested "\t- name: John,
             | Doe"? It's pretty much unpredictable without the previously
             | parsed data structures or their history in YAML.
             | 
             | (I don't wanna start the discussion of "<<" and how it
             | influences the parsing context of YAML data structures. I
             | think the merge key also has no place in a data
             | serialization format.)
        
           | cabalamat wrote:
           | > YAML and its "Arrays" are really broken.
           | 
           | Agreed. YAML does have some use cases. I find it useful when
           | I want to manually write lots of JSON data for test scripts.
           | But the format, because it tries to be concise, ends up to be
           | hard to manually parse.
           | 
           | I don't consider YAML a good serialisation format.
        
           | BiteCode_dev wrote:
           | Yaml has so many problems. Python 3.10 raised a new one to my
           | attention when the core devs realized their arrays of
           | versions contained twice 3.1 and no 3.10. Indeed, if write
           | unquotted ascii, yaml gives you strings. Except if it can
           | cast it to a number that is.
           | 
           | TOML is better, but it still has more gotchas that necessary.
           | So much I find it easier to just edit a python file
           | 
           | I'm thinking of giving a try to cue. Any feedback ?
        
           | irq-1 wrote:
           | Dixy looks easy, but "There is only one simple rule. In Dixy,
           | everything is a dictionary [string:string]" isn't accurate or
           | helpful.
           | 
           | It's also [string:dictionary] and [string:?] where ? means
           | nil. White space matters, and tab is fixed at 4 spaces wide.
           | When creating text from a dictionary it adds "# Dixy 1.0\n\n"
           | which means loading and saving will change the file every
           | time! Not sure what other issues there are, but I noticed
           | this line:                   // TODO: if key is numeric,
           | parse as Array
           | 
           | It does look simple though. It'd be nice if someone made
           | strict rules and addressed the corner cases.
        
           | 29athrowaway wrote:
           | The annoyance of YAML is the possibility of doing things in
           | different ways.
        
         | DemocracyFTW wrote:
         | The so-called "Norway Problem" of YAML is really the No-Way
         | Problem of YAML. /s
        
       | kesor wrote:
       | ffs please don't add yet another stupid standard. this looks like
       | a complicated version of csv, which is horrible, and this also
       | looks quite horrible.
        
       | account-5 wrote:
       | I've been looking at data serialisation formats recently.
       | 
       | - JSON - TOML - CSON - INI - ENO - XML
       | 
       | I like CSV for tabular data obviously. This looks, as others have
       | mentioned, like CSV with better metadata.
       | 
       | I like INI for its simplicity. JSON is good for more complicated
       | data, but I have to say I like CSON.
        
       | jFriedensreich wrote:
       | when a project has more inspirational quotes that tech facts and
       | relation to prior art thats often a red flag.also json is
       | inherently schema less and non binary, this is not a flaw but
       | critical for many usecases. if you want schemas there are many
       | proven alternatives like protobuffs, avro, cap n proto, and
       | message pack.
        
       | typingmonkey wrote:
       | So the plain data is smaller because some information comes from
       | the schema instead of the object. Guess what, you can do the same
       | with json already [1]
       | 
       | [1] https://github.com/pubkey/jsonschema-key-compression
        
       | antihero wrote:
       | Is there much space saving after response compression?
        
       | Azsy wrote:
       | What ever the pros and cons are here... What the ** does this
       | mean?
       | 
       | > Name , Email > Remain updated, we'll email you when it is
       | available.
       | 
       | Why do this? Should i read that the format isn't ready? Is there
       | going to be a mailing list of format enthusiast? Are you planning
       | on releasing a V2022 next year and every year? More use-case
       | specific derivatives?
       | 
       | All a format needs is 3 short examples, a language definition,
       | and a link to an implementation.
       | 
       | Everything else lowers my expectation and its appeal.
        
       | Hurtak wrote:
       | The whole thing seems to be dead. There is one blog post from
       | 2019 (https://internetobject.org/the-story/) and the Twitter
       | account also was active only in 2019
       | (https://twitter.com/InternetObject).
        
         | jdsampayo wrote:
         | Moderator should add (2019) to the title, as there has not been
         | any update.
        
       | wly_cdgr wrote:
       | Chuck Severance has a nice interview about JSON with Doug
       | Crockford where Crockford argues that one of the main reasons his
       | baby has been so successful is that it's unversioned. No new
       | versions, no new features, no bloat, no compatibility issues
        
       | ishche wrote:
       | Are comments allowed in this format?
        
         | SV_BubbleTime wrote:
         | Good one, I'd also want to see hex format! It's just a pain to
         | show all integers as decimal in JSON.
        
         | galaxyLogic wrote:
         | I would vote for that feature.
         | 
         | Also field-names which don't contain whitespace should not need
         | to be quoted.
        
       | Waterluvian wrote:
       | We are paying a cost in clarity, human editability, and further
       | splintering of formats.
       | 
       | Everything is a trade off. So what do we get in trade for those
       | rather large costs?
       | 
       | 40% bandwidth savings might be worth it. But what are the gzipped
       | comparisons?
        
       | 29athrowaway wrote:
       | It is less human readable than JSON.
       | 
       | Human readibility is one of the most important aspects of JSON.
       | Without that requirement you could use a binary serialization.
        
       | mccanne wrote:
       | This is a very real problem being addressed here and I am
       | intrigued by all the great comments in this thread.
       | 
       | In the Zed project, we've been thinking about and iterating on a
       | better data model for serialization for a few years, and have
       | concluded that schemas kind of get in the way (e.g., the way
       | Parquet, Avro, and JSON Schema define a schema then have a set of
       | values that adhere to the schema). In Zed, a modern and fine-
       | grained type system allows for a structure that is a superset of
       | both the JSON and the relational models, where a schema is simply
       | a special case of the type system (i.e., a named record type).
       | 
       | If you're interested, you can check out the Zed formats here...
       | https://github.com/brimdata/zed/tree/main/docs/formats
        
         | mccanne wrote:
         | Also, if any of you find problems with the Zed spec(s), we'd
         | love to hear about them. "Now" would be a good time to make
         | changes / fix flaws.
        
           | petre wrote:
           | I'd like to see more examples and probably data serialized as
           | zed.
        
             | mccanne wrote:
             | There are a few examples in the ZSON spec...
             | 
             | https://github.com/brimdata/zed/blob/main/docs/formats/zson
             | ....
             | 
             | And you can easily see whatever data you'd like formatted
             | as ZSON using the "zq" CLI tool, but I just made this gist
             | (with some data from the brimdata/zed-sample-data report)
             | so you can have a quick look (the bstring stuff is a little
             | noisy and an artifact of the data source being Zeek)... htt
             | ps://gist.github.com/mccanne/94865d557ca3de8abfd3eb09e8ac..
             | .
        
       | beardyw wrote:
       | > age:{int, min:20}, address: {street, city, state}
       | 
       | Unless the space after the colon is significant it seems we have
       | to just "know" that int introduces a type definition instead of a
       | structure.
       | 
       | Also
       | 
       | > Schema Details JSON doesn't have built-in schema support!
       | 
       | seems a little disingenuous. JSON provides a name for each type
       | of value, so there is mostly no need for the schema when viewing
       | the data. There is a JSON Schema definition.
        
         | kabes wrote:
         | Yeah, this format looks really badly designed
        
       | mofosyne wrote:
       | Hope it supports semantic tagging like in CBOR
        
         | SV_BubbleTime wrote:
         | Are you using your own tags in CBOR? What is the use case?
         | 
         | I figured that because I need to describe the tag, it was just
         | as easy to not use tags and describes the elements that would
         | make one up.
        
       | flqn wrote:
       | I'm sceptical about the value proposition of this without seeing
       | much more than a simple example that offers little over existing
       | hypermedia+json/csv practices.
       | 
       | If a compact columnar representation is what you're after to
       | avoid having to repeat every field name in an array of objects
       | (which CSV is good for) but you don't want to give up the ability
       | to include metadata in your JSON, there are a ton of different
       | ways for structure your document to solve this issue without
       | inventing new document formats.
       | 
       | Also this example is unclear (possibly ambiguous?); how is "int"
       | as a type for the "age" column distinguished from "street",
       | "city", etc as what I assume are field names?
        
         | samhw wrote:
         | > If a compact columnar representation is what you're after to
         | avoid having to repeat every field name in an array of objects
         | (which CSV is good for)
         | 
         | Plus, as I wrote elsewhere, gzipping your JSON will result in
         | essentially "avoiding having to repeat every field name" by
         | dictionary coding it. The only case in which that wouldn't be
         | true is when dealing with extremely unusual and heteromorphic
         | data, but then this format doesn't seem to support such data
         | _at all_.
         | 
         | I'm also mystified that the author claims this is readable. It
         | looks eminently _unreadable_ compared with JSON, if you have
         | anything beyond one row of very simple data with all optional
         | fields present. And, in that case, it 's basically just 'JSON
         | with the keys on a different row'.
         | 
         | (Congrats to the author, but this is more of a fun personal
         | project rather than something to seriously present as a 'JSON
         | killer'. If you _do_ present it as a JSON killer, then you have
         | to expect a rigorous review.)
        
           | fstrthnscnd wrote:
           | > Plus, as I wrote elsewhere, gzipping your JSON will result
           | in essentially "avoiding having to repeat every field name"
           | by dictionary coding it.
           | 
           | Gzipping indeed helps in getting mostly back the space taken
           | by the field names, but a parser will still have to parse
           | these strings. On a large document, this might have a
           | performance impact.
           | 
           | One good side of having the field names however is that one
           | can reorder them adlib.
        
         | tomrod wrote:
         | I agree. CSV + Metadata/field types (which JSON can handle)
         | plus zipping (dictionary coding) takes care of, what, 99.9999%
         | of the issues folks have with one type or the other?
        
         | Someone wrote:
         | Looked for a spec, but couldn't find it, so here's a _guess_:
         | there's significant whitespace between the colon and the
         | opening brace:                 age:{int, min:20},
         | address: {street, city, state}
         | 
         | Alternatively, there may be a set of forbidden field names,
         | including _bool_ , _int_ and _string_.
         | 
         | Of these two, I like neither, but would opt for the latter.
         | 
         | I also considered that _min:20_ implied the previous had to be
         | a type, but I don't see how that's consistent with
         | active?:bool
         | 
         | and                 tags?:[string]
        
       | snidane wrote:
       | Json is a good format to represent results of aggregation queries
       | (group by in sql) using nesting and storing data in a single
       | file.
       | 
       | Without that you would need to either                 1. store
       | multiple not-nested (tabular, eg. csv) files and join them at the
       | time of use.       2. denormalize all these csvs into a single
       | big csv duplicating the same values over and over. Compression
       | should handle this at storage time, bht you still pay the cost
       | when reading.       3. store values by columns, not by rows,
       | adding various RLE and dict encodings to compress repeated values
       | in columns, making the files not human friendly       4. once you
       | store it in columns and make it unreadable, just store it as
       | binary instead of text. You get parquet
       | 
       | Json and csb are simple and for that reason they won and will
       | stay with us no matter how hard you try to add features to it.
       | 
       | That said I think adding a trailing comma and comments to json
       | wouldn't be a big stretch.
       | 
       | The battle will be for the best columnar binary format. Parquet
       | is the closest to a standard, but it seems to be used only as a
       | standard for a storage. Big data systems still uncompress it and
       | work with their own representation. The holy grail is when you
       | get a columnar format which is good enough that big data systems
       | use it as their underlying data representation instead of coming
       | up with their own. I suspect such format will come from something
       | like open sourced Snowflake, Clickhouse, Chaossearch or something
       | like that, which has battle tested performant algorithms on them,
       | instead of designed by committee, such as parquet.
        
         | liuliu wrote:
         | You mean, Apache Arrow?
        
           | snidane wrote:
           | Partially.
           | 
           | The problem with Apache Arrow and Parquet is that you have
           | two - one for storage and one for computation - but in the
           | end you only want one for both. You want to run fast
           | algorithms on memory mapped compressed columns. Not doing
           | this stupid deserialization from parquet to arrow.
           | 
           | Parquet and arrow are designed by committee and try to
           | accomplish too much for that matter. While that's good for
           | some cases, my prediction is that there will exist a data
           | processing system in the future whose file format will
           | support that and be good enoigh for most data intensive
           | applications. It will not be feature complete, like json, but
           | will be good enough. Some devs from then on will complain
           | about adding this and that feature to that format, but
           | majority will be happy as they are now with json. Such format
           | can only come from industry, not from a committee.
        
             | liuliu wrote:
             | Right. That's why I am more interested in arrow than
             | parquet. Going from a pure compressed storage format to
             | incorporate computation would be more difficult than going
             | from memory-mapped / computation format to long-term
             | storage. Arrow already made some good choices regarding
             | data exchange over wire, these are translatable to data
             | exchange over time.
             | 
             | Of course, I am only dealing with a few hundreds GiB data,
             | not sure at larger scale whether arrow fails.
        
         | throwaway81523 wrote:
         | > That said I think adding a trailing comma and comments to
         | json wouldn't be a big stretch.
         | 
         | Sadly, json's designers suffered from the same hubris as the
         | designers of markdown and gemini, when they decided to not
         | include a version number in the file format. So you are kind of
         | hosed if you want to make a change like that.
         | 
         | Before json there was xml (ugh), but before xml there were Lisp
         | S-expressions, which seem to have handled all these issues
         | perfectly well 50 years ago. Yet we keep re-inventing them.
         | Greenspun's tenth law is still with us.
        
           | snidane wrote:
           | It's just a matter of parser implementation. These changes
           | are backwards compatible. If python decided to add support
           | for comments and trailing commas in json.loads, that would
           | become the new standard, at least for data scientists, not
           | for web devs. All the other ones would then follow.
        
             | throwaway81523 wrote:
             | Now whatever generates your data has to know what parser is
             | going to read the data. The parser can't tell right away
             | whether the data has those trailing commas. They are
             | optional, so they might not start appearing until after
             | gigabytes of output have gone by. So you can't count on a
             | quick error message in the event of a version mismatch.
        
               | samhw wrote:
               | If you have gigabytes of handwritten JSON (if it's not
               | handwritten, trailing vs non-trailing commas surely don't
               | matter), then I feel like you're doing something wrong.
               | 
               | Though I'm sure someone's going to step in and say "Have
               | you not heard of [stupendously niche use case]? Are you
               | living under a rock!?" etc etc ;)
        
       | HKH2 wrote:
       | What advantages do commas have over semicolons?
        
       | [deleted]
        
       | charles_f wrote:
       | a) why would you want to remove the field names, this is making
       | it so much harder to debug and very brittle, since now you're
       | dependent on the order of fields. No mention of how you handle
       | versioning as well. Back to csv
       | 
       | > However, this time, something felt wrong; I realized that with
       | the JSON, we were exchanging a huge amount of unnecessary
       | information to and from the server
       | 
       | b) Text size really ain't an issue given that we're talking about
       | typically just a few kb on gzipped protocols over hundreds of
       | mbps connections. Compactness sounds like a bad argument to me.
       | 
       | c) "json doesn't have schema built in is a really dubious
       | argument". If you want schemas you can still get them using json-
       | schema, and if you don't you can still understand the message
       | using the field names, which makes for a degraded schema ; which
       | doesn't exist in the case of internet objects. If you don't have
       | the schema, go figure what's in there
       | 
       | What really gives it to me is the comparison at the bottom
       | between internet objects anf json; json looks better to me.
       | 
       | Looks like it's an idea executed over a bad premise
        
       | dang wrote:
       | A couple small past threads:
       | 
       |  _JSON Alternative - Internet Object_ -
       | https://news.ycombinator.com/item?id=21220405 - Oct 2019 (12
       | comments)
       | 
       |  _Show HN: Internet Object - a thin, robust and schema oriented
       | JSON alternative_ - https://news.ycombinator.com/item?id=20982180
       | - Sept 2019 (8 comments)
        
       ___________________________________________________________________
       (page generated 2021-10-24 23:02 UTC)