[HN Gopher] Internet Object - A JSON alternative data serializat...
___________________________________________________________________
Internet Object - A JSON alternative data serialization format
Author : Starz0r
Score : 76 points
Date : 2021-10-24 12:39 UTC (10 hours ago)
(HTM) web link (internetobject.org)
(TXT) w3m dump (internetobject.org)
| kabes wrote:
| The example schema has: > age:{int, min:20}
|
| Why would a data serialization format bother with data validation
| like the minimum value here?
| williamtwild wrote:
| front and validation perhaps?
| kabes wrote:
| Shouldn't be a concern of a serialization library
| SV_BubbleTime wrote:
| It's part of the schema part, not the serialization part
| right? I don't disagree you parse then validate, but in a
| schema that defines type of data, it's not unreasonable to
| put limits on values.
| tom_ wrote:
| Where is the spec? Why are there spaces after the commas? Why
| does the example not include a string with commas in it?
| deadfish wrote:
| I see a benchmark for the data size... But as other comments have
| suggested gzip should remove the majority of that difference.
|
| I'd be more interested to know about serialisation and
| deserialisation time.
| hiyer wrote:
| 60% savings won't really count for much when the traffic is
| compressed, which is the case for most of JSON's uses. For real
| savings I think you'd have to go with a binary format like
| protobuf or thrift.
|
| Edit: 50 -> 60
| only_as_i_fall wrote:
| If you follow that link which says "Read the Story Here" they
| have this json example which has a list of employees and then
| info about the pagination of that list. The caption is this
|
| >If you look closely, this JSON document mixes the data employees
| with other non-data keys (headers) such as count, currentPage,
| and pageSize in the same response.
|
| But they don't explain at all how Changing the data format fixes
| the underlying issue of mixed concerns in one data object.
| barelysapient wrote:
| They lost me when they declared an address type.
|
| Addresses are so varied in implementation and meaning that it's
| frankly ridiculous.
| codeulike wrote:
| How are commas or speechmarks in strings escaped?
| __m wrote:
| How do you have an array with different types of objects? You
| either have to repeat the schema or have to reference the schema.
| danfritz wrote:
| Looks like CSV described. Gzip the json and don't care about the
| biggest selling point
| samhw wrote:
| Precisely. Many people don't realise how exceptionally well
| JSON compresses. Provided you're using it the way most do, to
| send arrays of objects which share the same set of keys (or
| some subset thereof), then all the keys will end up dictionary-
| coded away, thus totally eroding the space advantage that this
| format notionally has.
|
| Plus JSON's exceptionally wide support means you can benefit
| from SIMD-assisted decoders which will absolutely blow this out
| of the water - and much, much more besides. I wish people would
| devote their time to something more useful than 'yet another
| competing standard'.
|
| Edit: Sorry, I want to be clear, this is an impressive and cool
| personal project. I hope it's a step on an exciting journey for
| the person who wrote it. It just doesn't actually have enough
| strengths to replace JSON - which would be a tall order for
| _any_ new format.
| [deleted]
| dorongrinstein wrote:
| Looks neat. I don't see a formal spec. Question: if I have two
| optional fields of the same type and the first one isn't
| provided, how does a parser know which field is provided? The
| optional fields seem unclear to me.
| cyberpsybin wrote:
| No.
| Gys wrote:
| For me a JSON alternative should at the very least offer some
| spec for adding comments anywhere.
| random478101 wrote:
| One of the variants that permit comments:
| https://github.com/tailscale/hujson
| random478101 wrote:
| As far I can see "IO" addresses the size issue, which is indeed a
| compression issue for the most part.
|
| For a broader take on an alternative, there is concise encoding
| Concise Encoding [1][2], which I believe addresses a few more
| issues with existing encodings (clear spec, schema not an
| afterthought, native support for a variety of data structures,
| security, ...).
|
| [1] https://concise-encoding.org/ [2] The author gave a
| presentation on it here:
| https://www.youtube.com/watch?v=_dIHq4GJE14
| wffurr wrote:
| People keep saying "just use gzip and JSON is plenty small" but
| gzip isn't free. It takes time and power to do all the
| compression and decompression. The uncompressed size of the
| data takes up memory on client and server.
|
| A smaller data format requires less compression time and power
| and you can fit more of it in memory at either end.
| petre wrote:
| There's Messagepack and CBOR and Flat Buffers. All of them
| are faster and smaller than any text based format.
| nikeee wrote:
| Since strings don't need to be quoted, what happens during
| deserialization if you want the string "T"? Does this lead to the
| equivalent of the Norway-Problem of YAML [0]?
|
| Is the space between the key and the type necessary? If not, how
| to distinguish between objects and types?
|
| Does the validation offer some form of unions or mutual
| exclusion?
|
| [0]: https://hitchdev.com/strictyaml/why/implicit-typing-removed/
| lifthrasiir wrote:
| It seems to be a typed CSV, so whether `T` is interpreted as a
| string or a boolean presumably depends on the schema. That
| sounds slightly better than YAML, though it can easily break
| when you allow heterogeneous types (say, string or boolean).
| petre wrote:
| T is quite dumb. The author should had at least used #t and
| #f from Scheme.
| cookiengineer wrote:
| YAML and its "Arrays" are really broken. The problem I see with
| Internet Object is that it's also implying this kind of
| mechanism.
|
| Every time I read about new formats, they seem to get either
| the 1-n relations or the n-n relations implemented well, but
| not both. I guess that's what's so hard about map/reduce...
|
| Regarding YAML: somebody on HN mentioned his project DIXY a
| couple years ago, and it's much much _much_ easier to parse
| than YAML. [1] I'm using this over YAML pretty much everywhere
| now.
|
| [1] https://github.com/kuyawa/Dixy
| colejohnson66 wrote:
| I'll admit that YAML has its quirks, but a good syntax
| highlighter can take care of that in my experience. What's
| wrong with YAML's arrays?
| cookiengineer wrote:
| > What's wrong with YAML's arrays?
|
| That there are multiple ways to define Arrays: "- item",
| "-\n\titem", "\titem" or "item, item" for starters. Parsing
| YAML into Arrays requires context of its surroundings.
|
| Without the previous context, you cannot know what type of
| data you're parsing when you are at a "-" at the beginning
| of a line or a "," in the middle of a line.
|
| This is just unnecessary parser complexity and human
| ambiguity in my opinion.
|
| As a question to you in case you disagree: What happens
| when you write down an indented/nested "\t- name: John,
| Doe"? It's pretty much unpredictable without the previously
| parsed data structures or their history in YAML.
|
| (I don't wanna start the discussion of "<<" and how it
| influences the parsing context of YAML data structures. I
| think the merge key also has no place in a data
| serialization format.)
| cabalamat wrote:
| > YAML and its "Arrays" are really broken.
|
| Agreed. YAML does have some use cases. I find it useful when
| I want to manually write lots of JSON data for test scripts.
| But the format, because it tries to be concise, ends up to be
| hard to manually parse.
|
| I don't consider YAML a good serialisation format.
| BiteCode_dev wrote:
| Yaml has so many problems. Python 3.10 raised a new one to my
| attention when the core devs realized their arrays of
| versions contained twice 3.1 and no 3.10. Indeed, if write
| unquotted ascii, yaml gives you strings. Except if it can
| cast it to a number that is.
|
| TOML is better, but it still has more gotchas that necessary.
| So much I find it easier to just edit a python file
|
| I'm thinking of giving a try to cue. Any feedback ?
| irq-1 wrote:
| Dixy looks easy, but "There is only one simple rule. In Dixy,
| everything is a dictionary [string:string]" isn't accurate or
| helpful.
|
| It's also [string:dictionary] and [string:?] where ? means
| nil. White space matters, and tab is fixed at 4 spaces wide.
| When creating text from a dictionary it adds "# Dixy 1.0\n\n"
| which means loading and saving will change the file every
| time! Not sure what other issues there are, but I noticed
| this line: // TODO: if key is numeric,
| parse as Array
|
| It does look simple though. It'd be nice if someone made
| strict rules and addressed the corner cases.
| 29athrowaway wrote:
| The annoyance of YAML is the possibility of doing things in
| different ways.
| DemocracyFTW wrote:
| The so-called "Norway Problem" of YAML is really the No-Way
| Problem of YAML. /s
| kesor wrote:
| ffs please don't add yet another stupid standard. this looks like
| a complicated version of csv, which is horrible, and this also
| looks quite horrible.
| account-5 wrote:
| I've been looking at data serialisation formats recently.
|
| - JSON - TOML - CSON - INI - ENO - XML
|
| I like CSV for tabular data obviously. This looks, as others have
| mentioned, like CSV with better metadata.
|
| I like INI for its simplicity. JSON is good for more complicated
| data, but I have to say I like CSON.
| jFriedensreich wrote:
| when a project has more inspirational quotes that tech facts and
| relation to prior art thats often a red flag.also json is
| inherently schema less and non binary, this is not a flaw but
| critical for many usecases. if you want schemas there are many
| proven alternatives like protobuffs, avro, cap n proto, and
| message pack.
| typingmonkey wrote:
| So the plain data is smaller because some information comes from
| the schema instead of the object. Guess what, you can do the same
| with json already [1]
|
| [1] https://github.com/pubkey/jsonschema-key-compression
| antihero wrote:
| Is there much space saving after response compression?
| Azsy wrote:
| What ever the pros and cons are here... What the ** does this
| mean?
|
| > Name , Email > Remain updated, we'll email you when it is
| available.
|
| Why do this? Should i read that the format isn't ready? Is there
| going to be a mailing list of format enthusiast? Are you planning
| on releasing a V2022 next year and every year? More use-case
| specific derivatives?
|
| All a format needs is 3 short examples, a language definition,
| and a link to an implementation.
|
| Everything else lowers my expectation and its appeal.
| Hurtak wrote:
| The whole thing seems to be dead. There is one blog post from
| 2019 (https://internetobject.org/the-story/) and the Twitter
| account also was active only in 2019
| (https://twitter.com/InternetObject).
| jdsampayo wrote:
| Moderator should add (2019) to the title, as there has not been
| any update.
| wly_cdgr wrote:
| Chuck Severance has a nice interview about JSON with Doug
| Crockford where Crockford argues that one of the main reasons his
| baby has been so successful is that it's unversioned. No new
| versions, no new features, no bloat, no compatibility issues
| ishche wrote:
| Are comments allowed in this format?
| SV_BubbleTime wrote:
| Good one, I'd also want to see hex format! It's just a pain to
| show all integers as decimal in JSON.
| galaxyLogic wrote:
| I would vote for that feature.
|
| Also field-names which don't contain whitespace should not need
| to be quoted.
| Waterluvian wrote:
| We are paying a cost in clarity, human editability, and further
| splintering of formats.
|
| Everything is a trade off. So what do we get in trade for those
| rather large costs?
|
| 40% bandwidth savings might be worth it. But what are the gzipped
| comparisons?
| 29athrowaway wrote:
| It is less human readable than JSON.
|
| Human readibility is one of the most important aspects of JSON.
| Without that requirement you could use a binary serialization.
| mccanne wrote:
| This is a very real problem being addressed here and I am
| intrigued by all the great comments in this thread.
|
| In the Zed project, we've been thinking about and iterating on a
| better data model for serialization for a few years, and have
| concluded that schemas kind of get in the way (e.g., the way
| Parquet, Avro, and JSON Schema define a schema then have a set of
| values that adhere to the schema). In Zed, a modern and fine-
| grained type system allows for a structure that is a superset of
| both the JSON and the relational models, where a schema is simply
| a special case of the type system (i.e., a named record type).
|
| If you're interested, you can check out the Zed formats here...
| https://github.com/brimdata/zed/tree/main/docs/formats
| mccanne wrote:
| Also, if any of you find problems with the Zed spec(s), we'd
| love to hear about them. "Now" would be a good time to make
| changes / fix flaws.
| petre wrote:
| I'd like to see more examples and probably data serialized as
| zed.
| mccanne wrote:
| There are a few examples in the ZSON spec...
|
| https://github.com/brimdata/zed/blob/main/docs/formats/zson
| ....
|
| And you can easily see whatever data you'd like formatted
| as ZSON using the "zq" CLI tool, but I just made this gist
| (with some data from the brimdata/zed-sample-data report)
| so you can have a quick look (the bstring stuff is a little
| noisy and an artifact of the data source being Zeek)... htt
| ps://gist.github.com/mccanne/94865d557ca3de8abfd3eb09e8ac..
| .
| beardyw wrote:
| > age:{int, min:20}, address: {street, city, state}
|
| Unless the space after the colon is significant it seems we have
| to just "know" that int introduces a type definition instead of a
| structure.
|
| Also
|
| > Schema Details JSON doesn't have built-in schema support!
|
| seems a little disingenuous. JSON provides a name for each type
| of value, so there is mostly no need for the schema when viewing
| the data. There is a JSON Schema definition.
| kabes wrote:
| Yeah, this format looks really badly designed
| mofosyne wrote:
| Hope it supports semantic tagging like in CBOR
| SV_BubbleTime wrote:
| Are you using your own tags in CBOR? What is the use case?
|
| I figured that because I need to describe the tag, it was just
| as easy to not use tags and describes the elements that would
| make one up.
| flqn wrote:
| I'm sceptical about the value proposition of this without seeing
| much more than a simple example that offers little over existing
| hypermedia+json/csv practices.
|
| If a compact columnar representation is what you're after to
| avoid having to repeat every field name in an array of objects
| (which CSV is good for) but you don't want to give up the ability
| to include metadata in your JSON, there are a ton of different
| ways for structure your document to solve this issue without
| inventing new document formats.
|
| Also this example is unclear (possibly ambiguous?); how is "int"
| as a type for the "age" column distinguished from "street",
| "city", etc as what I assume are field names?
| samhw wrote:
| > If a compact columnar representation is what you're after to
| avoid having to repeat every field name in an array of objects
| (which CSV is good for)
|
| Plus, as I wrote elsewhere, gzipping your JSON will result in
| essentially "avoiding having to repeat every field name" by
| dictionary coding it. The only case in which that wouldn't be
| true is when dealing with extremely unusual and heteromorphic
| data, but then this format doesn't seem to support such data
| _at all_.
|
| I'm also mystified that the author claims this is readable. It
| looks eminently _unreadable_ compared with JSON, if you have
| anything beyond one row of very simple data with all optional
| fields present. And, in that case, it 's basically just 'JSON
| with the keys on a different row'.
|
| (Congrats to the author, but this is more of a fun personal
| project rather than something to seriously present as a 'JSON
| killer'. If you _do_ present it as a JSON killer, then you have
| to expect a rigorous review.)
| fstrthnscnd wrote:
| > Plus, as I wrote elsewhere, gzipping your JSON will result
| in essentially "avoiding having to repeat every field name"
| by dictionary coding it.
|
| Gzipping indeed helps in getting mostly back the space taken
| by the field names, but a parser will still have to parse
| these strings. On a large document, this might have a
| performance impact.
|
| One good side of having the field names however is that one
| can reorder them adlib.
| tomrod wrote:
| I agree. CSV + Metadata/field types (which JSON can handle)
| plus zipping (dictionary coding) takes care of, what, 99.9999%
| of the issues folks have with one type or the other?
| Someone wrote:
| Looked for a spec, but couldn't find it, so here's a _guess_:
| there's significant whitespace between the colon and the
| opening brace: age:{int, min:20},
| address: {street, city, state}
|
| Alternatively, there may be a set of forbidden field names,
| including _bool_ , _int_ and _string_.
|
| Of these two, I like neither, but would opt for the latter.
|
| I also considered that _min:20_ implied the previous had to be
| a type, but I don't see how that's consistent with
| active?:bool
|
| and tags?:[string]
| snidane wrote:
| Json is a good format to represent results of aggregation queries
| (group by in sql) using nesting and storing data in a single
| file.
|
| Without that you would need to either 1. store
| multiple not-nested (tabular, eg. csv) files and join them at the
| time of use. 2. denormalize all these csvs into a single
| big csv duplicating the same values over and over. Compression
| should handle this at storage time, bht you still pay the cost
| when reading. 3. store values by columns, not by rows,
| adding various RLE and dict encodings to compress repeated values
| in columns, making the files not human friendly 4. once you
| store it in columns and make it unreadable, just store it as
| binary instead of text. You get parquet
|
| Json and csb are simple and for that reason they won and will
| stay with us no matter how hard you try to add features to it.
|
| That said I think adding a trailing comma and comments to json
| wouldn't be a big stretch.
|
| The battle will be for the best columnar binary format. Parquet
| is the closest to a standard, but it seems to be used only as a
| standard for a storage. Big data systems still uncompress it and
| work with their own representation. The holy grail is when you
| get a columnar format which is good enough that big data systems
| use it as their underlying data representation instead of coming
| up with their own. I suspect such format will come from something
| like open sourced Snowflake, Clickhouse, Chaossearch or something
| like that, which has battle tested performant algorithms on them,
| instead of designed by committee, such as parquet.
| liuliu wrote:
| You mean, Apache Arrow?
| snidane wrote:
| Partially.
|
| The problem with Apache Arrow and Parquet is that you have
| two - one for storage and one for computation - but in the
| end you only want one for both. You want to run fast
| algorithms on memory mapped compressed columns. Not doing
| this stupid deserialization from parquet to arrow.
|
| Parquet and arrow are designed by committee and try to
| accomplish too much for that matter. While that's good for
| some cases, my prediction is that there will exist a data
| processing system in the future whose file format will
| support that and be good enoigh for most data intensive
| applications. It will not be feature complete, like json, but
| will be good enough. Some devs from then on will complain
| about adding this and that feature to that format, but
| majority will be happy as they are now with json. Such format
| can only come from industry, not from a committee.
| liuliu wrote:
| Right. That's why I am more interested in arrow than
| parquet. Going from a pure compressed storage format to
| incorporate computation would be more difficult than going
| from memory-mapped / computation format to long-term
| storage. Arrow already made some good choices regarding
| data exchange over wire, these are translatable to data
| exchange over time.
|
| Of course, I am only dealing with a few hundreds GiB data,
| not sure at larger scale whether arrow fails.
| throwaway81523 wrote:
| > That said I think adding a trailing comma and comments to
| json wouldn't be a big stretch.
|
| Sadly, json's designers suffered from the same hubris as the
| designers of markdown and gemini, when they decided to not
| include a version number in the file format. So you are kind of
| hosed if you want to make a change like that.
|
| Before json there was xml (ugh), but before xml there were Lisp
| S-expressions, which seem to have handled all these issues
| perfectly well 50 years ago. Yet we keep re-inventing them.
| Greenspun's tenth law is still with us.
| snidane wrote:
| It's just a matter of parser implementation. These changes
| are backwards compatible. If python decided to add support
| for comments and trailing commas in json.loads, that would
| become the new standard, at least for data scientists, not
| for web devs. All the other ones would then follow.
| throwaway81523 wrote:
| Now whatever generates your data has to know what parser is
| going to read the data. The parser can't tell right away
| whether the data has those trailing commas. They are
| optional, so they might not start appearing until after
| gigabytes of output have gone by. So you can't count on a
| quick error message in the event of a version mismatch.
| samhw wrote:
| If you have gigabytes of handwritten JSON (if it's not
| handwritten, trailing vs non-trailing commas surely don't
| matter), then I feel like you're doing something wrong.
|
| Though I'm sure someone's going to step in and say "Have
| you not heard of [stupendously niche use case]? Are you
| living under a rock!?" etc etc ;)
| HKH2 wrote:
| What advantages do commas have over semicolons?
| [deleted]
| charles_f wrote:
| a) why would you want to remove the field names, this is making
| it so much harder to debug and very brittle, since now you're
| dependent on the order of fields. No mention of how you handle
| versioning as well. Back to csv
|
| > However, this time, something felt wrong; I realized that with
| the JSON, we were exchanging a huge amount of unnecessary
| information to and from the server
|
| b) Text size really ain't an issue given that we're talking about
| typically just a few kb on gzipped protocols over hundreds of
| mbps connections. Compactness sounds like a bad argument to me.
|
| c) "json doesn't have schema built in is a really dubious
| argument". If you want schemas you can still get them using json-
| schema, and if you don't you can still understand the message
| using the field names, which makes for a degraded schema ; which
| doesn't exist in the case of internet objects. If you don't have
| the schema, go figure what's in there
|
| What really gives it to me is the comparison at the bottom
| between internet objects anf json; json looks better to me.
|
| Looks like it's an idea executed over a bad premise
| dang wrote:
| A couple small past threads:
|
| _JSON Alternative - Internet Object_ -
| https://news.ycombinator.com/item?id=21220405 - Oct 2019 (12
| comments)
|
| _Show HN: Internet Object - a thin, robust and schema oriented
| JSON alternative_ - https://news.ycombinator.com/item?id=20982180
| - Sept 2019 (8 comments)
___________________________________________________________________
(page generated 2021-10-24 23:02 UTC)