[HN Gopher] Amazon Ion - A richly-typed, self-describing, hierar...
       ___________________________________________________________________
        
       Amazon Ion - A richly-typed, self-describing, hierarchical
       serialization format
        
       Author : gjvc
       Score  : 367 points
       Date   : 2021-11-20 00:28 UTC (22 hours ago)
        
 (HTM) web link (amzn.github.io)
 (TXT) w3m dump (amzn.github.io)
        
       | kats wrote:
       | It's just another format not better than any of the others.
        
       | 95th wrote:
       | Remind of bencoding used in torrents
        
       | silvestrov wrote:
       | This is what JSON should have been extended to.
       | 
       | But Douglas Crockford just don't want to innovate anything, just
       | like Gruber didn't want to make a proper specification of the
       | Markdown format.
       | 
       | Sometimes people are keeping innovation back. Fortunately this
       | did not happend with html.
       | 
       | The main thing missing from the text format is a magic and
       | version number. At least the binary format has it.
        
         | kwertyoowiyop wrote:
         | The dominance of JSON shows that Crockford made some good
         | decisions, even though we may not agree with them on any given
         | day.
        
           | seanclayton wrote:
           | The dominance of JSON just shows that JS is dominant.
        
             | usrusr wrote:
             | Even JS stopped parsing JSON as a subset JS a long time
             | ago. JSON lineage has been irrelevant in terms of
             | popularity transmission ever since people stopped doing var
             | jsonobj = eval(jsonstring);
        
             | indymike wrote:
             | > The dominance of JSON just shows that JS is dominant.
             | 
             | I don't know that's the case... I've used JSON in lots of
             | non-JS languages because it just works, and errors rarely
             | are caused by mismatches in how JSON behaves in language X
             | and language Y. A lot of that is that it is simple, and
             | rigid.
        
       | chromatin wrote:
       | Check out Ilya Yaroshenko's Ion library for D, part of the larger
       | 'mir' library:
       | 
       | http://mir-ion.libmir.org/
       | 
       | https://github.com/libmir/mir-ion
        
         | hatf0 wrote:
         | Weird to see the library I work on show up in HN --- Mir Ion is
         | a pretty complicated library (and admittedly our documentation
         | needs work -- I'm working on that!), but I'm very proud of our
         | work.
         | 
         | Some fun things about Mir Ion:
         | 
         | - We can fully deserialize Ion at compile-time (via D's CTFE
         | functionality)
         | 
         | - We're one of the fastest JSON parsing libraries (and one of
         | the most memory efficient too -- we actually store all JSON
         | data in memory as Ion data, which is vastly more efficient)
         | 
         | - We're nearly 100% compliant to all of the upstream test cases
         | (our main issue is that we're often _too_ lax on spec, and
         | allow files that are invalid through)
         | 
         | - The entire library is (nearly) all `@nogc`, thanks to the Mir
         | standard library
         | 
         | If anyone has any questions on Mir Ion, feel free to shoot me a
         | line at harrison (at) 0xcc.pw
        
       | CyanLite4 wrote:
       | How does this compare with MsgPack?
        
       | tootie wrote:
       | No schema validation?
        
         | landonxjames wrote:
         | I believe that is provided by the Ion Schema Language
         | https://amzn.github.io/ion-schema/docs/spec.html
        
       | programd wrote:
       | "Zero and negative dates are not valid, so the earliest instant
       | in time that can be represented as a timestamp is Jan 01, 0001"
       | 
       | That seems to be...a problem? How do you deal with archeological
       | dates, of which there are many, in Ion?
        
         | pdpi wrote:
         | That's an interesting question. On the one hand, it feels weird
         | that you can't represent those dates at all.
         | 
         | On the other hand, representability of a given date becomes
         | progressively less useful the further back in time you go, and
         | stuff becomes really gnarly once you go past the Julian
         | calendar in 45BC.
         | 
         | Also, simplifying to "no dates before Jan 1 0001" has very
         | little impact on applications dealing with the modern-ish world
         | (with "modern" generously defined as "anything after the
         | collapse of the Roman Empire"), and I can only assume
         | applications dealing with earlier times could do with a more
         | specialised representation for dates anyway.
        
           | biztos wrote:
           | Just to give one example, in Thailand right now it's the year
           | 2564.
           | 
           | 1 BC for some is not "-1" for everyone.
        
         | elteto wrote:
         | What modern tech service (of the kind that would have use for
         | Ion) is dealing with archaeological dates _at scale_? Honest
         | question.
        
       | rjzzleep wrote:
       | I feel like a lot of file formats came out of companies, but even
       | protocol buffers isn't calling itself google protocol buffers.
       | What is it with modern companies putting their name everywhere
       | they can?
        
         | rp1 wrote:
         | Grpc?
        
           | tjpnz wrote:
           | The g is (allegedly) not for Google.
        
             | rp1 wrote:
             | What's it for?
        
               | [deleted]
        
               | moltenguardian wrote:
               | https://grpc.github.io/grpc/core/md_doc_g_stands_for.html
        
               | Rebelgecko wrote:
               | gRPC
        
               | travisd wrote:
               | > What does gRPC stand for? > gRPC Remote Procedure
               | Calls, of course!
               | 
               | https://grpc.io/docs/what-is-grpc/faq/
        
         | SavantIdiot wrote:
         | It's funny, I didn't realize protobuf was a Google thing for a
         | long time because of that. At least `protobuf` is a reasonably-
         | specific search term. `ion` returns too much noise. Almost a
         | good reason to name things weirder, like `iyon`. But then
         | they'd get laughed at. EDIT: oh, its a Tagolog name too, and a
         | light company.
        
         | jsnell wrote:
         | Disambiguation. There is one thing called protobufs. There are
         | hundreds called "ion", a lot of which are more notable than an
         | internal file format.
         | 
         | Edit: I was going to paste in a relevant quote from Zarf (i.e.
         | Andrew Plotkin) on naming. Some of his most important programs
         | have total nonsense names like "glulx", and the reasoning was
         | that at least it would be easy to search for when the name is
         | unique. But ironically, "Zarf" is so common a term that I can't
         | find the quote.
        
         | ix101 wrote:
         | Natural file extension will be .ai despite having no relation
         | to AI
        
           | seniorsassycat wrote:
           | I've seen .ion and .i0n for text and binary ion files. I
           | think amazon ion is like golang - used to clarify meaning not
           | branding
        
       | syspec wrote:
       | Surprised I have not heard of this before, I'd love something to
       | come along and give JSON a kick in the pants.
       | 
       | I do think JSON is the defacto standard, and it really does get
       | the job done, but for some more advances uses something like this
       | could really shine.
        
       | petilon wrote:
       | I like the fact that you can annotate objects as well, not just
       | literals. So this is valid:                   animal: Tiger:: {
       | gender: 'F',            weight: 450         }
       | 
       | This solves the inheritance problem, i.e., if you have multiple
       | subclasses how do you know which type to deserialize as?
        
         | plandis wrote:
         | I believe that this is exactly how Jackson serialized Ion
         | handles subtype polymorphism.
        
       | quiffledwerg wrote:
       | I just feel deeply disinclined towards supporting anything Amazon
       | because they've developed a reputation as such a poor community
       | member.
        
       | setheron wrote:
       | Wow I remember using Ion back at Amazon in 2012. I can't remember
       | but I think the order data warehouse was using it ...
       | 
       | I also now remember back to using something that was akin to FaaS
       | but wasn't called that. I could give them a JAR of some code that
       | would execute on some Ion data for the order data when it
       | changed. Basically FaaS for an ETL pipeline...
       | 
       | Crazy how ahead of the times some companies were.
        
         | vineyardmike wrote:
         | I wonder why it took 10+ years to share then?
        
           | timdorr wrote:
           | Actually, it only took them 4-ish years:
           | https://amzn.github.io/ion-docs/news/2016/04/21/amazon-
           | open-...
        
         | User23 wrote:
         | That was a golden age for Amazon engineering. I assume they're
         | still great, but that stretch from 2004 to 2014 was some
         | incredible advancement.
        
       | dorianmariefr wrote:
       | Pretty neat, but isn't it like *two* formats: one binary and one
       | textual?
        
         | OJFord wrote:
         | Consider that binary, binary coded decimal, Gray code,
         | hexadecimal, octal, etc. are all 'formats' expressing the same
         | (numerical) idea.
         | 
         | You can't say the same of, for example, YAML & JSON, since the
         | former (if not the latter?) has constructs unrepresentable in
         | the other.
         | 
         | It's slightly confused because an application might 'serialise
         | to' JSON or YAML or Ion equivalently - but really that's saying
         | the application's data being serialised fits a model that's a
         | subset of the intersection between those formats.
         | 
         | You could call Ion two, but it's more than that in that it's
         | also a promise that they're 1:1 (err, and onto if you like) -
         | their intersection is their union.
        
         | echelon wrote:
         | One data model, two serializations of it.
        
         | seniorsassycat wrote:
         | Two representations of the same data structures.
         | 
         | Ion text is like JSON, in fact all JSON is valid ion text. Ion
         | text has comments, trailing commas, dates, and unquoted keys.
         | It's a really good alternative to JSON, YAML, or TOML.
         | 
         | Ion binary is compact and fast to parse. Values are length
         | prefixed so the parser can skip over unneeded fields or
         | structs, saving time parsing and memory allocated. Common
         | string values, like struct keys and enum values, are given
         | numeric ids and stored once in a header table.
        
           | ComputerGuru wrote:
           | Do comments persist in binary serialization or is that a
           | lossy one-way operation?
        
             | the_girabbit wrote:
             | Comments don't persist in binary. Like white space, they
             | are explicitly not part of the data model.
        
             | seniorsassycat wrote:
             | I think the ion java library includes a AST parser that
             | includes comments, but the ION data model doesn't. The
             | binary format cannot include comments.
             | 
             | I think many text parsers are missing libraries that edit
             | documents in place, preserving formatting and comments.
        
       | jscholes wrote:
       | The latest format for Kindle eBooks, KFX, is based on this.
        
         | loeg wrote:
         | Yep:
         | https://github.com/apprenticeharper/DeDRM_tools/blob/master/...
        
       | yayitswei wrote:
       | Reminds me of Clojure's transit.
        
         | cmancini wrote:
         | That was the first thing I thought of! Big fan of transit.
         | Seems very similar.
        
       | Zamicol wrote:
       | Am I the only one that doesn't like base 64?
       | 
       | Hex for when efficiency isn't paramount.
       | 
       | Base 85 or BasE91 for when efficiency is more of a concern.
       | http://base91.sourceforge.net/
        
         | hackcasual wrote:
         | You want to use hex whenever byte aligned data is going to be
         | compressed. Base64 quadruples byte level symbols
        
         | ralusek wrote:
         | I understand the case for Base91, but why hex over Base64?
         | Base64 for readability and sticking to multiples of two, Base91
         | for maximum efficiency with readable ASCII.
        
           | Zamicol wrote:
           | Base 64 is good at nothing and bad at some things.
           | 
           | - Hex is human readable, case insensitive, not that
           | "inefficient", and always aligns to bytes.
           | 
           | - Base 85 and basE91 are efficient.
           | 
           | - Bitcoin uses Base58 because they thought base 64 was too
           | human unreadable. Ethereum uses Hex.
           | 
           | - Base 256 (bytes) is efficient and the native language of
           | computers.
           | 
           | Base 64 is not efficient, not human readable, and not easy to
           | encode.
           | 
           | The biggest problem with base 64 is that base 64 is not base
           | 64. Are you doing base 64 with padding? Are you doing base 64
           | with URL safe characters or URL unsafe characters? Are you
           | following the standard RFC 4648 bucket encoding, or are you
           | using iterative divide by radix? I think a great place where
           | the cracks show is JOSE, where for things like thumbprints
           | there's a ton of conversion steps (UTF-8 key -> base 64 ->
           | ASCII bytes -> digest (bytes) -> base 64 thumbprint).
           | 
           | My personal advise for 90% of projects considering looking at
           | base 64 should just use Hex or bytes. If needing human
           | readability, use Hex. Otherwise use binary.
        
         | Aeolun wrote:
         | I like base64 because it's the de-facto standard, and data size
         | (in places where I'd use base64) isn't a main concern for me.
        
           | stjohnswarts wrote:
           | Yeah I get tired of reinvention of everything for tiny gains
           | in size/performance.
        
       | transfire wrote:
       | `years::4`? I don't know. What not `4::years`?
       | 
       | Also, symbols converted to integers means the receiving end has
       | to already know exactly what they are.
        
         | re wrote:
         | Putting annotations before values is likely to be more useful
         | for streaming parsers than putting them after. Imagine the case
         | where the annotation represents a class that you want to
         | deserialize a large object into.
        
         | sokoloff wrote:
         | There is provision for encoding a local symbol table:
         | https://amzn.github.io/ion-docs/docs/symbols.html#processing...
        
       | stevefan1999 wrote:
       | How does that differs from the likes of MessagePack and CBOR?
        
       | Eelongate wrote:
       | Did anything ever become of the lispy language that was being
       | built using Ion as its homoiconic syntax? I'm afraid I can't
       | recall what it was called. Fusion maybe?
        
         | kayamon wrote:
         | Dunno about that one but if you like that sort of thing, check
         | out Rebol.
        
         | throwaway_sZntK wrote:
         | Yeah, Fusion was the name. Last I heard, they discontinued it,
         | saying essentially "If you really want a full Lisp, there's
         | already Clojure." S-exps continued to be used in Ion for
         | embedded 1-liners but they only supported a handful of
         | operators, not a full language.
        
         | garmaine wrote:
         | I hope you get an answer, because this sounds very intriguing
         | but google is failing me in finding any references to it.
        
       | seniorsassycat wrote:
       | Ions text format is a nice JSON alternative while it's binary
       | format is very compact and allows for efficient sparse parsing.
       | Fields are prefixed with their length so you can skip over
       | unneeded fields or structs while only creating objects for values
       | you'll use.
        
       | grouphugs wrote:
       | fuck amazon, and fuck everyone that won't stop promoting them.
        
       | nprateem wrote:
       | Shame there's no PHP lib :(
        
         | mpfundstein wrote:
         | your chance
        
         | jonwilsdon wrote:
         | Disclosure: I manage the Ion and PartiQL teams at Amazon.
         | 
         | If you want to create an issue for it (the best repo is
         | probably the ion-docs one: https://github.com/amzn/ion-
         | docs/issues) that will help to show us there is demand for it.
         | Providing information on your use case helps us prioritize.
        
       | clhodapp wrote:
       | It's staggering to me that people keep making these "rich" data
       | formats without sum types. At least to me, the "ors" are just as
       | important as the "ands" in domain modeling. Apart from that,
       | while you can always sort of fake it with a bunch of optional
       | fields I believe that you kind of need a native encoding to a
       | tagged union if you want to avoid bloating your messages.
        
         | spenczar5 wrote:
         | Others have mentioned Protobuf and Capnproto's support. Avro
         | has them too, they're called Union.
         | 
         | It seems that sum types are the norm, actually.
        
           | clhodapp wrote:
           | Those do now but I _believe_ that all of them added support
           | years after their initial versions
        
             | spenczar5 wrote:
             | I think you're incorrect:
             | 
             | Avro had unions in version 1.0 [0], which is from 2012.
             | 
             | Capnproto had unions back in 2013 [1]. That's from the v0.1
             | days, or maybe even earlier.
             | 
             | Protobuf has had oneof support for about 7 years. They were
             | added in version 2.6.0, from 2014-08-15 [2]. That's still 6
             | years after the initial public release in 2008, though, so
             | this is maybe what you were thinking of? I don't know too
             | many people who were using protobuf in those days outside
             | of Google, though.
             | 
             | ---
             | 
             | [0] https://avro.apache.org/docs/1.0.0/spec.html#Unions
             | 
             | [1] https://github.com/capnproto/capnproto/commit/eb8404a15
             | 7e074...
             | 
             | [2] https://github.com/protocolbuffers/protobuf/blob/master
             | /CHAN...
        
               | clhodapp wrote:
               | Thanks for the references, friend!
               | 
               | And yes, I definitely am primarily thinking of protobuf,
               | as I struggled with this back with version 2.5. I had the
               | (apparently mistakenly) impression that Avro and Cap'n
               | Proto (which I think actually first came out in this
               | timeframe) were about on par.
        
         | the_girabbit wrote:
         | Genuine question--why would you need a sum type in a self-
         | describing data format?
        
           | valenterry wrote:
           | Well, there are already sumtypes, just only specific builtin
           | ones, not custom ones. E.g. booleans are sumtypes (true |
           | false). Everything else that is nullable is also a sumtype
           | (e.g. number | null).
           | 
           | I think it should be pretty obvious how these are helpful and
           | why they are needed no?
        
             | the_girabbit wrote:
             | Yeah, but it's a schema-less, self-describing data format.
             | It's not like a specific position in a data stream has a
             | requirement to be a specific type.
             | 
             | I can see why sum types would be useful in a schema or for
             | the elements of a collection that is required to be
             | homogeneous (ie. List<Foo|Bar>).
             | 
             | For what use case would one use custom sum types in a
             | schema-less data format?
        
         | seniorsassycat wrote:
         | ion schema is a type system that can validate ion values and it
         | supports sum types.
         | 
         | https://amzn.github.io/ion-schema/docs/spec.html#union
         | 
         | The ion data model doesn't describe a schema or type system.
         | It's a data structure where values are of a known type. In the
         | binary format values are preceded by a type id, in the text
         | format the syntax declares the type - "" for string, {} for
         | struct. The data model doesn't declare what types a value could
         | have, only the type it does have.
        
         | dastbe wrote:
         | data interchange formats try to encode as little backwards
         | incompatible information as possible. in this case, it would be
         | the restriction that something is a sum type when it could have
         | multiple fields set in the future. another example is protobuf
         | moving to all fields being optional by default.
         | 
         | as for the wire format, a variant struct where you've only
         | instantiated a single field will encode down to just about the
         | minimum amount of information required.
        
           | nly wrote:
           | Avro went the opposite way to most and just makes the concept
           | of an optional field implementable via a union with null
           | 
           | Non union fields can even be upgraded to unions later
           | 
           | Personally I find the protobufs "everything is optional!"
           | Behaviour fucking insane and awful to deal with, but it is
           | true to the semantics of its underlying wire format.
        
           | valenterry wrote:
           | That's not contradicting though.
           | 
           | One can always choose not to use (native) sumtypes if they
           | are interested in extreme performance or compatibility.
           | 
           | But logically speaking, it is _good_ that it's a restriction
           | that a sumtype can't just turn into a multiple-fields type.
           | Because while my software (as the consumer) might still be
           | able to deserialize it, the assumption that only one field is
           | set would be broken and my logic would now potentially
           | broken. Much better if that happens at deserialization time
           | then later one when I find out that my data is
           | incorrect/corrupt.
        
           | vlovich123 wrote:
           | Have you looked at cap'n'proto. It does sum types in a very
           | sane way.
        
         | joshlemer wrote:
         | Doesn't it trivially have "sum types" since it's just arbitrary
         | self-describing data? i.e. nobody is stopping you from passing
         | around objects in such a way:
         | 
         | {a:1} {a:{b:2}} {a:4} {a:{b:4}}
         | 
         | There's no static type layer over top of this, so it's
         | inherently up to interpretation and whatever type system you
         | want to use to describe this data, to be able to express that
         | the values of `a` can be of type `number | {b: number}`
        
           | valenterry wrote:
           | > There's no static type layer over top of this
           | 
           | Yeah, that's the problem. I mean, hey, why json? We could
           | just use unstructured plaintext for everything and now we are
           | free to do everything. But obviously that has its own
           | drawbacks.
           | 
           | Having built-in support for sumtypes means better and more
           | ergonomic support from libraries, it means there is one
           | standard and not different ways to encode things and it also
           | means better performance and tooling.
        
             | joshlemer wrote:
             | The point is that there's no reason to single out sumtypes
             | here. Insofar as ions/json has support for
             | arrays/objects/strings/numbers, it has exactly the same
             | support for sumtypes, as in the example I showed above.
             | Here is a list of "sumtype" `string | number | object`:
             | 
             | [{}, "hi", 1, 2, 3, "yo", {a: "bc"}]
        
               | valenterry wrote:
               | No, that is not a sumtype, that's an array.
               | 
               | In the same sense "1e-12" is not a number, it's a string.
               | Yes, it's a string that encodes a number in a certain
               | notion, but for alle the tooling, the IDE, the libraries,
               | etc. it will stay a string.
        
               | joshlemer wrote:
               | What I mean is, it is an array of a sumtype `number |
               | string | object`. So precisely, you could call it a
               | `list<number | string | object>`
        
               | dunefox wrote:
               | That's a list union[number, string, object] or list[Any],
               | not a sum type, no? This
               | 
               | `data X = A | B
               | 
               | [A, B, ...]`
               | 
               | Is a list containing a sum type: list[X]
        
               | joshlemer wrote:
               | There is no such thing in JSON or Ions as defining this
               | "X" schema somewhere. So I may as well say that your
               | [A,B,...] is a list[Any].
               | 
               | Now, I wouldn't actually call it a list of any, I would
               | say you proved my point for me. Your example is
               | functionally the same as mine. I would give this example:
               | 
               | `[A, B, ...]`
               | 
               | and say that that is a list of sum types. You may say "no
               | no no! Only now is it a list of sum types!":
               | 
               | `data X = A | B
               | 
               | [A, B, ...]`
               | 
               | But my point is that there is no JSON/Ion equivalent of
               | your `data X = A | B`. Everyone in this comment tree is
               | confusing the data itself with out-of-band schema over
               | that data. "Sumtype" is nothing more than a fiction, or a
               | schema. Saying that JSON/Ions don't support sumtypes is
               | like saying JSON doesn't support "NonNegativeInteger"
               | type. Sure it does! Here are some: 1, 2, 3, 10. What
               | tooling or type system you use outside of the data itself
               | to enforce constraints on the data types is orthogonal to
               | the data format itself.
        
               | ImprobableTruth wrote:
               | Sum types =/= union types. Sum types are also called
               | 'tagged' or 'discriminable' unions because they have some
               | way to discriminate between them. That is, if you have an
               | element a of type A, a is _not_ part of the sum type A +
               | B because it 's missing a tag.
               | 
               | [5,"hello",3] has the type list (int [?] string), not
               | list (int + string). You _can_ emulate the latter by
               | manually adding a tag, but native support is much
               | preferable.
        
               | joshlemer wrote:
               | I know the differences between untagged and tagged
               | unions, I'm trying to provide a minimal example without
               | distracting details but sure we can talk about tagged
               | unions. Here is a list of tagged unions, so I once again
               | point out that sum types are "supported" in JSON/ions
               | just as much as any other data type:                   [
               | {tag: "a", foo: 1},           {tag: "b", bar: "hi", baz:
               | 2},           {tag: "a", foo: 3},           {tag: "a",
               | foo: 4},           {tag: "a", foo: 5},           {tag:
               | "b", bar: "yo", baz: 6}         ]
        
               | quantumspandex wrote:
               | His point was type support and standard way of doing
               | things. Using your argument we just need string type to
               | represent everything.
        
         | yakkityyak wrote:
         | You should look into https://cuelang.org
        
         | vlovich123 wrote:
         | Cap'n'proto has native sum types.
        
         | jsolson wrote:
         | Protobuf supports sum types in the higher-level generated
         | descriptors and languages -- on the wire they're just encoded
         | as, well... oneof a number of possible options.
        
           | ricardobeat wrote:
           | Which results in very painful inconsistencies when you're
           | dealing with the same schema on different platforms.
        
             | xyzzy_plugh wrote:
             | Are you referring to different language
             | implementations/runtimes? I don't follow your point about
             | inconsistencies.
        
               | [deleted]
        
       | tlocke wrote:
       | One problem with Ion is that it doesn't have a map type, but
       | instead a struct type that allows duplicate keys. I created Zish
       | https://github.com/tlocke/zish as a serialization format that
       | addresses the shortcomings of JSON and Ion. Any comments /
       | criticisms welcome.
        
       | n8ta wrote:
       | I recently implemented a similar (simpler) format
       | https://baremessages.org/ in ruby.
       | 
       | First thoughts are:
       | 
       | ION pros: - easy to skip around while reading a file - no need to
       | write a schema - backed by amazon so major langs will have impls
       | - good date support - better concatenation, probably better
       | suited to logging than bare
       | 
       | ION cons - what's the text format even for?
       | 
       | BARE pros: - schemas keep things tightly versioned - smaller
       | binaries (not self describing like ion) - simpler to implement so
       | tons of devs have impl'ed for their favorite lang - better suited
       | to small messages (think REST json api)
       | 
       | BARE cons: - no skip read - no date support
       | 
       | I might do an ion ruby implementation too, to really feel out the
       | difference.
        
         | ozzythecat wrote:
         | Ion text is helpful so you can convert ion binary to text for
         | debugging:
        
         | seniorsassycat wrote:
         | ion text is a good contender for JSON, YAML, TOML usecases.
         | It's also a good way to present the binary to humans.
        
         | imiric wrote:
         | > what's the text format even for?
         | 
         | Configuration files?
         | 
         | Not sure if that's an intended use case, but being more
         | flexible than JSON and stricter than YAML seems ideal for
         | configuration.
        
         | the_girabbit wrote:
         | Ion will be even better for (structured) logging if this
         | proposal for templates ever happens.
         | https://github.com/amzn/ion-docs/pull/104
         | 
         | Looks like no one's even so much as commented on it in the last
         | year, so it might have been abandoned.
        
           | n8ta wrote:
           | Ion is already a little too complex for my taste. It'd be a
           | shame to see it go the same way as yaml where it's so complex
           | that most major implementations are not safely interoperable.
        
           | jonwilsdon wrote:
           | Disclosure: I manage the Ion and PartiQL teams at Amazon.
           | 
           | This proposal hasn't been abandoned. We hope to post an
           | update soon!
        
       | sirk390 wrote:
       | timestamps and decimal are the two most useful additions compared
       | to json. They would be nice to add to json if that is somehow
       | possible.
        
         | nly wrote:
         | JSON numbers, just like all human readable formats, _are_
         | decimal... it 's not like binary double values are printed out
         | in to JSON in hex or base64
         | 
         | Sure 99% of decoders convert them to and from binary doubles,
         | but that's purely an implementation choice.
        
           | indymike wrote:
           | > JSON numbers, just like all human readable formats, are
           | decimal...
           | 
           | All JSON numbers are implemented as integers or floating
           | point, and as a result, have to be cast as a decimal (a
           | decimal type is generally something that meets this
           | specification: http://speleotrove.com/decimal/) when you
           | import them.
           | 
           | Decimal types differ from floating point types in three ways:
           | they are accurate, and they take into account rounding rules
           | and precision. Decimal math is slower, can have greater
           | precision and is better suited to domains where finite
           | precision is needed. Floating point is faster, but is not as
           | precise, so it's good for some scientific uses... or where
           | perfect precision isn't important but speed is... say 3d
           | graphics.
           | 
           | I've billed lots of hours over the years fixing code where a
           | developer used floats where they should have used decimals.
           | For example, if you are dealing with money, you probably want
           | decimal. It's one of those problems like trying to parse
           | email addresses with a regex or rolling your own crypto... it
           | will kind a work until someone finds out it really doesn't
           | (think accounting going, our numbers are off by random
           | amounts, WTF?).
        
             | nly wrote:
             | A binary double can hold any decimal value to 15 digits of
             | precision, so as a _serialisation format_ it 's a bit of a
             | non-issue... you just need to convert to decimal and round
             | appropriately before doing any arithmetic where it matters.
             | 
             | And you're confusing JSON the format with typical
             | implementations. Open a JSON file and you see _decimal_
             | digits. There is no limit to the number of the digits in
             | the grammar. Parsing these digits and converting them to
             | binary doubles, for example, is actually _slower_ than
             | parsing them as decimals, because you have to do the latter
             | anyway to accomplish the former. Almost all JSON libraries
             | convert to binary (e.g. doubles) because of their
             | ubiquitous hardware and software support...but some
             | libraries like RapidJSON expose raw numeric strings out of
             | the parser if you want to plug in a decimal library
        
       | hirundo wrote:
       | It seems like an odd choice to make the type "metadata" a prefix
       | to the value, rather than a separate field. It feels like
       | overloading. What's the advantage?
        
         | re wrote:
         | Not sure I understand exactly what "a separate field" would
         | look like, but:
         | 
         | 1. Considering that a goal of Ion is to be a strict superset of
         | JSON, separate syntax ensures that any JSON value can be parsed
         | without misinterpreting some field as an annotation--there are
         | no reserved/"magic" field names.
         | 
         | 2. Annotations can be applied to any type of value, not just
         | objects, which are the only type that have fields.
        
         | [deleted]
        
         | indymike wrote:
         | It tells you how to load the value and can be human readable
         | for audit purposes. example: degrees::'celsius'::100
        
       | wisty wrote:
       | I scanned the docs, and can't see what happens if you alter your
       | data schema. Anyone know?
        
         | travisd wrote:
         | Seems like you have to handle that yourself. The serialized
         | data includes the type, so your app code might have to have
         | logic a la "if type1: ... else: ..." after parsing it.
        
           | wisty wrote:
           | OK, so it's one of the more flexible ones (like those binary
           | jsons) rather than something like protobuf. I guess that
           | should have been obvious from "self-describing".
        
       | otabdeveloper4 wrote:
       | Nice! This thing is actually sane and thought through. A first
       | for serialization formats. They're usually a shitshow.
       | 
       | (Should have gone with 'rational' instead of 'decimal', though.
       | Decimal will be too painful to implement accross languages and
       | implementations. Java bias?)
        
         | sirk390 wrote:
         | But decimal are way more useful as they can represent currency
         | amounts. It would be strange to show a currency amount like
         | "3/4" or "11/12". Personally, the two datatypes I have always
         | been adding manually to json are datetimes and decimals (from
         | python)
        
           | otabdeveloper4 wrote:
           | A currency amount is just a rational number with "1000000" as
           | a denominator.
           | 
           | This is the correct representation, and how Google or the
           | blockchain do it.
        
       | mgamache wrote:
       | msgpack is near the top for speed and size... Readability is
       | nice. Are there other advantages?
       | 
       | https://msgpack.org/index.html
        
       | AtlasBarfed wrote:
       | This is json with relaxed jackson parsing: quote-optional keys,
       | comments, all doable with jackson OOTB for years now.
        
         | quda wrote:
         | Another useless transfer data format. It will be forgotten
         | within a year or two.
        
       | hliyan wrote:
       | This reminded me of a tight-packed binary format we used in the
       | trading systems domain almost 20 years ago. Instead of including
       | metadata/field names in each message, it had a central message
       | dictionary that every client/server would first download a copy
       | from. Messages had only type IDs, followed by binary packed data
       | in the correct field order. Because of microsecond latency
       | requirements, we even avoided the serialization/deserialization
       | process by making the memory format of the message and the wire
       | format one and the same. The message class contained the same
       | buffer that you would send/store. The GetInt(fieldID) method of
       | the class simply points to the right place in the buffer and does
       | a cast to int. Application logs contained these messages, rather
       | than plain text. There was a special reader to read logs.
       | Messages were exchanged over raw TCP. They contained their own
       | application layer sequence number so that streams could resume
       | after disconnection.
       | 
       | In that world, lantencies were so low that the response to your
       | order submission would land in your front-end before you've had
       | time to lift your finger off the enter key. I now work with web
       | based systems. On days like this, I miss the old ways.
        
         | atlgator wrote:
         | We did the same in high fidelity flight simulators for a lot
         | less money I'm sure.
        
         | lordnacho wrote:
         | Same here, I wrote an exchange core that did this using SBE.
         | Basically you don't serialize in the classical sense, because
         | you're simply taking whatever bytes are at your pointer and
         | using them as some natural type. The internals of the exchange
         | also simply used the same layout, so there was minimal copying
         | and interpreting. On the way out it was the same, all you had
         | to do was mask a few fields that you didn't want everyone to
         | see and ship it onto the network.
         | 
         | Even an unoptimized version of this managed to get throughput
         | in the 300K/s range.
         | 
         | Somehow it's the endpoint of my journey into serialization.
         | Basically, avoid it if you need to be super fast. For most
         | things though, it's useful to have something that you can read
         | by eye, so if you're not in that HFT bracket it might be nicer
         | to just use JSON or whatever.
        
         | secondcoming wrote:
         | I assume you were using C++? I'm not sure what you describe is
         | possible these days due to UB. At the very least just casting
         | bytes received over the wire to a type is UB, so you
         | technically need a memcpy() and hope that the compiler
         | optimises it out.
        
           | hliyan wrote:
           | Yes, it was C++. I was unfamiliar with the acronym "UB" so
           | did a Google search. Does it mean "Undefined Behavior"? If I
           | remember correctly, primitive types other than strings are
           | memcpy'd. GetStr basically returned a char* to the right
           | place in the buffer.
        
             | secondcoming wrote:
             | Apologies, yes Undefined Behaviour
        
         | sattoshi wrote:
         | Apache Thrift works on the same principle of separating
         | structure from data.
        
         | mrlemke wrote:
         | Very neat and similar to a project I am starting for packet
         | radio. I went further with the dictionary concept so that it
         | contains common data. This way, your message contains only a
         | few dictionary "pointers" (integers in base 64). This makes it
         | easier to fit messages in ASCII for 300 baud links.
        
         | oandrew wrote:
         | Interesting. Confluent Avro + Schema registry + Kafka uses
         | exactly the same approach - binary serialized Avro datums are
         | prefixed with schema id which can be resolved via Schema
         | registry
        
         | angstrom wrote:
         | And to top it off you could fit the entire message into
         | whatever the MTU of your network supported. Cap it at 1500
         | bytes and subtract the overhead for the frame headers and you
         | get an extremely tight TCP/IP sequence stream that buffers
         | through 16MB without needing to boil the ocean for a compound
         | command sequence.
         | 
         | Having been in industry only 2 decades it amuses me how many
         | times this gets rediscovered.
        
           | kwertyoowiyop wrote:
           | Every multiplayer game programmer from the 1990s agrees with
           | you!
        
           | hliyan wrote:
           | That just reminded me of the most mysterious scaling issue I
           | ever faced. We had a message to disseminate market data for
           | multiple markets (e.g. IBM: 100/100.12 @ NYSE, 101/102 @
           | NASDAQ etc.). The system performed admirably under load
           | testing (think 50,000 messages per second). One day we
           | onboarded a single new regional exchange and the whole market
           | data load test collapsed. We searched high and low for days
           | without success, until someone figured out that the new
           | market addition had caused the market data message to exceed
           | the Ethernet frame size for the first time. Problem was not
           | at the application layer or the transport, it was data link
           | layer fragmentation! Figuring that out felt like solving a
           | murder mystery (I wasn't the one who figured it out though).
        
             | jkhdigital wrote:
             | _Classic_ example of a leaky abstraction, and the principle
             | that implementation details inevitably become undocumented
             | API behavior.
        
               | kabdib wrote:
               | A lot of "transparent RPC" systems are like this. "It's
               | just like a normal function call, it's _sooo_ convenient
               | " . . . until it isn't, because it involves the network
               | hardware and configuration, routing environment,
               | firewalls, equipment failure . . .
        
               | andylynch wrote:
               | I've worked on systems like this too - the max packet
               | size is very well documented. Then post trade it all gets
               | turned into FIXML which somehow manages to be both more
               | verbose and less readable.
        
             | angstrom wrote:
             | Yeah, that's part of the trick for large listing responses
             | to be spread across frames. Usually with some indicator
             | like a "more flag" so the client can say "get me the next
             | sequence by requesting the next index in the listup with
             | the prior btree index. People do this all the time with
             | large databases and it's a very similar use case.
        
               | vendiddy wrote:
               | This was a fun back and forth to read!
        
             | elcritch wrote:
             | Ouch that's rough. One nice bit of IPv6 is that it doesn't
             | allow fragmentation. It often much nicer to get no message
             | or an error than subtly missing data.
        
               | depereo wrote:
               | IPv6 does allow fragmentation.
        
               | elcritch wrote:
               | Ah yah that's right. I'm just learning more of ipv6 and
               | get it mixed up. It appears what I had in my mind was
               | about intermediate routers: "Unlike in IPv4, IPv6 routers
               | (intermediate nodes) never fragment IPv6 packets."
               | (Wikipedia). To the previous point, it looks like ipv6
               | does require networks to send 1280 byte or smaller
               | packets unfragmented.
        
           | agumonkey wrote:
           | Smells like engineering
        
         | porker wrote:
         | Fab story, thank you! I understood up to "Messages were
         | exchanged over raw TCP. They contained their own application
         | layer sequence number so that streams could resume after
         | disconnection." Can you go into more details about how the
         | sequence number and resuming after disconnection worked?
        
           | mtrovo wrote:
           | Server used a global sequence number for all messages they
           | transmit. Clients are stateful so they know exactly what was
           | the latest message they processed and would send that id when
           | creating a new connection. This was very important as a lot
           | of the message types used delta values, one of the most
           | important ones being the order book. So in order to apply a
           | new message you had to make sure that you're internal state
           | was at the correct sequence id, failing to do so would make
           | your state go bonkers, specially when you're talking about
           | hundreds of messages being received per second. It's scary
           | but you had a special message type that would send you a
           | snapshot of the expected state with a sequence id that they
           | correspond to. So your error handling code would fetch one of
           | these and them ask for all the messages newer than that.
        
             | hliyan wrote:
             | This is exactly right. It was almost always deltas in favor
             | of snapshots. One of the downsides was that sometimes,
             | debugging an issue required replaying the entire market up
             | to the point of the crash/bug.
        
           | hliyan wrote:
           | Pretty basic. The receiving process usually has an input
           | thread that just puts the messages into a queue. Then a
           | processing thread processes (maybe logic, maybe disk writes,
           | maybe send) the messages and queues up periodic batch acks to
           | the sender. The sender uses these acks to clear its own
           | queue. The receiver persists the last acked sequence number,
           | so that in case of a restart, it can tell upstream senders to
           | restart sending messages from that point.
        
             | mianos wrote:
             | The number of times people have "invented" ASN.1 now is
             | ridiculous.
        
         | erenon wrote:
         | We do something very similar in binlog:
         | https://github.com/morganstanley/binlog
         | 
         | Serialization is platform-dependent (to make it a simple memcpy
         | most of the time), and the schema is sent up front (but can be
         | updated later, with in-bound messages at will). See the User
         | Guide (http://binlog.org/UserGuide.html) and the Internals
         | (http://binlog.org/Internals.html) for more.
        
         | ktzar wrote:
         | Is it FIX messages?
         | https://en.wikipedia.org/wiki/Financial_Information_eXchange
         | It's a good idea, extensible (ranges available for banks to
         | implement their own codes), and fast.
        
           | nly wrote:
           | Old school texty FIX is incredibly slow. FAST FIX is faster
           | but not fun to use. Largely SBE has won adoption on the
           | market data side, with huge platforms like Euronext (biggest
           | on Europe) using it.
        
             | mtrovo wrote:
             | I stopped working in the area on the age of FAST FIX, which
             | was extremely good for the time. Do you know what are the
             | differences to SBE?
        
               | nly wrote:
               | I guess I'm biased based on experience at the companies
               | I've worked at but FAST never seemed to have good
               | libraries or tooling
        
           | hliyan wrote:
           | It was a proprietary messaging middleware library. We
           | actually found even FAST FIX slow.
        
             | o_bender wrote:
             | FAST FIX protocol is terrible performance-wise, its format
             | requires multiple branching at every field parsing. Even
             | "high-performance" libraries like mFAST are slow: I
             | recently helped a client to optimize parsing for several
             | messages and got 8x speed improvement over mFAST (which is
             | a big deal in HFT space).
        
         | makotobestgirl wrote:
         | Sounds like Google's flatbuffers [0], which indexes directly
         | into a byte buffer using the field size prefix.
         | 
         | [0] https://google.github.io/flatbuffers/
        
         | armchairhacker wrote:
         | I don't understand why serialization formats that separate
         | structure and content aren't more popular.
         | 
         | Imagine a system every message is a UID or DID
         | (https://www.w3.org/TR/did-core/) followed by raw binary data.
         | The UID completely describes the shape of the rest of the
         | message. You can also transmit messages to define new UIDs:
         | these messages' UID is a shared global UID that everyone knows
         | about.
         | 
         | Once a client learns a UID, messages are about as compact as
         | possible. And the data defining UIDs can be much more
         | descriptive than e.g. property names in JSON. You can send
         | documentation and other excess data when defining the UID,
         | because you don't have to worry about size, because you're only
         | sending the UID once. And UIDs can reference other UIDs to
         | reduce duplication.
        
           | dboreham wrote:
           | This is protocol buffers + a global type registry. I worked
           | on such a system.
        
             | pcarolan wrote:
             | Is it public? Id love to learn more about it.
        
               | jenny91 wrote:
               | If you read the protobuf source, you can see a bunch of
               | places where you can hook in custom type-fetching code,
               | e.g. in the google.protobuf.Any type.
               | 
               | After studying it a bit, I'm certain this is how it's
               | used inside Google (might also be mentioned elsewhere).
               | 
               | All you'd really need to do is to compile all protos into
               | a repository (you can spit out the binary descriptors
               | from protoc), then fetch those and decode in the client.
               | 
               | It'd actually be quite straightforward to set up,
        
           | mtrovo wrote:
           | I think the system OP is describing is a little bit more
           | complex. You're not just describing message types, you also
           | have message templates; a template declares a message type
           | and a set of prefilled fields. You save data by just sending
           | the subset of fields that are actually changing, which is a
           | very good abstraction for market data. The template is
           | hydrated on the protocol parsing layer so your code only has
           | to deal with message types itself.
        
           | NavinF wrote:
           | You just described protobufs and all its successors.
           | 
           | See the "@0xdbb9ad1f14bf0b36" at the top of this capnproto
           | file for example: https://capnproto.org/language.html
           | 
           | It's a 64bit random number so it'll never have unintentional
           | collisions.
           | 
           | Also note that a capnp schema is natively represented as a
           | capnp message. Pretty convenient for the "You can also
           | transmit messages to define new UIDs" part of your scheme :)
        
             | infogulch wrote:
             | Interesting. Ids in particular are described here:
             | https://capnproto.org/language.html#unique-ids
             | 
             | I wonder if giving it a name based on the hash of the
             | definition has been explored; like Unison [0] where all
             | code is content addressable, but for just capnproto
             | definitions. Is there a reason not to?
             | 
             | [0]: https://www.unisonweb.org
        
               | NavinF wrote:
               | Capnp uses the name of your message, but not its full
               | definition because that would make it impossible to
               | extend protocols in a backwards compatible way. Without
               | the ability to add new fields, making changes to your
               | protocol would be impossible in large orgs.
        
             | boxfire wrote:
             | MD5 is a 128 bit random number no one would ever have
             | thought would collide. 64 bits is peanuts especially when
             | message types are being defined dynamically
        
               | remram wrote:
               | MD5 is safe against unintentional collisions.
        
               | NavinF wrote:
               | Dude that's why I said "unintentional collisions".
               | 
               | Of course you can get intentional collisions. The
               | security model here assumes that anyone that wants to
               | know your message's ID can just ask.
               | 
               | Did you know that the Internet Protocol uses a 4-bit
               | header to specify the format (v4 or v6) of the rest of
               | the message? They should have used 128 bits. What a bunch
               | of fools.
        
               | [deleted]
        
             | garmaine wrote:
             | > It's a 64bit random number so it'll never have
             | unintentional collisions.
             | 
             | It'll have unintentional collisions if you ever generate
             | more than 4 billion of these random numbers. That's not
             | inconceivable.
        
               | logicchains wrote:
               | >It'll have unintentional collisions if you ever generate
               | more than 4 billion of these random numbers.
               | 
               | If it's 64 bit, doesn't that mean you'd need to generate 
               | ~10000000000000000000000000000000000000000000000000000000
               | 000000000 (2^64) of those numbers to have a collision,
               | not 2^32?
        
               | tomerv wrote:
               | If you generate randomly then, due to the birthday
               | paradox, after generating sqrt(N) values you have a
               | reasonable chance of collision.
               | 
               | The birthday paradox is named after the non-intuitive
               | fact that with just 32 people in a room you have > 50% of
               | 2 people having a birthday on the same day of the year.
        
               | ratorx wrote:
               | Does birthday paradox apply here? It's about any pair of
               | people having the same birthday, whereas in this case you
               | need someone else with a specific birthday.
               | 
               | For example, if you generate 2 numbers and they are the
               | same, but are different to the capnproto number, that's a
               | collision but doesn't actually matter.
               | 
               | EDIT: It does apply, I misunderstood what the number was
               | being used for.
        
               | elcritch wrote:
               | It does apply, according to
               | https://www.johndcook.com/blog/2017/01/10/probability-of-
               | sec...
        
               | ratorx wrote:
               | You're right, I misunderstood what the magic number was
               | being used for.
        
               | lozenge wrote:
               | But if my application only uses 100 schemas, I only care
               | about a collision if it's with one of those 100.
        
               | gpderetta wrote:
               | You have a collision if any two schemas share the id, not
               | if a specific schema collides with any of the others. So
               | it is exactly like the birthday paradox.
        
               | heavenlyblue wrote:
               | Yeah, but that collision probably doesn't matter because
               | there's a bunch of other variables that need to come
               | together for it to be an issue at all.
        
               | gpderetta wrote:
               | If the schema id is the message id, in principle it could
               | be an issue as the protocol on the wite would be
               | ambiguous. Then again, you should be able to detect any
               | collisions when you register a schema with the schema
               | repo and deal with it at that time.
        
               | [deleted]
        
               | adwn wrote:
               | > _32 people_
               | 
               | Slight correction: only 23 people, actually. So in every
               | second football ("soccer") game, you have two people on
               | the field with the same birthday.
        
               | doo_daa wrote:
               | I think it's 23 people in a room. The canonical example
               | is people on a football (soccer) pitch. With 11 per side
               | plus the referee there's a 50% chance that two will share
               | the same birthday.
        
               | [deleted]
        
               | remram wrote:
               | When you reach the 4 billionth version of your protocol?
        
               | kentonv wrote:
               | All versions of the same protocol have the same ID. That
               | is the point of IDs -- to link together different
               | versions of the protocol.
        
               | remram wrote:
               | You're right! That makes collisions even less likely
               | then.
        
               | heavenlyblue wrote:
               | I don't understand your maths here: how is generating
               | 4billion of them is any different from generating 3
               | billion except a slight raise in the probability measure?
        
               | NavinF wrote:
               | Yes it is. Message schemas are made by humans. Most of
               | these messages will be extended in a backwards compatible
               | manner over the life of a project rather than replaced
               | entirely so their IDs don't change. That's kinda the
               | point of protobufs and its successors.
               | 
               | I've probably generated 100 IDs over my lifetime.
        
               | garmaine wrote:
               | Which puts it on the same order of magnitude as the
               | number of people on the planet. If every person alive
               | generated a schema (or if 1/100th of all people generate
               | 100 IDs each like you) then we'd have a small number of
               | collisions. More likely you'd get large numbers of schema
               | like that if there's a widespread application of a
               | protocol compiler that generates new schema
               | programmatically, e.g. to achieve domain separation, and
               | then is applied at scale. I'm not saying that's likely,
               | just that it is not, as is claimed, _inconceivable_.
        
               | kentonv wrote:
               | It's only really a problem if you use the IDs in the same
               | system. It's highly unlikely that you'd link 4B schemas
               | into a single binary. And anyway, if you do have a
               | conflict, you'll get a linker error.
               | 
               | Cap'n Proto type IDs are not really intended to be used
               | in any sort of global database where you look up types by
               | ID. Luckily no one really wants to do that anyway. In
               | practice you always have a more restricted set of schemas
               | you're interested in for your particular project.
               | 
               | (Plus if you actually created a global database, then
               | you'd find out if there were any collisions...)
        
               | heavenlyblue wrote:
               | If you have 4 billion of them generated there's another
               | 1/4billionth chance you'll generate a duplicate.
               | 
               | On top of that you would not only need to generate the
               | same ID, you would need to USE it in the same system
               | where that is could have some semantics to not cause an
               | error.
        
             | nly wrote:
             | Protobufs is a boring old tag-length-value format. It's
             | kind of the worst of both worlds because it has no type
             | information encoded in to it, meaning it's useless without
             | the schema, while still having quite a bit of overhead.
             | 
             | Capn'Proto is more like a formalization of C structs in
             | that new fields are only added to the end. If memory
             | serves, on the wire there is no tag, type or length info
             | (for fixed size field types), and everything is rooted at
             | fixed offsets
        
               | kentonv wrote:
               | Mostly right. Allow me to provide some wonky details.
               | 
               | Protobuf uses tag-type-values, i.e. each field is encoded
               | with a tag specifying the field number _and_ some basic
               | type info before the value. The type info is only just
               | enough information to be able to skip the field if you
               | don 't recognize it, e.g. it specifies "integer" vs.
               | "byte blob". Some types (such as byte blob) also have a
               | length, some (integer) do not. Nested messages are
               | usually encoded as byte blobs with a length, but there's
               | an alternate encoding where they have a start tag and an
               | end tag instead ("start group" and "end group" are two of
               | the basic types). On one hand, having a length for nested
               | messages seems better because it means you can skip the
               | message during deserialization if you aren't interested
               | in it. On the other hand, it means that during
               | serialization, you have to compute the length of the sub-
               | message before actually serializing it, meaning the whole
               | tree has to be traversed twice, which kind of sucks,
               | especially when the message tree is larger than the L1/L2
               | cache. Ironically, most Protobuf decoders don't actually
               | support skipping parsing of nested messages so the length
               | that was so expensive to compute ends up being largely
               | unused. Yet, most decoders only support length-delimited
               | nested messages and therefore that's what everyone has to
               | produce. Whoops.
               | 
               | Now on to Cap'n Proto. In a given Cap'n Proto "struct",
               | there is a data section and a pointer section. Primitive
               | types (integers, booleans, etc.) go into the data
               | section. This is the part that looks like a C struct --
               | fields are identified solely by their offset from the
               | start of the data section. Since new fields can be added
               | over time, if you're reading old data, you may find the
               | data section is too small. So, any fields that are out-
               | of-bounds must be assumed to have default values. Fields
               | that have complex variable-width types, like strings or
               | nested structs, go into the pointer section. Each pointer
               | is 64 bits, but does not work like a native pointer. Half
               | of the pointer specifies an _offset_ of the pointed-to
               | object, relative to the location of the pointer. The
               | other half contains... type information! The pointer
               | encodes enough information for you to know the basic size
               | and shape of the destination object -- just enough
               | information to make a copy of it even if you don't know
               | the schema. This turns out to be super-important in
               | practice for proxy servers and such that need to pass
               | messages through without necessarily knowing the details
               | of the application schema.
               | 
               | In short, both formats actually contain type information
               | on the wire! But, not a full schema -- only the minimal
               | information needed to deal with version skew and make
               | copying possible without data loss.
        
               | nly wrote:
               | I wouldn't call what protobuf encodes type information.
               | If I recall all the group stuff is deprecated, so what's
               | left basically boils down to 3 types: 32 bit values, 64
               | bit values and length prefixed values, which covers
               | strings and sub-messages. Without the schema you can't
               | even distinguish strings from sub-objects, as they are
               | both length prefixed as you described.
               | 
               | Can you even distinguish floats and ints without a schema
               | in protobufs? I don't remember.
               | 
               | I really enjoy capnproto, flatbuffers and Avro and bounce
               | between them depending on the task at hand.
        
         | mtrovo wrote:
         | Don't know if you're describing the original FIX itself with
         | the TCP connection. On FAST FIX they got rid of the TCP
         | connection and market data was sent over UDP using several
         | parallel connections, data was reordered on the client side at
         | consumption time and it only used a TCP connection to recover
         | data when a sequence gap was found.
        
           | hliyan wrote:
           | Actually, even FAST was too slow for us. This was a
           | proprietary messaging middleware library. And this particular
           | market data feed was the direct one into the matching engine
           | itself. For the rest of the system, we used a sort of
           | reliable multicast using UDP for the first transmission and
           | TCP for missed messages. We initially tried out a
           | Gossip/Epidemic protocol but that didn't work out too well.
        
         | Aeolun wrote:
         | > In that world, lantencies were so low that the response to
         | your order submission would land in your front-end before
         | you've had time to lift your finger off the enter key.
         | 
         | If the order submission process depends on the manual press on
         | the enter key (+/- 50ms) is there any point to that though?
        
           | sodality2 wrote:
           | It was probably used with high frequency trading so fully
           | automated unless you happened to be testing it manually.
        
           | hliyan wrote:
           | Despite all the algorithms we employed, the concept of a
           | manual trade never went away. Also, when the front-end was
           | taken out of the equation, the latencies were in the
           | microsecond range. 50ms would be excruciatingly slow for an
           | algorithm.
        
           | danachow wrote:
           | OT but keyboard latency can and often is far below 50ms, more
           | like 1ms. It seems to be a common misconception that
           | denouncing mandates increased lag.
        
             | formerly_proven wrote:
             | That's because a lot of input hardware uses moronic
             | debouncing.
        
             | dan-robertson wrote:
             | Is this a number that came from an actual benchmark or from
             | some marketing material from a keyboard maker? I ask this
             | because [1] finds latency (measured from touching the key
             | to the usb packet arriving) of 15ms with the fastest
             | keyboard and around 50ms with others, though apparently
             | some manufacturers have since improved. Or are you talking
             | about midi keyboards where I guess latency is more
             | noticeable to users?
             | 
             | [1] https://danluu.com/keyboard-latency/
        
               | danachow wrote:
               | From the countless review and small time YouTube channels
               | that test these things regularly.
               | 
               | I think that post must be a few years out of date - and
               | moreover by its own admission doesn't even test hardly
               | any "gaming" keyboards. There is a tremendous amount of
               | competition in keyboards that has been building for the
               | past 10 years.
               | 
               | Input latency is now a marketing thing like horsepower,
               | and there are reasonably reputable [1] places and
               | countless small time YouTube reviewers that test these
               | things. It's not like it is difficult to improve latency,
               | and now that it is something that is competitively
               | marketed it is delivered on.
               | 
               | [1] https://www.rtings.com/keyboard/tests/latency
               | 
               | Personally I think it's a bit ridiculous. This
               | fetishization with minimizing latency to now sub-ms
               | levels doesn't necessarily lead to better performance as
               | many top level gamers do not use the lowest latency level
               | keyboards. But that doesn't change the fact that modern
               | mainstream gaming keyboards can hit a latency far below
               | 50ms.
        
               | dan-robertson wrote:
               | The link I posted was 2017. The site you link gives quite
               | different ratings. I assume partly it is different
               | methodology (the site you link tries to account for key
               | travel somehow and they do something with a display and
               | try to account for display latency rather than using a
               | logic analyzer), but I'm not really sure. For some
               | keyboards in common:
               | 
               | - apple magic keyboard (? vs 2017) 15ms vs 27ms
               | 
               | - das keyboard (3 vs S professional/4 Professional) 25 vs
               | 11/10ms
               | 
               | - razer ornata (chroma vs chroma/chroma 2) 35 vs
               | 11.4/10.1ms
               | 
               | Interestingly it is not some simple uniform difference:
               | the Apple keyboard does much worse in the rtings test,
               | perhaps getting not much of a bonus from key travel
               | compensation. But the das keyboard vs the razer that are
               | 10ms apart on my link perform equally on rtings (but
               | maybe I found the wrong model). I don't have a good
               | explanation for that discrepancy.
        
             | Aeolun wrote:
             | I was thinking more of the time a human finger needs to
             | push the button down.
        
         | mendigou wrote:
         | This is exactly how it's done for spacecraft telemetry and
         | telecommand too, but in this case it's to save bytes rather
         | than processing time.
         | 
         | I also miss working on those systems.
        
         | nly wrote:
         | What you're describing is exactly what still takes place in
         | trading platforms, although a few i've seen now use SBE for
         | consistency sake (it's very common on the market data side)
        
         | FpUser wrote:
         | I had exactly the same implementation except that type /
         | version belonged to the whole message and would map to
         | appropriate binary buffer in memory. No real de/serialization
         | was needed.
         | 
         | I still use it in my UDP game servers, with added packet id if
         | message exceeds max datagram length and has to be split
        
           | ericbarrett wrote:
           | The one concern I'd have with this format is a length field
           | getting corrupted in transit and causing an out-of-bounds
           | memory access. The network protocols' checksums won't save
           | you 100% of the time, especially if there's bad hardware in
           | the loop. If every field is fixed length this is less of a
           | concern, of course; you might get bad data but you won't get
           | e.g. a string with length 64M.
        
             | hliyan wrote:
             | In our system, if the message didn't unpack properly, the
             | application would send a retransmit request with that
             | message's sequence number. But in practice, this scenario
             | never occurred because TCP already did this for us.
        
             | FpUser wrote:
             | I do not remember it ever happening but being semi-paranoid
             | I had length in 2 places - beginning and the end of the
             | message.
        
         | amitport wrote:
         | Well that's just like using C structs. The best serialization
         | protocol :).
        
           | nly wrote:
           | Some finance software systems do that too. It tends to be a
           | nightmare because people end up adding new message types just
           | to add a single field
        
       | cma wrote:
       | > Ion supports comments.
       | 
       | Thank god.. JSON for config files without comments is so awful.
        
       | michalkrupa wrote:
       | Yes, we are all still very excited about JSON. (Edit: and BSON)
        
       | oandrew wrote:
       | So basically it's Amazon's version of Apache Avro. Avro supports
       | binary/json serialization, schema evolution , logical types (e.g.
       | timestamp) and other cool stuff.
       | 
       | https://avro.apache.org/docs/current/spec.html
        
         | fnord77 wrote:
         | I wanted to see what the differences are between Ion and Avro.
         | 
         | Unlike avro, ion doesn't require a schema.
        
         | whimsicalism wrote:
         | ... or thrift ... or protobuf
         | 
         | https://xkcd.com/927/
        
         | joshka wrote:
         | Avro didn't exist when Ion started development.
        
       | jsnell wrote:
       | Previous discussions:
       | 
       | https://news.ycombinator.com/item?id=11546098
       | 
       | https://news.ycombinator.com/item?id=23921610
        
         | dang wrote:
         | Thanks! Macroexpanded:
         | 
         |  _Amazon Ion_ - https://news.ycombinator.com/item?id=23921610 -
         | July 2020 (110 comments)
         | 
         |  _Amazon open-sources Ion - a binary and text interchangable,
         | typed JSON-superset_ -
         | https://news.ycombinator.com/item?id=11546098 - April 2016 (163
         | comments)
        
           | throwoutway wrote:
           | What do you use for the macroexpansion? There are a hundred
           | odd tasks like this that I need to create macros for!
        
             | dang wrote:
             | I mean that metaphorically but I do have a bunch of
             | keyboard shortcuts (in a browser extension) that make
             | finding these, and formatting the comments, much faster.
        
       | trinovantes wrote:
       | I wonder what's the performance relative to native JSON parsers?
        
         | jonwilsdon wrote:
         | Disclosure: I manage the Ion and PartiQL teams at Amazon.
         | 
         | We have done some work on performance comparisons with the ion-
         | java-benchmark-cli tool (https://github.com/amzn/ion-java-
         | benchmark-cli). Right now you can compare JSON serialized with
         | Jackson and there is a pull request
         | (https://github.com/amzn/ion-java-benchmark-cli/pull/27) for
         | comparing against CBOR that should be merged soon.
         | 
         | We are always happy to hear suggestions for what is useful in
         | this area.
        
         | seniorsassycat wrote:
         | Parsing ion text should be similar to json, it has the same
         | characteristics. All JSON is valid ion text so you can even
         | parse JSON with an ION parser.
         | 
         | The binary parser is much faster. All fields are length-
         | prefixed so a parser doesn't have to scan forward for the next
         | syntax element.
         | 
         | The ion parsers (lexer? not sure the right vocab) I've worked
         | with have a `JSON.parse` equivalent that returns a fully
         | realized object, a Map, Array, Int, ect but they also have a
         | streaming parser that yields value by value. You can skip over
         | values you don't need, step over structs or into structs
         | without creating a Map or Array. That can be much faster.
        
       ___________________________________________________________________
       (page generated 2021-11-20 23:01 UTC)