[HN Gopher] Sick: Indexed deduplicated binary storage for JSON-l...
       ___________________________________________________________________
        
       Sick: Indexed deduplicated binary storage for JSON-like data
       structures
        
       Author : pshirshov
       Score  : 102 points
       Date   : 2025-10-28 13:22 UTC (9 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | gethly wrote:
       | It is a bit confusing that JSON is being mention so much when in
       | reality this has nothing to do with it - except to showcase that
       | JSON is not suitable for streaming whereas this format is.
       | 
       | Secondly, I fail to see advantages here as the claim is that it
       | allows streaming for partial processing compared to JSON that has
       | to be fully loaded in order to be parseable. Mainly, because the
       | values must be streamed first, before their location/pointers in
       | order for the structure to make sense and be usable for
       | processing, but that also means we need all the parent pointes as
       | well in order to know where to place the children in the root. So
       | all in all, I just do not see why this is advantageous format
       | above JSON(as that is its main complaint here), since you can
       | stream JSON just as easily because you can detect { and } and {
       | and ] and " and , delimiters and know when your token is complete
       | to then process it, without having to wait for the whole
       | structure to finish being streamed or wait for the SICK pointers
       | to arrive in full so you can build the structure.
       | 
       | Or, I am just not getting it at all...
        
         | pshirshov wrote:
         | > when in reality this has nothing to do with it
         | 
         | It's a specific representation of JSON-like data structures,
         | with an indexed deduplicated binary format and JSON encoders
         | and decoders. Why "nothing"? It's all about it.
         | 
         | Mostly it's not about streaming. More efficient streaming is a
         | byproduct of the representation.
         | 
         | > because you can detect { and } and { and ] and " and ,
         | 
         | You need a pushdown automaton for that. In case of SICK you
         | don't need potentially unbounded accumulation for many (not
         | all) usecases.
         | 
         | > the values must be streamed first, before their
         | location/pointers
         | 
         | Only to avoid accumulation. If you are fine with (some)
         | accumulation, you can reorder. Also think about the updates.
         | 
         | But again, streaming is a byproduct. This tool is an indexed
         | binary deduplicating storage which does not require parsing and
         | provides amortized O(1) access time.
        
           | aaronblohowiak wrote:
           | I think this is a generational thing. To a bunch of people
           | now practicing, JSON isn't just an encoding but has come to
           | mean "nested basic data types".
        
         | 8organicbits wrote:
         | I think this quote explains the efficient streaming:
         | 
         | > There is an interesting observation: when a stream does not
         | contain removal entries it can be safely reordered.
         | 
         | So if I'm understanding, the example in the readme could be
         | sent in reverse, allowing the client to immediately use root:0
         | and then string:2 while the rest streams in.
         | 
         | I was looking for something like this, but my use case exceeds
         | the 65k key limit for objects.
        
           | pshirshov wrote:
           | > 65k key limit for objects
           | 
           | The limit comes from 2-byte element pointer size. That can be
           | adjusted. We don't have an implementation with larger
           | pointers but it can be done easily.
           | 
           | > while the rest streams in
           | 
           | Yes, there are many usecases where you can use some chunks of
           | the data/rebuild some parts of the structures immediately
           | without any accumulation. The problem is that we don't have a
           | nice streaming abstraction which would suit anyone for any
           | usecase.
           | 
           | SICK as a library is an efficient indexed binary storage for
           | JSON with listed limitations.
           | 
           | SICK as a concept is much more but you might need your own
           | implementation tailored to your usecase.
        
           | gethly wrote:
           | If the stream contains removal entries, then that is not a
           | data stream but command stream and, again, has nothing to do
           | with JSON itself. We can extrapolate this to some kind of
           | json-rpc-ish stream if we want to keep this aligned with
           | JSON.
           | 
           | Overall, I think that mentioning JSON here at all is simply a
           | mistake. It would be better to just introduce this as
           | streaming protocol/framework for data structures. But then we
           | can do the same thing with literally any format and syntax.
        
             | pshirshov wrote:
             | > has nothing to do with JSON itself.
             | 
             | It's literally a deduplicated indexed binary storage for
             | JSON (plus an approach to JSON representation more suitable
             | for streaming than serialized JSON).
             | 
             | > we can do the same thing
             | 
             | I would highly encourage doing same things! For some reason
             | people love to fully parse a 50MiB long JSON document when
             | they need just 0.1% of the data in it.
        
         | xenadu02 wrote:
         | Most existing JSON parsers don't support streaming but that's
         | not inherent in the format. It is definitely possible to stream
         | writes easily but it's just as possible to stream parsing.
        
       | qixxiq wrote:
       | Current implementation has the following limitations:
       | Maximum object size: 65534 keys           The order of object
       | keys is not preserved           ...              These
       | limitations may be lifted by using more bytes to store offset
       | pointers and counts on binary level. Though it's hard to imagine
       | a real application which would need that.
       | 
       | I've worked on _many_ applications which have needed those
       | features. Object keys is a per implementation detail, but failing
       | at 65k keys seems like a problem people would likely hit if this
       | were to be used at larger scales.
        
         | pshirshov wrote:
         | In our usecase, for which we created the library, we made this
         | tradeoff to save several bytes per pointer and keep binary form
         | more compact. The application splits large objects into smaller
         | chunks. 99% of the structures there are relatively small but
         | there are tons of them. Most likely you can do the same - just
         | split large structures into smaller ones.
         | 
         | If you need support for larger structures, you may create your
         | own implementation or extend ours (and I would really like to
         | hear about your usecase).
         | 
         | SICK as a concept is simple. SICK as the library was created to
         | cover some particular usecases and may be not suitable for
         | everyone. We would welcome any contributions.
        
           | duped wrote:
           | If you use varints for pointers you can have the best of both
           | worlds, and achieve even better compaction for the smallest
           | objects.
        
             | pshirshov wrote:
             | Yep, can be done but they aren't free because of variable
             | size. With constant pointers I can access const-sized
             | elements directly, with varints I would have to do some
             | rituals.
             | 
             | I have another binary encoding for different purposes
             | (https://github.com/7mind/baboon) which relies on varints,
             | in case of SICK I decided to go with pointers of constant
             | size to save some pennies on access efficiency.
        
         | halayli wrote:
         | I don't know what kind of data you are dealing with but its
         | illogical and against all best practices to have this many keys
         | in a single object. it's equivalent to saying having tables
         | with 65k columns is very common.
         | 
         | on the other hand most database decisions are about finding the
         | sweet spot compromise tailored toward the common use case they
         | are aiming for, but your comment sound like you are expecting a
         | magic trick.
        
           | xienze wrote:
           | > I don't know what kind of data you are dealing with but its
           | illogical and against all best practices to have this many
           | keys in a single object.
           | 
           | The whole point of this project is to handle efficiently
           | parsing "huge" JSON documents. If 65K keys is considered
           | outrageously large, surely you can make do with a regular
           | JSON parser.
        
             | pshirshov wrote:
             | > If 65K keys is considered outrageously large
             | 
             | You can split it yourself. If you can't, replace Shorts
             | with Ints in the implementation and it would just work, but
             | I would be very happy to know your usecase.
             | 
             | Just bumping the pointer size to cover relatively rare
             | usecases is wasteful. It can be partially mitigated with
             | more tags and tricks, but it still would be wasteful. A
             | tiny chunking layer is easy to implement and I don't see
             | any downsides in that.
        
               | cogman10 wrote:
               | How wasteful?
               | 
               | Presumably 4 bytes dedicated to the keys would be dwarfed
               | by any strings thrown into the dataset.
               | 
               | Regardless, other than complexity, would there be any
               | reason to not support a dynamic key size? You could
               | dedicate the first 2 bits on the key to the length of the
               | key. 1 byte would work if there's only 64 keys, 2 bytes
               | would give you 16k keys and 3 4M. And if you wanted to
               | you could use a frequency table to order the pointers
               | such that more frequently used keys are smaller values in
               | the dictionary.
        
               | pshirshov wrote:
               | Most of the data the library originally was written for
               | consists of small objects and arrays with high levels of
               | duplication (think state of the world in a videogame with
               | tons of slightly varying objects). Pointer sizes really
               | matter.
        
           | duped wrote:
           | A pattern I've seen is to take something like `{ "users": [{
           | "id": string, ... }]}` and flatten it into `{ "user_id": {
           | ... } }` so you can deserialize directly into a hashmap. In
           | that case I can see 65k+ keys easily, although for an
           | individual query you would usually limit it.
        
             | mpweiher wrote:
             | Hmm...would all the user id's be known beforehand in this
             | use-case?
        
               | duped wrote:
               | I wouldn't get worked up about the actual names of things
               | I used here, but there's no difference between having the
               | key contained in the user data versus lifted up to the
               | containing object... every language supports iterating
               | objects by (key, value).
               | 
               | You would do a query like "give me all users with age
               | over 18" or something and return a `{ [id: string]: User
               | }`
        
           | zarzavat wrote:
           | Let's say you have a localization map: the keys are the
           | localization key and the values are the localized string. 65k
           | is a lot but it's not out of the question.
           | 
           | You could store this as two columnar arrays but that is
           | annoying and hardly anyone does that.
        
           | kevincox wrote:
           | You seem to be assuming that a JSON object is a "struct" with
           | a fixed set of application-defined keys. Very often it can
           | also be used as a "map". So the number of keys is essentially
           | unbounded and just depends on the size of the data.
        
             | mpweiher wrote:
             | Erm, yes, structs seems to be the use-case this is
             | consciously and very deliberately aiming at:
             | 
             |  _SICK: Streams of Independent Constant Keys_
             | 
             | And "maps" seems to be a use case it is deliberately not
             | aiming at.
        
               | zamadatix wrote:
               | Isn't the example in the readme just a smaller map style
               | object instead off a larger one?
        
           | jerf wrote:
           | Every pathological case you can imagine is something someone
           | somewhere has done.
           | 
           | Sticking data into the keys is definitely a thing I've seen.
           | 
           | One I've done personally is dump large portions of a Redis DB
           | into a JSON object. I could guarantee for my use case it
           | would fit into the relevant memory and resource constraints
           | but I would also have been able to guarantee it would exceed
           | 64K keys by over an order of magnitude. "Best practices"
           | didn't matter to me because this wasn't an API call result or
           | something.
           | 
           | There are other things like this you'll find in the wild.
           | Certainly some sort of "keyed by user" dump value is not
           | unheard of and you can easily have more than 64K users, and
           | there's nothing _a priori_ wrong with that. It may be a bad
           | solution for some _specific_ reason, and I think it often is,
           | but it is not automatically _a priori_ wrong. I 've written
           | streaming support for both directions, so while JSON may not
           | be optimal it is not necessarily a guarantee of badness. Plus
           | with the computers we have nowadays sometimes "just
           | deserialize the 1GB of JSON into RAM" is a perfectly valid
           | solution for some case. You don't want to do that a thousand
           | times per second, but not every problem is a "thousand times
           | per second" problem.
        
             | Groxx wrote:
             | redis is a good point, I've made MANY >64k key maps there
             | in the past, some up to half a million (and likely more if
             | we didn't rearchitect before we got bigger).
        
             | pests wrote:
             | re: storing data in keys
             | 
             | FoundationDB makes extensive use of this pattern, sometimes
             | with no data on the key at all.
        
           | paulddraper wrote:
           | That's like saying it's illogical to have 65k elements in an
           | array.
           | 
           | What is the difference?
        
             | pshirshov wrote:
             | If the limitation affects your usecase, you can chunk your
             | structures.
             | 
             | The limitation comes with benefits.
        
               | paulddraper wrote:
               | Fair enough. Implementation details matter.
               | 
               | I was just responding to the "X is an absurd way to do
               | JSON". Which seemed to single out objects vs arrays.
               | 
               | Like in this case maybe, but I don't see a reason to make
               | that general statement.
        
             | jiggawatts wrote:
             | I do not miss having to use "near" and "far" pointers in
             | 16-bit mode C++ programming!
        
           | MangoToupe wrote:
           | Data shape often outlives the original intentions.
        
         | nine_k wrote:
         | I'd say that it's generally unwise to use fixed-width integers
         | in a data structure where this width can vary widely, and has
         | no logical upper limit. Arbitrary-size integers are well known,
         | used in practice, and not hard to implement.
        
           | pshirshov wrote:
           | Even if it's "generally unwise" it was a well-thought
           | decision in this particular case. See my other comments. An
           | array of elements with constant size is indexed for free. An
           | array of elements of varying size needs a separate index.
           | SICK's binary representation (EBA) was created with several
           | particular usecases in mind. I needed most compact
           | representation and fastest access (for very underpowered
           | devices), large objects were not a big concern-they can be
           | chunked externally.
        
         | eknkc wrote:
         | .net has a polymorphic serializer where the output json
         | contains a $type field for deserializer to choose the concrete
         | type.
         | 
         | It needs to be the very first key in the object. I've been
         | bitten by this because postgresql's jsonb also does not
         | preserve the key ordering.
         | 
         | I believe the latest .net release addresses this but key
         | ordering does matter sometimes.
        
           | pshirshov wrote:
           | When order is important it can be maintained by an external
           | layer with, e.g. a an encoding into a list of pairs.
        
         | ericmcer wrote:
         | Isn't the order of JSON keys not guaranteed by the official
         | spec? I don't remember when I learned that but I have always
         | behaved as if that cannot be relied upon.
        
           | btschaegg wrote:
           | Since RFC 4627 (the original):
           | 
           | > An object is an unordered collection of zero or more
           | name/value pairs, [...]
           | 
           | Further, since RFC 7159:
           | 
           | > JSON parsing libraries have been observed to differ as to
           | whether or not they make the ordering of object members
           | visible to calling software. Implementations whose behavior
           | does not depend on member ordering will be interoperable in
           | the sense that they will not be affected by these
           | differences.
           | 
           | Both are in the current version (RFC 8259).
           | 
           | OTOH, I find the "but the order is not supposed to be
           | guaranteed!" debate _REALLY_ stupid when it comes to software
           | where it 's clear that at some point, a human will have to
           | look at the content and correlate it with another system.
           | 
           | There's nothing more evil than re-shuffling JSON just for the
           | fun of it and making everyone who has to look at the result
           | miserable. Yes, I'm talking about you, ELK devs.
           | 
           | Edit: (And/or whoever wrote the underlying Java/Go libs they
           | use for JSON that don't allow developers to patch ordering
           | in. I remember reading GitHub issues about this.)
        
             | harrall wrote:
             | Less than 1% of the hash maps I use have ever needed order.
             | 
             | The underlying data structures between both are different.
             | Ordered hash maps use more memory, are slower, and are more
             | complicated.
             | 
             | Knowing CS fundamentals, using an ordered hash map should
             | be a deliberate choice like renting a box truck when you
             | need to move a lot of stuff. Don't just drive a box truck
             | everywhere because you might pick up a couch from a thrift
             | store one day.
        
               | btschaegg wrote:
               | All well and true if all you have to do is process the
               | data programmatically.
               | 
               | And yet, as I said, if the same thinking gets applied to
               | e.g. a store of JSON documents (like ELK), chances are
               | good the thing will ruin the UX for countless people who
               | have to deal with the result. Note that you need exactly
               | _no_ hash maps to store the JSON as it is _text_.
               | 
               | To expand your analogy: ...and yet roads are built so
               | that you can drive your regular car _or_ a box car over
               | them, depending on your use case. You make the choice. A
               | JSON library that doesn 't afford such choices (and isn't
               | hyper focused on performance) isn't a good one in my
               | book.
               | 
               | Edit: As a sidenote: Or do you mean a freight train
               | wagon? Then replace "road" with "rails" and "car" with
               | "draisine" :)
        
           | pcthrowaway wrote:
           | An object is unordered, but most (nearly all?) parsing
           | software will preserve the order, and some software may rely
           | on that behaviour, so it's worth mentioning.
        
           | pshirshov wrote:
           | No, according to the spec the order is not preserved but most
           | parsers preserve the order (or maintain some order based on
           | side effects) and engineers rely on that (and sometimes that
           | backfires).
           | 
           | Essentially, SICK also maintains some strange order based on
           | values of some homegrown trivial hash function but the only
           | right approach to JSON objects is to treat their keys as an
           | unordered set.
        
       | noobcoder wrote:
       | it could be really useful for cases where youre repeatedly
       | processing similar JSON structure like in case of analytical
       | events but any plans for language bindings beyond the current
       | implementation?
        
         | pshirshov wrote:
         | > any plans for language bindings beyond the current
         | implementation?
         | 
         | If we need more bindings for the projects we work on - we will
         | implement and opensource them. E.g. recently we added
         | rudimentary JS support (no cursors, just encoder/decoder).
         | 
         | For many reasons, we avoid working on something we don't use
         | ourselves and we are not paid for. But your contributions are
         | very welcome. Also we would be happy to have you as a paying
         | client.
        
       | slashdave wrote:
       | So... isn't this just a database, and a bespoke one at that?
        
         | pshirshov wrote:
         | Yes, it's a specific JSON repiesentation and a tiny database
         | based on it.
        
       | IshKebab wrote:
       | This sounds quite similar to Amazon Ion which is one of the few
       | binary JSON formats that allows sparse reads, and deduplicated
       | keys.
       | 
       | However I found that in cases where you have the requirements of
       | streaming/random access and every entry is the same... SQLite is
       | a really great choice. It's way faster and more space efficient
       | than it has any right to be, and it gets you proper random access
       | (not just efficient streaming), and there are nice GUIs for it.
       | And you can query it.
        
         | pshirshov wrote:
         | SQLite was assessed for our usecase. SQLite representations of
         | our data were much bigger, it was much slower with our access
         | patterns and we didn't need all the power of SQL.
        
       | kordlessagain wrote:
       | Use for vectors and stream smaller projections first?
        
       | Retr0id wrote:
       | I started on something a bit like this, but using sqlite instead
       | of a custom serialisation format:
       | https://github.com/DavidBuchanan314/dag-sqlite (it's intended to
       | store DAG-CBOR objects but it is trivial to represent a JSON
       | object as DAG-CBOR)
        
         | pshirshov wrote:
         | SQLite was bit too heavy for our usecase unfortunately. We
         | tried it as one of the first options.
        
       ___________________________________________________________________
       (page generated 2025-10-28 23:00 UTC)