[HN Gopher] Sick: Indexed deduplicated binary storage for JSON-l...
___________________________________________________________________
Sick: Indexed deduplicated binary storage for JSON-like data
structures
Author : pshirshov
Score : 102 points
Date : 2025-10-28 13:22 UTC (9 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| gethly wrote:
| It is a bit confusing that JSON is being mention so much when in
| reality this has nothing to do with it - except to showcase that
| JSON is not suitable for streaming whereas this format is.
|
| Secondly, I fail to see advantages here as the claim is that it
| allows streaming for partial processing compared to JSON that has
| to be fully loaded in order to be parseable. Mainly, because the
| values must be streamed first, before their location/pointers in
| order for the structure to make sense and be usable for
| processing, but that also means we need all the parent pointes as
| well in order to know where to place the children in the root. So
| all in all, I just do not see why this is advantageous format
| above JSON(as that is its main complaint here), since you can
| stream JSON just as easily because you can detect { and } and {
| and ] and " and , delimiters and know when your token is complete
| to then process it, without having to wait for the whole
| structure to finish being streamed or wait for the SICK pointers
| to arrive in full so you can build the structure.
|
| Or, I am just not getting it at all...
| pshirshov wrote:
| > when in reality this has nothing to do with it
|
| It's a specific representation of JSON-like data structures,
| with an indexed deduplicated binary format and JSON encoders
| and decoders. Why "nothing"? It's all about it.
|
| Mostly it's not about streaming. More efficient streaming is a
| byproduct of the representation.
|
| > because you can detect { and } and { and ] and " and ,
|
| You need a pushdown automaton for that. In case of SICK you
| don't need potentially unbounded accumulation for many (not
| all) usecases.
|
| > the values must be streamed first, before their
| location/pointers
|
| Only to avoid accumulation. If you are fine with (some)
| accumulation, you can reorder. Also think about the updates.
|
| But again, streaming is a byproduct. This tool is an indexed
| binary deduplicating storage which does not require parsing and
| provides amortized O(1) access time.
| aaronblohowiak wrote:
| I think this is a generational thing. To a bunch of people
| now practicing, JSON isn't just an encoding but has come to
| mean "nested basic data types".
| 8organicbits wrote:
| I think this quote explains the efficient streaming:
|
| > There is an interesting observation: when a stream does not
| contain removal entries it can be safely reordered.
|
| So if I'm understanding, the example in the readme could be
| sent in reverse, allowing the client to immediately use root:0
| and then string:2 while the rest streams in.
|
| I was looking for something like this, but my use case exceeds
| the 65k key limit for objects.
| pshirshov wrote:
| > 65k key limit for objects
|
| The limit comes from 2-byte element pointer size. That can be
| adjusted. We don't have an implementation with larger
| pointers but it can be done easily.
|
| > while the rest streams in
|
| Yes, there are many usecases where you can use some chunks of
| the data/rebuild some parts of the structures immediately
| without any accumulation. The problem is that we don't have a
| nice streaming abstraction which would suit anyone for any
| usecase.
|
| SICK as a library is an efficient indexed binary storage for
| JSON with listed limitations.
|
| SICK as a concept is much more but you might need your own
| implementation tailored to your usecase.
| gethly wrote:
| If the stream contains removal entries, then that is not a
| data stream but command stream and, again, has nothing to do
| with JSON itself. We can extrapolate this to some kind of
| json-rpc-ish stream if we want to keep this aligned with
| JSON.
|
| Overall, I think that mentioning JSON here at all is simply a
| mistake. It would be better to just introduce this as
| streaming protocol/framework for data structures. But then we
| can do the same thing with literally any format and syntax.
| pshirshov wrote:
| > has nothing to do with JSON itself.
|
| It's literally a deduplicated indexed binary storage for
| JSON (plus an approach to JSON representation more suitable
| for streaming than serialized JSON).
|
| > we can do the same thing
|
| I would highly encourage doing same things! For some reason
| people love to fully parse a 50MiB long JSON document when
| they need just 0.1% of the data in it.
| xenadu02 wrote:
| Most existing JSON parsers don't support streaming but that's
| not inherent in the format. It is definitely possible to stream
| writes easily but it's just as possible to stream parsing.
| qixxiq wrote:
| Current implementation has the following limitations:
| Maximum object size: 65534 keys The order of object
| keys is not preserved ... These
| limitations may be lifted by using more bytes to store offset
| pointers and counts on binary level. Though it's hard to imagine
| a real application which would need that.
|
| I've worked on _many_ applications which have needed those
| features. Object keys is a per implementation detail, but failing
| at 65k keys seems like a problem people would likely hit if this
| were to be used at larger scales.
| pshirshov wrote:
| In our usecase, for which we created the library, we made this
| tradeoff to save several bytes per pointer and keep binary form
| more compact. The application splits large objects into smaller
| chunks. 99% of the structures there are relatively small but
| there are tons of them. Most likely you can do the same - just
| split large structures into smaller ones.
|
| If you need support for larger structures, you may create your
| own implementation or extend ours (and I would really like to
| hear about your usecase).
|
| SICK as a concept is simple. SICK as the library was created to
| cover some particular usecases and may be not suitable for
| everyone. We would welcome any contributions.
| duped wrote:
| If you use varints for pointers you can have the best of both
| worlds, and achieve even better compaction for the smallest
| objects.
| pshirshov wrote:
| Yep, can be done but they aren't free because of variable
| size. With constant pointers I can access const-sized
| elements directly, with varints I would have to do some
| rituals.
|
| I have another binary encoding for different purposes
| (https://github.com/7mind/baboon) which relies on varints,
| in case of SICK I decided to go with pointers of constant
| size to save some pennies on access efficiency.
| halayli wrote:
| I don't know what kind of data you are dealing with but its
| illogical and against all best practices to have this many keys
| in a single object. it's equivalent to saying having tables
| with 65k columns is very common.
|
| on the other hand most database decisions are about finding the
| sweet spot compromise tailored toward the common use case they
| are aiming for, but your comment sound like you are expecting a
| magic trick.
| xienze wrote:
| > I don't know what kind of data you are dealing with but its
| illogical and against all best practices to have this many
| keys in a single object.
|
| The whole point of this project is to handle efficiently
| parsing "huge" JSON documents. If 65K keys is considered
| outrageously large, surely you can make do with a regular
| JSON parser.
| pshirshov wrote:
| > If 65K keys is considered outrageously large
|
| You can split it yourself. If you can't, replace Shorts
| with Ints in the implementation and it would just work, but
| I would be very happy to know your usecase.
|
| Just bumping the pointer size to cover relatively rare
| usecases is wasteful. It can be partially mitigated with
| more tags and tricks, but it still would be wasteful. A
| tiny chunking layer is easy to implement and I don't see
| any downsides in that.
| cogman10 wrote:
| How wasteful?
|
| Presumably 4 bytes dedicated to the keys would be dwarfed
| by any strings thrown into the dataset.
|
| Regardless, other than complexity, would there be any
| reason to not support a dynamic key size? You could
| dedicate the first 2 bits on the key to the length of the
| key. 1 byte would work if there's only 64 keys, 2 bytes
| would give you 16k keys and 3 4M. And if you wanted to
| you could use a frequency table to order the pointers
| such that more frequently used keys are smaller values in
| the dictionary.
| pshirshov wrote:
| Most of the data the library originally was written for
| consists of small objects and arrays with high levels of
| duplication (think state of the world in a videogame with
| tons of slightly varying objects). Pointer sizes really
| matter.
| duped wrote:
| A pattern I've seen is to take something like `{ "users": [{
| "id": string, ... }]}` and flatten it into `{ "user_id": {
| ... } }` so you can deserialize directly into a hashmap. In
| that case I can see 65k+ keys easily, although for an
| individual query you would usually limit it.
| mpweiher wrote:
| Hmm...would all the user id's be known beforehand in this
| use-case?
| duped wrote:
| I wouldn't get worked up about the actual names of things
| I used here, but there's no difference between having the
| key contained in the user data versus lifted up to the
| containing object... every language supports iterating
| objects by (key, value).
|
| You would do a query like "give me all users with age
| over 18" or something and return a `{ [id: string]: User
| }`
| zarzavat wrote:
| Let's say you have a localization map: the keys are the
| localization key and the values are the localized string. 65k
| is a lot but it's not out of the question.
|
| You could store this as two columnar arrays but that is
| annoying and hardly anyone does that.
| kevincox wrote:
| You seem to be assuming that a JSON object is a "struct" with
| a fixed set of application-defined keys. Very often it can
| also be used as a "map". So the number of keys is essentially
| unbounded and just depends on the size of the data.
| mpweiher wrote:
| Erm, yes, structs seems to be the use-case this is
| consciously and very deliberately aiming at:
|
| _SICK: Streams of Independent Constant Keys_
|
| And "maps" seems to be a use case it is deliberately not
| aiming at.
| zamadatix wrote:
| Isn't the example in the readme just a smaller map style
| object instead off a larger one?
| jerf wrote:
| Every pathological case you can imagine is something someone
| somewhere has done.
|
| Sticking data into the keys is definitely a thing I've seen.
|
| One I've done personally is dump large portions of a Redis DB
| into a JSON object. I could guarantee for my use case it
| would fit into the relevant memory and resource constraints
| but I would also have been able to guarantee it would exceed
| 64K keys by over an order of magnitude. "Best practices"
| didn't matter to me because this wasn't an API call result or
| something.
|
| There are other things like this you'll find in the wild.
| Certainly some sort of "keyed by user" dump value is not
| unheard of and you can easily have more than 64K users, and
| there's nothing _a priori_ wrong with that. It may be a bad
| solution for some _specific_ reason, and I think it often is,
| but it is not automatically _a priori_ wrong. I 've written
| streaming support for both directions, so while JSON may not
| be optimal it is not necessarily a guarantee of badness. Plus
| with the computers we have nowadays sometimes "just
| deserialize the 1GB of JSON into RAM" is a perfectly valid
| solution for some case. You don't want to do that a thousand
| times per second, but not every problem is a "thousand times
| per second" problem.
| Groxx wrote:
| redis is a good point, I've made MANY >64k key maps there
| in the past, some up to half a million (and likely more if
| we didn't rearchitect before we got bigger).
| pests wrote:
| re: storing data in keys
|
| FoundationDB makes extensive use of this pattern, sometimes
| with no data on the key at all.
| paulddraper wrote:
| That's like saying it's illogical to have 65k elements in an
| array.
|
| What is the difference?
| pshirshov wrote:
| If the limitation affects your usecase, you can chunk your
| structures.
|
| The limitation comes with benefits.
| paulddraper wrote:
| Fair enough. Implementation details matter.
|
| I was just responding to the "X is an absurd way to do
| JSON". Which seemed to single out objects vs arrays.
|
| Like in this case maybe, but I don't see a reason to make
| that general statement.
| jiggawatts wrote:
| I do not miss having to use "near" and "far" pointers in
| 16-bit mode C++ programming!
| MangoToupe wrote:
| Data shape often outlives the original intentions.
| nine_k wrote:
| I'd say that it's generally unwise to use fixed-width integers
| in a data structure where this width can vary widely, and has
| no logical upper limit. Arbitrary-size integers are well known,
| used in practice, and not hard to implement.
| pshirshov wrote:
| Even if it's "generally unwise" it was a well-thought
| decision in this particular case. See my other comments. An
| array of elements with constant size is indexed for free. An
| array of elements of varying size needs a separate index.
| SICK's binary representation (EBA) was created with several
| particular usecases in mind. I needed most compact
| representation and fastest access (for very underpowered
| devices), large objects were not a big concern-they can be
| chunked externally.
| eknkc wrote:
| .net has a polymorphic serializer where the output json
| contains a $type field for deserializer to choose the concrete
| type.
|
| It needs to be the very first key in the object. I've been
| bitten by this because postgresql's jsonb also does not
| preserve the key ordering.
|
| I believe the latest .net release addresses this but key
| ordering does matter sometimes.
| pshirshov wrote:
| When order is important it can be maintained by an external
| layer with, e.g. a an encoding into a list of pairs.
| ericmcer wrote:
| Isn't the order of JSON keys not guaranteed by the official
| spec? I don't remember when I learned that but I have always
| behaved as if that cannot be relied upon.
| btschaegg wrote:
| Since RFC 4627 (the original):
|
| > An object is an unordered collection of zero or more
| name/value pairs, [...]
|
| Further, since RFC 7159:
|
| > JSON parsing libraries have been observed to differ as to
| whether or not they make the ordering of object members
| visible to calling software. Implementations whose behavior
| does not depend on member ordering will be interoperable in
| the sense that they will not be affected by these
| differences.
|
| Both are in the current version (RFC 8259).
|
| OTOH, I find the "but the order is not supposed to be
| guaranteed!" debate _REALLY_ stupid when it comes to software
| where it 's clear that at some point, a human will have to
| look at the content and correlate it with another system.
|
| There's nothing more evil than re-shuffling JSON just for the
| fun of it and making everyone who has to look at the result
| miserable. Yes, I'm talking about you, ELK devs.
|
| Edit: (And/or whoever wrote the underlying Java/Go libs they
| use for JSON that don't allow developers to patch ordering
| in. I remember reading GitHub issues about this.)
| harrall wrote:
| Less than 1% of the hash maps I use have ever needed order.
|
| The underlying data structures between both are different.
| Ordered hash maps use more memory, are slower, and are more
| complicated.
|
| Knowing CS fundamentals, using an ordered hash map should
| be a deliberate choice like renting a box truck when you
| need to move a lot of stuff. Don't just drive a box truck
| everywhere because you might pick up a couch from a thrift
| store one day.
| btschaegg wrote:
| All well and true if all you have to do is process the
| data programmatically.
|
| And yet, as I said, if the same thinking gets applied to
| e.g. a store of JSON documents (like ELK), chances are
| good the thing will ruin the UX for countless people who
| have to deal with the result. Note that you need exactly
| _no_ hash maps to store the JSON as it is _text_.
|
| To expand your analogy: ...and yet roads are built so
| that you can drive your regular car _or_ a box car over
| them, depending on your use case. You make the choice. A
| JSON library that doesn 't afford such choices (and isn't
| hyper focused on performance) isn't a good one in my
| book.
|
| Edit: As a sidenote: Or do you mean a freight train
| wagon? Then replace "road" with "rails" and "car" with
| "draisine" :)
| pcthrowaway wrote:
| An object is unordered, but most (nearly all?) parsing
| software will preserve the order, and some software may rely
| on that behaviour, so it's worth mentioning.
| pshirshov wrote:
| No, according to the spec the order is not preserved but most
| parsers preserve the order (or maintain some order based on
| side effects) and engineers rely on that (and sometimes that
| backfires).
|
| Essentially, SICK also maintains some strange order based on
| values of some homegrown trivial hash function but the only
| right approach to JSON objects is to treat their keys as an
| unordered set.
| noobcoder wrote:
| it could be really useful for cases where youre repeatedly
| processing similar JSON structure like in case of analytical
| events but any plans for language bindings beyond the current
| implementation?
| pshirshov wrote:
| > any plans for language bindings beyond the current
| implementation?
|
| If we need more bindings for the projects we work on - we will
| implement and opensource them. E.g. recently we added
| rudimentary JS support (no cursors, just encoder/decoder).
|
| For many reasons, we avoid working on something we don't use
| ourselves and we are not paid for. But your contributions are
| very welcome. Also we would be happy to have you as a paying
| client.
| slashdave wrote:
| So... isn't this just a database, and a bespoke one at that?
| pshirshov wrote:
| Yes, it's a specific JSON repiesentation and a tiny database
| based on it.
| IshKebab wrote:
| This sounds quite similar to Amazon Ion which is one of the few
| binary JSON formats that allows sparse reads, and deduplicated
| keys.
|
| However I found that in cases where you have the requirements of
| streaming/random access and every entry is the same... SQLite is
| a really great choice. It's way faster and more space efficient
| than it has any right to be, and it gets you proper random access
| (not just efficient streaming), and there are nice GUIs for it.
| And you can query it.
| pshirshov wrote:
| SQLite was assessed for our usecase. SQLite representations of
| our data were much bigger, it was much slower with our access
| patterns and we didn't need all the power of SQL.
| kordlessagain wrote:
| Use for vectors and stream smaller projections first?
| Retr0id wrote:
| I started on something a bit like this, but using sqlite instead
| of a custom serialisation format:
| https://github.com/DavidBuchanan314/dag-sqlite (it's intended to
| store DAG-CBOR objects but it is trivial to represent a JSON
| object as DAG-CBOR)
| pshirshov wrote:
| SQLite was bit too heavy for our usecase unfortunately. We
| tried it as one of the first options.
___________________________________________________________________
(page generated 2025-10-28 23:00 UTC)