[HN Gopher] Succinct Data Structures
___________________________________________________________________
Succinct Data Structures
Author : pavel_lishin
Score : 198 points
Date : 2025-03-06 17:48 UTC (5 hours ago)
(HTM) web link (blog.startifact.com)
(TXT) w3m dump (blog.startifact.com)
| svachalek wrote:
| Wow, this is really fascinating. I guess it all comes down to how
| it's doing select and rank in constant time, which is probably
| some clever bit arithmetic. I'll have to look into how that
| works.
| zellyn wrote:
| Some of it is moving or has moved down to the instruction sets:
| https://vaibhavsagar.com/blog/2019/09/08/popcount/
| judofyr wrote:
| I can speak a bit about one of the approaches: "Practical
| Entropy-Compressed Rank/Select Dictionary" by Daisuke Okanohara
| and Kunihiko Sadakane. This presents two different variants:
| One for dense (i.e. more than 50% of the bits are set) and one
| for sparse.
|
| The dense implementation is basically based around partitioning
| them into "blocks" of a given size and then you can find the
| block by doing `block[idx / block_size]`. It then also groups
| each block into sub-blocks which helps you even further. All of
| these are additional data structures (very much like an index)
| which are stored next to the regular bitset. You use the
| blocks/sub-blocks to find roughly where in the bitset you are
| and then use a algorithm for finding the value in a given
| machine-word.
|
| The sparse implementation treats the bitset as an ordered list
| of numbers (e.g. 100001001 is interpreted as 0, 5, 8) and then
| it stores those numbers using Elias-Fano encoding. The higher
| bits of the Elias-Fano encoding happen to be dense and hence we
| can use the previous dense implementation on that higher bits
| and then combine it with the lower bits.
|
| I'm also aware of https://arxiv.org/abs/1706.00990 which is
| more about how to do this most efficiently at a machine-word
| level.
|
| "Engineering Compact Data Structures for Rank and Select
| Queries on Bit Vectors" is another quite recent paper which I
| haven't fully digested yet.
| kccqzy wrote:
| I first heard of the concept of succinct data structures from
| Edward Kmett, a famous Haskeller behind many popular Haskell
| libraries. He gave a talk on succinct data structures a long time
| ago: http://youtu.be/uA0Z7_4J7u8
| tux3 wrote:
| I really like the article, but it would benefit from some numbers
| or complexity estimates to get some intuitive sense of what the
| cost is.
|
| Am I paying 30% overhead for this particular index or that
| wavelet matrix? Is it double the memory use? Or is it O(log N)?
| No idea! "doesn't use much more space" could mean a lot of
| different things!
| judofyr wrote:
| "Succinct data structure" does have a very strict definition
| which probably answers some of your questions:
| https://en.wikipedia.org/wiki/Succinct_data_structure. It's all
| about being close to the theoretical minimum.
|
| > Am I paying 30% overhead for this particular index or that
| wavelet matrix?
|
| Nope! That would not fit the definition. That would be a
| "compact data structure" according to this definition.
| jltsiren wrote:
| You should be careful when using asymptotic bounds with more
| precision than about O(sqrt(n)). The bounds ignore constant
| factors, and the constant factors could be more significant
| than slowly growing non-constant factors for reasonable
| values of n.
|
| It's also very common in algorithm design that improving the
| asymptotic bounds and making the algorithm faster (or the
| data structure smaller) are orthogonal (or even opposite)
| goals. Real computers have complex performance
| characteristics and fixed word lengths, and it's rarely a
| good idea to implement a theoretical algorithm exactly as
| described.
|
| Succinct data structures often have a number of internal
| parameters. In a theoretical parameterization, those
| parameters may be described as being O(log n), O(log^2 n), or
| O(log log n). In a concrete implementation, it may be a good
| idea to use constant approximations for some nontrivial (such
| as 2x) performance gains. O(log n) could become 32 or 64,
| O(log^2 n) could become 1024 (or a power-of-2 multiple), and
| O(log log n) could become 4 or 8.
|
| And then, if you parameterize the succinct data structure
| with these constants, the space overhead becomes a constant
| fraction.
| judofyr wrote:
| Succinct data structures are very fun! If anyone is interested,
| I've implemented some of this in Zig:
| https://github.com/judofyr/zini. The main thing this implements
| is a minimal perfect hash function which uses less than 4 bits
| per element (and can respond to queries in ~50 ns). As a part of
| that I ended up implementing on of these indexes for constant-
| time select(n):
| https://github.com/judofyr/zini/blob/main/src/darray.zig.
|
| It feels kinda magic implementing these algorithms because
| everything becomes _so tiny_!
| thomasmg wrote:
| For Java, C++, and Rust there is also https://sux.di.unimi.it/
| maintained by Sebastiano Vigna, a professor from Italy.
|
| Together with his student (I also helped a bit), he wrote a
| paper about RecSplit, a minimal perfect hash function (MPHF)
| algorithm I have invented: https://arxiv.org/abs/1910.06416 -
| that algorithm uses around 1.56 bits per key. But it is quite
| slow. In January 2020 I presented the paper at a conference,
| that was right before the pandemic.
|
| The algorithm with the least memory usage (and much faster as
| well) is now Consensus-RecSplit:
| https://arxiv.org/abs/2502.05613 - it can do 1.44 bits per key,
| which is right at the theoretical minimum (meaning, there is no
| good way to shrink it further). At least a small part of my
| original invention is still used there, nice. The fastest
| current MPHF is probably PtrHash
| https://arxiv.org/abs/2502.15539 - both papers were published
| last month (February 2025) by the way.
| rurban wrote:
| I'm working on making pthash faster and more practical. I can
| compile the data and code to C++, send efficiently store the
| keys also to be able to eliminate false positives.
|
| https://github.com/rurban/pthash
| mzs wrote:
| The word count seems artificially increased in the post. Here's a
| succinct explanation:
| https://www.eecs.tufts.edu/~aloupis/comp150/projects/Succinc...
| pegasus wrote:
| I didn't think that at all. In fact I found it very readable
| and prefer it over the drier presentation you linked. To each
| its own, I guess, but there's really no need to infer ulterior
| motives.
| abetusk wrote:
| My goto library for succinct data structures is SDSL-Lite [0].
|
| [0] https://github.com/simongog/sdsl-lite
| qazxcvbnm wrote:
| Note that succinct data structures may not be faster than
| conventional structures if your dataset fits in memory
| http://www.cs.cmu.edu/~huanche1/slides/FST.pdf . Of course, for
| large datasets where storage access times dominate, succinct data
| structures win all around. In any case, succinct trees are works
| of art (If I recall https://arxiv.org/pdf/1805.11255 was a good
| exposition) (just look at how that RMQ works)!
| yvdriess wrote:
| True, but it depends on what you mean with fitting in memory.
|
| Succinct datastructures are used in genomics (e.g. bwa, megahit
| exome sequencer) because N is so large that you're actually
| hitting asymptotic behavior.
|
| For memory latency it can by making your memory footprint fit
| in LLC or a single node; cross-node NUMA latencies are
| typically enough to absolutely tank performance.
|
| It can theoretically also help in massively parallel access
| situations where bandwidth becomes a limiting concern.
| Although, I intuit we would need near-memory hardware to
| perform the rank+select. Otherwise the latency of the multiple
| dependent accesses will kill your performance again, cfr
| previous point.
|
| With a lot of parallel accesses, bandwidth could also be an
| issue in conventional structures.
| lostmsu wrote:
| Way better detailed explanation:
| https://stackoverflow.com/questions/72580828/what-is-a-succi...
| jbreckmckye wrote:
| That IS excellent - thank you
| topspin wrote:
| Yes, that's great. This part:
|
| "Intuitively, a succinct data structure is one whose space
| usage equals the space needed to write out the data, plus
| something that grows more slowly than that. If you're familiar
| with little-o notation, a succinct data structure is one whose
| space usage is X + o(X), where X is the number of bits needed
| to write out the data itself."
|
| Brings to mind COBS encoding, which does this for streams bytes
| containing arbitrary length "packets" or similar.
| cxie wrote:
| Just spent my morning diving into succinct data structures after
| seeing this. The memory efficiency is incredible - especially the
| balanced parentheses tree representing a full node tree in just 2
| bits per node! I've been working on a project parsing large
| (10GB+) XML files for scientific data analysis, and our current
| approach burns through RAM like crazy. Has anyone here
| successfully implemented these structures in production systems?
|
| The wavelet matrix concept seems particularly promising for our
| text-heavy workloads. I'm curious if the constant-time operations
| actually hold up under real-world conditions or if there are
| hidden performance cliffs.
|
| This feels like one of those CS concepts that should be more
| widely known but somehow got overlooked by mainstream
| programming. Kind of like how bloom filters were obscure until
| suddenly every system was using them.
| MortyWaves wrote:
| When I've dealt with huge files in .NET, the usual approach is
| to stream the file such that only some of it is in memory at
| once. This way you can process files hundreds of GBs. Of
| course, if you truly need them all in memory at once for some
| reason I genuinely can't think of, then you'd need something
| else.
|
| Does your language have the concept of streaming files?
| crazygringo wrote:
| If you're streaming something row-based like a CSV, or a
| zipped CSV, then that's usually easy.
|
| But when you get to hierarchical data structures like
| JSON/protobuf there very often simply isn't a streaming
| library available. There's a library function to decode the
| whole thing into an object in memory, and that's all.
|
| Nothing prevents streaming in theory, it's just far more
| complicated to write that library.
| dilap wrote:
| protobuf sure, but streaming libraries for json (and xml,
| as in the parent) are extremely common. not harder (maybe
| even easier) than non-streaming to write, tho more
| cumbersome to use, so something you'd reach for only if you
| specifically need it ('cuz of memory constraints)
|
| e.g. standard go json library
| https://pkg.go.dev/encoding/json#example-Decoder.Decode-
| Stre...
| crazygringo wrote:
| Yup. I don't remember streaming JSON being common in the
| early days but now it is. But the absence of streaming
| protobuf is what has killed me, when dealing with
| gigantic protobuf files from government agencies (ugh).
| MortyWaves wrote:
| > This is a field that seems to have emerged in computer science
| relatively recently; many of the important data structures were
| invented in the last 25 years.
|
| This is crazy!
| sujayakar wrote:
| I really love this space: Navarro's book is an excellent survey.
|
| Erik Demaine has a few great lectures on succinct data structures
| too: L17 and L18 on
| https://courses.csail.mit.edu/6.851/spring12/lectures/
| yurivish wrote:
| I also emailed Gonzalo Navarro once to ask a question, and we had
| a great discussion and ended up writing a paper together about
| the answer. [1]
|
| Another paper of his that I really like combines a few elegant
| ideas into a simple implementation of bitvector rank/select:
| https://users.dcc.uchile.cl/~gnavarro/ps/sea12.1.pdf
|
| During this time I got really excited about succinct data
| structures and wrote a Rust library implementing many bitvector
| types and a wavelet matrix. [2]
|
| My interest came from a data visualization perspective -- I was
| curious if space-efficient data structures could fundamentally
| improve the interactive exploration of large datasets on the
| client side. Happy to chat about that if anyone's curious.
|
| [1] Paper:
| https://archive.yuri.is/pdfing/weighted_range_quantile_queri...
| though it's pretty hard to understand without some background
| context. I've been meaning to write a blog post explaining the
| core contribution, which is a simple tweak to one of Navarro's
| textbook data structures.
|
| [2] The rust version is here: https://github.com/yurivish/made-
| of-bits/tree/main/rust-play... and an earlier pure-JS
| implementation is here: https://github.com/yurivish/made-of-
| bits/tree/main
| sitkack wrote:
| Reading a Gonzalo Navarro paper is like going for walk, taking
| a shower and having a wonderful coffee. It literally sets the
| mind on fire.
|
| https://dblp.org/pid/n/GonzaloNavarro.html
| eqvinox wrote:
| There's a relative of this in making in-memory node
| representation efficient for large structs that have a bunch of
| rarely-used fields: chunk up memory in units (most reasonably 8
| bytes/pointer size), allocate offsets for rarely-used fields in
| ascending order, and then use bitmasks to indicate which fields
| are present. (Note the bits correspond to units, not fields; a
| 16-byte field would use 2 adjacent bits in the bitmask.)
|
| The trick is that masking & popcount (both low-cycle CPU
| instructions in most modern CPUs1) make this quite fast to access
| and thus suitable for in-memory representation.
|
| The intended use is when presence of optional fields is known at
| allocation time and doesn't change afterwards, i.e. some object
| is built up into a dynamically allocated buffer whose size is
| shrunk down by omitting fields. Changing which fields are present
| afterwards requires reallocating the thing, which tends to make
| the entire pattern pointless.
|
| 1 the real annoyance here is that _almost_ all x86_64 CPUs have
| POPCNT, except the absolute earliest ones, but if you build e.g.
| some package for a Linux distribution without any CPU
| optimization flags it 'll use a library fallback popcount routine
| rather than the instruction :(
| bo1024 wrote:
| With advanced CS topics, it often works to search for "<topic>
| lecture notes".
___________________________________________________________________
(page generated 2025-03-06 23:00 UTC)