hngopher.com

       [HN Gopher] Succinct Data Structures
       ___________________________________________________________________
        
       Succinct Data Structures
        
       Author : pavel_lishin
       Score  : 198 points
       Date   : 2025-03-06 17:48 UTC (5 hours ago)
        
 (HTM) web link (blog.startifact.com)
 (TXT) w3m dump (blog.startifact.com)
        
       | svachalek wrote:
       | Wow, this is really fascinating. I guess it all comes down to how
       | it's doing select and rank in constant time, which is probably
       | some clever bit arithmetic. I'll have to look into how that
       | works.
        
         | zellyn wrote:
         | Some of it is moving or has moved down to the instruction sets:
         | https://vaibhavsagar.com/blog/2019/09/08/popcount/
        
         | judofyr wrote:
         | I can speak a bit about one of the approaches: "Practical
         | Entropy-Compressed Rank/Select Dictionary" by Daisuke Okanohara
         | and Kunihiko Sadakane. This presents two different variants:
         | One for dense (i.e. more than 50% of the bits are set) and one
         | for sparse.
         | 
         | The dense implementation is basically based around partitioning
         | them into "blocks" of a given size and then you can find the
         | block by doing `block[idx / block_size]`. It then also groups
         | each block into sub-blocks which helps you even further. All of
         | these are additional data structures (very much like an index)
         | which are stored next to the regular bitset. You use the
         | blocks/sub-blocks to find roughly where in the bitset you are
         | and then use a algorithm for finding the value in a given
         | machine-word.
         | 
         | The sparse implementation treats the bitset as an ordered list
         | of numbers (e.g. 100001001 is interpreted as 0, 5, 8) and then
         | it stores those numbers using Elias-Fano encoding. The higher
         | bits of the Elias-Fano encoding happen to be dense and hence we
         | can use the previous dense implementation on that higher bits
         | and then combine it with the lower bits.
         | 
         | I'm also aware of https://arxiv.org/abs/1706.00990 which is
         | more about how to do this most efficiently at a machine-word
         | level.
         | 
         | "Engineering Compact Data Structures for Rank and Select
         | Queries on Bit Vectors" is another quite recent paper which I
         | haven't fully digested yet.
        
       | kccqzy wrote:
       | I first heard of the concept of succinct data structures from
       | Edward Kmett, a famous Haskeller behind many popular Haskell
       | libraries. He gave a talk on succinct data structures a long time
       | ago: http://youtu.be/uA0Z7_4J7u8
        
       | tux3 wrote:
       | I really like the article, but it would benefit from some numbers
       | or complexity estimates to get some intuitive sense of what the
       | cost is.
       | 
       | Am I paying 30% overhead for this particular index or that
       | wavelet matrix? Is it double the memory use? Or is it O(log N)?
       | No idea! "doesn't use much more space" could mean a lot of
       | different things!
        
         | judofyr wrote:
         | "Succinct data structure" does have a very strict definition
         | which probably answers some of your questions:
         | https://en.wikipedia.org/wiki/Succinct_data_structure. It's all
         | about being close to the theoretical minimum.
         | 
         | > Am I paying 30% overhead for this particular index or that
         | wavelet matrix?
         | 
         | Nope! That would not fit the definition. That would be a
         | "compact data structure" according to this definition.
        
           | jltsiren wrote:
           | You should be careful when using asymptotic bounds with more
           | precision than about O(sqrt(n)). The bounds ignore constant
           | factors, and the constant factors could be more significant
           | than slowly growing non-constant factors for reasonable
           | values of n.
           | 
           | It's also very common in algorithm design that improving the
           | asymptotic bounds and making the algorithm faster (or the
           | data structure smaller) are orthogonal (or even opposite)
           | goals. Real computers have complex performance
           | characteristics and fixed word lengths, and it's rarely a
           | good idea to implement a theoretical algorithm exactly as
           | described.
           | 
           | Succinct data structures often have a number of internal
           | parameters. In a theoretical parameterization, those
           | parameters may be described as being O(log n), O(log^2 n), or
           | O(log log n). In a concrete implementation, it may be a good
           | idea to use constant approximations for some nontrivial (such
           | as 2x) performance gains. O(log n) could become 32 or 64,
           | O(log^2 n) could become 1024 (or a power-of-2 multiple), and
           | O(log log n) could become 4 or 8.
           | 
           | And then, if you parameterize the succinct data structure
           | with these constants, the space overhead becomes a constant
           | fraction.
        
       | judofyr wrote:
       | Succinct data structures are very fun! If anyone is interested,
       | I've implemented some of this in Zig:
       | https://github.com/judofyr/zini. The main thing this implements
       | is a minimal perfect hash function which uses less than 4 bits
       | per element (and can respond to queries in ~50 ns). As a part of
       | that I ended up implementing on of these indexes for constant-
       | time select(n):
       | https://github.com/judofyr/zini/blob/main/src/darray.zig.
       | 
       | It feels kinda magic implementing these algorithms because
       | everything becomes _so tiny_!
        
         | thomasmg wrote:
         | For Java, C++, and Rust there is also https://sux.di.unimi.it/
         | maintained by Sebastiano Vigna, a professor from Italy.
         | 
         | Together with his student (I also helped a bit), he wrote a
         | paper about RecSplit, a minimal perfect hash function (MPHF)
         | algorithm I have invented: https://arxiv.org/abs/1910.06416 -
         | that algorithm uses around 1.56 bits per key. But it is quite
         | slow. In January 2020 I presented the paper at a conference,
         | that was right before the pandemic.
         | 
         | The algorithm with the least memory usage (and much faster as
         | well) is now Consensus-RecSplit:
         | https://arxiv.org/abs/2502.05613 - it can do 1.44 bits per key,
         | which is right at the theoretical minimum (meaning, there is no
         | good way to shrink it further). At least a small part of my
         | original invention is still used there, nice. The fastest
         | current MPHF is probably PtrHash
         | https://arxiv.org/abs/2502.15539 - both papers were published
         | last month (February 2025) by the way.
        
           | rurban wrote:
           | I'm working on making pthash faster and more practical. I can
           | compile the data and code to C++, send efficiently store the
           | keys also to be able to eliminate false positives.
           | 
           | https://github.com/rurban/pthash
        
       | mzs wrote:
       | The word count seems artificially increased in the post. Here's a
       | succinct explanation:
       | https://www.eecs.tufts.edu/~aloupis/comp150/projects/Succinc...
        
         | pegasus wrote:
         | I didn't think that at all. In fact I found it very readable
         | and prefer it over the drier presentation you linked. To each
         | its own, I guess, but there's really no need to infer ulterior
         | motives.
        
       | abetusk wrote:
       | My goto library for succinct data structures is SDSL-Lite [0].
       | 
       | [0] https://github.com/simongog/sdsl-lite
        
       | qazxcvbnm wrote:
       | Note that succinct data structures may not be faster than
       | conventional structures if your dataset fits in memory
       | http://www.cs.cmu.edu/~huanche1/slides/FST.pdf . Of course, for
       | large datasets where storage access times dominate, succinct data
       | structures win all around. In any case, succinct trees are works
       | of art (If I recall https://arxiv.org/pdf/1805.11255 was a good
       | exposition) (just look at how that RMQ works)!
        
         | yvdriess wrote:
         | True, but it depends on what you mean with fitting in memory.
         | 
         | Succinct datastructures are used in genomics (e.g. bwa, megahit
         | exome sequencer) because N is so large that you're actually
         | hitting asymptotic behavior.
         | 
         | For memory latency it can by making your memory footprint fit
         | in LLC or a single node; cross-node NUMA latencies are
         | typically enough to absolutely tank performance.
         | 
         | It can theoretically also help in massively parallel access
         | situations where bandwidth becomes a limiting concern.
         | Although, I intuit we would need near-memory hardware to
         | perform the rank+select. Otherwise the latency of the multiple
         | dependent accesses will kill your performance again, cfr
         | previous point.
         | 
         | With a lot of parallel accesses, bandwidth could also be an
         | issue in conventional structures.
        
       | lostmsu wrote:
       | Way better detailed explanation:
       | https://stackoverflow.com/questions/72580828/what-is-a-succi...
        
         | jbreckmckye wrote:
         | That IS excellent - thank you
        
         | topspin wrote:
         | Yes, that's great. This part:
         | 
         | "Intuitively, a succinct data structure is one whose space
         | usage equals the space needed to write out the data, plus
         | something that grows more slowly than that. If you're familiar
         | with little-o notation, a succinct data structure is one whose
         | space usage is X + o(X), where X is the number of bits needed
         | to write out the data itself."
         | 
         | Brings to mind COBS encoding, which does this for streams bytes
         | containing arbitrary length "packets" or similar.
        
       | cxie wrote:
       | Just spent my morning diving into succinct data structures after
       | seeing this. The memory efficiency is incredible - especially the
       | balanced parentheses tree representing a full node tree in just 2
       | bits per node! I've been working on a project parsing large
       | (10GB+) XML files for scientific data analysis, and our current
       | approach burns through RAM like crazy. Has anyone here
       | successfully implemented these structures in production systems?
       | 
       | The wavelet matrix concept seems particularly promising for our
       | text-heavy workloads. I'm curious if the constant-time operations
       | actually hold up under real-world conditions or if there are
       | hidden performance cliffs.
       | 
       | This feels like one of those CS concepts that should be more
       | widely known but somehow got overlooked by mainstream
       | programming. Kind of like how bloom filters were obscure until
       | suddenly every system was using them.
        
         | MortyWaves wrote:
         | When I've dealt with huge files in .NET, the usual approach is
         | to stream the file such that only some of it is in memory at
         | once. This way you can process files hundreds of GBs. Of
         | course, if you truly need them all in memory at once for some
         | reason I genuinely can't think of, then you'd need something
         | else.
         | 
         | Does your language have the concept of streaming files?
        
           | crazygringo wrote:
           | If you're streaming something row-based like a CSV, or a
           | zipped CSV, then that's usually easy.
           | 
           | But when you get to hierarchical data structures like
           | JSON/protobuf there very often simply isn't a streaming
           | library available. There's a library function to decode the
           | whole thing into an object in memory, and that's all.
           | 
           | Nothing prevents streaming in theory, it's just far more
           | complicated to write that library.
        
             | dilap wrote:
             | protobuf sure, but streaming libraries for json (and xml,
             | as in the parent) are extremely common. not harder (maybe
             | even easier) than non-streaming to write, tho more
             | cumbersome to use, so something you'd reach for only if you
             | specifically need it ('cuz of memory constraints)
             | 
             | e.g. standard go json library
             | https://pkg.go.dev/encoding/json#example-Decoder.Decode-
             | Stre...
        
               | crazygringo wrote:
               | Yup. I don't remember streaming JSON being common in the
               | early days but now it is. But the absence of streaming
               | protobuf is what has killed me, when dealing with
               | gigantic protobuf files from government agencies (ugh).
        
       | MortyWaves wrote:
       | > This is a field that seems to have emerged in computer science
       | relatively recently; many of the important data structures were
       | invented in the last 25 years.
       | 
       | This is crazy!
        
       | sujayakar wrote:
       | I really love this space: Navarro's book is an excellent survey.
       | 
       | Erik Demaine has a few great lectures on succinct data structures
       | too: L17 and L18 on
       | https://courses.csail.mit.edu/6.851/spring12/lectures/
        
       | yurivish wrote:
       | I also emailed Gonzalo Navarro once to ask a question, and we had
       | a great discussion and ended up writing a paper together about
       | the answer. [1]
       | 
       | Another paper of his that I really like combines a few elegant
       | ideas into a simple implementation of bitvector rank/select:
       | https://users.dcc.uchile.cl/~gnavarro/ps/sea12.1.pdf
       | 
       | During this time I got really excited about succinct data
       | structures and wrote a Rust library implementing many bitvector
       | types and a wavelet matrix. [2]
       | 
       | My interest came from a data visualization perspective -- I was
       | curious if space-efficient data structures could fundamentally
       | improve the interactive exploration of large datasets on the
       | client side. Happy to chat about that if anyone's curious.
       | 
       | [1] Paper:
       | https://archive.yuri.is/pdfing/weighted_range_quantile_queri...
       | though it's pretty hard to understand without some background
       | context. I've been meaning to write a blog post explaining the
       | core contribution, which is a simple tweak to one of Navarro's
       | textbook data structures.
       | 
       | [2] The rust version is here: https://github.com/yurivish/made-
       | of-bits/tree/main/rust-play... and an earlier pure-JS
       | implementation is here: https://github.com/yurivish/made-of-
       | bits/tree/main
        
         | sitkack wrote:
         | Reading a Gonzalo Navarro paper is like going for walk, taking
         | a shower and having a wonderful coffee. It literally sets the
         | mind on fire.
         | 
         | https://dblp.org/pid/n/GonzaloNavarro.html
        
       | eqvinox wrote:
       | There's a relative of this in making in-memory node
       | representation efficient for large structs that have a bunch of
       | rarely-used fields: chunk up memory in units (most reasonably 8
       | bytes/pointer size), allocate offsets for rarely-used fields in
       | ascending order, and then use bitmasks to indicate which fields
       | are present. (Note the bits correspond to units, not fields; a
       | 16-byte field would use 2 adjacent bits in the bitmask.)
       | 
       | The trick is that masking & popcount (both low-cycle CPU
       | instructions in most modern CPUs1) make this quite fast to access
       | and thus suitable for in-memory representation.
       | 
       | The intended use is when presence of optional fields is known at
       | allocation time and doesn't change afterwards, i.e. some object
       | is built up into a dynamically allocated buffer whose size is
       | shrunk down by omitting fields. Changing which fields are present
       | afterwards requires reallocating the thing, which tends to make
       | the entire pattern pointless.
       | 
       | 1 the real annoyance here is that _almost_ all x86_64 CPUs have
       | POPCNT, except the absolute earliest ones, but if you build e.g.
       | some package for a Linux distribution without any CPU
       | optimization flags it 'll use a library fallback popcount routine
       | rather than the instruction :(
        
       | bo1024 wrote:
       | With advanced CS topics, it often works to search for "<topic>
       | lecture notes".
        
       ___________________________________________________________________
       (page generated 2025-03-06 23:00 UTC)