[HN Gopher] Roaring bitmaps: what they are and how they work
___________________________________________________________________
Roaring bitmaps: what they are and how they work
Author : voberoi
Score : 122 points
Date : 2022-09-02 15:35 UTC (1 days ago)
(HTM) web link (vikramoberoi.com)
(TXT) w3m dump (vikramoberoi.com)
| celeritascelery wrote:
| I see this common theme among very fast practical algorithms
| (like timsort or pdqsort) where there is not some secret math or
| algorithm, rather they are just a bunch of special cases based on
| heuristics. Often they involve specific knowledge of the hardware
| instead of treating software as abstract. To me this is the big
| difference between "computer science" and "computer engineering".
| coldtea wrote:
| > _I see this common theme among very fast practical algorithms
| (like timsort or pdqsort) where there is not some secret math
| or algorithm, rather they are just a bunch of special cases
| based on heuristics_
|
| The core here, which is bitmap storage and the basic
| optimization are simple, but solid and general. Mathematically
| it takes less space to store data as positions on a bitmap
| rather than fully spelt associations, and it takes even less
| space to seggregate the bitmaps in chunks.
|
| This will hold and be smaller and faster in any computer. So,
| it's not some case of special case based on heuristics, either
| related to the specific frequencies or sample characteristics
| of some particular set of data, or of specific CPU
| peculiarities or whatever.
|
| More exotic optimizations piled on top, sure. But "compressed
| bitmaps" themselves as a concept, not.
| voberoi wrote:
| I had the same takeaway when I read these papers. It felt like
| a game of whack-a-mole.
|
| There's a massive performance benefit to doing this at the cost
| of implementation complexity. I haven't studied the
| implementations or tried my hand one, but I get the impression
| that these are tough to implement correctly in a way that takes
| full advantage of the hardware.
|
| (In that sense, it's awesome that the researchers also did the
| legwork to implement and maintain a library!)
| jandrewrogers wrote:
| Much of this is just threshold effects for algorithm
| performance, not special cases per se.
|
| You don't even need to have specific knowledge of the hardware
| as long as you can identify cross-over points where one
| algorithm starts significantly outperforming others. An old HPC
| trick is to write software that thoroughly measures several
| algorithm strategies on your specific hardware environment and
| then code-gens an algorithm that has an optimal set of
| strategies and strategy-switching thresholds. The meta-
| algorithm is mostly focused on cheaply detecting these
| thresholds at runtime.
| fuckstick wrote:
| No this is still comp sci. Comp sci doesn't mean that practical
| considerations in algorithm design are ignored. Computer
| engineering has to do with the actual design and construction
| of computing machinery. It's an accredited engineering field
| and the undergraduate coursework includes electronics,
| semiconductors, computer architecture and generally a lot of
| overlap with electrical engineering. I had to take one dumb
| programming course - all the data structures and algorithms
| stuff I had to learn elsewhere.
| jerven wrote:
| I love roaringbitmaps and maybe even over use this library. On
| the other hand it made a lot of things possible for
| legacy.uniprot.org over the years that would have been
| unaffordable otherwise. Quick faceting using permanent filters
| over 500+ million Documents would have been really hard
| otherwise.
| voberoi wrote:
| I'm really curious about this. Are you able to share a bit
| about your use case and how you use roaring bitmaps outright
| (vs. through search infra like Solr or ElasticSearch)?
| pvillano wrote:
| I thought the trick was going to be indirection.
|
| sorted-list/bitmap/runs, in a two level tree. cool.
|
| it's technically missing sorted compliment-list, i.e. only the
| items that are missing, although worst case runs only use twice
| as much space, so more complexity without much savings, esp.
| considering real workloads
|
| performs better with sequential ids than random uuids because you
| use fewer pages
|
| further research
|
| investigating the effect of adding more layers to the tree
|
| using additional set representations e.g. balanced tree
| nattaylor wrote:
| I enjoyed this walkthrough. I'm interested in RB because it's
| used in Lucene but never dug in and assumed (without much
| thought) that it had to do with consecutive runs of 1s and 0s.
| Wrong!
| LouisSayers wrote:
| Compressing consecutive 1's and 0's was the first idea that
| popped into my head as well. Then when they started to talk
| about inserting into the 800,000th position my brain went "What
| if they just reversed the array and stored some metadata as the
| type of array they're dealing with?"
|
| It's funny that in the end Bitmaps essentially are a modified
| data structure and process.
|
| The way things are going, I imagine someone will at some point
| take all the different tricks we have of inserting, sorting,
| finding, deleting data, take millions of datasets and run some
| machine learning type process to create libraries that can
| perform these operations optimally.
| nattaylor wrote:
| I was thinking a similar thought that 4096 seemed quite magic
| (although it seems to be chosen based on research) and that
| RB perf could probably be tuned based on the workload
| nicoburns wrote:
| I don't think 4096 is arbitrary. It's an array of 16bit
| integers, and 4096 * 16 = 65536. So 4096 represents the
| boundary point below which an array of integers uses less
| memory than a traditional bitmap.
| nattaylor wrote:
| Oops, thanks for pointing that out
| TacticalCoder wrote:
| > When a set is sparse, traditional bitmaps compress poorly.
|
| They waste space. But if you were to compress them, they'd
| compress _very_ well. As in, for example, if you were to compress
| them using roaring bitmaps.
| nullc wrote:
| Those who do not know judy1 are doomed to reinvent it.
|
| http://judy.sourceforge.net/
| latchkey wrote:
| http://nothings.org/computer/judy/
| froh wrote:
| which possibly is a good thing
|
| the Judy project seems inactive to me:
|
| https://sourceforge.net/p/judy/bugs/
|
| this short HN thread resonates with the intuition of jude1
| being stalled, it's last optimistic comment points to an
| orphaned comment in the big list above
|
| https://news.ycombinator.com/item?id=32188204
| Xeoncross wrote:
| Can someone explain how this is laid out in memory and what it
| would look like serialized? It's easy to have an array of N size
| that fits all bits and just write to disk. You can also mmap a
| file as a large array of bytes and just mutate it and let the OS
| handle the disk syncs if you're okay with some data loss of
| recent state changes if the machine crashes.
|
| What would an array of pointers to X containers which are N size
| each look like on disk?
| extesy wrote:
| This article has a section titled "how Roaring bitmaps are
| represented in memory". I suggest you actually read the
| article, it will likely answer all your questions.
| pvillano wrote:
| I'm reading this on a runaway with terrible internet and the
| figures slowly load to reveal... text :/
| superjan wrote:
| It looks a bit too simplistic: Blocks with up to 4k items are
| stored as sorted lists. Having to insert in a sorted list up to
| 4k items seems very expensive to me.
| jasonwatkinspdx wrote:
| The threshold was chosen empirically. Modern processors are
| _very_ fast when operating on memory sequentially, as the
| prefetcher achieves the best case scenario. The cost of a cache
| line miss is so substantial this ends up dominating things
| until "n" is in the 1000s. A lot of programmers' intuition is
| off the mark on this point.
| jandrewrogers wrote:
| CPUs have a strong performance bias toward sequential memory
| access and there are large threshold effects at work here. The
| block size used is not arbitrary. Improvements in prefetching
| and cache line utilization can have such large performance
| benefits that it more than justifies any apparent increase in
| computational cost because of how the code is organized to
| obtain those improvements.
|
| Most developers do not have a good intuition for the efficiency
| and scalability of sequential brute-force on modern
| architectures. There is a cross-over threshold but I think many
| people would be surprised at how high it is in practice.
| foota wrote:
| The 4k items are only 512 bytes though, so expensive yes but
| not overly so.
| skitter wrote:
| They are two byte per item; it's the bitmaps (>4k entries)
| that use 1 bit per item.
| Veserv wrote:
| No. The person you are replying to is talking about the
| sparse leaf case where the contained integers are actually
| being stored (to be more precise the low bits are being
| stored) in a sorted structure. The switch to the bitmap
| occurs when the sparse structure takes the same number of
| bits to directly encode the sparsely contained items. In this
| case they use a 2^16 bitmap and encode sparse elements using
| 16-bits, so it occurs at ((2^16 / 16) == 2^12) elements.
___________________________________________________________________
(page generated 2022-09-03 23:00 UTC)