[HN Gopher] Bitmap Indexes in Go: Search Speed (2019)
___________________________________________________________________
Bitmap Indexes in Go: Search Speed (2019)
Author : kordlessagain
Score : 138 points
Date : 2022-09-22 12:29 UTC (10 hours ago)
(HTM) web link (habr.com)
(TXT) w3m dump (habr.com)
| iampims wrote:
| Bitmaps can also be used for computing funnels:
|
| https://vikramoberoi.com/using-bitmaps-to-run-interactive-re...
| JAA1337 wrote:
| I think it's fair to say that the use cases for bitmaps have
| not yet been fully explored. I believe their use in areas of
| analytics and OLAP/RTOLAP will continue to grow as the desire
| to infer and target demo's across households increases.
| eurasiantiger wrote:
| Isn't Lucene already a household name?
| clowen wrote:
| Probably not, first thought I had was Lucerne dairy
| products.
| JAA1337 wrote:
| Not sure I understand your point. Can you clarify some?
| eurasiantiger wrote:
| Lucene uses roaring bitmaps for filters.
| JAA1337 wrote:
| Sure, I completely get it. However, does that mean
| bitmaps have been fully explored? For example, OLAP is
| only growing as a market as the need for realtime
| analysis on the most fresh data. I mean, we never going
| to have 'less' data nor are we ever going to seek results
| 'slower'. Where I am going with this, utilizing
| technologies like bitmaps in OLAP services is an area I
| believe will see continued growth. While Lucene is a
| known commodity, I think we can agree it hasn't solved
| the sector, right?
| gthrone wrote:
| I didn't realize Lucene used their own flavor of roaring
| bitmaps! Super Cool. I certainly think of search when it
| comes to Lucene, where as the cited material from Vikram &
| FeatureBase are targeting more general analytics.
| kordlessagain wrote:
| Lucene is fast, but probably can't handle analytical type
| questions on massive amounts of data as fast as a pure
| binary index can. I could be wrong about that though. I do
| know Solr has it because Lucene does...
| eurasiantiger wrote:
| Elasticsearch (another household name) is built on Lucene
| too, so I wouldn't count it out.
| skingkong wrote:
| 100%, and are freaking fast...and pretty efficient, especially
| for high throughput applications tracking tons of changing
| data. We've found bitmaps handle updates, inserts, and schema
| changes really elegantly - and quickly (like sub-second).
| Depending on the use case, looking at cohorts MoM or YoY that
| freshness doesn't entirely matter, but computing cohorts in
| real-time hour-over-hour, or even minute-over-minute, for in-
| app actions/personalization is insanely valuable to keep users
| engaged (or what ever entity type you are analyzing and taking
| action on, whether it's devices, nodes across a network, etc).
| pknerd wrote:
| A bit off-topic but is there some website/directory where I can
| learn about modern data structures or unknown/less popular data
| structures?
| manimino wrote:
| Writing an object indexer is a fun project. My Python indexer,
| ducks [1], went through a lot of the same development stages,
| such as supporting only exact-value matches at first. I tried
| binning too, but ultimately using a BTree was simpler and
| performed well enough (500ns for a log-n tree traversal, vs 50ns
| hash).
|
| Still looking for a way to apply bitmap indexing in ducks; it
| would surely speed up multi-index queries, but I haven't found an
| implementation that's worth the tradeoffs yet.
|
| [1] https://ducks.readthedocs.io/en/latest/
| skingkong wrote:
| This is neat! Curious what type of use case you use ducks for
| most often? Also...what tradeoffs are keeping you from
| implementing bitmaps in ducks?
| manimino wrote:
| As far as the bitmaps holdup, well. Let's see.
|
| The slowest step in ducks right now is combining the results
| of a multi-index query. Suppose I get 1000 objects matching
| one attribute and another 1000 objects from another, I've now
| got to intersect those two sets. Right now I'm using sets, or
| in some cases, sorted lists that act as sets [1]. But
| intuitively, a bitwise "and" _ought_ to be faster, due to
| SIMD etc.
|
| It's really just a matter of thinking about how to write it.
| This video was quite helpful as it mentioned that Postgres
| uses a similar approach, so probably I'll have a look at
| their implementation first and take inspiration from there.
|
| [1] https://ducks.readthedocs.io/en/latest/how_it_works.html#
| sor...
| marginalia_nu wrote:
| > Suppose I get 1000 objects matching one attribute and
| another 1000 objects from another, I've now got to
| intersect those two sets. Right now I'm using sets, or in
| some cases, sorted lists that act as sets [1]. But
| intuitively, a bitwise "and" ought to be faster, due to
| SIMD etc.
|
| I worked on this exact problem today.
|
| This is probably super obvious and I bet you're already
| doing something similar (or smarter still), but I'm excited
| about it so I'm telling you anyway.
|
| I was doing the naive O(m log n) algorithm where you fetch
| one set of items and then test them each against the other
| index one by one (both implemented as B-trees).
|
| I got an enormous speedup[1] after I realized that you can
| get a discount on index look-ups by testing a sorted list
| of multiple items in the same tree traversal operation. Not
| only does each item tested allow you to reduce the search
| scope, you can often omit repeating several of the
| calculations closer to the root.
|
| [1] Eventually 10x, but that's in part due to naively
| parallelizing the workload. The single threaded approach
| was still something like 4-5x compared to the original
| approach. 20k ops/s for thousands of items tested against a
| 800 Mb index.
| manimino wrote:
| Ducks, the story:
|
| I was using Python in-memory vector search engine called
| Annoy [1] to do semantic search on various kinds of data. It
| worked great for finding "similar" objects. Story A has
| similar text to story B, image A looks like image B, etc.
|
| So Annoy solved the hard part. But doing basic metadata
| lookups was surprisingly hard in Python. How do I get all
| images matching some criteria (say, size range, or tags)? I'd
| have to serialize them all into a DB, and use a DB index.
| Databases are great, but they add code bloat and overhead;
| I'm usually working Jupyter notebooks and I like keeping as
| few external dependencies as possible.
|
| So I wrote ducks as a quick, convenient way to index
| anything.
|
| There's lots of other usage patterns of course, it's very
| generic. It makes a great Wordle / crossword solver too.
| "Find me words where the first letter is A and the fifth
| letter is L" is very fast in ducks.
|
| Indexing is just one of those things you always need. Python
| didn't have a good way to do it, and now it does!
|
| Source code's here if you're curious:
| https://github.com/manimino/ducks
|
| [1] Annoy: https://github.com/spotify/annoy
| skingkong wrote:
| Huh, that's really cool and makes a lot of sense. Thanks
| for sharing more about it
| jorgeleo wrote:
| Did the exact same thing 8 years ago and implemented it on C#.
| Glad to see that more people implemented it.
|
| Here is the link for a different explanation:
| https://www.linkedin.com/pulse/why-column-oriented-database-...
| marginalia_nu wrote:
| I'm not sure I follow what it is they're benchmarking (although I
| struggle reading text that has too many images so I may be
| confused as to what they're even doing).
|
| I'd expect the bottleneck in a database index to be your disk or
| RAM access patterns. It seems very counterintuitive to see a
| real-world speed-up from SIMD if that is the case. The fact that
| they're seeing one in the benchmark would suggest maybe the
| benchmark isn't entirely capturing the right thing.
| JustG wrote:
| Disk and Memory can both be bottle necks, however with
| compression on Roaring bitmaps you typically keep your entire
| index in memory and compute in a compressed state, avoiding
| slow disk. You do in fact become CPU bound on queries, even
| intelligent multi-threading, and CPU efficiency comes into play
| (especially so as a bitmap begins expanding across several
| shards).
| jaffee wrote:
| Generally speaking, the nice thing about bitmap indexes is that
| you're able to access the data in a very granular way. If you
| have a WHERE clause that's calling out specific values, you
| only access the data which is pertinent to those values within
| a column, you don't have to scan the whole column. This is
| simply due to the structure of a bitmap index where you have a
| separate bitmap for each value in the domain of a column.
|
| Furthermore, access patterns for bitmaps tend to be very linear
| and cache/prefetch friendly.
|
| I think it's very feasible that adding SIMD could result in a
| real-world speedup in an otherwise well-optimized in-memory
| system. I agree if you need to go to disk, that will likely
| dominate the overall performance of a single query, but it may
| still be overall more efficient which can still help in a
| multi-user situation.
| toddgruben wrote:
| Looks like Pilosa mentioned in the article is now called
| FeatureBase and all the new activity is happening at
| https://github.com/FeatureBaseDB/featurebase
| kordlessagain wrote:
| Yes. Molecula, the company behind Pilosa, has rebranded itself
| to FeatureBase. It's offering both a managed cloud/serverless
| option and a downloadable Open Source build for implementing
| binary indexes. Both ARM and Intel versions are available (but
| not for Windows, yet). I just joined the team and am working on
| some interesting demos for machine learning applications.
|
| While FeatureBase is great for doing analytics on massive data
| sets, it's also well suited for being used as a model's feature
| store, for both training and inference.
|
| We have a Discord here, if anyone is interested:
| https://discord.com/invite/bSBYjDbUUb?utm_campaign=FeatureBa...
| todd8 wrote:
| This is an interesting article, and I'll probably take a look at
| the go language profiling capabilities and built-in machine
| independent assembler.
|
| The idea of using bitmaps for retrieval on multiple indexes isn't
| new. Not surprisingly Knuth has a good discussion of the
| technique [Knuth], it was explained very well with a nice
| illustration of using notched recipe cards that is worth taking a
| peek at.
|
| -- straying off topic below this line --
|
| One surprising note, Calvin N. Mooers invented zatocoding in
| 1947. This technique used the notched cards for information
| retrieval and cited by Knuth under the subsection _superimposed
| coding_. Calvin N. Mooers went on to invent the "Reactive
| Typewriter" in the 1960s. This was an visionary implementation of
| how users might interact with computers using a terminal device
| (like a teletype). Mooers' Reactive Typewriter was programmed
| using a language he invented named TRAC.
|
| TRAC is a macro based language. It, like the roughly
| contemporaneous GPM of Christopher Strachey (1965), is a Turing
| complete language based on macros that can define other macros,
| etc. I learned about TRAC in the mid 1970s from Ted Nelson's book
| _Computer Lib /Dream Machines_ [Nelson]. It was one of three
| languages featured in that book. A fun little book, _Etudes for
| Programmers_ , which I bought in 1978 inspired me to implement a
| version of TRAC. It's a nice exercise.
|
| Many years later (I believe it was 1991), during an Open Software
| Foundation meeting in Cambridge, Massachusetts, I got to meet
| Mooers in person, he was sitting in the front row during a
| presentation, and I got to speak with him briefly afterwards.
|
| By the way, Christopher Strachey wrote the first paper on the
| concept of time-sharing in 1959 and in the 1970's with Dana Scott
| is credited with coming up with the important development in the
| theory of programming languages known as Denotational Semantics.
|
| [Knuth] Donald Knuth. (1973).The Art of Computer Programming, Vol
| 3/Sorting and Searching. Section 6.5, Retrieval on Secondary
| Keys. Addison-Wesley.
|
| [Nelson] Theodor H Nelson. (1974). Computer lib : you can and
| must understand computers now. ISBN 0893470023. OCLC 11182412.
| https://en.wikipedia.org/wiki/Computer_Lib/Dream_Machines
|
| [Strachey]
| https://en.wikipedia.org/wiki/Christopher_Strachey#cite_note...
|
| [Wetherell] Charles Wetherell, Etudes for Programmers, Prentice
| Hall, 1977, https://www.amazon.com/Etudes-Programmers-Charles-
| Wetherell/...
| maycotte wrote:
| Love that! Early categorical/bitmap storage using
| notches...punched or not punched!
| mrjn wrote:
| Wonder how this compares to Sroar [1]
|
| [1]: https://dgraph.io/blog/post/serialized-roaring-bitmaps-
| golan...
| JAA1337 wrote:
| Thank you for sharing this. As someone who is working to learn
| more about bitmaps, id like to ask how you came across this? Is
| there a specific community or other that you follow? Thanks!
| jimbokun wrote:
| Very approachable introduction to this data structure.
|
| The assembly implementation, though, reminded me of the stone
| soup story. Where a stranger is boiling a rock in a pot of water,
| and the townspeople intrigued by the idea of stone soup bring
| various ingredients to add to it and everyone is shocked that
| "stone soup" could taste so good.
|
| The "Go" implementation ends up being pretty much all assembly by
| the end, in an article supposedly about a fast bitmap index
| implementation written in Go.
| clowen wrote:
| The story as I know it:
|
| Three soldiers wander into a town and watch as the townspeople
| shutter themselves into their homes.
|
| The first two soldiers, out of rations, start to knock on doors
| and asking if anyone has any food, but no one answers. The
| third says: "I have an idea, start a cooking fire in the town
| square and get some water boiling. I've got to find the right
| ingredient."
|
| They do as they were told, and the third makes a show of trying
| to find just the right stone. Making sure the town hears and
| sees them. They bring the "perfect" stone back, and place it in
| the pot of boiling water. The other two soldiers start to
| question, but the third tells them to just shut up and watch.
|
| After boiling a stone for 10 to 15 minutes, one of the
| villagers comes out to ask what the soldiers are doing. "Making
| stone soup." replies the soldier, "Though it is usually better
| with some potatoes or carrots."
|
| The villager lights up, "I have carrots!" They run to their
| home, come back with a handful of carrots, and add them to the
| soup. Seeing this, other villagers begin to emerge from their
| homes, and each suggesting a new ingredient.
|
| Over and over this happens until someone tries to put raw
| salmon in and is thrown out of the town. I mean, until they
| have a full and hardy stew going, with more than enough to feed
| the entire town.
|
| The morale of the story being to always have enough rations.
| pjmlp wrote:
| Going off-topic, a typical Portuguese soup actually, worth a
| visit to a "Loja das Sopas" in a nearby shopping mall when in
| Portugal.
| JustG wrote:
| Some of the best seafood ever is in Lisbon, Portugal, not to
| mention the Algrave..
| pjmlp wrote:
| As northern Portuguese I beg to differ. :)
___________________________________________________________________
(page generated 2022-09-22 23:01 UTC)