hngopher.com

       [HN Gopher] Bitmap Indexes in Go: Search Speed (2019)
       ___________________________________________________________________
        
       Bitmap Indexes in Go: Search Speed (2019)
        
       Author : kordlessagain
       Score  : 138 points
       Date   : 2022-09-22 12:29 UTC (10 hours ago)
        
 (HTM) web link (habr.com)
 (TXT) w3m dump (habr.com)
        
       | iampims wrote:
       | Bitmaps can also be used for computing funnels:
       | 
       | https://vikramoberoi.com/using-bitmaps-to-run-interactive-re...
        
         | JAA1337 wrote:
         | I think it's fair to say that the use cases for bitmaps have
         | not yet been fully explored. I believe their use in areas of
         | analytics and OLAP/RTOLAP will continue to grow as the desire
         | to infer and target demo's across households increases.
        
           | eurasiantiger wrote:
           | Isn't Lucene already a household name?
        
             | clowen wrote:
             | Probably not, first thought I had was Lucerne dairy
             | products.
        
             | JAA1337 wrote:
             | Not sure I understand your point. Can you clarify some?
        
               | eurasiantiger wrote:
               | Lucene uses roaring bitmaps for filters.
        
               | JAA1337 wrote:
               | Sure, I completely get it. However, does that mean
               | bitmaps have been fully explored? For example, OLAP is
               | only growing as a market as the need for realtime
               | analysis on the most fresh data. I mean, we never going
               | to have 'less' data nor are we ever going to seek results
               | 'slower'. Where I am going with this, utilizing
               | technologies like bitmaps in OLAP services is an area I
               | believe will see continued growth. While Lucene is a
               | known commodity, I think we can agree it hasn't solved
               | the sector, right?
        
             | gthrone wrote:
             | I didn't realize Lucene used their own flavor of roaring
             | bitmaps! Super Cool. I certainly think of search when it
             | comes to Lucene, where as the cited material from Vikram &
             | FeatureBase are targeting more general analytics.
        
             | kordlessagain wrote:
             | Lucene is fast, but probably can't handle analytical type
             | questions on massive amounts of data as fast as a pure
             | binary index can. I could be wrong about that though. I do
             | know Solr has it because Lucene does...
        
               | eurasiantiger wrote:
               | Elasticsearch (another household name) is built on Lucene
               | too, so I wouldn't count it out.
        
         | skingkong wrote:
         | 100%, and are freaking fast...and pretty efficient, especially
         | for high throughput applications tracking tons of changing
         | data. We've found bitmaps handle updates, inserts, and schema
         | changes really elegantly - and quickly (like sub-second).
         | Depending on the use case, looking at cohorts MoM or YoY that
         | freshness doesn't entirely matter, but computing cohorts in
         | real-time hour-over-hour, or even minute-over-minute, for in-
         | app actions/personalization is insanely valuable to keep users
         | engaged (or what ever entity type you are analyzing and taking
         | action on, whether it's devices, nodes across a network, etc).
        
       | pknerd wrote:
       | A bit off-topic but is there some website/directory where I can
       | learn about modern data structures or unknown/less popular data
       | structures?
        
       | manimino wrote:
       | Writing an object indexer is a fun project. My Python indexer,
       | ducks [1], went through a lot of the same development stages,
       | such as supporting only exact-value matches at first. I tried
       | binning too, but ultimately using a BTree was simpler and
       | performed well enough (500ns for a log-n tree traversal, vs 50ns
       | hash).
       | 
       | Still looking for a way to apply bitmap indexing in ducks; it
       | would surely speed up multi-index queries, but I haven't found an
       | implementation that's worth the tradeoffs yet.
       | 
       | [1] https://ducks.readthedocs.io/en/latest/
        
         | skingkong wrote:
         | This is neat! Curious what type of use case you use ducks for
         | most often? Also...what tradeoffs are keeping you from
         | implementing bitmaps in ducks?
        
           | manimino wrote:
           | As far as the bitmaps holdup, well. Let's see.
           | 
           | The slowest step in ducks right now is combining the results
           | of a multi-index query. Suppose I get 1000 objects matching
           | one attribute and another 1000 objects from another, I've now
           | got to intersect those two sets. Right now I'm using sets, or
           | in some cases, sorted lists that act as sets [1]. But
           | intuitively, a bitwise "and" _ought_ to be faster, due to
           | SIMD etc.
           | 
           | It's really just a matter of thinking about how to write it.
           | This video was quite helpful as it mentioned that Postgres
           | uses a similar approach, so probably I'll have a look at
           | their implementation first and take inspiration from there.
           | 
           | [1] https://ducks.readthedocs.io/en/latest/how_it_works.html#
           | sor...
        
             | marginalia_nu wrote:
             | > Suppose I get 1000 objects matching one attribute and
             | another 1000 objects from another, I've now got to
             | intersect those two sets. Right now I'm using sets, or in
             | some cases, sorted lists that act as sets [1]. But
             | intuitively, a bitwise "and" ought to be faster, due to
             | SIMD etc.
             | 
             | I worked on this exact problem today.
             | 
             | This is probably super obvious and I bet you're already
             | doing something similar (or smarter still), but I'm excited
             | about it so I'm telling you anyway.
             | 
             | I was doing the naive O(m log n) algorithm where you fetch
             | one set of items and then test them each against the other
             | index one by one (both implemented as B-trees).
             | 
             | I got an enormous speedup[1] after I realized that you can
             | get a discount on index look-ups by testing a sorted list
             | of multiple items in the same tree traversal operation. Not
             | only does each item tested allow you to reduce the search
             | scope, you can often omit repeating several of the
             | calculations closer to the root.
             | 
             | [1] Eventually 10x, but that's in part due to naively
             | parallelizing the workload. The single threaded approach
             | was still something like 4-5x compared to the original
             | approach. 20k ops/s for thousands of items tested against a
             | 800 Mb index.
        
           | manimino wrote:
           | Ducks, the story:
           | 
           | I was using Python in-memory vector search engine called
           | Annoy [1] to do semantic search on various kinds of data. It
           | worked great for finding "similar" objects. Story A has
           | similar text to story B, image A looks like image B, etc.
           | 
           | So Annoy solved the hard part. But doing basic metadata
           | lookups was surprisingly hard in Python. How do I get all
           | images matching some criteria (say, size range, or tags)? I'd
           | have to serialize them all into a DB, and use a DB index.
           | Databases are great, but they add code bloat and overhead;
           | I'm usually working Jupyter notebooks and I like keeping as
           | few external dependencies as possible.
           | 
           | So I wrote ducks as a quick, convenient way to index
           | anything.
           | 
           | There's lots of other usage patterns of course, it's very
           | generic. It makes a great Wordle / crossword solver too.
           | "Find me words where the first letter is A and the fifth
           | letter is L" is very fast in ducks.
           | 
           | Indexing is just one of those things you always need. Python
           | didn't have a good way to do it, and now it does!
           | 
           | Source code's here if you're curious:
           | https://github.com/manimino/ducks
           | 
           | [1] Annoy: https://github.com/spotify/annoy
        
             | skingkong wrote:
             | Huh, that's really cool and makes a lot of sense. Thanks
             | for sharing more about it
        
       | jorgeleo wrote:
       | Did the exact same thing 8 years ago and implemented it on C#.
       | Glad to see that more people implemented it.
       | 
       | Here is the link for a different explanation:
       | https://www.linkedin.com/pulse/why-column-oriented-database-...
        
       | marginalia_nu wrote:
       | I'm not sure I follow what it is they're benchmarking (although I
       | struggle reading text that has too many images so I may be
       | confused as to what they're even doing).
       | 
       | I'd expect the bottleneck in a database index to be your disk or
       | RAM access patterns. It seems very counterintuitive to see a
       | real-world speed-up from SIMD if that is the case. The fact that
       | they're seeing one in the benchmark would suggest maybe the
       | benchmark isn't entirely capturing the right thing.
        
         | JustG wrote:
         | Disk and Memory can both be bottle necks, however with
         | compression on Roaring bitmaps you typically keep your entire
         | index in memory and compute in a compressed state, avoiding
         | slow disk. You do in fact become CPU bound on queries, even
         | intelligent multi-threading, and CPU efficiency comes into play
         | (especially so as a bitmap begins expanding across several
         | shards).
        
         | jaffee wrote:
         | Generally speaking, the nice thing about bitmap indexes is that
         | you're able to access the data in a very granular way. If you
         | have a WHERE clause that's calling out specific values, you
         | only access the data which is pertinent to those values within
         | a column, you don't have to scan the whole column. This is
         | simply due to the structure of a bitmap index where you have a
         | separate bitmap for each value in the domain of a column.
         | 
         | Furthermore, access patterns for bitmaps tend to be very linear
         | and cache/prefetch friendly.
         | 
         | I think it's very feasible that adding SIMD could result in a
         | real-world speedup in an otherwise well-optimized in-memory
         | system. I agree if you need to go to disk, that will likely
         | dominate the overall performance of a single query, but it may
         | still be overall more efficient which can still help in a
         | multi-user situation.
        
       | toddgruben wrote:
       | Looks like Pilosa mentioned in the article is now called
       | FeatureBase and all the new activity is happening at
       | https://github.com/FeatureBaseDB/featurebase
        
         | kordlessagain wrote:
         | Yes. Molecula, the company behind Pilosa, has rebranded itself
         | to FeatureBase. It's offering both a managed cloud/serverless
         | option and a downloadable Open Source build for implementing
         | binary indexes. Both ARM and Intel versions are available (but
         | not for Windows, yet). I just joined the team and am working on
         | some interesting demos for machine learning applications.
         | 
         | While FeatureBase is great for doing analytics on massive data
         | sets, it's also well suited for being used as a model's feature
         | store, for both training and inference.
         | 
         | We have a Discord here, if anyone is interested:
         | https://discord.com/invite/bSBYjDbUUb?utm_campaign=FeatureBa...
        
       | todd8 wrote:
       | This is an interesting article, and I'll probably take a look at
       | the go language profiling capabilities and built-in machine
       | independent assembler.
       | 
       | The idea of using bitmaps for retrieval on multiple indexes isn't
       | new. Not surprisingly Knuth has a good discussion of the
       | technique [Knuth], it was explained very well with a nice
       | illustration of using notched recipe cards that is worth taking a
       | peek at.
       | 
       | -- straying off topic below this line --
       | 
       | One surprising note, Calvin N. Mooers invented zatocoding in
       | 1947. This technique used the notched cards for information
       | retrieval and cited by Knuth under the subsection _superimposed
       | coding_. Calvin N. Mooers went on to invent the  "Reactive
       | Typewriter" in the 1960s. This was an visionary implementation of
       | how users might interact with computers using a terminal device
       | (like a teletype). Mooers' Reactive Typewriter was programmed
       | using a language he invented named TRAC.
       | 
       | TRAC is a macro based language. It, like the roughly
       | contemporaneous GPM of Christopher Strachey (1965), is a Turing
       | complete language based on macros that can define other macros,
       | etc. I learned about TRAC in the mid 1970s from Ted Nelson's book
       | _Computer Lib /Dream Machines_ [Nelson]. It was one of three
       | languages featured in that book. A fun little book, _Etudes for
       | Programmers_ , which I bought in 1978 inspired me to implement a
       | version of TRAC. It's a nice exercise.
       | 
       | Many years later (I believe it was 1991), during an Open Software
       | Foundation meeting in Cambridge, Massachusetts, I got to meet
       | Mooers in person, he was sitting in the front row during a
       | presentation, and I got to speak with him briefly afterwards.
       | 
       | By the way, Christopher Strachey wrote the first paper on the
       | concept of time-sharing in 1959 and in the 1970's with Dana Scott
       | is credited with coming up with the important development in the
       | theory of programming languages known as Denotational Semantics.
       | 
       | [Knuth] Donald Knuth. (1973).The Art of Computer Programming, Vol
       | 3/Sorting and Searching. Section 6.5, Retrieval on Secondary
       | Keys. Addison-Wesley.
       | 
       | [Nelson] Theodor H Nelson. (1974). Computer lib : you can and
       | must understand computers now. ISBN 0893470023. OCLC 11182412.
       | https://en.wikipedia.org/wiki/Computer_Lib/Dream_Machines
       | 
       | [Strachey]
       | https://en.wikipedia.org/wiki/Christopher_Strachey#cite_note...
       | 
       | [Wetherell] Charles Wetherell, Etudes for Programmers, Prentice
       | Hall, 1977, https://www.amazon.com/Etudes-Programmers-Charles-
       | Wetherell/...
        
         | maycotte wrote:
         | Love that! Early categorical/bitmap storage using
         | notches...punched or not punched!
        
       | mrjn wrote:
       | Wonder how this compares to Sroar [1]
       | 
       | [1]: https://dgraph.io/blog/post/serialized-roaring-bitmaps-
       | golan...
        
         | JAA1337 wrote:
         | Thank you for sharing this. As someone who is working to learn
         | more about bitmaps, id like to ask how you came across this? Is
         | there a specific community or other that you follow? Thanks!
        
       | jimbokun wrote:
       | Very approachable introduction to this data structure.
       | 
       | The assembly implementation, though, reminded me of the stone
       | soup story. Where a stranger is boiling a rock in a pot of water,
       | and the townspeople intrigued by the idea of stone soup bring
       | various ingredients to add to it and everyone is shocked that
       | "stone soup" could taste so good.
       | 
       | The "Go" implementation ends up being pretty much all assembly by
       | the end, in an article supposedly about a fast bitmap index
       | implementation written in Go.
        
         | clowen wrote:
         | The story as I know it:
         | 
         | Three soldiers wander into a town and watch as the townspeople
         | shutter themselves into their homes.
         | 
         | The first two soldiers, out of rations, start to knock on doors
         | and asking if anyone has any food, but no one answers. The
         | third says: "I have an idea, start a cooking fire in the town
         | square and get some water boiling. I've got to find the right
         | ingredient."
         | 
         | They do as they were told, and the third makes a show of trying
         | to find just the right stone. Making sure the town hears and
         | sees them. They bring the "perfect" stone back, and place it in
         | the pot of boiling water. The other two soldiers start to
         | question, but the third tells them to just shut up and watch.
         | 
         | After boiling a stone for 10 to 15 minutes, one of the
         | villagers comes out to ask what the soldiers are doing. "Making
         | stone soup." replies the soldier, "Though it is usually better
         | with some potatoes or carrots."
         | 
         | The villager lights up, "I have carrots!" They run to their
         | home, come back with a handful of carrots, and add them to the
         | soup. Seeing this, other villagers begin to emerge from their
         | homes, and each suggesting a new ingredient.
         | 
         | Over and over this happens until someone tries to put raw
         | salmon in and is thrown out of the town. I mean, until they
         | have a full and hardy stew going, with more than enough to feed
         | the entire town.
         | 
         | The morale of the story being to always have enough rations.
        
         | pjmlp wrote:
         | Going off-topic, a typical Portuguese soup actually, worth a
         | visit to a "Loja das Sopas" in a nearby shopping mall when in
         | Portugal.
        
           | JustG wrote:
           | Some of the best seafood ever is in Lisbon, Portugal, not to
           | mention the Algrave..
        
             | pjmlp wrote:
             | As northern Portuguese I beg to differ. :)
        
       ___________________________________________________________________
       (page generated 2022-09-22 23:01 UTC)