hngopher.com

       [HN Gopher] Bloom filters are good for search that does not scale
       ___________________________________________________________________
        
       Bloom filters are good for search that does not scale
        
       Author : birdculture
       Score  : 160 points
       Date   : 2025-11-04 09:25 UTC (13 hours ago)
        
 (HTM) web link (notpeerreviewed.com)
 (TXT) w3m dump (notpeerreviewed.com)
        
       | sanskarix wrote:
       | the beautiful thing about bloom filters is they let you say
       | "definitely not here" without checking everything. that asymmetry
       | is weirdly powerful for specific problems.
       | 
       | I've seen them save startups real money in caching layers -
       | checking "did we already process this event" before hitting the
       | database. false positives are fine because you just check the
       | database anyway, but true negatives save you thousands of
       | queries.
       | 
       | the trick is recognizing when false positives don't hurt you.
       | most engineers learn about them in theory but never find that
       | practical use case where they're actually the right tool. same
       | with skip lists and a bunch of other algorithms - brilliant for
       | 2% of problems, overkill for the rest.
        
         | adamzwasserman wrote:
         | Exactly. A 1.2% false positive rate means unnecessary reads
         | 1.2% of the time vs 100% without the filter. Even at 10% FP
         | rate, you skip 90% of I/O.
         | 
         | This asymmetry works great for I/O-bound workloads (skip-
         | indexes) but fails for TFA's approach where every document
         | needs its own filter.
         | 
         | In practice, you combine both: inverted index for the
         | dictionary (amortizes across documents), then bloom filters per
         | chunk of the index (amortizes across chunks). This two-level
         | approach handles scale much better than TFA's one-filter-per-
         | document design. It's bloom filters as an optimization layer,
         | not a replacement.
        
         | munchbunny wrote:
         | > that asymmetry is weirdly powerful for specific problems.
         | 
         | 100% agree when it works in your favor. We use it for exactly
         | that situation where a non-zero false positive rate is fine and
         | you can choose how much memory to devote to getting it closer
         | to zero.
         | 
         | There have a been a couple times though where we've needed a
         | "keep everything not on this list" operation, and unfortunately
         | bloom filters don't work well for that situation. There are
         | alternatives, but none as elegantly compact as bloom filters.
        
       | MattPalmer1086 wrote:
       | May be true for offline full text search, but not true for online
       | string search.
       | 
       | I invented a _very_ fast string search algorithm based on bloom
       | filters. Our paper [1] was accepted to the Symposium of
       | Experimental Algorithms 2024 [2]. Code can be found here [3].
       | 
       | [1]
       | https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.S...
       | 
       | [2] https://sea2024.univie.ac.at/accepted-papers/
       | 
       | [3] https://github.com/nishihatapalmer/HashChain
        
         | lll-o-lll wrote:
         | Looks really interesting! If I wanted to try one of these,
         | which of the family would you recommend to start with?
        
           | MattPalmer1086 wrote:
           | Hashchain is the easiest to implement, but like many search
           | algorithms can suffer from quadratic performance on really
           | bad data and patterns (e.g. seaching for a long sequence of
           | zero bytes in a text of zero bytes). In practice this is very
           | rare.
           | 
           | If you want guaranteed linear performance in the worst case
           | too, then LinearHashchain is the one to use. It is slightly
           | more complex to implement as it builds in a KMP style
           | verifier to make it linear (so almost two search algorithms
           | are needed). It is actually about as fast as HashChain
           | generally in the average case, so you don't lose out.
           | 
           | The others are either experimental or quite niche and not
           | suitable for most purposes. SentinelHashchain is actually the
           | fastest, but relies on being able to add a copy of the
           | pattern at the end of the search text. Mostly this won't be
           | possible in most search contexts, unless you control all
           | memory allocations.
           | 
           | So I'd start with HashChain, and maybe play with the linear
           | version later - most of it is the same, it just needs a bit
           | more adding.
        
       | majke wrote:
       | Bing uses bloom filters for the most-recent index:
       | 
       | https://dl.acm.org/doi/pdf/10.1145/3077136.3080789
       | 
       | https://bitfunnel.org/strangeloop/
        
       | susam wrote:
       | When I worked at RSA over a decade ago, we developed Bloom
       | filter-based indexing to speed up querying on a proprietary
       | database that was specialised for storing petabytes of network
       | events and packet data. I implemented the core Bloom filter-based
       | indexer based on MurmurHash2 functions and I was quite proud of
       | the work I did back then. The resulting improvement in query
       | performance looked impressive to our customers. I remember the
       | querying speed went up from roughly 49,000 records per second to
       | roughly 1,490,000 records per second, so nearly a 30-fold
       | increase.
       | 
       | However, the performance gain is not surprising at all since
       | Bloom filters allow the querying engine to skip large blocks of
       | data with certainty when the blocks do not contain the target
       | data. False negatives are impossible. False positives occur but
       | the rate of false positives can be made very small with well-
       | chosen parameters and trade-offs.
       | 
       | With 4 hash functions (k = 4), 10007 bits per bloom filter (m =
       | 10007) and a new bloom filter for every 1000 records (n = 1000),
       | we achieved a theoretical false-positive rate of only 1.18% ((1 -
       | e(-k * n / m)) ^ k = 0.0118). In practice, over a period of 5
       | years, we found that the actual false positive rate varied
       | between 1.13% and 1.29%.
       | 
       | The only downside of a false positive is that it makes the query
       | engine read a data block unnecessarily to verify whether the
       | target data is present. This affects performance but not
       | correctness; much like how CPU branch misprediction affects
       | performance but not correctness.
       | 
       | A 30-fold increase in querying speed with just 1.25 kB of
       | overhead per data block of 1000 records (each block roughly 1 MB
       | to 2 MB in size) was, in my view, an excellent trade-off. It made
       | a lot of difference to the customer experience, turning what used
       | to be a 2 minute wait for query results into a wait of just about
       | 5 seconds, or in larger queries, reducing a 30 minute wait to
       | about 1 minute.
        
         | susam wrote:
         | I just noticed the m = 10007 value in my comment above and I
         | thought I should clarify it. The number of bits per bloom
         | filter does not need to be a prime number if the hash functions
         | have uniform distribution. Murmur2 hash functions do have
         | uniform distribution, so m was _not_ chosen to be prime in
         | order to reduce collisions in the Bloom filter 's bit
         | positions. The reason for using a prime value was more mundane.
         | 
         | This was a fairly large project, with roughly 3 million lines
         | of C and C++ code which had numerous constants with special
         | values defined throughout the code. So instead of using a less
         | interesting number like 10000, I chose 10007 so that if we ever
         | came across the value 10007 (decimal) or 0x2717 (hexadecimal)
         | while inspecting a core dump in a debugger, we would
         | immediately recognise it as the constant defining the number of
         | bits per Bloom filter.
        
           | anonymars wrote:
           | Ha, collision avoidance all the way down
        
           | gkfasdfasdf wrote:
           | Interesting trick about the constant value, and thank you for
           | the detailed write up!
        
         | 6510 wrote:
         | I'm not an expert at all, I learn they exist after making
         | something similar to search a few thousand blog posts.
         | 
         | Rather than one hash per file I made a file for each 2 letter
         | combinations like aa.raw, ab.raw, etc where each bit in the
         | file represents a record. (bit 0 is file 0 etc) you could ofc
         | do 3 or 4 letters too.
         | 
         | A query is split into 2 letter combinations. 1st + 2nd, 2nd +
         | 3rd, etc the files are loaded, do a bitwise AND on the files.
         | 
         | a search for "bloom" would only load bl.raw, lo.raw, oo.raw,
         | om.raw
         | 
         | The index is really hard to update but adding new records is
         | easy. New records are first added as false positives until we
         | have enough bits to push a byte to the end of each raw file.
         | 
         | I then got lost pondering what creative letter combinations
         | would yield the best results. Things like xx.raw and an.raw are
         | pretty useless. Words could be treated as if unique characters.
         | 
         | Characters (or other bytes) could be combined like s=z, k=c,
         | x=y or i=e
         | 
         | Calculating which combination is best for the static data set
         | was to hard for me. One could look at the ratio to see if it is
         | worth having a letter combination.
         | 
         | But it works and loading a hand full of files or doing an AND
         | is amazingly fast.
        
           | susam wrote:
           | What you've described here is an n-gram inverted index (with
           | n = 2) represented as a bitset. We could call it a _bigram
           | bitset inverted index_. Glad to know you designed and
           | implemented all of this from first principles, and that it
           | serves you well!
        
           | hinkley wrote:
           | I had a problem where we needed to compare large data sets
           | between machines for keys that existed in both, and the
           | bandwidth cost just wasn't mathing for the median result set
           | size. I was trying to figure out how to send a fingerprint
           | from machine A to B, then have machine B send the hits back.
           | Or how many round trips I could do based on set size to
           | minimize bandwidth + latency. I ended up with a calculus
           | problem nobody could help me solve because of an n^5 term.
           | 
           | My boss was generally pretty good with obscure data
           | structures but neither of us had encountered Bloom filters.
           | This was about three years after Google published their paper
           | on how they were using Bloom filters, but that company would
           | be bankrupt before I figured it out.
        
         | UltraSane wrote:
         | Splunk uses bloom filters to make searching for rare events
         | fast. Rare events are usually the most interesting.
        
           | hinkley wrote:
           | I've only used Splunk with one set of devs and maybe we were
           | doing it wrong, but it didn't feel fast to me.
           | 
           | Several of us were working hard to move everything into
           | Prometheus that made any sense to be in Prometheus instead of
           | Splunk.
           | 
           | Notably any time we had a production issue that it was
           | unclear which team was responsible, Splunk became the
           | bottleneck because we started exceeding quotas immediately.
        
             | UltraSane wrote:
             | Splunk is one of the best software I've ever used but it
             | HAS to be used with very fast storage to be effective. I've
             | only used it on enterprise grade storage arrays and servers
             | with lots of RAM for caches. On modern PCIe 5.0 NVMe drives
             | it is stupid fast.
             | 
             | I'm not sure what you mean by exceed quotas because Splunk
             | is normally licensed on GB ingested per day. This can lead
             | to bitter fights between teams over how this is allocated.
             | 
             | The good thing about this license model is that you can use
             | as much hardware as you want for no extra license cost.
        
               | hinkley wrote:
               | > used with very fast storage
               | 
               | That's sounds like self hosting. Which is not the only
               | product they offer. But you still have hardware that can
               | only run so many queries at once and then starts queuing
               | any additional request, yeah? Once you have a dozen
               | people on a call it went to shit. Only occasionally ran
               | into problems like this with graphite. But you need a lot
               | of people looking at a very large dashboard to start
               | feeling refresh delays.
        
       | hijinks wrote:
       | is there a better way then bloom filters to handle needle in the
       | haystack type searches where the haystack might be terabytes of
       | data and you only want a few lines?
        
         | philipkglass wrote:
         | There are a lot of "better than Bloom" filters that work
         | similarly in some aspects. I have used Cuckoo [1] and Ribbon
         | [2] filters for Bloom-type applications. If you have an
         | application where you do a lot of one kind of searching, it may
         | also be worth implementing a specialized variant of a data
         | structure. I needed a Cuckoo-type filter on the JVM but only
         | for 64 bit integers and I was able to make a smaller, faster
         | code base that was specialized to this data type instead of
         | handling generic objects.
         | 
         | You need to know up front whether you need to be able to
         | dynamically add entries to the filter or if your application
         | can tolerate rebuilding the filter entirely whenever the
         | underlying data changes. In the latter case you have more
         | freedom to choose data structures; many of the modern "better
         | than Bloom" filters are more compact but don't support dynamic
         | updates.
         | 
         | [1] https://en.wikipedia.org/wiki/Cuckoo_filter
         | 
         | [2] https://engineering.fb.com/2021/07/09/core-infra/ribbon-
         | filt...
        
           | hijinks wrote:
           | thanks.. i'll read up into these.. always amazes me that
           | companies like datadog somehow made log search quick
        
           | hinkley wrote:
           | I wonder how often in the wild people are tuning for a 1%
           | false positive rate versus a much lower one, like .1%. You do
           | quickly reach data set sizes where even 1% introduces some
           | strain on resources or responsiveness.
           | 
           | Cuckoo claims 70% of the size of bloom for the same error
           | rate, and the space is logarithmic to the error rate. Looks
           | like about 6.6 bits per record versus 9.56 bits for bloom at
           | 1%. But at .5% error rate a cuckoo is 7.6 bpr. In fact you
           | can get to about a .13% error rate for a cuckoo only a hair
           | larger than the equivalent bloom filter (n^9.567 = 758.5)
        
       | adamzwasserman wrote:
       | The "no sharing between filters" insight clicked for me on a
       | different problem.
       | 
       | I needed to filter items by tags. Bloom filter per item seemed
       | clever - quick membership checks. But with thousands of items
       | sharing dozens of tags, each filter re-encodes the same
       | vocabulary. Pure waste.
       | 
       | Switched to an inverted index (tag - item list) with bloom
       | filters per chunk of the index. Now the tag vocabulary is shared,
       | and bloom filters just speed up chunk-skipping when the index
       | grows large.
       | 
       | TFA's mistake is using bloom filters -instead- of an inverted
       | index rather than on top of one. The amortization patterns stack,
       | they don't compete.
        
         | hinkley wrote:
         | Why do these "inverted indexes" just look like indexes to me?
         | Too much time with databases perhaps?
        
           | adamzwasserman wrote:
           | A non-unique index, yes.
        
             | hinkley wrote:
             | Which is most indexes.
        
       | pi_22by7 wrote:
       | The key insight about bloom filters lacking synergy is excellent.
       | The ~7K document crossover point makes sense because inverted
       | indexes amortize dictionary storage across all documents while
       | bloom filters must encode it linearly per document
        
         | hinkley wrote:
         | But doesn't that depend on the cardinality of the indexes
         | versus the document count? I've seen systems with a stupid
         | number of tag values.
        
       | pauldix wrote:
       | I believe you could do this effectively with COBS (COmpact Bit
       | Sliced signature index): https://panthema.net/2019/1008-COBS-A-
       | Compact-Bit-Sliced-Sig...
       | 
       | It's a pretty neat algorithm from a paper in 2019 for the
       | application "to index k-mers of DNA samples or q-grams from text
       | documents". You can take a collection of bloom filters built for
       | documents and then combine them together to have a single filter
       | that will tell you which docs it maps to. Like an inverted index
       | meets a bloom filter.
       | 
       | I'm using it in a totally different domain for an upcoming
       | release in InfluxDB (time series database).
       | 
       | There's also code online here: https://github.com/bingmann/cobs
        
       | KevBurnsJr wrote:
       | Reminds me of @danthegoodman's project: bloomsearch [1]
       | 
       | [1] https://github.com/danthegoodman1/bloomsearch
        
       | taeric wrote:
       | I will forever think of Bloom filters as "bouncer filters." Could
       | go with concierge filter. Basically, it is the equivalent of
       | every movie where the detective is asking the front desk various
       | attributes of who they are looking for.
       | 
       | It is not hard to see how you could start asking the front desk
       | to track every obscure attribute and to expect to fall over for
       | various reasons.
        
       | hinkley wrote:
       | > Fun fact: There is a nice implementation of this exact
       | algorithm that is still used in the wild.
       | 
       | I thought that was going to be a link to Google.com
        
       | pncnmnp wrote:
       | When my friends and I were undergrads (3rd year, I believe), we
       | had an absolute blast exploring this exact topic - the
       | intersection of Bloom Filters and client side searching. So much
       | so that it became part of our undergrad thesis.
       | 
       | It all started when Stavros's blog was circulated on Hacker News!
       | The way we approached the search part was by using "Spectral
       | Bloom Filters" - https://theory.stanford.edu/~matias/papers/sbf-
       | sigmod-03.pdf - which is based on a paper by Saar Cohen and Yossi
       | Matias from the early 2000s - its basically an iteration on the
       | counting bloom filters. We used the minimal selection and minimal
       | increase algorithm from the paper for insertion and ranking of
       | results.
       | 
       | I wrote a blog on it too -
       | https://pncnmnp.github.io/blogs/spectral-bloom-filters.html
       | 
       | Some slides - https://pncnmnp.github.io/blogs/sthir-talk-2020.pdf
        
       ___________________________________________________________________
       (page generated 2025-11-04 23:00 UTC)