[HN Gopher] Bloom filters are good for search that does not scale
___________________________________________________________________
Bloom filters are good for search that does not scale
Author : birdculture
Score : 160 points
Date : 2025-11-04 09:25 UTC (13 hours ago)
(HTM) web link (notpeerreviewed.com)
(TXT) w3m dump (notpeerreviewed.com)
| sanskarix wrote:
| the beautiful thing about bloom filters is they let you say
| "definitely not here" without checking everything. that asymmetry
| is weirdly powerful for specific problems.
|
| I've seen them save startups real money in caching layers -
| checking "did we already process this event" before hitting the
| database. false positives are fine because you just check the
| database anyway, but true negatives save you thousands of
| queries.
|
| the trick is recognizing when false positives don't hurt you.
| most engineers learn about them in theory but never find that
| practical use case where they're actually the right tool. same
| with skip lists and a bunch of other algorithms - brilliant for
| 2% of problems, overkill for the rest.
| adamzwasserman wrote:
| Exactly. A 1.2% false positive rate means unnecessary reads
| 1.2% of the time vs 100% without the filter. Even at 10% FP
| rate, you skip 90% of I/O.
|
| This asymmetry works great for I/O-bound workloads (skip-
| indexes) but fails for TFA's approach where every document
| needs its own filter.
|
| In practice, you combine both: inverted index for the
| dictionary (amortizes across documents), then bloom filters per
| chunk of the index (amortizes across chunks). This two-level
| approach handles scale much better than TFA's one-filter-per-
| document design. It's bloom filters as an optimization layer,
| not a replacement.
| munchbunny wrote:
| > that asymmetry is weirdly powerful for specific problems.
|
| 100% agree when it works in your favor. We use it for exactly
| that situation where a non-zero false positive rate is fine and
| you can choose how much memory to devote to getting it closer
| to zero.
|
| There have a been a couple times though where we've needed a
| "keep everything not on this list" operation, and unfortunately
| bloom filters don't work well for that situation. There are
| alternatives, but none as elegantly compact as bloom filters.
| MattPalmer1086 wrote:
| May be true for offline full text search, but not true for online
| string search.
|
| I invented a _very_ fast string search algorithm based on bloom
| filters. Our paper [1] was accepted to the Symposium of
| Experimental Algorithms 2024 [2]. Code can be found here [3].
|
| [1]
| https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.S...
|
| [2] https://sea2024.univie.ac.at/accepted-papers/
|
| [3] https://github.com/nishihatapalmer/HashChain
| lll-o-lll wrote:
| Looks really interesting! If I wanted to try one of these,
| which of the family would you recommend to start with?
| MattPalmer1086 wrote:
| Hashchain is the easiest to implement, but like many search
| algorithms can suffer from quadratic performance on really
| bad data and patterns (e.g. seaching for a long sequence of
| zero bytes in a text of zero bytes). In practice this is very
| rare.
|
| If you want guaranteed linear performance in the worst case
| too, then LinearHashchain is the one to use. It is slightly
| more complex to implement as it builds in a KMP style
| verifier to make it linear (so almost two search algorithms
| are needed). It is actually about as fast as HashChain
| generally in the average case, so you don't lose out.
|
| The others are either experimental or quite niche and not
| suitable for most purposes. SentinelHashchain is actually the
| fastest, but relies on being able to add a copy of the
| pattern at the end of the search text. Mostly this won't be
| possible in most search contexts, unless you control all
| memory allocations.
|
| So I'd start with HashChain, and maybe play with the linear
| version later - most of it is the same, it just needs a bit
| more adding.
| majke wrote:
| Bing uses bloom filters for the most-recent index:
|
| https://dl.acm.org/doi/pdf/10.1145/3077136.3080789
|
| https://bitfunnel.org/strangeloop/
| susam wrote:
| When I worked at RSA over a decade ago, we developed Bloom
| filter-based indexing to speed up querying on a proprietary
| database that was specialised for storing petabytes of network
| events and packet data. I implemented the core Bloom filter-based
| indexer based on MurmurHash2 functions and I was quite proud of
| the work I did back then. The resulting improvement in query
| performance looked impressive to our customers. I remember the
| querying speed went up from roughly 49,000 records per second to
| roughly 1,490,000 records per second, so nearly a 30-fold
| increase.
|
| However, the performance gain is not surprising at all since
| Bloom filters allow the querying engine to skip large blocks of
| data with certainty when the blocks do not contain the target
| data. False negatives are impossible. False positives occur but
| the rate of false positives can be made very small with well-
| chosen parameters and trade-offs.
|
| With 4 hash functions (k = 4), 10007 bits per bloom filter (m =
| 10007) and a new bloom filter for every 1000 records (n = 1000),
| we achieved a theoretical false-positive rate of only 1.18% ((1 -
| e(-k * n / m)) ^ k = 0.0118). In practice, over a period of 5
| years, we found that the actual false positive rate varied
| between 1.13% and 1.29%.
|
| The only downside of a false positive is that it makes the query
| engine read a data block unnecessarily to verify whether the
| target data is present. This affects performance but not
| correctness; much like how CPU branch misprediction affects
| performance but not correctness.
|
| A 30-fold increase in querying speed with just 1.25 kB of
| overhead per data block of 1000 records (each block roughly 1 MB
| to 2 MB in size) was, in my view, an excellent trade-off. It made
| a lot of difference to the customer experience, turning what used
| to be a 2 minute wait for query results into a wait of just about
| 5 seconds, or in larger queries, reducing a 30 minute wait to
| about 1 minute.
| susam wrote:
| I just noticed the m = 10007 value in my comment above and I
| thought I should clarify it. The number of bits per bloom
| filter does not need to be a prime number if the hash functions
| have uniform distribution. Murmur2 hash functions do have
| uniform distribution, so m was _not_ chosen to be prime in
| order to reduce collisions in the Bloom filter 's bit
| positions. The reason for using a prime value was more mundane.
|
| This was a fairly large project, with roughly 3 million lines
| of C and C++ code which had numerous constants with special
| values defined throughout the code. So instead of using a less
| interesting number like 10000, I chose 10007 so that if we ever
| came across the value 10007 (decimal) or 0x2717 (hexadecimal)
| while inspecting a core dump in a debugger, we would
| immediately recognise it as the constant defining the number of
| bits per Bloom filter.
| anonymars wrote:
| Ha, collision avoidance all the way down
| gkfasdfasdf wrote:
| Interesting trick about the constant value, and thank you for
| the detailed write up!
| 6510 wrote:
| I'm not an expert at all, I learn they exist after making
| something similar to search a few thousand blog posts.
|
| Rather than one hash per file I made a file for each 2 letter
| combinations like aa.raw, ab.raw, etc where each bit in the
| file represents a record. (bit 0 is file 0 etc) you could ofc
| do 3 or 4 letters too.
|
| A query is split into 2 letter combinations. 1st + 2nd, 2nd +
| 3rd, etc the files are loaded, do a bitwise AND on the files.
|
| a search for "bloom" would only load bl.raw, lo.raw, oo.raw,
| om.raw
|
| The index is really hard to update but adding new records is
| easy. New records are first added as false positives until we
| have enough bits to push a byte to the end of each raw file.
|
| I then got lost pondering what creative letter combinations
| would yield the best results. Things like xx.raw and an.raw are
| pretty useless. Words could be treated as if unique characters.
|
| Characters (or other bytes) could be combined like s=z, k=c,
| x=y or i=e
|
| Calculating which combination is best for the static data set
| was to hard for me. One could look at the ratio to see if it is
| worth having a letter combination.
|
| But it works and loading a hand full of files or doing an AND
| is amazingly fast.
| susam wrote:
| What you've described here is an n-gram inverted index (with
| n = 2) represented as a bitset. We could call it a _bigram
| bitset inverted index_. Glad to know you designed and
| implemented all of this from first principles, and that it
| serves you well!
| hinkley wrote:
| I had a problem where we needed to compare large data sets
| between machines for keys that existed in both, and the
| bandwidth cost just wasn't mathing for the median result set
| size. I was trying to figure out how to send a fingerprint
| from machine A to B, then have machine B send the hits back.
| Or how many round trips I could do based on set size to
| minimize bandwidth + latency. I ended up with a calculus
| problem nobody could help me solve because of an n^5 term.
|
| My boss was generally pretty good with obscure data
| structures but neither of us had encountered Bloom filters.
| This was about three years after Google published their paper
| on how they were using Bloom filters, but that company would
| be bankrupt before I figured it out.
| UltraSane wrote:
| Splunk uses bloom filters to make searching for rare events
| fast. Rare events are usually the most interesting.
| hinkley wrote:
| I've only used Splunk with one set of devs and maybe we were
| doing it wrong, but it didn't feel fast to me.
|
| Several of us were working hard to move everything into
| Prometheus that made any sense to be in Prometheus instead of
| Splunk.
|
| Notably any time we had a production issue that it was
| unclear which team was responsible, Splunk became the
| bottleneck because we started exceeding quotas immediately.
| UltraSane wrote:
| Splunk is one of the best software I've ever used but it
| HAS to be used with very fast storage to be effective. I've
| only used it on enterprise grade storage arrays and servers
| with lots of RAM for caches. On modern PCIe 5.0 NVMe drives
| it is stupid fast.
|
| I'm not sure what you mean by exceed quotas because Splunk
| is normally licensed on GB ingested per day. This can lead
| to bitter fights between teams over how this is allocated.
|
| The good thing about this license model is that you can use
| as much hardware as you want for no extra license cost.
| hinkley wrote:
| > used with very fast storage
|
| That's sounds like self hosting. Which is not the only
| product they offer. But you still have hardware that can
| only run so many queries at once and then starts queuing
| any additional request, yeah? Once you have a dozen
| people on a call it went to shit. Only occasionally ran
| into problems like this with graphite. But you need a lot
| of people looking at a very large dashboard to start
| feeling refresh delays.
| hijinks wrote:
| is there a better way then bloom filters to handle needle in the
| haystack type searches where the haystack might be terabytes of
| data and you only want a few lines?
| philipkglass wrote:
| There are a lot of "better than Bloom" filters that work
| similarly in some aspects. I have used Cuckoo [1] and Ribbon
| [2] filters for Bloom-type applications. If you have an
| application where you do a lot of one kind of searching, it may
| also be worth implementing a specialized variant of a data
| structure. I needed a Cuckoo-type filter on the JVM but only
| for 64 bit integers and I was able to make a smaller, faster
| code base that was specialized to this data type instead of
| handling generic objects.
|
| You need to know up front whether you need to be able to
| dynamically add entries to the filter or if your application
| can tolerate rebuilding the filter entirely whenever the
| underlying data changes. In the latter case you have more
| freedom to choose data structures; many of the modern "better
| than Bloom" filters are more compact but don't support dynamic
| updates.
|
| [1] https://en.wikipedia.org/wiki/Cuckoo_filter
|
| [2] https://engineering.fb.com/2021/07/09/core-infra/ribbon-
| filt...
| hijinks wrote:
| thanks.. i'll read up into these.. always amazes me that
| companies like datadog somehow made log search quick
| hinkley wrote:
| I wonder how often in the wild people are tuning for a 1%
| false positive rate versus a much lower one, like .1%. You do
| quickly reach data set sizes where even 1% introduces some
| strain on resources or responsiveness.
|
| Cuckoo claims 70% of the size of bloom for the same error
| rate, and the space is logarithmic to the error rate. Looks
| like about 6.6 bits per record versus 9.56 bits for bloom at
| 1%. But at .5% error rate a cuckoo is 7.6 bpr. In fact you
| can get to about a .13% error rate for a cuckoo only a hair
| larger than the equivalent bloom filter (n^9.567 = 758.5)
| adamzwasserman wrote:
| The "no sharing between filters" insight clicked for me on a
| different problem.
|
| I needed to filter items by tags. Bloom filter per item seemed
| clever - quick membership checks. But with thousands of items
| sharing dozens of tags, each filter re-encodes the same
| vocabulary. Pure waste.
|
| Switched to an inverted index (tag - item list) with bloom
| filters per chunk of the index. Now the tag vocabulary is shared,
| and bloom filters just speed up chunk-skipping when the index
| grows large.
|
| TFA's mistake is using bloom filters -instead- of an inverted
| index rather than on top of one. The amortization patterns stack,
| they don't compete.
| hinkley wrote:
| Why do these "inverted indexes" just look like indexes to me?
| Too much time with databases perhaps?
| adamzwasserman wrote:
| A non-unique index, yes.
| hinkley wrote:
| Which is most indexes.
| pi_22by7 wrote:
| The key insight about bloom filters lacking synergy is excellent.
| The ~7K document crossover point makes sense because inverted
| indexes amortize dictionary storage across all documents while
| bloom filters must encode it linearly per document
| hinkley wrote:
| But doesn't that depend on the cardinality of the indexes
| versus the document count? I've seen systems with a stupid
| number of tag values.
| pauldix wrote:
| I believe you could do this effectively with COBS (COmpact Bit
| Sliced signature index): https://panthema.net/2019/1008-COBS-A-
| Compact-Bit-Sliced-Sig...
|
| It's a pretty neat algorithm from a paper in 2019 for the
| application "to index k-mers of DNA samples or q-grams from text
| documents". You can take a collection of bloom filters built for
| documents and then combine them together to have a single filter
| that will tell you which docs it maps to. Like an inverted index
| meets a bloom filter.
|
| I'm using it in a totally different domain for an upcoming
| release in InfluxDB (time series database).
|
| There's also code online here: https://github.com/bingmann/cobs
| KevBurnsJr wrote:
| Reminds me of @danthegoodman's project: bloomsearch [1]
|
| [1] https://github.com/danthegoodman1/bloomsearch
| taeric wrote:
| I will forever think of Bloom filters as "bouncer filters." Could
| go with concierge filter. Basically, it is the equivalent of
| every movie where the detective is asking the front desk various
| attributes of who they are looking for.
|
| It is not hard to see how you could start asking the front desk
| to track every obscure attribute and to expect to fall over for
| various reasons.
| hinkley wrote:
| > Fun fact: There is a nice implementation of this exact
| algorithm that is still used in the wild.
|
| I thought that was going to be a link to Google.com
| pncnmnp wrote:
| When my friends and I were undergrads (3rd year, I believe), we
| had an absolute blast exploring this exact topic - the
| intersection of Bloom Filters and client side searching. So much
| so that it became part of our undergrad thesis.
|
| It all started when Stavros's blog was circulated on Hacker News!
| The way we approached the search part was by using "Spectral
| Bloom Filters" - https://theory.stanford.edu/~matias/papers/sbf-
| sigmod-03.pdf - which is based on a paper by Saar Cohen and Yossi
| Matias from the early 2000s - its basically an iteration on the
| counting bloom filters. We used the minimal selection and minimal
| increase algorithm from the paper for insertion and ranking of
| results.
|
| I wrote a blog on it too -
| https://pncnmnp.github.io/blogs/spectral-bloom-filters.html
|
| Some slides - https://pncnmnp.github.io/blogs/sthir-talk-2020.pdf
___________________________________________________________________
(page generated 2025-11-04 23:00 UTC)