[HN Gopher] Bloom Filters by Example
___________________________________________________________________
Bloom Filters by Example
Author : ibobev
Score : 163 points
Date : 2025-06-29 11:56 UTC (11 hours ago)
(HTM) web link (llimllib.github.io)
(TXT) w3m dump (llimllib.github.io)
| marginalia_nu wrote:
| I have a trick I like:
|
| For sets that are plausibly sometimes going to be small where
| you're going to do a lot of membership checks, you can
| speculatively add a 64 bit bloom filter with a trivial hash
| function.
|
| This sounds really stupid, but the cost of doing this is so small
| you can do it as a gamble. If it doesn't work out you've added
| like 10ns to your insertions and membership checks, but when it
| does work out, you can save an incredible amount of work.
| Sesse__ wrote:
| Chromium does this in a bunch of places; the article only links
| to Safe Browsing using murmur, but the renderer (Blink)
| generally uses rapidhash and has some of these micro-filters
| which it uses for e.g.: - querySelector() in
| certain cases - Prefiltering hash lookups in CSS buckets
| - Rapid reject of elements when looking for certain Aria
| attributes (for accessibility)
|
| It's surprising that such tiny filters (32 or 64 bits) work at
| all, but they often do. There are also some larger Bloom
| filters around.
|
| (I added some of these)
| marginalia_nu wrote:
| They just have a really unintuitive economy where they
| basically only need to work once or twice to make up for the
| cost of all the times they don't contribute any benefit.
| Sesse__ wrote:
| For extra fun, you sometimes can make ideal filters with no
| false positives, if you know your possible elements ahead
| of time and you don't insert too many of them. (E.g., for
| 20 elements, you can construct a 12-bit code where there
| are guaranteed no false positives as long as you insert at
| most two elements.)
| alienbaby wrote:
| This article is aimed squarely at people like me. I'd heard of
| them. I kept meaning to look them up everytime I saw them
| mentioned. I finally did when I saw your articale and it was the
| perfect intro that I was looking for :)
| vayup wrote:
| Same with me
| konsalexee wrote:
| Another one bloom filter post I really appreciated from Eli
| Bendersky if anyone wants to read more:
| https://eli.thegreenplace.net/2025/bloom-filters/
| 256bit wrote:
| Another visualisation of Bloom filters can be found at the end of
| this page: https://www.chrislaux.com/hashtable.html
| verytrivial wrote:
| The overlap in concepts required to understand Bloom filters,
| sets and hash tables is about 95% IMHO. A set is a hash table
| used for membership tests where you only care about the key, not
| the value. And a Bloom filter is just a set that exploits the
| fact that many-to-one hashing 'compresses' the key-space with
| collisions. It deliberately uses a very collide-y hash function.
| If a specific key was ever hashed, you WILL get a hit, but there
| might be other keys that produced the same hash. It's a feature,
| not a bug.
| marginalia_nu wrote:
| If you've grokked bloom filters, you're very close to also
| understanding both random projection and certain
| implementations of locality-sensitive hashes.
| cherrycherry98 wrote:
| Glad to know I'm not alone in my mental modeling of Bloom
| filters as just hash tables that only track the buckets which
| have data but not the actual data itself.
| cortesoft wrote:
| I think the main bit your explanation is missing is how a bloom
| filter uses multiple hash functions to reduce collisions. For
| example, a bloom filter might have 3 hashes, and all of them
| have to hit for a key to be known to be in the set. This
| reduces the likelihood of false positive collisions while
| keeping the no false negatives guarantee.
| anon-3988 wrote:
| I have a specific use case where I know from startup the list of
| words that I want to find and this will not change for the
| duration of the program. Can anyone think of a low latency
| solution to this? I have tried a lot of variations of bloom
| filter, perfect hash, linear lookup, binary search, set search
| etc
|
| It appears that perfect hash is the one that works the best for
| my use case.
| jerf wrote:
| You're saying you can use a perfect hash also implies you know
| you will _only_ find those values? If so, then yes, the name is
| accurate and is probably a very good choice.
|
| But if you put things into the perfect hash function it is not
| expecting, some fraction of them will collide.
|
| If you're searching for a fixed set, look at the Ragel library.
| Compile-time generation of the search in a way that is very
| hard to beat.
| b0a04gl wrote:
| i got into bloom filters while debugging cassandra read spikes
| ,lot of sstable lookups even when key not exist ,didnt make sense
| at first ,then realised bloom filter on each sstable meant to
| skip disk ,but default fp rate was high like 0.1 or so ,too much
| for our case ,most reads were cache miss anyway so those false
| positives were killing us ,changed it to 0.01 ,bit more memory it
| consumed but way less useless reads ,lbrought p99 read latency by
| good 16-18%
| costco wrote:
| I had used bloom filters in the past without really understanding
| how they worked. Then one day I decided to implement them just
| going off the Wikipedia article with the 32-bit MurmurHash
| function and was surprised at how simple it was. If you're using
| C++ you can use std::vector<bool> (or as of C++23, std::bitset)
| to make it even easier to store the bits in a space efficient
| way.
| kridsdale3 wrote:
| I wrote a Bloom Filter for college in CUDA in 2009. My advisor
| was a former Nvidia guy. I then went on to not do any GPU
| programming at all in my career.
|
| I probably could have made $100,000,000 if I had made a different
| choice there.
| Kranar wrote:
| Could have also bought Bitcoin and made a lot more... just
| saying.
| ricardo81 wrote:
| Improbable considering it was a CS idea in 1970. Surely every
| idea for GPGPU was fair game.
|
| I wrote a hashcash implementation on a GPU 10 years ago. Pretty
| sure it's valueless now.
| deryilz wrote:
| It's always great to see interactive posts. I also appreciated
| the list of where Bloom filters were used in popular programs.
| clnhlzmn wrote:
| I recently used a bloom filter to achieve a log message anti-spam
| feature. In the logger I hashed the messages and inserted into
| the filter. If the entry was present I didn't output the
| messages. Then every few seconds I would iterate over the filter
| and clear all the bits. It worked out nicely that I didn't have
| to worry about atomically clearing all the bits in the filter, if
| messages were coming in and any of their bits had been cleared
| that was sufficient to cause them to be logged again. This was
| much more efficient than the previous implementation which kept a
| count of messages seen and would saturate at N and had the effect
| that if a particular message was being repeatedly logged it would
| be seen, just at most at the rate at which the filter was being
| cleared.
|
| After being aware of bloom filters for a while it was quite
| satisfying to organically find a real use for one that was such
| an improvement.
| jedberg wrote:
| I love me a good bloom filter. Bloom filters are one of the
| concepts I try to make sure everyone I talk to about tech knows
| -- the others being Random Weight Hashing (aka Rendezvous
| Hashing, aka Highest Random Weight Hashing) and Cumulative flow
| diagrams.
|
| All three concepts are key in understanding operations of complex
| distributed systems.
| sureglymop wrote:
| What about distributed hash table architectures in general?
| circle, chord, CAN and kademlia, etc.
| jedberg wrote:
| There's only so many things you can put on your "I must tell
| everyone I meet about this" list. :)
|
| But yes, those are pretty important too.
| jedberg wrote:
| Note to the author -- I love the interactive part. One suggestion
| to really drive the point home: Give the user an example of two
| strings that have a hash collision, and tell them to enter one
| into the box, and then test for the other in the second box.
|
| It will demonstrate why the answer is always "maybe it's in the
| set!" instead of "yes".
| thomascountz wrote:
| "bloom" and "demonstrators " (mind the trailing space
| character)
|
| Both collide with: fnv: 7 murmur: 12
___________________________________________________________________
(page generated 2025-06-29 23:00 UTC)