[HN Gopher] Bloom Filters by Example
       ___________________________________________________________________
        
       Bloom Filters by Example
        
       Author : ibobev
       Score  : 163 points
       Date   : 2025-06-29 11:56 UTC (11 hours ago)
        
 (HTM) web link (llimllib.github.io)
 (TXT) w3m dump (llimllib.github.io)
        
       | marginalia_nu wrote:
       | I have a trick I like:
       | 
       | For sets that are plausibly sometimes going to be small where
       | you're going to do a lot of membership checks, you can
       | speculatively add a 64 bit bloom filter with a trivial hash
       | function.
       | 
       | This sounds really stupid, but the cost of doing this is so small
       | you can do it as a gamble. If it doesn't work out you've added
       | like 10ns to your insertions and membership checks, but when it
       | does work out, you can save an incredible amount of work.
        
         | Sesse__ wrote:
         | Chromium does this in a bunch of places; the article only links
         | to Safe Browsing using murmur, but the renderer (Blink)
         | generally uses rapidhash and has some of these micro-filters
         | which it uses for e.g.:                 - querySelector() in
         | certain cases       - Prefiltering hash lookups in CSS buckets
         | - Rapid reject of elements when looking for certain Aria
         | attributes (for accessibility)
         | 
         | It's surprising that such tiny filters (32 or 64 bits) work at
         | all, but they often do. There are also some larger Bloom
         | filters around.
         | 
         | (I added some of these)
        
           | marginalia_nu wrote:
           | They just have a really unintuitive economy where they
           | basically only need to work once or twice to make up for the
           | cost of all the times they don't contribute any benefit.
        
             | Sesse__ wrote:
             | For extra fun, you sometimes can make ideal filters with no
             | false positives, if you know your possible elements ahead
             | of time and you don't insert too many of them. (E.g., for
             | 20 elements, you can construct a 12-bit code where there
             | are guaranteed no false positives as long as you insert at
             | most two elements.)
        
       | alienbaby wrote:
       | This article is aimed squarely at people like me. I'd heard of
       | them. I kept meaning to look them up everytime I saw them
       | mentioned. I finally did when I saw your articale and it was the
       | perfect intro that I was looking for :)
        
         | vayup wrote:
         | Same with me
        
       | konsalexee wrote:
       | Another one bloom filter post I really appreciated from Eli
       | Bendersky if anyone wants to read more:
       | https://eli.thegreenplace.net/2025/bloom-filters/
        
       | 256bit wrote:
       | Another visualisation of Bloom filters can be found at the end of
       | this page: https://www.chrislaux.com/hashtable.html
        
       | verytrivial wrote:
       | The overlap in concepts required to understand Bloom filters,
       | sets and hash tables is about 95% IMHO. A set is a hash table
       | used for membership tests where you only care about the key, not
       | the value. And a Bloom filter is just a set that exploits the
       | fact that many-to-one hashing 'compresses' the key-space with
       | collisions. It deliberately uses a very collide-y hash function.
       | If a specific key was ever hashed, you WILL get a hit, but there
       | might be other keys that produced the same hash. It's a feature,
       | not a bug.
        
         | marginalia_nu wrote:
         | If you've grokked bloom filters, you're very close to also
         | understanding both random projection and certain
         | implementations of locality-sensitive hashes.
        
         | cherrycherry98 wrote:
         | Glad to know I'm not alone in my mental modeling of Bloom
         | filters as just hash tables that only track the buckets which
         | have data but not the actual data itself.
        
         | cortesoft wrote:
         | I think the main bit your explanation is missing is how a bloom
         | filter uses multiple hash functions to reduce collisions. For
         | example, a bloom filter might have 3 hashes, and all of them
         | have to hit for a key to be known to be in the set. This
         | reduces the likelihood of false positive collisions while
         | keeping the no false negatives guarantee.
        
       | anon-3988 wrote:
       | I have a specific use case where I know from startup the list of
       | words that I want to find and this will not change for the
       | duration of the program. Can anyone think of a low latency
       | solution to this? I have tried a lot of variations of bloom
       | filter, perfect hash, linear lookup, binary search, set search
       | etc
       | 
       | It appears that perfect hash is the one that works the best for
       | my use case.
        
         | jerf wrote:
         | You're saying you can use a perfect hash also implies you know
         | you will _only_ find those values? If so, then yes, the name is
         | accurate and is probably a very good choice.
         | 
         | But if you put things into the perfect hash function it is not
         | expecting, some fraction of them will collide.
         | 
         | If you're searching for a fixed set, look at the Ragel library.
         | Compile-time generation of the search in a way that is very
         | hard to beat.
        
       | b0a04gl wrote:
       | i got into bloom filters while debugging cassandra read spikes
       | ,lot of sstable lookups even when key not exist ,didnt make sense
       | at first ,then realised bloom filter on each sstable meant to
       | skip disk ,but default fp rate was high like 0.1 or so ,too much
       | for our case ,most reads were cache miss anyway so those false
       | positives were killing us ,changed it to 0.01 ,bit more memory it
       | consumed but way less useless reads ,lbrought p99 read latency by
       | good 16-18%
        
       | costco wrote:
       | I had used bloom filters in the past without really understanding
       | how they worked. Then one day I decided to implement them just
       | going off the Wikipedia article with the 32-bit MurmurHash
       | function and was surprised at how simple it was. If you're using
       | C++ you can use std::vector<bool> (or as of C++23, std::bitset)
       | to make it even easier to store the bits in a space efficient
       | way.
        
       | kridsdale3 wrote:
       | I wrote a Bloom Filter for college in CUDA in 2009. My advisor
       | was a former Nvidia guy. I then went on to not do any GPU
       | programming at all in my career.
       | 
       | I probably could have made $100,000,000 if I had made a different
       | choice there.
        
         | Kranar wrote:
         | Could have also bought Bitcoin and made a lot more... just
         | saying.
        
         | ricardo81 wrote:
         | Improbable considering it was a CS idea in 1970. Surely every
         | idea for GPGPU was fair game.
         | 
         | I wrote a hashcash implementation on a GPU 10 years ago. Pretty
         | sure it's valueless now.
        
       | deryilz wrote:
       | It's always great to see interactive posts. I also appreciated
       | the list of where Bloom filters were used in popular programs.
        
       | clnhlzmn wrote:
       | I recently used a bloom filter to achieve a log message anti-spam
       | feature. In the logger I hashed the messages and inserted into
       | the filter. If the entry was present I didn't output the
       | messages. Then every few seconds I would iterate over the filter
       | and clear all the bits. It worked out nicely that I didn't have
       | to worry about atomically clearing all the bits in the filter, if
       | messages were coming in and any of their bits had been cleared
       | that was sufficient to cause them to be logged again. This was
       | much more efficient than the previous implementation which kept a
       | count of messages seen and would saturate at N and had the effect
       | that if a particular message was being repeatedly logged it would
       | be seen, just at most at the rate at which the filter was being
       | cleared.
       | 
       | After being aware of bloom filters for a while it was quite
       | satisfying to organically find a real use for one that was such
       | an improvement.
        
       | jedberg wrote:
       | I love me a good bloom filter. Bloom filters are one of the
       | concepts I try to make sure everyone I talk to about tech knows
       | -- the others being Random Weight Hashing (aka Rendezvous
       | Hashing, aka Highest Random Weight Hashing) and Cumulative flow
       | diagrams.
       | 
       | All three concepts are key in understanding operations of complex
       | distributed systems.
        
         | sureglymop wrote:
         | What about distributed hash table architectures in general?
         | circle, chord, CAN and kademlia, etc.
        
           | jedberg wrote:
           | There's only so many things you can put on your "I must tell
           | everyone I meet about this" list. :)
           | 
           | But yes, those are pretty important too.
        
       | jedberg wrote:
       | Note to the author -- I love the interactive part. One suggestion
       | to really drive the point home: Give the user an example of two
       | strings that have a hash collision, and tell them to enter one
       | into the box, and then test for the other in the second box.
       | 
       | It will demonstrate why the answer is always "maybe it's in the
       | set!" instead of "yes".
        
         | thomascountz wrote:
         | "bloom" and "demonstrators " (mind the trailing space
         | character)
         | 
         | Both collide with: fnv: 7 murmur: 12
        
       ___________________________________________________________________
       (page generated 2025-06-29 23:00 UTC)