hngopher.com

       [HN Gopher] Word-Aligned Bloom Filters
       ___________________________________________________________________
        
       Word-Aligned Bloom Filters
        
       Author : zdw
       Score  : 192 points
       Date   : 2021-10-03 15:26 UTC (7 hours ago)
        
 (HTM) web link (lemire.me)
 (TXT) w3m dump (lemire.me)
        
       | Rendello wrote:
       | I find these optimizations fascinating. Anyone familiar with
       | Lemire likely knows about this, but I listened to a podcast
       | episode[1] with him a few days ago and learned about `simdjson`,
       | the tool he authored that parses JSON at 25x the speed of the
       | standard C++ library.[2][3] It's worth looking at if you're into
       | this sort of thing.
       | 
       | 1. https://corecursive.com/frontiers-of-performance-with-
       | daniel...
       | 
       | 2. https://www.youtube.com/watch?v=wlvKAT7SZIQ
       | 
       | 3. https://arxiv.org/pdf/1902.08318.pdf
        
         | blondin wrote:
         | yep, great fan of his work here. thanks for sharing that
         | podcast.
         | 
         | that kind of optimization requires you to know your machine
         | architecture quite well. SIMD optimizations aren't new. but
         | it's always amazing to see these performance increases on a
         | single machine!
         | 
         | our CPUs and GPUs are quite amazing. we have decided, as a
         | field, that we can get enough virtual CPUs, GPUs, or RAM on-
         | demand. and that we shouldn't concern ourselves with things
         | happening at that level.
         | 
         | it's actually a situation that saddens me sometimes.
        
           | devwastaken wrote:
           | Simd convinced me to take college courses in algorithms and
           | to learn higher maths. Things like image decoding rely
           | heavily on doing transformations, and you just have to know
           | the math behind it to an exact point to be able to
           | effectively turn scalar math to vector effectively.
           | 
           | Ontop of this you have to identify what can and cannot be
           | vectorized and how it can be integrated.
           | 
           | Working in simd isn't too hard In itself once you get down to
           | the assembly and toss out all the extra stuff compilers add.
           | If you look at how ffmpeg does it, they just assemble the
           | hand written assembly as the C function itself. Arm64 is very
           | nice to do this in because it has an elegant way in defining
           | vectors and instructions for them.
        
             | Rendello wrote:
             | After listening to the JSON episode, I've been spending the
             | last few days coding up a SIMD implementation of LZ77
             | substring matching, for use in DEFLATE.
             | 
             | I used Zig, which has first-class SIMD support. As in, no
             | need to go down to the assembly or use intrinsics, or even
             | use a library. I just got it working and haven't had time
             | to profile it, however (I'm new to profiling code).
        
             | ignoramous wrote:
             | Thanks: Any books / blogs you'd recommend?
        
           | phkahler wrote:
           | I was recently disappointed and frustrated to learn that GCC
           | will be enabling vectorization at -O2 soon. The realization
           | that you have to specify both -O3 AND an architecture to get
           | it to use AVX basically invalidated a bunch of benchmarking
           | and testing I had done.
           | 
           | What's the point of building with x86-64-v3 if all your code
           | is built at -O2 without vectorization enabled? Doh!
        
           | vvanders wrote:
           | If you like this stuff data oriented design[1] is a nice
           | framework that isn't arch specific(but can be extended to be
           | if you need to make more gains that are arch aware). Back
           | when I worked in the PS3/X360 days engines written for PS3
           | had better cache locality(SPUs forced hard limits vs hitting
           | a cache miss) and ran faster on the X360 as result well when
           | ported.
           | 
           | You can do do fun things like using radix sort with bucket
           | sizes that line up well with L2/L3 caches(saw a multi order
           | of magnitude speedup for that one) and data aware layouts
           | that net 10-30x speedups for the same data. Many RMDBs use
           | similar approaches from what I understand as well.
           | 
           | [1] https://en.m.wikipedia.org/wiki/Data-oriented_design
        
         | phendrenad2 wrote:
         | For some reason, writing a less-that-completey-naive JSON
         | parser for C++ seems to be a right of passage for all C++
         | developers who eventually go on to great things.
        
       | [deleted]
        
       | colmmacc wrote:
       | I'm really suspicious about whether this would really work out in
       | most workloads.
       | 
       | On modern architectures, it's the random memory reference that
       | kills performance. Predictive pre-fetching in the pipeline has a
       | impossible time figuring out what to get ready in the caches with
       | random lookups; it's not like a linear sweep. This style of
       | bloom-filter needs to be quite large, and it pulls only one
       | cache-line in per key lookup, both of these properties make any
       | opportunistic (non-prefetched) caching less effective.
       | 
       | I wonder if real-world benchmarks would prefer a more memory
       | dense lookup that pulls in multiple lines; just because it keeps
       | the cache-lines hotter and less likely to be evicted. On the
       | other hand, writes to this kind of bloom filter invalidate only
       | one cache-line, so that's good. So I'd guess it's all very very
       | workload dependent. Ultimately it depends on cache pressure and
       | how over-subscribed the cache is and the ratio of writes to
       | reads, and the distribution of keys; an implementation might also
       | fare very differently on a system like Graviton2 (which has huge
       | caches) than on a small Xeon.
        
         | csense wrote:
         | > I'm really suspicious about whether this would really work
         | out in most workloads.
         | 
         | What kinds of workloads are those? Workloads where the Bloom
         | filter itself fits in cache?
         | 
         | > On modern architectures, it's the random memory reference
         | that kills performance
         | 
         | Yeah, the whole point of the article is that moving from a
         | traditional Bloom filter to a block Bloom filter is to improve
         | from N random accesses per query/update to 1 random access per
         | query/update.
         | 
         | > Ultimately it depends on cache pressure and how over-
         | subscribed the cache is
         | 
         | The unspoken assumption in the article is that the Bloom filter
         | is much bigger than cache.
         | 
         | If your Bloom filter fits in cache, or you're writing for
         | something with superfast RAM like a GPU or a Xeon Phi or
         | something, block Bloom filters are wasteful for you.
         | 
         | But I'm pretty sure the common case is writing normal
         | applications running on reasonable general-purpose CPU's with
         | Bloom filter sizes in the hundreds of MB or more. [1] [2]
         | 
         | [1] If the Bloom filter's smaller than hundreds of MB, your
         | data's probably small enough that you decide Bloom filters
         | aren't worth the hassle and just use a HashMap or language
         | equivalent.
         | 
         | [2] If you're using a Bloom filter to do some sort of network
         | stuff, like figuring out the set difference of two peers'
         | objects in some p2p protocol, you probably don't care too much
         | about block Bloom filters, because in this application memory
         | access time is probably going to be dominated by network
         | latency.
        
       | willvarfar wrote:
       | Excellent thought provoking article, but all the times I've
       | brought in a dumb cache or bloom filter it is to save calling a
       | micro service or hitting a DB or something so fantastically
       | expensive (in computer time) that even a staggeringly inefficient
       | bloom filter or just keeping lists of recently-seen values etc is
       | a massive win. So there's no big pressing need to microoptimise
       | the bloom filter or whatever, and all the real profile guided
       | optimizations point at other low hanging fruit.
        
         | flavius29663 wrote:
         | Same, I was thinking about implementing a bloom filter myself,
         | but when I put in the hit ratio and the frequencies, a simple
         | LRU cache was going to be close enough. It's hard to justify
         | the complexity sometimes, which is a pitty cause they are so
         | cool.
        
       | phkahler wrote:
       | I suspect a bloom filter should use a separate array of bits for
       | each hash function. This came up in a previous discussion where
       | stated probability of collisions was "wrong" but not if it were
       | actually implemented with separate bit arrays.
        
       | [deleted]
        
       | marginalia_nu wrote:
       | This seems a bit like a solution in search of a use case. There
       | are simpler solutions for this particular usecase.
       | 
       | Let's take an extreme case: You have 250 million paying customers
       | you want to cache access to. Yeah, that's _way_ more than you
       | probably have, but the point is that you can keep 250 million
       | integers in memory no problem what so ever. It fits in a
       | gigabyte. There are raspberry pis with eight times that amount of
       | RAM.
       | 
       | You can just put the ids in a sorted array and you can check for
       | existence with a binary search without having to deal with false
       | positives. Adding a bunch more is as easy as a merge of sorted
       | lists, and removing items is a great deal easier than doing the
       | same in a floom filter. You're looking at log_2(250 million) = 27
       | iterations. Cache-wise it's not great but it's faster than a
       | database access by a mile and you don't have to worry about hash
       | functions.
       | 
       | Maybe if you have several billion customers it starts to become a
       | bit much to keep in memory, but that is a weird corner case
       | indeed.
        
         | PragmaticPulp wrote:
         | Obviously you don't need bloom filters if lookups are cheap,
         | but you're ignoring all of the cases where lookups aren't
         | cheap.
         | 
         | You can't always put the entire database of everything in
         | memory. Consider cases where the check is done client-side and
         | requires a network request. A bloom filter could allow you to
         | optimize away a network request in some, though not all, cases.
        
           | marginalia_nu wrote:
           | I'm curious if you have a concrete example of a case where
           | lookups aren't cheap and a bloom filter can solve the
           | problem.
           | 
           | In general, bloom filters are mainly useful for the sort of
           | data where lookups _are_ cheap, that is, testing for
           | membership of a set, and that is typically one of the
           | operations a database does really well.
           | 
           | In most cases where you just want to save a network
           | roundtrip, a LRU cache is often a lot easier to get right and
           | is much easier to keep consistent against mutable data.
        
             | jefftk wrote:
             | What about https://developers.google.com/safe-browsing?
             | 
             | In the normal case, when a website is not matched by the
             | filter, you save a network request. This is good for both
             | latency and privacy.
        
             | Gh0stRAT wrote:
             | Imagine you're making a web browser plugin that blocks ads,
             | or malicious sites. Let's assume the blocklist is a hundred
             | megs (easily fits in ram) and your millions of users need
             | to get the latest data at least hourly in order to keep up
             | with the latest URLs that you want to block.
             | 
             | Rather than distributing the entire blocklist to your
             | userbase, you can instead send a bloom filter + an
             | allowlist of the small handful of sites which have a hash
             | collision with one of the blocked sites.
             | 
             | As a bonus, computing the hashes will have great branch
             | prediction characteristics and you'll have fewer cache
             | misses because the bloom filter is tiny and frequently
             | accessed, so your plugin will not add any perceptible
             | slowdown to the user's experience.
        
               | marginalia_nu wrote:
               | Couldn't you just send deltas if that is the case? Surely
               | the hourly updates wouldn't be hundreds of megabytes? I
               | just tested extracting 5 million URLs from my web crawler
               | and it was like 150 Mb in plain text. That's ignoring how
               | easy it is to create compression schemes for URLs that
               | slash their memory footprint by something like 80%.
        
               | Gh0stRAT wrote:
               | Yes, you could do it with deltas instead. The tradeoffs
               | are that it won't be as fast and will cost you a LOT more
               | in bandwidth. (and cost your users more bandwidth as
               | well. They might be on a very slow/limited data plan)
               | Maybe you don't care about bandwidth and would prefer to
               | avoid the complexity and maintenance overhead of adding a
               | bloom filter.
               | 
               | As with anything, there are tradeoffs and your
               | requirements can change over time. Maybe the ad networks
               | or malware creators start using new domains every 10
               | minutes to counter your blocking system so now you have
               | to store more data and disseminate it more frequently.
               | 
               | As engineers, it's our job to weigh the tradeoffs between
               | different solutions given the resources and constraints
               | of the situation. For the situation I've outlined above,
               | I'd at least strongly consider a bloom filter but it's
               | certainly not the only way to do it.
        
               | marginalia_nu wrote:
               | > As engineers, it's our job to weigh the tradeoffs
               | between different solutions given the resources and
               | constraints of the situation. For the situation I've
               | outlined above, I'd at least strongly consider a bloom
               | filter but it's certainly not the only way to do it.
               | 
               | We should also not gloss over that with a bloom filter
               | can't rule out false positives, and it's really not
               | feasible to figure out which they are. Due to the nature
               | of hashing, you won't be able to easily come up with a
               | list of false positives. Finding just one hash collision
               | in a wide hash is computationally stupidly hard.
               | 
               | > As with anything, there are tradeoffs and your
               | requirements can change over time. Maybe the ad networks
               | or malware creators start using new domains every 10
               | minutes to counter your blocking system so now you have
               | to store more data and disseminate it more frequently.
               | 
               | Domains cost quite a lot of money so that is still pretty
               | unrealistic. Sure you can have CN wildcards, but you can
               | also do wildcard matching.
               | 
               | Actually this whole scenario is unrealistic, since you
               | can just serve ads off a random URL. The way you would
               | create a decent ad filter is to look for characteristics
               | in the script itself (a bit like an antivirus program),
               | not base it off the URL.
        
               | teraflop wrote:
               | > We should also not gloss over that with a bloom filter
               | can't rule out false positives, and it's really not
               | feasible to figure out which they are.
               | 
               | Nothing says a bloom filter has to be the _only_ data
               | structure you use. It 's a performance optimization; even
               | if 1% of the time you have to consult a more expensive
               | data structure to confirm your result, it can still save
               | you a lot of computation in the long run.
               | 
               | > Finding just one hash collision in a wide hash is
               | computationally stupidly hard.
               | 
               | But bloom filters don't use wide hashes; the domain of
               | the hash function is the number of bits in the filter.
               | 
               | > Domains cost quite a lot of money so that is still
               | pretty unrealistic.
               | 
               | In the case of malware, the cost of _buying_ domains isn
               | 't that relevant, because you can compromise existing
               | domains using automated attacks.
        
               | jameshart wrote:
               | Additionally, this method avoids you having to distribute
               | a list of dubious sites to your users.
        
             | la_fayette wrote:
             | Bitcoin light wallet addresses are registered with bloom
             | filters on full nodes.
        
             | spenczar5 wrote:
             | I use them in astronomical data indexing when doing cross-
             | match searches. That is, "could there be a star or other
             | moving object in within 0.5 arcsec of this point in the
             | sky?" Sky catalogs have many billions of objects but the
             | density per solid angle of the sky is extremely variable,
             | and moving objects also make lookups really complicated.
             | Sky catalogs are big enough that they are not storable in
             | memory, and may be in big Parquet tables on the network, so
             | the full lookup is quite expensive.
             | 
             | Bloom filters are great for this because the backing
             | dataset changes extremely slowly, like a few times a year
             | when a survey publishes a new data release.
        
               | marginalia_nu wrote:
               | Then we're indeed back in the territory of useful use
               | cases, typically characterized by data that is far bigger
               | than the system memory.
        
               | gpderetta wrote:
               | Bloom filter generally are much smaller than the data
               | they represent. So if your set fits on disk, the bf might
               | fit in ram; if your data fits in ram, your bf might fit
               | in L3 and so on.
               | 
               | Edit: in particular you might want to use bloom filters
               | fir early rejects, when most of the set ownership queries
               | are expected to fail.
        
         | sethev wrote:
         | I think you missed the point of the example. Lemire's example
         | isn't "the use-case" for this optimization, it's just a trivial
         | motivating example. The use-case is more like, "assume you're
         | currently using a Bloom filter in an appropriate context,
         | here's a way that might be more efficient".
         | 
         | The point of the post is the optimization technique, not trying
         | to describe when a Bloom filter would be appropriate.
        
         | ak39 wrote:
         | Why wouldn't "isPayingCustomer" not be part of the Customer
         | object graph immediately after logging in to begin with? (Maybe
         | I didn't understand the use case properly)
        
           | marcosdumay wrote:
           | This is a very bad example, but the underlying concepts are
           | useful.
           | 
           | It's more realistic if you are storing a set of revoked
           | access keys or altered files on your cache that you subscribe
           | from some channel that is not on your main request - database
           | - response loop.
           | 
           | Normally you won't ever need to store many of those until
           | they reach the expiration time or you update you cache, and
           | you will want a lot of performance for the common case (what
           | the small memory requirements help ensuring).
        
             | marginalia_nu wrote:
             | I think fundamentally bloom filters are mainly useful for
             | astronomical data sizes. If you have a petabyte of data
             | partitioned between several datacenters, and you want to
             | know which shards have entries with certain properties,
             | bloom filers are amazing. A bloom filter that is just a few
             | dozen megabytes can save you an enormous amount of
             | computation in such a scenario.
             | 
             | To cache network requests it really isn't that great. A
             | basic LRU cache typically performs just as good, and is a
             | lot easier to get right.
        
         | k2xl wrote:
         | a good use case is malicious URL detection for browsers. they
         | can look up locally and get a sense whether the url is bad and
         | then double check against webserver.
        
         | kwillets wrote:
         | A Bloom filter with 1% FPR would fit in 285 MB and need 7
         | accesses per key.
        
         | hinkley wrote:
         | And you can consistent hash your way out of that problem by
         | having two servers for each customer ID.
         | 
         | But I think the bigger problem is figuring out if John Smith is
         | already in our system, and IDs don't help here.
        
         | bugzz wrote:
         | The main use case I've always seen is to use bloom filters on
         | the client side to reduce traffic to the server looking things
         | up. As you said, you could cache 250 million integers in a
         | gigabyte - but you don't want to bloat your client side
         | implementation by a gigabyte for every bloom filter you use.
         | Also, many times the items are a lot larger than an integer.
         | For example, storing a list of malicious URLs, a common use
         | case.
        
           | bqmjjx0kac wrote:
           | I'm so confused by this use case (the traffic-saving one, not
           | the malicious URL classifier). Why not store the "is-paying-
           | customer" bit in a cookie?
           | 
           | What are we using as the user identifier? Where does it come
           | from, if not a cookie?
           | 
           | Also, this client-side bloom filter kind of leaks your user
           | database, supposing it's keyed on email addresses and your
           | adversary has a gigantic list of email addresses, or is
           | patient enough to enumerate them.
        
         | [deleted]
        
       | lelandfe wrote:
       | At the risk of finally exposing myself as an impostor, does
       | anyone have a good link for an explanation of bloom filters?
       | 
       | I've tried, unsuccessfully, to wrap my head around the concept
       | before, and this article seems really interesting!
        
         | [deleted]
        
         | a11r wrote:
         | The original paper by Burton H. Bloom is only 5 pages and very
         | readable. https://dl.acm.org/doi/10.1145/362686.362692
        
           | ghostly_s wrote:
           | https://sci-hub.st/https://doi.org/10.1145/362686.362692
        
         | dr_kiszonka wrote:
         | I will take an imposter over a grandiose narcissist any day!
         | (But then I am an imposter too.)
        
           | robertlagrant wrote:
           | > If everyone is am imposter, no one is!
           | 
           | - Syndrome
        
         | rattlesnakedave wrote:
         | You want to check if an element is probably in a dataset. You
         | want speed, and the possibility of a false positive doesn't
         | scare you that much.
         | 
         | Say you want to lookup if a username is in a database. That
         | could take a while if you have a really large db of usernames,
         | but you want to tell a user quickly if a name is taken.
         | 
         | So what you can do is make a bloom filter, which is basically
         | just an array of bits of n length. Let's say 10 for our
         | example.
         | 
         | Then gather some hash functions (h1, h2, and h3).
         | 
         | When someone inputs a username, you run it through the hash
         | functions, and modulo by the number of bits, and get an integer
         | result.
         | 
         | So:
         | 
         | h1("rattlesnakedave") % 10 = 3 h2("rattlesnakedave") % 10 = 7
         | h3("rattlesnakedave") % 10 = 2
         | 
         | Then you check the bits at position 3,2, and 7 in the array.
         | 
         | If they any of the bits are set to 0, we haven't seen the
         | username before.
         | 
         | If they're all set to 1, we have possibly seen the username
         | before.
         | 
         | Let's say they're all set to 0, so we probably haven't seen it.
         | We can present the message that the username is available, and
         | allow registration. During registration, we can do a slower
         | more expensive verification that the username isn't actually in
         | the dataset.
         | 
         | Now that the username is in the dataset, we can flip the bits
         | on our bloom filter at position 3,2, and 7 to 1. So the next
         | time we lookup that username we can tell quickly that it's
         | probably taken.
         | 
         | There is no way to delete from the bloom filter. The more
         | values you store relative to its size, the more false positives
         | you get.
         | 
         | Hopefully that makes sense, and is correct.
        
           | DougBTX wrote:
           | Usernames is a nice example. Just as a demo, we could make
           | things simpler:
           | 
           | Look at all the usernames in the input, keep track of all the
           | letters you've seen. When you get a new name, if you've seen
           | all the letters before, you might have seen the name before.
           | If there's a new letter, it must be a new name.
           | 
           | Obviously that isn't a great hash function, some letters are
           | more common than others, so real Bloom filters use better
           | ones, but it works basically the same way.
        
           | Taywee wrote:
           | Not exactly correct. If all of the bits is 1, then the value
           | is possibly in the set (not "probably", but "possibly", with
           | the likelihood being affected by the sizes of the set and the
           | array).
           | 
           | If any of the bits is zero, the value definitely isn't in the
           | set (also not "probably haven't", but definitely not).
        
             | rattlesnakedave wrote:
             | Good catch! Thank you.
             | 
             | Was trying to keep the initial explanation simple by
             | leaving out the false positives scaling relative to the
             | size of the array and set, and then clarifying at the end.
             | 
             | The second part (any 0 bits it hasn't been seen) is true, a
             | goof up on my part.
        
         | actually_a_dog wrote:
         | This article has a pretty good explanation, IMO:
         | https://freecontent.manning.com/all-about-bloom-filters/
        
         | stevenguh wrote:
         | Try watching this video: https://youtu.be/Bay3X9PAX5k
        
         | avinassh wrote:
         | try this? - https://llimllib.github.io/bloomfilter-tutorial/
        
         | 6510 wrote:
         | let me butcher the concept for you :)
         | 
         | say though nummerology you reduce the character values of a
         | name to a number 0-10. Lots of names will reduce to the same
         | number. We take 11 bits and for "jim" we set the first bit. We
         | have only one name in our data set so all other bits are 0. Now
         | if someone types "joe" in the search box and it reduces to 2 we
         | look at the second bit, see it is a zero and know 100% *for
         | sure* that this name is not in the data set. If "jack" reduces
         | to 1 and we look him up we see 1 is set so this name *might be*
         | there.
        
           | cma wrote:
           | Isn't this just similar to hash prefixes and not a true bloom
           | filter?
        
       | gscott wrote:
       | Telling the GoogleBot to not index your site doesn't necessarily
       | remove your site from Google. They can and often do still show
       | your site but they replace your description a generic statement
       | that they can't show it because you have decided not to include
       | your site in their index. It is crazy but true.
        
       | perihelions wrote:
       | Oh, that's elegant.
       | 
       | If I math right, setting 5 bits in a 64-bit word extends the
       | effective length of the hash function by approximately (5-1) *
       | log_2(64) = (5-1) * 6 = 24 bits. In a simple bloom filter of
       | storage size _m_ bits, the hash functions have length log_2(m).
       | All else being equal, the number of hash functions used -- hence
       | the number of memory reads per lookup -- gets reduced by a factor
       | of log_2(m)  / [24 + log_2(m)].
       | 
       | (But it's not exactly equal; the storage size _m_ has to go up by
       | a factor of 1.2 (in the OP parameter set), for constant false-
       | positive rate).
        
         | eternalban wrote:
         | Each word in this design is a little bloom filter. It has
         | number of bits (m)=64; number of hashes (k)=5; and going for 1%
         | false positive rate, which per formula (with n being number of
         | items in the filter) in [1] is:                  fpr = (1 - e ^
         | -(k * n)/m ) ^ k
         | 
         | OP solved for 1% and got n =~ 4.
         | 
         | There is exactly 1 hash function (to compute the 64 bit key). 5
         | bits are pseudo-randomly selected (giving us k=5) and written
         | to an array element indexed off of the 64 bit key itself.
         | 
         | So sizing this [naively] is a matter of taking your n (say 4
         | million items) and dividing by 4 and getting a 1 million array
         | of words.
         | 
         | What is not discussed in detail is block selection. Given the
         | segmented nature of the array based data structure,
         | distributing exactly 4 items per word is impossible for an on-
         | line filter (as that requires perfect hashing). So, the actual
         | 'load' capacity of this filter is, in array form, a fraction of
         | 4 x array-len. Precisely, given an array of size S words, with
         | capacity of Sx4, well before we reach Sx4, our keys will
         | resolve to array elements, the little micro filters, that
         | already have 4 registrations. Inserting the full allocated
         | capacity (Sx4) will certainly result in a significant number of
         | words having n greater than 4, which reduces the false positive
         | rate. At 5 items we have fpr at 3.5%, and at 6 fpr is 7.3%.
         | Equally, some array elements (words) will have less than 4
         | elements, getting exceptional fprs. For example, at 2 entries,
         | fpr is 0.0063%! If we load the full capacity, we can no longer
         | make statements about "1%" false positive rates, as OP does, it
         | should be noted.
         | 
         | So, you either have to keep track of elements assigned to words
         | (3 bits / word to count 0..4), or you compute a probable value
         | to minimize overloaded array elements, keeping max writes to a
         | word to 4 items. My guess would be ~70%ish utilization range of
         | allocated memory. In this case, we get our desired 1% (max) but
         | get bonuses of fortunate keys that land in underloaded words
         | and get super high assurances at the cost of eating a bit of
         | space inefficiency. /g
         | 
         | So this is the classic ball into bins question [2]. For hashes,
         | the surprising but powerful 2 choice approach gets us high 90%s
         | load factors. I'd recommend that. Yes, we'll x2 hashing and
         | cache-line costs, but loading factor will be in high 90s range.
         | Here OP should examine two candidate words, and pick the one
         | with least items. [On look ups, we have 2 filters (words), one
         | of which asserts it doesn't have it, and another that is 99%
         | sure it does. Or both disclaim having it.]
         | 
         | [1]:
         | https://en.wikipedia.org/wiki/Bloom_filter#Probability_of_fa...
         | 
         | [2]: https://en.wikipedia.org/wiki/Balls_into_bins_problem
         | 
         | p.s. to OP: you could also return the fpr if you keep track of
         | items per word so the call site has insight into the
         | assurances. That would be useful for a probabilistic data
         | structure.
        
       ___________________________________________________________________
       (page generated 2021-10-03 23:00 UTC)