[HN Gopher] Word-Aligned Bloom Filters
___________________________________________________________________
Word-Aligned Bloom Filters
Author : zdw
Score : 192 points
Date : 2021-10-03 15:26 UTC (7 hours ago)
(HTM) web link (lemire.me)
(TXT) w3m dump (lemire.me)
| Rendello wrote:
| I find these optimizations fascinating. Anyone familiar with
| Lemire likely knows about this, but I listened to a podcast
| episode[1] with him a few days ago and learned about `simdjson`,
| the tool he authored that parses JSON at 25x the speed of the
| standard C++ library.[2][3] It's worth looking at if you're into
| this sort of thing.
|
| 1. https://corecursive.com/frontiers-of-performance-with-
| daniel...
|
| 2. https://www.youtube.com/watch?v=wlvKAT7SZIQ
|
| 3. https://arxiv.org/pdf/1902.08318.pdf
| blondin wrote:
| yep, great fan of his work here. thanks for sharing that
| podcast.
|
| that kind of optimization requires you to know your machine
| architecture quite well. SIMD optimizations aren't new. but
| it's always amazing to see these performance increases on a
| single machine!
|
| our CPUs and GPUs are quite amazing. we have decided, as a
| field, that we can get enough virtual CPUs, GPUs, or RAM on-
| demand. and that we shouldn't concern ourselves with things
| happening at that level.
|
| it's actually a situation that saddens me sometimes.
| devwastaken wrote:
| Simd convinced me to take college courses in algorithms and
| to learn higher maths. Things like image decoding rely
| heavily on doing transformations, and you just have to know
| the math behind it to an exact point to be able to
| effectively turn scalar math to vector effectively.
|
| Ontop of this you have to identify what can and cannot be
| vectorized and how it can be integrated.
|
| Working in simd isn't too hard In itself once you get down to
| the assembly and toss out all the extra stuff compilers add.
| If you look at how ffmpeg does it, they just assemble the
| hand written assembly as the C function itself. Arm64 is very
| nice to do this in because it has an elegant way in defining
| vectors and instructions for them.
| Rendello wrote:
| After listening to the JSON episode, I've been spending the
| last few days coding up a SIMD implementation of LZ77
| substring matching, for use in DEFLATE.
|
| I used Zig, which has first-class SIMD support. As in, no
| need to go down to the assembly or use intrinsics, or even
| use a library. I just got it working and haven't had time
| to profile it, however (I'm new to profiling code).
| ignoramous wrote:
| Thanks: Any books / blogs you'd recommend?
| phkahler wrote:
| I was recently disappointed and frustrated to learn that GCC
| will be enabling vectorization at -O2 soon. The realization
| that you have to specify both -O3 AND an architecture to get
| it to use AVX basically invalidated a bunch of benchmarking
| and testing I had done.
|
| What's the point of building with x86-64-v3 if all your code
| is built at -O2 without vectorization enabled? Doh!
| vvanders wrote:
| If you like this stuff data oriented design[1] is a nice
| framework that isn't arch specific(but can be extended to be
| if you need to make more gains that are arch aware). Back
| when I worked in the PS3/X360 days engines written for PS3
| had better cache locality(SPUs forced hard limits vs hitting
| a cache miss) and ran faster on the X360 as result well when
| ported.
|
| You can do do fun things like using radix sort with bucket
| sizes that line up well with L2/L3 caches(saw a multi order
| of magnitude speedup for that one) and data aware layouts
| that net 10-30x speedups for the same data. Many RMDBs use
| similar approaches from what I understand as well.
|
| [1] https://en.m.wikipedia.org/wiki/Data-oriented_design
| phendrenad2 wrote:
| For some reason, writing a less-that-completey-naive JSON
| parser for C++ seems to be a right of passage for all C++
| developers who eventually go on to great things.
| [deleted]
| colmmacc wrote:
| I'm really suspicious about whether this would really work out in
| most workloads.
|
| On modern architectures, it's the random memory reference that
| kills performance. Predictive pre-fetching in the pipeline has a
| impossible time figuring out what to get ready in the caches with
| random lookups; it's not like a linear sweep. This style of
| bloom-filter needs to be quite large, and it pulls only one
| cache-line in per key lookup, both of these properties make any
| opportunistic (non-prefetched) caching less effective.
|
| I wonder if real-world benchmarks would prefer a more memory
| dense lookup that pulls in multiple lines; just because it keeps
| the cache-lines hotter and less likely to be evicted. On the
| other hand, writes to this kind of bloom filter invalidate only
| one cache-line, so that's good. So I'd guess it's all very very
| workload dependent. Ultimately it depends on cache pressure and
| how over-subscribed the cache is and the ratio of writes to
| reads, and the distribution of keys; an implementation might also
| fare very differently on a system like Graviton2 (which has huge
| caches) than on a small Xeon.
| csense wrote:
| > I'm really suspicious about whether this would really work
| out in most workloads.
|
| What kinds of workloads are those? Workloads where the Bloom
| filter itself fits in cache?
|
| > On modern architectures, it's the random memory reference
| that kills performance
|
| Yeah, the whole point of the article is that moving from a
| traditional Bloom filter to a block Bloom filter is to improve
| from N random accesses per query/update to 1 random access per
| query/update.
|
| > Ultimately it depends on cache pressure and how over-
| subscribed the cache is
|
| The unspoken assumption in the article is that the Bloom filter
| is much bigger than cache.
|
| If your Bloom filter fits in cache, or you're writing for
| something with superfast RAM like a GPU or a Xeon Phi or
| something, block Bloom filters are wasteful for you.
|
| But I'm pretty sure the common case is writing normal
| applications running on reasonable general-purpose CPU's with
| Bloom filter sizes in the hundreds of MB or more. [1] [2]
|
| [1] If the Bloom filter's smaller than hundreds of MB, your
| data's probably small enough that you decide Bloom filters
| aren't worth the hassle and just use a HashMap or language
| equivalent.
|
| [2] If you're using a Bloom filter to do some sort of network
| stuff, like figuring out the set difference of two peers'
| objects in some p2p protocol, you probably don't care too much
| about block Bloom filters, because in this application memory
| access time is probably going to be dominated by network
| latency.
| willvarfar wrote:
| Excellent thought provoking article, but all the times I've
| brought in a dumb cache or bloom filter it is to save calling a
| micro service or hitting a DB or something so fantastically
| expensive (in computer time) that even a staggeringly inefficient
| bloom filter or just keeping lists of recently-seen values etc is
| a massive win. So there's no big pressing need to microoptimise
| the bloom filter or whatever, and all the real profile guided
| optimizations point at other low hanging fruit.
| flavius29663 wrote:
| Same, I was thinking about implementing a bloom filter myself,
| but when I put in the hit ratio and the frequencies, a simple
| LRU cache was going to be close enough. It's hard to justify
| the complexity sometimes, which is a pitty cause they are so
| cool.
| phkahler wrote:
| I suspect a bloom filter should use a separate array of bits for
| each hash function. This came up in a previous discussion where
| stated probability of collisions was "wrong" but not if it were
| actually implemented with separate bit arrays.
| [deleted]
| marginalia_nu wrote:
| This seems a bit like a solution in search of a use case. There
| are simpler solutions for this particular usecase.
|
| Let's take an extreme case: You have 250 million paying customers
| you want to cache access to. Yeah, that's _way_ more than you
| probably have, but the point is that you can keep 250 million
| integers in memory no problem what so ever. It fits in a
| gigabyte. There are raspberry pis with eight times that amount of
| RAM.
|
| You can just put the ids in a sorted array and you can check for
| existence with a binary search without having to deal with false
| positives. Adding a bunch more is as easy as a merge of sorted
| lists, and removing items is a great deal easier than doing the
| same in a floom filter. You're looking at log_2(250 million) = 27
| iterations. Cache-wise it's not great but it's faster than a
| database access by a mile and you don't have to worry about hash
| functions.
|
| Maybe if you have several billion customers it starts to become a
| bit much to keep in memory, but that is a weird corner case
| indeed.
| PragmaticPulp wrote:
| Obviously you don't need bloom filters if lookups are cheap,
| but you're ignoring all of the cases where lookups aren't
| cheap.
|
| You can't always put the entire database of everything in
| memory. Consider cases where the check is done client-side and
| requires a network request. A bloom filter could allow you to
| optimize away a network request in some, though not all, cases.
| marginalia_nu wrote:
| I'm curious if you have a concrete example of a case where
| lookups aren't cheap and a bloom filter can solve the
| problem.
|
| In general, bloom filters are mainly useful for the sort of
| data where lookups _are_ cheap, that is, testing for
| membership of a set, and that is typically one of the
| operations a database does really well.
|
| In most cases where you just want to save a network
| roundtrip, a LRU cache is often a lot easier to get right and
| is much easier to keep consistent against mutable data.
| jefftk wrote:
| What about https://developers.google.com/safe-browsing?
|
| In the normal case, when a website is not matched by the
| filter, you save a network request. This is good for both
| latency and privacy.
| Gh0stRAT wrote:
| Imagine you're making a web browser plugin that blocks ads,
| or malicious sites. Let's assume the blocklist is a hundred
| megs (easily fits in ram) and your millions of users need
| to get the latest data at least hourly in order to keep up
| with the latest URLs that you want to block.
|
| Rather than distributing the entire blocklist to your
| userbase, you can instead send a bloom filter + an
| allowlist of the small handful of sites which have a hash
| collision with one of the blocked sites.
|
| As a bonus, computing the hashes will have great branch
| prediction characteristics and you'll have fewer cache
| misses because the bloom filter is tiny and frequently
| accessed, so your plugin will not add any perceptible
| slowdown to the user's experience.
| marginalia_nu wrote:
| Couldn't you just send deltas if that is the case? Surely
| the hourly updates wouldn't be hundreds of megabytes? I
| just tested extracting 5 million URLs from my web crawler
| and it was like 150 Mb in plain text. That's ignoring how
| easy it is to create compression schemes for URLs that
| slash their memory footprint by something like 80%.
| Gh0stRAT wrote:
| Yes, you could do it with deltas instead. The tradeoffs
| are that it won't be as fast and will cost you a LOT more
| in bandwidth. (and cost your users more bandwidth as
| well. They might be on a very slow/limited data plan)
| Maybe you don't care about bandwidth and would prefer to
| avoid the complexity and maintenance overhead of adding a
| bloom filter.
|
| As with anything, there are tradeoffs and your
| requirements can change over time. Maybe the ad networks
| or malware creators start using new domains every 10
| minutes to counter your blocking system so now you have
| to store more data and disseminate it more frequently.
|
| As engineers, it's our job to weigh the tradeoffs between
| different solutions given the resources and constraints
| of the situation. For the situation I've outlined above,
| I'd at least strongly consider a bloom filter but it's
| certainly not the only way to do it.
| marginalia_nu wrote:
| > As engineers, it's our job to weigh the tradeoffs
| between different solutions given the resources and
| constraints of the situation. For the situation I've
| outlined above, I'd at least strongly consider a bloom
| filter but it's certainly not the only way to do it.
|
| We should also not gloss over that with a bloom filter
| can't rule out false positives, and it's really not
| feasible to figure out which they are. Due to the nature
| of hashing, you won't be able to easily come up with a
| list of false positives. Finding just one hash collision
| in a wide hash is computationally stupidly hard.
|
| > As with anything, there are tradeoffs and your
| requirements can change over time. Maybe the ad networks
| or malware creators start using new domains every 10
| minutes to counter your blocking system so now you have
| to store more data and disseminate it more frequently.
|
| Domains cost quite a lot of money so that is still pretty
| unrealistic. Sure you can have CN wildcards, but you can
| also do wildcard matching.
|
| Actually this whole scenario is unrealistic, since you
| can just serve ads off a random URL. The way you would
| create a decent ad filter is to look for characteristics
| in the script itself (a bit like an antivirus program),
| not base it off the URL.
| teraflop wrote:
| > We should also not gloss over that with a bloom filter
| can't rule out false positives, and it's really not
| feasible to figure out which they are.
|
| Nothing says a bloom filter has to be the _only_ data
| structure you use. It 's a performance optimization; even
| if 1% of the time you have to consult a more expensive
| data structure to confirm your result, it can still save
| you a lot of computation in the long run.
|
| > Finding just one hash collision in a wide hash is
| computationally stupidly hard.
|
| But bloom filters don't use wide hashes; the domain of
| the hash function is the number of bits in the filter.
|
| > Domains cost quite a lot of money so that is still
| pretty unrealistic.
|
| In the case of malware, the cost of _buying_ domains isn
| 't that relevant, because you can compromise existing
| domains using automated attacks.
| jameshart wrote:
| Additionally, this method avoids you having to distribute
| a list of dubious sites to your users.
| la_fayette wrote:
| Bitcoin light wallet addresses are registered with bloom
| filters on full nodes.
| spenczar5 wrote:
| I use them in astronomical data indexing when doing cross-
| match searches. That is, "could there be a star or other
| moving object in within 0.5 arcsec of this point in the
| sky?" Sky catalogs have many billions of objects but the
| density per solid angle of the sky is extremely variable,
| and moving objects also make lookups really complicated.
| Sky catalogs are big enough that they are not storable in
| memory, and may be in big Parquet tables on the network, so
| the full lookup is quite expensive.
|
| Bloom filters are great for this because the backing
| dataset changes extremely slowly, like a few times a year
| when a survey publishes a new data release.
| marginalia_nu wrote:
| Then we're indeed back in the territory of useful use
| cases, typically characterized by data that is far bigger
| than the system memory.
| gpderetta wrote:
| Bloom filter generally are much smaller than the data
| they represent. So if your set fits on disk, the bf might
| fit in ram; if your data fits in ram, your bf might fit
| in L3 and so on.
|
| Edit: in particular you might want to use bloom filters
| fir early rejects, when most of the set ownership queries
| are expected to fail.
| sethev wrote:
| I think you missed the point of the example. Lemire's example
| isn't "the use-case" for this optimization, it's just a trivial
| motivating example. The use-case is more like, "assume you're
| currently using a Bloom filter in an appropriate context,
| here's a way that might be more efficient".
|
| The point of the post is the optimization technique, not trying
| to describe when a Bloom filter would be appropriate.
| ak39 wrote:
| Why wouldn't "isPayingCustomer" not be part of the Customer
| object graph immediately after logging in to begin with? (Maybe
| I didn't understand the use case properly)
| marcosdumay wrote:
| This is a very bad example, but the underlying concepts are
| useful.
|
| It's more realistic if you are storing a set of revoked
| access keys or altered files on your cache that you subscribe
| from some channel that is not on your main request - database
| - response loop.
|
| Normally you won't ever need to store many of those until
| they reach the expiration time or you update you cache, and
| you will want a lot of performance for the common case (what
| the small memory requirements help ensuring).
| marginalia_nu wrote:
| I think fundamentally bloom filters are mainly useful for
| astronomical data sizes. If you have a petabyte of data
| partitioned between several datacenters, and you want to
| know which shards have entries with certain properties,
| bloom filers are amazing. A bloom filter that is just a few
| dozen megabytes can save you an enormous amount of
| computation in such a scenario.
|
| To cache network requests it really isn't that great. A
| basic LRU cache typically performs just as good, and is a
| lot easier to get right.
| k2xl wrote:
| a good use case is malicious URL detection for browsers. they
| can look up locally and get a sense whether the url is bad and
| then double check against webserver.
| kwillets wrote:
| A Bloom filter with 1% FPR would fit in 285 MB and need 7
| accesses per key.
| hinkley wrote:
| And you can consistent hash your way out of that problem by
| having two servers for each customer ID.
|
| But I think the bigger problem is figuring out if John Smith is
| already in our system, and IDs don't help here.
| bugzz wrote:
| The main use case I've always seen is to use bloom filters on
| the client side to reduce traffic to the server looking things
| up. As you said, you could cache 250 million integers in a
| gigabyte - but you don't want to bloat your client side
| implementation by a gigabyte for every bloom filter you use.
| Also, many times the items are a lot larger than an integer.
| For example, storing a list of malicious URLs, a common use
| case.
| bqmjjx0kac wrote:
| I'm so confused by this use case (the traffic-saving one, not
| the malicious URL classifier). Why not store the "is-paying-
| customer" bit in a cookie?
|
| What are we using as the user identifier? Where does it come
| from, if not a cookie?
|
| Also, this client-side bloom filter kind of leaks your user
| database, supposing it's keyed on email addresses and your
| adversary has a gigantic list of email addresses, or is
| patient enough to enumerate them.
| [deleted]
| lelandfe wrote:
| At the risk of finally exposing myself as an impostor, does
| anyone have a good link for an explanation of bloom filters?
|
| I've tried, unsuccessfully, to wrap my head around the concept
| before, and this article seems really interesting!
| [deleted]
| a11r wrote:
| The original paper by Burton H. Bloom is only 5 pages and very
| readable. https://dl.acm.org/doi/10.1145/362686.362692
| ghostly_s wrote:
| https://sci-hub.st/https://doi.org/10.1145/362686.362692
| dr_kiszonka wrote:
| I will take an imposter over a grandiose narcissist any day!
| (But then I am an imposter too.)
| robertlagrant wrote:
| > If everyone is am imposter, no one is!
|
| - Syndrome
| rattlesnakedave wrote:
| You want to check if an element is probably in a dataset. You
| want speed, and the possibility of a false positive doesn't
| scare you that much.
|
| Say you want to lookup if a username is in a database. That
| could take a while if you have a really large db of usernames,
| but you want to tell a user quickly if a name is taken.
|
| So what you can do is make a bloom filter, which is basically
| just an array of bits of n length. Let's say 10 for our
| example.
|
| Then gather some hash functions (h1, h2, and h3).
|
| When someone inputs a username, you run it through the hash
| functions, and modulo by the number of bits, and get an integer
| result.
|
| So:
|
| h1("rattlesnakedave") % 10 = 3 h2("rattlesnakedave") % 10 = 7
| h3("rattlesnakedave") % 10 = 2
|
| Then you check the bits at position 3,2, and 7 in the array.
|
| If they any of the bits are set to 0, we haven't seen the
| username before.
|
| If they're all set to 1, we have possibly seen the username
| before.
|
| Let's say they're all set to 0, so we probably haven't seen it.
| We can present the message that the username is available, and
| allow registration. During registration, we can do a slower
| more expensive verification that the username isn't actually in
| the dataset.
|
| Now that the username is in the dataset, we can flip the bits
| on our bloom filter at position 3,2, and 7 to 1. So the next
| time we lookup that username we can tell quickly that it's
| probably taken.
|
| There is no way to delete from the bloom filter. The more
| values you store relative to its size, the more false positives
| you get.
|
| Hopefully that makes sense, and is correct.
| DougBTX wrote:
| Usernames is a nice example. Just as a demo, we could make
| things simpler:
|
| Look at all the usernames in the input, keep track of all the
| letters you've seen. When you get a new name, if you've seen
| all the letters before, you might have seen the name before.
| If there's a new letter, it must be a new name.
|
| Obviously that isn't a great hash function, some letters are
| more common than others, so real Bloom filters use better
| ones, but it works basically the same way.
| Taywee wrote:
| Not exactly correct. If all of the bits is 1, then the value
| is possibly in the set (not "probably", but "possibly", with
| the likelihood being affected by the sizes of the set and the
| array).
|
| If any of the bits is zero, the value definitely isn't in the
| set (also not "probably haven't", but definitely not).
| rattlesnakedave wrote:
| Good catch! Thank you.
|
| Was trying to keep the initial explanation simple by
| leaving out the false positives scaling relative to the
| size of the array and set, and then clarifying at the end.
|
| The second part (any 0 bits it hasn't been seen) is true, a
| goof up on my part.
| actually_a_dog wrote:
| This article has a pretty good explanation, IMO:
| https://freecontent.manning.com/all-about-bloom-filters/
| stevenguh wrote:
| Try watching this video: https://youtu.be/Bay3X9PAX5k
| avinassh wrote:
| try this? - https://llimllib.github.io/bloomfilter-tutorial/
| 6510 wrote:
| let me butcher the concept for you :)
|
| say though nummerology you reduce the character values of a
| name to a number 0-10. Lots of names will reduce to the same
| number. We take 11 bits and for "jim" we set the first bit. We
| have only one name in our data set so all other bits are 0. Now
| if someone types "joe" in the search box and it reduces to 2 we
| look at the second bit, see it is a zero and know 100% *for
| sure* that this name is not in the data set. If "jack" reduces
| to 1 and we look him up we see 1 is set so this name *might be*
| there.
| cma wrote:
| Isn't this just similar to hash prefixes and not a true bloom
| filter?
| gscott wrote:
| Telling the GoogleBot to not index your site doesn't necessarily
| remove your site from Google. They can and often do still show
| your site but they replace your description a generic statement
| that they can't show it because you have decided not to include
| your site in their index. It is crazy but true.
| perihelions wrote:
| Oh, that's elegant.
|
| If I math right, setting 5 bits in a 64-bit word extends the
| effective length of the hash function by approximately (5-1) *
| log_2(64) = (5-1) * 6 = 24 bits. In a simple bloom filter of
| storage size _m_ bits, the hash functions have length log_2(m).
| All else being equal, the number of hash functions used -- hence
| the number of memory reads per lookup -- gets reduced by a factor
| of log_2(m) / [24 + log_2(m)].
|
| (But it's not exactly equal; the storage size _m_ has to go up by
| a factor of 1.2 (in the OP parameter set), for constant false-
| positive rate).
| eternalban wrote:
| Each word in this design is a little bloom filter. It has
| number of bits (m)=64; number of hashes (k)=5; and going for 1%
| false positive rate, which per formula (with n being number of
| items in the filter) in [1] is: fpr = (1 - e ^
| -(k * n)/m ) ^ k
|
| OP solved for 1% and got n =~ 4.
|
| There is exactly 1 hash function (to compute the 64 bit key). 5
| bits are pseudo-randomly selected (giving us k=5) and written
| to an array element indexed off of the 64 bit key itself.
|
| So sizing this [naively] is a matter of taking your n (say 4
| million items) and dividing by 4 and getting a 1 million array
| of words.
|
| What is not discussed in detail is block selection. Given the
| segmented nature of the array based data structure,
| distributing exactly 4 items per word is impossible for an on-
| line filter (as that requires perfect hashing). So, the actual
| 'load' capacity of this filter is, in array form, a fraction of
| 4 x array-len. Precisely, given an array of size S words, with
| capacity of Sx4, well before we reach Sx4, our keys will
| resolve to array elements, the little micro filters, that
| already have 4 registrations. Inserting the full allocated
| capacity (Sx4) will certainly result in a significant number of
| words having n greater than 4, which reduces the false positive
| rate. At 5 items we have fpr at 3.5%, and at 6 fpr is 7.3%.
| Equally, some array elements (words) will have less than 4
| elements, getting exceptional fprs. For example, at 2 entries,
| fpr is 0.0063%! If we load the full capacity, we can no longer
| make statements about "1%" false positive rates, as OP does, it
| should be noted.
|
| So, you either have to keep track of elements assigned to words
| (3 bits / word to count 0..4), or you compute a probable value
| to minimize overloaded array elements, keeping max writes to a
| word to 4 items. My guess would be ~70%ish utilization range of
| allocated memory. In this case, we get our desired 1% (max) but
| get bonuses of fortunate keys that land in underloaded words
| and get super high assurances at the cost of eating a bit of
| space inefficiency. /g
|
| So this is the classic ball into bins question [2]. For hashes,
| the surprising but powerful 2 choice approach gets us high 90%s
| load factors. I'd recommend that. Yes, we'll x2 hashing and
| cache-line costs, but loading factor will be in high 90s range.
| Here OP should examine two candidate words, and pick the one
| with least items. [On look ups, we have 2 filters (words), one
| of which asserts it doesn't have it, and another that is 99%
| sure it does. Or both disclaim having it.]
|
| [1]:
| https://en.wikipedia.org/wiki/Bloom_filter#Probability_of_fa...
|
| [2]: https://en.wikipedia.org/wiki/Balls_into_bins_problem
|
| p.s. to OP: you could also return the fpr if you keep track of
| items per word so the call site has insight into the
| assurances. That would be useful for a probabilistic data
| structure.
___________________________________________________________________
(page generated 2021-10-03 23:00 UTC)