https://lemire.me/blog/2024/03/31/fast-and-concise-probabilistic-filters-in-python/ Skip to content Daniel Lemire's blog Daniel Lemire is a computer science professor at the Data Science Laboratory of the Universite du Quebec (TELUQ) in Montreal. His research is focused on software performance and data engineering. He is a techno-optimist and a free-speech advocate. Menu and widgets * My home page * My papers * My software Join over 12,500 email subscribers: [ ][Go!] You can follow this blog on telegram. You can find me on twitter as @lemire or on Mastodon. Search for: [ ] [Search] Support my work! I do not accept any advertisement. However, you can you can sponsor my open-source work on GitHub. Recent Posts * Fast and concise probabilistic filters in Python * Passing recursive C++ lambdas as function pointers * Measuring your system's performance using software (Go edition) * How to read files quickly in JavaScript * How many political parties rule Canada? Fun with statistics Recent Comments * Sandor DARGO on Passing recursive C++ lambdas as function pointers * Daniel Lemire on Passing recursive C++ lambdas as function pointers * Peter on Passing recursive C++ lambdas as function pointers * Timofey on Passing recursive C++ lambdas as function pointers * Denis Fedotov on Passing recursive C++ lambdas as function pointers Pages * A short history of technology * About me * Book recommendations * Cognitive biases * Interviews and talks * My bets * My favorite articles * My favorite quotes * My rules * Newsletter * Predictions * Privacy Policy * Recommended video games * Terms of use * Write good papers Archives Archives [Select Month ] Boring stuff * Log in * Entries feed * Comments feed * WordPress.org Fast and concise probabilistic filters in Python Sometimes you need to filter out or filter in data quickly. Suppose that your employer maintains a list of forbidden passwords or URLs or words. You may store them in a relational database and query them as needed. Unfortunately, this process can be slow and inefficient. A better approach might be to use a probabilistic filter. A probabilistic filter is a sort of 'approximate set'. You can ask it whether a key is present in the set, and if it is present, then you will always get 'true' (the correct answer). However, when the key is not present, you may still get 'true', although with a low probability. So the probabilistic filter is sometimes wrong. Why would you accept a data structure that is sometimes wrong? Because it can be several times smaller and faster than querying directly the actual set. The best known probabilistic filter is the Bloom filter, but there are many others. For example, we recently presented the binary fuse filters which are smaller and faster than Bloom filters, for large and immutable sets. The pyxorfilter module in Python is an implementation of the binary fuse filters. It provides support for several filter types but both Fuse8 and Fuse16 are interesting if you have fairly large sets. They provide respectively a 0.39% and 0.0015% false probability rate. So a Fuse16 filter is almost nearly correct. Why would you prefer Fuse8? Because it uses half the memory. We can construct a probabilistic filter in Python like so with the pyxorfilter filter: from pyxorfilter import Fuse8 data = [uuid.uuid4() for i in range(2000000)] filter = Fuse8(len(data)) filter.populate(data) Once it is done, you can be certain that all the content of your initial set is in the filter: for d in data: assert filter.contains(d) You can save the filter on disk and check how much memory is used... f = open(filename, 'wb') f.write(filter.serialize()) f.close() print(os.path.getsize(filename)*8.0/len(data)) If your set is large enough (say 1000,000 elements), you will find that the memory usage is about 9 bits per entry. It grows a bit larger per entry as the set gets smaller. For smaller sets, the pyxorfilter module offers an alternative (Xor8) that can be slightly more efficient in these cases. How do you know if you can trust the filter? Just query random inputs (highly likely not to be present) and see how many falsely appear to be in the set: # estimate false positive rate N = 1000000 count = 0 for i in range(N): count += filter.contains(uuid.uuid4()) fpp = count/N*100.0 As I already implied, if you replace Fuse8 by Fuse16, then the memory usage per element goes up to about 18 bits, but the false positive rate is far lower: 0.00200%. I produced a small benchmark. On my laptop, I get that you get over 1 million queries per second (each time checking the presence of a string). On an Intel-based server, I get a lower number, so about half a million per second. For binary fuse filters, it does not matter whether the element is in the set or not as far as performance goes, so I use random inputs. When using a Bloom filter (say), you would typically get worse performance when the elements are in the set. The pyxorfilter module was created Amey Narkhede. It still early days. I expect you can install pyxorfilter the usual way (pip install pyxorfilter) under x64 Linux. Unfortunately, under macOS and Windows, there are issues getting the module installed (see issue 19 and issue 10). However, if you are a Python hacker, you can build it from source relatively easily: git clone --recurse-submodules https://github.com/glitzflitz/pyxorfilter cd pyxorfilter python setup.py build_ext python setup.py install I am sure Amey would appreciate it if experienced Python hackers could help resolve the few remaining issues. There are functionalities that pyxorfilter misses. Currently, you can save the filter to disk and recover it later. Sadly, to use it, you need to load the whole filter in memory. That is not needed. It might be more suitable to use memory file mapping or even other lower-level input-output operations. What if you do not care about Python? You can use the xor and binary fuse filters in Go, C or C++, Zig, Rust, etc. I just love working in Python when I can. Published by [2ca999] Daniel Lemire A computer science professor at the University of Quebec (TELUQ). View all posts by Daniel Lemire Posted on March 31, 2024March 31, 2024Author Daniel LemireCategories Leave a Reply Cancel reply Your email address will not be published. [ ] [ ] [ ] [ ] [ ] [ ] [ ] Comment * [ ] Name * [ ] Email * [ ] Website [ ] [ ] Save my name, email, and website in this browser for the next time I comment. [Post Comment] [ ] [ ] [ ] [ ] [ ] [ ] [ ] D[ ] You may subscribe to this blog by email. Post navigation Previous Previous post: Passing recursive C++ lambdas as function pointers Terms of use Proudly powered by WordPress