https://lemire.me/blog/2024/03/31/fast-and-concise-probabilistic-filters-in-python/

Skip to content

Daniel Lemire's blog

Daniel Lemire is a computer science professor at the Data Science
Laboratory of the Universite du Quebec (TELUQ) in Montreal. His
research is focused on software performance and data engineering. He
is a techno-optimist and a free-speech advocate.

Menu and widgets

  * My home page
  * My papers
  * My software

Join over 12,500 email subscribers:

[                    ][Go!]
You can follow this blog on telegram.
You can find me on twitter as @lemire or on Mastodon.
Search for: [                    ] [Search]
Support my work!

I do not accept any advertisement. However, you can you can sponsor
my open-source work on GitHub.

Recent Posts

  * Fast and concise probabilistic filters in Python
  * Passing recursive C++ lambdas as function pointers
  * Measuring your system's performance using software (Go edition)
  * How to read files quickly in JavaScript
  * How many political parties rule Canada? Fun with statistics

Recent Comments

  * Sandor DARGO on Passing recursive C++ lambdas as function
    pointers
  * Daniel Lemire on Passing recursive C++ lambdas as function
    pointers
  * Peter on Passing recursive C++ lambdas as function pointers
  * Timofey on Passing recursive C++ lambdas as function pointers
  * Denis Fedotov on Passing recursive C++ lambdas as function
    pointers

Pages

  * A short history of technology
  * About me
  * Book recommendations
  * Cognitive biases
  * Interviews and talks
  * My bets
  * My favorite articles
  * My favorite quotes
  * My rules
  * Newsletter
  * Predictions
  * Privacy Policy
  * Recommended video games
  * Terms of use
  * Write good papers

Archives

Archives [Select Month        ]

Boring stuff

  * Log in
  * Entries feed
  * Comments feed
  * WordPress.org

Fast and concise probabilistic filters in Python

Sometimes you need to filter out or filter in data quickly. Suppose
that your employer maintains a list of forbidden passwords or URLs or
words. You may store them in a relational database and query them as
needed. Unfortunately, this process can be slow and inefficient.

A better approach might be to use a probabilistic filter. A
probabilistic filter is a sort of 'approximate set'. You can ask it
whether a key is present in the set, and if it is present, then you
will always get 'true' (the correct answer). However, when the key is
not present, you may still get 'true', although with a low
probability. So the probabilistic filter is sometimes wrong. Why
would you accept a data structure that is sometimes wrong? Because it
can be several times smaller and faster than querying directly the
actual set.

The best known probabilistic filter is the Bloom filter, but there
are many others. For example, we recently presented the binary fuse
filters which are smaller and faster than Bloom filters, for large
and immutable sets.

The pyxorfilter module in Python is an implementation of the binary
fuse filters. It provides support for several filter types but both
Fuse8 and Fuse16 are interesting if you have fairly large sets. They
provide respectively a 0.39% and 0.0015% false probability rate. So a
Fuse16 filter is almost nearly correct. Why would you prefer Fuse8?
Because it uses half the memory.

We can construct a probabilistic filter in Python like so with the
pyxorfilter filter:

from pyxorfilter import Fuse8
data = [uuid.uuid4() for i in range(2000000)]

filter = Fuse8(len(data))
filter.populate(data)


Once it is done, you can be certain that all the content of your
initial set is in the filter:

for d in data:
  assert filter.contains(d)

You can save the filter on disk and check how much memory is used...

f = open(filename, 'wb')
f.write(filter.serialize())
f.close()
print(os.path.getsize(filename)*8.0/len(data))


If your set is large enough (say 1000,000 elements), you will find
that the memory usage is about 9 bits per entry. It grows a bit
larger per entry as the set gets smaller. For smaller sets, the
pyxorfilter module offers an alternative (Xor8) that can be slightly
more efficient in these cases.

How do you know if you can trust the filter? Just query random inputs
(highly likely not to be present) and see how many falsely appear to
be in the set:

# estimate false positive rate
N = 1000000
count = 0
for i in range(N):
count += filter.contains(uuid.uuid4())
fpp = count/N*100.0


As I already implied, if you replace Fuse8 by Fuse16, then the memory
usage per element goes up to about 18 bits, but the false positive
rate is far lower: 0.00200%.

I produced a small benchmark. On my laptop, I get that you get over 1
million queries per second (each time checking the presence of a
string). On an Intel-based server, I get a lower number, so about
half a million per second.

For binary fuse filters, it does not matter whether the element is in
the set or not as far as performance goes, so I use random inputs.
When using a Bloom filter (say), you would typically get worse
performance when the elements are in the set.

The pyxorfilter  module was created Amey Narkhede. It still early
days. I expect you can install pyxorfilter the usual way (pip install
pyxorfilter) under x64 Linux. Unfortunately, under macOS and Windows,
there are issues getting the module installed (see issue 19 and issue
10). However, if you are a Python hacker, you can build it from
source relatively easily:

git clone --recurse-submodules https://github.com/glitzflitz/pyxorfilter
cd pyxorfilter
python setup.py build_ext
python setup.py install

I am sure Amey would appreciate it if experienced Python hackers
could help resolve the few remaining issues.

There are functionalities that pyxorfilter misses. Currently, you can
save the filter to disk and recover it later. Sadly, to use it, you
need to load the whole filter in memory. That is not needed. It might
be more suitable to use memory file mapping or even other lower-level
input-output operations.

What if you do not care about Python? You can use the xor and binary
fuse filters in Go, C or C++, Zig, Rust, etc. I just love working in
Python when I can.

Published by

[2ca999]

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).
View all posts by Daniel Lemire

Posted on March 31, 2024March 31, 2024Author Daniel LemireCategories 
 

Leave a Reply Cancel reply

Your email address will not be published.

          [                                             ]
          [                                             ]
          [                                             ]
          [                                             ]
          [                                             ]
          [                                             ]
          [                                             ]
Comment * [                                             ]

Name * [                              ]

Email * [                              ]

Website [                              ]

[ ] Save my name, email, and website in this browser for the next
time I comment.

[Post Comment] 

 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
D[                                             ] 

You may subscribe to this blog by email.

Post navigation

Previous Previous post: Passing recursive C++ lambdas as function
pointers
Terms of use Proudly powered by WordPress