https://lemire.me/blog/2024/02/04/how-fast-is-rolling-karp-rabin-hashing/ Skip to content Daniel Lemire's blog Daniel Lemire is a computer science professor at the Data Science Laboratory of the Universite du Quebec (TELUQ) in Montreal. His research is focused on software performance and data engineering. He is a techno-optimist and a free-speech advocate. Menu and widgets * My home page * My papers * My software Join over 12,500 email subscribers: [ ][Go!] You can follow this blog on telegram. You can find me on twitter as @lemire or on Mastodon. Search for: [ ] [Search] Support my work! I do not accept any advertisement. However, you can you can sponsor my open-source work on GitHub. Recent Posts * How fast is rolling Karp-Rabin hashing? * C23: a slightly better C * How much memory bandwidth do large Amazon instances offer? * Estimating your memory bandwidth * Implementing the missing sign instruction in AVX-512 Recent Comments * Daniel Lemire on Optimizing polymorphic code in Java * Polat Alemdart on Optimizing polymorphic code in Java * niconiconi on Estimating your memory bandwidth * Daniel Lemire on How much memory bandwidth do large Amazon instances offer? * Daniel Lemire on How much memory bandwidth do large Amazon instances offer? Pages * A short history of technology * About me * Book recommendations * Cognitive biases * Interviews and talks * My bets * My favorite articles * My favorite quotes * My rules * Newsletter * Predictions * Privacy Policy * Recommended video games * Terms of use * Write good papers Archives Archives [Select Month ] Boring stuff * Log in * Entries feed * Comments feed * WordPress.org How fast is rolling Karp-Rabin hashing? A hash function maps values (e.g., strings) into a fixed number of strings, typically smaller than the original. It is useful to compare quickly two long strings, for example. Instead of comparing the strings, you may compare the hash values. A simple hash function consists in repeatedly multiplying the hash value by some constant B (e.g., you may pick a number such as 31): uint32_t hash = 0; for (size_t i = 0; i < len; i++) { hash = hash * B + data[i]; } return hash; I credit Karp-Rabin for this type of hash functions. It is an example of recursive hashing: the hash function is computed by taking the hash value and updating it with the next character. Given a long strings, you may want to hash all sequences of N characters. A naive approach might be as follows: for(size_t i = 0; i < len-N; i++) { uint32_t hash = 0; for(size_t j = 0; j < N; j++) { hash = hash * B + data[i+j]; } //... } You are visiting each character value N times. If N is large, it is inefficient. You can do better using a rolling hash function: instead of recomputing the hash function anew each time, you just update it. It is possible to only access each character twice: uint32_t BtoN = 1; for(size_t i = 0; i < N; i++) { BtoN *= B; } uint32_t hash = 0; for(size_t i = 0; i < N; i++) { hash = hash * B + data[i]; } // ... for(size_t i = N; i < len; i++) { hash = hash * B + data[i] - BtoN * data[i-N]; // ... } The most expensive component of this routine is the line with two character accesses and two multiplications. I wrote a simple benchmark in C++ to see how fast a straight-forward implementation could be. I use a set window of size 8 for the purpose of this analysis, but the larger the window, the less competitive the naive implementation becomes. rolling hashing 0.75 GB/s 13 instructions/byte naive hashing 0.18 GB/s 61 instructions/byte Karp-Rabin is not the only type hash functions. Tabulated hashing is another approach that would be much more compute efficient. However, I suspect we could greatly improve my naive implementation of the Karp-Rabin rolling hash function. Further reading: * The universality of iterated hashing over variable-length strings , Discrete Applied Mathematics 160 (4-5), 2012 * Recursive n-gram hashing is pairwise independent, at best Computer Speech & Language 24 (4), pages 698-710, 2010 * Strongly universal string hashing is fast, Computer Journal 57 (11), 2014 Possibly relevant software: * C library implementing the ridiculously fast CLHash hashing function * Rolling Hash C++ Library Published by [2ca999] Daniel Lemire A computer science professor at the University of Quebec (TELUQ). View all posts by Daniel Lemire Posted on February 4, 2024Author Daniel LemireCategories Leave a Reply Cancel reply Your email address will not be published. To create code blocks or other preformatted text, indent by four spaces: This will be displayed in a monospaced font. The first four spaces will be stripped off, but all other whitespace will be preserved. Markdown is turned off in code blocks: [This is not a link](http://example.com) To create not a block, but an inline code span, use backticks: Here is some inline `code`. For more help see http://daringfireball.net/projects/markdown/syntax [ ] [ ] [ ] [ ] [ ] [ ] [ ] Comment * [ ] Name * [ ] Email * [ ] Website [ ] [ ] Save my name, email, and website in this browser for the next time I comment. Receive Email Notifications? [no, do not subscribe ] [instantly ] Or, you can subscribe without commenting. [Post Comment] [ ] [ ] [ ] [ ] [ ] [ ] [ ] D[ ] You may subscribe to this blog by email. Post navigation Previous Previous post: C23: a slightly better C Terms of use Proudly powered by WordPress