https://lemire.me/blog/2024/02/04/how-fast-is-rolling-karp-rabin-hashing/

Skip to content

Daniel Lemire's blog

Daniel Lemire is a computer science professor at the Data Science
Laboratory of the Universite du Quebec (TELUQ) in Montreal. His
research is focused on software performance and data engineering. He
is a techno-optimist and a free-speech advocate.

Menu and widgets

  * My home page
  * My papers
  * My software

Join over 12,500 email subscribers:

[                    ][Go!]
You can follow this blog on telegram.
You can find me on twitter as @lemire or on Mastodon.
Search for: [                    ] [Search]
Support my work!

I do not accept any advertisement. However, you can you can sponsor
my open-source work on GitHub.

Recent Posts

  * How fast is rolling Karp-Rabin hashing?
  * C23: a slightly better C
  * How much memory bandwidth do large Amazon instances offer?
  * Estimating your memory bandwidth
  * Implementing the missing sign instruction in AVX-512

Recent Comments

  * Daniel Lemire on Optimizing polymorphic code in Java
  * Polat Alemdart on Optimizing polymorphic code in Java
  * niconiconi on Estimating your memory bandwidth
  * Daniel Lemire on How much memory bandwidth do large Amazon
    instances offer?
  * Daniel Lemire on How much memory bandwidth do large Amazon
    instances offer?

Pages

  * A short history of technology
  * About me
  * Book recommendations
  * Cognitive biases
  * Interviews and talks
  * My bets
  * My favorite articles
  * My favorite quotes
  * My rules
  * Newsletter
  * Predictions
  * Privacy Policy
  * Recommended video games
  * Terms of use
  * Write good papers

Archives

Archives [Select Month        ]

Boring stuff

  * Log in
  * Entries feed
  * Comments feed
  * WordPress.org

How fast is rolling Karp-Rabin hashing?

A hash function maps values (e.g., strings) into a fixed number of
strings, typically smaller than the original. It is useful to compare
quickly two long strings, for example. Instead of comparing the
strings, you may compare the hash values.

A simple hash function consists in repeatedly multiplying the hash
value by some constant B (e.g., you may pick a number such as 31):

uint32_t hash = 0;
for (size_t i = 0; i < len; i++) {
  hash = hash * B + data[i];
}
return hash;


I credit Karp-Rabin for this type of hash functions. It is an example
of recursive hashing: the hash function is computed by taking the
hash value and updating it with the next character.

Given a long strings, you may want to hash all sequences of N
characters. A naive approach might be as follows:

for(size_t i = 0; i < len-N; i++) {
  uint32_t hash = 0;
  for(size_t j = 0; j < N; j++) {
    hash = hash * B + data[i+j];
  }
  //...
}


You are visiting each character value N times. If N is large, it is
inefficient.

You can do better using a rolling hash function: instead of
recomputing the hash function anew each time, you just update it. It
is possible to only access each character twice:

uint32_t BtoN = 1;
for(size_t i = 0; i < N; i++) { BtoN *= B; }

uint32_t hash = 0;
for(size_t i = 0; i < N; i++) {
  hash = hash * B + data[i];
}
// ...
for(size_t i = N; i < len; i++) {
  hash = hash * B + data[i] - BtoN * data[i-N];
  // ...
}


The most expensive component of this routine is the line with two
character accesses and two multiplications.

I wrote a simple benchmark in C++ to see how fast a straight-forward
implementation could be. I use a set window of size 8 for the purpose
of this analysis, but the larger the window, the less competitive the
naive implementation becomes.

rolling hashing 0.75 GB/s 13 instructions/byte
naive hashing   0.18 GB/s 61 instructions/byte

Karp-Rabin is not the only type hash functions. Tabulated hashing is
another approach that would be much more compute efficient.

However, I suspect we could greatly improve my naive implementation
of the Karp-Rabin rolling hash function.

Further reading:

  * The universality of iterated hashing over variable-length strings
    , Discrete Applied Mathematics 160 (4-5), 2012
  * Recursive n-gram hashing is pairwise independent, at best
    Computer Speech & Language 24 (4), pages 698-710, 2010
  * Strongly universal string hashing is fast, Computer Journal 57
    (11), 2014

Possibly relevant software:

  * C library implementing the ridiculously fast CLHash hashing
    function
  * Rolling Hash C++ Library

Published by

[2ca999]

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).
View all posts by Daniel Lemire

Posted on February 4, 2024Author Daniel LemireCategories  

Leave a Reply Cancel reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four
spaces:

    This will be displayed in a monospaced font. The first four
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

          [                                             ]
          [                                             ]
          [                                             ]
          [                                             ]
          [                                             ]
          [                                             ]
          [                                             ]
Comment * [                                             ]

Name * [                              ]

Email * [                              ]

Website [                              ]

[ ] Save my name, email, and website in this browser for the next
time I comment.

Receive Email Notifications? [no, do not subscribe      ] 
[instantly    ]
Or, you can subscribe without commenting.

[Post Comment] 

 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
D[                                             ] 

You may subscribe to this blog by email.

Post navigation

Previous Previous post: C23: a slightly better C
Terms of use Proudly powered by WordPress