[HN Gopher] Show HN: Rain hashes - well designed, simple and fas...
       ___________________________________________________________________
        
       Show HN: Rain hashes - well designed, simple and fast variable
       sized hashes
        
       Author : keepamovin
       Score  : 62 points
       Date   : 2024-12-13 15:46 UTC (7 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | keepamovin wrote:
       | This is my hash included in the best hash testing library out
       | there "SMHasher3" maintained by Frank T Wojcik:
       | https://gitlab.com/fwojcik/smhasher3 This test lib has many
       | improvements over SMHasher 1 & 2, listed on the previous link.
       | Results are:
       | https://gitlab.com/fwojcik/smhasher3/-/blob/main/results/REA...
       | 
       | Rainbow is the fastest 128-bit hash, and the fastest 256-bit hash
       | (non-crypto). The 8th fastest 64-bit hash (by family, or 9th
       | fastest overall, and 13-th fatest overall if you include 32-bit
       | hashes). The advantage of rainbow is its easily scalable output
       | size (64, 128 and 256), its high quality (passes all the tests),
       | and its utter simplicity: it is under 140 source lines of code,
       | easily readable
       | 
       | The random constants are primes that were selected based on their
       | avalanche quality under a multiply modulo operation. This part of
       | the development was interesting different primes had very
       | distinctly different avalanche qualities. The highest quality
       | primes caused good avalanche (~ 50% bit flip probability) across
       | the widest possible set of bits, on average. These are quite
       | rare. A lot of large primes only avalanche across a narrow range
       | of bits, even in a 128-bit space. The search program took a
       | couple of days to discover all these primes running on a 2020s
       | era MBP. Primes are chosen because they give a complete residue
       | set under the modulus, ensuring a long cycle length at least
       | regarding the nominal test operation.
       | 
       | The rest of the hash was developed by trial and error using my
       | intuition on developing hashes arising from long experience of
       | doing so, and using SMHasher3 to evaluate the results, by
       | iterating to improve and re-testing, over a period of a couple of
       | weeks in the holidays a few years ago. I started making hash
       | functions in my teens as a fun hobby.
       | 
       | There's also a fun little wasm powered dashboard you can play
       | with, here: https://dosaygo-research.github.io/rain/
        
         | mulle_nat wrote:
         | I am using fnv-1a to hash Objective-C method selectors, which
         | are generally just identifier characters and zero or multiple
         | ':'. At the time of my research, fnv-1a had the least
         | collisions over my set of "real life" selectors. I think, it
         | could be worthwhile some time, to try out other constants for
         | maybe even less collisions. Is your list of good primes
         | available ? (And maybe also those that are not quite perfect)
        
       | zokier wrote:
       | The benchmarks are bogus at the least, you are essentially just
       | measuring the startup time of the executable/runtime and not your
       | hash function.
        
         | keepamovin wrote:
         | Right, how to improve that? That nodejs bench script is meant
         | to compare between cpp and wasm.
         | 
         | For hash speed consult the smasher3 results.
        
       | tptacek wrote:
       | Broadly: I don't understand the design space of "cryptographic-
       | ish" hashes. There's an intensely competitive design space of
       | cryptographic hashes. What does it mean to be "more secure" than
       | an insecure hash function? When would it ever be appropriate to
       | use such a hash?
        
         | dan_manges wrote:
         | Checksums are a good use case, where you're not worried about
         | any sort of forged collisions and you want to calculate the
         | checksum as fast as possible. Although cryptographic hashes are
         | fast enough that there likely aren't too many use cases where
         | the difference in speed is meaningful.
        
           | loeg wrote:
           | If you don't care about forged collisions, what makes a
           | secure-ish checksum bette than a faster insecure checksum?
           | Both are likely to result in a different value for bit-
           | flipped input, which is the point.
        
             | nepthar wrote:
             | That's a good question. A cryptographic hash must not leak
             | any information other than "yes" or "no". A checksum-type
             | hash might have an additional property where data that only
             | has a single bit flipped might hash to something close to
             | the original hash, allowing you to refine your collision
             | attack by knowing if you're hot or cold.
             | 
             | For a checksum, this isnt a bad thing and actually could be
             | good.
        
             | cstrahan wrote:
             | Note that you said "likely" and not "equally likely".
             | Therein lies your answer.
             | 
             | There are applications where hashes are used as
             | identifiers, where it would be impossible to use the
             | original data to resolve possible collisions. One example
             | would be in RTTI (Run Time Type Information), when you want
             | to check if two objects (of unknown-at-compile-time type)
             | are instances of the same type. The only performant way to
             | achieve this is if the compiler emits each type's RTTI with
             | a unique integer key to efficiently compare against, and
             | this key must be unique across all object files and shared
             | objects. In practice, this is done using hashing. If there
             | is a collision, then the program behavior is undefined, so
             | it is ideal to minimize probability of collision. You can
             | see this issue in the Rust compiler project where there is
             | ongoing discussion about how to handle this hashing, trying
             | to balance compiler speed and collision likeliness:
             | 
             | https://github.com/rust-lang/rust/issues/129014
             | 
             | Another scenario would be like what Backtrace does: a hash
             | is used to identify certain classes of errors in
             | application code. The cost of analyzing each error is quite
             | high (you have to load objects into memory, source maps (if
             | available), perform symbolification, etc), so comparing any
             | two errors every time you want to perform any metrics or
             | retrieve associated data (like issue tracker identifiers)
             | would reduce the performance to a point that any such
             | application would be too slow to be useful. So a lot of
             | thought was given to producing fast, non-cryptographic
             | hashes (these hashes aren't shared across tenants, and no
             | customer is going to go out of their way to work out
             | initialization parameters so they can be their own
             | adversary) with a known likely hood of collision. You can
             | read about the hash (UMASH) here:
             | 
             | https://engineering.backtrace.io/2020-08-24-umash-fast-
             | enoug...
        
               | loeg wrote:
               | Well, also notice that I said "bit-flipped input." I.e.,
               | the use case I believed to be talking about and
               | responding to was data corruption detection (which is
               | also associated with "checksum" as a term of art, as
               | opposed to a generic "hash"). I agree that hashing for
               | identity is a different use case than checksumming for
               | data integrity.
               | 
               | FWIW I think commonly used checksums are basically ideal
               | ("equally likely" to come to a different value) on
               | single-bit flipped inputs, but don't quote me on that.
        
         | jnwatson wrote:
         | Hash tables and related data structures.
        
           | loeg wrote:
           | If the distribution of bits isn't better, you would prefer a
           | faster hash for a table that doesn't care about adversarial
           | inputs. (And if you do care about adversarial inputs, I think
           | you need an actually secure hash? Medium confidence on that.)
        
             | bcoates wrote:
             | For adversarial inputs, even a cryptographic hash won't
             | save you, the cost of brute force searching for
             | pathological keys is based the number of buckets not the
             | cost of finding full collisions.
        
               | tptacek wrote:
               | There's a whole literature of cryptographic hashes
               | designed for this problem.
        
         | zamalek wrote:
         | Content addressable storage. For example, the Git DAG would be
         | one place to use it (depending on which camp you're in: should
         | the DAG be cryptographically secure, or should the security be
         | out-of-band e.g. signatures).
        
           | loeg wrote:
           | What operations would be faster in Git from using an insecure
           | hash instead of SHA256? Commit is already reasonably fast.
        
         | iainmerrick wrote:
         | I have the same question.
         | 
         | Might be useful to clarify for people replying with suggested
         | use cases: this repo has two hash functions, describing them as
         | "General-purpose, non-crypto hash" (OK) and "Experimental
         | cryptographic-like hashing" (???)
         | 
         | The question is, what is that second one for? Do they just mean
         | it could eventually become a cryptographic hash?
        
       | thadt wrote:
       | So, questions for people that know more than I: What's up with
       | the 'passing' and 'failing' hashes in SMHasher3?
       | 
       | Specifically, what happened to xxHash[1] to kick it down to the
       | 'failing' category? I'm not certain about SMHasher's testing
       | criteria, but comparing the benchmark numbers posed in Rain vs
       | xxHash show xxH3 rather faster? Not that a have a personally
       | vested intersted in xxHash, but it seems to be a rather widely
       | used (e.g. RocksDB [2]) with a number of implementations - it
       | would be useful to understand the criteria that make it no longer
       | a passing SMHasher3 entry.
       | 
       | [1] https://github.com/Cyan4973/xxHash
       | 
       | [2] https://rocksdb.org/blog/2022/07/18/per-key-value-
       | checksum.h...
        
         | zimpenfish wrote:
         | > Specifically, what happened to xxHash to kick it down to the
         | 'failing' category?
         | 
         | Looking at [1], it seems to collisions (search for "!!!!!")
         | 
         | [1]
         | https://gitlab.com/fwojcik/smhasher3/-/blob/main/results/raw...
        
       | mrbluecoat wrote:
       | The fastest 128-bit and 256-bit non-crypto hash, passes all
       | tests, and under 140 source lines of code
       | 
       | Nice!
        
       ___________________________________________________________________
       (page generated 2024-12-13 23:00 UTC)