[HN Gopher] Show HN: Rain hashes - well designed, simple and fas...
___________________________________________________________________
Show HN: Rain hashes - well designed, simple and fast variable
sized hashes
Author : keepamovin
Score : 62 points
Date : 2024-12-13 15:46 UTC (7 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| keepamovin wrote:
| This is my hash included in the best hash testing library out
| there "SMHasher3" maintained by Frank T Wojcik:
| https://gitlab.com/fwojcik/smhasher3 This test lib has many
| improvements over SMHasher 1 & 2, listed on the previous link.
| Results are:
| https://gitlab.com/fwojcik/smhasher3/-/blob/main/results/REA...
|
| Rainbow is the fastest 128-bit hash, and the fastest 256-bit hash
| (non-crypto). The 8th fastest 64-bit hash (by family, or 9th
| fastest overall, and 13-th fatest overall if you include 32-bit
| hashes). The advantage of rainbow is its easily scalable output
| size (64, 128 and 256), its high quality (passes all the tests),
| and its utter simplicity: it is under 140 source lines of code,
| easily readable
|
| The random constants are primes that were selected based on their
| avalanche quality under a multiply modulo operation. This part of
| the development was interesting different primes had very
| distinctly different avalanche qualities. The highest quality
| primes caused good avalanche (~ 50% bit flip probability) across
| the widest possible set of bits, on average. These are quite
| rare. A lot of large primes only avalanche across a narrow range
| of bits, even in a 128-bit space. The search program took a
| couple of days to discover all these primes running on a 2020s
| era MBP. Primes are chosen because they give a complete residue
| set under the modulus, ensuring a long cycle length at least
| regarding the nominal test operation.
|
| The rest of the hash was developed by trial and error using my
| intuition on developing hashes arising from long experience of
| doing so, and using SMHasher3 to evaluate the results, by
| iterating to improve and re-testing, over a period of a couple of
| weeks in the holidays a few years ago. I started making hash
| functions in my teens as a fun hobby.
|
| There's also a fun little wasm powered dashboard you can play
| with, here: https://dosaygo-research.github.io/rain/
| mulle_nat wrote:
| I am using fnv-1a to hash Objective-C method selectors, which
| are generally just identifier characters and zero or multiple
| ':'. At the time of my research, fnv-1a had the least
| collisions over my set of "real life" selectors. I think, it
| could be worthwhile some time, to try out other constants for
| maybe even less collisions. Is your list of good primes
| available ? (And maybe also those that are not quite perfect)
| zokier wrote:
| The benchmarks are bogus at the least, you are essentially just
| measuring the startup time of the executable/runtime and not your
| hash function.
| keepamovin wrote:
| Right, how to improve that? That nodejs bench script is meant
| to compare between cpp and wasm.
|
| For hash speed consult the smasher3 results.
| tptacek wrote:
| Broadly: I don't understand the design space of "cryptographic-
| ish" hashes. There's an intensely competitive design space of
| cryptographic hashes. What does it mean to be "more secure" than
| an insecure hash function? When would it ever be appropriate to
| use such a hash?
| dan_manges wrote:
| Checksums are a good use case, where you're not worried about
| any sort of forged collisions and you want to calculate the
| checksum as fast as possible. Although cryptographic hashes are
| fast enough that there likely aren't too many use cases where
| the difference in speed is meaningful.
| loeg wrote:
| If you don't care about forged collisions, what makes a
| secure-ish checksum bette than a faster insecure checksum?
| Both are likely to result in a different value for bit-
| flipped input, which is the point.
| nepthar wrote:
| That's a good question. A cryptographic hash must not leak
| any information other than "yes" or "no". A checksum-type
| hash might have an additional property where data that only
| has a single bit flipped might hash to something close to
| the original hash, allowing you to refine your collision
| attack by knowing if you're hot or cold.
|
| For a checksum, this isnt a bad thing and actually could be
| good.
| cstrahan wrote:
| Note that you said "likely" and not "equally likely".
| Therein lies your answer.
|
| There are applications where hashes are used as
| identifiers, where it would be impossible to use the
| original data to resolve possible collisions. One example
| would be in RTTI (Run Time Type Information), when you want
| to check if two objects (of unknown-at-compile-time type)
| are instances of the same type. The only performant way to
| achieve this is if the compiler emits each type's RTTI with
| a unique integer key to efficiently compare against, and
| this key must be unique across all object files and shared
| objects. In practice, this is done using hashing. If there
| is a collision, then the program behavior is undefined, so
| it is ideal to minimize probability of collision. You can
| see this issue in the Rust compiler project where there is
| ongoing discussion about how to handle this hashing, trying
| to balance compiler speed and collision likeliness:
|
| https://github.com/rust-lang/rust/issues/129014
|
| Another scenario would be like what Backtrace does: a hash
| is used to identify certain classes of errors in
| application code. The cost of analyzing each error is quite
| high (you have to load objects into memory, source maps (if
| available), perform symbolification, etc), so comparing any
| two errors every time you want to perform any metrics or
| retrieve associated data (like issue tracker identifiers)
| would reduce the performance to a point that any such
| application would be too slow to be useful. So a lot of
| thought was given to producing fast, non-cryptographic
| hashes (these hashes aren't shared across tenants, and no
| customer is going to go out of their way to work out
| initialization parameters so they can be their own
| adversary) with a known likely hood of collision. You can
| read about the hash (UMASH) here:
|
| https://engineering.backtrace.io/2020-08-24-umash-fast-
| enoug...
| loeg wrote:
| Well, also notice that I said "bit-flipped input." I.e.,
| the use case I believed to be talking about and
| responding to was data corruption detection (which is
| also associated with "checksum" as a term of art, as
| opposed to a generic "hash"). I agree that hashing for
| identity is a different use case than checksumming for
| data integrity.
|
| FWIW I think commonly used checksums are basically ideal
| ("equally likely" to come to a different value) on
| single-bit flipped inputs, but don't quote me on that.
| jnwatson wrote:
| Hash tables and related data structures.
| loeg wrote:
| If the distribution of bits isn't better, you would prefer a
| faster hash for a table that doesn't care about adversarial
| inputs. (And if you do care about adversarial inputs, I think
| you need an actually secure hash? Medium confidence on that.)
| bcoates wrote:
| For adversarial inputs, even a cryptographic hash won't
| save you, the cost of brute force searching for
| pathological keys is based the number of buckets not the
| cost of finding full collisions.
| tptacek wrote:
| There's a whole literature of cryptographic hashes
| designed for this problem.
| zamalek wrote:
| Content addressable storage. For example, the Git DAG would be
| one place to use it (depending on which camp you're in: should
| the DAG be cryptographically secure, or should the security be
| out-of-band e.g. signatures).
| loeg wrote:
| What operations would be faster in Git from using an insecure
| hash instead of SHA256? Commit is already reasonably fast.
| iainmerrick wrote:
| I have the same question.
|
| Might be useful to clarify for people replying with suggested
| use cases: this repo has two hash functions, describing them as
| "General-purpose, non-crypto hash" (OK) and "Experimental
| cryptographic-like hashing" (???)
|
| The question is, what is that second one for? Do they just mean
| it could eventually become a cryptographic hash?
| thadt wrote:
| So, questions for people that know more than I: What's up with
| the 'passing' and 'failing' hashes in SMHasher3?
|
| Specifically, what happened to xxHash[1] to kick it down to the
| 'failing' category? I'm not certain about SMHasher's testing
| criteria, but comparing the benchmark numbers posed in Rain vs
| xxHash show xxH3 rather faster? Not that a have a personally
| vested intersted in xxHash, but it seems to be a rather widely
| used (e.g. RocksDB [2]) with a number of implementations - it
| would be useful to understand the criteria that make it no longer
| a passing SMHasher3 entry.
|
| [1] https://github.com/Cyan4973/xxHash
|
| [2] https://rocksdb.org/blog/2022/07/18/per-key-value-
| checksum.h...
| zimpenfish wrote:
| > Specifically, what happened to xxHash to kick it down to the
| 'failing' category?
|
| Looking at [1], it seems to collisions (search for "!!!!!")
|
| [1]
| https://gitlab.com/fwojcik/smhasher3/-/blob/main/results/raw...
| mrbluecoat wrote:
| The fastest 128-bit and 256-bit non-crypto hash, passes all
| tests, and under 140 source lines of code
|
| Nice!
___________________________________________________________________
(page generated 2024-12-13 23:00 UTC)