[HN Gopher] GxHash is a fast and robust non-cryptographic hashin...
___________________________________________________________________
GxHash is a fast and robust non-cryptographic hashing algorithm
Author : vlovich123
Score : 66 points
Date : 2024-05-13 15:42 UTC (7 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| vlovich123 wrote:
| Wanted to highlight it since it looks like it's a significant
| Pareto frontier jump from existing techniques.
| not_a_dane wrote:
| how do we know it is a robust one ?
| ted_dunning wrote:
| Did you read the article?
| rurban wrote:
| Haven't checked it in smhasher yet. just saw this, claiming to
| be the very best. Well, maybe a PR will come soon. The rust
| framework for hashes is prepared, there just haven't been any
| good rust hashes yet.
| llimllib wrote:
| i know nothing about this algorithm, but I went and checked
| that already; it does have a PR for smhasher:
| https://github.com/rurban/smhasher/pull/279
|
| The note in there that the output of the algorithm is
| different for different SIMD widths seems to indicate that
| this isn't so much an algorithm as a family of related
| algorithms
| vlovich123 wrote:
| That's a bit weird since it says on the front page that:
|
| > All generated hashes for a given version of GxHash are
| stable, meaning that for a given input the output hash will
| be the same across all supported platforms.
|
| The github repo is more recent so maybe it wasn't stable at
| the time they tried to get it into smhasher from last year?
| aidenn0 wrote:
| Looks like the original failed smhasher, since the PR
| modified the compress function
| ted_dunning wrote:
| The article quotes results from smhasher runs.
|
| It passes.
| AlotOfReading wrote:
| Honestly, the paper is pretty confusing. It's a weird mix of
| explaining things people reading a hash paper should already
| know (e.g. avalanche effect) and not explaining the important
| bits. The mixing stage looks a lot like a Davies-meyer hash,
| but it's hard to be sure without reading code. If so, this will
| be susceptible to fixed point attacks (among others). It
| doesn't particularly matter because this isn't presented as
| secure, but worth noting.
|
| The paper also doesn't exhaustively verify numbers when
| domain=codomain, which is a super common scenario and a weird
| omission. 32 bits is small enough to check everything.
| vlovich123 wrote:
| The hash internally is a 128bit hash & can output 32 or 64
| bit as well. Can you speak more to how fixed point attacks
| are relevant? The only reference I found online was in a
| cryptographic context which wouldn't be relevant here. Are
| other hashes like SipHash-1-3 (Rust stdlib hashmap) or XXH3
| susceptible to it?
| AlotOfReading wrote:
| A fixed point attack would allow creating collisions.
| Whether that's an issue or not depends on your use case,
| but the answer is going to be yes surprisingly often in my
| experience.
| vlovich123 wrote:
| I don't know about this at all, but ChatGPT suggests that
| this attack is thwarted by a secure compression function
| and block function. As you point out elsewhere, it uses
| AES for compression but not sure if that's enough. Do you
| know?
| AlotOfReading wrote:
| ChatGPT would be wrong in this case. It applies to (and
| was originally identified with) hash functions built from
| cryptographically secure hash functions. Note that this
| _isn 't_ as strong as AES. It's just built with some of
| the same machinery as AES, it doesn't carry over its
| guarantees.
|
| Still an interesting construction I'll play around with
| some more when I have time.
| singron wrote:
| Can you realistically create fixed-point collisions of a
| randomly seeded hash function? E.g. to defeat the DOS-
| resistance of a hash table? I think the papers on fixed-
| point attacks of Davies-Meyer hashes require selecting
| the IV or knowing the IV in order to carry out the
| attack.
| AlotOfReading wrote:
| Not sure where the IV would come in, but both the
| plaintext and the initial key seem to be taken directly
| from the input as the first two blocks in any
| "compression group". Let's say you want intermediate key
| X after the first two steps, the first 3 blocks should be
| something like m_2 = D(X, E(m_1, m_0)) because we're
| missing the xor step from davies-meyer.
|
| Let me know if I'm wrong though, strictly an amateur
| here.
| ltbarcly3 wrote:
| I would like to see the collision distribution on
| /usr/share/dict/words. I have found that 'fast hash algorithms'
| can have very unequal hash distributions on short strings.
| barkbyte wrote:
| Deleted
| glitchc wrote:
| AES is more than just sboxes. The multiple rounds (including
| XORs) are integral to collision resistance.
| zimpenfish wrote:
| Using
| https://gist.github.com/rjp/3bbf33911aaf9ed7a07be417adb4b8f1 on
| Arch Linux `words` 2.1-8 (which defaults to the `usa` word
| list) gives collisions for the 32 bit hash (but none for the 64
| bit or the 128 bit variants.) >
| ./target/debug/gxwords | cut -d' ' -f1 | sort | uniq -c | awk
| '$1>1' 2 28273056 2 8e264f8
| 2 f7bce777
|
| Which then gives us > ./target/debug/gxwords
| | egrep '28273056|8e264f8|f7bce777' | sort 28273056
| counterclaiming 28273056 oncogene's 8e264f8
| ingrates 8e264f8 toadstool's f7bce777 Carib's
| f7bce777 stemming
| bravura wrote:
| Strange that all three collisions include "'s"
| judofyr wrote:
| Considering that 43% of the words (in my dictionary at
| least) include "s" it's not so strange: $
| cat /usr/share/dict/words | wc -l 235976
| $ cat /usr/share/dict/words | grep s | wc -l
| 101882
| filleokus wrote:
| GP meant that all collisions contained 's - a possessive
| apostrophe (if I remember my grammar correctly?).
| oncogene's, toadstool's, Carib's.
|
| The /usr/share/dict/words on my macOS machine doesn't
| seem to have any 's words in it.
| vlovich123 wrote:
| counterclaiming Doesn't have any s and stemming doesn't
| have any possessive. I think drawing conclusions from a
| 32bit hash against a 123k word list is fruitless. You'd
| expect at least 1 collision due simply to birthday
| paradox.
| airstrike wrote:
| oncogene's <> counterclaiming ingrates <>
| toadstool's Carib's <> stemming
|
| all 3 pairs have one word ending in "'s", which I think
| is what the GGGP (?) was pointing out
| bravura wrote:
| Yes. But I didn't realize that 43% of the words in that
| dictionary contain "'s".
|
| In that case, the probability that, for 3 collisions,
| there is at least one element in every collision contains
| "'s" is 31%. That seems reasonable to me.
| airstrike wrote:
| but 43% of the words contain s, not 's
| zimpenfish wrote:
| > counterclaiming Doesn't have any s and stemming doesn't
| have any possessive.
|
| But "oncogene's" (collides with counterclaiming) and
| "Carib's" (collides with stemming) do.
|
| Also I did point out that the 64 and 128 bit variants
| have no collisions.
| vlovich123 wrote:
| Wouldn't that be true for any 32-bit hash? The birthday
| paradox would suggest you only need 2^16 input values to
| generate a collision right?
| zimpenfish wrote:
| xxhash32 does give a couple of collisions on the same file,
| yep. 734c8981 abloom 734c8981
| sating ef336a7d contravene ef336a7d
| seducers
| ted_dunning wrote:
| The article emulates that result by synthesizing words. The
| problem with using /user/dict/words is that there aren't enough
| to get a really good statistical estimate of collisions at more
| than very small scale.
|
| Also the core primitive is AES which should calm most such
| worries.
| vlovich123 wrote:
| Not to mention you should be using the 64-bit variant for
| collision resistance and the 128-bit variant for integrity.
| woodruffw wrote:
| It would be interesting to have AES itself (with the same number
| of rounds) in the performance comparison, given that GxHash
| appears to be built on top of AES.
| vlovich123 wrote:
| AES is only used for the final mixing from their paper. AHash
| is the pure AES hash using AES-NI.
| AlotOfReading wrote:
| Am I missing something? The paper indicates that AES-NI is
| used for both compression and mixing.
|
| If you look at the source code, it's using it in both parts
| too. Interestingly, there's a discrepancy between the paper
| and the code. The paper says the seed is passed to the first
| round of the 3-round finalizer/mixer, but the code has a
| separate 0th round that gets the seed instead.
| vlovich123 wrote:
| Probably in response to
| https://github.com/ogxd/gxhash/issues/25?
| ComputerGuru wrote:
| Is the output stable across machines in the same architecture
| (i.e. with various simd features enabled/disabled, compiled in
| 32-bit vs 64-bit modes, etc) and across architectures (x86_64 to
| aarch64)?
| fl0ki wrote:
| From the page itself:
|
| > All generated hashes for a given version of GxHash are
| stable, meaning that for a given input the output hash will be
| the same across all supported platforms.
| ComputerGuru wrote:
| Thanks, I missed that.
| msk-lywenn wrote:
| how does it fare against murmurhash?
|
| edit: I did the work myself as I should have in the first place,
| although I'm tired. % cargo bench --bench
| throughput Compiling gxhash v3.1.1
| (/Users/daniel/Developer/gxhash) Compiling fastmurmur3
| v0.2.0 Finished `bench` profile [optimized] target(s) in
| 1.03s Running benches/throughput/main.rs
| (target/release/ deps/throughput-c9580307c88ca541) gxhash
| | 4 > 3980.55 | 8 > 7961.11 | 16 > 15922.22
| | 32 > 17986.93 | 64 > 25083.88 | 128 > 24548.44
| | 256 > 32485.82 | 512 > 40386.43 | 1024 >
| 45980.96 | 2048 > 49396.93 | 4096 > 51314.47
| | 8192 > 52324.80 | 16384 > 51536.52 | 32768 >
| 52342.91 murmur3 | 4 > 547.33 | 8 >
| 910.5 1 | 16 > 1752.21 | 32 > 2676.02 |
| 64 > 3552.02 | 128 > 4045.35 | 256 > 3975.75
| | 512 > 3748.31 | 1024 > 3560.22 | 2048 > 3439.41
| | 4096 > 3411.45 | 8192 > 3389.27 | 16384 >
| 3376.18 | 32768 > 3379.46
| cornstalks wrote:
| What do the numbers on the right mean? Time? Throughput?
| ashvardanian wrote:
| It says "throughput" in the command line :)
| huhtenberg wrote:
| Just to be 100% fair, they may want to retest against xxHash3 for
| that "fastest" claim as it's ~2.5x faster than xxHash that they
| benched against.
| vlovich123 wrote:
| Their graph is just poorly labelled. It's against XX3. I even
| tried switching their benchmark harness from twox_hash to
| xxhash-rust which has SSE support - their benchmark harness
| doesn't show a major difference for xxhash-rust although maybe
| I somehow forgot to set native-cpu for the benchmark & it
| wasn't testing the right thing.
| Genbox wrote:
| I've done some security testing against this hash function. Does
| anyone want to follow up and see if they can extend the attacks I
| describe[1]?
|
| [1]https://github.com/ogxd/gxhash/issues/25
| Quarrel wrote:
| I may well be missing something, but isn't the point of the
| title:
|
| "GxHash is a fast and robust non-cryptographic hashing
| algorithm"
|
| Exactly that I should use it when I want a fast hash, but not
| when I want a robust cryptographic hash? In which case I can
| ignore your attack? (Not to discount your work, I'm just trying
| to understand the scope here)
|
| So I can use it in my hash-map, or whatever O(1) lookup, but I
| shouldn't use it in my rewrite of SSL?
| TacticalCoder wrote:
| > So I can use it in my hash-map, or whatever O(1) lookup...
|
| BTW in the past (20 years ago?) there have been attacks on
| non-cryptographic hashes used for hash maps: denial of
| service by creating crafted requests to HTTP servers... The
| parameteres would be picked maliciously and would be all
| ending in the same "buckets". That attack worked on both Java
| and PHP servers. IIRC it's been solved by adding random
| seeding (and I noticed that GxHash mentions it's seedable:
| not saying it'd counter every attack but it's already
| something).
___________________________________________________________________
(page generated 2024-05-13 23:01 UTC)