[HN Gopher] GxHash is a fast and robust non-cryptographic hashin...
       ___________________________________________________________________
        
       GxHash is a fast and robust non-cryptographic hashing algorithm
        
       Author : vlovich123
       Score  : 66 points
       Date   : 2024-05-13 15:42 UTC (7 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | vlovich123 wrote:
       | Wanted to highlight it since it looks like it's a significant
       | Pareto frontier jump from existing techniques.
        
         | not_a_dane wrote:
         | how do we know it is a robust one ?
        
           | ted_dunning wrote:
           | Did you read the article?
        
         | rurban wrote:
         | Haven't checked it in smhasher yet. just saw this, claiming to
         | be the very best. Well, maybe a PR will come soon. The rust
         | framework for hashes is prepared, there just haven't been any
         | good rust hashes yet.
        
           | llimllib wrote:
           | i know nothing about this algorithm, but I went and checked
           | that already; it does have a PR for smhasher:
           | https://github.com/rurban/smhasher/pull/279
           | 
           | The note in there that the output of the algorithm is
           | different for different SIMD widths seems to indicate that
           | this isn't so much an algorithm as a family of related
           | algorithms
        
             | vlovich123 wrote:
             | That's a bit weird since it says on the front page that:
             | 
             | > All generated hashes for a given version of GxHash are
             | stable, meaning that for a given input the output hash will
             | be the same across all supported platforms.
             | 
             | The github repo is more recent so maybe it wasn't stable at
             | the time they tried to get it into smhasher from last year?
        
             | aidenn0 wrote:
             | Looks like the original failed smhasher, since the PR
             | modified the compress function
        
           | ted_dunning wrote:
           | The article quotes results from smhasher runs.
           | 
           | It passes.
        
         | AlotOfReading wrote:
         | Honestly, the paper is pretty confusing. It's a weird mix of
         | explaining things people reading a hash paper should already
         | know (e.g. avalanche effect) and not explaining the important
         | bits. The mixing stage looks a lot like a Davies-meyer hash,
         | but it's hard to be sure without reading code. If so, this will
         | be susceptible to fixed point attacks (among others). It
         | doesn't particularly matter because this isn't presented as
         | secure, but worth noting.
         | 
         | The paper also doesn't exhaustively verify numbers when
         | domain=codomain, which is a super common scenario and a weird
         | omission. 32 bits is small enough to check everything.
        
           | vlovich123 wrote:
           | The hash internally is a 128bit hash & can output 32 or 64
           | bit as well. Can you speak more to how fixed point attacks
           | are relevant? The only reference I found online was in a
           | cryptographic context which wouldn't be relevant here. Are
           | other hashes like SipHash-1-3 (Rust stdlib hashmap) or XXH3
           | susceptible to it?
        
             | AlotOfReading wrote:
             | A fixed point attack would allow creating collisions.
             | Whether that's an issue or not depends on your use case,
             | but the answer is going to be yes surprisingly often in my
             | experience.
        
               | vlovich123 wrote:
               | I don't know about this at all, but ChatGPT suggests that
               | this attack is thwarted by a secure compression function
               | and block function. As you point out elsewhere, it uses
               | AES for compression but not sure if that's enough. Do you
               | know?
        
               | AlotOfReading wrote:
               | ChatGPT would be wrong in this case. It applies to (and
               | was originally identified with) hash functions built from
               | cryptographically secure hash functions. Note that this
               | _isn 't_ as strong as AES. It's just built with some of
               | the same machinery as AES, it doesn't carry over its
               | guarantees.
               | 
               | Still an interesting construction I'll play around with
               | some more when I have time.
        
               | singron wrote:
               | Can you realistically create fixed-point collisions of a
               | randomly seeded hash function? E.g. to defeat the DOS-
               | resistance of a hash table? I think the papers on fixed-
               | point attacks of Davies-Meyer hashes require selecting
               | the IV or knowing the IV in order to carry out the
               | attack.
        
               | AlotOfReading wrote:
               | Not sure where the IV would come in, but both the
               | plaintext and the initial key seem to be taken directly
               | from the input as the first two blocks in any
               | "compression group". Let's say you want intermediate key
               | X after the first two steps, the first 3 blocks should be
               | something like m_2 = D(X, E(m_1, m_0)) because we're
               | missing the xor step from davies-meyer.
               | 
               | Let me know if I'm wrong though, strictly an amateur
               | here.
        
       | ltbarcly3 wrote:
       | I would like to see the collision distribution on
       | /usr/share/dict/words. I have found that 'fast hash algorithms'
       | can have very unequal hash distributions on short strings.
        
         | barkbyte wrote:
         | Deleted
        
           | glitchc wrote:
           | AES is more than just sboxes. The multiple rounds (including
           | XORs) are integral to collision resistance.
        
         | zimpenfish wrote:
         | Using
         | https://gist.github.com/rjp/3bbf33911aaf9ed7a07be417adb4b8f1 on
         | Arch Linux `words` 2.1-8 (which defaults to the `usa` word
         | list) gives collisions for the 32 bit hash (but none for the 64
         | bit or the 128 bit variants.)                   >
         | ./target/debug/gxwords | cut -d' ' -f1 | sort | uniq -c | awk
         | '$1>1'               2 28273056               2 8e264f8
         | 2 f7bce777
         | 
         | Which then gives us                   > ./target/debug/gxwords
         | | egrep '28273056|8e264f8|f7bce777' | sort         28273056
         | counterclaiming         28273056 oncogene's         8e264f8
         | ingrates         8e264f8 toadstool's         f7bce777 Carib's
         | f7bce777 stemming
        
           | bravura wrote:
           | Strange that all three collisions include "'s"
        
             | judofyr wrote:
             | Considering that 43% of the words (in my dictionary at
             | least) include "s" it's not so strange:                   $
             | cat /usr/share/dict/words | wc -l           235976
             | $ cat /usr/share/dict/words | grep s |  wc -l
             | 101882
        
               | filleokus wrote:
               | GP meant that all collisions contained 's - a possessive
               | apostrophe (if I remember my grammar correctly?).
               | oncogene's, toadstool's, Carib's.
               | 
               | The /usr/share/dict/words on my macOS machine doesn't
               | seem to have any 's words in it.
        
               | vlovich123 wrote:
               | counterclaiming Doesn't have any s and stemming doesn't
               | have any possessive. I think drawing conclusions from a
               | 32bit hash against a 123k word list is fruitless. You'd
               | expect at least 1 collision due simply to birthday
               | paradox.
        
               | airstrike wrote:
               | oncogene's <> counterclaiming              ingrates <>
               | toadstool's              Carib's <> stemming
               | 
               | all 3 pairs have one word ending in "'s", which I think
               | is what the GGGP (?) was pointing out
        
               | bravura wrote:
               | Yes. But I didn't realize that 43% of the words in that
               | dictionary contain "'s".
               | 
               | In that case, the probability that, for 3 collisions,
               | there is at least one element in every collision contains
               | "'s" is 31%. That seems reasonable to me.
        
               | airstrike wrote:
               | but 43% of the words contain s, not 's
        
               | zimpenfish wrote:
               | > counterclaiming Doesn't have any s and stemming doesn't
               | have any possessive.
               | 
               | But "oncogene's" (collides with counterclaiming) and
               | "Carib's" (collides with stemming) do.
               | 
               | Also I did point out that the 64 and 128 bit variants
               | have no collisions.
        
           | vlovich123 wrote:
           | Wouldn't that be true for any 32-bit hash? The birthday
           | paradox would suggest you only need 2^16 input values to
           | generate a collision right?
        
             | zimpenfish wrote:
             | xxhash32 does give a couple of collisions on the same file,
             | yep.                   734c8981 abloom         734c8981
             | sating         ef336a7d contravene         ef336a7d
             | seducers
        
         | ted_dunning wrote:
         | The article emulates that result by synthesizing words. The
         | problem with using /user/dict/words is that there aren't enough
         | to get a really good statistical estimate of collisions at more
         | than very small scale.
         | 
         | Also the core primitive is AES which should calm most such
         | worries.
        
           | vlovich123 wrote:
           | Not to mention you should be using the 64-bit variant for
           | collision resistance and the 128-bit variant for integrity.
        
       | woodruffw wrote:
       | It would be interesting to have AES itself (with the same number
       | of rounds) in the performance comparison, given that GxHash
       | appears to be built on top of AES.
        
         | vlovich123 wrote:
         | AES is only used for the final mixing from their paper. AHash
         | is the pure AES hash using AES-NI.
        
           | AlotOfReading wrote:
           | Am I missing something? The paper indicates that AES-NI is
           | used for both compression and mixing.
           | 
           | If you look at the source code, it's using it in both parts
           | too. Interestingly, there's a discrepancy between the paper
           | and the code. The paper says the seed is passed to the first
           | round of the 3-round finalizer/mixer, but the code has a
           | separate 0th round that gets the seed instead.
        
             | vlovich123 wrote:
             | Probably in response to
             | https://github.com/ogxd/gxhash/issues/25?
        
       | ComputerGuru wrote:
       | Is the output stable across machines in the same architecture
       | (i.e. with various simd features enabled/disabled, compiled in
       | 32-bit vs 64-bit modes, etc) and across architectures (x86_64 to
       | aarch64)?
        
         | fl0ki wrote:
         | From the page itself:
         | 
         | > All generated hashes for a given version of GxHash are
         | stable, meaning that for a given input the output hash will be
         | the same across all supported platforms.
        
           | ComputerGuru wrote:
           | Thanks, I missed that.
        
       | msk-lywenn wrote:
       | how does it fare against murmurhash?
       | 
       | edit: I did the work myself as I should have in the first place,
       | although I'm tired.                 % cargo bench --bench
       | throughput        Compiling gxhash v3.1.1
       | (/Users/daniel/Developer/gxhash)        Compiling fastmurmur3
       | v0.2.0         Finished `bench` profile [optimized] target(s) in
       | 1.03s          Running benches/throughput/main.rs
       | (target/release/ deps/throughput-c9580307c88ca541)       gxhash
       | | 4 > 3980.55         | 8 > 7961.11         | 16 > 15922.22
       | | 32 > 17986.93         | 64 > 25083.88         | 128 > 24548.44
       | | 256 > 32485.82         | 512 > 40386.43         | 1024 >
       | 45980.96         | 2048 > 49396.93         | 4096 > 51314.47
       | | 8192 > 52324.80         | 16384 > 51536.52         | 32768 >
       | 52342.91             murmur3         | 4 > 547.33          | 8 >
       | 910.5 1         | 16 > 1752.21         | 32 > 2676.02         |
       | 64 > 3552.02         | 128 > 4045.35         | 256 > 3975.75
       | | 512 > 3748.31         | 1024 > 3560.22         | 2048 > 3439.41
       | | 4096 > 3411.45         | 8192 > 3389.27         | 16384 >
       | 3376.18         | 32768 > 3379.46
        
         | cornstalks wrote:
         | What do the numbers on the right mean? Time? Throughput?
        
           | ashvardanian wrote:
           | It says "throughput" in the command line :)
        
       | huhtenberg wrote:
       | Just to be 100% fair, they may want to retest against xxHash3 for
       | that "fastest" claim as it's ~2.5x faster than xxHash that they
       | benched against.
        
         | vlovich123 wrote:
         | Their graph is just poorly labelled. It's against XX3. I even
         | tried switching their benchmark harness from twox_hash to
         | xxhash-rust which has SSE support - their benchmark harness
         | doesn't show a major difference for xxhash-rust although maybe
         | I somehow forgot to set native-cpu for the benchmark & it
         | wasn't testing the right thing.
        
       | Genbox wrote:
       | I've done some security testing against this hash function. Does
       | anyone want to follow up and see if they can extend the attacks I
       | describe[1]?
       | 
       | [1]https://github.com/ogxd/gxhash/issues/25
        
         | Quarrel wrote:
         | I may well be missing something, but isn't the point of the
         | title:
         | 
         | "GxHash is a fast and robust non-cryptographic hashing
         | algorithm"
         | 
         | Exactly that I should use it when I want a fast hash, but not
         | when I want a robust cryptographic hash? In which case I can
         | ignore your attack? (Not to discount your work, I'm just trying
         | to understand the scope here)
         | 
         | So I can use it in my hash-map, or whatever O(1) lookup, but I
         | shouldn't use it in my rewrite of SSL?
        
           | TacticalCoder wrote:
           | > So I can use it in my hash-map, or whatever O(1) lookup...
           | 
           | BTW in the past (20 years ago?) there have been attacks on
           | non-cryptographic hashes used for hash maps: denial of
           | service by creating crafted requests to HTTP servers... The
           | parameteres would be picked maliciously and would be all
           | ending in the same "buckets". That attack worked on both Java
           | and PHP servers. IIRC it's been solved by adding random
           | seeding (and I noticed that GxHash mentions it's seedable:
           | not saying it'd counter every attack but it's already
           | something).
        
       ___________________________________________________________________
       (page generated 2024-05-13 23:01 UTC)