[HN Gopher] Hashing
       ___________________________________________________________________
        
       Hashing
        
       Author : signa11
       Score  : 253 points
       Date   : 2023-06-20 09:48 UTC (13 hours ago)
        
 (HTM) web link (samwho.dev)
 (TXT) w3m dump (samwho.dev)
        
       | sudobash1 wrote:
       | Off-topic, but it just occurred to me that the "dialog" in this
       | article between the author and the "Haskie" dog is a very
       | traditional way of writing pedagogical works.
       | 
       | Some historical examples of books written in this way are
       | Galileo's "Dialogue Concerning the Two Chief World Systems" (the
       | one that landed him in hot water), and "The Study of
       | Counterpoint" by Johann Joseph Fux (also in the 17th century).
       | 
       | A small excerpt from "The Study of Counterpoint":
       | 
       | >Joseph.-- Why did you leave out B between A and C?
       | 
       | > Aloys.-- Because it has no perfect fifth and therefore cannot
       | be the final of a mode -- which we shall discuss more fully in
       | its proper place.
       | 
       | Both of these books are written as an entire discussion or
       | argument, as was commonplace in teaching books. I honestly often
       | find myself disliking it, but I think it is a great way to learn
       | for many people.
        
         | samwho wrote:
         | I didn't know this. I created Haskie during the writing of my
         | Memory Allocation post because I found myself trying to pre-
         | empt reader questions a lot and it was weird to read. Creating
         | a proxy for the reader in Haskie made it flow a lot better.
         | 
         | I've noticed in this post that a 2nd character that's a proxy
         | for an "expert" in a topic would also be handy. Taking
         | suggestions for a good dog breed to represent this character.
        
       | ComputerGuru wrote:
       | Great intro to hashing but if the concept of mmhash3's seed is
       | going to be brought up I think it's only natural to mention its
       | design limitations and why you still need something actually ddos
       | resistant like SIP hash (even if we don't get into the details of
       | the latter).
       | 
       | Also, it's important to distinguish between hashing, hashing, and
       | hashing. That is, hashing to map input to buckets, hashing to
       | obscure the original string, and hashing to find similarities in
       | the input data. They're all called hashing but they have
       | different (and conflicting!) requirements. There's a reason you
       | want mmhash3 to be fast but scrypt to be slow, and a reason why
       | you want mmhash3 to avalanche but certainly don't want your
       | perceptual hashing algo of choice to do the same.
        
         | Solvency wrote:
         | One thing I wish the original article would explain is the use
         | case of buckets. I'm sort of imagining weird use cases where
         | the buckets represent different servers or storage systems for
         | the purpose of spreading out data efficiently...but I'm totally
         | guessing here. Because I would assume a proper load balancer
         | would be based on the dynamics of real-time server performance,
         | not because of some simple hash function. What are common hash
         | to bucket use cases?
        
           | samwho wrote:
           | The buckets, in the context of hash maps, are typically
           | lists. Either arrays or linked lists. They're used to store
           | the key-value pairs.
           | 
           | Some load balancers do use hashing in much the same way hash
           | maps do. Usually they'll take a combination of: source IP,
           | source port, destination IP, destination port, and hash it.
           | They'll then use that hash to pick a server. The practical
           | impact of this is that each user always gets mapped to the
           | same server. This is typically called "session sticky load
           | balancing" because it means session information about that
           | user can live on the server, safe in the knowledge that the
           | user will always end up on that server and not get routed to
           | any others.
        
           | koromak wrote:
           | The idea of "buckets" here is purely to speed up search time,
           | at a very low level. Don't think about it in the context of a
           | web server or database storage, its way more base than that.
           | Its literally just arrays in memory.
           | 
           | The idea is: if you don't have a Hash or Map already
           | implemented in your language, how would you build a fast one?
           | You cant write my_object['my_key'], that doesn't exist, you
           | don't have Key-Value storage. You need instead to somehow
           | store those pieces of information, and find them later.
           | 
           | Obviously, you could just stick every value inside one big
           | array. Then when you call MyHash.get('key'), you simply do an
           | array search. But that would be slow.
           | 
           | Instead, you can hash the 'key', stick it into a smaller
           | bucket based on the hash, and then more quickly search for it
           | later. In the future, you know the hash of 'key', so you know
           | which bucket to look in.
           | 
           | The author does make it confusing, since in their example
           | each bucket contains Entry (entry['value']), meaning they are
           | already using a JS HashMap implementation in their rebuilding
           | of a HashMap, but you could rewrite the example to do it
           | without any objects. The code would be harder to read though.
        
             | samwho wrote:
             | This is my eternal struggle: be correct, or be simple.
             | 
             | I wrote a version that used 2-element arrays but the code
             | became more dense and I worried about losing people. I was
             | hoping that how easy it would be to translate it to not use
             | objects would give me a pass here, but apparently not :D
        
         | aappleby wrote:
         | If I ever write Murmur4 it's going to be the smallest function
         | that I can prove (via sat solver) to have no seed-independent
         | collisions :D
        
           | ComputerGuru wrote:
           | I approve of the use of relative metrics, Austin! Just be
           | prepared for mmhash4 to be a few megabytes in size ;)
        
         | samwho wrote:
         | I did have the distinction between cryptographic and non-
         | cryptographic in there originally but found when junior folks
         | read it they'd get confused about which use-case was being
         | discussed. So I decided to focus on just one. With more time, I
         | would have liked to cover at least scrypt.
         | 
         | I'll be honest, I don't know what the design limitations of the
         | seeding in murmur3 are. What I wanted to show is the concept of
         | seeding and what it's there to prevent. I'm hoping that comes
         | across, even without any deeper exploration of seeding.
        
           | ComputerGuru wrote:
           | Seeding is kind of a hack. It doesn't guarantee you'll avoid
           | a (or even the same) collision, and quite a few "popular"
           | hashes only change the output (but don't change whether or
           | not there is a collision) when you change the seed.
           | 
           | Thanks for the article, though!
        
             | samwho wrote:
             | Ahh I see, the limitation you're talking about is that
             | there isn't any inherent guarantee that a different seed
             | will mean two values no longer collide? It's just likely
             | (in the case of murmur3 at least).
        
               | ComputerGuru wrote:
               | Yes, but it's not just that there's only "a chance" that
               | the collision is avoided - that would be enough if it was
               | actually just a random probability (after all, everything
               | about hashing and collisions - in the best textbook case
               | - is just chance). The problem is that there are methods
               | for obtaining what we call "seed-independent collisions"
               | where by analyzing the hashing algorithm itself you can
               | actually determine a priori values that will collide
               | regardless of the seed.
               | 
               | If you have half an hour to spare, I really recommend you
               | take the time to read this:
               | http://emboss.github.io/blog/2012/12/14/breaking-murmur-
               | hash...
        
               | samwho wrote:
               | Oh damn, I didn't know that. Have bookmarked that post,
               | thank you so much for your comments. I've been very
               | fortunate in my writing that the comment sections are
               | always full of people with awesome insight that I missed
               | during my own reading on the topic.
        
       | edvards wrote:
       | Amazingly well done article, thanks for sharing and making it!
        
         | samwho wrote:
         | Thank you! I absolutely love making these, and have plans for
         | several more.
        
       | mcdonje wrote:
       | Between Sam and Bartosz, I think it's time to bite the bullet and
       | get an RSS reader set up. Anyone got any good app recommendations
       | for Android?
        
         | samwho wrote:
         | To be mentioned in the same breath as Bartosz Ciechanowski is
         | an incredible compliment, thank you <3
         | 
         | I will say up front: my posts don't read well in an RSS reader.
         | Sorry about that. But at least you'll get a notification when
         | new ones come out.
        
       | marginalia_nu wrote:
       | > The Avalance Effect
       | 
       | > Another way hash functions get evaluated is on something called
       | the "avalanche effect." This refers to how many bits in the
       | output value change when just a single bit of the input changes.
       | To say that a hash function has a good avalanche effect, a single
       | bit flip in the input should result in an average of 50% the
       | output bits flipping.
       | 
       | I think it's important to note that this isn't a property that is
       | necessary for a hash function, there are hash functions that
       | deliberately try to minimize the avalanche effect, such as in
       | locality sensitive hashing; which is another form of hash
       | function that has different use cases that are also very cool.
       | 
       | You can use LSHs to for example remove near duplicate search
       | results in a search engine without having to actually comparing
       | the texts, even tolerating single-word differences; or to help
       | with nearest neighbor searches.
       | 
       | A lot of what's written about hash function tends to assume you
       | want to use them for whatever thing the author has in mind, it's
       | not a unique fault of the author -- many textbooks have this
       | problem, and add all these properties to them that are accidental
       | to what a hash function actually is; it's really just a mapping
       | function to a fixed width representation with an even
       | distribution across the domain.
        
         | quickthrower2 wrote:
         | So technically truncating and then padding is a hash function
        
           | [deleted]
        
           | rurban wrote:
           | Division by size is also very popular. 1:1 mapping ditto
        
         | Solvency wrote:
         | Is there a world in which hashes could be used as an
         | alternative to word2vec style vector embeddings? Where you
         | basically convert text inputs to an ideal fixed width
         | representation and then calculate the "similarity" between them
         | to derive some kind of meaning?
        
           | marginalia_nu wrote:
           | Adjacent algorithms are used all over that space.
           | 
           | In general random projection, which is used to for dimension
           | reduction in vector spaces, can also be used to construct
           | LSHs; for example.
        
         | dahart wrote:
         | > You can use LSHs [...] to help with nearest neighbor
         | searches.
         | 
         | Yes! A good example of this in graphics is a spatial nearest
         | neighbor search. Space filling curves like the Morton and
         | Hilbert curves are often used as hash functions that can
         | preserve some amount of locality for 2-D / 3-D / N-D points. A
         | locally sensitive hash function can basically be used to
         | provide a sort key for data that doesn't otherwise have a key,
         | while increasing the odds that any two keys that are similar
         | point to data that are also similar. This is useful on GPUs in
         | order to improve coherence/performance of threads operating on
         | spatial points.
         | 
         | https://en.wikipedia.org/wiki/Z-order_curve
         | 
         | https://en.wikipedia.org/wiki/Hilbert_curve
        
           | Solvency wrote:
           | The OP mentioned murmur3 due to its speed and high avalanche
           | effect.
           | 
           | So what's the best LSH function for this use case?
        
             | dahart wrote:
             | Hi I'm not sure I understand the question. Avalanche and
             | local sensitivity are opposing goals, so typically the more
             | avalanche, the less locally sensitive, and vice versa. You
             | could even state the avalanche goal as being locally
             | insensitive - the whole idea of avalanche is one bit change
             | in the input produces a maximally different and uniformly
             | distributed output.
        
               | Solvency wrote:
               | I'm basically asking for the most efficient and
               | unavalanched hash map library recommended for the
               | purposes of this use case.
        
               | marginalia_nu wrote:
               | I don't think you're gonna get one-size-fits-all answers
               | the same way you will with avalanching hash functions.
               | They can be one-size-fits-all because the requirement is
               | that there is no (discernible) correlation between the
               | input and output.
               | 
               | Locality-sensitive hashes depend on what sort of thing
               | you are hashing. Is the data ordered, is it a sequence,
               | is it points in space? The requirement is that given
               | similarity in the input given a specific metric, there is
               | a similarity in the output given another metric.
               | 
               | There are no universal locality sensitive hash functions
               | because the function depends on in what sense you're
               | trying to preserve locality.
        
               | dahart wrote:
               | Ah, got it. I don't even know of one, TBH, let alone have
               | a best one in mind. Every time I've used an LSH, the
               | hashing code has been ad-hoc and tailored to pretty
               | specific data and optimization goals. (An added layer of
               | difficulty is that "locality" isn't universally defined.)
               | 
               | I'm not sure but it might be simplest to just look for a
               | decent Map library in your favorite language, and keep
               | the hashing part separate. It's generally very easy to
               | insert your own hash transform wherever you need, whether
               | it's on the fly, or saved as a sort key in memory. This
               | way all you would need is to find a Morton order
               | function, for example, if you want to hash with a z-order
               | curve.
        
           | marginalia_nu wrote:
           | Yeah there's a ton of really cool things you can do with
           | LSHs. They're really overlooked outside of a few niches.
        
         | samwho wrote:
         | I didn't know that, but it makes perfect sense after reading
         | your explanation of it. Thank you, TIL!
        
           | marginalia_nu wrote:
           | If you're interested, I built a stupid simple deduplication
           | algorithm[1]. It ironically uses the avalanche effect of
           | another hash function to implement random projection onto a
           | locality preserving bit vector. It sounds like real computer
           | science bingo but it's actually surprisingly simple and it
           | works much better than it has any business doing.
           | 
           | [1] https://github.com/MarginaliaSearch/MarginaliaSearch/blob
           | /ma...
        
         | mxkopy wrote:
         | I feel like there's a more general concept of "bijective index"
         | that subsumes LSHs and cryptographic hashes. When I think of
         | hash I usually think of modulos, randomness, etc. when in
         | general they're just ways of creating a set that's 1:1 to your
         | inputs.
        
       | aappleby wrote:
       | Murmur3 author here, very nice writeup.
        
         | samwho wrote:
         | Thank you! That puts to bed some of my anxieties around whether
         | I've misrepresented the qualities of murmur3
        
       | thecoppinger wrote:
       | This guide, just like your others, gave me the gift of several
       | 'ah-ha' moments that not only helped hash maps click, but also
       | several other linked concepts.
       | 
       | Thank you so much for making the time and effort to create such
       | quality content.
       | 
       | Please keep producing more!
        
       | anayak wrote:
       | Fantastic article! Love the visualizations and your explanations
       | are concise and very clear.
        
         | samwho wrote:
         | Thank you! I'm really glad you liked it.
        
       | vlovich123 wrote:
       | I'm surprised xxhash wasn't mentioned. Isn't it faster and more
       | hash collision resistant than murmur?
       | 
       | I think I read a very good website on another hash proposal that
       | can supersede xxhash (roughly similar performance but more
       | methodical in construction and has other nice properties) but I
       | can't recall what it's called (it's not on the smasher tests yet
       | if I recall correctly)
        
         | samwho wrote:
         | This was purely due to familiarity. I knew murmur3 was
         | considered a good, current hashing algorithm. I was able to
         | find a good implementation of it in JS, so I didn't really look
         | any further.
         | 
         | Thanks for exposing me to xxhash! I'll store that away as an
         | alternative to murmur3 if I ever need one in future. :)
        
       | koromak wrote:
       | Its kind of confusing that the author use JS objects (aka hashes)
       | as an entry inside the HashMap he's building. I think building
       | the concept of a hash from the ground up would be more
       | illuminating.
        
         | samwho wrote:
         | I agonised over this. I did originally have a version that used
         | 2-element arrays instead, but the resulting code felt more
         | dense and hard to follow.
         | 
         | I figured that people who know enough to point this out already
         | know what's up. People who don't will benefit from the code
         | being easier to read.
        
       | arek_nawo wrote:
       | That's a good, visual primer! Congrats.
        
         | samwho wrote:
         | Thank you <3
        
       | michael_leachim wrote:
       | cool, I enjoyed reading article and playing with visualizations.
       | I liked it a lot.
       | 
       | How can I learn to design a hash function? It is possible to
       | understand that stringSum is bad compared to murmur3 by
       | evaluating it against test cases, but what properties make it
       | bad. Is it summation compared to xoring in murmur3? I intuit that
       | summation is kinda lossy, but ofc there is much more rigorous
       | work put into it. It would be really cool to learn more about
       | this. Thank you
        
         | samwho wrote:
         | I'm actually not sure, to be honest. The post explicitly
         | doesn't touch on the implementation of murmur3, and part of the
         | reason is that while I can see what it's doing, I have no idea
         | why it's doing it.
         | 
         | The initial announcement of murmur
         | (https://tanjent.livejournal.com/756623.html) makes it seem
         | like it's trial and error.
        
           | aappleby wrote:
           | Murmur author here, it is a lot of trial and error but in
           | general you're trying to pack as much "nonlinearity" as
           | possible into as few instructions as possible. Multiplication
           | is commutative, xor is commutative, but if you mix the two
           | together you get functions that are strongly non-linear (in
           | the algebra sense) because the mathematical "spaces" of
           | integers and bits aren't connected together the same way.
           | Ensuring that the nonlinearity applies to all the bits
           | roughly equally is hard and requires trial and error.
        
         | moonchild wrote:
         | Perhaps look at some of what paul khuong has written about
         | umash; it is dense, but good, and includes good references.
        
         | dahart wrote:
         | The famous Bob Jenkins has lots of info on hash function
         | design:
         | 
         | http://www.burtleburtle.net/bob/index.html
         | 
         | http://www.burtleburtle.net/bob/hash/index.html
        
       | mpartel wrote:
       | Excellent work, again! Your intros somehow elicit the best out of
       | HN too: the comments are always full of interesting angles and
       | variations.
       | 
       | Wanna do quaternions next? :) Many have tried..
        
         | samwho wrote:
         | There's a name I haven't heard for a long time.
         | 
         | Thank you, sir! As soon as I can reliably spell "quaternions" I
         | will think about trying it.
         | 
         | Funnily, one of the ideas I have cooking in my head is going to
         | require a foray into 3D.
        
       | [deleted]
        
       | d--b wrote:
       | Good stuff.
       | 
       | I'd perhaps add a paragraph on bcrypt and why it must be slow.
        
       | photochemsyn wrote:
       | Very clear explanation! There's also a famous answer on SO that
       | goes into collisions and has similar visual maps for viewing how
       | random the hashing distributions are for a variety of different
       | non-cryptographic hashing algorithms (top answer):
       | 
       | https://softwareengineering.stackexchange.com/questions/4955...
        
         | samwho wrote:
         | I saw this during the writing of the post! Extremely cool, I
         | mention it a little in my behind-the-scenes post.
        
       ___________________________________________________________________
       (page generated 2023-06-20 23:01 UTC)