[HN Gopher] Hashing
___________________________________________________________________
Hashing
Author : signa11
Score : 253 points
Date : 2023-06-20 09:48 UTC (13 hours ago)
(HTM) web link (samwho.dev)
(TXT) w3m dump (samwho.dev)
| sudobash1 wrote:
| Off-topic, but it just occurred to me that the "dialog" in this
| article between the author and the "Haskie" dog is a very
| traditional way of writing pedagogical works.
|
| Some historical examples of books written in this way are
| Galileo's "Dialogue Concerning the Two Chief World Systems" (the
| one that landed him in hot water), and "The Study of
| Counterpoint" by Johann Joseph Fux (also in the 17th century).
|
| A small excerpt from "The Study of Counterpoint":
|
| >Joseph.-- Why did you leave out B between A and C?
|
| > Aloys.-- Because it has no perfect fifth and therefore cannot
| be the final of a mode -- which we shall discuss more fully in
| its proper place.
|
| Both of these books are written as an entire discussion or
| argument, as was commonplace in teaching books. I honestly often
| find myself disliking it, but I think it is a great way to learn
| for many people.
| samwho wrote:
| I didn't know this. I created Haskie during the writing of my
| Memory Allocation post because I found myself trying to pre-
| empt reader questions a lot and it was weird to read. Creating
| a proxy for the reader in Haskie made it flow a lot better.
|
| I've noticed in this post that a 2nd character that's a proxy
| for an "expert" in a topic would also be handy. Taking
| suggestions for a good dog breed to represent this character.
| ComputerGuru wrote:
| Great intro to hashing but if the concept of mmhash3's seed is
| going to be brought up I think it's only natural to mention its
| design limitations and why you still need something actually ddos
| resistant like SIP hash (even if we don't get into the details of
| the latter).
|
| Also, it's important to distinguish between hashing, hashing, and
| hashing. That is, hashing to map input to buckets, hashing to
| obscure the original string, and hashing to find similarities in
| the input data. They're all called hashing but they have
| different (and conflicting!) requirements. There's a reason you
| want mmhash3 to be fast but scrypt to be slow, and a reason why
| you want mmhash3 to avalanche but certainly don't want your
| perceptual hashing algo of choice to do the same.
| Solvency wrote:
| One thing I wish the original article would explain is the use
| case of buckets. I'm sort of imagining weird use cases where
| the buckets represent different servers or storage systems for
| the purpose of spreading out data efficiently...but I'm totally
| guessing here. Because I would assume a proper load balancer
| would be based on the dynamics of real-time server performance,
| not because of some simple hash function. What are common hash
| to bucket use cases?
| samwho wrote:
| The buckets, in the context of hash maps, are typically
| lists. Either arrays or linked lists. They're used to store
| the key-value pairs.
|
| Some load balancers do use hashing in much the same way hash
| maps do. Usually they'll take a combination of: source IP,
| source port, destination IP, destination port, and hash it.
| They'll then use that hash to pick a server. The practical
| impact of this is that each user always gets mapped to the
| same server. This is typically called "session sticky load
| balancing" because it means session information about that
| user can live on the server, safe in the knowledge that the
| user will always end up on that server and not get routed to
| any others.
| koromak wrote:
| The idea of "buckets" here is purely to speed up search time,
| at a very low level. Don't think about it in the context of a
| web server or database storage, its way more base than that.
| Its literally just arrays in memory.
|
| The idea is: if you don't have a Hash or Map already
| implemented in your language, how would you build a fast one?
| You cant write my_object['my_key'], that doesn't exist, you
| don't have Key-Value storage. You need instead to somehow
| store those pieces of information, and find them later.
|
| Obviously, you could just stick every value inside one big
| array. Then when you call MyHash.get('key'), you simply do an
| array search. But that would be slow.
|
| Instead, you can hash the 'key', stick it into a smaller
| bucket based on the hash, and then more quickly search for it
| later. In the future, you know the hash of 'key', so you know
| which bucket to look in.
|
| The author does make it confusing, since in their example
| each bucket contains Entry (entry['value']), meaning they are
| already using a JS HashMap implementation in their rebuilding
| of a HashMap, but you could rewrite the example to do it
| without any objects. The code would be harder to read though.
| samwho wrote:
| This is my eternal struggle: be correct, or be simple.
|
| I wrote a version that used 2-element arrays but the code
| became more dense and I worried about losing people. I was
| hoping that how easy it would be to translate it to not use
| objects would give me a pass here, but apparently not :D
| aappleby wrote:
| If I ever write Murmur4 it's going to be the smallest function
| that I can prove (via sat solver) to have no seed-independent
| collisions :D
| ComputerGuru wrote:
| I approve of the use of relative metrics, Austin! Just be
| prepared for mmhash4 to be a few megabytes in size ;)
| samwho wrote:
| I did have the distinction between cryptographic and non-
| cryptographic in there originally but found when junior folks
| read it they'd get confused about which use-case was being
| discussed. So I decided to focus on just one. With more time, I
| would have liked to cover at least scrypt.
|
| I'll be honest, I don't know what the design limitations of the
| seeding in murmur3 are. What I wanted to show is the concept of
| seeding and what it's there to prevent. I'm hoping that comes
| across, even without any deeper exploration of seeding.
| ComputerGuru wrote:
| Seeding is kind of a hack. It doesn't guarantee you'll avoid
| a (or even the same) collision, and quite a few "popular"
| hashes only change the output (but don't change whether or
| not there is a collision) when you change the seed.
|
| Thanks for the article, though!
| samwho wrote:
| Ahh I see, the limitation you're talking about is that
| there isn't any inherent guarantee that a different seed
| will mean two values no longer collide? It's just likely
| (in the case of murmur3 at least).
| ComputerGuru wrote:
| Yes, but it's not just that there's only "a chance" that
| the collision is avoided - that would be enough if it was
| actually just a random probability (after all, everything
| about hashing and collisions - in the best textbook case
| - is just chance). The problem is that there are methods
| for obtaining what we call "seed-independent collisions"
| where by analyzing the hashing algorithm itself you can
| actually determine a priori values that will collide
| regardless of the seed.
|
| If you have half an hour to spare, I really recommend you
| take the time to read this:
| http://emboss.github.io/blog/2012/12/14/breaking-murmur-
| hash...
| samwho wrote:
| Oh damn, I didn't know that. Have bookmarked that post,
| thank you so much for your comments. I've been very
| fortunate in my writing that the comment sections are
| always full of people with awesome insight that I missed
| during my own reading on the topic.
| edvards wrote:
| Amazingly well done article, thanks for sharing and making it!
| samwho wrote:
| Thank you! I absolutely love making these, and have plans for
| several more.
| mcdonje wrote:
| Between Sam and Bartosz, I think it's time to bite the bullet and
| get an RSS reader set up. Anyone got any good app recommendations
| for Android?
| samwho wrote:
| To be mentioned in the same breath as Bartosz Ciechanowski is
| an incredible compliment, thank you <3
|
| I will say up front: my posts don't read well in an RSS reader.
| Sorry about that. But at least you'll get a notification when
| new ones come out.
| marginalia_nu wrote:
| > The Avalance Effect
|
| > Another way hash functions get evaluated is on something called
| the "avalanche effect." This refers to how many bits in the
| output value change when just a single bit of the input changes.
| To say that a hash function has a good avalanche effect, a single
| bit flip in the input should result in an average of 50% the
| output bits flipping.
|
| I think it's important to note that this isn't a property that is
| necessary for a hash function, there are hash functions that
| deliberately try to minimize the avalanche effect, such as in
| locality sensitive hashing; which is another form of hash
| function that has different use cases that are also very cool.
|
| You can use LSHs to for example remove near duplicate search
| results in a search engine without having to actually comparing
| the texts, even tolerating single-word differences; or to help
| with nearest neighbor searches.
|
| A lot of what's written about hash function tends to assume you
| want to use them for whatever thing the author has in mind, it's
| not a unique fault of the author -- many textbooks have this
| problem, and add all these properties to them that are accidental
| to what a hash function actually is; it's really just a mapping
| function to a fixed width representation with an even
| distribution across the domain.
| quickthrower2 wrote:
| So technically truncating and then padding is a hash function
| [deleted]
| rurban wrote:
| Division by size is also very popular. 1:1 mapping ditto
| Solvency wrote:
| Is there a world in which hashes could be used as an
| alternative to word2vec style vector embeddings? Where you
| basically convert text inputs to an ideal fixed width
| representation and then calculate the "similarity" between them
| to derive some kind of meaning?
| marginalia_nu wrote:
| Adjacent algorithms are used all over that space.
|
| In general random projection, which is used to for dimension
| reduction in vector spaces, can also be used to construct
| LSHs; for example.
| dahart wrote:
| > You can use LSHs [...] to help with nearest neighbor
| searches.
|
| Yes! A good example of this in graphics is a spatial nearest
| neighbor search. Space filling curves like the Morton and
| Hilbert curves are often used as hash functions that can
| preserve some amount of locality for 2-D / 3-D / N-D points. A
| locally sensitive hash function can basically be used to
| provide a sort key for data that doesn't otherwise have a key,
| while increasing the odds that any two keys that are similar
| point to data that are also similar. This is useful on GPUs in
| order to improve coherence/performance of threads operating on
| spatial points.
|
| https://en.wikipedia.org/wiki/Z-order_curve
|
| https://en.wikipedia.org/wiki/Hilbert_curve
| Solvency wrote:
| The OP mentioned murmur3 due to its speed and high avalanche
| effect.
|
| So what's the best LSH function for this use case?
| dahart wrote:
| Hi I'm not sure I understand the question. Avalanche and
| local sensitivity are opposing goals, so typically the more
| avalanche, the less locally sensitive, and vice versa. You
| could even state the avalanche goal as being locally
| insensitive - the whole idea of avalanche is one bit change
| in the input produces a maximally different and uniformly
| distributed output.
| Solvency wrote:
| I'm basically asking for the most efficient and
| unavalanched hash map library recommended for the
| purposes of this use case.
| marginalia_nu wrote:
| I don't think you're gonna get one-size-fits-all answers
| the same way you will with avalanching hash functions.
| They can be one-size-fits-all because the requirement is
| that there is no (discernible) correlation between the
| input and output.
|
| Locality-sensitive hashes depend on what sort of thing
| you are hashing. Is the data ordered, is it a sequence,
| is it points in space? The requirement is that given
| similarity in the input given a specific metric, there is
| a similarity in the output given another metric.
|
| There are no universal locality sensitive hash functions
| because the function depends on in what sense you're
| trying to preserve locality.
| dahart wrote:
| Ah, got it. I don't even know of one, TBH, let alone have
| a best one in mind. Every time I've used an LSH, the
| hashing code has been ad-hoc and tailored to pretty
| specific data and optimization goals. (An added layer of
| difficulty is that "locality" isn't universally defined.)
|
| I'm not sure but it might be simplest to just look for a
| decent Map library in your favorite language, and keep
| the hashing part separate. It's generally very easy to
| insert your own hash transform wherever you need, whether
| it's on the fly, or saved as a sort key in memory. This
| way all you would need is to find a Morton order
| function, for example, if you want to hash with a z-order
| curve.
| marginalia_nu wrote:
| Yeah there's a ton of really cool things you can do with
| LSHs. They're really overlooked outside of a few niches.
| samwho wrote:
| I didn't know that, but it makes perfect sense after reading
| your explanation of it. Thank you, TIL!
| marginalia_nu wrote:
| If you're interested, I built a stupid simple deduplication
| algorithm[1]. It ironically uses the avalanche effect of
| another hash function to implement random projection onto a
| locality preserving bit vector. It sounds like real computer
| science bingo but it's actually surprisingly simple and it
| works much better than it has any business doing.
|
| [1] https://github.com/MarginaliaSearch/MarginaliaSearch/blob
| /ma...
| mxkopy wrote:
| I feel like there's a more general concept of "bijective index"
| that subsumes LSHs and cryptographic hashes. When I think of
| hash I usually think of modulos, randomness, etc. when in
| general they're just ways of creating a set that's 1:1 to your
| inputs.
| aappleby wrote:
| Murmur3 author here, very nice writeup.
| samwho wrote:
| Thank you! That puts to bed some of my anxieties around whether
| I've misrepresented the qualities of murmur3
| thecoppinger wrote:
| This guide, just like your others, gave me the gift of several
| 'ah-ha' moments that not only helped hash maps click, but also
| several other linked concepts.
|
| Thank you so much for making the time and effort to create such
| quality content.
|
| Please keep producing more!
| anayak wrote:
| Fantastic article! Love the visualizations and your explanations
| are concise and very clear.
| samwho wrote:
| Thank you! I'm really glad you liked it.
| vlovich123 wrote:
| I'm surprised xxhash wasn't mentioned. Isn't it faster and more
| hash collision resistant than murmur?
|
| I think I read a very good website on another hash proposal that
| can supersede xxhash (roughly similar performance but more
| methodical in construction and has other nice properties) but I
| can't recall what it's called (it's not on the smasher tests yet
| if I recall correctly)
| samwho wrote:
| This was purely due to familiarity. I knew murmur3 was
| considered a good, current hashing algorithm. I was able to
| find a good implementation of it in JS, so I didn't really look
| any further.
|
| Thanks for exposing me to xxhash! I'll store that away as an
| alternative to murmur3 if I ever need one in future. :)
| koromak wrote:
| Its kind of confusing that the author use JS objects (aka hashes)
| as an entry inside the HashMap he's building. I think building
| the concept of a hash from the ground up would be more
| illuminating.
| samwho wrote:
| I agonised over this. I did originally have a version that used
| 2-element arrays instead, but the resulting code felt more
| dense and hard to follow.
|
| I figured that people who know enough to point this out already
| know what's up. People who don't will benefit from the code
| being easier to read.
| arek_nawo wrote:
| That's a good, visual primer! Congrats.
| samwho wrote:
| Thank you <3
| michael_leachim wrote:
| cool, I enjoyed reading article and playing with visualizations.
| I liked it a lot.
|
| How can I learn to design a hash function? It is possible to
| understand that stringSum is bad compared to murmur3 by
| evaluating it against test cases, but what properties make it
| bad. Is it summation compared to xoring in murmur3? I intuit that
| summation is kinda lossy, but ofc there is much more rigorous
| work put into it. It would be really cool to learn more about
| this. Thank you
| samwho wrote:
| I'm actually not sure, to be honest. The post explicitly
| doesn't touch on the implementation of murmur3, and part of the
| reason is that while I can see what it's doing, I have no idea
| why it's doing it.
|
| The initial announcement of murmur
| (https://tanjent.livejournal.com/756623.html) makes it seem
| like it's trial and error.
| aappleby wrote:
| Murmur author here, it is a lot of trial and error but in
| general you're trying to pack as much "nonlinearity" as
| possible into as few instructions as possible. Multiplication
| is commutative, xor is commutative, but if you mix the two
| together you get functions that are strongly non-linear (in
| the algebra sense) because the mathematical "spaces" of
| integers and bits aren't connected together the same way.
| Ensuring that the nonlinearity applies to all the bits
| roughly equally is hard and requires trial and error.
| moonchild wrote:
| Perhaps look at some of what paul khuong has written about
| umash; it is dense, but good, and includes good references.
| dahart wrote:
| The famous Bob Jenkins has lots of info on hash function
| design:
|
| http://www.burtleburtle.net/bob/index.html
|
| http://www.burtleburtle.net/bob/hash/index.html
| mpartel wrote:
| Excellent work, again! Your intros somehow elicit the best out of
| HN too: the comments are always full of interesting angles and
| variations.
|
| Wanna do quaternions next? :) Many have tried..
| samwho wrote:
| There's a name I haven't heard for a long time.
|
| Thank you, sir! As soon as I can reliably spell "quaternions" I
| will think about trying it.
|
| Funnily, one of the ideas I have cooking in my head is going to
| require a foray into 3D.
| [deleted]
| d--b wrote:
| Good stuff.
|
| I'd perhaps add a paragraph on bcrypt and why it must be slow.
| photochemsyn wrote:
| Very clear explanation! There's also a famous answer on SO that
| goes into collisions and has similar visual maps for viewing how
| random the hashing distributions are for a variety of different
| non-cryptographic hashing algorithms (top answer):
|
| https://softwareengineering.stackexchange.com/questions/4955...
| samwho wrote:
| I saw this during the writing of the post! Extremely cool, I
| mention it a little in my behind-the-scenes post.
___________________________________________________________________
(page generated 2023-06-20 23:01 UTC)