[HN Gopher] JVM Anatomy Quark #26: Identity Hash Code
___________________________________________________________________
JVM Anatomy Quark #26: Identity Hash Code
Author : fniephaus
Score : 64 points
Date : 2021-07-19 08:23 UTC (1 days ago)
(HTM) web link (shipilev.net)
(TXT) w3m dump (shipilev.net)
| teknopaul wrote:
| "It is a frequent source of bugs to change the object in such a
| way that its hashCode changes after it was used"
|
| (citation needed)
|
| I have never seen that. 20 years in the game.
|
| I can't remember ever having seen anyone use a mutable object for
| a hashmap key.
| firebaze wrote:
| Just to add another anecdotal data point: me neither. Neither
| in java, nor in .Net, and surely not in C++ (there were lots of
| other things to mess up), having around ~15 years of experience
| under my belt. To me it is quite elementary, almost muscle
| memory, to never mutate objects used as keys - and while I
| observed quite a few rookie mistakes (some of them from the "I"
| perspective), it'd never occur to me to change a key in a map.
|
| I'd probably learn a lot from stories explaining the
| environment of such situations, so if you have one, please
| share.
| elric wrote:
| > I can't remember ever having seen anyone use a mutable object
| for a hashmap key.
|
| HashMap keys aren't the only issue, sadly. In my experience,
| this is way more common in HashSets when trying reduce a
| collection to a distinct set of items. If any of those items is
| shared and modified in another thread, you're in for a
| wonderful time debugging.
|
| Mutability makes things so much harder to reason about. It'd be
| great if there were a language-assisted way to enforce
| immutability for collection members.
| rzzzt wrote:
| HashSets are using a HashMap behind the scenes, they put a
| "present" placeholder value in the map when adding an element
| to the set: https://github.com/openjdk-
| mirror/jdk7u-jdk/blob/f4d80957e89...
| xxs wrote:
| >I have never seen that. 20 years in the game.
|
| I have seen that a bit too many times, it's pretty terrible.
| People debug and print HashMap contents, and complain the thing
| is broken... as get(Object) just returns null, finding an empty
| bucket.
|
| I guess, you have been a bit more lucky than I was. Definitely
| seen too many mutable keys.
| teknopaul wrote:
| to be honest I can't bring to mimd any code that changes a
| value without recalling put(), code that does that would smell
| so bad you would spot it by the @author
| mumblemumble wrote:
| It's all private code, so I can't give you a citation, but I've
| often helped junior devs with bugs due to this.
|
| The most common way to get there starts with creating a "bean"
| object. And then, since it's a bean, it has mutable getters and
| setters. On everything, because that's just the standard way
| that people are (or at least were) taught to do these things in
| Java. And then you ask your IDE to give you equals and hashCode
| implementations, and yeah sure whatever just use all the
| fields. And then, later on, you need some sort of lookup table
| that cross-references the data modeled by this bean with some
| outside source of information. So of course that's a HashMap.
| And then you go through your objects and update them all based
| on that information. Maybe you just edit the description or the
| lastUsed field, who knows. All sorts of possibilities here. In
| any case, there's your bug.
| cogman10 wrote:
| I've seen it as well.
|
| Usually it takes the form of "I added object x to a HashSet,
| then while processing things I update a field in object x
| used in the hashcode. Object x no longer appears in the set".
| andersonvieira wrote:
| I have seen it happen many times. A lot of developers I've
| worked with do not understand the basics of a HashMap/HashSet.
| They just use the IDE to automatically create hashCode
| functions with mutable attributes or use classes created by
| other people and do not check if the hashCode implementation is
| correct.
| MauranKilom wrote:
| I don't see how
|
| > "Not suprisingly, [...] global counters [...] are poorly
| distributed as well."
|
| is true.
|
| I mean, yes, they are not uniformly distributed, but that was
| never the requirement. As the article itself states, the desired
| property is that "the values for distinct objects are more or
| less distinct". With a global counter, you get _maximally_
| distinct hash codes. _More_ distinct than any of the other
| approaches (and not less than any user-implemented function), at
| least until 2^31 object allocations.
|
| Yes, after 2^31 objects you will get repeated values, but that is
| trivially the case for _any_ pattern of assigning 31 bit hash
| codes to distinct objects (and any of the pseudo-random
| approaches will get birthday-paradoxed _much_ sooner, and _much_
| harder). The only case where this could matter is in an
| adversarial scenario where someone is trying to DoS your hash map
| with crafted inputs. But according to the article itself, it
| would take 4+ minutes (120 ns * 2^31) of only allocating objects
| for each global counter wrapping. If an adversary can reliably
| achieve that already, what 's the point in slowing down a hash
| map by an epsilon every four minutes?
| nlitened wrote:
| > the values for distinct objects are more or less distinct
|
| I think these words of author understate the requirement of
| good distribution of hash codes. As far as I understand,
| ideally the hash codes for different objects should be as
| distinct as practically possible, so that they are often put
| into separate buckets of a hash table.
|
| Consecutively allocated objects will have almost all bits of
| their addresses equal.
| MauranKilom wrote:
| I wouldn't trust any hash table where a hash function
| yielding consecutive integers would inherently lead to bad
| behavior. Can you tell me a popular hash-to-bucket reduction
| that performs badly for this case?
|
| I will concede that there are plenty of schemes where
| insufficient entropy in _lower_ bits causes problems.
| Combining those with the global counter hash and e.g. only
| inserting every 64th allocated object could be a failure case
| indeed. But this is still simple to defend against in the
| reduction scheme.
| dragontamer wrote:
| I think its about the naive "rep_array[hashcode(obj) %
| bucket_size] = object" which is all too common.
|
| If your rep_array is doing linear probing and/or robin-hood
| hasing, then incremental hashcodes (such as 1, 2, 3, 4,
| 5...) is a bad thing. Especially if you're doing both
| inserts and removals: this sort of incremental pattern
| would lead to many "runs" where linear probing would
| perform poorly.
|
| Of course, it isn't very hard to do
| rep_array[(hashcode(obj) * large_constant_odd_number) %
| bucket_size] instead and get a good distribution. But the
| question is whether or not people know about those kinds of
| steps.
| jffry wrote:
| One neat trick could be to use a linear congruential
| generator to iterate through the 32-bit integers without
| repetition, as long as you choose the right parameters.
| Fortunately, these are relatively simple to pick [1] when
| we're dealing with powers of two.
|
| This would give you fairly easy-to-generate IDs which still
| have a period of 2^32 but where subsequent allocations have
| an ID that shares fewer bits on average with its predecessor.
|
| [1] https://en.wikipedia.org/wiki/Linear_congruential_generat
| or#...
| chrisseaton wrote:
| > The only case where this could matter is in an adversarial
| scenario where someone is trying to DoS your hash map with
| crafted inputs.
|
| That happens in practice.
| MauranKilom wrote:
| Does it also happen in practice where the crafted inputs are
| given _at minimum_ 4 minutes apart, on the precondition that
| those minutes are filled with finely-controlled-by-the-
| attacker spinning?
| barosl wrote:
| I think it was a mistake to put hashCode (and to an extent,
| equals) to all objects by default. It makes Objects bloat, and
| because of that, those who will never be put to a hash table also
| need to provide their own hashCode implementations, causing
| problems like the one mentioned in the article. There should have
| been a separate, dedicated "Hashable" interface. Strangely, C#
| shares the same caveat. Is there any benefit to making objects
| hashable by default?
| admax88q wrote:
| Where is the bloat? There are default behaviours for equals and
| hashCode, so you never need to define them if they're not
| useful for your Object.
| titzer wrote:
| Did you read the article? The JVM reserves ~4 bytes in the
| object header of _every object_ to store the identity hash
| code, regardless if you override .hashCode() or not.
| t-writescode wrote:
| Compared to all the other bloat in Java, this is pennies
| titzer wrote:
| If you have an object that has only one field, it still
| consumes at least 3 words of memory; 2 for the header,
| and 1 for the field. That's a 200% memory overhead. Of
| course that's extreme, but I've seen estimates as high as
| 40% of memory space the heap in large Java applications
| is just object headers. You probably need at least 1
| header word in any scheme, but 2 is wasteful IMHO.
| vlovich123 wrote:
| Moreover the Hashable interface _really_ needs to take a
| `HashState` object that can be given blocks of memory regions.
| Then you could plug in arbitrary algorithms rather than having
| each object hard-coded to use some half-baked implementation
| that 's usually wrong & definitely wrong for anything that
| needs cryptography/DOS protection in the hash table. The reason
| is that Hash(m1 + m2 + m3) is what you want as it's
| computationally optimal & lets you plug in crypto hashing
| algorithms, but the old technique that Java bakes into the
| language forces you to write Hash(Hash(m1) + Hash(m2) +
| Hash(m3)).
|
| http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n398...
| which I'm sad hasn't been adopted by C++ yet. It looks like
| Rust has taken note of this paper.
| moltenguardian wrote:
| It doesn't cost as much as you might think. The word it's
| stored in is overloaded with the object lock and garbage
| collection bits.
| breischl wrote:
| I don't know about bloat per se, but I think putting `equals`
| on all objects is a design flaw. This implicitly assumes that
| every object has a single implementation that works for every
| case, which is frequently false. Some objects have no
| semantically meaningful concept of "equals". Others have
| situation-specific definitions.
|
| A better design (which C# has made some steps towards) is
| defining interfaces for Equatable and Hashable, and requiring
| the object to implement methods that return them. Then they can
| return them or not, or have multiple implementations of each if
| needed. Or users of the objects can define their own, custom
| implementations easily.
| [deleted]
| cogman10 wrote:
| I'm not sure what you mean by bloat. Adding methods doesn't
| really affect individual object size, just class size. However,
| actual code weight (rarely) really impacts anything.
|
| In the case of Java/C#, hashcode and equals are optional
| anyways. So if you don't override them, there's no extra space
| consumed.
| titzer wrote:
| The storage of the identity hash code takes ~4 bytes in the
| object header, and that is independent of whether the method
| is overridden or not. It's explained in the article.
| usrusr wrote:
| But assuming that the other bit in those 4 bytes is
| required and there isn't some other place available where
| you could sneak it in, those ~4 bytes would still be
| allocated (for that one other bit) and only take less
| memory under some compression scheme.
|
| And chances are that even if you did get by without that
| bit (or found another place for it) you'd in many cases
| spend far more memory on workarounds for stuff where the
| creator was too stingy for implementing "IHashable" but you
| actually need it
| titzer wrote:
| Well, in Virgil, a class without a superclass is a new
| root class, with no assumed methods. To use an object of
| that class as a key in a hashtable, one provides a hash
| function to the hashtable constructor. (The hash function
| can be a class method, or a top-level function; with
| generics, a hashtable key could as well be any type,
| including primitives). Of course it depends on how your
| write your code, but most of my objects do not end up as
| keys in a hashtable, so don't need a hash function nor
| storage for an identity hashcode.
|
| So the smallest object header for a Virgil object is 4
| bytes: it's just a type ID that is used for dynamic casts
| and as an index into a table for GC scanning. Arrays have
| two 4 byte header fields: the type ID and a 32-bit
| length. That's considerably more memory efficient than 2
| or 3 64-bit header words for Java objects and arrays,
| respectively.
| chrisseaton wrote:
| They don't mean _code_ weight - they mean _field_ weight.
| cmckn wrote:
| Yes, seems it was just for hash tables. From the JDK 1.0 doc:
|
| """
|
| Returns a hash code value for the object. This method is
| supported for the benefit of hashtables such as those provided
| by <code>java.util.Hashtable</code>.
|
| The general contract of <code>hashCode</code> is:
|
| Whenever it is invoked on the same object more than once during
| an execution of a Java application, the <code>hashCode</code>
| method must consistently return the same integer. This integer
| need not remain consistent from one execution of an application
| to another execution of the same application.
|
| If two objects are equal according to the <code>equals</code>
| method, then calling the <code>hashCode</code> method on each
| of the two objects must produce the same integer result.
|
| """
|
| hashCode is also related to object comparison (see the last
| line) and this was 2 Java releases before `Comparable` existed.
| exabrial wrote:
| Given the algorithm used, I don't really understand what bloat
| you are referring to. Compared to assembly? I guess maybe. But
| this is a managed language with near-cpp performance. Small
| price to pay.
|
| Also, there's utility for hashcode/equals far beyond the
| collections framework... having an identity in a toString() for
| instance is of great value.
| matheusmoreira wrote:
| Love to read about these implementation details!
| xxs wrote:
| There is a caveat the article doesn't mention, it's not
| possible to use both synchronized/lock elision and
| System.identityHashCode. The 4bytes in the object header are
| either used for the hashCode or the threadId(not
| java.lang.Thread.getId()), holding the monitor. +2 control
| bits.
|
| Also the stuff has not changed for the past... 14y or so. It's
| pretty old news.
| _old_dude_ wrote:
| It's changing now
| https://wiki.openjdk.java.net/display/lilliput
___________________________________________________________________
(page generated 2021-07-20 23:00 UTC)