[HN Gopher] JVM Anatomy Quark #26: Identity Hash Code
       ___________________________________________________________________
        
       JVM Anatomy Quark #26: Identity Hash Code
        
       Author : fniephaus
       Score  : 64 points
       Date   : 2021-07-19 08:23 UTC (1 days ago)
        
 (HTM) web link (shipilev.net)
 (TXT) w3m dump (shipilev.net)
        
       | teknopaul wrote:
       | "It is a frequent source of bugs to change the object in such a
       | way that its hashCode changes after it was used"
       | 
       | (citation needed)
       | 
       | I have never seen that. 20 years in the game.
       | 
       | I can't remember ever having seen anyone use a mutable object for
       | a hashmap key.
        
         | firebaze wrote:
         | Just to add another anecdotal data point: me neither. Neither
         | in java, nor in .Net, and surely not in C++ (there were lots of
         | other things to mess up), having around ~15 years of experience
         | under my belt. To me it is quite elementary, almost muscle
         | memory, to never mutate objects used as keys - and while I
         | observed quite a few rookie mistakes (some of them from the "I"
         | perspective), it'd never occur to me to change a key in a map.
         | 
         | I'd probably learn a lot from stories explaining the
         | environment of such situations, so if you have one, please
         | share.
        
         | elric wrote:
         | > I can't remember ever having seen anyone use a mutable object
         | for a hashmap key.
         | 
         | HashMap keys aren't the only issue, sadly. In my experience,
         | this is way more common in HashSets when trying reduce a
         | collection to a distinct set of items. If any of those items is
         | shared and modified in another thread, you're in for a
         | wonderful time debugging.
         | 
         | Mutability makes things so much harder to reason about. It'd be
         | great if there were a language-assisted way to enforce
         | immutability for collection members.
        
           | rzzzt wrote:
           | HashSets are using a HashMap behind the scenes, they put a
           | "present" placeholder value in the map when adding an element
           | to the set: https://github.com/openjdk-
           | mirror/jdk7u-jdk/blob/f4d80957e89...
        
         | xxs wrote:
         | >I have never seen that. 20 years in the game.
         | 
         | I have seen that a bit too many times, it's pretty terrible.
         | People debug and print HashMap contents, and complain the thing
         | is broken... as get(Object) just returns null, finding an empty
         | bucket.
         | 
         | I guess, you have been a bit more lucky than I was. Definitely
         | seen too many mutable keys.
        
         | teknopaul wrote:
         | to be honest I can't bring to mimd any code that changes a
         | value without recalling put(), code that does that would smell
         | so bad you would spot it by the @author
        
         | mumblemumble wrote:
         | It's all private code, so I can't give you a citation, but I've
         | often helped junior devs with bugs due to this.
         | 
         | The most common way to get there starts with creating a "bean"
         | object. And then, since it's a bean, it has mutable getters and
         | setters. On everything, because that's just the standard way
         | that people are (or at least were) taught to do these things in
         | Java. And then you ask your IDE to give you equals and hashCode
         | implementations, and yeah sure whatever just use all the
         | fields. And then, later on, you need some sort of lookup table
         | that cross-references the data modeled by this bean with some
         | outside source of information. So of course that's a HashMap.
         | And then you go through your objects and update them all based
         | on that information. Maybe you just edit the description or the
         | lastUsed field, who knows. All sorts of possibilities here. In
         | any case, there's your bug.
        
           | cogman10 wrote:
           | I've seen it as well.
           | 
           | Usually it takes the form of "I added object x to a HashSet,
           | then while processing things I update a field in object x
           | used in the hashcode. Object x no longer appears in the set".
        
         | andersonvieira wrote:
         | I have seen it happen many times. A lot of developers I've
         | worked with do not understand the basics of a HashMap/HashSet.
         | They just use the IDE to automatically create hashCode
         | functions with mutable attributes or use classes created by
         | other people and do not check if the hashCode implementation is
         | correct.
        
       | MauranKilom wrote:
       | I don't see how
       | 
       | > "Not suprisingly, [...] global counters [...] are poorly
       | distributed as well."
       | 
       | is true.
       | 
       | I mean, yes, they are not uniformly distributed, but that was
       | never the requirement. As the article itself states, the desired
       | property is that "the values for distinct objects are more or
       | less distinct". With a global counter, you get _maximally_
       | distinct hash codes. _More_ distinct than any of the other
       | approaches (and not less than any user-implemented function), at
       | least until 2^31 object allocations.
       | 
       | Yes, after 2^31 objects you will get repeated values, but that is
       | trivially the case for _any_ pattern of assigning 31 bit hash
       | codes to distinct objects (and any of the pseudo-random
       | approaches will get birthday-paradoxed _much_ sooner, and _much_
       | harder). The only case where this could matter is in an
       | adversarial scenario where someone is trying to DoS your hash map
       | with crafted inputs. But according to the article itself, it
       | would take 4+ minutes (120 ns * 2^31) of only allocating objects
       | for each global counter wrapping. If an adversary can reliably
       | achieve that already, what 's the point in slowing down a hash
       | map by an epsilon every four minutes?
        
         | nlitened wrote:
         | > the values for distinct objects are more or less distinct
         | 
         | I think these words of author understate the requirement of
         | good distribution of hash codes. As far as I understand,
         | ideally the hash codes for different objects should be as
         | distinct as practically possible, so that they are often put
         | into separate buckets of a hash table.
         | 
         | Consecutively allocated objects will have almost all bits of
         | their addresses equal.
        
           | MauranKilom wrote:
           | I wouldn't trust any hash table where a hash function
           | yielding consecutive integers would inherently lead to bad
           | behavior. Can you tell me a popular hash-to-bucket reduction
           | that performs badly for this case?
           | 
           | I will concede that there are plenty of schemes where
           | insufficient entropy in _lower_ bits causes problems.
           | Combining those with the global counter hash and e.g. only
           | inserting every 64th allocated object could be a failure case
           | indeed. But this is still simple to defend against in the
           | reduction scheme.
        
             | dragontamer wrote:
             | I think its about the naive "rep_array[hashcode(obj) %
             | bucket_size] = object" which is all too common.
             | 
             | If your rep_array is doing linear probing and/or robin-hood
             | hasing, then incremental hashcodes (such as 1, 2, 3, 4,
             | 5...) is a bad thing. Especially if you're doing both
             | inserts and removals: this sort of incremental pattern
             | would lead to many "runs" where linear probing would
             | perform poorly.
             | 
             | Of course, it isn't very hard to do
             | rep_array[(hashcode(obj) * large_constant_odd_number) %
             | bucket_size] instead and get a good distribution. But the
             | question is whether or not people know about those kinds of
             | steps.
        
           | jffry wrote:
           | One neat trick could be to use a linear congruential
           | generator to iterate through the 32-bit integers without
           | repetition, as long as you choose the right parameters.
           | Fortunately, these are relatively simple to pick [1] when
           | we're dealing with powers of two.
           | 
           | This would give you fairly easy-to-generate IDs which still
           | have a period of 2^32 but where subsequent allocations have
           | an ID that shares fewer bits on average with its predecessor.
           | 
           | [1] https://en.wikipedia.org/wiki/Linear_congruential_generat
           | or#...
        
         | chrisseaton wrote:
         | > The only case where this could matter is in an adversarial
         | scenario where someone is trying to DoS your hash map with
         | crafted inputs.
         | 
         | That happens in practice.
        
           | MauranKilom wrote:
           | Does it also happen in practice where the crafted inputs are
           | given _at minimum_ 4 minutes apart, on the precondition that
           | those minutes are filled with finely-controlled-by-the-
           | attacker spinning?
        
       | barosl wrote:
       | I think it was a mistake to put hashCode (and to an extent,
       | equals) to all objects by default. It makes Objects bloat, and
       | because of that, those who will never be put to a hash table also
       | need to provide their own hashCode implementations, causing
       | problems like the one mentioned in the article. There should have
       | been a separate, dedicated "Hashable" interface. Strangely, C#
       | shares the same caveat. Is there any benefit to making objects
       | hashable by default?
        
         | admax88q wrote:
         | Where is the bloat? There are default behaviours for equals and
         | hashCode, so you never need to define them if they're not
         | useful for your Object.
        
           | titzer wrote:
           | Did you read the article? The JVM reserves ~4 bytes in the
           | object header of _every object_ to store the identity hash
           | code, regardless if you override .hashCode() or not.
        
             | t-writescode wrote:
             | Compared to all the other bloat in Java, this is pennies
        
               | titzer wrote:
               | If you have an object that has only one field, it still
               | consumes at least 3 words of memory; 2 for the header,
               | and 1 for the field. That's a 200% memory overhead. Of
               | course that's extreme, but I've seen estimates as high as
               | 40% of memory space the heap in large Java applications
               | is just object headers. You probably need at least 1
               | header word in any scheme, but 2 is wasteful IMHO.
        
         | vlovich123 wrote:
         | Moreover the Hashable interface _really_ needs to take a
         | `HashState` object that can be given blocks of memory regions.
         | Then you could plug in arbitrary algorithms rather than having
         | each object hard-coded to use some half-baked implementation
         | that 's usually wrong & definitely wrong for anything that
         | needs cryptography/DOS protection in the hash table. The reason
         | is that Hash(m1 + m2 + m3) is what you want as it's
         | computationally optimal & lets you plug in crypto hashing
         | algorithms, but the old technique that Java bakes into the
         | language forces you to write Hash(Hash(m1) + Hash(m2) +
         | Hash(m3)).
         | 
         | http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n398...
         | which I'm sad hasn't been adopted by C++ yet. It looks like
         | Rust has taken note of this paper.
        
         | moltenguardian wrote:
         | It doesn't cost as much as you might think. The word it's
         | stored in is overloaded with the object lock and garbage
         | collection bits.
        
         | breischl wrote:
         | I don't know about bloat per se, but I think putting `equals`
         | on all objects is a design flaw. This implicitly assumes that
         | every object has a single implementation that works for every
         | case, which is frequently false. Some objects have no
         | semantically meaningful concept of "equals". Others have
         | situation-specific definitions.
         | 
         | A better design (which C# has made some steps towards) is
         | defining interfaces for Equatable and Hashable, and requiring
         | the object to implement methods that return them. Then they can
         | return them or not, or have multiple implementations of each if
         | needed. Or users of the objects can define their own, custom
         | implementations easily.
        
         | [deleted]
        
         | cogman10 wrote:
         | I'm not sure what you mean by bloat. Adding methods doesn't
         | really affect individual object size, just class size. However,
         | actual code weight (rarely) really impacts anything.
         | 
         | In the case of Java/C#, hashcode and equals are optional
         | anyways. So if you don't override them, there's no extra space
         | consumed.
        
           | titzer wrote:
           | The storage of the identity hash code takes ~4 bytes in the
           | object header, and that is independent of whether the method
           | is overridden or not. It's explained in the article.
        
             | usrusr wrote:
             | But assuming that the other bit in those 4 bytes is
             | required and there isn't some other place available where
             | you could sneak it in, those ~4 bytes would still be
             | allocated (for that one other bit) and only take less
             | memory under some compression scheme.
             | 
             | And chances are that even if you did get by without that
             | bit (or found another place for it) you'd in many cases
             | spend far more memory on workarounds for stuff where the
             | creator was too stingy for implementing "IHashable" but you
             | actually need it
        
               | titzer wrote:
               | Well, in Virgil, a class without a superclass is a new
               | root class, with no assumed methods. To use an object of
               | that class as a key in a hashtable, one provides a hash
               | function to the hashtable constructor. (The hash function
               | can be a class method, or a top-level function; with
               | generics, a hashtable key could as well be any type,
               | including primitives). Of course it depends on how your
               | write your code, but most of my objects do not end up as
               | keys in a hashtable, so don't need a hash function nor
               | storage for an identity hashcode.
               | 
               | So the smallest object header for a Virgil object is 4
               | bytes: it's just a type ID that is used for dynamic casts
               | and as an index into a table for GC scanning. Arrays have
               | two 4 byte header fields: the type ID and a 32-bit
               | length. That's considerably more memory efficient than 2
               | or 3 64-bit header words for Java objects and arrays,
               | respectively.
        
           | chrisseaton wrote:
           | They don't mean _code_ weight - they mean _field_ weight.
        
         | cmckn wrote:
         | Yes, seems it was just for hash tables. From the JDK 1.0 doc:
         | 
         | """
         | 
         | Returns a hash code value for the object. This method is
         | supported for the benefit of hashtables such as those provided
         | by <code>java.util.Hashtable</code>.
         | 
         | The general contract of <code>hashCode</code> is:
         | 
         | Whenever it is invoked on the same object more than once during
         | an execution of a Java application, the <code>hashCode</code>
         | method must consistently return the same integer. This integer
         | need not remain consistent from one execution of an application
         | to another execution of the same application.
         | 
         | If two objects are equal according to the <code>equals</code>
         | method, then calling the <code>hashCode</code> method on each
         | of the two objects must produce the same integer result.
         | 
         | """
         | 
         | hashCode is also related to object comparison (see the last
         | line) and this was 2 Java releases before `Comparable` existed.
        
         | exabrial wrote:
         | Given the algorithm used, I don't really understand what bloat
         | you are referring to. Compared to assembly? I guess maybe. But
         | this is a managed language with near-cpp performance. Small
         | price to pay.
         | 
         | Also, there's utility for hashcode/equals far beyond the
         | collections framework... having an identity in a toString() for
         | instance is of great value.
        
       | matheusmoreira wrote:
       | Love to read about these implementation details!
        
         | xxs wrote:
         | There is a caveat the article doesn't mention, it's not
         | possible to use both synchronized/lock elision and
         | System.identityHashCode. The 4bytes in the object header are
         | either used for the hashCode or the threadId(not
         | java.lang.Thread.getId()), holding the monitor. +2 control
         | bits.
         | 
         | Also the stuff has not changed for the past... 14y or so. It's
         | pretty old news.
        
           | _old_dude_ wrote:
           | It's changing now
           | https://wiki.openjdk.java.net/display/lilliput
        
       ___________________________________________________________________
       (page generated 2021-07-20 23:00 UTC)