hngopher.com

       [HN Gopher] Why is hash(-1) == hash(-2) in Python?
       ___________________________________________________________________
        
       Why is hash(-1) == hash(-2) in Python?
        
       Author : alexmolas
       Score  : 78 points
       Date   : 2025-01-07 22:23 UTC (3 days ago)
        
 (HTM) web link (omairmajid.com)
 (TXT) w3m dump (omairmajid.com)
        
       | Terr_ wrote:
       | In a way this is an argument for languages where it's normal to
       | have result-types [0] or exceptions, instead of -1 etc. It just
       | feels _wrong_ to have any potential collision or confusion
       | between normal results and error results.
       | 
       | [0] https://en.wikipedia.org/wiki/Result_type
        
         | kevinventullo wrote:
         | We all know that types are not Pythonic.
         | 
         | (I'm only mostly joking)
        
           | folkrav wrote:
           | Ackchyually, Python has pretty strong typing, as far as
           | dynamic languages go.
        
             | bhickey wrote:
             | Sure. Everything is a PyObject.
        
             | Insanity wrote:
             | If you use the actual type system and something like mypy,
             | Python is a joy work with. (For my definition of joy, which
             | includes static typing).
        
               | IshKebab wrote:
               | Actually writing Python is reasonably nice in that case.
               | 
               | But dealing with Python infrastructure is so awful as to
               | make the whole experience just bad.
               | 
               | uv fixes a lot of that, but I think it will be some time
               | before it's used everywhere, and I have zero hope that
               | the Python devs will ever do the right thing and
               | officially replace Pip with uv.
        
               | arccy wrote:
               | you also have to trawl through the flags to make sure it
               | actually checks types, e.g. check_untyped_defs
        
         | kstrauser wrote:
         | Not really, I don't think. Python code doesn't ever really use
         | hash() for anything where specific outputs are expected. Even
         | if hash(a)==hash(b), it's not implied that a==b or a is b.
        
           | TRiG_Ireland wrote:
           | Yes, but having result types would avoid the need for this
           | special casing.
        
           | tmvphil wrote:
           | But -2 seems like just a bad choice here. -1 and -2 are very
           | "common" integers. Seems they could have just picked a quasi-
           | random large integer to reduce the likelihood of collision. I
           | expect hash functions to have collisions, but I expect it to
           | be "unlikely" in a way that scales with the size of the hash.
        
             | kstrauser wrote:
             | I don't think that matters here, though. This specific hash
             | function is literally used just for putting values in dict
             | buckets. The docs say:
             | 
             | > Hash values are integers. They are used to quickly
             | compare dictionary keys during a dictionary lookup.
             | 
             | They're not to be used as a crypto digest. The downside
             | here, then is that if -1 and -2 are used as dict keys,
             | they'll end up bucketed together. Apparently that hasn't
             | been enough of an issue over the years to bother changing
             | it. One advantage is that those values are common enough
             | that it might break code that incorrectly expects hash() to
             | have predictable outputs. If it used 0xffffffff as the
             | value, lots of busted tests may never stumble across it.
        
               | itishappy wrote:
               | > The downside here, then is that if -1 and -2 are used
               | as dict keys, they'll end up bucketed together.
               | 
               | Note that they'll only end up bucketed together in the
               | underlying hashmap. The dictionary will still
               | disambiguate the keys.                   >>> a = -1
               | >>> b = -2         >>> {a: a, b: b}         {-1: -1, -2:
               | -2}
        
               | kstrauser wrote:
               | Yep, absolutely. End programmers will never see that
               | unless they go digging into the CPython code. It's
               | otherwise invisible.
        
             | nneonneo wrote:
             | I bet -2 is _way_ less common than -1 in a typical
             | codebase, particularly a C one.
        
         | nayuki wrote:
         | In other words, this is a case of
         | https://en.wikipedia.org/wiki/In-band_signaling versus out-of-
         | band signaling
         | https://en.wikipedia.org/wiki/Signaling_(telecommunications)...
         | .
        
           | EricRiese wrote:
           | _Clears throat_ the precise wiki page you 're looking for is
           | the Semipredicate Problem
           | 
           | https://en.wikipedia.org/wiki/Semipredicate_problem
        
       | kstrauser wrote:
       | Because the hash function is only reliably useful for laying out
       | dict keys, and its outputs are totally opaque and coincidental.
       | 
       | The pigeonhole principle says there are an infinite number of
       | inputs with the same hash output. Trying to figure out _how_ that
       | happens can be fun and enlightening, but _why_? "Because it just
       | does, that's all."
        
         | Etheryte wrote:
         | I think this is not constructive criticism. There's a good,
         | clear cut reason for this specific overlap (error handling),
         | saying it collides just because is not really useful.
        
           | kstrauser wrote:
           | But even that's an implementation detail that happens to be
           | true today, but could easily disappear tomorrow. There's
           | nothing in the docs saying it has to be that way, or that you
           | can infer anything from a hash output other than that it's
           | going to be consistent within a single execution of that
           | Python program.
        
             | tooltower wrote:
             | This article is not about the API contract of the hash
             | function, or the abstraction it provides. If you are just
             | trying to hash things, you don't need any info here.
             | 
             | It's very much trying to go _under_ the abstraction layer
             | to investigate its behavior. Because it's interesting.
             | 
             | This is very similar to how people investigate performance
             | quirks or security issues.
        
               | kstrauser wrote:
               | I read the article and it's interesting! I learned
               | something from it. From an end user POV it explains the
               | mechanism behind how hash(-1)==hash(-2), which is neat.
               | There's not really a _why_ behind it, though. It wasn 't
               | really planned or designed for -1 to have an unexpected
               | hash value. That's just the value the devs randomly
               | picked as a flag. It could've been -837 just as easily.
        
             | sedatk wrote:
             | It could have been a bug. There's nothing wrong with the
             | question and seeking answers to it. It was an interesting
             | read too.
        
             | alain94040 wrote:
             | If the article title had been "why does the hash function
             | never return -1", you would agree it's not a random
             | behavior and serves a very clear purpose. It just so
             | happens that the original author didn't know that until
             | they did all the digging (nice job by the way).
        
       | pmarreck wrote:
       | A better question might be, why are people still using OO dynamic
       | languages without any typing?
        
         | kstrauser wrote:
         | Not sure. Is anyone using a language like that? Python
         | certainly isn't one.
        
           | worik wrote:
           | > Python certainly isn't one.
           | 
           | I am not a Python expert, but...
           | 
           | Python is described as OO and I thought is is dynamically
           | typed
           | 
           | Not quite "without any typing" but close
        
             | kstrauser wrote:
             | Not really. Python's strongly typed, but dynamically. You
             | can think of Python variables as untyped references to
             | strongly typed values: the _values_ are typed, but their
             | _names_ are not.
        
               | Dylan16807 wrote:
               | I agree with your last sentence, but I think we shouldn't
               | say python itself is "strongly typed".
        
               | kstrauser wrote:
               | Why, specifically? It's commonly described as strongly
               | typed and I'm not aware of an argument against that.
        
               | Dylan16807 wrote:
               | Is the specific reason not clear? You just said it. The
               | variables are untyped.
               | 
               | Strong versus weak is not super solidly defined as
               | referring only to values and not variables.
        
               | kstrauser wrote:
               | > Is the specific reason not clear?
               | 
               | Correct.
               | 
               | > Strong versus weak is not super solidly defined as
               | referring only to values and not variables.
               | 
               | You're right. It's not super solidly defined as referring
               | only to variables and not values, either. If dynamic
               | typing is the same as weak typing, there'd be no point in
               | having both phrases.
        
               | Dylan16807 wrote:
               | >> Is the specific reason not clear?
               | 
               | > Correct.
               | 
               | Is it still unclear with the clarification you didn't
               | quote? I can't tell if I need to explain more or not.
               | 
               | > It's not super solidly defined as referring only to
               | variables and not values, either.
               | 
               | I didn't mean to suggest it was.
               | 
               | > If dynamic typing is the same as weak typing, there'd
               | be no point in having both phrases.
               | 
               | It's not the same. There's a bunch of things referred to
               | as "weak".
        
               | tuveson wrote:
               | > untyped references to strongly typed values
               | 
               | I would call that weakly typed since there is no step
               | before your code runs that validates (or even tries to
               | validate) that your program satisfies a "type system".
               | But since the definition of "strong" and "weak" typing
               | has always been vague, I will just say that it is
               | stupidly typed[1], and that I am a pedantic nerd.
               | 
               | 1. https://danieltuveson.github.io/programming/languages/
               | stupid...
        
               | wruza wrote:
               | This is just hair splitting. What everyone means by
               | "typing" today is typing members in structures and co.
               | Every time you reach for python after a typed language,
               | it feels like going through a barbed minefield
               | blindfolded. The fact that its "world" almost entirely
               | consists of little moving subclasses in subnamespaces
               | doesn't help either. All you can think of python is how
               | to write as few lines of it as possible to return to DX
               | faster. It is my only use case for AI-based development
               | because I cannot stand an extra minute in this
               | abomination without cursing into the editor.
        
               | dragonwriter wrote:
               | > Every time you reach for python after a typed language,
               | it feels like going through a barbed minefield
               | blindfolded.
               | 
               | Using both Python and mandatorily statically typed
               | languages regularly professsional (my main working
               | languages being C#, Python, and a mix of JS and TS),
               | that's not my experience at all.
               | 
               | (Of course, Python has a variety of optional static
               | typecheckers with differing degrees of type inference, as
               | well as supporting explicit type specifications; its not
               | untyped unless you choose to use it that way.)
        
               | kragen wrote:
               | It's not hairsplitting, and it's not what _everyone_
               | means, just the people _you_ talk to. Languages without
               | typing are things like assembly language(s), which
               | underlie everything you do on a computer and is (are)
               | still in the TIOBE Index 's top 20
               | https://www.tiobe.com/tiobe-index, as well as much more
               | obscure languages. The distinction between untyped
               | languages and dynamically typed languages is night and
               | day.
               | 
               | I'm pretty competent in Python, C, Java, JavaScript,
               | assembly, and a number of other languages+, and my
               | experience in Python is nothing like what you're
               | describing, although I do have my own frustrations with
               | it. It's possible that when you're more experienced
               | you'll have a different perspective, but it sounds like
               | you're stuck inside a pretty small bubble right now.
               | 
               | ______
               | 
               | + Last year I also programmed in C++, Lua, C#, bash, Tcl,
               | Emacs Lisp, Forth, Golang, OCaml, Scheme, Perl, Common
               | Lisp, and the ngspice scripting language.
        
               | wruza wrote:
               | I have a similar list, no need to worry about my YoE in
               | various things. I _left_ full-time python many years ago.
               | You 're right about my current bubble though. I find it
               | productive and useful, compared to python which I have to
               | touch from time to time. My worst gripe is that ML is my
               | new interest and it is _all_ python. I 'm basically
               | screwed.
        
               | worik wrote:
               | > Not really. Python's strongly typed, but dynamically
               | 
               | Yea, nah!
               | 
               | You cannot have both.
        
               | kstrauser wrote:
               | That's an interesting and uncommon claim.
        
             | dragonwriter wrote:
             | There are different definitions of most of the terms around
             | typing which make most typing conversations confusing as
             | people talk past each other.
             | 
             | Particularly, there is a common definition of "typed" which
             | is exactly equivalent to "statically typed" under which all
             | so-called "dynamically typed" languages are actually
             | "untyped", and within that system strong and weak typing
             | are either a meaningless distinction or one _within_ the
             | set of statically types languages.
             | 
             | There's also, of course, a common usage within which
             | "dynamically typed" is meaningful and strong vs. weak
             | typing is a separate distinction, usually mostly discussed
             | within languages with dynamic typing, though languages like
             | C being both _statically_ typed and _weakly_ typed has been
             | discussed.
        
         | nayuki wrote:
         | > without any typing?                   >>> "abc" + 321
         | Traceback (most recent call last):           File "<stdin>",
         | line 1, in <module>         TypeError: can only concatenate str
         | (not "int") to str
         | 
         | I rest my case.
        
         | homebrewer wrote:
         | TaPL should be required reading for self proclaimed
         | "engineers". Do untyped languages even exist, or could they?
         | maybe something purely theoretical for ivory tower academics?
         | Even POSIX shell and assembler type their values.
        
           | nayuki wrote:
           | > Do untyped languages even exist, or could they? Even POSIX
           | shell and assembler type their values.
           | 
           | I program in assembly language. Memory is just a big array of
           | bytes. Values don't have types, but each instruction chooses
           | what type to interpret memory as - e.g. uint8, int16,
           | float32, x86 real-mode pointer, long-mode pointer, etc.
        
           | int_19h wrote:
           | I think one can reasonably argue that, if the language only
           | has a single type, it is effectively "untyped".
           | 
           | That would mean languages like Tcl. Or, for that matter, B.
        
       | ks2048 wrote:
       | The next k for which hash(k) != k: 2**61-1
       | print(hash(2**61 - 2))         2305843009213693950
       | print(hash(2**61 - 1))         0
        
       | TRiG_Ireland wrote:
       | I enjoyed this, because it seems to be written by an interested
       | and interesting human, at a level that I could actually follow.
        
       | jiggawatts wrote:
       | This seems like a loaded footgun: anyone writing a custome hash
       | function has to know about -1 being treated like an error and
       | handle it in a similar way.
       | 
       | I can't think of any other language where this kind of thing
       | happens, which means other developers won't expect it either.
       | 
       | I can see the bug report now: "certain records cause errors,
       | occurs only for about one in a few billion."
        
         | krab wrote:
         | I don't think so. This is true for people writing cpython. But
         | if you implement custom `__hash__` method, you should be fine.
        
         | dagw wrote:
         | If you're writing a custom hash function in python for a custom
         | python class, then this won't affect you since it's only
         | treated as an error in the underlying C code.
         | 
         | If you're adding a new type to the core python language then
         | you have to be aware of this, but if you're hacking the C
         | implementation to change the core language then you're probably
         | pretty well versed in the cpython internals, or at least
         | surrounded by people that are.
        
           | dagw wrote:
           | interestingly enough though, it doesn't seem to be possible
           | to write a custom hash function that returns -1
           | >>> class Foo:              def __hash__(self):
           | return -1             >>> f = Foo()        >>> print(hash(f))
           | -2
           | 
           | So if you're doing something where you have a custom __hash__
           | function that you're expecting to return -1 for a certain
           | value and then are testing for value of the hash of an object
           | rather than testing the property of the object directly, then
           | this might bite you. But I cannot think of any reasonable
           | case where you might want to do that.
        
             | omoikane wrote:
             | Perhaps the worry was that because the observed behavior up
             | until now was that hash never returns -1, if it were
             | possible possible to write a custom hash function that
             | returns -1, that might break some unsuspecting code due to
             | Hyrum's Law.
        
               | dragonwriter wrote:
               | If the _built-in_ hash function returned -1 when a custom
               | ___hash__ method_ did so, then custom numeric classes
               | following the hashing guidelines in the Python Language
               | Reference designed to maintain the x==y = >
               | hash(x)==hash(y) invariant would fail to maintain that
               | invariant on the reference implementation of Python due
               | to a CPython implementation detail, which would be a Bad
               | Thing(tm).
        
             | dragonwriter wrote:
             | You can write a custom hash method that returns -1 (and the
             | one you wrote verifiably does.)
             | 
             | The standard (not custom) hash function, in CPython, which
             | calls your custom method and returns the actual value
             | CPython uses for bucketing, will return -2 in that case,
             | though.
             | 
             | Which is necessary, so that, e.g., following the single
             | mathematical function for hashing of numeric types
             | published in the Python Language Reference will preserve
             | x==y implies hash(x)==hash(y) for custom and build in
             | numeric types in CPython where hash(-1) does not follow
             | that pattern.
             | 
             | Even though for -1 itself, the divergence from the rule is
             | in __hash__ itself. This is also consistent with the basic
             | rule that dunder methods are for communication with the
             | standard machinery, but external consumers should use the
             | standard machinery, not the dunder method.
        
       | intalentive wrote:
       | >>> d = {-2: None}
       | 
       | >>> -1 in d
       | 
       | False
       | 
       | What gives?
        
         | krab wrote:
         | Because -1 is not equal to -2. They will fall into the same
         | bucket of the hash map, but the hash collisions are expected
         | and must be accounted for.
        
         | Arnavion wrote:
         | If hashmaps only looked at hash(k1) == hash(k2) and not also k1
         | == k2 they would be quite useless as maps.
        
         | hyperpape wrote:
         | Since hashes aren't guaranteed to be unique, a dictionary
         | should use equality to actually confirm that the object with
         | the given hash matches the key present in the dictionary.
        
         | nayuki wrote:
         | > The __hash__() method should return an integer. The only
         | required property is that objects which compare equal have the
         | same hash value; it is advised to mix together the hash values
         | of the components of the object that also play a part in
         | comparison of objects ...
         | 
         | --
         | https://docs.python.org/3/reference/datamodel.html#object.__...
         | 
         | > The general contract of hashCode is:
         | 
         | > * If two objects are equal according to the equals method,
         | then calling the hashCode method on each of the two objects
         | must produce the same integer result.
         | 
         | > * It is not required that if two objects are unequal
         | according to the equals method, then calling the hashCode
         | method on each of the two objects must produce distinct integer
         | results. However, the programmer should be aware that producing
         | distinct integer results for unequal objects may improve the
         | performance of hash tables.
         | 
         | --
         | https://docs.oracle.com/en/java/javase/23/docs/api/java.base...
        
       | Galanwe wrote:
       | > So, in C, when you design a function, and you want your
       | function to be able to indicate "there was an error", it has to
       | return that error as its return value.
       | 
       | Well, you can also use an errno-like system. It has its own set
       | of drawbacks as well, but it removes the "I need to reserve a
       | sentinel value" problem.
        
         | secondcoming wrote:
         | Out params are thing too.
        
         | kragen wrote:
         | Also you can longjmp().
        
         | sgerenser wrote:
         | Functions using errno typically still use a return value of -1
         | to indicate an error has occurred, then errno just stores the
         | specific error code. Otherwise, how do you even know that you
         | need to check errno?
        
           | BenjiWiebe wrote:
           | You could always check errno and reset it to zero before any
           | function call.
        
       | snickerbockers wrote:
       | >CPython is written in C, and unlike Python, C doesn't have
       | exceptions. So, in C, when you design a function, and you want
       | your function to be able to indicate "there was an error", it has
       | to return that error as its return value
       | 
       | This is just wrong. C doesn't feature 'error handling' as a
       | dedicated form of branching the way many higher level languages
       | do and you're in no way required to use return codes as an error
       | signal. This is a case of bad API design and is entirely python's
       | fault.
        
       | nneonneo wrote:
       | Naturally, this extends to custom types:                 >>>
       | class X:       ...     def __hash__(self): return -1       ...
       | >>> hash(X())       -2
        
       | daxfohl wrote:
       | Another question is why do ints largely hash to themselves? Most
       | hashing guides recommend hashes be far more distributed IIUC.
        
         | woodruffw wrote:
         | Python's `__hash__` used primarily for hash map bucketing. In
         | that context, the identity of an integer is a perfectly
         | cromulent bucketing value.
        
           | zarzavat wrote:
           | If your keys are multiples of the table size then all your
           | keys will hash to the same bucket, in a naive hash table at
           | least.
        
       | vhcr wrote:
       | Somewhat related, hash() returns a result %
       | sys.hash_info.modulus, so if you build a set or a dict with keys
       | multiple of sys.hash_info.modulus, you run into quadratic
       | behavior.                   >>> {i * sys.hash_info.modulus for i
       | in range(5000)}         0.40s         >>> {i *
       | sys.hash_info.modulus for i in range(50000)}         29.34s
       | >>> {i for i in range(50000)}         0.06s
        
       | binary132 wrote:
       | The fact that a C function really only has its return value for
       | communicating success or error status is why most fallible C
       | functions use mutable parameters, also known as output arguments,
       | for their result value or values. This is like, elementary C
       | program design.
        
       ___________________________________________________________________
       (page generated 2025-01-10 23:01 UTC)