[HN Gopher] Why is hash(-1) == hash(-2) in Python?
___________________________________________________________________
Why is hash(-1) == hash(-2) in Python?
Author : alexmolas
Score : 78 points
Date : 2025-01-07 22:23 UTC (3 days ago)
(HTM) web link (omairmajid.com)
(TXT) w3m dump (omairmajid.com)
| Terr_ wrote:
| In a way this is an argument for languages where it's normal to
| have result-types [0] or exceptions, instead of -1 etc. It just
| feels _wrong_ to have any potential collision or confusion
| between normal results and error results.
|
| [0] https://en.wikipedia.org/wiki/Result_type
| kevinventullo wrote:
| We all know that types are not Pythonic.
|
| (I'm only mostly joking)
| folkrav wrote:
| Ackchyually, Python has pretty strong typing, as far as
| dynamic languages go.
| bhickey wrote:
| Sure. Everything is a PyObject.
| Insanity wrote:
| If you use the actual type system and something like mypy,
| Python is a joy work with. (For my definition of joy, which
| includes static typing).
| IshKebab wrote:
| Actually writing Python is reasonably nice in that case.
|
| But dealing with Python infrastructure is so awful as to
| make the whole experience just bad.
|
| uv fixes a lot of that, but I think it will be some time
| before it's used everywhere, and I have zero hope that
| the Python devs will ever do the right thing and
| officially replace Pip with uv.
| arccy wrote:
| you also have to trawl through the flags to make sure it
| actually checks types, e.g. check_untyped_defs
| kstrauser wrote:
| Not really, I don't think. Python code doesn't ever really use
| hash() for anything where specific outputs are expected. Even
| if hash(a)==hash(b), it's not implied that a==b or a is b.
| TRiG_Ireland wrote:
| Yes, but having result types would avoid the need for this
| special casing.
| tmvphil wrote:
| But -2 seems like just a bad choice here. -1 and -2 are very
| "common" integers. Seems they could have just picked a quasi-
| random large integer to reduce the likelihood of collision. I
| expect hash functions to have collisions, but I expect it to
| be "unlikely" in a way that scales with the size of the hash.
| kstrauser wrote:
| I don't think that matters here, though. This specific hash
| function is literally used just for putting values in dict
| buckets. The docs say:
|
| > Hash values are integers. They are used to quickly
| compare dictionary keys during a dictionary lookup.
|
| They're not to be used as a crypto digest. The downside
| here, then is that if -1 and -2 are used as dict keys,
| they'll end up bucketed together. Apparently that hasn't
| been enough of an issue over the years to bother changing
| it. One advantage is that those values are common enough
| that it might break code that incorrectly expects hash() to
| have predictable outputs. If it used 0xffffffff as the
| value, lots of busted tests may never stumble across it.
| itishappy wrote:
| > The downside here, then is that if -1 and -2 are used
| as dict keys, they'll end up bucketed together.
|
| Note that they'll only end up bucketed together in the
| underlying hashmap. The dictionary will still
| disambiguate the keys. >>> a = -1
| >>> b = -2 >>> {a: a, b: b} {-1: -1, -2:
| -2}
| kstrauser wrote:
| Yep, absolutely. End programmers will never see that
| unless they go digging into the CPython code. It's
| otherwise invisible.
| nneonneo wrote:
| I bet -2 is _way_ less common than -1 in a typical
| codebase, particularly a C one.
| nayuki wrote:
| In other words, this is a case of
| https://en.wikipedia.org/wiki/In-band_signaling versus out-of-
| band signaling
| https://en.wikipedia.org/wiki/Signaling_(telecommunications)...
| .
| EricRiese wrote:
| _Clears throat_ the precise wiki page you 're looking for is
| the Semipredicate Problem
|
| https://en.wikipedia.org/wiki/Semipredicate_problem
| kstrauser wrote:
| Because the hash function is only reliably useful for laying out
| dict keys, and its outputs are totally opaque and coincidental.
|
| The pigeonhole principle says there are an infinite number of
| inputs with the same hash output. Trying to figure out _how_ that
| happens can be fun and enlightening, but _why_? "Because it just
| does, that's all."
| Etheryte wrote:
| I think this is not constructive criticism. There's a good,
| clear cut reason for this specific overlap (error handling),
| saying it collides just because is not really useful.
| kstrauser wrote:
| But even that's an implementation detail that happens to be
| true today, but could easily disappear tomorrow. There's
| nothing in the docs saying it has to be that way, or that you
| can infer anything from a hash output other than that it's
| going to be consistent within a single execution of that
| Python program.
| tooltower wrote:
| This article is not about the API contract of the hash
| function, or the abstraction it provides. If you are just
| trying to hash things, you don't need any info here.
|
| It's very much trying to go _under_ the abstraction layer
| to investigate its behavior. Because it's interesting.
|
| This is very similar to how people investigate performance
| quirks or security issues.
| kstrauser wrote:
| I read the article and it's interesting! I learned
| something from it. From an end user POV it explains the
| mechanism behind how hash(-1)==hash(-2), which is neat.
| There's not really a _why_ behind it, though. It wasn 't
| really planned or designed for -1 to have an unexpected
| hash value. That's just the value the devs randomly
| picked as a flag. It could've been -837 just as easily.
| sedatk wrote:
| It could have been a bug. There's nothing wrong with the
| question and seeking answers to it. It was an interesting
| read too.
| alain94040 wrote:
| If the article title had been "why does the hash function
| never return -1", you would agree it's not a random
| behavior and serves a very clear purpose. It just so
| happens that the original author didn't know that until
| they did all the digging (nice job by the way).
| pmarreck wrote:
| A better question might be, why are people still using OO dynamic
| languages without any typing?
| kstrauser wrote:
| Not sure. Is anyone using a language like that? Python
| certainly isn't one.
| worik wrote:
| > Python certainly isn't one.
|
| I am not a Python expert, but...
|
| Python is described as OO and I thought is is dynamically
| typed
|
| Not quite "without any typing" but close
| kstrauser wrote:
| Not really. Python's strongly typed, but dynamically. You
| can think of Python variables as untyped references to
| strongly typed values: the _values_ are typed, but their
| _names_ are not.
| Dylan16807 wrote:
| I agree with your last sentence, but I think we shouldn't
| say python itself is "strongly typed".
| kstrauser wrote:
| Why, specifically? It's commonly described as strongly
| typed and I'm not aware of an argument against that.
| Dylan16807 wrote:
| Is the specific reason not clear? You just said it. The
| variables are untyped.
|
| Strong versus weak is not super solidly defined as
| referring only to values and not variables.
| kstrauser wrote:
| > Is the specific reason not clear?
|
| Correct.
|
| > Strong versus weak is not super solidly defined as
| referring only to values and not variables.
|
| You're right. It's not super solidly defined as referring
| only to variables and not values, either. If dynamic
| typing is the same as weak typing, there'd be no point in
| having both phrases.
| Dylan16807 wrote:
| >> Is the specific reason not clear?
|
| > Correct.
|
| Is it still unclear with the clarification you didn't
| quote? I can't tell if I need to explain more or not.
|
| > It's not super solidly defined as referring only to
| variables and not values, either.
|
| I didn't mean to suggest it was.
|
| > If dynamic typing is the same as weak typing, there'd
| be no point in having both phrases.
|
| It's not the same. There's a bunch of things referred to
| as "weak".
| tuveson wrote:
| > untyped references to strongly typed values
|
| I would call that weakly typed since there is no step
| before your code runs that validates (or even tries to
| validate) that your program satisfies a "type system".
| But since the definition of "strong" and "weak" typing
| has always been vague, I will just say that it is
| stupidly typed[1], and that I am a pedantic nerd.
|
| 1. https://danieltuveson.github.io/programming/languages/
| stupid...
| wruza wrote:
| This is just hair splitting. What everyone means by
| "typing" today is typing members in structures and co.
| Every time you reach for python after a typed language,
| it feels like going through a barbed minefield
| blindfolded. The fact that its "world" almost entirely
| consists of little moving subclasses in subnamespaces
| doesn't help either. All you can think of python is how
| to write as few lines of it as possible to return to DX
| faster. It is my only use case for AI-based development
| because I cannot stand an extra minute in this
| abomination without cursing into the editor.
| dragonwriter wrote:
| > Every time you reach for python after a typed language,
| it feels like going through a barbed minefield
| blindfolded.
|
| Using both Python and mandatorily statically typed
| languages regularly professsional (my main working
| languages being C#, Python, and a mix of JS and TS),
| that's not my experience at all.
|
| (Of course, Python has a variety of optional static
| typecheckers with differing degrees of type inference, as
| well as supporting explicit type specifications; its not
| untyped unless you choose to use it that way.)
| kragen wrote:
| It's not hairsplitting, and it's not what _everyone_
| means, just the people _you_ talk to. Languages without
| typing are things like assembly language(s), which
| underlie everything you do on a computer and is (are)
| still in the TIOBE Index 's top 20
| https://www.tiobe.com/tiobe-index, as well as much more
| obscure languages. The distinction between untyped
| languages and dynamically typed languages is night and
| day.
|
| I'm pretty competent in Python, C, Java, JavaScript,
| assembly, and a number of other languages+, and my
| experience in Python is nothing like what you're
| describing, although I do have my own frustrations with
| it. It's possible that when you're more experienced
| you'll have a different perspective, but it sounds like
| you're stuck inside a pretty small bubble right now.
|
| ______
|
| + Last year I also programmed in C++, Lua, C#, bash, Tcl,
| Emacs Lisp, Forth, Golang, OCaml, Scheme, Perl, Common
| Lisp, and the ngspice scripting language.
| wruza wrote:
| I have a similar list, no need to worry about my YoE in
| various things. I _left_ full-time python many years ago.
| You 're right about my current bubble though. I find it
| productive and useful, compared to python which I have to
| touch from time to time. My worst gripe is that ML is my
| new interest and it is _all_ python. I 'm basically
| screwed.
| worik wrote:
| > Not really. Python's strongly typed, but dynamically
|
| Yea, nah!
|
| You cannot have both.
| kstrauser wrote:
| That's an interesting and uncommon claim.
| dragonwriter wrote:
| There are different definitions of most of the terms around
| typing which make most typing conversations confusing as
| people talk past each other.
|
| Particularly, there is a common definition of "typed" which
| is exactly equivalent to "statically typed" under which all
| so-called "dynamically typed" languages are actually
| "untyped", and within that system strong and weak typing
| are either a meaningless distinction or one _within_ the
| set of statically types languages.
|
| There's also, of course, a common usage within which
| "dynamically typed" is meaningful and strong vs. weak
| typing is a separate distinction, usually mostly discussed
| within languages with dynamic typing, though languages like
| C being both _statically_ typed and _weakly_ typed has been
| discussed.
| nayuki wrote:
| > without any typing? >>> "abc" + 321
| Traceback (most recent call last): File "<stdin>",
| line 1, in <module> TypeError: can only concatenate str
| (not "int") to str
|
| I rest my case.
| homebrewer wrote:
| TaPL should be required reading for self proclaimed
| "engineers". Do untyped languages even exist, or could they?
| maybe something purely theoretical for ivory tower academics?
| Even POSIX shell and assembler type their values.
| nayuki wrote:
| > Do untyped languages even exist, or could they? Even POSIX
| shell and assembler type their values.
|
| I program in assembly language. Memory is just a big array of
| bytes. Values don't have types, but each instruction chooses
| what type to interpret memory as - e.g. uint8, int16,
| float32, x86 real-mode pointer, long-mode pointer, etc.
| int_19h wrote:
| I think one can reasonably argue that, if the language only
| has a single type, it is effectively "untyped".
|
| That would mean languages like Tcl. Or, for that matter, B.
| ks2048 wrote:
| The next k for which hash(k) != k: 2**61-1
| print(hash(2**61 - 2)) 2305843009213693950
| print(hash(2**61 - 1)) 0
| TRiG_Ireland wrote:
| I enjoyed this, because it seems to be written by an interested
| and interesting human, at a level that I could actually follow.
| jiggawatts wrote:
| This seems like a loaded footgun: anyone writing a custome hash
| function has to know about -1 being treated like an error and
| handle it in a similar way.
|
| I can't think of any other language where this kind of thing
| happens, which means other developers won't expect it either.
|
| I can see the bug report now: "certain records cause errors,
| occurs only for about one in a few billion."
| krab wrote:
| I don't think so. This is true for people writing cpython. But
| if you implement custom `__hash__` method, you should be fine.
| dagw wrote:
| If you're writing a custom hash function in python for a custom
| python class, then this won't affect you since it's only
| treated as an error in the underlying C code.
|
| If you're adding a new type to the core python language then
| you have to be aware of this, but if you're hacking the C
| implementation to change the core language then you're probably
| pretty well versed in the cpython internals, or at least
| surrounded by people that are.
| dagw wrote:
| interestingly enough though, it doesn't seem to be possible
| to write a custom hash function that returns -1
| >>> class Foo: def __hash__(self):
| return -1 >>> f = Foo() >>> print(hash(f))
| -2
|
| So if you're doing something where you have a custom __hash__
| function that you're expecting to return -1 for a certain
| value and then are testing for value of the hash of an object
| rather than testing the property of the object directly, then
| this might bite you. But I cannot think of any reasonable
| case where you might want to do that.
| omoikane wrote:
| Perhaps the worry was that because the observed behavior up
| until now was that hash never returns -1, if it were
| possible possible to write a custom hash function that
| returns -1, that might break some unsuspecting code due to
| Hyrum's Law.
| dragonwriter wrote:
| If the _built-in_ hash function returned -1 when a custom
| ___hash__ method_ did so, then custom numeric classes
| following the hashing guidelines in the Python Language
| Reference designed to maintain the x==y = >
| hash(x)==hash(y) invariant would fail to maintain that
| invariant on the reference implementation of Python due
| to a CPython implementation detail, which would be a Bad
| Thing(tm).
| dragonwriter wrote:
| You can write a custom hash method that returns -1 (and the
| one you wrote verifiably does.)
|
| The standard (not custom) hash function, in CPython, which
| calls your custom method and returns the actual value
| CPython uses for bucketing, will return -2 in that case,
| though.
|
| Which is necessary, so that, e.g., following the single
| mathematical function for hashing of numeric types
| published in the Python Language Reference will preserve
| x==y implies hash(x)==hash(y) for custom and build in
| numeric types in CPython where hash(-1) does not follow
| that pattern.
|
| Even though for -1 itself, the divergence from the rule is
| in __hash__ itself. This is also consistent with the basic
| rule that dunder methods are for communication with the
| standard machinery, but external consumers should use the
| standard machinery, not the dunder method.
| intalentive wrote:
| >>> d = {-2: None}
|
| >>> -1 in d
|
| False
|
| What gives?
| krab wrote:
| Because -1 is not equal to -2. They will fall into the same
| bucket of the hash map, but the hash collisions are expected
| and must be accounted for.
| Arnavion wrote:
| If hashmaps only looked at hash(k1) == hash(k2) and not also k1
| == k2 they would be quite useless as maps.
| hyperpape wrote:
| Since hashes aren't guaranteed to be unique, a dictionary
| should use equality to actually confirm that the object with
| the given hash matches the key present in the dictionary.
| nayuki wrote:
| > The __hash__() method should return an integer. The only
| required property is that objects which compare equal have the
| same hash value; it is advised to mix together the hash values
| of the components of the object that also play a part in
| comparison of objects ...
|
| --
| https://docs.python.org/3/reference/datamodel.html#object.__...
|
| > The general contract of hashCode is:
|
| > * If two objects are equal according to the equals method,
| then calling the hashCode method on each of the two objects
| must produce the same integer result.
|
| > * It is not required that if two objects are unequal
| according to the equals method, then calling the hashCode
| method on each of the two objects must produce distinct integer
| results. However, the programmer should be aware that producing
| distinct integer results for unequal objects may improve the
| performance of hash tables.
|
| --
| https://docs.oracle.com/en/java/javase/23/docs/api/java.base...
| Galanwe wrote:
| > So, in C, when you design a function, and you want your
| function to be able to indicate "there was an error", it has to
| return that error as its return value.
|
| Well, you can also use an errno-like system. It has its own set
| of drawbacks as well, but it removes the "I need to reserve a
| sentinel value" problem.
| secondcoming wrote:
| Out params are thing too.
| kragen wrote:
| Also you can longjmp().
| sgerenser wrote:
| Functions using errno typically still use a return value of -1
| to indicate an error has occurred, then errno just stores the
| specific error code. Otherwise, how do you even know that you
| need to check errno?
| BenjiWiebe wrote:
| You could always check errno and reset it to zero before any
| function call.
| snickerbockers wrote:
| >CPython is written in C, and unlike Python, C doesn't have
| exceptions. So, in C, when you design a function, and you want
| your function to be able to indicate "there was an error", it has
| to return that error as its return value
|
| This is just wrong. C doesn't feature 'error handling' as a
| dedicated form of branching the way many higher level languages
| do and you're in no way required to use return codes as an error
| signal. This is a case of bad API design and is entirely python's
| fault.
| nneonneo wrote:
| Naturally, this extends to custom types: >>>
| class X: ... def __hash__(self): return -1 ...
| >>> hash(X()) -2
| daxfohl wrote:
| Another question is why do ints largely hash to themselves? Most
| hashing guides recommend hashes be far more distributed IIUC.
| woodruffw wrote:
| Python's `__hash__` used primarily for hash map bucketing. In
| that context, the identity of an integer is a perfectly
| cromulent bucketing value.
| zarzavat wrote:
| If your keys are multiples of the table size then all your
| keys will hash to the same bucket, in a naive hash table at
| least.
| vhcr wrote:
| Somewhat related, hash() returns a result %
| sys.hash_info.modulus, so if you build a set or a dict with keys
| multiple of sys.hash_info.modulus, you run into quadratic
| behavior. >>> {i * sys.hash_info.modulus for i
| in range(5000)} 0.40s >>> {i *
| sys.hash_info.modulus for i in range(50000)} 29.34s
| >>> {i for i in range(50000)} 0.06s
| binary132 wrote:
| The fact that a C function really only has its return value for
| communicating success or error status is why most fallible C
| functions use mutable parameters, also known as output arguments,
| for their result value or values. This is like, elementary C
| program design.
___________________________________________________________________
(page generated 2025-01-10 23:01 UTC)