In Java, every object has a method hashCode that is simple to understand but still it’s sometimes forgotten or misused. Here are three things to keep in mind to avoid the common pitfalls.
An object’s hash code allows algorithms and data structures to put objects into compartments, just like letter types in a printer’s type case. The printer puts all “A” types into the compartment for “A”, and he looks for an “A” only in this one compartment. This simple system lets him find types much faster than searching in an unsorted drawer. That’s also the idea of hash-based collections, such as HashMap and HashSet.
Source: Wikimedia Commons
In order to make your class work properly with hash-based collections and other algorithms that rely on hash codes, all hashCode implementations must stick to a simple contract.
The hashCode contract
The contract is explained in the hashCode method’s JavaDoc. It can be roughly summarized with this statement:
Objects that are equal must have the same hash code within a running process
Please note that this does not imply the following common misconceptions:
- Unequal objects must have different hash codes – WRONG!
- Objects with the same hash code must be equal – WRONG!
The contract allows for unequal objects to share the same hash code, such as the “A“ and “µ” objects in the sketch above. In math terms, the mapping from objects to hash codes doesn’t have to be injective or even bijective. This is obvious because the number of possible distinct objects is usually bigger than the number of possible hash codes (232).
Edit: In an earlier version, I mistakenly stated that the hashCode mapping must be injective, but doesn’t have to be bijective, which is obviously wrong. Thanks Lucian for pointing out this mistake!
This contract directly leads to the first rule:
Whenever you implement equals, you MUST also implement hashCode
If you fail to do so, you will end up with broken objects. Why? An object’s hashCode method must take the same fields into account as its equals method. By overriding the equals method, you’re declaring some objects as equal to other objects, but the original hashCode method treats all objects as different. So you will have equal objects with different hash codes. For example, calling contains() on a HashMap will return false, even though the object has been added.
How to write a good hashCode function is beyond the scope of this article, it is perfectly explained in Joshua Bloch’s popular book Effective Java, which should not be missing in a Java developer’s bookshelf.
To be on the safe side, let the Eclipse IDE generate the equals and hashCode functions as a pair: Source > Generate hashCode() and equals()….
To protect yourself, you can also configure Eclipse to detect violations of this rule and display errors for classes that implement equals but not hashCode. Unfortunately, this options is set to “Ignore” by default: Preferences > Java > Compiler > Errors/Warnings, then use the quick filter to search for “hashcode”:
Whenever two different objects have the same hash code, we call this a collision. A collision is nothing critical, it just means that there is more than one object in a single bucket, so a HashMap lookup has to look again to find the right object. A lot of collisions will degrade the performance of a system, but they won’t lead to incorrect results.
But if you mistake the hash code for a unique handle to an object, e.g use it as a key in a Map, then you will sometimes get the wrong object. Because even though collisions are rare, they are inevitable. For example, the Strings
"BB" produce the same hashCode:
Never misuse hashCode as a key
You may object that, unlike the printer’s type case, in Java there are 4,294,967,296 compartments (232 possible int values). With 4 billion slots, collisions seem to be extremely unlikely, right?
Turns out that it’s not so unlikely. Here’s the surprising math of collisions: Please imagine 23 random people in a room. How would you estimate the odds of finding two fellows with the same birthday among them? Pretty low, because there are 365 days in a year? In fact, the odds are about 50%! And with 50 people it’s a save bet. This phenomenon is called the Birthday paradox. Transferred to hash codes, this means that with 77,163 different objects, you have a 50/50 chance for a collision – given that you have an ideal hashCode function, that evenly distributes objects over all available buckets.
The Enron email dataset contains 520,924 emails. Computing the String hash codes of the email contents, I found 50 pairs (and even 2 triples) of different emails with the same hash code. For half a million strings, this is a pretty good result. But the message here is: if you have many data items, collisions will occur. If you were using the hashCode as a key here, you would not immediately notice your mistake. But a few people would get the wrong mail.
HashCodes can change
Finally, there’s one important detail in the hashCode contract that can be quite surprising: hashCode does not guarantee the same result in different executions. Let’s have a look at the JavaDoc:
Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.
This is uncommon, in fact, some classes in the class library even specify the exact formula they use to calculate hash codes (e.g. String). For these classes, the hash code will always be the same. But while most of the hashCode implementations provide stable values, you must not rely on it. As this article points out, there are Java libraries that actually return different hashCode values in different processes and this tends to confuse people. Google’s Protocol Buffers is an example.
Therefore, you should not use the hash code in distributed applications. A remote object may have a different hash code than a local one, even if the two are equal.
Do not use hashCode in distributed applications
Moreover, you should be aware that the implementation of a hashCode function may change from one version to another. Therefore your code should not depend on any particular hash code values. For example, your should not use the hash code to persist state. Next time you run the application, the hash codes of the “same” objects may be different.
The best advice is probably: don’t use hashCode at all, except when you create hash-based algorithms.
An alternative: SHA1
You may know that cryptographic hash codes such as SHA1 are sometimes used to identify objects (Git does this, for example). Is this also unsafe? No. SHA1 uses 160-bit keys, which makes collisions virtually impossible. Even with a gigantic number of objects, the odds of a collision in this space are far below the odds of a meteor crashing the computer that runs your program. This article has a great overview of collision probabilities.
There’s probably more to say about hash codes, but these seem to be the most important things. If there’s anything I’ve missed, I’m happy to hear about it!