hashCode, equals and toString

In many programming languages we are urged to define methods hashCode, equals and toString. They are named like this in Java and in many JVM languages or they use similar names. Some languages like Perl and Scala provide decent mechanisms for the language to figure these out itself, which we do most of the time in Java as well by letting the IDE create it for us or by using a library. This solution is not really as good as having it done without polluting our source code and without using mechanisms like reflection, but it is usually the best choice we have in Java. It does have advantages, because it gives us some control over how to define it, if we are willing to exercise this control.

So why should we bother? equals is an „obvious“ concept that we need all the time by itself. And hashCode we need, when we put something into HashMaps or HashSets or something like that. It is very important to follow the basic contract, that hashCode and equals must be compatible, that is

$\forall a, b : a.\mathrm{equals}(b) \implies a.\mathrm{hashCode}() == b.\mathrm{hashCode}()$

And equals of course needs to be an equivalence relation.
There has been an article in this blog about „Can hashCodes impose a security risk?„, which covers aspects that are not covered here again.

An important observation is that these do not fit together with mutability very well. When we mutate objects, their hashCode and equals methods yield different results than before, but the HashSet and HashMap assume that they remain constant. This is not too bad, because usually we actually use very immutable objects like Strings and wrapped primitive numbers as keys for Maps. But as soon as we actually write hashCode and equals, this implies that we are considering the objects of this type to be members of HashMaps or HashSets as keys and the mutability question arises. One very ugly case is the object that we put into the database using Hibernate or something similar. Usually there is an ID field, which is generated, while we insert into the database using a sequence, for example. It is good to use a sequence from the database, because it provides the most robust and reliable mechanism for creating unique ids. This id becomes then the most plausible basis for hashCode, but it is null in the beginning. I have not yet found any really satisfying solution, other than avoiding Hibernate and JPAx. Seriously, I do think, that plain JDBC or any framework like MyBatis or Slick with less „magic“ is a better approach. But that is just a special case of a more general issue. So for objects that have not yet made the roundtrip to the database, hashCode and equals should be considered dangerous.

Now we have the issue that equality can be optimized for hashing, which would be accomplished by basing it on a minimal unique subset of attributes. Or it could be used to express an equality of all attributes, excluding maybe some kind of volatile caching attributes, if such things apply. When working with large hash tables, it does make a difference, because the comparison needs to look into a lot more attributes, which do not change the actual result at least for each comparison that succeeds. It also makes a difference, in which order the attributes are compared for equality. It is usually good to look into attributes that have a larger chance of yielding inequality, so that in case of inequality only one or only few comparisons are needed.

For the hashCode it is not very wrong to base it on the same attributes that are used for the equals-comparison, with this usual pattern of calculating hash codes of the parts and multiplying them with different powers of the some two-digit prime number before adding them. It is often a wise choice to chose a subset of these attributes that makes a difference most of the time and provides high selectivity. The collisions are rare and the calculation of the hash code is efficient.

Now the third method in the „club“ is usually toString(). I have a habit of defining toString, because it is useful for logging and sometimes even for debugging. I recommend making it short and expressive. So I prefer the format
className(attr1=val1 attr2=val2 att3=val3)
with className the name of the actual class of this object without package, as received by
getClass().getSimpleName()
and only including attributes that are of real interest. Commas are not necessary and should be avoided, they are just useless noise. It does not matter if the parantheses are „()“ or „[]“ or „{}“ or „«»“, but why not make it consistent within the project. If attribute values are strings and contain spaces, it might be a good idea to quote them. If they contain non-printable characters or quotation marks, maybe escaping is a good idea. For a real complete representation with all attributes a method toLongString() can be defined. Usually log files are already too much cluttered with noise and it is good to keep them consise and avoid noise.

Schreibe einen Kommentar

Antwort abbrechen