Can hashCodes impose a security risk?

This may come as a surprise, but attackers can assume that software is running in one of the common languages with their standard library. This calculates the hashcode of a string in a predictable way. For that reason it is possible, to create a large number of entries that result in strings having the same hashcode. If this software relies on hashmaps using this string as a key, then lookups will regularly use linear time instead of almost constant time. This might slow down the system to such an extent that it might be used for a denial of service attack.

The question is, what we can do about this. First of all it is necessary to understand, where are places that can be used by more or less unknown users to enter data into the system, for example registration of new users or some upload of information of users that are registered already. What would happen in case of such an attack?

In the end of the day we do want to allow legitimate usage of the system. Of course it is possible, to discover and stop abusive usage, but these detectors have a tendency to be accurate and create both „false positives“ and „false negatives“. This is something that a regular security team can understand and address. We need to remember, that maybe even the firewalls itself can be attacked by such an attack. So it is up to the developers to harden it against such an attack, which I hope they do.

From the developer point of view, we should look at another angle. There could be legitimate data that is hard to distinguish from abusive data, so we could just make our application powerful enough to handle this regularly. We need to understand the areas of our software that are vulnerable by this kind of attack. Where do we have external data that needs to be hashed. Now we can create a hashcode h as h(x)=f(\mathrm{md5}(g(x))) or h(x)=f(\mathrm{sha1}(g(x))), where we prepend the string with some „secret“ that is created a startup of the software, then apply the cryptographic hash and in the end apply a function that reduces the sha1 or sha256 or md5 hash to an integer. Since hash maps only need to remain valid during the runtime of the software, it is possible to change the „secret“ at startup time, thus making it reasonably hard for attackers to create entries that result in the same hashcode, even if they know the workings of the software, but do not know the „secret“. A possible way could be to have a special variant of hash map, that uses strings as keys, but uses its own implementation of hashcode instead of String’s .hashCode()-method. This would allow creating a random secret at construction time.

I have only become aware of the weakness of predictable hashcodes, but I do not know any established answers to this question, so here you can read what I came up with to address this issue. I think that it might be sufficient to have a simple hashcode function that just uses some secret as an input. Just prepending the string with a secret and then calculating the ordinary .hashCode() will not help, because it will make the hashcode unpredictable, but the same pairs of strings will still result in collisions. So it is necessary to have a hashcode h(x, s) with x the input string and s the secret such that for each x, y, s with x \ne y \wedge h(x, s)=h(y, s) there exists a t with h(x, t) \ne h(y, t), so the colliding pairs really depend on the choice of the secret and cannot be predicted without knowing the secret.

What do you think about this issue and how it can be addressed from the developer side? Please let me know in the comments section.

Share Button

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.