Can hashCodes impose a security risk?

This may come as a surprise, but attackers can assume that software is running in one of the common languages with their standard library. This calculates the hashcode of a string in a predictable way. For that reason it is possible, to create a large number of entries that result in strings having the same hashcode. If this software relies on hashmaps using this string as a key, then lookups will regularly use linear time instead of almost constant time. This might slow down the system to such an extent that it might be used for a denial of service attack.

The question is, what we can do about this. First of all it is necessary to understand, where are places that can be used by more or less unknown users to enter data into the system, for example registration of new users or some upload of information of users that are registered already. What would happen in case of such an attack?

In the end of the day we do want to allow legitimate usage of the system. Of course it is possible, to discover and stop abusive usage, but these detectors have a tendency to be accurate and create both „false positives“ and „false negatives“. This is something that a regular security team can understand and address. We need to remember, that maybe even the firewalls itself can be attacked by such an attack. So it is up to the developers to harden it against such an attack, which I hope they do.

From the developer point of view, we should look at another angle. There could be legitimate data that is hard to distinguish from abusive data, so we could just make our application powerful enough to handle this regularly. We need to understand the areas of our software that are vulnerable by this kind of attack. Where do we have external data that needs to be hashed. Now we can create a hashcode h as h(x)=f(\mathrm{md5}(g(x))) or h(x)=f(\mathrm{sha1}(g(x))), where we prepend the string with some „secret“ that is created a startup of the software, then apply the cryptographic hash and in the end apply a function that reduces the sha1 or sha256 or md5 hash to an integer. Since hash maps only need to remain valid during the runtime of the software, it is possible to change the „secret“ at startup time, thus making it reasonably hard for attackers to create entries that result in the same hashcode, even if they know the workings of the software, but do not know the „secret“. A possible way could be to have a special variant of hash map, that uses strings as keys, but uses its own implementation of hashcode instead of String’s .hashCode()-method. This would allow creating a random secret at construction time.

I have only become aware of the weakness of predictable hashcodes, but I do not know any established answers to this question, so here you can read what I came up with to address this issue. I think that it might be sufficient to have a simple hashcode function that just uses some secret as an input. Just prepending the string with a secret and then calculating the ordinary .hashCode() will not help, because it will make the hashcode unpredictable, but the same pairs of strings will still result in collisions. So it is necessary to have a hashcode h(x, s) with x the input string and s the secret such that for each x, y, s with x \ne y \wedge h(x, s)=h(y, s) there exists a t with h(x, t) \ne h(y, t), so the colliding pairs really depend on the choice of the secret and cannot be predicted without knowing the secret.

What do you think about this issue and how it can be addressed from the developer side? Please let me know in the comments section.

Share Button

Unit Tests as Specifications

Quite often I hear the idea, that one team should specify a software system and get the development done by another team. Only good documentation is needed and good API contracts and beyond that, no further cooperation and communication is needed. Often a complete set of unit tests is recommended as a way or as a supplement to specify the requirements. Sounds great.

We have to look twice, though. A good software project invests about 30 to 50 percent of its development effort into unit tests, at least if we program functionality. For GUIs it is another story, real coverage would be so much more than 50%, so that we usually go for lower coverage and later automated testing, when the software is at least somewhat stable and the tests do not have to be rewritten too often.

For a library or some piece of code that is a bit less trivial, the effort for a good unit test coverage would be more around 60% of the total development, so actually more then the functionality itself.

Also for APIs that are shared across teams, maybe a bit more is a good idea, because complete coverage is more important.

So the unit tests alone would in this case need about 50% of the development effort. Adding documentation and some other stuff around it, it will become even a bit more than 50%.

In practice it will not work out without intensive cooperation, unless the APIs have been in use by tons of other organizations for a long time and proven to be stable and useful and well documented.

So what exactly are we gaining by this approach of letting another team write the code and just providing a good unit test suite?

Share Button