hashCode, equals and toString

In many programming languages we are urged to define methods hashCode, equals and toString. They are named like this in Java and in many JVM languages or they use similar names. Some languages like Perl and Scala provide decent mechanisms for the language to figure these out itself, which we do most of the time in Java as well by letting the IDE create it for us or by using a library. This solution is not really as good as having it done without polluting our source code and without using mechanisms like reflection, but it is usually the best choice we have in Java. It does have advantages, because it gives us some control over how to define it, if we are willing to exercise this control.

So why should we bother? equals is an „obvious“ concept that we need all the time by itself. And hashCode we need, when we put something into HashMaps or HashSets or something like that. It is very important to follow the basic contract, that hashCode and equals must be compatible, that is

    \[\forall a, b : a.\mathrm{equals}(b) \implies a.\mathrm{hashCode}() == b.\mathrm{hashCode}()\]

And equals of course needs to be an equivalence relation.
There has been an article in this blog about „Can hashCodes impose a security risk?„, which covers aspects that are not covered here again.

An important observation is that these do not fit together with mutability very well. When we mutate objects, their hashCode and equals methods yield different results than before, but the HashSet and HashMap assume that they remain constant. This is not too bad, because usually we actually use very immutable objects like Strings and wrapped primitive numbers as keys for Maps. But as soon as we actually write hashCode and equals, this implies that we are considering the objects of this type to be members of HashMaps or HashSets as keys and the mutability question arises. One very ugly case is the object that we put into the database using Hibernate or something similar. Usually there is an ID field, which is generated, while we insert into the database using a sequence, for example. It is good to use a sequence from the database, because it provides the most robust and reliable mechanism for creating unique ids. This id becomes then the most plausible basis for hashCode, but it is null in the beginning. I have not yet found any really satisfying solution, other than avoiding Hibernate and JPAx. Seriously, I do think, that plain JDBC or any framework like MyBatis or Slick with less „magic“ is a better approach. But that is just a special case of a more general issue. So for objects that have not yet made the roundtrip to the database, hashCode and equals should be considered dangerous.

Now we have the issue that equality can be optimized for hashing, which would be accomplished by basing it on a minimal unique subset of attributes. Or it could be used to express an equality of all attributes, excluding maybe some kind of volatile caching attributes, if such things apply. When working with large hash tables, it does make a difference, because the comparison needs to look into a lot more attributes, which do not change the actual result at least for each comparison that succeeds. It also makes a difference, in which order the attributes are compared for equality. It is usually good to look into attributes that have a larger chance of yielding inequality, so that in case of inequality only one or only few comparisons are needed.

For the hashCode it is not very wrong to base it on the same attributes that are used for the equals-comparison, with this usual pattern of calculating hash codes of the parts and multiplying them with different powers of the some two-digit prime number before adding them. It is often a wise choice to chose a subset of these attributes that makes a difference most of the time and provides high selectivity. The collisions are rare and the calculation of the hash code is efficient.

Now the third method in the „club“ is usually toString(). I have a habit of defining toString, because it is useful for logging and sometimes even for debugging. I recommend making it short and expressive. So I prefer the format
className(attr1=val1 attr2=val2 att3=val3)
with className the name of the actual class of this object without package, as received by
and only including attributes that are of real interest. Commas are not necessary and should be avoided, they are just useless noise. It does not matter if the parantheses are „()“ or „[]“ or „{}“ or „«»“, but why not make it consistent within the project. If attribute values are strings and contain spaces, it might be a good idea to quote them. If they contain non-printable characters or quotation marks, maybe escaping is a good idea. For a real complete representation with all attributes a method toLongString() can be defined. Usually log files are already too much cluttered with noise and it is good to keep them consise and avoid noise.

Share Button

Can hashCodes impose a security risk?

This may come as a surprise, but attackers can assume that software is running in one of the common languages with their standard library. This calculates the hashcode of a string in a predictable way. For that reason it is possible, to create a large number of entries that result in strings having the same hashcode. If this software relies on hashmaps using this string as a key, then lookups will regularly use linear time instead of almost constant time. This might slow down the system to such an extent that it might be used for a denial of service attack.

The question is, what we can do about this. First of all it is necessary to understand, where are places that can be used by more or less unknown users to enter data into the system, for example registration of new users or some upload of information of users that are registered already. What would happen in case of such an attack?

In the end of the day we do want to allow legitimate usage of the system. Of course it is possible, to discover and stop abusive usage, but these detectors have a tendency to be accurate and create both „false positives“ and „false negatives“. This is something that a regular security team can understand and address. We need to remember, that maybe even the firewalls itself can be attacked by such an attack. So it is up to the developers to harden it against such an attack, which I hope they do.

From the developer point of view, we should look at another angle. There could be legitimate data that is hard to distinguish from abusive data, so we could just make our application powerful enough to handle this regularly. We need to understand the areas of our software that are vulnerable by this kind of attack. Where do we have external data that needs to be hashed. Now we can create a hashcode h as h(x)=f(\mathrm{md5}(g(x))) or h(x)=f(\mathrm{sha1}(g(x))), where we prepend the string with some „secret“ that is created a startup of the software, then apply the cryptographic hash and in the end apply a function that reduces the sha1 or sha256 or md5 hash to an integer. Since hash maps only need to remain valid during the runtime of the software, it is possible to change the „secret“ at startup time, thus making it reasonably hard for attackers to create entries that result in the same hashcode, even if they know the workings of the software, but do not know the „secret“. A possible way could be to have a special variant of hash map, that uses strings as keys, but uses its own implementation of hashcode instead of String’s .hashCode()-method. This would allow creating a random secret at construction time.

I have only become aware of the weakness of predictable hashcodes, but I do not know any established answers to this question, so here you can read what I came up with to address this issue. I think that it might be sufficient to have a simple hashcode function that just uses some secret as an input. Just prepending the string with a secret and then calculating the ordinary .hashCode() will not help, because it will make the hashcode unpredictable, but the same pairs of strings will still result in collisions. So it is necessary to have a hashcode h(x, s) with x the input string and s the secret such that for each x, y, s with x \ne y \wedge h(x, s)=h(y, s) there exists a t with h(x, t) \ne h(y, t), so the colliding pairs really depend on the choice of the secret and cannot be predicted without knowing the secret.

What do you think about this issue and how it can be addressed from the developer side? Please let me know in the comments section.

Share Button

Unit Tests as Specifications

Quite often I hear the idea, that one team should specify a software system and get the development done by another team. Only good documentation is needed and good API contracts and beyond that, no further cooperation and communication is needed. Often a complete set of unit tests is recommended as a way or as a supplement to specify the requirements. Sounds great.

We have to look twice, though. A good software project invests about 30 to 50 percent of its development effort into unit tests, at least if we program functionality. For GUIs it is another story, real coverage would be so much more than 50%, so that we usually go for lower coverage and later automated testing, when the software is at least somewhat stable and the tests do not have to be rewritten too often.

For a library or some piece of code that is a bit less trivial, the effort for a good unit test coverage would be more around 60% of the total development, so actually more then the functionality itself.

Also for APIs that are shared across teams, maybe a bit more is a good idea, because complete coverage is more important.

So the unit tests alone would in this case need about 50% of the development effort. Adding documentation and some other stuff around it, it will become even a bit more than 50%.

In practice it will not work out without intensive cooperation, unless the APIs have been in use by tons of other organizations for a long time and proven to be stable and useful and well documented.

So what exactly are we gaining by this approach of letting another team write the code and just providing a good unit test suite?

Share Button


Microservices are a Hype, but they have their pros and cons. Sometimes people say, that this is the magic tool to solve all problems. They hear it on conference talks, read it in the internet or even in books. It is not the first time and it won’t be the last time that we hear a promise like that. It was promised with a lot of new or newly sold technologies, like object oriented languages, scripting languages, functional languages, spiral instead of waterfall, agile instead of rup, client server instead of dumb terminals or web applications instead of fat clients or fat javascript clients instead of serverside rendered pages, cloud technologies, serverless, zero-something and now microservices. And of course a lot more. Usually we get promised an efficiency gain by a factor of two or three. So we should by now be million times faster in developing a functionality than in the good old days with assembly language on punch cards. Are we?

Such promises should not stop us from being critical and analyzing pros and cons.

But in order to achieve any benefit from micro services it is crucial to understand them well enough and to apply the concept well enough. Badly implemented micro service architectures just add the disadvantages of microservices to what we already have.

It is often heard that microservices should be made so small that it is „trivial“ to rewrite them. That may be hard to achieve and there are some good reasons why.

If we make microservices so small that they are easy to rewrite and we need only a few of them, probably our application is so trivial that we should question if it is at all necessary to split it into microservices or if we should rather build a monolith and structure it internally. On the other hand, if we have a big application, in the best case it can be combined of a huge number of such „trivial“ microservices. And we get a lot of complexity in the combination. Just imagine you are getting a house built. And the construction company just dumps a few truckloads of lego blocks in the construction site. They are all well designed, well tested, high quality and you just need to plug them together. I cannot imagine that this will ever be trivial for a huge number of services and a non-trivial application.

Another problem is that there are typically one or more spots of the application where the real complexity resides. This is not a complexity made by us, but it is there because of the business logic, no matter how well we look at the requirements, understand them, work on them to get something better that is easier to build. Either these microservices tend to become bigger or they are more connected with other parts of the system than would be desirable for a microservice. So we end up with one or a few relatively fat „micro“services and some smaller ones, the smallest ones actually relatively trivial to rewrite. But we can keep everything non-essential out of these central services and make them at least as simple as reasonably possible.

Now we do have issues. How do the services communicate? How do they share data? By the book each service should have its own database schema. And they should not access each other’s DB schemes, because services should be independent from each other and not be using the DB as integration layer. In practice we see this rule applied, but in the end there is one database server that runs all the schemes for all the microservices. But we could split that up and move the database closer to the service.

Now there is some data that needs to be shared between services. There are several ideas how to accomplish this. The most basic principle is that services should be cut in such a way that little data needs to be shared. But there is also the pattern of having a microservice that operates like a daemon and a companion service that can be used for configuring the daemon service. There are some advantages in splitting this up, because the optimizations are totally different, the deployment patterns are different, the availability requirements are different. And it can be a good idea to build a configuration on the config service, test it on a test system and them publish it to the productive service, when it is complete, consistent and tested. Still these two services closely belong together, so sharing between them is well understood and works well. More difficult is it when it comes to data between really different services. We assume that it has been tried to eliminate such access to data from other services and it only remains for a few cases. There can be data that really needs to be in one place and the same for everybody. We can for example think of the accounts to log in. They should be the same throughout the system and without any delays when changes occur. This can be accomplished by having one service with really good availability. In other cases we can consider one service as the owner of the data and then publish the data to other services via mechanisms like Kafka. This is powerful, but setting up a good and reliable Kafka infrastructure for high performance and throughput is not trivial. When it comes to multimaster data and bidirectional synchronization, it gets really hard. We should avoid that. But it can be done when absolutely needed.

When services talk to each other, we should prefer mechanisms like JMS or Kafka over REST or SOAP calls, because it avoids tight coupling and works better, if one service is temporarily not available. Also we should avoid letting one service wait for the response from another service. Sometimes REST calls are of course the only reasonable choice and then they should be used. We need to address issues as to what happens if the service is not available and how to find the service.

When moving to a microservice architecture it is very easy to „prove“ that this approach „does not work“ or has „only disadvantages“ over a monolithic architecture. But microservices are a legitimate way to create large applications and we should give this approach a chance to succeed if we have decided to follow this road. It is important to embrace the new style, not to try to copy what we would have done in a monolith, but to do things in the microservice way. And it is important to find the right balance, the right distribution and the right places for some allowed exceptions like REST instead of JMS or data sharing.


Share Button

ScalaUA 2019

In March 2019 I have visited ScalaUA in Kiev.

It was interesting. I attended the following talks:


Share Button

50 years of internet

According to some the RFC1, which was published on 1969-04-07, was the start of the internet. Many RFCs have since this time been published and they describe the standards of the internet.

The early internet did contain functionality to communicate between computers, but it did of course not include http, html and the WWW, which were introduced much later in the early nineties.

Share Button

www.it-sky-consulting.com now https only

I have converted my company site www.it-sky-consulting.com to always use https.

This is something all sites should do in the next few months.

Share Button

Weird blackmailing via email from „Hacker“

I got a few emails, that looked like this (see at the button).

I replaced all references to myself with xxxx. The source of the email indicates, that a mailserver „nmail.brlp.in“ has been used for this.

The fact, that the email seems to come from my own mail address is not a proof that this guy hacked into my system. On more low level email software it is quite easy to set header fields to any valid value, this includes the from-part of the email.

So, if you get such emails, what you can do: report it to the police. This person or organization is criminal and stealing some money from people who do not understand well enough what is happening here. Maybe they can track down the criminal by international cooperation, maybe not. I uploaded one of these emails to the Swiss federal police, who have a form for such uploads. They gave a polite advice, basicly asking me not to pay.

And that is important: PLEASE DO NOT PAY. The „person“ or „script“ is just pretending to have access to my system. Even what he claims to have observed is not true, but the headers of the email also give him away as using some mail server and changing the From-line.

I included the whole text, so it is possible to search for it.

Hi, this account is hacked! Modify the password right away!
You might not know anything about me and you obviously are probably wondering why you are receiving this letter, right?
I’mhacker who openedyour emailand OSa few months ago.
Do not waste your time and try out to talk to me or find me, it is definitely hopeless, because I directed you a letter from YOUR own hacked account.
I’ve created special program on the adult videos (porn) website and suppose you spent time on this site to have a good time (you know what I want to say).
During you have been taking a look at videos, your internet browser began to act like a RDP (Remote Control) with a keylogger which gave me the ability to access your monitor and web camera.
Consequently, my softwareaquiredall information.
You wrote passwords on the sites you visited, and I intercepted all of them.
Surely, you’ll be able to modify them, or have already modified them.
Even so it does not matter, my malware renews needed data every time.
What did I do?
I compiled a backup of your system. Of all files and contacts.
I got a dual-screen video recording. The 1 screen presents the clip you had been watching (you have a very good preferences, ha-ha…), and the second screen presents the recording from your own web camera.
What actually do you have to do?
Great, in my view, 1000 USD is a inexpensive amount of money for this little riddle. You will make your payment by bitcoins (in case you don’t understand this, go searching “how to buy bitcoin” in Google).
My bitcoin wallet address:
(It is cAsE sensitive, so copy and paste it).
You have 48 hours in order to make the payment. (I put an exclusive pixel to this message, and at the moment I know that you’ve read this email).
To monitorthe reading of a letterand the actionswithin it, I usea Facebook pixel. Thanks to them. (Everything thatcan be usedfor the authorities may also helpus.)

If I do not get bitcoins, I’ll undoubtedly transfer your recording to each of your contacts, such as family members, co-workers, etc?

The source of the EMail looked like this (shortened a bit):

Received: from xxxxxxxx.xxxxxxxx.com ([xx.xx.xx.xx]) by mx-ha.gmx.net
(mxgmx017 []) with ESMTPS (Nemesis) id 1MeSc2-1hZOnl0zR6-00aZJW
for ; Tue, 05 Mar 2019 14:49:21 +0100
X-Greylist: delayed 440 seconds by postgrey-1.34 at dd29014; Tue, 05 Mar 2019 14:49:18 CET
X-policyd-weight: using cached result; rate: -6.1
Received: from nmail.brlp.in (nmail.brlp.in [])
by xxxxxxxx.xxxxxxxx.com (Postfix) with ESMTPS id DDCCD63C255E
for ; Tue, 5 Mar 2019 14:49:18 +0100 (CET)
Received: from localhost (localhost [])
by nmail.brlp.in (Postfix) with ESMTP id D49CD45242ED
for ; Tue, 5 Mar 2019 19:11:55 +0530 (IST)
Received: from nmail.brlp.in ([])
by localhost (nmail.brlp.in []) (amavisd-new, port 10032)
with ESMTP id yaoBiyeSpTXg for ;
Tue, 5 Mar 2019 19:11:55 +0530 (IST)
Received: from localhost (localhost [])
by nmail.brlp.in (Postfix) with ESMTP id 11F0F452430F
for ; Tue, 5 Mar 2019 19:11:55 +0530 (IST)
X-Virus-Scanned: amavisd-new at brlp.in
Received: from nmail.brlp.in ([])
by localhost (nmail.brlp.in []) (amavisd-new, port 10026)
with ESMTP id ZRHfjiakcy7Q for ;
Tue, 5 Mar 2019 19:11:54 +0530 (IST)
Received: from [216.subnet110-136-205.speedy.telkom.net.id] (unknown [])
by nmail.brlp.in (Postfix) with ESMTPSA id D2C1345242C8
for ; Tue, 5 Mar 2019 19:11:53 +0530 (IST)
Subject: xxxxxxxxxx
To: xxxxx@xxxxx.com
X-aid: 6812375433
Date: Tue, 5 Mar 2019 14:41:53 +0100
X-Complaints-To: abuse@mailer.brlp.in
Organization: Rprgtkvvr
Content-Transfer-Encoding: base64
Content-Type: text/plain; charset=UTF-8
X-GMX-Antispam: 0 (Mail was not recognized as spam); Detail=V3;
X-Spam-Flag: NO
X-UI-Filterresults: notjunk:1;V03:K0:QH4Z6L3Srwk=:mzSkXH/rOihoavgPXEhMTWJI56





Share Button


New research has analyzed the concept of transactions from a very theoretical point of view. An interesting result of this research was the concept of cisactions, which are in some way the opposite of transactions. A duality between cisactions and transactions has been proven. This means that in principal every application that is based on transactions can also be written with cisactions. But there are some challenges:

  1. currently there is no database that supports cisactions
  2. cisactions are even harder do understand than transactions
  3. the whole program has to be written once again in a totally different way.
  4. even the business logic has to be redefined in a totally different way, but eventually the same results can be achieved
  5. The paradigm change from transactions to cisactions is much harder than the change from object-oriented to functional programming
  6. mixing of cisactions and transactions in the same program is almost impossible

But we will see applications that do use cisactions properly in the future. And they will perform about 1000 times faster than programs using transactions.

The interesting question is:

When will the first cisactional databases become available? How will they work? Which programming languages will support cisactions? Or will we rather have to invent totally new programming languages for proper support of cisactions?

A lot of questions still have to be answered, and this is still theoretical research. But it is so promising for high performance usage that we absolutely must expect that this will become an important way to solve real high performance development tasks.

Share Button

Unicode and C

It is a common practice in C to use arrays of char as strings. The 0 is used as end marker.

The whole thing was created like that in the 1970s and at that time it was kind of cool to get away with one less language feature and to express it in terms of others instead. And people did not think enough about the necessity to express more than ISO 646 IRV (commonly called ASCII) as string content.

This extended out of the box to 8 bit character sets like ISO 8859-1 or KOI-8, that are identical to ISO 646 in the lower 128 characters and contain an extension in the upper 128 characters. But fortunately we have moved ahead and now Unicode with its encodings UTF-8, UTF-16 and UTF-32.

How can we deal with this in C?

UTF-8 just works out of the box, because the byte 0 is only used to encode the code point U+0000. So the null termination can be kept as it is and a lot of functionality remains valid. Some issues arise, because in UTF-8 things like finding the logical length of a string, not its memory consumption or finding the nth code point, not the nth byte, require UTF-8-logic to be applied and to parse the whole string at least from the beginning to the desired position or the usage of an indexing facility. So a lot of non-trivial string functionality of the standard library will just not be as easy as people thought it would be in the 1970s and subsequently not work as needed. Libraries for better UTF-8-support in C can be found, leaving the „native“ C strings with UTF-8 content only for usage in interfaces that require them. I have not yet explored such libraries, but it would be interesting to find out how powerful and useful they are.

At the time when Unicode came out, it seemed to be sufficient to have 16 bits per character instead of 8. Java was built on this assumption. C added a wchar_t to allow for this and just required it to be „long enough“. So Linux uses 32 bits and MS-Windows 16 bits. This is not too bad, because programming in C for MS-Windows and for Linux is anyway quite different, unless we abstract the differences into a library, which would then also include a common string definition and string handling functionality. While the Linux wchar_t is sufficient, it really wastes a lot of memory, which is often undesirable, if we go the extra effort to program in C in order to gain performance. The Windows-wchar_t is „kind of sufficient“, as are the Java-Strings, because we can really do a lot with assuming that Unicode is only 16 bit or with UTF-16 and ignoring the complexities of that, that are in principal the same as for UTF-8, but can be ignored with less disadvantages most of the time. The good news is, that wchar_t is well supported by standard library functions.

Another way is to use char16_t and char32_t, that have a clear definition of their length, but much less library support.

Probably these facilities are sufficient for software whose string handling is relatively trivial. For more ambitious string handling in terms of functionality and performance, it will be necessary to find third party libraries or to write them.


Share Button