Unit Testing in a non-perfect World

Test Driven Development

We all know that how good test driven development is and that we should move in that direction.

How much coverage

There are some serious obstacles. Most of all, we have some obligation to actually finish software and the resources are usually kind of limited. If they were not limited by money and time constraints, they would hit the limit of efficient team sizes and organizational structures.

We can just look at a simple application that does „CRUD“ operations. Ideally we start with a known data set and reset the database to exactly this content before starting the tests, maybe even before each single test… If we have a huge and well managed server farm to run the test, maybe possible. For the „read“-methods we need to write some tests that succeed in reading, probably performing a few reads with a single read method to cover different outcomes of the successful read or different parameter combinations. Then there are unsuccessful reads that just do not find anything and return null or an empty result collection or even those that fail with an exception. It is of interest to check the maximum and minimum allowed values, if there are such limits. So we end up writing five to a few dozen test methods for a single read method. And this is the simple case. For delete and update we should create our own data in the beginning of the test. Probably there are dependencies and constraints in conjunction with other data, so it is necessary to cover these also. Create and update actually need a variety of at least two values for each of he most simple attributes of the created data object, to deal with not null. Usually we have more constraints on attributes, concerning lengths, value range and some kind of compatibility with other data. So there will be up to around ten tests for each attribute of the created or updated entity and we have successful and unsuccessful operations that we expect. So we will end up writing hundreds or even thousands of unit test methods just to obtain the most basic coverage for a relatively simple „CRUD“-application. Writing many similar tests is not so difficult and it would be interesting to explore ways to cut down on the repetitive work involved by using less verbose languages for writing the tests, creating them partially with scripts or simply writing very powerful helper methods in the test class that just get called with slightly different parameters to do all the tests. It will anyway be a lot of work, to write the tests. I think 60% of the time for the unit tests and 40% for the actual code is a reasonable number for a relatively fair coverage of most of the code.

In practice we should really prioritize our unit testing efforts, because spending 2.5 times as much time for the whole thing as for the code itself is simply not always possible. On the other hand, the time we save in the long run with good testing is even more than ha we spend, if we do the unit test development well.

But there are some aspects to think of:

  • Which parts of the application are fairly stable?
  • Which parts of the application are used and relied on heavily by other parts of the application?
  • Which parts of the application are used a lot by end users?
  • Which parts of the application are high risk because they have more inner complexity?
  • Which parts of the application actually showed errors? Fix the errors by writing a test to expose them first.
  • Which parts of the application are high risk in terms of reputation, money loss or data loss if they go wrong?
  • Which parts of the application are undergoing internal changes, while retaining the API?
  • Which parts of the application are migrated to another platform, OS, DB, architecture …, while retaining the API?

It is good to focus primarily on areas based on these questions and to do reduced testing for areas that are less critical.
The first question is quite delicate, because it exposes some contradiction we need to cope with. We should be agile, change the application easily when requirements are understood better or the architecture is understood better. But with tests, even this effort multiplies by 2.5 or whatever we have to update the unit tests. Or even worse, it leads to disabling the unit tests or to the loss of agility. In areas that change quickly it may be better to write the complete set of tests by the time they have become relatively stable.


The next issue is the database. Typical organizations like to provide one DB instance and schema for the whole development team, because the database instances and schemata are seen as expensive resources. They are hard to maintain and for various reasons it is often difficult to install a local database on each of the developers machine. If it is Oracle, DB2 or MS-SOL-Server, some know-how is needed to install it and maybe even some constraints are there in terms of the OS. MariaDB and PostgreSQL seam to be somewhat easier to install, there are less license issues involved, but still even that is an effort. This can be overcome by virtualization. An image with the DB-setup can be developed once and than copied to each team member. There are interesting and good ways to do something like this. So it is becoming less of an issue, but still it is very unusual to have that. Now there are two ways out of this. One way is to use another DB for development and production. This is somewhat dangerous, because databases are so different, that today’s common abstractions do not hide the differences and we also might pay a high price in terms of performance if we do not use DB-specific features. So it requires extra development effort to support both DB types. And it is very important to run tests against the DB that is used in production anyway on a regular basis. It may be helpful to move part of the tests to such a similar-but-not-equal local environment. The regular development DB is unfortunately often shared between many developers. Now if tests run simultaneously from different development machines against the same DB, they will usually inter and some tests will probably fail just because of that. Not all the time, but sporadically. It can be avoided, by some team organization and some kind of reservation of the DB, but that is painful, so we just run the tests and if they fail assume it is someone else testing at the same time causing the failure. It is possible to write the tests in such a way that they can withstand this, but this is a lot of extra effort, compared to the effort of using a virtual image with a working DB instance it is not justifiable.

So what we should aim for is a dedicated DB schema for each developer. Ideally it should be of the DB software product used for production. It can be locally, on some DB-server or as a virtual image.

Share Button

Modular Arithmetic

We have some articles in this blog about integers of typical programming languages and how they work. Time to introduce the underlying mathematical concepts, that have been covered implicitly until now, since they are also interesting in many other aspects. And besides, this is a very beautiful area of mathematics.

Mathematics that we learn in school is mostly inspired by what is needed for physics. This was quite a good choice 100 years ago, because it gave some motivation to why we do certain things and it was the area, where math was applied. Off course also chemistry and engineering, but these are somewhat similar aspects of mathematics as we use in physics. Now physics and chemistry make use of quite interesting areas of mathematics like group theory or non Euclidean geometry, but these are kind of advanced areas beyond what we typically learn in school. at least in the countries where I went to school. So it is about real numbers, some trigonometry, real analysis (calculus) and maybe complex numbers.

Since more than 50 years mathematics is heavily used in informatics as well, if we abstract informatics away from computers, even longer, because for example algorithms and cryptography have been used for several thousand years already, but that was a small niche and became main stream by the existence of computers. And for informatics and computer science we need different areas of mathematics. Analysis is not the so important, though not irrelevant. One area is information theory, which is based on probability theory and statistics. Numerical calculations have to a great extent remained a domain of mathematics itself, so this connection may be strong, but it is applied mathematicians using computers and using knowledge from IT to program them better, not the other way round. Still numerical analysis is somewhat important, but not really what most of us need very often.

The areas of mathematics that are really interesting for informatics are discrete mathematics, algebra and number theory. There is enough material about this on the web, but for now we will deal with modular arithmetic, which is kind of in the intersection of discrete mathematics, algebra and number theory.

We start with the integral numbers:

    \[{\Bbb Z} = \{\ldots,-3, -2, -1, 0, 1, 2, 3, 4,\ldots\}\]

Now we take any positive integral number m \in {\Bbb N} with m \ge 2.
We say that two integral numbers x and y are congruent modulo m:

    \[x \equiv y \pmod m\]

if and only if x-y can be divided by m. We might also say that there is a k \in {\Bbb Z} such that y = x + k m.
Now we can make interesting observations:
We assume, that we have pairs of numbers such that

    \[u \equiv v \pmod m\]


    \[x \equiv y \pmod m\]

Then we can observe that also

    \[u+x \equiv v+y \pmod m\]

    \[u-x \equiv v-y \pmod m\]

    \[u\cdot x \equiv v\cdot y \pmod m\]

This can be proven easily.
We assume as above

    \[y = x + k \cdot m\]

and similarly

    \[v = u + l \cdot m\]

Then we have

    \[y+v = x+u+(k+l)m \equiv x+u \pmod m\]

    \[y-v = x-u+(k-l)m \equiv x-u \pmod m\]

    \[yv = xu+kum + lvm + klm^2 \equiv x-u \pmod m\]

We call a set of all numbers of \Bbb Z that are congruent to each other a remainder class and write this as

    \[\bar x = x + m{\Bbb Z}\]

There are exactly m remainder classes modulo m and usually we use a representation system of

    \[0,1,\ldots m-1\]

or for even m we often use

    \[-\frac{m}{2}, -\frac{m}{2}+1,\ldots,-1,0,1,\dots,\frac{m}{2}-1\]

or for odd m we often use

    \[-\frac{m-1}{2}, -\frac{m-1}{2}+1,\ldots,-1,0,1,\dots,\frac{m-1}{2}\]

We observe these representation systems when we do division with remainder, written as % in many programming languages, but it is necessary to do some quick research on which representation system % uses and which one we want to use and possibly adjust the result. The corresponding division may not be /, but we can obtain it by subtracting our remainder from the dividend and dividing that, which should be an exact division.

Now we need to define a ring. A ring R is a set with operations + and \cdot such that the following rules apply:

  1. For any members x, y \in R we also have x+y \in R, x-y \in R and x\cdot y \in R. This is usually not mentioned, because it is part of how we define these operations in the first place in most mathematical texts.
  2. Addition is communicative: For any members x, y \in R we have x+y=y+x.
  3. Addition has a neutral element 0: For any member x \in R we have x+0=0+x=x.
  4. Addition has inverse elements: For any member x \in R we have a member x'\in R such that x+x'=x'+x=0. Usually we write -x for this inverse element of x and we write x-y instead of x+(-y).
  5. Addition is associative: For any members x, y, z \in R we have (x+y)+z=x+(y+z). We can omit the parentheses here and write x+y+z instead.
  6. Multiplication has a neutral element 1: For any member x \in R we have x\cdot 1=1\cdot x=x.
  7. Multiplication is associative: For any members x, y, z \in R we have (x\cdot y)\cdot z=x\cdot (y\cdot z). We can omit the parentheses here and write x\cdot y\cdot z or even xyz instead.
  8. Multiplication in conjunction with addition is distributive: For any members x, y, z \in R we have (x + y)\cdot z = x\cdot z + y\cdot z and z\cdot (x+y)=z\cdot x + z\cdot y.

If the multiplication is also communicative, we call it a commutative ring. If there is a multiplicative inverse for any element other than 0, we call it a skew field. And if both conditions hold, we call it a field.

Now we can see that \Bbb Z is actually a communicative ring.

And these remainder classes modulo m also form a ring. We call it {\Bbb Z}/m{\Bbb Z} or sometimes also {\Bbb Z}_m, but I do not use the second form, because it is ambiguous with something else (p-adic numbers). If m=p is a prime number, then {\Bbb Z}/p{\Bbb Z} is actually a field and in this case we may write {\Bbb F}_p instead of {\Bbb Z}/p{\Bbb Z}. Or GF(p) in some literature, if you prefer that. Why is it a field?

Now we have an extension of the Euclid algorithm to calculate the gcd of two numbers. This also yields numbers u and v such that g=\gcd(x,y)=ux+vy. So these numbers exist. For a prime number p and a remainder class \bar x \ne 0 we know that x is not a multiple of p and since p is prime we know that


. This yields a multiplicative inverse for \bar x because

    \[u\cdot x \equiv 1 \pmod p\]


Now we often see m as a power of 2 and the modular arithmetic, at least +, -, *, is what is sold to us as integer arithmetic of Java, C or C#.

On the other hand it can be interesting and useful to use modular arithmetic for other values of m. Interesting are mostly prime numbers, which can be relatively small like 2, 3 or 5, but also really big. For non-primes we have null-factors, that is numbers x, y \not\equiv 0 \pmod m such that x\cdot y \equiv 0 \pmod m. This breaks some fundamental mathematical assumption for integers and fields, but is perfectly correct for this modular ring.

In our daily life modular arithmetic is actually quite common. We have the week days with m=7, the hours of the clock with m=12 or m=24, the minutes and seconds of the clock with m=60 and quite a bit of m=2, which we do not really see as modular arithmetic, but maybe as boolean arithmetic with + being the „exclusive or“, \cdot being the „and“ etc.

Share Button

Collection Libraries

The standard libraries of newer programming languages usually contain so called collection libraries.

Collections can usually be Lists, Sets, Maps or specialization of these.

They cover quite a lot and we start seeing variants that are built on immutability and variants that allow mutability and as always the hybrid in Ruby, that combines these and does an irreversible transition using the freeze method.

There are some interesting collection types other than these, most often we find the Bag as fourth member in the club and then more complex and more specific collections.

What they all have in common is storing a finite number of elements in a certain structure.

Some languages like Clojure, Haskell or Perl 6 use so called lazy collections. That can mean that the members are not actually stored, but that there are methods to calculate them on demand. This allows for very interesting, expressive and beautiful programming, if used properly. Typically a Range of integers is provided as a lazy collection. But there can also be quite interesting lazy collections that are a little bit more sophisticated. Some allow random access to the nth element, like arrays or vectors or arrayLists, some only via iteration.

Interesting lazy collections could be multi-dimensional ranges. Assume we have an array of integers [n_0, n_1, n_2, ...., n_{m-1}] where even m is only known at runtime. Then it is a challenge that sometimes occurs to do a loop like this:

for (i_0 = 0\ldots n_0-1) {
for (i_1 = 0\ldots n_1-1) {
for (i_2 = 0\ldots n_2-1) {

Which is kind of hard to write, because we cannot nest the loops if we do not know how deeply they need to be nested.

But if we have a multi-range collection and do something like this

Collection> mr = new MultiRange([n_0, n_1, n_2, ...., n_{m-1});
for (List li : mr) {

and this beast becomes quite approachable.

A similar one, that is sometimes needed, is a lazy collection containing all the permutations of the n numbers \left\{0\ldots n-1\right\}. Again we only want to iterate over it and possibly not complete the iteration.

Another interesting idea is to perform the set operations like union, intersection and difference lazily. That means that we have a collection class Union, that implements the union of its members. Testing for membership is trivial, iteration does involve some additional structure to avoid duplicates. Intersection and difference are even easier, because they cannot produce duplicates.

What is also interesting is Sets built from intervals. Intervals can be defined in any base set {\mathrm T} (type) that supports comparisons like <, <=, ... We have

  • an open interval (a,b)=\left\{x \in {\mathrm T} : a < x < b\right\}
  • an left half-open interval (a,b]=\left\{x \in {\mathrm T} : a < x \le b\right\}
  • an right half-open interval [a,b)=\left\{x \in {\mathrm T} : a \le x < b\right\}
  • a closed interval [a,b]=\left\{x \in {\mathrm T} : a \le x \le b\right\}

Of these we can create unions and intersections and in the end can always reduce this to unions of intervals. Adjacent intervals can sometimes be merged, overlapping intervals always. If {\mathrm T} supports the concept of successors, than even closed intervals with different limits can be discovered to be adjacent, for example [1,2] and [3,4] for {\mathrm T}={\Bbb Z}. Often this cannot be assumed, for example if we are working with rational numbers with arbitrarily long integers as numerator and denominator.

So these are three concepts to get memory saving, easy to use lazy collections.

Share Button

Alpine Perl Workshop

On 2016-09-02 and 2016-09-03 I was able to visit the Alpine Perl Workshop. This was a Perl conference with around 50 participants, among them core members of the Perl community. We had mostly one track, so the documented information about the talks that were given is actually quite closely correlated to the list of talks that I have actually visited.

We had quite a diverse set of talks about technical issues but also about the role of Perl in projects and in general. The speeches were in English and German…

Perl 6 is now a reality. It can be used together with Perl 5, there are ways to embed them within each other and they seem to work reasonably well. This fills some of the gaps of Perl 5, since the set of modules is by far not as complete as for Perl 5.

Perl 5 has since quite a few years established a time boxed release schedule. Each year they ship a new major release. The previous two releases are supported for bugfixes. The danger that major Linux distributions remain on older releases has been banned. Python 3 has been released in 2008 and still in 2016 Python 2.7 is what is usually used and shipped with major Linux distributions. It looks like Perl 5 is there to stay, not be replaced by Perl 6, which is a quite different language that just shares the name and the community. But the recent versions are actually adopted and the incompatible changes are so little that they do not hurt too much, usually. An advantage of Perl is the CPAN repository for libraries. It is possible to test new versions against a ton of such libraries and to find out, where it might break or even providing fixes for the library.

An interesting issue is testing of software. For continuous integration we can now find servers and they will run against a configurable set of Perl versions. But using different Linux distributions or even non-Linux-systems becomes a more elaborate issue. People willing to test new versions of Perl or of libraries on exotic hardware and OS are still welcome and often they discover a weakness that might be of interest even for the mainstream platforms in the long run.

I will leave it with this. You can find more information in the web site of the conference.

And some of the talks are on youtube already.

It was fun to go there, I learned a lot and met nice people. It would be great to be able to visit a similar event again…

Share Button

Error Messages in Web Applications

In Applications that are used by non-technical end users, which are these days very often web applications, we have to deal with the issue that an unexpected error occurs.

There are the two extremes:

We can just show a screen telling in a nice design that the application does not work and that is all. In terms of user experience that is a good approach and often the recommended way. But there is no reasonable path to recover, other than retrying.

The other extreme, which we actually see quite a lot, is this:

Stacktrace of exception shown to the user in the browser

Stack-trace of exception shown to the user in the browser

That may be ok in some rare case, during beta test or for an internal application that is only used by the development team.

This was a productive application, apparently developed in some MS-dotnet-language, most likely C# or ASP-dotnet, but the same issues are valid for PHP, Java/JSF/JEE, Ruby, Perl, Perl 6, Python, Scala, Clojure, C, C++ and whatever you like.

Imagine Google would display such information because of a bug in their search engine and there were a contact mail address or phone number and millions of people would call their call center every day with such bugs… In this case it is probably better to hide the exception from the users and probably write the software well enough that such issues do not occur too often, because too many people rely on this every day.

The other side is, that there are probably some log files and the exceptions can be found there. Now the log files can be monitored manually, which becomes a bad idea as soon as there is actually one or more full time user, because the logs become huge, but tools like grep or simple perl scripts or tools like splunk can help to deal with this. Since applications tend to be distributed, we have to deal with the fact that the same single instance of error and even more so the same kind of error will occur in many different logs and we need to match them to make sense of this and to understand the problem.

Reading logs and especially stack traces is especially hard in framework worlds, where there are hundreds of levels in the stacktraces coming from the frameworks. Often this is were the error actually is, but even more often it is the application and anyway we are more comfortable doing an workaround rather than fixing the framework, which we could do at least if it is open source. And we should actually send a bug report with as much information as possible, but avoid interpretation on what the bug could be. This can be added as a comment to the bug report, maybe even with a hint how to fix it or a patch, but it should be kept separate.

Anyway, usually we assume the error is not in the framework and this is usually true. So it is a challenge to read this and again tools or scripts should be used to do this. It is also possible and usually necessary to find occurrences of the same kind of error. Often this is hard, because the root cause does not manifest itself and we get consequential errors much later. That is why we are IT experts, so we can find even such hidden bugs. 🙂

Now it is possible to make life a little bit easier. We can give exceptions unique IDs, that can be something like this:
where each block consists of digits and upper case letters.
eeeeee encodes the type of the exception, for example by skipping all lower case letters. It does not have to be perfect, just give a hint.
cccc is an error-code if this is used. If and when exceptions should include error codes is an interesting issue by itself.
hhhh is encodes the host where it originally occurred.
tttt… is the time-stamp. If we use msec since 1970 and use base 36 to encode it, it can be shorter.
nnnn… is a number from some counter.
This is just an idea. You could use UUIDs or do something along these lines, but different. Using base 36 is actually a good idea, it makes these codes shorter.

Anyway, having such an ID in each exception in the log allows more easily to find which are different log entries for exactly the same exception. And yes, they do occur and that is OK. Such a code could also be displayed on the screen of the end user if it is an application where users actually have access and contact to some support team. Then they can read it. Again, aim to make it short and unique, but don’t make the whole mechanism too fragile, otherwise we deal with finding the exceptions in the exception handling framework itself and that is not desirable.

What is important: We should actually fix bugs, when we find them. Free some time for it, write unit tests that prove the bug, fix it and make sure it does not come up again by retaining the unit test. Yes, it is work but it is worth it. If the bug justifies an immediate deployment or if we have to wait for a deployment window is another issue, but it should be fixed at least with the next deployment. If regular deployments should be done twice a year or daily is an interesting issue by itself. There should always be ways to do an „emergency deployment“ in case of a critical bug, but it is good to have strong regular mechanism so the emergency deployment can remain an exception.

Share Button

Integers in Perl 6

The language Perl 6 has been announced to be production ready by the beginning of this year. Its implementation is Rakudo, while Perl 6 itself is an abstract language definition that allows any language implementation that passes the test suite to call itself an Perl 6 implementation. The idea is not totally new, we see the Ruby language being implemented more than once (Ruby, JRuby, Rubinius, IronRuby), but we can also learn from the Ruby guys that it is a challenge to keep this up to date and eventually it is likely that one implementation will fall back or go its own way at some point of time.

Perl 6 is also called „Perl“ as part of its name, but quite different from its sister language Perl, which is sometimes called „Perl 5“ to emphasize the distinction, so it is absolutely necessary to call it „Perl 6“ or maybe „Rakudo“, but not just „Perl“.

Even though many things can be written in a similar way, a major change to Perl 5 is the way of dealing with numeric types. You can find an article describing Numeric Types in Perl [5]. So now we will see how to do the same things in Perl 6.

Dealing with numeric types in Perl 6 is neither like in Perl 5 nor like what we are used to in many other languages.

So when we just use numbers in a naïve way, we get long integers automatically:

my $f = 2_000_000_000;
my $p = 1;
loop (my Int $i = 0; $i < 10; $i++) {
    say($i, " ", $p);
    $p *= $f;

creates this output:

0 1
1 2000000000
2 4000000000000000000
3 8000000000000000000000000000
4 16000000000000000000000000000000000000
5 32000000000000000000000000000000000000000000000
6 64000000000000000000000000000000000000000000000000000000
7 128000000000000000000000000000000000000000000000000000000000000000
8 256000000000000000000000000000000000000000000000000000000000000000000000000
9 512000000000000000000000000000000000000000000000000000000000000000000000000000000000

This is an nice default, similar to what Ruby, Clojure and many other Lisps use, but most languages have a made a choice that is weird for application development.

Now we can also statically type this:

my Int $f = 2_000_000_000;
my Int $p = 1;
loop (my Int $i = 0; $i < 10; $i++) {
    say($i, " ", $p);
    $p *= $f;

and we get the exact same result:

0 1
1 2000000000
2 4000000000000000000
3 8000000000000000000000000000
4 16000000000000000000000000000000000000
5 32000000000000000000000000000000000000000000000
6 64000000000000000000000000000000000000000000000000000000
7 128000000000000000000000000000000000000000000000000000000000000000
8 256000000000000000000000000000000000000000000000000000000000000000000000000
9 512000000000000000000000000000000000000000000000000000000000000000000000000000000000

Now we can actually use low-level machine integers which do an arithmetic modulo powers of 2, usually 2^{32} or 2^{64}:

my int $f = 2_000_000_000;
my int $p = 1;
loop (my Int $i = 0; $i < 10; $i++) {
    say($i, " ", $p);
    $p *= $f;

and we get the same kind of results that we would get in java or C with (signed) long, if we are on a typical 64-bit environment:

0 1
1 2000000000
2 4000000000000000000
3 -106958398427234304
4 3799332742966018048
5 7229403301836488704
6 -8070450532247928832
7 0
8 0
9 0

We can try it in Java. I was lazy and changed as little as possible and the "$" is allowed as part of the variable name by the language, but of course not by the coding standards:

public class JavaInt {
    public static void main(String[] args) {
        long $f = 2_000_000_000;
        long $p = 1;
        for (int $i = 0; $i < 10; $i++) {
            System.out.println($i + " " +  $p);
            $p *= $f;

We get this output:

0 1
1 2000000000
2 4000000000000000000
3 -106958398427234304
4 3799332742966018048
5 7229403301836488704
6 -8070450532247928832
7 0
8 0
9 0

And we see, with C# we get the same result:

using System;

public class CsInt {

    public static void Main(string[] args) {
        long f = 2000000000;
        long p = 1;
        for (int i = 0; i < 10; i++) {
            Console.WriteLine(i + " " +  p);
            p *= f;

gives us:

0 1
1 2000000000
2 4000000000000000000
3 -106958398427234304
4 3799332742966018048
5 7229403301836488704
6 -8070450532247928832
7 0
8 0
9 0

If you like, you can try the same in C using signed long long (or whatever is 64 bits), and you will get the exact same result.

Now we can simulate this in Perl 6 also using Int, to understand what int is really doing to us. The idea has already been shown with Ruby before:

my Int $MODULUS = 0x10000000000000000;
my Int $LIMIT   =  0x8000000000000000;
sub mul($x, $y) {
    my Int $result = ($x * $y) % $MODULUS;
    if ($result >= $LIMIT) {
        $result -= $MODULUS;
    } elsif ($result < - $LIMIT) {
        $result += $MODULUS;

my Int $f = 2_000_000_000;
my Int $p = 1;
loop (my Int $i = 0; $i < 10; $i++) {
    say($i, " ", $p);
    $p = mul($p, $f);

and we get the same again:

0 1
1 2000000000
2 4000000000000000000
3 -106958398427234304
4 3799332742966018048
5 7229403301836488704
6 -8070450532247928832
7 0
8 0
9 0

The good thing is that the default has been chosen correctly as Int and that Int allows easily to do integer arithmetic with arbitrary precision.

Now the question is, how we actually get floating point numbers. This will be covered in another blog posting, because it is a longer story of its own interest.

Share Button


In the late 1990es there was a real hype about XML. Tons of standards evolved and it was a big deal to acquire sound knowledge of it.

It has been some success, because it is still around and very common almost 20 years later.

I would say that the idea of having a human readable and editable text format has mostly failed. Trivial XML can be edited manually without too much of a risk of breaking it, but then again simpler formats like JSON or even java-properties-Files or something along these lines would be sufficient and easier to deal with, unless it is the 1001st slightly different format that needs to be learned again. XML is different each time anyway, because it depends on the schema, so we have the problem on that side, but off course the general idea is well known.

For complex XML manual reading and editing becomes a nightmare, it is just so much harder to read for humans than any reasonably common programming languages of our time. It is text, but so involved that it feels like half binary. And who knows, maybe we can also edit binary files with a hex-editor. And real magicians, actually people with too much time in this case, can do so and keep the binary file correct and uncorrupted, at least for some binary formats. And they can do so in XML as well… But it is actually better to have a tool or a script to create and change non-trivial XML-configuration files.

Where XML is strong is for data exchange between systems. This is mostly transfer in space between different systems, but it can also be transfer in time, that is for storing information to be retrieved later. It gives a format that allows for some „type safety“, that is very versatile and that provides a lot of tool and script support around it. Even here we have to acknowledge that there are some drawbacks. Maintaining a XML interface involves some work for the schema files, adopting the software on the human side. It requires some CPU-overhead on the sending and mostly on the receiving side for creating and parsing XML. The libraries have been optimized but still they take a little bit of time. And then on the network size we transmit a multiple of the amount of data, if it is densely packed with tags.

But it is a format that is well understood, that works on pretty much any platform, over the network and also usually allows us to support different versions of the same interface simultaneously. For debugging it is good to have a format that is at least human readable, even if not very pleasant. Ideally the schema is defined in a way that is self documenting.

I wonder why approaches like in WML have not become more common. WML had a customized compressed format that was more friendly to low bandwidth cell phones.

XML is good for many purposes, but as always it is good to know other tools, like JSON and to decide when it is a case for XML and when not.

Some positive side effects of XML are that it helped some other standards to become more mainstream. UTF-8 was from the beginning the default encoding for XML and this is now a common standard encoding for any text. And with XML-schema it became common to encode dates within XML in the ISO-format, which helped this format in becoming generally known and commonly used for cases where one date format should work independently of the origin of the reader.

Share Button

Running a Large Number of Servers

These days we often have to run a large number of servers, and the times where we could afford to manually log into each one to do system administration tasks are mostly over.

It turns out that there are always different approaches to deal with this. In most cases we are talking about virtual hosts, so we have a layer between via the visualization that can help us. We can have a number of master images and create virtual images from those even on demand in a matter of a minute. In case of MS-Windows it is an issue that they have some internal UUID as host-id which should be unique and which is heavily spread throughout the image, but this issue can be ignored if we do not worry about windows domains. Usually we do and I leave dealing with these issues to MS-Windows-experts.

Talking about Linux, we only need to make sure that the network interface is unique, which it is if we use hardware and do not mess around with it, but it is not necessarily if we use visualization and virtual network devices. This issue needs to be addressed, but it is well supported by common visualization tools. Another point is the host name. This is not too hard, because we only need to change it in one or two places, which can easily be done by a script. We can mount the image and do the change. Now the image can contain a start-up script that discovers on boot that it is a fresh copy and uses its host-name to retrieve further setup from some server. And we just have to maintain there which host has which setup. These can be automated to a very high extent. Then we can for example request a certain number of servers with certain software and configuration via an web interface. This creates new host-names, stores the setup with these host-names in its setup table, creates the virtual images, deploys them on any available hardware server and once they have stared they retrieve their setup from the server. We can also have master images that already contain certain predefined setups so that this second step is only needed for minor adjustments. We have to assume that these exist. Yes, this is called cloud technology.

If we keep the data somewhere else, these servers can be discarded and new ones can be created, so there is no need to do too complex stuff on them. Off course we want to run our software on them. So the day long procedure to install our software is not attractive any more. We need mechanisms that can be automated.

Running real hardware is a bit more demanding and for larger servers that might even be justifiable, because they do a lot of work for us. Quite often it is possible to actually do mechanisms quite similar to the virtual world even on real hardware. It is possible to boot the machine from an USB-stick which copies fetches an image and copies it on disk. Again only the host-name needs to be provided and then the rest can be automated. Another approach is to initially boot via the network, which is an option that most of us rarely use, but which is supported by the hardware. For running a large server farm such a hardware and bios setting can just be initially the default and from there machines can install and reconfigure themselves. In this case we probably need to use the Ethernet-address of the network device as a key to our setup table and we need to know what Ethernet addresses are in use. It is a big deal to set up such an environment, but once it is running, it is tremendously efficient. Homogenous hardware is off course essential, maybe an small number of hardware setups, but not a new model with each delivery. It is not enough that the new hardware is named the same as the old one, it needs to be able to run the same images without manual customization. It is possible to have a small number of images, but having to supply already different images for different server setups multiplying there number with the number of the hardware setups can grow out of control, if one of the numbers or both become too large.

Now we also have ways to actually access oure servers. there have been tools to run a shell just simultaneously on n hosts to do exactly the same at once. This is fine if they are exactly the same, but this is something we need to enforce or we need to accept that servers deviate. There are tools around to deal with these issues, but it is actually quite reasonable to do a script based approach. What we do is using ssh-key-exchange to make sure that we can log into the servers from some admin server without password. We can then define a subset of the set of our servers, which can be one, a couple, a large fraction, all or all with a few exceptions, for example. Then we distribute a script with scp to all the target machines in a loop. We run this on each target machine using ssh and parse the outputs to see which have been successful and which not. Here it is usually a good idea to have a farm of test servers to try this out first and then start on a small number of servers before running it on all of them.

The big bang philosophy of applying a change twice a year on the whole server landscape is not really a good idea here, because we can loose all our servers if we make a mistake and this can be hard to recover, although still have the same tools and scripts even for that, unless we really screw things up. So in these scenarios software that supports the interfaces of the previous version for its communication partners is useful, because it allows to do a smooth migration.

Just to give you a few hints: During some coffee break I suggested that Google has around a million servers. Even though there is no hard evidence for this, because this number seems to be confidential and only known to Google employees, I would say that this is a reasonable number. For sure they cannot afford a million system administrators. The whole processes needs to be very stream-lined. Or take the hosting provider where this site is running on. It is possible to have virtual web-hosts, in this case it is multiple sites running on the same virtual or physical machine sharing the same Apache instance with just different directories attached to different URL-patterns. This is available for very little money, again suggesting that they are tremendously efficient.

Share Button

Brass Music and what we can learn for IT

The English term „music“ refers to what we actually listen to, but also to how we write it down on paper, like this:

Music handwritten by Johann Sebastian Bach

Music handwritten by Johann Sebastian Bach

This musical notation is actually like a programming language, because it allows to write down complex musical pieces on paper.

But there is off course more to it than just mechanically playing what is on the paper and dealing with the inconveniences of the musical instruments. What makes it pleasant is the interpretation and that requires skill and intuition and experience and feelings. Since this is not a music blog, I will leave this as it is and stick with the relatively irrelevant side issue of the musical notation language, how we write music.

Generally there might be issues that it is hard to read, because things look too similar, but on the other hand musicians just see it immediately and at least fast enough to work efficiently with it, so I guess that this way of writing music is generally OK.

Now we have the possibility to cover a certain range, slightly more than two octaves, efficiently. Beyond that it will get hard to count the auxiliary lines. To cover different instruments, at least three kinds of Clefs are in use and the same note usually means the same. I think that there are ways to shift the whole system by one ocatve, at least for beginners, but usually with the three clefs that is not necessary for the whole piece of music.

Now for some brass instruments we have different sizes, as for other instruments as well. So the same way to play it yields a different tone on different sizes of the instrument. Just take the recorder, which has five common sizes. They are based on different f and c notes and when you play an f-based recorder you have to adopt to this by playing an f when reading an f-note, for example by closing all wholes. On a c-based recorder you read a c (the deepest you can regularly play) and close all holes to play this. Normally people know this and can deal with this. For brass instruments a different approach was chosen. For situations where you actually want to hear an F and might actually write an F in the notes for one size of the instrument, the larger or smaller instruments just call something an „F“ which is actually not an F at all for the rest of the musical world. So for these instruments „F“ does not mean the tone that you hear, but the grip combination that you do to achieve the tone, simplified. It was supposedly meant to make it easier for relatively unskilled musicians to adapt to different sizes of their instruments, but now even professionals have to live with this.

So they invented a new mechanism to simplify things, which in the overall view makes things a lot more complicated and simplifies just something trivial that even average skilled musicians can easily learn.

I respect the musicians for what they do and I guess since they can deal with this irregularity, it is kind of OK. Or at least up to the musicians to decide if they want to fix this or not.

But we can learn a lot for IT solutions from it.

We often have the situation that we need to adapt a software for another related, but slightly different use case. And we often get the request to simplify things.

It is important to think carefully at which level we do the adaption, so that it will make sense in the long run.

And we should simplify things, but there is no point in trying to make things simpler than they actually are, this simply cannot work an will backfire.

Share Button

Some Thoughts about Incompleteness of Libraries

Selfwritten Util Libraries

Today we have really good libraries with our programming languages and they cover a lot of things. The funny thing is, that we usually end up writing some Util-classes like StringUtil, CollectionUtil, NumberUtil etc. that cover some common tasks that are not found in the libraries that we use. Usually it is no big deal and the methods are trivial to write. But then again, not having them in the library results in several slightly different ad hoc solutions for the same problem, sometimes flawless, sometimes somewhat weak, that are spread throughout the code and maybe eventually some „tools“, „utils“ or „helper“ classes that unify them and cover them in a somewhat reasonable way.

Imposing Util Libraries on all Developers

In the worst case these self written library classes really suck, but are imposed on the developers. Many years ago it was „company standard“ to use a common library for localizing strings. The concept was kind of nice, but it had its flaws. First there was a company wide database for localizing strings in order to save on translation costs, but the overhead was so much and the probability that the same short string means something different in the context of different applications was there. This could be addressed by just creating a label that somehow included the application ID and bypassing this overhead, whenever a collision was detected. What was worse, the new string made it into a header file and that caused the whole application to be recompiled, unless a hand written make file skipped this dependency. This was off course against company policy as well and it meant a lot of work. In those days compilation of the whole application took about 8 (eight!) hours. Maybe seven. So after adding one string it took 8 hours of compile time to continue working with it. Anyway, there was another implementation for the same concept for another operating system, that used hash tables and did not require recompilation. It had the risk of runtime errors because of non-defined strings, but it was at least reasonable to work with it. I ported this library to the operating system that I was using and used it and during each meeting I had do commit to the long term goal of changing to the broken library, which of course never happened, because there were always higher priorities.

I thing the lesson we can already learn is that such libraries that are written internally and imposed on all developers should be really done very well. Senior developers should be involved and if the company does not have them, hired externally for the development. Not to do the whole development, but to help doing it right.

Need for Util libraries

So why not just go with the given libraries? Or download some more? Depending on the language there are really good libraries around. Sometimes that is the way to go. Sometimes it is good to write a good util-libarary internally. But then it is important to do it well, to include only stuff that is actually needed or reasonably likely needed and to avoid major effort for reinventing the wheel. Some obscure libraries actually become obsolete when the main default library gets improved.

Example: Trigonometric and other Mathematical Functions

Most of us do not do a lot of floating point arithmetic and subsequentially we do not need the trigonometric functions like \sin and \cos, other transcendental functions like \exp and \log or functions like cube root (\sqrt[3]{x}) a lot. Where the default set of these functions ends is somewhat arbitrary, but of course we need to go to special libraries at some point for more special functions. We can look what early calculators used to have and what advanced math text books in schools cover. We have to consider the fact, that the commonly used set of trigonometric functions differs from country to country. Americans tend to use six of them, \sin, \cos, \tan, \cot, \sec and \csc, which is kind of beautiful, because it really completes the set. Germans tend to use only \sin, \cos, \tan and \cot, which is not as beautiful, but at least avoids the division by zero and issue of transforming \tan to \cot.  Calculators usually had only \sin, \cos and \tan. But they offered them in three flavors, with modes of „DEG“, „RAD“ and „GRAD“. The third one was kind of an attempt to metricize degrees by having 100 {\rm gon} instead of 90^\circ for an right angle, which seems to be a dead idea.  Off course in advanced mathematics and physics the „RAD“, which uses \frac{\pi}{2} instead of 90^\circ is common and that is what all programming languages that I know use, apart from the calculators. Just to explain the functions for those who are not familiar with the whole set, we can express the last four in terms of \sin and \cos:

  • \tan(x) = \frac{\sin(x)}{\cos(x)} (tangent)
  • \cot(x) = \frac{\cos(x)}{\sin(x)} (cotangent)
  • \sec(x) = \frac{1}{\cos(x)} (secans)
  • \csc(x) = \frac{1}{\sin(x)} (cosecans)

Then we have the inverse trigonometric functions, that can be denoted with something like \arcsin or \sin^{-1} for all six trigonometric functions. There is an irregularity to keep in mind. We write \sin^n(x) instead of (\sin(x))^n for n=2,3,4,\ldots, which is the multiplication of that number of \sin(x) terms. And we use \sin^{-1}(x) to apply the function „\sin-1 time, which is actually the inverse function. Mathematicians have invented this irregularity and usually it is convenient, but it confuses those who do not know it. From these functions many programming languages offer only the \tan^{-1} assuming the others five can be created from that. This is true, but cumbersome, because it needs to differentiate a lot of cases using something like if, so there are likely to be many bugs in software doing this. Also these ad hoc implementations loose some precision.

It was also common to have a conversion from polar coordinates to rectangular (p2r) coordinates and vice versa (r2p), which is kind of cool and again easy, but not too trivial to do ad hoc. Something like atan2 in FORTRAN, which does the essence of the harder r2p operation, would work also, depending on hon convenient it is to deal with multiple return values. We can then do r2p using r=\sqrt{x^2+y^2}, \phi ={\rm atan2}(x, y) and p2r by x=r \sin(\phi) and y = r \cos(\phi).

The hyperbolic functions like \sinh, their inverses like \arsinh or \sinh^{-1} are rarely used, but we find them on the calculator and in the math book, so we should have them in the standard floating point library. There is only one flavor of them.

Logarithms and exponential functions are found in two flavors on calculators: \log(x)=\log_{10}(x)=\lg(x) and \ln(x)=\log_{e}(x) and 10^x and e^x=\exp(x). The log is kind of confusing, because in mathematics and physics and in most current programming language we mean \log(x)=\log_{e}(x) (natural logarithm). This is just a wrong naming on calculators, even if they all did the same mistake across all vendors and probably still do in the scientific calculator app on the phone or on the desktop. As IT people we tend to like the base two logarithm {\rm ld}(x)=\log_2(x), so I would tend to add that to the list. Just to make the confusion complete, in some informatics text books and lectures the term „\log“ refers to the base two logarithm. It is a bad habit and at least the laziness should favor writing the correct „{\rm ld}„.

Then we usually have power functions x^y, which surprisingly many programming languages do not have. If they do, it is usually written as x ** y or pow(x, y), square root, square and maybe cube root and cube.  Even though the square root and the cube root can be expressed as powers using \sqrt(x)=x^\frac{1}{2} and \sqrt[3](x)=x^\frac{1}{3} it is better to do them as dedicated functions, because they are used much more frequently than any other power with non-integral exponents and it is possible to write optimized implementations that run faster and more reliably then the generic power which usually needs to go via log and exp. Internal optimization of power functions is usually a good idea for integral exponents and can easily be achieved, at least if the exponent is actually of an integer type.

Factorial and binomial coefficient are usually used for integers, which is not part of this discussion. Extensions for floating point numbers can be defined, but they are beyond the scope of advanced school mathematics and of common scientific calculators. I do not think that they are needed in a standard floating point library. It is of its own interest what could be in an „advanced math library“, but \sec and \tanh^{-1} and {\rm ld} for sure belong into the base math library.

That’s it. It would be easy to add all these into the standard library of any programming language that does floating point arithmetic at all and it would be helpful for those who work with this and not hurt at all those who do not use it, because this stuff is really small compared to most of our libraries. So this would be the list

  • sin, cos, tan, cot, sec, csc in two flavors
  • asin, acos, atan, acot, asec, acsc (standing for \sin^{-1}…) in two flavors
  • p2r, r2p (polar coordinates to rectangular and reverse) or atan2
  • sinh, cosh, tanh, coth, sech, csch
  • asinh, acosh, atanh, acoth, asech, acsch (for \sinh^{-1}…)
  • exp, log (for e^x and logarithm base e)
  • exp10, exp2, log10, log2 (base 10 and base 2, I would not rely on knowledge that ld and lg stand for log2 and log10, respectively, but name them like this)
  • sqrt, cbrt (for \sqrt{x} and \sqrt[3]{x})
  • ** or pow with double exponent
  • ** or pow with integer exponent (maybe the function with double exponent is sufficient)
  • \frac{1}{x}, x^2, x^3, x^\frac{1}{y} are maybe actually not needed, because we can just write them using ** and /

Actually pretty much every standard library contains sin, cos, tan, atan, exp, log and sqrt.


Java is actually not so bad in this area. It contains the tan2, sinh, cosh, tanh, asin, acos, atan, log10 and cbrt functions, beyond what any library contains. And it contains conversions from degree to radiens and vice versa. And as you can see here in the source code of pow, the calculations are actually quite sophisticated and done in C. It seems to be inspired by GNU-classpath, which did a similar implementation in Java. It is typical that a function that has a uniform mathematical definition gets very complicated internally with many cases, because depending on the parameters different ways of calculation provide the best precision. It would be quite possible that this function is so good that calling it with an integer as a second parameter, which is then converted to a double, would actually be good enough and leave no need for a specific function with an integer exponent. I would tend to assume that that is the case.

In this github project we can see what a library could look like that completes the list above, includes unit tests and works also for the edge cases, which ad hoc solution often do not. What could be improved is providing the optimal possible precision for any legitimate parameters, which I would see as an area of further investigation and improvement. The general idea is applicable to almost any programming language.

Two areas that have been known for a great need of such additional libraries are collections and Date&Time. I would say that really a lot what I would wish from a decent collection library has been addressed by Guava. Getting Date and time right is surprisingly hard, but just thing of the year-2000-problem to see the significance of this issue. I would say Java had this one messed up, but Joda Time was a good solution and has made it into the standard distribution of Java 8.


This may serve as an example. There are usually some functions missing for collections, strings, dates, integers etc. I might write about them as well, but they are less obvious, so I would like to collect some input before writing about that.

libc on Linux seem to contain sin, cos, tan, asin, acos, atan, atan2, sinh, cosh, tanh, asinh, acosh, atanh, sqrt, cbrt, log10, log2, exp, log, exp10, exp2. Surprisingly Java does not make use of these functions, but comes up with its own.

Actually a lot of functionality is already in the CPU-hardware. IEEE-recommendations suggest quite an impressive set of functions, but they are all optional and sometimes the accuracy is poor.

But standard libraries should be slightly more complete and ideally there would be no need to write a „generic“ util-library.  Such libraries should only be needed for application specific code that is somewhat generic across some projects of the organization or when doing a real demanding application that needs more powerful functionality than can easily be provided in the standard library. Ideally these can be donated to the developers of the standard library and included in future releases, if they are generic enough. We should not forget, even programming languages that are main stream and used by thousands of developers all over the world are usually maintained by quite small teams, sometimes only working part time on this. But usually it is hard to get even a good improvement into their code base for an outsider.

So what functions do you usually miss in the standard libraries?

Share Button