Karl Brodowsky's IT-Blog – Seite 10 – IT Sky Consulting GmbH

Poor mans profiling with LOGs

We have professional profiling tools and we should use them.. They give really useful and extensive information.

So why bother about doing „poor man’s profiling“?

About ten years ago, running profilers was kind of constrained to very small examples and computers with really huge memory. We have this huge memory now on every development machine. But we can only run such tests on development and test systems.

Sometimes it is possible to copy data from the productive system to the test system and really simulate what is happening on the productive system. Often this is quite far away from what is really happening on the productive system and we often only know about a problem when it has already occurred.

What we usually do have are log files. Each entry has a time stamp and can somehow be connected to a line of code where it is created. We always know the class or the file or something like that from where the log statement was issued and since there is only a limited number of log statements, they can often be recognized. Sometimes it is a good idea to actually help finding where it was logged. For example could we artificially put the source code line number or some uniq number or a short random string into the source code for each log statement. Then we can find several pieces of information in the log statement:

The timestamp (please use ISO-format!!! And at least milliseconds)
The source code location (line number, class name, random string, relative position in file, meaningful description like entering/leaving methods etc.)
The thread and/or process
The payload information of the log

Now log statements analyzed by thread. The first step is to look from which pairs of locations in the source code we observe successive log statements from the same thread, ignoring all logging from other threads and processes. We can also estimate the time that was spent for going from the first to the second location. Aggregating this by basically adding up thes timestamp deltas for each such pair over all threads and the whole log file will already indicate some hotspots where time is spent. Sometimes it is even interesting to look at sequences of three log statements (from the same thread). This was useful, when DB-commits, i.e. one such pair like „begin commit“ to „commit done“ used around 50% of the time, if not more. This was not interesting without knowing the statement that was committed, which was found out by looking at the previous log entry of the same thread just before the commit.

Such analysis can be done with scripts for example in Perl, Ruby or Python. I use Perl for this kind of task. Maybe log analysis tools provide this out of the box or with a little bit of tweaking. A concept about what entries in log files look like that make recognizing the location of the program that created the log entry, are very helpful. Sometimes we actually want to see, with which kins of data the problem occurred, which means looking also into the payload.

Rounding to Rational Numbers

Usually we think of rounding as a way of approximately expressing numbers with many decimal places by numbers with fewer decimal places.

Regular readers of this blog have already encountered, that this concept can be extended and generalized, as is mentioned in the article about Geometric and Harmonic Rounding and Residue Class Rounding or Rounding with Sum.
Now computers and humans can deal with rational numbers and sometimes that is the best way.

Idea 1: Read the Double as Rational

Most typical non-integer computer numbers are actually already rational numbers of the form $\frac{n}{2^m}$ with a relatively large power of two in the denominator. But as soon as we perform divisions, we leave the accurate world and round, usually in the way that the machine throws in front of our feet out of the box. But all other floating point operations can result in rounding as well. So rational numbers can be a way to go. So we can just naturally convert a double or float number into a rational number and continue working with that.

Idea 2: Go for human readable fractions

Let us think of human readers. We are kind of comfortable with fractions like $\frac{1}{12}, \frac{1}{10}, \frac{1}{8}$ and multiples of them, so basically every fraction that can be expressed with a denominator of 1, 2, 3, 4, 5, 6, 8, 10 or 12. The 12 is because of the omnipresence of the analog clock, the 10 because it is just one more decimal digit and the 8 because it is about the number of successive halving that we can still easily cope with. The idea would be to round to the nearest such fraction and to prefer the lower denominator when two adjacent values are equally far away. This works out well, because our set of allowed fractions between 0 and 1 is just

$\{0, \frac{1}{12}, \frac{1}{10}, \frac{1}{8}, \frac{1}{6}, \frac{1}{5}, \frac{1}{4}, \frac{1}{3}, \frac{3}{8}, \frac{2}{5}, \frac{5}{12}, \frac{1}{2}, \frac{7}{12}, \frac{3}{5}, \frac{5}{8}, \frac{2}{3}, \frac{3}{4}, \frac{4}{5}, \frac{5}{6}, \frac{7}{8}, \frac{9}{10}, \frac{11}{12}, 1\}$

and it is more or less trivial to program such a rounding algorithm, by just hard coding the limits, normalizing to the interval $[0,1)$ and finding the right slot by binary search, for example. This is what we usually want to make numbers human readable and understandable. If we want more accuracy we can often just use the trick of going to % or ‰ or just shifting units by multiples of 1000, depending on what we are measuring or counting. This is often a bit better than just decimal numbers, and we can more often solve the issue of rounding with sum when some of the values we want to round are the same and just won’t come out the same of our rounding procedure.

Idea 3: Use continuous fractions

With continuous fractions it is possible to express any real number in the form

$a_0 + \cfrac{1}{a_1 + \cfrac{1}{a_2 + \cfrac{1}{ \ddots + \cfrac{1}{a_n} }}}$

$a_0 + \cfrac{1}{a_1 + \cfrac{1}{a_2 + \cfrac{1}{a_3 + \cfrac{1}{\ddots}}}}$

For integers this is trivial, just use $a_0$ . For negative numbers we just take the negative of the continuous fraction of the absolute value, so we can assume a non-integral positive number $r_0$ .
This allows determining $a_0$ by just taking the integer part of it.
Then we continue with $r_1 = \frac{1}{r_0 - a_0}$ and so on. Either we end up with an integer $a_n$ at some point, which is the case for rational numbers or the continuous fraction continues infinitely, which is the case for irrational numbers. Now taking the continuous fraction just up to the first $n$ elements, as in the first form and ignoring what comes after that, can in turn be converted into a rational number, by just doing some trivial rational arithmetic. It turns out, that this is a very good approximation for that size of the denominator. The funny thing is that this works even for irrational numbers under certain conditions. For example could we assume that we are dealing with numbers of the form $x+y\sqrt{D}$ for some fixed rational $D$ and rational numbers $x$ and $y$ . It is relatively easy to approximate $x+y\sqrt{D}=x_0+y_0\sqrt{D}$ with the largest integer $a_0$ that is less or equal. Now we can calculate

$\frac{1}{x_k+y_k\sqrt{D}-a_k}=\frac{x_k-a_k-y_k\sqrt{D}}{(x_k-a_k)^2-y_k^2\cdot D}=x_{k+1}+y_{k+1}\sqrt{D}$

and thus use this algorithm as another way to calculate rational approximations of square roots. In this case the continuous fraction becomes periodic and there is a surprising lot of interesting mathematics behind it, if you like to dig deeper.

Accident Languages

Some commonly used languages have been quite well designed or at least would have been considered so at the time when they appeared. Even if they have their weaknesses, they should be good for some purposes.

Now beauty of programming languages is highly subjective. So I do not claim any universal truth to this. But I consider the popularity of certain languages as an accident, at least for the time when they became highly popular. This does not preclude that good stuff has been done with them or that they have matured to some extent or even that their potential replacements of the time of their appearance have lost relevance. I do not mention the classics COBOL and Fortran, because they were early pioneers of high level languages. And looking at these two languages, one of them does not deserve to be mentioned together with the other one.

Anyway, here we go:

Pascal

Pascal was in certain areas popular before C took off. To a great extent this was true because it was used as a teaching language and because Turbo Pascal ran well on a PC with 4.77 MHz and 256k Memory and Floppy Disks only in the early 1980s. This was the hardware that many people could afford with some pain (prices around 5000 DEM in Germany). And it transported some interesting concepts to a wider public. People who knew Algol 68 used to consider it a step backwards. Anyway, in the later 1980s C was available to a broader public, because affordable hardware that could run C-compilers reasonably fast and conveniently came into existence. C does have its flaws, but it is a well designed language and I personable consider the step from Pascal to C one of the most positive ever encountered.

Python

Python is a good language with a lot of power. But I see it in a similar space as Ruby, which to me appears as one of the most beautifully designed languages. So why use Python if Ruby could serve the same purpose. While Ruby took off with Rails for some time, Python kind of took over where people were looking for a replacement for Fortran and in Data science. It now has really superior libraries in many areas and has achieved a very strong position in Devops, where it has taken over to a large extent from Perl and Ruby. So Python has really gained traction and I do recommend learning it.

PHP

PHP came out as an easier to use Perl clone for web development. In the 1990s and early 2000s CGI was a common way to do web application development and Perl was strong in this area. I still consider Perl the better designed language, even though it does have its flaws and weaknesses. A more critical view would be that PHP was an inferior clone that copied only what they understood. In terms of building a great open source community it would have been more desirable, if the PHP guys had contributed to developing frameworks and libraries that would have made Perl better. But competition is a good thing also, even in open source. And PHP has clearly won the competition for the web space against Perl and now has good library and framework support. Some of the negative reputation of PHP comes from the fact, that unskilled developers write a lot of PHP code, because they came from the web design space and just added a bit of PHP to their HTML code. This is not to blame on the language. On the positive side I would mention that a lot of really great software has been written in PHP, some of which I use on a daily basis, like Wikipedia and its underlying software MediaWiki, WordPress and a lot of others.

VBA

VBA comes as a development and scripting language for MS-Office, which is a useful thing. Like PHP it attracts non-developers who write ugly code, because they use MS-Office and enrich it with a bit of VBA, which is not to blame on the language. I have not really tried it enough, so I let others speak:

We do have to admit, that sometimes non-developers „get the job done“ in a fraction of the time that professional developers need using VBA and Excel. The drawback is, that this solution is really limited, because it depends on some excel sheets on drive C: of a certain team member. Or of course shared ones, but who is using it when? How many copies are there around? Data and program are together in one file…
Anyway, a lot of useful stuff is done with VBA and as long as we do not miss the point when it is time to switch to something else, this might not even be a bad idea.

JavaScript

JavaScript was developed in 1995 and gave some functionality to the browser. The original idea was to enable applets, but applets have lost their relevance long ago and JavaScript itself can now do everything that applets could do and much more than that. JavaScript did transport interesting concepts to a wider public, like lambda expressions or anonymous functions, that had been around for ages in Lisp, but only been known to a small fraction of the software developers. But the legitimate question is, why a new language had to be invented for this, instead of adding an existing and mature language like Perl or Tcl to the browser. They were good candidates for this in 1995, I would not recommend to add them now, because we do have more adequate choices now. Anyway, JavaScript lacked and to my knowledge still lacks decent integer types. All numbers are double precision floating point, which does express integers of up to 53 bits accurately, but for counting pixels some kind of sufficiently long integer would have been a useful thing. There was some hope coming up that Dart might be established as a successor to JavaScript. But not even Google with Chrome was influential enough to get this established. It did survive as one of many languages that can be „compiled“ or „transpiled“ to JavaScript, as TypeScript, CoffeeScript, Clojure, Scala and many other languages can now be transpiled to JavaScript. I was told that even JavaScript itself is transpiled in order to support different languages of the language from one source code base. New hope is coming with WebAssembly, which will establish a common binary language (like JavaVM) on all browsers and allow compiling languages to this target instead of JavaScript. Again, as PHP and VBA, many non developers use JavaScript, again coming from the web design background. And again, we have today very good web applications that heavily rely on browser side logic, which is implemented in JavaScript or compiled to JavaScript. This allows for much better user experience and saves power on the server which no longer needs to do the rendering. Moving JavaScript to the server side with Node.js looks like a weird idea. We have better languages to work with on the server side than JavaScript. But again, competition between concepts, languages and technologies is also a good thing.

Conclusion

All these languages have somehow reached a relatively high popularity, even though they were not really the most advanced choices. But a lot of good stuff has been done with them and is still being done for example with JavaScript, Python and PHP today. That is what counts most.

UUIDs revisited

UUIDs have proven useful in many circumstances.
We have basically two main variants:

The UUID is calculated as a combination of the Ethernet-MAC-address, the timestamp and a counter.
The UUID is calculated using a good random number generator

While variant 1 provides for a good uniqueness, there are some issues with it. Today we use mostly virtualized servers, which means that the MAC-address is coming from a configuration file and no longer guaranteed to be world wide unique. And we give away some information with the UUID, that we do not necessarily want to give away.

Variant 2 can be proven to have an acceptably low risk of collisions, but this is only true when using really good random number generators, which cannot always be guaranteed. Also it introduces an uncertainty in an area where we do not need it. We need to worry about this uniqueness, at least a little but, which is unnecessary.

So the question is, if we can rethink variant 1.

Assuming, that our software runs on our server farm. There may be a few hundred or thousand or even millions of virtual or physical servers. Now the organization does have a way to uniquely identify their servers. Of course we only need to consider the servers that are relevant for the application. Maybe an ID for the service instance instead of the server is even better. We may assume a numerical ID or something or have a table to map IP addresses and the like to such an ID. Some thinking is still required on how to do this. We can fill the digits that we do not need with random numbers.

Putting this ID instead of the MAC address solves the issue of configurable MAC address.

The next problem, that timestamps can be abused to find out something that should not be found out could be resolved by running the timestamp part or even the ID part (including a random number) through a symmetric encryption or simply some bijective function that is kept as a secret.

In many circumstances there is nothing wrong with customizing the UUID-generation to some „local“ standard, if this is well understood and carefully implemented.

How to calculate Square Roots and Cubic Roots

The functions sqrt and sometimes even cbrt are commonly available, but it is nice to see how they can be calculated.

There are several approaches, but the most popular ones are Newton’s method and an algorithmic formulation of how roots are taken manually, for those old enough to still have learned it in school. Earlier measurements that I did many years ago showed that the Newton approximation is slower, but it would be worth to do newer measurements.

So we have an equation $y = x^2$ or $y=x^3$ and want to find $x$ or a well defined approximation of $x$ when we know $y$ . Mathematically speaking we want to assume that $y$ is constant and we want to find an $x$ for which $f(x)=x^2-y=0$ or $g(x)=x^3-y=0$ . If we guess such an $x$ and then draw the tangent at the curve of the function at the point $(x, f(x))$ or $(x, g(x))$ , then the intersection point of the tangent can be used as the next approximation. This method converges in the case of these two functions (and some others) and is reasonably fast. Now the tangent has the linear equation

$\frac{y-y_0}{x-x_0}=f'(x_0)$

where $y_0=f(x_0)$ and $f'(x)=\frac{df(x)}{dx}$ is the derivative of $f(x)$ . We want to solve this equation for $y=0$ and thus we get

$x-x_0 = -\frac{f(x_0)}{f'(x_0)}$

and thus

$x=x_0-\frac{f(x_0)}{f'(x_0)}$

As an iteration rule

$\bigwedge_{i=0}^\infty x_{i+1}=x_i-\frac{f(x_i)}{f'(x_i)}$

In case of the sqare root we can just start with an estimation by shifting half the length to the right, but avoiding zero, which is important because of the division. Then we get for an appropriate n

$x_0 = 2^{-n}y$

$x_{i+1}=x_i-\frac{x_i^2-y}{2x_i}=\frac{x_i+y/x_i}{2}$

The last form is quite intuitive, even without calculus. As I said this converges usefully fast and there is tons of math around to describe the behavior, speed, precision and convergence of the calculations performed in this algorithm. Written in Ruby just for integers, this is quite simple. Convergence is simply discovered by the fact that the result does not change any more, which may fail in some cases, where intermediate results oscillate between two values, but just for the purpose of benchmarking it seems to be sufficient:
def sqrt_newton(x) if (x == 0) then return 0 end y0 = x u0 = x while (u0 > 0) do y0 >>= 1 u0 >>= 2 end y0 = [1, y0].max yi = y0 yi_minus_1 = -1 loop do yi_plus_1 = (yi + x/yi) >> 1; if (yi_minus_1 == yi_plus_1) then return [yi, yi_minus_1].min elsif (yi == yi_plus_1) then return yi end yi_minus_1 = yi yi = yi_plus_1 end end

The newton algorithm tends to oscillate between two approximations, so this termination criteria takes into account $y_{i-1}$ , $y_i$ and $y_{i+1}$ and uses the lower of the two oscillating values. This results in calculating the largest integer $y$ such that $y^2\le x$ and $(y+1)^2 > x$ .

For the third root we get for an appropriate n

$x_0 = 2^{-n}y$

$x_{i+1}=x_i-\frac{x_i^3-y}{3x_i^2}=\frac{x_i+y/x_i^2}{3}$

Again this is a useful way and there is math around on when to stop for a desired precision.
Read Wikipedia for the convergence issues.

There is another approach, that people used to know when doing calculations on paper was more important than today. For the decimal system it works like this:
1. 7 3 2 0 5 ---------------------- / 3.00 00 00 00 00 /\/ 1 = 20*0*1+1^2 - 2 00 1 89 = 20*1*7+7^2 ---- 11 00 10 29 = 20*17*3+3^2 ----- 71 00 69 24 = 20*173*2+2^2 ----- 1 76 00 0 = 20*1732*0+0^2 ------- 1 76 00 00 1 73 20 25 = 20*17320*5+5^2 ---------- 2 79 75(source Wikipedia)
We group the digits to the left and to the right of the decimal point in groups of two. The highest possible square of an integral number that is below or equal to the leftmost group (03 in the example above) is used for the first digit of the result (1 in the example above). This square is subtracted and the next group is appended (200 in the example). Assuming that $y_n$ is the result already calculated and $x_n$ is what we have achieved after the subtraction and the appending of the next group, we search for a digit $z_n$ such that $u_n = 20\cdot y_n\cdot z_n + z_n^2 \le x_n$ . $z_n$ is chosen in such a way that it yields the maximum possible $u_n$ wich is still $\le x_n$ . Subtracting $u_n$ from $x_n$ and appending the next group allows for the next iteration.

Now this can be turned into an algorithm. The first approach is to just switch from decimal system to binary system. Then for each iteration step we have to deal just with the possible values of $0$ and $1$ , which greatly simplifies the algorithm. Here is a simple ruby program that would do this:
def split_to_words(x, word_len) bit_pattern = (1 << word_len) - 1 words = [] while (x != 0 || words.length == 0) do w = x & bit_pattern x = x >> word_len words.unshift(w) end words end

def sqrt_bin(x) if (x == 0) then return 0 end xwords = split_to_words(x, 2) xi = xwords[0] - 1 yi = 1 1.upto(xwords.length-1) do |i| xi = (xi << 2) + xwords[i] d0 = (yi << 2) + 1 r = xi - d0 b = 0 if (r >= 0) then b = 1 xi = r end yi = (yi << 1) + b end return yi end

It seems that the two solutions yield the same results, but the sqrt_newton outperforms sqrt_bin by a factor of two.

Now we should reconsider, if base 2 is really the best choice. Actually we can use any power of 2 as a base and efficiently work with that. Apart from the initial first step, which is done by using an extended version of sqrt_bin, the next steps are estimated by division and trying neighboring values to get the exact result. This makes use of the fact that the equation we need to solve
$u_n = 2\cdot b\cdot y_n\cdot z_n + z_n^2 \le x_n$ with the maximum $z_n$ fullfilling this equation, where $b$ is the base to which we are working, witch was $10$ or $2$ above and could now be a power of $2$ . As soon as $y_n\cdot b$ has a certain size, the influence of $z_n^2$ becomes less relevant. We can consider the maximum posible value for z_n, which is $b-1$ and thus solve $2\cdot b\cdot y_n\cdot z_n\le x_n$ and $2\cdot b\cdot y_n\cdot z_n\le x_n-(b-1)^2$ , each for the maximum $z_n$ fullfilling the equation. This can be calculated by simple division. If the range between the two solutions is small enough, then each value in the range can be tried to find the actual accurate solution for $2\cdot b\cdot y_n\cdot z_n + z_n^2 \le x_n$ and this is more efficient than working just bitwise. This method sqrt_word seems to outperform sqrt_newton for longer numbers, for example around 60 decimal digits with word_length=16. So the most promising approach seems to be to optimize the implementation and parameters of sqrt_word. The issue of termination, which has been properly addressed in the newton implementation, is already dealt with in this implementation. For more serious analysis it would be interesting to implement the algorithms in C or even in assembly language. So this is the final result for square roots, with some checks added:
def check_is_nonneg_int(x, name) raise TypeError, "#{name}=#{x.inspect} must be Integer" unless (x.kind_of? Integer) && x >= 0 end


def check_word_len(word_len, name="word_len")

  unless ((word_len.kind_of? Integer) && word_len > 0 && word_len <= 1024)

    raise TypeError, "#{name} must be a positive number <= 1024"

  end

end
def split_to_words(x, word_len)

  check_is_nonneg_int(x, "x")

  check_word_len(word_len)

  bit_pattern = (1 << word_len) - 1

  words = []

  while (x != 0 || words.length == 0) do

    w = x & bit_pattern

    x = x >> word_len

    words.unshift(w)

  end

  words

end
def sqrt_bin(x)

  yy = sqrt_bin_with_remainder(x)

  yy[0]

end
def sqrt_bin_with_remainder(x)

  check_is_nonneg_int(x, "x")

  if (x == 0) then

    return [0, 0]

  end
  xwords = split_to_words(x, 2)

  xi = xwords[0] - 1

  yi = 1
  1.upto(xwords.length-1) do |i|

    xi = (xi << 2) + xwords[i]

    d0 = (yi << 2) + 1

    r  = xi - d0

    b  = 0

    if (r >= 0) then

      b  = 1

      xi = r

    end

    yi = (yi << 1) + b

  end

  return [yi, xi]

end
def sqrt_word(x, n = 16)

  check_is_nonneg_int(x, "x")

  check_is_nonneg_int(n, "n")
  n2 = n << 1

  n1 = n+1

  check_word_len(n2, "2*n")

  if (x == 0) then

    return 0

  end
  xwords = split_to_words(x, n2)

  if (xwords.length == 1) then

    return sqrt_bin(xwords[0])

  end
  xi = (xwords[0] << n2) + xwords[1]

  a  = sqrt_bin_with_remainder(xi)

  yi = a[0]

  if (xwords.length <= 2) then

    return yi

  end
  xi = a[1]

  2.upto(xwords.length-1) do |i|

    xi = (xi << n2) + xwords[i]

    d0 = (yi << n1)

    q  = (xi / d0).to_i

    j  = 10

    was_negative = false

    while (true) do

      d = d0 + q

      r = xi - (q * d)

      break if (0 <= r && (r < d || was_negative))

      if (r < 0) then

        was_negative = true

        q = q-1

      else

        q = q+1

      end

      j -= 1

      if (j <= 0) then

        break

      end

    end

    xi = r

    yi = (yi << n) + q

  end

  return yi

end

def sqrt_newton(x) check_is_nonneg_int(x, "x") if (x == 0) then return 0 end y0 = x u0 = x while (u0 > 0) do y0 >>= 1 u0 >>= 2 end y0 = [1, y0].max yi = y0 yi_minus_1 = -1 loop do yi_plus_1 = (yi + x/yi) >> 1; if (yi_minus_1 == yi_plus_1) then return [yi, yi_minus_1].min elsif (yi == yi_plus_1) then return yi end yi_minus_1 = yi yi = yi_plus_1 end end

This is the approach that has been built into the LongDecimal library, ignoring Newton. The examples have been added to github.

The algorithms can be extended to cubic roots or any higher roots. In this case, the $n$ th root of $\sum_{j=0}^{m}a_j b^{n(m-j)}$ is calculated by starting with the maximal integral number $z_0$ with $z_0^n \le a_0$ and the subsequently finding numbers z_j fullfilling an equation of the form ${n}\choose{k}\sum_{k=1}^n (by_j)^{n-k}z_j^k \le x_i$ . This is always easy to handle for base two, by just testing the two possible solutions. For higher bases and $n=3$ it involves solving an quadratic equation, once the numbers are high enough to neglect the term $z_j^3$ . For $n=4$ it is just possible to take the square root of the square root. For higher values of $n$ and bases other than $2$ it becomes really difficult to tame this algorithm. So I intend to constrain myself to square roots and cube roots. I have not explored, if it is useful to calculate the cube root with a higher base than $2$ and which approach provides the best performance for cube roots. Even the square root calculation can possibly be tuned a bit. Maybe this will be addressed in another article.

Guava-Collections in Java-APIs

When we write APIs, that have parameters or as return values, it is a good idea to consider relying on immutable objects only. This applies also when collections are involved directly or indirectly as content of the classes that occur as return values or parameters. Changing what is given through the API in either direction can create weird side effects. It even causes different behavior, depending on weather the API works locally or via the network, because changes of the parameters are usually not brought back to the caller via some hidden back channel, unless we run locally. I use Java as an example, but it is quite an universal concept and applies to many languages. If we talk about Ruby, for example, the freeze method might be our friend, but it goes only one level deep and we actually want to deep-freeze.. Another story, maybe…

Now we can think of using Java’s own Collections.unmodifiableList and likewise. But that is not really ideal. First of all, these collections can still be modified by just working on the inner of the two collections:
List list = new ArrayList<>(); list.add("a"); list.add("b"); list.add("c"); List ulist = Collections.unmodifiableList(list); list.add("d"); ulist.get(3) -> "d"
Now list in the example above may already come from a source that we don’t really control or know, so it may be modified there, because it is just „in use“. Unless we actually copy the collection and put it into an unmodifiableXXX after that and only retain the unmodifiable variant of it, this is not really a good guarantee against accidental change and weird effects. Let’s not get into the issue, that this is just controlled at runtime and not at compile time. There are languages or even libraries for Java, that would allow to require an immutable List at compile time. While this is natural in languages like Scala, you have to leave the usual space of interfaces in Java, because it is also really a good idea to have the „standard“ interfaces in the APIs. But at least use immutable implementations.

When we copy and wrap, we still get the weird indirection of working through the methods of two classes and carrying around two instances. Normally that is not an issue, but it can be. I find it more elegant to use Guava in these cases and just copy into a natively immutable collection in one step.

For new projects or projects undergoing a major refactoring, I strongly recommend this approach. But it is of course only a good idea, if we still keep our APIs consistent. Doing one of 100 APIs right is not worth an additional inconsistency. It is a matter of taste, to use standard Java interfaces implemented by Guava or Guava interfaces and classes itself for the API. But it should be consistent.

Scala Days 2019

Scala Days had a tenth anniversary in 2019. It is an annual conference about Scala, which I like to visit, when possible.

There is some confusion as to what is meant:

Where this the 9th, 10th or 11th Scala Days?
Where the first Scala Days 9, 10 or 11 years ago?

The truth is, that the first Scala Days were organized in 2010 and took place in Lausanne. So it was 9 years ago, but the tenth event. For this reason it was done in Lausanne again, but in a much larger venue.

Just for the completeness:

This is the opening keynote by Martin Odersky. A video will be on youtube soon.

I visited the following talks:

And on the last day I visited the following talks and the diversity panel and the closing panel:

Sustaining open source digital infrastructure by Bogdan Vasilescu
Run Scala Faster with GraalVM on any Platform by Vojin Jovanovic
Pick your number type with Spire by Denis Rosset
Performance tuning Twitter services with Graal and ML by Chris Thalinger
Scala best practices I wish someone’d told me about by Nicolas Rinaudo
Immutable Sequential Maps – Keeping order while hashed by Odd Möller

Here is the whole schedule.

Btw., links might expire when the next Scala Days will start using the site in early 2020. Where it will be in 2020 is still unknown, but since 2014 it has been a habit to do the even years in Berlin. I will try to come, if there are Scala Days 2020 in Europe.

hashCode, equals and toString

In many programming languages we are urged to define methods hashCode, equals and toString. They are named like this in Java and in many JVM languages or they use similar names. Some languages like Perl and Scala provide decent mechanisms for the language to figure these out itself, which we do most of the time in Java as well by letting the IDE create it for us or by using a library. This solution is not really as good as having it done without polluting our source code and without using mechanisms like reflection, but it is usually the best choice we have in Java. It does have advantages, because it gives us some control over how to define it, if we are willing to exercise this control.

So why should we bother? equals is an „obvious“ concept that we need all the time by itself. And hashCode we need, when we put something into HashMaps or HashSets or something like that. It is very important to follow the basic contract, that hashCode and equals must be compatible, that is

$\forall a, b : a.\mathrm{equals}(b) \implies a.\mathrm{hashCode}() == b.\mathrm{hashCode}()$

And equals of course needs to be an equivalence relation.
There has been an article in this blog about „Can hashCodes impose a security risk?„, which covers aspects that are not covered here again.

An important observation is that these do not fit together with mutability very well. When we mutate objects, their hashCode and equals methods yield different results than before, but the HashSet and HashMap assume that they remain constant. This is not too bad, because usually we actually use very immutable objects like Strings and wrapped primitive numbers as keys for Maps. But as soon as we actually write hashCode and equals, this implies that we are considering the objects of this type to be members of HashMaps or HashSets as keys and the mutability question arises. One very ugly case is the object that we put into the database using Hibernate or something similar. Usually there is an ID field, which is generated, while we insert into the database using a sequence, for example. It is good to use a sequence from the database, because it provides the most robust and reliable mechanism for creating unique ids. This id becomes then the most plausible basis for hashCode, but it is null in the beginning. I have not yet found any really satisfying solution, other than avoiding Hibernate and JPAx. Seriously, I do think, that plain JDBC or any framework like MyBatis or Slick with less „magic“ is a better approach. But that is just a special case of a more general issue. So for objects that have not yet made the roundtrip to the database, hashCode and equals should be considered dangerous.

Now we have the issue that equality can be optimized for hashing, which would be accomplished by basing it on a minimal unique subset of attributes. Or it could be used to express an equality of all attributes, excluding maybe some kind of volatile caching attributes, if such things apply. When working with large hash tables, it does make a difference, because the comparison needs to look into a lot more attributes, which do not change the actual result at least for each comparison that succeeds. It also makes a difference, in which order the attributes are compared for equality. It is usually good to look into attributes that have a larger chance of yielding inequality, so that in case of inequality only one or only few comparisons are needed.

For the hashCode it is not very wrong to base it on the same attributes that are used for the equals-comparison, with this usual pattern of calculating hash codes of the parts and multiplying them with different powers of the some two-digit prime number before adding them. It is often a wise choice to chose a subset of these attributes that makes a difference most of the time and provides high selectivity. The collisions are rare and the calculation of the hash code is efficient.

Now the third method in the „club“ is usually toString(). I have a habit of defining toString, because it is useful for logging and sometimes even for debugging. I recommend making it short and expressive. So I prefer the format
className(attr1=val1 attr2=val2 att3=val3)
with className the name of the actual class of this object without package, as received by
getClass().getSimpleName()
and only including attributes that are of real interest. Commas are not necessary and should be avoided, they are just useless noise. It does not matter if the parantheses are „()“ or „[]“ or „{}“ or „«»“, but why not make it consistent within the project. If attribute values are strings and contain spaces, it might be a good idea to quote them. If they contain non-printable characters or quotation marks, maybe escaping is a good idea. For a real complete representation with all attributes a method toLongString() can be defined. Usually log files are already too much cluttered with noise and it is good to keep them consise and avoid noise.

Can hashCodes impose a security risk?

This may come as a surprise, but attackers can assume that software is running in one of the common languages with their standard library. This calculates the hashcode of a string in a predictable way. For that reason it is possible, to create a large number of entries that result in strings having the same hashcode. If this software relies on hashmaps using this string as a key, then lookups will regularly use linear time instead of almost constant time. This might slow down the system to such an extent that it might be used for a denial of service attack.

The question is, what we can do about this. First of all it is necessary to understand, where are places that can be used by more or less unknown users to enter data into the system, for example registration of new users or some upload of information of users that are registered already. What would happen in case of such an attack?

In the end of the day we do want to allow legitimate usage of the system. Of course it is possible, to discover and stop abusive usage, but these detectors have a tendency to be accurate and create both „false positives“ and „false negatives“. This is something that a regular security team can understand and address. We need to remember, that maybe even the firewalls itself can be attacked by such an attack. So it is up to the developers to harden it against such an attack, which I hope they do.

From the developer point of view, we should look at another angle. There could be legitimate data that is hard to distinguish from abusive data, so we could just make our application powerful enough to handle this regularly. We need to understand the areas of our software that are vulnerable by this kind of attack. Where do we have external data that needs to be hashed. Now we can create a hashcode $h$ as $h(x)=f(\mathrm{md5}(g(x)))$ or $h(x)=f(\mathrm{sha1}(g(x)))$ , where we prepend the string with some „secret“ that is created a startup of the software, then apply the cryptographic hash and in the end apply a function that reduces the sha1 or sha256 or md5 hash to an integer. Since hash maps only need to remain valid during the runtime of the software, it is possible to change the „secret“ at startup time, thus making it reasonably hard for attackers to create entries that result in the same hashcode, even if they know the workings of the software, but do not know the „secret“. A possible way could be to have a special variant of hash map, that uses strings as keys, but uses its own implementation of hashcode instead of String’s .hashCode()-method. This would allow creating a random secret at construction time.

I have only become aware of the weakness of predictable hashcodes, but I do not know any established answers to this question, so here you can read what I came up with to address this issue. I think that it might be sufficient to have a simple hashcode function that just uses some secret as an input. Just prepending the string with a secret and then calculating the ordinary .hashCode() will not help, because it will make the hashcode unpredictable, but the same pairs of strings will still result in collisions. So it is necessary to have a hashcode $h(x, s)$ with $x$ the input string and $s$ the secret such that for each $x, y, s$ with $x \ne y \wedge h(x, s)=h(y, s)$ there exists a $t$ with $h(x, t) \ne h(y, t)$ , so the colliding pairs really depend on the choice of the secret and cannot be predicted without knowing the secret.

What do you think about this issue and how it can be addressed from the developer side? Please let me know in the comments section.

Unit Tests as Specifications

Quite often I hear the idea, that one team should specify a software system and get the development done by another team. Only good documentation is needed and good API contracts and beyond that, no further cooperation and communication is needed. Often a complete set of unit tests is recommended as a way or as a supplement to specify the requirements. Sounds great.

We have to look twice, though. A good software project invests about 30 to 50 percent of its development effort into unit tests, at least if we program functionality. For GUIs it is another story, real coverage would be so much more than 50%, so that we usually go for lower coverage and later automated testing, when the software is at least somewhat stable and the tests do not have to be rewritten too often.

For a library or some piece of code that is a bit less trivial, the effort for a good unit test coverage would be more around 60% of the total development, so actually more then the functionality itself.

Also for APIs that are shared across teams, maybe a bit more is a good idea, because complete coverage is more important.

So the unit tests alone would in this case need about 50% of the development effort. Adding documentation and some other stuff around it, it will become even a bit more than 50%.

In practice it will not work out without intensive cooperation, unless the APIs have been in use by tons of other organizations for a long time and proven to be stable and useful and well documented.

So what exactly are we gaining by this approach of letting another team write the code and just providing a good unit test suite?