Karl Brodowsky's IT-Blog

2019 — Happy New Year

Gott nytt år! — Godt nytt år! — Felice anno nuovo! — Καλή Χρονια! — Щасливого нового року! — Срећна нова година! — С новым годом! — Feliĉan novan jaron! — Bonne année! — FELIX SIT ANNUS NOVUS — Gullukkig niuw jaar! — Un an nou fericit! — Frohes neues Jahr! — Happy new year! — ¡Feliz año nuevo! — Onnellista uutta vuotta! — عام سعيد

This was created by a Java-program:
import java.util.Random; import java.util.List; import java.util.Arrays; import java.util.Collections;


public class HappyNewYear {

public static void main(String[] args) { List list = Arrays.asList("Frohes neues Jahr!", "Happy new year!", "Gott nytt år!", "¡Feliz año nuevo!", "Bonne année!", "FELIX SIT ANNUS NOVUS", "С новым годом!", "عام سعيد", "Felice anno nuovo!", "Godt nytt år!", "Gullukkig niuw jaar!", "Feliĉan novan jaron!", "Onnellista uutta vuotta!", "Срећна нова година!", "Un an nou fericit!", "Щасливого нового року!", "Καλή Χρονια!"); Collections.shuffle(list); System.out.println(String.join(" — ", list)); } }

Christmas 2018

Feliĉan Kristnaskon! — Frohe Weihnachten! — God Jul! — Merry Christmas! — Joyeux Noël! — クリスマスおめでとう ; メリークリスマス — Срећан Божић! — Buon Natale! — Hyvää Joulua! — З Рiздвом Христовим! — ميلاد مجيد — С Рождеством! — Crăciun fericit! — ¡Feliz Navidad! — καλά Χριστούγεννα! — Natale hilare! — God Jul! — Prettige Kerstdagen!

As I said, I am learning some Python, so let’s use it. I created the message above with this program:
#!/usr/bin/python3 import random arr = [ "Frohe Weihnachten!", "Merry Christmas!", "God Jul!", "¡Feliz Navidad!", "Joyeux Noël!", "Natale hilare!", "С Рождеством!", "ميلاد مجيد", "Buon Natale!", "God Jul!", "Prettige Kerstdagen!", "Feliĉan Kristnaskon!", "Hyvää Joulua!", "クリスマスおめでとう ; メリークリスマス", "Срећан Божић!", "Crăciun fericit!", "З Рiздвом Христовим!", "καλά Χριστούγεννα!" ] random.shuffle(arr) print(" — ".join(arr)) print("\n")

Indexing of Arrays and Lists

We index arrays with integers. Lists also, at least the ones that allow random access. And sizes of collections are also integers.
This allows for $2^{31}-1=2147483647$ entries in Java and typical JVM languages, because integers are actually considered to be 32bit. Actually we could think of one more entry, using indices $0..2147483647$ , but then we would not be able to express the size in an signed integer. This stuff is quite deeply built into the language, so it is not so easy to break out of this. And 2’000’000’000 entries are a lot and take a lot of time to process. At least it was a lot in the first few years of Java. There should have been an unsigned variant of integers, which would in this case allow for 4’000’000’000 entries, when indexing by an uint32, but that would not really solve this problem. C uses 64 bit integers for indexing of arrays on 64 bit systems.

It turns out that we would like to be able to index arrays using long instead of int. Now changing the java arrays in a way that they could be indexed by long instead of int would break a lot of compatibility. I think this is impossible, because Java claims to retain very good backward compatibility and this reliability of both the language and the JVM has been a major advantage. Now a second type of arrays, indexed by long, could be added. This would imply even more complexity for APIs like reflection, that have to deal with all cases for parameters, where it already hurts that the primitives are no objects and arrays are such half-objects. So it will be interesting, what we can find in this area in the future.

For practical use it is a bit easier. We can already be quite happy with a second set of collections, let them be called BigCollections, that have sizes that can only be expressed with long and that are indexed in cases where applicable with longs. Now it is not too hard to program a BigList by internally using an array of arrays or an array of arrays of arrays and doing some arithmetic to calculate the internal indices from the long (int64) index given in the API. Actually we can buy some performance gain when resizing happens, because this structure, if well done, allows for more efficient resizing. Based on this all kinds of big collections could be built.

Tip: how to make Thunderbird use ISO Date Format

After an upgrade, my Thnderbird started to use a weird US-date format and forgot my setting for this. I did not find any setting either. But this seems to work fine:

Create a shell script
#!/bin/bash


export LC_TIME=sv_SE.utf8

export LC_DATE=sv_SE.utf8

exec /usr/bin/thunderbird
and make sure that this gets called when thunderbird is started instead of directly /usr/bin/thunderbird.

Clojure Exchange 2018

I visited Clojure Exchange 2018 in London.
Since there was only one track and I attended all talks, it is easy to just refer to the schedule.

Interesting topics, that came up multiple times in different flavors where immutability, stories of building real life applications, music, java and clojure and the transition, clojure script, emacs and cider…

I did a lightning talk myself about Some Thoughts about Immutability and its Limits.

Devoxx Kiew 2018

In the end of 2018 the number of conferences is kind of high. A great highlight is the Devoxx BE in Antwerp. But it has now five partner conferences in London, Paris, Krakow, Morocco and Kiev. So I decided to have a look at the one in Kiev.

How was it in comparison to the one in Belgium? What was better in Kiev: The food was way better, the drinks in the first evening (Whisky and Long Drinks vs. Belgium Beer) might be considered better, there were more people engaged to help the organizers…
What was better in Belgium: There were still a bit more speeches. While the location in Kiev was really great, in Belgium the rooms were way better for the purpose of providing a projection visible for everybody and doing a video recording that did not disturb the audience.
The quality of the speeches was mostly great in both locations. In Kiev they gamified the event a bit more..

Generally there was a wide range of topics and the talks were sorted into the following thematic groups:

Methodology & Culture
JVM Languages
Server Side
Architecture & Security
Mobile & IoT
Machine Learning & AI
Big Data & Data Mining
Cloud, Containers & Infrastructure
Modern Web & UX

See the schedule for the distribution…

I attended on Friday:

Keynote I: How is this by Lawrence Krauss
Java Developer’s Introduction to GraalVM by Oleg Šelajev
What Drives your Development by James Birnie
CONTEXTVS, STVLTE by Piotr Przybył
Securing the JVM by Nicolas Frankel
HTTP Security headers. How to make the browser help protect your users by Tim De Grande
Kill the middleman with a blockchain by Yakov Fain
From functional to reactive programming BOF

I attended on Saturday:

A lot to learn.

Devoxx Antwerp 2018

In 2018 I am visiting a few conferences. A great highlight is the Devoxx BE in Antwerp, which I had the privilege of visiting 2012, 2013, 2014, 2015, 2016 and 2017.

As it should be, it is not just the same every year, but content and speakers change a bit from year to year.

Some topics that got a lot of attention were functional programming, artificial intelligence, Big Data, Machine Learning, clouds, JVMs, Kotlin…

There was less about other JVM languages (apart from Kotlin), so Scala, Clojure, Groovy or Ceylon were covered little or not at all and Android used to be more present in other years. I would say that Ceylon has become irrelevant, probably because Kotlin was too similar and came out the same time and won. Groovy has its niche, Clojure has its niche, Scala and Kotlin have become mature and are now the two mainstream alternatives to Java, but themselves much smaller than Java. This was represented in the conference, taking into account that Scala has its own large conferences, like Scala Days, Scala Exchange, Scala World and a lot more.

Some side issues that might worry some of us did come up occasionally. Was it bad, that IBM bought Red Hat? At least they paid around 34’000’000’000 USD, which is more than 2’500’000 USD per employee. There are probably no other assets in terms of buildings, patents, hardware or whatever, that would justify this price, so IBM probably will have an interest to keep a large number of these employees and not scare them away by too much „IBM-culture“. We will see, but no reason to get immediately worried. Oracle wants money for running their JVM in production after more than 6 months. This can be avoided by always switching to the newest version or by relying on the JDKs offered by alternative sources like Amazon, RedHat…

Microsoft was a sponsor and had a booth. Their topic was not MS-Windows and MS-Office and MS-SQL-Server, but Azure, which can be used with Linux and Java and PostgreSQL, for example. The company did change a bit since the days of Steve Ballmer and we will see if this is an excursion or a continuous direction.

And James Gosling was there at the opening, as a surprise.

Generally there was a wide range of topics and the talks were sorted into the following thematic groups:

Methodology & Culture
Java Language
Programming languages
Architecture & Security
Big Data & Machine Learning
Mind the Geek
Server Side Java
Modern Web & UX
Cloud, Containers & Infrastructure
Mobile & IoT

See the schedule for the distribution…

I attended on Wednesday:

Welcome by Stephan Janssen
Keynote 0: Surprise guest James Gosling
Keynote I: Java in 2018: Change is the Only Constant by Mark Reinhold
Keynote II: Spearheading the future of programming by Venkat Subramaniam
Embrace the Anarchy : Apache Kafka’s Role in Modern Data Architectures by Robin Moffatt
Ignite Talk: Visiting Belgium: A cyclist’s perspective by myself
All other ignite talks
Be More Productive With IntelliJ IDEA by Trisha Gee
Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs by Gerard Maas
Your GraphQL field guide by Bojan Tomić
OOP v̶s̶ and FP by Mario Fusco

I attended on Thursday:

I attended on Friday:

Ask the Java Architects by Mark Reinhold, Brian Goetz and team
Functional Programming Patterns with Java8 by Victor Rentea. This was the best talk. Youtube
The Z Garbage Collector by Erik Österlund

It was a great conference. A lot of new ideas.

Logging

Deutsch

Software often contains a logging functionality. Usually entries one or sometimes multiple lines are appended to a file, written to syslog or to stdout, from where they are redirected into a file. They are telling us something about what the software is doing. Usually we can ignore all of it, but as soon as something with „ERROR“ or worse and more visible stack traces can be found, we should investigate this. Unfortunately software is often not so good, which can be due to libraries, frameworks or our own code. Then stack traces and errors are so common that it is hard to look into or to find the ones that are really worth looking into. Or there is simply no complete process in place to watch the log files. Sometimes the error shows up much later than it actually occurred and stack traces do not really lead us to the right spot. More often than we think logging actually introduces runtime errors, that were otherwise not present. This is related to a more general concept, which is called observer effect, where logging actually changes the business logic.

It is nice that log files keep to some format. Usually they start with a time stamp in ISO-format, often to the millisecond. Please add trailing zeros to always have 3 digits after the decimal point in this case. It is preferable to use UTC, but people tend to stick to local date and time zones, including the issues that come with switching to and from daylight saving time. Usually we have several processes or threads that run simultaneously. This can result in a wild mix of logging entries. As long as even multiline entries stay together and as long as beginning and end of one multiline entry can easily be recognized, this can be dealt with. Tools like splunk or simple Perl, Ruby or Python scripts can help us to follow threads separately. We could actually have separate logs for each thread in the first place, but this is not a common practice and it might hit OS-limitations on the number of open files, if we have many threads or even thousands of actors as in Erlang or Akka. Keeping log entries together can be achieved by using an atomic write, like the write system call in Linux and other Posix systems. Another way is to queue the log entries and to have a logger thread that processes the queue.

Overall this area has become very complex and hard to tame. In the Java world there used to be log4j with a configuration file that was a simple properties file, at least in the earlier version. This was so good that other languages copied it and created some log4X. Later the config file was replaced by XML and more logging frame works were added. Of course quite a lot of them just for the purpose of abstracting from the large zoo of logging frameworks and providing a unique interface for all of them. So the result was, that there was one more to deal with.

It is a good question, how much logic for handling of log files do we really want to see in our software. Does the software have to know, into which file it should log or how to do log rotation? If a configuration determines this, but the configuration is compiled into the jar file, it does have to know… We can keep our code a bit cleaner by relying on program functionality without code, but this still keeps it as part of the software.

Log files have to please the system administrator or whoever replaced them in a pure devops shop. And in the end developers will have to be able to work with the information provided by the logs to find issues in the code or to explain what is happening, if the system administrator cannot resolve an issue by himself. Should this system administrator have to deal with a different special complex setup for the logging for each software he is running? Or should it be necessary to call for developer support to get a new version of the software with just another log setting, because the configurations are hard coded in the deployment artifacts? Interesting is also, what happens when we use PAAS, where we have application server, database etc., but the software can easily move to another server, which might result in losing the logs. Moving logs to another server or logging across the network is expensive, maybe more expensive than the rest of this infrastructure.

Is it maybe a good idea to just log to stdout, maintaining a decent format and to run the software in such a way that stdout is piped into a log manager? This can be the same for all software and there is one way to configure it. The same means not only the same for all the java programs, but actually the same for all programs in all languages that comply to a minimal standard. This could be achieved using named pipes in conjunction with any hard coded log file that the software wants to use. But this is a dangerous path unless we really know what the software is doing with its log files. Just think of what weird errors might happen if the software tries to apply log rotation to the named pipe by renaming, deleting, creating new files and so on. A common trick to stop software from logging into a place where we do not want this is to create a directory with the name of the file that the software usually uses and to write protect this directory and its parent directory for the software. Please find out how to do it in detail, depending on your environment.

What about software, that is a filter by itself, so its main functionality is to actually write useful data to stdout? Usually smaller programs and scripts work like this. Often they do not need to log and often they are well tested relyable parts of our software installation. Where are the log files of cp, ls, rm, mv, grep, sort, cat, less,…? Yes, they do tend to write to stderr, if real errors occur. Where needed, programs can turn on logging with a log file provided on the command line, which is also a quite operations friendly approach. Named pipes can help here.

And we had a good logging framework in place for many years. It was called syslog and it is still around, at least on Linux.

A last thought: We spend really a lot of effort to get well performing software, using multiple processes, threads or even clusters. And then we forget about the fact that logging might become the bottle neck.

Some thoughts about String equality

Of course Strings are today in some way Unicode. In this article we assume code points as the building blocks of Strings. That means for example in the Java-world, that we are talking about one code point being comprised of one Java character for typical European languages, using Latin, Greek or Cyrillic alphabets including extensions to support all languages typically using these alphabets, for example. But when moving to Asian languages, a code point can also consist of two Java characters and there are Strings that are illegal from Unicode perspective, because they contain characters that should be combined in a way that cannot be combined properly. So here we assume, that Strings consist of sequences of bytes or two-byte characters or whatever encoding that properly express a sequence of code points. There are many interesting issues when dealing with some Asian languages that we will not cover here today.

Now there are a lot of possibilities to create Strings, that look the same, but are actually different. We are not talking about „0“ and „O“ or „1“ and „l“ and „I“ that might look similar in some fonts, but should not look similar, because we actually depend on their distinctness, even on their visual distinctness. Unfortunately we have the bad habit of using traditional typewriter fonts, that make it hard to distinguish these, for source code, where it would be so crucial. But for today, we just assume that we always look hard enough to solve this issue.

The classical example of what looks the same is whitespace. We have ordinary space “ “ and no break space “ „, that are meant to look exactly the same, but to expose a slightly different behavior. There are tons of possibilities to create exactly the same look with different combinations of whitespace. But this is kind of a special case, because in terms of semantics often carries little information and we want to disregard it to some extent when comparing strings. Typical examples are stripping of leading and trailing whitespace of the string or of the lines contained within it and replacing tabulators with the number of spaces that would be equivalent. Or even to replace any amount of adjacent whitespace within a line by a single space. Again, handling of different whitespace code points might require different rules, so it is good to be careful in not putting to much logic and it is better to rely on a library to at least apply exactly the same rules in equivalent situations.

Another example that we actually might know is that certain characters look the same or almost the same in the Cyrillic, Greek and Latin alphabets. I try to give an idea of the meaning of the Greek and Cyrillic characters, but they depend on the language, the dialect and even the word, the word form or the actual occurrence of the letter in the word…

Latin	Cyrillic	Greek	meaning of Cyrillic Letter	meaning of Greek letter
A	А	A	like Latin	like Latin
B	В	B	like Latin V	Beta (like V in new Greek)
C	С		like Latin S
E	Е	E	like Latin	Epsilon (like Latin E)
	Г	H	like Latin G	Gamma (like Latin G)
H	Н	Η	like Latin N	Eta (like Latin I in new Greek)
J	Ј		Serbian Ј, like German J
K	К	Κ	like Latin	Kappa (like Latin K)
M	М	Μ	like Latin	Mu (like Latin M)
N		Ν		Nu (like Latin N)
O	О	Ο	like Latin	Omikron (like Latin O)
P	Р	Ρ	like Latin R	Rho (like Latin R)
	П	Π	like Latin P	Pi (like Latin P)
T	Т	Τ	like Latin	Tau (like Latin T)
	Ф	Φ	like Latin F	Phi (like Latin F)
X	Х	Χ	like German CH	Chi (like German CH)
Y	У	Υ	like Latin U	Upsilon (like Latin U)
Z		Ζ		Zeta (like German Z)
I	І	Ι	Ukrainian I	Iota (like Latin I)

In this case we usually want the characters to look the same or at least very similar, because that is how to correctly display them, but we do want them to be different when comparing strings.

While these examples are kind of obvious, there is another one that we tend to ignore, but that will eventually catch us. There are so called combining characters, that should actually be named „combining code points“, but here we go. That means that we can put them after a letter and they will combine to form a letter with diacritical marks. A typical example is the letter „U“ that can be combined with two dots “ ̈ ̈“ to form an „Ü“, which looks the same as the „Ü“ that is composed of one code point. It is meant to look the same, but it also has the same meaning, at least for most purposes. What we see is the Glyph. We see the difference when we prefix each code point with a minus or a space: „Ü“ -> „-U-̈“ or “ U ̈“, while the second one is transformed like this: „Ü“ -> „-Ü“ or “ Ü“, as we would expect.

While the way to express the Glyph in such a way with two code points is not very well known and thus not very common, we actually see it already today when we look at Wikipedia articles. In some languages, where the pronunciations is ambiguous, it can be made clear by putting an accent mark on one vowel, as for example Кириллица, which puts an accent mark on the term in the beginning of the article like this: „Кири́ллица“. Since in Cyrillic Alphabet accent marks are unfortunately not used in normal writing, it comes in handy that the combining accent also works with cyrillic letter. When putting minus-signs between the code points it looks like this: „К-и-р-и-́-л-л-и-ц-а“ or with spaces like this: „К и р и ́ л л и ц а“. So Strings that we encounter in our programs will contain these combining characters in the future. While we can prohibit them, it is better to embrace this and it is actually not too hard, if we use decent libraries. Java has the Normalizer class in its built in library, that can convert to one or the other convention of expressing such glyphs and then allowing comparison in the way that we actually mean.

Unfortunately issues like semantic lengths of strings or semantic positions become even harder than they already are after moving from characters to code points. And we can be sure that Unicode has still more to offer to complicate things, if we dig deeper. The typical answer that we get on most web sites that talk about these issues is something like: „The length of strings and positions within strings are surprisingly irrelevant to most programs.“

In the end of the day, jobs that have been trivial in the past are now becoming a big deal and we need to learn to think of comparison, length, position, regular expressions, sorting and all kinds of string functionality with bytes, characters, code points and glyphs in mind.

What can our current libraries already do for us, what are we missing in them, considering different programming languages, databases, text files and network transmission?

Program Functionality without Code

Depending on the programming language and the frameworks we use, it is possible to have program functionality that is not happening in actual code that we write in that language. It seems weird, but actually it is something that we have been doing for decades and it has been sold to us as being extremely powerful and useful and sometimes it actually is. Aspect oriented programming is mostly based on this idea…

Typical examples are things we want to be taken care of but we do not want to actually write them ourselves..

transaction handling
logging
authorization
exception handling
thread and process handling
memory management
software transactional memory
…

Some of these look really great and who wants to deal with memory management today? Unless we do real time programming or special security code where information may not exist in the memory for longer than the actual processing, this is just what we successfully and without too much pain are doing all the time.

While some of these look really great, and have become more or less om there is also some danger in having some very powerful implicit functionality, like transaction management. While it looks tempting to delegate transaction management to a framework, because it is annoying and it is not really understood very well by most application developers, there comes some danger with it. This is even worse if it is used in conjunction with something like JPA or Hibernate… Assuming we have a framework that wraps methods marked with an annotation like „@Transactional“, meaning that this method call should be wrapped into a transaction (java-like pseudo-code):

@Transactional
public X myMethod(Y y)  {
   X result = do_something(y);
   return result;
}

being roughly equivalent to

public  X myMethod(Y y) {
  TransactionContext ctx = getTransactionContext();
  try {
      ctx.beginTransaction();
      X result = do_something(y);
      ctx.commit();
      return result;
   } catch (Exception ex) {
      ctx.rollback();
      throw ex;
   }
}

Yes, it is more elegant to just annotate it.
But now we program something like this:

@Transactional
public Function myMethod(Y y) {
      ....
}

where we actually enclose something into the function and give it back. Now when calling the function, we might get an error, because it encloses stuff from the time, when the transaction was actually still open, while it has been committed by the time, the function is actually called. So in frameworks that force the usage of such annotated transaction handling, such beautiful functional style programming patterns may actually not work and need to be avoided or at least constrained to the cases that do still work. This can be a reasonable price to pay, but it is important, to understand the constraints, that come with this implicit functionality.

Another interesting area that comes with a lot of potential functionality is correlated with authorization. Assuming we have a company that sells some services or products and we have key account managers that use the software we have written. Now for whatever reasons, they should only be able to see the data about their own customers, possibly data for customers for whose key account manager they are the deputy. Or if they are the boss of some key account managers, maybe they can see all of their data…

Now a function

List listCustomers() {
...
}

gives different results, depending on who is using it. This introduces an implicit invisible parameter. And however smart the user of this software is, he only sees what he is supposed to see, unless the software has some vulnerabilities, which it probably has.

So whenever we read such code that we have not written ourselves and have not written yesterday, there may be surprises about what it does. It is an interesting question how to test this with a good coverage of all constellations for implicit parameters. Anyway, we have to get used to it and embrace it, it is an integral part of our software ecosystem. But it is also important to use these powerful mechanisms only where they are really so helpful that it is worth the loss in clarity and explicitness.

While annotations are at least in place to be found, there are also other ways. Typically xml files can be used to configure such stuff. Or it can be done programmatically in a totally different place of the software by setting up some hooks, for example. Without good documentation or good information flow within the team, this may be hard to find.