TruffleRuby

The language Ruby is one of the most beautiful languages. A lot of things can be done, it has a good level of abstraction, it has chosen some very good defaults, has provided some great ideas that I have not discovered in any other language that I know well and provides a lot of flexibility. But I could no longer recommend it for projects that might require a good performance. I won’t go into the issue of static typing vs. dynamic here. Ruby is following the dynamic typing path and if you think that is a bad idea at all, then it will never become your favorite. But this is an issue with pros and cons. The big disadvantage of Ruby is that it is not very good in terms of performance. The single threaded performance is somewhat better in many reasonable languages like Java, Python, Scala, Clojure, C, C#, F# and some others .. And it gets worse when we want to use multiple cores, because Ruby does not run them simultaneously, but uses a global lock which ensures that only one thread at a time is running. Or in case of JRuby just crashes or yields wrong results in certain mulithreaded programs that we could write.

One approach is to go for immutability as a default, which allows quite painless multithreading. Scala and Clojure follow this route, for example. It is hard to write good code with this constraint or to make good use of very local mutability without leaking it outside, but under these conditions multiple threads are just working fine without deadlocks, crashes or falsified results. Another approach is to just copy structures and leave its own copy to each thread. There are ways to do a lot on this path, but the copying costs a lot of memory and performance and it is not always a gain.

Now Ruby heavily relies on mutable structures for strings and collections. It is not reasonable to go for a total paradigm change in this aspect. But there are some ways to get good and safe and fast operations on these collections and strings without breaking this. One idea is to work with chunks of collections or strings. For strings, the string that we are working with is described as a concatenation of such strings. Many operations can be made by just concatenating multiple strings together and possibly replacing one of them with a copy that can be made as needed. This is called Rope. A similar approach can be applied to collections. Then a smart locking mechanism is applied to the shorter string or collections when needed, but many operations can avoid locking or block much less of the structure.

Also the compiler can analyze the program and simplify it to a great extent, compile it to the JVM, which in conjunction with hot spot optimizations will make it run really fast. Now this TruffleRuby is much faster than other Rubies, by a factor of about 10. It uses GraalVM and it actually supports a lot of C-extensions for libraries through the feature of GraalVM that they can be eventually compiled to the JVM. It does not work if extensions rely on implementation details of the Ruby structures in C and it often does not work for C-extensions that go to low level OS functionality. The current version of TruffleRuby is not really ready to use in conjunction with Ruby on Rails, which is kind of a no go, because Ruby is usually used in conjunction with Rails. My impression is that it will be possible to use it with Rails in a year or two.

Hearing of this in a talk by Benoit Daloze in the local Rails user group in Zürich was a great and positive surprise. Ruby gets interesting again.

Links

Share Button

How to draw lines, circles and other curves

These ideas were developed more than 30 years without knowing that they were already known at that time…

Today the graphics cards can easily do things like this in very little time. And today’s CPUs are of course really good at multiplying. So this has lost a lot of its immediate relevance, but it is a fun topic and why not have some fun…

Let us assume we have a two dimensional coordinate system and a visible area that goes from x_{\min} to x_{\max} and y_{\min} to y_{\max}. Coordinates are discrete.

In this world we can easily measure an angle against a (directed) line parallel to the x-axis, for example up to an accuracy of 45^\circ=\frac{\pi}{4}:

  • y=0 \wedge x > 0 \implies \alpha = 0 (= 0^\circ)
  • 0 < y < x \implies 0 < \alpha < \frac{\pi}{4}(=45^\circ)
  • 0 < y = x \implies \alpha = \frac{\pi}{4}
  • 0 < x < y \implies \frac{\pi}{4} < \alpha < \frac{\pi}{2}(=90^\circ)</li> <li>x = 0 \land y > 0\implies \alpha = \frac{\pi}{2}</li> <li>x < 0 \land y > 0 \land |x| < |y|\implies \frac{\pi}{2}< \alpha < \frac{3\pi}{4}(=135^\circ)
  • x < 0 \land y > 0 \land -x = y\implies \alpha = \frac{3\pi}{4}(=135^\circ)

So let us assume we have a curve that is described by a polynomial function in two variables x and y, like this:

    \[f(x, y) = \sum_{j=0}^m\sum_{k=0}^n a_{j,k}x^jy^k = 0\]

We have to apply some math to understand that the curve behaves nicely in the sense that it does not behave to chaotic in scales that are below our accuracy, that it is connected etc. We might possibly scale and move it a bit by substituting something like c_1u+c_2 for x and c_3v+c_4 for y.

For example we may think of

  • line: f(x,y)=ax+by+c
  • circle: f(x, y)=x^2+y^2-r^2
  • eclipse: f(x, y)=\frac{x^2}{a^2}+\frac{y^2}{b^2}-1

We can assume our drawing is done with something like a king of chess. We need to find a starting point that is accurately on the curve or at least as accurately as possible. You could use knights or other chess figures or even fictive chess figures..

Now we have a starting point (x_0, y_0) which lies ideally exactly on the curve. We have a deviation from the curve, which is f(x_0, y_0)=d_0. So we have f(x_n, y_n)=d_n. Than we move to x_{n+1}=x_n + s and y_{n+1}=y_n+t with s, t = \{-1, 0, 1\}. Often only two or three combinations of (s, t) need to be considered. When calculating d_{n+1} from d_n for the different variants, it shows that for calculating d_{n+1}-d_n the difference becomes a polynomial with lower degree, because the highest terms cancel out. So drawing a line between two points or a circle with a given radius around a given point or an ellipse or a parabola or a hyperbola can be drawn without any multiplications… And powers of n-th powers of x can always be calculated with additions and subtractions only from the previous x-values, by using successive differences:
d_{m,1}=(x-m)^n-(x-m-1)^n
d_{m,l+1}=d_{m+1,l}-d_{m,l}
These become constant for l=n, just as the lth derivatives, so by using this triangle, successive powers can be calculated with some preparational work using just additions.
It was quite natural to program these in assembly language, even in 8-bit assembly languages that are primitive by today’s standards. And it was possible to draw such figures reasonably fast with only one MHz (yes, not GHz).

We don’t need this stuff any more. Usually the graphics card is much better than anything we can with reasonable effort program. Usually the performance is sufficient when we just program in high level languages and use standard libraries.

But occasionally situations occur where we need to think about how to get the performance we need:
Make it work,
make it right,
make it fast,
but don’t stop after the first of those.

It is important that we choose our steps wisely and use adequate methods to solve our problem. Please understand this article as a fun issue about how we could write software some decades ago, but also as an inspiration to actually look into bits and bytes when it is really helping to get the necessary performance without defeating the maintainability of the software.

Share Button

2019 — Happy New Year

Gott nytt år! — Godt nytt år! — Felice anno nuovo! — Καλή Χρονια! — Щасливого нового року! — Срећна нова година! — С новым годом! — Feliĉan novan jaron! — Bonne année! — FELIX SIT ANNUS NOVUS — Gullukkig niuw jaar! — Un an nou fericit! — Frohes neues Jahr! — Happy new year! — ¡Feliz año nuevo! — Onnellista uutta vuotta! — عام سعيد

This was created by a Java-program:

import java.util.Random;
import java.util.List;
import java.util.Arrays;
import java.util.Collections;

public class HappyNewYear {

    public static void main(String[] args) {
        List list = Arrays.asList("Frohes neues Jahr!",
                                          "Happy new year!",
                                          "Gott nytt år!", 
                                          "¡Feliz año nuevo!",
                                          "Bonne année!", 
                                          "FELIX SIT ANNUS NOVUS", 
                                          "С новым годом!",
                                          "عام سعيد",
                                          "Felice anno nuovo!",
                                          "Godt nytt år!", 
                                          "Gullukkig niuw jaar!", 
                                          "Feliĉan novan jaron!",
                                          "Onnellista uutta vuotta!",
                                          "Срећна нова година!",
                                          "Un an nou fericit!",
                                          "Щасливого нового року!", 
                                          "Καλή Χρονια!");
        Collections.shuffle(list);
        System.out.println(String.join(" — ", list));
    }
}

Share Button

Christmas 2018


Feliĉan Kristnaskon! — Frohe Weihnachten! — God Jul! — Merry Christmas! — Joyeux Noël! — クリスマスおめでとう ; メリークリスマス — Срећан Божић! — Buon Natale! — Hyvää Joulua! — З Рiздвом Христовим! — ميلاد مجيد — С Рождеством! — Crăciun fericit! — ¡Feliz Navidad! — καλά Χριστούγεννα! — Natale hilare! — God Jul! — Prettige Kerstdagen!

As I said, I am learning some Python, so let’s use it. I created the message above with this program:

#!/usr/bin/python3
import random
arr = [
    "Frohe Weihnachten!",
    "Merry Christmas!",
    "God Jul!",
    "¡Feliz Navidad!",
    "Joyeux Noël!",
    "Natale hilare!",
    "С Рождеством!",
    "ميلاد مجيد",
    "Buon Natale!",
    "God Jul!",
    "Prettige Kerstdagen!",
    "Feliĉan Kristnaskon!",
    "Hyvää Joulua!",
    "クリスマスおめでとう ; メリークリスマス",
    "Срећан Божић!",
    "Crăciun fericit!",
    "З Рiздвом Христовим!",
    "καλά Χριστούγεννα!"
]
random.shuffle(arr)
print(" — ".join(arr))
print("\n")

Share Button

Indexing of Arrays and Lists

We index arrays with integers. Lists also, at least the ones that allow random access. And sizes of collections are also integers.
This allows for 2^{31}-1=2147483647 entries in Java and typical JVM languages, because integers are actually considered to be 32bit. Actually we could think of one more entry, using indices 0..2147483647, but then we would not be able to express the size in an signed integer. This stuff is quite deeply built into the language, so it is not so easy to break out of this. And 2’000’000’000 entries are a lot and take a lot of time to process. At least it was a lot in the first few years of Java. There should have been an unsigned variant of integers, which would in this case allow for 4’000’000’000 entries, when indexing by an uint32, but that would not really solve this problem. C uses 64 bit integers for indexing of arrays on 64 bit systems.

It turns out that we would like to be able to index arrays using long instead of int. Now changing the java arrays in a way that they could be indexed by long instead of int would break a lot of compatibility. I think this is impossible, because Java claims to retain very good backward compatibility and this reliability of both the language and the JVM has been a major advantage. Now a second type of arrays, indexed by long, could be added. This would imply even more complexity for APIs like reflection, that have to deal with all cases for parameters, where it already hurts that the primitives are no objects and arrays are such half-objects. So it will be interesting, what we can find in this area in the future.

For practical use it is a bit easier. We can already be quite happy with a second set of collections, let them be called BigCollections, that have sizes that can only be expressed with long and that are indexed in cases where applicable with longs. Now it is not too hard to program a BigList by internally using an array of arrays or an array of arrays of arrays and doing some arithmetic to calculate the internal indices from the long (int64) index given in the API. Actually we can buy some performance gain when resizing happens, because this structure, if well done, allows for more efficient resizing. Based on this all kinds of big collections could be built.

Share Button

Clojure Exchange 2018

I visited Clojure Exchange 2018 in London.
Since there was only one track and I attended all talks, it is easy to just refer to the schedule.

Interesting topics, that came up multiple times in different flavors where immutability, stories of building real life applications, music, java and clojure and the transition, clojure script, emacs and cider…

I did a lightning talk myself about Some Thoughts about Immutability and its Limits.

Links

  • Clojure
  • Clojure Exchange 2018
  • Clojure Exchange 2016
  • Clojure
  • Share Button

Devoxx Kiew 2018

In the end of 2018 the number of conferences is kind of high. A great highlight is the Devoxx BE in Antwerp. But it has now five partner conferences in London, Paris, Krakow, Morocco and Kiev. So I decided to have a look at the one in Kiev.

How was it in comparison to the one in Belgium? What was better in Kiev: The food was way better, the drinks in the first evening (Whisky and Long Drinks vs. Belgium Beer) might be considered better, there were more people engaged to help the organizers…
What was better in Belgium: There were still a bit more speeches. While the location in Kiev was really great, in Belgium the rooms were way better for the purpose of providing a projection visible for everybody and doing a video recording that did not disturb the audience.
The quality of the speeches was mostly great in both locations. In Kiev they gamified the event a bit more..

Generally there was a wide range of topics and the talks were sorted into the following thematic groups:

  • Methodology & Culture
  • JVM Languages
  • Server Side
  • Architecture & Security
  • Mobile & IoT
  • Machine Learning & AI
  • Big Data & Data Mining
  • Cloud, Containers & Infrastructure
  • Modern Web & UX

See the schedule for the distribution…

I attended on Friday:

I attended on Saturday:

A lot to learn.

Share Button

Devoxx Antwerp 2018

In 2018 I am visiting a few conferences. A great highlight is the Devoxx BE in Antwerp, which I had the privilege of visiting 2012, 2013, 2014, 2015, 2016 and 2017.

As it should be, it is not just the same every year, but content and speakers change a bit from year to year.

Some topics that got a lot of attention were functional programming, artificial intelligence, Big Data, Machine Learning, clouds, JVMs, Kotlin

There was less about other JVM languages (apart from Kotlin), so Scala, Clojure, Groovy or Ceylon were covered little or not at all and Android used to be more present in other years. I would say that Ceylon has become irrelevant, probably because Kotlin was too similar and came out the same time and won. Groovy has its niche, Clojure has its niche, Scala and Kotlin have become mature and are now the two mainstream alternatives to Java, but themselves much smaller than Java. This was represented in the conference, taking into account that Scala has its own large conferences, like Scala Days, Scala Exchange, Scala World and a lot more.

Some side issues that might worry some of us did come up occasionally. Was it bad, that IBM bought Red Hat? At least they paid around 34’000’000’000 USD, which is more than 2’500’000 USD per employee. There are probably no other assets in terms of buildings, patents, hardware or whatever, that would justify this price, so IBM probably will have an interest to keep a large number of these employees and not scare them away by too much „IBM-culture“. We will see, but no reason to get immediately worried. Oracle wants money for running their JVM in production after more than 6 months. This can be avoided by always switching to the newest version or by relying on the JDKs offered by alternative sources like Amazon, RedHat…

Microsoft was a sponsor and had a booth. Their topic was not MS-Windows and MS-Office and MS-SQL-Server, but Azure, which can be used with Linux and Java and PostgreSQL, for example. The company did change a bit since the days of Steve Ballmer and we will see if this is an excursion or a continuous direction.

And James Gosling was there at the opening, as a surprise.

Generally there was a wide range of topics and the talks were sorted into the following thematic groups:

  • Methodology & Culture
  • Java Language
  • Programming languages
  • Architecture & Security
  • Big Data & Machine Learning
  • Mind the Geek
  • Server Side Java
  • Modern Web & UX
  • Cloud, Containers & Infrastructure
  • Mobile & IoT

See the schedule for the distribution…

I attended on Wednesday:

I attended on Thursday:

I attended on Friday:

It was a great conference. A lot of new ideas.

Share Button

Logging

Deutsch

Software often contains a logging functionality. Usually entries one or sometimes multiple lines are appended to a file, written to syslog or to stdout, from where they are redirected into a file. They are telling us something about what the software is doing. Usually we can ignore all of it, but as soon as something with „ERROR“ or worse and more visible stack traces can be found, we should investigate this. Unfortunately software is often not so good, which can be due to libraries, frameworks or our own code. Then stack traces and errors are so common that it is hard to look into or to find the ones that are really worth looking into. Or there is simply no complete process in place to watch the log files. Sometimes the error shows up much later than it actually occurred and stack traces do not really lead us to the right spot. More often than we think logging actually introduces runtime errors, that were otherwise not present. This is related to a more general concept, which is called observer effect, where logging actually changes the business logic.

It is nice that log files keep to some format. Usually they start with a time stamp in ISO-format, often to the millisecond. Please add trailing zeros to always have 3 digits after the decimal point in this case. It is preferable to use UTC, but people tend to stick to local date and time zones, including the issues that come with switching to and from daylight saving time. Usually we have several processes or threads that run simultaneously. This can result in a wild mix of logging entries. As long as even multiline entries stay together and as long as beginning and end of one multiline entry can easily be recognized, this can be dealt with. Tools like splunk or simple Perl, Ruby or Python scripts can help us to follow threads separately. We could actually have separate logs for each thread in the first place, but this is not a common practice and it might hit OS-limitations on the number of open files, if we have many threads or even thousands of actors as in Erlang or Akka. Keeping log entries together can be achieved by using an atomic write, like the write system call in Linux and other Posix systems. Another way is to queue the log entries and to have a logger thread that processes the queue.

Overall this area has become very complex and hard to tame. In the Java world there used to be log4j with a configuration file that was a simple properties file, at least in the earlier version. This was so good that other languages copied it and created some log4X. Later the config file was replaced by XML and more logging frame works were added. Of course quite a lot of them just for the purpose of abstracting from the large zoo of logging frameworks and providing a unique interface for all of them. So the result was, that there was one more to deal with.

It is a good question, how much logic for handling of log files do we really want to see in our software. Does the software have to know, into which file it should log or how to do log rotation? If a configuration determines this, but the configuration is compiled into the jar file, it does have to know… We can keep our code a bit cleaner by relying on program functionality without code, but this still keeps it as part of the software.

Log files have to please the system administrator or whoever replaced them in a pure devops shop. And in the end developers will have to be able to work with the information provided by the logs to find issues in the code or to explain what is happening, if the system administrator cannot resolve an issue by himself. Should this system administrator have to deal with a different special complex setup for the logging for each software he is running? Or should it be necessary to call for developer support to get a new version of the software with just another log setting, because the configurations are hard coded in the deployment artifacts? Interesting is also, what happens when we use PAAS, where we have application server, database etc., but the software can easily move to another server, which might result in losing the logs. Moving logs to another server or logging across the network is expensive, maybe more expensive than the rest of this infrastructure.

Is it maybe a good idea to just log to stdout, maintaining a decent format and to run the software in such a way that stdout is piped into a log manager? This can be the same for all software and there is one way to configure it. The same means not only the same for all the java programs, but actually the same for all programs in all languages that comply to a minimal standard. This could be achieved using named pipes in conjunction with any hard coded log file that the software wants to use. But this is a dangerous path unless we really know what the software is doing with its log files. Just think of what weird errors might happen if the software tries to apply log rotation to the named pipe by renaming, deleting, creating new files and so on. A common trick to stop software from logging into a place where we do not want this is to create a directory with the name of the file that the software usually uses and to write protect this directory and its parent directory for the software. Please find out how to do it in detail, depending on your environment.

What about software, that is a filter by itself, so its main functionality is to actually write useful data to stdout? Usually smaller programs and scripts work like this. Often they do not need to log and often they are well tested relyable parts of our software installation. Where are the log files of cp, ls, rm, mv, grep, sort, cat, less,…? Yes, they do tend to write to stderr, if real errors occur. Where needed, programs can turn on logging with a log file provided on the command line, which is also a quite operations friendly approach. Named pipes can help here.

And we had a good logging framework in place for many years. It was called syslog and it is still around, at least on Linux.

A last thought: We spend really a lot of effort to get well performing software, using multiple processes, threads or even clusters. And then we forget about the fact that logging might become the bottle neck.

Share Button

Some thoughts about String equality

Of course Strings are today in some way Unicode. In this article we assume code points as the building blocks of Strings. That means for example in the Java-world, that we are talking about one code point being comprised of one Java character for typical European languages, using Latin, Greek or Cyrillic alphabets including extensions to support all languages typically using these alphabets, for example. But when moving to Asian languages, a code point can also consist of two Java characters and there are Strings that are illegal from Unicode perspective, because they contain characters that should be combined in a way that cannot be combined properly. So here we assume, that Strings consist of sequences of bytes or two-byte characters or whatever encoding that properly express a sequence of code points. There are many interesting issues when dealing with some Asian languages that we will not cover here today.

Now there are a lot of possibilities to create Strings, that look the same, but are actually different. We are not talking about „0“ and „O“ or „1“ and „l“ and „I“ that might look similar in some fonts, but should not look similar, because we actually depend on their distinctness, even on their visual distinctness. Unfortunately we have the bad habit of using traditional typewriter fonts, that make it hard to distinguish these, for source code, where it would be so crucial. But for today, we just assume that we always look hard enough to solve this issue.

The classical example of what looks the same is whitespace. We have ordinary space “ “ and no break space “ „, that are meant to look exactly the same, but to expose a slightly different behavior. There are tons of possibilities to create exactly the same look with different combinations of whitespace. But this is kind of a special case, because in terms of semantics often carries little information and we want to disregard it to some extent when comparing strings. Typical examples are stripping of leading and trailing whitespace of the string or of the lines contained within it and replacing tabulators with the number of spaces that would be equivalent. Or even to replace any amount of adjacent whitespace within a line by a single space. Again, handling of different whitespace code points might require different rules, so it is good to be careful in not putting to much logic and it is better to rely on a library to at least apply exactly the same rules in equivalent situations.

Another example that we actually might know is that certain characters look the same or almost the same in the Cyrillic, Greek and Latin alphabets. I try to give an idea of the meaning of the Greek and Cyrillic characters, but they depend on the language, the dialect and even the word, the word form or the actual occurrence of the letter in the word…

LatinCyrillicGreekmeaning of Cyrillic Lettermeaning of Greek letter
AАAlike Latinlike Latin
BВBlike Latin VBeta (like V in new Greek)
CСlike Latin S
EЕElike LatinEpsilon (like Latin E)
ГHlike Latin GGamma (like Latin G)
HНΗlike Latin NEta (like Latin I in new Greek)
JЈSerbian Ј, like German J
KКΚlike LatinKappa (like Latin K)
MМΜlike LatinMu (like Latin M)
NΝNu (like Latin N)
OОΟlike LatinOmikron (like Latin O)
PРΡlike Latin RRho (like Latin R)
ПΠlike Latin PPi (like Latin P)
TТΤlike LatinTau (like Latin T)
ФΦlike Latin FPhi (like Latin F)
XХΧlike German CHChi (like German CH)
YУΥlike Latin UUpsilon (like Latin U)
ZΖZeta (like German Z)
IІΙUkrainian IIota (like Latin I)

In this case we usually want the characters to look the same or at least very similar, because that is how to correctly display them, but we do want them to be different when comparing strings.

While these examples are kind of obvious, there is another one that we tend to ignore, but that will eventually catch us. There are so called combining characters, that should actually be named „combining code points“, but here we go. That means that we can put them after a letter and they will combine to form a letter with diacritical marks. A typical example is the letter „U“ that can be combined with two dots “ ̈ ̈“ to form an „Ü“, which looks the same as the „Ü“ that is composed of one code point. It is meant to look the same, but it also has the same meaning, at least for most purposes. What we see is the Glyph. We see the difference when we prefix each code point with a minus or a space: „Ü“ -> „-U-̈“ or “ U ̈“, while the second one is transformed like this: „Ü“ -> „-Ü“ or “ Ü“, as we would expect.

While the way to express the Glyph in such a way with two code points is not very well known and thus not very common, we actually see it already today when we look at Wikipedia articles. In some languages, where the pronunciations is ambiguous, it can be made clear by putting an accent mark on one vowel, as for example Кириллица, which puts an accent mark on the term in the beginning of the article like this: „Кири́ллица“. Since in Cyrillic Alphabet accent marks are unfortunately not used in normal writing, it comes in handy that the combining accent also works with cyrillic letter. When putting minus-signs between the code points it looks like this: „К-и-р-и-́-л-л-и-ц-а“ or with spaces like this: „К и р и ́ л л и ц а“. So Strings that we encounter in our programs will contain these combining characters in the future. While we can prohibit them, it is better to embrace this and it is actually not too hard, if we use decent libraries. Java has the Normalizer class in its built in library, that can convert to one or the other convention of expressing such glyphs and then allowing comparison in the way that we actually mean.

Unfortunately issues like semantic lengths of strings or semantic positions become even harder than they already are after moving from characters to code points. And we can be sure that Unicode has still more to offer to complicate things, if we dig deeper. The typical answer that we get on most web sites that talk about these issues is something like: „The length of strings and positions within strings are surprisingly irrelevant to most programs.“

In the end of the day, jobs that have been trivial in the past are now becoming a big deal and we need to learn to think of comparison, length, position, regular expressions, sorting and all kinds of string functionality with bytes, characters, code points and glyphs in mind.

What can our current libraries already do for us, what are we missing in them, considering different programming languages, databases, text files and network transmission?

Links

Share Button