Not all projects are on ideal paths II (Tim Finnerty)

This is another story of a project, that did not go as well as it could have gone while I was there. From unsuccessful projects we can learn a lot, so there will be stories like this once in a while. The first one was about Tom Rocket.

Tim Finnerty

It was the time, when all the cool companies tried to introduce Java. And some of the new Java projects failed, causing the companies to go back to C, which again scared other companies from doing this step. But some companies did not get scared by this. They embraced the new Java-fashion at a time when it still was not clear whether or not this was a good idea. What could possibly go wrong?

Well, in those days the experienced guys did not want to move to Java. It was slow, it was unreliable, not mature,… Maybe for Applets, maybe more generally for GUIs to get rid of VisualBasic, but Java on the server was considered a bad joke. For the server real people used real C, of course on Unix or maybe Linux, which was not really such a bad idea in those days. But there were the young people. Or the ones who had stayed young. They often just had finished their education and firmly believed that by just using technology „xyz“ everything would become great. xyz can be a development method (spiral model in those days, agile today), an architecture (microservices), a paragdigm („OO“, „functional“), a framework („enterprise edition“), a tool or a programming language (yeah, Java).. Often this first enthusiasm ends in a disaster: The money has been spent, the developers are leaving and the software is further away from being useful than anybody would like to admit. In lucky cases there is still some money left to do it right. Maybe even to do it right in Java.

That is were we are coming in. I do not really know the earlier history, but according to the management it was a total disaster. Now Tim Finnerty (real name known to the author) was the new technical lead, team architect or whatever this role is called. He embraced the new technology, but promised to not overstrech it. A bit of old school. Sounds good, because it is exactly what managers want to hear. No more risky experiements, but this time it needs to become a success.

So Tim Finnerty defined, how we had to work. He knew Java, he knew databases and especially Oracle, he knew the web, he even knew Perl. And he knew OO. Better than anybody else, so we did the real thing. Great head start. And everybody had to program according to Tim’s rules.

Of course we were using Java enterprise edition. That meant, that we were programming against some Weblogic application server, that was hard to install, hard to run, required a few minutes of startup time for each minor change of the software that we were writing and forced to a very archaic and primitive programming model. But that was cool and it was the future, which unfortunately proved to be true. Even though it has at least become usable by now. So far nothing to blame on Tim, because it is kind of the stack that everybody used.

Now to the OO and the database. Each database table represented a „Business Object“, with a name like XyzBO. So most of the time, we wrote a class XyzBO plus a few more to fulfill the greed and need for boilerplate code of the old J2EE-world. XyzBO was a enterprise java bean. A stateless session bean, to be accurate. Which meant, that we wrote methods of this EJB, which were basically procedures of the pre-OO-world. But within we could of course use the whole OO-toolset. Which we did. So the class to represent any data from the database was actually called Data. It was a minor subset of the standard Perl data structures, which meant that Data was a list of hash maps, which could behave just like a hash map if it had only one entry. Database queries returned Data, or of course null, if nothing was found. Nobody would ever want to use an empty collection. Pretty much the opposite of what we are now doing the hard way by introducing Optional or Option to avoid the null. But it was easy to just write
if (data == null) { data = new Data(); }
at least for the ones who new this trick.
So data resembled the content of the database or of the query result, with the column names as keys and the values as objects. When working with these, it was really easy. Just know the attribute name accurately. Get the value from data. See if it is null. If not, cast it to the real type, and voila….

The database was designed according to Tim’s advice, he had to review every table. It was mandatory to have as unique key and as first column a string of around 700 characters, which was called HANDLE. Each table had a business primary key, which was always consisting of several columns. Because the system allowed multiple instances of the software to run on the same database, there was always one column called „SITE“ in this logical primary key. But there were no primary keys defined in the database. The unique HANDLE was enough. It contained the name of the BO, like XyzBO, followed by a colon and followed by the concatenation of all the logical primary keys, separated with commas. The date had to be converted to a string using a local US format, not ISO, of course. All foreign keys were defined using HANDLE. In the end more than half of the DB space was wasted for this stupidity. But each single handle value started with an XyzBO:, to remind us that we were programming OO.

And now booleans. It was forbidden to use the boolean type of Java. All booleans were strings containing the words „true“ and „false“. This went like that all the way to the web interface.

At that time web frameworks did not yet exist or were at least unknown. So the way to go was to write JSPs, which contained kind of dynamic web pages and to write servlets to control the flow and access the EJBs. Now according to Tim it was too hard to learn servlets as well, so it was forbidden to use them and instead a connection-JSP had to be written, which did not display anything, but only contained a <% in the beginning and a %> in the end and the code between.

A lot of small and larger stupidities, that were forced on the team. Most people were new to Java and to OO and did not even realize that there was anything wrong, apart from the fact, that it was kind of hard to get stuff done. Some stupidities were due to the fact that the early J2EE really sucked, but mostly it was Tim, who forced everyone to his level. This story happened a long time ago. Tim has already retired. I would say he is one of the guys, who retire as a Junior.

There is (almost) nothing wrong with stupidity. And there is (almost) nothing with arrogance. But the combination really sucks. Especially if it it taken serious by the manager or has to be taken serious by the team.

Share Button

2019 — Happy New Year

Gott nytt år! — Godt nytt år! — Felice anno nuovo! — Καλή Χρονια! — Щасливого нового року! — Срећна нова година! — С новым годом! — Feliĉan novan jaron! — Bonne année! — FELIX SIT ANNUS NOVUS — Gullukkig niuw jaar! — Un an nou fericit! — Frohes neues Jahr! — Happy new year! — ¡Feliz año nuevo! — Onnellista uutta vuotta! — عام سعيد

This was created by a Java-program:

import java.util.Random;
import java.util.List;
import java.util.Arrays;
import java.util.Collections;

public class HappyNewYear {

    public static void main(String[] args) {
        List list = Arrays.asList("Frohes neues Jahr!",
                                          "Happy new year!",
                                          "Gott nytt år!", 
                                          "¡Feliz año nuevo!",
                                          "Bonne année!", 
                                          "FELIX SIT ANNUS NOVUS", 
                                          "С новым годом!",
                                          "عام سعيد",
                                          "Felice anno nuovo!",
                                          "Godt nytt år!", 
                                          "Gullukkig niuw jaar!", 
                                          "Feliĉan novan jaron!",
                                          "Onnellista uutta vuotta!",
                                          "Срећна нова година!",
                                          "Un an nou fericit!",
                                          "Щасливого нового року!", 
                                          "Καλή Χρονια!");
        Collections.shuffle(list);
        System.out.println(String.join(" — ", list));
    }
}

Share Button

Indexing of Arrays and Lists

We index arrays with integers. Lists also, at least the ones that allow random access. And sizes of collections are also integers.
This allows for 2^{31}-1=2147483647 entries in Java and typical JVM languages, because integers are actually considered to be 32bit. Actually we could think of one more entry, using indices 0..2147483647, but then we would not be able to express the size in an signed integer. This stuff is quite deeply built into the language, so it is not so easy to break out of this. And 2’000’000’000 entries are a lot and take a lot of time to process. At least it was a lot in the first few years of Java. There should have been an unsigned variant of integers, which would in this case allow for 4’000’000’000 entries, when indexing by an uint32, but that would not really solve this problem. C uses 64 bit integers for indexing of arrays on 64 bit systems.

It turns out that we would like to be able to index arrays using long instead of int. Now changing the java arrays in a way that they could be indexed by long instead of int would break a lot of compatibility. I think this is impossible, because Java claims to retain very good backward compatibility and this reliability of both the language and the JVM has been a major advantage. Now a second type of arrays, indexed by long, could be added. This would imply even more complexity for APIs like reflection, that have to deal with all cases for parameters, where it already hurts that the primitives are no objects and arrays are such half-objects. So it will be interesting, what we can find in this area in the future.

For practical use it is a bit easier. We can already be quite happy with a second set of collections, let them be called BigCollections, that have sizes that can only be expressed with long and that are indexed in cases where applicable with longs. Now it is not too hard to program a BigList by internally using an array of arrays or an array of arrays of arrays and doing some arithmetic to calculate the internal indices from the long (int64) index given in the API. Actually we can buy some performance gain when resizing happens, because this structure, if well done, allows for more efficient resizing. Based on this all kinds of big collections could be built.

Share Button

Devoxx Kiew 2018

In the end of 2018 the number of conferences is kind of high. A great highlight is the Devoxx BE in Antwerp. But it has now five partner conferences in London, Paris, Krakow, Morocco and Kiev. So I decided to have a look at the one in Kiev.

How was it in comparison to the one in Belgium? What was better in Kiev: The food was way better, the drinks in the first evening (Whisky and Long Drinks vs. Belgium Beer) might be considered better, there were more people engaged to help the organizers…
What was better in Belgium: There were still a bit more speeches. While the location in Kiev was really great, in Belgium the rooms were way better for the purpose of providing a projection visible for everybody and doing a video recording that did not disturb the audience.
The quality of the speeches was mostly great in both locations. In Kiev they gamified the event a bit more..

Generally there was a wide range of topics and the talks were sorted into the following thematic groups:

  • Methodology & Culture
  • JVM Languages
  • Server Side
  • Architecture & Security
  • Mobile & IoT
  • Machine Learning & AI
  • Big Data & Data Mining
  • Cloud, Containers & Infrastructure
  • Modern Web & UX

See the schedule for the distribution…

I attended on Friday:

I attended on Saturday:

A lot to learn.

Share Button

Devoxx Antwerp 2018

In 2018 I am visiting a few conferences. A great highlight is the Devoxx BE in Antwerp, which I had the privilege of visiting 2012, 2013, 2014, 2015, 2016 and 2017.

As it should be, it is not just the same every year, but content and speakers change a bit from year to year.

Some topics that got a lot of attention were functional programming, artificial intelligence, Big Data, Machine Learning, clouds, JVMs, Kotlin

There was less about other JVM languages (apart from Kotlin), so Scala, Clojure, Groovy or Ceylon were covered little or not at all and Android used to be more present in other years. I would say that Ceylon has become irrelevant, probably because Kotlin was too similar and came out the same time and won. Groovy has its niche, Clojure has its niche, Scala and Kotlin have become mature and are now the two mainstream alternatives to Java, but themselves much smaller than Java. This was represented in the conference, taking into account that Scala has its own large conferences, like Scala Days, Scala Exchange, Scala World and a lot more.

Some side issues that might worry some of us did come up occasionally. Was it bad, that IBM bought Red Hat? At least they paid around 34’000’000’000 USD, which is more than 2’500’000 USD per employee. There are probably no other assets in terms of buildings, patents, hardware or whatever, that would justify this price, so IBM probably will have an interest to keep a large number of these employees and not scare them away by too much „IBM-culture“. We will see, but no reason to get immediately worried. Oracle wants money for running their JVM in production after more than 6 months. This can be avoided by always switching to the newest version or by relying on the JDKs offered by alternative sources like Amazon, RedHat…

Microsoft was a sponsor and had a booth. Their topic was not MS-Windows and MS-Office and MS-SQL-Server, but Azure, which can be used with Linux and Java and PostgreSQL, for example. The company did change a bit since the days of Steve Ballmer and we will see if this is an excursion or a continuous direction.

And James Gosling was there at the opening, as a surprise.

Generally there was a wide range of topics and the talks were sorted into the following thematic groups:

  • Methodology & Culture
  • Java Language
  • Programming languages
  • Architecture & Security
  • Big Data & Machine Learning
  • Mind the Geek
  • Server Side Java
  • Modern Web & UX
  • Cloud, Containers & Infrastructure
  • Mobile & IoT

See the schedule for the distribution…

I attended on Wednesday:

I attended on Thursday:

I attended on Friday:

It was a great conference. A lot of new ideas.

Share Button

Logging

Deutsch

Software often contains a logging functionality. Usually entries one or sometimes multiple lines are appended to a file, written to syslog or to stdout, from where they are redirected into a file. They are telling us something about what the software is doing. Usually we can ignore all of it, but as soon as something with „ERROR“ or worse and more visible stack traces can be found, we should investigate this. Unfortunately software is often not so good, which can be due to libraries, frameworks or our own code. Then stack traces and errors are so common that it is hard to look into or to find the ones that are really worth looking into. Or there is simply no complete process in place to watch the log files. Sometimes the error shows up much later than it actually occurred and stack traces do not really lead us to the right spot. More often than we think logging actually introduces runtime errors, that were otherwise not present. This is related to a more general concept, which is called observer effect, where logging actually changes the business logic.

It is nice that log files keep to some format. Usually they start with a time stamp in ISO-format, often to the millisecond. Please add trailing zeros to always have 3 digits after the decimal point in this case. It is preferable to use UTC, but people tend to stick to local date and time zones, including the issues that come with switching to and from daylight saving time. Usually we have several processes or threads that run simultaneously. This can result in a wild mix of logging entries. As long as even multiline entries stay together and as long as beginning and end of one multiline entry can easily be recognized, this can be dealt with. Tools like splunk or simple Perl, Ruby or Python scripts can help us to follow threads separately. We could actually have separate logs for each thread in the first place, but this is not a common practice and it might hit OS-limitations on the number of open files, if we have many threads or even thousands of actors as in Erlang or Akka. Keeping log entries together can be achieved by using an atomic write, like the write system call in Linux and other Posix systems. Another way is to queue the log entries and to have a logger thread that processes the queue.

Overall this area has become very complex and hard to tame. In the Java world there used to be log4j with a configuration file that was a simple properties file, at least in the earlier version. This was so good that other languages copied it and created some log4X. Later the config file was replaced by XML and more logging frame works were added. Of course quite a lot of them just for the purpose of abstracting from the large zoo of logging frameworks and providing a unique interface for all of them. So the result was, that there was one more to deal with.

It is a good question, how much logic for handling of log files do we really want to see in our software. Does the software have to know, into which file it should log or how to do log rotation? If a configuration determines this, but the configuration is compiled into the jar file, it does have to know… We can keep our code a bit cleaner by relying on program functionality without code, but this still keeps it as part of the software.

Log files have to please the system administrator or whoever replaced them in a pure devops shop. And in the end developers will have to be able to work with the information provided by the logs to find issues in the code or to explain what is happening, if the system administrator cannot resolve an issue by himself. Should this system administrator have to deal with a different special complex setup for the logging for each software he is running? Or should it be necessary to call for developer support to get a new version of the software with just another log setting, because the configurations are hard coded in the deployment artifacts? Interesting is also, what happens when we use PAAS, where we have application server, database etc., but the software can easily move to another server, which might result in losing the logs. Moving logs to another server or logging across the network is expensive, maybe more expensive than the rest of this infrastructure.

Is it maybe a good idea to just log to stdout, maintaining a decent format and to run the software in such a way that stdout is piped into a log manager? This can be the same for all software and there is one way to configure it. The same means not only the same for all the java programs, but actually the same for all programs in all languages that comply to a minimal standard. This could be achieved using named pipes in conjunction with any hard coded log file that the software wants to use. But this is a dangerous path unless we really know what the software is doing with its log files. Just think of what weird errors might happen if the software tries to apply log rotation to the named pipe by renaming, deleting, creating new files and so on. A common trick to stop software from logging into a place where we do not want this is to create a directory with the name of the file that the software usually uses and to write protect this directory and its parent directory for the software. Please find out how to do it in detail, depending on your environment.

What about software, that is a filter by itself, so its main functionality is to actually write useful data to stdout? Usually smaller programs and scripts work like this. Often they do not need to log and often they are well tested relyable parts of our software installation. Where are the log files of cp, ls, rm, mv, grep, sort, cat, less,…? Yes, they do tend to write to stderr, if real errors occur. Where needed, programs can turn on logging with a log file provided on the command line, which is also a quite operations friendly approach. Named pipes can help here.

And we had a good logging framework in place for many years. It was called syslog and it is still around, at least on Linux.

A last thought: We spend really a lot of effort to get well performing software, using multiple processes, threads or even clusters. And then we forget about the fact that logging might become the bottle neck.

Share Button

Some thoughts about String equality

Of course Strings are today in some way Unicode. In this article we assume code points as the building blocks of Strings. That means for example in the Java-world, that we are talking about one code point being comprised of one Java character for typical European languages, using Latin, Greek or Cyrillic alphabets including extensions to support all languages typically using these alphabets, for example. But when moving to Asian languages, a code point can also consist of two Java characters and there are Strings that are illegal from Unicode perspective, because they contain characters that should be combined in a way that cannot be combined properly. So here we assume, that Strings consist of sequences of bytes or two-byte characters or whatever encoding that properly express a sequence of code points. There are many interesting issues when dealing with some Asian languages that we will not cover here today.

Now there are a lot of possibilities to create Strings, that look the same, but are actually different. We are not talking about „0“ and „O“ or „1“ and „l“ and „I“ that might look similar in some fonts, but should not look similar, because we actually depend on their distinctness, even on their visual distinctness. Unfortunately we have the bad habit of using traditional typewriter fonts, that make it hard to distinguish these, for source code, where it would be so crucial. But for today, we just assume that we always look hard enough to solve this issue.

The classical example of what looks the same is whitespace. We have ordinary space “ “ and no break space “ „, that are meant to look exactly the same, but to expose a slightly different behavior. There are tons of possibilities to create exactly the same look with different combinations of whitespace. But this is kind of a special case, because in terms of semantics often carries little information and we want to disregard it to some extent when comparing strings. Typical examples are stripping of leading and trailing whitespace of the string or of the lines contained within it and replacing tabulators with the number of spaces that would be equivalent. Or even to replace any amount of adjacent whitespace within a line by a single space. Again, handling of different whitespace code points might require different rules, so it is good to be careful in not putting to much logic and it is better to rely on a library to at least apply exactly the same rules in equivalent situations.

Another example that we actually might know is that certain characters look the same or almost the same in the Cyrillic, Greek and Latin alphabets. I try to give an idea of the meaning of the Greek and Cyrillic characters, but they depend on the language, the dialect and even the word, the word form or the actual occurrence of the letter in the word…

LatinCyrillicGreekmeaning of Cyrillic Lettermeaning of Greek letter
AАAlike Latinlike Latin
BВBlike Latin VBeta (like V in new Greek)
CСlike Latin S
EЕElike LatinEpsilon (like Latin E)
ГHlike Latin GGamma (like Latin G)
HНΗlike Latin NEta (like Latin I in new Greek)
JЈSerbian Ј, like German J
KКΚlike LatinKappa (like Latin K)
MМΜlike LatinMu (like Latin M)
NΝNu (like Latin N)
OОΟlike LatinOmikron (like Latin O)
PРΡlike Latin RRho (like Latin R)
ПΠlike Latin PPi (like Latin P)
TТΤlike LatinTau (like Latin T)
ФΦlike Latin FPhi (like Latin F)
XХΧlike German CHChi (like German CH)
YУΥlike Latin UUpsilon (like Latin U)
ZΖZeta (like German Z)
IІΙUkrainian IIota (like Latin I)

In this case we usually want the characters to look the same or at least very similar, because that is how to correctly display them, but we do want them to be different when comparing strings.

While these examples are kind of obvious, there is another one that we tend to ignore, but that will eventually catch us. There are so called combining characters, that should actually be named „combining code points“, but here we go. That means that we can put them after a letter and they will combine to form a letter with diacritical marks. A typical example is the letter „U“ that can be combined with two dots “ ̈ ̈“ to form an „Ü“, which looks the same as the „Ü“ that is composed of one code point. It is meant to look the same, but it also has the same meaning, at least for most purposes. What we see is the Glyph. We see the difference when we prefix each code point with a minus or a space: „Ü“ -> „-U-̈“ or “ U ̈“, while the second one is transformed like this: „Ü“ -> „-Ü“ or “ Ü“, as we would expect.

While the way to express the Glyph in such a way with two code points is not very well known and thus not very common, we actually see it already today when we look at Wikipedia articles. In some languages, where the pronunciations is ambiguous, it can be made clear by putting an accent mark on one vowel, as for example Кириллица, which puts an accent mark on the term in the beginning of the article like this: „Кири́ллица“. Since in Cyrillic Alphabet accent marks are unfortunately not used in normal writing, it comes in handy that the combining accent also works with cyrillic letter. When putting minus-signs between the code points it looks like this: „К-и-р-и-́-л-л-и-ц-а“ or with spaces like this: „К и р и ́ л л и ц а“. So Strings that we encounter in our programs will contain these combining characters in the future. While we can prohibit them, it is better to embrace this and it is actually not too hard, if we use decent libraries. Java has the Normalizer class in its built in library, that can convert to one or the other convention of expressing such glyphs and then allowing comparison in the way that we actually mean.

Unfortunately issues like semantic lengths of strings or semantic positions become even harder than they already are after moving from characters to code points. And we can be sure that Unicode has still more to offer to complicate things, if we dig deeper. The typical answer that we get on most web sites that talk about these issues is something like: „The length of strings and positions within strings are surprisingly irrelevant to most programs.“

In the end of the day, jobs that have been trivial in the past are now becoming a big deal and we need to learn to think of comparison, length, position, regular expressions, sorting and all kinds of string functionality with bytes, characters, code points and glyphs in mind.

What can our current libraries already do for us, what are we missing in them, considering different programming languages, databases, text files and network transmission?

Links

Share Button

Serialization

Deutsch

Serialization allows us to store objects in a lossless way or to transfer them over the network. This could be done before, but it was necessary to program the serialization mechanism, which was a lot of work. Of course, they were not yet called objects in those days…

Java suddenly had such a serialization for (almost) all object without any additional programming effort. This does not mean that this automatic serialization did not exist before, but it was made popular with Java, because frameworks started to heavily rely on it. To use it ourselves, we just had to use ObjectOutputStream and ObjectInputStream and Objects could be stored and written across the network or just be cloned. It was even able to handle circular references, which most serialization mechanisms cannot do. The idea was not really new, as other languages had something like this already before, but nobody became aware of it.

But there are some drawbacks, that were discovered when it was already too late and that should at least be mentioned.

  • Marking with a Serializiable-Interface is conceptionally quite a bad solution, because it assumes that serialization never gets lost by deriving classes, which is just not true. An Unserializable interface would have been a much better solution, if not almost ideal solution for this, because trivial objects are always serializable and they loose this when something non-serializable is added. Then again, how about collections… Today possibly some annotation could also be helpful.
  • This serialVersionUID creates a lot of pain. Should we change it, whenever the interface changes? We talk about the implicit serialization interface, not about an explicit interface that we can easily see. Should we trust automatic mechanisms? In any case issues with incompatible versions remain that are not really solved well and cannot even easily be solved well.
  • Serialization introduces an additional invisible constructor.
  • Serialization undermines the idea of private and protected, because suddenly private and protected member attributes become part of the interface
  • Funny effects happen with serializable non-static inner classes, because there serialized version bakes in the containing outer object. Yes it has to…
  • The object indentity gets lost, when an object is serialized. It is easy to create several copies of the same object.
  • Sometimes it is still necessary to manually write serialization code, for example for singletons. It is easy to forget this, because everything seems to work just fine automatically.
  • Java’s Serialization is quite slow.
  • The format is binary and cannot easily be read. A pluggable serialization format that could allow more human readable data files like JSON, XML,… would have been better…
  • Serialization creates a temptation to use this format for communication, which again forces a tight coupling that might not be necessary otherweise.
  • Serialization creates a temptation to use it as a storage format instead of mature database technologies. Very bad: the second level cache of hibernate…

There were some advantages in having this serialization in the past and for some purposes it kind of works. But it is important to question this and to consider other, more solid approaches, even if they require slightly more work. Generally it is today considered one of the larges fallacies of Java to introduce this serialization mechanism in this way. There are now better ways to do serialization, that require a bit more work, but avoid some of the terrible short comings of the native Java-Serialization.

For the serialVersionUID there are several approaches that can work. A statical method, that extracts from an „$Id$“-string that is managed by svn, can be a way. It will avoid compatibility between even slightly different versions, which is probably the best we can get. With git it is a bit harder, but it can be done as well.

Usually it is the best choice by far to leave serialVersionUID empty and rely on Java’s automatic mechanisms. They are not perfect, but better than 99% of the manually badly maintained serialVersionUIDs. If you want to manage your serialVersionUIDs yourself, there needs to be a checklist on what to do to release a new version of a file, a library or a whole software system. This is usually sick, because it creates a lot of work, even more errors and should really be done only with good reasons, good discipline and a very good concept. If you like anyway to use serialVersionUID or if you are forced to do so by the project, here is a script to create them randomly:

#!/usr/bin/perl
use bigint;
use Math::Random::Secure qw(irand);
my $r = (irand() << 32) + irand();
printf "%20d\n", $r;

This is still better than using the IDE-generated value and keeping it forever or starting with a 1 or 0 and keeping it forever, because updating this serialVersionUID is not really on our agenda. And it shouldn't be.

Share Button

Meaningless Whitespace in Textfiles

We use different file formats that are more or less tolerant to certain changes. Most well known is white space in text files.

In some programming languages white space (space, newline, carriage return, form feed, tabulator, vertical tab) has no meaning, as long as any whitespace is present. Examples for this are Java, Perl, Lisp or C. Whitespace, that is somehow part of String content is always significant, but white space that is used within the program can be combination of one or more of the white space characters that are in the lower 128 positions (ISO-646, often referred to as ASCII or 7bit ASCII. It is of course recommended to have a certain coding standard, which gives some guidelines of when to use newlines, if tabs or spaces are preferred (please spaces) and how to indent. But this is just about human readability and the compiler does not really care. Line numbers are a bit meaningful in compiler and runtime error messages and stack traces, so putting everything into one line would harm beyond readability, but there is a wide range of ways that are all correct and equivalent. Btw. many teams limit lines to 80 characters, which was a valid choice 30 years ago, when some terminals were only 80 characters wide and 132 character wide terminals where just coming up. But as a hard limit it is a joke today, because not many of us would be able to work with a vt100 terminal efficiently anyway. Very long lines might be harder to read, so anything around 120 or 160 might still be a reasonable idea about line lengths…

Languages like Ruby and Scala put slightly more meaning into white space, because in most cases a semicolon can be skipped if it is followed by a newline and not just horizontal white space. And Perl (Perl 5) is for sure so hard to compile that only its own implementation can properly format or even recognize which white space is part of a literal string. Special cases like having the language in a string and parsing and then executing that should be ignored here.

Now we put this program files into a source code management system, usually Git. Some teams still use legacy systems like subversion, source safe, clear case or CVS, while there are some newer systems that are probably about as powerful as git, but I never saw them in use. Git creates an MD5 hash of each file, which implies that any minor change will result in a new version, even if it is just white space. Now this does not hurt too much, if we agree on the same formatting and on the same line ending (hopefully LF only, not CR LF, even on MS-Windows). But our tooling does not make any difference between significant changes and insignificant formatting only changes. This gets worse, if users have different IDEs, which they should have, because everyone should use the IDE or editor, with which he or she is most efficient and the formal description of the preferred formatting is not shared between editors or differs slightly.

I think that each programming language should come with a command line diff tool and a command line formatting tool, that obey a standard interface for calling and can be plugged into editors and into source code management systems like git. Then the same mechanisms work for C, Java, C#, Ruby, Python, Fortran, Clojure, Perl, F#, Scala, Lua or your favorite programming language.

I can imaging two ways of working: Either we have a standard format and possibly individual formats for each developer. During „git commit“ the file is brought into the standard format before it is shown to git. Meaning less whitespace changes disappear. During checkout the file can optionally be brought into the preferred format of the developer. And yes, there are ways to deal with deliberate formatting, that for some reason should be kept verbatim and for dealing differently with comments and of course all kinds of string literals. Remember, the formatting tool comes from the same source as the compiler and fully understands the language.

The other approach leaves the formatting up to the developer and only creates a new version, when the diff tool of the language signifies that there is a relevant change.

I think that we should strive for this approach. It is no rocket science, the kind of tools were around for many decades as diff and as formatting tools, it would just be necessary to go the extra mile and create sister diff and formatting tools for the compiler (or interpreter) and to actually integrate these into build environments, IDEs, editors and git. It would save a lot of time and leave more time for solving real problems.

Is there any programming language that actually does this already?

How to handle XML? Is XML just the new binary with a bit more bloat? Can we do a generic handling of all XML or should it depend on the Schema?

Share Button

Loops with unknown nesting depth

We often encounter nested loops, like

for (i = 0; i < n; i++) {
    for (j = 0; j < m; j++) {
        doSomething(i, j);
    }
}

This can be nested to a few more levels without too much pain, as long as we observe that the number of iterations for each level need to be multiplied to get the number of iterations for the whole thing and that total numbers of iterations beyond a few billions (10^9, German: Milliarden, Russian Миллиарди) become unreasonable no matter how fast the doSomethings(...) is. Just looking at this example program

public class Modular {
    public static void main(String[] args) {
        long n = Long.parseLong(args[0]);
        long t = System.currentTimeMillis();
        long m = Long.parseLong(args[1]);
        System.out.println("n=" + n + " t=" + t + " m=" + m);
        long prod = 1;
        long sum  = 0;
        for (long i = 0; i < n; i++) {
            long j = i % m;
            sum += j;
            sum %= m;
            prod *= (j*j+1) % m;
            prod %= m;
        }
        System.out.println("sum=" + sum + " prod=" + prod + " dt=" + (System.currentTimeMillis() - t));
    }
}

which measures it net run time and runs 0 msec for 1000 iterations and almost three minutes for 10 billions (10^{10}):

> java Modular 1000 1001 # 1'000
--> sum=1 prod=442 dt=0
> java Modular 10000 1001 # 10'000
--> sum=55 prod=520 dt=1
> java Modular 100000 1001 # 100'000
--> sum=45 prod=299 dt=7
> java Modular 1000000 1001 # 1'000'000
--> sum=0 prod=806 dt=36
> java Modular 10000000 1001 # 10'000'000
--> sum=45 prod=299 dt=344
> java Modular 100000000 1001 # 100'000'000
--> sum=946 prod=949 dt=3314
> java Modular 1000000000 1001 # 1'000'000'000
--> sum=1 prod=442 dt=34439
> java Modular 10000000000 1001 # 10'000'000'000
--> sum=55 prod=520 dt=332346

As soon as we do I/O, network access, database access or simply a bit more serious calculation, this becomes of course easily unbearably slow. But today it is cool to deal with big data and to at least call what we are doing big data, even though conventional processing on a laptop can do it in a few seconds or minutes... And there are of course ways to process way more iterations than this, but it becomes worth thinking about the system architecture, the hardware, parallel processing and of course algorithms and software stacks. But here we are in the "normal world", which can be a "normal subuniverse" of something really big, so running on one CPU and using a normal language like Perl, Java, Ruby, Scala, Clojure, F# or C.

Now sometimes we encounter situations where we want to nest loops, but the depth is unknown, something like

for (i_0 = 0; i_0 < n_0; i_0++) {
  for (i_1 = 0; i_1 < n_1; i_1++) {
    \cdots
      for (i_m = 0; i_m < n_m; i_m++) {
        dosomething(i_0, i_1,\ldots, i_m);
      }
    \cdots
  }
}

Now our friends from the functional world help us to understand what a loop is, because in some of these more functional languages the classical C-Style loop is either missing or at least not recommended as the everyday tool. Instead we view the set of values we iterate about as a collection and iterate through every element of the collection. This can be a bad thing, because instantiating such big collections can be a show stopper, but we don't. Out of the many features of collections we just pick the iterability, which can very well be accomplished by lazy collections. In Java we have the Iterable, Iterator, Spliterator and the Stream interfaces to express such potentially lazy collections that are just used for iterating.

So we could think of a library that provides us with support for ordinary loops, so we could write something like this:

Iterable range = new LoopRangeExcludeUpper<>(0, n);
for (Integer i : range) {
    doSomething(i);
}

or even better, if we assume 0 as a lower limit is the default anyway:

Iterable range = new LoopRangeExcludeUpper<>(n);
for (Integer i : range) {
    doSomething(i);
}

with the ugliness of boxing and unboxing in terms of runtime overhead, memory overhead, and additional complexity for development. In Scala, Ruby or Clojure the equivalent solution would be elegant and useful and the way to go...
I would assume, that a library who does something like LoopRangeExcludeUpper in the code example should easily be available for Java, maybe even in the standard library, or in some common public maven repository...

Now the issue of loops with unknown nesting depth can easily be addressed by writing or downloading a class like NestedLoopRange, which might have a constructor of the form NestedLoopRange(int ... ni) or NestedLoopRange(List li) or something with collections that are more efficient with primitives, for example from Apache Commons. Consider using long instead of int, which will break some compatibility with Java-collections. This should not hurt too much here and it is a good thing to reconsider the 31-bit size field of Java collections as an obstacle for future development and to address how collections can grow larger than 2^{31}-1 elements, but that is just a side issue here. We broke this limit with the example iterating over 10'000'000'000 values for i already and it took only a few minutes. Of course it was just an abstract way of dealing with a lazy collection without the Java interfaces involved.

So, the code could just look like this:

Iterable range = new NestedLoopRange(n_0, n_1, \ldots, n_m);
for (Tuple t : range) {
    doSomething(t);
}

Btw, it is not too hard to write it in the classical way either:

        long[] n = new long[] { n_0, n_1, \ldots, n_m };
        int m1 = n.length;
        int m  = m1-1; // just to have the math-m matched...
        long[] t = new long[m1];
        for (int j = 0; j < m1; j++) {
            t[j] = 0L;
        }
        boolean done = false;
        for (int j = 0; j < m1; j++) {
            if (n[j] <= 0) {
                done = true;
                break;
            }
        }
        while (! done) {
            doSomething(t);
            done = true;
            for (int j = 0; j < m1; j++) {
                t[j]++;
                if (t[j] < n[j]) {
                    done = false;
                    break;
                }
                t[j] = 0;
            }
        }

I have written this kind of loop several times in my life in different languages. The first time was on C64-basic when I was still in school and the last one was written in Java and shaped into a library, where appropriate collection interfaces were implemented, which remained in the project or the organization, where it had been done, but it could easily be written again, maybe in Scala, Clojure or Ruby, if it is not already there. It might even be interesting to explore, how to write it in C in a way that can be used as easily as such a library in Java or Scala. If there is interest, please let me know in the comments section, I might come back to this issue in the future...

In C it is actually quite possible to write a generic solution. I see an API like this might work:

struct nested_iteration {
  /* implementation detail */
};

void init_nested_iteration(struct nested_iteration ni, size_t m1, long *n);
void dispose_nested_iteration(struct nested_iteration ni);
int nested_iteration_done(struct nested_iteration ni); // returns 0=false or 1=true
void nested_iteration_next(struct nested_iteration ni);

and it would be called like this:

struct nested_iteration ni;
int n[] = { n_0, n_1, \ldots, n_m };
for (init_nested_iteration(ni, m+1, n); 
     ! nested_iteration_done(ni); 
     nested_iteration_next(ni)) {
...
}

So I guess, it is doable and reasonably easy to program and to use, but of course not quite as elegant as in Java 8, Clojure or Scala.
I would like to leave this as a rough idea and maybe come back with concrete examples and implementations in the future.

Links

Share Button