How to rename files according to a pattern

We often encounter situations, where a large number of files should be copied or renamed or moved or something like that.
This can be done on the Linux command line, but it should be possible in almost the same way on the Unix/Linux/Cygwin-command line of newer MS-Windows or MacOS-X.

Now people routinely do that and they have developed several ways of doing it, which are all valid and useful.

I will show how I do things like that. It works and it is not the only way to do it.

So in the most simple case, all files in a directory ending in ‚.a‘ should be renamed to ‚.b‘.

What I do is:

ls *.a \
|perl -p -e 'chomp;$x = $_;s/\.a$/.b/;$y = $_; s/.+/mv $x $y\n/;' \
|egrep '^mv '\

You can run it without the last |sh, to check if it really does what you want.

So I use the files as input to a short perl script and create shell commands. It would be possible to do this actually in Perl itself, without piping it into a shell:

ls *.b \
|perl -n -e 'chomp;$x = $_;s/\.b$/.c/;$y=$_;rename $x, $y;'

You could also read the directory from perl, it is quite easy, but for just quickly doing stuff, I prefer getting the input from some ls.

To go into sub directories, you can use find:

find . -name '*.c' -type f -print \
| perl -n -e 'chomp;$x = $_;s/\.c$/.d/;$y=$_;rename $x, $y;'

You can also rename all the files that contain a certain string:

find . -name '*.html' -type f -print \
|xargs egrep -l form \
|perl -n -e 'chomp; $x=$_;s/\.html$/.form/;$y=$_;rename $x, $y;'

So you can combine with all kinds of shell commands and do really a lot of things in one line.

Of course you can use Raku, Ruby, Python or your favorite scripting language instead, as long as it allows some simple pattern matching and an efficient implicit iteration over the lines.

For such simple tasks there are also ways to do it directly in the shell like this

for f in *.d ; do mv $f `basename $f .d`.e; done

And you can always use sed, possibly in conjunction with awk instead of perl for such simple tasks.

Another approach is to just pipe the files into an texteditor that is powerful enough and create a one time script using powerful editing commands.
On Linux and Unix servers we almost always use vi, even people like me, who prefer Emacs on their own computer:

ls *.e > tmpscript
vi tmpscript

and then in vi

:0,$s/\(.*\)\(.e\)$/mv \1\2 \1.f/

and then

sh tmpscript
rm tmpscript

So, there are many ways to achieve this goal and they are flexible and powerful enough to do really a lot more than just such simple pattern renaming.

If you work in a team and put these things into scripts, it might be necessary to follow a team policy about which scripting languages are preferred and which patterns are preferred. And you need to know the stuff that you write yourself, but also the stuff that your colleagues write.

Please, do not do

mv *.a *.b

It won’t work for good reasons.
On Linux and Unix systems the shell (usually bash) expands the glob expression (the stuff with the stars) into a list of strings and then starts mv with these strings a parameters. So calling mv with some file names ending in .a and .b, mv cannot have any idea what to do. When called with more than two parameters, the last one needs to be a directory where to move the stuff, so usually it will just refuse to work.

Share Button

Comparing Images

A practical problem that I have is to sort my digital photos. Some of them have been taken with analog cameras and they have been scanned. Some of them have been scanned several times, using different resolution, different providers or different technologies.

So one issue that occurs is having two directories which contain more or less the same images from different scans.

Some of them have been sorted or enriched with useful information or just rotated correctly, others have better resolution. Maybe it is good to use different scans to create a better image than the best scan.

For a few hundred films doing that manually is possible, but a lot of work. So I thought about automating it, at least partially.

There are several issues that make it more difficult. Scans are not very accurate, so they cut of a tiny random bit in the borders. So to match two images exactly it is necessary to scale and crop them.

Colors look quite different. And then of course resolutions are different. And sometimes they have been turned by an angle of 90 or 270 degrees. Some scans miss a few images that another one has.

So, how to start?

First all images are scaled to thumbnails of approximately 65536 pixels. It turns out to be 204×307, but could of course be anything around that size retaining the rough aspect ratio.

Now all thumbnail images from the two directories are read into memory. 80 thumbnail images are no big deal…

Images in portrait orientation are rotated by 90 and 270 degrees and both variants are used. So from here on all images are in landscape format. Upside down is not supported.

All images are scaled to exactly 204×307 in memory to allow comparison.

And the average r, g and b-values from all images from each directory are calculated. The r,g,b values of each pixel are multiplied or divided by a factor, which is the square root of the quotient of these averages and constrained to the range 0..255. So this partially neutralizes the effect of different colors.

Now for each thumbnail from the first directory is compared with each thumbnail from the second directory. This is done by calculating for each pixel the sum of the squares of the differences between the r, g and b values of the two images.

These values are added up and divided by the number of pixels (204×307 in my case). The square root of this is the comparison result. For images that are actually the same, comparison results between 30 and 120 are possible. Now some heuristic is used to find matches based on these values. Also an „interpolation“ is used, if consecutive images in both directories occur with the middle one missing. This usually brings good results. In some cases manual interaction is needed. So it is possible to mark images in an web interface and then the match that was found for them is revoked. Also it is possible to provide „hints“, which means that these images should be matched.

It took about a day and half to write this and it works sufficiently well for the purpose.

But there are much more sophisticated algorithms that really recognize objects. I have a program called hugin that combines a few images to a panorama. Usually it works quite well and it can even rotate, shift and scale images and do amazing things most of the time. Sometimes it just does not work with certain images. Also Google has of course very powerful software to recognize images and compare them and even create panoramas. If we thing about face recognition… There is really good stuff around. But the first approach with some improvements gave sufficiently good results and the program will only run a couple of hundred times, then it will not be needed anymore.

This is heavily inspired by the Perl-library Image::Compare, but since I am doing something slightly different, I do not use this library any more. But subsequently it has been written in Perl. I will happily provide my source code, but I have baked into it assumptions and conventions that I have introduced myself, but that are not universal, concerning the file names and directory structures for the images.

Share Button

Perl Scripts for editing

Even though we do have IDEs with quite powerful refactoring mechanisms for many languages, it is still sometimes useful, to have another automated editing mechanism.

Why would that be the case?

Some examples:

In some cases there was an SQL script to create the database tables, which had been written first. From that classes and even CRUD operations in something like JDBC or DBI can be created. Even though most Java-projects that I have seen recently prefer using Hibernate (or more generally JPA2), there are some benefits in doing just plain old JDBC. This requires some discipline in writing the SQL in a uniform way and it was sometimes even necessary to have „magical comments“ that were ignored by the DB, but used by the script. This can save a lot of work and errors.

In some cases there were large numbers of HTML files. Now a similar kind of change had to be applied to each of them. Using a Script to parse the HTML file and to apply the changes can save a lot of work and provide consistency that is often badly needed. And if the whole thing happens in the context of an agile project, then the stakeholders might want to see the outcome and come up with further ideas. No big deal, just change the transformation scripts and check it out again. I recommend thinking in terms of larger transformation steps. Then I would retain the original files and apply the script for the step until the outcome is ok. Then this can be committed to git and the next step can be worked on. If the intermediate results are not useful for anybody else, just wait with the push or better work on a branch.

Sometimes refactorings have to be done that are not easily supported by the IDE. For example a whole bunch of classes need to be moved to different packages according to some new naming rule. Just find all the classes with their old package names, then move to the new directory structure and rename imports and package declarations to the new structure. This can easily be done with a Script. Always remember to be careful if there is Reflection involved, this will most likely break all refactorings, no matter if by IDE or by script.

Also it is often useful to use scripts to analyze code and to find occurrences of certain patterns.

These things can be done with scripts in Ruby, Python, Perl or Raku. All of these are valid options. Ruby and Raku are somewhat more sophisticated languages than the other two. And Perl and Raku have the most sophisticated Regex capabilities. I would assume that Raku is the best choice, if you start from scratch, it even supports grammars out of the box, which might be a way to address such issues. Perl has it as add on libraries. Of course it is also useful to work with what you know. But sometimes it is worth learning the right tool instead of just using the golden hammer to put in screws.

So in my case the tool for this is currently Perl. It does the job, is available more or less everywhere and I know it well enough. It might be worth moving to Raku in the future.

Some things that often work for simple scripts are the following:

Try to start with normalizing the input files, this might be the first transformation step. Remove trailing spaces, replace tabs with spaces, maybe normalize idention, replace line endings by LF only without CR. In HTML replace HTML-entities for UNICODE-characters by the appropriate Unicode characters, convert everything to UTF-8 etc.
If you have data like phone numbers and dates in the file that are meaningful for your further steps, bring them to a standard format. Of course always depending on your local circumstances, but this is usually the right way to go. This might be partially done with an external tool like xmllint.

For the next steps we need to consider the issue that we have multiple lines. There are regex-variants that do not stop at line boundaries. But if we read the content line wise, it can still be a bit more work. Sometimes it is easier to just replace line feeds by some marker string that does not otherwise occur in your files (check it before!) and then apply usual regex on this longer line.

What we often need is a regex that finds the shortest and not the longest match. If we write something like /a.*b/, this will look for a sequence that starts with an „a“ and ends with the last „b“ that can be found. Often we want to end with the first „b“. This can be achieved by /a.*?b/.

Another pattern that is often useful for such scripts is to work with a state machine. If a certain pattern is discovered, we react to it depending on the state and possibly change the state. So we can apply changes to something only if it occurs in a certain context.

Share Button

JSON instead of Java Serialization: The solution?

We start recognizing that Serialization is not such a good idea.

It is cool and can really work on a wide range of objects, even including complex and cyclic reference graphs. And it was essential for some older Java frameworks like EJB and RMI, which allowed remote access to Java objects and classes.

But it is no longer the future, Oracle will soon deprecate and later remove it. And it will happen this time, even though they really keep stuff around for a long time due to compatibility requirements.

Just to recap: it opens up security discussion, it opens up hidden behavior and makes it harder to reason about code, it creates tight coupling between remote components and it can result in bugs, that only occur at runtime and cannot be discovered at compile time. In short, it is not resilient.

So we need something else. Obvious candidates are XML, YAML and JSON. XML is of course an option and is powerful enough to do many things, but often a bit too clumbsy and too much boiler plate, so we try to move away from it. YAML and JSON kind of do the same thing, but it seems that JSON is winning the race and we all need to know JSON and many of us tend to skip YAML.

So why not use JSON. It is easy, it has good libraries and we can even find databases that work with JSON.

What JSON can express very well are scalars, lists and maps and combinations of these. This is quite exactly what we have in Perl, JavaScript or Clojure as basic building blocks. These languages support object oriented programming, but for simple stuff we go with these basic building blocks. And objects can be modelled as (hash-)maps, with the attribute names as keys. Actually JSON is valid JavaScript code.

We do have to change our thinking when moving from Java Serialization to JSON. JSON does not store any serializable object but just data. Maybe that is enough and that is what we actually want. It totally works in heterogeneous environments, where we are using different programming languages or different implementations.

There are good libraries. I have tried two, Jackson and GSON which both work well, recently mostly Jackson. It is important to think of Clojure, JavaScript, Perl or something like that without objects. So we loose type information, which can be considered good or bad, but if we can arrange ourselves with it, we avoid the tight coupling. JavaBeans are expressed exactly the same as a HashMap with the attribute names as keys. We can provide the top level class when deserializing, but at the child levels it will not be able to figure that out, if it relies on runtime information.

Example Code

Here it has been tried out. Find full example code on github.

A class that contains all kinds of stuff. Not prepared for really putting in nulls, but it is just experimental code…

package net.itsky.jackson;

import java.util.List;
import java.util.Map;
import java.util.Set;

public class TestObject {
    private Long l;
    private String s;
    private Boolean b;
    private Set set;
    private List list;
    private Map map;

    public TestObject(Long l, String s, Boolean b, Set set, List list, Map map) {
        this.l = l;
        this.s = s;
        this.b = b;
        this.set = set;
        this.list = list; = map;

    public TestObject() {
        // only for framework purposes

    public Long getL() {
        return l;

    public String getS() {
        return s;

    public Boolean getB() {
        return b;

    public Set getSet() {
        return set;

    public List getList() {
        return list;

    public Map getMap() {
        return map;

    public String toString() {
        return getClass().getSimpleName() + "("
+                "l=" + l + " (" + l.getClass() + ") "
                + " s=\"" + s + "\" (" + s.getClass() + ") "
                + " b=" + b + " (" + b.getClass() + ") "
                + " set=" + set  + " (" + set.getClass() + ") "
                + " list=" + list + " (" + list.getClass() + ") "
                + " map=" + map + " (" + map.getClass() + "))";

And this is used for running everything. To play around more, it should probably be moved to tests..

package net.itsky.jackson;

import com.fasterxml.jackson.databind.ObjectMapper;
import com.fasterxml.jackson.databind.ObjectWriter;

import java.util.List;
import java.util.Map;
import java.util.Set;

public class App {

    public static void main(String[] args) {
        try {
            Set s1 = ImmutableSet.of(1, 2, 3);
            Set s2 = ImmutableSet.of(1, 2, 3);
            Map m1 = ImmutableMap.of("A", "abc", "B", 3L, "C", s1);
            List l1 = ImmutableList.of("i", "e", "a", "o", "u");
            TestObject t1 = new TestObject(30303L, "uv", true, s2, l1, m1);
            Map m2 = ImmutableMap.of("r", "r101", "s", 202, "t", t1);
            List l2 = ImmutableList.of("ä", "ö", "ü", "å", "ø");
            Set s3 = ImmutableSet.of("x", "y", "z");
            TestObject t2 = new TestObject(40404L, "ijk", false, s3, l2, m2);
            ObjectMapper mapper = new ObjectMapper();
            ObjectWriter writer = mapper.writerWithDefaultPrettyPrinter();
            System.out.println("t2=" + t2);
            String json = writer.writeValueAsString(t2);
            System.out.println("json=" + json);
            StringReader stringReader = new StringReader(json);
            TestObject t3 = mapper.readValue(stringReader, TestObject.class);
            System.out.println("t3=" + t3);
        } catch (Exception ex) {
            RuntimeException rex;
            if (ex instanceof RuntimeException) {
                rex = (RuntimeException) ex;
            } else {
                rex = new RuntimeException(ex);
            throw rex;

And here is the output:

t2=TestObject(l=40404 (class java.lang.Long)  s="ijk" (class java.lang.String)
  b=false (class java.lang.Boolean)  
set=[x, y, z] (class
  list=[ä, ö, ü, å, ø] (class
  map={r=r101, s=202, 
t=TestObject(l=30303 (class java.lang.Long)
  s="uv" (class java.lang.String)  b=true (class java.lang.Boolean)
  set=[1, 2, 3] (class
  list=[i, e, a, o, u] (class
  map={A=abc, B=3, C=[1, 2, 3]} (class}
  "l" : 40404,
  "s" : "ijk",
  "b" : false,
  "set" : [ "x", "y", "z" ],
  "list" : [ "ä", "ö", "ü", "å", "ø" ],
  "map" : {
    "r" : "r101",
    "s" : 202,
    "t" : {
      "l" : 30303,
      "s" : "uv",
      "b" : true,
      "set" : [ 1, 2, 3 ],
      "list" : [ "i", "e", "a", "o", "u" ],
      "map" : {
        "A" : "abc",
        "B" : 3,
        "C" : [ 1, 2, 3 ]
t3=TestObject(l=40404 (class java.lang.Long)  s="ijk" (class java.lang.String)  
b=false (class java.lang.Boolean)  
set=[x, y, z] (class java.util.HashSet)  
list=[ä, ö, ü, å, ø] (class java.util.ArrayList)  
map={r=r101, s=202, 
t={l=30303, s=uv, b=true, 
set=[1, 2, 3], 
list=[i, e, a, o, u], 
map={A=abc, B=3, C=[1, 2, 3]}}}
  (class java.util.LinkedHashMap))

Process finished with exit code 0

So the immediate object and its immediate attributes were deserialized properly to what we provided. But everything inside went to maps, lists and scalars.

The intermediate JSON does not carry the type information at all, so this is the best that can be done.
Often it is useful what we want. If not, we need to find something else or see if we can tweak JSON to carry type information.

It will be interesting to explore other serialization protocols…


Share Button

Perl and Scala: what can they learn from each other?

Ironically Scala at first drew my interest, because I discovered that about ten years ago there was no really good understanding of how to do a good multithreading concept for Perl 6. I thought exploring how they do it in Scala, where it was already known to be good at that time, would give a more general understanding to this issue. At this time Perl 6 (now named „Raku“) was intended to rather go without multithreading capabilities than doing them badly. In the end I got dragged into Scala and found that by itself more interesting than the original issue. And Perl 6 community eventually found good answers for providing multithreaded capabilities anyway.

So why there are technical concepts in both of these languages, that are interesting and possibly could in some way be applied to the other one, there is an interesting parallel.

Both Scala and Perl have been „cool languages“ that were really strong in an area or even in a broader range of application areas. Both of them found a competitor, that was kind of an „inferior clone“ of them. PHP in its early versions was very similar to Perl, but „simplified“ and kind of a subset of what Perl provided. At that time Perl had a real boom, because the first Web applications came up and the only reasonable way to go was CGI and of course it was done with Perl. There were some early alternatives like Cold Fusion and ASP, but they never really become main stream, at least not outside of their respective communities. Now PHP eventually took over most of Perl’s CGI and has become a major building block of our current WWW. Wikipedia and this Blog run on PHP. Perl eventually also lost its leading position as system administration scripting language to Ruby and even more to Python and some others, but it is still there and has strong string parsing capabilities and a very useful ecosystem of libraries called CPAN.

Now Scala has found Kotlin to be a similar competitor. Besides being somewhat simpler Kotlin also shines with good tooling support. It comes from the same organization as IntelliJ IDEA, which is the usual IDE for most JVM-languages for people who rely neither on Emacs nor vi. So Kotlin support in IntelliJ is always going to be a high priority. And Kotlin is officially supported by Google as programming language for Android-apps. It seems to work well, allows for more modern development than the supported Java versions and has conceptionally a lot of similarity with Swift, which is the most modern programming language supported by Apple for IOS-Apps. There have been heroic and admirable approaches to allow for Android App development using other JVM-languages, especially Scala. But they all suffer from the same set of problems. In order to avoid installing too much language specific code in the app, dynamic language features that would require a compile capability, as commonly used for Groovy or Clojure have to be avoided. And the excessive use of the languages libraries has to be avoided, because they are not on the phone already, but a copy of them has to be shipped with the app, for each App. So the storage usage is much more than for Kotlin and Java Apps. And then we see an attempt, to reduce the size of the libraries, by only including what is needed. That is necessary, but it looks too fragile to really trust it. So, for Mobile Apps, it is Kotlin. Period. And then Kotlin is already there, so why not use it on the server as well. Yes, I do believe Scala is better, but that is not what everyone thinks and it needs to be much better to justify the additional language, where App-development for Android is already happening.

Now both Perl and Scala had some problems. To some extent, they are even sharing the exact same problem. It was the possibility to write really „cool“ code that was very smart, very short and could not be read by anybody else without very much time and very much knowledge. This can be done in any language, but Perl is the number one for this and I would put Scala as number two and C++ and C as number three and four, from the languages, that I know. It is a good idea to use some coding standards that allow for clean Scala or Perl code. But please remain reasonable and do not let bureaucrats come in charge of the coding standards to create a monster that drains all creativity. Allow using powerful features, but use them in a decent and readable way.

Now in both cases, there was an effort, to write a new version of the language, that was meant to be slightly incompatible and cleaned up some of the weaknesses and brought some improvements. In case of Perl this was Perl 6. It was developed for around 20 years and came out a few years ago. Eventually it turned out too different, so it was renamed to Raku. For Scala, a new language called „Dotty“ was developed. It was decided to make this the next major version, Scala 3. Even though it is much closer to Scala 2 than Raku to Perl 5, it is still incompatible and requires an effort to rewrite code. It is already seen that large Perl 5 projects are hardly moving to Raku, so Perl 5 is there to stay and Raku is just a second language within the same community. This will probably not happen like that with Scala, and the core language team will probably at some point of time concentrate on Scala 3. But large organizations that heavily invested into Scala cannot easily migrate, simply because it needs a lot of time and money. So we will probably also see some long term coexistence of Scala 2 and Scala 3. Maybe Scala 2 will be forked by major adopters. Or it will be supported for money from Lightbend.

Share Button

How to get rid of these HTML-entities in Files

It has been written here that HTML-entities (these ä etc) should be avoided with the exception of those that we need due to the HTML-syntax like <, >, & and maybe " and  . They were already mostly obsolete more than 20 years ago, but in those days we still did not automatically use UTF-8 or UTF-16, but often an 8-bit character encoding that could express only up to 256 characters, in reality around 200 due to control characters. At least these 200 could be used. That was enough for web pages in those days and texts in German, French, Russian, Greek, Hebrew, Arabic and many other language could well be written, as long as only one language or a few similar languages were used. For the rare occasions that required some characters that were not in this character set, it was an option to rely on these HTML-entities. Or for typing HTML-pages on an US-keyboard without any good tool support.

But now Unicode has been around for more than 25 years and more than 90% of the web pages use UTF-8.

Now some people think that these HTML-entities are kind of necessary or at least „safer“ and I see people still writing HTML-code with them in these days. Or tools by relatively well known companies, that produced such output not so long ago… It is a good thing to have some courage and to change something like this to readable and natural format. Or more generally to try out if a simpler or better solution works. Reasonable courage is good for this, too much of something good can go bad, as so often…

So, please teach your collegues not to use these ugly HTML-entities, where UTF-8-characters are the better option.

And here is a perl script that converts the HTML-entities with the exceptions mentioned above to UTF-8. In the project conversion-utils some more such scripts might be added. The script is a bit too long to be pasted inline in a code block, so it is better to find the current version on github.

Then you can do something like this:

git commit
for file in *.html ; do
echo $file
mv $file ${file}~entities~
html2utf8 < ${file}~entities~ > $file
echo /$file
git diff

to convert all files in a directory. I assume that you are using Linux or at least have bash like for example in cygwin.
There are other tools to do the same thing, I am sure. Just use anything that works for you to get away from this unreadable crap.

Share Button

Exceptions to implement Program Logic

Sometimes it is conveniant to use exceptions for implementing the regular program logic.

Assuming we want to find some data and then process them. When no data is found, this step can be skipped, because there is nothing to do. So we could program something like this:

public Data readData(Param param) {
    Data data =;
    if (data.isEmpty()) {
        throw new NotFoundException("nothing found");
    return data;

public ProcessedData doWork(Param param) {
    try {
        Data input = readData(param);
        ProcessedData result = ....
        return result;
    } catch (NotFoundException nfex) {
        return ProcessedData.empty();

And some other exceptions could also be handled in a similar way.

Of course some people say, that this is not good and an abuse of exceptions. But sometimes it is tempting.

So is this bad? And if so, why? Let’s find out.

This is some kind of weird obfuscation of the control flow, because throwing and catching of exceptions can be far apart and it can become quite unclear, from where in the stack which exceptions can be thrown. So there are good reasons to recommend using exceptions only for what they are meant for by their name. The Goto has never made it into Java and we are discouraged from using it in many other languages, like C. But languages like Java, C, Perl, Ruby and some others provide quite rich control flow relying neither on goto nor on exceptions by using „return“ anywhere in a function or method or subroutine, leaving loops with „break“ or „last“ or going to the next iteration with „next“ or „continue“. Perl and Java even allow to specify which of nested loops to leave with break or last. These mechanisms are very powerful and there is no urgent need to add exceptions or even gotos just to support the control flow.

Once moving to newer languages like Scala much of this is gone or at least strongly discouraged in a purely functional programming style. This makes programming Scala harder, and comes with benefits that might be worth the extra effort.

But in Java these functional purists have not become very strong yet, so using „break“, „continue“, „return“ etc. is still ok and quite powerful.

In Java there is another very major problem with exceptions. Many, if not most Java programs run in a framework or container like Spring, EJB/JEE, JBoss Fuse, for example. Now a piece of software becomes a software component, that can interact with other components through the framework. And exceptions are noticed by the framework. In many cases they have the effect that an ongoing transaction is marked as „rollback only“. So the whole processing continues normally, and when all the code from the components is finally done, the framework performs a rollback instead of a commit.

As long as exceptions are only used for handling errors or unsual situations, in which cases the rollback is probably the way to go anyway, everything is fine. But if we for example look up something and base the further processing on the outcome of this, then a NotFoundException will result in very counter intuitive behavior.

So the original rule of not abusing exceptions is actually not such a bad idea.

Share Button



Software often contains a logging functionality. Usually entries one or sometimes multiple lines are appended to a file, written to syslog or to stdout, from where they are redirected into a file. They are telling us something about what the software is doing. Usually we can ignore all of it, but as soon as something with „ERROR“ or worse and more visible stack traces can be found, we should investigate this. Unfortunately software is often not so good, which can be due to libraries, frameworks or our own code. Then stack traces and errors are so common that it is hard to look into or to find the ones that are really worth looking into. Or there is simply no complete process in place to watch the log files. Sometimes the error shows up much later than it actually occurred and stack traces do not really lead us to the right spot. More often than we think logging actually introduces runtime errors, that were otherwise not present. This is related to a more general concept, which is called observer effect, where logging actually changes the business logic.

It is nice that log files keep to some format. Usually they start with a time stamp in ISO-format, often to the millisecond. Please add trailing zeros to always have 3 digits after the decimal point in this case. It is preferable to use UTC, but people tend to stick to local date and time zones, including the issues that come with switching to and from daylight saving time. Usually we have several processes or threads that run simultaneously. This can result in a wild mix of logging entries. As long as even multiline entries stay together and as long as beginning and end of one multiline entry can easily be recognized, this can be dealt with. Tools like splunk or simple Perl, Ruby or Python scripts can help us to follow threads separately. We could actually have separate logs for each thread in the first place, but this is not a common practice and it might hit OS-limitations on the number of open files, if we have many threads or even thousands of actors as in Erlang or Akka. Keeping log entries together can be achieved by using an atomic write, like the write system call in Linux and other Posix systems. Another way is to queue the log entries and to have a logger thread that processes the queue.

Overall this area has become very complex and hard to tame. In the Java world there used to be log4j with a configuration file that was a simple properties file, at least in the earlier version. This was so good that other languages copied it and created some log4X. Later the config file was replaced by XML and more logging frame works were added. Of course quite a lot of them just for the purpose of abstracting from the large zoo of logging frameworks and providing a unique interface for all of them. So the result was, that there was one more to deal with.

It is a good question, how much logic for handling of log files do we really want to see in our software. Does the software have to know, into which file it should log or how to do log rotation? If a configuration determines this, but the configuration is compiled into the jar file, it does have to know… We can keep our code a bit cleaner by relying on program functionality without code, but this still keeps it as part of the software.

Log files have to please the system administrator or whoever replaced them in a pure devops shop. And in the end developers will have to be able to work with the information provided by the logs to find issues in the code or to explain what is happening, if the system administrator cannot resolve an issue by himself. Should this system administrator have to deal with a different special complex setup for the logging for each software he is running? Or should it be necessary to call for developer support to get a new version of the software with just another log setting, because the configurations are hard coded in the deployment artifacts? Interesting is also, what happens when we use PAAS, where we have application server, database etc., but the software can easily move to another server, which might result in losing the logs. Moving logs to another server or logging across the network is expensive, maybe more expensive than the rest of this infrastructure.

Is it maybe a good idea to just log to stdout, maintaining a decent format and to run the software in such a way that stdout is piped into a log manager? This can be the same for all software and there is one way to configure it. The same means not only the same for all the java programs, but actually the same for all programs in all languages that comply to a minimal standard. This could be achieved using named pipes in conjunction with any hard coded log file that the software wants to use. But this is a dangerous path unless we really know what the software is doing with its log files. Just think of what weird errors might happen if the software tries to apply log rotation to the named pipe by renaming, deleting, creating new files and so on. A common trick to stop software from logging into a place where we do not want this is to create a directory with the name of the file that the software usually uses and to write protect this directory and its parent directory for the software. Please find out how to do it in detail, depending on your environment.

What about software, that is a filter by itself, so its main functionality is to actually write useful data to stdout? Usually smaller programs and scripts work like this. Often they do not need to log and often they are well tested relyable parts of our software installation. Where are the log files of cp, ls, rm, mv, grep, sort, cat, less,…? Yes, they do tend to write to stderr, if real errors occur. Where needed, programs can turn on logging with a log file provided on the command line, which is also a quite operations friendly approach. Named pipes can help here.

And we had a good logging framework in place for many years. It was called syslog and it is still around, at least on Linux.

A last thought: We spend really a lot of effort to get well performing software, using multiple processes, threads or even clusters. And then we forget about the fact that logging might become the bottle neck.

Share Button

Meaningless Whitespace in Textfiles

We use different file formats that are more or less tolerant to certain changes. Most well known is white space in text files.

In some programming languages white space (space, newline, carriage return, form feed, tabulator, vertical tab) has no meaning, as long as any whitespace is present. Examples for this are Java, Perl, Lisp or C. Whitespace, that is somehow part of String content is always significant, but white space that is used within the program can be combination of one or more of the white space characters that are in the lower 128 positions (ISO-646, often referred to as ASCII or 7bit ASCII. It is of course recommended to have a certain coding standard, which gives some guidelines of when to use newlines, if tabs or spaces are preferred (please spaces) and how to indent. But this is just about human readability and the compiler does not really care. Line numbers are a bit meaningful in compiler and runtime error messages and stack traces, so putting everything into one line would harm beyond readability, but there is a wide range of ways that are all correct and equivalent. Btw. many teams limit lines to 80 characters, which was a valid choice 30 years ago, when some terminals were only 80 characters wide and 132 character wide terminals where just coming up. But as a hard limit it is a joke today, because not many of us would be able to work with a vt100 terminal efficiently anyway. Very long lines might be harder to read, so anything around 120 or 160 might still be a reasonable idea about line lengths…

Languages like Ruby and Scala put slightly more meaning into white space, because in most cases a semicolon can be skipped if it is followed by a newline and not just horizontal white space. And Perl (Perl 5) is for sure so hard to compile that only its own implementation can properly format or even recognize which white space is part of a literal string. Special cases like having the language in a string and parsing and then executing that should be ignored here.

Now we put this program files into a source code management system, usually Git. Some teams still use legacy systems like subversion, source safe, clear case or CVS, while there are some newer systems that are probably about as powerful as git, but I never saw them in use. Git creates an MD5 hash of each file, which implies that any minor change will result in a new version, even if it is just white space. Now this does not hurt too much, if we agree on the same formatting and on the same line ending (hopefully LF only, not CR LF, even on MS-Windows). But our tooling does not make any difference between significant changes and insignificant formatting only changes. This gets worse, if users have different IDEs, which they should have, because everyone should use the IDE or editor, with which he or she is most efficient and the formal description of the preferred formatting is not shared between editors or differs slightly.

I think that each programming language should come with a command line diff tool and a command line formatting tool, that obey a standard interface for calling and can be plugged into editors and into source code management systems like git. Then the same mechanisms work for C, Java, C#, Ruby, Python, Fortran, Clojure, Perl, F#, Scala, Lua or your favorite programming language.

I can imaging two ways of working: Either we have a standard format and possibly individual formats for each developer. During „git commit“ the file is brought into the standard format before it is shown to git. Meaning less whitespace changes disappear. During checkout the file can optionally be brought into the preferred format of the developer. And yes, there are ways to deal with deliberate formatting, that for some reason should be kept verbatim and for dealing differently with comments and of course all kinds of string literals. Remember, the formatting tool comes from the same source as the compiler and fully understands the language.

The other approach leaves the formatting up to the developer and only creates a new version, when the diff tool of the language signifies that there is a relevant change.

I think that we should strive for this approach. It is no rocket science, the kind of tools were around for many decades as diff and as formatting tools, it would just be necessary to go the extra mile and create sister diff and formatting tools for the compiler (or interpreter) and to actually integrate these into build environments, IDEs, editors and git. It would save a lot of time and leave more time for solving real problems.

Is there any programming language that actually does this already?

How to handle XML? Is XML just the new binary with a bit more bloat? Can we do a generic handling of all XML or should it depend on the Schema?

Share Button

Loops with unknown nesting depth

We often encounter nested loops, like

for (i = 0; i < n; i++) {
    for (j = 0; j < m; j++) {
        doSomething(i, j);

This can be nested to a few more levels without too much pain, as long as we observe that the number of iterations for each level need to be multiplied to get the number of iterations for the whole thing and that total numbers of iterations beyond a few billions (10^9, German: Milliarden, Russian Миллиарди) become unreasonable no matter how fast the doSomethings(...) is. Just looking at this example program

public class Modular {
    public static void main(String[] args) {
        long n = Long.parseLong(args[0]);
        long t = System.currentTimeMillis();
        long m = Long.parseLong(args[1]);
        System.out.println("n=" + n + " t=" + t + " m=" + m);
        long prod = 1;
        long sum  = 0;
        for (long i = 0; i < n; i++) {
            long j = i % m;
            sum += j;
            sum %= m;
            prod *= (j*j+1) % m;
            prod %= m;
        System.out.println("sum=" + sum + " prod=" + prod + " dt=" + (System.currentTimeMillis() - t));

which measures it net run time and runs 0 msec for 1000 iterations and almost three minutes for 10 billions (10^{10}):

> java Modular 1000 1001 # 1'000
--> sum=1 prod=442 dt=0
> java Modular 10000 1001 # 10'000
--> sum=55 prod=520 dt=1
> java Modular 100000 1001 # 100'000
--> sum=45 prod=299 dt=7
> java Modular 1000000 1001 # 1'000'000
--> sum=0 prod=806 dt=36
> java Modular 10000000 1001 # 10'000'000
--> sum=45 prod=299 dt=344
> java Modular 100000000 1001 # 100'000'000
--> sum=946 prod=949 dt=3314
> java Modular 1000000000 1001 # 1'000'000'000
--> sum=1 prod=442 dt=34439
> java Modular 10000000000 1001 # 10'000'000'000
--> sum=55 prod=520 dt=332346

As soon as we do I/O, network access, database access or simply a bit more serious calculation, this becomes of course easily unbearably slow. But today it is cool to deal with big data and to at least call what we are doing big data, even though conventional processing on a laptop can do it in a few seconds or minutes... And there are of course ways to process way more iterations than this, but it becomes worth thinking about the system architecture, the hardware, parallel processing and of course algorithms and software stacks. But here we are in the "normal world", which can be a "normal subuniverse" of something really big, so running on one CPU and using a normal language like Perl, Java, Ruby, Scala, Clojure, F# or C.

Now sometimes we encounter situations where we want to nest loops, but the depth is unknown, something like

for (i_0 = 0; i_0 < n_0; i_0++) {
  for (i_1 = 0; i_1 < n_1; i_1++) {
      for (i_m = 0; i_m < n_m; i_m++) {
        dosomething(i_0, i_1,\ldots, i_m);

Now our friends from the functional world help us to understand what a loop is, because in some of these more functional languages the classical C-Style loop is either missing or at least not recommended as the everyday tool. Instead we view the set of values we iterate about as a collection and iterate through every element of the collection. This can be a bad thing, because instantiating such big collections can be a show stopper, but we don't. Out of the many features of collections we just pick the iterability, which can very well be accomplished by lazy collections. In Java we have the Iterable, Iterator, Spliterator and the Stream interfaces to express such potentially lazy collections that are just used for iterating.

So we could think of a library that provides us with support for ordinary loops, so we could write something like this:

Iterable range = new LoopRangeExcludeUpper<>(0, n);
for (Integer i : range) {

or even better, if we assume 0 as a lower limit is the default anyway:

Iterable range = new LoopRangeExcludeUpper<>(n);
for (Integer i : range) {

with the ugliness of boxing and unboxing in terms of runtime overhead, memory overhead, and additional complexity for development. In Scala, Ruby or Clojure the equivalent solution would be elegant and useful and the way to go...
I would assume, that a library who does something like LoopRangeExcludeUpper in the code example should easily be available for Java, maybe even in the standard library, or in some common public maven repository...

Now the issue of loops with unknown nesting depth can easily be addressed by writing or downloading a class like NestedLoopRange, which might have a constructor of the form NestedLoopRange(int ... ni) or NestedLoopRange(List li) or something with collections that are more efficient with primitives, for example from Apache Commons. Consider using long instead of int, which will break some compatibility with Java-collections. This should not hurt too much here and it is a good thing to reconsider the 31-bit size field of Java collections as an obstacle for future development and to address how collections can grow larger than 2^{31}-1 elements, but that is just a side issue here. We broke this limit with the example iterating over 10'000'000'000 values for i already and it took only a few minutes. Of course it was just an abstract way of dealing with a lazy collection without the Java interfaces involved.

So, the code could just look like this:

Iterable range = new NestedLoopRange(n_0, n_1, \ldots, n_m);
for (Tuple t : range) {

Btw, it is not too hard to write it in the classical way either:

        long[] n = new long[] { n_0, n_1, \ldots, n_m };
        int m1 = n.length;
        int m  = m1-1; // just to have the math-m matched...
        long[] t = new long[m1];
        for (int j = 0; j < m1; j++) {
            t[j] = 0L;
        boolean done = false;
        for (int j = 0; j < m1; j++) {
            if (n[j] <= 0) {
                done = true;
        while (! done) {
            done = true;
            for (int j = 0; j < m1; j++) {
                if (t[j] < n[j]) {
                    done = false;
                t[j] = 0;

I have written this kind of loop several times in my life in different languages. The first time was on C64-basic when I was still in school and the last one was written in Java and shaped into a library, where appropriate collection interfaces were implemented, which remained in the project or the organization, where it had been done, but it could easily be written again, maybe in Scala, Clojure or Ruby, if it is not already there. It might even be interesting to explore, how to write it in C in a way that can be used as easily as such a library in Java or Scala. If there is interest, please let me know in the comments section, I might come back to this issue in the future...

In C it is actually quite possible to write a generic solution. I see an API like this might work:

struct nested_iteration {
  /* implementation detail */

void init_nested_iteration(struct nested_iteration ni, size_t m1, long *n);
void dispose_nested_iteration(struct nested_iteration ni);
int nested_iteration_done(struct nested_iteration ni); // returns 0=false or 1=true
void nested_iteration_next(struct nested_iteration ni);

and it would be called like this:

struct nested_iteration ni;
int n[] = { n_0, n_1, \ldots, n_m };
for (init_nested_iteration(ni, m+1, n); 
     ! nested_iteration_done(ni); 
     nested_iteration_next(ni)) {

So I guess, it is doable and reasonably easy to program and to use, but of course not quite as elegant as in Java 8, Clojure or Scala.
I would like to leave this as a rough idea and maybe come back with concrete examples and implementations in the future.


Share Button