Logging

Deutsch

Software often contains a logging functionality. Usually entries one or sometimes multiple lines are appended to a file, written to syslog or to stdout, from where they are redirected into a file. They are telling us something about what the software is doing. Usually we can ignore all of it, but as soon as something with „ERROR“ or worse and more visible stack traces can be found, we should investigate this. Unfortunately software is often not so good, which can be due to libraries, frameworks or our own code. Then stack traces and errors are so common that it is hard to look into or to find the ones that are really worth looking into. Or there is simply no complete process in place to watch the log files. Sometimes the error shows up much later than it actually occurred and stack traces do not really lead us to the right spot. More often than we think logging actually introduces runtime errors, that were otherwise not present. This is related to a more general concept, which is called observer effect, where logging actually changes the business logic.

It is nice that log files keep to some format. Usually they start with a time stamp in ISO-format, often to the millisecond. Please add trailing zeros to always have 3 digits after the decimal point in this case. It is preferable to use UTC, but people tend to stick to local date and time zones, including the issues that come with switching to and from daylight saving time. Usually we have several processes or threads that run simultaneously. This can result in a wild mix of logging entries. As long as even multiline entries stay together and as long as beginning and end of one multiline entry can easily be recognized, this can be dealt with. Tools like splunk or simple Perl, Ruby or Python scripts can help us to follow threads separately. We could actually have separate logs for each thread in the first place, but this is not a common practice and it might hit OS-limitations on the number of open files, if we have many threads or even thousands of actors as in Erlang or Akka. Keeping log entries together can be achieved by using an atomic write, like the write system call in Linux and other Posix systems. Another way is to queue the log entries and to have a logger thread that processes the queue.

Overall this area has become very complex and hard to tame. In the Java world there used to be log4j with a configuration file that was a simple properties file, at least in the earlier version. This was so good that other languages copied it and created some log4X. Later the config file was replaced by XML and more logging frame works were added. Of course quite a lot of them just for the purpose of abstracting from the large zoo of logging frameworks and providing a unique interface for all of them. So the result was, that there was one more to deal with.

It is a good question, how much logic for handling of log files do we really want to see in our software. Does the software have to know, into which file it should log or how to do log rotation? If a configuration determines this, but the configuration is compiled into the jar file, it does have to know… We can keep our code a bit cleaner by relying on program functionality without code, but this still keeps it as part of the software.

Log files have to please the system administrator or whoever replaced them in a pure devops shop. And in the end developers will have to be able to work with the information provided by the logs to find issues in the code or to explain what is happening, if the system administrator cannot resolve an issue by himself. Should this system administrator have to deal with a different special complex setup for the logging for each software he is running? Or should it be necessary to call for developer support to get a new version of the software with just another log setting, because the configurations are hard coded in the deployment artifacts? Interesting is also, what happens when we use PAAS, where we have application server, database etc., but the software can easily move to another server, which might result in losing the logs. Moving logs to another server or logging across the network is expensive, maybe more expensive than the rest of this infrastructure.

Is it maybe a good idea to just log to stdout, maintaining a decent format and to run the software in such a way that stdout is piped into a log manager? This can be the same for all software and there is one way to configure it. The same means not only the same for all the java programs, but actually the same for all programs in all languages that comply to a minimal standard. This could be achieved using named pipes in conjunction with any hard coded log file that the software wants to use. But this is a dangerous path unless we really know what the software is doing with its log files. Just think of what weird errors might happen if the software tries to apply log rotation to the named pipe by renaming, deleting, creating new files and so on. A common trick to stop software from logging into a place where we do not want this is to create a directory with the name of the file that the software usually uses and to write protect this directory and its parent directory for the software. Please find out how to do it in detail, depending on your environment.

What about software, that is a filter by itself, so its main functionality is to actually write useful data to stdout? Usually smaller programs and scripts work like this. Often they do not need to log and often they are well tested relyable parts of our software installation. Where are the log files of cp, ls, rm, mv, grep, sort, cat, less,…? Yes, they do tend to write to stderr, if real errors occur. Where needed, programs can turn on logging with a log file provided on the command line, which is also a quite operations friendly approach. Named pipes can help here.

And we had a good logging framework in place for many years. It was called syslog and it is still around, at least on Linux.

A last thought: We spend really a lot of effort to get well performing software, using multiple processes, threads or even clusters. And then we forget about the fact that logging might become the bottle neck.

Share Button