Program Functionality without Code

Depending on the programming language and the frameworks we use, it is possible to have program functionality that is not happening in actual code that we write in that language. It seems weird, but actually it is something that we have been doing for decades and it has been sold to us as being extremely powerful and useful and sometimes it actually is. Aspect oriented programming is mostly based on this idea…

Typical examples are things we want to be taken care of but we do not want to actually write them ourselves..

Some of these look really great and who wants to deal with memory management today? Unless we do real time programming or special security code where information may not exist in the memory for longer than the actual processing, this is just what we successfully and without too much pain are doing all the time.

While some of these look really great, and have become more or less om there is also some danger in having some very powerful implicit functionality, like transaction management. While it looks tempting to delegate transaction management to a framework, because it is annoying and it is not really understood very well by most application developers, there comes some danger with it. This is even worse if it is used in conjunction with something like JPA or Hibernate… Assuming we have a framework that wraps methods marked with an annotation like „@Transactional“, meaning that this method call should be wrapped into a transaction (java-like pseudo-code):

@Transactional
public X myMethod(Y y)  {
   X result = do_something(y);
   return result;
}

being roughly equivalent to

public  X myMethod(Y y) {
  TransactionContext ctx = getTransactionContext();
  try {
      ctx.beginTransaction();
      X result = do_something(y);
      ctx.commit();
      return result;
   } catch (Exception ex) {
      ctx.rollback();
      throw ex;
   }
}

Yes, it is more elegant to just annotate it.
But now we program something like this:

@Transactional
public Function myMethod(Y y) {
      ....
}

where we actually enclose something into the function and give it back. Now when calling the function, we might get an error, because it encloses stuff from the time, when the transaction was actually still open, while it has been committed by the time, the function is actually called. So in frameworks that force the usage of such annotated transaction handling, such beautiful functional style programming patterns may actually not work and need to be avoided or at least constrained to the cases that do still work. This can be a reasonable price to pay, but it is important, to understand the constraints, that come with this implicit functionality.

Another interesting area that comes with a lot of potential functionality is correlated with authorization. Assuming we have a company that sells some services or products and we have key account managers that use the software we have written. Now for whatever reasons, they should only be able to see the data about their own customers, possibly data for customers for whose key account manager they are the deputy. Or if they are the boss of some key account managers, maybe they can see all of their data…

Now a function

List listCustomers() {
...
}

gives different results, depending on who is using it. This introduces an implicit invisible parameter. And however smart the user of this software is, he only sees what he is supposed to see, unless the software has some vulnerabilities, which it probably has.

So whenever we read such code that we have not written ourselves and have not written yesterday, there may be surprises about what it does. It is an interesting question how to test this with a good coverage of all constellations for implicit parameters. Anyway, we have to get used to it and embrace it, it is an integral part of our software ecosystem. But it is also important to use these powerful mechanisms only where they are really so helpful that it is worth the loss in clarity and explicitness.

While annotations are at least in place to be found, there are also other ways. Typically xml files can be used to configure such stuff. Or it can be done programmatically in a totally different place of the software by setting up some hooks, for example. Without good documentation or good information flow within the team, this may be hard to find.

Share Button

Intervals

Intervals are subsets of a universe, that are defined by upper and lower boundaries. Typically we think about real numbers, but any totally ordered universe allows the definition of intervals.

Intervals are defined by lower and upper boundaries, which can be a limiting number or unlimited, typically written as \infty for the upper bound and -\infty for the lower bound. The boundaries can be included or excluded. So the following combinations exist for a universe X:

(-\infty, \infty)=X
unlimited
(-\infty, a] = \{ x \in X : x \le a\}
half open, lower unlimited
(-\infty, a) = \{ x \in X : x < a\}
open, lower unlimited
[b, \infty) = \{ x \in X : x \ge b\}
half open, upper unlimited
(b, \infty) = \{ x \in X : x > b\}
open, upper unlimited
(c, d) = \{ x \in X : c < x < d\}
open
[c, d) = \{ x \in X : c \le x < d\}
half open
(c, d] = \{ x \in X : c < x \le d\}
half open
[c, d] = \{ x \in X : c \le x \le d\}
closed
\emptyset
it is sometimes useful to consider the empty set as an interval as well

The words „open“ and „closed“ refer to our usual topology of real numbers, but they do not necessarily retain their topological meaning when we extend the concept to our typical data types. a, b, c and d in the notation above do not have to be members of X, as long as the comparison is defined between them and all members of X. So we could for example meaningfully define for X=\mathbb{Q} the interval (\sqrt{2}, \pi) = \{ x\in\mathbb{Q} : \sqrt{2} < x < \pi\}.

As soon as we do not imply X=\mathbb{R} we always have to make this clear… And \mathbb{R} is kind of hard to really work with in software on computers with physically limited memory and CPU power.

Intervals have some relevance in software systems.

We sometimes have a business logic that actually relies on them and instead programming somehow around it, it is clearer and cleaner to actually work with intervals. For example, we can have a public transport scheduling system and we deal with certain time intervals in which different schedules apply than during the rest of the day. Or we have a system that records downtimes of servers and services and these are quite naturally expressed as intervals of some date-time datatype. It is usually healthy to consider all the cases mentioned above rather than ignoring the question if the boundary with probability zero of actually happening or having ugly interval limits like 22:59:59.999.

The other case is interval arithmetic. This means, we do floating point calculations by taking into account that we have an inaccuracy. So instead of numbers we have intervals I. When we add two intervals, we get I+J=\{x+y : x\in I \wedge y\in J\}. In the same way we can multiply and subtract and even divide, as long as we can stay clear of zero in the denominator. Or more generally we can define f(I_1, I_2, \ldots, I_n)=\{f(x_1,x_2,\ldots,x_n) : x_1\in I_1 \wedge x_2 \in I_2 \wedge\ldots\wedge x_n\in I_n\}.
It does of course require some mathematical thinking to understand, if the result is an interval again or at least something we can deal with reasonably. Actually we are usually happy with replacing the result by an interval that is possibly a superset of the real result, ideally the minimal superset that can be expressed with our boundary type.

At this point we will probably discover a desire to expand the concept of intervals in a meaningful way to complex numbers. We can do this by working with open disks like \{ z \in \mathbb{C} :  |z-z_0| < d\} or closed disks like \{ z \in \mathbb{C} :  |z-z_0| \le d\}. Or with rectangles based on two intervals I and J like \{ z = x+iy \in \mathbb{C} : x \in I \wedge y \in J \}.

These two areas are quite interesting and sometimes useful. Libraries have been written for both of them.

Often we discover, that intervals alone are not quite enough. We would like to do set operations with intervals, that is

union
A\cup B
intersection
A \cap B
set difference
A \setminus B

While the intersection works just fine, as long as we include the empty set \emptyset as an interval, unions and differences lead us to non-intervals. It turns out that interval-unions, sets that can be expressed as a union of a finite number of intervals, turn out to be a useful generalization, that is actually what we want to work with rather than with intervals. In this case we can drop the empty set as interval and just express it as the union of zero intervals.

There are some questions coming up, that are interesting to deal with:

normalization
Can we normalize interval-unions to some canonical form that allows safe and relyable comparison for equality?
adjacency
is our universe X actually discrete, so we can express all unlimited boundaries with closed boundaries?
interval lengths
Do we have a meaningful and useful way to measure the length of an interval or the total length of an interval-union, as long as they are limited? Or even for unlimited intervals?
collection interfaces
Do we want to implement a Set-interface in languages that have sets and an understanding of sets that would fit for intervals
implementation
How can we implement this ourselves?
implementation
Can we find useful implementations?

Having written a java library to support interval-unions on arbitrary Comparable types once in a project and having heard a speech about an interval library in Scala that ended up in using interval-unions in a pretty equivalent way, it might be interesting to write in the future about how to do this or what can be found in different languages to support us. For interval arithmetic some work has been done to create extensions or libraries for C and Fortran, that support this, while I was a student. So this is pretty old stuff and interesting mostly for the concepts, even if we are not going to move to Fortran because of this.

If there is interest I will write more about actual implementations and issues to address when using or writing them.

Links

Share Button

LF and CR LF in git and svn

Typically we observe that line endings of text file turn out to be „linefeed“ (LF or „\n“ or 0x0a or Ctrl-J) on Linux and Unix-like systems (including MacOSX), while they are „carriage return and linefeed“ (CR LF or „\r\n“ or 0x0d 0x0a or Ctrl-M Ctrl-J) on MS-Windows. See the little obstacles of interoperability.

This can become annoying and there is little reason for this. Most tools for editing, compiling and working with these files just understand both variants. It does become annoying when diffs are created between different files and even more so when scripts are turning out to have the „CR LF“ ending and the script interpreter given in the first line is not found, because the system tries to find one that has the otherwise invisible „CR“ in its file name. It also becomes kind of messy, when multiple CR-characters are present. CR-characters are annoying even on MS-Windows itself as soon as we use cygwin. Since in most cases the target system is a Linux system anyway and we just waste space with the unnecessary CR-character, it is actually in most cases a good idea to agree on not having these CR-characters at all in certain kinds of files.

The easiest way is to just set up git or subversion to change files from CR LF to LF on commit. So the repository only contains „clean“ files, at least from that moment onward.

This is accomplished in subversion by applying the following command on the files that we want to be kept on „LF“ only:

svn pset svn:eol LF

and then committing.

Git has a way to achieve this using the .gitattributes configuration file. So the .gitattributes file can contain something like this:

* text=auto eol=lf

Remark

Of course I recommend to use git instead of svn for new projects and to consider migrating existing projects from svn to git. But for this special aspect svn provides a slightly more powerful and more intuitive tool than git.

Links

Share Button

Word Documents vs. Wiki

It was this typical clash of cultures when a real IT guy, scientist, mathematician or engineer joined a company and suddenly an office package had to be used. It could have been any of them, think LibreOffice or MS-Office, but there were some others, that have meanwhile lost relevance. We liked to use text files or HTML or LaTeX or even Literate Programming, part of which has conceptionally made it into our time as JavaDoc and similar concepts for most modern languages. The idea is to keep source code and documentation together and in sync, to work with a similar technology stack for software and documentation and to be able to use typical Linux/Unix-tools like grep, Perl, Python or Ruby on the document files to search, to create them, to transform them, to parse them etc. Do not use awk and sed for these purpose any more, they have been superseded by scripting languages.

So now all tools for automatizing things and for working efficiently or at least comfortably where gone, maybe the office even forced us to use such a non-developer-system as MS-Windows for developing software that should eventually run on Linux– or Unix-Servers. Working together on the same file was a nightmare, in spite of the existence of some software that claims to support this. And yes, I know that these office packages contain scripting languages, like VBA or LibreOffice Basic that might not be our favorite, but could eventually be powerful if we took the effort and learned them. But actually the people who learned this were mostly on the „business side“ and sometimes they were faster getting a functionality up and running than we professionals. Up and running as long as their C:-drive was up and running…

So now after the anger is gone or hopefully at least a bit reduced, it is good to actually take a look. The „classical“ office suite contained a word processor, a spread sheet and a presentation program, other components like a database have become optional and I see them rarely in use.

I think that there is no serious doubt about wanting to have a spread sheet. And a presentation program is mostly useful as well, even though there are serious alternatives to be seen like doing presentations in modern HTML. The arguments against the word processor of typical office suites have proven to somewhat survive through all the years. While the word processor is useful for exchanging documents with people who are heavily working with them and cannot easily be taught otherwise, it has pretty much disappeared from our daily work. Since around 2008 the main documentation tool in the overwhelming majority of projects that I have seen was a private Wiki, like MediaWiki or Confluence or Redmine, for example.

This allowed for easier collaboration, it was a web application that worked with any modern browser and any OS and it was a bit more natural to use and somewhat easier to read and write and search. The Wikis do have a text format, that can be processed in situations where we want to generate documentation from code or code from documentation. And it could be more easily controlled than Word-documents, that were mailed around and changed.

Document management systems or enterprise content management systems allow us to do a lot of stuff even with „classical“ office files. I know that the more advanced systems are very powerful. But I have rarely seen it in the context of software development and architecture projects. So I am not going to write too much about them, but leave this area to the experts of this field.

Generally I find it much more pleasant to work with Wikis than it was before with Office-Documents, so I would consider this a positive development.

Share Button

SQL: DROP, CREATE,… How to avoid errors?

In SQL we often want to create or drop an object (TABLE, VIEW, SEQUENCE, INDEX, SYNONYM, DATABASE, USER, SCHEMA,….)

Now we do not know if this thing exists or not. We would like to write scripts, that work in either case.

So something like

CREATE TABLE IF NOT EXISTS XYZ (...);

or

DROP TABLE IF EXISTS XYZ;
CREATE TABLE XYZ (...);

would be desirable.

This seems to be quite hard, depending on the database.

Some databases support the „IF EXISTS“ or even the slightly less useful „IF NOT EXISTS“ pattern for some of the CREATE and DROP statements:

COMMANDPostgreSQLMS-SQL-ServerOracleDB2MySQL/MariaDB
DROP DATABASE IF EXISTSxx--x
DROP FUNCTION IF EXISTS xx-xx
DROP INDEX IF EXISTSxx-xx
DROP MATERIALIZED VIEW IF EXISTSxMaterialized view not found in documentation-Materialized view not found in documentationno materialized views supported
DROP ROLE IF EXISTSxx-xx
DROP SCHEMA IF EXISTSxxno DROP SCHEMA, Oracle uses User to express the concept of a Schema? (no DROP SCHEMA found in documentation)Schema not supported.
DROP SEQUENCE IF EXISTSxx-xx
DROP TABLE IF EXISTSxx-xx
DROP TABLESPACE IF EXISTSxNot found in documentation-? (no DROP TABLESPACE found in documentation)TABLESPACE not supported
DROP TRIGGER IF EXISTSxx-xx
DROP USER IF EXISTSxx--x
DROP VIEW IF EXISTSxx-xx
CREATE OR REPLACE DATABASE----x
CREATE DATABASE IF NOT EXISTS---xx
CREATE FUNCTION IF NOT EXISTS---x-
CREATE OR REPLACE FUNCTIONxx (CREATE OR ALTER FUNCTION)x-x
CREATE OR REPLACE INDEX----x
CREATE INDEX IF NOT EXISTSx--xx
CREATE MATERIALIZED VIEW IF NOT EXISTSxMaterialized view not found in documentation-Materialized view not found in documentationno materialized views supported
CREATE OR REPLACE ROLE----x
CREATE ROLE IF NOT EXISTS---xx
CREATE SCHEMA IF NOT EXISTSx-Oracle ties the schema closely to a user. So slightly different meaning of CREATE SCHEMA...Oracle ties the schema closely to a user. So slightly different meaning of CREATE SCHEMA...Schema not supported.
CREATE OR REPLACE SEQUENCE----x
CREATE SEQUENCE IF NOT EXISTSx--xx
CREATE OR REPLACE TABLE----x
CREATE TABLE IF NOT EXISTSx--xx
CREATE TABLESPACE IF NOT EXISTS-Not found in documentation-TABLESPACE apparently not supportedTABLESPACE not supported
CREATE OR REPLACE TRIGGER-x (CREATE OR ALTER TRIGGER)x-x
CREATE TRIGGER IF NOT EXISTS---xx
CREATE OR REPLACE USER----x
CREATE USER IF NOT EXISTS----x
CREATE OR REPLACE VIEWxx (CREATE OR ALTER VIEW)x-x
CREATE VIEW IF NOT EXISTS---xx

Usually we can happily ignore the missing „IF (NOT) EXISTS“ conditions and just DROP the object and then create it. By ignoring the error message this works.

But increasingly we want to manage the database with tools that run scripts automatically and cannot distinguish as well as a DBA between „good error messages“ and „bad error messages“.

So we should find our ways around this.

The trick is to use the procedural extension language of SQL (like PL/SQL for Oracle, because in case of Oracle we need it most often). Then we write a block and find by a select in the data dictionary if the object we are creating or dropping exists. Then we put the select result into a variable and put an IF-condition around our CREATE or DROP statement. This approach can also be useful for dropping all objects of a certain type that fullfill some SQL-query-accessible condition.

DECLARE
   CNT INT;
BEGIN
   SELECT COUNT(*) INTO CNT 
     FROM USER_TABLES 
     WHERE TABLE_NAME = 'MY_TABLE';
   IF CNT >= 1 THEN
      EXECUTE IMMEDIATE 'DROP TABLE MY_TABLE';
   END IF;
END;
/

Or to catch the kind of exception that we want to ignore (example for Oracle):

BEGIN
   EXECUTE IMMEDIATE 'DROP TABLE MY_TABLE';
EXCEPTION
   WHEN OTHERS THEN
      IF SQLCODE != -942 THEN
         RAISE;
      END IF;
END;
/

It is a bit tricky to find the reference pages for the SQL dialects of the different DB products. I used these:

MariaDB is a fork of mySQL that was created, when mySQL came to Oracle. I recommend it as the successor of mySQL, if you are into mySQL-like databases. And I strongly recommend looking into PostgreSQL, if you intend to use a free/opensource database, because it is the same price and just has richer features. Generally all five databases are good and have their areas of strength and their weaknesses. I won’t go deeper into this in this article…

Share Button

Serialization

Deutsch

Serialization allows us to store objects in a lossless way or to transfer them over the network. This could be done before, but it was necessary to program the serialization mechanism, which was a lot of work. Of course, they were not yet called objects in those days…

Java suddenly had such a serialization for (almost) all object without any additional programming effort. This does not mean that this automatic serialization did not exist before, but it was made popular with Java, because frameworks started to heavily rely on it. To use it ourselves, we just had to use ObjectOutputStream and ObjectInputStream and Objects could be stored and written across the network or just be cloned. It was even able to handle circular references, which most serialization mechanisms cannot do. The idea was not really new, as other languages had something like this already before, but nobody became aware of it.

But there are some drawbacks, that were discovered when it was already too late and that should at least be mentioned.

  • Marking with a Serializiable-Interface is conceptionally quite a bad solution, because it assumes that serialization never gets lost by deriving classes, which is just not true. An Unserializable interface would have been a much better solution, if not almost ideal solution for this, because trivial objects are always serializable and they loose this when something non-serializable is added. Then again, how about collections… Today possibly some annotation could also be helpful.
  • This serialVersionUID creates a lot of pain. Should we change it, whenever the interface changes? We talk about the implicit serialization interface, not about an explicit interface that we can easily see. Should we trust automatic mechanisms? In any case issues with incompatible versions remain that are not really solved well and cannot even easily be solved well.
  • Serialization introduces an additional invisible constructor.
  • Serialization undermines the idea of private and protected, because suddenly private and protected member attributes become part of the interface
  • Funny effects happen with serializable non-static inner classes, because there serialized version bakes in the containing outer object. Yes it has to…
  • The object indentity gets lost, when an object is serialized. It is easy to create several copies of the same object.
  • Sometimes it is still necessary to manually write serialization code, for example for singletons. It is easy to forget this, because everything seems to work just fine automatically.
  • Java’s Serialization is quite slow.
  • The format is binary and cannot easily be read. A pluggable serialization format that could allow more human readable data files like JSON, XML,… would have been better…
  • Serialization creates a temptation to use this format for communication, which again forces a tight coupling that might not be necessary otherweise.
  • Serialization creates a temptation to use it as a storage format instead of mature database technologies. Very bad: the second level cache of hibernate…

There were some advantages in having this serialization in the past and for some purposes it kind of works. But it is important to question this and to consider other, more solid approaches, even if they require slightly more work. Generally it is today considered one of the larges fallacies of Java to introduce this serialization mechanism in this way. There are now better ways to do serialization, that require a bit more work, but avoid some of the terrible short comings of the native Java-Serialization.

For the serialVersionUID there are several approaches that can work. A statical method, that extracts from an „$Id$“-string that is managed by svn, can be a way. It will avoid compatibility between even slightly different versions, which is probably the best we can get. With git it is a bit harder, but it can be done as well.

Usually it is the best choice by far to leave serialVersionUID empty and rely on Java’s automatic mechanisms. They are not perfect, but better than 99% of the manually badly maintained serialVersionUIDs. If you want to manage your serialVersionUIDs yourself, there needs to be a checklist on what to do to release a new version of a file, a library or a whole software system. This is usually sick, because it creates a lot of work, even more errors and should really be done only with good reasons, good discipline and a very good concept. If you like anyway to use serialVersionUID or if you are forced to do so by the project, here is a script to create them randomly:

#!/usr/bin/perl
use bigint;
use Math::Random::Secure qw(irand);
my $r = (irand() << 32) + irand();
printf "%20d\n", $r;

This is still better than using the IDE-generated value and keeping it forever or starting with a 1 or 0 and keeping it forever, because updating this serialVersionUID is not really on our agenda. And it shouldn't be.

Share Button

Why am I learning Python

To be honest, what can be done with Python can also be done with Perl or Ruby. I am not working in areas, where there is much better library support for Python than for Perl or Ruby. And I like Perl and Ruby very much and I am somewhat skeptical about Python. But there are some points that make it worth knowing Python in addition to Perl and Ruby, not instead of them.

I strongly recommend using real programming languages like Perl, Ruby, Python and you can add some more instead of Bash scripts, where a certain complexity is exceeded. Try reading the pure bash scripts that are used to start Maven, Tomcat or other useful software. Often there is a CMD-script as well, that is the real pure horror. Python serves this purpose well enough, the other two of course as well.

It is always good to learn new languages once in a while, because they extend our horizon and help us even to be better with our more preferred languages. And why not challenge the preferences…

There is a good point in allowing for a tool box of languages, not „only Java“ or „only C#“ or „only C“ or even „only Perl“, whatever you like… Combining a useful toolbox of several languages is the right way to go. This would be the case with a toolbox containing A and B, where A ∈ { C, C++, Java, C#, F#, Scala, Clojure, …} and B ∈ { Perl, Perl6, Ruby, Lua, Python,…}. Usually it is a good idea to make it slightly larger, but it is also good to find a consensus on which set of languages to concentrate. I would for example discourage using sed and awk, because they can quite easily be replaced by Perl and limit bash to very trivial scripts. There are some cases in which the awk or sed scripting is a bit shorter than it is with doing the same in Perl or Ruby, but this does not justify maintaining the extra knowledge, while Perl on top of Java does justify this a lot. So the toolbox should be big enough to cover everything, but it does not have to be too redundant and there can be preferences what tool is recommended to use for a certain class of purposes, if this recommendation is reasonable. This makes it easier to maintain each others code. Now there are many projects, where the spot of B is taken by Python. So in order to be a good team player it might be useful to be able to work with the python scripts, write in this language and contribute instead of spending too much time talking about why Perl or Ruby or Lua or whatever is better. Which it might be. Or which might be more a matter of taste. Here is what big sites are using as A, B, C,….

Now out of these scripting languages, Python is for sure a successful contender. This results in good libraries, but also in higher likelyhood of Python occupying the spot B.

Now we have tools like Jenkins, Kubernetes, Docker, Cloud computing, Spark and simply certain Linux distributions, which might come along with their preferred set of scripting languages that are well supported for performing certain tasks. This can be delegated to one or two guys in the team or kept to a minimum, but this might become a factor of increasing importance. It might force us to have multiple „Bs“ or multiple „As“.

And there are certain areas, where Python is simply strong and has become the language of choice. It seems to have become the successor of Fortran for many if not most numerical calculation areas, even though there will probably always be a niche for powerful compiled languages like Fortran and C for the ultimative performance. But so the library is written in C with Python bindings and we get most of the performance as well. Also Data Science seems to mostly opt for Python as the general purpose language besides R and SQL and SAS. Even Bioinformatics, which was a stronghold of Perl for many years is now preferring Python… Yes, it does hurt someone who likes Perl, but it is true… So to be able to work in many interesting areas, it is useful to know some Python. So I started learning it. I am using a Russian translation of the book Programming Python.

I might write a bit more about the language, once I have some more experience with it.

Share Button

Meaningless Whitespace in Textfiles

We use different file formats that are more or less tolerant to certain changes. Most well known is white space in text files.

In some programming languages white space (space, newline, carriage return, form feed, tabulator, vertical tab) has no meaning, as long as any whitespace is present. Examples for this are Java, Perl, Lisp or C. Whitespace, that is somehow part of String content is always significant, but white space that is used within the program can be combination of one or more of the white space characters that are in the lower 128 positions (ISO-646, often referred to as ASCII or 7bit ASCII. It is of course recommended to have a certain coding standard, which gives some guidelines of when to use newlines, if tabs or spaces are preferred (please spaces) and how to indent. But this is just about human readability and the compiler does not really care. Line numbers are a bit meaningful in compiler and runtime error messages and stack traces, so putting everything into one line would harm beyond readability, but there is a wide range of ways that are all correct and equivalent. Btw. many teams limit lines to 80 characters, which was a valid choice 30 years ago, when some terminals were only 80 characters wide and 132 character wide terminals where just coming up. But as a hard limit it is a joke today, because not many of us would be able to work with a vt100 terminal efficiently anyway. Very long lines might be harder to read, so anything around 120 or 160 might still be a reasonable idea about line lengths…

Languages like Ruby and Scala put slightly more meaning into white space, because in most cases a semicolon can be skipped if it is followed by a newline and not just horizontal white space. And Perl (Perl 5) is for sure so hard to compile that only its own implementation can properly format or even recognize which white space is part of a literal string. Special cases like having the language in a string and parsing and then executing that should be ignored here.

Now we put this program files into a source code management system, usually Git. Some teams still use legacy systems like subversion, source safe, clear case or CVS, while there are some newer systems that are probably about as powerful as git, but I never saw them in use. Git creates an MD5 hash of each file, which implies that any minor change will result in a new version, even if it is just white space. Now this does not hurt too much, if we agree on the same formatting and on the same line ending (hopefully LF only, not CR LF, even on MS-Windows). But our tooling does not make any difference between significant changes and insignificant formatting only changes. This gets worse, if users have different IDEs, which they should have, because everyone should use the IDE or editor, with which he or she is most efficient and the formal description of the preferred formatting is not shared between editors or differs slightly.

I think that each programming language should come with a command line diff tool and a command line formatting tool, that obey a standard interface for calling and can be plugged into editors and into source code management systems like git. Then the same mechanisms work for C, Java, C#, Ruby, Python, Fortran, Clojure, Perl, F#, Scala, Lua or your favorite programming language.

I can imaging two ways of working: Either we have a standard format and possibly individual formats for each developer. During „git commit“ the file is brought into the standard format before it is shown to git. Meaning less whitespace changes disappear. During checkout the file can optionally be brought into the preferred format of the developer. And yes, there are ways to deal with deliberate formatting, that for some reason should be kept verbatim and for dealing differently with comments and of course all kinds of string literals. Remember, the formatting tool comes from the same source as the compiler and fully understands the language.

The other approach leaves the formatting up to the developer and only creates a new version, when the diff tool of the language signifies that there is a relevant change.

I think that we should strive for this approach. It is no rocket science, the kind of tools were around for many decades as diff and as formatting tools, it would just be necessary to go the extra mile and create sister diff and formatting tools for the compiler (or interpreter) and to actually integrate these into build environments, IDEs, editors and git. It would save a lot of time and leave more time for solving real problems.

Is there any programming language that actually does this already?

How to handle XML? Is XML just the new binary with a bit more bloat? Can we do a generic handling of all XML or should it depend on the Schema?

Share Button

Loops with unknown nesting depth

We often encounter nested loops, like

for (i = 0; i < n; i++) {
    for (j = 0; j < m; j++) {
        doSomething(i, j);
    }
}

This can be nested to a few more levels without too much pain, as long as we observe that the number of iterations for each level need to be multiplied to get the number of iterations for the whole thing and that total numbers of iterations beyond a few billions (10^9, German: Milliarden, Russian Миллиарди) become unreasonable no matter how fast the doSomethings(...) is. Just looking at this example program

public class Modular {
    public static void main(String[] args) {
        long n = Long.parseLong(args[0]);
        long t = System.currentTimeMillis();
        long m = Long.parseLong(args[1]);
        System.out.println("n=" + n + " t=" + t + " m=" + m);
        long prod = 1;
        long sum  = 0;
        for (long i = 0; i < n; i++) {
            long j = i % m;
            sum += j;
            sum %= m;
            prod *= (j*j+1) % m;
            prod %= m;
        }
        System.out.println("sum=" + sum + " prod=" + prod + " dt=" + (System.currentTimeMillis() - t));
    }
}

which measures it net run time and runs 0 msec for 1000 iterations and almost three minutes for 10 billions (10^{10}):

> java Modular 1000 1001 # 1'000
--> sum=1 prod=442 dt=0
> java Modular 10000 1001 # 10'000
--> sum=55 prod=520 dt=1
> java Modular 100000 1001 # 100'000
--> sum=45 prod=299 dt=7
> java Modular 1000000 1001 # 1'000'000
--> sum=0 prod=806 dt=36
> java Modular 10000000 1001 # 10'000'000
--> sum=45 prod=299 dt=344
> java Modular 100000000 1001 # 100'000'000
--> sum=946 prod=949 dt=3314
> java Modular 1000000000 1001 # 1'000'000'000
--> sum=1 prod=442 dt=34439
> java Modular 10000000000 1001 # 10'000'000'000
--> sum=55 prod=520 dt=332346

As soon as we do I/O, network access, database access or simply a bit more serious calculation, this becomes of course easily unbearably slow. But today it is cool to deal with big data and to at least call what we are doing big data, even though conventional processing on a laptop can do it in a few seconds or minutes... And there are of course ways to process way more iterations than this, but it becomes worth thinking about the system architecture, the hardware, parallel processing and of course algorithms and software stacks. But here we are in the "normal world", which can be a "normal subuniverse" of something really big, so running on one CPU and using a normal language like Perl, Java, Ruby, Scala, Clojure, F# or C.

Now sometimes we encounter situations where we want to nest loops, but the depth is unknown, something like

for (i_0 = 0; i_0 < n_0; i_0++) {
  for (i_1 = 0; i_1 < n_1; i_1++) {
    \cdots
      for (i_m = 0; i_m < n_m; i_m++) {
        dosomething(i_0, i_1,\ldots, i_m);
      }
    \cdots
  }
}

Now our friends from the functional world help us to understand what a loop is, because in some of these more functional languages the classical C-Style loop is either missing or at least not recommended as the everyday tool. Instead we view the set of values we iterate about as a collection and iterate through every element of the collection. This can be a bad thing, because instantiating such big collections can be a show stopper, but we don't. Out of the many features of collections we just pick the iterability, which can very well be accomplished by lazy collections. In Java we have the Iterable, Iterator, Spliterator and the Stream interfaces to express such potentially lazy collections that are just used for iterating.

So we could think of a library that provides us with support for ordinary loops, so we could write something like this:

Iterable range = new LoopRangeExcludeUpper<>(0, n);
for (Integer i : range) {
    doSomething(i);
}

or even better, if we assume 0 as a lower limit is the default anyway:

Iterable range = new LoopRangeExcludeUpper<>(n);
for (Integer i : range) {
    doSomething(i);
}

with the ugliness of boxing and unboxing in terms of runtime overhead, memory overhead, and additional complexity for development. In Scala, Ruby or Clojure the equivalent solution would be elegant and useful and the way to go...
I would assume, that a library who does something like LoopRangeExcludeUpper in the code example should easily be available for Java, maybe even in the standard library, or in some common public maven repository...

Now the issue of loops with unknown nesting depth can easily be addressed by writing or downloading a class like NestedLoopRange, which might have a constructor of the form NestedLoopRange(int ... ni) or NestedLoopRange(List li) or something with collections that are more efficient with primitives, for example from Apache Commons. Consider using long instead of int, which will break some compatibility with Java-collections. This should not hurt too much here and it is a good thing to reconsider the 31-bit size field of Java collections as an obstacle for future development and to address how collections can grow larger than 2^{31}-1 elements, but that is just a side issue here. We broke this limit with the example iterating over 10'000'000'000 values for i already and it took only a few minutes. Of course it was just an abstract way of dealing with a lazy collection without the Java interfaces involved.

So, the code could just look like this:

Iterable range = new NestedLoopRange(n_0, n_1, \ldots, n_m);
for (Tuple t : range) {
    doSomething(t);
}

Btw, it is not too hard to write it in the classical way either:

        long[] n = new long[] { n_0, n_1, \ldots, n_m };
        int m1 = n.length;
        int m  = m1-1; // just to have the math-m matched...
        long[] t = new long[m1];
        for (int j = 0; j < m1; j++) {
            t[j] = 0L;
        }
        boolean done = false;
        for (int j = 0; j < m1; j++) {
            if (n[j] <= 0) {
                done = true;
                break;
            }
        }
        while (! done) {
            doSomething(t);
            done = true;
            for (int j = 0; j < m1; j++) {
                t[j]++;
                if (t[j] < n[j]) {
                    done = false;
                    break;
                }
                t[j] = 0;
            }
        }

I have written this kind of loop several times in my life in different languages. The first time was on C64-basic when I was still in school and the last one was written in Java and shaped into a library, where appropriate collection interfaces were implemented, which remained in the project or the organization, where it had been done, but it could easily be written again, maybe in Scala, Clojure or Ruby, if it is not already there. It might even be interesting to explore, how to write it in C in a way that can be used as easily as such a library in Java or Scala. If there is interest, please let me know in the comments section, I might come back to this issue in the future...

In C it is actually quite possible to write a generic solution. I see an API like this might work:

struct nested_iteration {
  /* implementation detail */
};

void init_nested_iteration(struct nested_iteration ni, size_t m1, long *n);
void dispose_nested_iteration(struct nested_iteration ni);
int nested_iteration_done(struct nested_iteration ni); // returns 0=false or 1=true
void nested_iteration_next(struct nested_iteration ni);

and it would be called like this:

struct nested_iteration ni;
int n[] = { n_0, n_1, \ldots, n_m };
for (init_nested_iteration(ni, m+1, n); 
     ! nested_iteration_done(ni); 
     nested_iteration_next(ni)) {
...
}

So I guess, it is doable and reasonably easy to program and to use, but of course not quite as elegant as in Java 8, Clojure or Scala.
I would like to leave this as a rough idea and maybe come back with concrete examples and implementations in the future.

Links

Share Button

Not all projects are on ideal paths

It is nice to write about positive things, how things have been done well, how they should be done well and how good we are and how good we could be if we were just applying the right technology and methodology and process and management…

Tom Rocket

Let’s leave that for today and write about some projects that did not go too well. I picked an example that is more than ten years ago and I am not going to mention any names. Maybe someone who knew this projects as well will recognize something, maybe not. There would be more than one article to write about something like this, and I guess almost anybody who has been around in IT for a couple of years would agree. If not, enjoy being so lucky. 🙂 I change all the names, of course. Otherwise, I hope it is enjoyable to read about others who did funny stuff and to find some anti-patterns between the lines.

So project dinosaur was taking place in the same rooms where I was working. I had some low volume consulting task for this team, but was mostly absorbed by totally different projects.

Tom Rocket is the project manager. There are some more project managers who represent the customers side, but Tom Rocket is running the show and is present in the room. It is easy to recognize that he is the boss, because he calls his team members like the way somebody treats his doggy, who happens for some unintelligible reasons to have a dog, even though he does not particularly like dogs. The team is very engaged, they work hard. Not every day very motivated, but who can understand that. Of course some parts of the development is done by another team of the same company, at another location. The managers of both locations are not really best friends. Donald Peak, the manager of this location, does not care too much, as long as he can sit in his office, smoke cigars, enjoy his newspaper, do some strategy work and manage his car collection. Having been a communist in his younger days, he still knows very well how to be a capitalist. He has his people to deal with the operational issues. They are the bosses of Tom Rocket. And they do not like the other office, where they seem to do interesting and good work. Anyway, to put the pieces together, a complicated server and firewall infrastructure is needed. The other team can dump its code in some neutral zone (DMZ) from where it is fetched and integrated by Tom’s team. And the really cool stuff is done by Tom’s team.

It was a bit before web applications where the thing, at least in a time when people like Tom Rocket did not know and did not care about web applications. The client was developed in MS-Access-Basic. This resulted in interesting opportunities concerning the data. It was possible to store some part of the data in the client database and some part of the data in the server database.

The project was really important. A big company in the country where this happened did an advertising campaign for their end users concerning the services provided by the product developed by Tom Rocket. It was seen a lot in cinemas, in news papers, on the streets. Everybody knew about it. It was coming, it was there already.

We know, there are some organizations that use IT and that actually perform backups. Tom Rocket and the company where he was working did not belong to this group. Backups were not yet introduced, but talking about the possibility of having backups had already begun.

Some IT development projects use source code management. We are talking about the times of source safe and RCS. CVS was already on the horizon, but not yet widely adopted. But Tom’s project did not employ source code management. Whenever a bug in production occurred, they had to get their current development version ready for production, fix the bug there and try to get it installed. Certain end of month operations were causing big pressure on the team each month, because they did not really work fully automatically. The data quality was kind of poor and it was hard for developers to find anybody responsible for this and Tom Rocket was not at all helpful. So they tried to find workarounds that allowed the software to kind of work with such poor data quality, but this was very hard and not really a success story. The introduction of source code management was on the radar, but since the whole team always worked overtime to get the most urgent things done, this never became a serious priority.

A new developer was added to the team. They tried to copy the sources to his home directory, but that did not work. So the solution was, that the senior developer had to give his login and password to the new colleague in order to let him participate in the development.

There was a happy end to all of this. Most if not all people in Tom’s team left the company. The project was stopped, even though it was a bit embarrassing after all the advertisements. And the mother company bought another company in the same country and somehow merged these too operations.

Lessons

We can look at this story from different angles. As a distant spectator it is kind of funny, maybe even as a spectator working on something else in the same room, if you have that kind of humor.

As a team member: It might be hard to enjoy working in such an environment and it would not get better by itself. So options are to see if something can be done about Tom becoming a better team leader or being replaced by a better team leader. Or as most of the guys did, just find a better job.

As a project leader in a project that is so deep in the mud? Treat the people working for it even worse, so they will be encouraged to work harder and solve the problems, that can’t be solved at their level? I guess, it is important to admit the situation and either get enough air to put the project on a good track or even risk that it is abandoned. In the end that happened anyway.

As a manager of the company or the office where this took place? I think beyond all cigars, car collections, stock options and strategy it is important to know what is going on in the company, how people are working, how people are feeling, how people are treating each other and how chances are to get to a success that will encourage customers to do another project in the same way. Just ignoring this or demanding „just a bit more“, because the project is in danger and it is no time for addressing any long term issues did not work out. The teams are the real value of an IT company, not the hardware, the furniture and the rooms. Burning the real values like coal, as the word „resources“ might suggest, might not be the recipe for long term success. I would mention in favor of Donald Peak, that I have good reasons to assume that he was a decent and probably good manager in earlier times, but the time of Tom’s project was probably not one the best years of his carreer.

As a customer? I think it is important to have an idea about the team, how they are working and what kind of people they are. In this case it was most likely the right decision to stop the whole thing.

As far as I know there was a happy end, at least for the team members and for the mother company.

Share Button