# Loops with unknown nesting depth

We often encounter nested loops, like

for (i = 0; i < n; i++) {
for (j = 0; j < m; j++) {
doSomething(i, j);
}
}


This can be nested to a few more levels without too much pain, as long as we observe that the number of iterations for each level need to be multiplied to get the number of iterations for the whole thing and that total numbers of iterations beyond a few billions (, German: Milliarden, Russian Миллиарди) become unreasonable no matter how fast the doSomethings(...) is. Just looking at this example program

public class Modular {
public static void main(String[] args) {
long n = Long.parseLong(args[0]);
long t = System.currentTimeMillis();
long m = Long.parseLong(args[1]);
long prod = 1;
long sum  = 0;
for (long i = 0; i < n; i++) {
sum += i;
sum %= m;
prod *= (i*i+1);
prod %= m;
}
System.out.println("--> sum=" + sum + " prod=" + prod + " dt=" + (System.currentTimeMillis() - t));
}
}


which measures it net run time and runs 0 msec for 1000 iterations and almost three minutes for 10 billions ():

> java Modular 10000 1001 # 10'000
--> sum=55 prod=520 dt=1
> java Modular 100000 1001 # 100'000
--> sum=45 prod=299 dt=4
> java Modular 1000000 1001 # 1'000'000
--> sum=0 prod=806 dt=20
> java Modular 10000000 1001 # 10'000'000
--> sum=45 prod=299 dt=161
> java Modular 100000000 1001 # 100'000'000
--> sum=946 prod=-364 dt=1570
> java Modular 1000000000 1001 # 1'000'000'000
--> sum=1 prod=0 dt=15635
> java Modular 10000000000 1001 # 10'000'000'000
--> sum=55 prod=0 dt=160365


As soon as we do I/O, network access, database access or simply a bit more serious calculation, this becomes of course easily unbearably slow. But today it is cool to deal with big data and to at least call what we are doing big data, even though conventional processing on a laptop can do it in a few seconds or minutes... And there are of course ways to process way more iterations than this, but it becomes worth thinking about the system architecture, the hardware, parallel processing and of course algorithms and software stacks. But here we are in the "normal world", which can be a "normal subuniverse" of something really big, so running on one CPU and using a normal language like Perl, Java, Ruby, Scala, Clojure, F# or C.

Now sometimes we encounter situations where we want to nest loops, but the depth is unknown, something like

for ( = 0;  < ; ++) {
for ( = 0;  < ; ++) {

for ( = 0;  < ; ++) \{\cr
dosomething();
}

}
}


Now our friends from the functional world help us to understand what a loop is, because in some of these more functional languages the classical C-Style loop is either missing or at least not recommended as the everyday tool. Instead we view the set of values we iterate about as a collection and iterate through every element of the collection. This can be a bad thing, because instantiating such big collections can be a show stopper, but we don't. Out of the many features of collections we just pick the iterability, which can very well be accomplished by lazy collections. In Java we have the Iterable, Iterator, Spliterator and the Stream interfaces to express such potentially lazy collections that are just used for iterating.

So we could think of a library that provides us with support for ordinary loops, so we could write something like this:

Iterable range = new LoopRangeExcludeUpper<>(0, n);
for (Integer i : range) {
doSomething(i);
}


or even better, if we assume 0 as a lower limit is the default anyway:

Iterable range = new LoopRangeExcludeUpper<>(n);
for (Integer i : range) {
doSomething(i);
}


with the ugliness of boxing and unboxing in terms of runtime overhead, memory overhead, and additional complexity for development. In Scala, Ruby or Clojure the equivalent solution would be elegant and useful and the way to go...
I would assume, that a library who does something like LoopRangeExcludeUpper in the code example should easily be available for Java, maybe even in the standard library, or in some common public maven repository...

Now the issue of loops with unknown nesting depth can easily be addressed by writing or downloading a class like NestedLoopRange, which might have a constructor of the form NestedLoopRange(int ... ni) or NestedLoopRange(List li) or something with collections that are more efficient with primitives, for example from Apache Commons. Consider using long instead of int, which will break some compatibility with Java-collections. This should not hurt too much here and it is a good thing to reconsider the 31-bit size field of Java collections as an obstacle for future development and to address how collections can grow larger than elements, but that is just a side issue here. We broke this limit with the example iterating over 10'000'000'000 values for i already and it took only a few minutes. Of course it was just an abstract way of dealing with a lazy collection without the Java interfaces involved.

So, the code could just look like this:

Iterable range = new NestedLoopRange();
for (Tuple t : range) {
doSomething(t);
}


Btw, it is not too hard to write it in the classical way either:

        long[] n = new long[] {  };
int m1 = n.length;
int m  = m1-1; // just to have the math- matched...
long[] t = new long[m1];
for (int j = 0; j < m1; j++) {
t[j] = 0L;
}
boolean done = false;
for (int j = 0; j < m1; j++) {
if (n[j] <= 0) {
done = true;
break;
}
}
while (! done) {
doSomething(t);
done = true;
for (int j = 0; j < m1; j++) {
t[j]++;
if (t[j] < n[j]) {
done = false;
break;
}
t[j] = 0;
}
}


I have written this kind of loop several times in my life in different languages. The first time was on C64-basic when I was still in school and the last one was written in Java and shaped into a library, where appropriate collection interfaces were implemented, which remained in the project or the organization, where it had been done, but it could easily be written again, maybe in Scala, Clojure or Ruby, if it is not already there. It might even be interesting to explore, how to write it in C in a way that can be used as easily as such a library in Java or Scala. If there is interest, please let me know in the comments section, I might come back to this issue in the future...

# Java Properties Files and UTF-8

Java uses a nice pragmatic file format for simple configuration tasks and for internationalization of applications. It is called Java properties file or simply „.properties file“. It contains simple key value pairs. For most configuration task this is useful and easy to read and edit. Nested configurations can be expressed by simple using dots („.“) as part of the key. This was introduced already in Java 1.0. For internationalization there is a simple way to create properties files with almost the same name, but a language code just before the .properties-suffix. The concept is called „resource bundle“. Whenever a language specific string is needed, the program just knows a unique key and performs a lookup.

The unpleasant part of this is that these files are in the style of the 1990es encoded in ISO-8859-1, which is only covering a few languages in western, central and northern Europe. For other languages as a workaround an \u followed by the 4 digit hex code can be used to express UTF-16 encoding, but this is not in any way readable or easy to edit. Usually we want to use UTF-8 or in some cases real UTF-16, without this \u-hack.

A way to deal with this is using the native2ascii-converter, that can convert UTF-8 or UTF-16 to the format of properties files. By using some .uproperties-files, which are UTF-8 and converting them to .properties-files using native2ascee as part of the build process this can be addressed. It is still a hack, but properly done it should not hurt too much, apart from the work it takes to get this working. I would strongly recommend to make sure the converted and unconverted files never get mixed up. This is extremely important, because this is not easily detected in case of UTF-8 with typical central European content, but it creates ugly errors that we are used to see like „sch�ner Zeichensalat“ instead of „schöner Zeichensalat“. But we only discover it, when the files are already quite messed up, because at least in German the umlaut characters are only a small fraction of the text, but still annoying if messed up. So I would recommend another suffix to make this clear.

The bad thing is that most JVM-languages have been kind of „lazy“ (which is a good thing, usually) and have used some of Java’s infrastructures for this, thus inherited the problem from Java.

Another way to deal with this is to use XML-files, which are actually by default in UTF-8 and which can be configured to be UTF-16. With some work on development or search of existing implementations there should be ways to do the internationalization this way.

Typically some process needs to be added, because translators are often non-IT-people who use some tool that displays the texts in the original languages and accepts the translation. For good translations, the translator should actually use the software to see the context, but this is another topic for the future. Possibly there needs to be some conversion from the data provided by the translator into XML, uproperties, .properties or whatever is used. These should be automated by scripts or even by the build process and merge new translations properly with existing ones.

Anyway, Java 9 Java 9 will be helpful in this issue. Finally Java-9-properties that are used as resource bundles for internationalization can be UTF-8.

# Collection Initializiation in Java

There is this so called „double brace“ pattern for initializing collection. We will see if it should be a pattern or an anti-pattern later on…

The idea is that we should consider the whole initializion of a collection one big operation. In other languages we write something like
[element1 element2 element3]
or
[element1, element2, element3]
for array-like collections and
{key1 val1, key2 val2, key3 val3}
or
{key1 => val1, key2 => val2, key3 => val3}.
Java could not do it so well until Java 9, but actually there was a way to construct sets and lists:
Arrays.asList(element1, element2, element3);
or
new HashSet<>(Arrays.asList(element1, element2, element3));.
Do not ask about immutability (or unmodifyability), which is not very well solved in the standard java library until now, unless you are willing to take a look into Guava, which we will in another article… Let us stick with Java’s own facilities for today.

So the double brace pattern would be something like this:
 import java.util.*;

 public class D {     public static void main(String[] args) {         List<String> l = new ArrayList<String>() {{                 add("abc");                 add("def");                 add("uvw");             }};         System.out.println("l=" + l);         Set<String> s = new HashSet<String>() {{                 add("1A2");                 add("2B707");                 add("3DD");             }};         System.out.println("s=" + s); 

        Map<String, String> m = new HashMap<String, String>() {{                 put("k1", "v1");                 put("k2", "v2");                 put("k3", "v3");             }};         System.out.println("m=" + m);     } } 
What does this do?

First of all having an opening brace after the new XXX() creates an anonymous class extending XXX. Then we open the body of the extended class. What is well known to many is that there can be a static {....} section, that is called exactly once for each class. The same applies for a non-static section, which is achieved by omitting the static keyword. This is of course called once for each instance of the class, so in this case it will be called after the constructor of the base class and serves kind of as a replacement for the constructor. To make it look cooler the two pairs of braces are placed together.

It is not so magic, but it creates a lot of overhead by creating anonymous classes with no real additional functionality just for the sake of an initialization. It is even worse, because these anonymous inner classes are not static, so they actually can refer to their surrounding instance. They do not make use of this, but anyway they carry a reference to their surrounding class which might be a very serious problem for serialization, if that is used. And for garbage collection. So please consider the double-brace-initialization as an anti-pattern. Others have blogged about this too…

There are more legitimate ways to group the initialization together. You can put the initialization into a static method and call that. Or you could group it with single braces, just to indicate the grouping. This is a bit unusual, but at least correct:
 import java.util.*;

 public class E {     public static void main(String[] args) {         List<String> l = new ArrayList<String>();         {             l.add("abc");             l.add("def");             l.add("uvw");         }         System.out.println("l=" + l);         Set<String> s = new HashSet<String>();         {             s.add("1A2");             s.add("2B707");             s.add("3DD");         }         System.out.println("s=" + s); 

        Map<String, String> m = new HashMap<String, String>();         {             m.put("k1", "v1");             m.put("k2", "v2");             m.put("k3", "v3");         }         System.out.println("m=" + m);     } } 

While the first two can somehow be written using Arrays.asList(...), now in Java 9 there are nicer ways for writing all three using List.of("abc", "def", "uvw");, Set.of("1A2", "2B707", "3DD"); and Map.of("k1", "v1", "k2", "v2", "k3", "v3");, which is recommended over any other way because there are some additional runtime and compile time checks and because these are efficient immutable collections. This has been blogged about too.

The aspect of immutability which we should consider today, is not very well covered by the java collections (apart from the new internal one for the new factory methods. Wrapping in Collections.unmodifyableXXX(...) is a bit of overhead in terms of code, memory and CPU-usage but it does not give a guarantee that the collection wrapped into this is actually not being modified elsewhere.

# Is Java becoming non-free?

We are kind of used to the fact that Java is „free“.
It has been free in the sense of „free beer“ pretty much forever.
And more recently also „free“ in the sense of „free speech“.

In spite of the fact that we read that „Oracle is going to monetize on Java“, as can be read in articles like this, it is remaining like that, at least for now. This is also written in the article.
But it seems that they are looking for loopholes. For example we download and install Java SE including X, Y and Z, because it comes like that. Agree to hundred pages of license text and confirm having read and understood everything, as always… Now we really need X, which is the JDK, which is actually free. But we just accidentally also install Y and Z, which we do not need, but which has a price tag on which they are trying to get us.

Even if nothing will really happen, issues like that help undermining the trust in the platform in general, not only for Java, but also for other JVM-languages. Eventually there could be forks like we have seen with LibreOffice vs. OpenOffice or with mariaDB vs. mySQL, which kind of took over by avoiding the ties to Oracle. Solaris seems to have a similar fork, but in this case people are just moving to Linux anyway, so the issue is less relevant.

These prospects are not desirable, but I think we do not have to panic, because there are ways to solve this that are going to be pursued if necessary. Maybe it is a good idea to be more careful when installing software. And to think twice when starting a new project if Oracle or PostgreSQL is the right DB product in the long term, taking into consideration Oracle’s attitude towards loyal long term customers.

It is regrettable. Oracle has great technology from their own history and from SUN in databases, Java including the surrounding universe, Solaris and hardware. Let us hope that they will stay reasonable at least with Java.

# JMS

Java has always not just been a language, but it brought us libraries and frameworks. Some of them proved to be bad ideas, some become hyped without having any obvious advantages, but some were really good.

In the JEE-stack, messaging (JMS) was included pretty much from the beginning. In those days, when Java belonged to Sun Microsystems and Sun did not belong to Oracle, an aim was to support databases, which was in those days mostly Oracle, via JDBC and so called Message oriented middleware, which was available in the IBM-world via JMS. JMS is a common interface for messaging, that is like sending micro-email-message not between human, but between software components. It can be used within one JVM, but even between geographically distant servers, provided a safe network connection exists. Since we all know EMail this is in principle not too hard to understand, but the question is, what it really means and if it brings us something that we do not already have otherwise.

We do have web services as an established way to communicate between different servers across the network and of course they can also be used locally, if desired. Web services are neither the first nor the only way to communicate between servers nor are they the most efficient way. But I would say that they are the way how we do it in typical distributed applications that are not tied to any legacy. In principal web services are network capable and synchronous. This is well understood and works fine for many applications. But it also forces us to block processes or threads while waiting for responses, thus occupying valuable resources. And we tend to loose responsiveness, because of the waiting for the response. It needs to be observed that DB-access is typically only available synchronously. In a way understandable because of the transactions, but it also blocks resources to a huge extent, because we know that the performance of many applications is DB driven.

Now message based software architectures think mostly asynchronously. Sending a message is a „fire and forget“. There is such a thing as making message transactional, but this has to be understood correctly. There is one transaction for sending the message. It is guaranteed that the message is sent. Delivery guarantees can only be given to a limited extent, because we do not know anything about the other side and if it is at all working. This is not checked as part of the transaction. We can imagine though that the messaging system has its own transactional database and stores the message there within the transaction. It then retries delivering it forever, until it succeeds. Then it is deleted from this store as part of the receiving transaction. Both these transactions can be part of a distributed transaction and thus be combined with other transactions, usually against databases, for a combined transaction. This is what we usually have in mind when talking about this. I have to mention that the distributed transaction, usually based on the so called two phase commit, is not quite as water proof as we might hope, but it can be broken by construction of a worst case scenario regarding the timing of failures of network and systems. But it is for practical purposes reasonable good to use.

While it is extremely interesting to investigate purely message based architectures, especially in conjunction with functional paradigm, this may not be the only choice. Often it is a good option to use a combination of messaging with synchronous services.

We should observe that messaging is a more abstract concept. It can be implemented by some middle ware and even be accessible by a standardized kind of interface like JMS. But it can also be more abstract as a queuing system or as something like Akka uses for its internal communication. And messaging is not limited to Java or JVM languages. Interoperability does impose some constraints on how to use it, because it bans usage of Object-messages which store serialized Java objects, but there are ways to address this by using JSON or BSON or XML or Protocol Buffers as message contents.

What is interesting about JMS and messaging in general are two major communication modes. We can have queues, which are point to point connections. Or we can have „topics“, which are channels into which messages are sent. They are then received by all current subscribers of the topic. This is interesting to notify different components about an event happening in the system, while possibly details about the event must be queried via synchronous services or requested by further messaging via queues.

Generally JMS in Java has different implementations, usually there are those coming with the application servers and there are also some standalone implementations. They can be operated via the same interface, at least as long as we constrain us to the common set of functionality. So we can exchange the JMS implementation for the whole platform (which is a nightmare in real life), but we cannot mix them, because the wire protocol is usually incompatible. There is now something like a standard network protocol for messaging, which is followed by some, but not all implementations.

As skeptical as I am against Java Enterprise edition, I do find the JMS part of enterprise Java very interesting and worthwhile exploring for projects that have a size and characteristics justifying this.

# Some Thoughts about Incompleteness of Libraries

## Selfwritten Util Libraries

Today we have really good libraries with our programming languages and they cover a lot of things. The funny thing is, that we usually end up writing some Util-classes like StringUtil, CollectionUtil, NumberUtil etc. that cover some common tasks that are not found in the libraries that we use. Usually it is no big deal and the methods are trivial to write. But then again, not having them in the library results in several slightly different ad hoc solutions for the same problem, sometimes flawless, sometimes somewhat weak, that are spread throughout the code and maybe eventually some „tools“, „utils“ or „helper“ classes that unify them and cover them in a somewhat reasonable way.

## Imposing Util Libraries on all Developers

In the worst case these self written library classes really suck, but are imposed on the developers. Many years ago it was „company standard“ to use a common library for localizing strings. The concept was kind of nice, but it had its flaws. First there was a company wide database for localizing strings in order to save on translation costs, but the overhead was so much and the probability that the same short string means something different in the context of different applications was there. This could be addressed by just creating a label that somehow included the application ID and bypassing this overhead, whenever a collision was detected. What was worse, the new string made it into a header file and that caused the whole application to be recompiled, unless a hand written make file skipped this dependency. This was of course against company policy as well and it meant a lot of work. In those days compilation of the whole application took about 8 (eight!) hours. Maybe seven. So after adding one string it took 8 hours of compile time to continue working with it. Anyway, there was another implementation for the same concept for another operating system, that used hash tables and did not require recompilation. It had the risk of runtime errors because of non-defined strings, but it was at least reasonable to work with it. I ported this library to the operating system that I was using and used it and during each meeting I had do commit to the long term goal of changing to the broken library, which of course never happened, because there were always higher priorities.

I thing the lesson we can already learn is that such libraries that are written internally and imposed on all developers should be really done very well. Senior developers should be involved and if the company does not have them, hired externally for the development. Not to do the whole development, but to help doing it right.

## Need for Util libraries

So why not just go with the given libraries? Or download some more? Depending on the language there are really good libraries around. Sometimes that is the way to go. Sometimes it is good to write a good util-libarary internally. But then it is important to do it well, to include only stuff that is actually needed or reasonably likely needed and to avoid major effort for reinventing the wheel. Some obscure libraries actually become obsolete when the main default library gets improved.

## Example: Trigonometric and other Mathematical Functions

Most of us do not do a lot of floating point arithmetic and subsequentially we do not need the trigonometric functions like and , other transcendental functions like and or functions like cube root () a lot. Where the default set of these functions ends is somewhat arbitrary, but of course we need to go to special libraries at some point for more special functions. We can look what early calculators used to have and what advanced math text books in schools cover. We have to consider the fact, that the commonly used set of trigonometric functions differs from country to country. Americans tend to use six of them, , , , , and , which is kind of beautiful, because it really completes the set. Germans tend to use only , , and , which is not as beautiful, but at least avoids the division by zero and issue of transforming to .  Calculators usually had only , and . But they offered them in three flavors, with modes of „DEG“, „RAD“ and „GRAD“. The third one was kind of an attempt to metricize degrees by having instead of for an right angle, which seems to be a dead idea.  Of course in advanced mathematics and physics the „RAD“, which uses instead of is common and that is what all programming languages that I know use, apart from the calculators. Just to explain the functions for those who are not familiar with the whole set, we can express the last four in terms of and :

• (tangent)
• (cotangent)
• (secans)
• (cosecans)

Then we have the inverse trigonometric functions, that can be denoted with something like or for all six trigonometric functions. There is an irregularity to keep in mind. We write instead of for , which is the multiplication of that number of terms. And we use to apply the function „ time, which is actually the inverse function. Mathematicians have invented this irregularity and usually it is convenient, but it confuses those who do not know it. From these functions many programming languages offer only the assuming the others five can be created from that. This is true, but cumbersome, because it needs to differentiate a lot of cases using something like if, so there are likely to be many bugs in software doing this. Also these ad hoc implementations loose some precision.

It was also common to have a conversion from polar coordinates to rectangular (p2r) coordinates and vice versa (r2p), which is kind of cool and again easy, but not too trivial to do ad hoc. Something like atan2 in FORTRAN, which does the essence of the harder r2p operation, would work also, depending on hon convenient it is to deal with multiple return values. We can then do r2p using , and p2r by and .

The hyperbolic functions like , their inverses like or are rarely used, but we find them on the calculator and in the math book, so we should have them in the standard floating point library. There is only one flavor of them.

Logarithms and exponential functions are found in two flavors on calculators: and and and . The log is kind of confusing, because in mathematics and physics and in most current programming language we mean (natural logarithm). This is just a wrong naming on calculators, even if they all did the same mistake across all vendors and probably still do in the scientific calculator app on the phone or on the desktop. As IT people we tend to like the base two logarithm , so I would tend to add that to the list. Just to make the confusion complete, in some informatics text books and lectures the term „“ refers to the base two logarithm. It is a bad habit and at least the laziness should favor writing the correct „„.

Then we usually have power functions , which surprisingly many programming languages do not have. If they do, it is usually written as x ** y or pow(x, y), square root, square and maybe cube root and cube.  Even though the square root and the cube root can be expressed as powers using and it is better to do them as dedicated functions, because they are used much more frequently than any other power with non-integral exponents and it is possible to write optimized implementations that run faster and more reliably then the generic power which usually needs to go via log and exp. Internal optimization of power functions is usually a good idea for integral exponents and can easily be achieved, at least if the exponent is actually of an integer type.

Factorial and binomial coefficient are usually used for integers, which is not part of this discussion. Extensions for floating point numbers can be defined, but they are beyond the scope of advanced school mathematics and of common scientific calculators. I do not think that they are needed in a standard floating point library. It is of its own interest what could be in an „advanced math library“, but and and for sure belong into the base math library.

That’s it. It would be easy to add all these into the standard library of any programming language that does floating point arithmetic at all and it would be helpful for those who work with this and not hurt at all those who do not use it, because this stuff is really small compared to most of our libraries. So this would be the list

• sin, cos, tan, cot, sec, csc in two flavors
• asin, acos, atan, acot, asec, acsc (standing for …) in two flavors
• p2r, r2p (polar coordinates to rectangular and reverse) or atan2
• sinh, cosh, tanh, coth, sech, csch
• asinh, acosh, atanh, acoth, asech, acsch (for …)
• exp, log (for and logarithm base e)
• exp10, exp2, log10, log2 (base 10 and base 2, I would not rely on knowledge that ld and lg stand for log2 and log10, respectively, but name them like this)
• sqrt, cbrt (for and )
• ** or pow with double exponent
• ** or pow with integer exponent (maybe the function with double exponent is sufficient)
• , , , are maybe actually not needed, because we can just write them using ** and /

Actually pretty much every standard library contains sin, cos, tan, atan, exp, log and sqrt.

## Java

Java is actually not so bad in this area. It contains the tan2, sinh, cosh, tanh, asin, acos, atan, log10 and cbrt functions, beyond what any library contains. And it contains conversions from degree to radiens and vice versa. And as you can see here in the source code of pow, the calculations are actually quite sophisticated and done in C. It seems to be inspired by GNU-classpath, which did a similar implementation in Java. It is typical that a function that has a uniform mathematical definition gets very complicated internally with many cases, because depending on the parameters different ways of calculation provide the best precision. It would be quite possible that this function is so good that calling it with an integer as a second parameter, which is then converted to a double, would actually be good enough and leave no need for a specific function with an integer exponent. I would tend to assume that that is the case.

In this github project we can see what a library could look like that completes the list above, includes unit tests and works also for the edge cases, which ad hoc solution often do not. What could be improved is providing the optimal possible precision for any legitimate parameters, which I would see as an area of further investigation and improvement. The general idea is applicable to almost any programming language.

Two areas that have been known for a great need of such additional libraries are collections and Date&Time. I would say that really a lot what I would wish from a decent collection library has been addressed by Guava. Getting Date and time right is surprisingly hard, but just thing of the year-2000-problem to see the significance of this issue. I would say Java had this one messed up, but Joda Time was a good solution and has made it into the standard distribution of Java 8.

## Summary

This may serve as an example. There are usually some functions missing for collections, strings, dates, integers etc. I might write about them as well, but they are less obvious, so I would like to collect some input before writing about that.

libc on Linux seem to contain sin, cos, tan, asin, acos, atan, atan2, sinh, cosh, tanh, asinh, acosh, atanh, sqrt, cbrt, log10, log2, exp, log, exp10, exp2. Surprisingly Java does not make use of these functions, but comes up with its own.

Actually a lot of functionality is already in the CPU-hardware. IEEE-recommendations suggest quite an impressive set of functions, but they are all optional and sometimes the accuracy is poor.

But standard libraries should be slightly more complete and ideally there would be no need to write a „generic“ util-library.  Such libraries should only be needed for application specific code that is somewhat generic across some projects of the organization or when doing a real demanding application that needs more powerful functionality than can easily be provided in the standard library. Ideally these can be donated to the developers of the standard library and included in future releases, if they are generic enough. We should not forget, even programming languages that are main stream and used by thousands of developers all over the world are usually maintained by quite small teams, sometimes only working part time on this. But usually it is hard to get even a good improvement into their code base for an outsider.

So what functions do you usually miss in the standard libraries?

When Java was created, the concept of operator overloading was already present in C++. I would say that it was generally well done in C++, but it kind of breaks the object oriented polymorphism patterns of C++ and the usual way was to have several overloaded functions to allow for all n² combinations.

In the early days of C++ people jumped on this feature and used it for all kinds of stuff that has nothing to do with the original concept of numeric operators, like adding dialog boxes to strings and multiplying that with events. We get somewhere a little bit towards what APL was, which had only operators and a special charset to allow for all the language features, requiring even a special keyboard:

APL example

You can find an article in Scott Locklin’s Blog about APL and other almost forgotten languages and the potential loss of some achievements that they tried to bring to us.

We see the same with some people in Scala who create a lot of operators using interesting Unicode characters. This is not necessarily wrong, but I think operators should only be used for something that is really important. Not in the sense: „I wrote functionality XYZ for library UVW, and this is really important“, but in the sense that this functionality is so commonly used that people have no problem remembering the operator. Or the operator is already known to us, like „+“, „-„, „*“, … for numeric types, but I still have no idea what adding a string to an event would mean.

In C++ it got even worse because it was possible to overload „->“ or new and thus digging deep into the language, which can be interesting when used carefully and skillfully by developers who really know what they are doing, but disastrous otherwise.

Now Java has opted not to support this operator overloading, which was wrong in even at that time, but understandable, because at that time we were still more in the mindset to count bits and live with the deficiencies of int and long and we ware also seeing the weird abuses of operator overloading in C++. Maybe it was also the lack of time to design a sound mechanism for this in Java. Unfortunately this decision that was made in a context more than 20 years ago has kind of become religious. Interestingly James Gosling, when asked in an interview for the 20 years anniversary of Java, mentioned operator overloading for numeric types as the first thing that he would have made better. (It is around minute 9.) So I hope that this undoes the religious aspect of this topic.

An interesting idea will probably be included in future versions of Scala. An operator is in principal defined as a method of the left operand, which is quite logical, but it would imply writing something like e = (a.*(b)).+(c.*(d)), possibly with fewer parentheses. Now this is recognized as a operator-method, so the dots can go away as well as the parentheses and the common operator precedence applies, so e = a * b + c * d works as well and is what we find natural. Ruby and Scala are very similar in this aspect. Now some future version of Scala, maybe Scala 3, will introduce an annotation that allows the „infix“-notation for these methods and that adds a descriptive name. Now error messages and even IDE-support could give us access to the descriptive name and we would be able to search for it, while searching for something like „+“ or „-“ or „*“ would not really be helpful. I think that this idea would be useful for other languages as well.

These examples demonstrate the BigInteger types of Java, C#, Scala, Clojure and Ruby, respectively:

import java.math.BigInteger;

public class JavaBigInt {

public static void main(String[] args) {
BigInteger f = BigInteger.valueOf(2_000_000_000L);
BigInteger p = BigInteger.ONE;
for (int i = 0; i < 8; i++) {
System.out.println(i + " " +  p);
p = p.multiply(f);
}
}
}


gives this output:

0 1
1 2000000000
2 4000000000000000000
3 8000000000000000000000000000
4 16000000000000000000000000000000000000
5 32000000000000000000000000000000000000000000000
6 64000000000000000000000000000000000000000000000000000000
7 128000000000000000000000000000000000000000000000000000000000000000


And the C#-version

using System;
using System.Numerics;

public class CsInt {

public static void Main(string[] args) {
BigInteger f = 2000000000;
BigInteger p = 1;
for (int i = 0; i < 8; i++) {
Console.WriteLine(i + " " +  p);
p *= f;
}
}
}


give exactly the same output:

0 1
1 2000000000
2 4000000000000000000
3 8000000000000000000000000000
4 16000000000000000000000000000000000000
5 32000000000000000000000000000000000000000000000
6 64000000000000000000000000000000000000000000000000000000
7 128000000000000000000000000000000000000000000000000000000000000000


Or the Scala version

object ScalaBigInt {

def main(args: Array[String]): Unit = {
val f : BigInt = 2000000000;
var p : BigInt = 1;
for (i  <- 0 until 8) {
println(i + " " + p);
p *= f;
}
}
}

0 1
1 2000000000
2 4000000000000000000
3 8000000000000000000000000000
4 16000000000000000000000000000000000000
5 32000000000000000000000000000000000000000000000
6 64000000000000000000000000000000000000000000000000000000
7 128000000000000000000000000000000000000000000000000000000000000000


Or in Clojure it looks like this, slightly shorter than then Java and C#:

(reduce (fn [x y] (println y x) (*' 2000000000 x)) 1 (range 8))


with the same output again, but a much shorter program. Please observe that the multiplication needs to use the "*'" instead of "*" in order to outexpand from fixed length integers to big-integers.

0 1
1 2000000000
2 4000000000000000000
3 8000000000000000000000000000N
4 16000000000000000000000000000000000000N
5 32000000000000000000000000000000000000000000000N
6 64000000000000000000000000000000000000000000000000000000N
7 128000000000000000000000000000000000000000000000000000000000000000N


Or in Ruby it is also quite short:

f = 2000000000
p = 1
8.times do |i|
puts "#{i} #{p}"
p *= f;
end


same result, without any special effort, because integers are always expanding to the needed size:

0 1
1 2000000000
2 4000000000000000000
3 8000000000000000000000000000
4 16000000000000000000000000000000000000
5 32000000000000000000000000000000000000000000000
6 64000000000000000000000000000000000000000000000000000000
7 128000000000000000000000000000000000000000000000000000000000000000


So I suggest to leave the IT-theology behind. So the pragmatic issues should be considered now.

In Java we have primitive numeric types, that are basically inadequate for application development, because they tacitly overflow and because application developers have usually no idea how to deal with rounding issues of float and double. We have good numeric types like BigInteger and BigDecimal to support arbitrarily long integral numbers, which do not overflow unless we exceed memory or addressaility issues with numbers of several billion digits. BigDecimal allows for controlled rounding, and also arbitrary precision.

Now we have to write
 e = a.multiply(b).add(c.multiply(d)) 
 e = a * b + c * d 
The latter is readable, it is exactly what we mean. The former is not readable at all and the likelihood of making mistakes is very high.
I would be happy with something like this:
 e = a (*) b (+) c (*) d 
where overloaded operators are surrounded with () or [] or something like that.

At some point of time a major producer of electronic calculators made us believe that it is more natural to express it like this
 e a b * c d * + = 
Maybe this way of writing math would be better, but it is not what we do outside of our computers and calculators. At least it was more natural to have this pattern for those who created the calculators, because it was much easier to implement in a clean way on limited hardware. We still have the opposite in Lisp, which is still quite alive as Clojure, so I use the Clojure syntax:
 (def x (+ (* a b) (* c d))) 
which is relatively readable after some learning and allows for a very simple and regular and powerful syntax. But even this is not how we write Math outside of our computer.

Btw., please use elementary school math skills and do not write
 e = (a * b) + (c * d) 
That is just noise. I do not recommend to memorize all the 10 to 25 levels of operator precedence of a typical programming languages, but it is good to know the basic ones, that almost any serious current programming language supports:
* binary * /
* binary + -
* == != <= >= < >
* &&
* ||
Some use "and" and "or" instead of "&&" and "||".

Now using overloaded operators should be no problem.

We do have an issue when implementing it.

Imagine you have a language with five built in numeric types. Now you add a sixth one. "+" is probably already defined for 25 combinations. With the sixth type we get a total of 36 combinations, of which we have to provide the missing 11 and a mechanism to dispatch the program flow to these. In C++ we just add 11 operator-functions and that does everything. In Ruby we add a method for the left side of the operator. Now this does not know our new type for the existing types, but it deals with it by calling coerce of the right operand with the left operand as parameter. This is actually powerful enough to deal with this situation.

It gets even more tricky when we use different libraries that do not know of each other and each of them adds numeric types. Possibly we cannot add these with each other or we can do so in a degraded manner by just falling back to double or float or rational or something like that.

The numeric types that we usually use can be added with each other, but we could hit situations where that is not the case, for example when having p-adic numbers, which can be added with rational number, but not with real numbers. Or finite fields, whose members can be added with integral numbers or with numbers of the same field, but not necessarily with numbers of another finite field. Fortunately these issues should occur only to people who understand them while writing libraries. Using the libraries should not be hard, if they are properly done.

# Primitives, Objects and Autoboxing

The type system in Java makes a difference between so called „primitives“, which are boolean, byte, char, int, long, float and double and Objects, which are anything derived from Object in object oriented philosophy, including the special case of arrays, which I will not discuss today.

Primitive types have many operations that are kind of natural to perform on them, like arithmetic. They behave as values, so they are actually copied, which is no big deal, because they are at most 64 bits in size, which is in modern java implementations the size of a pointer when using references. Now a major benefit of object orientation is arguable the polymorphism and this has been heavily used when implementing useful libraries like the collection classes, which were based mostly on Object and thus able to handle anything derived from Object. This has not changed with generics, they are just another way of writing this and adding some compile time checks and casts in a more readable way, as long as the complexity of the generics constructions remains simple and under control. Actually I like this approach and find it much more healthy than templates in C++, but this is a IT-theological discussion that is not too relevant for this article.

Now there is a necessity of using collections for numeric types. Even though I do recommend to thoroughly think about using types like BigInteger and BigDecimal, there are absolutely legitimate uses of long, int, boolean, double, char and less frequently short, byte and float. The only one that is really flawless of these is boolean, while the floating point numbers, the fixed size integral numbers (also this) and the Strings and chars in Java have serious flaws, some of which I have discussed in the linked articles.

Now we need to use the wrapper types Integer, Long, Double and Boolean instead of int, long, double and boolean to store them in collections. This comes with some overhead, because these wrappers use some additional memory and the wrapping and unwrapping costs some time. Usually this does not impose a problem and using these wrappers is often an acceptable approach. Now we would be tempted to just work with the wrappers, but that is impossible, because the natural operations for the underlying boolean and numeric types just do not work with the wrappers, so we have to unwrap (or unbox) them.

Now Java includes a feature called „autoboxing and autounboxing“ which tries to create a wrapper object around a primitive when in an object context and which extracts the primitive when in a primitive context. This can be enforced by casting, to be sure.

There are some dangers in using this feature. The most interesting case is the „==“-operator. For objects and also for the wrappers of the primitives this always compares object identity based on the pointer address. For primitives that is simply impossible and the comparison compares the value. I think that it was a mistake to define the „==“-operator like that and it should do a semantic comparison and there should be something else for object identity, but that cannot be changed any more for Java. So we get some confusion when comparing boxed primitives with == or even worse when comparing boxed and unboxed primitives. Another confusion occurs, when using autounboxing and the wrapper object is null. This creates of course a NullPointerException, but it is kind of hard to spot where it actually comes from.

So I do see some value in using explicit boxing and unboxing to make things clearer. It is a good thing to talk about this in the team and find a common way. Now the interesting question is how boxing and unboxing are done. We are tempted to use something like this:
 int x = ...; Integer xObj = new Integer(x); 
This works, but it is not good, because it creates too many objects. We can reuse them and java provides for this and reuses them for some small numbers. The recommended way for explicit boxing is this:
 int x = ... Integer xObj = Integer.valueOf(x); 
This can reuse values. If we are using this a lot and know that our range of commonly used numbers is reasonably small but still beyond what Java assumes, it is not too hard to write something like „IntegerUtil“ and use it:
 int x = ...; Integer xObj = IntegerUtil.valueOf(x); 
Look if you can find an implementation that fits your needs, instead of writing it. But it is no pain to write it.
Unboxing is also easy:
 Integer xObj = ....; int x = xObj.intValue(); 
The methods intValue(), longValue(), doubleValue(),… are actually in the base class Number, so it is possible to unbox and cast in one step with these.

Decide how much readability you want.

It is useful to look at the static methods of the wrapper classes even for converting numbers to Strings and Strings to numbers. Avoid using constructors, they are rarely necessary and some neat optimizations that the Java libraries give us for free only work when we use the right methods. This does not make a huge difference, but doing it right does not hurt, but rather makes code more readable.

It is also interesting how the extended numeric types like BigInteger and BigDecimal work similar to the wrapper types and to use them right.

Another interesting issue is to use actually specific collection implementations for primitives. This may add to the complexity of our code, because it gives up another piece of polymorphism, but they can really save our day by giving a better performance. And in cases where we actually know for sure that the data is always belonging to a certain primitive type, I find this even idiomatic.

Other languages have solved the issues discussed here in a more elegant way by avoiding this two sided world of primitives and wrappers or by making the conversions less dangerous and more natural. They have operator overloading for numeric types and they use a more consistent concept of equality than Java.

# UTF-16 Strings in Java

Deutsch

Strings in Java and many other JVM-languages consist of Unicode content and are encoded as utf-16. It was fantastic to already consider Unicode when introducing Java in the 90es and to make it the only reasonable way to use strings, so there is no temptation to start with a „US-ASCII“-version of a software that never really gets enhanced to deal properly with non-English content and data. And it also avoids having to deal with many kinds of String encodings within the software. For outside storage in databases and files and for communication with other processes and over the network, these issues of course remain. But Java strings are always encoded utf-16. We can be sure. Most common languages can make use of these Strings easily and handling common letter based languages from Europe and western Asia is quite strait forward. Rendering may still be a challenge, but that is another issue. A major drawback of this approach is that more memory is used. This usually does not hurt too much. Memory is not so expensive and a couple of Strings even with utf-16 will not be too big. With 64-bit Java, which should be used these days, the memory limitations of the JVM are not relevant any more, they can basically use as much memory as provided.

But some applications to hit the memory limits. And since usually most of the data we are dealing with is ultimately strings and combinations of relatively long strings with some reference pointers, some integer numbers and more strings, we can say that in typical memory intensive applications strings actually consume a large fraction of the memory. This leads to the temptation of using or even writing a string library that uses utf-8 or some other more condensed internal format, while still being able to express any Unicode content. This is possible and I have done it. Unfortunately it is very painful, because Strings are quite deeply integrated into the language and explicit conversions need to be added in many places to deal with this. But it is possible and can save a lot of memory. In my case we were able to abandon this approach, because other optimizations, that were less painful, proved to be sufficient.

An interesting idea is to compress strings. If they are long enough, algorithms like gzip work on a single string. As with utf-8, selectively accessing parts of the string becomes expensive, because it can only be achieved by parsing the string from the beginning or by adding indexing structures. We just do not know which byte to go to for accessing the n-th character, even with utf-8. In reality we often do not have long strings, but rather many relatively short strings. They do not compress well by themselves. If we know our data and have a rough idea about the content of our complete set of strings, custom compression algorithm can be generated. This allows good results even for relatively short strings, as long as they are within the „language“ that we are prepared for. This is more or less similar to the step from utf-16 to utf-8, because we replace common byte sequences by shorter byte sequences and less common sequences may even get replaced by something longer. There is no gain in utf-8, if we have mostly strings that are in non-Latin alphabets. Even Cyrillic or Greek, that are alphabets similar to the Latin alphabet, will end up needing two bytes for each letter, which is not at all better than utf-16. For other alphabets it will even become worse, because three or four bytes are needed for one symbol that could easily be expressed with two bytes in utf-16. But if we know our data well enough, the approach with the specific compression will work fine. The „dictionary“ for the compression needs to be stored only once, maybe hard-coded in the software, and not in each string. It might be of interest to consider building the dictionary dynamically at run-time, like it is done with gzip, but keeping it in a common place for all strings and thus sharing it. The custom strings that I used where actually using a hard coded compression algorithm generated using a large amount of typical data. It worked fine, but was just too clumsy to use because Java is not prepared to replace String with something else without really messing around in the standard run-time libraries, which I would neither recommend nor want.

It is important to consider the following issues:

1. Is the memory consumption of the strings really a problem?
2. Are there easier optimizations that solve the problem?
3. Can it just be solved by adding more hardware? Yes, this is a legitimate question.
4. Are there solutions for the problem in the internet or even within the current organization?
5. A new String class is so fundamental that excellent testing is absolutely mandatory. The unit tests should be very extensive and complete.

# Will Java, C, C++ and C# be the new Cobols?

A few decades ago most programming was performed in Cobol (I do not want to shout it), Fortran, Rexx and some typical main frame languages, that hardly made it to the Linux-, Unix- or MS-Windows-world. They are still present, but mostly used for maintenance and extension of existing software, but less often for writing new software from scratch.
In these days languages like C, C++, Java and to a slightly lesser extent C# dominate the list of most commonly used languages. I would assume that JavaScript is also quite prominent in the list, because it has become more popular to write rich web clients using frameworks like Angular JS. And there are tons of them and some really good stuff. Some people like to see JavaScript also on the server side and in spite of really interesting frameworks like Node-JS I do not really consider this a good idea. If you like you may add Objective C to this list, which I do not know very much at all, even though it has been part of my gcc since my first Linux installation in the early 1990es.

Anyway, C goes back to the 1970es and I think that it was a great language to create at that time and it still is for a certain set of purposes. When writing operating systems, database engines, compilers and interpreters for other languages, editors, or embedded software, everything that is very close to the hardware, explicit control and direct access to very powerful OS-APIs are features that prove to be useful. It has been said that Java runs as fast as C, which is at least close to the truth, but only if we do not take into account the memory usage. C has some short comings that could be done better without sacrificing its strengths in the areas where it is useful, but it does not seem to be happening.

C++ has been the OO-extension of C, but I would say that it has evolved to be a totally different language for different purposes, even though there is some overlap, it is relatively easy to call functionality written in C from C++ and a little bit harder the other way round… I have not used it very much recently, so I will refrain from commenting further on it.

Java has introduced an infrastructure that is very common now with its virtual machine. The JVM is running on a large number of servers and any JVM-language can be used there. The platform independence is an advantage, but I think that its importance on servers has diminished a little bit. There used to be all kinds of servers with different operating systems and different CPU-architectures. But now we are moving towards servers being mostly Linux with Intel-compatible CPUs, so it is becomeing less of an issue. This may change in the future again, of course.

With Mono C# can be used in ways similar to Java, at least that is what the theory says and what seems to be quite true at least up to a certain level. It seems to be a bit ahead of Java with some language features, just think of operator overloading, undeclared exceptions, properties, generics or lambdas, which have been introduced earlier or in a more elegant way or we are still waiting in Java. I think the case of lambdas also shows the limitations, because they seem to behave differently than you would expect from real closures, which is the way lambdas should be done and are done in more functionally oriented languages or even in the Ruby programming language, in the Perl programming language or typical Lisps.
Try this
 List<Func<int>> actions = new List<Func<int>>();

 int variable = 0; while (variable < 5) {     actions.Add(() => variable * 2);     ++ variable; } 

foreach (var act in actions) {     Console.WriteLine(act.Invoke()); } 
We would expect the output 0, 2, 4, 6, 8, but we are getting 10, 10, 10, 10, 10 (one number in a line, respectively).
But it can be fixed:
 List<Func<int>> actions = new List<Func<int>>();

 int variable = 0; while (variable < 5) {     int copy = variable;     actions.Add(() => copy * 2);     ++ variable; } 

foreach (var act in actions) {     Console.WriteLine(act.Invoke()); } 
I would say that the concept of inner classes is better in Java, even though what is static there should be the default, but having lambdas this is less important…
You find more issues with class loader, which are kind of hard to tame in java, but extremely powerful.

Anyway, I think that all of these languages tend to be similar in their syntax, at least within a method or function and require a lot of boiler plate code. Another issue that I see is that the basic types, which include Strings, even if they are seen as basic types by the language design, are not powerful enough or full of pitfalls.

While the idea to use just null terminated character arrays as strings in C has its beauty, I think it is actually not really good enough and for more serious C applications a more advanced string library would be good, with the disadvantage that different libraries will end up using different string libraries… Anyway, for stuff that is legitimately done with C now, this is not so much of an issue and legacy software has anyway its legacy how to deal with strings, and possible painful limitations in conjunction with Unicode. Java and also C# have been introduced at a time when Unicode was already around and the standard already claimed to use more than 65536 code points (characters in Unicode-speak), but at that time 65536 seemed to be quite ok to cover the needs for all common languages and so utf-16 was picked as an encoding. This blows up the memory, because strings occupy most of the memory in typical application software, but it still leaves us with uncertainties about length and position, because code points can be one or two 16-bit-„characters“ long, which can only be seen by actually iterating through the string, which leaves us where we were with null terminated strings in C. And strings are really hard to replace or enhance in this aspect, because they are used everywhere.

Numbers are not good either. As an application developer we should not care about counting bits, unless we are in an area that needs to be specifically optimized. We are using mostly integer types in application development, at least we should. These overflow silently. Just to see it in C#:
 int i = 0; int s = 1; for (i = 0; i < 20; i++) {     s *= 7;     Console.WriteLine("i=" + i + " s=" + s); } 
which gives us:
 i=0 s=7 i=1 s=49 i=2 s=343 i=3 s=2401 i=4 s=16807 i=5 s=117649 i=6 s=823543 i=7 s=5764801 i=8 s=40353607 i=9 s=282475249 i=10 s=1977326743 i=11 s=956385313 i=12 s=-1895237401 i=13 s=-381759919 i=14 s=1622647863 i=15 s=-1526366847 i=16 s=-2094633337 i=17 s=-1777531471 i=18 s=442181591 i=19 s=-1199696159 
So it silently overflows, or just takes the remainder modulo with the representation system . Java, C and C++ behave exactly the same way, only that we need to know what „int“ means for our C-compiler, but if we use 32-bit-ints, it is the same. This should throw an exception or switch to some unlimited long integer. Clojure offers both options, depending on whether you use * or *‘ as operator. So as application developers we should not have to care about these bits and most developers do not think about it. Usually it goes well, but a lot of software bugs are around due to this pattern. It is just wrong in C#, Java, and C++. In C I find it more acceptable, because the typical area for using C for new software actually is quite related to bits and bytes, so the developers need to be aware of such issues all the time anyway.