Lazy Collections, Strings or Numbers

The idea is, that we have data that is obtained or calculated to give us on demand as much of it as we request. But it is not necessarily initially present. This concept is quite common in the functional world, where we in a way hide the deprecated concept of state in such structures, by the way in a way that lets use retain the benefits that led to the desire for statelessness.

Actually the concept is quite old. We have it for I/O in Unix and hence in Linux since the 1970ies. „Everything is a file“, at least as long as we constrain ourselves to a universal subset of possible file operations. It can be keyboard input, a named or anonymous pipe, an actual file, a TCP-connection, to name the most important cases. These are „lazy“ files, behave more or less like files as far as sequential reading is concerned, but not for random access reading. The I/O-concept has been done in such a way that it takes the case into account that we want to read n bytes, but get only m < n bytes. This can happen with files when we reach their end, but then we can obtain an indication that we reached the end of the file, while it is perfectly possible that we read less then we want in one access, but eventually get \ge n bytes including subsequent reads. Since the API has been done right, but by no means ideal, it generalizes well to the different cases that exist in current OS environments.

We could consider a File as an array of bytes. There is actually a way to access it in this way by memory-mapping it, but this assumes a physically present file. Now we could assume that we think of the array as a list that is optimized for sequential access and iterating, but not for random access. Both list types actually exist in languages like Java. Actually the random access structure can be made lazy as well, within certain constraints. If the source is actually sequential, we can just assume that the data is obtained up to the point where we actually read. The information about the total length of the stream may or may not be available, it is always available somehow in the case of structures that are completely available in memory. This random access on lazy collections works fine if the reason of laziness is to actually save us from doing expensive operations to obtain data that we do not actually need or to obtain them in parallel to the computation that processes the data. But we loose another potential drawback in this case. If the data is truly sequential, we can actually process data that is way beyond our memory capacity.

So the concept transfers easily from I/O-streams to lists and even arrays, most naturally to iterables that can be iterated only once. But we can easily imagine that this also applies to Strings, which can be seen a sequence of characters. If we do not constrain us to what a String is in C or Java or Ruby, but consider String to be a more abstract concept, again possibly dropping the idea of knowing the length or having a finite length. Just think of the output of the Unix command „yes“ or „cat /dev/zero“, which is infinite, in a theoretical way, but the computer won’t last forever in real life, of course. And we always interrupt the output at some time, usually be having the consumer shut down the connection.

Even numbers can be infinite. For real numbers this can happen only after the decimal point, for p-adic numbers it happens only before the decimal point, if you like to look into that. Since we rarely program with p-adic numbers this is more or less an edge case that is not part of our daily work, unless we actually do math research. But we could have integers with so many digits that we actually obtain and process them sequentially.

Reactive programming, which is promoted by lightbend in the Reactive Manifesto relies heavily on lazy structures, in this case data streams. An important concept is the so called „backpressure“, that allows the consumer to slow down the producer, if it cannot read the data fast enough.

Back to the collections, we can observe different approaches. Java 8 has introduced streams as lazy collections and we need to transform collections into streams and after the operation a stream back into a collection, at least in many real life situations. But putting all into one structure has some drawbacks as well. But looking at it from an abstract point of view this does not matter. The java8-streams to not implement a collection interface, but they are lazy collections from a more abstract point of view.

It is interesting that this allows us to relatively easily write nested loops where the depth of the nesting is a parameter that is not known at compile time. We just need a lazy collections of n-tuples, where n is the actual depth of the nesting and the contents are according to what the loops should iterate through. In this case we might or might not know the size of the collection, possibly not fitting into a 32-bit-integer. We might be able to produce a random member of the collection. And for sure we can iterate through it and stop the iteration wherever it is, once the desired calculation has been completed.

Share Button

Is Java becoming non-free?

We are kind of used to the fact that Java is „free“.
It has been free in the sense of „free beer“ pretty much forever.
And more recently also „free“ in the sense of „free speech“.

In spite of the fact that we read that „Oracle is going to monetize on Java“, as can be read in articles like this, it is remaining like that, at least for now. This is also written in the article.
But it seems that they are looking for loopholes. For example we download and install Java SE including X, Y and Z, because it comes like that. Agree to hundred pages of license text and confirm having read and understood everything, as always… Now we really need X, which is the JDK, which is actually free. But we just accidentally also install Y and Z, which we do not need, but which has a price tag on which they are trying to get us.

Even if nothing will really happen, issues like that help undermining the trust in the platform in general, not only for Java, but also for other JVM-languages. Eventually there could be forks like we have seen with LibreOffice vs. OpenOffice or with mariaDB vs. mySQL, which kind of took over by avoiding the ties to Oracle. Solaris seems to have a similar fork, but in this case people are just moving to Linux anyway, so the issue is less relevant.

These prospects are not desirable, but I think we do not have to panic, because there are ways to solve this that are going to be pursued if necessary. Maybe it is a good idea to be more careful when installing software. And to think twice when starting a new project if Oracle or PostgreSQL is the right DB product in the long term, taking into consideration Oracle’s attitude towards loyal long term customers.

It is regrettable. Oracle has great technology from their own history and from SUN in databases, Java including the surrounding universe, Solaris and hardware. Let us hope that they will stay reasonable at least with Java.

Share Button

JMS

Java has always not just been a language, but it brought us libraries and frameworks. Some of them proved to be bad ideas, some become hyped without having any obvious advantages, but some were really good.

In the JEE-stack, messaging (JMS) was included pretty much from the beginning. In those days, when Java belonged to Sun Microsystems and Sun did not belong to Oracle, an aim was to support databases, which was in those days mostly Oracle, via JDBC and so called Message oriented middleware, which was available in the IBM-world via JMS. JMS is a common interface for messaging, that is like sending micro-email-message not between human, but between software components. It can be used within one JVM, but even between geographically distant servers, provided a safe network connection exists. Since we all know EMail this is in principle not too hard to understand, but the question is, what it really means and if it brings us something that we do not already have otherwise.

We do have web services as an established way to communicate between different servers across the network and of course they can also be used locally, if desired. Web services are neither the first nor the only way to communicate between servers nor are they the most efficient way. But I would say that they are the way how we do it in typical distributed applications that are not tied to any legacy. In principal web services are network capable and synchronous. This is well understood and works fine for many applications. But it also forces us to block processes or threads while waiting for responses, thus occupying valuable resources. And we tend to loose responsiveness, because of the waiting for the response. It needs to be observed that DB-access is typically only available synchronously. In a way understandable because of the transactions, but it also blocks resources to a huge extent, because we know that the performance of many applications is DB driven.

Now message based software architectures think mostly asynchronously. Sending a message is a „fire and forget“. There is such a thing as making message transactional, but this has to be understood correctly. There is one transaction for sending the message. It is guaranteed that the message is sent. Delivery guarantees can only be given to a limited extent, because we do not know anything about the other side and if it is at all working. This is not checked as part of the transaction. We can imagine though that the messaging system has its own transactional database and stores the message there within the transaction. It then retries delivering it forever, until it succeeds. Then it is deleted from this store as part of the receiving transaction. Both these transactions can be part of a distributed transaction and thus be combined with other transactions, usually against databases, for a combined transaction. This is what we usually have in mind when talking about this. I have to mention that the distributed transaction, usually based on the so called two phase commit, is not quite as water proof as we might hope, but it can be broken by construction of a worst case scenario regarding the timing of failures of network and systems. But it is for practical purposes reasonable good to use.

While it is extremely interesting to investigate purely message based architectures, especially in conjunction with functional paradigm, this may not be the only choice. Often it is a good option to use a combination of messaging with synchronous services.

We should observe that messaging is a more abstract concept. It can be implemented by some middle ware and even be accessible by a standardized kind of interface like JMS. But it can also be more abstract as a queuing system or as something like Akka uses for its internal communication. And messaging is not limited to Java or JVM languages. Interoperability does impose some constraints on how to use it, because it bans usage of Object-messages which store serialized Java objects, but there are ways to address this by using JSON or BSON or XML or Protocol Buffers as message contents.

What is interesting about JMS and messaging in general are two major communication modes. We can have queues, which are point to point connections. Or we can have „topics“, which are channels into which messages are sent. They are then received by all current subscribers of the topic. This is interesting to notify different components about an event happening in the system, while possibly details about the event must be queried via synchronous services or requested by further messaging via queues.

Generally JMS in Java has different implementations, usually there are those coming with the application servers and there are also some standalone implementations. They can be operated via the same interface, at least as long as we constrain us to the common set of functionality. So we can exchange the JMS implementation for the whole platform (which is a nightmare in real life), but we cannot mix them, because the wire protocol is usually incompatible. There is now something like a standard network protocol for messaging, which is followed by some, but not all implementations.

As skeptical as I am against Java Enterprise edition, I do find the JMS part of enterprise Java very interesting and worthwhile exploring for projects that have a size and characteristics justifying this.

Share Button

UTF-16 Strings in Java

Deutsch

Strings in Java and many other JVM-languages consist of Unicode content and are encoded as utf-16. It was fantastic to already consider Unicode when introducing Java in the 90es and to make it the only reasonable way to use strings, so there is no temptation to start with a „US-ASCII“-version of a software that never really gets enhanced to deal properly with non-English content and data. And it also avoids having to deal with many kinds of String encodings within the software. For outside storage in databases and files and for communication with other processes and over the network, these issues of course remain. But Java strings are always encoded utf-16. We can be sure. Most common languages can make use of these Strings easily and handling common letter based languages from Europe and western Asia is quite strait forward. Rendering may still be a challenge, but that is another issue. A major drawback of this approach is that more memory is used. This usually does not hurt too much. Memory is not so expensive and a couple of Strings even with utf-16 will not be too big. With 64-bit Java, which should be used these days, the memory limitations of the JVM are not relevant any more, they can basically use as much memory as provided.

But some applications to hit the memory limits. And since usually most of the data we are dealing with is ultimately strings and combinations of relatively long strings with some reference pointers, some integer numbers and more strings, we can say that in typical memory intensive applications strings actually consume a large fraction of the memory. This leads to the temptation of using or even writing a string library that uses utf-8 or some other more condensed internal format, while still being able to express any Unicode content. This is possible and I have done it. Unfortunately it is very painful, because Strings are quite deeply integrated into the language and explicit conversions need to be added in many places to deal with this. But it is possible and can save a lot of memory. In my case we were able to abandon this approach, because other optimizations, that were less painful, proved to be sufficient.

An interesting idea is to compress strings. If they are long enough, algorithms like gzip work on a single string. As with utf-8, selectively accessing parts of the string becomes expensive, because it can only be achieved by parsing the string from the beginning or by adding indexing structures. We just do not know which byte to go to for accessing the n-th character, even with utf-8. In reality we often do not have long strings, but rather many relatively short strings. They do not compress well by themselves. If we know our data and have a rough idea about the content of our complete set of strings, custom compression algorithm can be generated. This allows good results even for relatively short strings, as long as they are within the „language“ that we are prepared for. This is more or less similar to the step from utf-16 to utf-8, because we replace common byte sequences by shorter byte sequences and less common sequences may even get replaced by something longer. There is no gain in utf-8, if we have mostly strings that are in non-Latin alphabets. Even Cyrillic or Greek, that are alphabets similar to the Latin alphabet, will end up needing two bytes for each letter, which is not at all better than utf-16. For other alphabets it will even become worse, because three or four bytes are needed for one symbol that could easily be expressed with two bytes in utf-16. But if we know our data well enough, the approach with the specific compression will work fine. The „dictionary“ for the compression needs to be stored only once, maybe hard-coded in the software, and not in each string. It might be of interest to consider building the dictionary dynamically at run-time, like it is done with gzip, but keeping it in a common place for all strings and thus sharing it. The custom strings that I used where actually using a hard coded compression algorithm generated using a large amount of typical data. It worked fine, but was just too clumsy to use because Java is not prepared to replace String with something else without really messing around in the standard run-time libraries, which I would neither recommend nor want.

It is important to consider the following issues:

  1. Is the memory consumption of the strings really a problem?
  2. Are there easier optimizations that solve the problem?
  3. Can it just be solved by adding more hardware? Yes, this is a legitimate question.
  4. Are there solutions for the problem in the internet or even within the current organization?
  5. A new String class is so fundamental that excellent testing is absolutely mandatory. The unit tests should be very extensive and complete.
Share Button

Laziness

In conservative circles, laziness has a slightly negative connotation, but in IT it should be seen positive, if we understand it right. If we as humans get our job done in a more efficient way, by working less, this laziness is a good thing. So let’s be lazy by becoming more efficient. For now, just remember, that laziness is something good.

Laziness in software refers to performing a certain operation only when needed. Imagine we store a value that is the result of some calculation in a data structure. We can only have read access to this value, speaking in the Java-world it is a „getter“. Under certain circumstances it is functionally equivalent to store the function that calculates the value instead and to call it whenever the getter is called. Usually this is combined with the memoize pattern, so the value is retained, once it has been calculated. This whole idea breaks down if the function to calculate the value has side effects, is influenced by side effects or generally does not yield the same result on the same inputs whenever it is called. Obviously for implementing such a lazy value the parameters for the function have to be retained as well, either by storing them in the structure as well or by wrapping the function and the parameters into a parameter-less function that includes them. Obviously these parameters have to be immutable for this approach to work.

Now the question is, what do we gain from this kind of laziness?
At first glance we gain nothing, because the calculation has to be done anyway and we are just procrastinating it, so the eventual calculation of whatever we are doing will be expensive. Actually this might not be true. It is quite possible, that the value is never needed, so it would be wasteful to calculate it in advance.

Actually many of us have seen both approaches in text books for Java or C++, when the singleton pattern is explained. The typical implementation looks like this:


public class S {
    private static S instance = null;

    public static synchronized S getInstance() {
        if (instance == null) {
            instance = new S();
        }
        return instance;
    }

    public S() {
        // ...
    }

    //...
}

This „synchronized“ is a bit of a pain, but necessary to ensure that the instance is created only once. Smart people came up with the „double-check-pattern“, so it looked like this:

public class S {
    private static S instance = null;

    public static S getInstance() {
        if (instance == null) {
            synchronized (S.class) {
                if (instance == null) {
                    instance = new S();
                }
            }
        }
        return instance;
    }

    public S() {
        // ...
    }

    //...
}

It is kind of cute, because it takes into account that after the first check and before acquiring the synchronized-lock some other thread already initializes it. It is supposed to work with newer Java versions, but for sure going to fail in older versions. This is due to the memory management model. Anyway it is kind of scary, because we observe that something that really looks correct won’t work anyway. Or even worse, it will work fine during testing and miserably fail on production when we don’t expect it. But it would be so easy with eager initialization:

public class S {
    private static S instance = new S();

    public static S getInstance() {
        return instance;
    }

    public S() {
        // ...
    }

    //...
}

Or use even an enum to implement the singleton.

Other language like Scala support this kind of laziness in a more natural way.

case class S(s : String) {
  lazy val t : String = (s + s);
  def tt() : String = { t }
}

In this case t would be calculated when tt is called the first time.

In this example it does not help a lot, because the calculation of t is so cheap. But imagine we wanted to create a huge collection and then start doing something with the elements until we reach a certain goal. Eagerly initializing the collection would blow up our program, but we may be able to iterate through it or work with the first n elements where n is significantly smaller than the size of the collection would be. Just think of it as an iterate-only-collection or as a collection that is too big to keep it in memory completely.

Just do this in Clojure:

(def r (range 1000000000000N))

It would define a collection with 10^{12} elements, which would most likely exceed our memory. But it is accepted and we can do stuff with it as long as we are dealing only with elements near to the beginning. It could know its size, but it does not, so

(count r)

will fail or take forever. Or maybe work in some future version of Clojure…

So laziness allows us to write more elegant, expressive programs that have a good efficiency. Off course this can all be written by ourselves without such help, but it is a good laziness to rely on programming languages, libraries or tools that allow us to do it in such an elegant way.

Now we can find out that Hibernate and JPA use some kind of laziness. They have to, because objects tend to be connected and fetching one would require to fetch a really big bunch. Unfortunately the laziness is simply not correct, because we do have side effects, that influence the outcome. That is what databases are… So data may change, transactions may have been committed or rolled back or whatever. We get „LazyLoadingException“, when we try to access some data. And when we try to adopt a beautiful programming style that mimics functional patters in Java with non-static inner classes (prior to Java 8) in conjunction with Hibernate, it will be bound to fail miserably unless we apply absolute care about this issue.

While we are always moving on thin ice with this side-effect-dependent laziness of Hibernate and JPA, it will be really powerful in functional languages like Scala, Clojure or Haskell if we are going the extra mile to ensure that the function does not have side effects and does not depend on side effects or get influenced by side effects.

Share Button

Clojure

Functional programming languages have become a bit of a hype.

But the ideas are not really so new.
The first languages beyond Assembly language that have maintained some relevance up to today were FORTRAN, COBOL and Lisp. Indirectly also Algol, because it inspired pretty much any modern mainstream programming language in some way through some intermediate ancestors. The early Algol Variants itself have disappeared .

It can be argued if the early Lisp Dialects were really functional languages, but they did support some functional patterns and made them popular at least in certain communities. I would say that popular scripting languages like Perl, Ruby, Lua and especially JavaScript brought them to the main stream.

Lisp has always remained in its niche. But the question arose on creating a new Lisp that follows more strictly the functional paradigm and is somewhat more modern, cleaner and simpler than the traditional Lisps. It was done and it is called Clojure.

So anybody who has never used any Lisp will at first be lost, because it is a jungle of parentheses „((((())))()()()(…)“ with some minor stuff in between…
Actually that is an issue, when we move from today’s common languages to Clojure. But it is not that bad. The infix-notation is familiar to us, but it has its benefits to use one simple syntax for almost everything.

An expression that consists of a function call is written like this (function-name param1 param2 parm3...). +, -, *,…. are just functions like anything else, so if we want to write 3\cdot4 + 5\cdot6 we just write (+ (* 3 4) (* 5 6)).

In the early days of calculators it was easier to build something that works with a notion called „RPN“, so there we would write 3 ENTER 4 * 5 ENTER 6 * +, which is similar to the Lisp way, but just the other way round.

It is easy to add a different number of values:
* (+) -> 0
* (+ 7) -> 7
* (+ 1 2 3 4 5 6 7) -> 28

In Clojure functions are just normal values like numbers, arrays, lists,… that can be passed around.. It is good programming practice to rely on this where it helps. And with more experience it will be helpful more often.

Immutability is king. Most of the default structures of Clojure are immutable. They can be passed around without the fear that they might change once they have been constructed. This helps for multithreading.

Clojure provides lists, arrays, sets, hashmaps, and the sorted variants of the latter. These can be written easily:
* List: (list 1 2 3) -> (1 2 3) (entries are evaluated in this case)
* List: '(1 2 3) -> (1 2 3) (entries are not evaluated in this case)
* Array: [1 2 3] (entries are evaluated in this case)
* Set: #{1 2 3} (entries are evaluated in this case)
* Map: {1 2, 3 4} (entries are evaluated in this case. key-value-Pairs are usually grouped with a comma, which is whitespace for Clojure)

All of these are immutable. So methods that change collections, always create a copy that contains the changes. The copy can be done lazily or share data with the original.

Actually I can teach Clojure in course of two to five days duration depending on the experience of the participants and the goals they want to achieve.

There is much more to write about Clojure…

Share Button

ScalaX 2015

It was possible to arrange a visit of Scala Exchange 2015 in London, short #ScalaX.

I visited the following events:

What do +, – and * with Integer do?

When using integers in C, Java or Scala, we often use what is called int.

It is presented to us as the default.

And it is extremely fast.

Ruby uses by default arbitrary length integers.

But what do +, – and * mean?

We can rebuild them, in Ruby, kind of artificially restrict the integers to what we have in other programming langauges as int:


MODULUS = 0x100000000;
LIMIT   =  0x80000000;

def normalize(x)
  r = x % MODULUS;
  if (r < -LIMIT) then     return r + MODULUS;   elsif (r >= LIMIT) 
    return r - MODULUS;
  else
    return r;
  end
end

def intPlus(x, y)
  normalize(x+y);
end

def intMinus(x, y)
  normalize(x-y);
end

def intTimes(x, y)
  normalize(x*y);
end

x = 0x7fffffff;
y = intPlus(x, x);
z = intPlus(x, x);
puts("x=#{x} y=#{y} z=#{z}");

What is the outcome?

Exactly what you get when doing 32-Bit-Ints in C, Java, Scala or C#:

x=2147483647 y=-2 z=-2

int is always calculated modulo a power of two, usually 2^{32}. That is the
x % MODULUS in normalize(). The Rest of the function is just for normalizing the result to the range -2^{31} \ldots 2^{31}-1.

So we silently get this kind of result when an overflow situation occurs, without any notice.
The overflow is not trivial to discover, but it can be done.
For addition I have described how to do it.

See also Overflow of Integer Types, an earlier post about these issues…

I gave a talk that included this content adapted for Scala at Scala Exchange 2015.

Share Button

Devoxx 2015

This year I have had the pleasure to visit the Devoxx-Conference in Antwerp in Belgium.

I have visited the following talks:

There is a lot to write about this and I will get back to specific interesting topics in the future…

My previous visits in 2012, 2013 (part 1), 2013 (part 2), and 2014 have their own blog articles.

Share Button

Using Collections

When Java came out about 20 years, it was great to have a decent and quite extensive collection library available as part of the standard setup and ready to use.

Before that we often had to develop our own or find one of many interesting collection libraries and when writing and using APIs it was not a good idea to rely on them as part of the API.

Since Java 2 (technical name „Java 1.2“) collection interfaces have been added and now the implementation is kind of detached, because we should use the interfaces as much as possible and make the implementation exchangeable.

An interesting question arose in conjunction with concurrency. The early Java 1 default collections where synchronized all the way. In Java 2 non synchronized variants were added and became the default. Synchronization can be achieved by wrapping them or by using the old collections (that do implement the interfaces as well since Java 2).

This was a performance improvement, because most of the time it is expensive and unnecessary overhead to synchronize collections. As a matter of fact special care should be used anyway to know who is accessing a collection in what way. Even if the collection itself does not get broken by simultaneous access, your application most likely is, unless you really know what you are doing. Are you?

Now it is usually a good idea to control changes of a collection. This is achieved by wrapping it with some Collections.umodifyableXXX-method. The result is that accessing the wrapped collection with set or put will cause an exception. It was a good approach, as a first shot, but not where we want to be now.

Of the wrapped collection still references to the inner, non-wrapped collection can be around, so it can still change while being accessed. If you can easily afford it, just copy collections when taking them in or giving them out. Or go immutable all the way and wrap your own in an umnodifiable-wrapper, if that works.

What I would like to see is something along the following lines:

  • We have two kinds of collection interfaces, those that are immutable and those that are mutable.
  • The immutable should be the default.
  • We have implementations of the collections and construction facilities for the immutable collections
  • The immutable implementation is off course the default.

I do not want to advocate going immutable collections only, because that does come at a high price in terms of efficiency. The usual pattern is to still have methods that modify a collection, but these leave the original collection as it is and just create a modified copy. Usually these implementations have been done in such a smart way that they share a lot, which is no pain, because they are all immutable. No matter how smart and admirable these tricks are, I strongly doubt that they can reach the performance of modifiable collections, if modifications are actually used a lot, at least in a purely single threaded environment.

Ruby has taken an interesting approach. Collections have a method freeze that can be called to make them immutable. That is adding runtime checks, which is a good match for Ruby. Java should check this at compile time, because it is so important. Having different interfaces would do that.

I recommend checking out the guava-collection library from google. It does come with most of the issues described here addressed and I think it is the best bet at the moment for that purpose. There are some other collection libraries to explore. Maybe one is actually better then guava.

Share Button