Operator Overloading

When Java was created, the concept of operator overloading was already present in C++. I would say that it was generally well done in C++, but it kind of breaks the object oriented polymorphism patterns of C++ and the usual way was to have several overloaded functions to allow for all n² combinations.

In the early days of C++ people jumped on this feature and used it for all kinds of stuff that has nothing to do with the original concept of numeric operators, like adding dialog boxes to strings and multiplying that with events. We get somewhere a little bit towards what APL was, which had only operators and a special charset to allow for all the language features, requiring even a special keyboard:

APL example
APL example

You can find an article in Scott Locklin’s Blog about APL and other almost forgotten languages and the potential loss of some achievements that they tried to bring to us.

We see the same with some people in Scala who create a lot of operators using interesting Unicode characters. This is not necessarily wrong, but I think operators should only be used for something that is really important. Not in the sense: „I wrote functionality XYZ for library UVW, and this is really important“, but in the sense that this functionality is so commonly used that people have no problem remembering the operator. Or the operator is already known to us, like „+“, „-„, „*“, … for numeric types, but I still have no idea what adding a string to an event would mean.

In C++ it got even worse because it was possible to overload „->“ or new and thus digging deep into the language, which can be interesting when used carefully and skillfully by developers who really know what they are doing, but disastrous otherwise.

Now Java has opted not to support this operator overloading, which was wrong in even at that time, but understandable, because at that time we were still more in the mindset to count bits and live with the deficiencies of int and long and we were also seeing the weird abuses of operator overloading in C++. Maybe it was also the lack of time to design a sound mechanism for this in Java. Unfortunately this decision that was made in a context more than 20 years ago has kind of become religious. Interestingly James Gosling, when asked in an interview for the 20 years anniversary of Java, mentioned operator overloading for numeric types as the first thing that he would have made better. (It is around minute 9.) So I hope that this undoes the religious aspect of this topic.

An interesting idea will probably be included in future versions of Scala. An operator is in principal defined as a method of the left operand, which is quite logical, but it would imply writing something like e = (a.*(b)).+(c.*(d)), possibly with fewer parentheses. Now this is recognized as a operator-method, so the dots can go away as well as the parentheses and the common operator precedence applies, so e = a * b + c * d works as well and is what we find natural. Ruby and Scala are very similar in this aspect. Now some future version of Scala, maybe Scala 3, will introduce an annotation that allows the „infix“-notation for these methods and that adds a descriptive name. Now error messages and even IDE-support could give us access to the descriptive name and we would be able to search for it, while searching for something like „+“ or „-“ or „*“ would not really be helpful. I think that this idea would be useful for other languages as well.

These examples demonstrate the BigInteger types of Java, C#, Scala, Clojure and Ruby, respectively:

import java.math.BigInteger;

public class JavaBigInt {

    public static void main(String[] args) {
        BigInteger f = BigInteger.valueOf(2_000_000_000L);
        BigInteger p = BigInteger.ONE;
        for (int i = 0; i < 8; i++) {
            System.out.println(i + " " +  p);
            p = p.multiply(f);
        }
    }
}

gives this output:

0 1
1 2000000000
2 4000000000000000000
3 8000000000000000000000000000
4 16000000000000000000000000000000000000
5 32000000000000000000000000000000000000000000000
6 64000000000000000000000000000000000000000000000000000000
7 128000000000000000000000000000000000000000000000000000000000000000

And the C#-version

using System;
using System.Numerics;

public class CsInt {

    public static void Main(string[] args) {
        BigInteger f = 2000000000;
        BigInteger p = 1;
        for (int i = 0; i < 8; i++) {
            Console.WriteLine(i + " " +  p);
            p *= f;
        }
    }
}

give exactly the same output:

0 1
1 2000000000
2 4000000000000000000
3 8000000000000000000000000000
4 16000000000000000000000000000000000000
5 32000000000000000000000000000000000000000000000
6 64000000000000000000000000000000000000000000000000000000
7 128000000000000000000000000000000000000000000000000000000000000000

Or the Scala version

object ScalaBigInt {

  def main(args: Array[String]): Unit = {
    val f : BigInt = 2000000000;
    var p : BigInt = 1;
    for (i  <- 0 until 8) {
      println(i + " " + p);
      p *= f;
    }
  }
}
0 1
1 2000000000
2 4000000000000000000
3 8000000000000000000000000000
4 16000000000000000000000000000000000000
5 32000000000000000000000000000000000000000000000
6 64000000000000000000000000000000000000000000000000000000
7 128000000000000000000000000000000000000000000000000000000000000000

Or in Clojure it looks like this, slightly shorter than then Java and C#:

(reduce (fn [x y] (println y x) (*' 2000000000 x)) 1 (range 8))

with the same output again, but a much shorter program. Please observe that the multiplication needs to use the "*'" instead of "*" in order to outexpand from fixed length integers to big-integers.

0 1
1 2000000000
2 4000000000000000000
3 8000000000000000000000000000N
4 16000000000000000000000000000000000000N
5 32000000000000000000000000000000000000000000000N
6 64000000000000000000000000000000000000000000000000000000N
7 128000000000000000000000000000000000000000000000000000000000000000N

Or in Ruby it is also quite short:

f = 2000000000
p = 1
8.times do |i|
  puts "#{i} #{p}"
  p *= f;
end

same result, without any special effort, because integers are always expanding to the needed size:

0 1
1 2000000000
2 4000000000000000000
3 8000000000000000000000000000
4 16000000000000000000000000000000000000
5 32000000000000000000000000000000000000000000000
6 64000000000000000000000000000000000000000000000000000000
7 128000000000000000000000000000000000000000000000000000000000000000

So I suggest to leave the IT-theology behind. So the pragmatic issues should be considered now.

In Java we have primitive numeric types, that are basically inadequate for application development, because they tacitly overflow and because application developers have usually no idea how to deal with rounding issues of float and double. We have good numeric types like BigInteger and BigDecimal to support arbitrarily long integral numbers, which do not overflow unless we exceed memory or addressaility issues with numbers of several billion digits. BigDecimal allows for controlled rounding, and also arbitrary precision.

Now we have to write

e = a.multiply(b).add(c.multiply(d))

instead of

e = a * b + c * d

The latter is readable, it is exactly what we mean. The former is not readable at all and the likelihood of making mistakes is very high.
I would be happy with something like this:

e = a (*) b (+) c (*) d

where overloaded operators are surrounded with () or [] or something like that.

At some point of time a major producer of electronic calculators made us believe that it is more natural to express it like this

e a b * c d * + =

Maybe this way of writing math would be better, but it is not what we do outside of our computers and calculators. At least it was more natural to have this pattern for those who created the calculators, because it was much easier to implement in a clean way on limited hardware. We still have the opposite in Lisp, which is still quite alive as Clojure, so I use the Clojure syntax:

(def x (+ (* a b) (* c d)))

which is relatively readable after some learning and allows for a very simple and regular and powerful syntax. But even this is not how we write Math outside of our computer.

Now the good news is that Java will add "value types" in the future and consider to revisit the operator overloading issue for these value types. This may or may not solve the issue in a distant future. We should have an idea what a numeric type is. A numeric type can be more than just real and integral numbers. Just think of rational numbers, complex numbers, but even of polynomials, rational functions (quotients of polynomials), finite fields, p-adic numbers and more. We just need to talk about rings and fields in the mathematical sense and possibly subsets that do not quite follow the field semantics like Double, but that are still inspired by the field they aim to represent. Anyway, for the moment Java not having operator overloading is a degradation from something that other languages had already done well before.

Btw., please use elementary school math skills and do not write

e = (a * b) + (c * d)

That is just noise. I do not recommend to memorize all the 10 to 25 levels of operator precedence of a typical programming languages, but it is good to know the basic ones, that almost any serious current programming language supports:
* binary * /
* binary + -
* == != <= >= < >
* &&
* ||
Some use "and" and "or" instead of "&&" and "||".

Now using overloaded operators should be no problem.

We do have an issue when implementing it.

Imagine you have a language with five built in numeric types. Now you add a sixth one. "+" is probably already defined for 25 combinations. With the sixth type we get a total of 36 combinations, of which we have to provide the missing 11 and a mechanism to dispatch the program flow to these. In C++ we just add 11 operator-functions and that does everything. In Ruby we add a method for the left side of the operator. Now this does not know our new type for the existing types, but it deals with it by calling coerce of the right operand with the left operand as parameter. This is actually powerful enough to deal with this situation.

It gets even more tricky when we use different libraries that do not know of each other and each of them adds numeric types. Possibly we cannot add these with each other or we can do so in a degraded manner by just falling back to double or float or rational or something like that.

The numeric types that we usually use can be added with each other, but we could hit situations where that is not the case, for example when having p-adic numbers, which can be added with rational number, but not with real numbers. Or finite fields, whose members can be added with integral numbers or with numbers of the same field, but not necessarily with numbers of another finite field. Fortunately these issues should occur only to people who understand them while writing libraries. Using the libraries should not be hard, if they are properly done.

Share Button

Usability „Pearl“

I just found this usability pearl:

After entering a credit card number as usual with spaces between the groups of four digits, the web page complained like this:

Credit Card Number without Spaces
Web page of insisting to refill a form because of spaces

Yes, it is easy to allow spaces. Just match the following regex
/^\s*\d{4}\s*\d{4}\s*\d{4}\s*\d{4}\s*$/
and then remove the spaces when processing it, but do not let the user enter the number without spaces. That is just ridiculous.

Share Button

Virtual machines

We all know that Java uses a „virtual machine“ that is it simulates a non-existing hardware which is the same independent of the real hardware, thus helping to achieve the well known platform independence of Java. Btw. this is not about virtualization like VMWare, VirtualBox, Qemu, Xen, Docker and similar tools, but about byte code interpreters like the Java-VM.

We tend to believe that this is the major innovation of Java, but actually the concept of virtual machines is very old. Lisp, UCSD-Pascal, Eumel/Elan, the Perl programming language and many other systems have used this concept long before Java. The Java guys have been good in selling this and it was possible to get this really to the mainstream when Java came out. The Java guys deserve the credit for bringing this in the right time and bringing it to the main stream.

Earlier implementations where kind of cool, but the virtual machine technology and the hardware were to slow, so that they were not really attractive, at least not for high performance applications, which are now actually a domain of Java and other JVM languages. Some suggest that Java or other efficient JVM languages like Scala would run even faster than C++. While it may be true to show this in examples, and the hotspot optimization gives some theoretical evidence how optimization that takes place during run time can be better than static optimization at compile time, I do not generally trust this. I doubt that well written C-code for an application that is adequate for both C and Java will be outperformed by Java. But we have to take two more aspects into account, which tend to be considered kind of unlimited for many such comparisons to make them possible at all.

The JVM has two weaknesses in terms of performance. The start-up time is relatively long. This is addressed in those comparisons, because the claim to be fast is only maintained for long running server applications, where start-up time is not relevant. The hotspot optimization requires anyway a long running application in order to show its advantages. Another aspect that is very relevant is that Java uses a lot of memory. I do not really know why, because more high level languages like Perl or Ruby get along with less memory, but experience shows that this is true. So if we have a budget X to buy hardware and then put software written in C on it, we can just afford to buy more CPUs because we save on the memory or we can make use of the memory that the JVM would otherwise just use up to make our application faster. When we view the achievable performance with a given hardware budget, I am quite sure that well written C outperforms well written Java.

The other aspect is in favor of Java. We have implicitly assumed until now that the budget for development is unlimited. In practice that is not the case. While we fight with interesting, but time consuming low level issues in C, we already get work done in Java. A useful application in Java is usually finished faster than in C, again if it is in a domain that can reasonably be addressed with either of the two languages and if we do not get lost in the framework world. So if the Java application is good enough in terms of performance, which it often is, even for very performance critical applications, then we might be better off using Java instead of C to get the job done faster and to have time for optimization, documentation, testing, unit testing.. Yes, I am in a perfect world now, but we should always aim for that. You could argue that the same argument is valid in terms of using a more high-level language than Java, like Ruby, Perl, Perl 6, Clojure, Scala, F#,… I’ll leave this argument to other articles in the future and in the past.

What Java has really been good at is bringing the VM technology to a level that allows real world high performance server application and bringing it to the main stream.
That is already a great achievement. Interestingly there have never been serious and successful efforts to actually build the JavaVM as hardware CPU and put that as a co-processor into common PCs or servers. It would have been an issue with the upgrade to Java8, because that was an incompatible change, but other than that the JavaVM remained pretty stable. As we see the hotspot optimization is now so good that the urge for such a hardware is not so strong.

Now the JVM has been built around the Java language, which was quite legitimate, because that was the only goal in the beginning. It is even started using the command line tool java (or sometimes javaw on MS-Windows 32/64 systems). The success of Java made the JVM wide spread and efficient, so it became attractive to run other languages on it. There are more than 100 languages on the JVM. Most of them are not very relevant. A couple of them are part of the Java world, because they are or used to be specific micro languages closely related to java to achieve certain goals in the JEE-world, like the now almost obsolete JSP, JavaFX, .

Relevant languages are Scala, Clojure, JRuby, Groovy and JavaScript. I am not sure about Jython, Ceylon and Kotlin. There are interesting ideas coming up here and there like running Haskell under the name Frege on the JVM. And I would love to see a language that just adds operator overloading and provides some preprocessor to achieve this by translating for example „(+)“ in infix syntax to „.add(..)“ mainstream, to allow seriously using numeric types in Java.

Now Perl 6 started its development around 2000. They were at that time assuming that the JVM is not a good target for a dynamic language to achieve good performance. So they started developing Parrot as their own VM. The goal was to share Parrot between many dynamic languages like Ruby, Python, Scheme and Perl 6, which would have allowed inter-language inter-operation to be more easily achievable and using libraries from one of these languages in one of the others. I would not have been trivial, because I am quite sure that we would have come across issues that each language has another set of basic types, so strings and numbers would have to be converted to the strings and numbers of the library language when calling, but it would have been interesting.

In the end parrot was a very interesting project, theoretically very sound and it looked like for example the Ruby guys went for it even faster than the the Perl guys, resulting in an implementation called cardinal. But the relevant Perl 6 implementation, rakudo, eventually went for their own VM, Moar. Ruby also did itself a new better VM- Many other language, including Ruby and JavaScript also went for the JVM, at least as one implementation variant. Eventually the JVM proved to be successful even in this area. The argument to start parrot in the first place was that the JVM is not good for dynamic languages. I believe that this was true around 2000. But the JVM has vastly improved since then, even resulting in Java being a serious alternative to C for many high performance server applications. And it has been improved for dynamic languages, mostly by adding the „invoke_dynamic“-feature, that also proved to be useful for implementing Java 8 lambdas. The experience in transforming and executing dynamic languages to the JVM has grown. So in the end parrot has become kind of obsolete and seems to be maintained, but hardly used for any mainstream projects. In the end we have Perl 6 now and Parrot was an important stepping stone on this path, even if it becomes obsolete. The question of interoperability between different scripting languages remains interesting…

Share Button

Primitives, Objects and Autoboxing

The type system in Java makes a difference between so called „primitives“, which are boolean, byte, char, int, long, float and double and Objects, which are anything derived from Object in object oriented philosophy, including the special case of arrays, which I will not discuss today.

Primitive types have many operations that are kind of natural to perform on them, like arithmetic. They behave as values, so they are actually copied, which is no big deal, because they are at most 64 bits in size, which is in modern java implementations the size of a pointer when using references. Now a major benefit of object orientation is arguable the polymorphism and this has been heavily used when implementing useful libraries like the collection classes, which were based mostly on Object and thus able to handle anything derived from Object. This has not changed with generics, they are just another way of writing this and adding some compile time checks and casts in a more readable way, as long as the complexity of the generics constructions remains simple and under control. Actually I like this approach and find it much more healthy than templates in C++, but this is a IT-theological discussion that is not too relevant for this article.

Now there is a necessity of using collections for numeric types. Even though I do recommend to thoroughly think about using types like BigInteger and BigDecimal, there are absolutely legitimate uses of long, int, boolean, double, char and less frequently short, byte and float. The only one that is really flawless of these is boolean, while the floating point numbers, the fixed size integral numbers (also this) and the Strings and chars in Java have serious flaws, some of which I have discussed in the linked articles.

Now we need to use the wrapper types Integer, Long, Double and Boolean instead of int, long, double and boolean to store them in collections. This comes with some overhead, because these wrappers use some additional memory and the wrapping and unwrapping costs some time. Usually this does not impose a problem and using these wrappers is often an acceptable approach. Now we would be tempted to just work with the wrappers, but that is impossible, because the natural operations for the underlying boolean and numeric types just do not work with the wrappers, so we have to unwrap (or unbox) them.

Now Java includes a feature called „autoboxing and autounboxing“ which tries to create a wrapper object around a primitive when in an object context and which extracts the primitive when in a primitive context. This can be enforced by casting, to be sure.

There are some dangers in using this feature. The most interesting case is the „==“-operator. For objects and also for the wrappers of the primitives this always compares object identity based on the pointer address. For primitives that is simply impossible and the comparison compares the value. I think that it was a mistake to define the „==“-operator like that and it should do a semantic comparison and there should be something else for object identity, but that cannot be changed any more for Java. So we get some confusion when comparing boxed primitives with == or even worse when comparing boxed and unboxed primitives. Another confusion occurs, when using autounboxing and the wrapper object is null. This creates of course a NullPointerException, but it is kind of hard to spot where it actually comes from.

So I do see some value in using explicit boxing and unboxing to make things clearer. It is a good thing to talk about this in the team and find a common way. Now the interesting question is how boxing and unboxing are done. We are tempted to use something like this:

int x = ...;
Integer xObj = new Integer(x);

This works, but it is not good, because it creates too many objects. We can reuse them and java provides for this and reuses them for some small numbers. The recommended way for explicit boxing is this:

int x = ...
Integer xObj = Integer.valueOf(x);

This can reuse values. If we are using this a lot and know that our range of commonly used numbers is reasonably small but still beyond what Java assumes, it is not too hard to write something like „IntegerUtil“ and use it:

int x = ...;
Integer xObj = IntegerUtil.valueOf(x);

Look if you can find an implementation that fits your needs, instead of writing it. But it is no pain to write it.
Unboxing is also easy:

Integer xObj = ....;
int x = xObj.intValue();

The methods intValue(), longValue(), doubleValue(),… are actually in the base class Number, so it is possible to unbox and cast in one step with these.

Decide how much readability you want.

It is useful to look at the static methods of the wrapper classes even for converting numbers to Strings and Strings to numbers. Avoid using constructors, they are rarely necessary and some neat optimizations that the Java libraries give us for free only work when we use the right methods. This does not make a huge difference, but doing it right does not hurt, but rather makes code more readable.

It is also interesting how the extended numeric types like BigInteger and BigDecimal work similar to the wrapper types and to use them right.

Another interesting issue is to use actually specific collection implementations for primitives. This may add to the complexity of our code, because it gives up another piece of polymorphism, but they can really save our day by giving a better performance. And in cases where we actually know for sure that the data is always belonging to a certain primitive type, I find this even idiomatic.

Other languages have solved the issues discussed here in a more elegant way by avoiding this two sided world of primitives and wrappers or by making the conversions less dangerous and more natural. They have operator overloading for numeric types and they use a more consistent concept of equality than Java.

Share Button

Numeric types in Perl

Dealing with numeric types in Perl is not as strait-forward as in other programming languages. We can use „scalars“ out of the box, but then we get floating point numbers, more precisely what is called „double“ in most programming languages. This is kind of ok for trivial programs, but we should make a deliberate choice on what to use.

Actually the Perl programming language gives us (at least) two more choices. We can use 64-bit integers (or 32-bit on some platforms) by just adding

use integer;

somewhere in the beginning of the file. This causes Perl to work mostly with integer instead of floating point numbers, but the rules for this are not so obvious. You may read about them in the official documentation. Or find another explanation or one more.

Now we do want to control this on a more fine granular basis than the whole program. There may be legitimate programs that use both floating point and integers. This can be achieved in Perl as well. We can turn this off using:

no integer;

More likely we want to use another approach, that looks more natural and more robust most of the time. We just have to use blocks:

#!/usr/bin/perl -w                                  
                                                                                        
use strict;                                                                             
                                                                                        
my $f1 = 2_000_000_000;                                                                
my $f2 = $f1 * $f1;                                                                  
my $f3 = $f1 * $f2;                                                                  
my $f4 = $f1 * $f3;                                                                  
my $f5 = $f1 * $f4;                                                                  
                                                                                        
my @f = (1, $f1, $f2, $f3, $f4, $f5);                                              
for (my $i = 0; $i <= 5; $i++) {                                                          print($i, " ", $f[$i], "\n");                                                     }                                                                                                                                                                                 my $n2x;                                                                                {                                                                                            use integer;                                                                             my $n1 = 2_000_000_000;                                                                 my $n2 = $n1 * $n1;                                                                   my $n3 = $n1 * $n2;                                                                   my $n4 = $n1 * $n3;                                                                   my $n5 = $n1 * $n4;                                                                                                                                                            my @n = (1, $n1, $n2, $n3, $n4, $n5);                                               for (my $i = 0; $i <= 5; $i++) {                                                          print($i, " ", $n[$i], "\n");                                                     }                                                                                        $n2x = $n2;                                                                        }                                                                                                                                                                                 print "n2x=$n2x\n";                                                                                                                                                              my $g1 = 2_000_000_000;                                                                 my $g2 = $g1 * $g1;                                                                   my $g3 = $g1 * $g2;                                                                   my $g4 = $g1 * $g3;                                                                   my $g5 = $g1 * $g4;                                                                                                                                                            my @g = (1, $g1, $g2, $g3, $g4, $g5);                                               for (my $i = 0; $i <= 5; $i++) {                                                          print($i, " ", $g[$i], "\n");                                                     }                                                                                       
                                                                                 
This will output:                                                                       
                                                                                  
0 1                                                                                     
1 2000000000                                                                            
2 4000000000000000000                                                                   
3 8e+27                                                                                 
4 1.6e+37                                                                               
5 3.2e+46                                                                               
0 1                                                                                     
1 2000000000                                                                            
2 4000000000000000000                                                                   
3 -106958398427234304                                                                   
4 3799332742966018048                                                                   
5 7229403301836488704                                                                   
n2x=4000000000000000000                                                                 

So we see that the integer mode is constrained to the block. And we see that the results for 3, 4 and 5 went wrong...

So it may be a little bit tricky to do this, but we can. These integers have the same flaw as integers in many popular programming languages, because they silently overflow by taking the remainder modulo 2^{64} that lies in the interval [-2^{63}, 2^{63}-1] or modulo 2^{32} that lies in the interval [-2^{31}, 2^{31}-1]. I do not think that is really what we usually want and just hoping that our numbers remain within the safe range may go well in the 64-bit-case, but we have to be sure and explain this in a comment, when we work like this. Usually we do not want to think about this and spending a few extra bits costs less than hunting obscure bugs where everything looks so correct.

Our friend is

use bigint;

which switches to arbitrary precision integers.

#!/usr/bin/perl -w                             
                                                                                 
use strict;                                                                      
                                                                                 
my $f1 = 2_000_000_000;                                                         
my $f2 = $f1 * $f1;                                                           
my $f3 = $f1 * $f2;                                                           
my $f4 = $f1 * $f3;                                                           
my $f5 = $f1 * $f4;                                                           
                                                                                 
my @f = (1, $f1, $f2, $f3, $f4, $f5);                                       
for (my $i = 0; $i <= 5; $i++) {                                                   print($i, " ", $f[$i], "\n");                                              }                                                                                                                                                                   my $b2x;                                                                         {                                                                                     use bigint;                                                                       my $b1 = 2_000_000_000;                                                          my $b2 = $b1 * $b1;                                                            my $b3 = $b1 * $b2;                                                            my $b4 = $b1 * $b3;                                                            my $b5 = $b1 * $b4;                                                                                                                                              my @b = (1, $b1, $b2, $b3, $b4, $b5);                                        for (my $i = 0; $i <= 5; $i++) {                                                   print($i, " ", $b[$i], "\n");                                              }                                                                                 $b2x = $b2;                                                                 }                                                                                                                                                                   print "b2x=$b2x\n";                                                                                                                                                my $g1 = 2_000_000_000;                                                          my $g2 = $g1 * $g1;                                                            my $g3 = $g1 * $g2;                                                            my $g4 = $g1 * $g3;                                                            my $g5 = $g1 * $g4;                                                                                                                                              my @g = (1, $g1, $g2, $g3, $g4, $g5);                                        for (my $i = 0; $i <= 5; $i++) {                                                   print($i, " ", $g[$i], "\n");                                              }                                                                                

This gives us the output:

0 1
1 2000000000
2 4000000000000000000
3 8e+27
4 1.6e+37
5 3.2e+46
0 1
1 2000000000
2 4000000000000000000
3 8000000000000000000000000000
4 16000000000000000000000000000000000000
5 32000000000000000000000000000000000000000000000
b2x=4000000000000000000
0 1
1 2000000000
2 4000000000000000000
3 8e+27
4 1.6e+37
5 3.2e+46

So it is again constrained to the block, but it allows us to use arbitrary lengths of integers, as long as our memory is sufficient.

A less commonly used, but interesting approach is to work with rational numbers:

#!/usr/bin/perl -w                                      
                                                                                       
use strict;                                                                            
use bigrat;                                                                            
                                                                                       
my $x = 3/4;                                                                          
my $y = 4/5;                                                                          
my $z = 5/6;                                                                          
print("x=$x y=$y z=$z\n");                                                          
                                                                                       
my $sum = $x+$y+$z;                                                                
my $diff = $x - $y;                                                                 
my $prod = $x * $x * $z;                                                           
my $quot = $x / $y;                                                                 
print("sum=$sum diff=$diff prod=$prod quot=$quot\n");                              

This gives us:

x=3/4 y=4/5 z=5/6
sum=143/60 diff=-1/20 prod=15/32 quot=15/16

That is kind of cool...

There is also something like Math::BigFloat which can be used most easily by having

use bignum;

Find the documentation about "use bignum" and about Math::BigFloat...

You will find more numeric types, like Math::Decimal and Math::Complex.

While I would say that using good numeric types in Perl is not quite as easy as it should be, at least if we want to mix them, at least we have the means to use the adequate numeric types. And it is way better than in Java.

Share Button

Perl 6

Perl 6 has silently reached its first production ready release on Christmas 2015, called v6c. It will be interesting to explore what this language can do, which features it offers and how it compares to existing relevant and interesting languages like Java, C, Ruby, Perl (5), Clojure, Scala, F#, C++, Python, PHP and others in different aspects. It looks like Perl 5 is there to stay and will be continued. Perl 6 should actually be considered to be a different programming language than Perl 5, so the name is somewhat misleading, because it suggests slightly more similarity than there really is. On the other hand, it was done by the same people, good ideas concepts from Perl 5 were retained and so it does look somewhat similar.

Today I will just provide some links

* Main web page
* Documentation
* Documentation II
* Rakudo: Currently the major implementation
* Wikipedia
* Wikipedia (Russian)
* Wikipedia (Spanish)
* Wikipedia (French)
* Wikipeida (Norwegian)

Maybe I will write more about this in the future…

Share Button

UTF-16 Strings in Java

Deutsch

Strings in Java and many other JVM-languages consist of Unicode content and are encoded as utf-16. It was fantastic to already consider Unicode when introducing Java in the 90es and to make it the only reasonable way to use strings, so there is no temptation to start with a „US-ASCII“-version of a software that never really gets enhanced to deal properly with non-English content and data. And it also avoids having to deal with many kinds of String encodings within the software. For outside storage in databases and files and for communication with other processes and over the network, these issues of course remain. But Java strings are always encoded utf-16. We can be sure. Most common languages can make use of these Strings easily and handling common letter based languages from Europe and western Asia is quite strait forward. Rendering may still be a challenge, but that is another issue. A major drawback of this approach is that more memory is used. This usually does not hurt too much. Memory is not so expensive and a couple of Strings even with utf-16 will not be too big. With 64-bit Java, which should be used these days, the memory limitations of the JVM are not relevant any more, they can basically use as much memory as provided.

But some applications to hit the memory limits. And since usually most of the data we are dealing with is ultimately strings and combinations of relatively long strings with some reference pointers, some integer numbers and more strings, we can say that in typical memory intensive applications strings actually consume a large fraction of the memory. This leads to the temptation of using or even writing a string library that uses utf-8 or some other more condensed internal format, while still being able to express any Unicode content. This is possible and I have done it. Unfortunately it is very painful, because Strings are quite deeply integrated into the language and explicit conversions need to be added in many places to deal with this. But it is possible and can save a lot of memory. In my case we were able to abandon this approach, because other optimizations, that were less painful, proved to be sufficient.

An interesting idea is to compress strings. If they are long enough, algorithms like gzip work on a single string. As with utf-8, selectively accessing parts of the string becomes expensive, because it can only be achieved by parsing the string from the beginning or by adding indexing structures. We just do not know which byte to go to for accessing the n-th character, even with utf-8. In reality we often do not have long strings, but rather many relatively short strings. They do not compress well by themselves. If we know our data and have a rough idea about the content of our complete set of strings, custom compression algorithm can be generated. This allows good results even for relatively short strings, as long as they are within the „language“ that we are prepared for. This is more or less similar to the step from utf-16 to utf-8, because we replace common byte sequences by shorter byte sequences and less common sequences may even get replaced by something longer. There is no gain in utf-8, if we have mostly strings that are in non-Latin alphabets. Even Cyrillic or Greek, that are alphabets similar to the Latin alphabet, will end up needing two bytes for each letter, which is not at all better than utf-16. For other alphabets it will even become worse, because three or four bytes are needed for one symbol that could easily be expressed with two bytes in utf-16. But if we know our data well enough, the approach with the specific compression will work fine. The „dictionary“ for the compression needs to be stored only once, maybe hard-coded in the software, and not in each string. It might be of interest to consider building the dictionary dynamically at run-time, like it is done with gzip, but keeping it in a common place for all strings and thus sharing it. The custom strings that I used where actually using a hard coded compression algorithm generated using a large amount of typical data. It worked fine, but was just too clumsy to use because Java is not prepared to replace String with something else without really messing around in the standard run-time libraries, which I would neither recommend nor want.

It is important to consider the following issues:

  1. Is the memory consumption of the strings really a problem?
  2. Are there easier optimizations that solve the problem?
  3. Can it just be solved by adding more hardware? Yes, this is a legitimate question.
  4. Are there solutions for the problem in the internet or even within the current organization?
  5. A new String class is so fundamental that excellent testing is absolutely mandatory. The unit tests should be very extensive and complete.
Share Button

Scanning, sorting and processing large numbers of photos

I guess for most of us this is more an issue of their private life rather than done professionally, and those woo do this for money should already have answers for everything…. But the IT aspects of this are interesting anyway…

So some of us, including me, have hundreds or thousands of photographs that have been created using analog photography. I am still using it, because I have a good equipment, the prices and availability for films and prints and scanning of the negatives to a CD are still good. My equipment is good and I am neither willing to give that up nor to do a major investment. It will come some day in the future and I expect that within five to ten years the reasonably priced and ubiquitous offers for handling of negative films and prints will disappear.

Anyway it is a good idea to scan all the slides and negatives, at least the ones that are of any interest. It is easier and cheaper to copy them, to get prints and to do some improvements with software like Gimp prior to creating prints. Also it is also possible to use and share them online.

Scanning with a flat bed scanner is not an option for negatives, it works with prints, but I think that it is too slow and I do not like the loss in quality due to the unnecessary intermediate step. This leaves two options, getting a negative scanner myself or using a service. So it is good to assume that they are already scanned for now. I organize the photos in a directory structure. The names should contain only 7-Bit-ASCII-characters, but no spaces, to be easier accessible by scripts and on the shell. I have scripts to rename them to this pattern, for directories and for files. They can be found under my github project „photo processing scripts“ with names:
* rename-canonical
* rename-dir-canonical
* rename-dirs-radically
Another interesting issue is finding and removing duplicates, but since the name of the file and its position int the file system do have some meaning, this needs some attention. When two identical files A and B are found, there are five resolutions:
* rm A (remove A, leave B)
* rm B (remove B, leave A)
* rm A ; ln -s B A (make A a softlink to B)
* rm B ; ln -s A B (make B a softlink to A)
* rm B ; ln -f A B (make B a hardlink of A. Apart from the inode number this is equivalent to the opposition direction)
Which of these is actually prefered? My scripts picks the last option, but does not actually perform it. Instead it just create output of the shell commands, which can be piped to a file or directly to sh, in which case they are immediately executed. Otherwise it is possible to edit the command, filter them or even change them with a one-liner in the Perl programming language. This can be found here:
* find-dups
For viewing the photos in the browser, I have added another script, that is called
* create-foto-index
It searches the current directory and all sub directories, except those starting with a dot („.“) recursively. For each image file a thumb nail image is needed, which is eather found in the .thumbs directory or created using the script
* scale-image
Then an index.html file is created in each dictory having links to its child directories, the neighboring directories and the parent directory. For each image the thumbnail is included and it is a link to the full sized image. With this it is easily possible to vieww the whole album in a browser locally.
Some images know their orientation already from the camera or phone, but they appear wrong anyway. These can be fixed automatically running the script
* auto-rotate
in the directory.
I have a web server and a CGI-script running:
* cgi/mark-images.cgi
which allows me to mark images with a checkbox or with a string. Using letters „D“ for delete, „R“ for rotate right (90 degree clockwise), „L“ for rotate left (90 degries counter clockwise) and „F“ for flip (rotate 180 degrees) and then press the OK button.
Running the script
* rotate-checked
which will delete and rotate the images according to the choices in the form.

This is already quite a useful situation. Images that are needed for prints or for the web might need some processing with GIMP:
* possibly rotate them in such a way that the horizon is horizontal and vertical lines are vertical, at least in the middle of the image.
* possibly correct perspective
* possibly sharpen
* possibly correct contrast and brightness
* possibly correct color saturation and colors
* cut out what is really interesting
* save it under a different name
* call create-foto-index again.
The webform and the CGI-script can be used for picking which images to edit. After having pressed OK it will be done like this:

gimp `egrep 'jpg$' </var/lib/wwwrun/mark-fotos/marked.dat` &

In a similar way images from a directory can be selected in indexf.html and then extracted to a ZIP:

zip my-archive.zip `egrep 'jpg$' </var/lib/wwwrun/mark-fotos/marked.dat`

which can be given to somebody or uploaded for creating prints or just unpacked in anther directory to have only the good images.

There are some more issues, which I might address in another article.

Share Button

Will Java, C, C++ and C# be the new Cobols?

A few decades ago most programming was performed in Cobol (I do not want to shout it), Fortran, Rexx and some typical main frame languages, that hardly made it to the Linux-, Unix- or MS-Windows-world. They are still present, but mostly used for maintenance and extension of existing software, but less often for writing new software from scratch.
In these days languages like C, C++, Java and to a slightly lesser extent C# dominate the list of most commonly used languages. I would assume that JavaScript is also quite prominent in the list, because it has become more popular to write rich web clients using frameworks like Angular JS. And there are tons of them and some really good stuff. Some people like to see JavaScript also on the server side and in spite of really interesting frameworks like Node-JS I do not really consider this a good idea. If you like you may add Objective C to this list, which I do not know very much at all, even though it has been part of my gcc since my first Linux installation in the early 1990es.

Anyway, C goes back to the 1970es and I think that it was a great language to create at that time and it still is for a certain set of purposes. When writing operating systems, database engines, compilers and interpreters for other languages, editors, or embedded software, everything that is very close to the hardware, explicit control and direct access to very powerful OS-APIs are features that prove to be useful. It has been said that Java runs as fast as C, which is at least close to the truth, but only if we do not take into account the memory usage. C has some short comings that could be done better without sacrificing its strengths in the areas where it is useful, but it does not seem to be happening.

C++ has been the OO-extension of C, but I would say that it has evolved to be a totally different language for different purposes, even though there is some overlap, it is relatively easy to call functionality written in C from C++ and a little bit harder the other way round… I have not used it very much recently, so I will refrain from commenting further on it.

Java has introduced an infrastructure that is very common now with its virtual machine. The JVM is running on a large number of servers and any JVM-language can be used there. The platform independence is an advantage, but I think that its importance on servers has diminished a little bit. There used to be all kinds of servers with different operating systems and different CPU-architectures. But now we are moving towards servers being mostly Linux with Intel-compatible CPUs, so it is becomeing less of an issue. This may change in the future again, of course.

With Mono C# can be used in ways similar to Java, at least that is what the theory says and what seems to be quite true at least up to a certain level. It seems to be a bit ahead of Java with some language features, just think of operator overloading, undeclared exceptions, properties, generics or lambdas, which have been introduced earlier or in a more elegant way or we are still waiting in Java. I think the case of lambdas also shows the limitations, because they seem to behave differently than you would expect from real closures, which is the way lambdas should be done and are done in more functionally oriented languages or even in the Ruby programming language, in the Perl programming language or typical Lisps.
Try this

List<Func<int>> actions = new List<Func<int>>();

int variable = 0;
while (variable < 5)
{
    actions.Add(() => variable * 2);
    ++ variable;
}

foreach (var act in actions)
{
    Console.WriteLine(act.Invoke());
}

We would expect the output 0, 2, 4, 6, 8, but we are getting 10, 10, 10, 10, 10 (one number in a line, respectively).
But it can be fixed:

List<Func<int>> actions = new List<Func<int>>();

int variable = 0;
while (variable < 5)
{
    int copy = variable;
    actions.Add(() => copy * 2);
    ++ variable;
}

foreach (var act in actions)
{
    Console.WriteLine(act.Invoke());
}

I would say that the concept of inner classes is better in Java, even though what is static there should be the default, but having lambdas this is less important…
You find more issues with class loader, which are kind of hard to tame in java, but extremely powerful.

Anyway, I think that all of these languages tend to be similar in their syntax, at least within a method or function and require a lot of boiler plate code. Another issue that I see is that the basic types, which include Strings, even if they are seen as basic types by the language design, are not powerful enough or full of pitfalls.

While the idea to use just null terminated character arrays as strings in C has its beauty, I think it is actually not really good enough and for more serious C applications a more advanced string library would be good, with the disadvantage that different libraries will end up using different string libraries… Anyway, for stuff that is legitimately done with C now, this is not so much of an issue and legacy software has anyway its legacy how to deal with strings, and possible painful limitations in conjunction with Unicode. Java and also C# have been introduced at a time when Unicode was already around and the standard already claimed to use more than 65536 code points (characters in Unicode-speak), but at that time 65536 seemed to be quite ok to cover the needs for all common languages and so utf-16 was picked as an encoding. This blows up the memory, because strings occupy most of the memory in typical application software, but it still leaves us with uncertainties about length and position, because code points can be one or two 16-bit-„characters“ long, which can only be seen by actually iterating through the string, which leaves us where we were with null terminated strings in C. And strings are really hard to replace or enhance in this aspect, because they are used everywhere.

Numbers are not good either. As an application developer we should not care about counting bits, unless we are in an area that needs to be specifically optimized. We are using mostly integer types in application development, at least we should. These overflow silently. Just to see it in C#:

int i = 0;
int s = 1;
for (i = 0; i < 20; i++)
{
    s *= 7;
    Console.WriteLine("i=" + i + " s=" + s);
}

which gives us:

i=0 s=7
i=1 s=49
i=2 s=343
i=3 s=2401
i=4 s=16807
i=5 s=117649
i=6 s=823543
i=7 s=5764801
i=8 s=40353607
i=9 s=282475249
i=10 s=1977326743
i=11 s=956385313
i=12 s=-1895237401
i=13 s=-381759919
i=14 s=1622647863
i=15 s=-1526366847
i=16 s=-2094633337
i=17 s=-1777531471
i=18 s=442181591
i=19 s=-1199696159

So it silently overflows, or just takes the remainder modulo 2^{32} with the representation system \{-2^{31} \ldots 2^{31}-1\}. Java, C and C++ behave exactly the same way, only that we need to know what „int“ means for our C-compiler, but if we use 32-bit-ints, it is the same. This should throw an exception or switch to some unlimited long integer. Clojure offers both options, depending on whether you use * or *‘ as operator. So as application developers we should not have to care about these bits and most developers do not think about it. Usually it goes well, but a lot of software bugs are around due to this pattern. It is just wrong in C#, Java, and C++. In C I find it more acceptable, because the typical area for using C for new software actually is quite related to bits and bytes, so the developers need to be aware of such issues all the time anyway.

I would consider it desirable to move to more expressive languages like Clojure, Scala, F#, Ruby or Perl for application development. Ruby and Perl have better Strings. Clojure and Scala inherit them from the JVM, and F# has the same strings as C#. Ruby and Clojure have a good way to deal with integers, Scala, Perl and F# can do it right if we actually want to do so, but not by default. Perl and Ruby are very weak when it comes to multithreading. As compared to Java this can be dealt with by just using more processes instead of threads, because the overhead of a Ruby or Perl process is much less than the overhead of a Java process, but I would see this as a major drawback. C, C#, Java and C++ offer good facilities to use multithreading, but the issue of avoiding typical multithreading bugs is a big deal and actually too hard for a large fraction of typical application developers. Or at least too far away from there point of focus. Moving to a more functional paradigm might be a way to go. Java enterprise edition is a failure if the goal is to get multithreading, done well without having to worry about it, because the overhead is too much. On the other hand, if you are willing to go the extra mile, having more explicit access to the multithreading mechanism and using it correctly is extremely powerful, for example in C with pthreads or with a deliberate usage of processes, shared memory and threads together. For which kind of projects do we have the time and the team for this? I am not talking about multithreaded applications that work well on the developer’s laptop, but fail during some high load processing in production with some concurrent modification issues a few months after the deployment. Thinking cannot be replaced by testing.

So now we have a lot of software in C, C++, Java and C# and a lot of new software is written in these languages, even from scratch. We could do better, sometimes we do, sometimes we don’t. It is possible to write excellent application software with Java, C++, C# and even C. It just takes a bit longer, but if we use them with care, it will be ok. Some companies are very conservative and want to use stuff that has been around for a long time. This is sometimes right and sometimes wrong. And since none of the more modern languages has really picked up so much speed that it can be considered a new main stream, it is understandable that some organizations are scared about marching into a dead end road.

On the other hand, many businesses can differentiate themselves by providing services that are only possible by having a very innovative IT. Banks like UBS and Credit Suisse in Switzerland are not likely to be there, while banks like ING are on that road. As long as they compete for totally different customer bases and as long as the business has enough strengths that are not depending so heavily on an innovative IT, but just on a working robust IT, this will be fine. But time moves on and innovation will eventually out-compete conservative businesses.

Share Button

Oracle buys the NSA

Wonderful coincidences have been discovered.

The NSA has excellent technological knowledge, especially in the area of storing and processing huge amount of data. And the US government wants to move one step ahead with privatization of state run activities. So the US government and the Oracle corporation have agreed to sell the NSA to Oracle. Oracle will be able to be the leading DB vendor, even with higher prices than today, because of unprecedented technological advances. The NSA will become more efficient when run as part of a private company. And the whole country becomes more efficient, because less lobbying is needed for companies, if they control important organizations directly and not indirectly through the US government. Another important synergy is that additional backups of database will now be available.

Share Button