Karl Brodowsky's IT-Blog – Seite 22 – IT Sky Consulting GmbH

Scala Days 2016

I have visited Scala Days in Berlin 2016-06-15 to 2016-06-17. A little remark on the format might be of interest. The conference is scheduled for 3 days. On the first day, there is only one speech, the first keynote, some time in the late afternoon. During Scala Days 2015 the rest of the day was put into use by organizing a Scala training session, where volunteers could teach Scala to other volunteers who wanted to learn it. But I think two or three sessions on the first day would be better and would still allow starting in the late afternoon with the first keynote. The venue and of course Berlin were great and I enjoyed the whole event.

The talks that I visited were:

Wednesday 2016-06-15

First keynote: Scala’s Road Ahead by Martin Odersky about the future of Scala. Very interesting ideas for future versions that are currently explored in dotty.

Thursday 2016-06-16

Friday 2016-06-17

Summary

The whole event was great, I got a lot of inspiration and met great people. Looking forward to the next event.

I might write more on some topics, where I consider it interesting, but for the moment this summary should be sufficient.

Linkedin will probably be bought by Microsoft

The social business online network LinkedIn will probably be bought by Microsoft for around 26’200’000’000 USD.

Even if Microsoft has become a bit more trustworthy with Nadella then it was with Ballmer it remains an interesting question how much we should trust them concerning our data. The same issue arouse with Skype and other acquisitions in the past. Deleting an account probably does not change much, because it might just delete the access point to the account, not actually the data.

And anyway the NSA could query LinkedIn just as well es Microsoft.

In any case it is interesting to know about this acqusition.

Links

Find more links yourself, it won’t be hard.

Operator Overloading

When Java was created, the concept of operator overloading was already present in C++. I would say that it was generally well done in C++, but it kind of breaks the object oriented polymorphism patterns of C++ and the usual way was to have several overloaded functions to allow for all n² combinations.

In the early days of C++ people jumped on this feature and used it for all kinds of stuff that has nothing to do with the original concept of numeric operators, like adding dialog boxes to strings and multiplying that with events. We get somewhere a little bit towards what APL was, which had only operators and a special charset to allow for all the language features, requiring even a special keyboard:

You can find an article in Scott Locklin’s Blog about APL and other almost forgotten languages and the potential loss of some achievements that they tried to bring to us.

We see the same with some people in Scala who create a lot of operators using interesting Unicode characters. This is not necessarily wrong, but I think operators should only be used for something that is really important. Not in the sense: „I wrote functionality XYZ for library UVW, and this is really important“, but in the sense that this functionality is so commonly used that people have no problem remembering the operator. Or the operator is already known to us, like „+“, „-„, „*“, … for numeric types, but I still have no idea what adding a string to an event would mean.

In C++ it got even worse because it was possible to overload „->“ or new and thus digging deep into the language, which can be interesting when used carefully and skillfully by developers who really know what they are doing, but disastrous otherwise.

Now Java has opted not to support this operator overloading, which was wrong in even at that time, but understandable, because at that time we were still more in the mindset to count bits and live with the deficiencies of int and long and we were also seeing the weird abuses of operator overloading in C++. Maybe it was also the lack of time to design a sound mechanism for this in Java. Unfortunately this decision that was made in a context more than 20 years ago has kind of become religious. Interestingly James Gosling, when asked in an interview for the 20 years anniversary of Java, mentioned operator overloading for numeric types as the first thing that he would have made better. (It is around minute 9.) So I hope that this undoes the religious aspect of this topic.

An interesting idea will probably be included in future versions of Scala. An operator is in principal defined as a method of the left operand, which is quite logical, but it would imply writing something like e = (a.*(b)).+(c.*(d)), possibly with fewer parentheses. Now this is recognized as a operator-method, so the dots can go away as well as the parentheses and the common operator precedence applies, so e = a * b + c * d works as well and is what we find natural. Ruby and Scala are very similar in this aspect. Now some future version of Scala, maybe Scala 3, will introduce an annotation that allows the „infix“-notation for these methods and that adds a descriptive name. Now error messages and even IDE-support could give us access to the descriptive name and we would be able to search for it, while searching for something like „+“ or „-“ or „*“ would not really be helpful. I think that this idea would be useful for other languages as well.

These examples demonstrate the BigInteger types of Java, C#, Scala, Clojure and Ruby, respectively:

import java.math.BigInteger;

public class JavaBigInt {

    public static void main(String[] args) {
        BigInteger f = BigInteger.valueOf(2_000_000_000L);
        BigInteger p = BigInteger.ONE;
        for (int i = 0; i < 8; i++) {
            System.out.println(i + " " +  p);
            p = p.multiply(f);
        }
    }
}

gives this output:

0 1
1 2000000000
2 4000000000000000000
3 8000000000000000000000000000
4 16000000000000000000000000000000000000
5 32000000000000000000000000000000000000000000000
6 64000000000000000000000000000000000000000000000000000000
7 128000000000000000000000000000000000000000000000000000000000000000

And the C#-version

using System;
using System.Numerics;

public class CsInt {

    public static void Main(string[] args) {
        BigInteger f = 2000000000;
        BigInteger p = 1;
        for (int i = 0; i < 8; i++) {
            Console.WriteLine(i + " " +  p);
            p *= f;
        }
    }
}

give exactly the same output:

0 1
1 2000000000
2 4000000000000000000
3 8000000000000000000000000000
4 16000000000000000000000000000000000000
5 32000000000000000000000000000000000000000000000
6 64000000000000000000000000000000000000000000000000000000
7 128000000000000000000000000000000000000000000000000000000000000000

Or the Scala version

object ScalaBigInt {

  def main(args: Array[String]): Unit = {
    val f : BigInt = 2000000000;
    var p : BigInt = 1;
    for (i  <- 0 until 8) {
      println(i + " " + p);
      p *= f;
    }
  }
}

0 1
1 2000000000
2 4000000000000000000
3 8000000000000000000000000000
4 16000000000000000000000000000000000000
5 32000000000000000000000000000000000000000000000
6 64000000000000000000000000000000000000000000000000000000
7 128000000000000000000000000000000000000000000000000000000000000000

Or in Clojure it looks like this, slightly shorter than then Java and C#:

(reduce (fn [x y] (println y x) (*' 2000000000 x)) 1 (range 8))

with the same output again, but a much shorter program. Please observe that the multiplication needs to use the "*'" instead of "*" in order to outexpand from fixed length integers to big-integers.

0 1
1 2000000000
2 4000000000000000000
3 8000000000000000000000000000N
4 16000000000000000000000000000000000000N
5 32000000000000000000000000000000000000000000000N
6 64000000000000000000000000000000000000000000000000000000N
7 128000000000000000000000000000000000000000000000000000000000000000N

Or in Ruby it is also quite short:

f = 2000000000
p = 1
8.times do |i|
  puts "#{i} #{p}"
  p *= f;
end

same result, without any special effort, because integers are always expanding to the needed size:

0 1
1 2000000000
2 4000000000000000000
3 8000000000000000000000000000
4 16000000000000000000000000000000000000
5 32000000000000000000000000000000000000000000000
6 64000000000000000000000000000000000000000000000000000000
7 128000000000000000000000000000000000000000000000000000000000000000

So I suggest to leave the IT-theology behind. So the pragmatic issues should be considered now.

In Java we have primitive numeric types, that are basically inadequate for application development, because they tacitly overflow and because application developers have usually no idea how to deal with rounding issues of float and double. We have good numeric types like BigInteger and BigDecimal to support arbitrarily long integral numbers, which do not overflow unless we exceed memory or addressaility issues with numbers of several billion digits. BigDecimal allows for controlled rounding, and also arbitrary precision.

Now we have to write
e = a.multiply(b).add(c.multiply(d))
instead of
e = a * b + c * d
The latter is readable, it is exactly what we mean. The former is not readable at all and the likelihood of making mistakes is very high.
I would be happy with something like this:
e = a (*) b (+) c (*) d
where overloaded operators are surrounded with () or [] or something like that.

At some point of time a major producer of electronic calculators made us believe that it is more natural to express it like this
e a b * c d * + =
Maybe this way of writing math would be better, but it is not what we do outside of our computers and calculators. At least it was more natural to have this pattern for those who created the calculators, because it was much easier to implement in a clean way on limited hardware. We still have the opposite in Lisp, which is still quite alive as Clojure, so I use the Clojure syntax:
(def x (+ (* a b) (* c d)))
which is relatively readable after some learning and allows for a very simple and regular and powerful syntax. But even this is not how we write Math outside of our computer.

Now the good news is that Java will add "value types" in the future and consider to revisit the operator overloading issue for these value types. This may or may not solve the issue in a distant future. We should have an idea what a numeric type is. A numeric type can be more than just real and integral numbers. Just think of rational numbers, complex numbers, but even of polynomials, rational functions (quotients of polynomials), finite fields, p-adic numbers and more. We just need to talk about rings and fields in the mathematical sense and possibly subsets that do not quite follow the field semantics like Double, but that are still inspired by the field they aim to represent. Anyway, for the moment Java not having operator overloading is a degradation from something that other languages had already done well before.

Btw., please use elementary school math skills and do not write
e = (a * b) + (c * d)
That is just noise. I do not recommend to memorize all the 10 to 25 levels of operator precedence of a typical programming languages, but it is good to know the basic ones, that almost any serious current programming language supports:
* binary * /
* binary + -
* == != <= >= < >
* &&
* ||
Some use "and" and "or" instead of "&&" and "||".

Now using overloaded operators should be no problem.

We do have an issue when implementing it.

Imagine you have a language with five built in numeric types. Now you add a sixth one. "+" is probably already defined for 25 combinations. With the sixth type we get a total of 36 combinations, of which we have to provide the missing 11 and a mechanism to dispatch the program flow to these. In C++ we just add 11 operator-functions and that does everything. In Ruby we add a method for the left side of the operator. Now this does not know our new type for the existing types, but it deals with it by calling coerce of the right operand with the left operand as parameter. This is actually powerful enough to deal with this situation.

It gets even more tricky when we use different libraries that do not know of each other and each of them adds numeric types. Possibly we cannot add these with each other or we can do so in a degraded manner by just falling back to double or float or rational or something like that.

The numeric types that we usually use can be added with each other, but we could hit situations where that is not the case, for example when having p-adic numbers, which can be added with rational number, but not with real numbers. Or finite fields, whose members can be added with integral numbers or with numbers of the same field, but not necessarily with numbers of another finite field. Fortunately these issues should occur only to people who understand them while writing libraries. Using the libraries should not be hard, if they are properly done.

Usability „Pearl“

I just found this usability pearl:

After entering a credit card number as usual with spaces between the groups of four digits, the web page complained like this:

Credit Card Number without Spaces — Web page of insisting to refill a form because of spaces

Yes, it is easy to allow spaces. Just match the following regex
/^\s*\d{4}\s*\d{4}\s*\d{4}\s*\d{4}\s*$/
and then remove the spaces when processing it, but do not let the user enter the number without spaces. That is just ridiculous.

Virtual machines

We all know that Java uses a „virtual machine“ that is it simulates a non-existing hardware which is the same independent of the real hardware, thus helping to achieve the well known platform independence of Java. Btw. this is not about virtualization like VMWare, VirtualBox, Qemu, Xen, Docker and similar tools, but about byte code interpreters like the Java-VM.

We tend to believe that this is the major innovation of Java, but actually the concept of virtual machines is very old. Lisp, UCSD-Pascal, Eumel/Elan, the Perl programming language and many other systems have used this concept long before Java. The Java guys have been good in selling this and it was possible to get this really to the mainstream when Java came out. The Java guys deserve the credit for bringing this in the right time and bringing it to the main stream.

Earlier implementations where kind of cool, but the virtual machine technology and the hardware were to slow, so that they were not really attractive, at least not for high performance applications, which are now actually a domain of Java and other JVM languages. Some suggest that Java or other efficient JVM languages like Scala would run even faster than C++. While it may be true to show this in examples, and the hotspot optimization gives some theoretical evidence how optimization that takes place during run time can be better than static optimization at compile time, I do not generally trust this. I doubt that well written C-code for an application that is adequate for both C and Java will be outperformed by Java. But we have to take two more aspects into account, which tend to be considered kind of unlimited for many such comparisons to make them possible at all.

The JVM has two weaknesses in terms of performance. The start-up time is relatively long. This is addressed in those comparisons, because the claim to be fast is only maintained for long running server applications, where start-up time is not relevant. The hotspot optimization requires anyway a long running application in order to show its advantages. Another aspect that is very relevant is that Java uses a lot of memory. I do not really know why, because more high level languages like Perl or Ruby get along with less memory, but experience shows that this is true. So if we have a budget X to buy hardware and then put software written in C on it, we can just afford to buy more CPUs because we save on the memory or we can make use of the memory that the JVM would otherwise just use up to make our application faster. When we view the achievable performance with a given hardware budget, I am quite sure that well written C outperforms well written Java.

The other aspect is in favor of Java. We have implicitly assumed until now that the budget for development is unlimited. In practice that is not the case. While we fight with interesting, but time consuming low level issues in C, we already get work done in Java. A useful application in Java is usually finished faster than in C, again if it is in a domain that can reasonably be addressed with either of the two languages and if we do not get lost in the framework world. So if the Java application is good enough in terms of performance, which it often is, even for very performance critical applications, then we might be better off using Java instead of C to get the job done faster and to have time for optimization, documentation, testing, unit testing.. Yes, I am in a perfect world now, but we should always aim for that. You could argue that the same argument is valid in terms of using a more high-level language than Java, like Ruby, Perl, Perl 6, Clojure, Scala, F#,… I’ll leave this argument to other articles in the future and in the past.

What Java has really been good at is bringing the VM technology to a level that allows real world high performance server application and bringing it to the main stream.
That is already a great achievement. Interestingly there have never been serious and successful efforts to actually build the JavaVM as hardware CPU and put that as a co-processor into common PCs or servers. It would have been an issue with the upgrade to Java8, because that was an incompatible change, but other than that the JavaVM remained pretty stable. As we see the hotspot optimization is now so good that the urge for such a hardware is not so strong.

Now the JVM has been built around the Java language, which was quite legitimate, because that was the only goal in the beginning. It is even started using the command line tool java (or sometimes javaw on MS-Windows 32/64 systems). The success of Java made the JVM wide spread and efficient, so it became attractive to run other languages on it. There are more than 100 languages on the JVM. Most of them are not very relevant. A couple of them are part of the Java world, because they are or used to be specific micro languages closely related to java to achieve certain goals in the JEE-world, like the now almost obsolete JSP, JavaFX, .

Relevant languages are Scala, Clojure, JRuby, Groovy and JavaScript. I am not sure about Jython, Ceylon and Kotlin. There are interesting ideas coming up here and there like running Haskell under the name Frege on the JVM. And I would love to see a language that just adds operator overloading and provides some preprocessor to achieve this by translating for example „(+)“ in infix syntax to „.add(..)“ mainstream, to allow seriously using numeric types in Java.

Now Perl 6 started its development around 2000. They were at that time assuming that the JVM is not a good target for a dynamic language to achieve good performance. So they started developing Parrot as their own VM. The goal was to share Parrot between many dynamic languages like Ruby, Python, Scheme and Perl 6, which would have allowed inter-language inter-operation to be more easily achievable and using libraries from one of these languages in one of the others. I would not have been trivial, because I am quite sure that we would have come across issues that each language has another set of basic types, so strings and numbers would have to be converted to the strings and numbers of the library language when calling, but it would have been interesting.

In the end parrot was a very interesting project, theoretically very sound and it looked like for example the Ruby guys went for it even faster than the the Perl guys, resulting in an implementation called cardinal. But the relevant Perl 6 implementation, rakudo, eventually went for their own VM, Moar. Ruby also did itself a new better VM- Many other language, including Ruby and JavaScript also went for the JVM, at least as one implementation variant. Eventually the JVM proved to be successful even in this area. The argument to start parrot in the first place was that the JVM is not good for dynamic languages. I believe that this was true around 2000. But the JVM has vastly improved since then, even resulting in Java being a serious alternative to C for many high performance server applications. And it has been improved for dynamic languages, mostly by adding the „invoke_dynamic“-feature, that also proved to be useful for implementing Java 8 lambdas. The experience in transforming and executing dynamic languages to the JVM has grown. So in the end parrot has become kind of obsolete and seems to be maintained, but hardly used for any mainstream projects. In the end we have Perl 6 now and Parrot was an important stepping stone on this path, even if it becomes obsolete. The question of interoperability between different scripting languages remains interesting…

Primitives, Objects and Autoboxing

The type system in Java makes a difference between so called „primitives“, which are boolean, byte, char, int, long, float and double and Objects, which are anything derived from Object in object oriented philosophy, including the special case of arrays, which I will not discuss today.

Primitive types have many operations that are kind of natural to perform on them, like arithmetic. They behave as values, so they are actually copied, which is no big deal, because they are at most 64 bits in size, which is in modern java implementations the size of a pointer when using references. Now a major benefit of object orientation is arguable the polymorphism and this has been heavily used when implementing useful libraries like the collection classes, which were based mostly on Object and thus able to handle anything derived from Object. This has not changed with generics, they are just another way of writing this and adding some compile time checks and casts in a more readable way, as long as the complexity of the generics constructions remains simple and under control. Actually I like this approach and find it much more healthy than templates in C++, but this is a IT-theological discussion that is not too relevant for this article.

Now there is a necessity of using collections for numeric types. Even though I do recommend to thoroughly think about using types like BigInteger and BigDecimal, there are absolutely legitimate uses of long, int, boolean, double, char and less frequently short, byte and float. The only one that is really flawless of these is boolean, while the floating point numbers, the fixed size integral numbers (also this) and the Strings and chars in Java have serious flaws, some of which I have discussed in the linked articles.

Now we need to use the wrapper types Integer, Long, Double and Boolean instead of int, long, double and boolean to store them in collections. This comes with some overhead, because these wrappers use some additional memory and the wrapping and unwrapping costs some time. Usually this does not impose a problem and using these wrappers is often an acceptable approach. Now we would be tempted to just work with the wrappers, but that is impossible, because the natural operations for the underlying boolean and numeric types just do not work with the wrappers, so we have to unwrap (or unbox) them.

Now Java includes a feature called „autoboxing and autounboxing“ which tries to create a wrapper object around a primitive when in an object context and which extracts the primitive when in a primitive context. This can be enforced by casting, to be sure.

There are some dangers in using this feature. The most interesting case is the „==“-operator. For objects and also for the wrappers of the primitives this always compares object identity based on the pointer address. For primitives that is simply impossible and the comparison compares the value. I think that it was a mistake to define the „==“-operator like that and it should do a semantic comparison and there should be something else for object identity, but that cannot be changed any more for Java. So we get some confusion when comparing boxed primitives with == or even worse when comparing boxed and unboxed primitives. Another confusion occurs, when using autounboxing and the wrapper object is null. This creates of course a NullPointerException, but it is kind of hard to spot where it actually comes from.

So I do see some value in using explicit boxing and unboxing to make things clearer. It is a good thing to talk about this in the team and find a common way. Now the interesting question is how boxing and unboxing are done. We are tempted to use something like this:
int x = ...; Integer xObj = new Integer(x);
This works, but it is not good, because it creates too many objects. We can reuse them and java provides for this and reuses them for some small numbers. The recommended way for explicit boxing is this:
int x = ... Integer xObj = Integer.valueOf(x);
This can reuse values. If we are using this a lot and know that our range of commonly used numbers is reasonably small but still beyond what Java assumes, it is not too hard to write something like „IntegerUtil“ and use it:
int x = ...; Integer xObj = IntegerUtil.valueOf(x);
Look if you can find an implementation that fits your needs, instead of writing it. But it is no pain to write it.
Unboxing is also easy:
Integer xObj = ....; int x = xObj.intValue();
The methods intValue(), longValue(), doubleValue(),… are actually in the base class Number, so it is possible to unbox and cast in one step with these.

Decide how much readability you want.

It is useful to look at the static methods of the wrapper classes even for converting numbers to Strings and Strings to numbers. Avoid using constructors, they are rarely necessary and some neat optimizations that the Java libraries give us for free only work when we use the right methods. This does not make a huge difference, but doing it right does not hurt, but rather makes code more readable.

It is also interesting how the extended numeric types like BigInteger and BigDecimal work similar to the wrapper types and to use them right.

Another interesting issue is to use actually specific collection implementations for primitives. This may add to the complexity of our code, because it gives up another piece of polymorphism, but they can really save our day by giving a better performance. And in cases where we actually know for sure that the data is always belonging to a certain primitive type, I find this even idiomatic.

Other languages have solved the issues discussed here in a more elegant way by avoiding this two sided world of primitives and wrappers or by making the conversions less dangerous and more natural. They have operator overloading for numeric types and they use a more consistent concept of equality than Java.

Numeric types in Perl

Dealing with numeric types in Perl is not as strait-forward as in other programming languages. We can use „scalars“ out of the box, but then we get floating point numbers, more precisely what is called „double“ in most programming languages. This is kind of ok for trivial programs, but we should make a deliberate choice on what to use.

Actually the Perl programming language gives us (at least) two more choices. We can use 64-bit integers (or 32-bit on some platforms) by just adding
use integer;
somewhere in the beginning of the file. This causes Perl to work mostly with integer instead of floating point numbers, but the rules for this are not so obvious. You may read about them in the official documentation. Or find another explanation or one more.

Now we do want to control this on a more fine granular basis than the whole program. There may be legitimate programs that use both floating point and integers. This can be achieved in Perl as well. We can turn this off using:
no integer;
More likely we want to use another approach, that looks more natural and more robust most of the time. We just have to use blocks:
#!/usr/bin/perl -w                                                                                                                            use strict;                                                                                                                                                                       my $f1 = 2_000_000_000;                                                                 my $f2 = $f1 * $f1;                                                                   my $f3 = $f1 * $f2;                                                                   my $f4 = $f1 * $f3;                                                                   my $f5 = $f1 * $f4;                                                                                                                                                            my @f = (1, $f1, $f2, $f3, $f4, $f5);                                               for (my $i = 0; $i <= 5; $i++) {                                                          print($i, " ", $f[$i], "\n");                                                     }                                                                                                                                                                                 my $n2x;                                                                                {                                                                                            use integer;                                                                             my $n1 = 2_000_000_000;                                                                 my $n2 = $n1 * $n1;                                                                   my $n3 = $n1 * $n2;                                                                   my $n4 = $n1 * $n3;                                                                   my $n5 = $n1 * $n4;                                                                                                                                                            my @n = (1, $n1, $n2, $n3, $n4, $n5);                                               for (my $i = 0; $i <= 5; $i++) {                                                          print($i, " ", $n[$i], "\n");                                                     }                                                                                        $n2x = $n2;                                                                        }                                                                                                                                                                                 print "n2x=$n2x\n";                                                                                                                                                              my $g1 = 2_000_000_000;                                                                 my $g2 = $g1 * $g1;                                                                   my $g3 = $g1 * $g2;                                                                   my $g4 = $g1 * $g3;                                                                   my $g5 = $g1 * $g4;                                                                                                                                                            my @g = (1, $g1, $g2, $g3, $g4, $g5);                                               for (my $i = 0; $i <= 5; $i++) {                                                          print($i, " ", $g[$i], "\n");                                                     }
This will output:
                                                                                   0 1                                                                                      1 2000000000                                                                             2 4000000000000000000                                                                    3 8e+27                                                                                  4 1.6e+37                                                                                5 3.2e+46                                                                                0 1                                                                                      1 2000000000                                                                             2 4000000000000000000                                                                    3 -106958398427234304                                                                    4 3799332742966018048                                                                    5 7229403301836488704                                                                    n2x=4000000000000000000
So we see that the integer mode is constrained to the block. And we see that the results for 3, 4 and 5 went wrong...

So it may be a little bit tricky to do this, but we can. These integers have the same flaw as integers in many popular programming languages, because they silently overflow by taking the remainder modulo $2^{64}$ that lies in the interval $[-2^{63}, 2^{63}-1]$ or modulo $2^{32}$ that lies in the interval $[-2^{31}, 2^{31}-1]$ . I do not think that is really what we usually want and just hoping that our numbers remain within the safe range may go well in the 64-bit-case, but we have to be sure and explain this in a comment, when we work like this. Usually we do not want to think about this and spending a few extra bits costs less than hunting obscure bugs where everything looks so correct.

Our friend is
use bigint;
which switches to arbitrary precision integers.
#!/usr/bin/perl -w use strict; my $f1 = 2_000_000_000; my $f2 = $f1 * $f1; my $f3 = $f1 * $f2; my $f4 = $f1 * $f3; my $f5 = $f1 * $f4; my @f = (1, $f1, $f2, $f3, $f4, $f5); for (my $i = 0; $i <= 5; $i++) { print($i, " ", $f[$i], "\n"); } my $b2x; { use bigint; my $b1 = 2_000_000_000; my $b2 = $b1 * $b1; my $b3 = $b1 * $b2; my $b4 = $b1 * $b3; my $b5 = $b1 * $b4; my @b = (1, $b1, $b2, $b3, $b4, $b5); for (my $i = 0; $i <= 5; $i++) { print($i, " ", $b[$i], "\n"); } $b2x = $b2; } print "b2x=$b2x\n"; my $g1 = 2_000_000_000; my $g2 = $g1 * $g1; my $g3 = $g1 * $g2; my $g4 = $g1 * $g3; my $g5 = $g1 * $g4; my @g = (1, $g1, $g2, $g3, $g4, $g5); for (my $i = 0; $i <= 5; $i++) { print($i, " ", $g[$i], "\n"); }
This gives us the output:
0 1 1 2000000000 2 4000000000000000000 3 8e+27 4 1.6e+37 5 3.2e+46 0 1 1 2000000000 2 4000000000000000000 3 8000000000000000000000000000 4 16000000000000000000000000000000000000 5 32000000000000000000000000000000000000000000000 b2x=4000000000000000000 0 1 1 2000000000 2 4000000000000000000 3 8e+27 4 1.6e+37 5 3.2e+46
So it is again constrained to the block, but it allows us to use arbitrary lengths of integers, as long as our memory is sufficient.

A less commonly used, but interesting approach is to work with rational numbers:
#!/usr/bin/perl -w use strict; use bigrat; my $x = 3/4; my $y = 4/5; my $z = 5/6; print("x=$x y=$y z=$z\n"); my $sum = $x+$y+$z; my $diff = $x - $y; my $prod = $x * $x * $z; my $quot = $x / $y; print("sum=$sum diff=$diff prod=$prod quot=$quot\n");
This gives us:
x=3/4 y=4/5 z=5/6 sum=143/60 diff=-1/20 prod=15/32 quot=15/16
That is kind of cool...

There is also something like Math::BigFloat which can be used most easily by having
use bignum;
Find the documentation about "use bignum" and about Math::BigFloat...

You will find more numeric types, like Math::Decimal and Math::Complex.

While I would say that using good numeric types in Perl is not quite as easy as it should be, at least if we want to mix them, at least we have the means to use the adequate numeric types. And it is way better than in Java.

Perl 6

Perl 6 has silently reached its first production ready release on Christmas 2015, called v6c. It will be interesting to explore what this language can do, which features it offers and how it compares to existing relevant and interesting languages like Java, C, Ruby, Perl (5), Clojure, Scala, F#, C++, Python, PHP and others in different aspects. It looks like Perl 5 is there to stay and will be continued. Perl 6 should actually be considered to be a different programming language than Perl 5, so the name is somewhat misleading, because it suggests slightly more similarity than there really is. On the other hand, it was done by the same people, good ideas concepts from Perl 5 were retained and so it does look somewhat similar.

Today I will just provide some links

* Main web page
* Documentation
* Documentation II
* Rakudo: Currently the major implementation
* Wikipedia
* Wikipedia (Russian)
* Wikipedia (Spanish)
* Wikipedia (French)
* Wikipeida (Norwegian)

Maybe I will write more about this in the future…

UTF-16 Strings in Java

Deutsch

Strings in Java and many other JVM-languages consist of Unicode content and are encoded as utf-16. It was fantastic to already consider Unicode when introducing Java in the 90es and to make it the only reasonable way to use strings, so there is no temptation to start with a „US-ASCII“-version of a software that never really gets enhanced to deal properly with non-English content and data. And it also avoids having to deal with many kinds of String encodings within the software. For outside storage in databases and files and for communication with other processes and over the network, these issues of course remain. But Java strings are always encoded utf-16. We can be sure. Most common languages can make use of these Strings easily and handling common letter based languages from Europe and western Asia is quite strait forward. Rendering may still be a challenge, but that is another issue. A major drawback of this approach is that more memory is used. This usually does not hurt too much. Memory is not so expensive and a couple of Strings even with utf-16 will not be too big. With 64-bit Java, which should be used these days, the memory limitations of the JVM are not relevant any more, they can basically use as much memory as provided.

But some applications to hit the memory limits. And since usually most of the data we are dealing with is ultimately strings and combinations of relatively long strings with some reference pointers, some integer numbers and more strings, we can say that in typical memory intensive applications strings actually consume a large fraction of the memory. This leads to the temptation of using or even writing a string library that uses utf-8 or some other more condensed internal format, while still being able to express any Unicode content. This is possible and I have done it. Unfortunately it is very painful, because Strings are quite deeply integrated into the language and explicit conversions need to be added in many places to deal with this. But it is possible and can save a lot of memory. In my case we were able to abandon this approach, because other optimizations, that were less painful, proved to be sufficient.

An interesting idea is to compress strings. If they are long enough, algorithms like gzip work on a single string. As with utf-8, selectively accessing parts of the string becomes expensive, because it can only be achieved by parsing the string from the beginning or by adding indexing structures. We just do not know which byte to go to for accessing the n-th character, even with utf-8. In reality we often do not have long strings, but rather many relatively short strings. They do not compress well by themselves. If we know our data and have a rough idea about the content of our complete set of strings, custom compression algorithm can be generated. This allows good results even for relatively short strings, as long as they are within the „language“ that we are prepared for. This is more or less similar to the step from utf-16 to utf-8, because we replace common byte sequences by shorter byte sequences and less common sequences may even get replaced by something longer. There is no gain in utf-8, if we have mostly strings that are in non-Latin alphabets. Even Cyrillic or Greek, that are alphabets similar to the Latin alphabet, will end up needing two bytes for each letter, which is not at all better than utf-16. For other alphabets it will even become worse, because three or four bytes are needed for one symbol that could easily be expressed with two bytes in utf-16. But if we know our data well enough, the approach with the specific compression will work fine. The „dictionary“ for the compression needs to be stored only once, maybe hard-coded in the software, and not in each string. It might be of interest to consider building the dictionary dynamically at run-time, like it is done with gzip, but keeping it in a common place for all strings and thus sharing it. The custom strings that I used where actually using a hard coded compression algorithm generated using a large amount of typical data. It worked fine, but was just too clumsy to use because Java is not prepared to replace String with something else without really messing around in the standard run-time libraries, which I would neither recommend nor want.

It is important to consider the following issues:

Is the memory consumption of the strings really a problem?
Are there easier optimizations that solve the problem?
Can it just be solved by adding more hardware? Yes, this is a legitimate question.
Are there solutions for the problem in the internet or even within the current organization?
A new String class is so fundamental that excellent testing is absolutely mandatory. The unit tests should be very extensive and complete.

Scanning, sorting and processing large numbers of photos

I guess for most of us this is more an issue of their private life rather than done professionally, and those woo do this for money should already have answers for everything…. But the IT aspects of this are interesting anyway…

So some of us, including me, have hundreds or thousands of photographs that have been created using analog photography. I am still using it, because I have a good equipment, the prices and availability for films and prints and scanning of the negatives to a CD are still good. My equipment is good and I am neither willing to give that up nor to do a major investment. It will come some day in the future and I expect that within five to ten years the reasonably priced and ubiquitous offers for handling of negative films and prints will disappear.

Anyway it is a good idea to scan all the slides and negatives, at least the ones that are of any interest. It is easier and cheaper to copy them, to get prints and to do some improvements with software like Gimp prior to creating prints. Also it is also possible to use and share them online.

Scanning with a flat bed scanner is not an option for negatives, it works with prints, but I think that it is too slow and I do not like the loss in quality due to the unnecessary intermediate step. This leaves two options, getting a negative scanner myself or using a service. So it is good to assume that they are already scanned for now. I organize the photos in a directory structure. The names should contain only 7-Bit-ASCII-characters, but no spaces, to be easier accessible by scripts and on the shell. I have scripts to rename them to this pattern, for directories and for files. They can be found under my github project „photo processing scripts“ with names:
* rename-canonical
* rename-dir-canonical
* rename-dirs-radically
Another interesting issue is finding and removing duplicates, but since the name of the file and its position int the file system do have some meaning, this needs some attention. When two identical files A and B are found, there are five resolutions:
* rm A (remove A, leave B)
* rm B (remove B, leave A)
* rm A ; ln -s B A (make A a softlink to B)
* rm B ; ln -s A B (make B a softlink to A)
* rm B ; ln -f A B (make B a hardlink of A. Apart from the inode number this is equivalent to the opposition direction)
Which of these is actually prefered? My scripts picks the last option, but does not actually perform it. Instead it just create output of the shell commands, which can be piped to a file or directly to sh, in which case they are immediately executed. Otherwise it is possible to edit the command, filter them or even change them with a one-liner in the Perl programming language. This can be found here:
* find-dups
For viewing the photos in the browser, I have added another script, that is called
* create-foto-index
It searches the current directory and all sub directories, except those starting with a dot („.“) recursively. For each image file a thumb nail image is needed, which is eather found in the .thumbs directory or created using the script
* scale-image
Then an index.html file is created in each dictory having links to its child directories, the neighboring directories and the parent directory. For each image the thumbnail is included and it is a link to the full sized image. With this it is easily possible to vieww the whole album in a browser locally.
Some images know their orientation already from the camera or phone, but they appear wrong anyway. These can be fixed automatically running the script
* auto-rotate
in the directory.
I have a web server and a CGI-script running:
* cgi/mark-images.cgi
which allows me to mark images with a checkbox or with a string. Using letters „D“ for delete, „R“ for rotate right (90 degree clockwise), „L“ for rotate left (90 degries counter clockwise) and „F“ for flip (rotate 180 degrees) and then press the OK button.
Running the script
* rotate-checked
which will delete and rotate the images according to the choices in the form.

This is already quite a useful situation. Images that are needed for prints or for the web might need some processing with GIMP:
* possibly rotate them in such a way that the horizon is horizontal and vertical lines are vertical, at least in the middle of the image.
* possibly correct perspective
* possibly sharpen
* possibly correct contrast and brightness
* possibly correct color saturation and colors
* cut out what is really interesting
* save it under a different name
* call create-foto-index again.
The webform and the CGI-script can be used for picking which images to edit. After having pressed OK it will be done like this:
gimp `egrep 'jpg$' </var/lib/wwwrun/mark-fotos/marked.dat` &
In a similar way images from a directory can be selected in indexf.html and then extracted to a ZIP:
zip my-archive.zip `egrep 'jpg$' </var/lib/wwwrun/mark-fotos/marked.dat`
which can be given to somebody or uploaded for creating prints or just unpacked in anther directory to have only the good images.

There are some more issues, which I might address in another article.