Some Thoughts about Incompleteness of Libraries

Selfwritten Util Libraries

Today we have really good libraries with our programming languages and they cover a lot of things. The funny thing is, that we usually end up writing some Util-classes like StringUtil, CollectionUtil, NumberUtil etc. that cover some common tasks that are not found in the libraries that we use. Usually it is no big deal and the methods are trivial to write. But then again, not having them in the library results in several slightly different ad hoc solutions for the same problem, sometimes flawless, sometimes somewhat weak, that are spread throughout the code and maybe eventually some „tools“, „utils“ or „helper“ classes that unify them and cover them in a somewhat reasonable way.

Imposing Util Libraries on all Developers

In the worst case these self written library classes really suck, but are imposed on the developers. Many years ago it was „company standard“ to use a common library for localizing strings. The concept was kind of nice, but it had its flaws. First there was a company wide database for localizing strings in order to save on translation costs, but the overhead was so much and the probability that the same short string means something different in the context of different applications was there. This could be addressed by just creating a label that somehow included the application ID and bypassing this overhead, whenever a collision was detected. What was worse, the new string made it into a header file and that caused the whole application to be recompiled, unless a hand written make file skipped this dependency. This was off course against company policy as well and it meant a lot of work. In those days compilation of the whole application took about 8 (eight!) hours. Maybe seven. So after adding one string it took 8 hours of compile time to continue working with it. Anyway, there was another implementation for the same concept for another operating system, that used hash tables and did not require recompilation. It had the risk of runtime errors because of non-defined strings, but it was at least reasonable to work with it. I ported this library to the operating system that I was using and used it and during each meeting I had do commit to the long term goal of changing to the broken library, which of course never happened, because there were always higher priorities.

I thing the lesson we can already learn is that such libraries that are written internally and imposed on all developers should be really done very well. Senior developers should be involved and if the company does not have them, hired externally for the development. Not to do the whole development, but to help doing it right.

Need for Util libraries

So why not just go with the given libraries? Or download some more? Depending on the language there are really good libraries around. Sometimes that is the way to go. Sometimes it is good to write a good util-libarary internally. But then it is important to do it well, to include only stuff that is actually needed or reasonably likely needed and to avoid major effort for reinventing the wheel. Some obscure libraries actually become obsolete when the main default library gets improved.

Example: Trigonometric and other Mathematical Functions

Most of us do not do a lot of floating point arithmetic and subsequentially we do not need the trigonometric functions like \sin and \cos, other transcendental functions like \exp and \log or functions like cube root (\sqrt[3]{x}) a lot. Where the default set of these functions ends is somewhat arbitrary, but of course we need to go to special libraries at some point for more special functions. We can look what early calculators used to have and what advanced math text books in schools cover. We have to consider the fact, that the commonly used set of trigonometric functions differs from country to country. Americans tend to use six of them, \sin, \cos, \tan, \cot, \sec and \csc, which is kind of beautiful, because it really completes the set. Germans tend to use only \sin, \cos, \tan and \cot, which is not as beautiful, but at least avoids the division by zero and issue of transforming \tan to \cot.  Calculators usually had only \sin, \cos and \tan. But they offered them in three flavors, with modes of „DEG“, „RAD“ and „GRAD“. The third one was kind of an attempt to metricize degrees by having 100 {\rm gon} instead of 90^\circ for an right angle, which seems to be a dead idea.  Off course in advanced mathematics and physics the „RAD“, which uses \frac{\pi}{2} instead of 90^\circ is common and that is what all programming languages that I know use, apart from the calculators. Just to explain the functions for those who are not familiar with the whole set, we can express the last four in terms of \sin and \cos:

  • \tan(x) = \frac{\sin(x)}{\cos(x)} (tangent)
  • \cot(x) = \frac{\cos(x)}{\sin(x)} (cotangent)
  • \sec(x) = \frac{1}{\cos(x)} (secans)
  • \csc(x) = \frac{1}{\sin(x)} (cosecans)

Then we have the inverse trigonometric functions, that can be denoted with something like \arcsin or \sin^{-1} for all six trigonometric functions. There is an irregularity to keep in mind. We write \sin^n(x) instead of (\sin(x))^n for n=2,3,4,\ldots, which is the multiplication of that number of \sin(x) terms. And we use \sin^{-1}(x) to apply the function „\sin-1 time, which is actually the inverse function. Mathematicians have invented this irregularity and usually it is convenient, but it confuses those who do not know it. From these functions many programming languages offer only the \tan^{-1} assuming the others five can be created from that. This is true, but cumbersome, because it needs to differentiate a lot of cases using something like if, so there are likely to be many bugs in software doing this. Also these ad hoc implementations loose some precision.

It was also common to have a conversion from polar coordinates to rectangular (p2r) coordinates and vice versa (r2p), which is kind of cool and again easy, but not too trivial to do ad hoc. Something like atan2 in FORTRAN, which does the essence of the harder r2p operation, would work also, depending on hon convenient it is to deal with multiple return values. We can then do r2p using r=\sqrt{x^2+y^2}, \phi ={\rm atan2}(x, y) and p2r by x=r \sin(\phi) and y = r \cos(\phi).

The hyperbolic functions like \sinh, their inverses like \arsinh or \sinh^{-1} are rarely used, but we find them on the calculator and in the math book, so we should have them in the standard floating point library. There is only one flavor of them.

Logarithms and exponential functions are found in two flavors on calculators: \log(x)=\log_{10}(x)=\lg(x) and \ln(x)=\log_{e}(x) and 10^x and e^x=\exp(x). The log is kind of confusing, because in mathematics and physics and in most current programming language we mean \log(x)=\log_{e}(x) (natural logarithm). This is just a wrong naming on calculators, even if they all did the same mistake across all vendors and probably still do in the scientific calculator app on the phone or on the desktop. As IT people we tend to like the base two logarithm {\rm ld}(x)=\log_2(x), so I would tend to add that to the list. Just to make the confusion complete, in some informatics text books and lectures the term „\log“ refers to the base two logarithm. It is a bad habit and at least the laziness should favor writing the correct „{\rm ld}„.

Then we usually have power functions x^y, which surprisingly many programming languages do not have. If they do, it is usually written as x ** y or pow(x, y), square root, square and maybe cube root and cube.  Even though the square root and the cube root can be expressed as powers using \sqrt(x)=x^\frac{1}{2} and \sqrt[3](x)=x^\frac{1}{3} it is better to do them as dedicated functions, because they are used much more frequently than any other power with non-integral exponents and it is possible to write optimized implementations that run faster and more reliably then the generic power which usually needs to go via log and exp. Internal optimization of power functions is usually a good idea for integral exponents and can easily be achieved, at least if the exponent is actually of an integer type.

Factorial and binomial coefficient are usually used for integers, which is not part of this discussion. Extensions for floating point numbers can be defined, but they are beyond the scope of advanced school mathematics and of common scientific calculators. I do not think that they are needed in a standard floating point library. It is of its own interest what could be in an „advanced math library“, but \sec and \tanh^{-1} and {\rm ld} for sure belong into the base math library.

That’s it. It would be easy to add all these into the standard library of any programming language that does floating point arithmetic at all and it would be helpful for those who work with this and not hurt at all those who do not use it, because this stuff is really small compared to most of our libraries. So this would be the list

  • sin, cos, tan, cot, sec, csc in two flavors
  • asin, acos, atan, acot, asec, acsc (standing for \sin^{-1}…) in two flavors
  • p2r, r2p (polar coordinates to rectangular and reverse) or atan2
  • sinh, cosh, tanh, coth, sech, csch
  • asinh, acosh, atanh, acoth, asech, acsch (for \sinh^{-1}…)
  • exp, log (for e^x and logarithm base e)
  • exp10, exp2, log10, log2 (base 10 and base 2, I would not rely on knowledge that ld and lg stand for log2 and log10, respectively, but name them like this)
  • sqrt, cbrt (for \sqrt{x} and \sqrt[3]{x})
  • ** or pow with double exponent
  • ** or pow with integer exponent (maybe the function with double exponent is sufficient)
  • \frac{1}{x}, x^2, x^3, x^\frac{1}{y} are maybe actually not needed, because we can just write them using ** and /

Actually pretty much every standard library contains sin, cos, tan, atan, exp, log and sqrt.

Java

Java is actually not so bad in this area. It contains the tan2, sinh, cosh, tanh, asin, acos, atan, log10 and cbrt functions, beyond what any library contains. And it contains conversions from degree to radiens and vice versa. And as you can see here in the source code of pow, the calculations are actually quite sophisticated and done in C. It seems to be inspired by GNU-classpath, which did a similar implementation in Java. It is typical that a function that has a uniform mathematical definition gets very complicated internally with many cases, because depending on the parameters different ways of calculation provide the best precision. It would be quite possible that this function is so good that calling it with an integer as a second parameter, which is then converted to a double, would actually be good enough and leave no need for a specific function with an integer exponent. I would tend to assume that that is the case.

In this github project we can see what a library could look like that completes the list above, includes unit tests and works also for the edge cases, which ad hoc solution often do not. What could be improved is providing the optimal possible precision for any legitimate parameters, which I would see as an area of further investigation and improvement. The general idea is applicable to almost any programming language.

Two areas that have been known for a great need of such additional libraries are collections and Date&Time. I would say that really a lot what I would wish from a decent collection library has been addressed by Guava. Getting Date and time right is surprisingly hard, but just thing of the year-2000-problem to see the significance of this issue. I would say Java had this one messed up, but Joda Time was a good solution and has made it into the standard distribution of Java 8.

Summary

This may serve as an example. There are usually some functions missing for collections, strings, dates, integers etc. I might write about them as well, but they are less obvious, so I would like to collect some input before writing about that.

libc on Linux seem to contain sin, cos, tan, asin, acos, atan, atan2, sinh, cosh, tanh, asinh, acosh, atanh, sqrt, cbrt, log10, log2, exp, log, exp10, exp2. Surprisingly Java does not make use of these functions, but comes up with its own.

Actually a lot of functionality is already in the CPU-hardware. IEEE-recommendations suggest quite an impressive set of functions, but they are all optional and sometimes the accuracy is poor.

But standard libraries should be slightly more complete and ideally there would be no need to write a „generic“ util-library.  Such libraries should only be needed for application specific code that is somewhat generic across some projects of the organization or when doing a real demanding application that needs more powerful functionality than can easily be provided in the standard library. Ideally these can be donated to the developers of the standard library and included in future releases, if they are generic enough. We should not forget, even programming languages that are main stream and used by thousands of developers all over the world are usually maintained by quite small teams, sometimes only working part time on this. But usually it is hard to get even a good improvement into their code base for an outsider.

So what functions do you usually miss in the standard libraries?

Share Button

Scala Days 2016

I have visited Scala Days in Berlin 2016-06-15 to 2016-06-17. A little remark on the format might be of interest. The conference is scheduled for 3 days. On the first day, there is only one speech, the first keynote, some time in the late afternoon. During Scala Days 2015 the rest of the day was put into use by organizing a Scala training session, where volunteers could teach Scala to other volunteers who wanted to learn it. But I think two or three sessions on the first day would be better and would still allow starting in the late afternoon with the first keynote. The venue and off course Berlin were great and I enjoyed the whole event.

The talks that I visited were:

Wednesday 2016-06-15

  • First keynote: Scala’s Road Ahead by Martin Odersky about the future of Scala. Very interesting ideas for future versions that are currently explored in dotty.

Thursday 2016-06-16

Friday 2016-06-17

Summary

The whole event was great, I got a lot of inspiration and met great people. Looking forward to the next event.

I might write more on some topics, where I consider it interesting, but for the moment this summary should be sufficient.

Links

Share Button

Linkedin will probably be bought by Microsoft

The social business online network LinkedIn will probably be bought by Microsoft for around 26’200’000’000 USD.

Even if Microsoft has become a bit more trustworthy with Nadella then it was with Ballmer it remains an interesting question how much we should trust them concerning our data. The same issue arouse with Skype and other acquisitions in the past. Deleting an account probably does not change much, because it might just delete the access point to the account, not actually the data.

And anyway the NSA could query LinkedIn just as well es Microsoft.

In any case it is interesting to know about this acqusition.

Links:
* Wikipedia
* NZZ (German)
* 20min (German)

Find more links yourself, it won’t be hard.

Share Button

Operator Overloading

When Java was created, the concept of operator overloading was already present in C++. I would say that it was generally well done in C++, but it kind of breaks the object oriented polymorphism patterns of C++ and the usual way was to have several overloaded functions to allow for all n² combinations.

In the early days of C++ people jumped on this feature and used it for all kinds of stuff that has nothing to do with the original concept of numeric operators, like adding dialog boxes to strings and multiplying that with events. We get somewhere a little bit towards what APL was, which had only operators and a special charset to allow for all the language features, requiring even a special keyboard:

APL example

APL example


You can find an article in Scott Locklin’s Blog about APL and other almost forgotten languages and the potential loss of some achievements that they tried to bring to us.

We see the same with some people in Scala who create a lot of operators using interesting Unicode characters. This is not necessarily wrong, but I think operators should only be used for something that is really important. Not in the sense: „I wrote functionality XYZ for library UVW, and this is really important“, but in the sense that this functionality is so commonly used that people have no problem remembering the operator. Or the operator is already known to us, like „+“, „-„, „*“, … for numeric types, but I still have no idea what adding a string to an event would mean.

In C++ it got even worse because it was possible to overload „->“ or new and thus digging deep into the language, which can be interesting when used carefully and skillfully by developers who really know what they are doing, but disastrous otherwise.

Now Java has opted not to support this operator overloading, which was wrong in even at that time, but understandable, because at that time we were still more in the mindset to count bits and live with the deficiencies of int and long and we ware also seeing the weird abuses of operator overloading in C++. Maybe it was also the lack of time to design a sound mechanism for this in Java. Unfortunately this decision that was made in a context more than 20 years ago has kind of become religious. Interestingly James Gosling, when asked in an interview for the 20 years anniversary of Java, mentioned operator overloading for numeric types as the first thing that he would have made better. (It is around minute 9.) So I hope that this undoes the religious aspect of this topic.

An interesting idea will probably be included in future versions of Scala. An operator is in principal defined as a method of the left operand, which is quite logical, but it would imply writing something like e = (a.*(b)).+(c.*(d)), possibly with fewer parentheses. Now this is recognized as a operator-method, so the dots can go away as well as the parentheses and the common operator precedence applies, so e = a * b + c * d works as well and is what we find natural. Ruby and Scala are very similar in this aspect. Now some future version of Scala, maybe Scala 3, will introduce an annotation that allows the „infix“-notation for these methods and that adds a descriptive name. Now error messages and even IDE-support could give us access to the descriptive name and we would be able to search for it, while searching for something like „+“ or „-“ or „*“ would not really be helpful. I think that this idea would be useful for other languages as well.

These examples demonstrate the BigInteger types of Java, C#, Scala, Clojure and Ruby, respectively:

import java.math.BigInteger;

public class JavaBigInt {

    public static void main(String[] args) {
        BigInteger f = BigInteger.valueOf(2_000_000_000L);
        BigInteger p = BigInteger.ONE;
        for (int i = 0; i < 8; i++) {
            System.out.println(i + " " +  p);
            p = p.multiply(f);
        }
    }
}

gives this output:

0 1
1 2000000000
2 4000000000000000000
3 8000000000000000000000000000
4 16000000000000000000000000000000000000
5 32000000000000000000000000000000000000000000000
6 64000000000000000000000000000000000000000000000000000000
7 128000000000000000000000000000000000000000000000000000000000000000

And the C#-version

using System;
using System.Numerics;

public class CsInt {

    public static void Main(string[] args) {
        BigInteger f = 2000000000;
        BigInteger p = 1;
        for (int i = 0; i < 8; i++) {
            Console.WriteLine(i + " " +  p);
            p *= f;
        }
    }
}

give exactly the same output:

0 1
1 2000000000
2 4000000000000000000
3 8000000000000000000000000000
4 16000000000000000000000000000000000000
5 32000000000000000000000000000000000000000000000
6 64000000000000000000000000000000000000000000000000000000
7 128000000000000000000000000000000000000000000000000000000000000000

Or the Scala version

object ScalaBigInt {

  def main(args: Array[String]): Unit = {
    val f : BigInt = 2000000000;
    var p : BigInt = 1;
    for (i  <- 0 until 8) {
      println(i + " " + p);
      p *= f;
    }
  }
}
0 1
1 2000000000
2 4000000000000000000
3 8000000000000000000000000000
4 16000000000000000000000000000000000000
5 32000000000000000000000000000000000000000000000
6 64000000000000000000000000000000000000000000000000000000
7 128000000000000000000000000000000000000000000000000000000000000000

Or in Clojure it looks like this, slightly shorter than then Java and C#:

(reduce (fn [x y] (println y x) (*' 2000000000 x)) 1 (range 8))

with the same output again, but a much shorter program. Please observe that the multiplication needs to use the "*'" instead of "*" in order to outexpand from fixed length integers to big-integers.

0 1
1 2000000000
2 4000000000000000000
3 8000000000000000000000000000N
4 16000000000000000000000000000000000000N
5 32000000000000000000000000000000000000000000000N
6 64000000000000000000000000000000000000000000000000000000N
7 128000000000000000000000000000000000000000000000000000000000000000N

Or in Ruby it is also quite short:

f = 2000000000
p = 1
8.times do |i|
  puts "#{i} #{p}"
  p *= f;
end

same result, without any special effort, because integers are always expanding to the needed size:

0 1
1 2000000000
2 4000000000000000000
3 8000000000000000000000000000
4 16000000000000000000000000000000000000
5 32000000000000000000000000000000000000000000000
6 64000000000000000000000000000000000000000000000000000000
7 128000000000000000000000000000000000000000000000000000000000000000

So I suggest to leave the IT-theology behind. So the pragmatic issues should be considered now.

In Java we have primitive numeric types, that are basically inadequate for application development, because they tacitly overflow and because application developers have usually no idea how to deal with rounding issues of float and double. We have good numeric types like BigInteger and BigDecimal to support arbitrarily long integral numbers, which do not overflow unless we exceed memory or addressaility issues with numbers of several billion digits. BigDecimal allows for controlled rounding, and also arbitrary precision.

Now we have to write

e = a.multiply(b).add(c.multiply(d))

instead of

e = a * b + c * d

The latter is readable, it is exactly what we mean. The former is not readable at all and the likelihood of making mistakes is very high.
I would be happy with something like this:

e = a (*) b (+) c (*) d

where overloaded operators are surrounded with () or [] or something like that.

At some point of time a major producer of electronic calculators made us believe that it is more natural to express it like this

e a b * c d * + =

Maybe this way of writing math would be better, but it is not what we do outside of our computers and calculators. At least it was more natural to have this pattern for those who created the calculators, because it was much easier to implement in a clean way on limited hardware. We still have the opposite in Lisp, which is still quite alive as Clojure, so I use the clojure syntax:

(def x (+ (* a b) (* c d)))

which is relatively readable after some learning and allows for a very simple and regular and powerful syntax. But even this is not how we write Math outside of our computer.

Now the good news is that Java will add "value types" in the future and consider to revisit the operator overloading issue for these value types. This may or may not solve the issue in a distant future. We should have an idea what a numeric type is. A numeric type can be more than just real and integral numbers. Just think of rational numbers, complex numbers, but even of polynomials, rational functions (quotients of polynomials), finite fields, p-adic numbers and more. We just need to talk about rings and fields in the mathematical sense and possibly subsets that do not quite follow the field semantics like Double, but that are still inspired by the field they aim to represent. Anyway, for the moment Java not having operator overloading is a degradation from something that other languages had already done well before.

Btw., please use elementary school math skills and do not write

e = (a * b) + (c * d)

That is just noise. I do not recommend to memorize all the 10 to 25 levels of operator precedence of a typical programming languages, but it is good to know the basic ones, that almost any serious current programming language supports:
* binary * /
* binary + -
* == != <= >= < >
* &&
* ||
Some use "and" and "or" instead of "&&" and "||".

Now using overloaded operators should be no problem.

We do have an issue when implementing it.

Imagine you have a language with five built in numeric types. Now you add a sixth one. "+" is probably already defined for 25 combinations. With the sixth type we get a total of 36 combinations, of which we have to provide the missing 11 and a mechanism to dispatch the program flow to these. In C++ we just add 11 operator-functions and that does everything. In Ruby we add a method for the left side of the operator. Now this does not know our new type for the existing types, but it deals with it by calling coerce of the right operand with the left operand as parameter. This is actually powerful enough to deal with this situation.

It gets even more tricky when we use different libraries that do not know of each other and each of them adds numeric types. Possibly we cannot add these with each other or we can do so in a degraded manner by just falling back to double or float or rational or something like that.

The numeric types that we usually use can be added with each other, but we could hit situations where that is not the case, for example when having p-adic numbers, which can be added with rational number, but not with real numbers. Or finite fields, whose members can be added with integral numbers or with numbers of the same field, but not necessarily with numbers of another finite field. Fortunately these issues should occur only to people who understand them while writing libraries. Using the libraries should not be hard, if they are properly done.

Share Button

Usability „Pearl“

I just found this usability pearl:

After entering a credit card number as usual with spaces between the groups of four digits, the web page complained like this:

Credit Card Number without Spaces

Web page of insisting to refill a form because of spaces

Yes, it is easy to allow spaces. Just match the following regex
/^\s*\d{4}\s*\d{4}\s*\d{4}\s*\d{4}\s*$/
and then remove the spaces when processing it, but do not let the user enter the number without spaces. That is just ridiculous.

Share Button