Scala Exchange 2017

I have visited Scala Exchange („#ScalaX“) in London on 2017-12-14 and 2017-12-15. It was great, better than 2015 in my opinion. In 2016 I missed Scala Exchange in favor of Clojure Exchange.

This time there were really many talks about category theory and of course its application to Scala. Spark, Big Data and Slick were less heavily covered this time. Lightbend (former Typesafe), the company behind Scala, did show some presence, but less than in other years. But 800 attendees are a number by itself and some talks about category theory were really great.

While I have always had a hard time accepting why we need this „Über-Mathematics“ like category theory for such a finite task as programming, I start seeing its point and usefulness. While functors and categories provide a meta layer that is actually accessible in Scala there are actually quite rich theories that can even be useful when constrained to a less infinite universe. This helps understanding things in Java. I will leave details to another post. Or forget about it until we have the next Scala conference.

So the talks that I visited were:

  • Keynote: The Maths Behind Types [Bartosz Milewski]
  • Free Monad or Tagless Final? How Not to Commit to a Monad Too Early [Adam Warski]
  • A Pragmatic Introduction to Category Theory [Daniela Sfregola]
  • Keynote: Architectural patterns in Building Modular Domain Models [Debasish Ghosh]
  • Automatic Parallelisation and Batching of Scala Code [James Belsey and Gjeta Gjyshinca]
  • The Path to Generic Endpoints Using Shapeless [Maria-Livia Chiorean]
  • Lightning talk – Optic Algebras: Beyond Immutable Data Structures [Jesus Lopez Gonzalez]
  • Lightning Talk – Exploring Phantom Types: Compile-Time Checking of Resource Patterns [Joey Capper]
  • Lightning Talk – Leave Jala Behind: Better Exception Handling in Just 15 Mins [Netta Doron]
  • Keynote: The Magic Behind Spark [Holden Karau]
  • A Practical Introduction to Reactive Streams with Monix [Jacek Kunicki]
  • Building Scalable, Back Pressured Services with Akka [Christopher Batey]
  • Deep Learning data pipeline with TensorFlow, Apache Beam and Scio [Vincent Van Steenbergen]
  • Serialization Protocols in Scala: a Shootout [Christian Uhl]
  • Don’t Call Me Frontend Framework! A Quick Ride on Akka.Js [Andrea Peruffo]
  • Keynote: Composing Programs [Rúnar Bjarnason]
Share Button

Collection Initializiation in Java

There is this so called „double brace“ pattern for initializing collection. We will see if it should be a pattern or an anti-pattern later on…

The idea is that we should consider the whole initializion of a collection one big operation. In other languages we write something like
[element1 element2 element3]
or
[element1, element2, element3]
for array-like collections and
{key1 val1, key2 val2, key3 val3}
or
{key1 => val1, key2 => val2, key3 => val3}.
Java could not do it so well until Java 9, but actually there was a way to construct sets and lists:
Arrays.asList(element1, element2, element3);
or
new HashSet<>(Arrays.asList(element1, element2, element3));.
Do not ask about immutability (or unmodifyability), which is not very well solved in the standard java library until now, unless you are willing to take a look into Guava, which we will in another article… Let us stick with Java’s own facilities for today.

So the double brace pattern would be something like this:

import java.util.*;

public class D {
    public static void main(String[] args) {
        List<String> l = new ArrayList<String>() {{
                add("abc");
                add("def");
                add("uvw");
            }};
        System.out.println("l=" + l);

        Set<String> s = new HashSet<String>() {{
                add("1A2");
                add("2B707");
                add("3DD");
            }};
        System.out.println("s=" + s);

        Map<String, String> m = new HashMap<String, String>() {{
                put("k1", "v1");
                put("k2", "v2");
                put("k3", "v3");
            }};
        System.out.println("m=" + m);
    }
}

What does this do?

First of all having an opening brace after the new XXX() creates an anonymous class extending XXX. Then we open the body of the extended class. What is well known to many is that there can be a static {....} section, that is called exactly once for each class. The same applies for a non-static section, which is achieved by omitting the static keyword. This is of course called once for each instance of the class, so in this case it will be called after the constructor of the base class and serves kind of as a replacement for the constructor. To make it look cooler the two pairs of braces are placed together.

It is not so magic, but it creates a lot of overhead by creating anonymous classes with no real additional functionality just for the sake of an initialization. It is even worse, because these anonymous inner classes are not static, so they actually can refer to their surrounding instance. They do not make use of this, but anyway they carry a reference to their surrounding class which might be a very serious problem for serialization, if that is used. And for garbage collection. So please consider the double-brace-initialization as an anti-pattern. Others have blogged about this too…

There are more legitimate ways to group the initialization together. You can put the initialization into a static method and call that. Or you could group it with single braces, just to indicate the grouping. This is a bit unusual, but at least correct:

import java.util.*;

public class E {
    public static void main(String[] args) {
        List<String> l = new ArrayList<String>();
        {
            l.add("abc");
            l.add("def");
            l.add("uvw");
        }
        System.out.println("l=" + l);

        Set<String> s = new HashSet<String>();
        {
            s.add("1A2");
            s.add("2B707");
            s.add("3DD");
        }
        System.out.println("s=" + s);

        Map<String, String> m = new HashMap<String, String>();
        {
            m.put("k1", "v1");
            m.put("k2", "v2");
            m.put("k3", "v3");
        }
        System.out.println("m=" + m);
    }
}

While the first two can somehow be written using Arrays.asList(...), now in Java 9 there are nicer ways for writing all three using List.of("abc", "def", "uvw");, Set.of("1A2", "2B707", "3DD"); and Map.of("k1", "v1", "k2", "v2", "k3", "v3");, which is recommended over any other way because there are some additional runtime and compile time checks and because these are efficient immutable collections. This has been blogged about too.

The aspect of immutability which we should consider today, is not very well covered by the java collections (apart from the new internal one for the new factory methods. Wrapping in Collections.unmodifyableXXX(...) is a bit of overhead in terms of code, memory and CPU-usage but it does not give a guarantee that the collection wrapped into this is actually not being modified elsewhere.

Share Button

Perl 5 and Perl 6

We have now two Perls. Perl 5, which has been around for more than 20 years just as the „Perl programming language“ and Perl 6, which has been developed for more than a decade and of which now stable versions exist.

The fact, that they are both called „Perl“ is a bit misleading. They are two different and incompatible programming languages. But they share the same community. And Perl conferences are usually covering both languages.

So this rises the question about the differences or about which of the two Perls to use.

Here are some differences:

  • Perl 5 is well established and many people know it. Perl 6 has to be learned, even if it is relatively easy to learn for someone with a Perl 5 background.
  • Perl 5 runs about three times faster than Perl 6
  • Perl 6 programs are a bit shorter than Perl 5 programs
  • Perl 6 regular expressions are even better than Perl 5’s regular expressions
  • Perl 6 is more logical than Perl 5
  • Perl 6 uses by default better numerical types
  • Perl 6 makes it easier and more natural to do object oriented programming and functional programming
  • Perl 6 has come up with a useful approach for doing multithreading.
  • Perl 5 has so many cool libraries on CPAN, Perl 6 just a few.

Links:

Share Button

Swiss Perl Workshop 2017

I have attended the Swiss Perl Workshop.
We were a group of about 40 people, one track and some very interesting talks, including by Damian Conway.
I gave a regular talk and a lightning talk myself.
The content of my talk might go into another Blog post in the future.
The Perl programming language is still interesting, and of course it was covered in both variants: Perl 5 and Perl 6.
But many of the talks were about general issues like security and architecture and just exemplified by Perl.

The Video recording of talks was optional. Here are those that have been recorded and already uploaded: Youtube: Swiss Perl Workshop

Share Button

Shell Scripts

Shell scripts can be useful for writing small stuff like combining a few commands to pipes or doing a bit of „back ticking“. Even simple loops and if-conditions are possible. And if we want, it is almost a full programming language. A bit hard to tame, maybe, but quite a lot of stuff is possible. Those who like to know more about it may look into startup scripts of typical java software. Often a .bat and a .sh file are provided, where the right jvm is found, the classpath and the execution path and maybe some other environment are put together. In the end the .sh-file is quite a long and unreadable horror story and the .bat file is even much worse, because the cmd-language is just a lot more primitive and less capable and requires even worse hacks.

There are ways to make shell scripts more readable, which by themselves are truly admirable, but I think that route is wrong. We can learn all the Shell functionalities and understand bit by bit even more complex shell scripts, but I think for non trivial shell scripts it is time to switch to real programming languages instead. Scripting languages, of course, for example Perl, Ruby, Python or Lua. We may still execute „shell commands“, that are actually programs in /bin, /usr/bin or /usr/local/bin where they are powerful and more concise than writing purely in that programming language. But a magic for putting together a classpath is much cleaner in the Perl programming language than in pure bash (or worse cmd/bat).

This is of course another example of the Golden Hammer anti pattern. We should balance our tool box. Not add specific tools for making any minor task a bit easier on the expense of supporting one more tool, but keep a broad range of tools that in conjunction are very powerful. For example I would retire awk and sed and use either Perl or Ruby instead. We only have to keep them around because a lot of system tools that are just there still rely on them, but for a team I would deprecate awk and sed for new scripts or even for enhancing existing scripts. Bash would be ok only for small scripts, you can invent a line number or a maximum complexity, but for very short scripts I think bash is a legitimate tool.

Switch to Perl, Perl6, Ruby or… when you encounter any of the following:

  • The scripts is getting kind of long (>= 100 lines)
  • You find yourself modularizing it with functions
  • You find yourself using non trivial perl, ruby, sed or awk within the script, for example regex-stuff
  • The script need interaction
  • The scripts needs arrays, numbers or other types
  • More than one or two trivial if-statements or loop-statements are needed
  • Database access is done by the script (SQL or NoSQL)
  • String encoding becomes relevant
  • Quoting levels become an issue

This post was inspired by a similar post on the Isoblog by Kris. And the Shell Style Guide of Google is quite good especially in limiting the area where shell scripts are acceptable.

Share Button

Powerful API Functions or Specific API Functions?

When designing APIs we should confront ourselves with the question what they should look like, what they should contain and what not. This is not mostly a question about development effort, but about creating a good API that can be used and save us development effort elsewhere.

There are always simple answers, but in the end we should balance certain partially contradicting desires to create something great.

One aspect will be discussed here. Some of us know functions in the libc of certain systems that we use to program in C. Favorite candidates are ioctl and fcntl. These functions include a wide range of functionality and actually do quite different things depending on the parameters. Primarily there is one parameter that selects the function. And then depending on this parameter there are several additional parameters, whose meaning totally depends on the first parameter.

I truly admire the libc and the Posix-API, because of what it can do and how it is accomplished and how clever the concepts are. But putting loosely related stuff into one catch-all-function and using a parameter that selects which function to actually execute is just wrong and it has been wrong even in the days when it was created. Now there is possibly some argument in favor of this design, because these functions are system calls, which are special, because they go immediately into the OS-kernel. Depending on the implementation of the OS there might be limits of the total number of system calls that the OS can support and it might be hard to change the interface between OS and libc too often, so a flexible system call comes in handy. In the concrete example, it is impossible to change it directly, because the POSIX-API has been standardized and this is one of the few standards that has remained relatively stable for 25 years and still offers great functionality. Linux, which strictly follows this standard, is by far the most widespread operating system today, especially on servers, mobile devices (Android) and devices that we perceive as just hardware like network routers, firewalls, … It is too valuable that programs written for the POSIX-API and of course using the defined functionality run on newer Linuxes.

But there is a lesson to learn for our own APIs. We should avoid putting too many different things into one API-function. I do not think that many of us will try to write an universal API-function like ioctl, but more subtle examples are quite common.

A typical pattern is this:

findPerson(name, email, phone_number)

We can provide a name, a phone number or an email address or a combination and then search for entries that match all of the entries that we had provided. This is still quite clear, but now we could also provide a list of phone_numbers, a list of email addresses etc…

Independent of the actual preference, it should be considered, that this are 7 functions. We can include or exclude any of the parameters, but the case that all are null is probably not supported. Or it is the eighth case that finds everything.

When we are talking about 1, 2, 3, or maybe 4 parameters, it is still possible. to create API-functions for all the combinations, like

findPersonByName(name)
findPersonByEmail(email)
findPersonByPhoneNumber(pone_number)
findPersonByNameAndEmail(name, email)
...
findPersonByNameAndEmailAndPhoneNumber(name, email, phone_number)

This will be clearer. When writing exhaustive automatic Tests, which will probably be „integration tests“, not „unit tests“, they have to be written against these seven variants anyway, no matter if it is one or seven functions. The implementation might also internally use „if“s or do the equivalent at query level by doing something like

SELECT * FROM PERSON P
WHERE
(:name IS NULL OR P.NAME = :name)
AND (:email IS NULL OR P.EMAIL = :email)
AND (:phone_number IS NULL OR P.PHONE_NUMBER = :phone_number);

which has actually eight paths, that need to be covered by tests, including the case where all three parameters are null, if that is not blocked by application code.

This also shows the limits of the classical approach, when the multitude of queries gets really complex. That might require a more generic approach, which is actually quite well exemplified by SQL or its embedded forms like JDBC. For typical IT projects, I would give the recommendation, not to go there and develop such a generic query DSL as part of the project. This usually leads to disaster, because the skills for designing a good language or a good generic framework are usually not available in the team and if we talk about budget, quality and schedule, it will usually blow anyway. So the reasonable approaches are either to use an existing well proven solution for the generic API or to just find out, what functionalities are actually needed and to provide them.

Some examples show the opposite, like Ruby on Rails, which was developed as part of a project effort. Another example is a relatively big company that developed a framework quite similar to Spring itself, before Spring was available. But these successes cannot easily be duplicated in our projects.

Share Button

ScalaUA 2017

About a month ago I visted the conference ScalaUA in Kiev.

This was the schedule.

It was a great conference and I really enjoyed everything, including the food, which is quite unusual for an IT-conference.. 🙂

I listened to the following talks:
First day:

  • Kappa Architecture, Juantomás García Molina
  • 50 shades of Scala Compiler, Krzysztof Romanowski
  • Functional programming techniques in real world microservices, András Papp
  • Scala Refactoring: The Good the Bad and the Ugly, Matthias Langer
  • ScalaMeta and the Future of Scala, Alexander Nemish
  • ScalaMeta semantics API, Eugene Burmako

I gave these talks:

  • Some thoughts about immutability, exemplified by sorting large amounts of data
  • Lightning talK: Rounding

Day 2:

  • Mastering Optics in Scala with Monocle, Shimi Bandiel
  • Demystifying type-class derivation in Shapeless, Yurii Ostapchuk
  • Reactive Programming in the Browser with Scala.js and Rx, Luka Jacobowitz
  • Don’t call me frontend framework! A quick ride on Akka.Js, Andrea Peruffo
  • Flawors of streaming, Ruslan Shevchenko
  • Rewriting Engine for Process Algebra, Anatolii Kmetiuk

Find recording of all the talks here:
https://www.scalaua.com/speakers-speeches-at-scalaua2017/

Share Button

Observer Effect

Scientists have to deal with the observer effect, which means that observing something actually changes it. Typically we think of quantum physics, where this effect is very strong and surprising and closely related to the Heisenberg Uncertainty Principle, but it is actually something that in a more abstract sense is present in a multitude of situtations. Just think of human interaction. If we want to find out about people, we can ask them. But this conversation actually changes the people, sometimes in a way that we can neglect or tolerate.

But we also have this in the case of IT. If we think of a software and we want to observe if the software behaves well, we need ways to observe the software. Very often we use logging, sometimes monitoring tools, and sometimes debugging or even profiling. We think that they do not hurt us, apart from using resources, but we have to be quite careful. The example of logging is quite good, because it is quite common and usually something that we do a lot, without wasting too much thoughts about it.

Now logging slows our application down that is known already. Now we tend to use a slightly less noisy log level, because terabytes of log are still a pain, even today. But usually the messages are calculated and then discarded by the logging framework. With functional language features there are quite elegant ways to deal with this, by just passing a function that calculates the message on demand instead of passing the message. It has always been possible, but too clumsy to actually do it, unless the logging framework can rely on macro facilities, even such simple ones as the C-preprocessor. The deferred evaluation has its dangers as well. If an object that is passed as an ingrediant for a potential message already changes, while the message is created, we might get funny effects. Maybe only in the log, but maybe it could crash the application or stop the main program flow from doing its work. We need to be careful, unless the object is immutable.

In case of Hibernate or JPA or similar frameworks this can be specially interesting, even with eager message calculation. Accessing attributes of the object can actually lead to database operations. They can fail. They can create load, maybe deadlocks. They can have lost their transaction. A lot of things can happen in places far away from where we assumed to do the DB-work. This actually changes the objects. Do we want such operations to occur during logging? Maybe differently depending on the log level? Immutability is our friend, especially in conjunction with JPA, but that is a long story. We may at least be lucky that we actually have some tables that we only read. We can make the objects „pseudo-immutable“, but still JPA-magic must mutate them at least during the read operation. It is tempting to let tools generate the toString-methods of objects, but it is very dangerous here. We should avoid including any potentially lazily loaded attributes in the toString-output, because otherwise they will be loaded during the logging or even worse differently depending on how we log.

The next thing is the NullPointerException during logging. It is quite common in Java, for example. And we do not want to burden our program logic with NullPointerExceptions from the logging, especially not with those that occur only sporadically. So it is a good idea to be careful and to test well. Only the combination is possibly good enough.

Modern times create more demand for some kind of real multi threading, not in the JEE-sense with a couple of EJBs that can run in parallel, but with massively parallel operations. Even though we have a multitude of logging frameworks and unifying logging frameworks and even more of them, there is a common weakness that they tend to share. Writing into one single target is achieved by some kind of synchronizing, which can slow our application down and change the timing behavior in ways that we did not desire. Asynchronous logging could be good, but in a way this only shifts the problem a bit.

Share Button

Clojure-Art

It is an interesting idea to generate colorful images using or music. In both areas Clojure seems to be quite attractive. Not having explored the music side, I did find the idea of creating images fun and inspiring. It also shows us something about the functions we are working with, if we learn to read the images right, but that will come or not, depending on the circumstances. It is useful not to be too scared of some mathematics when reading this.

Now the challenge is to create an image on a two dimensional array of points, for example 1000×1000 pixel, with x- and y-coordinates ranging from 0 to 999. Each pixel needs to be colored. While it is very interesting to explore different color models, we can for simplicity assume that we need 3 numbers each ranging from 0 to 255 for the red, green and blue channels. This is how most displays work, more or less. Now the goal is to create something that looks good. And of course is reasonable to program, otherwise we could just color one million points individually using for example GIMP, but a million is a lot.

Now we can apply any function on x and y and play around with functions like exp, log, sqrt, sin, cos, tan, sec, csc, sinh, … and of course the basic operations +, -, * and /. It turns out that in most cases we do not get interesting images, but experience will show what is promising to explore. I tried to create pictures by keeping the three channels fairly independent, but this did not work so well. It seems that it is better to keep some connection. One approach that actually works quite well is to consider the pair (x,y) as a complex number z = x+iy and to apply just one complex function on it, again exp, log, sqrt, sin, …. are good building blocks. Now these complex functions have a tendency to grow to infinity somewhere. While real functions can avoid this issue by constraining themselves just to one strait line on the plane, complex functions almost have to go to infinity somewhere. By making the square small enough or by changing the scale we can avoid this, but it imposes quite severe constraints. The Riemann Sphere allows us to map any complex number to a point on the surface of a sphere. With some scaling we can already get to RGB-space and get coordinates that are using, but not exceeding the desired range. There are more ways to visualize complex numbers, but this is a possibility worth exploring.

Another way is to just use functions that calculate a real number and to apply a \sin to it. With some shifting and scaling the values will be between 0 and 255 only and there are nor abrupt changes in color, unless the function we calculated is very steep or very chaotic. Using phase shifts by \frac{2\pi}{3} and \frac{4\pi}{3} the three color channels can be served and we get nice rainbow-waves like the following:

Clojure Art: angle + log(r)

Clojure Art: angle + log(r)

Another experiment was to just assume the HSV-model and to calculate the colors from assuming the function is the H-part. But this ended up looking like plastic and I did not like it too much.

An important issue to observe is that functions may end up in exceptions. I wrapped the functions, so that they do not stop the calculation of the image half way through, but instead provide default values in cases where an exceptions occured.

It can also be fun to explore bitwise-functions like bitxor or even functions like the p-adic exponential function, which yields totally different kind of images.

I have put some of the code from my experiments into Github and licensed it with the GPL, so you can use it as a starting point. Others have worked with this as well, for example Clojure Art on Tumblr, Clojure Art Collective on github, another „clojure art“ on github or creative computing with clojure on O’Reilly’s blog.

Enjoy it and learn some Clojure. I sometimes use this when teaching Clojure.

Share Button

DB Persistence without UPDATE and DELETE

When exploring the usage of databases for persistence, the easiest case is a database that does only SELECT. We can cache as much as we like and it is more or less the functional immutable world brought to the database. For working on fixed data and analyzing data this can sometimes be useful.

Usually our data actually changes in some way. It has been discussed in this Blog already, that it would be possible to extend the idea of immutability to the database, which would be achieved by allowing only INSERT and SELECT. Since data can correlate, an INSERT in a table that is understood as a sub-entity via a one-to-many-relationship by the application actually is mutating the containing entity. So it is necessary to look at this in terms of the actual OR-mapping of all applications that are running on that DB schema.

Life can be simple, if we actually have self contained data as with MongoDB or by having a JSON-column in PostgreSQL, for example. Then inter-table-relations are eliminated, but of course it is not even following the first normal form. This can be OK or not, but at least there are good reasons why best practices have been introduced in the relational DB world and we should be careful about that. Another approach is to avoid the concept of sub entities and only work with IDs that are foreign keys. We can query them explicitly when needed.

An interesting approach is to have two ID-columns. One is an id, that is unique in the DB-table and increasing for newly created data. One is the entity-ID. This is shared between several records referring to different generations of the same object. New of them are generated each time we change something and persist the changes and in a simple approach we just consider the newest record with that entity-ID valid. It can of course be enhanced with validFrom and validTo. Then each access to the database also includes a timestamp, usually close to current time, but kept constant across a transaction. Only records for which validFrom <= timestamp < validTo are considered, and within these the newest. The validFrom and validTo can form disjoint intervals, but it is up to the application logic if that is needed or not. It is also possible to select the entry with the highest ID among the records with a given entityID and timestamp-validTo/From-condition. Deleting records can be simulated by this as well, by allowing a way to express a "deleted" record, which means that in case we find this deleted record by our rules, we pretend not having found anything at all. But still referential integrity is possible, because the pre-deletion-data are still there. This concept of having two IDs has been inspired by a talk on that I saw during Clojure Exchange 2017: Immutable back to front.

Share Button