Karl Brodowsky's IT-Blog – Seite 5 – IT Sky Consulting GmbH

Spline Approximation (Mathematics)

The goal of spline approximation has already been explained in the previous article „Spline Approximation (Introduction)„.

This article will cover the mathematics behind this approximation and develop an approach. If you do not care about the mathematics, just skip this article and read the Spline Approximation (Cookbook)“, that will come soon.

Spline Interpolation

We have points

$(x_0, y_0), (x_1, y_1), \ldots, (x_n, y_n)$

such that

(0.1) $x_0 < x_1 < x_2 < \ldots < x_n$

and want a function f such that

$(1)\thickspace \bigwedge_{i=0}^n f(x_i) = y_i$

and f is continuous. Usually the requirement goes further, so the first and second derivative should also be continuous too.

The way this is accomplished is by defining cubic polynomials $f_j$ on each interval $[x_j, x_{j+1}]$ such that

$(2)\thickspace \bigwedge{j=1}^{n-1} f'_{j-1}(x_j)=f'_j(x_j)$

$(3)\thickspace \bigwedge{j=1}^{n-1} f''_{j-1}(x_j)=f''_j(x_j)$

Now this gives us $4n$ unknowns and together with the initial condition $2(n-1)+2n$ equations. So this is underdetermined, which is usually resolved by adding two more or less arbitrary conditions. A lot of material can be found about this in the internet, in papers and in books.

Spline Approximation

Now a more interesting case is that we actually have much more given points than spline intervals. So we have interval borders at points

$x_0,\ldots,x_n$

and we have given pairs

$(\xi_1,\eta_1), \ldots, (\xi_N, \eta_N)$

with $N$ much larger than $n$ . The exact condition will become clear later, but for the time being it should be assumed, that $N$ is meant to be much larger than $n$ . The values $\xi_i$ may contain duplicates, but in that case the number of different values for $\xi_i$ should also be much larger than $n$ .

Btw. this is nothing new. Papers about this topic exist, but it is not as commonly found on the internet as the interpolation.
From here onwards, it is assumed, that the intervals all have the same length, i.e. there is some positive real number $h$ such that $x_i=x_0+i*h$ for all $i$ .

So we want the conditions (2) and (3) to be fullfilled and a weaker condition

$(1a)\thickspace \bigwedge_{j=1}^{n-1} f_{j-1}(x_j)=f_j(x_j)$

and we want $f(\xi_i)$ to be „somewhat close“ to $\eta_i$ for all points $(\xi_i, \eta_i)$ . More precisely it should be as close as possible on „average“, where the quadratic mean is used as „average“. That is common practice, allows for smooth formulas and works. To just minimize the quadratic mean, taking the square root and dividing by $N$ can be ommitted. So this can be made explicit by requiring the sum of the squares of the differences to be minimal i.e.

$(4)\thickspace \sum_{i=1}^N (f(\xi_i)-\eta_i)^2 \text{ is minimal}$

Btw. this can be done perfectly well with complex valued functions, we just need to replace the squares by the squares of the absolute values. So on the „ $\eta$ -side“ we can have complex numbers. Allowing complex numbers on the „ $\xi$ -side“ is a bit more involved, because being differentiable twice implies that the function would be holomorphic, so combining different functions is impossible. And even complex valued functions would become non continuous at the glue lines if we simply apply them to the whole complex plain. So, for the time being, real numbers are assumed.

Now the valid spline functions on the given set of intervals obviously form a vector space. Conditions (1a), (2) and (3) remain valid when we multiply by a constant or add two such functions. Having $4n$ parameters and $3(n-1)$ independent conditions, its dimension should be $n+3$ . This can be proved by induction. For $n=1$ any cubic polynomial (of degree $\le 3$ ) can be used. These form a 4-dimensional vectorspace. Assuming that for $n$ subintervals the valid spline functions form a vector space of dimension $n+3$ , then for $n+1$ subintervals the additional subinterval $[x_n, x_{n+1}]$ is added. In this subinterval, the function can be expressed as

$f(x)=a+b(x-x_n)+c(x-x_n)^2+d(x-x_n)^3\text{.}$

Conditions (1a), (2) and (3) already fix the values for $a$ , $b$ and $c$ , while $d$ can be choosen freely. Thus the dimension is exactly one higher and the assumption is proved.

Now a basis for this vector space should be found. Ideally functions that are only non-zero in a small range, because they are easier to handle and easier to calculate.

This can be accomplished by a function that looks like this:

Note that this is not the Gaussian function curve, which never actually becomes zero. The function we are looking for should actually be 0 outside of a given range. So assuming it is $f(x)=0$ for $|x| > A$ and $f(x)>0$ for $|x| < A$ for some constant $A>0$ . This implies that the first and second derivative are $0$ for $x=-A$ . So in the subinterval starting at $-A$ it needs to be a cubic polynomial of the form $a(x+A)^3$ . So further subintervals are needed to return to $0$ . For reasons of symmetry there should be a subinterval ending at $A$ in which the function takes the form $a(A-x)^3$ . Using a third subinterval $[-B, B]$ for the whole middle part would imply that this has to be an even function, thus of the form $g(x)=b+cx^2$ . $b$ could be determined as $b=a(A-B)^3-cB^2$ . According to the first derivative condition we would have $3(A-B)^2 = g'(-B) = -2cB$ , thus $c=-\frac{3(A-B)^2}{2B}$ . According to the second derivative condition we would have $6(A-B)=g''(-B)=2c=-\frac{3(A-B)^2}{B}$ thus $B=3(A-B)$ thus $B=\frac{3}{4}A$ Since subintervals of equal length are required, this is not adequate.

Using a total of four subintervals actually works. In this case for the subinterval $[-B,0]$ four conditions are given to determine the four coefficients of the cubic function.

For readability it will be assumed that $A=2$ and $B=1$ , so the subintervals are $[-2,-1], [-1, 0], [0,1], [1,2]$ . The function can be choosen as

$f(x)= \begin{cases} 0 &\text{for } x \le -2\\ (x+2)^3&\text{for } -2 < x \le -1 \end{cases}$

Now

$f(x)=a+b(x+1)+c(x+1)^2+d(x+1)^3$

needs to be defined in [-1, 0] such that

$f(-1)=1$

$f'(-1)=3$

$f''(-1)=6$

$f'(0)=0$

Thus $a=1$ , $b=3$ , $c=3$ and

$0=f'(0)=b+2c+3d=9+3d$

Thus $d=-3$ .

So the prototype function is

$(5)\thickspace f(x)= \begin{cases} 0 &\text{for } x \le -2\\ (x+2)^3&\text{for } -2 < x \le -1\\ 1+3(x+1)+3(x+1)^2-3(x+1)^3&\text{for } -1 < x \le 0\\ 1+3(1-x)+3(1-x)^2-3(1-x)^3&\text{for } 0 < x \le 1\\ (2-x)^3&\text{for } 1 < x \le 2\\ 0&\text{for } x > 2 \end{cases}$

A base for this vector space can be found using functions $f_i$ for $i=-3\ldots n-1$ . For readability purposes we define

$x_{j}=x_0+jh$

even for negative $j$ and $j>n$ .

The functions $f_i$ are defined such that such that

$f_i(x)=f\left(\frac{x-x_i}{h}\right) \text{ for } i=-1,\ldots,n+1$

These functions fulfill conditions (1a), (2) and (3), because they inherit that from $f$ .

By induction it can be proved that they are linear independent. It is true for $\{f_{-1}\}$ alone. If it is true for $\{f_{-1},\ldots,f_{i-1}\}$ it is also true for $\{f_0,\ldots,f_i\}$ , because

$f_i\left(x_i+\frac{3}{2}h\right) >0$

and

$\bigwedge_{j=-1}^{i-1}f_j\left(x_i+\frac{3}{2}h\right)=0\text{.}$

Since

$\{f_{-1},\ldots,f_{n+1}\}$

contains exactly $n+3$ elements, it is a vector space basis.

That means that we are searching for a function

$(6)\thickspace g(x) = \sum_{i} a_i f_i(x)$

such that the minimality condition (4) holds.
This is accomplished by filling (6) into (4) and calculating the partial derivatives with respect to each $a_i$ :

$(4a)\thickspace S(a_{-1},\ldots,a_{n+1}) = \sum_{j=1}^N \left(g\left(\xi_j\right)-\eta_j\right)^2$

$= \sum_{j=1}^N ( \sum_{i} a_i f_i(\xi_j)-\eta_j)^2$

Thus

$(4b) \thickspace\bigwedge_{k=-1}^{n+1} 0 &= \frac{\partial}{\partial a_k}\sum_{j=1}^N \left(g\left(\xi_j\right)-\eta_j\right)^2$

$= \frac{\partial}{\partial a_k}\sum_{j=1}^N \left( \sum_{i} a_i f_i\left(\xi_j\right)-\eta_j\right)^2$

$=\sum_{j=1}^N \left(2 f_k\left(\xi_j\right)\left( \sum_{i} a_i f_i\left(\xi_j\right)-\eta_j\right)\right)$

$=2\sum_{i} a_i \sum_{j=1}^N f_k\left(\xi_j\right)\left( f_i\left(\xi_j\right)-2\sum_{j=1}^N \f_k\left(\xi_j\right)\eta_j\right)$

So it comes down to solving the linear equation system

$\sum_{i=-1}^{n+1} a_i \sum_{j=1}^N f_k\left(\xi_j\right) f_i\left(\xi_j\right) = \sum_{j=1}^N f_k\left(\xi_j\right)\eta_j\thickspace\text{ ~ for }k=-1,\ldots,n+1$

This can be solved using a variant of the Gaussian elimination algorithm. Since this is a numerical problem, it is important to deal with the issue of rounding. Generally it is recommended choosing the pivot element wisely.
In this case the approach is chosen to iterate through the columns. For each column the line is chosen, in which the element in that column has the largest absolute value relative to the cubic mean of the absolute values of the other entries in the line.

When actually using the spline function a lot, it is probably a good idea to consolidate the linear combinations of different $f_i$ s within each subinterval into a cubic polynomial of the form

$f(x)=a+b(x-A)+c(x-A)^2+d*(x-A)^3.$

This can be based on the starting point of the interval or the end point or some point in the middle, probably the arithmetic mean of the interval borders. These choices of A have some advantages, because it makes the terms that need to be added smaller in terms of absolute value. Since the accurate end result is anyway the same, this helps avoiding rounding errors, that can go terribly wrong when adding (or subtracting) terms with large absolute values where the result is much smaller than the terms. So the arithmetic mean of the subinterval borders might be the best choice.

The actual formulas and a program will be added in one or two articles in the near future.

Spline Approximation (Introduction)

We sometimes encounter a situation where a number of points with coordinates $(x,y)$ are given and we want to find a function such that for all of these points we have $f(x)=y$ (interpolation) or $f(x) \approx y$ (approximation). Most often we say that we want on average $|f(x) - y|$ to be as small as possible and for whatever reasons usually the quadratic mean. The most simple and well known approximation is probably linear regression, where a straight line is found that tries to approximate the points. This can be extended to other function, for example polynomials of a fixed maximum order. For something that is supposed to be periodic, linear combinations of functions of the form $f(t)=\sin(at)$ and $f(t)=\cos(at)$ might be useful.

Now it may not be easy to express the whole extent by one function. So the interval, in which the x-coordinates lie, might be subdivided into subintervals and linear regression or whatever is being used can be performed in each subinterval separately. This results in a polygon-like curve or worse in a curve that „jumps“ at the interval borders. This can well be good enough and it is relatively easy to implement. With the additional constraint, that it should be continuous at the interval borders, it becomes a bit more difficult.

Now there is some preference for smooth curves. For example it might be desirable that the function is continuous (i.e. it does not jump at interval boarders) and even its first and second derivative should be continuous. This roughly resembles a mechanical spline as it was used in the old days for drawing and constructing. Kind of an elastic ruler.

This is where Splines are often used to interpolate a smooth curve that passes through some given points. More precisely cubic splines, but the concept is of course more general. A lot of material can be found on the internet about spline interpolation.

But they can also be used for approximation. So we want a curve that is smooth and that approximates our given points and that is expressible as a simple third degree polynomial in each subinterval.

Just to make it clear, the subintervals are not used as a „divide et impera“ strategy, but we consider all the points at once, just give ourselves the freedom to have different functions in different subintervals to get a combined function that behaves better than polynomials of very high degree. We do need to think also a bit about the inaccuracy of floating point arithmetic. So polynomials of degree three are still somewhat precise within the subinterval, but a higher order polynomial that is applied to a large interval will become less accurate with normal floating point arithmetic („double“).

I will leave this as a starting point for thinking. In some of the next articles, the spline approximation will be derived mathematically and then there will be a cookbook how to use it programmatically. My advice is to experiment with the implementation and its parameters until you are confident that it give sufficiently precise result, have a look at the math behind it to understand the question of the precision or have someone else have a look. With floating point arithmetic it is always a bad idea just to program something that looks right and totally ignore the rounding errors of floating point arithmetic, that can have huge impact on the result.

There is a lot of material on spline interpolation on the internet. On spline approximation there have been some papers, but very little can be found on the internet.

A followup article covering the mathematics behind spline approximation has been written.

Addresses

Postal addresses seem to be easy:
– Name and/or Company
– Street and/or PostBox
– ZIP Code
– Municipality
– Country

This is true in Germany or Switzerland and a few other countries. I would like to add, that in some large buildings it can be a good idea to add the apartment number, but these buildings are rare in these two countries and the name is really almost always sufficient.

For relational databases please keep in mind, that these fields may get just a bit or even a lot longer than we anticipated. US-ZIP-Codes are now something like NY-11713-5532. And others may be longer. How long are names, street names and names of villages and towns? Even the country part, which seems to be relatively easy, brings some challenges. Countries like Switzerland and Germany and Canada are no problem. But what about semi-independent countries like Guernsey, Jersey, …? And what about areas, that are de-jure part of another country, but de-facto an independent country? Or an independent country that is not accepted by each country? There are lists of countries that we can find and usually they include also the „semi-independent“ and „semi-accepted“ countries. But it is a good idea to check if the list is complete enough for our purpose. I would not recommend to define it yourself.

But now the other parts of the address: If we want to describe the location where a person lives, and this person does not live in a housing area, but for example in a nomadic life style, it becomes a bit harder. Or we might observe a different number of lines for the address. Or even that streets are not named consistently and the only relyable address is a postbox. Many countries do not write the names on the door and on the letterbox, but just an apartment number. So this number with some word for „apartment“ that is understood by the local post officer with possibly limited language skills is needed. Also in some countries there are a lot of buildings for „streetname 3053A“, so we need to add the building number also.

My point is, that postal addresses are not as easy as it might seem. So I recommend to do some research in the internet to find a library or a documentation that handles this or can be used as a basis instead of trying to invent a new wheel that will probably be incomplete and suffer from its insufficiencies at some point. It is sometimes more important to recognize which seemingly trivial problems are actually harder than being able to invent solutions for such problems. We should invest our energy on solving problems that have not been solved with publicly available documentations or even libraries and that make up our business. In this case the actual implementation is rather trivial, but the specification of the requirements is the hard part, so it is enough to find some useful documentation for this. It might of course be sufficient to handle only for example French addresses, if the system will never be dealing with foreign addresses, but the experience shows that at least some thinking about what it means to extend the system later are a good idea.

And please, handle too long entries properly instead of displaying a stack trace on the end users screen or even providing a spot to attack the server.

Happy New Year 2021

Un an nou fericit! — Happy new year! — Срећна нова година! — Frohes neues Jahr! — Onnellista uutta vuotta! — С новым годом! — Gullukkig niuw jaar! — FELIX SIT ANNUS NOVUS — ¡Feliz año nuevo! — Feliĉan novan jaron! — عام سعيد — Щасливого нового року! — Gott nytt år! — Bonne année! — Καλή Χρονια! — Felice anno nuovo! — Godt nytt år!

This was generated with JavaScript using Rhino:
a = ["Frohes neues Jahr!", "Happy new year!", "Gott nytt år!", "¡Feliz año nuevo!", "Bonne année!", "FELIX SIT ANNUS NOVUS", "С новым годом!", "عام سعيد", "Felice anno nuovo!", "Godt nytt år!", "Gullukkig niuw jaar!", "Feliĉan novan jaron!", "Onnellista uutta vuotta!", "Срећна нова година!", "Un an nou fericit!", "Щасливого нового року!", "Καλή Χρονια!"]; b = a.map(function(x) { return Math.floor(1000000000 + Math.random()*1000000) + " " + x; }); b.sort(); c = b.map(function(x) { return x.replace(/^\d+\s+/, ""); }); print(c.join(" — "));

Christmas 2020

¡Feliz Navidad! — καλά Χριστούγεννα! — Buon Natale! — З Рiздвом Христовим! — クリスマスおめでとう ; メリークリスマス — Natale hilare! — Merry Christmas! — ميلاد مجيد — Hyvää Joulua! — С Рождеством! — Joyeux Noël! — God Jul! — Feliĉan Kristnaskon! — Crăciun fericit! — God Jul! — Frohe Weihnachten! — Срећан Божић! — Prettige Kerstdagen!

This was generated by a bash script. I am using Perl instead of sed, but not for program logic:
#!/bin/bash set -e

mkfifo /tmp/tmp_pipeA-$$
mkfifo /tmp/tmp_pipeB-$$

cat << EOTXT |head -18 > /tmp/tmp_pipeA-$$ &
С Рождеством!
Hyvää Joulua!
καλά Χριστούγεννα!
Buon Natale!
Prettige Kerstdagen!
З Рiздвом Христовим!
Merry Christmas!
Срећан Божић!
God Jul!
¡Feliz Navidad!
ميلاد مجيد
クリスマスおめでとう ; メリークリスマス
Natale hilare!
Joyeux Noël!
God Jul!
Frohe Weihnachten!
Crăciun fericit!
Feliĉan Kristnaskon!
EOTXT

od -x /dev/urandom \
|head -18 \
|perl -p -e ’s/^\d+\s//;‘ > /tmp/tmp_pipeB-$$ &

paste /tmp/tmp_pipeB-$$ /tmp/tmp_pipeA-$$ \
|sort \
|cut -f 2 \
|perl -p -e ’s/\n/ — /g;‘ \
|perl -p -e ’s/ — $/\n/;‘

rm -f /tmp/tmp_pipeA-$$
rm -f /tmp/tmp_pipeB-$$

Functional Scala

I participated online in the conference „Functional Scala 2020“ in London. That it was in London had mostly one relevance, which was the time zone. There was no physical location and all talks were done online. An interesting idea was a virtual location. It consisted of rooms and we could move a dot representing ourselves around. Each room consisted of a beautiful landscape as a map of a different climate zone. I could hear what others said, when I moved my dot, representing myself, closer to them, as in real life, and do some nice conversations like that.

A lot of things were said about Scala 3, which will be a big step forward, but also a big step, because it is not compatible with Scala 2. So some work will be necessary to move on to Scala 3, but we will gain a better language for beginners, intermediates and advanced Scala developers.

I am really looking forward to Functional Scala 2021, hopefully in London.

How to disable touchpad (on Linux/X11)

For me it is much better to use an external mouse than the touchpad, which I sometimes touch accidentally.

So, here is how to disable it with a short Perl-Script. A bash script with a bit of Perl would do the same, btw.

#!/usr/bin/perl my $tp = `xinput list | egrep -i touch`; chomp $tp; $tp =~ s/.+id=(\d+).+/$1/; system "xinput set-prop $tp 'Device Enabled' 0"; print "Touchpad disabled\n";

Devoxx UA 2020 (talks)

I watched the conference onlie and picked the following talks:

The Effective Developer – Work Smarter, Not Harder by Sven Peters
Polyglot, Fault-Tolerant Event-Driven Programming with Kafka, Kubernetes and gRPC by Natan Silnitsky
Reacting to an Event-Driven World by Grace Jansen
Cause of Death: Hibernate
by Thorben Janssen
NoSQL Endgame by Otavio Santana, Thodoris Bais, Werner Keil
Local Variable Type Inference: Friend or Foe by Simon Ritter
Dungeons, Dragons, and Graph Databases by Guy Royse
Bootiful RSocket by Josh Long

And on the second day:

Devoxx UA

Most conferences have been cancelled, since it is difficult to hold a conference these days. The idea to move the conference online has been obviously around, but it was rejected by most organizers, because it it not the same and the all important chance to meet other people is just not the same. So the Devoxx in Antwerp, which I like to visit every year, did not happen. But Devoxx Ukraine decided to go for online.

So how did it work: There were three tracks. Each track was represented by a youtube channel, on which the live talk was transmitted. Before and after the talks, professional moderators appeared in these channels, announced the speakers and did what moderators do in normal conferences. The devoxx App worked on my cell phone and that seemed to be the most up to date schedule.

Some talks were really excellent. I enjoyed them a lot even online. For talks that are not so good, it requires more discipline to stay tuned.

The discussion was done in zoom channels that belonged to the three tracks. So discussions could technically last until the end of the next talk. The discussions in zoom also worked quite well, which was a surprise.

I think it is probably the right reaction, to have fewer conferences than usually in a year and to cancel some and move some to online, exactly as it is happening. I think DevoxxUA had around 10’000 visitors, so they absorbed audiences of several conferences. Also some common speakers at Devoxx conferences were not giving talks, so I assume that also part of the speakers decided that the online format is not ideal for them.

About the contents I will write in another Blog article.

How to rename files according to a pattern

We often encounter situations, where a large number of files should be copied or renamed or moved or something like that.
This can be done on the Linux command line, but it should be possible in almost the same way on the Unix/Linux/Cygwin-command line of newer MS-Windows or MacOS-X.

Now people routinely do that and they have developed several ways of doing it, which are all valid and useful.

I will show how I do things like that. It works and it is not the only way to do it.

So in the most simple case, all files in a directory ending in ‚.a‘ should be renamed to ‚.b‘.

What I do is:

ls *.a \ |perl -p -e 'chomp;$x = $_;s/\.a$/.b/;$y = $_; s/.+/mv $x $y\n/;' \ |egrep '^mv '\ |sh

You can run it without the last |sh, to check if it really does what you want.

So I use the files as input to a short perl script and create shell commands. It would be possible to do this actually in Perl itself, without piping it into a shell:

ls *.b \ |perl -n -e 'chomp;$x = $_;s/\.b$/.c/;$y=$_;rename $x, $y;'

You could also read the directory from perl, it is quite easy, but for just quickly doing stuff, I prefer getting the input from some ls.

To go into sub directories, you can use find:

find . -name '*.c' -type f -print \ | perl -n -e 'chomp;$x = $_;s/\.c$/.d/;$y=$_;rename $x, $y;'
a
You can also rename all the files that contain a certain string:
find . -name '*.html' -type f -print \ |xargs egrep -l form \ |perl -n -e 'chomp; $x=$_;s/\.html$/.form/;$y=$_;rename $x, $y;'

So you can combine with all kinds of shell commands and do really a lot of things in one line.

Of course you can use Raku, Ruby, Python or your favorite scripting language instead, as long as it allows some simple pattern matching and an efficient implicit iteration over the lines.

For such simple tasks there are also ways to do it directly in the shell like this
for f in *.d ; do mv $f `basename $f .d`.e; done

And you can always use sed, possibly in conjunction with awk instead of perl for such simple tasks.

Another approach is to just pipe the files into an texteditor that is powerful enough and create a one time script using powerful editing commands.
On Linux and Unix servers we almost always use vi, even people like me, who prefer Emacs on their own computer:
ls *.e > tmpscript vi tmpscript
and then in vi

:0,$s/$.*$$.e$$/mv \1\2 \1.f/ ZZ
and then
sh tmpscript rm tmpscript

So, there are many ways to achieve this goal and they are flexible and powerful enough to do really a lot more than just such simple pattern renaming.

If you work in a team and put these things into scripts, it might be necessary to follow a team policy about which scripting languages are preferred and which patterns are preferred. And you need to know the stuff that you write yourself, but also the stuff that your colleagues write.

Please, do not do
mv *.a *.b
It won’t work for good reasons.
On Linux and Unix systems the shell (usually bash) expands the glob expression (the stuff with the stars) into a list of strings and then starts mv with these strings a parameters. So calling mv with some file names ending in .a and .b, mv cannot have any idea what to do. When called with more than two parameters, the last one needs to be a directory where to move the stuff, so usually it will just refuse to work.

Spline Interpolation

Spline Approximation

Links