Scanning, sorting and processing large numbers of photos

I guess for most of us this is more an issue of their private life rather than done professionally, and those woo do this for money should already have answers for everything…. But the IT aspects of this are interesting anyway…

So some of us, including me, have hundreds or thousands of photographs that have been created using analog photography. I am still using it, because I have a good equipment, the prices and availability for films and prints and scanning of the negatives to a CD are still good. My equipment is good and I am neither willing to give that up nor to do a major investment. It will come some day in the future and I expect that within five to ten years the reasonably priced and ubiquitous offers for handling of negative films and prints will disappear.

Anyway it is a good idea to scan all the slides and negatives, at least the ones that are of any interest. It is easier and cheaper to copy them, to get prints and to do some improvements with software like Gimp prior to creating prints. Also it is also possible to use and share them online.

Scanning with a flat bed scanner is not an option for negatives, it works with prints, but I think that it is too slow and I do not like the loss in quality due to the unnecessary intermediate step. This leaves two options, getting a negative scanner myself or using a service. So it is good to assume that they are already scanned for now. I organize the photos in a directory structure. The names should contain only 7-Bit-ASCII-characters, but no spaces, to be easier accessible by scripts and on the shell. I have scripts to rename them to this pattern, for directories and for files. They can be found under my github project „photo processing scripts“ with names:
* rename-canonical
* rename-dir-canonical
* rename-dirs-radically
Another interesting issue is finding and removing duplicates, but since the name of the file and its position int the file system do have some meaning, this needs some attention. When two identical files A and B are found, there are five resolutions:
* rm A (remove A, leave B)
* rm B (remove B, leave A)
* rm A ; ln -s B A (make A a softlink to B)
* rm B ; ln -s A B (make B a softlink to A)
* rm B ; ln -f A B (make B a hardlink of A. Apart from the inode number this is equivalent to the opposition direction)
Which of these is actually prefered? My scripts picks the last option, but does not actually perform it. Instead it just create output of the shell commands, which can be piped to a file or directly to sh, in which case they are immediately executed. Otherwise it is possible to edit the command, filter them or even change them with a one-liner in the Perl programming language. This can be found here:
* find-dups
For viewing the photos in the browser, I have added another script, that is called
* create-foto-index
It searches the current directory and all sub directories, except those starting with a dot („.“) recursively. For each image file a thumb nail image is needed, which is eather found in the .thumbs directory or created using the script
* scale-image
Then an index.html file is created in each dictory having links to its child directories, the neighboring directories and the parent directory. For each image the thumbnail is included and it is a link to the full sized image. With this it is easily possible to vieww the whole album in a browser locally.
Some images know their orientation already from the camera or phone, but they appear wrong anyway. These can be fixed automatically running the script
* auto-rotate
in the directory.
I have a web server and a CGI-script running:
* cgi/mark-images.cgi
which allows me to mark images with a checkbox or with a string. Using letters „D“ for delete, „R“ for rotate right (90 degree clockwise), „L“ for rotate left (90 degries counter clockwise) and „F“ for flip (rotate 180 degrees) and then press the OK button.
Running the script
* rotate-checked
which will delete and rotate the images according to the choices in the form.

This is already quite a useful situation. Images that are needed for prints or for the web might need some processing with GIMP:
* possibly rotate them in such a way that the horizon is horizontal and vertical lines are vertical, at least in the middle of the image.
* possibly correct perspective
* possibly sharpen
* possibly correct contrast and brightness
* possibly correct color saturation and colors
* cut out what is really interesting
* save it under a different name
* call create-foto-index again.
The webform and the CGI-script can be used for picking which images to edit. After having pressed OK it will be done like this:

gimp `egrep 'jpg$' </var/lib/wwwrun/mark-fotos/marked.dat` &

In a similar way images from a directory can be selected in indexf.html and then extracted to a ZIP:

zip my-archive.zip `egrep 'jpg$' </var/lib/wwwrun/mark-fotos/marked.dat`

which can be given to somebody or uploaded for creating prints or just unpacked in anther directory to have only the good images.

There are some more issues, which I might address in another article.

Share Button

Will Java, C, C++ and C# be the new Cobols?

A few decades ago most programming was performed in Cobol (I do not want to shout it), Fortran, Rexx and some typical main frame languages, that hardly made it to the Linux-, Unix- or MS-Windows-world. They are still present, but mostly used for maintenance and extension of existing software, but less often for writing new software from scratch.
In these days languages like C, C++, Java and to a slightly lesser extent C# dominate the list of most commonly used languages. I would assume that JavaScript is also quite prominent in the list, because it has become more popular to write rich web clients using frameworks like Angular JS. And there are tons of them and some really good stuff. Some people like to see JavaScript also on the server side and in spite of really interesting frameworks like Node-JS I do not really consider this a good idea. If you like you may add Objective C to this list, which I do not know very much at all, even though it has been part of my gcc since my first Linux installation in the early 1990es.

Anyway, C goes back to the 1970es and I think that it was a great language to create at that time and it still is for a certain set of purposes. When writing operating systems, database engines, compilers and interpreters for other languages, editors, or embedded software, everything that is very close to the hardware, explicit control and direct access to very powerful OS-APIs are features that prove to be useful. It has been said that Java runs as fast as C, which is at least close to the truth, but only if we do not take into account the memory usage. C has some short comings that could be done better without sacrificing its strengths in the areas where it is useful, but it does not seem to be happening.

C++ has been the OO-extension of C, but I would say that it has evolved to be a totally different language for different purposes, even though there is some overlap, it is relatively easy to call functionality written in C from C++ and a little bit harder the other way round… I have not used it very much recently, so I will refrain from commenting further on it.

Java has introduced an infrastructure that is very common now with its virtual machine. The JVM is running on a large number of servers and any JVM-language can be used there. The platform independence is an advantage, but I think that its importance on servers has diminished a little bit. There used to be all kinds of servers with different operating systems and different CPU-architectures. But now we are moving towards servers being mostly Linux with Intel-compatible CPUs, so it is becomeing less of an issue. This may change in the future again, of course.

With Mono C# can be used in ways similar to Java, at least that is what the theory says and what seems to be quite true at least up to a certain level. It seems to be a bit ahead of Java with some language features, just think of operator overloading, undeclared exceptions, properties, generics or lambdas, which have been introduced earlier or in a more elegant way or we are still waiting in Java. I think the case of lambdas also shows the limitations, because they seem to behave differently than you would expect from real closures, which is the way lambdas should be done and are done in more functionally oriented languages or even in the Ruby programming language, in the Perl programming language or typical Lisps.
Try this

List<Func<int>> actions = new List<Func<int>>();

int variable = 0;
while (variable < 5)
{
    actions.Add(() => variable * 2);
    ++ variable;
}

foreach (var act in actions)
{
    Console.WriteLine(act.Invoke());
}

We would expect the output 0, 2, 4, 6, 8, but we are getting 10, 10, 10, 10, 10 (one number in a line, respectively).
But it can be fixed:

List<Func<int>> actions = new List<Func<int>>();

int variable = 0;
while (variable < 5)
{
    int copy = variable;
    actions.Add(() => copy * 2);
    ++ variable;
}

foreach (var act in actions)
{
    Console.WriteLine(act.Invoke());
}

I would say that the concept of inner classes is better in Java, even though what is static there should be the default, but having lambdas this is less important…
You find more issues with class loader, which are kind of hard to tame in java, but extremely powerful.

Anyway, I think that all of these languages tend to be similar in their syntax, at least within a method or function and require a lot of boiler plate code. Another issue that I see is that the basic types, which include Strings, even if they are seen as basic types by the language design, are not powerful enough or full of pitfalls.

While the idea to use just null terminated character arrays as strings in C has its beauty, I think it is actually not really good enough and for more serious C applications a more advanced string library would be good, with the disadvantage that different libraries will end up using different string libraries… Anyway, for stuff that is legitimately done with C now, this is not so much of an issue and legacy software has anyway its legacy how to deal with strings, and possible painful limitations in conjunction with Unicode. Java and also C# have been introduced at a time when Unicode was already around and the standard already claimed to use more than 65536 code points (characters in Unicode-speak), but at that time 65536 seemed to be quite ok to cover the needs for all common languages and so utf-16 was picked as an encoding. This blows up the memory, because strings occupy most of the memory in typical application software, but it still leaves us with uncertainties about length and position, because code points can be one or two 16-bit-„characters“ long, which can only be seen by actually iterating through the string, which leaves us where we were with null terminated strings in C. And strings are really hard to replace or enhance in this aspect, because they are used everywhere.

Numbers are not good either. As an application developer we should not care about counting bits, unless we are in an area that needs to be specifically optimized. We are using mostly integer types in application development, at least we should. These overflow silently. Just to see it in C#:

int i = 0;
int s = 1;
for (i = 0; i < 20; i++)
{
    s *= 7;
    Console.WriteLine("i=" + i + " s=" + s);
}

which gives us:

i=0 s=7
i=1 s=49
i=2 s=343
i=3 s=2401
i=4 s=16807
i=5 s=117649
i=6 s=823543
i=7 s=5764801
i=8 s=40353607
i=9 s=282475249
i=10 s=1977326743
i=11 s=956385313
i=12 s=-1895237401
i=13 s=-381759919
i=14 s=1622647863
i=15 s=-1526366847
i=16 s=-2094633337
i=17 s=-1777531471
i=18 s=442181591
i=19 s=-1199696159

So it silently overflows, or just takes the remainder modulo 2^{32} with the representation system \{-2^{31} \ldots 2^{31}-1\}. Java, C and C++ behave exactly the same way, only that we need to know what „int“ means for our C-compiler, but if we use 32-bit-ints, it is the same. This should throw an exception or switch to some unlimited long integer. Clojure offers both options, depending on whether you use * or *‘ as operator. So as application developers we should not have to care about these bits and most developers do not think about it. Usually it goes well, but a lot of software bugs are around due to this pattern. It is just wrong in C#, Java, and C++. In C I find it more acceptable, because the typical area for using C for new software actually is quite related to bits and bytes, so the developers need to be aware of such issues all the time anyway.

I would consider it desirable to move to more expressive languages like Clojure, Scala, F#, Ruby or Perl for application development. Ruby and Perl have better Strings. Clojure and Scala inherit them from the JVM, and F# has the same strings as C#. Ruby and Clojure have a good way to deal with integers, Scala, Perl and F# can do it right if we actually want to do so, but not by default. Perl and Ruby are very weak when it comes to multithreading. As compared to Java this can be dealt with by just using more processes instead of threads, because the overhead of a Ruby or Perl process is much less than the overhead of a Java process, but I would see this as a major drawback. C, C#, Java and C++ offer good facilities to use multithreading, but the issue of avoiding typical multithreading bugs is a big deal and actually too hard for a large fraction of typical application developers. Or at least too far away from there point of focus. Moving to a more functional paradigm might be a way to go. Java enterprise edition is a failure if the goal is to get multithreading, done well without having to worry about it, because the overhead is too much. On the other hand, if you are willing to go the extra mile, having more explicit access to the multithreading mechanism and using it correctly is extremely powerful, for example in C with pthreads or with a deliberate usage of processes, shared memory and threads together. For which kind of projects do we have the time and the team for this? I am not talking about multithreaded applications that work well on the developer’s laptop, but fail during some high load processing in production with some concurrent modification issues a few months after the deployment. Thinking cannot be replaced by testing.

So now we have a lot of software in C, C++, Java and C# and a lot of new software is written in these languages, even from scratch. We could do better, sometimes we do, sometimes we don’t. It is possible to write excellent application software with Java, C++, C# and even C. It just takes a bit longer, but if we use them with care, it will be ok. Some companies are very conservative and want to use stuff that has been around for a long time. This is sometimes right and sometimes wrong. And since none of the more modern languages has really picked up so much speed that it can be considered a new main stream, it is understandable that some organizations are scared about marching into a dead end road.

On the other hand, many businesses can differentiate themselves by providing services that are only possible by having a very innovative IT. Banks like UBS and Credit Suisse in Switzerland are not likely to be there, while banks like ING are on that road. As long as they compete for totally different customer bases and as long as the business has enough strengths that are not depending so heavily on an innovative IT, but just on a working robust IT, this will be fine. But time moves on and innovation will eventually out-compete conservative businesses.

Share Button

Frameworks for Unit Testing and Mocking

Unit testing has fortunately become an important issue in many software projects. The idea of automatic software based unit and integration tests is actually quite old. The typical Linux software that is downloaded as source code and then built with steps like

tar xfzvv «software-name-with-version».tar.gz
cd «software-name-with-version»
./configure
make
sudo make install

often allows a step

make test

or

make check

or even both before the

make install

It was like that already in the 1990s, when the word „unit test“ was unknown and the whole concept had not been popularized to the main stream.

What we need is to write those automated tests to an extent that we have good confidence that the software will be reliable enough in terms of bugs if it passes the test suite. The tests can be written in any language and I do encourage you to think about using other languages, in order to be less biased and more efficient for writing the tests. We may choose to write a software in C, C++ or Java for the sake of efficiency or easier integration into the target platform. But these languages are efficient in their usages of CPU power, but not at all efficient in using developer time to write a lot of functionality. This is ok for most projects, because the effort it takes to develop with these languages is accepted in exchange for the anticipated benefits. For testing it is another issue.

On the other hand there are of course advantages in using actually the same language for writing the tests, because it is easier to access the APIs and even internal functionalities during the tests. So it may very well be that Unit tests are written in the same language as the software and this is actually what I am doing most of the time. But do think twice about your choice.

Now writing automated tests is actually no magic. It does not really need frameworks, but is quite easy to accomplish manually. All we need is kind of two areas in our source code tree. One area that goes into the production code and one area that is only used for the tests and remains on the development and continuous integration machines. Since writing automated tests without frameworks is not really a big deal, we should only look at frameworks that are really simple and easy to use or maybe give us really good features that we actually need. This is the case with many such frameworks, so the way to go is to actually use them and save some time and make the structure more accessible to other team members, who know the same testing framework. Writing and running unit tests should be really easy, otherwise it is not done or the unit tests are disabled and loose contact to the actual software and become worthless.

Bugs are much more expensive, the later they are discovered. So we should try to find as many of them while developing. Writing unit tests and automated integrated tests is a good thing and writing them early is even better. The pure test driven approach does so before actually writing the code. I recommend this for bug fixing, whenever possible.

There is one exception to this rule. When writing GUIs, automated testing is possible, but quite hard. Now we should have UX guys involved and we should present them with some early drafts of the software. If we had already developed elaborate selenium tests by then, it would be painful to change the software according to the advice of the UX guy and rewrite the tests. So I would keep it flexible until we are on the same page as the UX guys and add the tests later in this area.

Frameworks that I like are actually CUnit for C, JUnit for Java, where TestNG would be a viable alternative, and Google-Test for C++. CUnit works extremely well on Linux and probably on other Unix-like systems like Solaris, Aix, MacOSX, BSD etc. There is no reason why it should not work on MS-Windows. With cygwin actually it is extremely easy to use it, but with native Win32/Win64 it seems to need an effort to get this working, probably because MS-Windows is no priority for the developers of CUnit.

Now we should use our existing structures, but there can be reasons to mock a component or functionality. It can be because during the development a component does not exist. Maybe we want to see if the component is accessed the right way and this is easier to track with a mock that records the calls than with the real thing that does some processing and gives us only the result. Or maybe we have a component with is external and not always available or available, but too time consuming for most of our tests.

Again mocking is no magic and can be done without tools and frameworks. So the frameworks should again be very easy and friendly to use, otherwise they are just a pain in the neck. Early mocking frameworks were often too ambitious and too hard to use and I would have avoided them whenever possible. In Java mocking manually is quite easy. We just need an interface of the mocked component and create an implementing class. Then we need to add all missing methods, which tools like eclipse would do for us, and change some of them. That’s it. Now we have mockito for Java and Google-Mock, which is now part of Google-Test, for C++. In C++ we create a class that behaves similar to a Java interface by having all methods pure virtual with keyword „virtual“ and „=0“ instead of the implementation. The destructor is virtual with an empty implementation. They are so easy to use and they provide useful features, so they are actually good ways to go.

For C the approach is a little bit harder. We do not have the interfaces. So the way to go is to create a library of the code that we want to test and that should go to production. Then we write one of more c-files for the test, that will and up in an executable that actually runs the test. In these .c-files we can provide a mock-implementation for any function and it takes precedence of the implementation from the library. For complete tests we will need to have more than one executable, because in each case the set of mocked functions is fixed within one executable. There are tools in the web to help with this. I find the approach charming to generate the C-code for the mocked functions from the header files using scripts in the a href=“https://en.wikipedia.org/wiki/Ruby_(programming_language)“>Ruby programming language or in the Perl programming language.

Automated testing is so important that I strongly recommend to do changes to the software in order to make it accessible to tests, of course within reason. A common trick is to make certain Java methods package private and have the tests in the same package, but a different directory. Document why they are package private.

It is important to discuss and develop the automated testing within the team and find and improve a common approach. Laziness is a good thing. But laziness means running many automated tests and avoid some manual testing, not being too lazy to write them and eventually spending more time on manual repetitive activities.

I can actually teach this in a two-day or three-day course.

Share Button

How to create ISO Date String

It is a more and more common task that we need to have a date or maybe date with time as String.

There are two reasonable ways to do this:
* We may want the date formatted in the users Locale, whatever that is.
* We want to use a generic date format, that is for a broader audience or for usage in data exchange formats, log files etc.

The first issue is interesting, because it is not always trivial to teach the software to get the right locale and to use it properly… The mechanisms are there and they are often used correctly, but more often this is just working fine for the locale that the software developers where asked to support.

So now the question is, how do we get the ISO-date of today in different environments.

Linux/Unix-Shell (bash, tcsh, …)

date "+%F"

TeX/LaTeX


\def\dayiso{\ifcase\day \or
01\or 02\or 03\or 04\or 05\or 06\or 07\or 08\or 09\or 10\or% 1..10
11\or 12\or 13\or 14\or 15\or 16\or 17\or 18\or 19\or 20\or% 11..20
21\or 22\or 23\or 24\or 25\or 26\or 27\or 28\or 29\or 30\or% 21..30
31\fi}
\def\monthiso{\ifcase\month \or
01\or 02\or 03\or 04\or 05\or 06\or 07\or 08\or 09\or 10\or 11\or 12\fi}
\def\dateiso{\def\today{\number\year-\monthiso-\dayiso}}
\def\todayiso{\number\year-\monthiso-\dayiso}

This can go into a file isodate.sty which can then be included by \include or \input Then using \todayiso in your TeX document will use the current date. To be more precise, it is the date when TeX or LaTeX is called to process the file. This is what I use for my paper letters.

LaTeX

(From Fritz Zaucker, see his comment below):

\usepackage{isodate} % load package
\isodate % switch to ISO format
\today % print date according to current format

Oracle


SELECT TO_CHAR(SYSDATE, 'YYYY-MM-DD') FROM DUAL;

On Oracle Docs this function is documented.
It can be chosen as a default using ALTER SESSION for the whole session. Or in SQL-developer it can be configured. Then it is ok to just call

SELECT SYSDATE FROM DUAL;

Btw. Oracle allows to add numbers to dates. These are days. Use fractions of a day to add hours or minutes.

PostreSQL

(From Fritz Zaucker, see his comment):

select current_date;
—> 2016-01-08


select now();
—> 2016-01-08 14:37:55.701079+01

Emacs

In Emacs I like to have the current Date immediately:

(defun insert-current-date ()
"inserts the current date"
(interactive)
(insert
(let ((x (current-time-string)))
(concat (substring x 20 24)
"-"
(cdr (assoc (substring x 4 7)
cmode-month-alist))
"-"
(let ((y (substring x 8 9)))
(if (string= y " ") "0" y))
(substring x 9 10)))))
(global-set-key [S-f5] 'insert-current-date)

Pressing Shift-F5 will put the current date into the cursor position, mostly as if it had been typed.

Emacs (better Variant)

(From Thomas, see his comment below):

(defun insert-current-date ()
"Insert current date."
(interactive)
(insert (format-time-string "%Y-%m-%d")))

Perl

In the Perl programming language we can use a command line call

perl -e 'use POSIX qw/strftime/;print strftime("%F", localtime()), "\n"'

or to use it in larger programms

use POSIX qw/strftime/;
my $isodate_of_today = strftime("%F", localtime());

I am not sure, if this works on MS-Windows as well, but Linux-, Unix- and MacOS-X-users should see this working.

If someone has tried it on Windows, I will be interested to hear about it…
Maybe I will try it out myself…

Perl 5 (second suggestion)

(From Fritz Zaucker, see his comment below):

perl -e 'use DateTime; use 5.10.0; say DateTime->now->strftime(„%F“);‘

Perl 6

(From Fritz Zaucker, see his comment below):

say Date.today;

or

Date.today.say;

Ruby

This is even more elegant than Perl:

ruby -e 'puts Time.new.strftime("%F")'

will do it on the command line.
Or if you like to use it in your Ruby program, just use

d = Time.new
s = d.strftime("%F")

Btw. like in Oracle SQL it is possible add numbers to this. In case of Ruby, you are adding seconds.

It is slightly confusing that Ruby has two different types, Date and Time. Not quite as confusing as Java, but still…
Time is ok for this purpose.

C on Linux / Posix / Unix


#include
#include
#include

main(int argc, char **argv) {

char s[12];
time_t seconds_since_1970 = time(NULL);
struct tm local;
struct tm gmt;
localtime_r(&seconds_since_1970, &local);
gmtime_r(&seconds_since_1970, &gmt);
size_t l1 = strftime(s, 11, "%Y-%m-%d", &local);
printf("local:\t%s\n", s);
size_t l2 = strftime(s, 11, "%Y-%m-%d", &gmt);
printf("gmt:\t%s\n", s);
exit(0);
}

This speeks for itself..
But if you like to know: time() gets the seconds since 1970 as some kind of integer.
localtime_r or gmtime_r convert it into a structur, that has seconds, minutes etc as separate fields.
stftime formats it. Depending on your C it is also possible to use %F.

Scala


import java.util.Date
import java.text.SimpleDateFormat
...
val s : String = new SimpleDateFormat("YYYY-MM-dd").format(new Date())

This uses the ugly Java-7-libraries. We want to go to Java 8 or use Joda time and a wrapper for Scala.

Java 7


import java.util.Date
import java.text.SimpleDateFormat

...
String s = new SimpleDateFormat("YYYY-MM-dd").format(new Date());

Please observe that SimpleDateFormat is not thread safe. So do one of the following:
* initialize it each time with new
* make sure you run only single threaded, forever
* use EJB and have the format as instance variable in a stateless session bean
* protect it with synchronized
* protect it with locks
* make it a thread local variable

In Java 8 or Java 7 with Joda time this is better. And the toString()-method should have ISO8601 as default, but off course including the time part.

Summary

This is quite easy to achieve in many environments.
I could provide more, but maybe I leave this to you in the comments section.
What could be interesting:
* better ways for the ones that I have provided
* other databases
* other editors (vim, sublime, eclipse, idea,…)
* Office packages (Libreoffice and MS-Office)
* C#
* F#
* Clojure
* C on MS-Windows
* Perl and Ruby on MS-Windows
* Java 8
* Scala using better libraries than the Java-7-library for this
* Java using better libraries than the Java-7-library for this
* C++
* PHP
* Python
* Cobol
* JavaScript
* …
If you provide a reasonable solution I will make it part of the article with a reference…
See also Date Formats

Share Button

Conversion of ASCII-graphics to PNG or JPG

Images are usually some obscure binary files. Their most common formats, PNG, SVG, JPEG and GIF are well documented and supported by many software tools. Libraries and APIs exist for accessing these formats, but also a phantastic free interactive software like Gimp. The compression rate that can reasonably be achieved when using these format is awesome, especially when picking the right format and the right settings. Tons of good examples can be found how to manipulate these image formats in C, Java, Scala, F#, Ruby, Perl or any other popular language, often by using language bindings for Image Magick.

There is another approach worth exploring. You can use a tool called convert to just convert an image from PNG, JPG or GIF to XPM. The other direction is also possible. Now XPM is a text format, which basically represents the image in ASCII graphics. It is by the way also valid C-code, so it can be included directly in C programms and used from there, when an image needs to be hard coded into a program. It is not generally recommended to use this format, because it is terribly inefficient because it uses no compression at all, but as intermediate format for exploring additional ways for manipulating images it is of interest.
An interesting option is to create the XPM-file using ERB in Ruby and then converting it to PNG or JPG.

Share Button

Find the next entry in a sequence

In Facebook, Xing, Google+, Vk.com, Linkedin and other of these social media networks we are often encountered with a trivial question like this:

1->2
2->8
3->18
4->32
5->50
6->72
7->?

There are some easy patterns. Either it is some polynomial formula or some trick with the digits.
But the point is, that any such sequence can easily be fullfilled by a polynomial formula. That means we can put any value for 7 and make it work. Or any answer is correct. So what would probably be the real question is the most simple function to full-fill the given constraints. Simplicity can be measured in some way… If the solution is unique is unclear, but let us just look at the polynomial solution.

A function is needed that takes as parameter a list of key-value-pairs (or a hash map) and that yields a function such that the function of any of the key is the associated value.

Assuming a polynomial function in one variable we can make use of the chinese remainder theorem, which can be applied to univariate polynomials over a field F as well as to integral numbers. For a polynomial p(X) we have

    \[p(x) \equiv p(X) \mod X-x\]

where X is the polynomial variable and x\in F is a concrete value.

We are looking for a polynomial p(X) such that for given values x_0,\ldots x_{n-1}, y_0,\ldots,y_{n-1} \in F we have

    \[\bigwedge_{i=0}^{n-1} p(x_i) = y_i\]

or in another way

    \[\bigwedge_{i=0}^{n-1} p(X) \equiv y_i \mod X-x_i\]

which is exactly the Chinese remainder theorem.
Let

    \[I=\{0,\ldots,n-1\}\]

and

    \[\bigwedge_{j=0}^{n-1} I_j = I \setminus \{j\}\]

We can see that for all i \in I the polynomials

    \[e_i = \prod_{j \in I_j} \frac{X-x_j}{x_i-x_j}\]

have the properties

    \[e_i(x_i)=1\]

    \[\bigwedge_{j \in I_i} e_i(x_j)=0\]

or

    \[\bigwedge_{i \in I}\bigwedge_{j \in J} e_i(x_j)=\delta_{i,j}\]

where \delta_{i,j} is the Kronecker symbol, which is 0 if the two indices differ and 1 if they are equal.
Or as congruence:

    \[\bigwedge_{i \in I}\bigwedge_{j \in J} e_i(X)\equiv \delta_{i,j} \mod X-x_j\]

Then we can just combine this and use

    \[p(X) =\sum_{i \in I} y_i e_i(X)\]

This can easily be written as a Ruby function

def fun_calc(pairs)
  n = pairs.size
  result = lambda do |x|
    y = 0
    n.times do |i|
      p_i = pairs[i]
      x_i = p_i[0].to_r
      y_i = p_i[1].to_r
      z = y_i
      n.times do |j|
        if (j != i)
          p_j = pairs[j]
          x_j = p_j[0]
          z *= (x - x_j) / (x_i - x_j)
        end
      end
      y += z
    end
    y
  end
  result
end

This takes a list of pairs as a parameter and returns the polynomial function als lambda.
It can be used like this:

lop = [[0, 0], [1, 1], [2, 4], [3, 9], [4, 16], [5, 25], [6, 36], [7, 64]]

f = fun_calc(lop)

20.times do |x|
  y = f.call(x)
  puts sprintf("%6d -> %6d", x, y)
end

Put this together into a ruby program and add some parsing for the list of pairs or change the program each time you use it and all these „difficult“ questions „that 99.9% fail to solve“ are not just easy, but actually soluble automatically.

This is interesting for more useful applications. I assume that there will always be situations where a function is needed that meets certain exact values a certain inputs and is an interpolation or extrapolation of this.

Please observe that there are other interesting and useful ways to approach this:

  • Use a „best“ approximation from a set of functions, for example polynomials with a given maximum degree
  • use cubic splines, which are cubic polynomials within each section between two neighboring input values such that at the input values the two adjacent functions have the same value (y_i, of course), the same first derivative and the same second derivative.

For highway and railroad construction other curves are used, because the splines are making an assumption on what is the x-axis and what is the y-axis, which does not make sense for transport facilities. They are using a curve called Clothoid.

Use Java, C, Perl, Scala, F# or the programming language of your choice to do this. You only need Closures, which are available in Java 8, F#, Scala, Perl, Ruby and any decent Lisp dialect. In Java 7 they can be done with an additional interface as anonymous inner classes. And for C it has been described in this blog how to do closures.

Share Button

Binary Data

Having discussed to some extent about strings and text data, it is time to look at the other case, binary data.

Usually we think of arrays of bytes or sequences of bytes stored in some media.

Why bytes? The 8-bit-computers are no longer so common, but the byte as a typical unit of binary data is still present. Actually this is how files on typical operating systems work, they have a length in bytes and that number of bytes as content. When we use the file system, we have to deal with this.

But assuming a perfect lossless world, the typical smallest unit of information is not a byte, but a bit. So we could actually assume that we have an array of bits of a given length l and the length does not need to be a multiple of 8. This might be subtle, but it does describe the most basic case. Padding with 0-bits does not usually do the job, but there need to be other provisions to deal with the bit-length and the storage length. In case of files this could be achieved by storing the last three bits b of the bit length in the first byte of the file and using the file size s the bit length l can be calculated as 8(s-1)+b.

How about memory? Often the array of bytes is just fine. The approach to have an array of booleans as bits usually hurts, because typical programming languages reserve 32 bit for each boolean and we actually want a packed array. On the other hand we have 64bit-computers and we would like to use this when moving around data, so actually thinking of an array of 64-bit-integers plus information about how many of the bits of the last entry actually count might be better. This does bring up the issue of endianness at some point.

But we actually want to use the data. In the good old C-days we could use quite a powerful trick: We read an array of bytes and casted the pointer to a pointer to a struct. Then we could access the struct. Poor mans serialization, but it was efficient and worked especially well for records with fixed size, for both reading and writing. Off course variable length records can be dealt with as well by imposing rules that say something about the type of the next record based on the contents of the current record.

The Perl programming language and subsequently the Ruby programming language had much more powerful and flexible mechanisms using methods of functions called pack and unpack.

Share Button

Why I still like Ruby

Some years ago Ruby in conjunction with rails was an absolute hype. In the Rails User Group in Zürich we had meetups with 30 people every two weeks. The meetings every two weeks have been retained, but often we are just five to ten people now. Is the great time of Ruby over or is it just a temporary decline? Is Perl 6 now taking over, since it is ready?

Perl 6 is cool and could challenge Ruby and it is now getting more and more towards production readiness. This will happen next year. This is what we will say next year. But Perl 6 is or will be very cool, stay tuned… What about Perl 5? At least Allison Randall still likes Perl (5). But it has other areas of strength than Ruby, so I will leave my opinionated writing focused on Ruby.

But the hype is right now in the area of JavaScript.

There are approaches to use server side JavaScript, which are interesting, because JavaScript needs to be dealt with on the client anyway and then it is tempting to have the same language on both sides.

There are also approaches to move back from the server to the client and put more functionality into the client and make the server less relevant, at least the percentage of development effort of the client gets more and of the server gets less.

There are cool frameworks for the client development in JavaScript and in conjunction with HTML5 and these frameworks a lot can be achieved.

On the other hand we are exploring NoSQL-databases like MongoDB and MongoDB is using JavaScript with some extension library as query language.

The general hype is toward functional programming and voila, JavaScript is functional, it is the most functional of the languages that have been established for a long time and brought the functional paradigm to every developer and even to the more sophisticated web designer, everything under the radar and long time before we knew how cool functional programming is. On the other hand, Ruby (and by the way Perl and especially Perl6) allow programming according to the functional paradigm as well, if used appropriately. But in JavaScript it is slightly more natural.

Then we have more contenders. Java is not dying so soon and it is still strong on its web frameworks, JavaServerFaces and one million others, not so bad actually.

And the idea of rails or at least its name, has been taken over by the groovy guys to develop groovy on grails, which integrates nicely with a Java enterprise backend.

And the more functional languages like F#, Scala, Haskell, Erlang, Elixir (and some others for the guys that don’t find Haskell theoretical enough) are around and gaining or retaining some popularity. Will Scala support the Erlang-VM as alternative backend, like JavaScript and JVM are supported today (and dotnet-CLI in the past)? And Scala has a promising web framework with play, that does WebServices right, which is what is actually needed for the modern client side JavaScript frameworks, just as an example.

PHP was always the ugly baby, why did they implement a second Perl based on the subset of Perl they understood? Well, I am not advocating PHP at all, but there are some pretty decent PHP-applications that are working really well with high load, like Wikipedia.

The unequal twin of Ruby, Python, is doing pretty well as well on attempting to get a web framework called Django. I have not investigated it at all, but it seems to exist.

So, where is ruby and rails? Ruby is a beautiful language, that has done many things better than any of the languages mentioned here:

It is easy to learn.

It has a nice and complete and consequent concept for object orientation.

It allows all the functional programming, most of it pretty natural.

It has decent numerical types by default. Integer grow to arbitrary length, Rationals are included and Complex numbers are included. JavaScript does not even have integers, it just has one numerical type: double precision floating point.

It allows extending, even operator overloading for new numerical types.

And it has a useful and easily accessible web framework like rails and actually some contender as alternative web frameworks, like camping, sinatra and some others.

And the development effort and the amount of code needed for a certain functionality is still lower than in Java or C# or COBOL.

I do believe that the web applications that are mostly server based and are using just HTML or minimal JavaScript still have their place and will be discovered again as useful for many application areas.

To conclude, I think that Ruby did not make it to become the replacement for Java, that some saw in it and that was due according to the history of replacement of other mainstream languages that have moved to legacy in the past. But I do believe that it will play an important role for some years to come and beyond the current JavaScript hype. Off course it will be challenged again in the future and maybe something will replace the main stream and will make Ruby on Rails just another legacy area for web applications. But I don’t even have an idea what that will be. I don’t think JavaScript. Perl6, Scala, F# or Clojure do have technical potential to do so, but I do not see that happening in the near future. Maybe something new will come up? Or something old? Just remember, Ruby is about as old as Java.

We will see.

Read more in this discussion on Quora.

For those who are interested, I actually teach Ruby and it is possible to arrange trainings for two to five days, depending on the previous knowledge of the participants and the goals.

Share Button

Logging

Software enthält häufig eine Log-Funktionalität. Üblicherweise werden dort ein- oder mehrzeilige Einträge in eine Datei, nach syslog oder in die Standardausgabe geschrieben (und letzlich in eine Datei umgeleitet), die etwas darüber sagen, was die Software so macht. Normalerweise kann man das alles ignorieren, aber sobald dort etwas mit „ERROR“ auftritt oder schlimmer noch sogenannte Stacktraces, ist eigentlich angesagt, dass man dem Problem auf den Grund geht. Da nun leider Software häufig von minderer Qualität ist, was durchaus von den Unzulänglichkeiten der verwendeten Libraries und Frameworks kommen kann, sieht man das leider recht oft, teilweise so oft, dass man denen nicht mehr wirklich nachgeht. Ärgerlich ist vor allem, dass der eigentliche Fehler oft vorher an einer ganz anderen Stelle aufgetreten ist, sich aber in der Log-Datei nicht nachvollziehen lässt.

Nun ist es schön, dass die Log-Dateien einen gewissen Aufbau der Einträge einhalten. Meist beginnen sie mit einer Zeitangabe im ISO-Format, oft auf die Millisekunde genau. Da in der Regel mehrere Prozesse oder mehrere Threads gleichzeitig laufen, werden deren Einträge natürlich mehr oder weniger wild gemischt. Das ist gut erträglich, wenn auch mehrzeilige Log-Einträge (z.B. Stack-Traces) zusammen bleiben und wenn sich Anfang und Ende von mehrzeiligen Log-Einträgen gut erkennen lassen. Man kann dann mit Splunk oder mit Perl- oder Ruby-Scripten die Log-Dateien analysieren und jeweils für einzelne Threads die Abläufe nachvollziehen. Das Zusammenhalten mehrzeiliger Einträge lässt sich realisieren, indem man ein atomares write verwendet oder indem man einen „Logger-Thread“-hat und alle Log-Einträge diesem Thread in einer Queue zur Verfügung stellt, die dieser dann abarbeitet und herausschreibt.

Insgesamt ist das ganze Thema aber unübersichtlich und unhandhabbar geworden. In der Java-Welt hatte man früher log4j und eine relativ einfache Konfigurations-Datei im properties-Format. Das wurde dann irgendwann durch XML ersetzt und es kamen andere Logging-Frameworks dazu und jeweils immer wieder etwas, was alle diese vereinheitlichte. Letztlich wurde es dadurch komplizierter und unübersichtlicher.

Es stellt sich die Frage, wie viel von der Logik für die Handhabung der Log-Dateien in die Software selbst gehört. Muss die Software wissen, in welche Datei geloggt wird? Muss sie Log-Rotation kennen? Muss es sein, dass eine Software überhaupt nicht startet, nur weil man es nicht schafft, die komplizierte Log-Konfiguration richtig einzustellen? Letztlich müssen die Log-Dateien am Ende den Systemadministrator zufriedenstellen, der die Software auf dem Produktivsystem betreibt. Soll man diesem zumuten, für jede Software eine spezielle Konfiguration des Logging zu pflegen? Oder für diesen Zweck jeweils von den Entwickerln eine neue Software-Version anzufordern, weil die Konfiguration als Teil des Deployments „hardcodiert“ ist? Interessant wird es auch dann, wenn man auf Technologien wie „Platform as a Service“ (PAAS) setzt, wo man Applikationsserver, Framework, Datenbank u.s.w. zur Verfügung hat, aber wo die Software ohne weiteres den Server wechseln kann und damit Dateien verloren gehen.

Ist es vielleicht einfach die bessere Lösung, unter Einhaltung eines vernünftigen Formats nach stdout zu loggen die Software so laufen zu lassen, dass deren Standardausgabe (stdout oder stderr) in einen logmanager geleitet wird? Dieser kann dann für alle Software, die auf dem System läuft, verwendet werden und vom Systemadministrator einheitlich konfiguriert werden. Einheitlich heißt nicht nur, für alle Java-Programme gleich, sondern für alle Software, die sich dazu bringen lässt, ihre Log-Ausgabe nach stdout zu schreiben und gewisse Mindestanforderungen an das Format einzuhalten. Im Prinzip ließe sich mit named-Pipes auch eine beliebig hard-codierte Log-Datei unterstützen, aber das macht die Dinge nur komplizierter. Wenn dann das Logging-Framework der Software selbst noch Log-Rotation unterstützt, kann das natürlich durcheinandergeraten, wenn etwa nach n geschriebenen Bytes oder m Sekunden die named-pipe geschlossen, umbenannt und eine neue Datei angelegt wird, was je nach Schreibrechten in dem Verzeichnis zu verschiedenen Fehler führt.

Was ist jetzt mit Software, die selbst als Filter fungiert, wo also stdout Teil der Funktionalität ist? Solche Software findet man üblicherweise in Form kleinerer Programme oder Skripte, die nicht unbedingt Log-Dateien schreiben müssen oder in Form von gut ausgetesteter Software, die Teil der Installation ist. Oft kann man hier mit stderr für Log-Ausgaben, in diesem Fall meist für Fehlermeldungen, auskommen. Oder man könnte eine Log-Datei auf der Kommandozeile angeben, was dann auch wieder für den Systemadminstrator leicht zugänglich wäre und auf eine named-pipe umleiten könnte.

Share Button

Rounding of Money Amounts

Deutsch

Many numerical calculations deal with amounts of money. It is always nice if these calculations are somewhat correct or if we can at least rest assured that we do not loose money due to such inaccuracies. Off course there are calculations that may legitimately deal with approximations, for example when calculating profits as percentages of the investment (return on investment). For this kind of calculations floating point numbers (double, Float or something like that) come in handy, but off course dealing with the numeric subtleties can be quite a challenge and rounding errors can grow huge if not dealt with properly.

It is often necessary to do calculations with exact amounts. It is quite easy to write something like 3.23 as floating point number, but unfortunately these are internally expressed in binary format (base 2), so the fraction with a power of 10 in the denominator needs to be expressed as a fraction with a power of two in the denominator. We can just give it a try and divide x=323_{10}=101000011_{2} by y=100_{10}=1100100_2, doing the math in base 2 arithmetic. We get something like z=x/y=11.0011101011100001010001111010111000010100011110101\ldots_{2}, more precisely z=11.00\overline{11101011100001010001} or as fraction z=11_2+\frac{11101011100001010001_2}{100_2*(100000000000000000000_2-1_2)} or in base 10 z=3_{10}+\frac{964689_{10}}{4_{10}\cdot1048575_{10}}.

It can be seen that such a harmless number with only two digits after the decimal point ends up having an infinite number of digits after the decimal point. Our usual Double and Float types are usually limited to 64 Bits, so some digits have to be discarded, causing a rounding error. Fortunately this usually works well and the stored number is still shown as 3.23. But working a little bit with these floating point numbers sooner or later results like 3.299999999999 or 3.2300000001 will appear instead of 3.23, even when doing only addition and subtraction. When doing larger sums it might even end up as 3.22 or 3.24. For many applications that is unacceptable.

A simple and often useful solution is to use integral numbers and storing the amounts in cents instead of dollars. Then the rounding problem of pure additions and subtractions is under control and for other operations it might at least become easier to deal with the issue. A common mistake that absolutely needs to be avoided is mixing amountInDollar and amountInCent. In Scala it would be a good idea to use different types to make this distinction, so that such errors can be avoided at compile time. In any case it is very important to avoid such numeric types as int of Java that have a weird and hardly controllable overflow behavior where adding positive numbers can end up negative. How absurd, modular arithmetic is for sure not what we want for our money, it is a little bit too socialistic. 😉 Estimations about the upper bound of money amounts of persons and companies and other organizations are problematic, because there can be inflation and there can be some accumulation of money in somebody’s accounts… Even though databases tend to force us to such assumption, but we can make them really huge. So some languages might end up using something like BigInteger or BigNum or BigInt. Unfortunately Java shows one of its insufficiencies here, which makes quite ugly to use for financial applications, because calculations like a = b + c * d for BigInteger appear like this: a = b.{\rm add}(c.{\rm multiply}(d)). The disadvantage is that the formula cannot be seen at one glance, which leads to errors. In principal this problem can be solved using a preprocessor for Java. Maybe a library doing some kind of RPN-notation would possible, writing code like this:

Calculation calc = new Calculation();
calc.push(b)
calc.push(c)
calc.push(d)
calc.add()
calc.multiply()
a = calc.top()

Those who still know old HP calculators (like HP 25 and HP 67 😉 in the good old days) or Forth might like this, but for most of us this is not really cutting it.

Common and useful is actually the usage of some decimal fixed point type. In Java this is BigDecimal, in Ruby it is LongDecimal.
And example in Ruby:

> sudo gem install long-decimal
Successfully installed long-decimal-1.00.01
1 gem installed
....
> irb
irb(main):001:0> require "long-decimal"
=> true
irb(main):002:0> x = LongDecimal("3.23")
=> LongDecimal(323, 2)
irb(main):003:0> y = LongDecimal("7.68")
=> LongDecimal(768, 2)
irb(main):004:0> z = LongDecimal("3.9291")
=> LongDecimal(39291, 4)
irb(main):005:0> x+y
=> LongDecimal(1091, 2)
irb(main):006:0> (x+y).to_s
=> "10.91"
irb(main):007:0> x+y*z
=> LongDecimal(33405488, 6)
irb(main):008:0> (x+y*z).to_s
=> "33.405488"
irb(main):009:0> 

It is interesting to see that the number of digits remains the same under addition and subtraction if both sides have the same number of digits. But the longer number of digits wins otherwise. During multiplication, division and off course during more complex operations many decimal places can become necessary. It becomes important to do rounding and to do it right and in a controlled way. LongDecimal supports the following rounding modes:

ROUND_UP
Round away from 0.
ROUND_DOWN
Round towards 0.
ROUND_CEILING
Round to more positive, less negative numbers.
ROUND_FLOOR
Round to more negative, less positive numbers.
ROUND_HALF_UP
Round the middle and from above up (away from 0), everything below down (towards 0).
ROUND_HALF_DOWN
Round the middle and from above down (towards 0), everything below up (away from 0).
ROUND_HALF_CEILING
Round from the middle onward towards infinity, otherwise towards negative infinity.
ROUND_HALF_FLOOR
Round up to and including the middle towards negative infinity, otherwise towards infinity.
ROUND_HALF_EVEN
Round the middle in such a way that the last digit becomes even.
ROUND_HALF_ODD
Round the middle in such a way that the last digit becomes odd (will be added to long-decimal in next release).
ROUND_UNNECESSARY
Do not round, just discard trailing 0. If that does not work, raise an exception.

Which of these to use should be decided with domain knowledge in mind. The above example could be continued as follows:

irb(main):035:0> t=(x+y*z)
=> LongDecimal(33405488, 6)
irb(main):036:0> 
irb(main):037:0* t.round_to_scale(2, LongDecimal::ROUND_UP).to_s
=> "33.41"
irb(main):038:0> t.round_to_scale(2, LongDecimal::ROUND_DOWN).to_s
=> "33.40"
irb(main):039:0> t.round_to_scale(2, LongDecimal::ROUND_CEILING).to_s
=> "33.41"
irb(main):040:0> t.round_to_scale(2, LongDecimal::ROUND_FLOOR).to_s
=> "33.40"
irb(main):041:0> t.round_to_scale(2, LongDecimal::ROUND_HALF_UP).to_s
=> "33.41"
irb(main):042:0> t.round_to_scale(2, LongDecimal::ROUND_HALF_DOWN).to_s
=> "33.41"
irb(main):043:0> t.round_to_scale(2, LongDecimal::ROUND_HALF_CEILING).to_s
=> "33.41"
irb(main):044:0> t.round_to_scale(2, LongDecimal::ROUND_HALF_FLOOR).to_s
=> "33.41"
irb(main):045:0> t.round_to_scale(2, LongDecimal::ROUND_HALF_EVEN).to_s
=> "33.41"
irb(main):046:0> t.round_to_scale(2, LongDecimal::ROUND_UNNECESSARY).to_s
ArgumentError: mode ROUND_UNNECESSARY not applicable, remainder 5488 is not zero
        from /usr/local/lib/ruby/gems/1.9.1/gems/long-decimal-1.00.01/lib/long-decimal.rb:507:in `round_to_scale_helper'
        from /usr/local/lib/ruby/gems/1.9.1/gems/long-decimal-1.00.01/lib/long-decimal.rb:858:in `round_to_scale'
        from (irb):46
        from /usr/local/bin/irb:12:in `
' irb(main):047:0>

A specialty is that some countries do not use the coins with one of the smallest unit, like 1 cent in the US. In Switzerland the smallest common coin is 5 Rp (5 cent=0.05 CHF). It might be possible to make the bank do a transfer for amounts not ending in 0 or 5, but usually invoices apply this kind of rounding and avoid such exact amounts that could not possibly be paid in cash. This can be dealt with by multiplying the amount by 20, rounding it to 0 digits after the decimal point and divide it by 20 and round the result to 2 digits after the point. In Ruby there is a better way using an advanced feature of LongDecimal called remainder rounding (using the method round_to_allowed_remainders(…) ). Assuming we want to round a number x with n decimal places in such a way that, 10^n\cdot x belongs to one of the residue classes \overline{0} \mod 10 or \overline{5} \mod 10. In this case we are just talking about the last digit, but the mechanism has been implemented in a more general way allowing any set of allowed residues and even a base other than 10. If 0 is not in the set of allowed residues, it may be unclear how 0 should be rounded and this needs to be actually defined with a parameter. For common practical uses the last digit of 0 is allowed, so things work out of the box:


irb(main):003:0> t.round_to_allowed_remainders(2, [0, 5], 10, LongDecimal::ROUND_UP).to_s
=> "33.45"
irb(main):005:0> t.round_to_allowed_remainders(2, [0, 5], 10, LongDecimal::ROUND_DOWN).to_s
=> "33.40"
irb(main):006:0> t.round_to_allowed_remainders(2, [0, 5], 10, LongDecimal::ROUND_CEILING).to_s
=> "33.45"
irb(main):007:0> t.round_to_allowed_remainders(2, [0, 5], 10, LongDecimal::ROUND_FLOOR).to_s
=> "33.40"
irb(main):008:0> t.round_to_allowed_remainders(2, [0, 5], 10, LongDecimal::ROUND_HALF_UP).to_s
=> "33.40"
irb(main):009:0> t.round_to_allowed_remainders(2, [0, 5], 10, LongDecimal::ROUND_HALF_DOWN).to_s
=> "33.40"
irb(main):010:0> t.round_to_allowed_remainders(2, [0, 5], 10, LongDecimal::ROUND_HALF_CEILING).to_s
=> "33.40"
irb(main):011:0> t.round_to_allowed_remainders(2, [0, 5], 10, LongDecimal::ROUND_HALF_FLOOR).to_s
=> "33.40"

In an finance application the rounding methods should probably be defined for each currency, possible using one formula and some currency specific parameters.

LongDecimal can do even more than that. It is possible to calculate logarithms, exponential functions, square root, cube root all to a desired number of decimal places. The algorithms have been tuned for speed without sacrificing precision.

Share Button