UTF-16 Strings in Java

Deutsch

Strings in Java and many other JVM-languages consist of Unicode content and are encoded as utf-16. It was fantastic to already consider Unicode when introducing Java in the 90es and to make it the only reasonable way to use strings, so there is no temptation to start with a „US-ASCII“-version of a software that never really gets enhanced to deal properly with non-English content and data. And it also avoids having to deal with many kinds of String encodings within the software. For outside storage in databases and files and for communication with other processes and over the network, these issues of course remain. But Java strings are always encoded utf-16. We can be sure. Most common languages can make use of these Strings easily and handling common letter based languages from Europe and western Asia is quite strait forward. Rendering may still be a challenge, but that is another issue. A major drawback of this approach is that more memory is used. This usually does not hurt too much. Memory is not so expensive and a couple of Strings even with utf-16 will not be too big. With 64-bit Java, which should be used these days, the memory limitations of the JVM are not relevant any more, they can basically use as much memory as provided.

But some applications to hit the memory limits. And since usually most of the data we are dealing with is ultimately strings and combinations of relatively long strings with some reference pointers, some integer numbers and more strings, we can say that in typical memory intensive applications strings actually consume a large fraction of the memory. This leads to the temptation of using or even writing a string library that uses utf-8 or some other more condensed internal format, while still being able to express any Unicode content. This is possible and I have done it. Unfortunately it is very painful, because Strings are quite deeply integrated into the language and explicit conversions need to be added in many places to deal with this. But it is possible and can save a lot of memory. In my case we were able to abandon this approach, because other optimizations, that were less painful, proved to be sufficient.

An interesting idea is to compress strings. If they are long enough, algorithms like gzip work on a single string. As with utf-8, selectively accessing parts of the string becomes expensive, because it can only be achieved by parsing the string from the beginning or by adding indexing structures. We just do not know which byte to go to for accessing the n-th character, even with utf-8. In reality we often do not have long strings, but rather many relatively short strings. They do not compress well by themselves. If we know our data and have a rough idea about the content of our complete set of strings, custom compression algorithm can be generated. This allows good results even for relatively short strings, as long as they are within the „language“ that we are prepared for. This is more or less similar to the step from utf-16 to utf-8, because we replace common byte sequences by shorter byte sequences and less common sequences may even get replaced by something longer. There is no gain in utf-8, if we have mostly strings that are in non-Latin alphabets. Even Cyrillic or Greek, that are alphabets similar to the Latin alphabet, will end up needing two bytes for each letter, which is not at all better than utf-16. For other alphabets it will even become worse, because three or four bytes are needed for one symbol that could easily be expressed with two bytes in utf-16. But if we know our data well enough, the approach with the specific compression will work fine. The „dictionary“ for the compression needs to be stored only once, maybe hard-coded in the software, and not in each string. It might be of interest to consider building the dictionary dynamically at run-time, like it is done with gzip, but keeping it in a common place for all strings and thus sharing it. The custom strings that I used where actually using a hard coded compression algorithm generated using a large amount of typical data. It worked fine, but was just too clumsy to use because Java is not prepared to replace String with something else without really messing around in the standard run-time libraries, which I would neither recommend nor want.

It is important to consider the following issues:

  1. Is the memory consumption of the strings really a problem?
  2. Are there easier optimizations that solve the problem?
  3. Can it just be solved by adding more hardware? Yes, this is a legitimate question.
  4. Are there solutions for the problem in the internet or even within the current organization?
  5. A new String class is so fundamental that excellent testing is absolutely mandatory. The unit tests should be very extensive and complete.
Share Button

Will Java, C, C++ and C# be the new Cobols?

A few decades ago most programming was performed in Cobol (I do not want to shout it), Fortran, Rexx and some typical main frame languages, that hardly made it to the Linux-, Unix- or MS-Windows-world. They are still present, but mostly used for maintenance and extension of existing software, but less often for writing new software from scratch.
In these days languages like C, C++, Java and to a slightly lesser extent C# dominate the list of most commonly used languages. I would assume that JavaScript is also quite prominent in the list, because it has become more popular to write rich web clients using frameworks like Angular JS. And there are tons of them and some really good stuff. Some people like to see JavaScript also on the server side and in spite of really interesting frameworks like Node-JS I do not really consider this a good idea. If you like you may add Objective C to this list, which I do not know very much at all, even though it has been part of my gcc since my first Linux installation in the early 1990es.

Anyway, C goes back to the 1970es and I think that it was a great language to create at that time and it still is for a certain set of purposes. When writing operating systems, database engines, compilers and interpreters for other languages, editors, or embedded software, everything that is very close to the hardware, explicit control and direct access to very powerful OS-APIs are features that prove to be useful. It has been said that Java runs as fast as C, which is at least close to the truth, but only if we do not take into account the memory usage. C has some short comings that could be done better without sacrificing its strengths in the areas where it is useful, but it does not seem to be happening.

C++ has been the OO-extension of C, but I would say that it has evolved to be a totally different language for different purposes, even though there is some overlap, it is relatively easy to call functionality written in C from C++ and a little bit harder the other way round… I have not used it very much recently, so I will refrain from commenting further on it.

Java has introduced an infrastructure that is very common now with its virtual machine. The JVM is running on a large number of servers and any JVM-language can be used there. The platform independence is an advantage, but I think that its importance on servers has diminished a little bit. There used to be all kinds of servers with different operating systems and different CPU-architectures. But now we are moving towards servers being mostly Linux with Intel-compatible CPUs, so it is becomeing less of an issue. This may change in the future again, of course.

With Mono C# can be used in ways similar to Java, at least that is what the theory says and what seems to be quite true at least up to a certain level. It seems to be a bit ahead of Java with some language features, just think of operator overloading, undeclared exceptions, properties, generics or lambdas, which have been introduced earlier or in a more elegant way or we are still waiting in Java. I think the case of lambdas also shows the limitations, because they seem to behave differently than you would expect from real closures, which is the way lambdas should be done and are done in more functionally oriented languages or even in the Ruby programming language, in the Perl programming language or typical Lisps.
Try this

List<Func<int>> actions = new List<Func<int>>();

int variable = 0;
while (variable < 5)
{
    actions.Add(() => variable * 2);
    ++ variable;
}

foreach (var act in actions)
{
    Console.WriteLine(act.Invoke());
}

We would expect the output 0, 2, 4, 6, 8, but we are getting 10, 10, 10, 10, 10 (one number in a line, respectively).
But it can be fixed:

List<Func<int>> actions = new List<Func<int>>();

int variable = 0;
while (variable < 5)
{
    int copy = variable;
    actions.Add(() => copy * 2);
    ++ variable;
}

foreach (var act in actions)
{
    Console.WriteLine(act.Invoke());
}

I would say that the concept of inner classes is better in Java, even though what is static there should be the default, but having lambdas this is less important…
You find more issues with class loader, which are kind of hard to tame in java, but extremely powerful.

Anyway, I think that all of these languages tend to be similar in their syntax, at least within a method or function and require a lot of boiler plate code. Another issue that I see is that the basic types, which include Strings, even if they are seen as basic types by the language design, are not powerful enough or full of pitfalls.

While the idea to use just null terminated character arrays as strings in C has its beauty, I think it is actually not really good enough and for more serious C applications a more advanced string library would be good, with the disadvantage that different libraries will end up using different string libraries… Anyway, for stuff that is legitimately done with C now, this is not so much of an issue and legacy software has anyway its legacy how to deal with strings, and possible painful limitations in conjunction with Unicode. Java and also C# have been introduced at a time when Unicode was already around and the standard already claimed to use more than 65536 code points (characters in Unicode-speak), but at that time 65536 seemed to be quite ok to cover the needs for all common languages and so utf-16 was picked as an encoding. This blows up the memory, because strings occupy most of the memory in typical application software, but it still leaves us with uncertainties about length and position, because code points can be one or two 16-bit-„characters“ long, which can only be seen by actually iterating through the string, which leaves us where we were with null terminated strings in C. And strings are really hard to replace or enhance in this aspect, because they are used everywhere.

Numbers are not good either. As an application developer we should not care about counting bits, unless we are in an area that needs to be specifically optimized. We are using mostly integer types in application development, at least we should. These overflow silently. Just to see it in C#:

int i = 0;
int s = 1;
for (i = 0; i < 20; i++)
{
    s *= 7;
    Console.WriteLine("i=" + i + " s=" + s);
}

which gives us:

i=0 s=7
i=1 s=49
i=2 s=343
i=3 s=2401
i=4 s=16807
i=5 s=117649
i=6 s=823543
i=7 s=5764801
i=8 s=40353607
i=9 s=282475249
i=10 s=1977326743
i=11 s=956385313
i=12 s=-1895237401
i=13 s=-381759919
i=14 s=1622647863
i=15 s=-1526366847
i=16 s=-2094633337
i=17 s=-1777531471
i=18 s=442181591
i=19 s=-1199696159

So it silently overflows, or just takes the remainder modulo 2^{32} with the representation system \{-2^{31} \ldots 2^{31}-1\}. Java, C and C++ behave exactly the same way, only that we need to know what „int“ means for our C-compiler, but if we use 32-bit-ints, it is the same. This should throw an exception or switch to some unlimited long integer. Clojure offers both options, depending on whether you use * or *‘ as operator. So as application developers we should not have to care about these bits and most developers do not think about it. Usually it goes well, but a lot of software bugs are around due to this pattern. It is just wrong in C#, Java, and C++. In C I find it more acceptable, because the typical area for using C for new software actually is quite related to bits and bytes, so the developers need to be aware of such issues all the time anyway.

I would consider it desirable to move to more expressive languages like Clojure, Scala, F#, Ruby or Perl for application development. Ruby and Perl have better Strings. Clojure and Scala inherit them from the JVM, and F# has the same strings as C#. Ruby and Clojure have a good way to deal with integers, Scala, Perl and F# can do it right if we actually want to do so, but not by default. Perl and Ruby are very weak when it comes to multithreading. As compared to Java this can be dealt with by just using more processes instead of threads, because the overhead of a Ruby or Perl process is much less than the overhead of a Java process, but I would see this as a major drawback. C, C#, Java and C++ offer good facilities to use multithreading, but the issue of avoiding typical multithreading bugs is a big deal and actually too hard for a large fraction of typical application developers. Or at least too far away from there point of focus. Moving to a more functional paradigm might be a way to go. Java enterprise edition is a failure if the goal is to get multithreading, done well without having to worry about it, because the overhead is too much. On the other hand, if you are willing to go the extra mile, having more explicit access to the multithreading mechanism and using it correctly is extremely powerful, for example in C with pthreads or with a deliberate usage of processes, shared memory and threads together. For which kind of projects do we have the time and the team for this? I am not talking about multithreaded applications that work well on the developer’s laptop, but fail during some high load processing in production with some concurrent modification issues a few months after the deployment. Thinking cannot be replaced by testing.

So now we have a lot of software in C, C++, Java and C# and a lot of new software is written in these languages, even from scratch. We could do better, sometimes we do, sometimes we don’t. It is possible to write excellent application software with Java, C++, C# and even C. It just takes a bit longer, but if we use them with care, it will be ok. Some companies are very conservative and want to use stuff that has been around for a long time. This is sometimes right and sometimes wrong. And since none of the more modern languages has really picked up so much speed that it can be considered a new main stream, it is understandable that some organizations are scared about marching into a dead end road.

On the other hand, many businesses can differentiate themselves by providing services that are only possible by having a very innovative IT. Banks like UBS and Credit Suisse in Switzerland are not likely to be there, while banks like ING are on that road. As long as they compete for totally different customer bases and as long as the business has enough strengths that are not depending so heavily on an innovative IT, but just on a working robust IT, this will be fine. But time moves on and innovation will eventually out-compete conservative businesses.

Share Button

Frameworks for Unit Testing and Mocking

Unit testing has fortunately become an important issue in many software projects. The idea of automatic software based unit and integration tests is actually quite old. The typical Linux software that is downloaded as source code and then built with steps like

tar xfzvv «software-name-with-version».tar.gz
cd «software-name-with-version»
./configure
make
sudo make install

often allows a step

make test

or

make check

or even both before the

make install

It was like that already in the 1990s, when the word „unit test“ was unknown and the whole concept had not been popularized to the main stream.

What we need is to write those automated tests to an extent that we have good confidence that the software will be reliable enough in terms of bugs if it passes the test suite. The tests can be written in any language and I do encourage you to think about using other languages, in order to be less biased and more efficient for writing the tests. We may choose to write a software in C, C++ or Java for the sake of efficiency or easier integration into the target platform. But these languages are efficient in their usages of CPU power, but not at all efficient in using developer time to write a lot of functionality. This is ok for most projects, because the effort it takes to develop with these languages is accepted in exchange for the anticipated benefits. For testing it is another issue.

On the other hand there are of course advantages in using actually the same language for writing the tests, because it is easier to access the APIs and even internal functionalities during the tests. So it may very well be that Unit tests are written in the same language as the software and this is actually what I am doing most of the time. But do think twice about your choice.

Now writing automated tests is actually no magic. It does not really need frameworks, but is quite easy to accomplish manually. All we need is kind of two areas in our source code tree. One area that goes into the production code and one area that is only used for the tests and remains on the development and continuous integration machines. Since writing automated tests without frameworks is not really a big deal, we should only look at frameworks that are really simple and easy to use or maybe give us really good features that we actually need. This is the case with many such frameworks, so the way to go is to actually use them and save some time and make the structure more accessible to other team members, who know the same testing framework. Writing and running unit tests should be really easy, otherwise it is not done or the unit tests are disabled and loose contact to the actual software and become worthless.

Bugs are much more expensive, the later they are discovered. So we should try to find as many of them while developing. Writing unit tests and automated integrated tests is a good thing and writing them early is even better. The pure test driven approach does so before actually writing the code. I recommend this for bug fixing, whenever possible.

There is one exception to this rule. When writing GUIs, automated testing is possible, but quite hard. Now we should have UX guys involved and we should present them with some early drafts of the software. If we had already developed elaborate selenium tests by then, it would be painful to change the software according to the advice of the UX guy and rewrite the tests. So I would keep it flexible until we are on the same page as the UX guys and add the tests later in this area.

Frameworks that I like are actually CUnit for C, JUnit for Java, where TestNG would be a viable alternative, and Google-Test for C++. CUnit works extremely well on Linux and probably on other Unix-like systems like Solaris, Aix, MacOSX, BSD etc. There is no reason why it should not work on MS-Windows. With cygwin actually it is extremely easy to use it, but with native Win32/Win64 it seems to need an effort to get this working, probably because MS-Windows is no priority for the developers of CUnit.

Now we should use our existing structures, but there can be reasons to mock a component or functionality. It can be because during the development a component does not exist. Maybe we want to see if the component is accessed the right way and this is easier to track with a mock that records the calls than with the real thing that does some processing and gives us only the result. Or maybe we have a component with is external and not always available or available, but too time consuming for most of our tests.

Again mocking is no magic and can be done without tools and frameworks. So the frameworks should again be very easy and friendly to use, otherwise they are just a pain in the neck. Early mocking frameworks were often too ambitious and too hard to use and I would have avoided them whenever possible. In Java mocking manually is quite easy. We just need an interface of the mocked component and create an implementing class. Then we need to add all missing methods, which tools like eclipse would do for us, and change some of them. That’s it. Now we have mockito for Java and Google-Mock, which is now part of Google-Test, for C++. In C++ we create a class that behaves similar to a Java interface by having all methods pure virtual with keyword „virtual“ and „=0“ instead of the implementation. The destructor is virtual with an empty implementation. They are so easy to use and they provide useful features, so they are actually good ways to go.

For C the approach is a little bit harder. We do not have the interfaces. So the way to go is to create a library of the code that we want to test and that should go to production. Then we write one of more c-files for the test, that will and up in an executable that actually runs the test. In these .c-files we can provide a mock-implementation for any function and it takes precedence of the implementation from the library. For complete tests we will need to have more than one executable, because in each case the set of mocked functions is fixed within one executable. There are tools in the web to help with this. I find the approach charming to generate the C-code for the mocked functions from the header files using scripts in the a href=“https://en.wikipedia.org/wiki/Ruby_(programming_language)“>Ruby programming language or in the Perl programming language.

Automated testing is so important that I strongly recommend to do changes to the software in order to make it accessible to tests, of course within reason. A common trick is to make certain Java methods package private and have the tests in the same package, but a different directory. Document why they are package private.

It is important to discuss and develop the automated testing within the team and find and improve a common approach. Laziness is a good thing. But laziness means running many automated tests and avoid some manual testing, not being too lazy to write them and eventually spending more time on manual repetitive activities.

I can actually teach this in a two-day or three-day course.

Share Button

How to create ISO Date String

It is a more and more common task that we need to have a date or maybe date with time as String.

There are two reasonable ways to do this:
* We may want the date formatted in the users Locale, whatever that is.
* We want to use a generic date format, that is for a broader audience or for usage in data exchange formats, log files etc.

The first issue is interesting, because it is not always trivial to teach the software to get the right locale and to use it properly… The mechanisms are there and they are often used correctly, but more often this is just working fine for the locale that the software developers where asked to support.

So now the question is, how do we get the ISO-date of today in different environments.

Linux/Unix-Shell (bash, tcsh, …)

date "+%F"

TeX/LaTeX


\def\dayiso{\ifcase\day \or
01\or 02\or 03\or 04\or 05\or 06\or 07\or 08\or 09\or 10\or% 1..10
11\or 12\or 13\or 14\or 15\or 16\or 17\or 18\or 19\or 20\or% 11..20
21\or 22\or 23\or 24\or 25\or 26\or 27\or 28\or 29\or 30\or% 21..30
31\fi}
\def\monthiso{\ifcase\month \or
01\or 02\or 03\or 04\or 05\or 06\or 07\or 08\or 09\or 10\or 11\or 12\fi}
\def\dateiso{\def\today{\number\year-\monthiso-\dayiso}}
\def\todayiso{\number\year-\monthiso-\dayiso}

This can go into a file isodate.sty which can then be included by \include or \input Then using \todayiso in your TeX document will use the current date. To be more precise, it is the date when TeX or LaTeX is called to process the file. This is what I use for my paper letters.

LaTeX

(From Fritz Zaucker, see his comment below):

\usepackage{isodate} % load package
\isodate % switch to ISO format
\today % print date according to current format

Oracle


SELECT TO_CHAR(SYSDATE, 'YYYY-MM-DD') FROM DUAL;

On Oracle Docs this function is documented.
It can be chosen as a default using ALTER SESSION for the whole session. Or in SQL-developer it can be configured. Then it is ok to just call

SELECT SYSDATE FROM DUAL;

Btw. Oracle allows to add numbers to dates. These are days. Use fractions of a day to add hours or minutes.

PostreSQL

(From Fritz Zaucker, see his comment):

select current_date;
—> 2016-01-08


select now();
—> 2016-01-08 14:37:55.701079+01

Emacs

In Emacs I like to have the current Date immediately:

(defun insert-current-date ()
"inserts the current date"
(interactive)
(insert
(let ((x (current-time-string)))
(concat (substring x 20 24)
"-"
(cdr (assoc (substring x 4 7)
cmode-month-alist))
"-"
(let ((y (substring x 8 9)))
(if (string= y " ") "0" y))
(substring x 9 10)))))
(global-set-key [S-f5] 'insert-current-date)

Pressing Shift-F5 will put the current date into the cursor position, mostly as if it had been typed.

Emacs (better Variant)

(From Thomas, see his comment below):

(defun insert-current-date ()
"Insert current date."
(interactive)
(insert (format-time-string "%Y-%m-%d")))

Perl

In the Perl programming language we can use a command line call

perl -e 'use POSIX qw/strftime/;print strftime("%F", localtime()), "\n"'

or to use it in larger programms

use POSIX qw/strftime/;
my $isodate_of_today = strftime("%F", localtime());

I am not sure, if this works on MS-Windows as well, but Linux-, Unix- and MacOS-X-users should see this working.

If someone has tried it on Windows, I will be interested to hear about it…
Maybe I will try it out myself…

Perl 5 (second suggestion)

(From Fritz Zaucker, see his comment below):

perl -e 'use DateTime; use 5.10.0; say DateTime->now->strftime(„%F“);‘

Perl 6

(From Fritz Zaucker, see his comment below):

say Date.today;

or

Date.today.say;

Ruby

This is even more elegant than Perl:

ruby -e 'puts Time.new.strftime("%F")'

will do it on the command line.
Or if you like to use it in your Ruby program, just use

d = Time.new
s = d.strftime("%F")

Btw. like in Oracle SQL it is possible add numbers to this. In case of Ruby, you are adding seconds.

It is slightly confusing that Ruby has two different types, Date and Time. Not quite as confusing as Java, but still…
Time is ok for this purpose.

C on Linux / Posix / Unix


#include
#include
#include

main(int argc, char **argv) {

char s[12];
time_t seconds_since_1970 = time(NULL);
struct tm local;
struct tm gmt;
localtime_r(&seconds_since_1970, &local);
gmtime_r(&seconds_since_1970, &gmt);
size_t l1 = strftime(s, 11, "%Y-%m-%d", &local);
printf("local:\t%s\n", s);
size_t l2 = strftime(s, 11, "%Y-%m-%d", &gmt);
printf("gmt:\t%s\n", s);
exit(0);
}

This speeks for itself..
But if you like to know: time() gets the seconds since 1970 as some kind of integer.
localtime_r or gmtime_r convert it into a structur, that has seconds, minutes etc as separate fields.
stftime formats it. Depending on your C it is also possible to use %F.

Scala


import java.util.Date
import java.text.SimpleDateFormat
...
val s : String = new SimpleDateFormat("YYYY-MM-dd").format(new Date())

This uses the ugly Java-7-libraries. We want to go to Java 8 or use Joda time and a wrapper for Scala.

Java 7


import java.util.Date
import java.text.SimpleDateFormat

...
String s = new SimpleDateFormat("YYYY-MM-dd").format(new Date());

Please observe that SimpleDateFormat is not thread safe. So do one of the following:
* initialize it each time with new
* make sure you run only single threaded, forever
* use EJB and have the format as instance variable in a stateless session bean
* protect it with synchronized
* protect it with locks
* make it a thread local variable

In Java 8 or Java 7 with Joda time this is better. And the toString()-method should have ISO8601 as default, but off course including the time part.

Summary

This is quite easy to achieve in many environments.
I could provide more, but maybe I leave this to you in the comments section.
What could be interesting:
* better ways for the ones that I have provided
* other databases
* other editors (vim, sublime, eclipse, idea,…)
* Office packages (Libreoffice and MS-Office)
* C#
* F#
* Clojure
* C on MS-Windows
* Perl and Ruby on MS-Windows
* Java 8
* Scala using better libraries than the Java-7-library for this
* Java using better libraries than the Java-7-library for this
* C++
* PHP
* Python
* Cobol
* JavaScript
* …
If you provide a reasonable solution I will make it part of the article with a reference…
See also Date Formats

Share Button

What do +, – and * with Integer do?

When using integers in C, Java or Scala, we often use what is called int.

It is presented to us as the default.

And it is extremely fast.

Ruby uses by default arbitrary length integers.

But what do +, – and * mean?

We can rebuild them, in Ruby, kind of artificially restrict the integers to what we have in other programming langauges as int:


MODULUS = 0x100000000;
LIMIT   =  0x80000000;

def normalize(x)
  r = x % MODULUS;
  if (r < -LIMIT) then     return r + MODULUS;   elsif (r >= LIMIT) 
    return r - MODULUS;
  else
    return r;
  end
end

def intPlus(x, y)
  normalize(x+y);
end

def intMinus(x, y)
  normalize(x-y);
end

def intTimes(x, y)
  normalize(x*y);
end

x = 0x7fffffff;
y = intPlus(x, x);
z = intPlus(x, x);
puts("x=#{x} y=#{y} z=#{z}");

What is the outcome?

Exactly what you get when doing 32-Bit-Ints in C, Java, Scala or C#:

x=2147483647 y=-2 z=-2

int is always calculated modulo a power of two, usually 2^{32}. That is the
x % MODULUS in normalize(). The Rest of the function is just for normalizing the result to the range -2^{31} \ldots 2^{31}-1.

So we silently get this kind of result when an overflow situation occurs, without any notice.
The overflow is not trivial to discover, but it can be done.
For addition I have described how to do it.

See also Overflow of Integer Types, an earlier post about these issues…

I gave a talk that included this content adapted for Scala at Scala Exchange 2015.

Share Button

Devoxx 2015

This year I have had the pleasure to visit the Devoxx-Conference in Antwerp in Belgium.

I have visited the following talks:

There is a lot to write about this and I will get back to specific interesting topics in the future…

My previous visits in 2012, 2013 (part 1), 2013 (part 2), and 2014 have their own blog articles.

Share Button

Java: Using Enums for Singletons

This singleton pattern is not really a big deal, but overused because it gives us the coolness of knowing about design patters without learning too much stuff for that… It does have its uses in some programming languages, while more or less implicit or obsolete in others, depending on the point of view.

I have already discussed the generalization of the singleton.
Actually it can be seen that the enum of Java and of languages who do something like Java’s enum in a similar way can be used as a way to build a singleton by just limiting it to one instance. Having been skeptical about this as an abuse of enums that might confuse people who read this I would now suggest to embrace this as a possible and useful way to write singletons and to be familiar with it when reading other people’s code or even to use it when writing code.

The enum way to write a singleton is the most elegant way in Java, because it uses least boiler plate code and includes some benefits that are usually ignored, especially when it comes to serialization that has to be addressed in singletons quite often, but is quite often ignored, resulting in multiple instances of the „singleton“ to float around…

So here we go:


enum Singleton  {
    INSTANCE;

    private long count;

    private Singleton() {
        count = 0;
    }

    public synchronized long nextValue() {
        return count++;
    }
}

Remarks:

  • For real programs it would be important to check if long is sufficient or BigInteger is needed.
  • For real programs using an AtomicLong could be a better way than using synchronized.

Some links:

Share Button

Design Patterns: Singleton

Deutsch

This Singleton Pattern has the advantage to be easy to memorize.

The only really interesting aspect of it is the issue of initialization („lazy or „eager“) and maybe the dependencies between multiple singletons.

But I would like to mention two generalizations.

A Singleton exists once in the whole program. Generalizations can address the uniqueness in two ways. Either the number of instances can be increased from one to a small number or the universe („whole program“) can be changed. What is the universe in a big software system that is running on multiple servers with many processes and components?

Changing from one to a small fixed finite number is quite a routine thing. Java developers are calling this enum, but other programming language often have similar structures or at least allow to build them easily by restricting the generation of objects and creating the whole set of instances statically and making them accessible. I am not talking about the enum of C, C++ and C#.

The other, slightly more complex generalization looks at the „universe“. An application can be distributed and have several parallel processes, not only threads. This would mean dealing with a larger universe, which can become slightly tricky, but some examples of that are actually routine. Think of a database that exists once in the application to allow operating on the same data and keeping them consistent. Frameworks sometimes allow for an application server wide more or less classical singleton that exists once in the whole server landscape but can be accessed more or less transparently. An interesting case that is needed quite often is a global counter. It can be done using a DB-sequence. Or giving disjoint subsets of the numbers that could be generated to different processes and server. Or by using something like a UUID and praying that no collisions will occur. But an application wide singleton might be useful in some cases. Some frameworks are talking about a „bean“ with „application scope“. On the other hand smaller universes are even more useful. A typical example is the „session scope“.

Maybe just understanding these as a proper generalization of the singleton pattern will help in making these frameworks more understandable and more useful. The investment of having started with the most simple design pattern might actually pay off. Looking into some of the other patterns could be a worth while task for the future… 😉

Share Button

Using Collections

When Java came out about 20 years, it was great to have a decent and quite extensive collection library available as part of the standard setup and ready to use.

Before that we often had to develop our own or find one of many interesting collection libraries and when writing and using APIs it was not a good idea to rely on them as part of the API.

Since Java 2 (technical name „Java 1.2“) collection interfaces have been added and now the implementation is kind of detached, because we should use the interfaces as much as possible and make the implementation exchangeable.

An interesting question arose in conjunction with concurrency. The early Java 1 default collections where synchronized all the way. In Java 2 non synchronized variants were added and became the default. Synchronization can be achieved by wrapping them or by using the old collections (that do implement the interfaces as well since Java 2).

This was a performance improvement, because most of the time it is expensive and unnecessary overhead to synchronize collections. As a matter of fact special care should be used anyway to know who is accessing a collection in what way. Even if the collection itself does not get broken by simultaneous access, your application most likely is, unless you really know what you are doing. Are you?

Now it is usually a good idea to control changes of a collection. This is achieved by wrapping it with some Collections.umodifyableXXX-method. The result is that accessing the wrapped collection with set or put will cause an exception. It was a good approach, as a first shot, but not where we want to be now.

Of the wrapped collection still references to the inner, non-wrapped collection can be around, so it can still change while being accessed. If you can easily afford it, just copy collections when taking them in or giving them out. Or go immutable all the way and wrap your own in an umnodifiable-wrapper, if that works.

What I would like to see is something along the following lines:

  • We have two kinds of collection interfaces, those that are immutable and those that are mutable.
  • The immutable should be the default.
  • We have implementations of the collections and construction facilities for the immutable collections
  • The immutable implementation is off course the default.

I do not want to advocate going immutable collections only, because that does come at a high price in terms of efficiency. The usual pattern is to still have methods that modify a collection, but these leave the original collection as it is and just create a modified copy. Usually these implementations have been done in such a smart way that they share a lot, which is no pain, because they are all immutable. No matter how smart and admirable these tricks are, I strongly doubt that they can reach the performance of modifiable collections, if modifications are actually used a lot, at least in a purely single threaded environment.

Ruby has taken an interesting approach. Collections have a method freeze that can be called to make them immutable. That is adding runtime checks, which is a good match for Ruby. Java should check this at compile time, because it is so important. Having different interfaces would do that.

I recommend checking out the guava-collection library from google. It does come with most of the issues described here addressed and I think it is the best bet at the moment for that purpose. There are some other collection libraries to explore. Maybe one is actually better then guava.

Share Button

Find the next entry in a sequence

In Facebook, Xing, Google+, Vk.com, Linkedin and other of these social media networks we are often encountered with a trivial question like this:

1->2
2->8
3->18
4->32
5->50
6->72
7->?

There are some easy patterns. Either it is some polynomial formula or some trick with the digits.
But the point is, that any such sequence can easily be fullfilled by a polynomial formula. That means we can put any value for 7 and make it work. Or any answer is correct. So what would probably be the real question is the most simple function to full-fill the given constraints. Simplicity can be measured in some way… If the solution is unique is unclear, but let us just look at the polynomial solution.

A function is needed that takes as parameter a list of key-value-pairs (or a hash map) and that yields a function such that the function of any of the key is the associated value.

Assuming a polynomial function in one variable we can make use of the chinese remainder theorem, which can be applied to univariate polynomials over a field F as well as to integral numbers. For a polynomial p(X) we have

    \[p(x) \equiv p(X) \mod X-x\]

where X is the polynomial variable and x\in F is a concrete value.

We are looking for a polynomial p(X) such that for given values x_0,\ldots x_{n-1}, y_0,\ldots,y_{n-1} \in F we have

    \[\bigwedge_{i=0}^{n-1} p(x_i) = y_i\]

or in another way

    \[\bigwedge_{i=0}^{n-1} p(X) \equiv y_i \mod X-x_i\]

which is exactly the Chinese remainder theorem.
Let

    \[I=\{0,\ldots,n-1\}\]

and

    \[\bigwedge_{j=0}^{n-1} I_j = I \setminus \{j\}\]

We can see that for all i \in I the polynomials

    \[e_i = \prod_{j \in I_j} \frac{X-x_j}{x_i-x_j}\]

have the properties

    \[e_i(x_i)=1\]

    \[\bigwedge_{j \in I_i} e_i(x_j)=0\]

or

    \[\bigwedge_{i \in I}\bigwedge_{j \in J} e_i(x_j)=\delta_{i,j}\]

where \delta_{i,j} is the Kronecker symbol, which is 0 if the two indices differ and 1 if they are equal.
Or as congruence:

    \[\bigwedge_{i \in I}\bigwedge_{j \in J} e_i(X)\equiv \delta_{i,j} \mod X-x_j\]

Then we can just combine this and use

    \[p(X) =\sum_{i \in I} y_i e_i(X)\]

This can easily be written as a Ruby function

def fun_calc(pairs)
  n = pairs.size
  result = lambda do |x|
    y = 0
    n.times do |i|
      p_i = pairs[i]
      x_i = p_i[0].to_r
      y_i = p_i[1].to_r
      z = y_i
      n.times do |j|
        if (j != i)
          p_j = pairs[j]
          x_j = p_j[0]
          z *= (x - x_j) / (x_i - x_j)
        end
      end
      y += z
    end
    y
  end
  result
end

This takes a list of pairs as a parameter and returns the polynomial function als lambda.
It can be used like this:

lop = [[0, 0], [1, 1], [2, 4], [3, 9], [4, 16], [5, 25], [6, 36], [7, 64]]

f = fun_calc(lop)

20.times do |x|
  y = f.call(x)
  puts sprintf("%6d -> %6d", x, y)
end

Put this together into a ruby program and add some parsing for the list of pairs or change the program each time you use it and all these „difficult“ questions „that 99.9% fail to solve“ are not just easy, but actually soluble automatically.

This is interesting for more useful applications. I assume that there will always be situations where a function is needed that meets certain exact values a certain inputs and is an interpolation or extrapolation of this.

Please observe that there are other interesting and useful ways to approach this:

  • Use a „best“ approximation from a set of functions, for example polynomials with a given maximum degree
  • use cubic splines, which are cubic polynomials within each section between two neighboring input values such that at the input values the two adjacent functions have the same value (y_i, of course), the same first derivative and the same second derivative.

For highway and railroad construction other curves are used, because the splines are making an assumption on what is the x-axis and what is the y-axis, which does not make sense for transport facilities. They are using a curve called Clothoid.

Use Java, C, Perl, Scala, F# or the programming language of your choice to do this. You only need Closures, which are available in Java 8, F#, Scala, Perl, Ruby and any decent Lisp dialect. In Java 7 they can be done with an additional interface as anonymous inner classes. And for C it has been described in this blog how to do closures.

Share Button