Unicode and C

It is a common practice in C to use arrays of char as strings. The 0 is used as end marker.

The whole thing was created like that in the 1970s and at that time it was kind of cool to get away with one less language feature and to express it in terms of others instead. And people did not think enough about the necessity to express more than ISO 646 IRV (commonly called ASCII) as string content.

This extended out of the box to 8 bit character sets like ISO 8859-1 or KOI-8, that are identical to ISO 646 in the lower 128 characters and contain an extension in the upper 128 characters. But fortunately we have moved ahead and now Unicode with its encodings UTF-8, UTF-16 and UTF-32.

How can we deal with this in C?

UTF-8 just works out of the box, because the byte 0 is only used to encode the code point U+0000. So the null termination can be kept as it is and a lot of functionality remains valid. Some issues arise, because in UTF-8 things like finding the logical length of a string, not its memory consumption or finding the nth code point, not the nth byte, require UTF-8-logic to be applied and to parse the whole string at least from the beginning to the desired position or the usage of an indexing facility. So a lot of non-trivial string functionality of the standard library will just not be as easy as people thought it would be in the 1970s and subsequently not work as needed. Libraries for better UTF-8-support in C can be found, leaving the „native“ C strings with UTF-8 content only for usage in interfaces that require them. I have not yet explored such libraries, but it would be interesting to find out how powerful and useful they are.

At the time when Unicode came out, it seemed to be sufficient to have 16 bits per character instead of 8. Java was built on this assumption. C added a wchar_t to allow for this and just required it to be „long enough“. So Linux uses 32 bits and MS-Windows 16 bits. This is not too bad, because programming in C for MS-Windows and for Linux is anyway quite different, unless we abstract the differences into a library, which would then also include a common string definition and string handling functionality. While the Linux wchar_t is sufficient, it really wastes a lot of memory, which is often undesirable, if we go the extra effort to program in C in order to gain performance. The Windows-wchar_t is „kind of sufficient“, as are the Java-Strings, because we can really do a lot with assuming that Unicode is only 16 bit or with UTF-16 and ignoring the complexities of that, that are in principal the same as for UTF-8, but can be ignored with less disadvantages most of the time. The good news is, that wchar_t is well supported by standard library functions.

Another way is to use char16_t and char32_t, that have a clear definition of their length, but much less library support.

Probably these facilities are sufficient for software whose string handling is relatively trivial. For more ambitious string handling in terms of functionality and performance, it will be necessary to find third party libraries or to write them.

Links

Share Button

Flashsort in Ruby

Deutsch

There is a simple implementation of Flashsort in Ruby, after having already provided an implementation in C. The C-implementation is typically faster than the libc-function qsort, but this depends always on the data and on how well the metric-function has been written, that is needed on top of the comparison function for Flashsort. You can think of this metric function as some kind of monotonic hash function. So we have

    \[\bigwedge_{a,b: a\le b} m(a) \le m(b) \]

This additionally needed function of method is not really there, apart from numerical values, so we really have to invest some time into writing it. This makes the use of Flashsort a bit harder. A good metric function is crucial for good performance, but for typical text files quite trivial implentations already outperform classical O(n \log n) algorithms like Heapsort and Quicksort and Mergesort for larger amounts of data.

This blog article shows other sorting algorithms for Ruby.

Share Button

How to draw lines, circles and other curves

These ideas were developed more than 30 years without knowing that they were already known at that time…

Today the graphics cards can easily do things like this in very little time. And today’s CPUs are of course really good at multiplying. So this has lost a lot of its immediate relevance, but it is a fun topic and why not have some fun…

Let us assume we have a two dimensional coordinate system and a visible area that goes from x_{\min} to x_{\max} and y_{\min} to y_{\max}. Coordinates are discrete.

In this world we can easily measure an angle against a (directed) line parallel to the x-axis, for example up to an accuracy of 45^\circ=\frac{\pi}{4}:

  • y=0 \wedge x > 0 \implies \alpha = 0 (= 0^\circ)
  • 0 < y < x \implies 0 < \alpha < \frac{\pi}{4}(=45^\circ)
  • 0 < y = x \implies \alpha = \frac{\pi}{4}
  • 0 < x < y \implies \frac{\pi}{4} < \alpha < \frac{\pi}{2}(=90^\circ)
  • x = 0 \land y > 0\implies \alpha = \frac{\pi}{2}
  • x < 0 \land y > 0 \land |x| < |y|\implies \frac{\pi}{2} < \alpha < \frac{3\pi}{4}(=135^\circ)
  • x < 0 \land y > 0 \land -x = y\implies \alpha = \frac{3\pi}{4}(=135^\circ)

So let us assume we have a curve that is described by a polynomial function in two variables x and y, like this:

    \[f(x, y) = \sum_{j=0}^m\sum_{k=0}^n a_{j,k}x^jy^k = 0\]

We have to apply some math to understand that the curve behaves nicely in the sense that it does not behave to chaotic in scales that are below our accuracy, that it is connected etc. We might possibly scale and move it a bit by substituting something like c_1u+c_2 for x and c_3v+c_4 for y.

For example we may think of

  • line: f(x,y)=ax+by+c
  • circle: f(x, y)=x^2+y^2-r^2
  • eclipse: f(x, y)=\frac{x^2}{a^2}+\frac{y^2}{b^2}-1

We can assume our drawing is done with something like a king of chess. We need to find a starting point that is accurately on the curve or at least as accurately as possible. You could use knights or other chess figures or even fictive chess figures..

Now we have a starting point (x_0, y_0) which lies ideally exactly on the curve. We have a deviation from the curve, which is f(x_0, y_0)=d_0. So we have f(x_n, y_n)=d_n. Than we move to x_{n+1}=x_n + s and y_{n+1}=y_n+t with s, t = \{-1, 0, 1\}. Often only two or three combinations of (s, t) need to be considered. When calculating d_{n+1} from d_n for the different variants, it shows that for calculating d_{n+1}-d_n the difference becomes a polynomial with lower degree, because the highest terms cancel out. So drawing a line between two points or a circle with a given radius around a given point or an ellipse or a parabola or a hyperbola can be drawn without any multiplications… And powers of n-th powers of x can always be calculated with additions and subtractions only from the previous x-values, by using successive differences:
d_{m,1}=(x-m)^n-(x-m-1)^n
d_{m,l+1}=d_{m+1,l}-d_{m,l}
These become constant for l=n, just as the lth derivatives, so by using this triangle, successive powers can be calculated with some preparational work using just additions.
It was quite natural to program these in assembly language, even in 8-bit assembly languages that are primitive by today’s standards. And it was possible to draw such figures reasonably fast with only one MHz (yes, not GHz).

We don’t need this stuff any more. Usually the graphics card is much better than anything we can with reasonable effort program. Usually the performance is sufficient when we just program in high level languages and use standard libraries.

But occasionally situations occur where we need to think about how to get the performance we need:
Make it work,
make it right,
make it fast,
but don’t stop after the first of those.

It is important that we choose our steps wisely and use adequate methods to solve our problem. Please understand this article as a fun issue about how we could write software some decades ago, but also as an inspiration to actually look into bits and bytes when it is really helping to get the necessary performance without defeating the maintainability of the software.

Share Button

Logging

Deutsch

Software often contains a logging functionality. Usually entries one or sometimes multiple lines are appended to a file, written to syslog or to stdout, from where they are redirected into a file. They are telling us something about what the software is doing. Usually we can ignore all of it, but as soon as something with „ERROR“ or worse and more visible stack traces can be found, we should investigate this. Unfortunately software is often not so good, which can be due to libraries, frameworks or our own code. Then stack traces and errors are so common that it is hard to look into or to find the ones that are really worth looking into. Or there is simply no complete process in place to watch the log files. Sometimes the error shows up much later than it actually occurred and stack traces do not really lead us to the right spot. More often than we think logging actually introduces runtime errors, that were otherwise not present. This is related to a more general concept, which is called observer effect, where logging actually changes the business logic.

It is nice that log files keep to some format. Usually they start with a time stamp in ISO-format, often to the millisecond. Please add trailing zeros to always have 3 digits after the decimal point in this case. It is preferable to use UTC, but people tend to stick to local date and time zones, including the issues that come with switching to and from daylight saving time. Usually we have several processes or threads that run simultaneously. This can result in a wild mix of logging entries. As long as even multiline entries stay together and as long as beginning and end of one multiline entry can easily be recognized, this can be dealt with. Tools like splunk or simple Perl, Ruby or Python scripts can help us to follow threads separately. We could actually have separate logs for each thread in the first place, but this is not a common practice and it might hit OS-limitations on the number of open files, if we have many threads or even thousands of actors as in Erlang or Akka. Keeping log entries together can be achieved by using an atomic write, like the write system call in Linux and other Posix systems. Another way is to queue the log entries and to have a logger thread that processes the queue.

Overall this area has become very complex and hard to tame. In the Java world there used to be log4j with a configuration file that was a simple properties file, at least in the earlier version. This was so good that other languages copied it and created some log4X. Later the config file was replaced by XML and more logging frame works were added. Of course quite a lot of them just for the purpose of abstracting from the large zoo of logging frameworks and providing a unique interface for all of them. So the result was, that there was one more to deal with.

It is a good question, how much logic for handling of log files do we really want to see in our software. Does the software have to know, into which file it should log or how to do log rotation? If a configuration determines this, but the configuration is compiled into the jar file, it does have to know… We can keep our code a bit cleaner by relying on program functionality without code, but this still keeps it as part of the software.

Log files have to please the system administrator or whoever replaced them in a pure devops shop. And in the end developers will have to be able to work with the information provided by the logs to find issues in the code or to explain what is happening, if the system administrator cannot resolve an issue by himself. Should this system administrator have to deal with a different special complex setup for the logging for each software he is running? Or should it be necessary to call for developer support to get a new version of the software with just another log setting, because the configurations are hard coded in the deployment artifacts? Interesting is also, what happens when we use PAAS, where we have application server, database etc., but the software can easily move to another server, which might result in losing the logs. Moving logs to another server or logging across the network is expensive, maybe more expensive than the rest of this infrastructure.

Is it maybe a good idea to just log to stdout, maintaining a decent format and to run the software in such a way that stdout is piped into a log manager? This can be the same for all software and there is one way to configure it. The same means not only the same for all the java programs, but actually the same for all programs in all languages that comply to a minimal standard. This could be achieved using named pipes in conjunction with any hard coded log file that the software wants to use. But this is a dangerous path unless we really know what the software is doing with its log files. Just think of what weird errors might happen if the software tries to apply log rotation to the named pipe by renaming, deleting, creating new files and so on. A common trick to stop software from logging into a place where we do not want this is to create a directory with the name of the file that the software usually uses and to write protect this directory and its parent directory for the software. Please find out how to do it in detail, depending on your environment.

What about software, that is a filter by itself, so its main functionality is to actually write useful data to stdout? Usually smaller programs and scripts work like this. Often they do not need to log and often they are well tested relyable parts of our software installation. Where are the log files of cp, ls, rm, mv, grep, sort, cat, less,…? Yes, they do tend to write to stderr, if real errors occur. Where needed, programs can turn on logging with a log file provided on the command line, which is also a quite operations friendly approach. Named pipes can help here.

And we had a good logging framework in place for many years. It was called syslog and it is still around, at least on Linux.

A last thought: We spend really a lot of effort to get well performing software, using multiple processes, threads or even clusters. And then we forget about the fact that logging might become the bottle neck.

Share Button

Meaningless Whitespace in Textfiles

We use different file formats that are more or less tolerant to certain changes. Most well known is white space in text files.

In some programming languages white space (space, newline, carriage return, form feed, tabulator, vertical tab) has no meaning, as long as any whitespace is present. Examples for this are Java, Perl, Lisp or C. Whitespace, that is somehow part of String content is always significant, but white space that is used within the program can be combination of one or more of the white space characters that are in the lower 128 positions (ISO-646, often referred to as ASCII or 7bit ASCII. It is of course recommended to have a certain coding standard, which gives some guidelines of when to use newlines, if tabs or spaces are preferred (please spaces) and how to indent. But this is just about human readability and the compiler does not really care. Line numbers are a bit meaningful in compiler and runtime error messages and stack traces, so putting everything into one line would harm beyond readability, but there is a wide range of ways that are all correct and equivalent. Btw. many teams limit lines to 80 characters, which was a valid choice 30 years ago, when some terminals were only 80 characters wide and 132 character wide terminals where just coming up. But as a hard limit it is a joke today, because not many of us would be able to work with a vt100 terminal efficiently anyway. Very long lines might be harder to read, so anything around 120 or 160 might still be a reasonable idea about line lengths…

Languages like Ruby and Scala put slightly more meaning into white space, because in most cases a semicolon can be skipped if it is followed by a newline and not just horizontal white space. And Perl (Perl 5) is for sure so hard to compile that only its own implementation can properly format or even recognize which white space is part of a literal string. Special cases like having the language in a string and parsing and then executing that should be ignored here.

Now we put this program files into a source code management system, usually Git. Some teams still use legacy systems like subversion, source safe, clear case or CVS, while there are some newer systems that are probably about as powerful as git, but I never saw them in use. Git creates an MD5 hash of each file, which implies that any minor change will result in a new version, even if it is just white space. Now this does not hurt too much, if we agree on the same formatting and on the same line ending (hopefully LF only, not CR LF, even on MS-Windows). But our tooling does not make any difference between significant changes and insignificant formatting only changes. This gets worse, if users have different IDEs, which they should have, because everyone should use the IDE or editor, with which he or she is most efficient and the formal description of the preferred formatting is not shared between editors or differs slightly.

I think that each programming language should come with a command line diff tool and a command line formatting tool, that obey a standard interface for calling and can be plugged into editors and into source code management systems like git. Then the same mechanisms work for C, Java, C#, Ruby, Python, Fortran, Clojure, Perl, F#, Scala, Lua or your favorite programming language.

I can imaging two ways of working: Either we have a standard format and possibly individual formats for each developer. During „git commit“ the file is brought into the standard format before it is shown to git. Meaning less whitespace changes disappear. During checkout the file can optionally be brought into the preferred format of the developer. And yes, there are ways to deal with deliberate formatting, that for some reason should be kept verbatim and for dealing differently with comments and of course all kinds of string literals. Remember, the formatting tool comes from the same source as the compiler and fully understands the language.

The other approach leaves the formatting up to the developer and only creates a new version, when the diff tool of the language signifies that there is a relevant change.

I think that we should strive for this approach. It is no rocket science, the kind of tools were around for many decades as diff and as formatting tools, it would just be necessary to go the extra mile and create sister diff and formatting tools for the compiler (or interpreter) and to actually integrate these into build environments, IDEs, editors and git. It would save a lot of time and leave more time for solving real problems.

Is there any programming language that actually does this already?

How to handle XML? Is XML just the new binary with a bit more bloat? Can we do a generic handling of all XML or should it depend on the Schema?

Share Button

Loops with unknown nesting depth

We often encounter nested loops, like

for (i = 0; i < n; i++) {
    for (j = 0; j < m; j++) {
        doSomething(i, j);
    }
}

This can be nested to a few more levels without too much pain, as long as we observe that the number of iterations for each level need to be multiplied to get the number of iterations for the whole thing and that total numbers of iterations beyond a few billions (10^9, German: Milliarden, Russian Миллиарди) become unreasonable no matter how fast the doSomethings(...) is. Just looking at this example program

public class Modular {
    public static void main(String[] args) {
        long n = Long.parseLong(args[0]);
        long t = System.currentTimeMillis();
        long m = Long.parseLong(args[1]);
        System.out.println("n=" + n + " t=" + t + " m=" + m);
        long prod = 1;
        long sum  = 0;
        for (long i = 0; i < n; i++) {
            long j = i % m;
            sum += j;
            sum %= m;
            prod *= (j*j+1) % m;
            prod %= m;
        }
        System.out.println("sum=" + sum + " prod=" + prod + " dt=" + (System.currentTimeMillis() - t));
    }
}

which measures it net run time and runs 0 msec for 1000 iterations and almost three minutes for 10 billions (10^{10}):

> java Modular 1000 1001 # 1'000
--> sum=1 prod=442 dt=0
> java Modular 10000 1001 # 10'000
--> sum=55 prod=520 dt=1
> java Modular 100000 1001 # 100'000
--> sum=45 prod=299 dt=7
> java Modular 1000000 1001 # 1'000'000
--> sum=0 prod=806 dt=36
> java Modular 10000000 1001 # 10'000'000
--> sum=45 prod=299 dt=344
> java Modular 100000000 1001 # 100'000'000
--> sum=946 prod=949 dt=3314
> java Modular 1000000000 1001 # 1'000'000'000
--> sum=1 prod=442 dt=34439
> java Modular 10000000000 1001 # 10'000'000'000
--> sum=55 prod=520 dt=332346

As soon as we do I/O, network access, database access or simply a bit more serious calculation, this becomes of course easily unbearably slow. But today it is cool to deal with big data and to at least call what we are doing big data, even though conventional processing on a laptop can do it in a few seconds or minutes... And there are of course ways to process way more iterations than this, but it becomes worth thinking about the system architecture, the hardware, parallel processing and of course algorithms and software stacks. But here we are in the "normal world", which can be a "normal subuniverse" of something really big, so running on one CPU and using a normal language like Perl, Java, Ruby, Scala, Clojure, F# or C.

Now sometimes we encounter situations where we want to nest loops, but the depth is unknown, something like

for (i_0 = 0; i_0 < n_0; i_0++) {
  for (i_1 = 0; i_1 < n_1; i_1++) {
    \cdots
      for (i_m = 0; i_m < n_m; i_m++) {
        dosomething(i_0, i_1,\ldots, i_m);
      }
    \cdots
  }
}

Now our friends from the functional world help us to understand what a loop is, because in some of these more functional languages the classical C-Style loop is either missing or at least not recommended as the everyday tool. Instead we view the set of values we iterate about as a collection and iterate through every element of the collection. This can be a bad thing, because instantiating such big collections can be a show stopper, but we don't. Out of the many features of collections we just pick the iterability, which can very well be accomplished by lazy collections. In Java we have the Iterable, Iterator, Spliterator and the Stream interfaces to express such potentially lazy collections that are just used for iterating.

So we could think of a library that provides us with support for ordinary loops, so we could write something like this:

Iterable range = new LoopRangeExcludeUpper<>(0, n);
for (Integer i : range) {
    doSomething(i);
}

or even better, if we assume 0 as a lower limit is the default anyway:

Iterable range = new LoopRangeExcludeUpper<>(n);
for (Integer i : range) {
    doSomething(i);
}

with the ugliness of boxing and unboxing in terms of runtime overhead, memory overhead, and additional complexity for development. In Scala, Ruby or Clojure the equivalent solution would be elegant and useful and the way to go...
I would assume, that a library who does something like LoopRangeExcludeUpper in the code example should easily be available for Java, maybe even in the standard library, or in some common public maven repository...

Now the issue of loops with unknown nesting depth can easily be addressed by writing or downloading a class like NestedLoopRange, which might have a constructor of the form NestedLoopRange(int ... ni) or NestedLoopRange(List li) or something with collections that are more efficient with primitives, for example from Apache Commons. Consider using long instead of int, which will break some compatibility with Java-collections. This should not hurt too much here and it is a good thing to reconsider the 31-bit size field of Java collections as an obstacle for future development and to address how collections can grow larger than 2^{31}-1 elements, but that is just a side issue here. We broke this limit with the example iterating over 10'000'000'000 values for i already and it took only a few minutes. Of course it was just an abstract way of dealing with a lazy collection without the Java interfaces involved.

So, the code could just look like this:

Iterable range = new NestedLoopRange(n_0, n_1, \ldots, n_m);
for (Tuple t : range) {
    doSomething(t);
}

Btw, it is not too hard to write it in the classical way either:

        long[] n = new long[] { n_0, n_1, \ldots, n_m };
        int m1 = n.length;
        int m  = m1-1; // just to have the math-m matched...
        long[] t = new long[m1];
        for (int j = 0; j < m1; j++) {
            t[j] = 0L;
        }
        boolean done = false;
        for (int j = 0; j < m1; j++) {
            if (n[j] <= 0) {
                done = true;
                break;
            }
        }
        while (! done) {
            doSomething(t);
            done = true;
            for (int j = 0; j < m1; j++) {
                t[j]++;
                if (t[j] < n[j]) {
                    done = false;
                    break;
                }
                t[j] = 0;
            }
        }

I have written this kind of loop several times in my life in different languages. The first time was on C64-basic when I was still in school and the last one was written in Java and shaped into a library, where appropriate collection interfaces were implemented, which remained in the project or the organization, where it had been done, but it could easily be written again, maybe in Scala, Clojure or Ruby, if it is not already there. It might even be interesting to explore, how to write it in C in a way that can be used as easily as such a library in Java or Scala. If there is interest, please let me know in the comments section, I might come back to this issue in the future...

In C it is actually quite possible to write a generic solution. I see an API like this might work:

struct nested_iteration {
  /* implementation detail */
};

void init_nested_iteration(struct nested_iteration ni, size_t m1, long *n);
void dispose_nested_iteration(struct nested_iteration ni);
int nested_iteration_done(struct nested_iteration ni); // returns 0=false or 1=true
void nested_iteration_next(struct nested_iteration ni);

and it would be called like this:

struct nested_iteration ni;
int n[] = { n_0, n_1, \ldots, n_m };
for (init_nested_iteration(ni, m+1, n); 
     ! nested_iteration_done(ni); 
     nested_iteration_next(ni)) {
...
}

So I guess, it is doable and reasonably easy to program and to use, but of course not quite as elegant as in Java 8, Clojure or Scala.
I would like to leave this as a rough idea and maybe come back with concrete examples and implementations in the future.

Links

Share Button

Carry Bit, Overflow Bit and Signed Integers

It has already been explained how the Carry Bit works for addition. Now there was interest in a comment about how it would work for negative numbers.

The point is, that the calculation of the carry bit does not have any dependency on the sign. The nature of the carry bit is that it is meant to be used for the less significant parts of the addition. So assuming we add two numbers x and y that are having k and l words, respectively. We assume that n=\max(k,l) and make sure that x and y are both n words long by just providing the necessary number of 0-words in the most significant positions. Now the addition is performed as described by starting with a carry bit of 0 and adding with carry x[0]+y[0], then x[1]+y[1] and so on up to x[n-1]+y[n-1], assuming that x[0] is the least significant word and x[n-1] the most significant word, respectively. Each addition includes the carry bit from the previous addition. Up to this point, it does not make any difference, if the numbers are signed or not.

Now for the last addition, we need to consider the question, if our result still fits in n words or if we need one more word. In the case of unsigned numbers we just look at the last carry bit. If it is 1, we just add one more word in the most significant position with the value of 1, otherwise we are already done with n words.

In case of signed integers, we should investigate what can possibly happen. The input for the last step is two signed words and possibly a carry bit from the previous addition. Assuming we have m-Bit-words, we are adding numbers between -2^{m-1} and 2^{m-1}-1 plus an optional carry bit c. If the numbers have different signs, actually an overflow cannot occur and we can be sure that the final result fits in at most n words.

If both are not-negative, the most significant bits of x[n-1] and y[n-1] are both 0. An overflow is happening, if and only if the sum x[n-1]+y[n-1]+c \ge 2^{n-1}, which means that the result „looks negative“, although both summands were not-negative. In this case another word with value 0 has to be provided for the most significant position n to express that the result is \ge 0 while maintaining its already correctly calculated result. It cannot happen that real non-zero bits are going into this new most significant word. Consequently the carry bit can never become 1 in this last addition step.

If both are negative, the most significant bits of x[n-1] and y[n-1] are both 1. An overflow is happening, if and only if the sum x[n-1]+y[n-1]+c \lt 2^{n-1}, which means that the result „looks positive or 0“, although both summands were negative. In this case another word with value 2^n-1 or -1, depending on the viewpoint, has to be prepended as new most significant word. In this case of two negative summands the carry bit is always 1.

Now typical microprocessors provide an overflow flag (called „O“ or more often „V“) to deal with this. So the final addition can be left as it is in n words, if the overflow bit is 0. If it is 1, we have to signal an overflow or we can just provided one more word. Depending on the carry flag it is 0 for C=0 or all bits 1 (2^n-1 or -1, depending on the view point) for C=1.

The overflow flag can be calculated by o=\mathrm{signbit}(x) = \mathrm{signbit}(y) \land \mathrm{signbit}(x+y\mod 2^n) \ne \mathrm{signbit}(x).
There are other ways, but they lead to the same results with approximately the same or more effort.

The following table shows the possible combinations and examples for 8-Bit arithmetic and n=1:

x<0 or x≥0y<0 or y≥ 0(x+y)%2^8 < 0 or ≥ 0Overflow BitCarry Bitadditional word neededvalue additional wordExamples (8bit)
x≥0y≥0≥000no-0+0
63+64
x≥0y≥0<010yes064+64
127+127
x≥0y<0≥000 or 1no-65+(-1)
127+(-127)
x≥0y<0<000 or 1no-7+(-8)
127+(-128)
0+(-128)
x<0y≥0≥000 or 1no--9 + 12
-1 + 127
-127+127
x<0y≥0<000 or 1no--128+127
-128+0
-1 + 0
x<0y<0≥011yes-1-64 + (-65)
-128+(-128)
x<0y<0<001no--1 + (-1)
-1 + (-127)
-64 + (-64)

If you like, you can try out examples that include the carry bit and see that the concepts still work out as described.

Links

Share Button

Will Java, C, C++ and C# be the new Cobols?

A few decades ago most programming was performed in Cobol (I do not want to shout it), Fortran, Rexx and some typical main frame languages, that hardly made it to the Linux-, Unix- or MS-Windows-world. They are still present, but mostly used for maintenance and extension of existing software, but less often for writing new software from scratch.
In these days languages like C, C++, Java and to a slightly lesser extent C# dominate the list of most commonly used languages. I would assume that JavaScript is also quite prominent in the list, because it has become more popular to write rich web clients using frameworks like Angular JS. And there are tons of them and some really good stuff. Some people like to see JavaScript also on the server side and in spite of really interesting frameworks like Node-JS I do not really consider this a good idea. If you like you may add Objective C to this list, which I do not know very much at all, even though it has been part of my gcc since my first Linux installation in the early 1990es.

Anyway, C goes back to the 1970es and I think that it was a great language to create at that time and it still is for a certain set of purposes. When writing operating systems, database engines, compilers and interpreters for other languages, editors, or embedded software, everything that is very close to the hardware, explicit control and direct access to very powerful OS-APIs are features that prove to be useful. It has been said that Java runs as fast as C, which is at least close to the truth, but only if we do not take into account the memory usage. C has some short comings that could be done better without sacrificing its strengths in the areas where it is useful, but it does not seem to be happening.

C++ has been the OO-extension of C, but I would say that it has evolved to be a totally different language for different purposes, even though there is some overlap, it is relatively easy to call functionality written in C from C++ and a little bit harder the other way round… I have not used it very much recently, so I will refrain from commenting further on it.

Java has introduced an infrastructure that is very common now with its virtual machine. The JVM is running on a large number of servers and any JVM-language can be used there. The platform independence is an advantage, but I think that its importance on servers has diminished a little bit. There used to be all kinds of servers with different operating systems and different CPU-architectures. But now we are moving towards servers being mostly Linux with Intel-compatible CPUs, so it is becomeing less of an issue. This may change in the future again, of course.

With Mono C# can be used in ways similar to Java, at least that is what the theory says and what seems to be quite true at least up to a certain level. It seems to be a bit ahead of Java with some language features, just think of operator overloading, undeclared exceptions, properties, generics or lambdas, which have been introduced earlier or in a more elegant way or we are still waiting in Java. I think the case of lambdas also shows the limitations, because they seem to behave differently than you would expect from real closures, which is the way lambdas should be done and are done in more functionally oriented languages or even in the Ruby programming language, in the Perl programming language or typical Lisps.
Try this

List<Func<int>> actions = new List<Func<int>>();

int variable = 0;
while (variable < 5)
{
    actions.Add(() => variable * 2);
    ++ variable;
}

foreach (var act in actions)
{
    Console.WriteLine(act.Invoke());
}

We would expect the output 0, 2, 4, 6, 8, but we are getting 10, 10, 10, 10, 10 (one number in a line, respectively).
But it can be fixed:

List<Func<int>> actions = new List<Func<int>>();

int variable = 0;
while (variable < 5)
{
    int copy = variable;
    actions.Add(() => copy * 2);
    ++ variable;
}

foreach (var act in actions)
{
    Console.WriteLine(act.Invoke());
}

I would say that the concept of inner classes is better in Java, even though what is static there should be the default, but having lambdas this is less important…
You find more issues with class loader, which are kind of hard to tame in java, but extremely powerful.

Anyway, I think that all of these languages tend to be similar in their syntax, at least within a method or function and require a lot of boiler plate code. Another issue that I see is that the basic types, which include Strings, even if they are seen as basic types by the language design, are not powerful enough or full of pitfalls.

While the idea to use just null terminated character arrays as strings in C has its beauty, I think it is actually not really good enough and for more serious C applications a more advanced string library would be good, with the disadvantage that different libraries will end up using different string libraries… Anyway, for stuff that is legitimately done with C now, this is not so much of an issue and legacy software has anyway its legacy how to deal with strings, and possible painful limitations in conjunction with Unicode. Java and also C# have been introduced at a time when Unicode was already around and the standard already claimed to use more than 65536 code points (characters in Unicode-speak), but at that time 65536 seemed to be quite ok to cover the needs for all common languages and so utf-16 was picked as an encoding. This blows up the memory, because strings occupy most of the memory in typical application software, but it still leaves us with uncertainties about length and position, because code points can be one or two 16-bit-„characters“ long, which can only be seen by actually iterating through the string, which leaves us where we were with null terminated strings in C. And strings are really hard to replace or enhance in this aspect, because they are used everywhere.

Numbers are not good either. As an application developer we should not care about counting bits, unless we are in an area that needs to be specifically optimized. We are using mostly integer types in application development, at least we should. These overflow silently. Just to see it in C#:

int i = 0;
int s = 1;
for (i = 0; i < 20; i++)
{
    s *= 7;
    Console.WriteLine("i=" + i + " s=" + s);
}

which gives us:

i=0 s=7
i=1 s=49
i=2 s=343
i=3 s=2401
i=4 s=16807
i=5 s=117649
i=6 s=823543
i=7 s=5764801
i=8 s=40353607
i=9 s=282475249
i=10 s=1977326743
i=11 s=956385313
i=12 s=-1895237401
i=13 s=-381759919
i=14 s=1622647863
i=15 s=-1526366847
i=16 s=-2094633337
i=17 s=-1777531471
i=18 s=442181591
i=19 s=-1199696159

So it silently overflows, or just takes the remainder modulo 2^{32} with the representation system \{-2^{31} \ldots 2^{31}-1\}. Java, C and C++ behave exactly the same way, only that we need to know what „int“ means for our C-compiler, but if we use 32-bit-ints, it is the same. This should throw an exception or switch to some unlimited long integer. Clojure offers both options, depending on whether you use * or *‘ as operator. So as application developers we should not have to care about these bits and most developers do not think about it. Usually it goes well, but a lot of software bugs are around due to this pattern. It is just wrong in C#, Java, and C++. In C I find it more acceptable, because the typical area for using C for new software actually is quite related to bits and bytes, so the developers need to be aware of such issues all the time anyway.

I would consider it desirable to move to more expressive languages like Clojure, Scala, F#, Ruby or Perl for application development. Ruby and Perl have better Strings. Clojure and Scala inherit them from the JVM, and F# has the same strings as C#. Ruby and Clojure have a good way to deal with integers, Scala, Perl and F# can do it right if we actually want to do so, but not by default. Perl and Ruby are very weak when it comes to multithreading. As compared to Java this can be dealt with by just using more processes instead of threads, because the overhead of a Ruby or Perl process is much less than the overhead of a Java process, but I would see this as a major drawback. C, C#, Java and C++ offer good facilities to use multithreading, but the issue of avoiding typical multithreading bugs is a big deal and actually too hard for a large fraction of typical application developers. Or at least too far away from there point of focus. Moving to a more functional paradigm might be a way to go. Java enterprise edition is a failure if the goal is to get multithreading, done well without having to worry about it, because the overhead is too much. On the other hand, if you are willing to go the extra mile, having more explicit access to the multithreading mechanism and using it correctly is extremely powerful, for example in C with pthreads or with a deliberate usage of processes, shared memory and threads together. For which kind of projects do we have the time and the team for this? I am not talking about multithreaded applications that work well on the developer’s laptop, but fail during some high load processing in production with some concurrent modification issues a few months after the deployment. Thinking cannot be replaced by testing.

So now we have a lot of software in C, C++, Java and C# and a lot of new software is written in these languages, even from scratch. We could do better, sometimes we do, sometimes we don’t. It is possible to write excellent application software with Java, C++, C# and even C. It just takes a bit longer, but if we use them with care, it will be ok. Some companies are very conservative and want to use stuff that has been around for a long time. This is sometimes right and sometimes wrong. And since none of the more modern languages has really picked up so much speed that it can be considered a new main stream, it is understandable that some organizations are scared about marching into a dead end road.

On the other hand, many businesses can differentiate themselves by providing services that are only possible by having a very innovative IT. Banks like UBS and Credit Suisse in Switzerland are not likely to be there, while banks like ING are on that road. As long as they compete for totally different customer bases and as long as the business has enough strengths that are not depending so heavily on an innovative IT, but just on a working robust IT, this will be fine. But time moves on and innovation will eventually out-compete conservative businesses.

Share Button

Frameworks for Unit Testing and Mocking

Unit testing has fortunately become an important issue in many software projects. The idea of automatic software based unit and integration tests is actually quite old. The typical Linux software that is downloaded as source code and then built with steps like

tar xfzvv «software-name-with-version».tar.gz
cd «software-name-with-version»
./configure
make
sudo make install

often allows a step

make test

or

make check

or even both before the

make install

It was like that already in the 1990s, when the word „unit test“ was unknown and the whole concept had not been popularized to the main stream.

What we need is to write those automated tests to an extent that we have good confidence that the software will be reliable enough in terms of bugs if it passes the test suite. The tests can be written in any language and I do encourage you to think about using other languages, in order to be less biased and more efficient for writing the tests. We may choose to write a software in C, C++ or Java for the sake of efficiency or easier integration into the target platform. But these languages are efficient in their usages of CPU power, but not at all efficient in using developer time to write a lot of functionality. This is ok for most projects, because the effort it takes to develop with these languages is accepted in exchange for the anticipated benefits. For testing it is another issue.

On the other hand there are of course advantages in using actually the same language for writing the tests, because it is easier to access the APIs and even internal functionalities during the tests. So it may very well be that Unit tests are written in the same language as the software and this is actually what I am doing most of the time. But do think twice about your choice.

Now writing automated tests is actually no magic. It does not really need frameworks, but is quite easy to accomplish manually. All we need is kind of two areas in our source code tree. One area that goes into the production code and one area that is only used for the tests and remains on the development and continuous integration machines. Since writing automated tests without frameworks is not really a big deal, we should only look at frameworks that are really simple and easy to use or maybe give us really good features that we actually need. This is the case with many such frameworks, so the way to go is to actually use them and save some time and make the structure more accessible to other team members, who know the same testing framework. Writing and running unit tests should be really easy, otherwise it is not done or the unit tests are disabled and loose contact to the actual software and become worthless.

Bugs are much more expensive, the later they are discovered. So we should try to find as many of them while developing. Writing unit tests and automated integrated tests is a good thing and writing them early is even better. The pure test driven approach does so before actually writing the code. I recommend this for bug fixing, whenever possible.

There is one exception to this rule. When writing GUIs, automated testing is possible, but quite hard. Now we should have UX guys involved and we should present them with some early drafts of the software. If we had already developed elaborate selenium tests by then, it would be painful to change the software according to the advice of the UX guy and rewrite the tests. So I would keep it flexible until we are on the same page as the UX guys and add the tests later in this area.

Frameworks that I like are actually CUnit for C, JUnit for Java, where TestNG would be a viable alternative, and Google-Test for C++. CUnit works extremely well on Linux and probably on other Unix-like systems like Solaris, Aix, MacOSX, BSD etc. There is no reason why it should not work on MS-Windows. With cygwin actually it is extremely easy to use it, but with native Win32/Win64 it seems to need an effort to get this working, probably because MS-Windows is no priority for the developers of CUnit.

Now we should use our existing structures, but there can be reasons to mock a component or functionality. It can be because during the development a component does not exist. Maybe we want to see if the component is accessed the right way and this is easier to track with a mock that records the calls than with the real thing that does some processing and gives us only the result. Or maybe we have a component with is external and not always available or available, but too time consuming for most of our tests.

Again mocking is no magic and can be done without tools and frameworks. So the frameworks should again be very easy and friendly to use, otherwise they are just a pain in the neck. Early mocking frameworks were often too ambitious and too hard to use and I would have avoided them whenever possible. In Java mocking manually is quite easy. We just need an interface of the mocked component and create an implementing class. Then we need to add all missing methods, which tools like eclipse would do for us, and change some of them. That’s it. Now we have mockito for Java and Google-Mock, which is now part of Google-Test, for C++. In C++ we create a class that behaves similar to a Java interface by having all methods pure virtual with keyword „virtual“ and „=0“ instead of the implementation. The destructor is virtual with an empty implementation. They are so easy to use and they provide useful features, so they are actually good ways to go.

For C the approach is a little bit harder. We do not have the interfaces. So the way to go is to create a library of the code that we want to test and that should go to production. Then we write one of more c-files for the test, that will and up in an executable that actually runs the test. In these .c-files we can provide a mock-implementation for any function and it takes precedence of the implementation from the library. For complete tests we will need to have more than one executable, because in each case the set of mocked functions is fixed within one executable. There are tools in the web to help with this. I find the approach charming to generate the C-code for the mocked functions from the header files using scripts in the a href=“https://en.wikipedia.org/wiki/Ruby_(programming_language)“>Ruby programming language or in the Perl programming language.

Automated testing is so important that I strongly recommend to do changes to the software in order to make it accessible to tests, of course within reason. A common trick is to make certain Java methods package private and have the tests in the same package, but a different directory. Document why they are package private.

It is important to discuss and develop the automated testing within the team and find and improve a common approach. Laziness is a good thing. But laziness means running many automated tests and avoid some manual testing, not being too lazy to write them and eventually spending more time on manual repetitive activities.

I can actually teach this in a two-day or three-day course.

Share Button

How to create ISO Date String

It is a more and more common task that we need to have a date or maybe date with time as String.

There are two reasonable ways to do this:
* We may want the date formatted in the users Locale, whatever that is.
* We want to use a generic date format, that is for a broader audience or for usage in data exchange formats, log files etc.

The first issue is interesting, because it is not always trivial to teach the software to get the right locale and to use it properly… The mechanisms are there and they are often used correctly, but more often this is just working fine for the locale that the software developers where asked to support.

So now the question is, how do we get the ISO-date of today in different environments.

Linux/Unix-Shell (bash, tcsh, …)

date "+%F"

TeX/LaTeX


\def\dayiso{\ifcase\day \or
01\or 02\or 03\or 04\or 05\or 06\or 07\or 08\or 09\or 10\or% 1..10
11\or 12\or 13\or 14\or 15\or 16\or 17\or 18\or 19\or 20\or% 11..20
21\or 22\or 23\or 24\or 25\or 26\or 27\or 28\or 29\or 30\or% 21..30
31\fi}
\def\monthiso{\ifcase\month \or
01\or 02\or 03\or 04\or 05\or 06\or 07\or 08\or 09\or 10\or 11\or 12\fi}
\def\dateiso{\def\today{\number\year-\monthiso-\dayiso}}
\def\todayiso{\number\year-\monthiso-\dayiso}

This can go into a file isodate.sty which can then be included by \include or \input Then using \todayiso in your TeX document will use the current date. To be more precise, it is the date when TeX or LaTeX is called to process the file. This is what I use for my paper letters.

LaTeX

(From Fritz Zaucker, see his comment below):

\usepackage{isodate} % load package
\isodate % switch to ISO format
\today % print date according to current format

Oracle


SELECT TO_CHAR(SYSDATE, 'YYYY-MM-DD') FROM DUAL;

On Oracle Docs this function is documented.
It can be chosen as a default using ALTER SESSION for the whole session. Or in SQL-developer it can be configured. Then it is ok to just call

SELECT SYSDATE FROM DUAL;

Btw. Oracle allows to add numbers to dates. These are days. Use fractions of a day to add hours or minutes.

PostreSQL

(From Fritz Zaucker, see his comment):

select current_date;
—> 2016-01-08


select now();
—> 2016-01-08 14:37:55.701079+01

Emacs

In Emacs I like to have the current Date immediately:

(defun insert-current-date ()
"inserts the current date"
(interactive)
(insert
(let ((x (current-time-string)))
(concat (substring x 20 24)
"-"
(cdr (assoc (substring x 4 7)
cmode-month-alist))
"-"
(let ((y (substring x 8 9)))
(if (string= y " ") "0" y))
(substring x 9 10)))))
(global-set-key [S-f5] 'insert-current-date)

Pressing Shift-F5 will put the current date into the cursor position, mostly as if it had been typed.

Emacs (better Variant)

(From Thomas, see his comment below):

(defun insert-current-date ()
"Insert current date."
(interactive)
(insert (format-time-string "%Y-%m-%d")))

Perl

In the Perl programming language we can use a command line call

perl -e 'use POSIX qw/strftime/;print strftime("%F", localtime()), "\n"'

or to use it in larger programms

use POSIX qw/strftime/;
my $isodate_of_today = strftime("%F", localtime());

I am not sure, if this works on MS-Windows as well, but Linux-, Unix- and MacOS-X-users should see this working.

If someone has tried it on Windows, I will be interested to hear about it…
Maybe I will try it out myself…

Perl 5 (second suggestion)

(From Fritz Zaucker, see his comment below):

perl -e 'use DateTime; use 5.10.0; say DateTime->now->strftime(„%F“);‘

Perl 6

(From Fritz Zaucker, see his comment below):

say Date.today;

or

Date.today.say;

Ruby

This is even more elegant than Perl:

ruby -e 'puts Time.new.strftime("%F")'

will do it on the command line.
Or if you like to use it in your Ruby program, just use

d = Time.new
s = d.strftime("%F")

Btw. like in Oracle SQL it is possible add numbers to this. In case of Ruby, you are adding seconds.

It is slightly confusing that Ruby has two different types, Date and Time. Not quite as confusing as Java, but still…
Time is ok for this purpose.

C on Linux / Posix / Unix


#include
#include
#include

main(int argc, char **argv) {

char s[12];
time_t seconds_since_1970 = time(NULL);
struct tm local;
struct tm gmt;
localtime_r(&seconds_since_1970, &local);
gmtime_r(&seconds_since_1970, &gmt);
size_t l1 = strftime(s, 11, "%Y-%m-%d", &local);
printf("local:\t%s\n", s);
size_t l2 = strftime(s, 11, "%Y-%m-%d", &gmt);
printf("gmt:\t%s\n", s);
exit(0);
}

This speeks for itself..
But if you like to know: time() gets the seconds since 1970 as some kind of integer.
localtime_r or gmtime_r convert it into a structur, that has seconds, minutes etc as separate fields.
stftime formats it. Depending on your C it is also possible to use %F.

Scala


import java.util.Date
import java.text.SimpleDateFormat
...
val s : String = new SimpleDateFormat("YYYY-MM-dd").format(new Date())

This uses the ugly Java-7-libraries. We want to go to Java 8 or use Joda time and a wrapper for Scala.

Java 7


import java.util.Date
import java.text.SimpleDateFormat

...
String s = new SimpleDateFormat("YYYY-MM-dd").format(new Date());

Please observe that SimpleDateFormat is not thread safe. So do one of the following:
* initialize it each time with new
* make sure you run only single threaded, forever
* use EJB and have the format as instance variable in a stateless session bean
* protect it with synchronized
* protect it with locks
* make it a thread local variable

In Java 8 or Java 7 with Joda time this is better. And the toString()-method should have ISO8601 as default, but off course including the time part.

Summary

This is quite easy to achieve in many environments.
I could provide more, but maybe I leave this to you in the comments section.
What could be interesting:
* better ways for the ones that I have provided
* other databases
* other editors (vim, sublime, eclipse, idea,…)
* Office packages (Libreoffice and MS-Office)
* C#
* F#
* Clojure
* C on MS-Windows
* Perl and Ruby on MS-Windows
* Java 8
* Scala using better libraries than the Java-7-library for this
* Java using better libraries than the Java-7-library for this
* C++
* PHP
* Python
* Cobol
* JavaScript
* …
If you provide a reasonable solution I will make it part of the article with a reference…
See also Date Formats

Share Button