Meaningless Whitespace in Textfiles

We use different file formats that are more or less tolerant to certain changes. Most well known is white space in text files.

In some programming languages white space (space, newline, carriage return, form feed, tabulator, vertical tab) has no meaning, as long as any whitespace is present. Examples for this are Java, Perl, Lisp or C. Whitespace, that is somehow part of String content is always significant, but white space that is used within the program can be combination of one or more of the white space characters that are in the lower 128 positions (ISO-646, often referred to as ASCII or 7bit ASCII. It is of course recommended to have a certain coding standard, which gives some guidelines of when to use newlines, if tabs or spaces are preferred (please spaces) and how to indent. But this is just about human readability and the compiler does not really care. Line numbers are a bit meaningful in compiler and runtime error messages and stack traces, so putting everything into one line would harm beyond readability, but there is a wide range of ways that are all correct and equivalent. Btw. many teams limit lines to 80 characters, which was a valid choice 30 years ago, when some terminals were only 80 characters wide and 132 character wide terminals where just coming up. But as a hard limit it is a joke today, because not many of us would be able to work with a vt100 terminal efficiently anyway. Very long lines might be harder to read, so anything around 120 or 160 might still be a reasonable idea about line lengths…

Languages like Ruby and Scala put slightly more meaning into white space, because in most cases a semicolon can be skipped if it is followed by a newline and not just horizontal white space. And Perl (Perl 5) is for sure so hard to compile that only its own implementation can properly format or even recognize which white space is part of a literal string. Special cases like having the language in a string and parsing and then executing that should be ignored here.

Now we put this program files into a source code management system, usually Git. Some teams still use legacy systems like subversion, source safe, clear case or CVS, while there are some newer systems that are probably about as powerful as git, but I never saw them in use. Git creates an MD5 hash of each file, which implies that any minor change will result in a new version, even if it is just white space. Now this does not hurt too much, if we agree on the same formatting and on the same line ending (hopefully LF only, not CR LF, even on MS-Windows). But our tooling does not make any difference between significant changes and insignificant formatting only changes. This gets worse, if users have different IDEs, which they should have, because everyone should use the IDE or editor, with which he or she is most efficient and the formal description of the preferred formatting is not shared between editors or differs slightly.

I think that each programming language should come with a command line diff tool and a command line formatting tool, that obey a standard interface for calling and can be plugged into editors and into source code management systems like git. Then the same mechanisms work for C, Java, C#, Ruby, Python, Fortran, Clojure, Perl, F#, Scala, Lua or your favorite programming language.

I can imaging two ways of working: Either we have a standard format and possibly individual formats for each developer. During „git commit“ the file is brought into the standard format before it is shown to git. Meaning less whitespace changes disappear. During checkout the file can optionally be brought into the preferred format of the developer. And yes, there are ways to deal with deliberate formatting, that for some reason should be kept verbatim and for dealing differently with comments and of course all kinds of string literals. Remember, the formatting tool comes from the same source as the compiler and fully understands the language.

The other approach leaves the formatting up to the developer and only creates a new version, when the diff tool of the language signifies that there is a relevant change.

I think that we should strive for this approach. It is no rocket science, the kind of tools were around for many decades as diff and as formatting tools, it would just be necessary to go the extra mile and create sister diff and formatting tools for the compiler (or interpreter) and to actually integrate these into build environments, IDEs, editors and git. It would save a lot of time and leave more time for solving real problems.

Is there any programming language that actually does this already?

How to handle XML? Is XML just the new binary with a bit more bloat? Can we do a generic handling of all XML or should it depend on the Schema?

Share Button

Loops with unknown nesting depth

We often encounter nested loops, like

for (i = 0; i < n; i++) {
    for (j = 0; j < m; j++) {
        doSomething(i, j);
    }
}

This can be nested to a few more levels without too much pain, as long as we observe that the number of iterations for each level need to be multiplied to get the number of iterations for the whole thing and that total numbers of iterations beyond a few billions (10^9, German: Milliarden, Russian Миллиарди) become unreasonable no matter how fast the doSomethings(...) is. Just looking at this example program

public class Modular {
    public static void main(String[] args) {
        long n = Long.parseLong(args[0]);
        long t = System.currentTimeMillis();
        long m = Long.parseLong(args[1]);
        System.out.println("n=" + n + " t=" + t + " m=" + m);
        long prod = 1;
        long sum  = 0;
        for (long i = 0; i < n; i++) {
            long j = i % m;
            sum += j;
            sum %= m;
            prod *= (j*j+1) % m;
            prod %= m;
        }
        System.out.println("sum=" + sum + " prod=" + prod + " dt=" + (System.currentTimeMillis() - t));
    }
}

which measures it net run time and runs 0 msec for 1000 iterations and almost three minutes for 10 billions (10^{10}):

> java Modular 1000 1001 # 1'000
--> sum=1 prod=442 dt=0
> java Modular 10000 1001 # 10'000
--> sum=55 prod=520 dt=1
> java Modular 100000 1001 # 100'000
--> sum=45 prod=299 dt=7
> java Modular 1000000 1001 # 1'000'000
--> sum=0 prod=806 dt=36
> java Modular 10000000 1001 # 10'000'000
--> sum=45 prod=299 dt=344
> java Modular 100000000 1001 # 100'000'000
--> sum=946 prod=949 dt=3314
> java Modular 1000000000 1001 # 1'000'000'000
--> sum=1 prod=442 dt=34439
> java Modular 10000000000 1001 # 10'000'000'000
--> sum=55 prod=520 dt=332346

As soon as we do I/O, network access, database access or simply a bit more serious calculation, this becomes of course easily unbearably slow. But today it is cool to deal with big data and to at least call what we are doing big data, even though conventional processing on a laptop can do it in a few seconds or minutes... And there are of course ways to process way more iterations than this, but it becomes worth thinking about the system architecture, the hardware, parallel processing and of course algorithms and software stacks. But here we are in the "normal world", which can be a "normal subuniverse" of something really big, so running on one CPU and using a normal language like Perl, Java, Ruby, Scala, Clojure, F# or C.

Now sometimes we encounter situations where we want to nest loops, but the depth is unknown, something like

for (i_0 = 0; i_0 < n_0; i_0++) {
  for (i_1 = 0; i_1 < n_1; i_1++) {
    \cdots
      for (i_m = 0; i_m < n_m; i_m++) {
        dosomething(i_0, i_1,\ldots, i_m);
      }
    \cdots
  }
}

Now our friends from the functional world help us to understand what a loop is, because in some of these more functional languages the classical C-Style loop is either missing or at least not recommended as the everyday tool. Instead we view the set of values we iterate about as a collection and iterate through every element of the collection. This can be a bad thing, because instantiating such big collections can be a show stopper, but we don't. Out of the many features of collections we just pick the iterability, which can very well be accomplished by lazy collections. In Java we have the Iterable, Iterator, Spliterator and the Stream interfaces to express such potentially lazy collections that are just used for iterating.

So we could think of a library that provides us with support for ordinary loops, so we could write something like this:

Iterable range = new LoopRangeExcludeUpper<>(0, n);
for (Integer i : range) {
    doSomething(i);
}

or even better, if we assume 0 as a lower limit is the default anyway:

Iterable range = new LoopRangeExcludeUpper<>(n);
for (Integer i : range) {
    doSomething(i);
}

with the ugliness of boxing and unboxing in terms of runtime overhead, memory overhead, and additional complexity for development. In Scala, Ruby or Clojure the equivalent solution would be elegant and useful and the way to go...
I would assume, that a library who does something like LoopRangeExcludeUpper in the code example should easily be available for Java, maybe even in the standard library, or in some common public maven repository...

Now the issue of loops with unknown nesting depth can easily be addressed by writing or downloading a class like NestedLoopRange, which might have a constructor of the form NestedLoopRange(int ... ni) or NestedLoopRange(List li) or something with collections that are more efficient with primitives, for example from Apache Commons. Consider using long instead of int, which will break some compatibility with Java-collections. This should not hurt too much here and it is a good thing to reconsider the 31-bit size field of Java collections as an obstacle for future development and to address how collections can grow larger than 2^{31}-1 elements, but that is just a side issue here. We broke this limit with the example iterating over 10'000'000'000 values for i already and it took only a few minutes. Of course it was just an abstract way of dealing with a lazy collection without the Java interfaces involved.

So, the code could just look like this:

Iterable range = new NestedLoopRange(n_0, n_1, \ldots, n_m);
for (Tuple t : range) {
    doSomething(t);
}

Btw, it is not too hard to write it in the classical way either:

        long[] n = new long[] { n_0, n_1, \ldots, n_m };
        int m1 = n.length;
        int m  = m1-1; // just to have the math-m matched...
        long[] t = new long[m1];
        for (int j = 0; j < m1; j++) {
            t[j] = 0L;
        }
        boolean done = false;
        for (int j = 0; j < m1; j++) {
            if (n[j] <= 0) {
                done = true;
                break;
            }
        }
        while (! done) {
            doSomething(t);
            done = true;
            for (int j = 0; j < m1; j++) {
                t[j]++;
                if (t[j] < n[j]) {
                    done = false;
                    break;
                }
                t[j] = 0;
            }
        }

I have written this kind of loop several times in my life in different languages. The first time was on C64-basic when I was still in school and the last one was written in Java and shaped into a library, where appropriate collection interfaces were implemented, which remained in the project or the organization, where it had been done, but it could easily be written again, maybe in Scala, Clojure or Ruby, if it is not already there. It might even be interesting to explore, how to write it in C in a way that can be used as easily as such a library in Java or Scala. If there is interest, please let me know in the comments section, I might come back to this issue in the future...

In C it is actually quite possible to write a generic solution. I see an API like this might work:

struct nested_iteration {
  /* implementation detail */
};

void init_nested_iteration(struct nested_iteration ni, size_t m1, long *n);
void dispose_nested_iteration(struct nested_iteration ni);
int nested_iteration_done(struct nested_iteration ni); // returns 0=false or 1=true
void nested_iteration_next(struct nested_iteration ni);

and it would be called like this:

struct nested_iteration ni;
int n[] = { n_0, n_1, \ldots, n_m };
for (init_nested_iteration(ni, m+1, n); 
     ! nested_iteration_done(ni); 
     nested_iteration_next(ni)) {
...
}

So I guess, it is doable and reasonably easy to program and to use, but of course not quite as elegant as in Java 8, Clojure or Scala.
I would like to leave this as a rough idea and maybe come back with concrete examples and implementations in the future.

Links

Share Button

How to recover the Carry Bit

As frequent readers might have observed, I like the concept of the Carry Bit as it allows for efficient implementations of long integer arithmetic, which I would like to use as default integer type for most application development. And unfortunately such facilities are not available in high level languages like C and Java. But it is possible to recover the carry bit from what is available in C or Java, with some extra cost of performance, but maybe neglectable, if the compiler does a good optimization on this. We might assume gcc on a 64-Bit-Linux. It should be possible to do similar things on other platforms.

So we add two unsigned 64-bit integers x and y to a result

    \[z\equiv x+y \mod 2^{64}\]

with

    \[0 \le z < 2^{64}\]

using the typical „long long“ of C. We assume that

    \[x=2^{63}x_h+x_l\]

where

    \[x_h \in \{0,1\}\]

and

    \[0 \le x_l < 2^{63}\]

. In the same way we assume y=2^{63}y_h + y_l and z=2^{63}z_h + z_l with the same kind of conditions for x_h, y_h, z_h or x_l, y_l, z_l, respectively.

Now we have

    \[0 \le x_l+y_l \le 2^{64}-2\]

and we can see that

    \[x_l + y_l = 2^{63}u + z_l\]

for some

    \[u\in \{0.1\}\]

.
And we have

    \[x+y = 2^{64}c+z\]

where

    \[c\in\{0,1\}\]

is the carry bit.
When looking just at the highest visible bit and the carry bit, this boils down to

    \[2c+z_h = x_h + y_h + u\]

This leaves us with eight cases to observe for the combination of x_h, y_h and u:

x_hy_huz_hc
00000
10010
01010
11001
00110
10101
01101
11111

Or we can check all eight cases and find that we always have

    \[c = x_h \wedge\neg z_h \vee y_h \wedge\neg z_h \vee x_h \wedge y_h \wedge z_h\]

or

    \[c = (x_h \vee y_h) \wedge\neg z_h \vee x_h \wedge y_h \wedge z_h.\]

So the result does not depend on u anymore, allowing to calculate it by temporarily casting x, y and z to (signed long long) and using their sign.
We can express this as „use x_h \wedge y_h if z_h=1 and use x_h \vee y_h if z_h = 0„.

An incoming carry bit d does not change this, it just allows for x_l + y_l + d < 2^{64}, which is sufficient for making the previous calculations work.

In a similar way subtraction can be dealt with.

The basic operations add, adc, sub, sbb, mul, xdiv (div is not available) have been implemented in this library for C. Feel free to use it according to the license (GPL). Addition and subtraction could be implemented in a similar way in Java, with the weirdness of declaring signed longs and using them as unsigned. For multiplication and division, native code would be needed, because Java lacks 128bit-integers. So the C-implementation is cleaner.

Share Button