In spite of working mostly for server software and server setup using powerful non-graphic command line tools and scripting languages, it is sometimes fun to work with something very graphical. I did talk about Clojure Art, which is fun and creates interesting visual results and helps getting into the phantastic language Clojure. But more than twenty years ago I have discovered GIMP, which is the main image editor on Linux computers. I keep hearing that Photoshop is even a bit better, but it does not work on my computers, so I do not really care too much about it.

To be clear, I am not a professional image editing specialist, I just do it a bit for fun and without the claim of putting in all the knowledge about colors and their visual appearance, the functionality of gimp and image editing in general… I am just experimenting and finding out what looks interesting or good to me and how to work efficiently. Actually it brings together my three interests, programming, photography and bicycle touring, the combination of the latter two being the major source of my input material.

Now you start working with layers and with tools that increase or decrease the brightness, contrast and saturation of an image or of the layer being worked on. Now I would like to explore how certain functions can be brought into this. Either by putting them into my fork of gimp or into a plugin or into a script within gimp or a standalone script or program.

Some things that I find interesting and would like to explore: additional functions for merging layers. The function has the input of n (for example n=2) pixels from the same position and different layers and it produces another pixel. These are the twenty or thirty layer modes, that describe how a layer is seen on top of the next lower visible layer. So two images of the same size or two layers could be merged. It could be as nicely as in Gimp or just desctructively to make things easier. If it is worth anything for anybody else, maybe making it work as the current modes would be a noble goal. But for the moment it is more interesting what to do. A very logical thing to do is taking just the average of the two layers. So for rgb it could be the arithmetic mean of the r, g and b values of the n layers (or images) belonging to the same x-y-position. Now what would alpha values mean? I would think that they are weighting the average and the new alpha value could be the average of the input alpha values. Now we could use geometric, quadratic and cubic means and with some care concerning the 0 even harmonic means. Very funny effects could be created by combining these byte-values with functions like xor.

When working with any functions, it is always annoying that the r-g-b-values are always between 0 and 255 (or something like that). So this can be changed to real numbers, by doing something like

    \[s = \tan(\frac{(r-127.5)\pi}{256})\]

or to the non-negative real numbers by doing something like

    \[s = \tan(\frac{r\pi}{512})\]

Then some functions can be applied to these double values and in the end the inverse function will just be applied and result in a rounded and limited integral value from 0 to 255.

Do we need this? I do not know. But I think it would be fun to be able play around with some functions on one pixel, a vicinity of pixels, the same pixel-position from different layers and the like.

The nice thing is: we can see the result and like it or throw it away. Which function is correct or useful can be discussed and disagreed on. That is more fun than formally proven correctness, assuming of course, that the functions itself are implemented correctly.

I have not looked into the source code of plugins to tell what can reasonably be done without too much effort. But if someone reading this has some ideas, this would be interesting to hear about.

And finally we have one more advantage of GIMP, because it is open source and it is possible to make changes to it.

Share Button

Combining multiple scans

When images are scanned multiple times, maybe there is a way to construct an image that is better than any of the scans from them.

In this case it is assumed, that one scan has a higher resolution, but another scan got the colors better.

It has already been found out, which two scans belong to the same original.

Also one of them has already been rotated to the desired position.

The rotation can easily be found. Since images are roughly rectangular and width is roughly 1.5 times the height, simply two rotations have to be tried. Images are compared pixelwise and the one that has less deviation is probably the right one.

Now the orientation is already the same. And when scaled to the same size, the images should be exactly the same, apart from inaccuracies. But that is not the case. They are scaled slightly differently and they are shifted a bit.

Now the shift and scaling can be expressed by four numbers as x=a*u+b and y=c*v+d, where (x,y) and (u,v) are coordinates of the two images. It could be a matrix, if rotation is also involved and this would work the same way, but professional scans do have shift and scaling problems, but much less rotation problems.

So to estimate a, b, c and d, it is possible to go through all the pixels (u,v) of one image. Now the pixels (x,y) = (u+du, v+dv) are compared for combinations of small du and dv. The result is a sum of squares of differences s in a range between 0 and 3*65536. Now low values are good, that means that the coordiates (u,v,x,y) are a good match and need to be recorded with a high weight. This is achieved by weighting them with (3*65536-s)^2. Now it is just two linear regressions to calculate a,b,c and d. Of course, the points near ues the borders need special care, but it is possible to ommit a bit near the boarder to avoid this issue.

Now the points (u,v) of one image are iterated and the approximate match (x,y) on the other image is found. Now for (r,g,b) there can be a function to transform them. To start, it can be linear again, so this will be three linear regressions for r, g, and b. Or it could involve a matrix multiplied to the input vector. Or some function, then a matrix multiplication then vector addition and then the inverse function. The result has to be constrained to RGB-values between 0 and 255.

Now in the end, for each pixel in one image there is a function that changes the colors more to what the other image has. This can be applied to the full resolution image. Also, a weighted average of the original values and the calculated values, to find the best results. Then this can be applied to the whole series.

A few experiments with the function and a bit research on good functions for this might be done, to improve. In the end of the day there is a solution that is good enough and works. Or maybe a few variants and then some manual work to pick the best.

Share Button

Ranges of Dates and Times

In Software we often deal with ranges of dates and times.

Let us look at it from the perspective of an end user.

When we say something like „from 2020-03-07 to 2019-03-10“ we mean the set of all timestamps t such that

    \[\text{2019-03-07} \le d < \text{2019-03-11}\]

or more accurately:

    \[\text{2019-03-07T00:00:00}+TZ \le d < \text{2019-03-11T00:00:00}+TZ\]

Important is, that we mean to include the whole 24 hour day of 2019-03-10. Btw. please try to get used to the ISO-date even when writing normal human readable texts, it just makes sense…

Now when we are not talking about dates, but about times or instants of time, the interpretation is different.
When we say sonmething like „from 07:00 to 10:00“ or „from 2020-03-10T07:00:00+TZ to 2020-04-11T09:00:00+TZ“, we actually mean the set of all timestamps t such that

    \[givenDate\text{T07:00:00}+TZ \le t < givenDate\text{T10:00:00}+TZ\]


    \[\text{2020-03-10T07:00:00}+TZ \le t < \text{2020-04-11T09:00:00}+TZ,\]

respectively. It is important that we have to add one in case of date only (accuracy to one day) and we do not in case of finer grained date/time information. The question if the upper bound is included or not is not so important in our everyday life, but it proves that commonly the most useful way is not to include the upper bound. If you prefer to have all options, it is a better idea to employ an interval library, i.e. to find one or to write one. But for most cases it is enough to exclude the upper limit. This guarantees disjoint adjacent intervals which is usually what we want. I have seen people write code that adds 23:59:59.999 to a date and compares with \le instead of <, but this is an ugly hack that needs a lot of boiler plate code and a lot of time to understand. Use the exclusive upper limit, because we have it.

Now the requirement is to add one day to the upper limit to get from the human readable form of date-only ranges to something computers can work with. It is a good thing to agree on where this transformation is made. And to do it in such a way that it even behaves correctly on those dates where daylight saving starts or ends, because adding one day might actually mean „23 hours“ or „25 hours“. If we need to be really very accurate, sometimes switch seconds need to be added.

Just another issue has come up here. Local time is much harder than UTC. We need to work with local time on all kinds of user interfaces for humans, with very few exceptions like for pilots, who actually work with UTC. But local date and time is ambiguous for one hour every year and at least a bit special to handle for these two days where daylight saving starts and ends. Convert dates to UTC and work with that internally. And convert them to local date on all kinds of user interfaces, where it makes sense, including documents that are printed or provided as PDFs, for example. When we work with dates without time, we need to add one day to the upper limit and then round it to the nearest some-date\text{T00:00:00}+TZ for our timezone TZ or know when to add 23, 24 or 25 hours, respectively, which we do not want to know, but we need to use modern time libraries like the java.time.XXX stuff in Java, for example.

Working with date and time is hard. It is important to avoid making it harder than it needs to be. Here some recommendations:

  • Try to use UTC for the internal use of the software as much as possible
  • Use local date or time or date and time in all kinds of user interfaces (with few exceptions)
  • add one day to the upper limit and round it to the nearest midnight of local time exactly once in the stack
  • exclude the upper limit in date ranges
  • Use ISO-date formats even in the user interfaces, if possible


Share Button

How to recover the Borrow Bit

In a similar way as the carry bit for addition it is possible to recover the borrow bit for substraction, just based on the highest bits of three numbers that we deal with during the operation.

With this program, a subtraction operation of an 8-bit CPU can be simulated exhaustively


my $x, $x, $bi;

my %tab = ();

for ($bi = 0; $bi <= 1; $bi++) {     for ($x = 0; $x < 256; $x++) {         for ($y = 0; $y < 256; $y++) {             my $zz = $x - $y - $bi;             my $b  = $zz < 0 ? 1 : 0;             my $c  = 1 - $b;             my $z = ($zz + 256) & 0xff;             my $xs = $x >> 7;
            my $ys = $y >> 7;
            my $zs = $z >> 7;
            my $key = "$xs:$ys:$zs";
            $tab{$key} //= $b;
            my $bb = $tab{$key};
            if ($bb != $b) {
                print "b=$b bb=$bb c=$c xs=$xs ys=$ys zs=$zs x=$x y=$y z=$z zz=$zz bi=$bi\n";

for my $key (sort keys %tab) {
    $key =~ m/(\d+):(\d+):(\d+)/;
    $b =$tab{$key};
    $c = 1 - $b;
    $bb = $xs & $ys & $zs | !$xs & ($ys | $zs);
    print "b=$b bb=$bb c=$c xs=$xs ys=$ys zs=$zs\n";

This gives an idea, what is happening. But in real life, probably a 64bit-CPU is used, but the concepts would work with longer or shorter CPU words the same way.

So we subtract two unsigned 64-bit integers x and y and an incoming borrow bit i\in\{0, 1\} to a result

    \[z\equiv x-y-i \mod 2^{64}\]


    \[0 \le z < 2^{64}\]

using the typical „long long“ of C. We assume that



    \[x_h \in \{0,1\}\]


    \[0 \le x_l < 2^{63}.\]

In the same way we assume y=2^{63}y_h + y_l and z=2^{63}z_h + z_l with the same kind of conditions for x_h, y_h, z_h or x_l, y_l, z_l, respectively.

Now we have

    \[-2^{63} \le  x_l-y_l-i \le 2^{63}-1\]

and we can see that

    \[x_l - y_l - i = z_l-2^{63}u\]

for some

    \[u\in \{0,1\}\]

And we have

    \[x-y-i = z-2^{64}b\]



is the borrow bit.
When combining we get

    \[2^{63}x_h - 2^{63}y_h + z_l - 2^{63}u = 2^{63}x_h + x_l -2^{63}y_h-x_l -i = x-y-i = z - 2^{64}b = 2^{63}z_h + z_l -2^{64}b\]

When looking just at the highest visible bit and the borrow bit, this boils down to

    \[z_h-2b = x_h - y_h - u\]

This leaves us with eight cases to observe for the combination of x_h, y_h and u:


Or we can check all eight cases and find that we always have

    \[b = x_h \wedge y_h \wedge z_h \vee \neg x_h \wedge (y_h \vee z_h)\]

So the result does not depend on u anymore, allowing to calculate the borrow bit by temporarily casting x, y and z to (signed long long) and using their sign.
We can express this as „use y_h \wedge z_h if x_h=1 and use y_h \vee z_h if x_h = 0„.

The incoming borrow bit i does not change this, it just allows for x_l - y_l - d \ge -2^{64}, which is sufficient for making the previous calculations work.

The basic operations add, adc, sub, sbb, mul, xdiv (div is not available) have been implemented in this library for C. Feel free to use it according to the license (GPL). Addition and subtraction could be implemented in a similar way in Java, with the weirdness of declaring signed longs and using them as unsigned. For multiplication and division, native code would be needed, because Java lacks 128bit-integers. So the C-implementation is cleaner.

Share Button

Borrow and Carry Bit for Subtraction

Similar to the usage of the carry bit when adding there are mechanisms for subtracting that allow to integrate the result of subtraction of the lower bits into the subtraction of the next higher block of bits, where necessary.

There are two ways to do this, that are trivially equivalent by a simple not operation:

  • borrow bit (also borrow flag)
  • carry bit (also carry flag)

Often CPUs use what is the carry bit for addition and interpret it as borrow bit for subtraction.

Here are the two ways:

Borrow Bit

It is assumed, that the CPU-word is n bits long, so calculations are performed modulo N=2^n. Further it is assumed, that the range 0\ldots N-1 is preferred, so all sign issues are excluded for the moment.

When subtracting two numbers x, y from each other like x-y and y > x, the provided result is

    \[x-y+N \equiv x-y \mod N\]

and the borrow bit is set (b=1), to indicate that the subtraction caused „underflow“, which had to be corrected by added N in order to get into the desired range.

In the „normal“ case where y \le x, the provided result is simply


and the borrow bit is not set (b=0).

The next round of subtraction takes the borrow bit into account and calculates x-y-b, where the condition becomes y+b > x and the result is

    \[x-y-b+N \equiv x-y \mod N\]



respectively. This is how some of the older readers used to do it in school on paper, but of course with N=10.

Now the typical integer arithmetic of current CPUs uses Two’s complement, which means that -y=(\mathrm{NOT}\; y)+1. Combining this with the previous results in calculating

    \[x-y = x + (\mathrm{NOT}\; y) + 1 - b \mod N\]

At this point some CPU-designers found it more natural, to use the carry bit c=1-b instead of the borrow bit b.

Carry Bit

When subtracting two numbers x, y from each other like x-y and we have y > x, the provided result is

    \[x-y+N \equiv x-y \mod N\]

and the carry bit is not set (c=0), to indicate that the subtraction caused „underflow“, which had to be corrected by added N in order to get into the desired range.

In the „normal“ case where y \le x, the provided result is simply


and the carry bit is set (c=1).

The next round of subtraction takes the borrow bit into account and calculates x-y-1+c, where the condition becomes y+1-c > x and the result is

    \[x-y-1+c+N \equiv x-y \mod N\]




Now two’s complement with -y=(\mathrm{NOT}\; y)+1 this can be written as

    \[x-y = x + (\mathrm{NOT}\; y) + 1 - b \mod N\]

or with c=1-b

    \[x-y = x + (\mathrm{NOT}\; y) + c \mod N\]

These two ways are really equivalent and easily transformed into each other. Neither of them provides big advantages, apart from the fact that we have the unnecessary confusion, because it depends on the CPU design, which of the two variants is used.

Recovery of Borrow or Carry bit

The borrow bit is calculated and used during subtractions that are done at assembly language level, but higher languages like C or Java do provide access to this information. It is relatively easy to recover the carry bit in the case of addition based on x, y and x+y \mod N.

This possible as well for the subtraction. Quite easily the comparison between x and y or y+b could be done before the subtraction. This would work, but it is kind of inefficient, because under the hood the comparison is just a subtraction with discarding the result and keeping the flags. So the subtraction is performed twice. This might not be such a bad idea, because a compiler could recognize it or the work of subtracting twice could be neglectable compared to the logic for an „optimized“ recovery, based on some logic expression of certain bits from x, y and x-y \mod N.


Share Button

Rounding to Rational Numbers

Usually we think of rounding as a way of approximately expressing numbers with many decimal places by numbers with fewer decimal places.

Regular readers of this blog have already encountered, that this concept can be extended and generalized, as is mentioned in the article about Geometric and Harmonic Rounding and Residue Class Rounding or Rounding with Sum.
Now computers and humans can deal with rational numbers and sometimes that is the best way.

Idea 1: Read the Double as Rational

Most typical non-integer computer numbers are actually already rational numbers of the form \frac{n}{2^m} with a relatively large power of two in the denominator. But as soon as we perform divisions, we leave the accurate world and round, usually in the way that the machine throws in front of our feet out of the box. But all other floating point operations can result in rounding as well. So rational numbers can be a way to go. So we can just naturally convert a double or float number into a rational number and continue working with that.

Idea 2: Go for human readable fractions

Let us think of human readers. We are kind of comfortable with fractions like \frac{1}{12}, \frac{1}{10}, \frac{1}{8} and multiples of them, so basically every fraction that can be expressed with a denominator of 1, 2, 3, 4, 5, 6, 8, 10 or 12. The 12 is because of the omnipresence of the analog clock, the 10 because it is just one more decimal digit and the 8 because it is about the number of successive halving that we can still easily cope with. The idea would be to round to the nearest such fraction and to prefer the lower denominator when two adjacent values are equally far away. This works out well, because our set of allowed fractions between 0 and 1 is just

    \[\{0, \frac{1}{12}, \frac{1}{10}, \frac{1}{8}, \frac{1}{6}, \frac{1}{5}, \frac{1}{4}, \frac{1}{3}, \frac{3}{8}, \frac{2}{5}, \frac{5}{12}, \frac{1}{2}, \frac{7}{12}, \frac{3}{5}, \frac{5}{8}, \frac{2}{3}, \frac{3}{4}, \frac{4}{5}, \frac{5}{6}, \frac{7}{8}, \frac{9}{10}, \frac{11}{12}, 1\}\]

and it is more or less trivial to program such a rounding algorithm, by just hard coding the limits, normalizing to the interval [0,1) and finding the right slot by binary search, for example. This is what we usually want to make numbers human readable and understandable. If we want more accuracy we can often just use the trick of going to % or or just shifting units by multiples of 1000, depending on what we are measuring or counting. This is often a bit better than just decimal numbers, and we can more often solve the issue of rounding with sum when some of the values we want to round are the same and just won’t come out the same of our rounding procedure.

Idea 3: Use continuous fractions

With continuous fractions it is possible to express any real number in the form

    \[a_0 + \cfrac{1}{a_1 + \cfrac{1}{a_2 + \cfrac{1}{ \ddots + \cfrac{1}{a_n} }}}\]


    \[a_0 + \cfrac{1}{a_1 + \cfrac{1}{a_2 + \cfrac{1}{a_3 + \cfrac{1}{\ddots}}}}\]

For integers this is trivial, just use a_0. For negative numbers we just take the negative of the continuous fraction of the absolute value, so we can assume a non-integral positive number r_0.
This allows determining a_0 by just taking the integer part of it.
Then we continue with r_1 = \frac{1}{r_0 - a_0} and so on. Either we end up with an integer a_n at some point, which is the case for rational numbers or the continuous fraction continues infinitely, which is the case for irrational numbers. Now taking the continuous fraction just up to the first n elements, as in the first form and ignoring what comes after that, can in turn be converted into a rational number, by just doing some trivial rational arithmetic. It turns out, that this is a very good approximation for that size of the denominator. The funny thing is that this works even for irrational numbers under certain conditions. For example could we assume that we are dealing with numbers of the form x+y\sqrt{D} for some fixed rational D and rational numbers x and y. It is relatively easy to approximate x+y\sqrt{D}=x_0+y_0\sqrt{D} with the largest integer a_0 that is less or equal. Now we can calculate

    \[\frac{1}{x_k+y_k\sqrt{D}-a_k}=\frac{x_k-a_k-y_k\sqrt{D}}{(x_k-a_k)^2-y_k^2\cdot D}=x_{k+1}+y_{k+1}\sqrt{D}\]

and thus use this algorithm as another way to calculate rational approximations of square roots. In this case the continuous fraction becomes periodic and there is a surprising lot of interesting mathematics behind it, if you like to dig deeper.

Share Button

How to calculate Square Roots and Cubic Roots

The functions sqrt and sometimes even cbrt are commonly available, but it is nice to see how they can be calculated.

There are several approaches, but the most popular ones are Newton’s method and an algorithmic formulation of how roots are taken manually, for those old enough to still have learned it in school. Earlier measurements that I did many years ago showed that the Newton approximation is slower, but it would be worth to do newer measurements.

So we have an equation y = x^2 or y=x^3 and want to find x or a well defined approximation of x when we know y. Mathematically speaking we want to assume that y is constant and we want to find an x for which f(x)=x^2-y=0 or g(x)=x^3-y=0. If we guess such an x and then draw the tangent at the curve of the function at the point (x, f(x)) or (x, g(x)), then the intersection point of the tangent can be used as the next approximation. This method converges in the case of these two functions (and some others) and is reasonably fast. Now the tangent has the linear equation


where y_0=f(x_0) and f'(x)=\frac{df(x)}{dx} is the derivative of f(x). We want to solve this equation for y=0 and thus we get

    \[x-x_0 = -\frac{f(x_0)}{f'(x_0)}\]

and thus


As an iteration rule

    \[\bigwedge_{i=0}^\infty x_{i+1}=x_i-\frac{f(x_i)}{f'(x_i)}\]

In case of the sqare root we can just start with an estimation by shifting half the length to the right, but avoiding zero, which is important because of the division. Then we get for an appropriate n

    \[x_0 = 2^{-n}y\]


The last form is quite intuitive, even without calculus. As I said this converges usefully fast and there is tons of math around to describe the behavior, speed, precision and convergence of the calculations performed in this algorithm. Written in Ruby just for integers, this is quite simple. Convergence is simply discovered by the fact that the result does not change any more, which may fail in some cases, where intermediate results oscillate between two values, but just for the purpose of benchmarking it seems to be sufficient:

def sqrt_newton(x)
  if (x == 0) then
    return 0
  y0 = x
  u0 = x
  while (u0 > 0) do
    y0 >>= 1
    u0 >>= 2
  y0 = [1, y0].max
  yi = y0
  yi_minus_1 = -1
  loop do
    yi_plus_1 = (yi + x/yi) >> 1;
    if (yi_minus_1 == yi_plus_1) then
      return [yi, yi_minus_1].min
    elsif (yi == yi_plus_1) then
      return yi
    yi_minus_1 = yi
    yi = yi_plus_1

The newton algorithm tends to oscillate between two approximations, so this termination criteria takes into account y_{i-1}, y_i and y_{i+1} and uses the lower of the two oscillating values. This results in calculating the largest integer y such that y^2\le x and (y+1)^2 > x.

For the third root we get for an appropriate n

    \[x_0 = 2^{-n}y\]


Again this is a useful way and there is math around on when to stop for a desired precision.
Read Wikipedia for the convergence issues.

There is another approach, that people used to know when doing calculations on paper was more important than today. For the decimal system it works like this:

1. 7 3 2 0 5
/ 3.00 00 00 00 00
/\/ 1 = 20*0*1+1^2
2 00
1 89 = 20*1*7+7^2
11 00
10 29 = 20*17*3+3^2
71 00
69 24 = 20*173*2+2^2
1 76 00
0 = 20*1732*0+0^2
1 76 00 00
1 73 20 25 = 20*17320*5+5^2
2 79 75
(source Wikipedia)
We group the digits to the left and to the right of the decimal point in groups of two. The highest possible square of an integral number that is below or equal to the leftmost group (03 in the example above) is used for the first digit of the result (1 in the example above). This square is subtracted and the next group is appended (200 in the example). Assuming that y_n is the result already calculated and x_n is what we have achieved after the subtraction and the appending of the next group, we search for a digit z_n such that u_n = 20\cdot y_n\cdot z_n + z_n^2 \le x_n. z_n is chosen in such a way that it yields the maximum possible u_n wich is still \le x_n. Subtracting u_n from x_n and appending the next group allows for the next iteration.

Now this can be turned into an algorithm. The first approach is to just switch from decimal system to binary system. Then for each iteration step we have to deal just with the possible values of 0 and 1, which greatly simplifies the algorithm. Here is a simple ruby program that would do this:

def split_to_words(x, word_len)
  bit_pattern = (1 << word_len) - 1   words = []   while (x != 0 || words.length == 0) do     w = x & bit_pattern     x = x >> word_len

def sqrt_bin(x)
  if (x == 0) then
    return 0
  xwords = split_to_words(x, 2)
  xi = xwords[0] - 1
  yi = 1
  1.upto(xwords.length-1) do |i|
    xi = (xi << 2) + xwords[i]     d0 = (yi << 2) + 1     r  = xi - d0     b  = 0     if (r >= 0) then
      b  = 1
      xi = r
    yi = (yi << 1) + b   end   return yi end

It seems that the two solutions yield the same results, but the sqrt_newton outperforms sqrt_bin by a factor of two.

Now we should reconsider, if base 2 is really the best choice. Actually we can use any power of 2 as a base and efficiently work with that. Apart from the initial first step, which is done by using an extended version of sqrt_bin, the next steps are estimated by division and trying neighboring values to get the exact result. This makes use of the fact that the equation we need to solve
u_n = 2\cdot b\cdot y_n\cdot z_n + z_n^2 \le x_n with the maximum z_n fullfilling this equation, where b is the base to which we are working, witch was 10 or 2 above and could now be a power of 2. As soon as y_n\cdot b has a certain size, the influence of z_n^2 becomes less relevant. We can consider the maximum posible value for z_n, which is b-1 and thus solve 2\cdot b\cdot y_n\cdot z_n\le x_n and 2\cdot b\cdot y_n\cdot z_n\le x_n-(b-1)^2, each for the maximum z_n fullfilling the equation. This can be calculated by simple division. If the range between the two solutions is small enough, then each value in the range can be tried to find the actual accurate solution for 2\cdot b\cdot y_n\cdot z_n + z_n^2 \le x_n and this is more efficient than working just bitwise. This method sqrt_word seems to outperform sqrt_newton for longer numbers, for example around 60 decimal digits with word_length=16. So the most promising approach seems to be to optimize the implementation and parameters of sqrt_word. The issue of termination, which has been properly addressed in the newton implementation, is already dealt with in this implementation. For more serious analysis it would be interesting to implement the algorithms in C or even in assembly language. So this is the final result for square roots, with some checks added:

def check_is_nonneg_int(x, name)
  raise TypeError, "#{name}=#{x.inspect} must be Integer" unless (x.kind_of? Integer) && x >= 0

def check_word_len(word_len, name="word_len")
  unless ((word_len.kind_of? Integer) && word_len > 0 && word_len <= 1024)
    raise TypeError, "#{name} must be a positive number <= 1024"

def split_to_words(x, word_len)
  check_is_nonneg_int(x, "x")
  bit_pattern = (1 << word_len) - 1
  words = []
  while (x != 0 || words.length == 0) do
    w = x & bit_pattern
    x = x >> word_len

def sqrt_bin(x)
  yy = sqrt_bin_with_remainder(x)

def sqrt_bin_with_remainder(x)
  check_is_nonneg_int(x, "x")
  if (x == 0) then
    return [0, 0]

  xwords = split_to_words(x, 2)
  xi = xwords[0] - 1
  yi = 1

  1.upto(xwords.length-1) do |i|
    xi = (xi << 2) + xwords[i]
    d0 = (yi << 2) + 1
    r  = xi - d0
    b  = 0
    if (r >= 0) then
      b  = 1
      xi = r
    yi = (yi << 1) + b
  return [yi, xi]

def sqrt_word(x, n = 16)
  check_is_nonneg_int(x, "x")
  check_is_nonneg_int(n, "n")

  n2 = n << 1
  n1 = n+1
  check_word_len(n2, "2*n")
  if (x == 0) then
    return 0

  xwords = split_to_words(x, n2)
  if (xwords.length == 1) then
    return sqrt_bin(xwords[0])

  xi = (xwords[0] << n2) + xwords[1]
  a  = sqrt_bin_with_remainder(xi)
  yi = a[0]
  if (xwords.length <= 2) then
    return yi

  xi = a[1]
  2.upto(xwords.length-1) do |i|
    xi = (xi << n2) + xwords[i]
    d0 = (yi << n1)
    q  = (xi / d0).to_i
    j  = 10
    was_negative = false
    while (true) do
      d = d0 + q
      r = xi - (q * d)
      break if (0 <= r && (r < d || was_negative))
      if (r < 0) then
        was_negative = true
        q = q-1
        q = q+1
      j -= 1
      if (j <= 0) then
    xi = r
    yi = (yi << n) + q
  return yi

def sqrt_newton(x)
  check_is_nonneg_int(x, "x")
  if (x == 0) then
    return 0
  y0 = x
  u0 = x
  while (u0 > 0) do
    y0 >>= 1
    u0 >>= 2
  y0 = [1, y0].max
  yi = y0
  yi_minus_1 = -1
  loop do
    yi_plus_1 = (yi + x/yi) >> 1;
    if (yi_minus_1 == yi_plus_1) then
      return [yi, yi_minus_1].min
    elsif (yi == yi_plus_1) then
      return yi
    yi_minus_1 = yi
    yi = yi_plus_1

This is the approach that has been built into the LongDecimal library, ignoring Newton. The examples have been added to github.

The algorithms can be extended to cubic roots or any higher roots. In this case, the nth root of \sum_{j=0}^{m}a_j b^{n(m-j)} is calculated by starting with the maximal integral number z_0 with z_0^n \le a_0 and the subsequently finding numbers z_j fullfilling an equation of the form {n}\choose{k}\sum_{k=1}^n (by_j)^{n-k}z_j^k \le x_i. This is always easy to handle for base two, by just testing the two possible solutions. For higher bases and n=3 it involves solving an quadratic equation, once the numbers are high enough to neglect the term z_j^3. For n=4 it is just possible to take the square root of the square root. For higher values of n and bases other than 2 it becomes really difficult to tame this algorithm. So I intend to constrain myself to square roots and cube roots. I have not explored, if it is useful to calculate the cube root with a higher base than 2 and which approach provides the best performance for cube roots. Even the square root calculation can possibly be tuned a bit. Maybe this will be addressed in another article.

Share Button

How to draw lines, circles and other curves

These ideas were developed more than 30 years without knowing that they were already known at that time…

Today the graphics cards can easily do things like this in very little time. And today’s CPUs are of course really good at multiplying. So this has lost a lot of its immediate relevance, but it is a fun topic and why not have some fun…

Let us assume we have a two dimensional coordinate system and a visible area that goes from x_{\min} to x_{\max} and y_{\min} to y_{\max}. Coordinates are discrete.

In this world we can easily measure an angle against a (directed) line parallel to the x-axis, for example up to an accuracy of 45^\circ=\frac{\pi}{4}:

  • y=0 \wedge x > 0 \implies \alpha = 0 (= 0^\circ)
  • 0 < y < x \implies 0 < \alpha < \frac{\pi}{4}(=45^\circ)
  • 0 < y = x \implies \alpha = \frac{\pi}{4}
  • 0 < x < y \implies \frac{\pi}{4} < \alpha < \frac{\pi}{2}(=90^\circ)
  • x = 0 \land y > 0\implies \alpha = \frac{\pi}{2}
  • x < 0 \land y > 0 \land |x| < |y|\implies \frac{\pi}{2} < \alpha < \frac{3\pi}{4}(=135^\circ)
  • x < 0 \land y > 0 \land -x = y\implies \alpha = \frac{3\pi}{4}(=135^\circ)

So let us assume we have a curve that is described by a polynomial function in two variables x and y, like this:

    \[f(x, y) = \sum_{j=0}^m\sum_{k=0}^n a_{j,k}x^jy^k = 0\]

We have to apply some math to understand that the curve behaves nicely in the sense that it does not behave to chaotic in scales that are below our accuracy, that it is connected etc. We might possibly scale and move it a bit by substituting something like c_1u+c_2 for x and c_3v+c_4 for y.

For example we may think of

  • line: f(x,y)=ax+by+c
  • circle: f(x, y)=x^2+y^2-r^2
  • eclipse: f(x, y)=\frac{x^2}{a^2}+\frac{y^2}{b^2}-1

We can assume our drawing is done with something like a king of chess. We need to find a starting point that is accurately on the curve or at least as accurately as possible. You could use knights or other chess figures or even fictive chess figures..

Now we have a starting point (x_0, y_0) which lies ideally exactly on the curve. We have a deviation from the curve, which is f(x_0, y_0)=d_0. So we have f(x_n, y_n)=d_n. Than we move to x_{n+1}=x_n + s and y_{n+1}=y_n+t with s, t = \{-1, 0, 1\}. Often only two or three combinations of (s, t) need to be considered. When calculating d_{n+1} from d_n for the different variants, it shows that for calculating d_{n+1}-d_n the difference becomes a polynomial with lower degree, because the highest terms cancel out. So drawing a line between two points or a circle with a given radius around a given point or an ellipse or a parabola or a hyperbola can be drawn without any multiplications… And powers of n-th powers of x can always be calculated with additions and subtractions only from the previous x-values, by using successive differences:
These become constant for l=n, just as the lth derivatives, so by using this triangle, successive powers can be calculated with some preparational work using just additions.
It was quite natural to program these in assembly language, even in 8-bit assembly languages that are primitive by today’s standards. And it was possible to draw such figures reasonably fast with only one MHz (yes, not GHz).

We don’t need this stuff any more. Usually the graphics card is much better than anything we can with reasonable effort program. Usually the performance is sufficient when we just program in high level languages and use standard libraries.

But occasionally situations occur where we need to think about how to get the performance we need:
Make it work,
make it right,
make it fast,
but don’t stop after the first of those.

It is important that we choose our steps wisely and use adequate methods to solve our problem. Please understand this article as a fun issue about how we could write software some decades ago, but also as an inspiration to actually look into bits and bytes when it is really helping to get the necessary performance without defeating the maintainability of the software.

Share Button


Intervals are subsets of a universe, that are defined by upper and lower boundaries. Typically we think about real numbers, but any totally ordered universe allows the definition of intervals.

Intervals are defined by lower and upper boundaries, which can be a limiting number or unlimited, typically written as \infty for the upper bound and -\infty for the lower bound. The boundaries can be included or excluded. So the following combinations exist for a universe X:

(-\infty, \infty)=X
(-\infty, a] = \{ x \in X : x \le a\}
half open, lower unlimited
(-\infty, a) = \{ x \in X : x < a\}
open, lower unlimited
[b, \infty) = \{ x \in X : x \ge b\}
half open, upper unlimited
(b, \infty) = \{ x \in X : x > b\}
open, upper unlimited
(c, d) = \{ x \in X : c < x < d\}
[c, d) = \{ x \in X : c \le x < d\}
half open
(c, d] = \{ x \in X : c < x \le d\}
half open
[c, d] = \{ x \in X : c \le x \le d\}
it is sometimes useful to consider the empty set as an interval as well

The words „open“ and „closed“ refer to our usual topology of real numbers, but they do not necessarily retain their topological meaning when we extend the concept to our typical data types. a, b, c and d in the notation above do not have to be members of X, as long as the comparison is defined between them and all members of X. So we could for example meaningfully define for X=\mathbb{Q} the interval (\sqrt{2}, \pi) = \{ x\in\mathbb{Q} : \sqrt{2} < x < \pi\}.

As soon as we do not imply X=\mathbb{R} we always have to make this clear… And \mathbb{R} is kind of hard to really work with in software on computers with physically limited memory and CPU power.

Intervals have some relevance in software systems.

We sometimes have a business logic that actually relies on them and instead programming somehow around it, it is clearer and cleaner to actually work with intervals. For example, we can have a public transport scheduling system and we deal with certain time intervals in which different schedules apply than during the rest of the day. Or we have a system that records downtimes of servers and services and these are quite naturally expressed as intervals of some date-time datatype. It is usually healthy to consider all the cases mentioned above rather than ignoring the question if the boundary with probability zero of actually happening or having ugly interval limits like 22:59:59.999.

The other case is interval arithmetic. This means, we do floating point calculations by taking into account that we have an inaccuracy. So instead of numbers we have intervals I. When we add two intervals, we get I+J=\{x+y : x\in I \wedge y\in J\}. In the same way we can multiply and subtract and even divide, as long as we can stay clear of zero in the denominator. Or more generally we can define f(I_1, I_2, \ldots, I_n)=\{f(x_1,x_2,\ldots,x_n) : x_1\in I_1 \wedge x_2 \in I_2 \wedge\ldots\wedge x_n\in I_n\}.
It does of course require some mathematical thinking to understand, if the result is an interval again or at least something we can deal with reasonably. Actually we are usually happy with replacing the result by an interval that is possibly a superset of the real result, ideally the minimal superset that can be expressed with our boundary type.

At this point we will probably discover a desire to expand the concept of intervals in a meaningful way to complex numbers. We can do this by working with open disks like \{ z \in \mathbb{C} :  |z-z_0| < d\} or closed disks like \{ z \in \mathbb{C} :  |z-z_0| \le d\}. Or with rectangles based on two intervals I and J like \{ z = x+iy \in \mathbb{C} : x \in I \wedge y \in J \}.

These two areas are quite interesting and sometimes useful. Libraries have been written for both of them.

Often we discover, that intervals alone are not quite enough. We would like to do set operations with intervals, that is

A\cup B
A \cap B
set difference
A \setminus B

While the intersection works just fine, as long as we include the empty set \emptyset as an interval, unions and differences lead us to non-intervals. It turns out that interval-unions, sets that can be expressed as a union of a finite number of intervals, turn out to be a useful generalization, that is actually what we want to work with rather than with intervals. In this case we can drop the empty set as interval and just express it as the union of zero intervals.

There are some questions coming up, that are interesting to deal with:

Can we normalize interval-unions to some canonical form that allows safe and relyable comparison for equality?
is our universe X actually discrete, so we can express all unlimited boundaries with closed boundaries?
interval lengths
Do we have a meaningful and useful way to measure the length of an interval or the total length of an interval-union, as long as they are limited? Or even for unlimited intervals?
collection interfaces
Do we want to implement a Set-interface in languages that have sets and an understanding of sets that would fit for intervals
How can we implement this ourselves?
Can we find useful implementations?

Having written a java library to support interval-unions on arbitrary Comparable types once in a project and having heard a speech about an interval library in Scala that ended up in using interval-unions in a pretty equivalent way, it might be interesting to write in the future about how to do this or what can be found in different languages to support us. For interval arithmetic some work has been done to create extensions or libraries for C and Fortran, that support this, while I was a student. So this is pretty old stuff and interesting mostly for the concepts, even if we are not going to move to Fortran because of this.

If there is interest I will write more about actual implementations and issues to address when using or writing them.


Share Button

Carry Bit, Overflow Bit and Signed Integers

It has already been explained how the Carry Bit works for addition. Now there was interest in a comment about how it would work for negative numbers.

The point is, that the calculation of the carry bit does not have any dependency on the sign. The nature of the carry bit is that it is meant to be used for the less significant parts of the addition. So assuming we add two numbers x and y that are having k and l words, respectively. We assume that n=\max(k,l) and make sure that x and y are both n words long by just providing the necessary number of 0-words in the most significant positions. Now the addition is performed as described by starting with a carry bit of 0 and adding with carry x[0]+y[0], then x[1]+y[1] and so on up to x[n-1]+y[n-1], assuming that x[0] is the least significant word and x[n-1] the most significant word, respectively. Each addition includes the carry bit from the previous addition. Up to this point, it does not make any difference, if the numbers are signed or not.

Now for the last addition, we need to consider the question, if our result still fits in n words or if we need one more word. In the case of unsigned numbers we just look at the last carry bit. If it is 1, we just add one more word in the most significant position with the value of 1, otherwise we are already done with n words.

In case of signed integers, we should investigate what can possibly happen. The input for the last step is two signed words and possibly a carry bit from the previous addition. Assuming we have m-Bit-words, we are adding numbers between -2^{m-1} and 2^{m-1}-1 plus an optional carry bit c. If the numbers have different signs, actually an overflow cannot occur and we can be sure that the final result fits in at most n words.

If both are not-negative, the most significant bits of x[n-1] and y[n-1] are both 0. An overflow is happening, if and only if the sum x[n-1]+y[n-1]+c \ge 2^{n-1}, which means that the result „looks negative“, although both summands were not-negative. In this case another word with value 0 has to be provided for the most significant position n to express that the result is \ge 0 while maintaining its already correctly calculated result. It cannot happen that real non-zero bits are going into this new most significant word. Consequently the carry bit can never become 1 in this last addition step.

If both are negative, the most significant bits of x[n-1] and y[n-1] are both 1. An overflow is happening, if and only if the sum x[n-1]+y[n-1]+c \lt 2^{n-1}, which means that the result „looks positive or 0“, although both summands were negative. In this case another word with value 2^n-1 or -1, depending on the viewpoint, has to be prepended as new most significant word. In this case of two negative summands the carry bit is always 1.

Now typical microprocessors provide an overflow flag (called „O“ or more often „V“) to deal with this. So the final addition can be left as it is in n words, if the overflow bit is 0. If it is 1, we have to signal an overflow or we can just provided one more word. Depending on the carry flag it is 0 for C=0 or all bits 1 (2^n-1 or -1, depending on the view point) for C=1.

The overflow flag can be calculated by o := \mathrm{signbit}(x) = \mathrm{signbit}(y) \land \mathrm{signbit}(x+y\mod 2^n) \ne \mathrm{signbit}(x).
There are other ways, but they lead to the same results with approximately the same or more effort.

The following table shows the possible combinations and examples for 8-Bit arithmetic and n=1:

x<0 or x≥0y<0 or y≥ 0(x+y)%2^8 < 0 or ≥ 0Overflow BitCarry Bitadditional word neededvalue additional wordExamples (8bit)
x≥0y<0≥000 or 1no-65+(-1)
x≥0y<0<000 or 1no-7+(-8)
x<0y≥0≥000 or 1no--9 + 12
-1 + 127
x<0y≥0<000 or 1no--128+127
-1 + 0
x<0y<0≥011yes-1-64 + (-65)
x<0y<0<001no--1 + (-1)
-1 + (-127)
-64 + (-64)

If you like, you can try out examples that include the carry bit and see that the concepts still work out as described.


Share Button