How to multiply long numbers

Assuming we have a multiplication of n-bit-numbers already available.
Other than typical compiled languages, which assume that the multiplication result fits into the same size as the two factors, we should take a closer look.
Unsigned n-bit-numbers as factors x and y fullfill

    \[0 \le x \le 2^n-1\]

    \[0 \le y \le 2^n-1\]

So we have for the product z=x*y

    \[0 \le z \le (2^n-1)^2 \le 2^{2n}-1\]

So we should be prepared for a 2n-bit long result. Assembly language provides for this. In Java we are kind of running out of luck, because we have the 64-bit-long as the largest possible value, so we cannot make use of the 64-bit capabilities of our CPU by getting a 128-bit-result. Even worse we do not have unsigned primitive types. For addition and subtraction it is possible to cheat by assuming that the longs are unsigned, because the addition for signed and unsigned numbers is almost the same, but for multiplication this is slightly more complex.

In C we do have better chances, but we need to go compiler and OS-specific to some extent. gcc on 64-bit platforms has a type __uint128_t and we can hope that something like this

__uint128_t mul(unsigned long long x, unsigned long long y) {
__uint128_t xx = (__uint128_t) x;
__uint128_t yy = (__uint128_t) y;
__uint128_t z = xx*yy;
return z;

is optimized by the compiler to one assembly language instruction that multiplies two unsigned 64-bit-integers into one unsigned 128-bit integer.

Multiplication of longer numbers is achieved by just assuming

    \[ x= \sum_{j=0}^{m-1}2^{nj}x_j, y= \sum_{j=0}^{m-1}2^{nj}y_j\]


    \[ \bigwedge_{j=0}^{m-1} 0\le x_j, y_j <2^n.\]

Now the product is

    \[ z = \sum_{j=0}^{m-1}\sum_{k=0}^{m-1}2^{n(j+k)}x_j y_k \]

where each x_j y_k summand needs to be split into a lower and an upper part such that

    \[x_j y_k = 2^n h(x_j y_k) + l(x_j y_k).\]

All these half products need to be added with propagating the carry bits to the next higher level.
This can be sorted out and done with a lot of additions with carry-bit, finally giving the desired result.

For reasonably short numbers this is a good approach, if it is done well, but it can be replaced by better algorithms for larger numbers.

The most obvious next step is to move to the Karatsuba algorithm.
Assuming we have two factors x=x_l + 2^m x_h and y = y_l + 2^m y_h the naïve approach of having to calculate the four products
x_h y_h, x_h y_l, x_l y_h, x_l y_l can be replaced by calculation of only three products x_h y_h, (x_l + x_h)(y_l+ y_h), x_l y_l, because what is actually needed is the sum x_h y_l+x_l y_h which can be obtained by

    \[(x_l + x_h)(y_l+ y_h)-x_l y_l-x_h y_h=x_h y_l+x_l y_h.\]

That can be iterated by subdividing the numbers subsequently into halves until a size is reached that can be handled well enough by the naïve approach.
An issue that needs to be observed is that x_l + x_h and y_l+ y_h need one bit more than their summands.

For even larger numbers the so called Toom-Cook algorithm, an extension of the idea of the Karatsuba-algorithm for more than two summands, can be useful and for numbers larger than that the Schönhage-Strassen multiplication proves to be faster. It is implemented in some libraries for long integer arithmetic, like GMP. Since around 2007 this can be improved by using the Fürer algorithm which is assymptotically faster, but I do not know of any useful implementations and for practical purposes the improvement does not seem to be very relevant.

Share Button

How to recover the Carry Bit

As frequent readers might have observed, I like the concept of the Carry Bit as it allows for efficient implementations of long integer arithmetic, which I would like to use as default integer type for most application development. And unfortunately such facilities are not available in high level languages like C and Java. But it is possible to recover the carry bit from what is available in C or Java, with some extra cost of performance, but maybe neglectable, if the compiler does a good optimization on this. We might assume gcc on a 64-Bit-Linux. It should be possible to do similar things on other platforms.

So we add two unsigned 64-bit integers x and y to a result

    \[z\equiv x+y \mod 2^{64}\]


    \[0 \le z < 2^{64}\]

using the typical „long long“ of C. We assume that



    \[x_h \in \{0,1\}\]


    \[0 \le x_l < 2^{63}\]

. In the same way we assume y=2^{63}y_h + y_l and z=2^{63}z_h + z_l with the same kind of conditions for x_h, y_h, z_h or x_l, y_l, z_l, respectively.

Now we have

    \[0 \le x_l+y_l \le 2^{64}-2\]

and we can see that

    \[x_l + y_l = 2^{63}u + z_l\]

for some

    \[u\in \{0.1\}\]

And we have

    \[x+y = 2^{64}c+z\]



is the carry bit.
When looking just at the highest visible bit and the carry bit, this boils down to

    \[2c+z_h = x_h + y_h + u\]

This leaves us with eight cases to observe for the combination of x_h, y_h and u:


Or we can check all eight cases and find that we always have

    \[c = x_h \wedge\neg z_h \vee y_h \wedge\neg z_h \vee x_h \wedge y_h \wedge z_h\]


    \[c = (x_h \vee y_h) \wedge\neg z_h \vee x_h \wedge y_h \wedge z_h.\]

So the result does not depend on u anymore, allowing to calculate it by temporarily casting x, y and z to (signed long long) and using their sign.
We can express this as „use x_h \wedge y_h if z_h=1 and use x_h \vee y_h if z_h = 0„.

An incoming carry bit d does not change this, it just allows for x_l + y_l + d < 2^{64}, which is sufficient for making the previous calculations work.

In a similar way subtraction can be dealt with.

The basic operations add, adc, sub, sbb, mul, xdiv (div is not available) have been implemented in this library for C. Feel free to use it according to the license (GPL). Addition and subtraction could be implemented in a similar way in Java, with the weirdness of declaring signed longs and using them as unsigned. For multiplication and division, native code would be needed, because Java lacks 128bit-integers. So the C-implementation is cleaner.

Share Button


When using an arguable stupid American date format, today’s date looks like „03/14/15“ instead of 2015-03-14, which are the first few digits of \pi.

Now we all know how to calculate \pi, we just use the Taylor series of \arcsin or \arctan, which unfortunately does not converge too well when used naïvely, but there are tricks to make it converge by calculating \arctan(x) or \arcsin(x) of some x with 0 < x < \frac{1}{2} resulting in a fraction of \pi that can easily be multiplied to get \pi from it. This approach is quite ok and understandable with relatively basic mathematical background, when allowing for some inaccuracy and intuition when trying to understand the Taylor theorem. But we can actually do better. \pi is somehow weird for being so irrational and even non-algebraic and also in some other aspects, when investigating it.

So here are some cool approaches to calculating \pi that are efficient and beatiful, but require a little bit more advanced math or trust to the ones who provided it (which should be ok for most people, but a no-go for mathematicians):


Use the arithmetic-geometric mean, also mentioned shortly in Geometric and Harmonic Rounding:

The arithmetic-geometric mean is included in long-decimal, and long-decimal uses the AGM-method for calculating pi.

Elliptic Integrals

Elliptic Integrals are somewhat weird, because they are easy to write down, but not really easy to solve. But numerical approximations to their calculation can also help finding about the value of \pi to some desired precision. Since elliptic functions, elliptic curves and elliptic integrals have fascinated mathematicians )and at least in the case of elliptic curves over finite fields also computer scientists) a lot, this must be cool.


If you just want to calculation 10000 digits on your computer (should be Linux or something with Ruby preinstalled), just type:

gem install long-decimal
ruby -e 'require "long-decimal";puts(LongMath.pi(10000))'

That should give you some bed time lecture for the \pi-day. Enjoy it.
And please, do not memorize \pi to 10000 digits. It is possible, but there is stuff more worth while memorizing.

And please, do use a decent date format, even if you are American.

Share Button

Geometric and Harmonic Rounding

Most of our rounding methods are based on arithmetic rounding.
Please refer to the nomenclature described in Residue Class Rounding. So we can base our rounding on a metric and the most obvious choice is the absolute value of the difference of the two parameters, d(x,y)=|x-y|. That is what everybody does and nothing else really comes to our mind when we are talking about rounding. But this approach is based on the additive structure of our field or ring containing the numbers that we want to round. Another way of looking at this is that (in case we are talking of real numbers or some subset thereof) the choice of the rounded value is mostly based on seeing on which side of a boundary between the two candidates for the rounded values the original value lies, with special handling if it is exactly equal to this boundary. Now the this boundary is the arithmetic mean between the two candidates, therefore we call this arithmetic rounding. Just to give you an example:

We look at the „middle“ between two allowed numbers, for example usually something like 2.5 when rounding a number between 2 and 3 to integers and automatically assume that above that middle we round up, below we round down and exactly on the boundary we need additional rules. This middle is the arithmetic mean of the two neighboring allowed numbers after rounding and hence we can call this arithmetic rounding. All numbers x with 2 \le x < 2.5 are rounded to two, all numbers x with 2.5 < x \le 3 are rounded to three.


We assume M=\Bbb R and N=\Bbb Z, so each real number should be rounded to an integral number. Also we assume d(x,y)=|x-y|.

Now assume we have a non-integral number x \in \Bbb R and an integral number n\in \Bbb Z such n< x< n+1. For convenience we write n'=n+1. Then we round x to n if x < n+0.5 and to n' if x > n+0.5, so n+0.5 = \frac{n+n'}{2} serves as boundary. Now there are many other ways to define a mean between two number n and n'. Some of the most common are

We could also define a arithmetic-harmonic mean in a similar way, but that is just the geometric mean. It is an interesting way to approach the geometric mean of matrices, which is somewhat hard to define otherwise because of the non-commutative multiplication.

For the users of spreadsheets, like included in LibreOffice, OpenOffice or MS-Office, you can find the arithmetic mean as function average, the geometric mean as geomean and the harmonic mean as harmean.
Others means can be defined along these lines. And we can see that:

  • a(n,n')=m_1(n,n')
  • g(n,n')=m_0(n,n')
  • h(n,n')=m_{-1}(n,n')
  • q(n,n')=m_2(n,n')
  • c(n,n')=m_3(n,n')
  • \min(n,n')=m_{-\infty}(n,n')
  • \max(n,n')=m_\infty(n,n')

Here is a visual explanation of arithmetic, geometric and harmonic mean:


The line x from K to L has the harmonic mean of a and c as a length. The geometric mean would be reached by moving that line to a position that the upper and lower part have the same shape, apart from scaling. And the arithmetic mean would be reached by moving K and L to the middle points of the lines AD and BC, respectively.

The ruby-gem long-decimal supports all of these methods for calculating a mean.

Using means to define rounding methods

Just to give you an idea that you can extend this further, the geometric and harmonic mean are now considered.
As can be seen negative and zero values for n and n' are causing some trouble or require some care. Actually we can deal with this by splitting our unrounded and rounded worlds into separate parts, here we take the number < 0, the \{0\} and the numbers > 0 as three different sets and do the rounding within them:

    \[M=M_1\cup M_2\cup M_3\]


    \[\bigwedge_{i\ne j}M_i\cap M_j = \emptyset\]

where we define

    \[M_1=\{m\in M : m < 0\}\]




    \[M_3=\{m\in M : m > 0\}.\]

In a similar way we define N_1, N_2 and N_3 and define metrics d_1, d_2, d_3 separately. Then we just look at one partition i and consider rounding to be a mapping

    \[r_i: M_i\rightarrow N_i, x\mapsto r_i(x),\]

such that d_i(x, r_i(x)) is minimal.

The obvious assumption of rounding 0 to 0 releaves us from dealing with logarithms of zero or division by zero. We would see that anyway any number between 0 and 1 would be rounded to 1 for both harmonic and geometric rounding, because the boundary is 0, if we pick the right variant of the formula for calculating the mean. Also between -1 and 0 everything is rounded to -1 in our example. Otherwise the calculation of the boundary can be made to work, because then both factors are negative, so we draw the square root of a positive product, but then we need to replace the square root by its negative counterpart.

So for geometric rounding we define that x is rounded to n if x < g(n,n'), rounded to n' if x > g(n,n') and rounding for x=g(n,n') has to be defined by additional rules. But we define g(n,n') for negative values of n and n' to be -\sqrt{nn'}. This can also be expressed in terms of a metric:

    \[d_1(x,y)=|\log(-x)-\log(-y)|\;\; (x, y < 0),\]

    \[d_2(x,y)=0 \;\;(x=y=0)\]


    \[d_3(x,y)=|log(x)-log(y)| \;\; (x,y > 0).\]

Harmonic rounding is definable in a similar way:
We define that x is rounded to n if x < h(n,n'), rounded to n' if x > h(n,n') and rounding for x=h(n,n') has to be defined by additional rules. This can also be expressed in terms of a metric:

    \[d_1(x,y)=|\frac{1}{x}-\frac{1}{y}|\;\; (x, y < 0),\]

    \[d_2(x,y)=0 (x=y=0)\]


    \[d_3(x,y)=|\frac{1}{x}-\frac{1}{y}| (x,y > 0).\]

Quadratic and Cubic rounding can be defined in a similar way. It would even be possible to go for arithmetic-geometric rounding or harmonic-geometric rounding, but that would imply having to do an approximative iterative calculation of the boundary each time it is needed, while the rounding modes described above can still be achieved with relatively simple multiplication and addition operations.

In any case it is necessary to define the rule how to round numbers that lie exactly on the boundary, unless we can prove that the boundary is irrational.

That is the case for geometric, quadratic and cubic rounding, because if we have adjacent integers n and n+1 then we see that g(n, n+1)=\sqrt{n(n+1) is irrational for n>0 because n and n+1 cannot both be squares, so there is one prime number that divides either n or n+1 with an odd multiplicity, which already ensures that the square root is irrational. For the quadratic mean we have q(n,n+1)=\sqrt{\frac{n^2+(n+1)^2}{2}}, where the numerator of the fraction is odd, so two occurs with an odd multiplicity, which makes it irrational again. The similar argument applies for the cubic mean. For the harmonic mean we have h(n, n+1)=\frac{2n(n+1)}{2n+1}, where 2n+1 can be a power of 5, in which case we have to deal with the boundary issue.

The ruby-gem long-decimal supports geometric, harmonic, quadratic and cubic rounding in addition to the regular arithmetic rounding, that is referred to as „half“.

Harmonic and geometric rounding are important for some of the mechanisms of rounding with sum.
Based on this basis these algorithms will look more natural and no longer like some weird approach that needs to be copied from some untrusted source just to get over the nightmare of having to solve that issue.

Share Button

Rounding with sum


Often we have to round some numbers in such a way that the sum is correct afterwards. The most obvious case is maybe with percentage values, where a sum other than 100% is confusing for readers. But we may even have cases where a calculation becomes incorrect if the some constraint is violated. Anyway it can be useful to do some more advanced stuff behind the scenes to fulfill such a constraint. Now we could hope for the rounding operation to be an Endomorphism of the additive group, that means \bigwedge_{x,y\in M} r(x+y)=r(x)+r(y). At first glance it makes sense, error often compensate each other to make this true, but we can find counter examples for our usual rounding methods. Just think of the numbers x=1.4, y=2.3, z=3.3. We have x+y+z=7. Now we us a typical rounding method and we get r(x)=1 \wedge r(y)=2 \wedge r(z)=3 \Rightarrow r(x)+r(y)+r(z)=6 \ne 7. So it is no endomorphism, one counterexample is enough.

Is this any kind of new concept? Obviously we find printed media or web pages that are showing percentage values in such a way that they cheated for the partial values to have the anticipated sum. But it shows that such things happen officially every four years. We have events that are called „elections“ and parties or lists can be voted for. The parliament usually has a fixed number of members and if we disregard specialties that make the whole game more complex in some countries, in order to keep it accessible to logic, it is quite the same thing. Now parliament membership should be distributed according to the percentages of the popular votes. If we have n seats in the parliament, we have to round the exact percentages (as fractions of long integers) to multiples of \frac{100}{n} in such a way that the sum is correct. There are may established procedures for achieving this and depending on the procedure the result may differ a little bit. Did I mention the word „cheating“ above? No, we do not talk about cheating by deliberately counting wrong, off course nobody ever wanted to do something like that, but about ways to enforce an end result that uses up exactly the given number of parliament seats. Which procedure is considered to be the fairest changes almost every other year a little bit. Interestingly an equivalent issue occurs when distributing legislative seats to region. Even in the United States, where congressmen are elected by getting the plurality of votes in their district, this is applied to deciding how many congressmen will represent each state, in this case with the constraint that each state should have at least one congress seat (plus two senators).

If we do not need to write an election software, the question is often less delicate. For example a text should contain the sum of some rounded values in a reasonable accurate way. The majority of the readers will be happy. The very critical reader who knows anyway what I am writing here, will be happy as well, if it has been done well enough with inevitable cheating in the last digits. It should be observed that this seemingly trivial issues is quite involved, for example see apportionment paradox

As described in residue class rounding, we define sets M and N \subseteq M with a metric d: M\times M \rightarrow {\Bbb R_+}, (m, n) \mapsto d(m,n) \ge 0 and want to deal with maps r : M \rightarrow N having d(x, r(x)) kept small and possibly some other constraints. But since the correction („cheating“) should be part of the mapping, we are here rather looking at a mapping
s: M^n \rightarrow N^n, (m_1,\ldots,m_n) \mapsto (n_1,\ldots n_n) with \sum_{i=1}^n m_i = \sum_{i=1}^n n_i and for all i the value of d(m_i,n_i) is „small“. We could say, that the sum has to be rounded as well, but often it is already rounded. Think about the details of your special case when they occur and read the concepts… Purely rounding up or down will not work unless we have an exact edge case. We could desire to minimize error squares, which has the advantage that larger deviations have a larger effect, encouraging equal distribution of the corrections, for example we want
\sum_{i=1}^n (d(m_i, n_i))^2 minimal. It is possible to find examples that are non-unique. In this case a rule can be defined or random numbers can be used to resolve.
Another approach would be to minimize relative errors, like
\sum_{i=1: m_i \ne 0}^n (\frac{d(m_i,n_i)}{d(m_i,m_i)})^2, skipping summand with m_i=0 because they do not create any issues if we have 0\in N. Usually we do want to round 0 to 0.
A good alternative is to just look at the Apportionment methods for parliaments and the underlying algorithms and adjust them for the required usage:

Often our requirements may be different and especially lower in terms of the quality of the rounding than for parliament apportionment, but it is a good idea to think twice about this prior to just doing something naïve without thinking.

Question: how are the apportionment methods behaving in terms of absolute and relative error square optimization?

If there is serious interest, I might provide an implementation in some real programming language.

Often we encounter problems similar to this that can be deduced to this. Who has ever thought that parliament apportionment is actually an act of rounding?


Here are some links to interesting pages and papers about algorithms and other topics related to this issue.

Share Button

Residue Class Rounding


If you do not know what residue classes are, just read on, it will be explained to the extent needed later on.

The concept of rounding seems familiar, but let us try to grab it a little bit more systematically.

Let us assume a set M of numbers and a subset N \subseteq M of it, whose elements can be represented in the programming or software environment we are having in mind. We want to use a metric d : M \times M \rightarrow \Bbb R_+, (x, y) \mapsto r=d(x,y)>=0. Usually we have three requirements for d:

  • Identity of indiscernibles: d(x,y) = 0 \iff x = y
  • Symmetry: d(x,y)=d(y,x)
  • Triangular inquation: d(x,z) \le d(x,y) + d(y,z)

Usually the number that we are dealing with can be considered to be within the world of real or complex numbers, we can hence assume that M \subseteq \Bbb C, often even M \subseteq \Bbb R or if we are honest M \subseteq \Bbb Q. Then we are usually working with d(x,y) = |x-y|, which is kind of implicitly clear. If you do not know complex numbers, just think of real or even rational numbers, which is the most common case anyway. Off course the concepts for rounding of p-adic numbers are really interesting and beautiful, but since I do not want to explain p-adic numbers here, I will not extend on this issue.

What is our intuitive understanding of rounding?
Maybe just a map
r: M \rightarrow N, x \mapsto r(x),
that is chosen with certain constraints for each x in such a way that d(x, r(x)) is minimal.
We need the constraints to enforce uniqueness if multiple values y\in N with minimal d(x,y) exist. The classical case of rounding to integral numbers has to answer the question of how to round 0.5. If N is a subset of the real numbers, which is „usually“ the case, we have ordering. We can choose between the following constraints:

|r(x)|\ge |x| Z.B. r(0.4)=1 and r(0.5)=1 and r(-0.5)=-1
|r(x)|\le |x| Z.B. r(0.6)=0 and r(0.5)=0 and r(-0.5)=0
r(x) \ge x Z.B. r(0.4)=1 and r(0.5)=1 and r(-0.5)=0
r(x) \le x Z.B. r(0.6)=0 and r(0.5)=0 and r(-0.5)=-1
Minimize d(r(x),x), but if multiple optimal values exist for r(x), pick the one furthest away from 0, for example r(0.4)=0 and r(0.6)=1 and r(0.5)=1 and r(-0.5)=-1
Minimize d(r(x),x), but if multiple optimal values exist for r(x), pick the one closest to 0, for example r(0.4)=0 and r(0.6)=1 and r(0.5)=0 and r(-0.5)=0
Minimize d(r(x),x), but if multiple optimal values exist for r(x), pick the largest, for example r(0.4)=0 and r(0.6)=1 and r(0.5)=1 and r(-0.5)=0
Minimize d(r(x),x), but if multiple optimal values exist for r(x), pick the smallest, for example r(0.4)=0 and r(0.6)=1 and r(0.5)=0 and r(-0.5)=-1
Minimize d(r(x),x), but if multiple optimal values exist for r(x), pick the one with even last digit. Please observe that this constraint is useful in the classical case, but it cannot be generalized. For example: r(0.4)=0 and r(0.6)=1 and r(0.5)=0 and r(-0.5)=0 and (1.5)=2
This constraint does not work in the mathematical sense (or only with ugly abusive math formulas), but programmatically we can do that: We try r(x)=x and throw an exception if not x\in N.

Usually we are thinking in the decimal system (even though our computers prefer something else, but the computer should serve us and not vice versa..). So we pick a power of ten 10^n with n \in \Bbb N_0, so n \ge 0. Now we simply define
N = \{ x \in M : 10^n x \in \Bbb Z\},
which means that N contains all numbers of M with a maximum of n digits after the decimal point. This rounding works quite well with something like LongDecimal in Ruby or BigDecimal in Scala or Java, but BigDecimal offers fewer rounding modes than LongDecimal for Ruby.

Now we look at the residue class rounding. We assume such a power of ten 10^n. Then we need an integral number m \ge 2 and a set of numbers R \subseteq \{0,...,m-1\}, we call them residues. Now we define N = \{ x \in M : 10^n x \in {\Bbb Z} \wedge \bigvee_{r \in R} 10^n x \equiv r \mod{m}\}.
That means that if we fill the number with zeros to the given number of places after the decimal point, remove the decimal point and perform a division with residue of this number by m, the residue lies in R. In this case ROUND_HALF_EVEN can become difficult to implement and ambiguous, so we might need to sacrifice it. But even worse, we have to deal with the case that 0 is not in R and provide rules how to round 0. The candidates have self-explanatory names:


An important application of this is m=10 and R=\{0, 5\}. This is used in Switzerland and probably other currency regions for rounding money amounts to multiples of 5 Rappen (0.05 CHF). This has already been described in „Rounding of Money Amounts„. If we have m=10 and R=\{0,1,2,3,4,5,6,7,8,9\} we have the usual non-CHF-rounding case. Maybe there are even cases for m=10 and R=\{0,2,4,6,8\}. But the concept is more general, if you like to use it.

Share Button

Shift- and Rotation Functions


In a comment to Carry Bit: How does it work the question about shift and rotation functions has been asked. Which exist and how they work exactly depends off course on the CPU architecture, but I will try to give a high level overview anyway.

The following aspects have to be considered:

  • 8, 16, 32, 64,… Bit: How many Bits are affected by the operation?
  • Shift vs. Rotate: Shift moves bits to the left or right, filling in 0 or 1 in the positions that get freed by this. The bits that get shifted out usually go to the carry bit, so the last one of these wins, if the shift is done by more than one bit. Rotation means what some English knowledge suggests: The bits shifted out are moved in on the other side.
  • ROL and ROR vs RCL and RCR: Rotation through Carry (RCL or RCR) means that the carry bit participates as bit 9., 17., 33. or 65. ROL and ROR leave the carry bit alone, at least do not use it as input.
  • Logical vs. Arithmetic Shift: This deals with the „sign“. For arithmetic Shift the number is considered to be signed in two’s complement. Therefore right shift does not necessarily introduce a 0 as new most significant bit, but depending on the sign introduces a 0 or a 1.

Current CPUs have barrel shifts or something like that built into the hardware, so shift and rotate operations by more than one bit position are much faster than sequences of single shifts. This technology has been around on better CPUs for decades and is not at all new.

How the commands are called exactly and possibly some details about there operations depend on the CPU architecture.

Examples (8 bit) for shifting and rotating by one bit position:

Vorher (Carrybit in Klammern)ROL
(rotate left)
(rotate left through carry)
(rotate right)
(rotate right through carry)
(arithmetic/logical shift left)
(logical shift right)
(arithmetic shift right)
00000000 (0)00000000 (0)00000000 (0)00000000 (0)00000000 (0)00000000 (0)00000000 (0)00000000 (0)
00000000 (1)00000000 (1)00000001 (0)00000000 (1)10000000 (0)00000000 (0)00000000 (0)00000000 (0)
00000001 (0)00000010 (0)00000010 (0)10000000 (0)00000000 (1)00000010 (0)00000000 (1)00000000 (1)
00000001 (1)00000010 (1)00000011 (0)10000000 (1)10000000 (1)00000010 (0)00000000 (1)00000000 (1)
10000000 (0)00000001 (0)00000000 (1)01000000 (0)01000000 (0)00000000 (1)01000000 (0)11000000 (0)
10000000 (1)00000001 (1)00000001 (1)00000001 (1)11000000 (0)00000000 (1)01000000 (0)11000000 (0)
11111111 (0)11111111 (0)11111110 (1)11111111 (0)01111111 (1)11111110 (1)01111111 (1)11111111 (1)
11111111 (1)11111111 (1)11111111 (1)11111111 (1)11111111 (1)11111110 (1)01111111 (1)11111111 (1)

These shift and rotate operations are needed to express multiplications by powers of two and division by powers of two. In C, C++, C#, Perl, Java and some other porgramming languages these operations are included and written as „<<" (for ASL or SHL), ">>“ (for SHR) and „>>>“ for ASR.


Share Button

Rounding of Money Amounts


Many numerical calculations deal with amounts of money. It is always nice if these calculations are somewhat correct or if we can at least rest assured that we do not loose money due to such inaccuracies. Off course there are calculations that may legitimately deal with approximations, for example when calculating profits as percentages of the investment (return on investment). For this kind of calculations floating point numbers (double, Float or something like that) come in handy, but off course dealing with the numeric subtleties can be quite a challenge and rounding errors can grow huge if not dealt with properly.

It is often necessary to do calculations with exact amounts. It is quite easy to write something like 3.23 as floating point number, but unfortunately these are internally expressed in binary format (base 2), so the fraction with a power of 10 in the denominator needs to be expressed as a fraction with a power of two in the denominator. We can just give it a try and divide x=323_{10}=101000011_{2} by y=100_{10}=1100100_2, doing the math in base 2 arithmetic. We get something like z=x/y=11.0011101011100001010001111010111000010100011110101\ldots_{2}, more precisely z=11.00\overline{11101011100001010001} or as fraction z=11_2+\frac{11101011100001010001_2}{100_2*(100000000000000000000_2-1_2)} or in base 10 z=3_{10}+\frac{964689_{10}}{4_{10}\cdot1048575_{10}}.

It can be seen that such a harmless number with only two digits after the decimal point ends up having an infinite number of digits after the decimal point. Our usual Double and Float types are usually limited to 64 Bits, so some digits have to be discarded, causing a rounding error. Fortunately this usually works well and the stored number is still shown as 3.23. But working a little bit with these floating point numbers sooner or later results like 3.299999999999 or 3.2300000001 will appear instead of 3.23, even when doing only addition and subtraction. When doing larger sums it might even end up as 3.22 or 3.24. For many applications that is unacceptable.

A simple and often useful solution is to use integral numbers and storing the amounts in cents instead of dollars. Then the rounding problem of pure additions and subtractions is under control and for other operations it might at least become easier to deal with the issue. A common mistake that absolutely needs to be avoided is mixing amountInDollar and amountInCent. In Scala it would be a good idea to use different types to make this distinction, so that such errors can be avoided at compile time. In any case it is very important to avoid such numeric types as int of Java that have a weird and hardly controllable overflow behavior where adding positive numbers can end up negative. How absurd, modular arithmetic is for sure not what we want for our money, it is a little bit too socialistic. 😉 Estimations about the upper bound of money amounts of persons and companies and other organizations are problematic, because there can be inflation and there can be some accumulation of money in somebody’s accounts… Even though databases tend to force us to such assumption, but we can make them really huge. So some languages might end up using something like BigInteger or BigNum or BigInt. Unfortunately Java shows one of its insufficiencies here, which makes quite ugly to use for financial applications, because calculations like a = b + c * d for BigInteger appear like this: a = b.{\rm add}(c.{\rm multiply}(d)). The disadvantage is that the formula cannot be seen at one glance, which leads to errors. In principal this problem can be solved using a preprocessor for Java. Maybe a library doing some kind of RPN-notation would possible, writing code like this:

Calculation calc = new Calculation();
a =

Those who still know old HP calculators (like HP 25 and HP 67 😉 in the good old days) or Forth might like this, but for most of us this is not really cutting it.

Common and useful is actually the usage of some decimal fixed point type. In Java this is BigDecimal, in Ruby it is LongDecimal.
And example in Ruby:

> sudo gem install long-decimal
Successfully installed long-decimal-1.00.01
1 gem installed
> irb
irb(main):001:0> require "long-decimal"
=> true
irb(main):002:0> x = LongDecimal("3.23")
=> LongDecimal(323, 2)
irb(main):003:0> y = LongDecimal("7.68")
=> LongDecimal(768, 2)
irb(main):004:0> z = LongDecimal("3.9291")
=> LongDecimal(39291, 4)
irb(main):005:0> x+y
=> LongDecimal(1091, 2)
irb(main):006:0> (x+y).to_s
=> "10.91"
irb(main):007:0> x+y*z
=> LongDecimal(33405488, 6)
irb(main):008:0> (x+y*z).to_s
=> "33.405488"

It is interesting to see that the number of digits remains the same under addition and subtraction if both sides have the same number of digits. But the longer number of digits wins otherwise. During multiplication, division and off course during more complex operations many decimal places can become necessary. It becomes important to do rounding and to do it right and in a controlled way. LongDecimal supports the following rounding modes:

Round away from 0.
Round towards 0.
Round to more positive, less negative numbers.
Round to more negative, less positive numbers.
Round the middle and from above up (away from 0), everything below down (towards 0).
Round the middle and from above down (towards 0), everything below up (away from 0).
Round from the middle onward towards infinity, otherwise towards negative infinity.
Round up to and including the middle towards negative infinity, otherwise towards infinity.
Round the middle in such a way that the last digit becomes even.
Round the middle in such a way that the last digit becomes odd (will be added to long-decimal in next release).
Do not round, just discard trailing 0. If that does not work, raise an exception.

Which of these to use should be decided with domain knowledge in mind. The above example could be continued as follows:

irb(main):035:0> t=(x+y*z)
=> LongDecimal(33405488, 6)
irb(main):037:0* t.round_to_scale(2, LongDecimal::ROUND_UP).to_s
=> "33.41"
irb(main):038:0> t.round_to_scale(2, LongDecimal::ROUND_DOWN).to_s
=> "33.40"
irb(main):039:0> t.round_to_scale(2, LongDecimal::ROUND_CEILING).to_s
=> "33.41"
irb(main):040:0> t.round_to_scale(2, LongDecimal::ROUND_FLOOR).to_s
=> "33.40"
irb(main):041:0> t.round_to_scale(2, LongDecimal::ROUND_HALF_UP).to_s
=> "33.41"
irb(main):042:0> t.round_to_scale(2, LongDecimal::ROUND_HALF_DOWN).to_s
=> "33.41"
irb(main):043:0> t.round_to_scale(2, LongDecimal::ROUND_HALF_CEILING).to_s
=> "33.41"
irb(main):044:0> t.round_to_scale(2, LongDecimal::ROUND_HALF_FLOOR).to_s
=> "33.41"
irb(main):045:0> t.round_to_scale(2, LongDecimal::ROUND_HALF_EVEN).to_s
=> "33.41"
irb(main):046:0> t.round_to_scale(2, LongDecimal::ROUND_UNNECESSARY).to_s
ArgumentError: mode ROUND_UNNECESSARY not applicable, remainder 5488 is not zero
        from /usr/local/lib/ruby/gems/1.9.1/gems/long-decimal-1.00.01/lib/long-decimal.rb:507:in `round_to_scale_helper'
        from /usr/local/lib/ruby/gems/1.9.1/gems/long-decimal-1.00.01/lib/long-decimal.rb:858:in `round_to_scale'
        from (irb):46
        from /usr/local/bin/irb:12:in `
' irb(main):047:0>

A specialty is that some countries do not use the coins with one of the smallest unit, like 1 cent in the US. In Switzerland the smallest common coin is 5 Rp (5 cent=0.05 CHF). It might be possible to make the bank do a transfer for amounts not ending in 0 or 5, but usually invoices apply this kind of rounding and avoid such exact amounts that could not possibly be paid in cash. This can be dealt with by multiplying the amount by 20, rounding it to 0 digits after the decimal point and divide it by 20 and round the result to 2 digits after the point. In Ruby there is a better way using an advanced feature of LongDecimal called remainder rounding (using the method round_to_allowed_remainders(…) ). Assuming we want to round a number x with n decimal places in such a way that, 10^n\cdot x belongs to one of the residue classes \overline{0} \mod 10 or \overline{5} \mod 10. In this case we are just talking about the last digit, but the mechanism has been implemented in a more general way allowing any set of allowed residues and even a base other than 10. If 0 is not in the set of allowed residues, it may be unclear how 0 should be rounded and this needs to be actually defined with a parameter. For common practical uses the last digit of 0 is allowed, so things work out of the box:

irb(main):003:0> t.round_to_allowed_remainders(2, [0, 5], 10, LongDecimal::ROUND_UP).to_s
=> "33.45"
irb(main):005:0> t.round_to_allowed_remainders(2, [0, 5], 10, LongDecimal::ROUND_DOWN).to_s
=> "33.40"
irb(main):006:0> t.round_to_allowed_remainders(2, [0, 5], 10, LongDecimal::ROUND_CEILING).to_s
=> "33.45"
irb(main):007:0> t.round_to_allowed_remainders(2, [0, 5], 10, LongDecimal::ROUND_FLOOR).to_s
=> "33.40"
irb(main):008:0> t.round_to_allowed_remainders(2, [0, 5], 10, LongDecimal::ROUND_HALF_UP).to_s
=> "33.40"
irb(main):009:0> t.round_to_allowed_remainders(2, [0, 5], 10, LongDecimal::ROUND_HALF_DOWN).to_s
=> "33.40"
irb(main):010:0> t.round_to_allowed_remainders(2, [0, 5], 10, LongDecimal::ROUND_HALF_CEILING).to_s
=> "33.40"
irb(main):011:0> t.round_to_allowed_remainders(2, [0, 5], 10, LongDecimal::ROUND_HALF_FLOOR).to_s
=> "33.40"

In an finance application the rounding methods should probably be defined for each currency, possible using one formula and some currency specific parameters.

LongDecimal can do even more than that. It is possible to calculate logarithms, exponential functions, square root, cube root all to a desired number of decimal places. The algorithms have been tuned for speed without sacrificing precision.

Share Button

Carry Bit: How does it work?


Most of us know from elementary school how to add multi-digit numbers on paper. Usage of the carry bit is the same concept, but not for base 10, not even for base 2, but for base 256 (in the old 8-bit-days), base 65536 (in the almost as old 16-bit-days), base 4294967296 (32 bit) or base 18446744073709551616 (64 bit), whatever is the word width of the CPU. Always using powers of two is common today, but it is quite possible that this will change from bits to trits (having three possible values, -1, 0 and 1) in the far future.

I do not think that application development should be dealing with low level stuff like bits and bytes, but currently common programming languages like Java, C, C++, C# and more would not let you get away with that, you have to be aware of the bits underlying their numeric types to some extent. So it is a good idea to spend some effort on understanding this. Unfortunately all of these languages are lacking the carry bit, but it is anyway useful to understand the concept.

I have been writing software since the beginning of the 80es. The computers available to me at that time where 8-bit with a 6502- or 6510-CPU and 1 MHz clock speed. Yes, it was 1 MHz, not 1 GHz. It was possible to program them in some BASIC-dialect, but that was quite useless for many purposes because it was simply too slow. Compiled languages existed, but were too clumsy and too big to be handled properly on those computers, at least the ones that I have seen. So assembly language was the way to go. In later years I have also learned to use the 680×0 assembly language and the 80×86 assembly language, but after the mid 90es that has not happened any more. An 8-bit CPU can add two 8-bit numbers and yield an 8-bit result. For this two variants need to be distinguished, namely signed and unsigned integers. For signed numbers it is common to use 2’s complement. That means that the highest bit encodes the sign. So all numbers from 0 to 127 are positive integers as expected. 127 has the 8-bit representation 01111111. Now it would be tempting to assume that 10000000 stands for the next number, which would be +128, but it does not. Having the highest bit 1 makes this a negative number, so this is -128. Those who are familiar with modular arithmetic should find this easily understandable, it is just a matter of choosing the representatives for the residue classes. But this should not disturb you, if you have no recent experience with modular arithmetic, just accept the fact that 10000000 stands for -128. Further increments of this number make it less negative, so 10000001 stands for -127 and 11111111 for -1. For unsigned numbers, the 8 bits are used to express any number from 0 to 255.

For introducing the carry bit let us start with unsigned integral numbers. The possible values of a word are 0 to 2^n-1 where n is the word width in bits, which would be 8 in our example. Current CPUs have off course 64-bit word width, but that does not change the principle, so we stick with 8-bit to make it more readable. Just use your imagination for getting this to 32, 64, 96 or 128 bits.

So now the bit sequence 11111111 stands for 255. Using an assembly language command that is often called ADD or something similar, it is possible to add two such numbers. This addition can typically be performed by the CPU within one or two clock cycles. The sum of two 8-bit numbers is in the range from 0 through 510 (111111110 in binary), which is a little bit too much for one byte. One bit more would be sufficient to express this result. The workaround is to accept the lower 8 bits as the result, but to retain the upper ninth bit, which can be 0 or 1, in the so called carry bit or carry flag. It is possible to query it and use a different program flow depending on it, for example for handling overflows, in case successive operation cannot handle more than 8 bit. But there is also an elegant solution for adding numbers that are several bytes (or several machine words) long. From the second addition onwards a so called ADC („add with carry“) is used. The carry bit is included as third summand. This can create results from 0 to 511 (111111111 in binary). Again we are getting a carry bit. This can be continued until all bytes from both summands have been processed, just using 0 if one summand is shorter than the other one. If the carry bit is not 0, one more addition with both summand 0 and the carry bit has to be performed, yielding a result that is longer than the longer summand. This can off course also be achieved by just assuming 1, but this is really an implementation detail.

So it is possible to write a simple long integer addition in assembly language. One of the most painful design mistakes of current programming languages, especially of C is not providing convenient facilities to access the carry bit, so a lot of weird coding is used to work around this when writing a long integer arithmetic. Usually 64-bit arithemetic is used to do 32-bit calculations and the upper 32 bits are used for the carry bit. Actually, it is not that hard to recover the carry bit, but it is anyway a bit annoying.

Subtraction of long integers can be done in a quite similar way, using something like SBC („subtract with carry“) or SBB („subtract with borrow“), depending on how the carry bit is interpreted when subtracting.

For signed integer special care has to be taken for the highest bit of the highest word of each summand, which is the sign. Often a so called overflow but comes in handy, which allows to recognize if an additional machine word is needed for the result.

Within the CPU of current 64 bit hardware it could theoretically be possible to do the 64-bit addition internally bit-wise or byte-wise one step after the other. I do not really know the implementation details of ARM, Intel and AMD, but I assume that much more parallelism is used for performing such operation within one CPU cycle for all 64 bits. It is possible to use algorithms for long integer addition that make use of parallel computations and that can run much faster than what has been described here. They work for the bits and bytes within the CPU, but they can also be used for very long numbers when having a large number of CPUs, most typically in a SIMD fashion that is available on graphics devices misused for doing calculations. I might be willing to write about this, if interest is indicated by readers.

It is quite interesting to look how multiplication, division, square roots, cube roots and more are calculated (or approximated). I have a lot of experience with that so it would be possible to write about hat. In short these operations can be done quite easily on modern CPUs, because they have already quite sophisticated multiplication and division functions in the assembly language level, but I have off course been able to write such operations even for 8-bit CPUs lacking multiplication and division commands. Even that I could cover, but that would be more for nostalgic reasons. Again there are much better algorithms than the naïve ones for multiplication of very long integers.


Share Button

Overflow of Integer Types


Handling of integral numbers has always been one of the basic capabilities of our computing devices. Any common programming language since the fifties provides this in some way.

There are some significant differences between the ways integral numbers are handled in different programming languages.

Dealing with the overflow

There are several approaches:

  1. Integral numbers can have any number of digits, as far as the memory allows. Results of calculations that need more bytes to be stored than the input are stored in as many bytes as needed to be represented completely. There is no need to deal with overflow, unless the numbers are so big and so many that they eat up the available memory. Examples: Common Lisp, Ruby
  2. Integral numbers have a fixed number of bits. If a calculation results in a number that cannot be expressed in these many bits, the upper bits are tacitly discarded, unless each operation is accompanied by elaborate indirect checks for overflow. Examples: C, Java
  3. Integral numbers have a fixed number of bits. If a calculation results in a number that cannot be expressed in these many bits, an exception is thrown. I do not have any examples, maybe someone can suggest an example.
  4. Integral numbers have a fixed number of bits. For results of a multiplication, a type with twice as many bits as the two factors is used. For results of addition, a type with as many bits as the summands is used, but a carry bit is provided, too. This is the extra bit that is needed to express the result in all possible cases. Using this carry bit and feeding it into the next addition, it is possible to add numbers of arbitrary length, as needed for solution 1. Multiplication, Division and Subtraction can be implemented in a similar manner. Example: Assembly language
  5. Not integers are available and floating point numbers have to be used instead. Example: Lua


Solution 1 is exactly what is needed for application development. Counting of bits, checking for overflow and such practices are just absurd during the development of business logic. This is quite a strong argument in favor of using Ruby (or Lisp) for application development.

Solution 2 is simply crap. I consider this the most serious design bug of C and Java. Rounding is something that we can accept in many cases, but that discards the lower bits. Here we tacitly discard the upper bits. Average software developers do not really want to deal with these issues, sometimes they are simply ignored, because the bugs do not occur frequently and are not discovered during the testing phase. Throwing away the carry bit, which contains very important information for implementing solution 1 or for just recognizing overflows is not a good idea. There are workarounds for this, but they double the runtime of certain calculation intensive software.

Solution 3 would be acceptable, but I consider solution 4 better. Low level code should leave it to the developer if an exception is thrown or if the situation can be handled.

Solution 4 is quite acceptable and useful for low level programming, but not for typical application development. Sometimes access to bits and bytes is needed explicitly, for example to talk to external hardware, to implement the lower layers of network protocols or similar issues. It can be useful for very performance critical calculations, where it can be guaranteed that no overflows occur. The possibility to deal with a carry bit at least optionally cannot do any harm. Unfortunately this is currently only (or almost only?) possible in assembly languages. For implementing integer arithmetic for languages like Ruby or Lisp, some implementation that works with solution 2 by doing crappy workarounds needs to be provided, but for common CPU-architectures it is possible to provide an implementation in assembly language based on solution 4, that will run about twice as fast.

Solution 5 is just an unnecessary restriction.


Many software developers think that this does not interest them or is not their problem. I recommend to take this issue serious. The bugs caused by the tacit overflow are hard to find, because they can occur anywhere and they might not show during testing. But then they occur with some weird combination of data on the productive system. Who wants to find the cause of a weird bug, that surfaces far away from the real cause? Only solution 1 allows the application developer to safely ignore these problems.

Share Button