Indexing of long utf-8 files or strings

The UTF-8 format has the disadvantage that the length of characters and code points varies. Accessing a given position counted in characters is only possible by starting from the beginning or by providing an indexing structure. It is a good idea to find a balance between size and speed. So indexing blocks of several kilobyte size and scanning through them to reach the exact desired position should achieve both goals. The ideas described here can also be applied for compressing a long string or file by using the compression for these multi-kilobyte-blocks separately. In a way utf-8 is a compression scheme for allowing to store many of the four byte long code points in one or two or three bytes. Let us just assume we have characters and bytes. It could be arbitrary length numbers and bytes. And let us talk about compressed and uncompressed blocks.

Grouping and indexing can be done by having approximately a certain number of bytes in the compressed block, probably the largest number of characters that fit into the block. Now blocks can start at multiples of their maximum size and for each block we have meta information about its position in the chain of blocks, its uncompressed size and the position of its first entry in a totally uncompressed file or string. If the indexing structure is kept in memory only, this is quite easy to achieve. If it is stored in the file, some approach might be chosen: Add an additional indexing file to each „compressed“ text file, add the meta information in the beginning of the file, thus limiting the number of blocks at create time, or add the meta information to the end of the file. Or in some file systems add it into a second stream, which seems to be the cleanest approach, but only applicable for some file systems that are not main stream in the Unix- and Linux-world.

The other way around the uncompressed data can be grouped into blocks and then stored in a dense way. Then the meta information needs to contain the address of the first byte of the block. Both approaches seem to be reasonable.

It should be observed that random access reading is possible with this approach, but random access writing is doomed to fail, so it cannot be easily supported, unless the blocks can be stored anywhere and the meta information is used to contain all the information, current block size in bytes, current block size in characters, current location of first byte, position of block in the chain, number of the first character. When inserting characters, all metainformation needs to be updated and the block in which the change happens needs to be rewritten completely, but there could be ways to achieve this.

The question arises if this has been standardized in some way. I have not heard about such a standard, but I am observing that dealing with UTF-8 is always a mess or that this issue is just neglected. Databases have off course answered these kind of questions in their way quite well, but I think there could be a useful standard for compressed random access text files, maybe even supported by the libc or by the Posix standard.

If a typical compression is applied, it is possible to have a common dictionary for frequently occuring blocks and their encoding in one place of the file and just store the encoded data in the actual blocks, avoiding duplication. That would require a third area for storing data.

Share Button

Geometric and Harmonic Rounding

Most of our rounding methods are based on arithmetic rounding.
Please refer to the nomenclature described in Residue Class Rounding. So we can base our rounding on a metric and the most obvious choice is the absolute value of the difference of the two parameters, d(x,y)=|x-y|. That is what everybody does and nothing else really comes to our mind when we are talking about rounding. But this approach is based on the additive structure of our field or ring containing the numbers that we want to round. Another way of looking at this is that (in case we are talking of real numbers or some subset thereof) the choice of the rounded value is mostly based on seeing on which side of a boundary between the two candidates for the rounded values the original value lies, with special handling if it is exactly equal to this boundary. Now the this boundary is the arithmetic mean between the two candidates, therefore we call this arithmetic rounding. Just to give you an example:

We look at the „middle“ between two allowed numbers, for example usually something like 2.5 when rounding a number between 2 and 3 to integers and automatically assume that above that middle we round up, below we round down and exactly on the boundary we need additional rules. This middle is the arithmetic mean of the two neighboring allowed numbers after rounding and hence we can call this arithmetic rounding. All numbers x with 2 \le x < 2.5 are rounded to two, all numbers x with 2.5 < x \le 3 are rounded to three.


We assume M=\Bbb R and N=\Bbb Z, so each real number should be rounded to an integral number. Also we assume d(x,y)=|x-y|.

Now assume we have a non-integral number x \in \Bbb R and an integral number n\in \Bbb Z such n< x< n+1. For convenience we write n'=n+1. Then we round x to n if x < n+0.5 and to n' if x > n+0.5, so n+0.5 = \frac{n+n'}{2} serves as boundary. Now there are many other ways to define a mean between two number n and n'. Some of the most common are

We could also define a arithmetic-harmonic mean in a similar way, but that is just the geometric mean. It is an interesting way to approach the geometric mean of matrices, which is somewhat hard to define otherwise because of the non-commutative multiplication.

For the users of spreadsheets, like included in LibreOffice, OpenOffice or MS-Office, you can find the arithmetic mean as function average, the geometric mean as geomean and the harmonic mean as harmean.
Others means can be defined along these lines. And we can see that:

  • a(n,n')=m_1(n,n')
  • g(n,n')=m_0(n,n')
  • h(n,n')=m_{-1}(n,n')
  • q(n,n')=m_2(n,n')
  • c(n,n')=m_3(n,n')
  • \min(n,n')=m_{-\infty}(n,n')
  • \max(n,n')=m_\infty(n,n')

Here is a visual explanation of arithmetic, geometric and harmonic mean:


The line x from K to L has the harmonic mean of a and c as a length. The geometric mean would be reached by moving that line to a position that the upper and lower part have the same shape, apart from scaling. And the arithmetic mean would be reached by moving K and L to the middle points of the lines AD and BC, respectively.

The ruby-gem long-decimal supports all of these methods for calculating a mean.

Using means to define rounding methods

Just to give you an idea that you can extend this further, the geometric and harmonic mean are now considered.
As can be seen negative and zero values for n and n' are causing some trouble or require some care. Actually we can deal with this by splitting our unrounded and rounded worlds into separate parts, here we take the number < 0, the \{0\} and the numbers > 0 as three different sets and do the rounding within them:

    \[M=M_1\cup M_2\cup M_3\]


    \[\bigwedge_{i\ne j}M_i\cap M_j = \emptyset\]

where we define

    \[M_1=\{m\in M : m < 0\}\]




    \[M_3=\{m\in M : m > 0\}.\]

In a similar way we define N_1, N_2 and N_3 and define metrics d_1, d_2, d_3 separately. Then we just look at one partition i and consider rounding to be a mapping

    \[r_i: M_i\rightarrow N_i, x\mapsto r_i(x),\]

such that d_i(x, r_i(x)) is minimal.

The obvious assumption of rounding 0 to 0 releaves us from dealing with logarithms of zero or division by zero. We would see that anyway any number between 0 and 1 would be rounded to 1 for both harmonic and geometric rounding, because the boundary is 0, if we pick the right variant of the formula for calculating the mean. Also between -1 and 0 everything is rounded to -1 in our example. Otherwise the calculation of the boundary can be made to work, because then both factors are negative, so we draw the square root of a positive product, but then we need to replace the square root by its negative counterpart.

So for geometric rounding we define that x is rounded to n if x < g(n,n'), rounded to n' if x > g(n,n') and rounding for x=g(n,n') has to be defined by additional rules. But we define g(n,n') for negative values of n and n' to be -\sqrt{nn'}. This can also be expressed in terms of a metric:

    \[d_1(x,y)=|\log(-x)-\log(-y)|\;\; (x, y < 0),\]

    \[d_2(x,y)=0 \;\;(x=y=0)\]


    \[d_3(x,y)=|log(x)-log(y)| \;\; (x,y > 0).\]

Harmonic rounding is definable in a similar way:
We define that x is rounded to n if x < h(n,n'), rounded to n' if x > h(n,n') and rounding for x=h(n,n') has to be defined by additional rules. This can also be expressed in terms of a metric:

    \[d_1(x,y)=|\frac{1}{x}-\frac{1}{y}|\;\; (x, y < 0),\]

    \[d_2(x,y)=0 (x=y=0)\]


    \[d_3(x,y)=|\frac{1}{x}-\frac{1}{y}| (x,y > 0).\]

Quadratic and Cubic rounding can be defined in a similar way. It would even be possible to go for arithmetic-geometric rounding or harmonic-geometric rounding, but that would imply having to do an approximative iterative calculation of the boundary each time it is needed, while the rounding modes described above can still be achieved with relatively simple multiplication and addition operations.

In any case it is necessary to define the rule how to round numbers that lie exactly on the boundary.

The ruby-gem long-decimal supports geometric, harmonic, quadratic and cubic rounding in addition to the regular arithmetic rounding, that is referred to as „half“.

Harmonic and geometric rounding are important for some of the mechanisms of rounding with sum.
Based on this basis these algorithms will look more natural and no longer like some weird approach that needs to be copied from some untrusted source just to get over the nightmare of having to solve that issue.

Share Button