# DB Persistence without UPDATE and DELETE

When exploring the usage of databases for persistence, the easiest case is a database that does only SELECT. We can cache as much as we like and it is more or less the functional immutable world brought to the database. For working on fixed data and analyzing data this can sometimes be useful.

Usually our data actually changes in some way. It has been discussed in this Blog already, that it would be possible to extend the idea of immutability to the database, which would be achieved by allowing only INSERT and SELECT. Since data can correlate, an INSERT in a table that is understood as a sub-entity via a one-to-many-relationship by the application actually is mutating the containing entity. So it is necessary to look at this in terms of the actual OR-mapping of all applications that are running on that DB schema.

Life can be simple, if we actually have self contained data as with MongoDB or by having a JSON-column in PostgreSQL, for example. Then inter-table-relations are eliminated, but of course it is not even following the first normal form. This can be OK or not, but at least there are good reasons why best practices have been introduced in the relational DB world and we should be careful about that. Another approach is to avoid the concept of sub entities and only work with IDs that are foreign keys. We can query them explicitly when needed.

An interesting approach is to have two ID-columns. One is an id, that is unique in the DB-table and increasing for newly created data. One is the entity-ID. This is shared between several records referring to different generations of the same object. New of them are generated each time we change something and persist the changes and in a simple approach we just consider the newest record with that entity-ID valid. It can of course be enhanced with validFrom and validTo. Then each access to the database also includes a timestamp, usually close to current time, but kept constant across a transaction. Only records for which validFrom <= timestamp < validTo are considered, and within these the newest. The validFrom and validTo can form disjoint intervals, but it is up to the application logic if that is needed or not. It is also possible to select the entry with the highest ID among the records with a given entityID and timestamp-validTo/From-condition. Deleting records can be simulated by this as well, by allowing a way to express a "deleted" record, which means that in case we find this deleted record by our rules, we pretend not having found anything at all. But still referential integrity is possible, because the pre-deletion-data are still there. This concept of having two IDs has been inspired by a talk on that I saw during Clojure Exchange 2017: Immutable back to front.

# Lazy Collections, Strings or Numbers

The idea is, that we have data that is obtained or calculated to give us on demand as much of it as we request. But it is not necessarily initially present. This concept is quite common in the functional world, where we in a way hide the deprecated concept of state in such structures, by the way in a way that lets use retain the benefits that led to the desire for statelessness.

Actually the concept is quite old. We have it for I/O in Unix and hence in Linux since the 1970ies. „Everything is a file“, at least as long as we constrain ourselves to a universal subset of possible file operations. It can be keyboard input, a named or anonymous pipe, an actual file, a TCP-connection, to name the most important cases. These are „lazy“ files, behave more or less like files as far as sequential reading is concerned, but not for random access reading. The I/O-concept has been done in such a way that it takes the case into account that we want to read bytes, but get only bytes. This can happen with files when we reach their end, but then we can obtain an indication that we reached the end of the file, while it is perfectly possible that we read less then we want in one access, but eventually get bytes including subsequent reads. Since the API has been done right, but by no means ideal, it generalizes well to the different cases that exist in current OS environments.

We could consider a File as an array of bytes. There is actually a way to access it in this way by memory-mapping it, but this assumes a physically present file. Now we could assume that we think of the array as a list that is optimized for sequential access and iterating, but not for random access. Both list types actually exist in languages like Java. Actually the random access structure can be made lazy as well, within certain constraints. If the source is actually sequential, we can just assume that the data is obtained up to the point where we actually read. The information about the total length of the stream may or may not be available, it is always available somehow in the case of structures that are completely available in memory. This random access on lazy collections works fine if the reason of laziness is to actually save us from doing expensive operations to obtain data that we do not actually need or to obtain them in parallel to the computation that processes the data. But we loose another potential drawback in this case. If the data is truly sequential, we can actually process data that is way beyond our memory capacity.

So the concept transfers easily from I/O-streams to lists and even arrays, most naturally to iterables that can be iterated only once. But we can easily imagine that this also applies to Strings, which can be seen a sequence of characters. If we do not constrain us to what a String is in C or Java or Ruby, but consider String to be a more abstract concept, again possibly dropping the idea of knowing the length or having a finite length. Just think of the output of the Unix command „yes“ or „cat /dev/zero“, which is infinite, in a theoretical way, but the computer won’t last forever in real life, of course. And we always interrupt the output at some time, usually be having the consumer shut down the connection.

Even numbers can be infinite. For real numbers this can happen only after the decimal point, for p-adic numbers it happens only before the decimal point, if you like to look into that. Since we rarely program with p-adic numbers this is more or less an edge case that is not part of our daily work, unless we actually do math research. But we could have integers with so many digits that we actually obtain and process them sequentially.

Reactive programming, which is promoted by lightbend in the Reactive Manifesto relies heavily on lazy structures, in this case data streams. An important concept is the so called „backpressure“, that allows the consumer to slow down the producer, if it cannot read the data fast enough.

Back to the collections, we can observe different approaches. Java 8 has introduced streams as lazy collections and we need to transform collections into streams and after the operation a stream back into a collection, at least in many real life situations. But putting all into one structure has some drawbacks as well. But looking at it from an abstract point of view this does not matter. The java8-streams to not implement a collection interface, but they are lazy collections from a more abstract point of view.

It is interesting that this allows us to relatively easily write nested loops where the depth of the nesting is a parameter that is not known at compile time. We just need a lazy collections of -tuples, where is the actual depth of the nesting and the contents are according to what the loops should iterate through. In this case we might or might not know the size of the collection, possibly not fitting into a 32-bit-integer. We might be able to produce a random member of the collection. And for sure we can iterate through it and stop the iteration wherever it is, once the desired calculation has been completed.

# How to create ISO Date String

It is a more and more common task that we need to have a date or maybe date with time as String.

There are two reasonable ways to do this:
* We may want the date formatted in the users Locale, whatever that is.
* We want to use a generic date format, that is for a broader audience or for usage in data exchange formats, log files etc.

The first issue is interesting, because it is not always trivial to teach the software to get the right locale and to use it properly… The mechanisms are there and they are often used correctly, but more often this is just working fine for the locale that the software developers where asked to support.

So now the question is, how do we get the ISO-date of today in different environments.

## Linux/Unix-Shell (bash, tcsh, …)

date "+%F"

## TeX/LaTeX

 \def\dayiso{\ifcase\day \or 01\or 02\or 03\or 04\or 05\or 06\or 07\or 08\or 09\or 10\or% 1..10 11\or 12\or 13\or 14\or 15\or 16\or 17\or 18\or 19\or 20\or% 11..20 21\or 22\or 23\or 24\or 25\or 26\or 27\or 28\or 29\or 30\or% 21..30 31\fi} \def\monthiso{\ifcase\month \or 01\or 02\or 03\or 04\or 05\or 06\or 07\or 08\or 09\or 10\or 11\or 12\fi} \def\dateiso{\def\today{\number\year-\monthiso-\dayiso}} \def\todayiso{\number\year-\monthiso-\dayiso} 
This can go into a file isodate.sty which can then be included by \include or \input Then using \todayiso in your TeX document will use the current date. To be more precise, it is the date when TeX or LaTeX is called to process the file. This is what I use for my paper letters.

## LaTeX

(From Fritz Zaucker, see his comment below):
 \usepackage{isodate} % load package \isodate % switch to ISO format \today % print date according to current format 

## Oracle

 SELECT TO_CHAR(SYSDATE, 'YYYY-MM-DD') FROM DUAL; 
On Oracle Docs this function is documented.
It can be chosen as a default using ALTER SESSION for the whole session. Or in SQL-developer it can be configured. Then it is ok to just call
 SELECT SYSDATE FROM DUAL; 

Btw. Oracle allows to add numbers to dates. These are days. Use fractions of a day to add hours or minutes.

## PostreSQL

(From Fritz Zaucker, see his comment):
 select current_date; —> 2016-01-08 
 select now(); —> 2016-01-08 14:37:55.701079+01 

## Emacs

In Emacs I like to have the current Date immediately:
 (defun insert-current-date () "inserts the current date" (interactive) (insert (let ((x (current-time-string))) (concat (substring x 20 24) "-" (cdr (assoc (substring x 4 7) cmode-month-alist)) "-" (let ((y (substring x 8 9))) (if (string= y " ") "0" y)) (substring x 9 10))))) (global-set-key [S-f5] 'insert-current-date) 
Pressing Shift-F5 will put the current date into the cursor position, mostly as if it had been typed.

## Emacs (better Variant)

(From Thomas, see his comment below):
 (defun insert-current-date () "Insert current date." (interactive) (insert (format-time-string "%Y-%m-%d"))) 

## Perl

In the Perl programming language we can use a command line call
 perl -e 'use POSIX qw/strftime/;print strftime("%F", localtime()), "\n"' 
or to use it in larger programms
 use POSIX qw/strftime/; my \$isodate_of_today = strftime("%F", localtime()); 
I am not sure, if this works on MS-Windows as well, but Linux-, Unix- and MacOS-X-users should see this working.

If someone has tried it on Windows, I will be interested to hear about it…
Maybe I will try it out myself…

## Perl 5 (second suggestion)

(From Fritz Zaucker, see his comment below):
 perl -e 'use DateTime; use 5.10.0; say DateTime->now->strftime(„%F“);‘ 

## Perl 6

(From Fritz Zaucker, see his comment below):
 say Date.today; 
or
 Date.today.say; 

## Ruby

This is even more elegant than Perl:
 ruby -e 'puts Time.new.strftime("%F")' 
will do it on the command line.
Or if you like to use it in your Ruby program, just use
 d = Time.new s = d.strftime("%F") 

Btw. like in Oracle SQL it is possible add numbers to this. In case of Ruby, you are adding seconds.

It is slightly confusing that Ruby has two different types, Date and Time. Not quite as confusing as Java, but still…
Time is ok for this purpose.

## C on Linux / Posix / Unix

 #include #include #include 

 main(int argc, char **argv) { 

 char s[12]; time_t seconds_since_1970 = time(NULL); struct tm local; struct tm gmt; localtime_r(&seconds_since_1970, &local); gmtime_r(&seconds_since_1970, &gmt); size_t l1 = strftime(s, 11, "%Y-%m-%d", &local); printf("local:\t%s\n", s); size_t l2 = strftime(s, 11, "%Y-%m-%d", &gmt); printf("gmt:\t%s\n", s); exit(0); } 
This speeks for itself..
But if you like to know: time() gets the seconds since 1970 as some kind of integer.
localtime_r or gmtime_r convert it into a structur, that has seconds, minutes etc as separate fields.
stftime formats it. Depending on your C it is also possible to use %F.

## Scala

 import java.util.Date import java.text.SimpleDateFormat ... val s : String = new SimpleDateFormat("YYYY-MM-dd").format(new Date()) 
This uses the ugly Java-7-libraries. We want to go to Java 8 or use Joda time and a wrapper for Scala.

## Java 7

 import java.util.Date import java.text.SimpleDateFormat

 

... String s = new SimpleDateFormat("YYYY-MM-dd").format(new Date()); 
Please observe that SimpleDateFormat is not thread safe. So do one of the following:
* initialize it each time with new
* make sure you run only single threaded, forever
* use EJB and have the format as instance variable in a stateless session bean
* protect it with synchronized
* protect it with locks
* make it a thread local variable

In Java 8 or Java 7 with Joda time this is better. And the toString()-method should have ISO8601 as default, but off course including the time part.

## Summary

This is quite easy to achieve in many environments.
I could provide more, but maybe I leave this to you in the comments section.
What could be interesting:
* better ways for the ones that I have provided
* other databases
* other editors (vim, sublime, eclipse, idea,…)
* Office packages (Libreoffice and MS-Office)
* C#
* F#
* Clojure
* C on MS-Windows
* Perl and Ruby on MS-Windows
* Java 8
* Scala using better libraries than the Java-7-library for this
* Java using better libraries than the Java-7-library for this
* C++
* PHP
* Python
* Cobol
* JavaScript
* …
If you provide a reasonable solution I will make it part of the article with a reference…

# Creating Unique Numbers

Many software systems rely on some kind of unique numbers. Uniqueness is always a question in what universe this uniqueness is required. We do see the different kinds of universes in the case of the IP-addresses. In theory they are world wide unique. In practice we have mechanisms in place like NAT, that use certain dedicated IP-ranges for an intranet and map them to publicly available addresses for internet traffic outside the intranet. So we already have two kinds of universes… This is typical.

A common case are database IDs, that are often used as primary keys in databases. I do challenge this by the question, if there is a natural key already available in the data, which might make this db internal id unnecessary, but more often than not DB tables do have these ID columns as primary keys. They have to be unique within the DB table. Which can mean across several servers, because somewhat distributed databases are common.

Other examples are message ids of emails. They may be quite long, a short line of text is acceptable and they should be world wide unique. Combining the fully qualified publicly accessible hostname and a time stamp and a counter is usually good enough. They look like this 5AE.630049D.2EA050907@gmx.net or 5AE.630049D.2EA050907@ms-93424.gmx.net, where the part after the „@“ stands for the mail server and the part in front is a unique id for the message created by the mail server and looking slightly different depending on its software.

Often the length is not arbitrary and the UUIDs are a good compromise for this. They have 128 bits, some of which are used to specify the type of UUID. One type combines a hostname and timestamp like the message ids. But since in some contexts the generation of the UUID should not reveal the hostname and the time, some implementations prefer random UUIDs. It can very well be argued that with good random numbers duplicates of such random UUIDs are less likely than events that would bother us much more than having a duplicate. For randomly generated UUIDs six bits are used up for expressing the version and the fact that it is a random UUID, leaving 122 bits, which is a total of different possible values. Generating billions of UUIDs for many years leaves the risk of creating duplicates acceptably low.

But the issue of the quality of the random generator and the issue of potential duplicates remain something that needs attention. So it is worth to consider the path of using the host and timestamp. Now the host can not be identified by an IP address or fully qualified domain and host name, because these tend to be either too long or not unique enough. The MAC address used to be a good possibility. But I would not be so sure about this any more. Most server systems are virtualized these days and the MAC address is configured by software, so duplicates can occur accidentally or deliberately. Using a time stamp by itself can be a problem too because sooner or later it will happen that two IDs are generated at the same time, within the given granularity. Machines have several processors, run several processes and several threads within each process.

So achieving the goal of a real globally unique UUID value remains a difficult question. Following a more local uniqueness within an application or application landscape might be more reasonable. The number of servers may be large and may vary, but it should be possible to assign numbers to virtual or real servers. In case there are multiple different processes on the same (virtual) machine to assign numbers to these as well. This can be used as a replacement for the host part of the UUID. If it does not use up all the bits, these can be filled up with random numbers.

Timestamps can be obtained easily and relatively reliably for a granularity of msec (Milliseconds). The UUID timestamp allows up to a granularity of 100 nsec, which is 10’000 sub divisions of the msec. A thread safe counter that may reset during program start or with its overflow can be used to count and its positive remainder modulo 10’000 can be used instead of the 100 nsec part in conjunction with the msec.

Often a uniqueness within an application or application landscape can be achieved by using some kind of unique counter. The best choice is often the sequence of a database, which is good in this task and well tested. It is not too hard to create such a functionality. Handling of multiple processes and threads needs to be addressed. For persistence, it can be an improvement to reserve blocks of 100 or 1000 numbers and persist less often. This will result in skipping some numbers when restarting, but otherwise work out well. The same idea can also be applied for a distributed unique number generator, where each instances gets ranges from some master generator and gets new ranges, when they are used up.

Such unique numbers or identifiers are needed quite often. It is usually best to use something that works reliably, like the DB sequence. But it can be developed with adequate care, if there is a need. Testing and especially automated testing is off course very important, but only sufficient if the whole implementation is conceptionally sound and robust.

# PDF and PDF/A formats

The PDF format has experienced a success story on its way from being a quasi proprietary format that could only be dealt with using Adobe tools to a format that is specified and standardized and can be dealt with using open source tools and tools from different vendors. It has become accepted that PDF is primarily a print format and that for web content HTML is the better choice, which was not clear 15 years ago, when people coming from print layout who just considered themselves trivially capable of adding web to their portfolio just wanted to build whole web pages by just using PDF instead of HTML.

Now the format did change over time and there are always PDF files that use specific features that do not work in certain PDF viewers.

But there are requirements for maintaining documents over a long period of time. Just consider long term contracts that have a duration of 50-100 years. The associated documents usually need to be retained for that duration plus ten years. Alone the issue of storing data for such a long time and being able to read it physically is a challenge, but assuming that this issue is addressed and files can still be read in 110 years, the file format should be readable.

Now companies disappear. A lot of them in 100 years, probably even big ones like Adobe, Apple, Microsoft, Oracle and others. We do not know which companies will disappear, only that it is very likely that some companies that are big now will disappear. Proprietary software may make it to another vendor when shutting down the company, to pay the salaries of the former employees for some more days. But it might eventually disappear as well. Open source software has a better chance of being available in 100 years, but that cannot be absolutely guaranteed either, unless special attention is given to that software over such a long time. And if software is not maintained, it is highly unlikely that it will be able to run on the platforms that are common in 100 years.

So it is good to create a stable and simplified standard for long term archiving. Software for accessing that can be written from scratch based on that specification. And it is more likely to remain available, if continuous need for it can be seen.

The idea is a format called PDF/A, where A stands for „archive“, which is an option for storing PDF files over a very long period of time. Many cool features of PDF have been removed from PDF/A and make it more robust and easy to use. Important is also not to rely on additional data sources, for example for passwords of encrypted PDF files or for fonts. Encryption with password protection is a bad thing because it is quite likely that the password is gone in 100 years. Fonts need to be included, because finding them in 100 years might not be trivial. This usually means that proprietary fonts have to be avoided, unless the licensing allows inclusion of the fonts into the PDF file and unlimited reading. Including JavaScript, Video, Audio or Forms is also a bad idea. Video should be archived separately and it has the same issues as PDF for long term archiving.

# Find the next entry in a sequence

In Facebook, Xing, Google+, Vk.com, Linkedin and other of these social media networks we are often encountered with a trivial question like this:
 1->2 2->8 3->18 4->32 5->50 6->72 7->? 
There are some easy patterns. Either it is some polynomial formula or some trick with the digits.
But the point is, that any such sequence can easily be fullfilled by a polynomial formula. That means we can put any value for 7 and make it work. Or any answer is correct. So what would probably be the real question is the most simple function to full-fill the given constraints. Simplicity can be measured in some way… If the solution is unique is unclear, but let us just look at the polynomial solution.

A function is needed that takes as parameter a list of key-value-pairs (or a hash map) and that yields a function such that the function of any of the key is the associated value.

Assuming a polynomial function in one variable we can make use of the chinese remainder theorem, which can be applied to univariate polynomials over a field as well as to integral numbers. For a polynomial p(X) we have

where is the polynomial variable and is a concrete value.

We are looking for a polynomial such that for given values we have

or in another way

which is exactly the Chinese remainder theorem.
Let

and

We can see that for all the polynomials

have the properties

or

where is the Kronecker symbol, which is 0 if the two indices differ and 1 if they are equal.
Or as congruence:

Then we can just combine this and use

This can easily be written as a Ruby function
 def fun_calc(pairs)   n = pairs.size   result = lambda do |x|     y = 0     n.times do |i|       p_i = pairs[i]       x_i = p_i[0].to_r       y_i = p_i[1].to_r       z = y_i       n.times do |j|         if (j != i)           p_j = pairs[j]           x_j = p_j[0]           z *= (x - x_j) / (x_i - x_j)         end       end       y += z     end     y   end   result end 
This takes a list of pairs as a parameter and returns the polynomial function als lambda.
It can be used like this:
 lop = [[0, 0], [1, 1], [2, 4], [3, 9], [4, 16], [5, 25], [6, 36], [7, 64]]

 f = fun_calc(lop) 

20.times do |x|   y = f.call(x)   puts sprintf("%6d -> %6d", x, y) end 
Put this together into a ruby program and add some parsing for the list of pairs or change the program each time you use it and all these „difficult“ questions „that 99.9% fail to solve“ are not just easy, but actually soluble automatically.

This is interesting for more useful applications. I assume that there will always be situations where a function is needed that meets certain exact values a certain inputs and is an interpolation or extrapolation of this.

Please observe that there are other interesting and useful ways to approach this:

• Use a „best“ approximation from a set of functions, for example polynomials with a given maximum degree
• use cubic splines, which are cubic polynomials within each section between two neighboring input values such that at the input values the two adjacent functions have the same value (, of course), the same first derivative and the same second derivative.

For highway and railroad construction other curves are used, because the splines are making an assumption on what is the -axis and what is the -axis, which does not make sense for transport facilities. They are using a curve called Clothoid.

Use Java, C, Perl, Scala, F# or the programming language of your choice to do this. You only need Closures, which are available in Java 8, F#, Scala, Perl, Ruby and any decent Lisp dialect. In Java 7 they can be done with an additional interface as anonymous inner classes. And for C it has been described in this blog how to do closures.

# Binary Data

Having discussed to some extent about strings and text data, it is time to look at the other case, binary data.

Usually we think of arrays of bytes or sequences of bytes stored in some media.

Why bytes? The 8-bit-computers are no longer so common, but the byte as a typical unit of binary data is still present. Actually this is how files on typical operating systems work, they have a length in bytes and that number of bytes as content. When we use the file system, we have to deal with this.

But assuming a perfect lossless world, the typical smallest unit of information is not a byte, but a bit. So we could actually assume that we have an array of bits of a given length and the length does not need to be a multiple of 8. This might be subtle, but it does describe the most basic case. Padding with 0-bits does not usually do the job, but there need to be other provisions to deal with the bit-length and the storage length. In case of files this could be achieved by storing the last three bits of the bit length in the first byte of the file and using the file size the bit length can be calculated as .

How about memory? Often the array of bytes is just fine. The approach to have an array of booleans as bits usually hurts, because typical programming languages reserve 32 bit for each boolean and we actually want a packed array. On the other hand we have 64bit-computers and we would like to use this when moving around data, so actually thinking of an array of 64-bit-integers plus information about how many of the bits of the last entry actually count might be better. This does bring up the issue of endianness at some point.

But we actually want to use the data. In the good old C-days we could use quite a powerful trick: We read an array of bytes and casted the pointer to a pointer to a struct. Then we could access the struct. Poor mans serialization, but it was efficient and worked especially well for records with fixed size, for both reading and writing. Off course variable length records can be dealt with as well by imposing rules that say something about the type of the next record based on the contents of the current record.

The Perl programming language and subsequently the Ruby programming language had much more powerful and flexible mechanisms using methods of functions called pack and unpack.

# Indexing of long utf-8 files or strings

The UTF-8 format has the disadvantage that the length of characters and code points varies. Accessing a given position counted in characters is only possible by starting from the beginning or by providing an indexing structure. It is a good idea to find a balance between size and speed. So indexing blocks of several kilobyte size and scanning through them to reach the exact desired position should achieve both goals. The ideas described here can also be applied for compressing a long string or file by using the compression for these multi-kilobyte-blocks separately. In a way utf-8 is a compression scheme for allowing to store many of the four byte long code points in one or two or three bytes. Let us just assume we have characters and bytes. It could be arbitrary length numbers and bytes. And let us talk about compressed and uncompressed blocks.

Grouping and indexing can be done by having approximately a certain number of bytes in the compressed block, probably the largest number of characters that fit into the block. Now blocks can start at multiples of their maximum size and for each block we have meta information about its position in the chain of blocks, its uncompressed size and the position of its first entry in a totally uncompressed file or string. If the indexing structure is kept in memory only, this is quite easy to achieve. If it is stored in the file, some approach might be chosen: Add an additional indexing file to each „compressed“ text file, add the meta information in the beginning of the file, thus limiting the number of blocks at create time, or add the meta information to the end of the file. Or in some file systems add it into a second stream, which seems to be the cleanest approach, but only applicable for some file systems that are not main stream in the Unix- and Linux-world.

The other way around the uncompressed data can be grouped into blocks and then stored in a dense way. Then the meta information needs to contain the address of the first byte of the block. Both approaches seem to be reasonable.

It should be observed that random access reading is possible with this approach, but random access writing is doomed to fail, so it cannot be easily supported, unless the blocks can be stored anywhere and the meta information is used to contain all the information, current block size in bytes, current block size in characters, current location of first byte, position of block in the chain, number of the first character. When inserting characters, all metainformation needs to be updated and the block in which the change happens needs to be rewritten completely, but there could be ways to achieve this.

The question arises if this has been standardized in some way. I have not heard about such a standard, but I am observing that dealing with UTF-8 is always a mess or that this issue is just neglected. Databases have off course answered these kind of questions in their way quite well, but I think there could be a useful standard for compressed random access text files, maybe even supported by the libc or by the Posix standard.

If a typical compression is applied, it is possible to have a common dictionary for frequently occuring blocks and their encoding in one place of the file and just store the encoded data in the actual blocks, avoiding duplication. That would require a third area for storing data.

# Random-Access and UTF-8

Deutsch

It is a nice thing to be able to use random access files and to have the possibility to efficiently move to any byte position for reading or writing.

This is even true for text files that have a fixed number of bytes per character, for example exactly one, exactly two or exactly four bytes per character. Maybe we should prefer the term „code point“ here.

Now the new standard for text files is UTF-8. In many languages each character is usually just one byte. But when moving to an arbitrary byte position this may be in the middle of a byte sequence comprising just one character (code point) and not at the beginning of a character. How should that be handled without reading the whole file up to that position?

It is not that bad, because UTF-8 is self synchronizing. It can be seen if a particular byte is the first byte of a byte sequence or a successive one. First bytes start with 0 or 11, successive bytes start with 10. So by moving forward or backward a little bit that can be handled. So going to a rough position is quite possible, when knowledge of the average number of bytes per character for that language is around.
But when we do not want to go to a rough character position or to an almost exact byte position, but to an exact character position, which is usually the requirement that we have, then things get hard.

We either need to read the file from the beginning to be sure or we need to use an indexing structure which helps starting in the middle and only completely reading a small section of the file. This can be done with extra effort as long as data is only appended to the end, but never overwritten in the middle of the file.

But dealing with UTF-8 and random access is much harder than with bytes. Indexing structures need to be maintained in memory when accessing the file or even as part of the file.

# Development of Hardware: Parallelism

Deutsch

Until recently we could just rely on the fact that the CPU frequencies doubled at least every year, which has stopped a couple of years ago. So we can no longer compensate the inefficiencies of our software by just waiting for the next hardware release, which was no big deal, because software was often delayed anyway by a couple of months. Off course the power of hardware depends on many factors, even on the number of instructions that can be done within one clock cycle or the number of clock cycles needed for instructions. Everyone who has dealt with performance issues knows that providing enough physical memory is usually a good idea and certain optimizations in the circuits and the design of the chips can help to make the computer run faster, even though we usually do not care. But the power of the single CPU core has almost stagnated now for some years, but it is easy to get chips that provide multiple cores. An interesting link: The Free Lunch is Over.

Now we have the challenge of making use of these multiple CPU cores for building resource hungry applications, which is basically achieved by having multiple threads or processes running simultaneously. Unfortunately we encounter some issues. The most obvious problem is that it is easy to find developers who say that they are capable of developing such applications, but there are only very few who can really do it well enough to build reliable and stable software. So the software might work well under ideal circumstances, for example when testing it on the developer’s machine, but it will eventually fail in the productive environment, when run under load, creating errors that are very hard to pin down. Or the threads and processes spend so much time waiting for each other that the system does not actually make use of the parallel capabilities of the hardware. Or we even get dead locks. What do we learn from this?

For this kind of architecture excellent developers are needed, who can imagine the parallel computations and who have enough experience with this kind of development. And it is usually better to do development that uses the parallelism to a reasonable extent, without loosing robustness. Obviously it is important to test with reasonable data and load on test systems that are like the productive systems.

Another approach is the use of frameworks. There are some good lightweight frameworks, but common frameworks like JEE (earlier called J2EE) are using so many resources for themselves and restrict the developer so much that the advantage of easier multithreading gets lost by this, because the framework itself uses most of the CPU power and the main memory. There are many cases where using frameworks with JEE applications servers is a good idea, but high performance applications should done differently.

The problem is always that data structures that need to be manipulated by multiple threads or processes cause problems. These may be handled, but create a lot of difficulties in practice.

• avoid commonly used data structures
• usage of immutable data structures

The first approach is quite logical for development with C or Ruby or Perl, where the processes need relatively little memory, so that it is possible to run multiple processes simultaneously. Using POSIX-IPC (or whatever your OS offers instead) or TCP/IP the processes can communicate with each other. That works well, if there are several relatively independent processes that do not need to communicate very much. But it needs excellent developers as well, because they really need to know the IPC mechanisms, unless the sub tasks are so independent that they do not need to communicate with each other at all. Maybe Erlang has implemented this idea in a practicable way, allowing a huge number of parallel processes with totally separate data stores that communicate with each other through some message passing mechanism.

The other idea, to have all shared data structures immutable, is followed by Scala and Clojure. The disadvantage of having to create a copy with some changes applied instead of modifying the object itself can be reduced by internal optimizations within the standard libraries that use references to the original and just store the changes instead of really copying huge data structures for each change. Even Java uses such mechanisms when creating a substring of an immutable String.

In any case it is necessary to deal with dependencies between processes in order to avoid deadlocks. In the Scala and Clojure world it is reasonable to build lightweight frameworks that help dealing with multiple parallel threads because the promise of immutability eliminates many of the problems of shared objects. Twitter uses Scala internally and has been able to cope with the load even during events that cause a heavy communications load.

A principal problem remains whenever heavy communication between processes is required. In a huge system it is impossible to optimize all communication paths. Assuming n parallel processors we have communication pairs, which is growing . So we need to compromise as soon as gets really huge. A bus architecture with one common channel get congested and for separate point to point connections it will be necessary to provide these only for immediate neighbors instead of all possible connections. To really imagine huge, think of an application that is running on several locations, each having several racks, each containing several machines, each containing several CPU chips, each containing several CPU cores, possibly even with hyper-threading. Using sophisticated hardware architecture it is possible that CPU cores communicate with other CPU cores in their vicinity through very fast mechanisms, but it is only possible to place a limited number of CPU cores in this vicinity.

An interesting idea was to put a large number of boards containing this number of CPUs and cores that can communicate with each other efficiently into a topological hypercube. Having boards, each board has neighbors that can be reached directly through a relatively short communication channel. The boards represent the vertices of an -dimensional hypercube. This architecture allows reaching another board in steps and even to aggregate a result from all or a subset of all boards in steps. Having a wired-or for synchronization is very helpful for enhancing the performance for many typical types of tasks. Does anybody know how current super computers are built?

In any case it is good to be able to run sub tasks with as little communication with other sub tasks as possible, because the overhead of communication can eat up the gain of parallelism.