# Cassandra DB

The large zoo of NoSQL databases needs to be considered thoroughly.

While the major transactional SQL databases (Oracle, PostgreSQL, MS-SQL-Server, DB2 and to a limited extent MariaDB/MySQL) can be used more or less interchangeably, if we are talking about the beginning of the project, the NoSQL databases have totally different features and must be chosen wisely and adequately. While the typical pattern of many NoSQL databases is trading speed for complex transactions, this is by no means compulsory. There are SQL databases like Teradata, that have reduced support for transactions, but feature a full set of SQL for queries. This can be used for data warehousing, where the data changing operations take place in a more controlled manor, because the data is already known and present elsewhere, but queries are the important thing. On the other hand there are NoSQL-databases like Neo4j that support full ACID-transactions, but organize the data in a totally different way that might be more adequate than a relational database for many purposes. This shows the other motivation to go for NoSQL-databases. While the relational model is somewhat complete and there are always ways to express whatever is desired with it, it may not be a natural or efficient or convenient way to model certain data.

Some databases, like MongoDB allow additional indexing and thus efficient queries without full table scans and additional uniqueness constraints, which is a must have feature for a good transactional SQL database, but not so much for NoSQL databases in general.

Some NoSQL-databases are typically run in memory. While there are some SQL databases that run in memory as well, like H2, this kind of defeats the idea of ACID, so it is not really a full transactional SQL database anyway, at least not in this mode.

SQL databases are well known to many people and can do many jobs quite well. If the data model can be expressed well in the SQL database and the performance is sufficient, this may be a good way to go. The powerful query and data manipulation language is usually extremely helpful and usually you get only a shadow of what a powerful SQL dialect like PostgreSQL or Oracle can do when you move to a NoSQL database.

Now NoSQL databases are somewhat scary. The data is „not safe“ in them, because there are no transactions. But what about the transactional relational SQL database? If we use it through some implicit or explicit caching layers, we get rid of this feature without being aware of it. JPA or hibernate tend to do implicit caching and can even use file system persistency outside of the database as a „second level cache“. I think this is just so broken, but it is used and good applications are built with it. But the feature of the good old transactional SQL database that serves as argument to use it has silently been thrown into the garbage can in this project.

So we should think if and where we need serious transactions and then really write our software in a way that does not defeat them by caching, preferably by avoiding JPA and Hibernate. Or we should pick our database solution for the purpose based on the modelling features and performance features.

Cassandra comes in as one of the NoSQL products with very good write performance and scalability. The query language kind of looks like SQL, but it is a tiny subset. If we can live with that subset and it is relevant to get more performance than we would get from a PostgreSQL or Oracle, this might be a good way to go. An important observation is that tables in Cassandra do have primary keys. But additional unique keys cannot be defined and instead lookup tables. This works like this:

CREATE TABLE IF NOT EXISTS „A“ (
„X“ text,
„Y“ text,
„Z“ text,
„T“ text,
PRIMARY KEY(„X“)
);

CREATE TABLE IF NOT EXISTS „A_LOOKUP_BY_Y“ (
„Y“ text,
„X“ text,
PRIMARY KEY(„Y“)
);

Lookups by X can be done directly in A. Lookups by Y have to be done in two steps. First the lookup table is used to determine the value for X and the table A is used for the actual lookup of the data.

There are ways to do interesting and useful things that seem to be missing and it is usually worth investigating how they can be done. When knowing how the different tools that are available can be used efficiently and properly, it may be a good moment to actually decide what tool to use. But a serious comparison should compare well done solutions of both technologies and not just a junior’s first day program in one technology against a professional solution done by a senior.

It is always important to have a look at the different NoSQL databases to decide which ones are going to be used in the team or in the organization. Cassandra can be a useful database for this approach. Often it is good to also have a SQL database in this set of supported databases. It is easier to model for example accounts, booking and payments with them, while a other data may be better stored in adequate NoSQL databases.

By the way, while major transactional SQL databases may be somewhat interchangable in the beginning of a project, it is quite a pain to change them later or to support multiple databases. Also there are of course licensing issues, teams are needed to operate the database, databases may run better on certain operating systems and there may be some additional features that might make them more or less desirable. But the basic functionality differs less than in the case of NoSQL databases.

# Observer Effect

Scientists have to deal with the observer effect, which means that observing something actually changes it. Typically we think of quantum physics, where this effect is very strong and surprising and closely related to the Heisenberg Uncertainty Principle, but it is actually something that in a more abstract sense is present in a multitude of situtations. Just think of human interaction. If we want to find out about people, we can ask them. But this conversation actually changes the people, sometimes in a way that we can neglect or tolerate.

But we also have this in the case of IT. If we think of a software and we want to observe if the software behaves well, we need ways to observe the software. Very often we use logging, sometimes monitoring tools, and sometimes debugging or even profiling. We think that they do not hurt us, apart from using resources, but we have to be quite careful. The example of logging is quite good, because it is quite common and usually something that we do a lot, without wasting too much thoughts about it.

Now logging slows our application down that is known already. Now we tend to use a slightly less noisy log level, because terabytes of log are still a pain, even today. But usually the messages are calculated and then discarded by the logging framework. With functional language features there are quite elegant ways to deal with this, by just passing a function that calculates the message on demand instead of passing the message. It has always been possible, but too clumsy to actually do it, unless the logging framework can rely on macro facilities, even such simple ones as the C-preprocessor. The deferred evaluation has its dangers as well. If an object that is passed as an ingrediant for a potential message already changes, while the message is created, we might get funny effects. Maybe only in the log, but maybe it could crash the application or stop the main program flow from doing its work. We need to be careful, unless the object is immutable.

In case of Hibernate or JPA or similar frameworks this can be specially interesting, even with eager message calculation. Accessing attributes of the object can actually lead to database operations. They can fail. They can create load, maybe deadlocks. They can have lost their transaction. A lot of things can happen in places far away from where we assumed to do the DB-work. This actually changes the objects. Do we want such operations to occur during logging? Maybe differently depending on the log level? Immutability is our friend, especially in conjunction with JPA, but that is a long story. We may at least be lucky that we actually have some tables that we only read. We can make the objects „pseudo-immutable“, but still JPA-magic must mutate them at least during the read operation. It is tempting to let tools generate the toString-methods of objects, but it is very dangerous here. We should avoid including any potentially lazily loaded attributes in the toString-output, because otherwise they will be loaded during the logging or even worse differently depending on how we log.

The next thing is the NullPointerException during logging. It is quite common in Java, for example. And we do not want to burden our program logic with NullPointerExceptions from the logging, especially not with those that occur only sporadically. So it is a good idea to be careful and to test well. Only the combination is possibly good enough.

Modern times create more demand for some kind of real multi threading, not in the JEE-sense with a couple of EJBs that can run in parallel, but with massively parallel operations. Even though we have a multitude of logging frameworks and unifying logging frameworks and even more of them, there is a common weakness that they tend to share. Writing into one single target is achieved by some kind of synchronizing, which can slow our application down and change the timing behavior in ways that we did not desire. Asynchronous logging could be good, but in a way this only shifts the problem a bit.

# DB Persistence without UPDATE and DELETE

When exploring the usage of databases for persistence, the easiest case is a database that does only SELECT. We can cache as much as we like and it is more or less the functional immutable world brought to the database. For working on fixed data and analyzing data this can sometimes be useful.

Usually our data actually changes in some way. It has been discussed in this Blog already, that it would be possible to extend the idea of immutability to the database, which would be achieved by allowing only INSERT and SELECT. Since data can correlate, an INSERT in a table that is understood as a sub-entity via a one-to-many-relationship by the application actually is mutating the containing entity. So it is necessary to look at this in terms of the actual OR-mapping of all applications that are running on that DB schema.

Life can be simple, if we actually have self contained data as with MongoDB or by having a JSON-column in PostgreSQL, for example. Then inter-table-relations are eliminated, but of course it is not even following the first normal form. This can be OK or not, but at least there are good reasons why best practices have been introduced in the relational DB world and we should be careful about that. Another approach is to avoid the concept of sub entities and only work with IDs that are foreign keys. We can query them explicitly when needed.

An interesting approach is to have two ID-columns. One is an id, that is unique in the DB-table and increasing for newly created data. One is the entity-ID. This is shared between several records referring to different generations of the same object. New of them are generated each time we change something and persist the changes and in a simple approach we just consider the newest record with that entity-ID valid. It can of course be enhanced with validFrom and validTo. Then each access to the database also includes a timestamp, usually close to current time, but kept constant across a transaction. Only records for which validFrom <= timestamp < validTo are considered, and within these the newest. The validFrom and validTo can form disjoint intervals, but it is up to the application logic if that is needed or not. It is also possible to select the entry with the highest ID among the records with a given entityID and timestamp-validTo/From-condition. Deleting records can be simulated by this as well, by allowing a way to express a "deleted" record, which means that in case we find this deleted record by our rules, we pretend not having found anything at all. But still referential integrity is possible, because the pre-deletion-data are still there. This concept of having two IDs has been inspired by a talk on that I saw during Clojure Exchange 2017: Immutable back to front.

# Microsoft SQL Server will be available for Linux in 2017

Deutsch

Microsoft has officially announced that their database MS SQL Server will become available for Linux in 2017.

I think the time has come for this. Since the departure of Steve Ballmer Microsoft has become a little bit less religous and more pragmatic. There are good reasons to be skeptical about companies like Microsoft and Oracle, but having more competition and more choice is a good thing. Maybe the database product from Oracle is slightly better than MS SQL Server, but there are very few projects where this difference really matters. So now we have three important relational DB products: DB2, Oracle, MS-SQL-Server, PostgreSQL and MariaDB (the successor of mySQL). When starting a new project with no specific constraints about the DB I would usually look at PostgreSQL first, because it is a feature rich and powerful open source database. Since database products are usually something that cannot reasonably be changed within one software system for decades this is a good thing, because we never know what the big companies want to do in such a long time scale. If the migration to another DB product is easy, then the software does not really make use of the power and the features of the DB. And it will not be easy anyway.

There are a lot of cases where the combination of MS-SQL-Server with Linux will make a lot of sense. Since there are software systems that make use of this DB product, it gives the flexibility to run the DB on Linux servers. And maybe avoid an expensive migration to another DB product. As I already said it gives one more choice. In development environments where MS-products are commonly used, it gives one more combination. And eventually it will encourage Oracle and IBM a little bit to refrain from excessive price increases.

# Slick

Almost every serious programming language has to deal with database access, if not out of love, then at least out of practical necessity.

The theoretical background of a functional programming language is somewhat hostile to this, because pure functional langauges tend to dislike state and a database has the exact purpose to preserve state for a long time. Treating the database as somewhat external and describing its access with monads will at least please theoretical purists to some extent.

In case of Scala things are not so strict, it is allowed to leave the purely functional path where adequate. But there are reasons to follow it and to leave it only in well defined, well known, well understood and well contained exceptions… So the database itself may be acceptable.

Another intellectual and practical challenge is the fact that modern Scala architectures like to be reactive, like to uncouple things, become asynchronous. This is possible with databases, but it is very unusual for database drivers to support this. And it is actually quite contrary to the philosophy of database transactions. It can be addressed with future database drivers, but I do not expect this to be easy to use properly in conjunction with transactions.

It is worthwhile to think about
immutability and databases.
For now we can assume that the database is there and can be used.

So the question is, why not integrate existing and established persistence frameworks like JPA, Hibernate, Eclipslink and other OR-mapping-based systems into Scala, with Scala language binding?

Actually I am quite happy they did not go this path. Hibernate is broken and buggy, but the generic concept behind it is broken in my opinion. I do not think that it is theoretically correct, especially if we should program with waterproof transactions and actually defeat them by JPA-caching without even being aware of it? But for practical purposes, I actually doubt that taming JPA for non trivial applications is really easier than good old JDBC.

Anyway, Scala took the approach of building something in the lines of JDBC, but making it nicer. It is called Slick, currently Slick 3. Some links:
* Slick
* Slick 3.1.1
* Documentation
* Github
* Assynchronous DB access
* Recording of the Slick Workshop at ScalaX

What Slick basically does is building queries using Scala language. It does pack the results into Scala structures. Important is, that the DB is not hidden. So we just do the same as with JDBC, but in a more advanced way.

# How to create ISO Date String

It is a more and more common task that we need to have a date or maybe date with time as String.

There are two reasonable ways to do this:
* We may want the date formatted in the users Locale, whatever that is.
* We want to use a generic date format, that is for a broader audience or for usage in data exchange formats, log files etc.

The first issue is interesting, because it is not always trivial to teach the software to get the right locale and to use it properly… The mechanisms are there and they are often used correctly, but more often this is just working fine for the locale that the software developers where asked to support.

So now the question is, how do we get the ISO-date of today in different environments.

## Linux/Unix-Shell (bash, tcsh, …)

date "+%F"

## TeX/LaTeX

 \def\dayiso{\ifcase\day \or 01\or 02\or 03\or 04\or 05\or 06\or 07\or 08\or 09\or 10\or% 1..10 11\or 12\or 13\or 14\or 15\or 16\or 17\or 18\or 19\or 20\or% 11..20 21\or 22\or 23\or 24\or 25\or 26\or 27\or 28\or 29\or 30\or% 21..30 31\fi} \def\monthiso{\ifcase\month \or 01\or 02\or 03\or 04\or 05\or 06\or 07\or 08\or 09\or 10\or 11\or 12\fi} \def\dateiso{\def\today{\number\year-\monthiso-\dayiso}} \def\todayiso{\number\year-\monthiso-\dayiso} 
This can go into a file isodate.sty which can then be included by \include or \input Then using \todayiso in your TeX document will use the current date. To be more precise, it is the date when TeX or LaTeX is called to process the file. This is what I use for my paper letters.

## LaTeX

(From Fritz Zaucker, see his comment below):
 \usepackage{isodate} % load package \isodate % switch to ISO format \today % print date according to current format 

## Oracle

 SELECT TO_CHAR(SYSDATE, 'YYYY-MM-DD') FROM DUAL; 
On Oracle Docs this function is documented.
It can be chosen as a default using ALTER SESSION for the whole session. Or in SQL-developer it can be configured. Then it is ok to just call
 SELECT SYSDATE FROM DUAL; 

Btw. Oracle allows to add numbers to dates. These are days. Use fractions of a day to add hours or minutes.

## PostreSQL

(From Fritz Zaucker, see his comment):
 select current_date; —> 2016-01-08 
 select now(); —> 2016-01-08 14:37:55.701079+01 

## Emacs

In Emacs I like to have the current Date immediately:
 (defun insert-current-date () "inserts the current date" (interactive) (insert (let ((x (current-time-string))) (concat (substring x 20 24) "-" (cdr (assoc (substring x 4 7) cmode-month-alist)) "-" (let ((y (substring x 8 9))) (if (string= y " ") "0" y)) (substring x 9 10))))) (global-set-key [S-f5] 'insert-current-date) 
Pressing Shift-F5 will put the current date into the cursor position, mostly as if it had been typed.

## Emacs (better Variant)

(From Thomas, see his comment below):
 (defun insert-current-date () "Insert current date." (interactive) (insert (format-time-string "%Y-%m-%d"))) 

## Perl

In the Perl programming language we can use a command line call
 perl -e 'use POSIX qw/strftime/;print strftime("%F", localtime()), "\n"' 
or to use it in larger programms
 use POSIX qw/strftime/; my \$isodate_of_today = strftime("%F", localtime()); 
I am not sure, if this works on MS-Windows as well, but Linux-, Unix- and MacOS-X-users should see this working.

If someone has tried it on Windows, I will be interested to hear about it…
Maybe I will try it out myself…

## Perl 5 (second suggestion)

(From Fritz Zaucker, see his comment below):
 perl -e 'use DateTime; use 5.10.0; say DateTime->now->strftime(„%F“);‘ 

## Perl 6

(From Fritz Zaucker, see his comment below):
 say Date.today; 
or
 Date.today.say; 

## Ruby

This is even more elegant than Perl:
 ruby -e 'puts Time.new.strftime("%F")' 
will do it on the command line.
Or if you like to use it in your Ruby program, just use
 d = Time.new s = d.strftime("%F") 

Btw. like in Oracle SQL it is possible add numbers to this. In case of Ruby, you are adding seconds.

It is slightly confusing that Ruby has two different types, Date and Time. Not quite as confusing as Java, but still…
Time is ok for this purpose.

## C on Linux / Posix / Unix

 #include #include #include 

 main(int argc, char **argv) { 

 char s[12]; time_t seconds_since_1970 = time(NULL); struct tm local; struct tm gmt; localtime_r(&seconds_since_1970, &local); gmtime_r(&seconds_since_1970, &gmt); size_t l1 = strftime(s, 11, "%Y-%m-%d", &local); printf("local:\t%s\n", s); size_t l2 = strftime(s, 11, "%Y-%m-%d", &gmt); printf("gmt:\t%s\n", s); exit(0); } 
This speeks for itself..
But if you like to know: time() gets the seconds since 1970 as some kind of integer.
localtime_r or gmtime_r convert it into a structur, that has seconds, minutes etc as separate fields.
stftime formats it. Depending on your C it is also possible to use %F.

## Scala

 import java.util.Date import java.text.SimpleDateFormat ... val s : String = new SimpleDateFormat("YYYY-MM-dd").format(new Date()) 
This uses the ugly Java-7-libraries. We want to go to Java 8 or use Joda time and a wrapper for Scala.

## Java 7

 import java.util.Date import java.text.SimpleDateFormat

 

... String s = new SimpleDateFormat("YYYY-MM-dd").format(new Date()); 
Please observe that SimpleDateFormat is not thread safe. So do one of the following:
* initialize it each time with new
* make sure you run only single threaded, forever
* use EJB and have the format as instance variable in a stateless session bean
* protect it with synchronized
* protect it with locks
* make it a thread local variable

In Java 8 or Java 7 with Joda time this is better. And the toString()-method should have ISO8601 as default, but off course including the time part.

## Summary

This is quite easy to achieve in many environments.
I could provide more, but maybe I leave this to you in the comments section.
What could be interesting:
* better ways for the ones that I have provided
* other databases
* other editors (vim, sublime, eclipse, idea,…)
* Office packages (Libreoffice and MS-Office)
* C#
* F#
* Clojure
* C on MS-Windows
* Perl and Ruby on MS-Windows
* Java 8
* Scala using better libraries than the Java-7-library for this
* Java using better libraries than the Java-7-library for this
* C++
* PHP
* Python
* Cobol
* JavaScript
* …
If you provide a reasonable solution I will make it part of the article with a reference…

# Creating Unique Numbers

Many software systems rely on some kind of unique numbers. Uniqueness is always a question in what universe this uniqueness is required. We do see the different kinds of universes in the case of the IP-addresses. In theory they are world wide unique. In practice we have mechanisms in place like NAT, that use certain dedicated IP-ranges for an intranet and map them to publicly available addresses for internet traffic outside the intranet. So we already have two kinds of universes… This is typical.

A common case are database IDs, that are often used as primary keys in databases. I do challenge this by the question, if there is a natural key already available in the data, which might make this db internal id unnecessary, but more often than not DB tables do have these ID columns as primary keys. They have to be unique within the DB table. Which can mean across several servers, because somewhat distributed databases are common.

Other examples are message ids of emails. They may be quite long, a short line of text is acceptable and they should be world wide unique. Combining the fully qualified publicly accessible hostname and a time stamp and a counter is usually good enough. They look like this 5AE.630049D.2EA050907@gmx.net or 5AE.630049D.2EA050907@ms-93424.gmx.net, where the part after the „@“ stands for the mail server and the part in front is a unique id for the message created by the mail server and looking slightly different depending on its software.

Often the length is not arbitrary and the UUIDs are a good compromise for this. They have 128 bits, some of which are used to specify the type of UUID. One type combines a hostname and timestamp like the message ids. But since in some contexts the generation of the UUID should not reveal the hostname and the time, some implementations prefer random UUIDs. It can very well be argued that with good random numbers duplicates of such random UUIDs are less likely than events that would bother us much more than having a duplicate. For randomly generated UUIDs six bits are used up for expressing the version and the fact that it is a random UUID, leaving 122 bits, which is a total of different possible values. Generating billions of UUIDs for many years leaves the risk of creating duplicates acceptably low.

But the issue of the quality of the random generator and the issue of potential duplicates remain something that needs attention. So it is worth to consider the path of using the host and timestamp. Now the host can not be identified by an IP address or fully qualified domain and host name, because these tend to be either too long or not unique enough. The MAC address used to be a good possibility. But I would not be so sure about this any more. Most server systems are virtualized these days and the MAC address is configured by software, so duplicates can occur accidentally or deliberately. Using a time stamp by itself can be a problem too because sooner or later it will happen that two IDs are generated at the same time, within the given granularity. Machines have several processors, run several processes and several threads within each process.

So achieving the goal of a real globally unique UUID value remains a difficult question. Following a more local uniqueness within an application or application landscape might be more reasonable. The number of servers may be large and may vary, but it should be possible to assign numbers to virtual or real servers. In case there are multiple different processes on the same (virtual) machine to assign numbers to these as well. This can be used as a replacement for the host part of the UUID. If it does not use up all the bits, these can be filled up with random numbers.

Timestamps can be obtained easily and relatively reliably for a granularity of msec (Milliseconds). The UUID timestamp allows up to a granularity of 100 nsec, which is 10’000 sub divisions of the msec. A thread safe counter that may reset during program start or with its overflow can be used to count and its positive remainder modulo 10’000 can be used instead of the 100 nsec part in conjunction with the msec.

Often a uniqueness within an application or application landscape can be achieved by using some kind of unique counter. The best choice is often the sequence of a database, which is good in this task and well tested. It is not too hard to create such a functionality. Handling of multiple processes and threads needs to be addressed. For persistence, it can be an improvement to reserve blocks of 100 or 1000 numbers and persist less often. This will result in skipping some numbers when restarting, but otherwise work out well. The same idea can also be applied for a distributed unique number generator, where each instances gets ranges from some master generator and gets new ranges, when they are used up.

Such unique numbers or identifiers are needed quite often. It is usually best to use something that works reliably, like the DB sequence. But it can be developed with adequate care, if there is a need. Testing and especially automated testing is off course very important, but only sufficient if the whole implementation is conceptionally sound and robust.

# Indexing of Database Tables II (additional indices)

Additional indices („indexes“ in Oracle’s English) apart from the primary key are extremely important for the performance of databases.
There is this magic behind it, when a complicated query with multiple joins runs slowly and magically becomes faster when adding the right index and the right hint or does not become faster at all, even though the index being added could be so useful.. I leave this to the DBAs.

Some thoughts about indexing in general. Each index costs a little bit of storage and a little bit of performance when writing to the table, especially when inserting or when updating columns that are included in the index. This is more, if the amount of data per row in the index is more. And the helpfulness of the index is more, if it allows drilling down to a much smaller number of rows, not necessarily one. I would call this selectiveness of the index.

An interesting issue arises when combining multiple indices and there is interest in selecting by any one of them or their conjunction. This leads to Z-curve based indices, an issue worth an article by itself, maybe in the future.

Generally it is desirable to only include fields in the index that contribute to selecting the data. That means that adding this column to the where criteria of a select significantly reduces the number of records found. If that is not the case, the overhead of maintaining this field as part of the index and blowing up the index operations for both read and write access is probably not worth it. Sometimes a field with very little selectiveness is included in an index that is supporting a unique constraint. That may be ok, if the table is small, but for larger tables, rethinking the database design should be considered. I have seen this happening when several quite different kinds of records where stored in the same table. Splitting up to different tables would have resolved this. Other cases require different answers.

An interesting issue is also a situation, where many selects contain columns A and B in the WHERE-clause. By itself column A as well as column B are quite selective. But they are quite strongly correlated, so adding the other of the two only marginally contributes to the selectiveness of the index. In this case it should be considered, which of the three options, having A, B or both in the index is best. Try it out, discuss it with a good DBA. Yes, I have seen many people calling themselves DBA who were not really good in their area and I have learned what really good DBAs can do. Get one for a few days, when designing a serious database…

# Indexing of Database Tables I (Primary Keys)

Any useful databases with non trivial amounts of data have some indexing mechanism in place that helps finding data based on some key values, because a full table scane or something like that is too expensive and too slow to do it often.

The most typical situation is having a key and one or more values behind that. This applies to relational and NoSQL databases.
Typical database access layers for programming languages like ActiveRecord for Ruby or JPA/Hibernate for Java favor the pattern of having one index column called „id“ that is numeric and gets successive values assigned when new entries are created.

This is good, because the id is always unique and it can be handled and stored efficiently, especially when being used for relationships between different tables. It is off course a good idea to think about the size of integers to use for this, so that the number range is not exhausted. Depending on the DB product this can produced automatically by the DB when inserting if the column is declared as identity column or autoincrement or something like that (MySQL, MS-SQLServer) or it can be created by using a sequence (Oracle, PostgreSQL). It could be considered to use just one global sequence instead of one per table, so the id is unique across tables, which might have some advantages, but it kind of unusual.

Another common approach is to use UUIDs instead of such numeric ids. They have the advantage of being kind of globally unique. They have 128 bits and there are two approaches for obtaining these 128 bit numbers. Some bits are used to express version and type of UUID, but most of them can be used for differentiating from other UUIDs. They can be generated using random numbers or using some identity of the current host combined with time stamp, counter and possibly thread and process numbers and filling some digits with random numbers… When having around a billion (German „Milliarde“, ) entries, the probability of having a collision with a single additional randomly generated UUID is less than . So the risk of being struck by a lightning is higher than the risk of getting a collision by creating the same random UUID twice, if the random number generator is really good. It is quite possible to build big systems based on that and run them for a long time without problems by duplicate UUIDs. If that is good enough or not is a philosophical question, but I prefer the other way and finding a way to generate a UUID that is unique across hosts in the current and expected future system environment. Quite honestly, I do not see the benefit of using UUIDs as ID fields and primary keys in databases, because the numeric IDs can be handled more efficiently and mechanisms of generating them uniquely are reasonably easy to implement.

It is often forgotten that data actually might contain internal unique keys that can be used. Often a autoincrement numeric ID is used as a primary key and these semantic keys are declared as unique constraint or at least with a non unique index to allow searching. This has the advantage that changing these values is possible by accessing only one table and without breaking referential consistency. But for large tables I do recommend to consider using these natural keys instead of the ID as primary key for the sake of efficiency. Serious databases can also deal with combined keys consisting of several columns that are unique only in their combination.

An obvious stupidity that I am only mentioning because I have actually seen it, is the concatenation of all the fields forming the logical primary key to some column called „HANDLE“ and using that as a primary key. A 40 year old junior developer with the authority of a senior lead architect could impose such a thing on the database of a project that I have worked in in the past and he could degrade the whole software by this, because it just wasted half of the database space and a lot of performance for a product that should have performed really well to be useful to its customers.

I would also resist the temptation to use something like oid in PostgreSQL or the ROW_ID of Oracle as key. These values can change when moving data to other table spaces or restoring a backup or even upgrading to another version of the DBMS product. They might be good for temporary use, probably within a transaction or within a set of closely related transactions occurring within a few minutes or hours.

Additional indexes are an interesting issue, which I might deal with in another article in the future.

# Transaction Isolation Levels

Deutsch

This blog post is based on a talk that I have given in the Ruby on Rails User Group.

## Who knows Transactions?

Who knows really well what transactions are?

# What is a Transaction?

Just common sense what transactions mean:

• We open an transaction (implicitely or explicitely)
• We perform a certain number of operations (for example SQL-statements)
• We decide in the end to either „roll-back“ or „commit“
• Roll-back restores the state before the transaction
• Commit accepts the changes for good

## That’s it?

• So, that’s it, we are done…
• Just scrap these stupid confusing advanced stuff, do it simple…

## Simple world

• In the simple world, only one DB-user at a time exists
• Then we are really almost done..

## Complicated World

• In a complicated world we have parallel activities taking place:
• Multiple threads, processes or external database clients simultanously using the database

## Complications of Complication

• What happens now?
• When things get parallel, everybody is good in doing that, but it usually does not work.
• So DB-developers are really good, because databases do actually sometimes work quite well, even under load…
• What happens with incomplete transactions?
• What happens with complete transactions during one complicated read query?

## Naïve Requirement

• Other clients should see an transaction either completely or not at all.
• Is that enough?
• Think of an auditorium for a boring lecture: 90 min, people continuously leaving the room (so boring), but at the end it is still quite full.
• How many people are in the room?
• Counting is hard, because more people leave during the count.

## Snapshot too old

• Oracle has solved that problem of counting the students and similar ones…
• The SELECT-statement gets for its duration a consistent data content, based on what was in the DB when the statement started.
• Transactions that have been committed successfully after starting the SELECT are not taken into account
• This is called snapshot. Long SELECTs that work on data that is actively changed sometimes fail with „Snapshot too old“

## Now the hardcore stuff

• Transaction Isolation Levels
• ACID

• „Dirty read“ refers to uncommitted data being available for reading by other transactions. Sounds cool, right? Intuitive?
• Interesting question: can you read your own uncommitted data within the running transaction?

• „Non-repeatable read“ refers to a reading operations on the same piece of data that happen during the same transaction (or SELECT statement in the Oracle world) and give different results.
• With Oracle this should not happen, we get snapshot too old instead.

• Data records are the same
• But the set of records may change

• I do not know of any case where this is done
• You can configure it and may get „better“ than that

• Read data only from completed transaction
• Default for many databases, like PostgreSQL

• During one transaction data does not change
• Phantom reads are still possible
• Default for many databases, for example Oracle

## Isolation Level: Serializable

• Pretend all transactions are queued and only one at a time is happening

## Real databases

• They have some default
• But may always be „better“ than that
• So mixtures are possible
• For example Oracle has „read-only“ as isolation level…

## Locking vs. Multiversion

• Two mechanisms:
• Locking of parts of the data
• Multiversion: multiple copies (maybe copy on write?) around, for each transaction… Have fun when merging. 😉

## Different DB products

• Different DB products use the freedom to interpret the standard quite liberally
• Oracle is the standard, they only pretend to care about SQL-standard…
• Make a wise decision which DB to support with your software in the beginning
• Remember: some commercial DB vendors might not have any customers, they just have hostages

## ACID

ACIDs in chemistry are substances with a pH-value < 7.
Here acronym for

• Atomicity
• Consistency
• Isolation
• Durability