Microsoft SQL Server will be available for Linux in 2017

Deutsch

Microsoft has officially announced that their database MS SQL Server will become available for Linux in 2017.

I think the time has come for this. Since the departure of Steve Ballmer Microsoft has become a little bit less religous and more pragmatic. There are good reasons to be skeptical about companies like Microsoft and Oracle, but having more competition and more choice is a good thing. Maybe the database product from Oracle is slightly better than MS SQL Server, but there are very few projects where this difference really matters. So now we have three important relational DB products: DB2, Oracle, MS-SQL-Server, PostgreSQL and MariaDB (the successor of mySQL). When starting a new project with no specific constraints about the DB I would usually look at PostgreSQL first, because it is a feature rich and powerful open source database. Since database products are usually something that cannot reasonably be changed within one software system for decades this is a good thing, because we never know what the big companies want to do in such a long time scale. If the migration to another DB product is easy, then the software does not really make use of the power and the features of the DB. And it will not be easy anyway.

There are a lot of cases where the combination of MS-SQL-Server with Linux will make a lot of sense. Since there are software systems that make use of this DB product, it gives the flexibility to run the DB on Linux servers. And maybe avoid an expensive migration to another DB product. As I already said it gives one more choice. In development environments where MS-products are commonly used, it gives one more combination. And eventually it will encourage Oracle and IBM a little bit to refrain from excessive price increases.

* MS official: Announcing SQL Server on Linux

Share Button

Slick

Almost every serious programming language has to deal with database access, if not out of love, then at least out of practical necessity.

The theoretical background of a functional programming language is somewhat hostile to this, because pure functional langauges tend to dislike state and a database has the exact purpose to preserve state for a long time. Treating the database as somewhat external and describing its access with monads will at least please theoretical purists to some extent.

In case of Scala things are not so strict, it is allowed to leave the purely functional path where adequate. But there are reasons to follow it and to leave it only in well defined, well known, well understood and well contained exceptions… So the database itself may be acceptable.

Another intellectual and practical challenge is the fact that modern Scala architectures like to be reactive, like to uncouple things, become asynchronous. This is possible with databases, but it is very unusual for database drivers to support this. And it is actually quite contrary to the philosophy of database transactions. It can be addressed with future database drivers, but I do not expect this to be easy to use properly in conjunction with transactions.

It is worthwhile to think about
immutability and databases.
For now we can assume that the database is there and can be used.

So the question is, why not integrate existing and established persistence frameworks like JPA, Hibernate, Eclipslink and other OR-mapping-based systems into Scala, with Scala language binding?

Actually I am quite happy they did not go this path. Hibernate is broken and buggy, but the generic concept behind it is broken in my opinion. I do not think that it is theoretically correct, especially if we should program with waterproof transactions and actually defeat them by JPA-caching without even being aware of it? But for practical purposes, I actually doubt that taming JPA for non trivial applications is really easier than good old JDBC.

Anyway, Scala took the approach of building something in the lines of JDBC, but making it nicer. It is called Slick, currently Slick 3. Some links:
* Slick
* Slick 3.1.1
* Documentation
* Github
* Assynchronous DB access
* Recording of the Slick Workshop at ScalaX

What Slick basically does is building queries using Scala language. It does pack the results into Scala structures. Important is, that the DB is not hidden. So we just do the same as with JDBC, but in a more advanced way.

This article is inspired by my visit to Scala X.

Share Button

How to create ISO Date String

It is a more and more common task that we need to have a date or maybe date with time as String.

There are two reasonable ways to do this:
* We may want the date formatted in the users Locale, whatever that is.
* We want to use a generic date format, that is for a broader audience or for usage in data exchange formats, log files etc.

The first issue is interesting, because it is not always trivial to teach the software to get the right locale and to use it properly… The mechanisms are there and they are often used correctly, but more often this is just working fine for the locale that the software developers where asked to support.

So now the question is, how do we get the ISO-date of today in different environments.

Linux/Unix-Shell (bash, tcsh, …)

date "+%F"

TeX/LaTeX


\def\dayiso{\ifcase\day \or
01\or 02\or 03\or 04\or 05\or 06\or 07\or 08\or 09\or 10\or% 1..10
11\or 12\or 13\or 14\or 15\or 16\or 17\or 18\or 19\or 20\or% 11..20
21\or 22\or 23\or 24\or 25\or 26\or 27\or 28\or 29\or 30\or% 21..30
31\fi}
\def\monthiso{\ifcase\month \or
01\or 02\or 03\or 04\or 05\or 06\or 07\or 08\or 09\or 10\or 11\or 12\fi}
\def\dateiso{\def\today{\number\year-\monthiso-\dayiso}}
\def\todayiso{\number\year-\monthiso-\dayiso}

This can go into a file isodate.sty which can then be included by \include or \input Then using \todayiso in your TeX document will use the current date. To be more precise, it is the date when TeX or LaTeX is called to process the file. This is what I use for my paper letters.

LaTeX

(From Fritz Zaucker, see his comment below):

\usepackage{isodate} % load package
\isodate % switch to ISO format
\today % print date according to current format

Oracle


SELECT TO_CHAR(SYSDATE, 'YYYY-MM-DD') FROM DUAL;

On Oracle Docs this function is documented.
It can be chosen as a default using ALTER SESSION for the whole session. Or in SQL-developer it can be configured. Then it is ok to just call

SELECT SYSDATE FROM DUAL;

Btw. Oracle allows to add numbers to dates. These are days. Use fractions of a day to add hours or minutes.

PostreSQL

(From Fritz Zaucker, see his comment):

select current_date;
—> 2016-01-08


select now();
—> 2016-01-08 14:37:55.701079+01

Emacs

In Emacs I like to have the current Date immediately:

(defun insert-current-date ()
"inserts the current date"
(interactive)
(insert
(let ((x (current-time-string)))
(concat (substring x 20 24)
"-"
(cdr (assoc (substring x 4 7)
cmode-month-alist))
"-"
(let ((y (substring x 8 9)))
(if (string= y " ") "0" y))
(substring x 9 10)))))
(global-set-key [S-f5] 'insert-current-date)

Pressing Shift-F5 will put the current date into the cursor position, mostly as if it had been typed.

Emacs (better Variant)

(From Thomas, see his comment below):

(defun insert-current-date ()
"Insert current date."
(interactive)
(insert (format-time-string "%Y-%m-%d")))

Perl

In Perl we can use a command line call

perl -e 'use POSIX qw/strftime/;print strftime("%F", localtime()), "\n"'

or to use it in larger programms

use POSIX qw/strftime/;
my $isodate_of_today = strftime("%F", localtime());

I am not sure, if this works on MS-Windows as well, but Linux-, Unix- and MacOS-X-users should see this working.

If someone has tried it on Windows, I will be interested to hear about it…
Maybe I will try it out myself…

Perl 5 (second suggestion)

(From Fritz Zaucker, see his comment below):

perl -e 'use DateTime; use 5.10.0; say DateTime->now->strftime(„%F“);‘

Perl 6

(From Fritz Zaucker, see his comment below):

say Date.today;

or

Date.today.say;

Ruby

This is even more elegant than Perl:

ruby -e 'puts Time.new.strftime("%F")'

will do it on the command line.
Or if you like to use it in your Ruby program, just use

d = Time.new
s = d.strftime("%F")

Btw. like in Oracle SQL it is possible add numbers to this. In case of Ruby, you are adding seconds.

It is slightly confusing that Ruby has two different types, Date and Time. Not quite as confusing as Java, but still…
Time is ok for this purpose.

C on Linux / Posix / Unix


#include
#include
#include

main(int argc, char **argv) {

char s[12];
time_t seconds_since_1970 = time(NULL);
struct tm local;
struct tm gmt;
localtime_r(&seconds_since_1970, &local);
gmtime_r(&seconds_since_1970, &gmt);
size_t l1 = strftime(s, 11, "%Y-%m-%d", &local);
printf("local:\t%s\n", s);
size_t l2 = strftime(s, 11, "%Y-%m-%d", &gmt);
printf("gmt:\t%s\n", s);
exit(0);
}

This speeks for itself..
But if you like to know: time() gets the seconds since 1970 as some kind of integer.
localtime_r or gmtime_r convert it into a structur, that has seconds, minutes etc as separate fields.
stftime formats it. Depending on your C it is also possible to use %F.

Scala


import java.util.Date
import java.text.SimpleDateFormat
...
val s : String = new SimpleDateFormat("YYYY-MM-dd").format(new Date())

This uses the ugly Java-7-libraries. We want to go to Java 8 or use Joda time and a wrapper for Scala.

Java 7


import java.util.Date
import java.text.SimpleDateFormat

...
String s = new SimpleDateFormat("YYYY-MM-dd").format(new Date());

Please observe that SimpleDateFormat is not thread safe. So do one of the following:
* initialize it each time with new
* make sure you run only single threaded, forever
* use EJB and have the format as instance variable in a stateless session bean
* protect it with synchronized
* protect it with locks
* make it a thread local variable

In Java 8 or Java 7 with Joda time this is better. And the toString()-method should have ISO8601 as default, but off course including the time part.

Summary

This is quite easy to achieve in many environments.
I could provide more, but maybe I leave this to you in the comments section.
What could be interesting:
* better ways for the ones that I have provided
* other databases
* other editors (vim, sublime, eclipse, idea,…)
* Office packages (Libreoffice and MS-Office)
* C#
* F#
* Clojure
* C on MS-Windows
* Perl and Ruby on MS-Windows
* Java 8
* Scala using better libraries than the Java-7-library for this
* Java using better libraries than the Java-7-library for this
* C++
* PHP
* Python
* Cobol
* JavaScript
* …
If you provide a reasonable solution I will make it part of the article with a reference…
See also Date Formats

Share Button

Creating Unique Numbers

Many software systems rely on some kind of unique numbers. Uniqueness is always a question in what universe this uniqueness is required. We do see the different kinds of universes in the case of the IP-addresses. In theory they are world wide unique. In practice we have mechanisms in place like NAT, that use certain dedicated IP-ranges for an intranet and map them to publicly available addresses for internet traffic outside the intranet. So we already have two kinds of universes… This is typical.

A common case are database IDs, that are often used as primary keys in databases. I do challenge this by the question, if there is a natural key already available in the data, which might make this db internal id unnecessary, but more often than not DB tables do have these ID columns as primary keys. They have to be unique within the DB table. Which can mean across several servers, because somewhat distributed databases are common.

Other examples are message ids of emails. They may be quite long, a short line of text is acceptable and they should be world wide unique. Combining the fully qualified publicly accessible hostname and a time stamp and a counter is usually good enough. They look like this 5AE.630049D.2EA050907@gmx.net or 5AE.630049D.2EA050907@ms-93424.gmx.net, where the part after the „@“ stands for the mail server and the part in front is a unique id for the message created by the mail server and looking slightly different depending on its software.

Often the length is not arbitrary and the UUIDs are a good compromise for this. They have 128 bits, some of which are used to specify the type of UUID. One type combines a hostname and timestamp like the message ids. But since in some contexts the generation of the UUID should not reveal the hostname and the time, some implementations prefer random UUIDs. It can very well be argued that with good random numbers duplicates of such random UUIDs are less likely than events that would bother us much more than having a duplicate. For randomly generated UUIDs six bits are used up for expressing the version and the fact that it is a random UUID, leaving 122 bits, which is a total of 2^{122} = 5316911983139663491615228241121378304 \approx 5.317\cdot 10^{36} different possible values. Generating billions of UUIDs for many years leaves the risk of creating duplicates acceptably low.

But the issue of the quality of the random generator and the issue of potential duplicates remain something that needs attention. So it is worth to consider the path of using the host and timestamp. Now the host can not be identified by an IP address or fully qualified domain and host name, because these tend to be either too long or not unique enough. The MAC address used to be a good possibility. But I would not be so sure about this any more. Most server systems are virtualized these days and the MAC address is configured by software, so duplicates can accidentally or deliberately. Using a time stamp by itself can be a problem too because sooner or later it will happen that two IDs are generated at the same time, within the given granularity. Machines have several processors, run several processes and several threads within each process.

So achieving the goal of a real globally unique UUID value remains a difficult question. Following a more local uniqueness within an application or application landscape might be more reasonable. The number of servers may be large and may vary, but it should be possible to assign numbers to virtual or real servers. In case there are multiple different processes on the same (virtual) machine to assign numbers to these as well. This can be used as a replacement for the host part of the UUID. If it does not use up all the bits, these can be filled up with random numbers.

Timestamps can be obtained easily and relatively reliably for a granularity of msec (Milliseconds). The UUID timestamp allows up to a granularity of 100 nsec, which is 10’000 sub divisions of the msec. A thread safe counter that may reset during program start or with its overflow can be used to count and its positive remainder modulo 10’000 can be used instead of the 100 nsec part in conjunction with the msec.

Often a uniqueness within an application or application landscape can be achieved by using some kind of unique counter. The best choice is often the sequence of a database, which is good in this task and well tested. It is not too hard to create such a functionality. Handling of multiple processes and threads needs to be addressed. For persistence, it can be an improvement to reserve blocks of 100 or 1000 numbers and persist less often. This will result in skipping some numbers when restarting, but otherwise work out well. The same idea can also be applied for a distributed unique number generator, where each instances gets ranges from some master generator and gets new ranges, when they are used up.

Such unique numbers or identifiers are needed quite often. It is usually best to use something that works reliably, like the DB sequence. But it can be developed with adequate care, if there is a need. Testing and especially automated testing is off course very important, but only sufficient if the whole implementation is conceptionally sound and robust.

Share Button

Indexing of Database Tables II (additional indices)

Additional indices („indexes“ in Oracle’s English) apart from the primary key are extremely important for the performance of databases.
There is this magic behind it, when a complicated query with multiple joins runs slowly and magically becomes faster when adding the right index and the right hint or does not become faster at all, even though the index being added could be so useful.. I leave this to the DBAs.

Some thoughts about indexing in general. Each index costs a little bit of storage and a little bit of performance when writing to the table, especially when inserting or when updating columns that are included in the index. This is more, if the amount of data per row in the index is more. And the helpfulness of the index is more, if it allows drilling down to a much smaller number of rows, not necessarily one. I would call this selectiveness of the index.

An interesting issue arises when combining multiple indices and there is interest in selecting by any one of them or their conjunction. This leads to Z-curve based indices, an issue worth an article by itself, maybe in the future.

Generally it is desirable to only include fields in the index that contribute to selecting the data. That means that adding this column to the where criteria of a select significantly reduces the number of records found. If that is not the case, the overhead of maintaining this field as part of the index and blowing up the index operations for both read and write access is probably not worth it. Sometimes a field with very little selectiveness is included in an index that is supporting a unique constraint. That may be ok, if the table is small, but for larger tables, rethinking the database design should be considered. I have seen this happening when several quite different kinds of records where stored in the same table. Splitting up to different tables would have resolved this. Other cases require different answers.

An interesting issue is also a situation, where many selects contain columns A and B in the WHERE-clause. By itself column A as well as column B are quite selective. But they are quite strongly correlated, so adding the other of the two only marginally contributes to the selectiveness of the index. In this case it should be considered, which of the three options, having A, B or both in the index is best. Try it out, discuss it with a good DBA. Yes, I have seen many people calling themselves DBA who were not really good in their area and I have learned what really good DBAs can do. Get one for a few days, when designing a serious database…

Share Button

Indexing of Database Tables I (Primary Keys)

Any useful databases with non trivial amounts of data have some indexing mechanism in place that helps finding data based on some key values, because a full table scane or something like that is too expensive and too slow to do it often.

The most typical situation is having a key and one or more values behind that. This applies to relational and NoSQL databases.
Typical database access layers for programming languages like ActiveRecord for Ruby or JPA/Hibernate for Java favor the pattern of having one index column called „id“ that is numeric and gets successive values assigned when new entries are created.

This is good, because the id is always unique and it can be handled and stored efficiently, especially when being used for relationships between different tables. It is off course a good idea to think about the size of integers to use for this, so that the number range is not exhausted. Depending on the DB product this can produced automatically by the DB when inserting if the column is declared as identity column or autoincrement or something like that (MySQL, MS-SQLServer) or it can be created by using a sequence (Oracle, PostgreSQL). It could be considered to use just one global sequence instead of one per table, so the id is unique across tables, which might have some advantages, but it kind of unusual.

Another common approach is to use UUIDs instead of such numeric ids. They have the advantage of being kind of globally unique. They have 128 bits and there are two approaches for obtaining these 128 bit numbers. Some bits are used to express version and type of UUID, but most of them can be used for differentiating from other UUIDs. They can be generated using random numbers or using some identity of the current host combined with time stamp, counter and possibly thread and process numbers and filling some digits with random numbers… When having around a billion (German „Milliarde“, 10^9) entries, the probability of having a collision with a single additional randomly generated UUID is less than 10^{-27}. So the risk of being struck by a lightning is higher than the risk of getting a collision by creating the same random UUID twice, if the random number generator is really good. It is quite possible to build big systems based on that and run them for a long time without problems by duplicate UUIDs. If that is good enough or not is a philosophical question, but I prefer the other way and finding a way to generate a UUID that is unique across hosts in the current and expected future system environment. Quite honestly, I do not see the benefit of using UUIDs as ID fields and primary keys in databases, because the numeric IDs can be handled more efficiently and mechanisms of generating them uniquely are reasonably easy to implement.

It is often forgotten that data actually might contain internal unique keys that can be used. Often a autoincrement numeric ID is used as a primary key and these semantic keys are declared as unique constraint or at least with a non unique index to allow searching. This has the advantage that changing these values is possible by accessing only one table and without breaking referential consistency. But for large tables I do recommend to consider using these natural keys instead of the ID as primary key for the sake of efficiency. Serious databases can also deal with combined keys consisting of several columns that are unique only in their combination.

An obvious stupidity that I am only mentioning because I have actually seen it, is the concatenation of all the fields forming the logical primary key to some column called „HANDLE“ and using that as a primary key. A 40 year old junior developer with the authority of a senior lead architect could impose such a thing on the database of a project that I have worked in in the past and he could degrade the whole software by this, because it just wasted half of the database space and a lot of performance for a product that should have performed really well to be useful to its customers.

I would also resist the temptation to use something like oid in PostgreSQL or the ROW_ID of Oracle as key. These values can change when moving data to other table spaces or restoring a backup or even upgrading to another version of the DBMS product. They might be good for temporary use, probably within a transaction or within a set of closely related transactions occurring within a few minutes or hours.

Additional indexes are an interesting issue, which I might deal with in another article in the future.

Share Button

Transaction Isolation Levels

Deutsch

This blog post is based on a talk that I have given in the Ruby on Rails User Group.

Who knows Transactions?

Who knows really well what transactions are?

What is a Transaction?

Just common sense what transactions mean:

  • We open an transaction (implicitely or explicitely)
  • We perform a certain number of operations (for example SQL-statements)
  • We decide in the end to either „roll-back“ or „commit“
  • Roll-back restores the state before the transaction
  • Commit accepts the changes for good

That’s it?

  • So, that’s it, we are done…
  • Just scrap these stupid confusing advanced stuff, do it simple…

Simple world

  • In the simple world, only one DB-user at a time exists
  • Then we are really almost done..

Complicated World

  • In a complicated world we have parallel activities taking place:
  • Multiple threads, processes or external database clients simultanously using the database

Complications of Complication

  • What happens now?
  • When things get parallel, everybody is good in doing that, but it usually does not work.
  • So DB-developers are really good, because databases do actually sometimes work quite well, even under load…
  • What happens with incomplete transactions?
  • What happens with complete transactions during one complicated read query?

Naïve Requirement

  • Other clients should see an transaction either completely or not at all.
  • Is that enough?
  • Think of an auditorium for a boring lecture: 90 min, people continuously leaving the room (so boring), but at the end it is still quite full.
  • How many people are in the room?
  • Counting is hard, because more people leave during the count.

Snapshot too old

  • Oracle has solved that problem of counting the students and similar ones…
  • The SELECT-statement gets for its duration a consistent data content, based on what was in the DB when the statement started.
  • Transactions that have been committed successfully after starting the SELECT are not taken into account
  • This is called snapshot. Long SELECTs that work on data that is actively changed sometimes fail with „Snapshot too old“

Now the hardcore stuff

  • Read phenomena
  • Transaction Isolation Levels
  • ACID

Read Phenomena: Dirty Read

  • „Dirty read“ refers to uncommitted data being available for reading by other transactions. Sounds cool, right? Intuitive?
  • Interesting question: can you read your own uncommitted data within the running transaction?

Read Phenomena: Non-repeatable Read

  • „Non-repeatable read“ refers to a reading operations on the same piece of data that happen during the same transaction (or SELECT statement in the Oracle world) and give different results.
  • With Oracle this should not happen, we get snapshot too old instead.

Read Phenomena: Phantom reads

  • Data records are the same
  • But the set of records may change

Isolation Level: Read Uncommited

  • Allows „dirty Read“
  • I do not know of any case where this is done
  • You can configure it and may get „better“ than that

Isolation Level: Read Committed

  • Read data only from completed transaction
  • Non-repeatable read is possible
  • Default for many databases, like PostgreSQL

Isolation Level: Repeatable Reads

  • During one transaction data does not change
  • Phantom reads are still possible
  • Default for many databases, for example Oracle

Isolation Level: Serializable

  • Pretend all transactions are queued and only one at a time is happening
  • No phantom reads possible

Real databases

  • They have some default
  • But may always be „better“ than that
  • So mixtures are possible
  • For example Oracle has „read-only“ as isolation level…

Locking vs. Multiversion

  • Two mechanisms:
  • Locking of parts of the data
  • Multiversion: multiple copies (maybe copy on write?) around, for each transaction… Have fun when merging. 😉

Different DB products

  • Different DB products use the freedom to interpret the standard quite liberally
  • Oracle is the standard, they only pretend to care about SQL-standard…
  • Make a wise decision which DB to support with your software in the beginning
  • Remember: some commercial DB vendors might not have any customers, they just have hostages

ACID

ACIDs in chemistry are substances with a pH-value < 7.
Here acronym for

  • Atomicity
  • Consistency
  • Isolation
  • Durability

Some Links

Share Button

Databases and Immutable Objects

Deutsch

A common approach in functional programming and even a good practice in object oriented programming is to prefer having objects immutable.

Especially in applications with multithreading this is extremely helpful, but in principal it helps having the information flow under control and avoiding unexpected side effects. I do not want to postulate this as a dogma, since there are actually legitimate uses of mutable objects, but it is a good idea to keep in mind what mutability means and implies where it is used and to question if it is really worth it in the particular case.

It is important to differentiate between really immutable objects and those that are handed out via a wrapper or a second interface in an immutable way while still being accessible for changes by those parts of the code that control the mutable interface.

Example:

import java.util.*;

public class A {
    List l = new ArrayList();
    
    public void addStr(String s) {
        l.add(s);
    }

    public List getList() {
        return Collections.unmodifiableList(l);
    }
}

Even though the list provided by getList() is by itself immutable, it can still be changed by the addStr(..)-method of A.

The whole story gets more interesting when involving databases.

Database contents are in some way mapped to objects. If this mapping is done via automatisms like OR-mappers or Active-Record or if it is done explicitely by using mechanisms like JDBC or DBI is secondary for the moment. These objects can be based on specifically written code for the particular usage or generic DB-content-objects or just plain old collections, which should be considered as an implementation detail for the moment. Important is that such objects or data structures expressing DB content exist in the software. For simplicity’s sake they will be referred to as objects throughout this text.

Now DB content can change. This can happen due to activities of the software with which we are dealing right now. But there can also be accesses to the database by other software or by an DB administrator which can cause changes in the database content while the software is running. This already results in some complications, because the objects processed by the software are already outdated while being processed. This would be acceptable if there were no caching mechanisms for database content on software and framework level, leaving the caching to the database software. Other than framework and software caches the caches of the database itself can be truly transaction conform. Off course it is good having the DB in terms of network connectivity not too far away from the application, maybe even on the same machine, depending on the access patterns. If the network connectivity is poor, this approach ends up having the cache on the wrong side of the network connection.

If the objects within the software have a life time that is so short that it can be avoided that they are out of sync with the database contents they are representing, things should be fine. This can be achieved by using the transaction isolation of „serializable“ or „phantom-read“ and by discarding all these objects before the end of the transaction. Within the transaction these objects that have been read during the same transaction are guaranteed to be up to date, as a result of the concept of transaction isolation used here. If certain OR-mapping patterns needed to map collections to database contents are not involved it is possible that „repeatable-read“ is already enough to guarantee this. As long as objects are only read from the database (SELECT or READ or FIND) immutable objects work just fine.

Even deleting (DELETE) can be imagined, but in this case it is important to ensure that the corresponding object is not in use in other parts of the software even though the database content represented by it have already been deleted. When ensuring that objects representing database content do not live longer than the corresponding transaction this should be possible to deal with. Also creating new objects (INSERT or CREATE) and the corresponding database content should be possible.

What is most problematic is changing of data (UPDATE). Common OR mapper do this by reading the object, changing it and then saving it, hoping that the framework will figure out the differences and run the appropriate update, possibly multiple updates in conjunction with inserts and deletes. This becomes difficult with immutable objects, because that blocks the typical approach just described. Ways can be found to accomplish this anyway, but they tend to get this done at the expense of the elegance of the framework. Even more important it becomes relevant to deal with the situation that objects expire even within the transaction and need to be replaced by new objects representing the changed database contents.

The provocative question is if updates are at all needed. Off course they are needed, since update is one of the four basic database operations. But it is worth taking a closer look in which cases updates really make sense and in which cases other approaches are attractive alternatives. A classic is an application that is dealing with some kind of bookings and some kind of subjects to which accounts are attached, typically this can be persons or companies or other entities, but we can go more abstract than that. For simplicity we can assume persons as an example. This can be a banking software with accounts, account owners and whatever is needed on top of that. Or a billing system of a phone company that calculates, creates and manages invoices to customers based on their phone usage and subscriptions. It is commonly seen that there is such a booking table to which records can be added, but never deleted or changed. That would falsify the book keeping. For canceling a booking, a cancel-entry can be added which is actually a booking in the reverse direction marked as cancel for the original booking. Maybe there is an archiving process that moves old data to a data warehouse, thus actually performing a delete in the booking table and replacing the old bookings by some summary entry, but that is really a detail not relevant for this article as long as this archiving process is done in a good way not compromising the regular usage of the software and database, maybe by actually taking the system down for maintenance during this archiving process once a year. Usually some kind of balance is needed quite often. In principal this can easily be calculated any time by just adding up the bookings of that particular account. This is in line with normalization requirements. Unfortunately this will slow down software to an extent that it is not useful for any practical purposes, but in theory this approach is extremely beautiful. So a more efficient way of finding balances is needed. Possibly each booking could contain the resulting balance reducing the task to looking for the newest booking of a particular account. That can be dangerous if two bookings generated simultaneously refer to the same previous balance. With transaction isolation of „serializable“ that would not happen, but this might slow down the database quite a bit because it reduces the performance of access to the largest table and thus worsening the bottle neck that might already exist there. It could be solved on database level for example with triggers or materialized views so that each insert to the booking table influences the balance automatically. This should work find for all kinds of accesses to the system by any software as long as it is constrained from doing any delete and update to the booking table. Actually granting only INSERT and SELECT rights to the booking table for the user under which the software is running is a very good idea. It could be seen as a disadvantage that part of the business logic is moved to the database. Maybe this is acceptable by just understanding the balance being the sum of the bookings of the account as some kind of high level constraint which is actively enforced by the database. Another approach would be to provide a functionality in the software that inserts a booking or more likely a set of bookings comprising a transaction and that always keeps the balance updated. This can go wrong if simultaneous bookings for the same account can happen, meaning that a transaction for a second booking start while the transaction of a first booking of the same account has not completed. This seems to be simple to avoid by just providing some queue for each account and working through the queue, but since the transactions usually include bookings on several different account this can become quite messy to implement while still avoiding deadlocks. It must be quite interesting to build such a system in a way that it will work fine even under high load. In any case objects containing the balance of an account tend to age extremely fast so they have to be handled with care as short lived objects within a transaction or even shorter, if that account is touched by bookings to that account in the same tranaction.

On the other hand the table with persons and addresses is commonly kept up to date using updates whenever data changes. People marry, change their phone number or move to another address. But is this really the best approach? In reality moving to another address could be known in advance and maybe it is not a good requirement that this change of address has to be entered on the exact date when the person actually moves. Or who wants to spend his wedding day updating all kind of name and address fields in all kinds of web applications? Ok, this can be cheated on because it is usually not a big deal if the change is done a little bit too late. But for some purposes exact data are really desirable. And then again, is this really the right approach? Sometimes it is required to be able to answer questions like name and address of the customer while booking 23571113a was being performed. Maybe it is a good idea to store address changes as some kind of bookings as well, with a valid-since field. For finding a customer’s current address it is sufficient to look for the newest entry for that particular customer that is not in the future. With this approach new entries need to be created for these changes and they can even be prepared in advance. Even deleting a customer can be accomplished by added a „deleted“-entry with a particular valid-since timestamp to that table.

The beatiful part is that now certain data in the database can be considered immutable since no updates can occur to them. So dealing with them as immutable in the software becomes a valable approach and actually also a must. These objects can even be kept in the system longer than the duration of a transaction. It remains important to be careful and to understand the OR-mapping since complex objects represeting collections can even change due to inserts. An example would be an account object containing the account and all its bookings. Customer address needs to be dealt with in conjunction with the timestamp that was used for acquiring it. So it is a good practice to freeze the „now“-timestamp at the beginning of a large operation and to consistently use it throughout that operation to have a consistent view of the data. If this timestamp gets too old, it might be a good idea to refresh it and to reread all data attached to the timestamp. Other than for bookings the times of address changes usually need not be accurate to the micro second as long as consistency is guaranteed and no weird mixture of old and new addresses ever occurs.

Share Button

Oracle Bug with empty strings

Deutsch

Oracle databases seem to consider empty strings and null the same. This contradicts the SQL specification and is therefore a bug.

Unfortunately, it would be almost impossible for Oracle to fix this, because too many applications using Oracle for decades rely on this behavior.

It is therefore crucial to know about this bug when dealing with Oracle databases and to take it into account. As long as the database is accessed explicitly using SQL, SQL*Plus, JDBC, DBI, Pro*C etc., it is easier to deal with. But it is becoming tricky. when using frameworks like Hibernate, where weird errors might be observed. If you have an entity that is persisted, which has a string attribute, it should be legitimately to assign the empty string to it. Once this object travels through Hibernate and gets persisted, this empty string has become null instead, resulting in unexpected NullPointerExceptions. Or if the attribute has the NOT-NULL-constraint in the database, this will result in an Oracle error because of constraint violation. On top of this it might become a semantic issue, when empty strings and null are actually considered to have a different meaning.

An ugly, but practical way to deal with it is to change the getter and setter methods of the string attributes accordingly. Maybe some kind of „magic string“ can be defined, which will never ever occur as a value of this string attribute. The getters and setters use this magic string to internally represent the empty string. It is necessary to annotate the attribute, not the getter, for making this work with JPA or Hibernate.

Example in Java (without JPA annotations, synchronization, toString(), hashCode(), equals() etc. Would be similar for other languages)


public class MyEntity {
  private static String MAGIC_VALUE = "_._._«NULL»_._._";
  private String value;

  public String getValue() {
    if (MAGIC_VALUE.equals(value)) {
      return "";
    } else {
      return value;
    }
  }

  public void setValue(String newValue) {
    if (MAGIC_VALUE.equals(newValue)) {
      throw new IllegalArgumentException("value="+newValue+" not allowed");
    } else if ("".equals(newValue)) {
      this.value = MAGIC_VALUE;
    } else {
      this.value = newValue;
    }
  }
}

To make this work even with short strings, defined as VARCHAR2(1) or VARCHAR2(2) in the database, it is perhaps a good idea to use control characters or UTF-8 characters that are not going to be used otherwise. But it remains an unsatisfactory and ugly solution.

Does anyone have a better idea?

Are there any other databases that have the same issue?

Share Button

Latency of DB-connections

Deutsch

Applications that are using a database are often running slower than one would  expect by the power of the hard- and software.  Both DB server and application server are running on powerful machines that are not even close to their limits.

The problem is the connection between the two servers. This connection is off course very fast, which can be quite impressive when huge amounts of data need to be transferred, all at once.  Ideally think of the huge  resultset  of one query.

Typical database access operations need multiple accesses to the DB-server.  Each one of these works according to a communication protocol.  Usually this implies some round trips.  So many small accesses to the database tend to slow the whole application, because it is mostly waiting for the connection.

What can be done about this?
First of all the real bottleneck should be identified before optimizing something that is not the problem.

It does make a lot of  sense to optimize the application for larger and fewer queries, so more data is transferred at once.  This may not be easy with frameworks, especially if eclipselink or hibernate is used.  A typical anti pattern is to read a large number of records from the database and then fetch detail information for each record from another table, with one request for each record.  This can usually be avoided by using complex SQL queries and it is sometimes worth the effort.  As always optimizations should only be done where they are either not very intrusive or where the benefit is likely to justify them.

Another more general optimization is to put application and database on the same server.  This is often not easy, because many system administrators do not like this too well.  But if sufficient performance can only be achieved by putting all on one server, this is recommended.  Maybe it is an advantage of NoSQL databases that system administrators do not know any best practices that suggest to install database and application on different servers, so they can be running on the same server, thus eliminating most (but not all) of the latency.

Share Button