UUIDs revisited

UUIDs have proven useful in many circumstances.
We have basically two main variants:

  • The UUID is calculated as a combination of the Ethernet-MAC-address, the timestamp and a counter.
  • The UUID is calculated using a good random number generator

While variant 1 provides for a good uniqueness, there are some issues with it. Today we use mostly virtualized servers, which means that the MAC-address is coming from a configuration file and no longer guaranteed to be world wide unique. And we give away some information with the UUID, that we do not necessarily want to give away.

Variant 2 can be proven to have an acceptably low risk of collisions, but this is only true when using really good random number generators, which cannot always be guaranteed. Also it introduces an uncertainty in an area where we do not need it. We need to worry about this uniqueness, at least a little but, which is unnecessary.

So the question is, if we can rethink variant 1.

Assuming, that our software runs on our server farm. There may be a few hundred or thousand or even millions of virtual or physical servers. Now the organization does have a way to uniquely identify their servers. Of course we only need to consider the servers that are relevant for the application. Maybe an ID for the service instance instead of the server is even better. We may assume a numerical ID or something or have a table to map IP addresses and the like to such an ID. Some thinking is still required on how to do this. We can fill the digits that we do not need with random numbers.

Putting this ID instead of the MAC address solves the issue of configurable MAC address.

The next problem, that timestamps can be abused to find out something that should not be found out could be resolved by running the timestamp part or even the ID part (including a random number) through a symmetric encryption or simply some bijective function that is kept as a secret.

In many circumstances there is nothing wrong with customizing the UUID-generation to some „local“ standard, if this is well understood and carefully implemented.

Share Button

Cisactions

New research has analyzed the concept of transactions from a very theoretical point of view. An interesting result of this research was the concept of cisactions, which are in some way the opposite of transactions. A duality between cisactions and transactions has been proven. This means that in principal every application that is based on transactions can also be written with cisactions. But there are some challenges:

  1. currently there is no database that supports cisactions
  2. cisactions are even harder do understand than transactions
  3. the whole program has to be written once again in a totally different way.
  4. even the business logic has to be redefined in a totally different way, but eventually the same results can be achieved
  5. The paradigm change from transactions to cisactions is much harder than the change from object-oriented to functional programming
  6. mixing of cisactions and transactions in the same program is almost impossible

But we will see applications that do use cisactions properly in the future. And they will perform about 1000 times faster than programs using transactions.

The interesting question is:

When will the first cisactional databases become available? How will they work? Which programming languages will support cisactions? Or will we rather have to invent totally new programming languages for proper support of cisactions?

A lot of questions still have to be answered, and this is still theoretical research. But it is so promising for high performance usage that we absolutely must expect that this will become an important way to solve real high performance development tasks.

Share Button

Not all projects are on ideal paths II (Tim Finnerty)

This is another story of a project, that did not go as well as it could have gone while I was there. From unsuccessful projects we can learn a lot, so there will be stories like this once in a while. The first one was about Tom Rocket.

Tim Finnerty

It was the time, when all the cool companies tried to introduce Java. And some of the new Java projects failed, causing the companies to go back to C, which again scared other companies from doing this step. But some companies did not get scared by this. They embraced the new Java-fashion at a time when it still was not clear whether or not this was a good idea. What could possibly go wrong?

Well, in those days the experienced guys did not want to move to Java. It was slow, it was unreliable, not mature,… Maybe for Applets, maybe more generally for GUIs to get rid of VisualBasic, but Java on the server was considered a bad joke. For the server real people used real C, of course on Unix or maybe Linux, which was not really such a bad idea in those days. But there were the young people. Or the ones who had stayed young. They often just had finished their education and firmly believed that by just using technology „xyz“ everything would become great. xyz can be a development method (spiral model in those days, agile today), an architecture (microservices), a paragdigm („OO“, „functional“), a framework („enterprise edition“), a tool or a programming language (yeah, Java).. Often this first enthusiasm ends in a disaster: The money has been spent, the developers are leaving and the software is further away from being useful than anybody would like to admit. In lucky cases there is still some money left to do it right. Maybe even to do it right in Java.

That is were we are coming in. I do not really know the earlier history, but according to the management it was a total disaster. Now Tim Finnerty (real name known to the author) was the new technical lead, team architect or whatever this role is called. He embraced the new technology, but promised to not overstrech it. A bit of old school. Sounds good, because it is exactly what managers want to hear. No more risky experiements, but this time it needs to become a success.

So Tim Finnerty defined, how we had to work. He knew Java, he knew databases and especially Oracle, he knew the web, he even knew Perl. And he knew OO. Better than anybody else, so we did the real thing. Great head start. And everybody had to program according to Tim’s rules.

Of course we were using Java enterprise edition. That meant, that we were programming against some Weblogic application server, that was hard to install, hard to run, required a few minutes of startup time for each minor change of the software that we were writing and forced to a very archaic and primitive programming model. But that was cool and it was the future, which unfortunately proved to be true. Even though it has at least become usable by now. So far nothing to blame on Tim, because it is kind of the stack that everybody used.

Now to the OO and the database. Each database table represented a „Business Object“, with a name like XyzBO. So most of the time, we wrote a class XyzBO plus a few more to fulfill the greed and need for boilerplate code of the old J2EE-world. XyzBO was a enterprise java bean. A stateless session bean, to be accurate. Which meant, that we wrote methods of this EJB, which were basically procedures of the pre-OO-world. But within we could of course use the whole OO-toolset. Which we did. So the class to represent any data from the database was actually called Data. It was a minor subset of the standard Perl data structures, which meant that Data was a list of hash maps, which could behave just like a hash map if it had only one entry. Database queries returned Data, or of course null, if nothing was found. Nobody would ever want to use an empty collection. Pretty much the opposite of what we are now doing the hard way by introducing Optional or Option to avoid the null. But it was easy to just write
if (data == null) { data = new Data(); }
at least for the ones who new this trick.
So data resembled the content of the database or of the query result, with the column names as keys and the values as objects. When working with these, it was really easy. Just know the attribute name accurately. Get the value from data. See if it is null. If not, cast it to the real type, and voila….

The database was designed according to Tim’s advice, he had to review every table. It was mandatory to have as unique key and as first column a string of around 700 characters, which was called HANDLE. Each table had a business primary key, which was always consisting of several columns. Because the system allowed multiple instances of the software to run on the same database, there was always one column called „SITE“ in this logical primary key. But there were no primary keys defined in the database. The unique HANDLE was enough. It contained the name of the BO, like XyzBO, followed by a colon and followed by the concatenation of all the logical primary keys, separated with commas. The date had to be converted to a string using a local US format, not ISO, of course. All foreign keys were defined using HANDLE. In the end more than half of the DB space was wasted for this stupidity. But each single handle value started with an XyzBO:, to remind us that we were programming OO.

And now booleans. It was forbidden to use the boolean type of Java. All booleans were strings containing the words „true“ and „false“. This went like that all the way to the web interface.

At that time web frameworks did not yet exist or were at least unknown. So the way to go was to write JSPs, which contained kind of dynamic web pages and to write servlets to control the flow and access the EJBs. Now according to Tim it was too hard to learn servlets as well, so it was forbidden to use them and instead a connection-JSP had to be written, which did not display anything, but only contained a <% in the beginning and a %> in the end and the code between.

A lot of small and larger stupidities, that were forced on the team. Most people were new to Java and to OO and did not even realize that there was anything wrong, apart from the fact, that it was kind of hard to get stuff done. Some stupidities were due to the fact that the early J2EE really sucked, but mostly it was Tim, who forced everyone to his level. This story happened a long time ago. Tim has already retired. I would say he is one of the guys, who retire as a Junior.

There is (almost) nothing wrong with stupidity. And there is (almost) nothing with arrogance. But the combination really sucks. Especially if it it taken serious by the manager or has to be taken serious by the team.

Share Button

Devoxx Kiew 2018

In the end of 2018 the number of conferences is kind of high. A great highlight is the Devoxx BE in Antwerp. But it has now five partner conferences in London, Paris, Krakow, Morocco and Kiev. So I decided to have a look at the one in Kiev.

How was it in comparison to the one in Belgium? What was better in Kiev: The food was way better, the drinks in the first evening (Whisky and Long Drinks vs. Belgium Beer) might be considered better, there were more people engaged to help the organizers…
What was better in Belgium: There were still a bit more speeches. While the location in Kiev was really great, in Belgium the rooms were way better for the purpose of providing a projection visible for everybody and doing a video recording that did not disturb the audience.
The quality of the speeches was mostly great in both locations. In Kiev they gamified the event a bit more..

Generally there was a wide range of topics and the talks were sorted into the following thematic groups:

  • Methodology & Culture
  • JVM Languages
  • Server Side
  • Architecture & Security
  • Mobile & IoT
  • Machine Learning & AI
  • Big Data & Data Mining
  • Cloud, Containers & Infrastructure
  • Modern Web & UX

See the schedule for the distribution…

I attended on Friday:

I attended on Saturday:

A lot to learn.

Share Button

Some thoughts about String equality

Of course Strings are today in some way Unicode. In this article we assume code points as the building blocks of Strings. That means for example in the Java-world, that we are talking about one code point being comprised of one Java character for typical European languages, using Latin, Greek or Cyrillic alphabets including extensions to support all languages typically using these alphabets, for example. But when moving to Asian languages, a code point can also consist of two Java characters and there are Strings that are illegal from Unicode perspective, because they contain characters that should be combined in a way that cannot be combined properly. So here we assume, that Strings consist of sequences of bytes or two-byte characters or whatever encoding that properly express a sequence of code points. There are many interesting issues when dealing with some Asian languages that we will not cover here today.

Now there are a lot of possibilities to create Strings, that look the same, but are actually different. We are not talking about „0“ and „O“ or „1“ and „l“ and „I“ that might look similar in some fonts, but should not look similar, because we actually depend on their distinctness, even on their visual distinctness. Unfortunately we have the bad habit of using traditional typewriter fonts, that make it hard to distinguish these, for source code, where it would be so crucial. But for today, we just assume that we always look hard enough to solve this issue.

The classical example of what looks the same is whitespace. We have ordinary space “ “ and no break space “ „, that are meant to look exactly the same, but to expose a slightly different behavior. There are tons of possibilities to create exactly the same look with different combinations of whitespace. But this is kind of a special case, because in terms of semantics often carries little information and we want to disregard it to some extent when comparing strings. Typical examples are stripping of leading and trailing whitespace of the string or of the lines contained within it and replacing tabulators with the number of spaces that would be equivalent. Or even to replace any amount of adjacent whitespace within a line by a single space. Again, handling of different whitespace code points might require different rules, so it is good to be careful in not putting to much logic and it is better to rely on a library to at least apply exactly the same rules in equivalent situations.

Another example that we actually might know is that certain characters look the same or almost the same in the Cyrillic, Greek and Latin alphabets. I try to give an idea of the meaning of the Greek and Cyrillic characters, but they depend on the language, the dialect and even the word, the word form or the actual occurrence of the letter in the word…

LatinCyrillicGreekmeaning of Cyrillic Lettermeaning of Greek letter
AАAlike Latinlike Latin
BВBlike Latin VBeta (like V in new Greek)
CСlike Latin S
EЕElike LatinEpsilon (like Latin E)
ГHlike Latin GGamma (like Latin G)
HНΗlike Latin NEta (like Latin I in new Greek)
JЈSerbian Ј, like German J
KКΚlike LatinKappa (like Latin K)
MМΜlike LatinMu (like Latin M)
NΝNu (like Latin N)
OОΟlike LatinOmikron (like Latin O)
PРΡlike Latin RRho (like Latin R)
ПΠlike Latin PPi (like Latin P)
TТΤlike LatinTau (like Latin T)
ФΦlike Latin FPhi (like Latin F)
XХΧlike German CHChi (like German CH)
YУΥlike Latin UUpsilon (like Latin U)
ZΖZeta (like German Z)
IІΙUkrainian IIota (like Latin I)

In this case we usually want the characters to look the same or at least very similar, because that is how to correctly display them, but we do want them to be different when comparing strings.

While these examples are kind of obvious, there is another one that we tend to ignore, but that will eventually catch us. There are so called combining characters, that should actually be named „combining code points“, but here we go. That means that we can put them after a letter and they will combine to form a letter with diacritical marks. A typical example is the letter „U“ that can be combined with two dots “ ̈ ̈“ to form an „Ü“, which looks the same as the „Ü“ that is composed of one code point. It is meant to look the same, but it also has the same meaning, at least for most purposes. What we see is the Glyph. We see the difference when we prefix each code point with a minus or a space: „Ü“ -> „-U-̈“ or “ U ̈“, while the second one is transformed like this: „Ü“ -> „-Ü“ or “ Ü“, as we would expect.

While the way to express the Glyph in such a way with two code points is not very well known and thus not very common, we actually see it already today when we look at Wikipedia articles. In some languages, where the pronunciations is ambiguous, it can be made clear by putting an accent mark on one vowel, as for example Кириллица, which puts an accent mark on the term in the beginning of the article like this: „Кири́ллица“. Since in Cyrillic Alphabet accent marks are unfortunately not used in normal writing, it comes in handy that the combining accent also works with cyrillic letter. When putting minus-signs between the code points it looks like this: „К-и-р-и-́-л-л-и-ц-а“ or with spaces like this: „К и р и ́ л л и ц а“. So Strings that we encounter in our programs will contain these combining characters in the future. While we can prohibit them, it is better to embrace this and it is actually not too hard, if we use decent libraries. Java has the Normalizer class in its built in library, that can convert to one or the other convention of expressing such glyphs and then allowing comparison in the way that we actually mean.

Unfortunately issues like semantic lengths of strings or semantic positions become even harder than they already are after moving from characters to code points. And we can be sure that Unicode has still more to offer to complicate things, if we dig deeper. The typical answer that we get on most web sites that talk about these issues is something like: „The length of strings and positions within strings are surprisingly irrelevant to most programs.“

In the end of the day, jobs that have been trivial in the past are now becoming a big deal and we need to learn to think of comparison, length, position, regular expressions, sorting and all kinds of string functionality with bytes, characters, code points and glyphs in mind.

What can our current libraries already do for us, what are we missing in them, considering different programming languages, databases, text files and network transmission?

Links

Share Button

SQL: DROP, CREATE,… How to avoid errors?

In SQL we often want to create or drop an object (TABLE, VIEW, SEQUENCE, INDEX, SYNONYM, DATABASE, USER, SCHEMA,….)

Now we do not know if this thing exists or not. We would like to write scripts, that work in either case.

So something like

CREATE TABLE IF NOT EXISTS XYZ (...);

or

DROP TABLE IF EXISTS XYZ;
CREATE TABLE XYZ (...);

would be desirable.

This seems to be quite hard, depending on the database.

Some databases support the „IF EXISTS“ or even the slightly less useful „IF NOT EXISTS“ pattern for some of the CREATE and DROP statements:

COMMANDPostgreSQLMS-SQL-ServerOracleDB2MySQL/MariaDB
DROP DATABASE IF EXISTSxx--x
DROP FUNCTION IF EXISTS xx-xx
DROP INDEX IF EXISTSxx-xx
DROP MATERIALIZED VIEW IF EXISTSxMaterialized view not found in documentation-Materialized view not found in documentationno materialized views supported
DROP ROLE IF EXISTSxx-xx
DROP SCHEMA IF EXISTSxxno DROP SCHEMA, Oracle uses User to express the concept of a Schema? (no DROP SCHEMA found in documentation)Schema not supported.
DROP SEQUENCE IF EXISTSxx-xx
DROP TABLE IF EXISTSxx-xx
DROP TABLESPACE IF EXISTSxNot found in documentation-? (no DROP TABLESPACE found in documentation)TABLESPACE not supported
DROP TRIGGER IF EXISTSxx-xx
DROP USER IF EXISTSxx--x
DROP VIEW IF EXISTSxx-xx
CREATE OR REPLACE DATABASE----x
CREATE DATABASE IF NOT EXISTS---xx
CREATE FUNCTION IF NOT EXISTS---x-
CREATE OR REPLACE FUNCTIONxx (CREATE OR ALTER FUNCTION)x-x
CREATE OR REPLACE INDEX----x
CREATE INDEX IF NOT EXISTSx--xx
CREATE MATERIALIZED VIEW IF NOT EXISTSxMaterialized view not found in documentation-Materialized view not found in documentationno materialized views supported
CREATE OR REPLACE ROLE----x
CREATE ROLE IF NOT EXISTS---xx
CREATE SCHEMA IF NOT EXISTSx-Oracle ties the schema closely to a user. So slightly different meaning of CREATE SCHEMA...Oracle ties the schema closely to a user. So slightly different meaning of CREATE SCHEMA...Schema not supported.
CREATE OR REPLACE SEQUENCE----x
CREATE SEQUENCE IF NOT EXISTSx--xx
CREATE OR REPLACE TABLE----x
CREATE TABLE IF NOT EXISTSx--xx
CREATE TABLESPACE IF NOT EXISTS-Not found in documentation-TABLESPACE apparently not supportedTABLESPACE not supported
CREATE OR REPLACE TRIGGER-x (CREATE OR ALTER TRIGGER)x-x
CREATE TRIGGER IF NOT EXISTS---xx
CREATE OR REPLACE USER----x
CREATE USER IF NOT EXISTS----x
CREATE OR REPLACE VIEWxx (CREATE OR ALTER VIEW)x-x
CREATE VIEW IF NOT EXISTS---xx

Usually we can happily ignore the missing „IF (NOT) EXISTS“ conditions and just DROP the object and then create it. By ignoring the error message this works.

But increasingly we want to manage the database with tools that run scripts automatically and cannot distinguish as well as a DBA between „good error messages“ and „bad error messages“.

So we should find our ways around this.

The trick is to use the procedural extension language of SQL (like PL/SQL for Oracle, because in case of Oracle we need it most often). Then we write a block and find by a select in the data dictionary if the object we are creating or dropping exists. Then we put the select result into a variable and put an IF-condition around our CREATE or DROP statement. This approach can also be useful for dropping all objects of a certain type that fullfill some SQL-query-accessible condition.

DECLARE
   CNT INT;
BEGIN
   SELECT COUNT(*) INTO CNT 
     FROM USER_TABLES 
     WHERE TABLE_NAME = 'MY_TABLE';
   IF CNT >= 1 THEN
      EXECUTE IMMEDIATE 'DROP TABLE MY_TABLE';
   END IF;
END;
/

Or to catch the kind of exception that we want to ignore (example for Oracle):

BEGIN
   EXECUTE IMMEDIATE 'DROP TABLE MY_TABLE';
EXCEPTION
   WHEN OTHERS THEN
      IF SQLCODE != -942 THEN
         RAISE;
      END IF;
END;
/

It is a bit tricky to find the reference pages for the SQL dialects of the different DB products. I used these:

MariaDB is a fork of mySQL that was created, when mySQL came to Oracle. I recommend it as the successor of mySQL, if you are into mySQL-like databases. And I strongly recommend looking into PostgreSQL, if you intend to use a free/opensource database, because it is the same price and just has richer features. Generally all five databases are good and have their areas of strength and their weaknesses. I won’t go deeper into this in this article…

Share Button

PostgreSQL

Almost every non trivial application uses in some way a database.

For many years this has been anyway Oracle, DB2 or MS-SQL-Server, depending mostly on the habits and on the religious orientation of the organization that developed or ran the application. These days all three are available for Linux and MS-Windows. DB2 is also available z/OS. The „home-platforms“ of these three are probably Linux, z/OS and MS-Windows, respectively (2018).

We saw Teradata as an alternative to DB2 and Oracle for data warehouses. They run on huge amounts of data, but are really invisible to most of us. Maybe the data warehouse is the old „big data“, before the invention of the term.

We saw a big Hype about NoSQL databases and some interesting DB products from this group that could successfully establish themselves.

We saw MySQL (and its fork MariaDB) mostly for database-installations that had somewhat lower requirements on the DB-product in terms of features or in terms of performance. Actually Wikipedia runs on MySQL or MariaDB and that is quite a big installation with heavy user load, but it is mostly about reading.

PostgreSQL was often positioned „somewhere between Oracle and MySQL“.

PostgreSQL 10 just came out. The most important new features where replication on a per table basis, better partitioning of large tables and better support for clustering.

I have worked with all of the database technologies listed here and even giving trainings for MongoDB, Oracle and PostgreSQL.

So where is PostgreSQL positioned really in this landscape?

It is a good database product for a large and growing class of applications. I find it slightly more pleasant than the other four SQL databases mentioned here to work with, because the SQL-implementation and its extensions are powerful, clean and behave more or less as expected. Some minor positive points are the default usage of ISO-date-format, the distinction between Null and empty string and on the other hand that most stuff that works in Oracle at SQL level can easily be transferred to PostgreSQL. The psql-shell works like typical linux shells in terms of command line editing and history. So a lot of minor details are just pleasant or as they should be.

Comparing to the three groups of contenders:

NoSQL

NoSQL databases kind of leave the mainstream of transactional relational SQL-databases and provide us either some interesting special features or promise us performance gains or support of huge data base sizes. The price for this is that we loose a extremely mature, clever and powerful query language, which SQL is. I would go for NoSQL products, if the additional feature of this NoSQL-DB-product cannot be reasonably be duplicated in PostgreSQL or other SQL-DBs and if it is really useful for the job. I would go for a NoSQL-DB-product, if the required data sizes and performance cannot reasonably be achieve using an SQL-product like PostgreSQL, good tuning of hardware, OS, database and application logic, but can actually be achieved with the NoSQL-product. These applications exist and it is important to pick the right NoSQL-DB for the project. It should be observed that PostgreSQL has a lot of features beyond of what normal SQL-databases have and looking into this area might be useful… A typical strength of some NoSQL-databases (like CassandraDB and MongoDB) is that a powerful replication is kind of trivial to set up, while it is a really big story for typical transactional SQL databases… This is due to the transactional feature which adds complexity and difficulty and a performance penalty to some kinds of replications…

MariaDB/MySQL

I do not count that MySQL belongs to Oracle, because MariaDB is an independent fork outside of Oracle and can be used instead of MySQL.
I do think that MySQL does not have quite the level of PostgreSQL in terms of features and cleanness. So we can get PostgreSQL for the same price as MySQL or MariaDB. So why not go for the better product? Even if MariaDB perfectly fits today, the application will grow and it will at some point prove useful to be based on PostgreSQL. I came across the issue of nested transactions some years ago. They were easily supported by PostgreSQL, but not at all by MariaDB. Issues like that can come up more likely this way than the other way around.

Oracle, DB2, MS-SQL-Server

Especially Oracle makes many long term loyal customers run away due to there pricing and licensing practices. While it is extremely hard to change the database of a non trivial database based application, at least new applications in many organizations are discouraged from using Oracle, unless they can make a point why they really need it. MS SQL-Server might absorb some of these, especially since they are now available on Linux servers. But what Oracle does now might very well be the policy of Microsoft or IBM in a few years, so it makes perfect sense to have a serious look at PostgreSQL. A reasonably well tuned PostgreSQL will work pretty much as good as a reasonably well tuned Oracle, DB2 or MS-SQL-Server. Features that are missing now are being added with new releases. Some interesting features make it just a bit more pleasant to use than for example Oracle. It just feals more modern and more Linux-like.

Btw. there were some more contenders in the space of commercial transactional SQL-databases like Adabas D, SyBase and Informix. While the database products Adabas D and SyBase have been bought by SAP 1997 and the whole Sybase company in 2010 in two more or less unsuccessful attempts to have their own database and not having to use their competitors product as database, but they seem to have some success in using HANA now. Informix has been bought by IBM and is still offered as alternative to DB2. I would say that they have lost their relevance.

PostgreSQL

So I do recommend to seriously consider PostgreSQL as a DB product. It is currently my favorite in this space, but there is no univeral tool that fits for everything.

Some random aspects to keep in mind when moving from Oracle to PostgreSQL are mentioned here…

Types CLOB and BLOB do not exist. They can mostly be replaced by types TEXT and BYTEA, but it is not exactly the same. The type TEXT, which is a somewhat unlimited variable length string can easily be used for columns where we would try to use VARCHAR2 in Oracle, which gives us the advantage that we do not have to worry about defining a maximum length or exceeding the 4k limit that Oracle imposes on VARCHAR2.

Empty Strings are not the same as NULL in PostgreSQL, they are in Oracle.

PostgreSQL has a boolean type. Please use it and get rid of the workaround using CHAR, VARCHAR2 or NUMBER as replacement.

Oracle only had one kind of transaction isolation that was really well supported and I think this is still the way to go. It is an excellent choice and is very close to „repeatable read“, while PostgreSQL uses by default „read committed“, but it can be brought to use „repeatable read“. Please keep this in mind to avoid very unpleasant surprises and use the transaction isolation level appropriately.

The structuring of PostgreSQL consists of DB-instances, usually only one on a virtual or physical server, which somewhat resembles what is a database in Oracle. Within a DB-instance, it is possible to define a number of database without much pain. This was totally not the case with Oracle in earlier years and it was best practice to rely no schemas, but now we can easily afford to put more virtual servers each running Oracle (or PostgreSQL), if the licensing does not prohibit it in the case of Oracle. And since Oracle 12 there is the concept of the virtual database which splits a Oracle database into sub databases, somewhat behaving like separate database without the overhead of DB instances. It seems to be quite equivalent to what PostgreSQL does, apart from the naming and many details about how to set it up and how to use it. Schema and User are more separate concepts in PostgreSQL, a Schema can be defined totally independently of Users, but there is a way to define Schema names that match the user names to support this way of working. So we can do pretty well what we want, but the details how to work it out are quite different.

Each database has its programming language to write triggers, stored procedures and the like. They seem to be somewhat similar between different DB-products (we are talking about MS-SQL-Server, Oracle, PostgreSQL and DB2), but different enough that we need to rewrite triggers and stored procedures from scratch. This is not as painful as it used to be, since the approach of accessing DB tables for read access only via views and for write access only via stored procedures seems to have lost some popularity. Having written a lot of the business logic PL/SQL the pain of migrating to another DB product is really enormous, while a business logic in Java, Scala, C, C++, Perl, Ruby, C# or Clojure can be ported more easily to different OS and different DB. But it is no way for free.

One remark for development: Some teams like to use in memory databases for development and then trust that deployment on PostgreSQL or Oracle or whatever will more or less work. I strongly recommend not to follow this route. It is totally not trivial to support one more DB product or usually a second DB product and it is quite easy to setup a virtual OS with the DB product that is being used and with test data. PostgreSQL, Oracle, MS-SQL-Server, MongoDB and whatever you like can be configured to use more Memory and perform pretty much like these in memory DBs, if we set them up for development and are willing to risk data loss. This is no problem, because the image can be trivially copied from the master image when needed. Yes, a really good network and SSDs of sufficient size, speed and quality are needed for working efficiently like this and it is possible and worthwhile to have that.

I can give training about PostgreSQL and MongoDB and about SQL in different dialects. Find contact information here.

And please: comments, corrections and additional information are always welcome…

Links

Share Button

Cassandra DB

The large zoo of NoSQL databases needs to be considered thoroughly.

While the major transactional SQL databases (Oracle, PostgreSQL, MS-SQL-Server, DB2 and to a limited extent MariaDB/MySQL) can be used more or less interchangeably, if we are talking about the beginning of the project, the NoSQL databases have totally different features and must be chosen wisely and adequately. While the typical pattern of many NoSQL databases is trading speed for complex transactions, this is by no means compulsory. There are SQL databases like Teradata, that have reduced support for transactions, but feature a full set of SQL for queries. This can be used for data warehousing, where the data changing operations take place in a more controlled manor, because the data is already known and present elsewhere, but queries are the important thing. On the other hand there are NoSQL-databases like Neo4j that support full ACID-transactions, but organize the data in a totally different way that might be more adequate than a relational database for many purposes. This shows the other motivation to go for NoSQL-databases. While the relational model is somewhat complete and there are always ways to express whatever is desired with it, it may not be a natural or efficient or convenient way to model certain data.

Some databases, like MongoDB allow additional indexing and thus efficient queries without full table scans and additional uniqueness constraints, which is a must have feature for a good transactional SQL database, but not so much for NoSQL databases in general.

Some NoSQL-databases are typically run in memory. While there are some SQL databases that run in memory as well, like H2, this kind of defeats the idea of ACID, so it is not really a full transactional SQL database anyway, at least not in this mode.

SQL databases are well known to many people and can do many jobs quite well. If the data model can be expressed well in the SQL database and the performance is sufficient, this may be a good way to go. The powerful query and data manipulation language is usually extremely helpful and usually you get only a shadow of what a powerful SQL dialect like PostgreSQL or Oracle can do when you move to a NoSQL database.

Now NoSQL databases are somewhat scary. The data is „not safe“ in them, because there are no transactions. But what about the transactional relational SQL database? If we use it through some implicit or explicit caching layers, we get rid of this feature without being aware of it. JPA or hibernate tend to do implicit caching and can even use file system persistency outside of the database as a „second level cache“. I think this is just so broken, but it is used and good applications are built with it. But the feature of the good old transactional SQL database that serves as argument to use it has silently been thrown into the garbage can in this project.

So we should think if and where we need serious transactions and then really write our software in a way that does not defeat them by caching, preferably by avoiding JPA and Hibernate. Or we should pick our database solution for the purpose based on the modelling features and performance features.

Cassandra comes in as one of the NoSQL products with very good write performance and scalability. The query language kind of looks like SQL, but it is a tiny subset. If we can live with that subset and it is relevant to get more performance than we would get from a PostgreSQL or Oracle, this might be a good way to go. An important observation is that tables in Cassandra do have primary keys. But additional unique keys cannot be defined and instead lookup tables. This works like this:

CREATE TABLE IF NOT EXISTS „A“ (
„X“ text,
„Y“ text,
„Z“ text,
„T“ text,
PRIMARY KEY(„X“)
);

CREATE TABLE IF NOT EXISTS „A_LOOKUP_BY_Y“ (
„Y“ text,
„X“ text,
PRIMARY KEY(„Y“)
);

Lookups by X can be done directly in A. Lookups by Y have to be done in two steps. First the lookup table is used to determine the value for X and the table A is used for the actual lookup of the data.

There are ways to do interesting and useful things that seem to be missing and it is usually worth investigating how they can be done. When knowing how the different tools that are available can be used efficiently and properly, it may be a good moment to actually decide what tool to use. But a serious comparison should compare well done solutions of both technologies and not just a junior’s first day program in one technology against a professional solution done by a senior.

It is always important to have a look at the different NoSQL databases to decide which ones are going to be used in the team or in the organization. Cassandra can be a useful database for this approach. Often it is good to also have a SQL database in this set of supported databases. It is easier to model for example accounts, booking and payments with them, while a other data may be better stored in adequate NoSQL databases.

By the way, while major transactional SQL databases may be somewhat interchangable in the beginning of a project, it is quite a pain to change them later or to support multiple databases. Also there are of course licensing issues, teams are needed to operate the database, databases may run better on certain operating systems and there may be some additional features that might make them more or less desirable. But the basic functionality differs less than in the case of NoSQL databases.

Share Button

Observer Effect

Scientists have to deal with the observer effect, which means that observing something actually changes it. Typically we think of quantum physics, where this effect is very strong and surprising and closely related to the Heisenberg Uncertainty Principle, but it is actually something that in a more abstract sense is present in a multitude of situtations. Just think of human interaction. If we want to find out about people, we can ask them. But this conversation actually changes the people, sometimes in a way that we can neglect or tolerate.

But we also have this in the case of IT. If we think of a software and we want to observe if the software behaves well, we need ways to observe the software. Very often we use logging, sometimes monitoring tools, and sometimes debugging or even profiling. We think that they do not hurt us, apart from using resources, but we have to be quite careful. The example of logging is quite good, because it is quite common and usually something that we do a lot, without wasting too much thoughts about it.

Now logging slows our application down that is known already. Now we tend to use a slightly less noisy log level, because terabytes of log are still a pain, even today. But usually the messages are calculated and then discarded by the logging framework. With functional language features there are quite elegant ways to deal with this, by just passing a function that calculates the message on demand instead of passing the message. It has always been possible, but too clumsy to actually do it, unless the logging framework can rely on macro facilities, even such simple ones as the C-preprocessor. The deferred evaluation has its dangers as well. If an object that is passed as an ingrediant for a potential message already changes, while the message is created, we might get funny effects. Maybe only in the log, but maybe it could crash the application or stop the main program flow from doing its work. We need to be careful, unless the object is immutable.

In case of Hibernate or JPA or similar frameworks this can be specially interesting, even with eager message calculation. Accessing attributes of the object can actually lead to database operations. They can fail. They can create load, maybe deadlocks. They can have lost their transaction. A lot of things can happen in places far away from where we assumed to do the DB-work. This actually changes the objects. Do we want such operations to occur during logging? Maybe differently depending on the log level? Immutability is our friend, especially in conjunction with JPA, but that is a long story. We may at least be lucky that we actually have some tables that we only read. We can make the objects „pseudo-immutable“, but still JPA-magic must mutate them at least during the read operation. It is tempting to let tools generate the toString-methods of objects, but it is very dangerous here. We should avoid including any potentially lazily loaded attributes in the toString-output, because otherwise they will be loaded during the logging or even worse differently depending on how we log.

The next thing is the NullPointerException during logging. It is quite common in Java, for example. And we do not want to burden our program logic with NullPointerExceptions from the logging, especially not with those that occur only sporadically. So it is a good idea to be careful and to test well. Only the combination is possibly good enough.

Modern times create more demand for some kind of real multi threading, not in the JEE-sense with a couple of EJBs that can run in parallel, but with massively parallel operations. Even though we have a multitude of logging frameworks and unifying logging frameworks and even more of them, there is a common weakness that they tend to share. Writing into one single target is achieved by some kind of synchronizing, which can slow our application down and change the timing behavior in ways that we did not desire. Asynchronous logging could be good, but in a way this only shifts the problem a bit.

Share Button

DB Persistence without UPDATE and DELETE

When exploring the usage of databases for persistence, the easiest case is a database that does only SELECT. We can cache as much as we like and it is more or less the functional immutable world brought to the database. For working on fixed data and analyzing data this can sometimes be useful.

Usually our data actually changes in some way. It has been discussed in this Blog already, that it would be possible to extend the idea of immutability to the database, which would be achieved by allowing only INSERT and SELECT. Since data can correlate, an INSERT in a table that is understood as a sub-entity via a one-to-many-relationship by the application actually is mutating the containing entity. So it is necessary to look at this in terms of the actual OR-mapping of all applications that are running on that DB schema.

Life can be simple, if we actually have self contained data as with MongoDB or by having a JSON-column in PostgreSQL, for example. Then inter-table-relations are eliminated, but of course it is not even following the first normal form. This can be OK or not, but at least there are good reasons why best practices have been introduced in the relational DB world and we should be careful about that. Another approach is to avoid the concept of sub entities and only work with IDs that are foreign keys. We can query them explicitly when needed.

An interesting approach is to have two ID-columns. One is an id, that is unique in the DB-table and increasing for newly created data. One is the entity-ID. This is shared between several records referring to different generations of the same object. New of them are generated each time we change something and persist the changes and in a simple approach we just consider the newest record with that entity-ID valid. It can of course be enhanced with validFrom and validTo. Then each access to the database also includes a timestamp, usually close to current time, but kept constant across a transaction. Only records for which validFrom <= timestamp < validTo are considered, and within these the newest. The validFrom and validTo can form disjoint intervals, but it is up to the application logic if that is needed or not. It is also possible to select the entry with the highest ID among the records with a given entityID and timestamp-validTo/From-condition. Deleting records can be simulated by this as well, by allowing a way to express a "deleted" record, which means that in case we find this deleted record by our rules, we pretend not having found anything at all. But still referential integrity is possible, because the pre-deletion-data are still there. This concept of having two IDs has been inspired by a talk on that I saw during Clojure Exchange 2017: Immutable back to front.

Share Button