Cassandra DB

The large zoo of NoSQL databases needs to be considered thoroughly.

While the major transactional SQL databases (Oracle, PostgreSQL, MS-SQL-Server, DB2 and to a limited extent MariaDB/MySQL) can be used more or less interchangeably, if we are talking about the beginning of the project, the NoSQL databases have totally different features and must be chosen wisely and adequately. While the typical pattern of many NoSQL databases is trading speed for complex transactions, this is by no means compulsory. There are SQL databases like Teradata, that have reduced support for transactions, but feature a full set of SQL for queries. This can be used for data warehousing, where the data changing operations take place in a more controlled manor, because the data is already known and present elsewhere, but queries are the important thing. On the other hand there are NoSQL-databases like Neo4j that support full ACID-transactions, but organize the data in a totally different way that might be more adequate than a relational database for many purposes. This shows the other motivation to go for NoSQL-databases. While the relational model is somewhat complete and there are always ways to express whatever is desired with it, it may not be a natural or efficient or convenient way to model certain data.

Some databases, like MongoDB allow additional indexing and thus efficient queries without full table scans and additional uniqueness constraints, which is a must have feature for a good transactional SQL database, but not so much for NoSQL databases in general.

Some NoSQL-databases are typically run in memory. While there are some SQL databases that run in memory as well, like H2, this kind of defeats the idea of ACID, so it is not really a full transactional SQL database anyway, at least not in this mode.

SQL databases are well known to many people and can do many jobs quite well. If the data model can be expressed well in the SQL database and the performance is sufficient, this may be a good way to go. The powerful query and data manipulation language is usually extremely helpful and usually you get only a shadow of what a powerful SQL dialect like PostgreSQL or Oracle can do when you move to a NoSQL database.

Now NoSQL databases are somewhat scary. The data is „not safe“ in them, because there are no transactions. But what about the transactional relational SQL database? If we use it through some implicit or explicit caching layers, we get rid of this feature without being aware of it. JPA or hibernate tend to do implicit caching and can even use file system persistency outside of the database as a „second level cache“. I think this is just so broken, but it is used and good applications are built with it. But the feature of the good old transactional SQL database that serves as argument to use it has silently been thrown into the garbage can in this project.

So we should think if and where we need serious transactions and then really write our software in a way that does not defeat them by caching, preferably by avoiding JPA and Hibernate. Or we should pick our database solution for the purpose based on the modelling features and performance features.

Cassandra comes in as one of the NoSQL products with very good write performance and scalability. The query language kind of looks like SQL, but it is a tiny subset. If we can live with that subset and it is relevant to get more performance than we would get from a PostgreSQL or Oracle, this might be a good way to go. An important observation is that tables in Cassandra do have primary keys. But additional unique keys cannot be defined and instead lookup tables. This works like this:

CREATE TABLE IF NOT EXISTS „A“ (
„X“ text,
„Y“ text,
„Z“ text,
„T“ text,
PRIMARY KEY(„X“)
);

CREATE TABLE IF NOT EXISTS „A_LOOKUP_BY_Y“ (
„Y“ text,
„X“ text,
PRIMARY KEY(„Y“)
);

Lookups by X can be done directly in A. Lookups by Y have to be done in two steps. First the lookup table is used to determine the value for X and the table A is used for the actual lookup of the data.

There are ways to do interesting and useful things that seem to be missing and it is usually worth investigating how they can be done. When knowing how the different tools that are available can be used efficiently and properly, it may be a good moment to actually decide what tool to use. But a serious comparison should compare well done solutions of both technologies and not just a junior’s first day program in one technology against a professional solution done by a senior.

It is always important to have a look at the different NoSQL databases to decide which ones are going to be used in the team or in the organization. Cassandra can be a useful database for this approach. Often it is good to also have a SQL database in this set of supported databases. It is easier to model for example accounts, booking and payments with them, while a other data may be better stored in adequate NoSQL databases.

By the way, while major transactional SQL databases may be somewhat interchangable in the beginning of a project, it is quite a pain to change them later or to support multiple databases. Also there are of course licensing issues, teams are needed to operate the database, databases may run better on certain operating systems and there may be some additional features that might make them more or less desirable. But the basic functionality differs less than in the case of NoSQL databases.

Share Button