Databases and Immutable Objects

Deutsch

A common approach in functional programming and even a good practice in object oriented programming is to prefer having objects immutable.

Especially in applications with multithreading this is extremely helpful, but in principal it helps having the information flow under control and avoiding unexpected side effects. I do not want to postulate this as a dogma, since there are actually legitimate uses of mutable objects, but it is a good idea to keep in mind what mutability means and implies where it is used and to question if it is really worth it in the particular case.

It is important to differentiate between really immutable objects and those that are handed out via a wrapper or a second interface in an immutable way while still being accessible for changes by those parts of the code that control the mutable interface.

Example:

import java.util.*;

public class A {
    List l = new ArrayList();
    
    public void addStr(String s) {
        l.add(s);
    }

    public List getList() {
        return Collections.unmodifiableList(l);
    }
}

Even though the list provided by getList() is by itself immutable, it can still be changed by the addStr(..)-method of A.

The whole story gets more interesting when involving databases.

Database contents are in some way mapped to objects. If this mapping is done via automatisms like OR-mappers or Active-Record or if it is done explicitely by using mechanisms like JDBC or DBI is secondary for the moment. These objects can be based on specifically written code for the particular usage or generic DB-content-objects or just plain old collections, which should be considered as an implementation detail for the moment. Important is that such objects or data structures expressing DB content exist in the software. For simplicity’s sake they will be referred to as objects throughout this text.

Now DB content can change. This can happen due to activities of the software with which we are dealing right now. But there can also be accesses to the database by other software or by an DB administrator which can cause changes in the database content while the software is running. This already results in some complications, because the objects processed by the software are already outdated while being processed. This would be acceptable if there were no caching mechanisms for database content on software and framework level, leaving the caching to the database software. Other than framework and software caches the caches of the database itself can be truly transaction conform. Off course it is good having the DB in terms of network connectivity not too far away from the application, maybe even on the same machine, depending on the access patterns. If the network connectivity is poor, this approach ends up having the cache on the wrong side of the network connection.

If the objects within the software have a life time that is so short that it can be avoided that they are out of sync with the database contents they are representing, things should be fine. This can be achieved by using the transaction isolation of „serializable“ or „phantom-read“ and by discarding all these objects before the end of the transaction. Within the transaction these objects that have been read during the same transaction are guaranteed to be up to date, as a result of the concept of transaction isolation used here. If certain OR-mapping patterns needed to map collections to database contents are not involved it is possible that „repeatable-read“ is already enough to guarantee this. As long as objects are only read from the database (SELECT or READ or FIND) immutable objects work just fine.

Even deleting (DELETE) can be imagined, but in this case it is important to ensure that the corresponding object is not in use in other parts of the software even though the database content represented by it have already been deleted. When ensuring that objects representing database content do not live longer than the corresponding transaction this should be possible to deal with. Also creating new objects (INSERT or CREATE) and the corresponding database content should be possible.

What is most problematic is changing of data (UPDATE). Common OR mapper do this by reading the object, changing it and then saving it, hoping that the framework will figure out the differences and run the appropriate update, possibly multiple updates in conjunction with inserts and deletes. This becomes difficult with immutable objects, because that blocks the typical approach just described. Ways can be found to accomplish this anyway, but they tend to get this done at the expense of the elegance of the framework. Even more important it becomes relevant to deal with the situation that objects expire even within the transaction and need to be replaced by new objects representing the changed database contents.

The provocative question is if updates are at all needed. Off course they are needed, since update is one of the four basic database operations. But it is worth taking a closer look in which cases updates really make sense and in which cases other approaches are attractive alternatives. A classic is an application that is dealing with some kind of bookings and some kind of subjects to which accounts are attached, typically this can be persons or companies or other entities, but we can go more abstract than that. For simplicity we can assume persons as an example. This can be a banking software with accounts, account owners and whatever is needed on top of that. Or a billing system of a phone company that calculates, creates and manages invoices to customers based on their phone usage and subscriptions. It is commonly seen that there is such a booking table to which records can be added, but never deleted or changed. That would falsify the book keeping. For canceling a booking, a cancel-entry can be added which is actually a booking in the reverse direction marked as cancel for the original booking. Maybe there is an archiving process that moves old data to a data warehouse, thus actually performing a delete in the booking table and replacing the old bookings by some summary entry, but that is really a detail not relevant for this article as long as this archiving process is done in a good way not compromising the regular usage of the software and database, maybe by actually taking the system down for maintenance during this archiving process once a year. Usually some kind of balance is needed quite often. In principal this can easily be calculated any time by just adding up the bookings of that particular account. This is in line with normalization requirements. Unfortunately this will slow down software to an extent that it is not useful for any practical purposes, but in theory this approach is extremely beautiful. So a more efficient way of finding balances is needed. Possibly each booking could contain the resulting balance reducing the task to looking for the newest booking of a particular account. That can be dangerous if two bookings generated simultaneously refer to the same previous balance. With transaction isolation of „serializable“ that would not happen, but this might slow down the database quite a bit because it reduces the performance of access to the largest table and thus worsening the bottle neck that might already exist there. It could be solved on database level for example with triggers or materialized views so that each insert to the booking table influences the balance automatically. This should work find for all kinds of accesses to the system by any software as long as it is constrained from doing any delete and update to the booking table. Actually granting only INSERT and SELECT rights to the booking table for the user under which the software is running is a very good idea. It could be seen as a disadvantage that part of the business logic is moved to the database. Maybe this is acceptable by just understanding the balance being the sum of the bookings of the account as some kind of high level constraint which is actively enforced by the database. Another approach would be to provide a functionality in the software that inserts a booking or more likely a set of bookings comprising a transaction and that always keeps the balance updated. This can go wrong if simultaneous bookings for the same account can happen, meaning that a transaction for a second booking start while the transaction of a first booking of the same account has not completed. This seems to be simple to avoid by just providing some queue for each account and working through the queue, but since the transactions usually include bookings on several different account this can become quite messy to implement while still avoiding deadlocks. It must be quite interesting to build such a system in a way that it will work fine even under high load. In any case objects containing the balance of an account tend to age extremely fast so they have to be handled with care as short lived objects within a transaction or even shorter, if that account is touched by bookings to that account in the same tranaction.

On the other hand the table with persons and addresses is commonly kept up to date using updates whenever data changes. People marry, change their phone number or move to another address. But is this really the best approach? In reality moving to another address could be known in advance and maybe it is not a good requirement that this change of address has to be entered on the exact date when the person actually moves. Or who wants to spend his wedding day updating all kind of name and address fields in all kinds of web applications? Ok, this can be cheated on because it is usually not a big deal if the change is done a little bit too late. But for some purposes exact data are really desirable. And then again, is this really the right approach? Sometimes it is required to be able to answer questions like name and address of the customer while booking 23571113a was being performed. Maybe it is a good idea to store address changes as some kind of bookings as well, with a valid-since field. For finding a customer’s current address it is sufficient to look for the newest entry for that particular customer that is not in the future. With this approach new entries need to be created for these changes and they can even be prepared in advance. Even deleting a customer can be accomplished by added a „deleted“-entry with a particular valid-since timestamp to that table.

The beatiful part is that now certain data in the database can be considered immutable since no updates can occur to them. So dealing with them as immutable in the software becomes a valable approach and actually also a must. These objects can even be kept in the system longer than the duration of a transaction. It remains important to be careful and to understand the OR-mapping since complex objects represeting collections can even change due to inserts. An example would be an account object containing the account and all its bookings. Customer address needs to be dealt with in conjunction with the timestamp that was used for acquiring it. So it is a good practice to freeze the „now“-timestamp at the beginning of a large operation and to consistently use it throughout that operation to have a consistent view of the data. If this timestamp gets too old, it might be a good idea to refresh it and to reread all data attached to the timestamp. Other than for bookings the times of address changes usually need not be accurate to the micro second as long as consistency is guaranteed and no weird mixture of old and new addresses ever occurs.

Share Button

Oracle Bug with empty strings

Deutsch

Oracle databases seem to consider empty strings and null the same. This contradicts the SQL specification and is therefore a bug.

Unfortunately, it would be almost impossible for Oracle to fix this, because too many applications using Oracle for decades rely on this behavior.

It is therefore crucial to know about this bug when dealing with Oracle databases and to take it into account. As long as the database is accessed explicitly using SQL, SQL*Plus, JDBC, DBI, Pro*C etc., it is easier to deal with. But it is becoming tricky. when using frameworks like Hibernate, where weird errors might be observed. If you have an entity that is persisted, which has a string attribute, it should be legitimately to assign the empty string to it. Once this object travels through Hibernate and gets persisted, this empty string has become null instead, resulting in unexpected NullPointerExceptions. Or if the attribute has the NOT-NULL-constraint in the database, this will result in an Oracle error because of constraint violation. On top of this it might become a semantic issue, when empty strings and null are actually considered to have a different meaning.

An ugly, but practical way to deal with it is to change the getter and setter methods of the string attributes accordingly. Maybe some kind of „magic string“ can be defined, which will never ever occur as a value of this string attribute. The getters and setters use this magic string to internally represent the empty string. It is necessary to annotate the attribute, not the getter, for making this work with JPA or Hibernate.

Example in Java (without JPA annotations, synchronization, toString(), hashCode(), equals() etc. Would be similar for other languages)


public class MyEntity {
  private static String MAGIC_VALUE = "_._._«NULL»_._._";
  private String value;

  public String getValue() {
    if (MAGIC_VALUE.equals(value)) {
      return "";
    } else {
      return value;
    }
  }

  public void setValue(String newValue) {
    if (MAGIC_VALUE.equals(newValue)) {
      throw new IllegalArgumentException("value="+newValue+" not allowed");
    } else if ("".equals(newValue)) {
      this.value = MAGIC_VALUE;
    } else {
      this.value = newValue;
    }
  }
}

To make this work even with short strings, defined as VARCHAR2(1) or VARCHAR2(2) in the database, it is perhaps a good idea to use control characters or UTF-8 characters that are not going to be used otherwise. But it remains an unsatisfactory and ugly solution.

Does anyone have a better idea?

Are there any other databases that have the same issue?

Share Button

Latency of DB-connections

Deutsch

Applications that are using a database are often running slower than one would  expect by the power of the hard- and software.  Both DB server and application server are running on powerful machines that are not even close to their limits.

The problem is the connection between the two servers. This connection is of course very fast, which can be quite impressive when huge amounts of data need to be transferred, all at once.  Ideally think of the huge  resultset  of one query.

Typical database access operations need multiple accesses to the DB-server.  Each one of these works according to a communication protocol.  Usually this implies some round trips.  So many small accesses to the database tend to slow the whole application, because it is mostly waiting for the connection.

What can be done about this?
First of all the real bottleneck should be identified before optimizing something that is not the problem.

It does make a lot of  sense to optimize the application for larger and fewer queries, so more data is transferred at once.  This may not be easy with frameworks, especially if eclipselink or hibernate is used.  A typical anti pattern is to read a large number of records from the database and then fetch detail information for each record from another table, with one request for each record.  This can usually be avoided by using complex SQL queries and it is sometimes worth the effort.  As always optimizations should only be done where they are either not very intrusive or where the benefit is likely to justify them.

Another more general optimization is to put application and database on the same server.  This is often not easy, because many system administrators do not like this too well.  But if sufficient performance can only be achieved by putting all on one server, this is recommended.  Maybe it is an advantage of NoSQL databases that system administrators do not know any best practices that suggest to install database and application on different servers, so they can be running on the same server, thus eliminating most (but not all) of the latency.

Share Button