Databases and Immutable Objects

Deutsch

A common approach in functional programming and even a good practice in object oriented programming is to prefer having objects immutable.

Especially in applications with multithreading this is extremely helpful, but in principal it helps having the information flow under control and avoiding unexpected side effects. I do not want to postulate this as a dogma, since there are actually legitimate uses of mutable objects, but it is a good idea to keep in mind what mutability means and implies where it is used and to question if it is really worth it in the particular case.

It is important to differentiate between really immutable objects and those that are handed out via a wrapper or a second interface in an immutable way while still being accessible for changes by those parts of the code that control the mutable interface.

Example:
import java.util.*;


public class A {

    List l = new ArrayList();

    

    public void addStr(String s) {

        l.add(s);

    }

public List getList() { return Collections.unmodifiableList(l); } }

Even though the list provided by getList() is by itself immutable, it can still be changed by the addStr(..)-method of A.

The whole story gets more interesting when involving databases.

Database contents are in some way mapped to objects. If this mapping is done via automatisms like OR-mappers or Active-Record or if it is done explicitely by using mechanisms like JDBC or DBI is secondary for the moment. These objects can be based on specifically written code for the particular usage or generic DB-content-objects or just plain old collections, which should be considered as an implementation detail for the moment. Important is that such objects or data structures expressing DB content exist in the software. For simplicity’s sake they will be referred to as objects throughout this text.

Now DB content can change. This can happen due to activities of the software with which we are dealing right now. But there can also be accesses to the database by other software or by an DB administrator which can cause changes in the database content while the software is running. This already results in some complications, because the objects processed by the software are already outdated while being processed. This would be acceptable if there were no caching mechanisms for database content on software and framework level, leaving the caching to the database software. Other than framework and software caches the caches of the database itself can be truly transaction conform. Off course it is good having the DB in terms of network connectivity not too far away from the application, maybe even on the same machine, depending on the access patterns. If the network connectivity is poor, this approach ends up having the cache on the wrong side of the network connection.

If the objects within the software have a life time that is so short that it can be avoided that they are out of sync with the database contents they are representing, things should be fine. This can be achieved by using the transaction isolation of „serializable“ or „phantom-read“ and by discarding all these objects before the end of the transaction. Within the transaction these objects that have been read during the same transaction are guaranteed to be up to date, as a result of the concept of transaction isolation used here. If certain OR-mapping patterns needed to map collections to database contents are not involved it is possible that „repeatable-read“ is already enough to guarantee this. As long as objects are only read from the database (SELECT or READ or FIND) immutable objects work just fine.

Even deleting (DELETE) can be imagined, but in this case it is important to ensure that the corresponding object is not in use in other parts of the software even though the database content represented by it have already been deleted. When ensuring that objects representing database content do not live longer than the corresponding transaction this should be possible to deal with. Also creating new objects (INSERT or CREATE) and the corresponding database content should be possible.

What is most problematic is changing of data (UPDATE). Common OR mapper do this by reading the object, changing it and then saving it, hoping that the framework will figure out the differences and run the appropriate update, possibly multiple updates in conjunction with inserts and deletes. This becomes difficult with immutable objects, because that blocks the typical approach just described. Ways can be found to accomplish this anyway, but they tend to get this done at the expense of the elegance of the framework. Even more important it becomes relevant to deal with the situation that objects expire even within the transaction and need to be replaced by new objects representing the changed database contents.

The provocative question is if updates are at all needed. Off course they are needed, since update is one of the four basic database operations. But it is worth taking a closer look in which cases updates really make sense and in which cases other approaches are attractive alternatives. A classic is an application that is dealing with some kind of bookings and some kind of subjects to which accounts are attached, typically this can be persons or companies or other entities, but we can go more abstract than that. For simplicity we can assume persons as an example. This can be a banking software with accounts, account owners and whatever is needed on top of that. Or a billing system of a phone company that calculates, creates and manages invoices to customers based on their phone usage and subscriptions. It is commonly seen that there is such a booking table to which records can be added, but never deleted or changed. That would falsify the book keeping. For canceling a booking, a cancel-entry can be added which is actually a booking in the reverse direction marked as cancel for the original booking. Maybe there is an archiving process that moves old data to a data warehouse, thus actually performing a delete in the booking table and replacing the old bookings by some summary entry, but that is really a detail not relevant for this article as long as this archiving process is done in a good way not compromising the regular usage of the software and database, maybe by actually taking the system down for maintenance during this archiving process once a year. Usually some kind of balance is needed quite often. In principal this can easily be calculated any time by just adding up the bookings of that particular account. This is in line with normalization requirements. Unfortunately this will slow down software to an extent that it is not useful for any practical purposes, but in theory this approach is extremely beautiful. So a more efficient way of finding balances is needed. Possibly each booking could contain the resulting balance reducing the task to looking for the newest booking of a particular account. That can be dangerous if two bookings generated simultaneously refer to the same previous balance. With transaction isolation of „serializable“ that would not happen, but this might slow down the database quite a bit because it reduces the performance of access to the largest table and thus worsening the bottle neck that might already exist there. It could be solved on database level for example with triggers or materialized views so that each insert to the booking table influences the balance automatically. This should work find for all kinds of accesses to the system by any software as long as it is constrained from doing any delete and update to the booking table. Actually granting only INSERT and SELECT rights to the booking table for the user under which the software is running is a very good idea. It could be seen as a disadvantage that part of the business logic is moved to the database. Maybe this is acceptable by just understanding the balance being the sum of the bookings of the account as some kind of high level constraint which is actively enforced by the database. Another approach would be to provide a functionality in the software that inserts a booking or more likely a set of bookings comprising a transaction and that always keeps the balance updated. This can go wrong if simultaneous bookings for the same account can happen, meaning that a transaction for a second booking start while the transaction of a first booking of the same account has not completed. This seems to be simple to avoid by just providing some queue for each account and working through the queue, but since the transactions usually include bookings on several different account this can become quite messy to implement while still avoiding deadlocks. It must be quite interesting to build such a system in a way that it will work fine even under high load. In any case objects containing the balance of an account tend to age extremely fast so they have to be handled with care as short lived objects within a transaction or even shorter, if that account is touched by bookings to that account in the same tranaction.

On the other hand the table with persons and addresses is commonly kept up to date using updates whenever data changes. People marry, change their phone number or move to another address. But is this really the best approach? In reality moving to another address could be known in advance and maybe it is not a good requirement that this change of address has to be entered on the exact date when the person actually moves. Or who wants to spend his wedding day updating all kind of name and address fields in all kinds of web applications? Ok, this can be cheated on because it is usually not a big deal if the change is done a little bit too late. But for some purposes exact data are really desirable. And then again, is this really the right approach? Sometimes it is required to be able to answer questions like name and address of the customer while booking 23571113a was being performed. Maybe it is a good idea to store address changes as some kind of bookings as well, with a valid-since field. For finding a customer’s current address it is sufficient to look for the newest entry for that particular customer that is not in the future. With this approach new entries need to be created for these changes and they can even be prepared in advance. Even deleting a customer can be accomplished by added a „deleted“-entry with a particular valid-since timestamp to that table.

The beatiful part is that now certain data in the database can be considered immutable since no updates can occur to them. So dealing with them as immutable in the software becomes a valable approach and actually also a must. These objects can even be kept in the system longer than the duration of a transaction. It remains important to be careful and to understand the OR-mapping since complex objects represeting collections can even change due to inserts. An example would be an account object containing the account and all its bookings. Customer address needs to be dealt with in conjunction with the timestamp that was used for acquiring it. So it is a good practice to freeze the „now“-timestamp at the beginning of a large operation and to consistently use it throughout that operation to have a consistent view of the data. If this timestamp gets too old, it might be a good idea to refresh it and to reread all data attached to the timestamp. Other than for bookings the times of address changes usually need not be accurate to the micro second as long as consistency is guaranteed and no weird mixture of old and new addresses ever occurs.

Beteilige dich an der Unterhaltung

Schreibe einen Kommentar

Antworten abbrechen