Fehler in Oracle-Datenbanken mit leeren Zeichenketten

English

Es scheint so, dass alle gängigen Oracle-Datenbanken leere Zeichenketten und null nicht unterscheiden. Das widerspricht der SQL-Spezifikation und ist damit ein Fehler.

Leider wäre es für Oracle nicht einmal einfach, den zu entfernen, weil zu viele Applikationen seit Jahrzehnten Oracle benutzen und sich auf dieses Verhalten eingestellt haben.

Entscheidend ist also erst einmal, diesen Fehler zu kennen und bei der Verwendung von Oracle-Datenbanken zu berücksichtigen. Solange man explizit mit SQL, also SQL*Plus, JDBC, DBI, Pro*C u.ä. auf die Datenbank zugreift, ist es noch einfacher, das zu berücksichtigen, aber heikel wird es bei der Verwendung von Frameworks wie Hibernate, wo man merkwürdige Fehler findet. Wenn man zum Beispiel eine Entität (engl. Entity) hat, die persistiert wird, und dort ein Attribut eine Zeichenkette ist, kann man legitimerweise dort die leere Zeichenkette zuweisen. Sobald dieses Objekt durch Hibernate wandert und persistiert wird, hat man plötzlich stattdessen null dort stehen und bekommt unerwartete NullPointerExceptions. Oder das Attribut ist in der Datenbank mit dem NOT-NULL-Constraint versehen und man bekommt beim Schreiben Oracle-Fehler. Außerdem ist es natürlich ein semantisches Problem, man möchte vielleicht gerne leere Zeichenketten und null dort speichern können und mit verschiedenen Bedeutungen behandeln.

Eine hässliche, aber doch praktikable Möglichkeit, damit umzugehen ist die Getter und Setter der Zeichenketten-Attribute entsprechend zu ändern. Man kann eine „magische Zeichenkette“ definieren, die man garantiert nicht als Wert dort speichern möchte. Dann kann man im Getter vergleichen, ob das Attribut diese Zeichenkette als Wert hat und in dem Fall die leere Zeichenkette zurückgeben. Und im Setter prüft man erst, ob der neue Wert diese magische Zeichenkette ist, was man mit einer Exception ablehnt, und andernfalls, ob der neue Wert die leere Zeichenkette ist, was man durch die magische Zeichenkette ersetzt.

Bespiel in Java (ohne JPA-Annotationen, Synchronisation, toString(), hashCode(), equals() u.s.w.; geht analog für andere Programmiersprachen)


public class MyEntity {
  private static String MAGIC_VALUE = "_._._«NULL»_._._";
  private String value;

  public String getValue() {
    if (MAGIC_VALUE.equals(value)) {
      return "";
    } else {
      return value;
    }
  }

  public void setValue(String newValue) {
    if (MAGIC_VALUE.equals(newValue)) {
      throw new IllegalArgumentException("value="+newValue+" not allowed");
    } else if ("".equals(newValue)) {
      this.value = MAGIC_VALUE;
    } else {
      this.value = newValue;
    }
  }
}

Damit das ganze auch mit Attributen funktioniert, die in der Datenbank als VARCHAR(1) oder VARCHAR(2) definiert sind, ist es vielleicht eine gute Idee, Steuerzeichen oder UTF-8-Zeichen aus einer Sprache, die man nie unterstützen will, zu verwenden. Aber es bleibt eine unbefriedigende Lösung.

Hat jemand eine bessere Idee dazu?

Gibt es noch andere Datenbanken, die den gleichen Fehler wie Oracle in diesem Punkt haben?

Share Button

Data Quality

Deutsch

Very often we experience that software is not working well. Often this is a problem of the software itself, which we all know quite well.

Experience shows, however, that more often the problem is of the data on which the software operates. In short, junk in — junk out.

In organizations that use software, it is a good idea to keep an eye on the underlying data. For software we are used to development processes that at least pretend to take testing and bug-fixing serious. Formalized processes exist for this and in serious IT projects they prove to be helpful for reducing the amount of software bugs. The issue of data quality is often outside these processes, because data content is often provided by the business side, where such quality processes have not always been established. Often data is exchanged between different systems, where we can have quality problems as well.

Some questions about this:

  • Do we know the path of data through the various systems=
  • Is any person responsible for data quality?
  • Is the data quality checked
  • Is there an efficient process to correct data?

Such questions are often less clearly answered than similar questions about software quality, where the development process is well defined, at least initially, responsibilities are known and at least some rudimentary review of quality has become common practice.

Here are some examples. I try to write about them in such a way that it cannot be concluded which in organization they occurred:

Data should represent the reality. As an example we can describe the stock of furniture in an office location. Occasionally pieces of furniture are replaced, removed, or added. I the data is not properly adjusted according to such changes, we eventually lost touch with reality and the application working with these data becomes useless.

Data should be accurate. As an example my name had been misspelled somewhere and the email address had been defined based on that misspelled name. That would not sound too bad at first glance, if I did not feel embarrassed. But it is a problem. The name is used in so many places, where the exact spelling is mandatory, therefore an inaccuracy cannot be tolerated any more. In the old days the mailman could deliver paper mail even with slightly inaccurate addresses. Inaccurate mail addresses simply do not work. In the case of my misspelled name, I was able to find someone who could correct this in some data store within short time. But after a week or so, the misspelled name was back. No one really knew how to fix it for good. By chance I found the guy responsible for the master data system a few months later and he could fix it for good. As it seems, data was duplicated from the master system to other systems, but the data flow was not well documented or at least not well known.

A common result of inaccuracies are duplicates. They often occur because of minor misspellings when writing names in data fields or even when scanning names from printed text, because OCR-software is not perfect. A reason for duplicates can even be transfer of data from one system to another without checking matches with records already present in the target system.

Interesting can also be found when attributes change, for example a person’s name. When data is transferred between different systems and data is not linked well, it is possible that records for the same person exist with the old and the new name. Or differently asked, how many places are there in the IT landscape, where such a change needs to be performed?

A good example of such a weakness can be found with MS-Windows-NT, at least for version 5.1 (Windows XP), but probably any other version as well. It is possible to boot a PC with a working installation with an USB-stick with Linux and copy the whole disk content byte-by-byte to another PC. If the hardware is identical, the other PC will boot and work quite well, if dynamic IP addresses are used and no NT-Domains are involved. But there is one problem. If NT-Domains are used, which is the case in all organizations that use MS-Windows PCs for a large part of there office workers, an internal identity of the PC becomes relevant. This is generated at some point and it is stored in the registry and maybe also in the file system in so many places that it is hard to keep track of that. Having two PCs in the same network with the same internal ID is calling for trouble. So this simple approach that would make life easy for system administrators does not work.

Many of these problems can be treated at least partially, by paying attention to the following issues when creating an IT landscape:

  • How do the system communicate with each other? Which system is the „master“ for the data? Or is there a serious conceptionally correct „multimaster“ architecture in place?
  • Can obviously incorrect data be detected and rejected?
  • How are records of data linked together? How robust are these connections for changes?
  • Are there work-flows that make it easier to keep data consistent and accurate?
  • How stable are interfaces to other systems?
  • Are tests for plausibility of data in place, especially for similarities (duplicates)?
  • Is a comparison with other (possibly more reliable) data sources performed?
  • How are changes in the reality that is expressed by the data detected and used for adjusting the data content?

Today a lot could be done and it makes sense, to perform some automatic tests of data during the productive operation. But it is also important that the people who supply the data work accurate and that the processes are taken serious, so that all parties involved are working to ensure good data quality. Concerning the example of furniture: Those who are responsible to enter information about the furniture into the system, may not be overloaded with other activities that have higher priority, otherwise keeping the furniture database up to date will not be done, will be done too late or will be done half-heartedly. Without good data the expenses for maintaining the IT application are wasted.

Share Button