Very often we experience that software is not working well. Often this is a problem of the software itself, which we all know quite well.
Experience shows, however, that more often the problem is of the data on which the software operates. In short, junk in — junk out.
In organizations that use software, it is a good idea to keep an eye on the underlying data. For software we are used to development processes that at least pretend to take testing and bug-fixing serious. Formalized processes exist for this and in serious IT projects they prove to be helpful for reducing the amount of software bugs. The issue of data quality is often outside these processes, because data content is often provided by the business side, where such quality processes have not always been established. Often data is exchanged between different systems, where we can have quality problems as well.
Some questions about this:
- Do we know the path of data through the various systems=
- Is any person responsible for data quality?
- Is the data quality checked
- Is there an efficient process to correct data?
Such questions are often less clearly answered than similar questions about software quality, where the development process is well defined, at least initially, responsibilities are known and at least some rudimentary review of quality has become common practice.
Here are some examples. I try to write about them in such a way that it cannot be concluded which in organization they occurred:
Data should represent the reality. As an example we can describe the stock of furniture in an office location. Occasionally pieces of furniture are replaced, removed, or added. I the data is not properly adjusted according to such changes, we eventually lost touch with reality and the application working with these data becomes useless.
Data should be accurate. As an example my name had been misspelled somewhere and the email address had been defined based on that misspelled name. That would not sound too bad at first glance, if I did not feel embarrassed. But it is a problem. The name is used in so many places, where the exact spelling is mandatory, therefore an inaccuracy cannot be tolerated any more. In the old days the mailman could deliver paper mail even with slightly inaccurate addresses. Inaccurate mail addresses simply do not work. In the case of my misspelled name, I was able to find someone who could correct this in some data store within short time. But after a week or so, the misspelled name was back. No one really knew how to fix it for good. By chance I found the guy responsible for the master data system a few months later and he could fix it for good. As it seems, data was duplicated from the master system to other systems, but the data flow was not well documented or at least not well known.
A common result of inaccuracies are duplicates. They often occur because of minor misspellings when writing names in data fields or even when scanning names from printed text, because OCR-software is not perfect. A reason for duplicates can even be transfer of data from one system to another without checking matches with records already present in the target system.
Interesting can also be found when attributes change, for example a person’s name. When data is transferred between different systems and data is not linked well, it is possible that records for the same person exist with the old and the new name. Or differently asked, how many places are there in the IT landscape, where such a change needs to be performed?
A good example of such a weakness can be found with MS-Windows-NT, at least for version 5.1 (Windows XP), but probably any other version as well. It is possible to boot a PC with a working installation with an USB-stick with Linux and copy the whole disk content byte-by-byte to another PC. If the hardware is identical, the other PC will boot and work quite well, if dynamic IP addresses are used and no NT-Domains are involved. But there is one problem. If NT-Domains are used, which is the case in all organizations that use MS-Windows PCs for a large part of there office workers, an internal identity of the PC becomes relevant. This is generated at some point and it is stored in the registry and maybe also in the file system in so many places that it is hard to keep track of that. Having two PCs in the same network with the same internal ID is calling for trouble. So this simple approach that would make life easy for system administrators does not work.
Many of these problems can be treated at least partially, by paying attention to the following issues when creating an IT landscape:
- How do the system communicate with each other? Which system is the „master“ for the data? Or is there a serious conceptionally correct „multimaster“ architecture in place?
- Can obviously incorrect data be detected and rejected?
- How are records of data linked together? How robust are these connections for changes?
- Are there work-flows that make it easier to keep data consistent and accurate?
- How stable are interfaces to other systems?
- Are tests for plausibility of data in place, especially for similarities (duplicates)?
- Is a comparison with other (possibly more reliable) data sources performed?
- How are changes in the reality that is expressed by the data detected and used for adjusting the data content?
Today a lot could be done and it makes sense, to perform some automatic tests of data during the productive operation. But it is also important that the people who supply the data work accurate and that the processes are taken serious, so that all parties involved are working to ensure good data quality. Concerning the example of furniture: Those who are responsible to enter information about the furniture into the system, may not be overloaded with other activities that have higher priority, otherwise keeping the furniture database up to date will not be done, will be done too late or will be done half-heartedly. Without good data the expenses for maintaining the IT application are wasted.