Development of Hardware: Parallelism

Deutsch

Until recently we could just rely on the fact that the CPU frequencies doubled at least every year, which has stopped a couple of years ago. So we can no longer compensate the inefficiencies of our software by just waiting for the next hardware release, which was no big deal, because software was often delayed anyway by a couple of months. Off course the power of hardware depends on many factors, even on the number of instructions that can be done within one clock cycle or the number of clock cycles needed for instructions. Everyone who has dealt with performance issues knows that providing enough physical memory is usually a good idea and certain optimizations in the circuits and the design of the chips can help to make the computer run faster, even though we usually do not care. But the power of the single CPU core has almost stagnated now for some years, but it is easy to get chips that provide multiple cores. An interesting link: The Free Lunch is Over.

Now we have the challenge of making use of these multiple CPU cores for building resource hungry applications, which is basically achieved by having multiple threads or processes running simultaneously. Unfortunately we encounter some issues. The most obvious problem is that it is easy to find developers who say that they are capable of developing such applications, but there are only very few who can really do it well enough to build reliable and stable software. So the software might work well under ideal circumstances, for example when testing it on the developer’s machine, but it will eventually fail in the productive environment, when run under load, creating errors that are very hard to pin down. Or the threads and processes spend so much time waiting for each other that the system does not actually make use of the parallel capabilities of the hardware. Or we even get dead locks. What do we learn from this?

For this kind of architecture excellent developers are needed, who can imagine the parallel computations and who have enough experience with this kind of development. And it is usually better to do development that uses the parallelism to a reasonable extent, without loosing robustness. Obviously it is important to test with reasonable data and load on test systems that are like the productive systems.

Another approach is the use of frameworks. There are some good lightweight frameworks, but common frameworks like JEE (earlier called J2EE) are using so many resources for themselves and restrict the developer so much that the advantage of easier multithreading gets lost by this, because the framework itself uses most of the CPU power and the main memory. There are many cases where using frameworks with JEE applications servers is a good idea, but high performance applications should done differently.

The problem is always that data structures that need to be manipulated by multiple threads or processes cause problems. These may be handled, but create a lot of difficulties in practice.

Some radical approaches are:

  • avoid commonly used data structures
  • usage of immutable data structures

The first approach is quite logical for development with C or Ruby or Perl, where the processes need relatively little memory, so that it is possible to run multiple processes simultaneously. Using POSIX-IPC (or whatever your OS offers instead) or TCP/IP the processes can communicate with each other. That works well, if there are several relatively independent processes that do not need to communicate very much. But it needs excellent developers as well, because they really need to know the IPC mechanisms, unless the sub tasks are so independent that they do not need to communicate with each other at all. Maybe Erlang has implemented this idea in a practicable way, allowing a huge number of parallel processes with totally separate data stores that communicate with each other through some message passing mechanism.

The other idea, to have all shared data structures immutable, is followed by Scala and Clojure. The disadvantage of having to create a copy with some changes applied instead of modifying the object itself can be reduced by internal optimizations within the standard libraries that use references to the original and just store the changes instead of really copying huge data structures for each change. Even Java uses such mechanisms when creating a substring of an immutable String.

In any case it is necessary to deal with dependencies between processes in order to avoid deadlocks. In the Scala and Clojure world it is reasonable to build lightweight frameworks that help dealing with multiple parallel threads because the promise of immutability eliminates many of the problems of shared objects. Twitter uses Scala internally and has been able to cope with the load even during events that cause a heavy communications load.

A principal problem remains whenever heavy communication between processes is required. In a huge system it is impossible to optimize all communication paths. Assuming n parallel processors we have {n(n-1)\over2} communication pairs, which is growing O(n^2). So we need to compromise as soon as n gets really huge. A bus architecture with one common channel get congested and for separate point to point connections it will be necessary to provide these only for immediate neighbors instead of all possible connections. To really imagine huge, think of an application that is running on several locations, each having several racks, each containing several machines, each containing several CPU chips, each containing several CPU cores, possibly even with hyper-threading. Using sophisticated hardware architecture it is possible that CPU cores communicate with other CPU cores in their vicinity through very fast mechanisms, but it is only possible to place a limited number of CPU cores in this vicinity.

An interesting idea was to put a large number of boards containing this number of CPUs and cores that can communicate with each other efficiently into a topological hypercube. Having 2^m boards, each board has m neighbors that can be reached directly through a relatively short communication channel. The boards represent the vertices of an m-dimensional hypercube. This architecture allows reaching another board in m steps and even to aggregate a result from all or a subset of all boards in m steps. Having a wired-or for synchronization is very helpful for enhancing the performance for many typical types of tasks. Does anybody know how current super computers are built?

In any case it is good to be able to run sub tasks with as little communication with other sub tasks as possible, because the overhead of communication can eat up the gain of parallelism.

Share Button

Data Quality

Deutsch

Very often we experience that software is not working well. Often this is a problem of the software itself, which we all know quite well.

Experience shows, however, that more often the problem is of the data on which the software operates. In short, junk in — junk out.

In organizations that use software, it is a good idea to keep an eye on the underlying data. For software we are used to development processes that at least pretend to take testing and bug-fixing serious. Formalized processes exist for this and in serious IT projects they prove to be helpful for reducing the amount of software bugs. The issue of data quality is often outside these processes, because data content is often provided by the business side, where such quality processes have not always been established. Often data is exchanged between different systems, where we can have quality problems as well.

Some questions about this:

  • Do we know the path of data through the various systems=
  • Is any person responsible for data quality?
  • Is the data quality checked
  • Is there an efficient process to correct data?

Such questions are often less clearly answered than similar questions about software quality, where the development process is well defined, at least initially, responsibilities are known and at least some rudimentary review of quality has become common practice.

Here are some examples. I try to write about them in such a way that it cannot be concluded which in organization they occurred:

Data should represent the reality. As an example we can describe the stock of furniture in an office location. Occasionally pieces of furniture are replaced, removed, or added. I the data is not properly adjusted according to such changes, we eventually lost touch with reality and the application working with these data becomes useless.

Data should be accurate. As an example my name had been misspelled somewhere and the email address had been defined based on that misspelled name. That would not sound too bad at first glance, if I did not feel embarrassed. But it is a problem. The name is used in so many places, where the exact spelling is mandatory, therefore an inaccuracy cannot be tolerated any more. In the old days the mailman could deliver paper mail even with slightly inaccurate addresses. Inaccurate mail addresses simply do not work. In the case of my misspelled name, I was able to find someone who could correct this in some data store within short time. But after a week or so, the misspelled name was back. No one really knew how to fix it for good. By chance I found the guy responsible for the master data system a few months later and he could fix it for good. As it seems, data was duplicated from the master system to other systems, but the data flow was not well documented or at least not well known.

A common result of inaccuracies are duplicates. They often occur because of minor misspellings when writing names in data fields or even when scanning names from printed text, because OCR-software is not perfect. A reason for duplicates can even be transfer of data from one system to another without checking matches with records already present in the target system.

Interesting can also be found when attributes change, for example a person’s name. When data is transferred between different systems and data is not linked well, it is possible that records for the same person exist with the old and the new name. Or differently asked, how many places are there in the IT landscape, where such a change needs to be performed?

A good example of such a weakness can be found with MS-Windows-NT, at least for version 5.1 (Windows XP), but probably any other version as well. It is possible to boot a PC with a working installation with an USB-stick with Linux and copy the whole disk content byte-by-byte to another PC. If the hardware is identical, the other PC will boot and work quite well, if dynamic IP addresses are used and no NT-Domains are involved. But there is one problem. If NT-Domains are used, which is the case in all organizations that use MS-Windows PCs for a large part of there office workers, an internal identity of the PC becomes relevant. This is generated at some point and it is stored in the registry and maybe also in the file system in so many places that it is hard to keep track of that. Having two PCs in the same network with the same internal ID is calling for trouble. So this simple approach that would make life easy for system administrators does not work.

Many of these problems can be treated at least partially, by paying attention to the following issues when creating an IT landscape:

  • How do the system communicate with each other? Which system is the „master“ for the data? Or is there a serious conceptionally correct „multimaster“ architecture in place?
  • Can obviously incorrect data be detected and rejected?
  • How are records of data linked together? How robust are these connections for changes?
  • Are there work-flows that make it easier to keep data consistent and accurate?
  • How stable are interfaces to other systems?
  • Are tests for plausibility of data in place, especially for similarities (duplicates)?
  • Is a comparison with other (possibly more reliable) data sources performed?
  • How are changes in the reality that is expressed by the data detected and used for adjusting the data content?

Today a lot could be done and it makes sense, to perform some automatic tests of data during the productive operation. But it is also important that the people who supply the data work accurate and that the processes are taken serious, so that all parties involved are working to ensure good data quality. Concerning the example of furniture: Those who are responsible to enter information about the furniture into the system, may not be overloaded with other activities that have higher priority, otherwise keeping the furniture database up to date will not be done, will be done too late or will be done half-heartedly. Without good data the expenses for maintaining the IT application are wasted.

Share Button