Development of Hardware: Parallelism


Until recently we could just rely on the fact that the CPU frequencies doubled at least every year, which has stopped a couple of years ago. So we can no longer compensate the inefficiencies of our software by just waiting for the next hardware release, which was no big deal, because software was often delayed anyway by a couple of months. Off course the power of hardware depends on many factors, even on the number of instructions that can be done within one clock cycle or the number of clock cycles needed for instructions. Everyone who has dealt with performance issues knows that providing enough physical memory is usually a good idea and certain optimizations in the circuits and the design of the chips can help to make the computer run faster, even though we usually do not care. But the power of the single CPU core has almost stagnated now for some years, but it is easy to get chips that provide multiple cores. An interesting link: The Free Lunch is Over.

Now we have the challenge of making use of these multiple CPU cores for building resource hungry applications, which is basically achieved by having multiple threads or processes running simultaneously. Unfortunately we encounter some issues. The most obvious problem is that it is easy to find developers who say that they are capable of developing such applications, but there are only very few who can really do it well enough to build reliable and stable software. So the software might work well under ideal circumstances, for example when testing it on the developer’s machine, but it will eventually fail in the productive environment, when run under load, creating errors that are very hard to pin down. Or the threads and processes spend so much time waiting for each other that the system does not actually make use of the parallel capabilities of the hardware. Or we even get dead locks. What do we learn from this?

For this kind of architecture excellent developers are needed, who can imagine the parallel computations and who have enough experience with this kind of development. And it is usually better to do development that uses the parallelism to a reasonable extent, without loosing robustness. Obviously it is important to test with reasonable data and load on test systems that are like the productive systems.

Another approach is the use of frameworks. There are some good lightweight frameworks, but common frameworks like JEE (earlier called J2EE) are using so many resources for themselves and restrict the developer so much that the advantage of easier multithreading gets lost by this, because the framework itself uses most of the CPU power and the main memory. There are many cases where using frameworks with JEE applications servers is a good idea, but high performance applications should done differently.

The problem is always that data structures that need to be manipulated by multiple threads or processes cause problems. These may be handled, but create a lot of difficulties in practice.

Some radical approaches are:

  • avoid commonly used data structures
  • usage of immutable data structures

The first approach is quite logical for development with C or Ruby or Perl, where the processes need relatively little memory, so that it is possible to run multiple processes simultaneously. Using POSIX-IPC (or whatever your OS offers instead) or TCP/IP the processes can communicate with each other. That works well, if there are several relatively independent processes that do not need to communicate very much. But it needs excellent developers as well, because they really need to know the IPC mechanisms, unless the sub tasks are so independent that they do not need to communicate with each other at all. Maybe Erlang has implemented this idea in a practicable way, allowing a huge number of parallel processes with totally separate data stores that communicate with each other through some message passing mechanism.

The other idea, to have all shared data structures immutable, is followed by Scala and Clojure. The disadvantage of having to create a copy with some changes applied instead of modifying the object itself can be reduced by internal optimizations within the standard libraries that use references to the original and just store the changes instead of really copying huge data structures for each change. Even Java uses such mechanisms when creating a substring of an immutable String.

In any case it is necessary to deal with dependencies between processes in order to avoid deadlocks. In the Scala and Clojure world it is reasonable to build lightweight frameworks that help dealing with multiple parallel threads because the promise of immutability eliminates many of the problems of shared objects. Twitter uses Scala internally and has been able to cope with the load even during events that cause a heavy communications load.

A principal problem remains whenever heavy communication between processes is required. In a huge system it is impossible to optimize all communication paths. Assuming n parallel processors we have {n(n-1)\over2} communication pairs, which is growing O(n^2). So we need to compromise as soon as n gets really huge. A bus architecture with one common channel get congested and for separate point to point connections it will be necessary to provide these only for immediate neighbors instead of all possible connections. To really imagine huge, think of an application that is running on several locations, each having several racks, each containing several machines, each containing several CPU chips, each containing several CPU cores, possibly even with hyper-threading. Using sophisticated hardware architecture it is possible that CPU cores communicate with other CPU cores in their vicinity through very fast mechanisms, but it is only possible to place a limited number of CPU cores in this vicinity.

An interesting idea was to put a large number of boards containing this number of CPUs and cores that can communicate with each other efficiently into a topological hypercube. Having 2^m boards, each board has m neighbors that can be reached directly through a relatively short communication channel. The boards represent the vertices of an m-dimensional hypercube. This architecture allows reaching another board in m steps and even to aggregate a result from all or a subset of all boards in m steps. Having a wired-or for synchronization is very helpful for enhancing the performance for many typical types of tasks. Does anybody know how current super computers are built?

In any case it is good to be able to run sub tasks with as little communication with other sub tasks as possible, because the overhead of communication can eat up the gain of parallelism.

Share Button

Why are Laptop displays having so poor resolutions?


On google+ you can find a post Linus Torvalds himself about this issue.
The post is in English, not in his native language (Swedish).

P.S. I will continue to write a blog entry about every Friday in German and translate some (but not all) of these to English.

Share Button