1. Technical Field
The invention generally relates to computers and, more particularly, to symmetric multiprocessing (SMP).
2. Description of the Related Art
The shared memory concept became the architecture of choice for general-purpose multi-processor machines over the last decade. One of the reasons was the simplicity of the programming model. On one side, the number of processors in symmetric multiprocessing (SMP) is steadily growing and, on the other side, smaller SMP configurations are already common as workstations and about to enter the domain of personal computers.
A description will now be given regarding scalability limitations in current SMPs.
There are different types of limitations to scalability in parallel shared memory programs. Intrinsic limitations are a property of the algorithm and result from data dependencies. The intrinsic limitations of an algorithm define the amount of coordination between parallel tasks and, thus, an upper bound for the amount of parallelism that can be achieved. A second set of limitations results from the system executing the parallel program, and these can be classified into explicit and implicit scalability limitations. The impact of explicit and implicit limitations depends on the system executing the program, not on the algorithm.
Explicit scalability limitations are a result of the time required by a coordinating operation in the program. An example is the time between the release of a lock (also called mutex) by one thread and the successful acquisition of another thread that was waiting for this lock.
Implicit scalability limitations are the result of coordination operations between parallel tasks that are not stated in the program, but which are part of the architecture of the system. One of the important implicit scalability limitations is the maintenance of cache coherence in an SMP system.
Amdahl's law expresses how serialization of even a small fraction of a computation impacts scalability:
  Speedup  =      1                            Fraction          enhanced                          Speedup          enhanced                    +              (                  1          -                      Fraction            enhanced                          )            Coordination overhead contributes often directly to the unenhanced fraction.
For an algorithm which partitions the workload into 2% that is serial, and 98% that is perfectly parallelized, the intrinsic limitations allow a speedup of at most 50. The serial workload may include the task of partitioning the data into packages that can be processed independently. Here the total compute cost required by the serialized version of the algorithm is presumed to be identical to the total compute cost of the parallel algorithm, i.e. coordinating operations such as thread creation or lock acquisitions are presumed to be instantaneous.
The explicit scalability limitations add another contribution, e.g., the creation of threads is not instantaneous. If it is presumed that the scheduling of threads involving a conditional variable and a lock requires 1% of the total cost of the algorithm, then the speedup is limited to 33.3. For example, on an IBM S80 (24 processors, 450 MHz each), the lock transfer between two processors through the load reserve—store conditional primitive requires roughly between 500 and 1000 clock cycles, depending on contention. This cost was determined through a micro benchmark (in assembler) and also within the context of an application using the pthread library. For the experiments, rescheduling of “spinning” threads was disabled to minimize the impact of contact switches. The program ran in the 1:1 thread model (i.e., each thread has its own kernel thread). The compute cost of explicit coordination operations increases the total compute cost of the parallel algorithm compared to the serial algorithm, while potentially reducing the elapsed time due to using multiple processors. In our example, the total compute cost of the parallel program is 101% of the serial execution.
Implicit scalability limitations further increase the total compute cost without being visible in the program. Implicit scalability limitations include, for example, the overhead of coherence traffic. If it is presumed that implicit scalability limitations add another 1% of serial execution, the speedup for our example is already limited to 25.
FIGS. 1A through 1C are plots illustrating results (runtimes) 100 110, and 120, respectively, for a benchmark derived from a practical task, VLSI net building. Net building finds electrically connected components in a very large-scale integration (VLSI) layout. A VLSI design includes multiple so-called cells in which nets are built independently. Additionally, cells interact with each other. The benchmark neglects these interactions to create independent tasks so that parallelism is not influenced by data dependencies of the problem; this minimizes intrinsic scalability limitations. Thus, this benchmark has no significant intrinsic and no significant explicit scalability limitations.
FIG. 1A illustrates the runtimes 100 using 1, 8, 12 and 24 POSIX-threads for different memory managers which all stop to improve for more than 8 processors. The curve 110 is the result using 1, 8, 12 and 24 processes (using fork), and the curve 120 shows the theoretical optimum derived from a sequential run. As the result for multiple processes compared to the theoretical optimum shows, locality and cache capacity are not an issue for this particular workload, although it used a few hundred megabytes (MB) of memory per task (maximum speedup 22.3 with separate processes compared to a theoretical optimum of 24). The limitation of scalability to 8 CPUs or less for multi-threaded execution results in this case predominantly from maintaining cache coherence, an implicit scalability limitation.
A description will now be given regarding typical SMP cache coherence implementations.
Multiple processors with local caches and shared memory are often connected by a shared bus or fabric and use a “snooping” protocol to maintain cache coherence. A processor that writes to a cache line invalidates all other copies of the memory location covered by the altered cache line that reside in other caches. In a snooping protocol the writing processor broadcasts the invalidation on the bus or fabric and all caches including a copy of the affected memory location invalidate it by setting an “invalidation bit”.
The access to the shared memory typically uses write'serialization based on interconnect fabric atomicity. The same principle is in larger systems applied hierarchically if several central processing units (CPUs) share a cache. In IBM's Power 4 systems, 2 processor cores share a level 2 cache, and groups of processors on a module share a processor bus and a level 3 cache. The modules are connected with an additional level of busses.
Depending on the form of implementation, there are two structural limitations to scalability. A shared bus provides a mechanism for serialization, but the serialization of access to the shared bus is a point of contention. More complex fabrics avoid the single point of contention, but increase the latency of broadcasting operations.
Directory based systems do not broadcast invalidations. The directory (which may be distributed itself) serves as an arbitrator and routes invalidations to nodes that have copies. Here, the latency of broadcasting operations is also a limiting factor.
Current cache consistency/coherence models have two common properties. The time at which a local cache observes an invalidation depends on how fast it is transmitted through the bus or fabric, not directly on program semantics. The instruction set architecture provides a form of synchronizing instruction that establishes a barrier at which all processors have a common opinion on the state of the memory. Such a barrier ensures completion of operations, but does not prevent premature invalidations.