1. Technical Field
This invention relates to pipelining processes in a multiprocessor computing environment. More specifically, the invention relates to a method and system for improving throughput based upon ordering constraints for shared memory operations.
2. Description of the Prior Art
Multiprocessor systems contain multiple processors (also referred to herein as CPUs) that can execute multiple processes or multiple threads within a single process simultaneously in a manner known as parallel computing. In general, multiprocessor systems execute multiple processes or threads faster than conventional single processor systems, such as personal computer, that execute programs sequentially. The actual performance advantage is a function of a number of factors, including the degree to which parts of a multithreaded process and/or multiple distinct processes can be executed in parallel and the architecture of the particular multiprocessor system. The degree to which processes can be executed in parallel depends, in part, on the extent to which they compete for exclusive access to shared memory resources.
Shared memory multiprocessor systems offer a common physical memory address space that all processors can access. Multiple processes therein, or multiple threads within a process, can communicate through shared variables in memory which allow the processes to read or write to the same memory location in the computer system. In order to increase operating efficiency in a multiprocessor system it is important to increase the speed by which a processor executes a program. One way to achieve this goal is to execute more than one operation at the same time. This approach is generally referred to as parallelism. A known technique for supporting parallel programming and to manage memory access operations in a multiprocessor is pipelining. Pipelining is a technique in which the execution of an operation is partitioned into a series of independent, sequential steps called pipeline segments. Each segment in the pipeline completes a part of an instructions, and different segments of different instructions may operate in parallel. Accordingly, pipelining is a form of instruction level parallelism that allows more than one operation to be processed in a pipeline at a given point in time.
In a cache-coherent system, multiple processors see a consistent view of memory. Several memory-consistency models may be implemented. The most straightforward model is called sequential consistency. Sequential consistency requires that the result of any execution be the same as if the accesses executed by each processor were kept in order and the accesses among different processors were interleaved. The simplest way to implement sequential consistency is to require a processor to delay the completion of any memory access. However, sequential consistency is generally inefficient. FIGS. 1a-c outline the process of adding a new element 30 to a data structure 5 in a sequential consistency model. FIG. 1a is an illustration of a sequential consistency memory model for a data structure prior to adding or initializing a new element 30 to the data structure 5. The data structure 5 includes a first element 10 and a second element 20. Both the first and second elements 10 and 20, respectively, have three fields 12, 14 and 16 , and 22, 24 and 26. In order to add a new element 30 to the data structure 5 such that the CPUs in the multiprocessor environment could concurrently search the data structure, the new element 30 must first be initialized. This ensures that CPUs searching the linked data structure do not see fields in the new element filled with corrupted data. Following initialization of the new element's 30 fields 32, 34 and 36 , the new element may be added to the data structure 5. FIG. 1b is an illustration of the new element 30 following initialization of each of it's fields 32, 34 and 36 , and prior to adding the new element 30 to the data structure 5. Finally, FIG. 1c illustrates the addition of the third element tote data structure following the initialization of the fields 32, 34 and 36. Accordingly, in a sequential consistency memory model execution of each step in the process must occur in a pre-specified order.
The process of FIGS. 1a-c is only effective on CPUs that use a sequentially consistent memory model. For example, the sequential memory model may fail in weaker memory models where other CPUs may see write operations from a given CPU happening in different orders. FIG. 2 is an illustration of a weak memory-consistency model for adding a new element to a data structure. In this example, the write operation to the new element's 30 first field 32 passes the write operation to the second element's 20 next field 22. A CPU searching the data structure may see the first field 32 of the third element 30, resulting in corrupted data. The searching CPU may then attempt to use the data ascertained from the field 32 as a pointer, and most likely this would result in a program failure or a system crash. Accordingly, it is desirable to place some form of a memory barrier instruction to be executed prior to storing a pointer from the second element in the data structure to the new element in the data structure.
FIG. 3 is a block diagram 40 illustrating the segregation of instructions into groups, wherein one group of instructions occurs before the memory barrier and another group of instructions occurs after the memory barrier. This diagram follows the linked data structure example of FIGS. 1 and 2. There are essentially four levels of operation. The first level includes the following operations: storing a NULL pointer into the new element's first field 42, storing the character string “IJKL” into the new element's second field 44, and storing the number “9012” into the new elements third field 46. Following this group of write operations, a memory barrier 50 is executed. The memory barrier ensures that each of the write operations 42, 44 and 46 occur prior to any other computations. Following the execution of the memory barrier 50 and the execution of the write operations 42, 44 and 46, the address of the second element may be computed 52. Step 52 is a local memory operation, and it may involve a plurality of write operations to the CPUs local memory. Finally, following step 52, a pointer to the new element is stored in the second element's first field 54. Although the memory barrier instruction 50 prevents the memory write operations 42, 44 and 46 from appearing to have occurred later than memory write operation at 54, it needlessly prevents the write operations in 42, 44 and 46 from appearing to have occurred later than computation of address 52. Accordingly, the prior uses of memory barrier instructions as shown in FIG. 3 results in an inefficient use of the CPU's resources resulting in a delayed execution of the program.
FIG. 4 is a block diagram 60 similar to the example shown in FIG. 3 without the memory barrier instruction. This diagram follows the linked data structure example of FIGS. 1 and 2. In this example, the memory barrier instruction 50 is removed, and as such there are two levels of operation. The first level includes the following operations: storing a NULL pointer into the new element's first field 42, storing the character string “IJKL” into the new element's second field 44, and storing the number “9012” into the new elements third field 46. Following the write operations of 42, 44 and 46, the address of the second element may be computed 52. At the same time, a pointer to the new element is stored in the second element's first field 54. The removal of the memory barrier instruction allows the address of the second element to be computed 52 at the same time as storing a pointer to the new element in the second element's first field 54. The removal of the memory barrier instruction increases the efficiency of operation of the program. However, there may temporarily be corrupted data in the new element. Accordingly, there is a need for an efficient pipelining model that maintains data integrity while improving operating efficiency.
One programming model that allows a more efficient implementation is synchronization. A program is synchronized if all access to shared data is ordered by synchronized operations. In addition to synchronizing programs, there is also a need to define the ordering of memory operations. There are two types of restrictions on memory orders, write barriers and read barriers. In general, barriers act as boundaries, forcing the processor to order read operations and write operations with respect to the barrier. Barriers are fixed points in a computation that ensure that no read operation or write operation is moved across the barrier. For example, a write barrier executed by a processor A ensures that all write operations by A prior to the write barrier operation have completed, and no write operations that occur after the write barrier in A are initiated before the barrier operation. In sequential consistency, all read operations are read barriers, and all write operations are write barriers. This limits the ability of the hardware to optimize accesses, since order must be strictly maintained. The typical effect of a write barrier is to cause the program execution to stall until all outstanding writes have completed, including the delivery of any associated invalidations.
In an attempt to increase performance, it has become known to reorder execution of instructions. However, in reordering instructions special synchronizing instructions are required in order to specify to the CPU which accesses may not be reordered. Accordingly, there is a need for a computer system comprising multiple processors for maximizing CPU performance by placing constraints on shared memory access while removing constraints on non-shared memory accesses.