1. Field of the Invention
This invention relates to the field of multiprocessor computer systems and, more particularly, to mechanisms and methods for optimizing shared memory write operations within multiprocessor computer systems.
2. Description of the Related Art
A popular architecture in commercial multiprocessing computer systems is a distributed shared memory architecture. A distributed shared memory architecture includes multiple nodes within which processors and memory reside. The multiple nodes communicate via a network coupled between them. When considered as a whole, the memory included within the multiple nodes forms the shared memory for the computer system.
Distributed shared memory systems are scaleable, overcoming various limitations associated with shared bus architectures. Since many of the processor accesses are completed within a node, nodes typically have much lower bandwidth requirements upon the network in comparison to the bandwidth requirements a shared bus architecture must provide upon its shared bus to attain comparable performance. The nodes may operate at high clock frequency and bandwidth, accessing the network when needed. Additional nodes may be added to the network without affecting the local bandwidth of the nodes. Instead, only the network bandwidth is affected.
One complication associated with distributed memory in multiprocessing computer systems relates to maintaining the coherency of program data shared across multiple nodes. In general, the system may implement an ordering policy that defines an order of operations initiated by different sources. During the execution of a system's workload, cache lines may often move between various system nodes. This movement needs to be performed such that operations on the cache lines occur in a manner that is consistent with the ordering model. Without a coordination mechanism, one node may perform an update that is not properly reflected in another node. Maintaining a unified, coherent view of shared memory locations is thus essential from the standpoint of program correctness.
One technique for handling coherence in shared memory systems employs hardware interfaces between nodes that track the coherency state of each cache line and perform coherence operations depending upon desired operations. Typically, the coherency state of each cache line is tracked in a directory structure. When a processor initiates a write to a particular cache line, if the node in which the processor resides does not already have a write access right to the cache line, the hardware interfaces may respond by invoking coherence operations to provide the requesting node with an exclusive, writable copy of the data. These coherence operations may include functionality to cause the owner of the cache line to provide the cache line to the requesting node, and functionality to cause shared copies of the cache line in other nodes to be invalidated before allowing the requesting node to commence the write operation.
Similarly, when a processor initiates a read from a particular cache line, if the node in which the processor resides does not already have a read access right to the line, the hardware interfaces may respond by invoking coherence operations to provide the requesting node with a shared copy of the data. Typically, this involves causing the owner of the cache line to provide the cache line to the requesting node.
Other techniques for handling coherence in shared memory systems employ software methodologies that perform functions similar to those of the hardware interfaces described above. More particularly, prior to performing an operation on a given line, the software may be configured to access a directory entry corresponding to the cache line and to perform corresponding coherence operations similar to those discussed in the hardware context above. In some implementations, other data structures such as MTAGs may also be maintained that indicate access rights to cache lines stored within each node. The MTAG for a given cache line may be accessed to determine whether coherence operations to carry out a given operation are necessary.
To avoid race conditions, the directory and/or MTAG entries may be “locked” via atomic operations. The locking of the directory and/or MTAG entries prevents other processors or nodes from modifying the entries and performing coherence operations with respect to a cache line that is already being operated upon by a processor that has acquired the lock. Thus, possessing a lock on the directory and/or MTAG entry may be a necessary precondition for performing a given operation (e.g., a store and/or a load) on a cache line. After performing the operation on the cache line or coherence operations relating thereto, the processor may release the lock, thereby allowing another processor to acquire the lock.
The atomic operations required to obtain a lock include both load and store sub-operations that must be performed. Unfortunately, these lock acquisition functions can add significant latency, thus degrading overall system performance. In addition, if a cache line is alternatively written to by processors of different nodes, frequent migration of the cache line and the corresponding locks may further result, thus also limiting overall system performance.