1. Field of the Invention
This invention relates to the field of multiprocessor computer systems and, more particularly, to mechanisms and methods for optimizing operations within multiprocessor computer systems having distributed shared memory architectures.
2. Description of the Relevant Art
Multiprocessing computer systems include two or more processors which may be employed to perform computing tasks. A particular computing task may be performed upon one processor while other processors perform unrelated computing tasks. Alternatively, components of a particular computing task may be distributed among multiple processors to decrease the time required to perform the computing task as a whole. Generally speaking, a processor is a device configured to perform an operation upon one or more operands to produce a result. The operation is performed in response to an instruction executed by the processor.
A popular architecture in commercial multiprocessing computer systems is the symmetric multiprocessor (SMP) architecture. Typically, an SMP computer system comprises multiple processors connected through a cache hierarchy to a shared bus. Additionally connected to the bus is a memory, which is shared among the processors in the system. Access to any particular memory location within the memory occurs in a similar amount of time as access to any other particular memory location. Since each location in the memory may be accessed in a uniform manner, this structure is often referred to as a uniform memory architecture (UMA).
Processors are often configured with internal caches, and one or more caches are typically included in the cache hierarchy between the processors and the shared bus in an SMP computer system. Multiple copies of data residing at a particular main memory address may be stored in these caches. In order to maintain the shared memory model, in which a particular address stores exactly one data value at any given time, shared bus computer systems employ cache coherency. Generally speaking, an operation is coherent if the effects of the operation upon data stored at a particular memory address are reflected in each copy of the data within the cache hierarchy. For example, when data stored at a particular memory address is updated, the update may be supplied to the caches which are storing copies of the previous data. Alternatively, the copies of the previous data may be invalidated in the caches such that a subsequent access to the particular memory address causes the updated copy to be transferred from main memory. For shared bus systems, a snoop bus protocol is typically employed. Each coherent transaction performed upon the shared bus is examined (or "snooped") against data in the caches. If a copy of the affected data is found, the state of the cache line containing the data may be updated in response to the coherent transaction.
Unfortunately, shared bus architectures suffer from several drawbacks which limit their usefulness in multiprocessing computer systems. A bus is capable of a peak bandwidth (e.g. a number of bytes/second which may be transferred across the bus). As additional processors are attached to the bus, the bandwidth required to supply the processors with data and instructions may exceed the peak bus bandwidth. Since some processors are forced to wait for available bus bandwidth, performance of the computer system suffers when the bandwidth requirements of the processors exceeds available bus bandwidth.
Additionally, adding more processors to a shared bus increases the capacitive loading on the bus and may even cause the physical length of the bus to be increased. The increased capacitive loading and extended bus length increases the delay in propagating a signal across the bus. Due to the increased propagation delay, transactions may take longer to perform. Therefore, the peak bandwidth of the bus may decrease as more processors are added.
These problems are further magnified by the continued increase in operating frequency and performance of processors. The increased performance enabled by the higher frequencies and more advanced processor microarchitectures results in higher bandwidth requirements than previous processor generations, even for the same number of processors. Therefore, buses which previously provided sufficient bandwidth for a multiprocessing computer system may be insufficient for a similar computer system employing the higher performance processors.
Another structure for multiprocessing computer systems is a distributed shared memory architecture. A distributed shared memory architecture includes multiple nodes within which processors and memory reside. The multiple nodes communicate via a network coupled there between. When considered as a whole, the memory included within the multiple nodes forms the shared memory for the computer system. Typically, directories are used to identify which nodes have cached copies of data corresponding to a particular address. Coherency activities may be generated via examination of the directories.
Distributed shared memory systems are scaleable, overcoming the limitations of the shared bus architecture. Since many of the processor accesses are completed within a node, nodes typically have much lower bandwidth requirements upon the network than a shared bus architecture must provide upon its shared bus. The nodes may operate at high clock frequency and bandwidth, accessing the network when needed. Additional nodes may be added to the network without affecting the local bandwidth of the nodes. Instead, only the network bandwidth is affected.
Despite their advantages, multiprocessing computer systems having distributed shared memory architectures can suffer severe performance degradation as a result of spin-lock operations. In general, spin-lock operations are associated with software locks which are used by programs to ensure that only one parallel process at a time can access a critical region of memory. A variety of lock implementations have been implemented, ranging from simple spin-locks to advanced queue-based locks. Although simple spin-lock implementations can create very bursty traffic as described below, they are still the most commonly used software lock within computer systems.
Systems employing spin-lock implementations typically require that a given process perform an atomic operation to obtain access to a critical memory region. For example, an atomic test-and-set operation is commonly used. The test-and-set operation is performed to determine whether a lock bit associated with the memory region is cleared and to atomically set the lock bit. That is, the test allows the process to determine whether the memory region is free of a lock by another process, and the set operation allows the process to achieve the lock if the lock bit is cleared. If the test of the lock bit indicates that the memory region is currently locked, the process initiates a software loop wherein the lock bit is continuously read until the lock bit is detected as cleared, at which time the process reinitiates the atomic test-and-set operation.
Spin-locks may be implemented using either optimistic or pessimistic spin-lock algorithms. An optimistic spin-lock is depicted by the following algorithm:
______________________________________ top: atomic.sub.-- test&set ;RTO if failed begin while busy spin ;spin on RTS goto top end ______________________________________
For the optimistic spin-lock algorithm shown above, the process first performs an atomic test-and-set operation upon the lock bit corresponding to the memory region for which access is sought. Since the atomic test-and-set operation includes a write, it is treated as a read-to-own (RTO) operation in shared memory systems. The system will thus place the coherency unit containing the lock bit in a modified state in response to the atomic test-and-set operation. If the atomic test-and-set operation fails, the process reads the lock bit in a repetitive fashion until the lock bit is cleared by another process. The process then reinitiates the atomic test-and-set operation.
A pessimistic spin-lock is depicted by the following algorithm:
______________________________________ top: while busy spin ; spin on RTS atomic.sub.-- test&&set ; RTO if failed begin goto top end ______________________________________
For the pessimistic spin-lock algorithm, the process first reads the lock bit corresponding to the memory region for which access is sought in a repetitive fashion until the lock bit is cleared. The read of the lock bit is treated as a read-to-share operation in shared memory systems. When the process determines that the lock bit is clear in accordance with the read operation(s), the process performs an atomic test-and-set operation to lock and gain access to the memory region. If the test failed upon execution of the atomic test-and-set operation, the process again repetitively reads the lock bit until it is cleared.
For both implementations, when a memory region corresponding to a contended spin-lock is released, all N spinning processors will generate RTS transactions bound for the cache line. In a distributed shared memory architecture, N RTS requests will therefore be queued at the home node, and will be serviced one at a time.
The first processor to receive a data reply detects the free lock and will generate an RTO transaction. The RTO transaction will be queued at the home node behind the earlier RTS requests. Since the processor of each of the remaining RTS requests will similarly receive an indication that the lock is free, each of these processors will also generate an RTO transaction. When the first RTO transaction is ultimately serviced by the home node, the processor issuing that transaction will lock and gain access to the memory region. The test-and-set operations corresponding to the RTO requests of the remaining processors will therefore fail, and each of these processors will resume spinning RTS requests.
From the above discussion, it is evident that when several spinning processors contend for access to the same memory region, a relatively large number of transaction requests will occur when the lock is released. Due to this, the latency associated with the release of a lock until the next contender can acquire the lock is relatively high (i.e., on the order N times the latency for an RTS). The large number of transactions can further limit the maximum frequency at which ownership of the lock can migrate from node to node. Finally, since only one of the spinning processors will achieve the lock, the failed test-and-set operations of the remaining processors result in undesirable request-to-own requests on the network. The coherency unit in which the lock is stored undesirably migrates from processor to processor and node to node, invalidating other copies. Network traffic is thereby further increased despite the fact that the lock is set. A mechanism is therefore desirable for optimizing operations of a multiprocessor system during spin-locks to reduce the number of transaction requests resulting from a released lock, thereby improving overall system performance.
An important aspect of multiprocessing systems employing distributed shared memory architectures during both spin-lock operations as well as transactions involving other operations is maximizing transaction throughput. Still further, systems employing distributed shared memory architectures should be configured to avoid coherency failures due to race conditions. In addition, in some situations many CPUs may access the same cache lines. This may occur at start up, for example, when many CPUs execute identical code. It may also occur for some barrier synchronization implementations where all the waiting CPUs are spinning on the same variable having a "wait" value. When the variable is changed to its "go" value, the local copies in all of the CPUs' caches are invalidated and all CPUs issue global read requests to achieve the new value. In such situations, the system may force the CPUs to access the variable sequentially with no access overlapping. This has the effect of delaying the access of the last CPU by an amount equal to the latency for one CPU to access the variable multiplied by the number of waiting CPUs before it may proceed.