1. Field of the Invention
This invention relates to the field of multiprocessor computer systems and, more particularly, to mechanisms and methods for optimizing spin-lock operations within multiprocessor computer systems.
2. Description of the Related Art
A popular architecture in commercial multiprocessing computer systems is a distributed shared memory architecture. A distributed shared memory architecture includes multiple nodes within which processors and memory reside. The multiple nodes communicate via a network coupled there between. When considered as a whole, the memory included within the multiple nodes forms the shared memory for the computer system. Typically, directories are used to identify which nodes have cached copies of data corresponding to a particular address. Coherency activities may be generated via examination of the directories.
Distributed shared memory systems are scaleable, overcoming various limitations associated with shared bus architectures. Since many of the processor accesses are completed within a node, nodes typically have much lower bandwidth requirements upon the network in comparison to the bandwidth requirements a shared bus architecture must provide upon its shared bus to attain comparable performance. The nodes may operate at high clock frequency and bandwidth, accessing the network when needed. Additional nodes may be added to the network without affecting the local bandwidth of the nodes. Instead, only the network bandwidth is affected.
Many distributed shared memory architectures have non-uniform access time to the shared memory. Such architectures are known as non-uniform memory architectures (NUMA). Most systems that form NUMA architectures also have the characteristic of a non-uniform communication architecture (NUCA), in which the access time from a processor the other processors' caches varies greatly depending on their placement. In particular, node-based NUMA systems, where a group of processors have a much shorter access time to each other's caches than to the other caches, are common. Recently, technology trends have made it attractive to run more than one thread per chip, using either the chip multiprocessor (CMP) and/or the simultaneous multi-threading (SMT) approach. Large servers, built from several such chips, can therefore be expected to form NUCA architectures, since collated threads will most likely share an on-chip cache at some level.
Due to the popularity of NUMA systems, optimizations directed to such architectures have attracted much attention in the past. For example, optimizations involving the migration and replication of data in NUMA systems have demonstrated a great performance improvement in many applications. In addition, since many of today's applications exhibit a large fraction of cache-to-cache misses, optimizations which consider the NUCA nature of a system may also lead to significant performance enhancements.
One particular problem associated with multiprocessing computer systems having distributed shared memory architectures relates to spin-lock operations. In general, spin-lock operations are associated with software locks which are used by programs to ensure that only one parallel process at a time can access a critical region of memory. A variety of lock implementations have been proposed, ranging from simple spin-locks to advanced queue-based locks. Although simple spin-lock implementations can create very bursty traffic as described below, they are still the most commonly used software lock within computer systems.
Systems employing spin-lock implementations typically require that a given process perform an atomic operation to obtain access to a critical memory region. For example, an atomic test-and-set operation is commonly used. The test-and-set operation is performed to determine whether a lock bit associated with the memory region is cleared and to atomically set the lock bit. That is, the test allows the thread to determine whether the memory region is free of a lock by another thread, and the set operation allows the thread to achieve the lock if the lock bit is cleared. If the test of the lock bit indicates that the memory region is currently locked, the thread initiates a software loop wherein the lock bit is continuously read until the lock bit is detected as cleared, at which time the thread reinitiates the atomic test-and-set operation.
When several spinning processors contend for access to the same memory region, a relatively large number of transaction requests may be generated. Due to this, the latency associated with the release of a lock until the next contender can acquire the lock may be relatively high. The large number of transactions can further limit the maximum frequency at which ownership of the lock can migrate from node to node. Finally, since only one of the spinning processors will achieve the lock, the failed test-and-set operations of the remaining processors result in undesirable requests on the network. The coherency unit in which the lock is stored undesirably migrates from processor to processor and node to node, invalidating other copies. Network traffic is thereby further increased despite the fact that the lock is set.
Other spin-lock implementations have therefore been proposed to improve performance and reduce network traffic when contention for a lock exists. For example, in some implementations, the burst of refill traffic when a lock is released may be reduced by using an exponential back-off delay in which, after failing to obtain a lock, the requester waits for successively longer periods of time before initiating additional lock operations. In other implementations, queue-based locking methodologies have been employed to reduce network traffic. In a system that implements a queue-based lock, requesting processors contending for a lock are queued in an order. A contending processor generates transactions to acquire the lock only if it is the next in line contender. Numerous variations of queue-based lock implementations are known.
While the various optimizations for spin-lock implementations have in some instances led to enhanced performance, most solutions do not consider or exploit the NUCA characteristics of a distributed shared memory computer system. In addition, many implementations have resulted in relatively high latencies for uncontended locks. A mechanism is therefore desirable that may exploit the NUCA nature of a multiprocessing system to optimize spin-lock operations without introducing significant latencies for uncontended locks.