This invention relates generally to multiprocessor computers that are comprised of a number of separate but interconnected processor nodes. More particularly, this invention relates to a method for efficiently granting a lock to requesting processors while maintaining fairness among the processor nodes.
Multiprocessor computers by definition contain multiple processors that can execute multiple parts of a computer program or multiple distinct programs simultaneously, in a manner known as parallel computing. In general, multiprocessor computers execute multithreaded programs or multiple concurrent single-threaded programs faster than conventional single processor computers, such as personal computers (PCs), that must execute programs sequentially. The actual performance advantage is a function of a number of factors, including the degree to which parts of a multithreaded program and/or multiple distinct programs can be executed in parallel and the architecture of the particular multiprocessor computer at hand.
Multiprocessor computers may be classified by how they share information among the processors. Shared memory multiprocessor computers offer a common physical memory address space that all processors can access. Multiple processes or multiple threads within the same process can communicate through shared variables in memory that allow them to read or write to the same memory location in the computer. Message passing multiprocessor computers, in contrast, have a separate memory space for each processor, requiring processes in such a system to communicate through explicit messages to each other.
Shared memory multiprocessor computers may further be classified by how the memory is physically organized. In distributed shared memory (DSM) machines, the memory is divided into modules physically placed near each processor. Although all of the memory modules are globally accessible, a processor can access memory placed nearby faster than memory placed remotely. Because the memory access time differs based on memory location, distributed shared memory systems are also called non-uniform memory access (NUMA) machines. In centralized shared memory computers, on the other hand, the memory is physically in one location. Centralized shared memory computers are called uniform memory access (UMA) machines because the memory is equidistant in time from each of the processors. Both forms of memory organization typically use high-speed cache memory in conjunction with main memory to reduce execution time. An alternative form of memory organization in an UMA machine involves groups of processors sharing a cache. In such an arrangement, even though the memory is equidistant from all processors, data can circulate among the processors sharing a cache with lower latency than among processors not sharing a cache.
Multiprocessor computers with distributed shared memory are organized into nodes with one or more processors per node. Also included in the node are local memory for the processors, a remote cache for caching data obtained from memory in other nodes, and logic for linking the node with other nodes in the computer. A processor in a node communicates directly with the local memory and communicates indirectly with memory on other nodes through the node""s remote cache.
For example, if the desired data is in local memory, a processor obtains the data directly from a block (or line) of local memory. But if the desired data is stored in memory in another node, the processor must access its remote cache to obtain the data. Further information on multiprocessor computer systems in general and NUMA machines in particular can be found in a number of works including Computer Architecture: A Quantitative Approach (2nd Ed. 1996), by D. Patterson and J. Hennessy, which is incorporated herein by reference.
Although the processors can often execute in parallel, it is sometimes desirable to restrict execution of certain tasks to a single processor. For example, two processors might execute program instructions to add one to a counter. Specifically, the instructions could be the following:
1. Read the counter into a register.
2. Add one to the register.
3. Write the register to the counter.
If two processors were to execute these instructions in parallel, the first processor might read the counter (e.g., xe2x80x9c5xe2x80x9d) and add one to it (resulting in xe2x80x9c6xe2x80x9d). Since the second processor is executing in parallel with the first processor, the second processor might also read the counter (still xe2x80x9c5xe2x80x9d) and add one to it (resulting in xe2x80x9c6xe2x80x9d). One of the processors would then write its register (containing a xe2x80x9c6xe2x80x9d) to the counter, and the other processor would do the same. Although two processors have executed instructions to add one to the counter, the counter is only one greater than its original value.
To avoid such an undesirable result, some computer systems provide a mechanism called a lock for protecting sections of programs. When a processor requests the lock, it is granted to that processor exclusively. Other processors desiring the lock must wait until the processor with the lock releases it. A common arrangement is to require possession of a particular lock before allowing access to a designated section of a program; the processor then releases the lock when it is finished with the section. The section is thereby protected by the lock.
Accordingly, when a processor has acquired the lock, the processor is guaranteed that it is the sole processor executing the protected code.
To solve the add-one-to-a-counter scenario described above, both the first and the second processors would request the lock. Whichever processor first acquires the lock would then read the counter, increment the register, and write to the counter before releasing the lock. The remaining processor would have to wait until the first processor finishes, acquire the lock, perform its operations on the counter, and release the lock. In this way, the lock guarantees the counter is incremented twice if the instructions are run twice, even if processors running in parallel execute them.
Program instructions requiring exclusive execution are grouped into a section of program code called a critical section or critical region. Typically, the operating system handles the details of granting and releasing the lock associated with the critical section, but critical sections can also be implemented using user-level functions.
Accordingly, when code in a critical section is executing, the lock guarantees no other processors are executing the same code. To prevent the add-one-to-a-counter problem. in the above example, the program instructions for manipulating the counter could be grouped into a critical section.
Locks are useful for solving a wide variety of other problems such as restricting concurrent data structure access to a single processor to avoid data inconsistency. For more information on locks and related topics, see xe2x80x9cProcess Synchronization and Interprocess Communicationxe2x80x9d in The Computer Science and Engineering Handbook (1996) by A. Tucker, CRC Press, pages 1725-1746, which is incorporated herein by reference.
A typical scheme for managing lock requests is to use a first-come-first-served queued lock design. Under this design, the operating system grants a lock to the first requesting processor, queues subsequent requests by other processors until the first processor finishes with the lock, and grants the lock to processors waiting in the queue in order. However, a first-come-first-served scheme has certain drawbacks relating to performance.
For example, if the architecture of the multiprocessor system groups processors into nodes, communication latencies between processors in two different nodes are typically greater than that for processors in the same node. Accordingly, it is typically more expensive in terms of processing resources to move the lock from a processor in one node to a processor in another node. However, if a multiprocessor system implements a first-come-first-served queued lock scheme, each lock request might result in a lock trip between the nodes under certain conditions. Since each inter-node lock trip is expensive, lock synchronization can consume tremendous processing resources, leaving less resources for completing program tasks. As a result, a first-come-first served scheme may exhibit poor performance.
To reduce lock synchronization overhead, a variation of the queued lock scheme grants the lock to the next processor in the queue at the same node at which the lock was most recently released. In other words, the lock is kept in the node until all processors in the node with outstanding lock requests have been granted the lock. In this way, the scheme avoids some inter-node lock trips. In certain applications, critical sections are rare enough that this variation works well. However, under some circumstances, a particular node can become saturated with lock requests. Processors outside the saturated node are unable to acquire the lock in a reasonable time, so certain tasks or programs perform poorly while others enjoy prolonged access to the lock. The processors unable to acquire the lock are said to be subjected to starvation due to unfair allocation of the lock.
In accordance with the invention, a method of granting a lock to requesting processors tends to keep the lock at a particular node but maintains fairness among the nodes. When a lock is released by a first processor, the lock is kept at the node if there is an outstanding lock request by a second processor at the same node, even if other processors at other nodes requested the lock before the second processor. However, fairness control prevents starvation of the other nodes by limiting how the lock is kept at the node according to some criterion (e.g., by limiting the number of consecutive lock grants at a node or limiting the time a lock can be kept at a node).
In one aspect of the invention, logic for handling lock requests of interrupted processors is incorporated into the scheme. For example, lock requests by interrupted processors are ignored when determining when to keep the lock at a node.
In another aspect of the invention, a specialized data structure may be used to represent lock requests. The data structure can be placed in a queue, and the fields are arranged to prevent corrupting other fields in the data structure when atomic operations related to locking are performed on the data structure. A spin state field is crafted to fit within 32 bits.
In yet another aspect of the invention, before a lock is requested, data structures representing requests for the lock are preallocated to avoid having to allocate structures when the lock is requested.
In yet another aspect of the invention, the locking scheme avoids excess remote memory accesses by the processors by allowing processors to spin on a field local to the node, thereby enhancing performance.
In still another aspect of the invention, threads are blocked if they spin on the lock for more than a predetermined maximum amount of time. When the lock is circulated to a node, the processors at the node are unblocked.
The detailed description sets forth two detailed illustrative embodiments: a kernel-level locking implementation and a user-level locking implementation. In the kernel-level embodiment, a method for queuing lock requests generally keeps the lock at a node while maintaining fairness among the nodes by tracking lock requests in a specialized queue. A processor requesting the lock can acquire a preemptive position in the queue if another processor at the same node has placed a queue element in the queue and the queue element has not been used more than a predetermined maximum number of times. The queuing method handles interrupted processors, requeuing them if passed over while servicing an interrupt.
In the user-level embodiment, a method of keeping the lock at a node employs a round-robin lock scheme that circulates the lock among the nodes, generally keeping the lock at a. node while maintaining fairness among the nodes. A specialized data structure tracks which nodes have processors holding the lock or attempting to acquire the lock.
Although the illustrated embodiments portray the invention implemented in a NUMA machine in which each node has local memory, a lingering lock scheme can be applied to machines having other memory organization designs. Any processor interconnection design wherein processors are grouped so processors within a group have significantly lower communications latencies (e.g., an UMA machine in which processors are grouped to share cache) can benefit from the described lingering lock scheme. The term xe2x80x9cnodexe2x80x9d includes any such grouping of processors as well as a NUMA machine node.
Additional aspects and advantages of the invention will become apparent with reference to the following description which proceeds with reference to the accompanying drawings.