Given the continually increased reliance on computers in contemporary society, computer technology has had to advance on many fronts to keep up with increased demand. One particular subject of significant research and development efforts is parallelism, i.e., the performance of multiple tasks in parallel.
A number of computer software and hardware technologies have been developed to facilitate increased parallel processing. From a hardware standpoint, computers increasingly rely on multiple microprocessors to provide increased workload capacity. Furthermore, some microprocessors have been developed that support the ability to execute multiple threads in parallel, effectively providing many of the same performance gains attainable through the use of multiple microprocessors. From a software standpoint, multithreaded operating systems and kernels have been developed, which permit computer programs to concurrently execute in multiple threads so that multiple tasks can essentially be performed at the same time.
While parallelism effectively increases system performance by virtue of the ability to perform multiple tasks at once, one side effect of parallelism is increased system complexity due to the need to synchronize the operation of multiple concurrent processes or threads, particularly with regard to data structures and other system resources that are capable of being accessed by multiple processes or threads. Separate processes or threads that are capable of accessing specific shared data structures are typically not aware of the activities of other threads or processes. As such, a risk exists that one thread might access a specific data structure in an unexpected manner relative to another thread, creating indeterminate results and potential system errors.
As an example, the possibility exists that one thread may retrieve data from a data structure, while another thread may later change the data structure in some manner, resulting in each thread seeing a different state for the data structure. Efforts must be made, however, to ensure that the state of a data structure be consistent when viewed by different threads, otherwise indeterminate results can occur.
To address these concerns, a serialization mechanism such as a lock (also referred to as a semaphore) may be used to limit the access to a shared data structure or other shared resource to one process or thread at a time. A lock is essentially a “token” that can be obtained exclusively by a process or thread in a multithreaded environment to access a particular shared resource. Before a process or thread can access a resource, it must first obtain the token from the system. If another process or thread currently possesses the token, the former process or thread is not permitted to access the resource until the token is released by the other process or thread. In this manner, the accesses to the resource are effectively “serialized” to prevent indeterminate operations from occurring.
While locks enable a programmer to ensure complete serialization of a data structure or other shared resource, it has been found that the operations associated with checking the status of locks, acquiring locks, and waiting on locks can add significant overhead, and as a result, have an adverse impact on system performance. As a result, significant efforts have been directed toward optimizing the processing of locks to minimize the impact of such locks and maximize system performance.
The process of attempting to acquire a lock may be performed using a number of different methods; however, typically no one lock acquisition method is optimal for all situations, as some methods are more efficient for lightly contended locks, while others are more efficient for more heavily contended locks. Furthermore, these different methods of attempting to acquire a lock may be chained together to progressively handle lock acquisitions, i.e., so that methods that are more efficient for lightly contended locks will be tried before attempting those methods that are more efficient for more heavily contended locks.
As an example, one method that may be used to attempt to acquire a lock is an inline “fast path” lock acquisition, which simply attempts to acquire a lock on an object when there is little or no contention on that lock. If successful, the inline call receives a “locked” result that indicates that the lock was acquired. If unsuccessful, however, a call is typically made to an external service function to wait for an existing lock on the object to be released. One method that may be used in an external service function is spinning or looping, which places the thread in a wait loop, stalling the thread and periodically checking the status of the lock to see if the lock has been released. In addition, in some designs, spinning may give way to yielding, whereby after spinning for a designated period of time, a thread yields the remainder of its allocated slice of processor resources for use by another thread that can make productive use of those yielded processor resources.
Yet another method that may be used in an external service function, e.g., if spinning and yielding does not result in a successful lock acquisition, is to suspend, or enter a long wait phase, whereby the thread informs a task dispatcher to put the thread to sleep until the lock at issue has been released. Typically, the thread being put to sleep informs the task dispatcher that the thread is waiting on a particular lock such that when another thread releases the lock, the task dispatcher will awaken the sleeping thread and thereby enable the lock to finally be acquired.
For lightly contended locks, often inline or fast path lock acquisition is the most efficient, since the probability is relatively high that the lock will be acquired when it is first accessed. For more heavily contended locks, however, an inline or fast path lock acquisition is often a wasted effort, as in most instances an external service function will have to be called.
While an inline lock acquisition that does not result in a successful lock acquisition is often relatively inexpensive in terms of the number of processing cycles required to perform the check, the insertion of inline lock acquisition code in a program may result in suboptimal register allocation if an external service function is required to be called. Register allocation is a process performed during compilation to assign variables and other data required by a program to the finite set of registers in a processor during execution. As the set of registers is finite, often the data in registers must be replaced with other data as it is needed, with the data that is being replaced either discarded (often referred to as “killed”) or saved for later retrieval (often via the insertion of spill and unspill instructions into the program code). External calls, in particular, typically require the data in several registers to be replaced when the call is made, and then restored once the call returns to the original method.
Typically, whenever an external call is expected, the optimal register allocation for the external call is to save and restore registers before and after the external call. As such, for lock acquisition attempts that require an external call, saving and restoring registers is often an optimal register allocation strategy. On the other hand, where lock acquisition attempts do not require an external call, often saving and restoring registers proves unnecessary, and adversely impacts performance, so saving and restoring registers may not be used where inline lock acquisition code is inserted into a program. In those instances where a lock is more heavily contended, however, the register allocation associated with the external function call may not be optimal, and thus lead to lower performance.
In addition, even with more heavily contended locks, the strategy used in an external service function may not prove to be optimal for all circumstances. Spinning typically provides the quickest acquisition once a lock is released by another thread, since the thread that is spinning will typically check the lock on a relatively frequent basis. However, spinning in a loop occupies processing bandwidth that could be put to other uses by other threads. Conversely, with a suspension, the processing bandwidth that would otherwise be used by a thread may be utilized by other threads, but with the drawback that the time required to suspend the thread, and the time required to acquire a lock and resume execution of a thread after the lock has been released by another thread, is often longer thus slowing the response of the thread. In general, therefore, spinning is often more efficient for moderately contended locks, or locks that are often acquired for relatively short periods of time, while suspending is often more efficient for highly contended locks or locks that are often acquired for relatively long periods of time.
Whether a particular lock acquisition strategy will be optimal for a particular lock is often unknown during development of a program, and furthermore, the optimal lock acquisition strategy may vary in different runtime environments depending upon factors such as the number of processors and hardware threads supported by such runtime environments. Accordingly, a need exists in the art for a manner of improving the selection of an optimal lock acquisition strategy for acquiring a lock.