Many contemporary data processing systems have adopted multiprocessing architectures in an attempt to improve performance. Various techniques have been used, from simply installing multiple central processing units (“CPUs”) to building individual CPUs with multiple execution cores that still share some support circuitry. Combinations of these techniques have also been tried.
Unfortunately, system performance rarely scales linearly with the number of CPUs or execution cores available. Part of this is simply due to the inevitable overhead of synchronizing and coordinating the operations of multiple processors sharing resources like main memory and hardware peripherals. Coordination overhead may be relatively unaffected by the number of processors managed (i.e. it may represent a fixed cost). However, as the number of processors grows further, contention between multiple threads of execution (either actual threads, which share a memory space, or processes, which do not) for shared, single-access resources, can consume an increasing amount of processing power.
Inter-thread (inter-process) synchronization in multithreaded programs is usually based on low-level functions supported by the hardware and called, aptly enough, synchronization primitives. One of the simplest primitives is called a test-and-set (“TAS”) instruction. A TAS instruction can be encapsulated in a loop to form a simple test-and-set lock, which can protect a shared, single-access resource against multiple simultaneous attempts by different threads to change the state of the resource. However, simple test-and-set locks may not provide adequately sophisticated semantics for complex programs. For example, if many threads use such a mechanism to protect a highly-contended resource, scheduling and timing vagaries may result in predominantly (or only) one of the threads actually being allowed to use the resource for extended periods of time. Moreover, the memory traffic generated by simple test-and-set locks can impact the execution speed of the entire program. Synchronization facilities built on lower-level synchronization primitives, or on more complex intrinsic atomic operations that can be performed by a processor, can provide the sophisticated semantics a program may need, and efficient implementations of those synchronization facilities may permit a multithreaded program that uses them to operate faster.