A modern computer system has at least a microprocessor and some form of memory. Generally, the processor processes retrieves data stored in the memory, processes/uses the retrieved data to obtain a result, and stores the result in the memory.
One type of computer system uses a single processor to perform the operations of the computer system. In such a single processor (or “uniprocessor”) computer system, incoming memory requests to memory occur serially. However, as described below with reference to FIG. 1, in a computer system that uses multiple processors at least partly in order to increase data throughput, due to parallel processing (i.e., simultaneous processing by two or more processors), memory shared by multiple processors may receive multiple memory requests that overlap in both time and space.
FIG. 1 shows a typical multiprocessor system 100. In FIG. 1, multiple processors 102a, 102b share a memory 106, where the memory 106 includes memory locations 106a-106n. An important consideration in multiprocessor system design involves the potential of two or more processors attempting to access and/or store data in the same memory location at the same time. Thus, designers have implemented, using both software and/or hardware, various “synchronization” techniques to address the issue of threads (i.e., sequences of instructions being processed by a processor) concurrently attempting to access the same memory location.
Synchronization can be implemented by a processor “blocking” other processors from accessing or storing data to a particular memory location, i.e., a processor maintains exclusive, uninterruptible ownership of a particular memory location. However, maintaining exclusive ownership of a memory location results in a high number of failures and deadlocks, particularly for large-scale multiprocessor systems (e.g., systems having thousands of processors running in parallel). Such large-scale multiprocessor systems tend to require higher levels of robustness and tolerance than that provided by blocking synchronization techniques due to increased delays and fluctuations in communication time and the effects of fast context switching typical of large-scale multiprocessor systems.
At least partly in order to address the drawbacks of blocking synchronization techniques, “non-blocking” synchronization techniques have emerged that allow multiple processors to access concurrent objects in a non-mutually exclusive manner to meet the increased performance requirements of large-scale multiprocessor systems. The concept of non-blocking may be implemented through hardware and software components in a variety of ways. For example, in the multiprocessor system shown in FIG. 2, a combination of instruction primitives and registers is used to achieve non-blocking synchronization in a multiprocessor system.
In FIG. 2, a processor 102a sends a Load-Linked request to a controller 104a to load a value from a memory location (e.g., 106a), which, in turn, sets a bit associated with the memory location. The controller 104a issues a response to the request indicating that the value of the memory location has been successfully loaded. Once the value has been successfully loaded, the processor 102a executes one or more instructions to manipulate the loaded value. The processor 102a then issues a Store-Conditional request that attempts to store the manipulated value back to the memory location (e.g., 106a). However, the value is only stored to that memory location if the associated bit in the controller 104a has not been unset (i.e., if no other processor has written to the memory location since the Load-Linked request). If the Store-Conditional request succeeds, this indicates that all three steps occurred atomically (i.e., as a single, uninterrupted sequence). On the other hand, if the Store-Conditional request fails, the data is not stored in that memory location and the Load-Linked request must be retried.
The implementation of the Load-Linked/Store-Conditional primitives in non-blocking synchronization has two distinct features. First, all Load-Linked requests are required to succeed. Secondly, all Load-Linked requests require some sort of recording (or tracking).
Recording Load-Linked requests may require that a controller notify all processors that initiated Load-Linked requests whenever a Store-Conditional request invalidates them, essentially mimicking a cache coherence protocol. Alternatively, a record may be maintained in each controller for every initiated Load-Linked request. In this case, the Load-Linked request is only removed from the record of the controller once a successful Store-Conditional request occurs. Because the completion of a Store-Conditional request cannot be forecasted, the latter option requires support for lists of unbounded size, which complicates controller design and creates performance bottlenecks whenever a Load-Linked request is initiated.
Another type of non-blocking synchronization technique involves the use of Compare&Swap primitives. A Compare&Swap operation typically accepts three values, or quantities: a memory address A, a comparison value C, and a new value N. The operation fetches and examines the contents V of memory at address A. If those contents V are equal to C, then N is stored into the memory location at address A, replacing V. A boolean return value indicates whether the replacement occurred. Depending on whether V matches C, V is returned or saved in a register for later inspection (possibly replacing either C or N depending on the implementation).
The Load-Linked/Store-Conditional and Compare&Swap operations described above are recognized as types of Read-Modify-Write operations, which are generally operations that read a value of a memory location (e.g., a single word having a size that is system specific), modify the memory location, and write the modified value back to the memory location. Typical Read-Modify-Write operations do not hold ownership and must optimistically check to make sure they were not interrupted, thereby possibly introducing implementation and user-level problems that require costly solutions and/or weakened semantics. Further, these non-blocking synchronization implementations put the burden of coordination on the threads and are typically incompatible with fast context-switching, which is an important technology often used in hiding memory access latencies in large-scale multiprocessor systems.