A typical computer system includes processing circuitry, memory and an input/output (I/O) interface. The processing circuitry includes a set of processors (i.e., one or more processors) which is configured to run code stored in the memory (e.g., an operating system, command scripts, high level applications, other software constructs, etc.). The memory typically includes both random access memory (e.g., volatile semiconductor memory) as well as relatively slower non-volatile memory (e.g., disk drive memory). The I/O interface allows communications into and out of the computer system to enable external access to the computer system (e.g., user access, network communications with external devices, etc.).
Some computer systems enable multiple threads or processes (hereinafter generally referred to as threads) to share access to certain computer resources such as shared memory. These threads are configured to run simultaneously on the processing circuitry and share access to the shared memory (e.g., for inter-process communications). To prevent the threads from concurrently accessing the same shared memory data structure (e.g., a single location, a complex data structure such as a linked list involving many locations, etc.) at the same time and thus inadvertently corrupt data within that shared data structure, software developers typically employ one or more synchronization approaches which enable the simultaneously-running threads to coordinate their access of shared memory. Such approaches enable mutual exclusion where at most a single thread of the multiple threads running in parallel is permitted access to protected code or data at any time.
In one conventional synchronization approach (hereinafter referred to as the atomic instruction approach), the computer platform provides atomic operations or instructions. Examples include compare-and-swap (CAS), load-locked and store-conditional, exchange and fetch-and-add operations. The Intel® IA32 Architecture, which is offered by Intel Corporation of Santa Clara, Calif., provides CAS instructions under the name “cmpxchg”.
In another conventional synchronization approach (hereinafter referred to as the simple load-store approach), the computer system provides a set of common memory locations, and each thread is configured to set and test the contents of these memory locations to determine whether that thread has access to a critical section. Classic examples of conventional load-store based synchronization mechanisms include Dekker, Dijkstra, Lamport and Peterson. For illustration purposes only, a short explanation of a simplified Dekker mechanism will now be provided.
Suppose that there are two threads running on a computer system. Both threads synchronize their execution in order to share access to a critical section of code using commonly accessible memory variables T1 and T2 which are initially zero. When the first thread is ready to access the critical section, the first thread stores a non-zero value into the memory variable T1, and loads the value of the memory variable T2. If the value of the memory variable T2 is non-zero, the first thread is blocked from accessing the critical section due to the second thread having a “lock” on the critical section. Accordingly, the first thread then clears the memory variable T1 and retries. However, if the value of the memory variable T2 is zero, the first thread obtains a lock on the critical section, accesses the critical section, and then sets the memory variable T1 back to zero.
Similarly, when the second thread is ready to access the critical section, the second thread stores a non-zero value into the memory variable T2, and loads the value of the memory variable T1. If the value of the memory variable T1 is non-zero, the second thread is blocked from accessing the critical section due to the first thread having a lock on the critical section. In response, the second thread clears the memory variable T2 and retries. However, if the value of the memory variable T1 is zero, the second thread obtains a lock on the critical section, accesses the critical section, and then clears the memory variable T2.
It should be understood that the above-provided explanation is simplified for illustration purposes and is vulnerable to “livelock” where both the first thread and the second thread attempt to enter the critical section simultaneously and then perpetually spin, retrying. In practice, code developers augment the mutual exclusion mechanism with additional logic so that the two threads take turns to ensure progress and avoid livelock.
It should be further understood that certain processor architectures do not guarantee that, when multiple threads are running in parallel, each thread will be able to accurately view operations of the other threads in correct order. Rather, by not making such a guarantee, these processor architectures are able to enjoy certain optimizations (e.g., processor design optimizations, interconnect optimizations, etc.) which offer the potential to improve overall system performance. In particular, in the context of the above-described simplified Dekker mechanism, the store and load operations of each thread may be presented to the other thread out of order. Such reordering typically arises from out-of-order execution or by virtue of a processor's store buffer construct. For example, even though a thread may perform a store operation ahead of a load operation, the processor may place the store operation in a store buffer while making the subsequent load operation immediately visible to other threads on a communications bus thus showing the other threads the load operation before the store operation in an incorrect order. Unfortunately, if the system makes the store and load operations visible in the wrong order, the Dekker mechanism can fail and permit two threads to access the same critical section at one time. This is commonly termed an exclusion failure and is extremely undesirable as the data within the critical section can become inconsistent.
Examples of processor architectures which do not guarantee that threads will be able to accurately view operations of other threads in correct order are the SPARC® Architecture and the Intel® IA32 Architecture. The SPARC® Architecture is offered by SPARC® International, Inc. of San Jose, Calif. The Intel® IA32 Architecture is offered by Intel Corporation of Santa Clara, Calif.
To prevent exclusion failures, software developers utilize memory barrier (MEMBAR) instructions which provide certain guarantees regarding instruction order. In particular, a typical SPARC processor implements a MEMBAR instruction by delaying execution until the processor completely drains its store buffer to memory so that any stores within the store buffer become visible to other processors. At this point, the operation of the processor is considered to be “serialized” because the effects of all previously executed and committed instructions are now visible to other processors. Accordingly, a software developer implementing the Dekker mechanism within a thread places a MEMBAR instruction between the initial store instruction and the subsequent load instruction to serialize the store and load instructions from the perspective of other threads.
It should be understood that the effect of executing a MEMBAR instruction is restricted to the executing processor. That is, executing a MEMBAR instruction does not cause actions on any remote processors. Additionally, for some processors, if the processor supports out-of-order or speculative execution, speculation typically is not allowed to proceed past the MEMBAR instruction. That is, when such a processor encounters a MEMBAR instruction, the processor typically cancels the effects of any subsequent instructions that have started to execute speculatively in order to ensure serialization up to the MEMBAR instruction.
With reference back to the earlier-provided simplified Dekker mechanism, software developers place MEMBAR instructions in the code of the first and second threads to avoid exclusion failures. In particular, the developers position MEMBAR instructions between the store and load instructions in each thread thus forcing the processors to make the executed store and load operations visible in the correct order for proper Dekker mechanism operation. That is, when the first thread is ready to access the critical section, the first thread stores a non-zero value into the memory variable T1, performs a MEMBAR operation and loads the value of the memory variable T2. The MEMBAR operation in the first thread ensures that the executed store operation is visible to the second thread prior to the executed load operation. Similarly, when the second thread is ready to access the critical section, the second thread stores a non-zero value into the memory variable T2, performs a MEMBAR operation and loads the value of the memory variable T1. Again, the MEMBAR operation in the second thread ensures that the executed store operation is visible to the first thread prior to the executed load operation, i.e., in the correct order).