Field of the Invention
Embodiments of this invention relate to processors. More particularly, one embodiment relates to performing compare and exchange operations using a sleep-wakeup mechanism.
Description of Related Art
Typically, a multithreaded processor or a multi-processor system is capable of processing multiple instruction sequences concurrently. A primary motivating factor driving execution of multiple instruction streams within a single processor is the resulting improvement in processor utilization. Multithreaded processors allow multiple instruction streams to execute concurrently in different execution resources in an attempt to better utilize those resources. Furthermore, multithreaded processors can be used for programs that encounter high latency delays or which often wait for events to occur.
Typically, computer systems have a single resource setup that is to be shared by all threads or processors. Not having adequate resources may result in significant contention between processors (or threads) because, for example, processors share bus and memory bandwidth. This contention is particularly evident when one or more processors wait for a semaphore or lock (which refers to the data structure often used to allow a single processor exclusive access to other data structures) to become available. This causes bottlenecking of resources, waste of memory bandwidth, compute bandwidth, microarchitectural resources, and power. The “busy waiting” of processors can also have an adverse effect on the performance of other processors in the system.
FIG. 1 is a block diagram illustrating an exemplary computer system 100 having processors 102-106 accessing a shared memory space 114. The semaphore (lock) 110 is a particular location in memory 108 that is assigned to contain a value associated with obtaining access 112 to the shared space 114. In order for one of the processors 102-106 to access the shared space 114, it first accesses the lock 110 and tests the state (value) of the data stored in the lock location 110 and, in the simplest format, either of two values are assigned to the lock 110. The first value indicates the availability of the shared space 114 for access and the second value indicates the current utilization of the shared space 114 and thus, it is not available for access. Also, bit states 1 and 0 can be used for the locked and unlocked states for the lock 110.
The accessing of the memory 108 by the processors 102-106 for data transfer typically involves the use of load and store operations. The load operation transfers memory content from a location accessed in the memory 108, while the store operation transfers data to a memory location accessed in the memory 108. Thus, load/store operations are used to access the memory 108 and the lock 110 for data transfer between the processors 102-106 and the memory 108. The load and store accesses are also referred to as read and write accesses, respectively. When performing a read, the cache line is present in the processor's cache in either “shared unmodified” or “exclusive” or “modified” according to a protocol, such as the Modified, Exclusive, Shared, Invalid (MESI) protocol. If the cache line is not present in one of these states (e.g., invalid) the processor 102-106 retrieves the line from the memory 108 and places it into “shared unmodified” or “exclusive” state. In order to perform a write, the processor 102-106 has the line in its cache in “exclusive” or “modified” state, or it retrieves it and places it into its cache in “exclusive” state. The “shared state” is available for concurrent reading, but only one processor 102-106 can have the line in “exclusive” state for reading or writing.
An example of a technique for examining the availability of and making the semaphore busy is the use of an atomic read-modify-write sequence (e.g., “test & set” (TS) mechanism). One mechanism for implementing synchronization is the “compare and exchange instruction,” which is relatively efficient, but not efficient enough as it requires exclusive ownership of the cache line of the memory location. This prevents other processors from reading the memory location concurrently.
Another example includes the “test & test & set” (TTS) mechanism. The TTS mechanism is relatively more efficient because the processor accesses a local cache copy in shared state of the variable for the first test, while the semaphore is not free. However, when one processor has acquired the lock and other processors are contending (e.g., simultaneous attempted reading to check if the semaphore is free) for the lock, the TTS mechanism fails to prevent the blocking or bottlenecking of other processors. The lock acquiring processor obtains the cache line of the lock in “exclusive” state forcing it out of all other caches. When it is done writing the lock, the other processors attempt a read, which causes the acquiring processor to write its modified lock value back to memory and forward the now shared data to the other processors in a sequence of bus transactions.