The present invention relates in general to data processing and more specifically to store-conditional operations within a data processing system. Still more particularly, the present invention relates to accelerating a store-conditional operation by bypassing the store-conditional operation around a store queue.
In a multiprocessor (MP) computer system, processors often need to update certain shared memory locations of the MP system in a synchronized fashion. Traditionally, this synchronization has been achieved by a thread of a processor core updating a shared memory location utilizing an atomic “read-modify-write” operation that reads, modifies, and then writes the specific memory location in an atomic fashion. Examples of such operations are the well known “compare-and-swap” and “test-and-set” operations.
In some conventional processors, a read-modify-write operation is implemented using a pair of instructions rather than a single instruction, where such instructions are referred to herein as load-and-reserve (LARX) and store-conditional (STCX) instructions. LARX and STCX instructions, while not atomic primitives in themselves, implement an atomic read-modify-write of memory by monitoring for any possible updates to the shared memory location in question between performance of the LARX and STCX operations. In effect, the STCX operation only succeeds when the execution of LARX and STCX instructions produces an atomic read-modify-write update of memory.
The processing of a LARX/STCX instruction pair begins with a thread of execution executing a LARX instruction. A LARX instruction is a special load instruction that returns load data for the target memory address and further instructs the memory coherence mechanism in the MP system to establish a reservation for a “reservation granule” (e.g., cache line) containing target memory address. Once the reservation is established, the memory coherence mechanism monitors for write operations that target the reservation granule.
Once the load data is returned by the LARX instruction, the thread of execution typically, but not always, modifies the returned load data within the registers of the processor core utilizing some sequence of arithmetic, test, and branch instructions corresponding to the particular type of atomic update desired (e.g., fetch-and-increment, fetch-and-decrement, compare-and-swap, etc.).
Next, the thread of execution typically issues a STCX instruction to attempt to store the modified value back to the target memory address. The STCX instruction will succeed (and update the target memory address) only if the memory coherence mechanism has not detected any write operations to the reservation granule between the LARX operation and the STCX operation. A pass/fail indication is returned to the processor core indicating whether or not the update indicated by the STCX instruction was successful.
The thread of execution is usually stalled at the STCX instruction until the pass/fail indication for the STCX instruction is returned. Even in those cores that can execute instructions beyond a STCX that is waiting for its pass/fail indication, it is usually not possible to execute another LARX and STCX sequence because the coherence mechanism usually cannot easily track more than one reservation address per thread of execution at a time. Finally, the thread of execution typically examines the pass/fail indication of the STCX instruction and loops back to execute the LARX instruction if the pass/fail indication indicates the memory update requested by the STCX instruction failed.
In a typical implementation, a store queue is disposed between a processor core and the level of supporting cache memory at which coherence determinations are made (e.g., a store-in level two (L2) cache). The store queue includes a number of entries that are used to buffer regular store requests generated by the various threads of the processor core through execution of store instructions, as well as STCX requests generated by the processor core through execution of STCX instructions. The present disclosure recognizes that, in general, the probability that any given STCX request will fail increases the longer the STCX request remains in the store queue. Further, the present disclosure recognizes that, in general, the duration of pendency of a STCX request increases as the number of threads of execution supported by a common store queue (and hence the potential number of store and STCX operations in-flight) increases.