The present invention relates to locked memory instructions, and more specifically to a system and method for the high performance execution of locked memory instructions in a system with distributed memory and a restrictive memory model.
Most instruction set architectures (ISAs) provide some mechanism to perform an atomic (locked) read-modify-write sequence which guarantees that one process has exclusive access to a memory location when there are other processes that may also be contending for access to that location. Some ISAs, for example, Intel(copyright) Architecture 32-bit ISA (IA-32) from Intel Corporation of Santa Clara, Calif., can place additional restrictions on these locked-memory instructions which give the instructions memory barrier semantics. The use of memory barrier semantics creates a more restrictive memory model. This means that memory instructions younger than the locked-memory instruction cannot become visible before the locked-memory instruction safely completes execution and retires to update the architectural state of the machine. Processor chip manufacturers have generally implemented this effect by delaying execution of the locked memory instruction until it becomes the oldest, non-speculative instruction in the execution window. This delay, which also affects all instructions younger than the locked-memory instruction, can be costly to system performance. Furthermore, as modem processors continue to extend the size of instruction execution windows, the effect of this delay becomes increasingly costly to system performance. Therefore, it is desirable to replace this outdated locked-memory instruction execution paradigm with one that does not impose this delay.
Numerous processors have implemented locked-memory instructions. Some ISAs specify memory models that are so weak that a high performance implementation of the locked-memory instructions falls out as a natural consequence of the weak memory model. IA-32, however, specifies a much more restrictive memory model for locked-memory instructions, and, as a result, it is difficult to implement a high performance solution.
Prior processors have implemented locked-memory instructions in a manner that serializes their execution. For example, in Intel""s Pentium(copyright) III, the following algorithm for locked-memory instructions was implemented:
1. When the locked memory instruction is detected, stop issuing instructions.
2. When all instructions older than the locked-memory instruction have completed execution, wait for all outstanding store operations to become globally observed.
3. Execute the locked-memory instruction.
4. Continue program execution.
The Pentium(copyright) III micro architecture refers to this sequences as xe2x80x9cat-retirement execution.xe2x80x9d While this implementation can easily achieve the correct result for locked-memory instruction execution, the implementation can reduce performance because it can cause many of the resources available in a modern super-scalar, out-of-order execution processor to be under-utilized when a locked-memory instruction is decoded in the program flow.