Modern microprocessors may support the use of out-of-order execution in their architectures. Individual instructions may each be decoded into a set of corresponding micro-operations, which then may be stored in a re-order buffer prior to execution. A scheduler may determine which micro-operations are actually ready to execute, and may issue the micro-operations other than in strict program order, or “out-of-order”. When the micro-operations are ready for retirement, they may be retired in program order and will hence have the appearance of being executed in program order.
One family of instructions which have posed a problem in previous out-of-order processors is the lock instruction family. The lock instructions generally assert a signal or employ some procedure that performs an atomic memory transaction, that is, it locks a particular location in memory to prevent other processors, or other threads on the same processor, from accessing the memory location (or equivalent cache line) used during the constituent load and store micro-operations. In differing embodiments, the signal may include a bus signal or a cache-coherency protocol lock. Specific implementations of the lock instructions have necessitated that all previous instructions (in program order) have retired before the lock instructions start to execute. The load and store micro-operations of the lock instruction are generally delayed so that they may execute and retire as close together as possible to limit the time the processor must protect the memory address or cache line used by the lock instruction. However this prevents the load micro-operation and any other intervening micro-operations from speculatively executing, and therefore adds their latency to the critical path of the program. Specific implementations may also prevent subsequent load operations, or other subsequent operations, from speculatively executing, thus increasing the latency of the subsequent operations. In practice this may mean that any re-order buffer used to support out-of-order processing may fill and stall the pipeline, causing the application performance to degrade further.