This disclosure relates generally to processor performance improvement and, more specifically, to processor performance improvement for instruction sequences that include barrier instructions.
A conventional symmetric multiprocessor (SMP) computer system, such as a server computer system, includes multiple processing units all coupled to a system interconnect, which typically comprises one or more address, data, and control buses. Coupled to the system interconnect is a system memory, which represents the lowest level of volatile memory in the multiprocessor computer system and which generally is accessible for read and write access by all processing units. In order to reduce access latency to instructions and data residing in the system memory, each processing unit is typically further supported by a respective multi-level cache hierarchy, the lower level(s) of which may be shared by one or more processor cores.
Various processor designs have been proposed and/or implemented that have been intended to improve data processing system performance. For example, in processor designs that implement weak memory models, instructions may be re-ordered for execution as long as the operations are not restricted from being executed out-of-order. One technique for restricting execution of certain instructions has employed barrier (or synchronization) instructions to prevent execution of subsequent load or store instructions (i.e., load or store instructions following a barrier instruction) until a prior load or store instruction or instructions (i.e., a load or store instructions before the barrier instruction) is resolved.
With reference to FIG. 8, four different cases are possible with respect to employing a barrier instruction (SYNC) to enforce ordering between store (ST) and load (LD) instructions. In a first store-to-store case, a store instruction to address ‘A’ (ST A) before a barrier instruction (SYNC) is followed by a store instruction to address ‘B’ (ST B) after the barrier instruction (SYNC). In a second store-to-load case, a store instruction to address ‘A’ (ST A) before a barrier instruction (SYNC) is followed by a load instruction to address ‘B’ (LD B) after the barrier instruction (SYNC). In a third load-to-load case, a load instruction to address ‘A’ (LD A) before a barrier instruction (SYNC) is followed by a load instruction to address ‘B’ (LD B) after the barrier instruction (SYNC). In a fourth load-to-store case, a load instruction to address ‘A’ (LD A) before a barrier instruction (SYNC) is followed by a store instruction to address ‘B’ (ST B) after the barrier instruction (SYNC).
In the first case, the barrier instruction (SYNC) has maintained order between the store instructions (ST A and ST B) as the instruction sequence has flowed through a processor pipeline. In the second case, execution of the load instruction (LD B) following the barrier instruction (SYNC) has been delayed until the store instruction (ST A) before the barrier instruction (SYNC) was complete (or the load instruction (LD B) following the barrier instruction (SYNC) has been issued early and tracked for invalidation until an acknowledgement (ACK) for the barrier instruction (SYNC) has been received, from a memory subsystem, at a processor core). In the third case, the load instruction (LD B) following the barrier instruction (SYNC) has waited until data for the load instruction (LD A) before the barrier instruction (SYNC) has been received (or the load instruction (LD B) following the barrier instruction (SYNC) has been launched early and tracked for invalidation until the data for the load instruction (LD A) before the barrier instruction (SYNC) has been received at the processor core). In the fourth case, execution of the store instruction (ST B) following the barrier instruction (SYNC) has been delayed until data for the load instruction (LD A) before the barrier instruction (SYNC) has been received at the processor core.
In conventional implementations, a load instruction prior to a barrier instruction has been resolved when data (for the load instruction) has been returned to the processor core. In contrast, in conventional implementations, a store instruction prior to a barrier instruction has been resolved when an acknowledgment (ACK) is received (for a subsequent barrier instruction) at the processor core from a memory subsystem (e.g., an L2 cache). With reference to cases 10 of FIG. 8, conventional implementations have either speculatively executed and tracked load instructions that follow a barrier instruction for invalidation (with invalidated load instructions being re-executed) or (in the event that tracking logic is not available) delayed execution of load instructions following the barrier instruction until the instructions (ST A or LD A) before the barrier instruction have been resolved, as noted above. With reference to cases 12 of FIG. 8, conventional implementations have delayed execution of instructions (LD B or ST B) that follow a barrier instruction until data for the load instruction (LD A) is received by the processor core.