Embodiments of the present invention relate in general to an out-of-order (OoO) processor and more specifically to facilitating efficient store-forwarding with a partitioned FIFO store-reorder queue in the OoO processor.
In an OoO processor, an instruction sequencing unit (ISU) dispatches instructions to various issue queues, renames registers in support of OoO execution, issues instructions from the various issue queues to the execution pipelines, completes executed instructions, and handles exception conditions. Register renaming is typically performed by mapper logic in the ISU before the instructions are placed in their respective issue queues. The ISU includes one or more issue queues that contain dependency matrices for tracking dependencies between instructions. A dependency matrix typically includes one row and one column for each instruction in the issue queue.
Typically, OoO processors facilitate higher performance by executing memory access instructions (loads and stores) out of program order. For example, a program code may include a series of memory access instructions including loads (L1, L2, . . . ) and stores (S1, S2; . . . ) that are provided in a computer program to be executed in an order such as: S1, L1, S2, L2, . . . . However, the OoO processor may select the instructions in a different order such as L1, L2, S1, S2, . . . . The memory operations may be strongly ordered if they are to occur in the program order specified. In such cases, the OoO processor when executing the instructions has to respect the dependencies between the instructions because executing loads and stores out of order can produce incorrect results, for example if a dependent load/store pair is executed out of order. For example, if S1 stores data to the same physical address that L1 subsequently reads data from, the store S1 must be completed (data written to memory) before L1 is performed so that the correct data is stored at the physical address for the L1 to read. Violation of such dependencies leads to a hazard.
In the domain of central processing unit (CPU) design, and particularly for OoO processors, hazards pose technical challenges with the instruction pipeline in the CPU microarchitectures when a next instruction cannot execute in the following clock cycle, because of potentially leading to incorrect computation results. Typical types of hazards include data hazards, structural hazards, and control flow hazards (branching hazards). Data hazards occur when instructions that exhibit data dependence modify data in different stages of a pipeline, for example read after write (RAW), write after read (WAR), and write after write (WAW). A structural hazard occurs when a part of the processor's hardware is needed by two or more instructions at the same time, for example a memory unit being accessed both in the fetch stage where an instruction is retrieved from memory, and the memory stage where data is written and/or read from memory. Further, branching hazards (also termed control hazards) occur with branches in the computer program being executed by the processor.
Dependencies between instructions can also be violated when different instructions are performed by different processors and/or co-processors in OoO processors that implement multiple processors and/or co-processors. For example, memory ordering rules may be violated if a first processor performs a store to address Ad1 followed by a store to address Ad2 and a second processor performs a load from address Ad2 (which misses in the data cache of the second processor) followed by a load from address Ad1 (which hits in the data cache of the second processor). Memory ordering rules require, in the above example, that if the load from address Ad2 receives the store data from the store to address Ad2, then the load from address Ad1 must receive the store data from the store to address Ad1. However, if the load from address Ad1 is allowed to complete while the load from address Ad2 is being serviced, then the following scenario may occur: (1) the load from address Ad1 may receive data prior to the store to address Ad1; (2) the store to address Ad1 may complete; (3) the store to address Ad2 may complete; and (4) the load to address Ad2 may complete and receive the data provided by the store to address Ad2. Such outcome would be incorrect because the load from address Ad1 occurred before the store to address Ad1. Thus, the load to address Ad1 receives stale data in the above case.