1. Field of Invention
This invention relates in general to computers, and more particularly, to reducing the number of in-flight load instructions searched by loads, stores, or snoops executed by a processor.
2. Description of Background
In out-of-order processors, instructions may execute in an order other than what the program specified. For an instruction to execute on an out-of-order processor, only three conditions need normally be satisfied:                (1) the inputs to the instruction are available;        (2) a function unit is available on which to execute the instruction;        
(3) there is a place to put the result.
For most instructions, these requirements are relatively straight-forward. However, for load instructions, accurately determining condition (1) is difficult. Load instructions have two inputs: (a) registers, which specify the address from which data is to be loaded, and (b) the memory location(s) from which the load data will come. Determining the availability of register values in case (a) is relatively straight-forward. However, determining the availability of memory locations in case (b) is not. The problem with memory locations is that there may be stores earlier in program order than a particular load, and some of these stores may not have executed, when the remaining parts of the three conditions above are satisfied, for example, (1) when all of the register inputs for the load instruction are ready, (2) there is a function unit available on which the load can be executed, and (3) there is a place (a register) in which to put the loaded value. Since earlier stores have not yet executed, it may be that the data location to which these stores write, are some of the same data locations from which the load reads. In general, without executing the store instructions, it is not possible to determine if the address, for example, data locations, to which a store writes overlap the address from which a load reads.
As a result, most modern out-of-order processors execute load instructions when (1) all of the input register values are available, (2) there is a function unit available on which to execute the load, and (3) there is a register where the loaded value may be placed. Since dependences on previous store instructions are ignored, a load instruction may sometimes execute prematurely, and have to be squashed and re-executed so as to obtain the correct value produced by a prior store instruction.
To detect when a load instruction has executed prematurely, modern processors typically have a load reorder queue (LRQ), which keeps a list of all in-flight loads. In-flight loads have been fetched and decoded by the processor, but have not fully completed their execution, or are waiting on older instructions in the program to finish their execution. Completed means that the loads have finished executing, and thus each of these instructions can be represented to the programmer or anyone else viewing execution of the program as having completed their execution.
The LRQ is normally sorted by the order of loads in the program. Each entry in the LRQ has, among other information, the address(es) from which the load received data.
Each time a store executes, it checks the LRQ to determine if any loads which are after the store in program order, nonetheless executed before the store, and if so, whether any of those loads read data from a location to which the store writes. If so, the store signals the appropriate parts of the processor that the load has received a bad value and must re-execute.
There may be many loads in-flight at any one time: modern processors allow 16, 32, 64 or more loads to be simultaneously in-flight. Thus, a store instruction must check 16, 32, 64 or more entries in the LRQ to see if those loads executed prematurely.
Since new store instructions may occur each cycle in a modern processor, these checks for premature load execution must take at most one cycle, for example, all 16, 32, 64 or more entries in the LRQ must be able to be checked every cycle. Such a fully associative comparison is known to be expensive (a) in terms of the area required to perform the comparison, (b) in terms of the amount of energy required to perform the comparison, and (c) in terms of the time required to perform the comparison, for example, a cycle may have to take longer than it otherwise would so as to allow time for the comparison to complete. All three of these factors (a), (b), and (c) are significant concerns in the design of modern processors.
Related problems arise when a processor is one of a plurality of processors in a multiprocessor (MP) system. Different MP systems have different rules for the ordering of load and store instructions executed on different processors. At a minimum, most MP systems require a condition known as sequential load consistency, which means that if processor X stores to a particular location A, then all loads from location A on processor Y must be consistent. In other words, if an older load in program order on processor Y sees the updated value at location A, then any younger load in program order on processor Y must also see that updated value.
If all of the loads on processor Y were executed in order, such sequential load consistency would happen naturally. However, on an out-of-order processor, the younger load in program order may execute earlier than the older load in program order. If processor X updates the location from which these two loads read, the sequential load consistency will be violated.
To avoid problems with sequential load consistency, each time a processor writes to a particular location, it conceptually informs every other processor that it has done so. In practice, most processor systems have mechanisms that avoid the need to inform every processor of every individual store performed by other processors. These mechanisms are beyond the scope of the proposed invention and apply equally well to the proposed invention as to the standard solution described herein.
However, even with these mechanisms there is some subset of stores about which other processors must be informed. When a processor Y receives notice (a snoop) that another processor X has written to a location, processor Y must ensure that all of the loads currently in-flight receive sequentially load consistent values. The check to ensure these conditions is similar to the check previously described for store instructions; each entry in LRQ is checked to see if it matches the snoop address stored to by the other processor X.
All entries in the LRQ that match the snoop address have a snooped bit set to indicate that they match the snoop. All load instructions check this snooped bit when they execute. More precisely, when a load instruction (L) executes, it checks all entries in the LRQ to see if there are any load instructions (M) which satisfy all of the following conditions:                (1) load M is younger in program order than the current load L;        (2) load M is from the same address as the current load L;        (3) load M has already executed;        (4) load M has the snooped bit set.        
Any load in the LRQ meeting all of these conditions must re-execute so as to maintain sequential load consistency, for example, to ensure that the younger load Y does not receive an older value than the older load L.
Just as it is problematic for area, power and cycle time that store instructions must check the large number of entries in the LRQ, it is problematic that snoops must also check this large number of entries in the LRQ.
Thus, there is a need for a method to reduce the number of LRQ entries that are checked each cycle and still maintain fast and correct execution. Furthermore, there is a need for a method to reduce the number of LRQ entries that are checked at each snoop and still maintain fast and correct execution. Such a solution will contribute to improved performance in an out-of-order processor.