The primary function of most computer processors is to execute computer instructions. Most processors execute instructions in the programmed order that they are received. However, some recent processors, such as the Pentium®, II processor from Intel Corp., are “out-of-order” processors. An out-of-order processor can execute instructions in any order as the data and execution units required for each instruction becomes available. Therefore, with an out-of-order processor, execution units within the processor that otherwise may be idle can be more efficiently utilized.
With either type of processor, delays can occur when executing “dependent” instructions. A dependent instruction, in order to execute correctly, requires a value produced by another instruction that has executed correctly. For example, consider the following set of instructions:                1) Load memory-1 into register-X;        2) Add1 register-X register-Y into register-Z.        3) Add2 register-Y register-Z into register-W.        
The first instruction loads the content of memory-1 into register-X. The second instruction adds the content of register-X to the content of register-Y and stores the result in register-Z. The third instruction adds the content of register-Y to the content of the register-Z and stores the result in register-W. In this set of instructions, instructions 2 and 3 are dependent instructions that are dependent on instruction 1 (instruction 3 is also dependent on instruction 2). In other words, if register-X is not loaded with the proper value in instruction 1 before instructions 2 and 3 are executed, instructions 2 and 3 will likely generate incorrect results. Dependent instructions can cause a delay in known processors because most known processors typically do not schedule a dependent instruction until they known that the instruction that the dependent instruction depends on will produce the correct result.
Referring now to the drawings, FIG. 1 is a block diagram of a processor pipeline and timing diagram illustrating the delay caused by dependent instructions in most known processors. In FIG. 1, a scheduler 105 schedules instructions. The instructions proceed through an execution unit pipeline that includes pipeline stages 110, 115, 120, 125, 130, 135 and 140. During each pipeline stage a processing step is executed. For example, at pipeline stage 110 the instruction is dispatched. At stage 115 the instruction is decoded and source registers are read. At stage 120 a memory address is generated (for a memory instruction) or an arithmetic logic unit (“ALU”) operation is executed (for an arithmetic or logic instruction). At stage 125 cache data is read and a lookup of the translation lookaside buffer (“TLB”) is performed. At stage 130 the cache Tag is read. At stage 135 a hit/miss signal is generated as a result of the Tag read. The hit/miss signal indicates whether the desired data was found in the cache (i.e., whether the data read from the cache at stage 125 was the correct data). As shown in FIG. 1, the hit/miss signal is typically generated after the data is read at stage 125, because generating the hit/miss signal requires the additional steps of TLB lookup and Tag read.
The timing diagram of FIG. 1 illustrates the pipeline flow of two instructions: a memory load instructions (“Ld”) and an add instruction (“Add”). The memory load instruction is a six-cycle instruction, the add instruction is a one-cycle instruction, and the add instruction is dependent on the load instruction. At time=0 (i.e., the first clock cycle) Ld is scheduled and dispatched (pipeline stage 110). At time=1, time=2 and time=3, Ld moves to pipeline stages 115, 120 and 258, respectively. At time=4, Ld is at pipeline stage 130. At time=5, Ld is at stage 135 and the hit/miss signal is generated. Scheduler 105 receives this signal. Finally at time=6, assuming a hit signal is received indicating that the data was correct, scheduler 105 schedules. Add to stage 110, while Ld continues to stage 140, which is an additional pipeline stage. The add operation is eventually performed when Add is at stage 120. However, if a time=6 a miss signal is received, scheduler 105 will wait an indefinite number of clock cycles until data is received by accessing the next levels of the cache hierarchy.
As shown in the timing diagram of FIG. 1, Add, because it is dependent on Ld, cannot be scheduled until time=6, at the earliest. A latency of an instruction may be defined as the time from when its input operands must be ready for it to execute until its result is ready to be used by another instruction. Therefore, the latency of Ld in the example of FIG. 1 is six. Further, as shown in FIG. 1, scheduler 105 cannot schedule Add until it receives the hit/miss signal. Therefore, even if the time required to read data from a cache decreases with improved cache technology, the latency of Ld will remain at six because it is dependent on the hit/mass signal.
Reducing the latencies of instructions in a processor is sometimes necessary to increase the operating speed of a processor. For example, suppose that a part of a program contains a sequence of N instructions, I1, I2, I3 . . . IN. Suppose that In+1 requires, as part of its inputs, the result of In, for all n, from 1 to N-1. This part of the program may also contain any other instructions. The program cannot be executed in less time than T=L1+L2+L2+. . . +LN, where Ln is the latency of instruction In, for all n from 1 to N. In fact, even if the processor was capable of executing a very large number of instructions in parallel, T remains a lower bound for the time to execute this part of this program. Hence to execute this program faster, it will ultimately be essential to shorten the latencies of the instructions.
Based on the foregoing, there is a need for a computer processor that can schedule instructions, especially dependent instructions, faster than known processors, and therefore reduces the latencies of the instructions.