The primary function of most computer processors is to execute computer instructions. Most processors execute instructions in the programmed order that they are received. However, some recent processors, such as the Pentium.RTM. II processor from Intel Corp., are "out-of-order" processors. An out-of-order processor can execute instructions in any order as the data and execution units required for each instruction becomes available. Therefore, with an out-of-order processor, execution units within the processor that otherwise may be idle can be more efficiently utilized.
With either type of processor, delays can occur when executing "dependent" instructions. A dependent instruction, in order to execute correctly, requires a value produced by another instruction that has executed correctly. For example, consider the following set of instructions:
1) Load memory-1.fwdarw.register-X; PA1 2) Add1 register-X register-Y.fwdarw.register-Z; PA1 3) Add2 register-Y register-Z.fwdarw.register-W.
The first instruction loads the content of memory-1 into register-X. The second instruction adds the content of register-X to the content of register-Y and stores the result in register-Z. The third instruction adds the content of register-Y to the content of register-Z and stores the result in register-W. In this set of instructions, instructions 2 and 3 are dependent instructions that are dependent on instruction 1 (instruction 3 is also dependent on instruction 2). In other words, if register-X is not loaded with the proper value in instruction 1 before instructions 2 and 3 are executed, instructions 2 and 3 will likely generate incorrect results. Dependent instructions can cause a delay in known processors because most known processors typically do not schedule a dependent instruction until they know that the instruction that the dependent instruction depends on will produce the correct result.
Referring now in detail to the drawings, wherein like parts are designated by like reference numerals throughout, FIG. 1 is a block diagram of a processor pipeline and timing diagram illustrating the delay caused by dependent instructions in most known processors. In FIG. 1, a scheduler 10 schedules instructions. The instructions proceed through an execution unit pipeline that includes pipeline stages 12, 14, 16, 18, 20, 22 and 24. During each pipeline stage a processing step is executed. For example, at pipeline stage 12 the instruction is dispatched. At stage 14 the instruction is decoded and source registers are read. At stage 16 a memory address is generated (for a memory instruction) or an arithmetic logic unit ("ALU") operation is executed (for an arithmetic or logic instruction). At stage 18 cache data is read and a lookup of the translation lookaside buffer ("TLB") is performed. At stage 20 the cache Tag is read. At stage 22 a hit/miss signal is generated as a result of the Tag read. The hit/miss signal indicates whether the desired data was found in the cache (i.e., whether the data read from the cache at stage 18 was the correct data). As shown in FIG. 1, the hit/miss signal is typically generated after the data is read at stage 18, because generating the hit/miss signal requires the additional steps of TLB lookup and Tag read.
The timing diagram of FIG. 1 illustrates the pipeline flow of two instructions: a memory load instruction ("Ld") and an add instruction ("Add"). The memory load instruction is a two-cycle instruction, the add instruction is a one-cycle instruction, and the add instruction is dependent on the load instruction. At time=0 (i.e., the first clock cycle) Ld is scheduled and dispatched (pipeline stage 12). At time=1, time=2 and time=3, Ld moves to pipeline stages 14, 16 and 18, respectively. At time=4, Ld is at pipeline stage 20. At time=5, Ld is at stage 22 and the hit/miss signal is generated. Scheduler 10 receives this signal. Finally at time=6, assuming a hit signal is received indicating that the data was correct, scheduler 10 schedules Add to stage 12, while Ld continues to stage 24, which is an additional pipeline stage. The add operation is eventually performed when Add is at stage 16. However, if at time=6 a miss signal is received, scheduler 10 will wait an indefinite number of clock cycles until data is received by accessing the next levels of the cache hierarchy.
As shown in the timing diagram of FIG. 1, Add, because it is dependent on Ld, cannot be scheduled until time=6, at the earliest. A latency of an instruction may be defined as the time from when its input operands must be ready for it to execute until its result is ready to be used by another instruction. Therefore, the latency of Ld in the example of FIG. 1 is six. Further, as shown in FIG. 1, scheduler 10 cannot schedule Add until it receives the hit/miss signal. Therefore, even if the time required to read data from a cache decreases with improved cache technology, the latency of Ld will remain at six because it is dependent on the hit/miss signal.
Reducing the latencies of instructions in a processor is sometimes necessary to increase the operating speed of a processor. For example, suppose that a part of a program contains a sequence of N instructions, I.sub.1, I.sub.2, I.sub.3, . . . , I.sub.N. Suppose that I.sub.n+1 requires, as part of its inputs, the result of I.sub.n, for all n, from 1 to N-1. This part of the program may also contain any other instructions. The program cannot be executed in less time than T=L.sub.1 +L.sub.2 +L.sub.3 + . . . +L.sub.N, where L.sub.n is the latency of instruction I.sub.n, for all n from 1 to N. In fact, even if the processor was capable of executing a very large number of instructions in parallel, T remains a lower bound for the time to execute this part of this program. Hence to execute this program faster, it will ultimately be essential to shorten the latencies of the instructions.
Based on the foregoing, there is a need for a computer processor that can schedule instructions, especially dependent instructions, faster than known processors, and therefore reduces the latencies of the instructions.