The present invention relates to a method and apparatus for executing instructions in a computer. More specifically, the present invention relates to a method and apparatus for predicting the latency required between the execution of two related instructions.
Pipelining is an implementation technique whereby multiple instructions are overlapped in execution. A pipeline (also known as a functional unit) completes each instruction in a series of steps called pipeline stages. Instructions "enter" at one end of the pipeline, are processed through the stages, and "exit" at the other end (i.e., their intended effects are carried out). The throughput of the pipeline is determined by how often instructions are completed in the pipeline. The time required to move an instruction one step down the pipeline is known as a machine cycle. The length of a machine cycle is determined by the time required by the slowest pipeline stage because all the stages must proceed at the same time. In this type of architecture, as in most, the chief means of increasing throughput is reducing the duration of the clock cycle.
However, an alternative to increasing the clock frequency is to employ more than one pipeline. In systems employing multiple pipelines, instructions are dispatched by a scheduler or similar hardware construct. Instructions may be dispatched to the pipelines based on numerous factors, such as pipeline availability, op-code type, operand availability, data dependencies, and other considerations.
When using pipelines, conditions can exist that prevent the next instruction in the instruction stream from executing during its designated clock cycle. Known as pipeline hazards (or, simply, hazards), these conditions can impair the performance increases provided by pipelining. A hazard is created whenever a dependence exists between instructions that execute closely enough to change the order of access to an operand.
Broadly, three classes of hazards exist. The first of these is structural hazards, which arise from resource conflicts when the hardware cannot support all possible combinations of instructions in simultaneous overlapped execution. The second class of hazards are control hazards, which arise from the pipelining of branches and other instructions that change the program counter. Finally, data hazards arise when an instruction depends on the results of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline.
In the latter case, the order of access to operands is changed by the pipeline versus the normal order encountered by sequentially executing instructions. For example, consider the pipelined execution of the instructions shown in Table 1.
TABLE 1 ______________________________________ Ememplary code segment containing a data hazard. Instruction Destination Source 1 Source 2 ______________________________________ ADD R1 R2 R3 SUB R4 R1 R5 ______________________________________
The SUB instruction has a source operand, R1, that is the destination of the ADD instruction. The ADD instruction writes the value of R1 in the write-back stage of the pipeline, but the SUB instruction reads the value while in the pipeline's instruction decode stage. This causes a data hazard because the SUB instruction may be decoded prior to the ADD instruction completing (i.e., writing its results back). Unless precautions are taken to prevent it, the SUB instruction will attempt to read and use the wrong value. In fact, the value used by the SUB instruction is not even deterministic: Although it might seem logical to assume that SUB would always use the value of R1 that was assigned by an instruction prior to ADD, this is not always the case. If an interrupt should occur between the ADD and SUB instructions, the write-back stage of the ADD will complete, and the value of R1 at that point will be the result of the ADD. This unpredictable behavior is unacceptable in an interlocking pipeline.
Data hazards may be classified as one of three types, depending on the order of read and write accesses in the instructions. By convention, the hazards are named by the ordering in the program that must be preserved by the pipeline. Considering two instructions I and J, with I occurring before J, three types of data hazards may occur.
First is the read-after-write (RAW) hazard. In this case, J tries to read a source before I writes it, so J incorrectly gets the old value. This is the most common type of hazard, and is exemplified by the example illustrated by Table 1. Second is the write-after-read (WAR) hazard. In this case, J tries to write a destination before it is read by I, so I incorrectly gets the new value. This hazard occurs when there are some instructions that write results early in the instruction pipeline, and other instructions that read a source after a write of an instruction later in the pipeline. Third is the write-after-write (WAW) hazard. In this case, J tries to write an operand before it is written by I. The writes end up being performed in the wrong order, leaving the value written by I rather than the value written by J in the destination. This hazard is present only in pipelines that write in more than one pipeline stage (or allow an instruction to proceed even when a previous instruction is stalled). The read-after-read (RAR) case, of course, cannot create a hazard.
While the hazards presented above involve register operands, it is also possible for a pair of instructions to create a dependence by writing and reading the same memory location (assuming a register-memory microarchitecture is employed). Cache misses can cause memory references to get out of order if we allowed the processor to continue working on later instructions while an earlier instruction that missed the cache was accessing memory.
The data dependency problem in the example shown in Table 1 can be solved with a simple hardware technique called forwarding. Using this technique, the microarchitecture always feeds the arithmetic-logic unit (ALU) result back to the multiplexer at the ALU's input latches. If the forwarding hardware detects that the previous ALU operation has written to the register corresponding to a source for the current ALU operation, control logic selects the forwarded result as the ALU input rather than the value read from the register file, thus avoiding reading the stale data still contained therein. However, it is desirable to reduce the number of dependency cases that require forwarding, since each case requires the inclusion of additional hardware in the microarchitecture.
In this example, each such case requires a latch and a pair of comparators which examine whether the instructions having a dependency relationship share a destination and a source. Two ALU result buffers are needed to hold the ALU results to be stored because each instruction in the exemplary instruction set have two operands, either of which might encounter a hazard. For ALU operations, the result is always forwarded when the instruction using the result as a source enters the pipeline. The results held in the buffers can be inputs into either port of the ALU, via a pair of multiplexers. The multiplexers can be controlled by either the processor's control unit (which must then track the destinations and sources of all operations in the pipeline) or locally by logic associated with the forwarding operations supported (in which case the bypass buffers will contain tags giving the register numbers for which the values are destined).
In either event, the logic must determine if an uncompleted instruction wrote a register that is the input to the current instruction. If so, the multiplexer selects are set to choose from the appropriate result register rather than from the standard inputs. Forwarding can be generalized to include passing data from any appropriate architectural element to the functional unit that requires the data. In such a system, data may be forwarded from a stage of one functional unit or other element to the input of another functional unit, rather than just from the output of a functional unit to the input of the same unit. Forwarding allows the resolution of such data dependencies without incurring performance penalties. However, such situations may entail significant amounts of time compared to the normal latencies encountered in instruction processing, if the immediate forwarding of data is not supported for the given case. Immediate forwarding may not be supported for several reasons, including design constraints, infrequency of occurrence, or for other reasons. Such situations are the focus of the present invention.
Not all data hazards can be handled without a performance penalty, however. For example, the existence of data hazards in pipelined microarchitectures can make it necessary to stall the pipeline(s) affected, even if forwarding hardware is provided, as there is no way to make the determination early enough (as the addressing cannot be performed in zero time). A stall in a pipelined microarchitecture often requires that some instructions be allowed to proceed, while others are delayed. Typically, when an instruction is stalled, all instructions later in the pipeline (i.e., younger) than the stalled instruction are also stalled. Instructions earlier (i.e., older) than the stalled instruction can continue, but no new instructions are fetched during the stall.
For example, load instructions have a delay (or latency) that cannot be determined prior to their execution because of the auxiliary processing entailed (e.g., endian swaps), unimplemented forwarding, cache misses, or similar reasons. The most common solution to this problem is a hardware construct called a pipeline interlock. In general, a pipeline interlock detects a hazard and stalls the pipeline until the hazard is cleared. In this case, the interlock stalls the pipeline beginning with the instruction that wants to use the data until the sourcing instruction produces it. This delay cycle, called a pipeline stall or bubble, gives the requisite data time to arrive from memory.
Many types of stalls can occur frequently, depending on the architecture involved. For example, in a single pipeline architecture, the typical code-generation pattern for a statement such as A=B+C produces a stall for a load of the second data value because the addition cannot proceed until the second load has completed (the loads going through an extra stage to account for the accessing of memory). The store need not result in another stall, since the result of the addition can be forwarded to the memory data register. Machines where the operands may come from memory for arithmetic operations will need to stall the pipeline in the middle of the instruction to wait for the memory access to complete.
What is therefore required is a means of dynamically determining the proper latency period between issuing a first instruction and issuing a second instruction dependent on the first instruction. However, the latency period thus selected should be minimized to maximize throughput because a longer latency period equates to a greater number of cycles per completed instruction.