This invention relates to pipelined computer processors. More particularly, this invention relates to pipelined computer processors that reduce data hazards to provide high processor utilization.
A processor, also known as a central processing unit, processes a set of instructions from a stored program. The processing of an instruction is typically divided into multiple stages, where each stage generally requires one clock cycle to complete and typically requires different hardware within the processor.
For example, the processing of an instruction can be divided into the following stages: fetch, decode, execute, and write-back. At the fetch stage, the processor retrieves an instruction from memory. The instruction is typically encoded as a string of bits that represent input information (e.g., operands), an operation code (“opcode”), and output information (e.g., a destination address). An opcode represents an arithmetic or logic function associated with the operands. Once the instruction is retrieved from memory, a program counter is either incremented for linear program execution or updated to show a branch destination. The program counter contains a pointer to an address in memory from which a next instruction is fetched. At the decode stage, the processor decodes the instruction into an opcode, operands, and a destination. The opcode can include one of the following: add, subtract, multiply, divide, shift, load, store, loop, branch, etc. The operands, depending on the opcode, can be constants, values stored at one or more memory addresses, or the contents of one or more registers. The destination can be a register or a memory address where a result produced from execution of the opcode is stored. At the execute stage, the processor executes the decoded opcode using the operands. For instructions such as add and subtract, the execute stage typically requires one clock cycle. For more complicated instructions, such as multiply and divide, the execute stage typically requires more than one clock cycle. At the write-back stage, the processor stores the result from the execute stage at the specified destination.
Pipelining is a known technique that improves processor performance by overlapping the execution of instructions such that different instructions are in each stage of the pipeline during a same clock cycle. For example, while a first instruction is in the write-back stage, a second instruction can be in the execute stage, a third instruction can be in the decode stage, and a fourth instruction can be in the fetch stage. In an ideal situation, one instruction completes processing each clock cycle, and processor utilization is 100%. Processor utilization can be determined by dividing the number of program instructions that complete processing by the number of clock cycles in which those instructions complete processing.
Although pipelining can increase throughput (the number of instructions executed per unit time), it increases instruction latency (the time to completely process an instruction). Increases in throughput are restricted by data hazards. A data hazard is a dependence of one instruction on another instruction. An example is a load-use hazard, which occurs when the result of one instruction is needed as input for a subsequent instruction. Instructions (1) and (2) below illustrate a load-use hazard. R0, R1, R2, R3, and R4 represent register contents.R0<-R1+R2  (1)R3<-R0+R4  (2)
In the four-stage pipeline described above, the result of instruction (1) is stored in register R0 and is available at the end of the write-back stage. Data dependent instruction (2) needs the contents of register R0 at the beginning of the decode stage. If instruction (2) is immediately subsequent to instruction (1) or is separated from instruction (1) by only one instruction, instruction (2) will retrieve an old value from register R0.
Software techniques that do not require hardware control for reducing such data hazards are known. One technique eliminates data hazards by exploiting instruction-level parallelism to reorder instructions. To eliminate a data hazard, an instruction and its associated data-dependent instruction are separated by sufficient independent instructions such that a result from the first instruction is available to the data-dependent instruction by the start of the data-dependent instruction's decode stage. However, there is a limit to the amount of instruction-level parallelism possible in a program and, therefore, a limit to the extent that data hazards can be eliminated by instruction reordering.
Data hazards that cannot be eliminated by instruction reordering can be eliminated by introducing one or more null (i.e., no-operation or nop) instructions immediately before the data-dependent instruction. Each nop instruction, which advances in the pipeline, simply delays the processing of the rest of the program by a clock cycle. The addition of nop instructions increases program size and total program execution time, which decreases utilization (since nop instructions do not process any data). For example, when each instruction is data-dependent on an immediately preceding instruction (such that the instructions cannot be reordered), two nop instructions should be inserted between each program instruction (e.g., A..B..C, where each letter represents a program instruction and each “.” represents a nop instruction). The utilization becomes less than 100% for a processor running in steady state (e.g., for instructions A, B, and C, the utilization is 3/7 or 43%; for nine similar instructions, the utilization is 9/25 or 36%), which does not take into account priming or draining. Priming is the initial entry of instructions into the pipeline and draining is the clearing of instructions from the pipeline.
In addition to software techniques, hardware techniques, such as “data forwarding,” are known. Without data forwarding, the result of an instruction, which is known at the end of the execute stage, is not available as input to another instruction until the end of the write-back stage. Data forwarding forwards that result one cycle earlier so that the result is available as input to another instruction at the end of the execute stage. With data forwarding, an instruction only needs to be separated from a data-dependent instruction by one independent instruction or one nop instruction. For example, in hardware, a state register R0 can hold a register value X. Without data forwarding, a new value Y can be written into R0 during a cycle n (e.g., a write-back stage) such that Y is available at a next cycle (n+1). Because the new value Y may be needed by an instruction in cycle n, control logic associated with data forwarding enables a multiplexer to output the new result Y, making Y available for another instruction one cycle earlier (cycle n). While data forwarding advantageously provides data one cycle earlier (which improves processor utilization), data forwarding hardware requires additional circuit area which increases cost. Data forwarding also increases hardware complexity, which increases design and verification time.
Furthermore, many hazards cannot be resolved by data forwarding (e.g., cases in which a new value cannot be forwarded). In these instances, stalling the pipeline is an alternative hardware method. Stalling allows instructions ahead of a data-dependent instruction to proceed while the processing of that data-dependent instruction is stalled. Once the hazard is resolved, the stalled section of the pipeline is restarted. Stalling the pipeline is analogous to the software technique of introducing nop instructions, except that the hardware stalling technique is automatic and avoids increasing program size. However, stalling also reduces performance and thus utilization.
In view of the foregoing, it would be desirable to provide a pipelined processor that reduces data hazards such that high processor utilization is attained.