A typical processor inputs (i.e., fetches or receives) instructions from an external memory, and executes the instructions. Most instructions specify operations to be performed using one or more operands (i.e., data values). In general, operands may be included in the instructions themselves (i.e., immediate operands), or reside in either a register in the processor (i.e., register operands) or in an external memory coupled to the processor (i.e., memory operands).
Processor registers typically have unique addresses, and are often grouped together in register files. Some types of processors (e.g., digital signal processors) generally have multiple separate register files, e.g. address register files with address registers for storing address values, and data register files with data registers for storing data values.
Many modern processors employ a technique called pipelining to execute more software program instructions (instructions) per unit of time. In general, processor execution of an instruction involves fetching the instruction (e.g., from a memory system), decoding the instruction, obtaining needed operands, using the operands to perform an operation specified by the instruction, and saving a result. In a pipelined processor, the various steps of instruction execution are performed by independent units called pipeline stages. In the pipeline stages, corresponding steps of instruction execution are performed on different instructions independently, and intermediate results are passed to successive stages. By permitting the processor to overlap the executions of multiple instructions, pipelining allows the processor to execute more instructions per unit of time.
In general, a “scalar” processor issues instructions for execution one at a time, and a “superscalar” processor is capable of issuing multiple instructions for execution at the same time. A pipelined scalar processor concurrently executes multiple instructions in different pipeline stages; the executions of the multiple instructions are overlapped as described above. A pipelined superscalar processor, on the other hand, concurrently executes multiple instructions in different pipeline stages, and is also capable of concurrently executing multiple instructions in the same pipeline stage. Examples of pipelined superscalar processors include the popular Intel® Pentium® processors (Intel Corporation, Santa Clara, Calif.) and IBM® PowerPC® processors (IBM Corporation, White Plains, N.Y.).
Pipelining changes the relative timing of instructions by overlapping their executions. This deviation from completing an instruction before executing a subsequent instruction can result in certain problems. For example, instructions are often interdependent, and these dependencies often result in “pipeline hazards.” Pipeline hazards result in stalls that prevent instructions from continually entering a pipeline at a maximum possible rate.
A “data dependency” is said to exist between two instructions when one of the instructions requires a value or data produced by the other. A “data hazard” occurs in a pipeline when a first instruction in the pipeline requires a value produced by a second instruction in the pipeline, and the value is not yet available. In this situation, the pipeline is typically stalled until the operation specified by the second instruction is completed and the needed value is produced.
For example, consider execution of the following consecutive instructions in a pipelined processor:ADD R1,R2,R3SUB R4,R1,R5
where the ADD instruction specifies an addition operation R1←R2+R3, and the SUB instruction specifies a subtraction operation R4←R1−R5. The register R1 is both a destination of the ADD instruction a source (i.e., operand) of the SUB instruction. Unless the subtraction operation is delayed until the addition operation is completed and the result is stored in the R1 register, the SUB instruction may read and use the wrong value of R1. In this situation, a data dependency exists between the ADD and SUB instructions, and a data hazard may occur in the processor as a result.
For example, assume the processor includes a register file including the registers R1-R5, an arithmetic logic unit (ALU) for performing addition and subtraction operations, and implements an execution pipeline including an execution stage and a subsequent write back stage. During the execution stage, operations specified by instructions are carried out, and the results are produced. During the subsequent write back stage, the results produced during the execution stage are stored in the register file. When the ADD instruction is in the execution stage, the ALU performs the addition operation, thereby producing a result to be stored in the R1 register during the subsequent write back stage. In order to obtain the correct R1 value from the register file, execution of the SUB instruction must be delayed such that the SUB instruction does not enter the execution stage until after the ADD instruction has completed the write back stage. The pipeline may have to be stalled in this situation, and in any case, an upper performance limit of the processor is reduced due to the required delay.
A hardware technique sometimes called “forwarding” (or “bypassing” or “short-circuiting”) may be implemented in processors to avoid such data hazards. In the above-described processor, switching logic may be added between outputs of the register file and inputs of the ALU, and a bypass bus may be added between an output of the ALU and an input of the switching logic. The switching logic may be controlled to provide either an output of the register file, or a result previously computed by the ALU, to each input of the ALU. In the above example, the SUB instruction may enter the execution stage while the ADD instruction is in the write back stage. As a result, the required delay is reduced, and the upper performance limit of the processor is increased.
Many modern processors include multiple functional units, such as multiple ALUs. In processors including multiple functional units and implementing the above-described result forwarding technique, each bypass bus may be required to provide results produced by a functional unit to the switching logic of several functional units, as well as the register file. As the number of functional units is increased, and/or as the size of the register file is increased, the electrical load on each of the bypass buses can quickly become very large. While larger drivers could be used to drive the bypass buses, the amount of time required to accomplish result bypassing is almost certainly increased, and an upper performance limit of the processor is likely reduced.