The demand for high performance computers requires that state-of-the-art microprocessors execute instructions in the minimum amount of time. A number of different approaches have been taken to decrease instruction execution time, thereby increasing processor throughput. One way to increase processor throughput is to use a pipeline architecture in which the processor is divided into separate processing stages that form the pipeline. Instructions are broken down into elemental steps that are executed in different stages in an assembly line fashion.
A pipelined processor is capable of executing several different machine instructions concurrently. This is accomplished by breaking down the processing steps for each instruction into several discrete processing phases, each of which is executed by a separate pipeline stage. Hence, each instruction must pass sequentially through each pipeline stage in order to complete its execution. In general, a given instruction is processed by only one pipeline stage at a time, with one clock cycle being required for each stage. Since instructions use the pipeline stages in the same order and typically only stay in each stage for a single clock cycle, an N stage pipeline is capable of simultaneously processing N instructions. When filled with instructions, a processor with N pipeline stages completes one instruction each clock cycle.
The execution rate of an N-stage pipeline processor is theoretically N times faster than an equivalent non-pipelined processor. A non-pipelined processor is a processor that completes execution of one instruction before proceeding to the next instruction. Typically, pipeline overheads and other factors decrease somewhat the execution rate advantage that a pipelined processor has over a non-pipelined processor.
An exemplary seven stage processor pipeline may consist of an address generation stage, an instruction fetch stage, a decode stage, a read stage, a pair of execution (E1 and E2) stages, and a write (or write-back) stage. In addition, the processor may have an instruction cache that stores program instructions for execution, a data cache that temporarily stores data operands that otherwise are stored in processor memory, and a register file that also temporarily stores data operands.
The address generation stage generates the address of the next instruction to be fetched from the instruction cache. The instruction fetch stage fetches an instruction for execution from the instruction cache and stores the fetched instruction in an instruction buffer. The decode stage takes the instruction from the instruction buffer and decodes the instruction into a set of signals that can be directly used for executing subsequent pipeline stages. The read stage fetches required operands from registers in the register file. The E1 and E2 stages perform the actual program operation (e.g., add, multiply, divide, and the like) on the operands fetched by the read stage and generates the result. The write stage then writes the result generated by the E1 and E2 stages back into the data cache or the register file.
Assuming that each pipeline stage completes its operation in one clock cycle, the exemplary seven stage processor pipeline takes seven clock cycles to process one instruction. As previously described, once the pipeline is full, an instruction can theoretically be completed every clock cycle.
The throughput of a processor also is affected by the size of the instruction set executed by the processor and the resulting complexity of the instruction decoder. Large instruction sets require large, complex decoders in order to maintain a high processor throughput. However, large complex decoders tend to increase power dissipation, die size and the cost of the processor. The throughput of a processor also may be affected by other factors, such as exception handling, data and instruction cache sizes, multiple parallel instruction pipelines, and the like. All of these factors increase or at least maintain processor throughput by means of complex and/or redundant circuitry that simultaneously increases power dissipation, die size and cost.
In many processor applications, the increased cost, increased power dissipation, and increased die size are tolerable, such as in personal computers and network servers that use x86-based processors. These types of processors include, for example, Intel Pentium™ processors and AMD Athlon™ processors.
However, in many applications it is essential to minimize the size, cost, and power requirements of a data processor. This has led to the development of processors that are optimized to meet particular size, cost and/or power limits. For example, the recently developed Transmeta Crusoe™ processor greatly reduces the amount of power consumed by the processor when executing most x86 based programs. This is particularly useful in laptop computer applications. Other types of data processors may be optimized for use in consumer appliances (e.g., televisions, video players, radios, digital music players, and the like) and office equipment (e.g., printers, copiers, fax machines, telephone systems, and other peripheral devices). The general design objectives for data processors used in consumer appliances and office equipment are the minimization of cost and complexity of the data processor.
The throughput of a processor may also be improved by reducing the latency between an instruction that writes a result to a register and a subsequent instruction that reads the result from the same register. If both instructions are in the pipeline at the same time, the read operation cannot be permitted to read the result until after the write operation is complete, otherwise the data will not be valid. Unfortunately, a read operation (i.e., the read stage) occurs relatively early in the instruction pipeline, whereas the write operation (i.e., the write stage) occurs last. To prevent an invalid read operation from occurring, it is necessary to stall read operation in the pipeline until after the write operation completes. Unfortunately, this increases latency and reduces the throughput of the data processor.
The problem of latency in write-followed-by-read operations can be minimized by using bypass circuitry that transfers a result that is still in the processor pipeline to be transferred directly to the read stage even before the result is written to the register file. It is recalled from above that in an exemplary data processor, the read stage fetches required operands from the register file, the E1 and E2 stages perform the actual program operation on the operands and generate the result, and the write stage then writes the result generated by the E1 and E2 stages back into the data cache or the register file. To reduce latency, bypass circuitry may be implemented to transfer a result from the E1 stage or the E2 stage (or both) directly to the read stage as soon as the result is available in the E1 or E2 stage. Bypass circuitry also may be implemented to transfer a result from the write stage directly to the read stage as soon as the result is available in the write stage. In this manner the delay incurred by actually writing the result into the register file and then reading the result back out of the register file is avoided.
Unfortunately, as the number of read ports and write ports in a multi-ported register file and the number of pipeline stages increases, the amount and complexity of the bypass circuitry becomes very great. For example, if the E1 pipeline stage comprises four identical execution units and each one reads two operands to produce one result, the register file needs twelve ports, including eight read ports and four write ports. If bypass circuitry is implemented between the eight operand data paths of the four execution units and the twelve result outputs from the three stages, then 96 bypass data paths are required. In a 32-bit register file architecture, each of these 96 bypass data paths is 32 bits wide. This results in a very large bypass network.
Therefore, there is a need in the art for data processors that implement pipeline architectures that minimize the latency in read-after-write operations. In particular, there is an need for a pipelined data processors comprising improved bypass circuitry for transferring result data from late pipeline stages to earlier pipeline stages. More particularly, there is need for an improved execution pipeline that minimizes the amount and complexity of bypass circuitry needed to bypass results data to be written to a multi-ported register file to earlier pipeline stages.