FIG. 1 shows a simplified four-stage pipeline architecture 10 illustrating parallel processing within a RISC microprocessor of the prior art. Architecture 10 has a series of pipeline stages 12 for each pipeline that process instructions i, i1, i2, i3, i4 (i1 is “younger” than i, and so on) by incremental clock cycles 16. As known to those skilled in the art, instructions i are acted upon by individual stages of the pipeline, such as the fetch stage F, the register read stage R, the execute stage E, and the write-back stage W. Within the CPU architecture 10, register files are typically written to, or “loaded,” at the write-back stage W. Other stages may be included within the pipeline, including a detect exception stage D, known in the art, between stages E and W.
Those skilled in the art also understand that data hazards may occur within the pipeline. These hazards may derive from a number of sources, including data interdependencies. One prior art solution to such data hazards is called “bypassing” or “data forwarding,” as illustrated by the data forwarding logic 20 of FIG. 2. The purpose of data forwarding is is to supply the “newest” data to the pipelines. Data forwarding logic 20 is essentially part of each CPU pipeline; it stores the output of the execution unit 22 (shown as an ALU) within temporary registers 24 for input to unit 22, generally through a mutiplexer (“mux”) 25, as an operand in subsequent instructions. Once an instruction is finalized, the data is architected into the CPU's register file 26 at the write-back stage, illustrated by feedback line 28. Multiplexers 25 serve to couple data between register file 26, temporary registers 24 and unit 22, as shown. Data forwarding thus provides a performance boost to CPU architectures by reducing execution latency.
Data within temporary registers 24 are sometimes denoted as “speculative” since the instruction is not committed until the write-back stage 28 to register file 26. FIG. 3 shows another prior art architecture 100 for bypassing through a high performing RISC processor utilizing a register file 102 with 128 64-bit registers. Register file 102 has 12 read ports processed through a read mux 106, and 8 write ports processed through a write mux 104. In operation, an instruction unit 108 provides instructions to an execution unit 109 with an array of pipeline execution units 110 through a mux 112. Pipeline execution units 110 have execution stages 111a-111n so as to perform, for example, F,R,E,W described above. Pipeline stage 111n may for example architect any of the registers within register file 102 as a write-back stage W, through data bus 114 and write mux 104 (supporting 8 write ports). Individual stages 111 of pipelines 110 may transfer speculative data to other execution units through bypass logic 116 and mux 112; this speculative data may reduce hazards within other individual stages 111 in providing the data forwarding capability for architecture 100. Data may be read from register file 102 through read mux 106 (supporting 12 read ports) and data bus 120.
One difficulty of implementing the bypassing architectures and logic of FIG. 3 stems from the number of stages between register read (R) and register write (W) times the number of instructions in the execution stages (the “execution width”). For a 6-wide execution pipeline, for example, any one stage (e.g., stage 111b) will hold six instructions for the same cycle, plus two load return ports, for a total of eight. Accordingly, eight times three stages (from R to W) equals twenty-four plus the register file, effectively requiring a 25-to-1 mux. Moreover, since each instruction has two operands, this relationship is doubled and then multiplied by the number of execution pipelines (6 in this example), resulting in twelve copies of the 25-to-1 mux. Such a design thus generates 25 sources per operand in the pipeline; the mux and bypass logic implementing this design utilizes a significant fraction of the total cycles per instruction. The need exists to reduce (a) this time and (b) the size of the associated area used to implement the bypass logic.
It is, accordingly, one object of the invention to provide methods and systems for reducing the complexity of bypass logic in the CPU. Other objects of the invention are apparent within the description that follows.