Pipelining is a well-known processor implementation technique whereby multiple instructions are overlapped in execution. Conventional pipelining techniques are described in, for example, John L. Hennessy and David A. Patterson, “Computer Architecture: A Quantitative Approach,” Third Edition, Morgan Kaufmann Publishers, Inc., San Francisco, Calif., 2003.
FIG. 1A shows an example involving the execution of two instructions without any overlap. In this example, the two instructions are an integer add instruction addi r0, r2, 8, and an integer multiplication instruction muli r8, r3, 4. The first instruction, addi, performs an addition of the contents of register r2 and an immediate value 8, and stores the result in register r0. It is assumed for simplicity and clarity of illustration that each of the instructions includes the same four pipeline stages, denoted instruction fetch (IF), read (RD), execute (EX) and writeback (WB).
In the first stage (IF) instructions are fetched from memory and decoded. In the second stage (RD) the operands are read from the register file. In the third stage (EX) the addition is performed. Finally, in the fourth stage (WB) the results are written back into the register file at location r0. When the addi instruction has completed, the next instruction mull is started. The mull instruction performs an addition of the contents of register r3 and an immediate value 4, and stores the result in register r8.
FIG. 1B shows the same two instructions but depicts how they may be overlapped using a conventional pipelining technique. Each of the pipeline stages (IF, RD, EX and WB) is generally executed on a clock boundary. The second instruction, mull, may be started on the second clock cycle without requiring additional hardware. The hardware associated with the IF, RD, EX and WB stages are shared between the two instructions, but the stages of one instruction are shifted in time relative to those of the other.
FIG. 2 illustrates a complication that may arise in a pipeline implementation. In this example, the muli instruction requires as an operand the contents of register r0, and thus cannot read r0 until the addi instruction has computed and written back the result of the addition operation to r0. Processing of the mull instruction begins on the next clock cycle following the start of the addi instruction, but this process must stall and wait for the execution and writeback stages of the addi instruction to complete. The empty cycles the mull instruction must wait for its operands to become available are typically called “bubbles” in the pipeline.
In single-threaded processors, a common method for reducing pipeline bubbles is known as bypassing, whereby instead of writing the computed value back to the register file in the WB stage, the result is forwarded directly to the processor execution unit that requires it. This reduces but does not eliminate bubbles in deeply pipelined machines. Also, it generally requires dependency checking and bypassing hardware, which unduly increases processor cost and complexity.
It is also possible to reduce pipeline stalls through the use of multithreading. Multithreaded processors are processors that support simultaneous execution of multiple distinct instruction sequences or “threads.” Conventional threading techniques are described in, for example, M. J. Flynn, “Computer Architecture: Pipelined and Parallel Processor Design,” Jones and Bartlett Publishers, Boston, Mass., 1995, and G. A. Blaauw and Frederick P. Brooks, “Computer Architecture: Concepts and Evolution,” Addison-Wesley, Reading, Mass.; 1997, both of which are incorporated by reference herein.
However, these and other conventional approaches generally do not allow multiple concurrent pipelines per thread, nor do they support pipeline shifting.
Accordingly, techniques are needed which can provide improved pipelining in a multithreaded digital data processor.