The demand for high performance computers requires that state-of-the-art microprocessors execute instructions in the minimum amount of time. A number of different approaches have been taken to decrease instruction execution time, thereby increasing processor throughput. One way to increase processor throughput is to use a pipeline architecture in which the processor is divided into separate processing stages that form the pipeline. Instructions are broken down into elemental steps that are executed in different stages in an assembly line fashion.
A pipelined processor is capable of executing several different machine instructions concurrently. This is accomplished by breaking down the processing steps for each instruction into several discrete processing phases, each of which is executed by a separate pipeline stage. Hence, each instruction must pass sequentially through each pipeline stage in order to complete its execution. In general, a given instruction is processed by only one pipeline stage at a time, with one clock cycle being required for each stage. Since instructions use the pipeline stages in the same order and typically only stay in each stage for a single clock cycle, an “n” stage pipeline is capable of simultaneously processing “n” instructions. When filled with instructions, a processor with “n” pipeline stages completes one instruction each clock cycle.
The execution rate of a pipeline processor is theoretically “n” times faster than an equivalent non-pipelined processor. A non-pipelined processor is a processor that completes execution of one instruction before proceeding to the next instruction. Typically, pipeline overheads and other factors decrease somewhat the execution advantage rate that a pipelined processor has over a non-pipelined processor.
A simplified five stage processor pipeline may consist of an instruction fetch (IF) stage, a decode (DCD) stage, an operand fetch (OF) stage, an execute (EXE) stage, and a write-back (WBK) stage. In addition, the processor may have an instruction cache that stores program instructions for execution, a data cache that temporarily stores data operands that otherwise are stored in processor memory, and a register stack that also temporarily stores data operands.
The instruction fetch stage fetches an instruction for execution from the instruction cache and stores it into an instruction fetch buffer (IFB). The decode stage takes the instruction from the instruction fetch buffer in the instruction cache and decodes the instruction into a set of signals that can be directly used for executing subsequent pipeline stages. The operand fetch stage fetches required operands from the data cache or registers in the register stack. The execute stage performs the actual programmed operation (e.g., add, multiply, divide, and the like) on the operands fetched by the operand fetch stage and generates the result. The write-back stage then writes the result generated by the execute stage back into the data cache or the register stack.
Assuming that each pipeline stage completes its operation in one clock cycle, the described five stage processor pipeline takes five clocks cycles to process one instruction. As previously described, once the pipeline is full, an instruction can theoretically be completed every clock cycle.
A typical pipeline program instruction sequence may have several instructions with dependencies on the results and effects of preceding instructions. As a result, there is normally some degree of interaction between instructions which occupy different stages of the pipeline. For example, a given instruction may need to use the data result from an earlier instruction as its operands.
Consider the previously described five stage pipeline with three registers (A, B, C) in the register file and a hypothetical program sequence with three instructions. The first two instructions load one operand each from memory into registers A and B, respectively. The third instruction increments the contents of register A. The program may be represented as follows:
1) Ld Mem1, A(load contents of Mem. location 1 into Reg. A)2) Ld Mem2, B(load contents of Mem. location 2 into Reg. B)3) Inc A(increment the contents of Reg. A)
For this example, the increment instruction has a dependency on the first instruction. The increment instruction must wait until the first instruction loads register A with the content of memory location 1 before it can increment the content of register A.
In the prior art, there are generally three ways of dealing with such data dependencies. First, there may not be a problem and no special action may be required. It is possible that the dependent instruction (increment, for example) is not affected. This is the case if the pipeline has processed the first instruction to the point that the memory data is loaded into register A by the time the third instruction is ready to increment data in register A. This would be the case if two or more clock cycles provide enough time for the first instruction to proceed through the execute and write-back stages and to write the result into register A before the third instruction needs the new value of register A when it enters the operand fetch stage.
Second, it may be possible to delay processing of one or more stages so that data is available when required. For example, the increment instruction requires the result of the first load instruction by the time the increment instruction is in the operand fetch stage. By holding the increment instruction in the decode stage for two clock cycles instead of one, the first instruction can proceed through the execute and write-back stages, which results in register A being loaded with the required operand data by the time the increment instruction is in the operand fetch stage. Thus, a one clock cycle delay is incurred for the increment instruction.
Third, it may be possible to transfer data directly between pipeline stages by eliminating or bypassing selected operations. For example, the increment instruction requires the result of the first instruction when the increment instruction is in the operand fetch stage and the first instruction is in the write-back stage. Under these conditions, the load instruction is about to write the required data from memory into register A. However, the data from memory that is destined for register A is actually still in temporary registers in the write-back stage, ready for transfer to register A. At this point, the contents of the temporary registers may be sent directly to the operand fetch stage for use by the increment instruction, thereby bypassing register A. This bypass prevents the increment instruction from incurring any delay.
The described delay and bypass scenarios require logic that detects when one instruction depends upon the result of a previous one. For simple pipeline processors such as the one described, the detection of data dependence between instructions is performed via a mechanism generally known as a register scoreboard. A register scoreboard comprises logic that tracks which of the processor registers (A, B, C in the example) are about to be updated by instructions that are active in the pipeline. The register scoreboard also identifies which pipeline stage is occupied by the respective instructions during any given clock cycle. Thus, a subsequent instruction can check the register scoreboard to determine whether its required source operands (A, B, or C) are in the process of being generated by preceding instructions in the pipeline. If the source operands are being generated, the dependent instruction may then perform a stalling (delay) operation or a bypassing operation to obtain the required data.
This process works well in the most common forms of microprocessor architectures which have on-chip registers with fixed names that are referenced by fixed allocated codes embedded in the instructions. For these architectures, it is relatively simple to check whether a source register requested by an instruction in an early pipeline stage is also the destination (target) register of an instruction in a later pipeline stage. An instruction in a later pipeline stage precedes an instruction in an earlier pipeline stage. The processor may perform this analysis by examining source register and storage (target) register addresses of each instruction in the pipeline for a match. When a match is found, appropriate delay or bypass actions are taken.
Pipeline processors implemented with a register-stack based architecture have a more complex problem when addressing data dependencies between pipelined instructions. This problem is caused by the fact that the addresses of architectural registers which store the data results of particular operations (i.e., those registers seen by the program) are no longer fixed, but dynamically change in response to instruction execution.
For example, a register-stack based architecture is generally based on the premise that the architectural registers are organized as a first-in-first-out (FIFO) stack. Instructions requiring source operands in a register-stack architecture implicitly use the data from the top of the stack by first popping it from the top of the stack and then by pushing any result to the top of the stack when execution is completed. Thus, all stack entries move toward the top of the stack when dependent data is removed from the stack and toward the bottom of the stack when result data is added to the stack.
To illustrate this problem, consider the previous instruction series as implemented for a register-stack based processor. For this example, registers A, B, and C are organized as a register-stack, with register A always being at the top of the stack, register B in the middle, and register C at the bottom.
1) Ld Mem1, A(load contents of Mem. location 1 into Reg. A)2) Ld Mem2, A(push old contents of Reg. A into Reg. B, then store datafrom Mem. location 2 into Reg. A)3) Inc B(explicitly increment the contents of Reg. B)
The fundamental difference between this example and the previous example is that, although the data in memory 1 was originally loaded into register A, it is in register B by the time the increment B instruction is executed. Therefore, the first and third instructions of both examples have the same dependency even though the explicitly-named source and destination registers are different.
Thus, the register stack-based architecture adds complexity to the pipeline data dependency problem, since the architectural registers (i.e., those registers seen by the program) dynamically change as each subsequent instruction is executed in the pipeline. In fact, the architectural register in which the required data result resides can be changed by any or all intervening instructions between the instruction creating the data and the instruction requiring the data.
Therefore, there is a need in the art for improved pipelined processors that are less susceptible to data dependency problems between different stages of the instruction pipeline. In particular, there is a need for an improved register stack-based architecture for pipelined processors that does not incur a performance penalty when data dependencies exist between different stages of the instruction pipeline. More particularly, there is a need for a register stack-based, pipelined processor in which the addresses of architectural registers may be dynamically changed which is capable of handling data dependencies without incurring a performance penalty.