Modern computer systems utilize a variety of different microprocessor architectures to perform program execution. Each microprocessor architecture is configured to execute programs made up of a number of macro instructions and micro instructions. Many macro instructions are translated or decoded into a sequence of micro instructions before processing. Micro instructions are simple machine instructions that can be executed directly by a microprocessor.
To increase processing power, most microprocessors use multiple pipelines, such as integer pipelines and load/store pipelines to process the macro and micro instructions. Typically, an integer pipeline consists of multiple stages. Each stage in an integer pipeline operates in parallel with the other stages. However, each stage operates on a different macro or micro instruction.
FIG. 1 shows an instruction fetch and issue unit, having an instruction fetch stage (I stage) 105 and a pre-decode stage (PD stage) 110, coupled to a typical four stage integer pipeline 120 for a microprocessor. Integer pipeline 120 comprises a decode stage (D stage) 130, an execute one stage (E1 stage) 140, an execute two stage (E2 stage) 150, and a write back stage (W stage) 160. Instruction fetch stage 105 fetches instructions to be processed. Pre-decode stage 110 groups and issues instructions to one or more pipelines. Ideally, instructions are issued into integer pipeline 120 every clock cycle. Each instruction passes through the pipeline and is processed by each stage as necessary. Thus, during ideal operating conditions integer pipeline 120 is simultaneously processing 4 instructions. However, many conditions as explained below may prevent the ideal operation of integer pipeline 120.
Decode stage 130 decodes the instruction and gathers the source operands needed by the instruction being processed in decode stage 130. Execute one stage 140 and execute two stage 150 performs the function of the instructions. Write back stage 160 writes the appropriate result value into the register file. Pipeline 120 can be enhanced by including forwarding paths between the various stages of integer pipeline 120 as well as forwarding paths between stages of other pipelines. For brevity and clarity forwarding paths, which are well known in the art, are not described in detail herein.
FIG. 2 shows a typical four stage load/store pipeline 200 for a microprocessor coupled to instruction fetch stage 105 and pre-decode stage 110. Load/store pipeline 200 includes a decode stage (D stage) 230, an execute one stage (E1 stage) 240, an execute two stage (E2 stage) 250, and a write back stage (W stage) 260. Load/store pipeline 200 is specifically tailored to perform load and store instructions. By including both a load/store pipeline and an integer pipeline, overall performance of a microprocessor is enhanced because the load/store pipeline and integer pipelines can perform in parallel. Many processors may even include multiple load/store pipelines, multiple integer pipelines, as well as other pipelines to further increase processing power. Decode stage 230 decodes the instruction and reads the register file for the needed information regarding the instruction. Execute one stage 240 calculates memory addresses for the load or store instructions. For store instructions, execute two stage 250 stores the appropriate value into memory. For load instructions, execute two stage 250 retrieves information from the appropriate location. For register load operations, write back stage 260 writes the appropriate value into a register file.
Ideally, integer pipeline 120 and load/store pipeline 200 can execute instructions every clock cycle. However, many situations may occur that causes parts of integer pipeline 120 or load/store pipeline 200 to stall, which degrades the performance of the microprocessor. FIGS. 3(a)-3(f) illustrate a load-use data dependency problem, which causes parts of integer pipeline 120 to stall. Load-use data dependency problems are caused by the issuance of a load instruction followed by an instruction that requires the data being loaded by the load instruction. Specifically, FIGS. 3(a)-3(f) illustrate an instruction “LD D0, [A0]” followed by an instruction “ADD D1, D0, #1”. “LD D0, [A0]” causes the value at address A0 to be loaded into data register D0. “ADD D1, D0, #1” adds one to the value in data register D0 and stores the result in data register D1. Thus, the add instruction requires the data (the new value of data register D0) from the load instruction to properly calculate the new value for data register D1. For clarity, FIGS. 3(a)-3(f) omit instruction fetch stage 105 and pre-decode stage 110. To avoid confusion, only two instructions are shown in FIGS. 3(a)-3(f). In actual use other instructions would usually be processed simultaneously in other stages of the pipelines. As shown in FIG. 3(a), instruction “LD D0, [A0]” is first processed in decode stage 230 of load/store pipeline 200. Then, as shown in FIG. 3(b), instruction “LD D0, [A0]” is processed in execute one stage 240 of load/store pipeline 200 and instruction “ADD D1, D0, #1” is processed in decode stage 130 of integer pipeline 120. If instruction “ADD D1, D0, #1” were allowed to propagate through integer pipeline 120, the current value in data register D0 would be used in the add instruction rather than the new value to be loaded from address A0. Thus, decode stage 130, which is configured to detect load-use data dependency problems, holds instruction “ADD D1, D0, #1” in decode stage 130. Because decode stage 130 is full, a pipeline stall at decode stage 130 of integer pipeline 120 occurs. Thus, pre-decode stage 110 could not issue additional instructions into integer pipeline 120.
Then, as shown in FIG. 3(c), instruction “LD D0, [A0]” is processed in execute two stage 250 of load/store pipeline 200. If instruction “ADD D1, D0, #1” were allowed to propagate through integer pipeline 120, the current value in data register D0 would be used in the add instruction rather than the new value to be loaded from address A0. Thus, decode stage 130, which is configured to detect load-use data dependency problems, holds instruction “ADD D1, D0, #1” in decode stage 130. Because decode stage 130 is full, a pipeline stall at decode stage 130 of integer pipeline 120 occurs. Thus, pre-decode stage 110 could not issue additional instructions into integer pipeline 120.
In most microprocessors, timing constraints prohibit forwarding from execute two stage 250. Thus, as shown in FIG. 3(d), instruction “ADD D1, D0, #1” remains in decode stage 130 when instruction “LD D0, [A0]” proceeds to write back stage 260.
After instruction “LD D0, [A0]” is processed in write stage 260 the new value for data register D0 becomes available and integer pipeline can process instruction “ADD D1, D0, #1”. Therefore, as shown in FIGS. 3(e), 3(f) and 3(g), after the new data value for data register D0 is available, instruction “ADD D1, D0, #1” is processed through execute one stage 140, execute two stage 150 and write back stage 160.
As explained above stalling a pipeline or parts of a pipeline degrades the overall processing power of a microprocessor. Because load-use data dependency problems are quite common, integer pipelines have been modified to process load-use instructions without stalling. FIG. 4 illustrates an integer pipeline 400 coupled to instruction fetch stage 105 and pre-decode stage 110. Integer pipeline can be used with load/store pipeline 200 to avoid stalling on load-use instructions. Integer pipeline 400, which is similar to integer pipeline 120, includes all the stages of integer pipeline 120 and adds two buffer stages. Specifically, integer pipeline includes a buffer one stage (B1 stage) 425 and a buffer two stage (B2 stage) 427 preceding decode stage 130. Generally, no processing is performed in buffer one stage 425 and buffer two stage 427. However, in some microprocessors some pre-decoding is performed in buffer one stage 425 and buffer two stage 427. For consistency similar parts performing similar functions in different figures are given the same reference numerals. Thus, decode stage 130 is used in both integer pipeline 120 of FIG. 1 and integer pipeline 400 of FIG. 4.
FIGS. 5(a)-(f) illustrate the processing of instruction “LD D0, [A0]” followed by the instruction “ADD D1, D0, #1” using integer pipeline 400 and load/store pipeline 200. As shown in FIG. 5(a), instruction “LD D0, [A0]” is first processed in decode stage 230 of load/store pipeline 200 and instruction. Then, as shown in FIG. 5(b), instruction “LD D0, [A0]” is processed in execute one stage 240 of load/store pipeline 200 and instruction “ADD D1, D0, #1” is stored in buffer one stage 425 of integer pipeline 400. Then, as shown in FIG. 5(c), instruction “LD D0, [A0]” is processed in execute two stage 250 of load/store pipeline 200 and instruction “ADD D1, D0, #1” is stored in buffer 2 stage 427 of integer pipeline 400. As shown in FIG. 5(d), instruction “LD D0, [A0]” is next processed in write back stage 260 and instruction “ADD D1, D0, #1” proceeds to decode stage 130. As explained above, the new value for data register D0 becomes available when instruction “LD D0, [A0]” is processed by write back stage 260. Thus as shown in FIG. 5(e), instruction “ADD D1, D0, #1” can proceed to execute 1 stage 140 without stalling integer pipeline 400. As shown in FIGS. 5(f) and 5(g), instruction “ADD D1, D0, #1” is processed through execute two stage 150 and write back stage 160.
Thus, including buffer one stage 425 and buffer two stage 427 allows integer pipeline 400 to avoid stalling on load-use data dependency problems. However, using buffer one stage 425 and buffer two stage 427 increases the latency of integer pipeline 400 because an instruction must pass through six stages of integer pipeline 400 rather than the four stages of integer pipeline 120. Furthermore, longer integer pipelines suffer delays on several other types of pipeline stalls. For example, speculative execution caused by conditional branching instruction may cause pipeline stalls. A typical conditional branch instruction has the form: If {branch condition is satisfied} then jump to the instruction at the {branch address}. For example the macro instruction “JZ 100” can be interpreted as: if {the operand of the preceding instruction was zero} then jump to the instruction at {address 100}. To avoid pipeline stalls, most processors selects an outcome for the conditional branch instruction and processes instructions as if the selected outcome is correct. Actual determination of the conditional branch condition does not occur until execute one stage 140 or execute two stage 150. Thus, in longer pipelines, determination of the result of the conditional branch instruction is delayed as compared to shorter pipelines. Specifically, each buffer stage delays determination of the conditional branch instruction by one clock cycle. Thus, the performance of integer pipeline 400 on conditional branch instruction is worse than the performance of integer pipeline 120. In addition, a mixed register instruction which uses the data register file of the integer pipeline and the execution stages of the load/store pipeline would cause stalls due to the delays caused by the buffer one stage and the buffer two stage. Especially, if the mixed register instruction has a data register as a destination and a following integer instruction is using this value, then the integer instruction needs to stall. Thus, the conventional solution to avoid load-use data dependency problems, degrades performance for other instruction types and may reduce the overall processing speed of the microprocessor. Hence there is a need for an integer pipeline that can avoid load-use data dependency problems while minimizing problems associated with long integer pipelines.