1. Field of the Invention
The present invention generally relates to pipelined processors and, more particularly, to processors utilizing a cascaded arrangement of execution units that are delayed with respect to each other.
2. Description of the Related Art
Computer systems typically contain several integrated circuits (ICs), including one or more processors used to process information in the computer system. Modern processors often process instructions in a pipelined manner, executing each instruction as a series of steps. Each step is typically performed by a different stage (hardware circuit) in the pipeline, with each pipeline stage performing its step on a different instruction in the pipeline in a given clock cycle. As a result, if a pipeline is fully loaded, an instruction is processed each clock cycle, thereby increasing throughput.
As a simple example, a pipeline may include three stages: load (read instruction from memory), execute (execute the instruction), and store (store the results). In a first clock cycle, a first instruction enters the pipeline load stage. In a second clock cycle, the first instruction moves to the execution stage, freeing up the load stage to load a second instruction. In a third clock cycle, the results of executing the first instruction may be stored by the store stage, while the second instruction is executed and a third instruction is loaded.
Unfortunately, due to dependencies inherent in a typical instruction stream, conventional instruction pipelines suffer from stalls (with pipeline stages not executing) while an execution unit to execute one instruction waits for results generated by execution of a previous instruction. As an example, a load instruction may be dependent on a previous instruction (e.g., another load instruction or addition of an offset to a base address) to supply the address of the data to be loaded. As another example, a multiply instruction may rely on the results of one or more previous load instructions for one of its operands. In either case, a conventional instruction pipeline would stall until the results of the previous instruction are available. Stalls can be for several clock cycles, for example, if the previous instruction (on which the subsequent instruction is dependent) targets data that does not reside in an L1 cache (resulting in an L1 “cache miss”) and a relatively slow L2 cache must be accessed. As a result, such stalls may result in a substantial reduction in performance due to underutilization of the pipeline.
Accordingly, what is needed is an improved mechanism of pipelining instructions, preferably that reduces stalls.