Over the last several years, DSPs have become an important tool, particularly in the real-time modification of signal streams. They have found use in all manner of electronic devices and will continue to grow in power and popularity.
As time has passed, greater performance has been demanded of DSPs. In most cases, performance increases are realized by increases in speed. One approach to improve DSP performance is to increase the rate of the clock that drives the DSP. As the clock rate increases, however, the DSP's power consumption and temperature also increase. Increased power consumption is expensive, and intolerable in battery-powered applications. Further, high circuit temperatures may damage the DSP. The DSP clock rate may not increase beyond a threshold physical speed at which signals may traverse the DSP. Simply stated, there is a practical maximum to the clock rate that is acceptable to conventional DSPs.
An alternate approach to improve DSP performance is to increase the number of instructions executed per clock cycle by the DSP (“DSP throughput”). One technique for increasing DSP throughput is pipelining, which calls for the DSP to be divided into separate processing stages (collectively termed a “pipeline”). Instructions are processed in an “assembly line” fashion in the processing stages. Each processing stage is optimized to perform a particular processing function, thereby causing the DSP as a whole to become faster.
“Superpipelining” extends the pipelining concept further by allowing the simultaneous processing of multiple instructions in the pipeline. Consider, as an example, a DSP in which each instruction executes in six stages, each stage requiring a single clock cycle to perform its function. Six separate instructions can therefore be processed concurrently in the pipeline; i.e., the processing of one instruction is completed during each clock cycle. The instruction throughput of an n-stage pipelined architecture is therefore, in theory, n times greater than the throughput of a non-pipelined architecture capable of completing only one instruction every n clock cycles.
Another technique for increasing overall DSP speed is “superscalar” processing. Superscalar processing calls for multiple instructions to be processed per clock cycle. Assuming that instructions are independent of one another (the execution of each instruction does not depend upon the execution of any other instruction), DSP throughput is increased in proportion to the number of instructions processed per clock cycle (“degree of scalability”). If, for example, a particular DSP architecture is superscalar to degree three (i.e., three instructions are processed during each clock cycle), the instruction throughput of the DSP is theoretically tripled.
These techniques are not mutually exclusive; DSPs may be both superpipelined and superscalar. However, operation of such DSPs in practice is often far from ideal, as instructions tend to depend upon one another and are also often not executed efficiently within the pipeline stages. In actual operation, instructions often require varying amounts of DSP resources, creating interruptions (“bubbles” or “stalls”) in the flow of instructions through the pipeline. Consequently, while superpipelining and superscalar techniques do increase throughput, the actual throughput of the DSP ultimately depends upon the particular instructions processed during a given period of time and the particular implementation of the DSP's architecture.
The speed at which a DSP can perform a desired task is also a function of the number of instructions required to code the task. A DSP may require one or many clock cycles to execute a particular instruction. Thus, in order to enhance the speed at which a DSP can perform a desired task, both the number of instructions used to code the task as well as the number of clock cycles required to execute each instruction should be minimized.
One conventional technique employed to increase DSP performance is speculative execution (now called “dynamic execution” to connote less risk). Speculative execution employs a technique called “branch prediction” whereby the outcomes of conditional branch instructions are predicted. Once outcomes are predicted, resulting instructions can be retrieved from memory and placed in the pipeline for execution.
Unfortunately, predictions are at best educated guesses; the outcome of conditional branches cannot truly be determined (“resolved”) until their instructions are late in the pipeline, typically in its execution stage. If the outcome of a conditional branch has not been predicted correctly (a “mispredict,” giving rise to a “mispredict condition”), the entire state of the DSP must be restored to the point at which it was when the misprediction occurred. This restoration involves, among other things, a flushing of the entire pipeline, a restoration of registers in the register file(s) and a resetting of the program counter (PC) such that it points to the instructions that should have been executed. All of this takes precious time.
Moreover, depending on the depth of the pipeline, there could be several conditional branches that enter the pipeline before the first branch is resolved. Without the mispredict PC FIFO, the second branch has to be stalled until the first branch is grouped in GR stage. This causes the performance issues because the prefetch the can not prefetch the branch targets ahead of time.
What is needed in the art is a way to generate multiple branch predictions and mispredict PCs and store them for later use without waiting for branches to be grouped. This would allow the predictions to be used to prefetch instructions.
In addition, what is needed in the art is a way to increase the speed with which a processor can be restored from a mispredict condition. More specifically, what is needed in the art is a way to speed the process of restoring a PC in a DSP to its correct value upon the occurrence of a misprediction.