Over the last several years, DSPs have become an important tool, particularly in the real-time modification of signal streams. They have found use in all manner of electronic devices and will continue to grow in power and popularity.
As time has passed, greater performance has been demanded of DSPs. In most cases, performance increases are realized by increases in speed. One approach to improve DSP performance is to increase the rate of the clock that drives the DSP. As the clock rate increases, however, the DSP's power consumption and temperature also increase. Increased power consumption is expensive, and intolerable in battery-powered applications. Further, high circuit temperatures may damage the DSP. The DSP clock rate may not increase beyond a threshold physical speed at which signals may traverse the DSP. Simply stated, there is a practical maximum to the clock rate that is acceptable to conventional DSPs.
An alternate approach to improve DSP performance is to increase the number of instructions executed per clock cycle by the DSP (“DSP throughput”). One technique for increasing DSP throughput is pipelining, which calls for the DSP to be divided into separate processing stages (collectively termed a “pipeline”). Instructions are processed in an “assembly line” fashion in the processing stages. Each processing stage is optimized to perform a particular processing function, thereby causing the DSP as a whole to become faster.
“Superpipelining” extends the pipelining concept further by allowing the simultaneous processing of multiple instructions in the pipeline. Consider, as an example, a DSP in which each instruction executes in six stages, each stage requiring a single clock cycle to perform its function. Six separate instructions can therefore be processed concurrently in the pipeline; i.e., the processing of one instruction is completed during each clock cycle. The instruction throughput of an n-stage pipelined architecture is therefore, in theory, n times greater than the throughput of a non-pipelined architecture capable of completing only one instruction every n clock cycles.
Another technique for increasing overall DSP speed is “superscalar” processing. Superscalar processing calls for multiple instructions to be processed per clock cycle. Assuming that instructions are independent of one another (the execution of each instruction does not depend upon the execution of any other instruction), DSP throughput is increased in proportion to the number of instructions processed per clock cycle (“degree of scalability”). If, for example, a particular DSP architecture is superscalar to degree three (i.e., three instructions are processed during each clock cycle), the instruction throughput of the DSP is theoretically tripled.
These techniques are not mutually exclusive; DSPs may be both superpipelined and superscalar. However, operation of such DSPs in practice is often far from ideal, as instructions tend to depend upon one another and are also often not executed efficiently within the pipeline stages. In actual operation, instructions often require varying amounts of DSP resources, creating interruptions (“bubbles” or “stalls”) in the flow of instructions through the pipeline. Consequently, while superpipelining and superscalar techniques do increase throughput, the actual throughput of the DSP ultimately depends upon the particular instructions processed during a given period of time and the particular implementation of the DSP's architecture.
The speed at which a DSP can perform a desired task is also a function of the number of instructions required to code the task. A DSP may require one or many clock cycles to execute a particular instruction. Thus, in order to enhance the speed at which a DSP can perform a desired task, both the number of instructions used to code the task as well as the number of clock cycles required to execute each instruction should be minimized.
Fetching instructions from memory takes time and can therefore inhibit DSP performance. If, on the other hand, the DSP can be engaged in the execution of some instructions while other instructions are being fetched, DSP performance can remain high. Fetching instructions before they are actually needed for issuance into a pipeline is called “prefetching.”
One conventional programming technique is called “conditional execution.” Instructions in what is called a “conditional execution block,” or “CE block,” are only validly executed upon the occurrence of a condition established in a conditional branch instruction. If the condition does not occur, the instructions in the CE block are ignored.
As with all other instructions, prefetching of instructions in a CE block is advantageous. It is also advantageous to issue those instructions into the pipeline, even though the condition that determines whether they should be validly executed remains unresolved. Of course, if the condition is resolved in the affirmative (“true”), execution of the instructions in the CE block is already underway, and DSP performance remains intact. Unfortunately, if the condition is resolved in the negative (“false”), the instructions in the CE block must be flushed from the pipeline so that they do not corrupt valid data or further consume the DSP's processing resources.
To allow the flushing and restoration to take place, instructions that are in the CE block should be identified and tracked as they traverse the DSP's pipeline. Unfortunately, conventional mechanisms for identifying and tracking CE blocks involved many registers and much moving around of instructions and associated tags. All of this consumed time and electric power. In the case of battery-powered DSPs, this power dissipation proves disadvantageous.
What is needed in the art is a more efficient way to identify and track CE instructions as they traverse a processor pipeline.