Over the last several years, DSPs have become an important tool, particularly in the real-time modification of signal streams. They have found use in all manner of electronic devices and will continue to grow in power and popularity.
As time has passed, greater performance has been demanded of DSPs. In most cases, performance increases are realized by increases in speed. One approach to improve DSP performance is to increase the rate of the clock that drives the DSP. As the clock rate increases, however, the DSP's power consumption and temperature also increase. Increased power consumption is expensive, and intolerable in battery-powered applications. Further, high circuit temperatures may damage the DSP. The DSP clock rate may not increase beyond a threshold physical speed at which signals may traverse the DSP. Simply stated, there is a practical maximum to the clock rate that is acceptable to conventional DSPs.
An alternate approach to improve DSP performance is to increase the number of instructions executed per clock cycle by the DSP (“DSP throughput”). One technique for increasing DSP throughput is pipelining, which calls for the DSP to be divided into separate processing stages (collectively termed a “pipeline”)). Instructions are processed in an “assembly line” fashion in the processing stages. Each processing stage is optimized to perform a particular processing function, thereby causing the DSP as a whole to become faster.
“Superpipelining” extends the pipelining concept further by allowing the simultaneous processing of multiple instructions in the pipeline. Consider, as an example, a DSP in which each instruction executes in six stages, each stage requiring a single clock cycle to perform its function. Six separate instructions can therefore be processed concurrently in the pipeline; i.e., the processing of one instruction is completed during each clock cycle. The instruction throughput of an n-stage pipelined architecture is therefore, in theory, n times greater than the throughput of a non- pipelined architecture capable of completing only one instruction every n clock cycles.
Another technique for increasing overall DSP speed is “superscalar” processing. Superscalar processing calls for multiple instructions to be processed per clock cycle. Assuming that instructions are independent of one another (the execution of each instruction does not depend upon the execution of any other instruction), DSP throughput is increased in proportion to the number of instructions processed per clock cycle (“degree of scalability”). If, for example, a particular DSP architecture is superscalar to degree three (i.e., three instructions are processed during each clock cycle), the instruction throughput of the DSP is theoretically tripled.
These techniques are not mutually exclusive; DSPs may be both superpipelined and superscalar. However, operation of such DSPs in practice is often far from ideal, as instructions tend to depend upon one another and are also often not executed efficiently within the pipeline stages. In actual operation, instructions often require varying amounts of DSP resources, creating interruptions (“bubbles” or “stalls”) in the flow of instructions through the pipeline. Consequently, while superpipelining and superscalar techniques do increase throughput, the actual throughput of the DSP ultimately depends upon the particular instructions processed during a given period of time and the particular implementation of the DSP's architecture.
The speed at which a DSP can perform a desired task is also a function of the number of instructions required to code the task. A DSP may require one or many clock cycles to execute a particular instruction. Thus, in order to enhance the speed at which a DSP can perform a desired task, both the number of instructions used to code the task as well as the number of clock cycles required to execute each instruction should be minimized.
It has long been a preferred practice to break computer programs down into separate routines and subroutines. From a conceptual standpoint, program functions are compartmentalized and the structural integrity and comprehensibility of the program as a whole increased. From a practical standpoint, subroutines can be reused without duplication, sometimes dramatically decreasing the overall size of the program.
Subroutines are invoked by a process termed “calling.” A routine may therefore “call” a subroutine to have it perform its particular function; when the subroutine has finished, it “returns” back to the routine that called it. It is apparent that a hierarchy of routines and subroutines could be advantageous for certain kinds of programs. For example, a main routine could call a first subroutine, which itself could call a second subroutine, and so on. This hierarchy of multiple subroutine levels is called “nested subroutines.”
A DSP, and a processor in general, handles subroutines by manipulating its program counters (PCS). A program counter simply contains the address of the instruction that is being executed. To call a subroutine, the contents of the PC is stored in a separate memory location, the address of the first instruction in the subroutine is loaded into the PC, and the subroutine is executed. When time to return, the original contents of the PC are retrieved from the separate memory location and incremented to point to the next instruction in the routine that called the subroutine.
Nested subroutines are handled by establishing a last-in, first out (LIFO) buffer, called a “stack,” in memory. Each time a subroutine is called, the contents of the PC are “pushed” into the stack. Each time a subroutine ends (a return), the contents that were earlier pushed into the stack are “popped” from the stack and reloaded into the PC.
Unfortunately, pushing into, and popping from, a stack require accesses to memory, which are time-consuming. They are also power-consuming, which is highly disadvantageous in a battery- powered environment. It is therefore advantageous to avoid these memory accesses whenever possible.
It is further advantageous to provide a mechanism to support early execution of nested call instructions thereby to allow prefetching of instructions in nested subroutines. Prefetching at least some of the instructions in nested subroutines would avoid undue latency that would otherwise be encountered in the absence of prefetching.
What is needed in the art is a way to support nested subroutines without having to resort to memory accesses. What is further needed in the art is a way to support prefetching and early execution of nested subroutine calls in a pipelined processor architecture.