Recently much attention has been focused on designing low-cost, low power and high performance processors for mid-to-low end embedded applications, such as pagers, cellular phones, etc. Many of these embedded applications require the data processing system to perform highly repetitive functions, such as digital signal processing (DSP) functions, where a large amount of Instruction Level Parallelism (ILP) can be exploited, while also requiring the system to perform control intensive functions.
To address these needs, some systems use dual-core solutions, where one core performs all the control intensive functions, and the other core performs the specialized DSP functions. In this approach, the processor cores communicate with each other through communication channels implemented within the system, such as a shared memory. These systems often employ dual instruction streams, one for each execution core. These dual core systems typically have higher hardware and development costs.
In addition, in many embedded applications, some loops are highly vectorizable, while other loops are more difficult to vectorize. Highly vectorizable loops can be efficiently processed by using the traditional vector processing paradigm, such as those described in “Cray-1 Computer System Hardware Reference Manual”, Cray Research, Inc., Bloomington, Minn., publication number 2240004, 1977. This is applicable to the vectorizable loops, but does not extend to those loops that are difficult to vectorize.
For loops that are difficult to vectorize, a DSP style of processing paradigm, which focuses on optimizing loop executions will be more suitable. The SHARC product described in the ADSP-2106x SHARC User's Manual, Analog Devices Inc., 1997, is an example of a system employing loop optimization. While providing efficient performance of loops that are difficult to vectorize, this approach is not as efficient for highly vectorizable loops.
Many embedded applications spend most of their execution time executing a handful of critical program loops. These critical loops often constitute only a small fraction of the static code side. In such systems, an optimum tradeoff between performance and system cost (code size) can often be achieved if a dense instruction-encoding scheme is used for the entire program, except for the few critical program loops. From the above discussion, it is apparent that an improved method of instruction encoding is needed.