A microprocessor is a circuit that combines the instruction-handling, arithmetic, and logical operations of a computer on a single chip. A digital signal processor (DSP) is a microprocessor optimized to handle large volumes of data efficiently. Such processors are central to the operation of many of today's electronic products, such as high-speed modems, high-density disk drives, digital cellular phones, and complex automotive systems, and will enable a wide variety of other digital systems in the future. The demands placed upon DSPs in these environments continue to grow as consumers seek increased performance from their digital products.
Designers have succeeded in increasing the performance of DSPs and microprocessors in general by increasing clock speeds, by removing architectural bottlenecks in circuit designs, by incorporating multiple execution units on a single processor circuit, and by developing optimizing compilers that schedule operations to be executed by the processor in an efficient manner. As further increases in clock frequency become more difficult to achieve, designers have embraced the multiple execution unit processor as a means of achieving enhanced DSP performance. For example, FIG. 2 shows a block diagram of the CPU data paths of a DSP having eight execution units, L1, S1, M1, D1, L2, S2, M2, and D2. These execution units operate in parallel to perform multiple operations, such as addition, multiplication, addressing, logic functions, and data storage and retrieval, simultaneously.
Theoretically, the performance of a multiple execution unit processor is proportional to the number of execution units available. However, utilization of this performance advantage depends on the efficient scheduling of operations so that most of the execution units have a task to perform each clock cycle. Efficient scheduling is particularly important for looped instructions, since in a typical runtime application the processor will spend the majority of its time in loop execution.
Unfortunately, the inclusion of multiple execution units also creates new architectural bottlenecks. Increased functionality translates into longer instructions, such as may be found in very long instruction word (VLIW) architectures. For example, the eight-execution unit VLIW processor described above may require a 256-bit instruction every clock cycle in order to perform tasks on all execution units. As it is generally neither practical nor desirable to provide, e.g., a 256-bit-wide parallel data path external to the processor merely for instruction retrieval, the data rate available for loading instructions may become the overall limiting factor in many applications. An object of the present invention is to resolve this bottleneck.