1. Field of the Invention
This invention pertains in general to microprocessors and in particular to a processor for performing media operations. More particularly, this invention pertains to a processor having a modified Harvard memory architecture and performing streaming ALU operations.
2. Description of Background Art
There is a general desire to increase the speed of computer processors. This desire is especially acute in the field of media processing, including digital signal processing (DSP), graphics processing, audio processing, and video processing. In this field, a typical algorithm will process a large number of elements while using each element only once. For example, a typical DSP filter loop will process a large number of different input samples and apply a different operand from a large coefficient array to each input sample. Thus, a typical cycle of operation in such a processing environment requires 1) determining the addresses of the sample and coefficient; 2) retrieving the sample and coefficient from memory; 3) operating on the sample and coefficient; and 4) storing the result of the operation.
To meet the demands of media processing, various techniques have been employed to either increase the clock speed of the processor or improve the efficiency of instruction processing by the processor. One such technique is the reduced instruction set computer (RISC) architecture. A RISC architecture uses a low-complexity design adapted to handle a small set of simple instructions in order to obtain high-speed and high-performance. In addition, a RISC architecture uses fixed length instructions with very few instruction formats and fixed positions for certain operand fields, like register indices, within the instruction format. This architecture allows for low-complexity instruction decoders and control logic and this lower complexity can be leveraged into increased performance from other parts of the processor.
Another way to reduce complexity in a RISC-based processor is to decouple arithmetic logic unit (ALU) functions from the operand movement between the register file and memory. This decoupling results in a load-store architecture wherein memory accesses are allowed only by explicit loads and stores. Subsequently, this load-store architecture results in an expansion of code size because an instruction must explicitly call for each memory access. In order to perform the DSP steps discussed above, therefore, the RISC processor must issue two loads and a store for a single iteration of a DSP filter loop.
Unlike RISC processors, many complex instruction set computer (CISC) processors have instructions allowing simultaneous memory accesses from two different memory locations. Using these instructions, a programmer can retrieve both an input sample and a coefficient in a single instruction cycle. In addition, such processors allow simultaneous ALU operation. Thus, the number of instructions necessary to perform a DSP filter loop is reduced.
To achieve this functional parallelism, however, CISC processors generally have complex instruction decoding schemes in which several operands are implicit for the instruction and the load/store mechanism is coupled to the ALU operation encoding. Accordingly, few operations allow parallel load-stores. Even those operations, moreover, are limited to combinations to and from a small register file and support only a limited subset of parallel ALU operations. In addition, such processors require complex instruction decoders and, therefore, have lower clock speeds.
Another technique for increasing processor efficiency is superscalar instruction scheduling. Processors supporting superscalar instruction scheduling dynamically extract instruction-level parallelism from the instruction stream and then group loads and stores with ALU operations. In this manner, the instructions can utilize parallel functional units in the processor. However, such processors are highly complex in terms of design and size.
Yet another approach used to increase processing efficiency is the very long instruction word (VLIW) format. The VLIW format explicitly encodes instruction-level parallelism into a very long instruction word. The VLIW typically has fields for frequently performed operations, such as ALU operations and memory accesses. By using VLIW, the instructions required for a DSP filter loop can be incorporated into a single instruction word. Moreover, the VLIW format allows use of a low-complexity decoder and has the potential for high performance by parallelizing the use of multiple functional units within the processor.
The VLIW format, however, essentially demands that parallelism in the instruction stream be determined when the program is compiled. This demand results in an extremely complex programming model and, accordingly, a difficult program compilation. Thus, the gains made in processing efficiency by using the VLIW format are offset by the compile-time difficulties.