The present invention relates to digital signal processors. More specifically, the present invention relates to digital signal processing using highly parallel, highly pipelined, processing techniques.
Digital Signal Processors (DSPs) are generally used for real time processing of digital signals. A digital signal is typically a series of numbers, or digital values, used to represent a corresponding analog signal. DSPs are used in a wide variety of applications including audio systems such as compact disk players, and wireless communication systems such as cellular telephones.
A DSP is often considered to be a specialized form of microprocessor. Like a microprocessor, a DSP is typically implemented on a silicon based semiconductor integrated circuit. Additionally, as with microprocessors, the computing power of DSPs is enhanced by using reduced instruction set (RISC) computing techniques. RISC computing techniques include using smaller numbers of like sized instructions to control the operation of the DSP, where each instruction is executed in the same amount of time. The use of RISC computing techniques increases the rate at which instructions are performed, or the clock rate, as well as the amount of instruction pipelining within the DSP. This increases the overall computing power of the DSP.
Configuring a DSP using RISC computing techniques also creates undesirable characteristics. In particular, RISC based DSPs execute a greater number of instructions to perform a given task. Executing additional instructions increases the power consumption of the DSP, even though the time to execute those instructions decreases due to the improved clocking speed of a RISC based DSP. Additionally, using a greater number of instructions increases the size of the on-chip instruction memory within the DSP. Memory structures require substantial (often more than 50% of the total) circuit area within a DSP, which increases the size and cost of the DSP. Thus, the use of RISC based DSPs is less than ideal for low cost, low power, applications such as digital cellular telephony or other types of battery operation wireless communication systems.
FIG. 1 is a highly simplified block diagram of a digital signal processor configured in accordance with the prior art. Arithmetic logic unit (ALU) 16 is coupled to ALU register bank 17 and multiply accumulate (MAC) circuit 26 is coupled to MAC register bank 27. Data bus 20 couples MAC register bank 27, ALU register 17 and (on chip) data memory 10. Instruction bus 22 couples MAC register bank 27, (on-chip) instruction memory 12, MAC register bank 27 and ALU register bank 17. Instruction decode 18 is coupled to MAC 26 and ALU 16, and in some prior art systems instruction decode 18 is coupled directly to instruction memory 12. Data memory 10 is also coupled to data interface 11 and instruction memory 12 is also coupled to instruction interface 13. Data interface 12 and instruction interface 12 exchange data and instructions with off-chip memory 6.
During operation, the instructions in instruction memory 12 are decoded by instruction decode 18. In response, instruction decode 18 generates internal control signals that are applied to ALU 16 and MAC 26. The control signals typically cause ALU 16 to have data exchanged between ALU register bank 17 and data memory 10 or instruction memory 12. Also, the control signals cause MAC 26 to have instruction data exchanged between MAC register bank 27 and instruction memory 12 or data memory 10. Additionally, the control signals cause ALU 16 and MAC 26 to perform various operations in response to, and on, the data stored in ALU register bank 17 and MAC register bank 27 respectively.
In an exemplary operation, instruction memory 12 may contain coefficient data for use by ALU 16 and MAC 26 and data memory 10 may contain data to be processed (signal data). The coefficient data may be for implementing a frequency filter using the DSP, which is a common practice. As the filtering is performed, both the signal data from data memory 10 and the coefficient data from instruction memory 12 are read into MAC register 27. Additional instruction data within instruction memory 12 is also applied to instruction decode 18, either through instruction data bus 22 or through a direct connection. The additional instruction data specifies the operation to be performed by MAC 26. The results generated by MAC 26 are typically read back into data memory 10.
Many processing inefficiencies result from this prior art processing. These processing inefficiencies include, e.g., bus, or access contention, to instruction memory 12, which must supply instruction data to both MAC register 26 and instruction decode 18, as well as bus, or access contention, to data memory 10, which must both read out signal data and write in the output data. Additionally, in many instances, additional processing on the output data must be performed by ALU 16. This further aggravates access to data memory 10, and therefore creates contention for data bus 20, because the output data must be written from MAC register bank 27 into data memory 10, and then read out to ALU register 17. These read and write operations are performed over bus 20 and therefore consume additional bus cycles. Such inefficiencies reduce the processing performance of the DSP.
The present invention seeks to improve the performance and usefulness of a DSP by addressing the problems and inefficiencies listed above, as well as by providing other features and improvements described throughout the application.
The present invention is a novel and improved method and circuit for digital signal processing. One aspect of the invention calls for the use of a variable length instruction set. A portion of the variable length instructions may be stored in adjacent locations within memory space with the beginning and ending of instructions occurring across memory word boundaries. Furthermore, additional aspects of the invention are realized by having instructions contain variable numbers of instruction fragments. Each instruction fragment causes a particular operation, or operations, to be performed allowing multiple operations during each clock cycle. Thus, multiple operations are performed during each clock cycle, reducing the total number of clock cycles necessary to perform a task.
The exemplary DSP includes a set of three data buses over which data may be exchanged with a register bank and three data memories. The use of more than two data buses, and especially three data buses, realizes another aspect of the invention, which is significantly reduced bus contention. One embodiment of the invention calls for the data buses to include one wide bus and two narrow buses. The wide bus is coupled to a wide data memory and the two narrow buses are coupled to two narrow data memories.
Another aspect of the invention is realized by the use of a register bank that has registers accessible by at least two processing units. This allows multiple operations to be performed on a particular set of data by the multiple processing units, without reading and writing the data to and from a memory. The processing units in the exemplary embodiment of the invention include an arithmetic logic (ALU) and a multiply-accumulate (MAC) unit. When combined with the use of the multiple bus architecture, highly parallel instructions, or both, an additional aspect of the invention is realized where highly pipelined, multi-operation, processing is performed.
Other aspects of the invention are realized by including an instruction fetch unit that receives instructions of variable length stored in an instruction memory. Still another aspect of the invention is realized by an instruction memory that is separate from the set of three data memories. An instruction decoder decodes the instructions from the instruction memory and generates control signals that cause data to be exchanged between the various registers, data memories, and functional units allowing multiple operations to be performed during each clock cycle.
Additionally, the various aspects of the invention combine synergistically, to provide unexpected and desirable results. For example, the use of variable length instructions that are stored consecutively within memory reduces the necessary circuit area of the DSP. This reduction facilitates adding multiple data buses to the DSP, as well as the addition of registers that are accessible by multiple processing units, increasing the overall performance of the DSP. Other synergistic benefits provided by the combination of the various aspects of the invention are apparent, and are described in greater detail below.