This invention relates generally to the instruction set architectures (ISA) of processors. More particularly, the invention relates to instruction set architectures for the execution of operations within a signal processing integrated circuit.
Single chip digital signal processing devices (DSP) are relatively well known. DSPs generally are distinguished from general purpose microprocessors in that DSPs typically support accelerated arithmetic operations by including a dedicated multiplier and accumulator (MAC) for performing multiplication of digital numbers. The instruction set for a typical DSP device usually includes a MAC instruction for performing multiplication of new operands and addition with a prior accumulated value stored within an accumulator register. A MAC instruction is typically the only instruction provided in prior art digital signal processors where two DSP operations, multiply followed by add, are performed by the execution of one instruction. However, when performing signal processing functions on data it is often desirable to perform other DSP operations in varying combinations.
An area where DSPs may be utilized is in telecommunication systems. One use of DSPs in telecommunication systems is digital filtering. In this case a DSP is typically programmed with instructions to implement some filter function in the digital or time domain. The mathematical algorithm for a typical finite impulse response (FIR) filter may look like the equation Yn=h0X0+h1X1+h2X2+ . . . +hNXN where hn are fixed filter coefficients numbering from 1 to N and Xn are the data samples. The equation Yn may be evaluated by using a software program. However in some applications, it is necessary that the equation be evaluated as fast as possible. One way to do this is to perform the computations using hardware components such as a DSP device programmed to compute the equation Yn. In order to further speed the process, it is desirable to vectorize the equation and distribute the computation amongst multiple DSPs such that the final result is obtained more quickly. The multiple DSPs operate in parallel to speed the computation process. In this case, the multiplication of terms is spread across the multipliers of the DSPs equally for simultaneous computations of terms. The adding of terms is similarly spread equally across the adders of the DSPs for simultaneous computations. In vectorized processing, the order of processing terms is unimportant since the combination is associative. If the processing order of the terms is altered, it has no effect on the final result expected in a vectorized processing of a function.
In typical micro processors, a MAC operation would require a multiply instruction and an add instruction to perform both multiplication and addition. To perform these two instructions would require two processing cycles. Additionally, a program written for the typical micro processor would require a larger program memory in order to store the extra instructions necessary to perform the MAC operation. In prior art DSP devices, if a DSP operation other than a MAC DSP instruction needs to be performed, the operation requires separate arithmetic instructions programmed into program memory. These separate arithmetic instructions in prior art DSPs similarly require increased program memory space and processing cycles to perform the operation when compared to a single MAC instruction. It is desirable to reduce the number of processing cycles when performing DSP operations. It is desirable to reduce program memory requirements as well.
DSPs are often programmed in a loop to continuously perform accelerated arithmetic functions including a MAC instruction using different operands. Often times, multiple arithmetic instructions are programmed in a loop to operate on the same data set. The same arithmetic instruction is often executed over and over in a loop using different operands. Additionally, each time one instruction is completed, another instruction is fetched from the program stored in memory during a fetch cycle. Fetch cycles require one or more cycle times to access a memory before instruction execution occurs. Because circuits change state during a fetch cycle, power is consumed and thus it is desirable to reduce the number of fetch cycles. Typically, approximately twenty percent of power consumption may be utilized in the set up and clean up operations of a loop in order to execute DSP instructions. Typically, the loop execution where signal processing of data is performed consumes approximately eighty percent of power consumption with a significant portion being due to instruction fetching. Additionally, because data sets that a DSP device processes are usually large, it is also desirable to speed instruction execution by avoiding frequent fetch cycles to memory.
Additionally, the quality of service over a telephone system often relates to the processing speed of signals. That is particularly the case when a DSP is to provide voice processing, such as voice compression, voice decompression, and echo cancellation for multiple channels. More recently, processing speed has become even more important because of the desire to transmit voice aggregated with data in a packetized form for communication over packetized networks. Delays in processing the packetized voice signal tend to result in the degradation of signal quality on receiving ends.
It is desirable to provide improved processing of voice and data signals to enhance the quality of voice and data communication over packetized networks. It is desirable to improve the efficiency of using computing resources when performing signal processing functions.
Briefly, the present invention includes an apparatus, method, instruction set architecture, and system as described in the claims. Multiple Application Specific Signal Processors (ASSPs) having the instruction set architecture of the present invention are provided within gateways in communication systems to provide improved voice and data communication over a packetized network. Each ASSP includes a serial interface, a buffer memory, and multiple core processors each to simultaneously process multiple channels of voice or data. Each core processor preferably includes at least one reduced instruction set computer (RISC) processor and multiple signal processing units (SPs). Each SP includes multiple arithmetic blocks to simultaneously process multiple voice and data communication signal samples for communication over IP, ATM, Frame Relay or other packetized network. The multiple signal processing units can execute the digital signal processing algorithms in parallel. Each ASSP is flexible and can be programmed to perform many network functions or data/voice processing functions, including voice and data compression/decompression in telecommunications systems (such as CODECs) particularly packetized telecommunication networks, simply by altering the software program controlling the commands executed by the ASSP.
An instruction set architecture for the ASSP is tailored to digital signal processing applications including audio and speech processing such as compression/decompression and echo cancellation. The instruction set architecture implemented with the ASSP, is adapted to DSP algorithmic structures. This adaptation of the ISA of the present invention to DSP algorithmic structures balances the ease of implementation, processing efficiency, and programmability of DSP algorithms. The instruction set architecture may be viewed as being two component parts, one (RISC ISA) corresponding to the RISC control unit and another (DSP ISA) to the DSP datapaths of the signal processing units. The RISC ISA is a register based architecture including 16-registers within the register file, while the DSP ISA is a memory based architecture with efficient digital signal processing instructions.
The instruction word for the ASSP can be 20 bits, or can be expanded to 40 bits. The 40-bit instruction word can be used to control two instructions to be executed in series or parallel, such as two RISC control instructions, extended DSP instructions, or two 20-bit DSP instructions. The instruction set architecture of the ASSP has five distinct types of instructions to optimize the DSP operational mix. These are (1) a 20-bit DSP instruction that uses mode bits in control registers (i.e. mode registers), (2) a 40-bit DSP instruction having control extensions that can override mode registers, (3) a 20-bit dyadic DSP instruction, (4) a 40-bit DSP instruction that extends the capabilities of a 20-bit dyadic DSP instruction by providing powerful bit manipulation, and 5) a single 40-bit DSP instruction that includes a pair of 20-bit dyadic sub-instructions: a primary DSP sub-instruction and a shadow DSP sub-instruction that are executed simultaneously according to one embodiment of the present invention.
These instructions are for accelerating calculations within the core processor of the type where D=[(A op1 B) op2 C] and each of xe2x80x9cop1xe2x80x9d and xe2x80x9cop2xe2x80x9d can be a multiply, add, extremum (min/max) or other primitive DSP class of operation on the three operands A, B, and C. The ISA of the ASSP which accelerates these calculations allows efficient chaining of different combinations of operations.
All DSP instructions of the instruction set architecture of the ASSP are dyadic DSP instructions to execute two operations in one instruction with one cycle throughput. A dyadic DSP instruction is a combination of two basic DSP operations in one instruction and includes a main DSP operation (MAIN OP) and a sub DSP operation (SUB OP). Generally, the instruction set architecture of the present invention can be generalized to combining any pair of basic DSP operations to provide very powerful dyadic instruction combinations. The DSP arithmetic instructions or operations in the preferred embodiment include a multiply instruction (MULT), an addition instruction (ADD), a minimize/maximize instruction (MIN/MAX) also referred to as an extrema instruction, and a no operation instruction (NOP) each having an associated operation code (xe2x80x9copcodexe2x80x9d).
The present invention efficiently executes these dyadic DSP instructions by means of the instruction set architecture and the hardware architecture of the application specific signal processor. For example, the DSP instructions can process vector data or scalar data automatically using a single instruction and provide the appropriate vector or scalar output results.
In one embodiment of the present invention, a single DSP instruction includes a pair of sub-instructions: a primary DSP sub-instruction and a shadow DSP sub-instruction. Both the primary and the shadow DSP sub-instructions are dyadic DSP instructions performing two operations in one instruction cycle. These DSP operations include the MULT, ADD, MIN/MAX, and NOP operations as previously described.
According to one embodiment, each core processor has a plurality of signal processing units (SPs). Each of the SPs includes a primary stage to execute a primary DSP sub-instruction based upon current data and a shadow stage to simultaneously execute a shadow DSP sub-instruction based upon delayed data stored locally within delayed data registers of each SP. Each SP includes a shadow selector coupled to the delayed data registers of its own associated SP. Control logic is utilized to control the shadow selectors of each SP to select delayed data (specified by the shadow DSP sub-instruction) from the delayed data registers for use by the shadows stages of the SPs.
In this way, the present invention efficiently executes DSP instructions by simultaneously executing primary DSP sub-instructions (based upon current data) and shadow DSP sub-instructions (based upon delayed locally stored data) with a single DSP instruction thereby performing four operations per single instruction cycle per SP, without increasing the memory bandwidth.