The present invention relates to programmable processor systems, such as digital signal processor systems, and more particularly, to methods and apparatus for achieving high processing rates, required for certain algorithms currently achieved only by dedicated hardware.
Currently available digital signal processors are highly programmable, but they do not provide sufficient performance for many applications, since the digital signal processor is optimized for a data width of 16 bits or higher precision. Thus, to achieve the higher processing rates required for certain algorithms, which require more than an order of magnitude beyond the capabilities of commercially available digital signal processors, a number of digital signal processor systems, such as receivers in a wireless local area network (LAN) or a wideband CDMA network, have implemented such algorithms in dedicated application specific logic or in dedicated coprocessors. Specifically, algorithms requiring low-precision and relatively high data rates, such as certain types of finite impulse response (FIR), correlation and Viterbi computations, have been implemented in such application specific integrated circuits (ASICs) or coprocessors.
For example, in a typical Wireless LAN channel matched filter performing FIR computations, approximately 500 million multiply-add calculations (MACs) per second are required. Meanwhile, the required input and output precision for such FIR computations is only five bits and nine bits, respectively. Likewise, in a wireless LAN correlator, the incoming bit stream must be correlated with the original Barker code sequence, in a well-known manner. Such correlation computations require about 900 million multiply-add calculations (MACs) per second. Since the Barker code is only a one-bit sequence (with each value being either +1 or xe2x88x921), the multipliers implement relatively simple operations. Finally, Viterbi decoders in wideband CDMA or IS-95 receivers have increasingly high bit rates and an increased constraint length of the convolutional code. Meanwhile, a branch metric in such a Viterbi decoder can be represented by less than eight bits (even for soft decision decoding) and no more than 32 branch metrics need to be stored for a complete update of the required 256 states.
While application specific integrated circuit (ASIC) and coprocessor implementations efficiently (with low power dissipation) perform such operations at the required data rates, they typically perform only a single function. In addition, since the design and verification of such application specific integrated circuits is often an expensive and time-consuming process, any modifications to an application specific integrated circuit implementation will require a significant amount of time and expense.
As apparent from the above-described deficiencies with current techniques for achieving processing rates required for certain digital signal processor algorithms, a need exists for a programmable and low power accelerator that achieves required processing rates for a number of different algorithms.
Generally, a programmable multi-mode accelerator is disclosed for use with a digital signal processor, microcontroller or microprocessor. The term xe2x80x9cprogrammable processorxe2x80x9d is used herein to collectively refer to a digital signal processor, a microcontroller or microprocessor. The programmable multi-mode accelerator allows a programmable processor to execute specific algorithms that require low-precision operations at an extremely high rate, such as certain types of finite impulse response, correlation and Viterbi computations. The disclosed programmable multi-mode accelerator replaces the ASIC implementations that have typically been used in digital signal processor systems and allows for a more programmable and more cost-effective solution. The accelerator extends the digital signal processor""s performance into the required range for low-precision computations.
In one implementation, the accelerator begins executing its program after the main decode and dispatch unit of the programmable processor has issued a special start instruction. In such an implementation, the accelerator is coupled with the main data path of a programmable processor. The accelerator optionally has direct access to the register files of the programmable processor. In an illustrative implementation, the accelerator data path obtains its input values (source operands) directly from a set of registers in the programmable processor and writes results back into a second set of registers.
According to an aspect of the invention, the accelerator allows a plurality of algorithms, such as certain types of finite impulse response, correlation and Viterbi computations, to utilize the same adder cells thereby saving silicon area. In particular, the present invention allows low-precision algorithms requiring primarily addition or multiply-add computations to be implemented using a programmable accelerator. Thus, although an illustrative finite impulse response computation requires sixteen eight bit by eight bit multipliers and an adder tree to add the 16 products, and an illustrative Viterbi computation requires eight 16-bit additions and compare-select operations, the present invention allows these computations to be performed using the same adder cells. Thus, in accordance with the present invention, the accelerator includes a multi-mode adder that can be programmatically reconfigured to perform the various operations discussed above.
The multi-mode adder is controlled by the instructions of the accelerator. In a first mode, referred to as the xe2x80x9csingle-add mode,xe2x80x9d the adder operates as a 17-input 16-bit adder. In the single-add mode, the adder has 17 16-bit inputs that are all summed to form one 16-bit output. One input is a feedback path and the other 16 inputs come from a multiplexer and a multiplier bank. The single-add mode can be utilized to perform finite impulse response and correlation computations.
In the single-add mode, the illustrative accelerator can implement FIR filters with a delay line having delays of zxe2x88x921 or zxe2x88x922 and with up to 16 taps. In this implementation of the FIR filter, the throughput is one output sample per cycle. In addition, the accelerator can implement a finite impulse response filter with a zxe2x88x921 delay line and with between 17 and 32 taps. In this implementation of the FIR filter, the throughput is one output for each two cycles.
In the single-add mode, the accelerator initially advances the registers in the delay chain by one, reads a new value from the main register file, and writes the value into the first register of the delay chain. In the next cycle, the eight accelerator registers are read and are applied to the inputs of the multipliers in the multiplier bank. In addition, the delay chain values are applied to the inputs of the multipliers in the multiplier bank, and the values are multiplied. Thereafter, the outputs of the multipliers in the multiplier bank are summed by the adder, with or without the feedback input. Finally, the output of the adder is written back to the main register file.
In a second mode, referred to as the xe2x80x9cfour state add-compare-select modexe2x80x9d (or xe2x80x9cACS modexe2x80x9d), the feedback path is inactive. The other 16 inputs are divided into 8 groups of two inputs each. The two inputs of each group are summed to form eight intermediate 16-bit outputs. The eight intermediate 16-bit outputs are paired and a maximum or minimum from each pair is selected, based on the current operating mode, to produce four values. These four values are concatenated into two 32-bit values and sent back to the register file where results are stored. The ACS mode can be utilized to perform Viterbi computations.
In the ACS mode, the accelerator initially reads two values from the accelerator registers and sign-extends them to an appropriate length. In addition, two of the registers from the main register file where inputs are stored are read and the values are added. The two values are then compared and a maximum or minimum is selected. Thereafter, the results of the adder are written to the main register file and the accelerator register pointer is updated.
The accelerator has a small instruction set and instruction memory and, once started by the main data path, the accelerator executes its own instruction stream. The main processor and accelerator are always synchronized (i.e., in lock step) and no synchronization overhead, such as semaphores or hardware flags, is required, thereby maximizing data throughput.
A more complete understanding of the present invention, as well as further features and advantages of the present invention, will be obtained by reference to the following detailed description and drawings.