A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever.
1. Field of the Invention
This invention relates generally to data processing, and more particularly to the processing of algorithms in software that benefit from the efficient implementation of forward and backward butterfly operations used, for example, in Maximum a posteriori (MAP) decoding. Such exemplary MAP decoding is used in the processing of parallel concatenated codes (Turbo codes) and serial concatenated codes.
2. Description of Related Technology
Parallel and serial concatenated codes are formed from a data sequence that is concatenated with a sequence of output bits from two or more constituent encoders, e.g., convolutional encoders. Turbo codes correspond a specific type of parallel concatenated code. However, within this application, it is to be understood that where applicable, discussions referring to xe2x80x9cTurbo codesxe2x80x9d can be extended more generally to both parallel and serial concatenated codes. Embodiments involving parallel concatenated codes and more specifically Turbo codes are developed herein by way of example only.
The use of Turbo codes for transmission of data over a noisy channel was first introduced in C. Berrou, A. Glavieux, and P. Thitimajshima, xe2x80x9cNear Shannon limit error-correcting coding and decoding: Turbo codesxe2x80x9d, Proc. of 1993 Int. Conf. Comm., pp. 1064-1070. This reference is referred to as the xe2x80x9cBerrou referencexe2x80x9d hereinafter. Turbo codes provide bit error rates near Shannon""s theoretical limit but add significant complexity to the receiver""s decoder. Turbo codes are used for forward error correction in several important communication standards such as, inter alia, third-generation partnership project (hereafter, 3GPP) cellular communications standards. Consequently much effort has been applied to develop efficient Turbo decoder implementations.
MAP (maximum a posteriori) based decoders are widely used within Turbo decoder implementations and require significant data processing. A MAP decoder determines a sequence that minimizes a symbol error rate as opposed to finding a maximum-likelihood sequence as determined using the more common Viterbi algorithm. The MAP decoder algorithm is described in Bahl, L. R. et al., xe2x80x9cOptimal Decoding of Linear Codes for Minimizing Symbol Error Rate,xe2x80x9d IEEE Transactions on Information Theory, March 1974, pp. 284-287, hereinafter called the xe2x80x9cBahl reference.xe2x80x9d The MAP decoder described in the Bahl reference is often called the xe2x80x9cBCJR algorithmxe2x80x9d in recognition of its authors. While the MAP decoder is more costly than the Viterbi algorithm, it provides an information sequence known as an extrinsic sequence that is needed by Turbo decoders. Two MAP decoders configured in a feedback configuration are employed within a Turbo decoder. The processing associated with MAP decoders accounts for the bulk of the computational load in Turbo decoding.
Most practical implementations perform computations using logarithmic representations of probability information and are known as Log-MAP decoders. A decoder known as the Max-Log-MAP decoder uses a mathematical approximation to simplify the calculations involved and to thereby reduce the overall complexity of the decoder. Max-Log-MAP decoders are discussed in xe2x80x9cEfficient Software Implementation of the Max-Log-MAP Turbo decoder on the StarCore SC140 DSPxe2x80x9d, A. Chass, A. Gubeskys, and G. Kutz, ICSPAT 2000 and Motorola Application Note, hereinafter referred to as the xe2x80x9cChass reference.xe2x80x9d The Max-Log-MAP decoder performance is slightly reduced compared to the Log-MAP but is more commonly implemented due its decreased computational complexity. The Max-Log-MAP decoder performance can be improved by the addition of a correction term. A Max-Log-MAP decoder that makes use of this correction is known as a Max*-Log-MAP decoder. Max*-Log-MAP decoders are discussed in Michel, H. and When, N. xe2x80x9cTurbo-Decoder Quantization for UMTSxe2x80x9d IEEE Communications letters, Vol. 5, Number 2, February 2001, hereinafter called the Michel reference. The exemplary embodiment of the invention performs efficient Max*-Log-MAP decoding in software using of a customized processor designed to efficiently implement operations involved in various MAP decoding algorithms. Most of the computational operations required to perform MAP based decoding involve forward (alpha) metric updates, backward (beta) metric updates and the Log Likelihood Ratio (hereafter, LLR) calculations.
FIG. 1 illustrates a prior art block diagram of a rate ⅓ Turbo encoder 100 as used in a transmitting device. An input data sequence u(k) 101 (typically binary valued) is directly coupled to an output coupling 103 to produce a systematic data subsequence x(k) (i.e., x(k)=u(k)). The input sequence u(k) is also coupled to a first convolutional encoder 105 to produce a first parity information subsequence y1(k) 107. The input sequence u(k) is also coupled to a pseudo random interleaver 109 whose output is coupled to a second convolutional encoder 111 to produce a second parity information subsequence y2(k) 113. The output of the rate ⅓ Turbo encoder 100 is a sequence containing the three subsequences x(k), y1(k), and y2(k).
The Turbo encoder of FIG. 1 involves relatively simple logic processing and is usually implemented using Finite State Machine (FSM) controlled hardware. The encoded data stream is transmitted over a noisy channel and is received at a receiving device as an error-prone data stream comprising error-prone systematic and parity information subsequences. A Turbo decoder is used to operate on the received error-prone subsequences in order to produce an error-corrected estimate of input data sequence, u(k).
In many embodiments a rate xc2xd Turbo decoder is used instead of the aforementioned rate ⅓ Turbo decoder. The rate xc2xd Turbo decoder discards every other element of the subsequences y1(k) 107, and y2(k) 113, so that the encoder""s output sequence contains one parity bit for each systematic bit. This process of decimating the parity sequences is known to those skilled in the art as xe2x80x9cpuncturing.xe2x80x9d
A Turbo decoder 200 designed according to the most commonly employed Turbo decoding scheme is shown in FIG. 2. At the Turbo decoder 200, the input data subsequences correspond to error-prone versions of the transmitted subsequences. This is because the Turbo decoder generally only has access to the transmitted information after it has been received through a noisy channel. The received error-prone data subsequence x(k) 202, and the received error-prone first parity subsequence y1(k) 204 are coupled into a first Soft Input Soft Output (SISO) MAP decoder 206. Also coupled into the first MAP decoder 206 is a feedback sequence involving a priori log likelihood information, xcexin(k), output from a deinterleaver 208. The output from the first SISO MAP decoder 206, xcexout(k) 207, is coupled to an interleaver 210 which generates a set of a priori information that is coupled to a second SISO MAP decoder 212. The second SISO MAP decoder 212 also takes as input an error-prone parity data subsequence Y2(k) 214 and the error-prone systematic data x(k) 202 after passing through an interleaver 216. As is known in the art, the deinterleaver 208, and the interleavers 210 and 216 use the same interleaving function as used in the encoder 100. The output of the second SISO MAP decoder 212 is a second log likelihood data output sequence, xcexout(k) 213. The sequence xcexout(k) 213, like the other data sequences, includes a corresponding element for each bit index k into the input data block. The number k preferably ranges from 0 to Nxe2x88x921, so that there are N elements in each data block. After the data block is operated upon via several iterations through the decoder 200, a hard decision output data element 218 can be produced with low Bit Error Rate (BER).
A summary of the calculations involved in a SISO MAP decoder for a version of the popular Max*-Log-MAP algorithm is provided in the detailed description of the invention. Also refer to the Berrou, Michel and Chass references for further details regarding the Turbo decoder and its implementation. The Turbo decoder of FIG. 2 is well known to involve a significant computational load. When turbo decoding is performed using logarithmic values, the computational load involves accessing the many data values required, data selection, add-compare-select operations, correction factor computations and nontrivial pointer arithmetic.
The combination of computational complexity and the need for power efficient solutions has lead to prior art solutions involving one or more processors coupled to a hardware Turbo decoder. An exemplary prior art communications device 300 is shown in FIG. 3. The communications device 300 may represent, for example, a cellular phone, a wireless basestation, a modem or any other communications device that applies error correction processing to a received signal. The communications device 300 includes a Turbo decoder hardware module 302 and a private memory 304 coupled thereto. The Turbo decoder 302 is coupled to receive information from a communication interface 306. The communication interface 306 generally corresponds to a receiver that provides a demodulated bit stream received from a communication channel 308. The communication channel 308 may be a wireless, wireline, optical, or other type of communication channel.
The Turbo decoder 302 is coupled to a digital signal processor (DSP) 310. The DSP 310 typically is coupled to a private memory 312, for example, on-board memory associated with the DSP 310. The communication device 300 also typically includes a microcontroller 314. While the DSP 310 handles physical layer processing tasks, the microcontroller 314 typically handles link layer and other upper layer processing. In this exemplary prior art system, the DSP 310, the microcontroller 314, and the Turbo decoder 302 are coupled together via a system bus 316. Also coupled to the system bus 316 are a memory module 318, a memory module 320, and an input/output device 322. In some systems, the memories 318 and 320 are merged into a single memory module.
In operation, a communication signal is received from the communication channel 308. The communication signal is then converted by the interface circuitry 306 into a digital data sequence. The received digital data sequence consists of error-prone systematic and parity data. The microcontroller 314 is typically used to write this information to the memory 318. The Turbo decoder 302 then reads a block of the data sequence from the memory 318 and performs Turbo decoding to convert the error-prone data block into an error-corrected data sequence. At the end of the iterative decode process the data is written by the Turbo decoder into the memory 320.
In some embodiments, the DSP 310 performs signal conditioning such as equalization prior to sending the data block to the Turbo decoder. Also, the DSP 310 may also perform baseband processing such as Viterbi Algorithm decoding and speech codec functions. The decoded data from the Turbo decoder will typically be further processed by the microcontroller 314 with its associated memory subsystem 320 before being passed to the data Input/Output logic 322 of the system.
The reason prior art systems use a dedicated hardware Turbo decoder 302 is because it is generally costly and inefficient to implement such a high complexity algorithm in software on a general purpose DSP. For example, each SISO MAP decoder involves branch metric calculations (gamma metrics), a forward recursion through the trellis (alpha metric calculations), a backward recursion through the trellis (beta metric calculations), a soft output calculation and an extrinsic information (LLR) calculation. The Chass reference reports a DSP software implementation of the decoder, but the implementation results in a costly and power consuming solution. This is because general purpose DSP""s require many instruction cycles to implement all of the aforementioned operations and the supporting pointer arithmetic to control memory accessing.
While prior art Turbo decoding solutions have been proposed, they have some limiting problems that need to be overcome. For example, Hardware decoders lack flexibility. A change in a standard, a new standard, or any other change in a specification or requirements is difficult to handle when the controlling algorithms are not software programmable. Also, Hardware decoders lack advanced programmable features. Because of this limitation, hardware decoders tend to not have certain features that would be easy to add to a software programmable decoder. Another problem is that hardware decoders consume gates and memory that will not be reused by other functions. The silicon area consumed by a hardware decoder will not be used for other functions whereas the silicon area used to support a software decoder in a DSP can be reused for functions such as speech and audio decompression/decoding and speech recognition. As discussed above, DSP software based implementations are inefficient. To implement a Turbo decoder in DSP software is overly costly in both instructions per second and power consumption. Hence there is a trade off in the prior art between efficient but fixed hardware decoders and inefficient but flexible software decoders.
Based on the foregoing, there is a need for an improved decoding architecture that provides efficiency similar to that of a hardware decoder while still providing the flexibility of a software-implemented decoder. It would be desirable for such a decoder to be reprogrammable and thereby able to deal with new requirements and/or to accommodate a new standard. There is also a need for an improved decoder architecture that could be readily programmed to support advanced features. It would be desirable to have a decoder architecture that could be reused for other functions such as speech and audio encoding/decoding and speech recognition. It would also be desirable to have a programmable and reusable decoder architecture that is tightly coupled to a processor such as a DSP and allows Turbo decoding to be performed using much fewer processor cycles and/or much less power than prior art DSP software-based approaches. There is a need to eliminate the trade off in the prior art between efficiency and programmability of Turbo decoding structures.
The present invention satisfies the aforementioned needs by providing an improved method and apparatus for implementing a data processor adapted for turbo decoding.
In a first aspect of the invention, an improved processor adapted for decoding is disclosed. In one exemplary embodiment, the processor comprises: a memory that holds a set of state values; an arithmetic unit that supports forward and reverse butterfly update operations; at least one instruction that causes the arithmetic unit to perform a butterfly update operation; and at least one addressing mode that causes a pair of butterfly output state values to be written to a set of locations in the memory, such that the written output states are reordered to be ready for subsequent sequential pair-wise reading as input states in a subsequent butterfly operation. In a second exemplary embodiment, the processor comprises: an ALU adapted to perform forward and reverse trellis butterfly update calculations; at least one instruction operative to cause the ALU to perform at least one of a forward and a reverse trellis update operation; a memory for storing a set of alpha metrics; at least one addressing mode adapted to automatically write at least a pair of first metrics to a permuted set of locations relative to an output address pointer, the permutation arranging the pair for subsequent sequential reading as input state values to a subsequent butterfly operation; a local register file for storing a set of second metrics; and a local register file for storing a set of third metrics; whereby when the instruction executes, the ALU selectively couples at least some of the first, second, and third metrics into the ALU to selectively perform one of a first metric update and a second metric update.
In a second aspect of the invention, an improved arithmetic logic unit (ALU) apparatus for use in, inter alia, a data processor, is disclosed, the ALU generally comprising: at least one control bus adapted to carry at least one control signal thereon; local first and second memory areas and memory busses, respectively; and a partitioned memory and a partitioned memory bus; a plurality of selection units, each of the units having a plurality of inputs and at least one output, the plurality of inputs comprising a first input coupled to the partitioned memory bus, a second input coupled to the first memory bus, and a third input coupled to the second memory bus, the selection units being adapted to select one or more of the data inputs, the control of the selection function being related at least in part to the control signal present on the control bus. In one exemplary embodiment, a plurality of arithmetic units are also provided, each of the arithmetic units having at least two inputs corresponding to the outputs of at least two of the selection units operative to arithmetically combine the at least two input values. At least one compare unit is also provided, having as an input the output of at least one of the arithmetic units, and at least one result multiplexer having the same inputs as the respective one of the one compare unit and being controlled by the output of the compare unit(s).
In a third aspect of the invention, an improved communication system incorporating the aforementioned processor is disclosed. The system generally comprises a processor with ALU capable of selectively performing a forward and a reverse MAP butterfly update operations in response to at least one instruction, and at least one MAP decoder software routine comprising the at least one butterfly update instruction. In one exemplary embodiment, the system further comprises a first MAP decoder module adapted to execute the at least one MAP decoder software routine, whereby the MAP decoder executes a forward trellis update and a reverse trellis update recursion and computes a first likelihood output sequence. An interleaver coupled to receive this first likelihood sequence is also provided. A second MAP decoder module adapted to execute at least one of the MAP decoder software routines is also provided, the second MAP decoder executing a forward and a reverse trellis update recursion to compute a second likelihood output sequence. A deinterleaver receives this second likelihood sequence and provides feedback to the first MAP decoder in the form of a permuted likelihood sequence.
In a fourth aspect of the invention, an improved method for performing a butterfly operation for implementing a decoder in a digital processor having a memory and arithmetic logic unit (ALU) associated therewith is disclosed. The method generally comprises: disposing old path metrics in the memory; disposing a set of first metrics in a local storage device associated with the ALU; providing a butterfly update instruction within the instruction set of the processor; providing the old path metrics as inputs to the ALU; providing the first metrics as inputs to said extension ALU; and providing at least one addressing mode for the memory which causes a set of new path metrics to be reordered and written back to memory subsequent to execution of the butterfly update instruction. In one exemplary embodiment, the processor comprises an extended processor having an XY memory, and the reordering of new path metrics occurs in such a way that the written values can be subsequently sequentially accessed as old path metric inputs to a subsequent butterfly update instruction.