This invention relates in general to Digital Signal Processor (DSP) Cores and more specifically to a DSP Instruction for Turbo Decoding.
Turbo coders are one type of forward error correction (FEC) used in today""s communication systems. They are starting to become widely used in many applications such as: wireless handsets, wireless base stations, hard disk drives, wireless LANs, satellites, and digital television Their bit error rate (BER) performance is closer to the Shannon limit as compared with other types of FECs, as is illustrated in FIG. 2. Turbo coders work on blocks of data called frames. There are two main wireless standards 3GPP and CDMA2000. The frame size for 3GPP ranges from 40 to 5114 bits and the frame size for CDMA2000 ranges from 378 to 20,730 bits. One implementation of a turbo decoder was designed to run in parallel with the TI C64X DSP, with the DSP executing all of the communication algorithm in software except for turbo decoding. The DSP would download a frame of data to the turbo decoder and start the decode. After the decode is complete, the decoder will interrupt the DSP with either a hard or soft interrupt. Next, the DSP retrieves the corrected frame and continues executing the remaining parts of the communication algorithm. The data entering the decoder is soft and is assumed to be quantized to 8 bits in this example. The data exiting the decoder is hard and is binary. The turbo decoder will attempt to find and fix as many errors as possible. Turbo decoders achieve good results by iteratively decoding the data in the frame many times. Typical number of iterations can range from 1 to 100. Typically, the results get better at each iteration until the optimum solution is obtained.
An illustrative example of wireless handset or base station is illustrated in FIG. 1, wherein digital hard data is modulated and transmitted from the transmitter portion of the wireless base station and soft data is received at the receiver portion of the wireless base station. As illustrated, noise is introduced between the transmitter and receiver and as a result, errors in the received data make the received data soft, i.e., xe2x80x9c0.9xe2x80x9d, xe2x80x9cxe2x88x920.2xe2x80x9d, instead of the transmitted modulated data xe2x80x9c1xe2x80x9d or xe2x80x9cxe2x88x921xe2x80x9d. The encoder is typically located in a transmitter of a wireless base station, for example, where the decoder is typically located in the receiver base station.
An example of a rate ⅓ parallel concatenated encoder is shown in FIG. 3. The encoder illustrated in FIG. 3 is a rate xe2x80x9c⅓xe2x80x9d parallel concatenated encoder due to one input stream and three output streams. The xe2x80x9cIxe2x80x9d block in FIG. 3 is an interleaver which randomly scrambles the information bits to decorrelate the noise for the decoder. Included in FIG. 3 are two Recursive Systematic Convolutional (RSC) encoders running in parallel. The interleaver located in the encoder scrambles the information in the same way that the interleaver located in the decoder (illustrated in FIG. 6) must unscramble the information. Therefore the scrambling can take any form or use any algorithm as long as both the encoder and the decoder use the same scrambling method.
The functional block diagram of the turbo decoder is shown in FIG. 6. The frame entering the decoder contains systematic and parity symbols xxe2x80x2, pxe2x80x20 and pxe2x80x21, respectively. These symbols are scaled once by the DSP and stored in separate memories within the turbo decoder. Turbo decoding achieves an error performance close to the Shannon limit. The performance is achieved through decoding multiple iterations. Each iteration results in additional performance and additional computational delay. Turbo codes consist of a concatenation of convolutional codes, connected by an interleaver, with an iterative decoding algorithm. The iterative decoder generates soft decisions from a maximum-a-posteriori (MAP) block. Each iteration requires the execution of two MAP decodes to generate two sets of extrinsic information. The first MAP decoder uses the non-interleaved data as its input and the second MAP decoder uses the interleaved data. The frame of input data entering the decoder contains systematic xxe2x80x2 and parity symbols pxe2x80x20 pxe2x80x21. There are N of these symbols and they are soft (not binary). The symbols are scaled once and stored in memory. The scaled symbols are labeled xcex9(x), xcex9(p0) , and xcex9(p1)xcfx84 in the above figure. These inputs are constant for the entire decode of that block of data.
The input to the upper MAP decoder is xcex9(x), xcex9(p0) and A2. A2. is the apriori information from the lower MAP decoder. The output of the upper MAP decoder is the first extrinsic or W1. W1 is interleaved to make A1. The input to the lower MAP decoder is xcex9(x), xcex9(p1) and A1. The output of the lower MAP decoder is the second extrinsic or is W2. W2 is deinterleaved to make A2. This completes one iteration.
The map decoder function is to determine the logarithm of likelihood ratio (LLR). This is commonly called the extrinsic and labeled as W1 and W2 in FIG. 6. The extrinsic associated with each decoded bit xn is       W    n    =      log    ⁢                  Pr        ⁡                  (                                    x              n                        =                          1              |                              R                1                n                                              )                            Pr        ⁡                  (                                    x              n                        =                          0              |                              R                1                n                                              )                    
where R1n=(R0, R1, . . . Rnxe2x88x921) denotes the received symbols as received by the decoder. The MAP decoder computes the a posteriori probabilities:       Pr    ⁡          (                        x          n                =                  i          |                      R            1            n                              )        =            1              Pr        ⁡                  (                      R            1            n                    )                      ⁢          ∑              (                                            x              n                        =            i                    ,                                    S              n                        =            m                    ,                                    S                              n                -                1                                      =                          m              xe2x80x2                                      )            
Here Sn refers to the state at time n in the trellis of the constituent convolutional code. The code rate ⅓ encoder of FIG. 4 trellis is shown in FIG. 5. FIG. 3 shows the two encoders of FIG. 4 in which the second parity is punctured (or not used), connected in parallel. FIG. 4 illustrates the logic gates and registers. The encoder has a code rate of xe2x80x9c⅓xe2x80x9d because there are three outputs for one input. The 3 bit representations to the far left of the trellis represent the values stored within the three registers, one bit for each register, respectively. The number just to the right of those three bit representations are the states, Sn, within the trellis of which there are 8 possible as there are three registers which can be either a xe2x80x981xe2x80x99 or a xe2x80x980xe2x80x99 bit, i.e. 23=8. The trellis depicts the output of the RSC encoder in dependence upon the initial state, Sn, of the encoder, the values which are stored in the registers, and the input bit. For example, if the initial state of the encoder is state xe2x80x980xe2x80x99 and if all the registers have a xe2x80x9c0xe2x80x9d located within and the input bit is a xe2x80x9c0xe2x80x9d, then as illustrated in the trellis of FIG. 5, the output will be xe2x80x9c000xe2x80x9d, representing the systematic bit and the two parity bits, respectively. As another example, if the decoder is in state xe2x80x9c5xe2x80x9d and the registers store xe2x80x9c101xe2x80x9d respectively and the input bit is a 1, the output is xe2x80x9c100xe2x80x9d.
The terms in the summation can be expressed in the form
xe2x80x83Pr(xn=l, Sn=m, Snxe2x88x921=mxe2x80x2)=xcex1nxe2x88x921(mxe2x80x2)xcex31n(mxe2x80x2,m)xcex2n(m)
The following simplified equation is used to calculate xcex1, xcex2 and the a posteriori probability(APP) of the bit xk:
F=ln[eA+eB]
This equation will be called the exponent logarithm equation. For an eight state code, the exponent logarithm equation is executed 8(N+3)times in the generation of both alpha and beta. The 3 in the (N+3) is the extra processing associated with the 3 extra tail bits. The exponent logarithm equation is executed 8N times in the generation of the extrinsic. Table 1 lists the number of exponent logarithm equations which are required for several different sizes of N. These numbers are for a non sliding block implementation and are 10% to 15% greater for a sliding block implementation of the MAP decoder.
The exponent logarithm equation requires two exponent functions, one addition, and one logarithm function. The exponent and logarithm functions are usually not performed on a DSP without the use of lookup tables and these tables can be quite large. One way to rewrite the exponent logarithm equation is as follows:                               ln          ⁡                      [                                          ⅇ                A                            +                              ⅇ                B                                      ]                          =                              max            ⁡                          (                              A                ,                B                            )                                +                      ln            ⁢                          ⌊                              1                +                                  ⅇ                                      1                    ⁢                                          "LeftBracketingBar"                                              A                        -                        B                                            "RightBracketingBar"                                                                                  ⌋                                                              =                              max            ⁡                          (                              A                ,                B                            )                                +                      f            ⁡                          (                              "LeftBracketingBar"                                  A                  -                  B                                "RightBracketingBar"                            )                                          
The above equation consists of the MAX function and a small table lookup. This equation is commonly called MAX*, MAX star, or MAX with a table lookup. A subtraction followed by an absolute value are required to generate the index for the table lookup. The MAX, subtraction, and absolute value functions are commonly implemented by DSPS; but the table lookup part for this equation is not. Currently DSPs allocate a block of memory for the table lookup portion. DSPs can execute the MAX, subtraction, addition, and absolute value functions in 1 cycle and the table lookup requires several cycles. The C6x family takes 5 cycles to load an element of the table and other DSPs take a similar number of cycles to perform a load function. On DSPs that have only 1 functional block; the MAX star equation would require 4+5=9 cycles. For DSPs which can execute more than one function at a time; the MAX star equation would take 8 cycles.
The C6x could execute several MAX stars in a rolled loop in which the individual functions are done in parallel. This could possibly reduce the average number of clock cycles to 2 or 3. FIG. 8 shows an example of 8 MAX star functions in a rolled loop. It takes 16 cycles to execute the 8 MAX star functions. The average number of clock cycles has been reduced to 2; but the DSPs 8 functional blocks are kept busy most of the time.
Turbo decoder algorithms for the above reasons are currently implemented on a dedicated chip outside of the DSP. This is due to the high number of MIPs required because of the MAX star function. This extra chip increases the system cost of the entire communication design.
The addition of a specialized instruction to perform the MAX star function provides a very efficient, low cost way to get better performance on the DSP. The instruction would have two inputs and one output as shown in FIG. 9 and is designed to fit in a standard 3-operand format. This simple instruction is implemented on a DSP to take 1 cycle. FIG. 10 shows one possible implementation of the MAX star function. Signal m1 is the difference between inputs A and B. The sign of m1 controls the multiplexer of the max function. Signal m2 is this result. Signal m1 is applied to a lookup table. The lookup table is built to handle both positive and negative results. Its output is m3 and is summed with m2 to form the MAX star result.
The size of the lookup table depends on the resolution required. FIGS. 11-15 illustrate a few examples of different fixpoint sizes for the lookup table. Implementing the MAX star circuit on a DSP will allow the DSP to execute the MAX star function in 1 clock cycle. This reduction will allow the turbo decoder to run more efficiently on the DSP; therefore reducing the price of the system by eliminating the dedicated turbo decoder chip.