This invention relates generally to decoder circuitry. More particularly, this invention relates to a high speed add-compare-select (ACS) circuit useful in Viterbi and “turbo” decoders.
A Viterbi decoder performs an optimum decoding of convolutionally encoded digital sequences. It is widely used in digital communication systems with data rates ranging from few kbps in narrowband applications to several hundreds of Mbps in broadband applications like Wireless LAN.
As shown in FIG. 1, a Viterbi decoder 100 is comprised of three units: a branch-metric computation unit (BMU) 102, an add-compare select unit (ACSU) 104 and a survivor path memory unit (SMU) 106. The input data is used in the BMU 102 to calculate the set of branch metrics for each new time step. These metrics are then fed to the ACSU 104 that accumulates the branch metrics recursively as path metrics according to the trellis determined by a convolutional encoder polynomial. The SMU 106 processes the decisions being made in the ACSU 104 and outputs an estimated path, with a latency of trace-back depth.
It is clear that ACSU 104 and SMU 106 architectures depend only on the trellis and hence these two units are independent of the application for which a Viterbi decoder is being used. The application specific computations are done in the BMU 102 according to soft input definition; and the interpretation of the decoded path into data at the output of the SMU 106 is also dependent upon the output format definition. Since the application specific parts of a Viterbi decoder are mainly found at the input and output, the high speed architecture of ACSU 104 can be generally applicable.
If a high speed Viterbi decoder needs to be implemented for broadband applications with greater than 100 Mbps data rates, the critical path of a Viterbi decoder must be minimized. By looking at the block diagram of a Viterbi decoder 100 in FIG. 1, it is obvious that the BMU 102 as well as the SMU 106 are purely feedforward and the throughput can easily be increased by massive pipelining. However, this does not hold for the ACSU 104.
One way to improve the throughput of ACSU 104 is to apply a look-ahead scheme (radix-4 architecture) to the trellis 200 as shown in FIG. 2. A radix-4 architecture achieves a double data rate without increasing the clock rate because a radix-4 architecture can run at the clock rates employed by a radix-2 architecture. The circuit complexity associated with a conventional radix-4 architecture is greater however, as can be seen with reference to FIG. 3a and FIG. 4, where a conventional radix-4 ACSU 400 basically requires 2-stage comparison circuits including 4 more adders and 2 more multiplexers than that required by a conventional radix-2 ACSU 300 shown in FIG. 3a. 
In view of the foregoing, it is both advantageous and desirable to provide an ACS circuit for a radix-4 Viterbi decoder that has a lower critical path delay than that achievable using a traditional ACS circuit suitable for use with a radix-4 Viterbi decoder. It is also advantageous and desirable to implement such a lower critical path delay without increasing the clock rate beyond that required by a radix-2 ACS circuit.
By way of further background, the term “turbo coding” is well-known in the art as referring to a technique of coding in which two or more convolutional codes are applied to the payload data. Turbo coding is commonly used in many communications applications, including cellular wireless communications according to the well-known GSM, CDMA, WCDMA, and Long-term-evolution (LTE) protocols. As known in the art, conventional Turbo encoding uses one convolutional encoder to encode the datastream in its original form, while the other encodes a pseudo-randomly interleaved version of the datastream. The results from the two encoders are interwoven (concatenated), either serially or in parallel, to produce an output encoded datastream that is then transmitted or otherwise communicated to the destination. Turbo coding involving parallel concatenation is often referred to as a parallel concatenated convolutional code (PCCC), while serial concatenation results in a serial concatenated convolutional code (SCCC). Upon receipt, turbo decoding involves first decoding the received sequence according to one of the convolutional codes, de-interleaving the result, then applying a second decoding according to the other convolutional code, and repeating this process multiple times.
FIG. 3b illustrates a conventional arrangement of a turbo decoder for a parallel concatenated convolutional code (PCCC), in which a first maximum a posterior (MAP) decoder 3501 receives the communicated data in the form of log-likelihood ratio (LLR) values for the system, or data, bits L(S), and also LLR values for the parity bits L(P1) from the first of the two convolutional codes. The system bits L(S) are de-interleaved by interleaver 360, according to the interleaving applied in encoding, and the result L(S′) is applied to a second MAP decoder 3502, which decodes de-interleaved system bits L(S′) and the LLR values of the parity bits L(P2) derived from the second convolutional code. As known in the art, MAP decoders 3501, 3502 operate iteratively with respect to one another, effectively providing “extrinsic” values applied to the inputs of each other for the next iteration. The decoded output Y is generated by decoder 3501 after a desired number of iterations.
Maximum a posterior (MAP) decoders 350 are conventionally arranged to generate LLR values by way of a “forward-backward” algorithm, generally implemented by way of a trellis structure, and typically operating in the log domain. FIG. 3c illustrates the construction of such a conventional decoder 350, which follows the MAP decoder architectures described in Lee et al., “Area-Efficient High-Throughput MAP Decoder Architectures”, Transactions on VLSI Systems, Vol. 13, No. 8 (IEEE, August 2005), pp. 921-33, incorporated herein by this reference. In this architecture, decoder 350 includes γ unit 311, α unit 315, β unit 317, and Λ MPU (Metric Processing Unit) 309. γ unit 311 includes γ MPU units 308, which produces branch metric values, referred to in the art as the γ values, based on the received input values and versions of those values delayed by α delay line 313a and β delay line 313b; multiplexers 314 determine whether to apply the input value or the β-delayed value to the γ units 308 generating those γ values that are applied to β unit 317. α unit 315 includes α MPU 310, which computes a “forward” recursion value for an α(k) term at a sample point sk, by applying a maximum function (“max*”) to α(k−1) and γ(k−1) values from a sample point sk−1. The γ values are also applied to β unit 317, which includes two β MPUs 312, in this example, and which performs “backward” recursion of a value for a β(k) term at the sample point sk, from application of the max* function to β(k+1) and γ(k+1) values from a sample point sk+1; the results of this backward recursion are stored in metric memory 318, as sequenced by multiplexer 316. A MPU 309 derives the LLR output values from the outputs of α unit 315 and β unit 317, as well as the original γ values from a delay line 313a. 
In the arrangement of FIG. 3c, as well as in conventional turbo decoders, the trellis stages in the forward and backward recursion paths (α MPU 310 and β MPUs 312 of FIG. 3c), respectively, operate in a similar manner as in the Viterbi decoder, and as such may be implemented by add-compare-select stages. FIG. 3d illustrates construction of an add-compare-select “kernel” as used in α MPU 310 implemented in radix-2 fashion. A pair of adders 3020, 3021 receive the γ0 and γ1 values for the two paths into a particular trellis node (which correspond to the “branch metrics” of Viterbi decoding), along with respective α0 and α1 values for those two paths (which correspond to the “path metrics” of Viterbi decoding, in this case of forward metric “α” computation). Add-compare-select stage 310 also includes adder 304, which computes a difference between the sums output from adders 3020, 3021; those sums are also applied to multiplexer 306. The sign bit of the difference derived by adder 304, which in this two's complement implementation is merely the most-significant-bit of the difference, controls the selection made by multiplexer 306; as known in the art, for turbo decoding, the maximum of the two log sums generated by adders 3020, 3021 is selected. The difference generated by adder 304 is also applied to look-up table 305, which generates a correction factor for the max* operation, as known in the art. The sum selected by multiplexer 306 and this correction factor from LUT 305 are added by adder 307 to produce the new α value in this example.
As discussed above for the case of the Viterbi decoder, the critical path delay of the turbo decoder must be minimized in order to implement a high-speed turbo decoder for broadband applications requiring greater than 100 Mbps data rates. And as mentioned above for the Viterbi decoder, the throughput of the forward-backward trellis stages cannot be increased by massive pipelining of add-compare-select functions 310, 312, because these computations are not purely feedforward.
By way of further background, the “carry-select” adder is a well-known adder architecture, in which the addends are broken into sectors, and added on a sector-by-sector basis. The adder for each sector (other than the least significant sector) produces two sums—one sum assuming a carry-in of zero, and the other sum assuming a carry-in of one. The carry bit output from the next least significant sector controls a multiplexer to select the correct one of the two possible sums, based on the actual carry-in value produced by the next least significant sector.