Convolutional encoding is widely used in many communication standards, including, for example, Wireless Local Area Network (WLAN) and Wi-Fi standards, such as 802.11a/b/g/n. Other examples are possible as well. In convolutional encoding, as in other error correction mechanisms, redundancy is added to the data so that the data can be recovered in the event it is corrupted by noise, channel conditions, and/or receiver non-idealities.
In a convolutional encoder, an input bit stream is applied to a shift register. Input bits are combined using a binary single bit addition (XOR) with several outputs of the shift register cells. The bit streams obtained at the output form a representation of the encoded input bit stream. Each input bit at the input of the convolutional encoder results in n output bits. The coding rate is thus defined as 1/n (or k/n if k input bits are used). These output bits are a function of the current input bit and the K previous input bits, where K is called the constraint length.
In general a convolutional code is identified by the following characteristics: the constraint length K, the number n of output branches, and the polynomial Gx for each output branch. The constraint length K determines the number of memory elements in the shift register. It is defined as the shift register length plus one. Each branch in the number n of output branches outputs one bit. The polynomial Gx for each output branch defines the relation of the output bit to the current input bit and K previous input bits. Each output bit is a modulo-2 addition (or XOR-operation) of some of the input bits. The polynomial Gx indicates which bits in the input stream have to be added to form the output.
An encoder is completely characterised by n polynomials of degree K. The encoder can have different states, represented by the K input bits in the shift register. Every new input bit processed by the encoder leads to a state transition. The state diagram can be unfolded in time to represent transitions at each stage in time. Such representation is called a trellis diagram.
In a convolutional encoder, data bits are fed into delay line (of length K) from which certain branches are XOR-ed and fed to the output. Considering WLAN as an example, the throughput is stressed towards decoder output rates of 600 Mbps (in IEEE 802.11n standard) while keeping the energy efficiency as high as possible. In many cases, there is additionally a desire to keep the area footprint as low as possible. A Viterbi decoder implemented in a handheld device typically satisfies these requirements.
Viterbi decoding is a well-known method for decoding convolutional error codes. Viterbi decoding is a near-optimal decoding of convolutional encoded data. Compared to optimal decoding, however, it has a greatly reduced complexity and memory requirement. In general, during decoding the most probable path over the trellis diagram is reconstructed using the received (soft) bits, and results in determining the original data. Specifically, in Viterbi decoding, a window (with a so-called trace-back length) is considered before taking a decision on the most probable path and corresponding decoded bit. Constraining the decision over a window, rather than the complete data sequence, considerably reduces complexity without sacrificing decoding performance significantly. A high-level view of the Viterbi decoding operation is depicted in FIG. 1.
Starting from input Log Likelihood Ratios (LLRs), path metrics are calculated for each of the S=2K−1 paths. One of these paths is selected to be optimal and the result of this decision is stored into the trace-back memory. Once trace-back depth number of path metrics has been calculated, an output bit can be produced for every incoming pair of input LLRs.
Viterbi decoding is typically performed in a streaming fashion and the main bottleneck is situated in the state memory update. In order to boost the throughput, this iterative loop needs to be avoided or optimized. The principle of breaking down iterative loops into parallel computations is a known technique and the higher-level concept behind it has been applied in other domains since the 1980's. They have mainly worked on digital signal processor algorithms, but some iterative control algorithm kernels have also been treated this way. The idea of parallelizing Viterbi decoding has been described in the art. The principle of Viterbi decoding parallelization is sometimes also referred to as radix-2Z or Z-level look-ahead (LAH) decoding. Look-ahead techniques combine several trellis steps into one trellis step in time sequence through parallel computation. The number of combined trellis steps defines the look-ahead factor Z.
Based on the techniques explained above, many contributions have been made to offer high-speed Viterbi decoding. Some of these contributions only address solutions for a limited number of states and have a clear focus on boosting performance without taking into account a possible trade-off with area and energy. Others of these contributions exploit look-ahead techniques to allow extra pipelining inside the decoding loop, resulting in throughputs which are equal or lower than a single bit per clock cycle.
The paper “Design Space Exploration of Hard-Decision Viterbi Decoding: Algorithm and VLSI Implementation” (Irfan Habib et al., IEEE Trans. on Very Large Scale Integration (VLSI) Systems, vol. 18, no. 5, May 2010) presents an extensive design space exploration for performing Viterbi decoding, taking into account area, throughput, and power. At a top level, a typical Viterbi decoder consists of three units, namely the branch metric unit (BMU), the path metric unit (PMU), and the survivor memory unit (SMU). The paper explores the design space for each unit.
The BMU calculates the distances from the received (noisy) symbols to all code words. The measure calculated by the BMU can be, for example, the Hamming distance, in the case of the hard input decoding, or the Manhattan/Euclidean distance, in the case of the soft input decoding (e.g., every incoming symbol is represented using several bits).
The PMU accumulates the distances of the single code word metrics produced by the BMU for every state. Under the assumption that zero or one was transmitted, corresponding branch metrics are added to the previously stored path metrics which are initialized with zero values. The resulting values are compared with each other and the smaller value is selected and stored as the new path metric for each state. In parallel, the corresponding bit decision (zero or one) is transferred to the SMU while the inverse decision is discarded.
Finally, the SMU stores the bit decisions produced by the PMU for a certain defined number of clock cycles (referred to as the trace-back depth (TBD)) and processes them in a reverse manner called backtracking. Starting from a random state, all state transitions in the trellis will merge to the same state after TBD (or fewer) clock cycles. From this point on, the decoded output sequence can be reconstructed.
The Habib paper mentions that the PMU is a critical block both in terms of area and throughput. The key problem of the PMU design is the recursive nature of the add-compare-select (ACS) operation (in which path metrics calculated in the previous clock cycle are used in the current clock cycle). In order to increase the throughput or to reduce the area, optimizations can be introduced at algorithmic, word, or bit level. Word level optimizations work on folding (serialization) or unfolding (parallelization) the ACS recursion loop.
In the folding technique, the same ACS is shared among a certain set of states. This technique trades off throughput for area. This is an area-efficient approach for low throughput decoders, though in case of folding, routing of the path metrics becomes quite complex.
In the unfolding technique, two or more trellis stages are processed in a single recursion (i.e., look-ahead, as described above). If look-ahead is short, the area penalty is not high. Radix-4 look ahead (i.e., processing two bits at a time, Z=2) is a commonly used technique to increase decoder's throughput.
Although the Habib paper mentions that look-ahead can be used to enhance throughput, it states in section IV.F that use of look-ahead is to be dissuaded, as the authors consider look-ahead techniques extremely expensive in terms of area and power consumption. Therefore, the design space exploration results do not consider the look-head option as an optimal trade-off point in the area versus power trade-off dimension. Moreover, the Habib paper only considers maximal power consumption and not energy consumption for executing the Viterbi decoder task.