The Viterbi algorithm is used for maximum-likelihood sequence decoding of convolutional codes, which are employed for data transmission in many digital communications standards.
It is known in the field of digital communications that Viterbi decoding operations can be visualized by a directed graph known as a trellis, which can be partitioned into several sections or stages, each of which corresponds to a unit of time. A trellis section has J input nodes and J output nodes, where J=2m denotes the number of states of the underlying convolutional encoder, in which m is the encoder memory in bits, typically an integer between 4 and 9. The state of the convolutional encoder can be defined through equationst2m−1·bt−1+ . . . +21·bt−m+1+20·bt−m,   (Equ. 1)where bt denotes an input bit of the convolutional encoder.
The Viterbi algorithm essentially comprises three steps. In a first step, branch metrics β(c) are computed, for all binary code n-tuples c. A branch metric β(c) represents a distance between a received n-tuple of channel outputs and an ideal n-tuple of channel outputs for the given code n-tuple c. The computation of branch metrics is known to those skilled in the art.
In a second step, one implementation of the Viterbi algorithm performs a maximization of a log-likelihood function through a sequence of add-compare-select (ACS) operations, which compute an array of J new state metrics Mt+1(0), Mt+1(1), . . . , Mt+1(J−1) based on J current state metrics Mt(0), Mt(1), . . . , Mt(J−1) and the set of previously computed branch metrics. The ACS operations can be expressed by the equationMt+1(st+1)=max[Mt(s′t)+β(G(s′t, st+1)), Mt(s″t)+β(G(s″t, st+1))]  (Equ. 2)where s′t and s″t are those states at time t which are connected through trellis branches with state st+1 at time t+1, and where G(st, st+1) denotes the code n-tuple corresponding to the branch from state st to state st+1. For each new state st+1, the maximum operation effectively determines a survivor branch and an associated survivor metric. Moreover, information identifying the survivor branch is recorded during the ACS operations. Generally, the add-compare-select (ACS) operations calculate the best path through the trellis for each state.
In a third step, a traceback operation is performed along survivor branches in the trellis, starting either from the state with maximum metric at a particular instant of time or from state 0, if a zero tail has been used by the transmitter to drive the encoder to state 0. While performing a traceback operation, a most likely path through the trellis is identified and decoded data corresponding to the branches on that path are output.
Furthermore, it is known that a trellis section for a rate-1/n binary convolutional code can be decomposed into subgraphs called Viterbi butterflies. The corresponding ACS operations will be referred to as butterfly operations. FIG. 1 shows a Viterbi butterfly, where the nodes on the left and right represent a pair of states at times t and t+1, respectively, and the branches represent the possible state transitions from state pair (2j, 2j+1) to state pair (j , j+J/2). The branches are labeled with code n-tuples c, which are generated by the encoder in corresponding state transitions and which in turn select the appropriate branch metric β(c).
Straightforward implementations of ACS operations on programmable DSPs with a data register file whose size is typically limited by instruction encoding constraints require two memory buffers for holding metrics and other state-dependent data, which are used in a ping-pong fashion, for example, in Motorola, Inc., and Agere Systems, “How to Implement a Viterbi Decoder on the StarCore SC140”, Application Note ANSC140VIT/D, Jul. 18, 2000. For such Viterbi decoder implementations, all state metrics must be read from a first memory buffer into registers, updated, and written back from registers to a second memory buffer, for every stage of the Viterbi decoder. The frequent memory transfers are expensive in terms of power consumption.
A technique that is based on rotated metric indexing and which avoids the overhead associated with ping-pong memory buffers is described by Marc Biver, Hubert Kaeslin, Carlo Tommasini in “In-place updating of path metrics in Viterbi decoders”, IEEE J. of Solid-State Circuits, Vol. 24, No. 4, pp. 1158-1160, August 1989. This technique can be described as follows. Let rotl (j, a, m) (rotr (j, a, m) ) denote the result of rotating the m LSBs of an integer j towards the left (right) by a bits and leaving all other bits of j unchanged and define                                                                         s                                  t                  +                  1                                ′                            ⁢                              =                Δ                            ⁢                            ⁢                                                rot1                  ⁡                                      (                                                                  s                                                  t                          +                          1                                                                    ,                      1                      ,                      m                                        )                                                  =                                  rot1                  (                                                                                    2                                                  m                          -                          1                                                                    ·                                              b                        t                                                              +                                                                  2                                                  m                          -                          2                                                                    ·                                              b                                                  t                          -                          1                                                                                      +                                                                                                                                        ⁢                                                …                  +                                                            2                      0                                        ·                                          b                                              t                        -                        m                        +                        1                                                                                            ,                1                ,                m                            )                                                                          =                            ⁢                                                                    2                                          m                      -                      1                                                        ·                                      b                                          t                      -                      1                                                                      +                …                +                                                      2                    1                                    ·                                      b                                          t                      -                      m                      +                      1                                                                      +                                                      2                    0                                    ·                                      b                    t                                                                                                          (                  Equ          .                                           ⁢          3                )            By relabeling the nodes on the right in FIG. 1 with s′t+1, the modified graph in FIG. 2 is obtained. By definingM′t+1(s′t+1)Mt+1(st+1)  (Equ. 4)it is implied by Biver et al. that the pair of state metrics (Mt(2j), Mt(2j+1)) can be replaced by (M′t+1(2j), M′t+1(2j+1)) through in-place butterfly operations. After performing these in-place operations, the metric M′t+1(s′t+1) is stored at index st+1rotr(s′t+1, 1, m), for 0≦s′t+1<J, which is a permutation of the metric array.
It thus follows from Biver et al. that a section S of the original trellis is equivalent to an in-place trellis section T followed by a permutation R that rotates array indices towards the right by one bit, which is expressed by equationS=T·R,  (Equ. 5)and illustrated in FIG. 2. Furthermore, the concatenation of m original trellis sections S0·S1· . . . ·Sm−1, where m denotes the encoder memory in bits, is equivalent to the concatenation (T0·R)·(T1·R)· . . . ·(Tm−1·R), where Ti=T. Due to the associative law, the latter concatenation is in turn equivalent to the concatenation of in-place trellis sections {tilde over (T)}0·{tilde over (T)}1· . . . ·{tilde over (T)}m−1, where {tilde over (T)}i is defined through equation{tilde over (T)}iRi·T·Li,   (Equ. 6)in which Ri and Li denote permutations that rotate array indices by i bits towards the right and left, respectively.
FIG. 3 shows m sections of the trellis for in-place metric updating according to the prior art, for a time instant t that is wholly divisible by the encoder memory m. The structure of the in-place trellis section {tilde over (T)}i repeats after every m sections. For in-place metric updating according to FIG. 3, the elements of the array of J metrics are referenced by the column indices {tilde over (s)}t in FIG. 3, which are given by equation{tilde over (s)}trotl(st, mod(t, m), m),  (Equ. 7)i.e., the metrics are stored in a permuted order that depends on the current time index t.
Attempts were made in recent programmable DSPs to accelerate Viterbi decoding by instruction-level parallelism and/or data-level parallelism.
FIG. 4 shows an exemplary DSP according to the prior art, which is based on memory-to-register load instructions, register-to-register compute instructions, and register-to-memory store instructions. Generally, an instruction or program instruction is herein contemplated as a sequence of stages comprising instruction fetch, instruction decoding, operand read, execute, and result writeback. Program instructions are decoded by an instruction decoder 2 and executed by either moving data between memory 6 and registers in a data register file 8 or by performing arithmetic/logic operations on the contents of such registers in arithmetic/logic units (ALUs) 10 and saving the results in such registers. The source address of memory-to-register load instructions and the target address of register-to-memory store instructions is usually provided by an address generation unit 4. The DSP in FIG. 4 uses multiple arithmetic/logic units (ALUs) 10 and one common data register file 8.
In order to implement a Viterbi decoder on a programmable DSP as shown in FIG. 4, metric data are typically placed in the data memory 6 and only temporarily moved from/to registers in the data register file 8 for performing ACS operations since the number of metrics, J, typically exceeds the number of available registers in the data register file 8. Those skilled in the art will recognize that repeatedly moving metric data from/to memory incurs a significant power consumption overhead. In general, in-place data processing is a means for reducing the number of memory transfers. However, even if the J metrics fit into the data register file 8 so that each metric corresponds to a register in the data register file 8, the in-place metric updating according to FIG. 3 is not beneficial since different sets of registers correspond to the input/output metrics of a butterfly in each section {tilde over (T)}i of the trellis. This implies different and non-uniform program code for each of m trellis sections and therefore a significant growth of program code so that rotated metric indexing cannot be used efficiently on programmable DSPs as shown in FIG. 4.
On programmable, pipelined DSPs, Viterbi ACS operations are advantageously broken up into a set of metric add instructions, followed by a set of metric compare/select instructions. Furthermore, it is desirable to use software pipelining, i.e., metric add instructions that perform operations for one set of butterflies are executed concurrently with metric compare/select instructions that perform operations for another set of butterflies. For a DSP as shown in FIG. 4, such concurrency demands that all source operands of the metric add instruction and the metric compare/select instruction be read simultaneously from the data register file 8. Similarly, all target operands of said instructions must be written simultaneously to the data register file 8. It will be apparent to those skilled in the art that the corresponding large numbers of register file read ports and write ports is costly in terms of chip area and power consumption.
Another approach is disclosed in U.S. Pat. No. 5,987,490, which describes a dual-MAC processor that performs up to two ACS operations every two machine cycles or, equivalently, achieves a performance of 2.0 cycles per Viterbi butterfly. This configuration has the disadvantages arising from the need for a ping-pong buffer in memory, as outlined above.
Yet another approach is used in the StarCore SC140 DSP architecture, which has four functional units and exploits subword parallelism to process two Viterbi butterflies in parallel, achieving a highest performance of 1.25 cycles per Viterbi butterfly as described in Motorola, Inc., and Agere Systems, “How to Implement a Viterbi Decoder on the StarCore SC140”, Application Note ANSC140VIT/D, Jul. 18, 2000. This configuration has the disadvantages arising from the need for a ping-pong buffer in memory, as outlined above.
A further disadvantage of these prior-art techniques is that they do not scale for the parallel processing of four or more Viterbi butterflies. A primary reason for this problem is the limited number of data registers available in DSP architectures which do not support indirect access to registers and whose register file size is thus limited by instruction encoding constraints.
From the above it follows that there is still a need in the art for an improved programmable digital signal processing device for implementing a Viterbi algorithm. Moreover, the add-compare-select (ACS) operations should be performed more efficiently in order to reduce the power-consuming memory transfers to a minimum.