In modern audio/speech digital signal communication systems, a digital signal is compressed at an encoder and the compressed information is packetized and sent to a decoder through a communication channel, frame by frame, in real time. A system made of an encoder and decoder together is called a CODEC.
Most communication channels can not guarantee that all information packets sent by encoder reaches decoder side in real time without any loss of data, or without the data being delayed to the point where it becomes unusable. Generally, the packet loss rate varies according to the channel quality. In order to compensate for loss of sound quality due to the packet loss, some audio decoders implement a Frame Erasure Concealment (FEC) algorithm, also known as a Packet Loss Concealment (PLC) algorithm. Different types of decoders usually employ different FEC algorithms.
G.729.1 is a scalable codec having multiple layers working at different bit rates. The lowest core layers of 8 kbps and 12 kbps implement a Code-Excited Linear Prediction (CELP) algorithm. These two core layers encode and decode a narrowband signal from 0 to 4 kHz. At the bit rate of 14 kbps, a Band-Width Extension (BWE) algorithm called a Time Domain Band-Width Extension (TDBWE) encodes/decodes a high band from 4 kHz to 7 kHz by using an extra 2 kbps added to the 12 kbps bit rate to enhance audio quality. BWE usually includes frequency and time envelope coding and fine spectral structure generation. Since both frequency and time envelope coding may take most of the bit budget, fine spectral structure is often generated by spending very little or no bit budget. The corresponding signal in time domain of the fine spectral structure is called excitation. The frequency domain can be defined in a Modified Discrete Cosine Transform (MDCT), a Fast-Fourier Transform (FFT) domain, or other domain. The TDBWE algorithm in G.729.1 is a BWE that generates an excitation signal in the time domain and applies temporal shaping on the excitation signal. The time domain excitation signal is then transformed into the frequency domain with an FFT transformation, and the spectral envelope is applied in FFT domain.
In the ITU G.729.1 standard, which is incorporated herein by reference, at a 16 kbps layer or greater layers, the high frequency band from 4 kHz to 7 kHz is encoded/decoded with an MDCT algorithm when no information (bitstream packets) is lost in the channel. When packet loss occurs, however, the FEC algorithm is based on a TDBWE algorithm.
ITU-T Rec. G.729.1 is also called G.729EV, which is an 8-32 kbit/s scalable wideband (50-7000 Hz) extension of ITU-T Rec. G.729. By default, the encoder input and decoder output are sampled at 16 kHz. The bitstream produced by the encoder is scalable and has 12 embedded layers, which will be referred to as Layers 1 to 12. Layer 1 is the core layer corresponding to a bit rate of 8 kbit/s. This layer is compliant with a G.729 bitstream, which makes G.729EV interoperable with G.729. Layer 2 is a narrowband enhancement layer adding 4 kbit/s, while Layers 3 to 12 are wideband enhancement layers adding 20 kbit/s with steps of 2 kbit/s.
A G.729EV coder operates with a digital signal sampled at 16 kHz in a 16-bit linear pulse code modulated (PCM) format as an encoder input. However, an 8 kHz input sampling frequency is also supported. Similarly, the format of the decoder output is 16-bit linear PCM with a sampling frequency of 8 or 16 kHz. Other input/output characteristics are converted to 16-bit linear PCM with 8 or 16 kHz sampling before encoding, or from 16-bit linear PCM to the appropriate format after decoding.
The G.729EV coder is built upon a three-stage structure using embedded CELP coding, TDBWE, and predictive transform coding that will be referred to as Time-Domain Aliasing Cancellation (TDAC). A TDAC algorithm can be viewed as specific type of MDCT algorithm. The embedded CELP stage generates Layers 1 and 2 that yield a narrowband synthesis (50-4000 Hz) at 8 kbit/s and 12 kbit/s. The TDBWE stage generates Layer 3 and allows the production of a wideband output (50-7000 Hz) at 14 kbit/s. The TDAC stage operates in the MDCT domain and generates Layers 4 to 12 to improve quality from 16 to 32 kbit/s. The TDAC module jointly encodes the weighted CELP coding error signal in the 50-4000 Hz band and the input signal in the 4000-7000 Hz band for Layers 4 to 12. The FEC algorithm for Layers 4 to 12, however, is still based on the TDBWE algorithm.
The G.729EV coder operates using 20 ms frames. However, the embedded CELP coding stage operates on 10 ms frames, like G.729. As a result two 10 ms CELP frames are processed per 20 ms frame. To be consistent with the text of ITU-T Rec. G.729, which is incorporated herein by reference, the 20 ms frames used by G.729EV will be referred to as superframes, whereas the 10 ms frames and the 5 ms subframes involved in the CELP processing will be respectively called frames and subframes.
As illustrated in FIG. 1, the TDBWE (Layer 3) encoder extracts a fairly coarse parametric description from the pre-processed and downsampled higher-band signal 101, sHB(n). This parametric description includes time envelope 102 and frequency envelope 103 parameters. The 20 ms input speech superframe 101, sHB(n) is subdivided into 16 segments of length 1.25 ms each, i.e., where each segment has 10 samples. The 16 time envelope parameters 102, Tenv(i), i=0, . . . , 15, are computed as logarithmic subframe energies:
                                                        T              env                        ⁡                          (              i              )                                =                                    1              2                        ⁢                                          log                2                            ⁡                              (                                                      1                    /                    10                                    ⁢                                                            ∑                                              n                        =                        0                                            9                                        ⁢                                                                  S                        HB                        2                                            ⁡                                              (                                                  n                          +                                                      i                            ·                            10                                                                          )                                                                                            )                                                    ,                  i          =          0                ,        …        ⁢                                  ,        15.                            (        1        )            
TDBWE parameters Tenv(i), i=0, . . . , 15, are quantized by mean-removed split vector quantization. First, mean time envelope 104 is calculated:
                              M          T                =                              1            16                    ⁢                                    ∑                              i                =                0                            15                        ⁢                                                            T                  env                                ⁡                                  (                  i                  )                                            .                                                          (        2        )            The mean value 104, MT, is then scalar quantized with 5 bits using uniform 3 dB steps in log domain. This quantization produces the quantized value 105, {circumflex over (M)}T. The quantized mean is then subtracted:TenvM(i)=Tenv(i)−{circumflex over (M)}T, i=0, . . . ,15.  (3)The mean-removed time envelope parameter set is then split into two vectors of dimension 8:Tenv,1=(TenvM(0),TenvM(1)1, . . . ,TenvM(7)) and Tenv,2=(TenvM(8),TenvM(9), . . . ,TenvM(15).  (4)
Finally, vector quantization using pre-trained quantization tables is applied. Note that the vectors Tenv,1 and Tenv,2 share the same vector quantization codebooks to reduce storage requirements. The codebooks (or quantization tables) for Tenv,1/Tenv,2 are generated by modifying generalized Lloyd-Max centroids such that a minimal distance between two centroids is verified. The codebook modification procedure includes rounding Lloyd-Max centroids on a rectangular grid with a step size of 6 dB in log domain.
For the computation of the 12 frequency envelope parameters 103, Fenv(j) j=0, . . . , 11, the signal 101, sHB(n), is windowed by a slightly asymmetric analysis window wF(n). The maximum of the window wF (n) is centered on the second 10 ms frame of the current superframe. The window wF (n) is constructed such that the frequency envelope computation has a lookahead of 16 samples (2 ms) and a lookback of 32 samples (4 ms). The windowed signal sHBw(n) is transformed by FFT. Finally, the frequency envelope parameter set is calculated as logarithmic weighted sub-band energies for 12 evenly spaced and equally wide overlapping sub-bands in the FFT domain. The j-th sub-band starts at the FFT bin of index 2j and spans a bandwidth of 3 FFT bins.
FIG. 2 illustrates the concept of the TDBWE decoder module. The TDBWE received parameters are used to shape artificially generated excitation signal 202, ŝHBexc(n), according to desired time and frequency envelopes 209, {circumflex over (T)}env(i), and 209, {circumflex over (F)}env(j). This shaping is followed by a time-domain post-processing procedure.
The quantized parameter set includes the value {circumflex over (M)}T and the following vectors: {circumflex over (T)}env,1, {circumflex over (T)}env,2, {circumflex over (F)}env,1, {circumflex over (F)}env,2 and {circumflex over (F)}env,3. The split vectors are defined by Equations (4). The quantized mean time envelope {circumflex over (M)}T is used to reconstruct the time envelope and the frequency envelope parameters from the individual vector components, i.e.:{circumflex over (T)}env(i)={circumflex over (T)}envM(i)+{circumflex over (M)}T, i=0, . . . ,15  (5)and{circumflex over (F)}env(j)={circumflex over (F)}envM(j)+{circumflex over (M)}T, j=0, . . . ,11  (6)
TDBWE excitation signal 201, exc(n), is generated by a 5 ms subframe based on parameters that are transmitted in Layers 1 and 2 of the bitstream. Specifically, the following parameters are used: the integer pitch lag T0=int(T1) or int(T2) depending on the subframe, the fractional pitch lag frac, the energy of the fixed codebook
contributions
            E      c        =                  ∑                  n          =          0                39            ⁢                        (                                                                      g                  ^                                c                            ·                              c                ⁡                                  (                  n                  )                                                      +                                                            g                  ^                                enh                            ·                                                c                  ′                                ⁡                                  (                  n                  )                                                              )                2              ,and the energy of the adaptive codebook contribution
      E    p    =            ∑              n        =        0            39        ⁢                            (                                                    g                ^                            p                        ·                          v              ⁡                              (                n                )                                              )                2            .      The parameters of the excitation generation are computed for every 5 ms subframe. The excitation signal generation includes the following steps:                estimation of two gains gv and guv for the voiced and unvoiced contributions to the final excitation signal 201, exc(n);        pitch lag post-processing;        generation of the voiced contribution;        generation of the unvoiced contribution; and        low-pass filtering.        
The shaping of the time envelope of the excitation signal 202, sHBexc(n) utilizes decoded time envelope parameters 208, {circumflex over (T)}env(i), with i=0, . . . , 15 to obtain a signal 203, ŝHBT(n), with a time envelope that is nearly identical to the time envelope of the encoder side higher-band signal 101, sHB(n). This is achieved by scalar multiplication:ŝHBT(n)=gT(n)·sHBexc(n), n=0, . . . ,159.  (7)
In order to determine the gain function gT(n), the excitation signal 202, sHBexc(n), is segmented and analyzed in the same manner as the parameter extraction in the encoder. The obtained analysis results are, again, time envelope parameters {tilde over (T)}env(i) with i=0, . . . , 15. They describe the observed time envelope of sHBexc(n). Then a preliminary gain factor is calculated:
                                                        g              T              ′                        ⁡                          (              i              )                                =                      2                                                                                T                    ^                                    env                                ⁡                                  (                  i                  )                                            -                                                                    T                    ~                                    env                                ⁡                                  (                  i                  )                                                                    ,                  i          =          0                ,        …        ⁢                                  ,        15                            (        8        )            
For each signal segment with index i=0, . . . , 15, these gain factors are interpolated using a “flat-top” Hanning window wt( ). This interpolation procedure finally yields the gain function:
                                          g            T                    ⁡                      (                          n              +                              i                ·                10                                      )                          =                  {                                                                                                                                        w                        t                                            ⁡                                              (                        n                        )                                                              ·                                                                  g                        T                        ′                                            ⁡                                              (                        i                        )                                                                              +                                                                                    w                        t                                            ⁡                                              (                                                  n                          +                          10                                                )                                                              ·                                                                  g                        T                        ′                                            ⁡                                              (                                                  i                          -                          1                                                )                                                                                                                                                              n                    =                    0                                    ,                  …                  ⁢                                                                          ,                  4                                                                                                                                                w                      t                                        ⁡                                          (                      n                      )                                                        ·                                                            g                      T                      ′                                        ⁡                                          (                      i                      )                                                                                                                                        n                    =                    5                                    ,                  …                  ⁢                                                                          ,                  9                  ,                                                                                        (        9        )            where g′T(−1) is defined as the memorized gain factor g′T(15) from the last 1.25 ms segment of the preceding superframe.
Signal 204, ŝHBF(n), is obtained by shaping the excitation signal sHBexc(n) (generated from parameters estimated in lower-band by the CELP decoder) according to the desired time and frequency envelopes. Generally, there is no coupling between this excitation and the related envelope shapes {circumflex over (T)}env(i) and {circumflex over (F)}env(j). As a result, some clicks may be present in the signal ŝHBF(n). To attenuate these artifacts, an adaptive amplitude compression is applied to ŝHBF. Each sample of ŝHBF(n) of the i-th 1.25 ms segment is compared to the decoded time envelope {circumflex over (T)}env(i) and the amplitude of ŝHBF(n) are compressed in order to attenuate large deviations from this envelope. The TDBWE synthesis 205, ŝHBbwe(n) is transformed to ŜHBbwe(k) by MDCT. This spectrum is used by the TDAC decoder to extrapolate missing sub-bands.
In case of packet loss, the G.729.1 decoder employs the TDBWE algorithm to compensate for the HB part by estimating the current spectral envelope and the temporal envelope using information from the previous frame. The excitation signal is still constructed by extracting information from the low band (Narrowband) CELP parameters. As can be seen from the above description, such an FEC process is quite complicated.
As mentioned above, G.729.1 employs a TDAC/MDCT based codec algorithm to encode and decode the high band part for bit-rate higher than 14 kbps. The TDAC encoder illustrated in FIG. 3 jointly represents jointly two split MDCT spectra 301, DLBw(k), and 302, SHB(k), by gain-shape vector quantization. Joint spectrum 303, Y(k), is divided into sub-bands, where each sub-band defines the spectral envelope. The sub-bands are represented in the log domain by 304, log_rms(j). After quantization, the spectral envelope is represented by the index 305, rms_index (j). The spectral envelope information is also used to allocate a proper number of bits 306, nbit(j), for each subband to code the MDCT coefficients. The shape of each sub-band coefficients is encoded by embedded spherical vector quantization using trained permutation codes.
Lower-band CELP weighted error signal dLBw(n) and higher-band signal sHB(n) are transformed into frequency domain by MDCT with a superframe length of 20 ms and a window length of 40 ms. DLBw(k) represents the MDCT coefficients of the windowed signal dLBw(n) with 40 ms sinusoidal windowing. MDCT coefficients, Y(k), in the 0-7000 Hz band are split into 18 sub-bands. The j-th sub-band comprises nb_coef(j) coefficients Y(k) with sb_bound (j)≦k<sb_bound (j+1). Each subband of the first 17 sub-bands includes 16 coefficients (400 Hz bandwidth), and the last sub-band includes 8 coefficients (200 Hz bandwidth). The spectral envelope is defined as the root mean square (rms) in log domain of the 18 sub-bands, which is then quantized in encoder.
The perceptual importance 307, ip(j),j=0 . . . 17, of each sub-band is defined as:
                                          ip            ⁡                          (              j              )                                =                                                    1                2                            ⁢                                                log                  2                                ⁡                                  (                                      rms_q                    ⁢                                                                  (                        j                        )                                            2                                        ×                    nb_coef                    ⁢                                          (                      j                      )                                                        )                                                      +            offset                          ,                            (        10        )            where rms_q(j)=21/2 rms—index(j) is the quantized rms and rms_q(j)2×nb_coef(j) corresponds to the quantized sub-band energy. Consequently the perceptual importance is equivalent to the sub-band log-energy. This information is related to the quantized spectral envelope as follows:
                              ip          ⁡                      (            j            )                          =                                            1              2                        ⁡                          [                                                rms_index                  ⁢                                      (                    j                    )                                                  +                                                      log                    2                                    ⁡                                      (                                          nb_coef                      ⁢                                              (                        j                        )                                                              )                                                              ]                                +                      offset            .                                              (        11        )            
The offset value is introduced to simplify further the expression of ip(j). The sub-bands are then sorted by decreasing perceptual importance. This perceptual importance ordering is used for bit allocation and multiplexing of vector quantization indices.
Each sub-band j=0, . . . , 17 of dimension nb_coef(j) is encoded with nbit(j) bits by spherical vector quantization. This operation is divided into two steps: search for a best code vector and indexing of the selected code vector.
The bits associated with the HB spectral envelope coding are multiplexed before the bits associated with the lower-band spectral envelope coding. Furthermore, sub-band quantization indices are multiplexed by order of decreasing perceptual importance. The sub-bands that are perceptually more important (i.e., with the largest perceptual importance ip(j)) are written first in the bitstream. As a result, if just part of the coded spectral envelope is received at the decoder, the higher-band envelope can be decoded before that of the lower band. This property is used at the TDAC decoder to perform a partial level-adjustment of the higher-band MDCT spectrum.
The TDAC decoder pertaining to layers 4 to 12 is depicted in FIG. 4. Received normalization factor (called norm_MDCT) transmitted by the encoder with 4 bits is used in the TDAC decoder to normalize MDCT coefficients 401, Ŷnorm(k). The factor is used to scale the signal reconstructed by two inverse MDCTs. The higher-band spectral envelope 407, rms_q(j), is decoded first, then index rms_index(j), j=11, . . . , 17, is reconstructed. If the number of bits is insufficient to decode the higher-band spectral envelope completely, decoded indices rms_index(j) are kept to allow partial level-adjustment of the decoded HB spectrum. The bits related to the lower band, i.e. rms_index(j), j=0, . . . , 9, are decoded in a similar way as in the higher band. The decoded indices are combined into a single vector [rms_index(0)rms_index(1) . . . rms_index(17)], which represents the reconstructed spectral envelope in log domain. The vector quantization indices are read from the TDAC bitstream according to their perceptual importance ip(j).
In sub-band j of dimension nb_coef(j) and non-zero bit allocation nbit(j), the vector quantization index identifies a code vector which constructs the sub-band j of Ŷnorm(k) The missing subbands are filled by the generated coefficients 408 from the transform of the TDBWE signal. After filling the missing subbands, the complete set of MDCT coefficients are named as 402, Ŷext(k), which will be subject to level adjustment by using the spectral envelope information. Level-adjusted coefficients 403, Ŷ(k), are the input to the post-processing module. The post-processing of MDCT coefficients is only applied to the higher band, because the lower band is post-processed with a traditional time-domain approach. For the high-band, there are no Linear Prediction Coding (LPC) coefficients transmitted to the decoder. The TDAC post-processing is performed on the available MDCT coefficients at the decoder side. Reconstructed spectrum 404, Ŷpost(k), is split into a lower-band spectrum 406, {circumflex over (D)}LBw(k), and a higher-band spectrum 405, ŜHB(k). Both bands are transformed to the time domain using inverse MDCT transforms.
Narrowband (NB) signal encoding is mainly contributed by the CELP algorithm, and its concealment strategy is disclosed the ITU G7.29.1 standard. Here, the concealment strategy includes replacing the parameters of the erased frame based on the parameters from past frames and the transmitted extra FEC parameters. Erased frames are synthesized while controlling the energy. This concealment strategy depends on the class of the erased superframe, and makes use of other transmitted parameters that include phase information and gain information.