A spectral envelope is described by energy levels of spectral subbands in the frequency domain. In modern audio/speech transform coding technology, if an audio/speech signal is coded in the frequency domain, encoding/decoding system often includes spectral envelope coding and spectral fine structure coding. In the case of BandWidth Extension (BWE), High Band Extension (HBE), or SubBand Replica (SBR), spectral fine structure is simply generated with 0 bit or very small number of bits. Temporal envelope coding is optional, and most bits are used to quantize spectral envelope. Precise envelope coding is the first step to gain a good quality. However, precise envelope coding could require too many bits for a low bit rate coding.
Frequency domain can be defined as FFT transformed domain. It can also be in Modified Discrete Cosine Transform (MDCT) domain. One of the well-known examples including spectral envelope coding can be found in the standard ITU G.729.1. An algorithm of BWE named Time Domain Bandwidth Extension (TD-BWE) in the ITU G.729.1 also uses spectral envelope coding.
G.729.1 Encoder
A functional diagram of the encoder part is presented in FIG. 1. The encoder operates on 20 ms input superframes. By default, the input signal 101, sWB(n), is sampled at 16,000 Hz. Therefore, the input superframes are 320 samples long. The input signal sWB(n) is first split into two sub-bands using a QMF filter bank defined by the filters H1(z) and H2(z). The lower-band input signal 102, SLBqmf(n), obtained after decimation is pre-processed by a high-pass filter Hh1(z) with 50 Hz cut-off frequency. The resulting signal 103, sLB(n), is coded by the 8-12 kbit/s narrowband embedded CELP encoder. To be consistent with ITU-T Rec. G.729, the signal sLB(n) will also be denoted s(n). The difference 104, dLB(n), between s(n) and the local synthesis 105, ŝenh(n), of the CELP encoder at 12 kbit/s is processed by the perceptual weighting filter WLB(z). The parameters of WLB(z) are derived from the quantized LP coefficients of the CELP encoder. Furthermore, the filter WLB(z) includes a gain compensation which guarantees the spectral continuity between the output 106, dLBw(n), of WLB(z) and the higher-band input signal 107, sHB(n). The weighted difference dLBw(n) is then transformed into frequency domain by MDCT. The higher-band input signal 108, sHBfold(n), obtained after decimation and spectral folding by (−1)n is pre-processed by a low-pass filter Hh2(z) with a 3,000 Hz cut-off frequency. The resulting signal sHB(n) is coded by the TDBWE encoder. The signal sHB(n) is also transformed into frequency domain by MDCT. The two sets of MDCT coefficients, 109, DLBw(k), and 110, SHB(k), are finally coded by the TDAC encoder. In addition, some parameters are transmitted by the frame erasure concealment (FEC) encoder in order to introduce a parameter-level redundancy in the bitstream. This redundancy allows for an improved quality in the presence of erased superframes.
TDBWE Encoder
The TDBWE encoder is illustrated in FIG. 2. The TDBWE encoder extracts a fairly coarse parametric description from the pre-processed and down-sampled higher-band signal 201, sHB(n). This parametric description comprises time envelope 202 and frequency envelope 203 parameters. A summarized description of envelope computations and the parameter quantization scheme will be given later.
The 20 ms input speech superframe sHB(n) (with a 8 kHz sampling frequency) is subdivided into 16 segments of length 1.25 ms each, i.e.,with each segment comprising 10 samples. The 16 time envelope parameters 102, Tenv(i), i=0, . . . , 15, are computed as logarithmic subframe energies before the quantization. For the computation of the 12 frequency envelope parameters 203, Fenv(j), j=0, . . . , 11, the signal 201, sHB(n), is windowed by a slightly asymmetric analysis window. The maximum of the window wF(n) is centered on the second 10 ms frame of the current superframe. The window wF(n) is constructed such that the frequency envelope computation has a lookahead of 16 samples (2 ms) and a lookback of 32 samples (4 ms). The windowed signal sHBw(n) is transformed by FFT. Finally, the frequency envelope parameter set is calculated as logarithmic weighted sub-band energies for 12 evenly spaced and equally wide overlapping sub-bands in the FFT domain. The j-th sub-band starts at the FFT bin of index 2j and spans a bandwidth of 3 FFT bins.
TDAC Encoder
The Time Domain Aliasing Cancellation (TDAC) encoder is illustrated in FIG. 3. The TDAC encoder represents jointly two split MDCT spectra 301, DLBw(k), and 302, SHB(k), by a gain-shape vector quantization. In other words, the joint spectrum 303, Y(k), is constructed by combining the two split MDCT spectra 301, DLBw(k), and 302, SHB(k). The joint spectrum is divided into many sub-bands. The gains in each sub-band define the spectral envelope. The shape of each sub-band is encoded by embedded spherical vector quantization using trained permutation codes. The gain-shape of SHB(k) represents a true spectral envelope in a second band.
The MDCT coefficients of Y(k) in 0-7,000 Hz band are split into 18 sub-bands. The j-th sub-band comprises nb_coef(j) coefficients of Y(k) with sb_bound(j)≦k<sb_bound(j+1). The first 17 sub-bands comprise 16 coefficients (400 Hz), and the last sub-band comprises 8 coefficients (200 Hz). The spectral envelope is defined as the root mean square (rms) 304 in log domain of the 18 sub-bands:
                                          log_rms            ⁢                          (              j              )                                =                                    1              2                        ⁢                                          log                2                            [                                                                    1                                          nb_coef                      ⁢                                              (                        j                        )                                                                              ⁢                                                            ∑                                              k                        =                                                  sb_bound                          ⁢                                                                                                          ⁢                                                      (                            j                            )                                                                                                                                                sb_bound                          ⁢                                                                                                          ⁢                                                      (                                                          j                              +                              1                                                        )                                                                          -                        1                                                              ⁢                                                                                  ⁢                                                                  Y                        ⁡                                                  (                          k                          )                                                                    2                                                                      +                                  ɛ                  rms                                            ]                                      ,                                  ⁢                  j          =          0                ,        …        ⁢                                  ,        17                            (        1        )            where εrms=2−24. The gain-shape defined by equation (1) in the second half number of the 18 sub-bands represents the true spectral envelope of SHB(k). Each spectral envelope gain is quantized with 5 bits by uniform scalar quantization, and the resulting quantization indices are coded using a two-mode binary encoder. The 5-bit quantization consists in computing the indices 305, rms_index(j), j=0, . . . , 17, as follows:
                              rms_index          ⁢                      (            j            )                          =                  round          ⁡                      (                                          1                2                            ⁢              log_rms              ⁢                              (                j                )                                      )                                              (        2        )            with the restriction:−11≦rms_index(j)≦+20
For example, the indices are limited between, and including −11 and +20 (with 32 possible values). The resulting quantized full-band envelope is then divided into two subvectors:                a lower-band spectral envelope: (rms_index(0), rms_index(1), . . . , rms_index(9)) and        a higher-band spectral envelope:        (rms_index(10), rms_index(11), . . . , rms_index(17)).        
These two subvectors are coded separately using a two-mode lossless encoder, which switches adaptively between differential Huffman coding (mode 0) and direct natural binary coding (mode 1). Differential Huffman coding is used to minimize the average number of bits, whereas a direct natural binary coding is used to limit the worst-case number of bits as well as to correctly encode the envelope of signals, which are saturated by differential Huffman coding (e.g., sinusoids). One bit is used to indicate the selected mode to the spectral envelope decoder.
TDBWE Decoder
FIG. 4 illustrates the concept of the TDBWE decoder module. The TDBWE receives parameters, which are computed by the parameter extraction procedure, and are used to shape an artificially generated excitation signal 402, ŝHBexc(n), according to desired time and frequency envelopes 408, {circumflex over (T)}env(i), and 409, {circumflex over (F)}env(j). This is followed by a time-domain post-processing procedure. The quantized parameter set consists of the value {circumflex over (M)}T and the following vectors: {circumflex over (T)}env,1, {circumflex over (T)}env,2, {circumflex over (F)}env,1, {circumflex over (F)}env,2, and {circumflex over (F)}env,3. The quantized mean time envelope {circumflex over (M)}T is used to reconstruct the time envelope and the frequency envelope parameters from the individual vector components, i.e.:{circumflex over (T)}env(i)={circumflex over (T)}envM(i)+{circumflex over (M)}T, i=0, . . . , 15  (3)and{circumflex over (F)}env(j)={circumflex over (F)}envM(j)+{circumflex over (M)}T, j=0, . . . , 11  (4)
The decoded frequency envelope parameters {circumflex over (F)}env(j) with j=0, . . . , 11 are representative for the second 10 ms frame within the 20 ms superframe. The first 10 ms frame is covered by parameter interpolation between the current parameter set and the parameter set {circumflex over (F)}env,old(j) from the preceding superframe:
                                                                        F                ^                                            env                ,                int                                      ⁡                          (              j              )                                =                                    1              2                        ⁢                          (                                                                                          F                      ^                                                              env                      ,                      old                                                        ⁡                                      (                    j                    )                                                  +                                                                            F                      ^                                        env                                    ⁡                                      (                    j                    )                                                              )                                      ,                                  ⁢                  j          =          0                ,        …        ⁢                                  ,        11                            (        5        )            
The superframe of 403, ŝHBT(n), is analyzed twice per superframe. A filter-bank equalizer is designed such that its individual channels match the sub-band division to realize the frequency envelope shaping with proper gain for each channel. The respective frequency responses for the filter-bank design are depicted in FIG. 5.
TDAC Decoder
The TDAC decoder (depicted in FIG. 6) is simply the inverse operation of the TDAC encoder. The higher-band spectral envelope is decoded first. The bit indicating the selected coding mode at the encoder may be: 0→differential Huffman coding, 1→natural binary coding. If mode 0 is selected, 5 bits are decoded to obtain an index rms_index(10) in [−11, +20]. Then, the Huffman codes associated with the differential indices diff_index(j), j=11, . . . , 17, are decoded. The index 601, rms_index(j), j=11, . . . , 17, is reconstructed as follows:rms_index(j)=rms_index(j−1)+diff_index(j)  (6)
If mode 1 is selected, rms_index(j), j=10, . . . , 17, is obtained in [−11, +20] by decoding 8×5 bits. If the number of bits is not sufficient to decode the higher-band spectral envelope completely, the decoded indices rms_index(j) are kept to allow partial level-adjustment of the decoded higher-band spectrum. The bits related to the lower band, i.e., rms_index(j), j=0, . . . , 9, are decoded in a similar way as in the higher band, including one bit to select mode 0 or 1. The decoded indices are combined into a single vector [rms_index(0) rms_index(1) . . . rms_index(17)], which represents the reconstructed spectral envelope in log domain. The envelope 602 is converted into the linear domain as follows:rms—q(j)=21/2 rms—index(j)  (7)