In modern audio/speech signal compression technologies, frequency domain coding has been widely used in various ITU-T, MPEG, and 3 GPP standards. If bit rate is very low, a concept of BandWidth Extension (BWE) is well possible to be used. No matter which spectral coding approach is used, spectral envelope coding is often needed. The technology concept of BWE sometimes is also called High Band Extension (HBE) or SubBand Replica (SBR). Although the name could be different, they all have the similar meaning of encoding/decoding some frequency sub-bands (usually high bands) with little budget of bit rate or significantly lower bit rate than normal encoding/decoding approach. BWE often encodes/decodes some perceptually critical information within bit budget while generating some information with very limited bit budget or without spending any number of bits; it usually comprises frequency envelope coding, temporal envelope coding (optional), and spectral fine structure generation. The precise description of the spectral fine structure needs a lot of bits, which becomes not realistic for any BWE algorithm. A realistic way is to artificially generate the spectral fine structure and only spend limited bit budget to code the fine spectral envelope. Obviously, the spectral envelope coding is the most important first step toward successful BWE algorithm; it is also important to any other spectral coding algorithms.
Frequency domain can be defined as FFT transformed domain; it can also be in MDCT (Modified Discrete Cosine Transform) domain. One of the pre-art BWE algorithms can be found in the standard ITU-T G.729.1 in which the algorithm is named as TDBWE (Time Domain Bandwidth Extension).
General Description of ITU G.729.1
ITU-T G.729.1 is also called G.729EV coder which is an 8-32 kbit/s scalable wideband (50-7000 Hz) extension of ITU-T Rec. G.729. By default, the encoder input and decoder output are sampled at 16000 Hz. The bitstream produced by the encoder is scalable and consists of 12 embedded layers, which will be referred to as Layers 1 to 12. Layer 1 is the core layer corresponding to a bit rate of 8 kbit/s. This layer is compliant with G.729 bitstream, which makes G.729EV interoperable with G.729. Layer 2 is a narrowband enhancement layer adding 4 kbit/s, while Layers 3 to 12 are wideband enhancement layers adding 20 kbit/s with steps of 2 kbit/s.
This coder is designed to operate with a digital signal sampled at 16000 Hz followed by conversion to 16-bit linear PCM for the input to the encoder. However, the 8000 Hz input sampling frequency is also supported. Similarly, the format of the decoder output is 16-bit linear PCM with a sampling frequency of 8000 or 16000 Hz. Other input/output characteristics should be converted to 16-bit linear PCM with 8000 or 16000 Hz sampling before encoding, or from 16-bit linear PCM to the appropriate format after decoding. The bitstream from the encoder to the decoder is defined within this Recommendation. The G.729EV coder is built upon a three-stage structure: embedded Code-Excited Linear-Prediction (CELP) coding, Time-Domain Bandwidth Extension (TDBWE) and predictive transform coding that will be referred to as Time-Domain Aliasing Cancellation (TDAC). The embedded CELP stage generates Layers 1 and 2 which yield a narrowband synthesis (50-4000 Hz) at 8 and 12 kbit/s. The TDBWE stage generates Layer 3 and allows producing a wideband output (50-7000 Hz) at 14 kbit/s. The TDAC stage operates in the Modified Discrete Cosine Transform (MDCT) domain and generates Layers 4 to 12 to improve quality from 14 to 32 kbit/s. TDAC coding represents jointly the weighted CELP coding error signal in the 50-4000 Hz band and the input signal in the 4000-7000 Hz band.
The G.729EV coder operates on 20 ms frames. However, the embedded CELP coding stage operates on 10 ms frames, like G.729. As a result two 10 ms CELP frames are processed per 20 ms frame. In the following, to be consistent with the text of ITU-T Rec. G.729, the ms frames used by G.729EV will be referred to as superframes, whereas the 10 ms frames and the 5 ms subframes involved in the CELP processing will be respectively called frames and subframes. In this G.729EV, TDBWE algorithm is related to our topics.
G729.1 Encoder
A functional diagram of the encoder part is presented in FIG. 1. The encoder operates on 20 ms input superframes. By default, the input signal 101, sWB(n), is sampled at 16000 Hz. Therefore, the input superframes are 320 samples long. The input signal sWB(n) is first split into two sub-bands using a QMF filter bank defined by the filters H1(z) and H2(z). The lower-band input signal 102, sLBqmf(n), obtained after decimation is pre-processed by a high-pass filter Hh1(z) with 50 Hz cut-off frequency. The resulting signal 103, sLB(n), is coded by the 8-12 kbit/s narrowband embedded CELP encoder. To be consistent with ITU-T Rec. G.729, the signal sLB(n) will also be denoted s(n). The difference 104, dLB(n), between s(n) and the local synthesis 105, ŝenh(n), of the CELP encoder at 12 kbit/s is processed by the perceptual weighting filter WLB(z). The parameters of WLB(z) are derived from the quantized LP coefficients of the CELP encoder. Furthermore, the filter WLB(z) includes a gain compensation which guarantees the spectral continuity between the output 106, dLBw(n), of WLB(z) and the higher-band input signal 107, SHB(n). The weighted difference dLBw(n) is then transformed into frequency domain by MDCT. The higher-band input signal 108, sHBfold(n), obtained after decimation and spectral folding by (−1)n is pre-processed by a low-pass filter Hh2(z) with 3000 Hz cut-off frequency. The resulting signal sHB(n) is coded by the TDBWE encoder. The signal sHB(n) is also transformed into frequency domain by MDCT. The two sets of MDCT coefficients 109, DHBw(k), and 110, SHB(k), are finally coded by the TDAC encoder. In addition, some parameters are transmitted by the frame erasure concealment (FEC) encoder in order to introduce parameter-level redundancy in the bitstream. This redundancy allows improving quality in the presence of erased superframes.
TDBWE Encoder
The TDBWE encoder is illustrated in FIG. 2. The TDBWE encoder extracts a fairly coarse parametric description from the pre-processed and down-sampled higher-band signal 201, sHB(n). This parametric description comprises time envelope 202 and frequency envelope 203 parameters. The 20 ms input speech superframe SHB(n) (8 kHz sampling frequency) is subdivided into 16 segments of length 1.25 ms each, i.e., each segment comprises 10 samples. The 16 time envelope parameters 102, Tenv(i) i=0, . . . , 15, are computed as logarithmic subframe energies before the quantization. For the computation of the 12 frequency envelope parameters 203, Fenv(j), j=0, . . . , 11, the signal 201, sHB(n), is windowed by a slightly asymmetric analysis window. This window is 128 tap long (16 ms) and is constructed from the rising slope of a 144-tap Hanning window, followed by the falling slope of a 112-tap Hanning window. The maximum of the window is centered on the second 10 ms frame of the current superframe. The window is constructed such that the frequency envelope computation has a lookahead of 16 samples (2 ms) and a lookback of 32 samples (4 ms). The windowed signal is transformed by FFT. The even bins of the full length 128-tap FFT are computed using a polyphase structure. Finally, the frequency envelope parameter set is calculated as logarithmic weighted sub-band energies for 12 evenly spaced and equally spaced and equally wide overlapping sub-bands in the FFT domain.
G.729.1 TDAC Encoder (Layers 4 to 12)
The Time Domain Aliasing Cancellation (TDAC) encoder is illustrated in FIG. 3. The TDAC encoder represents jointly two split MDCT spectra 301, DLBw(k), and 302, SHB(k), by gain-shape vector quantization. DLBw(k) represents CELP coding error in weighted spectrum domain of [0.4 kHz]; SHB(k) is the unquantized weighted spectrum of [4 kHz, 8 kHz]. The joint spectrum is divided into sub-bands. The gains in each sub-band define the spectral envelope. The shape in each sub-band is encoded by embedded spherical vector quantization using trained permutation codes. The gain-shape of SHB(k) represents a true spectral envelope in second band.
The each spectral envelope gain is quantized with 5 bits by uniform scalar quantization and the resulting quantization indices are coded using a two-mode binary encoder. The 5-bit quantization consists in computing the indices 305, rms_index(j), j=0, . . . , 17, as follows:rms_index(j)=round(½ log—rms(j))  (1)with the restriction−11≦rms_index(j)≦+20  (2)i.e., the indices are limited by −11 and +20 (32 possible values).
The resulting quantized full-band envelope is then divided into two subvectors:
lower-band spectral envelope: (rms_index(0), rms_index(1), . . . , rms_index(9))
and
higher-band spectral envelope:
(rms_index(10), rms_index(11), . . . , rms_index(17)).
These two subvectors are coded separately using a two-mode lossless encoder which switches adaptively between differential Huffman coding (mode 0) and direct natural binary coding (mode 1). Differential Huffman coding is used to minimize the average number of bits, whereas direct natural binary coding is used to limit the worst-case number of bits as well as to correctly encode the envelope of signals which are saturated by differential Huffman coding (e.g., sinusoids). One bit is used to indicate the selected mode to the spectral envelope decoder.
G729.1 Decoder
A functional diagram of the decoder is presented in FIG. 4. The specific case of frame erasure concealment is not considered in this figure. The decoding depends on the actual number of received layers or equivalently on the received bit rate.
If the received bit rate is:
8 kbit/s (Layer 1): The core layer is decoded by the embedded CELP decoder to obtain 401, ŝLB(n)=s(n). Then ŝLB(n) is postfiltered into 402, ŝLBpost(n), and postprocessed by a high-pass filter (HPF) into 403, ŝLBqmf(n)=ŝLBhpf(n). The QMF synthesis filterbank defined by the filters G1(z) and G2(z) generates the output with a high-frequency synthesis 404, ŝHBqmf(n), set to zero.
12 kbit/s (Layers 1 and 2): The core layer and narrowband enhancement layer are decoded by the embedded CELP decoder to obtain 401, ŝLB(n)=ŝenh(n), and sLB(n) is then postfiltered into 402, ŝLBpost(n) and high-pass filtered to obtain 403, ŝLBqmf(n)=ŝLBhpf(n). The QMF synthesis filterbank generates the output with a high-frequency synthesis 404, ŝHBqmf(n) set to zero.
14 kbit/s (Layers 1 to 3): In addition to the narrowband CELP decoding and lower-band adaptive postfiltering, the TDBWE decoder produces a high-frequency synthesis 405, ŝHBbwe(n) which is then transformed into frequency domain by MDCT so as to zero the frequency band above 3000 Hz in the higher-band spectrum 406, ŝHBbwe(k). The resulting spectrum 407, ŝHB(k) is transformed in time domain by inverse MDCT and overlap-add before spectral folding by (−1)n. In the QMF synthesis filterbank the reconstructed higher band signal 404, ŝHBqmf(n) is combined with the respective lower band signal 402, ŝLBqmf(n)=ŝLBpost(n) reconstructed at 12 kbit/s without high-pass filtering. Above 14 kbit/s (Layers 1 to 4+): In addition to the narrowband CELP and TDBWE decoding, the TDAC decoder reconstructs MDCT coefficients 408, {circumflex over (D)}LBw(k) and 407, ŝHB(k), which correspond to the reconstructed weighted difference in lower band (0-4000 Hz) and the reconstructed signal in higher band (4000-7000 Hz). Note that in the higher band, the non-received sub-bands and the sub-bands with zero bit allocation in TDAC decoding are replaced by the level-adjusted sub-bands of ŝHBbwe(k). Both {circumflex over (D)}LBw(k) and ŝHB(k) are transformed into time domain by inverse MDCT and overlap-add. The lower-band signal 409, {circumflex over (d)}LBw(n) is then processed by the inverse perceptual weighting filter WLB(z)−1. To attenuate transform coding artefacts, pre/post-echoes are detected and reduced in both the lower- and higher-band signals 410, {circumflex over (d)}LB(n) and 411, ŝHB(n). The lower-band synthesis ŝLB(n) is postfiltered, while the higher-band synthesis 412, ŝHBfold(n) is spectrally folded by (−1)n. The signals ŝLBqmf(n)=ŝLBpost(n) and ŝHBqmf(n) are then combined and upsampled in the QMF synthesis filterbank.
TDBWE Decoder
FIG. 5 illustrates the concept of the TDBWE decoder module. The TDBWE received parameters, which are computed by a parameter extraction procedure, are used to shape an artificially generated excitation signal 502, ŝHBexc(n) according to desired time and frequency envelopes 508, {circumflex over (T)}env(i), and 509, {circumflex over (F)}env(j). This is followed by a time-domain post-processing procedure.
The quantized parameter set consists of the value {circumflex over (M)}T and of the following vectors: {circumflex over (T)}env,1, {circumflex over (T)}env,2, {circumflex over (F)}env,1, {circumflex over (F)}env,2, and {circumflex over (F)}env,3. The quantized mean time envelope is {circumflex over (M)}T used to reconstruct the time envelope and the frequency envelope parameters from the individual vector components, i.e.:{circumflex over (T)}env(i)={circumflex over (T)}envM(i)+{circumflex over (M)}T, i=0, . . . , 15  (3)and{circumflex over (F)}env(j)={circumflex over (F)}envM(j)+{circumflex over (M)}T, j=0, . . . , 11  (4)
The decoded frequency envelope parameters {circumflex over (F)}env(j) with j=0, . . . , 11 are representative for the second 10 ms frame within the 20 ms superframe. The first 10 ms frame is covered by parameter interpolation between the current parameter set and the parameter set {circumflex over (F)}env,old(j) from the preceding superframe:{circumflex over (F)}env,int(j)=½({circumflex over (F)}env,old(j)+{circumflex over (F)}env(j)), j=0, . . . , 11  (5)
The superframe of 503, ŝHBT(n), is analyzed twice per superframe. A filterbank equalizer is designed such that its individual channels match the sub-band division to realize the frequency envelope shaping with proper gain for each channel.
The TDBWE excitation signal 501, exc(n), is generated by 5 ms subframe based on parameters which are transmitted in Layers 1 and 2 of the bitstream. Specifically, the following parameters are used: the integer pitch lag T0=int(T1) or int(T2) depending on the subframe, the fractional pitch lag frac, the energy Ec of the fixed codebook contributions, and the energy Ep of the adaptive codebook contribution. The parameters of the excitation generation are computed every 5 ms subframe. The excitation signal generation consists of the following steps:
estimation of two gains gv and guv for the voiced and unvoiced contributions to the final excitation signal exc(n);
pitch lag post-processing;
generation of the voiced contribution;
generation of the unvoiced contribution; and
low-pass filtering.
TDAC Decoder
The TDAC decoder is depicted in FIG. 6. The higher-band spectral envelope is decoded first. The bit indicating the selected coding mode at the encoder may be: 0.fwdarw.differential Huffman coding, 1.fwdarw.natural binary coding. If mode 0 is selected, 5 bits are decoded to obtain an index rms_index(10) in [−11, +20]. Then the Huffman codes associated with the differential indices diff_index(j), j=11, . . . , 17, are decoded. The index 601, rms_index(j), j=11, . . . , 17, is reconstructed as follows:rms_index(j)=rms_index(j−1)+diff_index(j)  (6)
If mode 1 is selected, rms_index(j), j=10, . . . , 17, is obtained in [−11, +20] by decoding 8.times.5 bits. If the number of bits is not sufficient to decode the higher-band spectral envelope completely, the decoded indices 601, rms_index(j), are kept to allow partial level-adjustment of the decoded higher-band spectrum. The bits related to the lower band, i.e., rms_index(j), j=0, . . . , 9, are decoded in a similar way as in the higher band, including one bit to select mode 0 or 1. The decoded indices are combined into a single vector [rms_index(0) rms_index(1) . . . rms_index(17)], which represents the reconstructed spectral envelope in log domain. This envelope is converted into the linear domain 402 as follows:rms—q(j)=21/2rms—index(j)  (7)