G.729.1 is a new-generation speech encoding and decoding standard newly released by the International Telecommunication Union (ITU). This embedded speech encoding and decoding standard is best characterized in having a feature of layered encoding, which may provide an audio quality from narrowband to broadband within a rate range of 8 kb/s˜32 kb/s. During the transmission process, an outer-layer code stream may be discarded depending on the channel condition and thus good channel adaptation may be achieved.
In the G.729.1 standard, the feature of layering is achieved by formulating a code stream into an embedded layered structure, and thus a novel embedded layered multi-rate speech codec is needed. With a 20 ms super-frame being input, when the sampling rate is 16000 Hz, the length of the frame is 320 points. FIG. 1 is a block diagram of a G.729.1 system with encoders at each layer. The speech codec has a specific encoding process as follows. First, an input signal sWB(n) is divided by a Quadrature Mirror Filterbank (QMF) into two sub-bands (H1(z), H2(z)). The lower sub-band signal sLBqmf(n) is pre-processed at a high pass filter having a cut-off frequency of 50 Hz. The output signal sLB(n) is encoded by an 8 kb/s˜12 kb/s narrowband embedded Code-Excited Linear-Prediction (CELP) encoder. The difference signal dLB(n) between sLB(n) and a local synthesis signal ŝenh(n) of the CELP encoder at the rate of 12 Kb/s passes through a sense weighting filter (WLB(z)) to obtain a signal dLBw(n). The signal dLBw(n) is subject to a Modified Discrete Cosine Transform (MDCT) to the frequency-domain. The weighting filter WLB(z) includes gain compensation, to maintain spectral continuity between the output signal dLBw(n) of the filter and the higher sub-band input signal sHB(n). The weighted difference signal is transformed to the frequency-domain.
The higher sub-band component is multiplied with (−1)n to obtain a spectrally inverted signal sHBfold(n). The spectrally inverted signal sHBfold(n) is pre-processed after passing through a low pass filter having a cut-off frequency of 3000 HZ. The filtered signal sHB(n) is encoded at a Time-Domain BandWidth Extension (TDBWE) encoder. An MDCT transform is performed on sHB (n) to the frequency-domain before it enters the Time-domain Alias Cancellation (TDAC) encoding module.
Finally, two sets of MDCT coefficients DLBw(k) and SHB(k) are encoded with a TDAC encoding algorithm. In addition, some other parameters are transmitted by the Frame Erasure Concealment (FEC) encoder to improve over the errors caused when frame loss occurs during transmission.
FIG. 2 is the block diagram of a G.729.1 system having decoders at each layer. The operation mode of the decoder is determined by the number of layers of the received code stream, or equivalently, the receiving rate. Detailed descriptions will be made to various cases based on different receiving rates at the receiving side.
1. If the receiving rate is 8 kb/s or 12 kb/s (i.e., only the first layer or the first two layers are received), an embedded CELP decoder decodes the code stream of the first layer or the first two layers, obtains a decoded signal ŝLB(n), and performs a post-filtering to obtain ŝLBpost(n), which passes through a high pass filter to reach a QMF filter bank. A 16 kHz broadband signal is synthesized, having a higher-band signal component set to 0.
2. If the receiving rate is 14 kb/s (i.e., the first three layers are received), besides the CELP decoder decodes the narrowband component, the TDBWE decoder decodes the higher-band signal component ŝHBbwe(n). An MDCT transform is performed on ŝHBbwe(n), the frequency components higher than 3000 Hz in the higher sub-band component spectrum (corresponding to higher than 7000 Hz in the 16 kHz sampling rate) are set to 0, and then an inverse MDCT transform is performed. After superimposition and spectrum inversion, the processed higher-band component is synthesized in the QMF filter bank with the lower-band component ŝLBpost(n) decoded by the CELP decoder, to obtain a broadband signal having a sampling rate of 16 kHz.
3. If the received code stream has a rate of higher than 14 kb/s (corresponding to the first four layers or more layers), besides the CELP decoder obtains the lower sub-band component ŝLBpost(n) by decoding and the TDBWE decoder obtains the higher sub-band component ŝHBbwe(n) by decoding, the TDAC decoder obtains a lower sub-band weighting differential signal and a higher sub-band enhancement signal by decoding. The full band signal is enhanced and finally a broadband signal having a sampling rate of 16 kHz is synthesized in the QMF filter bank.
Conventional systems have at least the following deficiencies.
A G.729.1 code stream has a layered structure. During the transmission process, outer-layer code streams may be discarded from the outer to the inner depending on the channel transmission capability, and thus adaptation to the channel condition may be achieved. From the description to the encoding and decoding algorithms, it can be seen that when the channel capacity has a fast change over time, the decoder might receive a narrowband code stream (equal to or lower than 12 kb/s) at a moment when the decoded signal only contains components lower than 4000 Hz and the decoder might receive a broadband code stream (equal to or higher than 14 kb/s) at another moment when the decoded signal may contain a broadband signal of 0˜7000 Hz. Such a sudden change in bandwidth is referred to as bandwidth switch herein. Since contributions from higher and lower bands to the listening experience are different, such frequent switches may bring noticeable discomfort to the listening experience. In particular, when there are frequent broadband-to-narrowband switches, one will frequently feel that the voice jumps from clearness to tediousness. Therefore, there is a need for a technique to mitigate the discomfort caused by the frequent switches to the listening experience.