The performance of digital speech systems using low bit rates has become increasingly important with current and foreseeable digital communications. Both dedicated channel and packetized voice-over-internet protocol (VoIP) transmission benefit from compression of speech signals. Linear prediction (LP) digital speech coding is one of the widely used techniques for parameter quantization in speech coding applications. This predictive coding method removes the correlation between the parameters in adjacent frames, and thus allows more accurate quantization at same bit-rate than non-predictive quantization methods. Predictive coding is especially useful for stationary voiced segments as parameters of adjacent frames have large correlations. In addition, the human ear is more sensitive to small changes in stationary signals, and predictive coding allows more efficient encoding of these small changes.
The predictive coding approach to speech compression models the vocal tract as a time-varying filter and a time-varying excitation of the filter to mimic human speech. Linear prediction analysis determines LP coefficients a(j), j=1, 2, . . . , M, for an input frame of digital speech samples {s(n)} by settingr(n)=s(n)−ΣM≧j≧1a(j)s(n−j)  (0)and minimizing Σframe r(n)2 with respect to a(j). Typically, M, the order of the linear prediction filter, is taken to be about 8-16; the sampling rate to form the samples s(n) is typically taken to be 8 or 16 kHz; and the number of samples {s(n)} in a frame is often 80 or 160 for 8 kHz or 160 or 320 for 16 kHz. Various windowing operations may be applied to the samples of the input speech frame. The name “linear prediction” arises from the interpretation of the residual r(n)=s(n)−ΣM≧j≧1 a(j)s(n−j) as the error in predicting s(n) by a linear combination of preceding speech samples ΣM≧j≧1 a(j)s(n−j), i.e., a linear autoregression. Thus, minimizing Σframer(n)2 yields the {a(j)} which furnish the best linear prediction. The coefficients {a(j)} may be converted to line spectral frequencies (LSFs) or immittance spectrum pairs (ISPs) for vector quantization plus transmission and/or storage.
The {r(n)} form the LP residual for the frame, and ideally the LP residual would be the excitation for the synthesis filter 1/A(z) where A(z) is the transfer function of equation (0); that is, equation (0) is a convolution which corresponds to multiplication in the z-domain: R(z)=A(z)S(z), so S(z)=R(z)/A(z). Of course, the LP residual is not available at the decoder; thus the task of the encoder is to represent the LP residual so that the decoder can generate an excitation for the LP synthesis filter. Indeed, from input encoded (quantized) parameters, the decoder generates a filter estimate, Â(z), plus an estimate of the residual to use as an excitation, E(z), and thereby estimates the speech frame by Ŝ(z)=E(z)/Â(z). Physiologically, for voiced frames, the excitation roughly has the form of a series of pulses at the pitch frequency, and for unvoiced frames the excitation roughly has the form of white noise.
For speech compression, the predictive coding approach basically quantizes various parameters with respect to their values in the previous frame and only transmits/stores updates or codebook entries for these quantized parameters. A receiver regenerates the speech with the same perceptual characteristics as the input speech. Periodic updating of the quantized items requires fewer bits than direct representation of the speech signal, so a reasonable LP encoder can operate at bits rates as low as 2-3 kb/s (kilobits per second).
For example, the Adaptive Multirate Wideband (AMR-WB) encoding standard with available bit rates ranging from 6.6 kb/s up to 23.85 kb/s uses LP analysis with codebook excitation (CELP) to compress speech. An adaptive-codebook contribution provides periodicity in the excitation and is the product of a gain, gP, multiplied by v(n), the excitation of the prior frame translated by the pitch lag of the current frame and interpolated to fit the current frame. An algebraic codebook contribution approximates the difference between the actual residual and the adaptive codebook contribution with a multiple-pulse vector (also known as an innovation sequence), c(n), multiplied by a gain, gC. The number of pulses depends on the bit rate. That is, the excitation is u(n)=gP v(n)+gC c(n) where v(n) comes from the prior (decoded) frame, and gP, gC, and c(n) come from the transmitted parameters for the current frame. The speech synthesized from the excitation is then postfiltered to mask noise. Postfiltering essentially involves three successive filters: a short-term filter, a long-term filter, and a tilt compensation filter. The short-term filter emphasizes formants; the long-term filter emphasizes periodicity, and the tilt compensation filter compensates for the spectral tilt typical of the short-term filter.
Predictive quantization can be applied to almost all parameters in speech coding applications including linear prediction coefficients (LPC), gain, pitch, speech/residual harmonics, etc. In this technique, the mean of the parameter vector, μx, is first subtracted from the quantized parameter vector in the prior frame (k−1st frame), {circumflex over (x)}k−1, and then, the current frame (kth frame) is predicted from the prior frame as:{hacek over (x)}k=A({hacek over (x)}k−1−μx),  (1)where A is the prediction matrix and {hacek over (x)}k is the mean removed predicted vector of the current frame. When the correlation among the elements of the parameter vector is zero such as in line spectral frequencies (LSF) or immitance spectral frequencies (ISF), A is a diagonal matrix. After this step, the difference vector, dk, between the predicted and the mean-removed unquantized parameter vector, xk, is calculated asdk=(xk−μx)−{hacek over (x)}k.  (2)This difference vector is then quantized and sent to the decoder.
In the decoder, the current frame's parameter vector is first predicted using (1), and then the quantized difference vector and the mean vector are added to find the quantized parameter vector, {circumflex over (x)}k:{circumflex over (x)}k={hacek over (x)}k+{circumflex over (d)}k+μx,  (3)where {circumflex over (d)}k is the quantized version of the difference vector calculated with (2).
In a typical quantization system, A and μx are obtained by a training procedure using a set of vectors. μx is obtained as the mean of the vectors in this set, and A is chosen to minimize the summation of squared dk in all frames. The difference vector, dk, may be coded with any quantization technique (e.g., scalar and vector quantization) that is designed to optimally quantize difference vectors.
Further, in a typical quantization system, the vector quantization is essentially a lookup process, where a lookup table is referred to as a “codebook.” A codebook lists each quantization level, and each level has an associated “code-vector.” The quantization process compares an input vector to the code-vectors and determines the best code-vector in terms of minimum distortion. Some quantization systems implement multi-stage vector quantization (MSVQ) in which multiple codebooks are used. In MSVQ, a central quantized vector (i.e., the output vector) is obtained by adding a number of quantized vectors. The output vector is sometimes referred to as a “reconstructed” vector. Each vector used in the reconstruction is from a different codebook and each codebook corresponds to a “stage” of the quantization process. Each codebook is designed especially for a stage of the search. An input vector is quantized with the first codebook, and the resulting error vector (i.e., difference vector) is quantized with the second codebook, etc. The set of vectors used in the reconstruction may be expressed as:y(j0,j1, . . . js−1)=y0(j1)+y1(j1)+ . . . +ys−1(js−1)  (4)where s is the number of stages and ys is the codebook for the sth stage. For example, for a three-dimensional input vector, such as x=(2,3,4), the reconstruction vectors for a two-stage search might be y0=(1,2,3) and y1=(1,1,1) (a perfect quantization and not always the case).
During MSVQ, the codebooks may be searched using a sub-optimal tree search algorithm, also known as an M-algorithm. At each stage, M-best number of “best” code-vectors are passed from one stage to the next. The “best” code-vectors are selected in terms of minimum distortion. The search continues until the final stage, where only one best code-vector is determined. One example of an MSVQ quantizer is described in U.S. Pat. No. 6,122,608 filed on Aug. 15, 1998, entitled “Method for Switched Predictive Quantization”.
While predictive coding is one of the widely used techniques for parameter quantization in speech coding applications, any error that occurs in one frame propagates into subsequent frames. In particular, for VoIP, the loss or delay of packets or other corruption can lead to erased frames. There are a number of techniques to combat error propagation including: (1) using a moving average (MA) filter that approximates the IIR filter which limits the error propagation to only a small number of frames (equal to the MA filter order); (2) reducing the prediction coefficient artificially and designing the quantizer accordingly so that an error decays faster in subsequent frames; and (3) using switched-predictive quantization (or safety-net quantization) techniques in which two different codebooks with two different predictors (i.e., prediction matrices) are used and one of the predictors is chosen small (or zero in the case of safety-net quantization) so that the error propagation is limited to the frames that are encoded with strong prediction.
Switched-predictive quantization (or safety-net quantization) is often used to encode speech parameters that have multiple classes of unique statistical characteristics; a speech signal has both stationary segments in which the parameter vectors of the frames have large correlations from one frame to the next and transition segments in which the parameter vectors of the frames change rapidly between successive frames and thus have low correlations from one frame to the next. Typically, when switched predictive quantization is used for speech, two predictor/codebook pairs are used: one weakly-predictive codebook with a small prediction coefficient (i.e., prediction matrix) that is close to zero and one strongly-predictive codebook with a large prediction coefficient that is close to one. In the encoder, the parameter vector of a frame is quantized with both predictor/codebook pairs, and the predictor/quantizer pair providing the lesser quantization distortion is chosen. One example of a switched-predictive quantizer is the MSVQ quantizer described in the previously mentioned U.S. Pat. No. 6,122,608.
As previously mentioned, switched-predictive quantization may provide additional encoding robustness in the presence of frame erasures. Because the prediction coefficient associated with a weakly-predictive codebook is small, the propagated error due to a prior erased frame decays much faster when a weakly-predictive codebook is used. For this reason, the use of the weakly-predictive codebook is desired whenever possible. Further, if a safety-net codebook is used instead of a weakly-predictive codebook, the propagation error vanishes. Accordingly, use of a safety-net codebook is also desired whenever possible.
However, if a transition frame is lost because of a frame erasure and it is constructed with a frame erasure concealment technique in the decoder, it is highly probable that reconstructed frame is significantly different from the actual one, and many of the following stationary frames that are encoded with the strongly-predictive codebook will have that large error as the error does not decay rapidly when strong prediction is used. One approach to decreasing the error propagation in such cases is described in the cross-referenced U.S. Pat. No. 7,295,974. The cross-referenced patent describes a technique for decreasing the error propagation due to frame erasure in which the first stationary frame following a transition frame is also encoded with a weakly-predictive codebook. More specifically, this technique causes the first stationary frame occurring after a transition frame (which is encoded with a weakly-predictive codebook) to always be encoded with the weakly-predictive codebook even if the quantization distortion of the weakly-predictive codebook is not smaller than the quantization distortion of the strongly-predictive codebook. Thus, even if the transition frame is erased, the error decays faster because of the low prediction coefficient of the weakly-predictive codebook. As a result, a large error does not propagate into the subsequent frames encoded with the strongly-predictive codebook.
When this technique is used, the parameters of the first stationary frame may, under some circumstances, be quantized with a large quantization distortion. As discussed above, the weakly-predictive codebook is trained for transition frames. Therefore, if the weakly-predictive codebook is used for a stationary frame, the quantization distortion could possibly be significantly larger than the quantization distortion if the strongly-predictive codebook is used. In addition, because the human ear is more sensitive to small changes in stationary frames, the increased quantization distortion may result in slight speech quality loss when there are no frame-erasures in the decoder.