1. Field of the Invention
The present invention relates to digital communications. More particularly, the present invention relates to the enhancement of speech quality when frames of a compressed bit stream representing a speech signal are lost within the context of a digital communications system.
2. Related Art
In speech coding, sometimes called voice compression, a coder encodes an input speech or audio signal into a digital bit stream for transmission. A decoder decodes the bit stream into an output speech signal. The combination of the coder and the decoder is called a codec. The transmitted bit stream is usually partitioned into frames. In wireless or packet networks, sometimes frames of transmitted bits are lost, erased, or corrupted. This condition is called frame erasure in wireless communications. The same condition of erased frames can happen in packet networks due to packet loss. When frame erasure happens, the decoder cannot perform normal decoding operations since there are no bits to decode in the lost frame. During erased frames, the decoder needs to perform frame erasure concealment (FEC) operations to try to conceal the quality-degrading effects of the frame erasure.
One of the earliest FEC techniques is waveform substitution based on pattern matching, as proposed by Goodman, et al. in “Waveform Substitution Techniques for Recovering Missing Speech Segments in Packet Voice Communications”, IEEE Transaction on Acoustics, Speech and Signal Processing, December 1986, pp. 1440–1448. This scheme was applied to Pulse Code Modulation (PCM) speech codec that performs sample-by-sample instantaneous quantization of speech waveform directly. This FEC scheme uses a piece of decoded speech waveform immediately before the lost frame as the template, and slides this template back in time to find a suitable piece of decoded speech waveform that maximizes some sort of waveform similarity measure (or minimizes a waveform difference measure).
Goodman's FEC scheme then uses the section of waveform immediately following a best-matching waveform segment as the substitute waveform for the lost frame. To eliminate discontinuities at frame boundaries, the scheme also uses a raised cosine window to perform an overlap-add technique between the correctly decoded waveform and the substitute waveform. This overlap-add technique increases the coding delay. The delay occurs because at the end of each frame, there are many speech samples that need to be overlap-added to obtain the final values, and thus cannot be played out until the next frame of speech is decoded.
Based on the work of Goodman above, David Kapilow developed a more sophisticated version of an FEC scheme for G.711 PCM codec. This FEC scheme is described in Appendix I of the ITU-T Recommendation G.711. However, both the FEC of Goodman and the FEC scheme of Kapilow are limited to PCM codecs with instantaneous quantization.
For speech coding, the most popular type of speech codec is based on predictive coding. Perhaps the first publicized FEC scheme for a predictive codec is a “bad frame masking” scheme in the original TIA IS-54 VSELP standard for North American digital cellular radio (rescinded in September 1996). Here, upon detection of a bad frame, the scheme repeats the linear prediction parameters of the last frame. This scheme derives the speech energy parameter for the current frame by either repeating or attenuating the speech energy parameter of last frame, depending on how many consecutive bad frames have been counted. For the excitation signal (or quantized prediction residual), this scheme does not perform any special operation. It merely decodes the excitation bits, even though they might contain a large number of bit errors.
The first FEC scheme for a predictive codec that performs waveform substitution in the excitation domain is probably the FEC system developed by Chen for the ITU-T Recommendation G.728 Low-Delay Code Excited Linear Predictor (CELP) codec, as described in U.S. Pat. No. 5,615,298 issued to Chen, titled “Excitation Signal Synthesis During Frame Erasure or Packet Loss.” In this approach, during erased frames, the speech excitation signal is extrapolated depending on whether the last frame is a voiced or an unvoiced frame. If it is voiced, the excitation signal is extrapolated by periodic repetition. If it is unvoiced, the excitation signal is extrapolated by randomly repeating small segments of speech waveform in the previous frame, while ensuring the average speech power is roughly maintained.
What is needed therefore is an FEC technique that avoids the noted deficiencies associated with the conventional decoders. For example, what is needed is an FEC technique that avoids the increased delay created in the overlap-add operation of Goodman's approach. What is also needed is an FEC technique that can ensure the smooth reproduction of a speech or audio waveform when the next good frame is received.