1. Field of the Invention
The present invention relates to digital communication systems. More particularly, the present invention relates to the enhancement of speech quality when portions of a bit stream representing a speech signal are lost within the context of a digital communications system.
2. Background Art
In speech coding (sometimes called “voice compression”), a coder encodes an input speech or audio signal into a digital bit stream for transmission. A decoder decodes the bit stream into an output speech signal. The combination of the coder and the decoder is called a codec. The transmitted bit stream is usually partitioned into frames, and in packet transmission networks, each transmitted packet may contain one or more frames of a compressed bit stream. In wireless or packet networks, sometimes the transmitted frames or packets are erased or lost. This condition is called frame erasure in wireless networks and packet loss in packet networks. When this condition occurs, to avoid substantial degradation in output speech quality, the decoder needs to perform frame erasure concealment (FEC) or packet loss concealment (PLC) to try to conceal the quality-degrading effects of the lost frames. Because the terms FEC and PLC generally refer to the same kind of technique, they can be used interchangeably. Thus, for the sake of convenience, the term “frame erasure concealment”, or FEC, is used herein to refer to both.
One of the earliest FEC techniques is waveform substitution based on pattern matching, as proposed by Goodman, et al. in “Waveform Substitution Techniques for Recovering Missing Speech Segments in Packet Voice Communications”, IEEE Transaction on Acoustics, Speech and Signal Processing, December 1986, pp. 1440-1448. This scheme was applied to a Pulse Code Modulation (PCM) speech codec that performs sample-by-sample instantaneous quantization of a speech waveform directly. This FEC scheme uses a piece of decoded speech waveform that immediately precedes the lost frame as a template, and then slides this template back in time to find a suitable piece of decoded speech waveform that maximizes some sort of waveform similarity measure (or minimizes a waveform difference measure).
Goodman's FEC scheme then uses the section of waveform immediately following a best-matching waveform segment as the substitute waveform for the lost frame. To eliminate discontinuities at frame boundaries, the scheme also uses a raised cosine window to perform an overlap-add operation between the correctly decoded waveform and the substitute waveform. This overlap-add technique increases the coding delay. The delay occurs because at the end of each frame, there are many speech samples that need to be overlap-added, and thus final values cannot be determined until the next frame of speech is decoded.
Based on the work of Goodman as described above, David Kapilow developed a more sophisticated version of an FEC scheme for the G.711 PCM codec. This FEC scheme is described in Appendix I of the ITU-T Recommendation G.711.
The FEC scheme of Goodman and the FEC scheme of Kapilow are both limited to PCM codecs that use instantaneous quantization. Such PCM codecs are block-independent; that is, there is no inter-frame or inter-block codec memory, so the decoding operation for one block of speech samples does not depend on the decoded speech signal or speech parameters in any other block.
All PCM codecs are block-independent codecs, but a block-independent codec does not have to be a PCM codec. For example, a codec may have a frame size of 20 ms, and within this 20 ms frame there may be some codec memory that makes the decoding of certain speech samples in the frame dependent on decoded speech samples or speech parameters from other parts of the frame. However, as long as the decoding operation of each 20 ms frame does not depend on decoded speech samples or speech parameters from any other frame, then the codec is still block-independent.
One advantage of a block-independent codec is that there is no error propagation from frame to frame. After a frame erasure, the decoding operation of the very next good frame of transmitted speech data is completely unaffected by the erasure of the immediately preceding frame. In other words, the first good frame after a frame erasure can be immediately decoded into a good frame of output speech samples.
For speech coding, the most popular type of speech codec is based on predictive coding. Perhaps the first publicized FEC scheme for a predictive codec is a “bad frame masking” scheme in the original TIA IS-54 VSELP standard for North American digital cellular radio (rescinded in September 1996). The first FEC scheme for a predictive codec that performs waveform extrapolation in the excitation domain is probably the FEC system developed by Chen for the ITU-T Recommendation G.728 Low-Delay Code Excited Linear Predictor (CELP) codec, as described in U.S. Pat. No. 5,615,298 issued to Chen, entitled “Excitation Signal Synthesis During Frame Erasure or Packet Loss.” After the publication of these early FEC schemes for predictive codecs, many, many other FEC schemes have been proposed for predictive codecs, some of which are quite sophisticated.
Despite the fact that most of the speech codecs standardized in the last 15 years are predictive codecs, there are still some applications, such as Voice over Internet Protocol (VoIP), where the G.711 (8-bit logarithmic PCM) codec, or even the 16-bit linear PCM codec, is still used in order to ensure a very high signal fidelity. In such applications, none of the advanced FEC schemes developed for predictive codecs can be used, and typically G.711 Appendix I (Kapilow's FEC scheme) is used instead. However, G.711 Appendix I has the following drawbacks: (1) it requires an additional delay of 3.75 ms due to overlap-add, (2) it has a fairly large state memory requirement due to the use of a long history buffer with a length of three and a half times the maximum pitch period, (3) its performance is not as good as it can be.
What is needed therefore is an FEC technique for block-independent speech codecs that avoids the noted deficiencies associated with G.711 Appendix I. In particular, it is desirable for the FEC not to add additional delay. It is also desirable to have a state memory that is as small as possible. It is further desirable to achieve speech quality better than that produced by G.711 Appendix I.