1. Field of the Invention
The invention generally relates to systems and methods for concealing the quality degrading effects of packet loss in a speech coder.
2. Background
In speech coding (sometimes called “voice compression”), a coder encodes an input speech signal into a digital bit stream for transmission. A decoder decodes the bit stream into an output speech signal. The combination of the coder and the decoder is called a codec. The transmitted bit stream is usually partitioned into segments called frames, and in packet transmission networks, each transmitted packet may contain one or more frames of a compressed bit stream. In wireless or packet networks, sometimes the transmitted frames or packets are erased or lost. This condition is typically called frame erasure in wireless networks and packet loss in packet networks. When this condition occurs, to avoid substantial degradation in output speech quality, the decoder needs to perform frame erasure concealment (FEC) or packet loss concealment (PLC) to try to conceal the quality-degrading effects of the lost frames. Because the terms FEC and PLC generally refer to the same kind of technique, they can be used interchangeably. Thus, for the sake of convenience, the term “packet loss concealment,” or PLC, is used herein to refer to both.
Most PLC algorithms utilize a technique referred to as periodic waveform extrapolation (PWE). In accordance with this technique, the missing speech waveform is extrapolated from the past speech by periodic repetition. The period of the repetition is based on an estimated pitch derived by analyzing the past speech. This technique assumes the speech signal is stationary for the analysis of the past speech and the missing segment. Most speech segments can be modeled as stationary for about 20 milliseconds (ms). Beyond this point, the signal has deviated too much and the stationarity model no longer holds. As a result, most PWE-based PLC schemes begin to attenuate the synthesized speech signal beyond about 20 ms.
By missing the larger overall statistical trends of speech-related parameters such as formants, pitch, voicing and energy, conventional PWE-based PLC is limited to the validity of the assumed stationarity of the speech signal. It would be beneficial if the PLC technique could focus on the larger context of speech signal statistical evolution, thereby providing a superior model of how speech-related parameters vary over time. For example, depending upon the language, the average length of a phoneme (the smallest segmental unit of sound employed to form meaningful contrasts between utterances lengths in a given language) may be around 100 ms, which is significantly longer than 20 ms. This phoneme length may provide a better context within which to model the evolution of speech-related parameters.
For example, in English, each of the phonemes can be classified as either a continuant or a non-continuant sound. Continuant sounds are produced by a fixed (non-time-varying) vocal tract excited by the appropriate source. The class of continuant sounds includes the vowels, the fricatives (both voiced and unvoiced), and the nasals. The remaining sounds (dipthongs, semivowels, stops and affricates) are produced by changing vocal tract configuration and are classified as non-continuants. This results in essentially stationary formants and spectral envelope for continuant sounds, and evolving formants and spectral envelope for non-continuant sounds. Similar correlation can be found in the time variation of other speech-related parameters such as pitch, voicing, gain, etc., for the larger speech signal context of phonemes or similar segmental units of sound.
Different methods for capturing the speech context have been proposed such as Gaussian Mixture Models (GMMs), Hidden Markov Models (HMMs) and n-grams. These methods are promising but are plagued by high complexity and storage requirements.