1. Field of the Invention
The present invention relates to digital communications systems. More particularly, the present invention relates to the enhancement of audio quality when portions of an encoded bit stream representing an audio signal, such as a speech signal, are lost within the context of a digital communications system.
2. Background
In speech coding (sometimes called “voice compression”), a coder encodes an input speech signal into a digital bit stream for transmission. A decoder decodes the bit stream into an output speech signal. The combination of the coder and the decoder is called a codec. The transmitted bit stream is usually partitioned into segments called frames, and in packet transmission networks, each transmitted packet may contain one or more frames of a compressed bit stream. In wireless or packet networks, sometimes the transmitted frames or packets are erased or lost. This condition is typically called frame erasure in wireless networks and packet loss in packet networks. When this condition occurs, to avoid substantial degradation in output speech quality, the decoder needs to perform frame erasure concealment (FEC) or packet loss concealment (PLC) to try to conceal or otherwise mitigate the quality-degrading effects of the lost frames. Because the terms FEC and PLC generally refer to the same kind of technique, they can be used interchangeably. Thus, for the sake of convenience, the term “packet loss concealment,” or PLC, will be used herein to refer to both.
A number of PLC techniques have been developed. These techniques can be broadly classified into sender-based or receiver-based approaches. (See, C. Perkins, et al., “A Survey of Packet Loss Recovery Techniques for Streaming Audio,” IEEE Network Magazine, pp. 40-48, September/October 1998). Some PLC schemes may consist of varying mixtures of the two classes. Sender-based PLC schemes require modifications to a transmitter and are generally based on the transmission of redundant information or the use of interleaving. Receiver-based PLC schemes are confined to a receiver and attempt to mitigate the effects of a lost frame by utilizing the speech signal in neighboring received frames.
At the receiver, the mitigation problem is either one of prediction or estimation. In the case of prediction, the PLC scheme uses only portions of a speech signal that precede one or more lost frames (also referred to herein as “past speech” or “past frames”) to “predict” the speech signal in the lost frame(s). Portions of the speech signal that follow the lost frame(s) (also referred to herein as “future speech” or “future frames”) are not used. In the case of estimation, however, both the past speech and future speech are available and are used to “estimate” the speech signal in the lost frame(s). In certain cases, future frames are obtained through the use of a jitter buffer. Rather than directly playing out the speech samples carried by packets as they arrive at the receiver, a jitter buffer holds the speech samples for a period of time. The amount of delay added by the jitter buffer is often based on the monitored arrival time of packets from the transmitter. A PLC scheme that uses a jitter buffer may employ some form of time-scale modification in the playback of the speech signal in order to increase or reduce the amount of data in the jitter buffer and to adapt to dynamic network delay conditions.
A popular method for PLC is based on periodic waveform extrapolation (PWE). In PWE, the missing data is concealed by repeating a pitch signal based on the pitch period of a neighboring speech signal. PWE may be performed in either the excitation domain (see, e.g., C. R. Watkins and J.-H. Chen, “Improving 16 kb/s G.728 LD-CELP Speech Coder for Frame Erasure Channels,” ICASSP, pp. 241-244, May 1995; R. Salami, et al., “Design and Description of CS-ACELP: a Toll Quality 8 kb/s Speech Coder,” IEEE Trans. Speech and Audio Processing, Vol. 6, No. 2, pp. 116-130, March 1998) or the speech domain (see, e.g., J.-H. Chen, “Packet Loss Concealment for Predictive Speech Coding Based on Extrapolation of Speech Waveform,” ACSSC 2007, pp. 2088-2092, November 2007; J.-H. Chen, “Packet loss concealment based on extrapolation of speech waveform,” ICASSP 2009, pp. 4129-4132, April 2009). A major challenge associated with PWE is avoiding signal discontinuity in the transition between the concealment waveform and the received speech signal. In excitation domain PWE, any signal discontinuity is mostly smoothed out by synthesis filtering. In speech domain PWE, an overlap-add is typically used to perform smoothing. In particular, in the first good frame after frame loss, the extrapolated signal is extended into a first portion of the received signal and used in the overlap-add operation. In the transition from concealment waveform to received speech, a delay may be used to enable the overlap-add. (See, ITU-T, “G.711, Appendix I: A High Quality Low-complexity Algorithm for Packet Loss Concealment with G.711,” 1999). The additional delay associated with this scheme may be circumvented by utilizing the “ringing” of a synthesis filter. (See, J.-H. Chen, “Packet Loss Concealment for Predictive Speech Coding Based on Extrapolation of Speech Waveform,” ACSSC 2007, pp. 2088-2092, November 2007).
It has been reported that most of the distortion associated with PLC is not from the lost frames, but from the frames after packet loss, often due to misalignment between the extrapolated waveform and the received signal. (See J.-H. Chen, “Packet loss concealment based on extrapolation of speech waveform,” ICASSP 2009, pp. 4129-4132, April 2009). As discussed above, to avoid discontinuity, the PWE waveform can be extended beyond the end of the lost frame and an overlap-add operation with the first good frame after packet loss can then be performed. However, the true pitch period of the lost frame(s) in general does not follow the pitch track used during the waveform extrapolation. As a result, the extrapolated signal and the speech signal in the first good frame may be out of phase and destructive interference can occur in the overlap-add region causing an audible distortion.
Different estimation techniques have been proposed in the literature to combat the issue of phase alignment of the extrapolated signal and the received speech signal. For example, one technique performs interpolation between the previous good frame(s) and future good frame(s) on either side of the packet loss. (See N. Aoki, et. al. “Development of a VoIP System Implementing a High Quality Packet Loss Concealment Technique”, Canadian Conference on Electrical and Computer Engineering, pp. 308-311, May 2005). However, doing so requires the extraction of the pitch period of the speech segment after the packet loss, which in turn requires a long segment of decoded speech after the packet loss to be available. Typically, 25 to 35 milliseconds (ms) of decoded speech must be buffered. In another technique, the PLC algorithm uses the decoded speech waveform associated with a future frame to guide the pitch contour of waveform extrapolation during the lost frame such that the extrapolated waveform is phase-aligned with the decoded speech waveform after the packet loss. (See J.-H. Chen, “Packet loss concealment based on extrapolation of speech waveform,” ICASSP 2009, pp. 4129-4132, April 2009). This technique also requires future frame(s) to be buffered, but since the pitch period is not explicitly estimated in the future speech, the delay requirement is reduced.
The estimation methods above introduce delay, requiring speech to be buffered at the receiver. In R. Zopf, J. Thyssen, and J.-H. Chen, “Time-Warping and Re-Phasing in Packet Loss Concealment,” Proc. Interspeech 2007—Eurospeech, pp. 1677-1680, Antwerp, Belgium, Aug. 27-31, 2007, time-warping is used to stretch or shrink the time axis of the signal received in the first good frame after frame loss to align it with the extrapolated signal used to conceal the lost frame. This prediction technique avoids the introduction of additional delay by modifying the received signal after packet loss as opposed to modifying the extrapolation signal during packet loss.
The above techniques have drawbacks and limitations. The estimation techniques require frame(s) to be buffered at the decoder, thus introducing additional delay. This is a fixed delay introduced into the system regardless of network conditions. Even in perfect network conditions with no packet loss, additional delay has be introduced. The two-sided estimation technique presented in the reference by N. Aoki, et. al. does not work when the pitch variation in the missing speech segment is not linear. This is illustrated in FIGS. 1A and 1B. In particular, FIG. 1A shows the pitch cycle phase associated with three frames of a speech signal as a function of time, wherein the second frame is lost. The three frames are designated “last good frame,” “current bad frame” and “next good frame,” respectively. The various pitch periods associated with the speech signal across the three frames are shown as p0, p1 and p2, wherein p2>p1>p0. As shown in FIG. 1A, during the lost frame, the pitch period slowly increases and decreases. FIG. 1B shows that when the two-sided estimation technique is applied to replace the lost frame shown in FIG. 1A, the result is the creation of two out-of-phase waveforms. In particular, the technique results in the extrapolation of a first waveform 102 based on the last good frame and the extrapolation of second waveform 104 based on the next good frame, wherein first waveform 102 and second waveform 104 are out of phase. In further accordance with the two-side estimation technique, the two out-of-phase waveforms are combined using an overlap-add operation, which results in destructive interference.
All of the techniques described above have a limited amount of time for the phase adjustment. For estimation approaches that provide a one-frame look-ahead, the phase adjustment must be achieved within the length of the lost frame. In the case of the approach presented in the aforementioned reference entitled “Time-Warping and Re-Phasing in Packet Loss Concealment,” the time-warping is applied only within the length of the first good frame. Hence, in these approaches, the phase adjustment must be achieved within a single frame. This should be sufficient in the case of isolated frame loss where only a single frame is missing. However, for consecutive frame loss, the natural phase evolution that has occurred over the period of multiple frames must now be applied in a single frame. In fact, it was noted in the aforementioned reference entitled “Time-Warping and Re-Phasing in Packet Loss Concealment” that the amount of time-warping was tuned to be constrained to ±1.75 milliseconds (ms) for 10 ms frames. Time-warping by more than this may remove the destructive interference, but often introduces some other audible distortion.
The foregoing problem is illustrated in FIG. 2. In particular, FIG. 2 shows the pitch cycle phase associated with three frames of a speech signal 202 as a function of time, wherein the first and second frames are lost and the third frame represents the first good frame after the lost frames. The three frames are designated “first bad frame,” “second bad frame” and “first good frame,” respectively. In accordance with this scenario, an estimation solution that provides a one-frame look-ahead becomes one of prediction because both the first and second frames are lost. Since the speech signal is not known in the second bad frame, the first bad frame must be extrapolated using the pitch from only the last good frame. If the third frame is also lost, the second bad frame must be extrapolated again using the same pitch.
As shown in FIG. 2, the pitch period associated with speech signal 202 slowly increases during the three frames. In contrast, during the lost frames, an extrapolated waveform 204 generated to replace the lost frames has a fixed pitch period that is based on a previous good frame. Consequently, the phases of speech signal 202 and extrapolated waveform 204 diverge. In particular, by the end of the second bad frame, extrapolated waveform 204 and speech signal 202 are 180 degrees out of phase. This phase misalignment must be corrected in the first good frame by generating a waveform 206 exhibiting unnatural phase evolution. Adjustment of the phase by this amount in a limited amount of time may introduce an audible distortion.
What is needed then is an approach to performing PLC that operates to merge an extrapolated signal generated to replace one or more lost frames of an audio signal with a received signal representing one or more subsequent good frames of the audio signal in a manner that avoids signal discontinuity and audible artifacts resulting therefrom. The desired approach should operate to align the phase of the extrapolated signal and the received signal in a manner that does not require the introduction of a fixed delay as required by estimation-based PLC schemes. The desired approach should also overcome the constraints associates with prediction-based PLC schemes that utilize time-warping and require the entirety of the phase adjustment to be achieved within the first good frame.
Another major source of distortion associated with PLC is the loss of one or more frames that include transitions, such as transitions from unvoiced to voiced sounds, from voiced to unvoiced sounds, and from one voice sound to another voiced sound. Loss of the frame(s) containing the transition region will often result in an audible artifact during PLC if the transition is not handled carefully. For estimation PLC where the future frames are buffered before playback, classification of the frames before and after the packet loss can be done and the transition can be detected and estimated accordingly. The problem occurs in prediction-based PLC when only the past speech is available. In this case, the upcoming transition is not known or very difficult to accurately predict. The prediction-based PLC scheme may conceal the transition with the previous signal type and then perform an overlap-add of the different signals in the first good frame. Unfortunately, the overlap-add of these different signals does not accurately reproduce the transition region and an audible artifact often results. What is also needed, then, is an approach to perform prediction-based PLC that can conceal the loss of one or more frames containing a transition region in a manner that will not result in an audible artifact.