Traditionally, voice and video conferencing systems have predominantly communicated over reliable networks such as the Plain Old Telephone Service (POTS), Integrated Services Digital Network (ISDN), or custom intranets. Increasingly, as people set up remote and home offices, voice and video conferencing systems are connecting over unreliable networks such as wireless networks or the public Internet. In such networks, packet loss and delay occur, sometimes at substantial levels. The effect is that audio packets do not arrive at their destined conferencing systems. In order to prevent the listener from hearing an audio drop out, typically a conferencing system will use some form of packet loss concealment (PLC).
PLC algorithms, also known as frame erasure concealment algorithms, hide transmission losses in an audio system where the input signal is encoded and packetized at a transmitter, sent over a network, and received at a receiver that decodes the packet and plays out the output. Many of the standard CELP-based speech coders, such as International Telecommunication Union Telecommunication Standardization Sector (ITU-T) Recommendations G.723.1, G.728, and G.729, have PLC algorithms built into their standards. ITU-T Recommendation G.711, Appendix I describes a PLC algorithm for audio transmissions. G.711-encoded audio data is sampled at 8 KHz, and is typically partitioned into 10 ms frames (80 samples). Other encodings, packet sizes, and sampling rates may be used.
The objective of PLC is to generate a synthetic speech signal to cover missing data (erasures) in a received bit stream. Ideally, the synthesized signal will have the same timbre and spectral characteristics as the missing signal, and will not create unnatural artifacts. Since speech signals are often locally stationary, it is possible to use the signals' history to generate a reasonable approximation to the missing segment. If the erasures are not too long, and the erasure does not land in a region where the signal is rapidly changing, the erasures may be inaudible after concealment.
The most popular PLC algorithms extrapolate from earlier pulse-code modulation (PCM) audio samples to synthesize a replacement for the lost audio packet. Two types of extrapolation are common: periodic extrapolation (PE) and non-periodic extrapolation (NPE). These two extrapolation techniques can also be used together, using a weighted sum technique.
FIG. 1 depicts one technique 100 for periodic extrapolation according to the prior art. This technique is often used for extrapolating audio segments that have periodic elements. During normal operation, the receiver decodes the received good packet or frame and sends its output to the audio port. To support PLC, a circular history buffer is typically provided to save a copy of the decoded output. The buffer is used to extract waveforms for performing the PLC.
A common PLC technique is to extrapolate new audio from the old audio for a fixed period. If the packet loss continues after the fixed period, the extrapolated audio will be attenuated to silence. Holding certain types of sounds too long without attenuation may create strange artifacts, even if the synthesized signal segment sounds natural in isolation. The extrapolated audio, attenuation, and silence become the outputs of the PLC technique.
The simplest way to extrapolate from good audio to conceal packet losses is to take the last cycle or frame of the periodic audio from the circular buffer and repeat it, as shown in box 110. While repeating a single cycle works well for short losses, on long erasures the technique eventually sounds artificial and may introduce unnatural harmonic artifacts (beeps), particularly if the erasure occurs in an unvoiced region of speech, or in a region of rapid transition such as a stop. Therefore, a PLC technique typically repeats one cycle for a fixed length of time, such as 10 ms, then starts to repeat two cycles of audio from the last audio frame as shown in box 120. After another fixed length of time, such as another 10 ms, the PLC algorithm may switch to repeating three cycles, as shown in box 130. Although the cycles are not played in the order they occurred in the original signal, the resulting output generally still sounds natural. The length of time used for each of the one cycle, two cycle, and three cycle repetitions is represented as the switch rate 140 in FIG. 1 and is always fixed in the prior art.
The output of FIG. 1 is PE. The total extrapolation output of PLC is typically generated as a weighted sum of PE and NPE components, where NPE is the non-periodic extrapolation. One prior art technique for generating NPE is shown in FIG. 2. In this technique, a noise generator 210 generates noise that is shaped by a shaping filter 220 to produce the NPE. This extrapolation technique works reasonably well on audio segments that have non-periodic elements.
Ideally PLC would create such natural audio that the listener is unaware of the packet losses. In practice, however, the use of PLC often results in audio artifacts. The dominant artifact may be described as a buzziness. Another artifact typically heard could subjectively be described as a choppiness. As the network packet loss rate increases, the artifacts become ever more objectionable.