This application is directed generally to digitally encoded speech and in particular to enhancing the quality of digitally encoded speech transmitted over media susceptible to packet loss.
The use of digital systems to transmit human speech has become commonplace. Wireless telephony, VOIP, CDMA, GSM, WiFi, and ethernet are just a few examples of such applications. Typically, speech in analog form is converted into digital data, i.e. digitally encoded, at its source by a digital encoder. The digitally encoded speech is then divided into manageable data groups, or “packets” for transmission over a communications medium.
Unfortunately, known communications media often experience “packet loss”, in which data groups are lost during transmission. Packet Loss can occur for a variety of reasons including link failure, high levels of congestion that lead to buffer overflow in routers, Random Early Detection (RED), Ethernet problems, and the occasional misrouted packet. The missing data occurring as a result of packet loss can produce pops, random noise, or silence at the receiving end. In such instances, the end user of the system receives garbled, often unintelligible speech.
Packet Loss Concealment (“PLC”) is a technique used to mask the effects of missing sound data due to lost or discarded packets. PLC is generally effective only for small numbers of consecutive lost packets, for example a total of 20-30 milliseconds of speech, and for low packet loss rates. Packet loss can be bursty in nature—with periods of several seconds during which packet loss may be 20-30 percent. The average packet loss rate for a sound transmission session may be low. However, even short periods of high loss rate can cause noticeable degradation in the quality of transmitted sound. PLC algorithms can be implemented simply by inserting silence or “white noise” in place of missing packets. Other PLC algorithms involve either replaying the last packet received (“replay”) or some more sophisticated algorithm that uses previous speech samples to generate speech. Simple replay algorithms tend to lead to “robotic” sounding speech when multiple consecutive packets are lost. More sophisticated algorithms can provide reasonable quality at 20% packet loss rates. Unfortunately, sophisticated algorithms can consume DSP bandwidth and hence reduce the number of channels that can be supported in, for example, a high density gateway.
Turning next to speech itself, linguists classify the speech sounds used in a language into a number of abstract categories called phonemes. American English, for example, has about 41 phonemes, although the number varies according to the dialect of the speaker and the system employed by the linguist doing the classification. Phonemes are abstract categories which allow us to group together subsets of speech sounds. Even though no two speech sounds, or phones, are identical, all of the phones classified into one phoneme category are similar enough so that they convey the same meaning. The phoneme can be defined as “the smallest meaningful psychological unit of sound.” The phoneme has mental, physiological, and physical substance: our brains process the sounds; the sounds are produced by the human speech organs; and the sounds are physical entities that can be recorded and measured.