1. Field of the Invention
Embodiments of the present invention generally relate to improving the intelligibility of voice calls, in particular for voice calls that may be subjected to one or more transcodings.
2. Description of Related Art
ITU-T Recommendation G.711 at 64 kbps and G.729 at 8 kbps are two codecs widely used in packet-switched telephony applications. ITU-T G.711 wideband extension (“G.711 WBE”) is an embedded wideband codec based on a narrowband core interoperable with ITU-T Recommendation G.711 (both .mu.-law and A-law) at 64 kbps.
ITU-T Recommendation G.711, also known as a companded pulse code modulation (PCM), quantizes each input sample using 8 bits. The amplitude of the input signal is first compressed using a logarithmic law, uniformly quantized with 7 bits (plus 1 bit for the sign), and then expanded to bring it back to the linear domain. The G.711 standard defines two compression laws, the .mu.-law and the A-law. ITU-T Recommendation G.711 was designed specifically for narrowband input signals in the telephony bandwidth, i.e. 200-3400 Hz.
The standard ITU-T G.729 (which follows conjugate structure algebraic CELP), is based on a human speech model where the throat and mouth have the function of a linear filter with an excitation vector. For each frame in G.729, an encoder analyses input data and extracts the parameters of the CELP model such as linear prediction filter coefficients and the excitation vectors. The encoder searches through its parameter space, carries out the decode operation in each loop of the search and compares the output signal of the decode operation (i.e., the synthesized signal) with the original speech signal.
G.722 is an ITU standard codec that provides 7 kHz wideband audio at data rates from 48, 56 and 64 kbps. This is useful for VoIP applications, such as on a local area network where network bandwidth is readily available, and offers a significant improvement in speech quality over older narrowband codecs such as G.711, without an excessive increase in implementation complexity.
G.723.1 is an ITU standard codec that provides compressed voice audio at 5.3 Kbps and 6.3 Kbps. G.723.1 is mostly used in Voice over IP (“VoIP”) applications due to its low bandwidth requirement. G.723.1 is designed to represent speech with a high quality at the above rates using a limited amount of complexity. It encodes speech or other audio signals in frames using linear predictive analysis-by-synthesis coding. The excitation signal for the high rate coder is Multipulse Maximum Likelihood Quantization (MP-MLQ) and for the low rate coder is Algebraic-Code-Excited Linear Prediction (ACELP). The frame size is 30 ms and there is an additional look ahead of 7.5 ms, resulting in a total algorithmic delay of 37.5 ms. All additional delays in this coder are due to processing delays of the implementation, transmission delays in the communication link and buffering delays of the multiplexing protocol.
Internet Low Bitrate Codec (“iLBC”) is an open source narrowband speech codec described by RFC 3951. iLBC, uses a block-independent linear-predictive coding (LPC) algorithm and supports frame lengths of 20 ms at 15.2 kbit/s and 30 ms at 13.33 kbit/s.
SILK™ is an audio compression format and audio codec used by Skype™. SILK is usable with a sampling frequency of 8, 12, 16 or 24 kHz and a bit rate from 6 to 40 Kbps. SILK is described in further detail in IETF document “draft-vos-silk-02.”
Filtering of an audio signal is integral to common speech codec operation. By the Nyquist theorem, signals must be sampled at a rate at of least twice the highest frequency present in the source signal, in order to avoid aliasing artifacts in the decoded audio signal. The required sampling rate can be reduced by using a low-pass filter to filter out high-frequency components from the source signal, in order to substantially limit the spectral content to within a desired low-pass bandwidth. Roll-off characteristics of the low-pass filter result in some attenuation of higher-frequency spectral components that are still within the desired low-pass bandwidth.
Some speech encoders such as G.711 and G.722 at 64 Kbps use a relatively high bit rate in order to encode the raw audio waveform with relatively little encoding loss within the bandwidth of interest. Because such encoders encode the raw audio waveform more directly, no assumptions are made about the source of the raw audio waveform and the encoding is relatively high quality for non-speech sounds, within the available bandwidth and resolution limits.
In contrast, some lower bit rate speech encoders such as G.729 and G723.1 operate on the principle of linear predictive coding (“LPC”), such that a lower bit rate is achieved by fitting the raw audio waveform to a parametric model of the human voice tract, and then encoding the parameters of the model that upon decoding would produce a close approximation to the raw audio waveform. However, a drawback of such encoders is that if the raw audio waveform includes non-speech components (e.g., spectral levels or temporal dynamics not ordinarily found in human speech), the encoder produces a relatively lower quality encoding. That is, upon decoding, the decoded audio waveform would not be a good approximation to the raw audio waveform. Furthermore, in order to achieve a low bit rate encoding, high frequency components of the raw audio waveform may be more attenuated compared to lower-frequency components.
Calls subjected to multiple transcodings by lower bit rate encoders may suffer from excessive high-frequency attenuation and potentially intelligibility problems. Hands-free calls may especially experience a higher attenuation, depending on the acoustic environment the speakerphone is positioned in. A problem of the known art is that many speech codecs, such as narrowband voice codecs and in particular the G.729 codec, attenuate high-frequency speech components (i.e., greater than around 1500 Hz) with each encoding. As a rule of thumb, each G.729 encoding attenuates frequencies above 1500 Hz by around 3 dB for a clean input signal, such as a noise-free handset/headset recording.
A loss of high-frequency components is known to have a negative impact on speech intelligibility, in particular when dealing with fricative sounds such as the sound of the letter “f” versus the sound of the letter “s”. For example, consider a conference call, with participants from different locations of a corporation. Participants call into a conferencing system using a single telephone number plus an ID code to identify the conference, and the conferencing system bridges the calls together. Voice signals to and from participants may be transmitted as a Voice over Internet Protocol (“VoIP”) call over a wide area network (“WAN”) linking the different corporate locations. Corporate policy may dictate that all calls crossing the WAN to be established using the G.729 codec to conserve bandwidth. However, the conference bridge may only accept data encoded using G.711. Hence, media gateways situated immediately in front of the bridge transcode the audio stream from G.729 to G.711 and back to G.729. As a result each call has to undergo two G.729 encoding steps (i.e., one in the endpoint and one in the gateway), resulting in an attenuation of the high frequencies in the audio stream of at least 6 dB.
Therefore, a need exists to compensate for multiple encoding conversions and/or filtering, in order to provide improved speech intelligibility.