1. Field of the Invention
The field of the invention relates generally to the cancellation of an echo signal in a voice communication system. More particularly, the invention relates to an echo cancellation system that uses pitch information and/or other speech characteristics.
2. Related Art
The perception of speech is a complex process. It is not yet clear how the human auditory system processes the speech signal. However, it is known that both temporal and spectral analyses of the speech signal are performed. This can be used as a justification for analyzing the speech signal in terms of its frequency-domain as well as its time-domain characteristics.
For most speech sounds, the envelope of the power spectrum is the main factor determining their linguistic interpretation. In fact, in common classifications of speech sounds it is possible to provide a typical power spectrum for each particular speech sound. For voiced segments of speech (e.g., vowels), the fine structure of the power spectrum displays a harmonic structure. That is, sharp peaks in the power spectrum occur at regularly spaced frequency intervals of 75 to 400 Hz, the interval being dependent on the speaker and the utterance. The spacing between the harmonics is called the fundamental frequency. According to basic signal-processing theory, it follows that a harmonic structure in the speech spectrum corresponds to a periodic time-domain signal. Therefore, voiced speech segments have a nearly harmonic frequency-domain structure and a nearly periodic time-domain structure.
When the harmonic structure does not exist in the power spectrum, then the speech segment is called unvoiced. In the time-domain such signal segments display noise-like structure (periodicity is not apparent). Fricatives such as xe2x80x9cfxe2x80x9d are examples of unvoiced sounds. Whispered speech is completely unvoiced.
To derive its properties, the speech signal is analyzed over short time intervals (frames) of about 20 to 30 ms each. The speech signal is considered to be stationary during each frame. Because of the non-stationary nature of the speech signal, the analysis must be performed over many frames. Two common metrics are associated with the analysis of speech frames: pitch lag and pitch gain. Pitch lag is an estimate of the speech frame""s fundamental frequency. Pitch lag measurements are only valid for voiced speech frames. The pitch gain is a measure of the overall match for the pitch lag estimate. The pitch gain could be derived in a variety of ways including, for instance, a normalized pitch correlation or the gain of the adaptive codebook as, for example, in the context of an analysis-by-synthesis approach of a CELP codec. Large pitch gains indicate voiced frames and valid pitch lags. Small pitch gains indicate unvoiced frames and invalid pitch lags.
Impedance mismatches are inevitable in speech communication systems. Connecting a handset that has four wires to the phone lines having two wires creates an impedance mismatch. An impedance mismatch creates an echo signal from the outgoing speech signal of a talker. This echo signal is a reflection of the original speech signal. A person listening to the original speech signal may hear the undesired echo signal. Speech communication systems also generate a delay between the original speech signal and when the listener hears the echo signal. In other words, the echo signal arrives at a certain time after the original speech signal. The greater the delay, the greater the annoyance to the listener. For this reason, designers of communication systems have tried to eliminate this echo with echo cancellers.
In order to cancel the echo signal on the communication line, the echo canceller must analyze an unknown signal and determine whether it is solely an echo signal or also contains the speech of a second person on the line. By convention, if two people are talking over a communication network or system, one person is referred to as xe2x80x9ctalker 1xe2x80x9d or the xe2x80x9cnear talker,xe2x80x9d while the other person is referred to as xe2x80x9ctalker 2xe2x80x9d or the xe2x80x9cfar talker.xe2x80x9d After talker 1 speaks, a signal may return to talker 1. That incoming signal may be an echo of talker 1""s speech signal, or a combination of an echo signal and the speech signal of talker 2. This combination is referred to as xe2x80x9cdouble talkxe2x80x9d An echo canceller is placed in the communication line and must be able to differentiate between an echo signal and double talk because the echo canceller must only cancel the echo signal, but not the double talk.
To determine whether the unknown incoming signal contains an echo signal component without double talk, the echo canceller must estimate the characteristics of an echo signal based on the outgoing signal. Since the outgoing speech signal changes (due to talker 1 voicing different speech patterns over time), the echo canceller must be able to analyze the outgoing speech signal and adapt its estimation of what the expected echo signal will be so that the echo canceller can look for and eliminate the echo signal. To model the echo and its delay, a transversal filter with adjustable taps often is used. Each tap receives a coefficient that specifies the magnitude of the corresponding output signal sample and each tap is spaced a sample time apart. The better the echo canceller can estimate what the echo signal will look like, the better it can eliminate the echo. To improve performance, it may be desirable to vary the adaptation rate at which the transversal filter tap coefficients are adjusted. For instance, if the echo canceller is sure that the unknown incoming signal is an echo, it is preferable for the echo canceller to adapt fast, estimate the echo signal as fast as possible, and eliminate the echo signal as quickly as possible. On the other hand, if the echo canceller is sure that the unknown incoming signal is not just an echo but double talk, it is preferable to decline to adapt at all. If there is an error in determining whether the unknown incoming signal is an echo signal, a fast adaptation method would cause rapid divergence and a failure to eliminate the echo. Thus, besides determining whether the unknown incoming signal is an echo or double talk, there is a need to know the level of confidence in the decision.
Other approaches to detecting double talk are cumbersome and computationally intensive. In fact, they may require a dedicated DSP (digital signal processing) chip just to perform the echo cancellation function.
The prior art echo cancellers try to compare the unknown signal with the far-end talker""s speech signal on a sample by sample basis in the time domain and because they do not know the delays of the speech communication system, they do this comparison over a wide range of samples. In other words, the prior art echo cancellers accounted for delays between the far-end talker""s speech signal and its echo by comparing a sample of the unknown signal with many samples of the far-end talker""s speech signal to see if any of the comparisons matches. Because of the unknown delay, the prior art had to perform this comparison many times, which made the detection of double talk computationally intensive. Note that the detection of an echo means that double talk was not detected, and vice versa. To demonstrate the inefficiency of the prior art approach, assume that the window of a possible match between a signal and its echo is 1 second. Thus, if the sample rate is 8000 per second, a sample of the unknown signal must be compared against 8000 samples (1 second worth) of the far-end talker""s speech signal. This cumbersome approach slowed the detection of double talk and decreased the efficiency of echo cancellers. The intensive process required the prior art to sometimes dedicate a processor to the double talk detection process.
One solution to the problems presented in prior systems is to have a double talk detection algorithm that is simpler so that a separate DSP is not required or so that less computational resources are required. However, it is also important that the double talk algorithm be robust and not fail readily.
This invention provides a system for detecting an echo signal in a voice communication system. In particular, the echo detection and/or cancellation system uses a speech characteristic or characteristics about the outgoing speech and the unknown signal to determine if the unknown signal is an echoed version of the outgoing speech or also contains a speech signal from a second talker (double talk). For example, the echo detection system may compare the pitch lags, pitch gains, energies, and/or other characteristics of the outgoing speech signal with that of the unknown incoming signal to determine whether the unknown signal is an echo signal. Additionally, a certain number of frames of these characteristics of the outgoing speech signal and the unknown incoming signal may be buffered so that the analysis and comparison can be made more efficiently and quickly in the frame domain as opposed to a time domain.
Other systems, methods, features and advantages of the invention will be or will become apparent to one with skill in the art upon examination of the following figures and detailed description. It is intended that all such additional systems, methods, features and advantages be included within this description, be within the scope of the invention, and be protected by the accompanying claims.