Echo in a communication system is commonly characterized as the return of a part of a transmitted signal from an end user back to the originator of the transmitted signal after a delay period. As is known in the art, a near-end user transmits an uplink signal to a far-end user. Conversely, the near-end user receives a downlink signal from the far-end user. For example, echo at the near-end occurs when the near-end user originates an uplink signal on an uplink path, and a part of the transmitted signal is reflected back as an echo signal on a downlink path to the near-end. Echo at the far-end occurs when the far-end user originates a downlink signal on the downlink path, and a part of the transmitted signal is reflected back as an echo signal on the uplink path to the far-end. The reflection of the transmitted signal may occur due to a number of reasons, the two primary reasons being 1) impedance mismatch at the four/two wire hybrids of a public switched telephone network (PSTN) exchange resulting in the so-called network or line echo and 2) acoustic coupling between the loudspeaker and microphone of a hands-free telephone resulting in the so-called acoustic echo. An echo signal corresponding to the delayed transmitted signal is perceived as annoying to the end user and, in some cases, can result in an unstable condition known as “howling.”
Echo cancellers are required at any echo generating source in an attempt to eliminate or reduce the transmission of echo signals. Echo cancellers may be employed in wireless devices, such as voice capable personal data assistants (PDAs), cellular telephones, two-way radios, car-kits for cellular telephones, car phones and other suitable devices that can move throughout a geographic area. Additionally, echo cancellers may be employed in wireline devices, such as hands-free speaker phones, video and audio conference phones and telephones otherwise commonly referred to in the telecommunications industry as plain old telephone system (POTS) devices. Hands-free speaker phones typically comprise a microphone to produce the uplink signal, a speaker to acoustically produce the downlink signal, an echo canceller to cancel the echo signal and a telephone circuit.
Echo cancellers in a hands-free environment attempt to cancel the echo signals produced at the near-end when the far-end is transmitting by generating echo estimation data corresponding to a portion of a downlink audio signal traveling through the acoustic coupling channel between the speaker and the microphone. The echo canceller models the acoustic coupling channel and in response generates the echo estimation data through the use of an echo canceller adaptive filter. The echo canceller adaptive filter employs modeling techniques using, for example, a finite impulse response (FIR) filter having a set of weighting coefficients adapted using a least mean squared (LMS) algorithm to model the acoustic coupling channel, or other similar modeling techniques known in the art. The echo canceller adaptive filter attempts to subtract the echo estimation data from pre-echo canceller uplink data received by the microphone in order to produce post-echo canceller uplink data. The post-echo canceller uplink data is used by the echo canceller adaptive filter to dynamically update the weighting coefficients of the finite impulse response filter.
A hands-free speaker phone may be integrated into an in-vehicle audio system. The vehicle may be any suitable vehicle, such as an automobile, boat or airplane. The in-vehicle audio system may comprise an amplifier, speakers and an audio source, a CD/DVD player, a tape player, a hard drive playback system, a satellite radio, etc.
Typically, the downlink audio signal received from the far-end through the downlink path is played through at least one speaker in the in-vehicle audio system. The hands-free speaker phone installed in the vehicle, however, may experience significant coupling between the speakers and the microphone. As a result, an amplified downlink audio signal transmitted through the speakers will be partially received by the microphone as an echo signal.
Regarding hands-free telephony systems, as mentioned above, such systems interface to the user at the near-end by means of a loudspeaker and a microphone with minimal or no acoustic isolation between them. The acoustic coupling between the loudspeaker and microphone causes part of the signal received from the far-end, being reproduced by the loudspeaker, to be picked up by the microphone. Left unprocessed, this signal picked up at the microphone would be transmitted to the far-end of the communication system, producing an undesirable echo effect.
Introducing an echo canceller circuit at the near-end of the communication system can eliminate or at least reduce the echo signal before it is transmitted to the far-end. As shown in FIG. 1, known echo cancellers use adaptive filters to estimate the transfer function between the signal reproduced at the loudspeaker and the echo received at the microphone. Once an adaptive filter has an accurate estimate of the transfer function, it is used to filter the signal sent to the far-end loudspeaker to obtain an estimate of the echo signal that is picked up at the microphone. To remove the echo, the estimated echo from the adaptive filter is subtracted from the microphone signal to obtain an echo cancelled signal.
Because the echo path between the loudspeaker and microphone can change frequently, the adaptive filter on an echo canceller must be able to track this changing transfer function continuously. The presence of a near-end signal, however, can affect the adaptation of the filter and cause its estimate to diverge from the transfer function that it is estimating. This causes imperfect cancellation of the echo signal leading to poor performance of the communication system.
To avoid the divergence of the filter coefficients from the optimal values and to improve the performance of the echo canceller, a double-talk detector is often employed. The purpose of the double-talk detector is to determine when the microphone signal comprises not only echo signal from the loudspeaker but also near-end speech. The output of the double-talk detector is then used to slow down or stop the adaptation of the adaptive filter of the echo canceller. Additionally, the output of a double-talk detector can be used in a post-processing stage of the echo canceller, which is used to suppress any residual echo present after the adaptive filter output is subtracted from the microphone signal.
Typical post-processing comprises a non-linear processor (NLP), e.g., a center clipper, to completely remove those parts of a communication signal containing the residual echo. Consequently, when both far-end and near-end speakers are active, i.e., during double-talk, the NLP either passes the residual echo through along with the near-end speech or removes both the residual echo and the near-end speech. Because divergence in the adaptive filter coefficients and incorrect operation of the post-processing functions can have substantial impact on the quality of the echo canceller output, a double-talk detector is a critical part of the echo cancellation system.
Several methods for double-talk detection have been used in the past. Some are based on computing the ratio of power levels of various signals in the communication system. Others are based on computing cross-correlations between various signals in the system. For a description of such methods, please refer to “Acoustic Signal Processing for Telecommunication,” edited by Steven L. Gay and Jacob Benesty, Kluwer Academic Publishers (2000), Ch. 5 titled “Double-Talk Detection Schemes for Acoustic Echo Cancellation.” While these methods work for network echo cancellation to suppress echoes caused by impedance mismatch in a hybrid circuit, their performance is inadequate for acoustic echo cancellation in a hands-free telephony environment. One reason is that in a hands-free environment, the echo signal level is often much stronger than that of the near-end speech.
Some methods that alter or remove some frequency content from the signal received from the far-end have been introduced, e.g., see U.S. Pat. Nos. 6,052,462 and 6,141,415. To avoid excessive distortion of the received signal, these methods remove the signal energy from a small region of the frequency spectrum before being reproduced by the loudspeaker. At the microphone, the presence or absence of a signal component in the region where the signal energy was removed indicates the presence or absence of near-end speech. Because the frequency region used for detecting near-end speech is usually narrow, some segments of near-end speech may not trigger the detector, which causes the double-talk detector to fail under some conditions.
Another method disclosed in U.S. Pat. No. 6,049,606 exploits the fact that most modern telephony systems carry signals with energy only in the 250-3500 Hz band, but speech signals can contain energy in a wider band, e.g., 0-8000 Hz. By detecting signal energy outside of the standard telephony band, this method can determine the presence of near-end speech without the need to distort the received signal. The method filters the signal from the microphone to remove the telephony band components. The method then compares an estimate of the energy in the out-of-telephony band region with a predetermined threshold to detect the presence or absence of near-end speech. Although more robust than the other methods, this method suffers from a number of drawbacks. First, the use of a predetermined threshold to detect the presence or absence of near-end speech can limit performance, especially when the nature and level of the background noise can vary substantially and continuously in a particular environment, e.g., automobile environment. Second, while useful, the 0-250 Hz portion of the out-of-telephony band has the following limitations: a) the fundamental frequency of some higher-pitched speakers falls outside this region thus rendering this portion practically useless for detection purposes, and b) several background noise types, e.g., car noise, have the highest energy in this frequency range thus lowering signal-to-noise ratios (SNRs), which makes signal detection difficult. Third, several types of speech sounds (phones/phonemes) do not have sufficient energy in the upper band region, i.e., 4-8 kHz portion of the out-of-telephony band, which in combination with a higher-pitched speaker, as discussed above, would cause the detection method to perform poorly. Accordingly, there exists a need for improved double-talk detection performance.