The goal of a teleconferencing system is to bring the participants at the ends of the communication as "close together" as possible. Ideally, the effect obtained in good communication should be one of "being there" (See, e.g., U.S. Pat. No. 4,890,314, describing a teleconferencing system that includes a teleconferencing station which utilizes a high resolution display).
A teleconferencing system comprises two or more remotely located stations which are interconnected by a transmission system. Two teleconference participants located at the two remote stations are in audio and video communication with each other. To accomplish the audio and video communication, each station includes a microphone for generating an audio signal for transmission to the other station, a speaker for receiving an audio signal from the other station, a video camera for generating a video signal for transmission to the other station and a display apparatus for displaying a video signal generated at the other station. Each station also includes a codec for coding the video signal generated at the station for transmission in a compressed fashion to the other station and for decoding a coded video signal received from the other station.
The present invention relates to the audio processing portion of the teleconferencing system. The audio processing portion may be viewed as comprising a first microphone and a first speaker located at a first station and a second microphone and a second speaker located at a second station. A first channel is established in a transmission system for transmitting an audio signal from the first microphone at the first station to the second speaker at the second station. A second channel is established in the transmission system for transmitting an audio signal from the second microphone at the second station to the first speaker at the first station.
A problem with this type of audio system is acoustic coupling between the microphone and the speaker at each station. In particular, there is a round-trip feedback loop which, for example, is formed by: 1) the first microphone at the first station, 2) the channel connecting the first microphone to the second speaker at the second station. 3) the acoustic coupling path at the second station between the second speaker and the second microphone, 4) the channel connecting the second microphone and the first speaker at the first station, and 5) the acoustic coupling path at the first station between the first speaker and the first microphone. If at any time, the net loop gain is greater than unity, the loop becomes unstable and may oscillate. The result of this instability is the well-known "howling" sound. In such loops, even when the overall gain is low, there is still the problem of far-end talker echo, which stems from a speaker's voice returning to his ear, at a reduced but audible level after traveling around the loop. The acoustic echo problem worsens in teleconferencing systems as the transmission delay increases. Incompletely suppressed echoes which are not distinguishable to a teleconference participant at short transmission delays, become more distinguishable with longer transmission delays.
A variety of solutions have been proposed in the prior art for the problems of acoustic instability and acoustic echoes (see, e.g., G. Hill, "Improving Audio Quality Echo Control in Video Conferencing", Teleconference, Vol. 10, No. 2, March-April 1991; and W. Armbruster, "High Quality Hands-Free Telephony Using Voice Switching Optimized With Echo Cancellation", Signal Processing IV, J. L. Lacoume, et al, editors, Elsevier Science Publishers, B. V., 1988, pp. 495-498).
One approach to solving the echo problem in the audio processing loop of a teleconferencing system is to use an echo canceller. An echo canceller is a circuit which produces a synthetic replica of an actual echo contained in an incoming signal. The synthetic replica is subtracted from the incoming signal to cancel out the actual echo contained in the incoming signal. The echo canceller may be implemented by an adaptive transversal filter whose tap values are continuously updated using, for example, a least mean square algorithm to mimic the transfer function of the actual echo path. This type of echo canceller suffers from a number of disadvantages. First, the echo canceller is computationally complex, i.e., it requires the use of a significant number of specialized Digital Signal Processors for implementation. Second, for wideband speech (7 KHz), in rooms with a large reverberation time, the echo canceller requires a long transversal filter with about 4000 or more taps. Such long filters have a low convergence rate and poorly track the transfer function of the actual echo path. In addition, some echo cancellers implemented using an adaptive transversal filter must be trained with a white noise training sequence at the beginning of each teleconference. Retraining may be required during the teleconference.
Another technique for solving the echo problem is to place an echo suppressor at the output of the microphone at each teleconferencing station. Typically, the echo suppressor comprises a level activated switch which controls a gate and a variable attenuation device. When the signal level at the output of a microphone is below a threshold level, a gate is closed to block the communication channel leading away from the microphone. When the signal level at the output of the microphone is above a threshold level, the gate is open to place the communication channel leading away from the microphone into a pass state. Illustratively, the threshold level of the echo suppressor may be set to the maximum level of the return echo. For this system, when one teleconference participant is talking, his local echo suppressor opens the local gate so that the channel to the remote station is open. If the other teleconference participant at the remote station is not talking, the echo suppressor at the remote station closes the gate at the remote station so that the echo return path is blocked. Some echo suppressors open or close the gate to the communication channel by detecting the presence or absence of local speech rather than by simply determining if a microphone output signal is above or below a threshold.
When the participants at both ends of the teleconference try to speak at the same time, a condition known as double talk exists. Under the double talk condition, the echo suppressor gates at both ends of the teleconference are open, and there is the possibility of acoustic echo being returned to both participants as well as the possibility of acoustic instability. In this case, each echo suppressor utilizes its variable attenuation device to introduce the amount of attenuation necessary to suppress the acoustic echo and ensure acoustic stability. Thus, the echo is reduced, but so is the audio signal generated by the speech of the teleconference participants. In many cases, the amount of attenuation which has to be introduced at the output of each microphone for echo suppressor may be too great to maintain fully interactive two-way communication between participants. Thus, this type of echo suppressor is not entirely satisfactory for use in a teleconferencing system.
In addition to the use of echo suppressors and echo cancellers, frequency shifters or special filters may be utilized in the audio processing system of a teleconferencing system. For example, a frequency shifter may be utilized to increase the margin of acoustic stability (see, e.g., U.S. Pat. No. 3,183,304, and F. K. Harvey et al, "Some Aspects of Stereophony Applicable to Conference Use", Journal Audio Engineering Society, Vol. 11, pp. 212-217, July 1963).
Alternatively, comb filters with complementary pass and stop bands may be placed in the two audio channels connecting the two stations of a teleconference (see, e.g., U.S. Pat. No. 3,622,714 and U.S. Pat. No. 4,991,167). The use of the complementary comb filters mitigates the effect of acoustic coupling between the speaker and microphone at each station. The reason is that any signal going around the feedback loop is processed by both comb filters and will be attenuated across its entire spectrum as the stop bands of the two comb filters are complementary. This improves the margin of acoustic stability to some extent and reduces far-end talker echo. On the other hand a speech signal which travels from one station to the other is only processed by one comb filter and is not attenuated appreciably across its entire spectrum. In comparison to echo cancellers, comb filters have the advantage of simplicity. However, comb filters introduce some degradation in perceived speech quality and do not always provide a sufficient margin of acoustic stability. The reason for the degradation is that the frequency response of a room in which the microphone and speaker of a station are located is characterized by a large number of resonant peaks. The band transitions in the comb filter transfer functions are often not sharp enough to suppress the resonant peaks, because if the transitions are too sharp the quality of the transmitted audio signal is adversely affected.
In view of the foregoing, it is an object of the present invention to provide an audio processing system for use in a teleconferencing system. Specifically, it is an object of the present invention to provide an audio processing system which permits two-way fully interactive audio communications in a teleconferencing system, while at the same time suppressing far-end talker echoes and providing a satisfactory margin of acoustic stability. Finally, it is an object of the present invention to provide an audio processing system for use in a teleconferencing system which utilizes complementary comb filters, but provides a satisfactory margin of acoustic stability and mitigates the degradation in perceived speech quality caused by the comb filters.