Meetings between multiple individuals conducted in two or more separate locations can be facilitated using audio conferencing systems. An audio conferencing system typically includes some number of microphones, at least one loudspeaker and a base station which is connected to a public network. In such a system, microphones can operate to pick up near end (N.E.) acoustic audio signals (speech) from an individual and transmit the audio signals to a base station which generally operates to provide session control and to process the audio signals in a number of ways before sending it to a far end (F.E.) communication device to be played by a loudspeaker. Among other things, the base station can be configured with functionality to amplify audio signals, it can regulate microphone signal gain (automatic gain control or AGC), suppress noise, and it can remove acoustic echo present in an audio signal transmitted to a F.E. system.
FIG. 1 is a diagram showing functional elements comprising a typical audio conference system 100. The system 100 can be comprised of one or more microphones 11, one or more loudspeakers 13, and a base station 15. The base station 15 generally includes complex digital signal processing and audio signal control functionality. The audio signal control can include functionality to automatically control near side audio signal gain, functionality to control microphone sensitivity, and system mode control (duplex/half duplex modes) to name only a few, and the digital signal processing can include acoustic echo cancellation (AEC) functionality, residual echo suppression functionality or other non-linear processing, noise cancellation functionality and double talk detection and mitigation.
AEC is an essential function performed by audio conferencing systems, and it generally operates to remove acoustic echo from a N.E. audio signal prior to the signal being transmitted to a remote or F.E. system. Specifically, acoustic echo occurs when a F.E. audio signal, received and played by a N.E. system loud speaker, is picked up by a microphone proximate to the loud speaker. An audio signal captured by the near side microphone will include at least some of the F.E. audio signal information, and this audio information can be transmitted back to F.E. end system where it can be heard as an echo. This acoustic echo is distracting and can severely degrade the quality of an audio conferencing session if it is not cancelled.
Continuing to refer to FIG. 1, the base station 15 is comprised of an adaptive filter 20 and a summation function 22. In operation, a F.E. audio signal XN is received at the system 100 and sent to both the loudspeaker 13 and to the adaptive filter 20 which operates to, among other things, use the input signal XN and an error signal E to calculate an estimated echo signal ĥ which is sent to the summation function 22. The F.E. audio signal XN sent to the loudspeaker is played, and the microphone proximate to the loudspeaker can receive an audio signal h that includes XN signal audio information transformed by an echo path that exists between the loud speaker and the microphone. This echo path can be modeled as a room impulse response, which in this case is represented by an acoustic signal hN. A N.E. audio signal VN (generate by one or more individual speaking into a N.E. microphone), and signals played by the loud speaker and reflected to the microphone, referred to here as a reflected signal SN, are combined into a microphone signal YN and send to the summation function 22. The summation function 22 generally operates to subtract the estimated echo signal ĥ from the microphone signal YN which results in an error signal E that serves as an input to the adaptive filter 20 and which can be transmitted to a F.E. audio system. More specifically, the error signal is an input to an adaptive algorithm comprising the adaptive filter 20 and is employed by the adaptive algorithm to calculate a set of coefficients W. The coefficients calculated by the adaptive algorithm are used by the filter 20 to operate on the F.E. signal XN to generate the estimated echo signal ĥ. The objective of the adaptive algorithm is to calculate or update filter coefficients such that the adaptive filter is able to minimize the error signal value, which in an ideal case is zero.
An adaptive filter suitable for operating to cancel acoustic echo in an audio system is typically designed to have a fixed number of filter elements or taps so that it is able to converge to a solution (which is the minimization of an error signal) within a reasonable period of time. Such an adaptive filter is shown with reference to FIG. 2. In an ideal acoustic environment (i.e., there is no reflected signal energy), a microphone signal YN may only capture energy received directly (non-reflected) from a loudspeaker. In this case, such a fixed length adaptive filter can operate effectively to cancel substantially all of an acoustic echo component in a microphone signal. However, if a microphone signal includes a reflected acoustic energy component (SN) in addition to the non-reflected acoustic energy, an adaptive filter may not be able to converge to a solution within a reasonable period of time, if ever. The period of time that audio signal reflections (SN) linger is largely dependent upon the characteristics of the environment in which an audio system is operating. So, for instance, if a room in which an audio system is operating is relatively large, or the surfaces comprising the room are composed of materials that readily reflect acoustic signals, a signal XN played by a loud speaker can reflect from more than one room surface before being received by a microphone. Given a large room with many reflective surfaces, a reflected acoustic signal can be received at a microphone up to several hundred milliseconds after the signal from which it originates is played by a loud speaker. So, for example, an audio signal XN played by a loud speaker at time T.1 can be received by a microphone in an unreflected form at time T.2, and the microphone can receive a one or more reflected audio signal corresponding to the signal XN at times T.3 to T.n until the amplitude of the reflected signal is less than a threshold sensitivity level of the microphone. Given the wide variety of environments in which an audio system can operate, designing one adaptive filter to effectively cancel acoustic echo in all environments is a daunting task. As such, adaptive filters are typically designed for particular environments such that they are able to converge to a solution within a reasonable period of time. One factor contributing to an adaptive filters convergence time is filter length. Filter length according to this description can mean the number of filter elements comprising an adaptive filter in the frequency domain, or the number of filter taps comprising an adaptive filter in the time domain. For instance, in FIG. 2, the adaptive filter 200 has ĥN filter elements or taps and so can be said to have a length of “N”, where N is an integer value.
Adaptive filters can be designed with a greater number of taps for use in audio systems operating in relatively large rooms and/or in rooms with highly reflective surfaces, and adaptive filters can be designed with fewer numbers of taps for use in audio systems operating in relatively smaller rooms and/or in rooms with surfaces that do not readily reflect acoustic signal energy. As it is difficult to design an adaptive filter that is able to effectively remove acoustic echo in a wide range of environments, the audio signal resulting from the adaptive filtration process can be subjected to various forms of non-linear signal processing (NLP), such as the NLP 310 shown with reference to FIG. 3. Typically, this non-linear filtering process operates to suppress the audio level/energy comprising an audio signal in selected frequency spectra prior to transmitting the audio signal to a F.E. system.