Audio systems are generally developed under certain generic assumptions about the acoustic environment in which they are used and about the properties of the equipment involved. However, the actual environments in which they are used and in many cases the characteristics of the equipment may vary substantially. Accordingly, many audio systems and applications comprise functionality for adapting to the current operating characteristics. Specifically, many audio systems comprise functionality for calibrating and adapting the system e.g. to the specific acoustic environment in which they are used. Such adaptation may be performed regularly in order to account for variations with time.
Indeed, in many applications, and in particular those related to speech enhancement systems for voice communication, parameters related to an algorithm are adapted to the characteristics of a specific device and its hardware, such as e.g. characteristics of microphone(s), loudspeaker(s), etc. While adaptive signal processing techniques exist to perform such adaptation during a device's normal operation, in many cases certain parameters (especially those on which these adaptive techniques rely) have to be estimated during production in a special calibration session which is usually performed in a controlled, e.g., quiet, environment with only relevant signals being present.
Such calibration can be performed under close to ideal conditions. However, the resulting system performance can degrade when this adaptation is performed in the use environment. In such environments local interference such as speech and noise can often be present.
For example, a communication accessory containing one or more microphones which can be attached to a television, and which further is arranged to use the television's loudspeakers and onboard processing, cannot be tuned/adapted/calibrated during production since the related hardware depends on the specific television with which it is used. Therefore, adaptation must be performed by the user in his or her own home where noise conditions may result in a poorly adapted system.
As a specific example, many communication systems are often used in conjunction with other devices, or in a range of different acoustic environments. An example of one such device is a hands-free communication accessory with built-in microphones for a television based Internet telephone service. Such a device may be mounted on or near a television and can also include a video camera, and a digital signal processing unit, allowing one to use software directly via a television in order to connect to other devices and conduct two-way or multi-party communication. A challenge when developing such an accessory is the wide-range of televisions that it may be used with as well as the variations in the acoustic environments in which it should be capable of delivering satisfactory performance.
The audio reproduction chain in television sets and the environments in which they are used affect the acoustic characteristics of the produced sound. For example, some televisions use higher fidelity components in the audio chain, such as better loudspeakers capable of linear operation over a wide dynamic input range, while others apply nonlinear processing to the received audio signals, such as simulated surround sound and bass boost, or dynamic range compression. Furthermore, the audio output of a television may be fed into a home audio system with the loudspeakers of the television muted.
Speech enhancement systems apply signal processing algorithms, such as acoustic echo cancellation, noise suppression, and de-reverberation to the captured (microphone) signal(s) and to transmit a clean speech signal to the far-end call participant. The speech enhancement seeks to improve sound quality e.g. in order to reduce listener fatigue associated with long conversations. The performance of such speech enhancement may depend on various characteristics of the involved equipment and the audio environment.
The fact that such devices are used in such a wide range of situations makes it difficult to deliver a speech enhancement system that performs consistently well. Therefore, speech enhancement systems are usually adapted/tuned during device initialization and/or runtime when the system detects poor speech enhancement performance. Most adaptation routines employ a test signal which is played back by the sound reproduction system of the connected device and recorded by the capturing device to estimate and set acoustic parameter values for the speech enhancement system.
As a simple example of a tuning routine, the measuring of the acoustic impulse response of a room may be considered. Listening environments, such as e.g. living rooms, are characterized by their reverberation time, which is defined as the time it takes an acoustic impulse response of a room to decay by a certain amount. For example, T60 denotes the amount of time for the acoustic impulse response tail of a room to decay by 60 dB.
A test signal, such as white noise, can be rendered by a device's loudspeaker and the resulting sound signal can be recorded with a microphone. An adaptive filter is then used to estimate the linear acoustic impulse response. From this impulse response, various parameters, such as T60, can be estimated and used to improve the performance of the speech enhancement system, e.g. by performing de-reverberation based on the reverberation time. As a specific example, reverberation time is often measured using an energy decay curve given as:
      EDC    ⁡          (      t      )        =            ∫      t      ∞        ⁢                            h          2                ⁡                  (          τ          )                    ⁢                          ⁢              ⅆ        τ            where h(t) is the acoustic impulse response. An acoustic impulse response and its corresponding energy decay curve is shown in FIG. 1.
However, a significant problem associated with adaptation procedures based on audio test signals is that they tend to be affected by the presence of interfering sound. Specifically, if there is an interfering sound source, this will cause the captured signal to be distorted relative to the rendered audio signal thereby degrading the adaptation process.
For example, when determining an acoustic impulse response of a room, the signal captured by the microphone can be contaminated by interfering sound sources that may result in errors in the impulse response estimate, or which may even result in the impulse response estimation failing to generate any estimate (e.g. due to an adaptive filter emulating the estimated impulse response failing to converge).
Adaptation routines for audio processing, such as e.g. for speech enhancement systems usually assume that only known and appropriate sound sources are present, such as specifically test sounds that are used for the adaptation. For example, to tune an acoustic echo cancellation system, the signal captured by the microphone should only contain the signal produced by the loudspeaker (echo). Any local interference such as noise sources or near-end speakers in the local environment will only deteriorate the resulting performance.
As it is typically impossible to guarantee that no other sounds sources than those used in the adaptation are present, it is accordingly often critical that it can be estimated whether interferences are present, and if so it is often advantageous to estimate how strong the interference is. Therefore, an interference estimate is often critical for adaptation of audio processing, and especially it is desirable if a relatively accurate interference estimate can be generated without overly complex processing. Indeed, interference estimates may be suitable for many audio processing algorithms and approaches, and accordingly there is a desire for improved approaches for determining an audio interference estimate.
Hence, an improved approach for generating an audio interference measure would be advantageous and in particular an approach allowing increased flexibility, reduced complexity, reduced resource usage, facilitated operation, improved accuracy, increased reliability and/or improved performance would be advantageous.