Background noise, reverberation and echo signals are typical causes of problems in systems for personal communication, and in systems involving automated recognition of voiced commands. Background noise and room reverberation can seriously decrease the sound quality and intelligibility of the desired speech signal. In a voice recognition system, background noise and reverberation increase the error rate. Furthermore, in some communication systems a speaker system delivers a known audio signal to the environment, which is picked up by a microphone array. For example, for a voice controlled TV-set it may be desired to disregard the echo of the television sound signal delivered to the loudspeakers when capturing voice commands. Similarly, in a telephone/voice communication setup, the far-end speech signal is delivered to one or more local loudspeakers, which produce an audio signal which is picked up by the local microphones as an undesirable echo. This echo should be removed before transmission of the near-end speech signal to the far-end. Similarly, a voice control system benefits from the removal of echo components.
Traditional methods for addressing background noise include beamforming and single channel noise reduction. Beamforming allows a differentiation of sound sources by employing a spatial filter, i.e. a filter where the gain of a signal depends on the spatial direction of the sound relative to the array of microphones. Multi-microphone enhancement methods can be seen as a concatenation of a beamformer algorithm and a single channel noise reduction algorithm; therefore multi-microphone methods can perform spatial filtering in addition to the spectro-temporal filtering offered by stand-alone single-channel systems.
The traditional method for echo cancellation is based on adaptively estimating the transfer functions from each loudspeaker signal to each of the microphone signals and subtracting an estimate of the echo from the microphone signals. However, certain components of the echo signals cannot be attenuated sufficiently by such methods, in particular in rooms with a long reverberation time. The part of the echo signal associated with late reverberation is often similar to ambient noise in that both sound fields are typically diffuse in nature. This is the primary reason that a multi-microphone spectral noise reduction system is also usable for removing the residual reverberant part of the echo signal.
The Multi Channel Wiener filter (MWF) for speech enhancement (see e.g. [3] Chapter 3.2) is an optimal linear estimator in mean-squared error sense of a target signal, given that the microphone signal consists of the target signal with additive uncorrelated noise. The MWF can be decomposed into a concatenation of a Minimum Variance Distortionless Response (MVDR) beam former and a single-channel Wiener post-filter. While these two systems are theoretically identical, the decomposed system is advantageous in practice over a brute-force implementation of the MWF filter. Specifically, one can exploit that the spatial signal statistics, which need to be estimated to implement the MVDR beamformer, change across time at a different (often slower) rate than the signal statistics that need to be estimated to implement the post-filter.
Most, if not all, post-filters rely on an estimate of the power spectral density (PSD) of the noise and undesired reverberation signal entering the post-filter. Considering a multi-microphone noise reduction system as a concatenation of a beamformer and a post-filter, it is obviously possible to estimate the noise PSD directly from the output signal of the beamformer, using well-known single-channel noise tracking algorithms (see e.g. [4] Section II, Eq. (1)-(3)). However, generally speaking, better performance can be obtained by taking advantage of having multiple microphone signals available when estimating the PSD of the noise entering the post-filter.
The idea of using multiple microphone signals for estimating the PSD of the noise that enters the post filter is not new. In [10] (FIG. 1), Zelinski used multiple microphone signals to estimate the noise PSD observed at the microphones under the assumption that the noise sequences were uncorrelated between microphones, i.e., the inter-microphone noise covariance matrix was diagonal. McCowan [11] (FIG. 1) and Lefkimmiatis [12] (FIG. 1) replaced this often unrealistic model with a diffuse (homogenous, isotropic) model of the noise field. More recently, Wolff [9] (FIG. 1) considered the beamformer in a generalized sidelobe canceller (GSC) structure, and used the output of a blocking matrix, combined with a voice activity detection (VAD) algorithm, to compute an estimate of the PSD of the noise entering the post-filter.