Capturing audio, and in particularly speech, has become increasingly important in the last decades. Indeed, capturing speech has become increasingly important for a variety of applications including telecommunication, teleconferencing, gaming, audio user interfaces, etc. However, a problem in many scenarios and applications is that the desired speech source is typically not the only audio source in the environment. Rather, in typical audio environments there are many other audio/noise sources which are being captured by the microphone. One of the critical problems facing many speech capturing applications is that of how to best extract speech in a noisy environment. In order to address this problem a number of different approaches for noise suppression have been proposed.
Indeed, research in e.g. hands-free speech communication systems is a topic that has received much interest for decades. The first commercial systems available focused on professional (video) conferencing systems in environments with low background noise and low reverberation time. A particularly advantageous approach for identifying and extracting desired audio sources, such as e.g. a desired speaker, was found to be the use of beamforming based on signals from a microphone array. Initially, microphone arrays were often used with a focused fixed beam but later the use of adaptive beams became more popular.
In the late 1990's, hands-free systems for mobiles started to be introduced. These were intended to be used in many different environments, including reverberant rooms and at high(er) background noise levels. Such audio environments provide substantially more difficult challenges, and in particular may complicate or degrade the adaptation of the formed beam.
Initially, research in audio capture for such environments focused on echo cancellation, and later on noise suppression. An example of an audio capture system based on beamforming is illustrated in FIG. 1. In the example, an array of a plurality of microphones 101 are coupled to a beamformer 103 which generates an audio source signal z(n) and one or more noise reference signal(s) x(n).
The microphone array 101 may in some embodiments comprise only two microphones but will typically comprise a higher number.
The beamformer 103 may specifically be an adaptive beamformer in which one beam can be directed towards the speech source using a suitable adaptation algorithm.
For example, U.S. Pat. Nos. 7,146,012 and 7,602,926 discloses examples of adaptive beamformers that focus on the speech but also provides a reference signal that contains (almost) no speech.
Alternatively, U.S. 2014/278394 discloses beams that can be controlled and modified depending on various parameters including speech recognition results. The parameters used to control and modify the beams are all based or derived from output signals of the beams.
The beamformer creates an enhanced output signal, z(n), by adding the desired part of the microphone signals coherently by filtering the received signals in forward matching filters and adding the filtered outputs. Also, the output signal is filtered in backward adaptive filters having conjugate filter responses to the forward filters (in the frequency domain corresponding to time inversed impulse responses in the time domain). Error signals are generated as the difference between the input signals and the outputs of the backward adaptive filters, and the coefficients of the filters are adapted to minimize the error signals thereby resulting in the audio beam being steered towards the dominant signal. The generated error signals x(n) can be considered as noise reference signals which are particularly suitable for performing additional noise reduction on the enhanced output signal z(n).
The primary signal z(n) and the reference signal x(n) are typically both contaminated by noise. In case the noise in the two signals is coherent (for example when there is an interfering point noise source), an adaptive filter 105 can be used to reduce the coherent noise.
For this purpose, the noise reference signal x(n) is coupled to the input of the adaptive filter 105 with the output being subtracted from the audio source signal z(n) to generate a compensated signal r(n). The adaptive filter 105 is adapted to minimize the power of the compensated signal r(n), typically when the desired audio source is not active (e.g. when there is no speech) and this results in the suppression of coherent noise.
The compensated signal is fed to a post-processor 107 which performs noise reduction on the compensated signal r(n) based on the noise reference signal x(n). Specifically, the post-processor 107 transforms the compensated signal r(n) and the noise reference signal x(n) to the frequency domain using a short-time Fourier transform. It then, for each frequency bin, modifies the amplitude of R(ω) by subtracting a scaled version of the amplitude spectrum of X(ω). The resulting complex spectrum is transformed back to the time domain to yield the output signal q(n) in which noise has been suppressed. This technique of spectral subtraction was first described in S. F. Boll, “Suppression of Acoustic Noise in Speech using Spectral Subtraction,” IEEE Trans. Acoustics, Speech and Signal Processing, vol. 27, pp. 113-120, April 1979.
Although the system of FIG. 1 provides very efficient operation and advantageous performance in many scenarios, it is not optimum in all scenarios. Indeed, whereas many conventional systems, including the example of FIG. 1, provide very good performance when the desired audio source/speaker is within the reverberation radius of the microphone array, i.e. for applications where the direct energy of the desired audio source is (preferably significantly) stronger than the energy of the reflections of the desired audio source, it tends to provide less optimum results when this is not the case. In typical environments, it has been found that a speaker typically should be within 1-1.5 meter of the microphone array.
However, there is a strong desire for audio based hands-free solutions, applications, and systems where the user may be at further distances from the microphone array. This is for example desired both for many communication and for many voice control systems and applications. Systems providing speech enhancement including dereverberation and noise suppression for such situations are in the field referred to as super hands-free systems.
In more detail, when dealing with additional diffuse noise and a desired speaker outside the reverberation radius the following problems may occur:                The beamformer may often have problems distinguishing between echoes of the desired speech and diffuse background noise, resulting in speech distortion.        The adaptive beamformer may converge slower towards the desired speaker. During the time when the adaptive beam has not yet converged, there will be speech leakage in the reference signal, resulting in speech distortion in case this reference signal is used for non-stationary noise suppression and cancellation. The problem increases when there are more desired sources that talk after each other.        
A solution to deal with slower converging adaptive filters (due to the background noise) is to supplement this with a number of fixed beams being aimed in different directions as illustrated in FIG. 2. However, this approach is particularly developed for scenarios wherein a desired audio source is present within the reverberation radius. It may be less efficient for audio sources outside the reverberation radius and may often lead to non-robust solutions in such cases, especially if there is also acoustic diffuse background noise.
This can be understood as follows: in case the desired audio source is outside the reverberation radius, the energy of the direct sound field is small when compared to the energy of the diffuse sound field created from reflections. The direct sound field to diffuse sound field ratio will further degrade if there is also diffuse background noise. The energies of the different beams will be approximately the same and accordingly this does not provide a suitable parameter for controlling the beamformers. For the same reason, a system based on measuring the Direction Of Arrival (DOA) will not be robust: due to the low energy of the direct field, cross-correlating the signals will not give a sharp distinct peak and will result in large errors. Making the detectors more robust will often result in no detections of desired audio source leading to non-focused beams. The typical result is speech leakage in the noise reference, and severe distortion will occur if it is attempted to reduce the noise in the primary signal based on the noise reference signal.
Hence, an improved audio capture approach would be advantageous, and in particular an approach allowing reduced complexity, increased flexibility, facilitated implementation, reduced cost, improved audio capture, improved suitability for capturing audio outside the reverberation radius, reduced noise sensitivity, improved speech capture, and/or improved performance would be advantageous.