The present application relates to signal processing, and more specifically, but not exclusively, relates to the recovery of speech in noisy environments.
In many multi-sensor, single-source applications noise interferes with recovering a desired speech signal from its source. Various approaches have been designed to recover sources in interference, but most of them require prior knowledge or assumptions that limit their applicability to real-world environments. Single-channel noise reduction techniques have been applied to the speech enhancement problem, one of the most common being spectral subtraction. See J. Lim and A. Oppenheim, Enhancement and bandwidth compression of noisy speech, PROC. OF THE IEEE 67, 1586-1604 (1979). Spectral subtraction reduces noise levels given estimates of the noise power spectrum and speech uncorrelated to the noise; it can be effective in reducing listener fatigue, but it has not been shown to increase intelligibility. Single-source de-noising methods rely on the existence of a basis where thresholds can be used to discard or modify noisy basis elements. See D. Donoho, De-noising by soft-thresholding, IEEE TRANS. INFO. THEORY 41, 613-627 (1995).
Multiple-microphone approaches can offer speech-enhancement advantages over single-microphone methods. One such category of approaches to speech recovery in noise is beamforming. See S. Haykin, Adaptive Filter Theory, Third Edition (PRENTICE HALL, Upper Saddle River, N.J.) (1996). Fixed beamforming requires many microphones and prior knowledge or estimation of the desired source location. Beamformers such as the Minimum Variance Distortionless Response (MVDR) [See J. Capon, High-resolution frequency-wavenumber spectrum analysis, PROC. OF THE IEEE 57, 1408-1418 (1969)] beamformer require knowledge of the desired source-to-microphone channel response or a parametric representation of the response, which is often impractical in real-world applications, especially in reverberent environments. If minimum mean-squared error is desired, then the Wiener beamformer can be computed. However, the Wiener beamformer requires knowledge of the time-varying, cross-spectral densities of the speech and interference. An adaptive frequency-domain MVDR technique that accounts for non-stationarity of typical sources can also be applied, resulting in performance superior to standard beamforming approaches for such sources. See Capon. However, this adaptive beamformer requires the same prior channel knowledge as the standard MVDR beamformer.
Blind source separation (BSS) techniques offer recovery of L sources from R sensor signals (typically less than or equal to R) with few known parameters. A well-researched class of approaches that relies on higher-order statistics to separate the mixtures is Independent Component Analysis (ICA) [See M. Lockwood, D. Jones, R. Bilger, C. Lansing, J. W. D. O'Brien, B. Wheeler, and A. Feng, Performance of time-and frequency-domain binaural beamformers based on recorded signals from real rooms, JRNL. ACOUST. SOC. AMER. 115, 379-391 (2004)]—ICA is especially well-suited when the sources are stationary and instantaneously mixed. Convolutional mixtures can be handled in the frequency domain by applying ICA individually in each frequency bin. This approach can be used in most applications if the noise is modeled as a few distinct sources. However, recovery of the noise sources is not required in most applications, and parameters that are usually unknown are required to construct the recovery filter; a complex scale factor is required in each bin to construct the recovery filter for each source, and a peiniutation matrix is required to assign separated signals in each bin to a particular source.
The permutation problem has been approached by making bin-by-bin signal-to-source assignments based on local inter-frequency correlations. See T. Lee, Independent Component Analysis (KLUWER ACADEMIC PUBLISHERS, Boston, Mass.) (1998). However, errors can accumulate because decisions are made locally. Nonstationarity and second-order statistics are used in a broadband method that circumvents the scaling and permutation problem [See H. Sawada, R. Mukai, S. Araki, and S. Makino, Robust and precise method for solving the permutation problem of frequency-domain blind-source separation, IEEE TRANS. SPEECH AND AUDIO PROC. 12, 530-538 (2004)], but this method is computationally expensive. Independent vector analysis (IVA) solves the permutation problem by extending ICA to directly model and exploit the dependencies among frequency components within each source. See S.-Y. L. T. Kim, H. T. Attias and T.-W. Lee, Blind source separation exploiting higher-order frequency dependencies, IEEE TRANS. AUDIO, SPEECH, AND LANGUAGE PROC. 15, 70-79 (2007), See also I. Lee and T.-W. Lee, On the assumption of spherical symmetry and sparseness for the frequency-domain speech model, IEEE TRANS. AUDIO, SPEECH, AND LANGUAGE PROC. 15, 1521-1528 (2007). However, all of these methods require the number of sources to be less than or equal to the number of microphones, which is impractical as noise often cannot be modeled as a small number of distinct sources.
None of these methods explicitly account for more noise sources than microphones. A combination of ICA and time-frequency masking can be used with two microphones to recover up to six sources. See M. Pederson, D. Wang, J. Larsen, and U. Kjems, Overcomplete blind source separation by combining ICA and binary time-frequency masking, (IEEE WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROC.) 15-20 (2005). However, this approach is typically not practical when the sources are mixed instantaneously, and sparse source distribution in time or frequency is needed for good reconstruction.
Another way for ICA methods to recover speech in noise is to model the noise separately from the sources. Convolutive BSS for noisy mixtures was shown in H. Buchner, R. Aichner, and W. Kellermann, Convolutive blind source separation for noisy mixtures, (PROC. JOINT MTG. GERMAN FRENCH ACOUST. SOC. (CFA/DAGA) 583-584, Strasbourg, France) (2004). While this approach may be viable for one or two speech sources in noise, it is computationally expensive and relies on sparsity in time to estimate the noise correlation matrix and remove the bias caused by the noise.
Thus, while a number of advances have been made, there remains a demand for further contributions in this area of technology.