This invention relates to separating source signals, and in particular relates to separating multiple audio sources in a multiple-microphone system.
Multiple sound sources may be present in an environment in which audio signals are received by multiple microphones. Localizing, separating, and/or tracking the sources can be useful in a number of applications. For example, in a multiple-microphone hearing aid, one of multiple sources may be selected as the desired source whose signal is provided to the user of the hearing aid. The better the desired source is isolated in the microphone signals, the better the user's perception of the desired signal, hopefully providing higher intelligibility, lower fatigue, etc.
One broad approach to separating a signal from a source of interest using multiple microphone signals is beamforming, which uses multiple microphones separated by distances on the order of a wavelength or more to provide directional sensitivity to the microphone system. However, beamforming approaches may be limited, for example, by inadequate separation of the microphones.
Interaural (including inter-microphone) phase differences (IPD) have been used for source separation from a collection of acquired signals. It has been shown that blind source separation is possible using just IPD's and interaural level differences (ILD) with the Degenerate Unmixing Estimation Technique (DUET). DUET relies on the condition that the sources to be separated exhibit W-disjoint orthogonality. Such orthogonality means that the energy in each time-frequency bin of the mixture's Short-Time Fourier Transform (STFT) is assumed to be dominated by a single source. The mixture STFT can be partitioned into disjoint sets such that only the bins assigned to the jth source are used to reconstruct it. In theory, as long as the sources are W-disjoint orthogonal, perfect separation can be achieved. Good separation can be achieved in practice even though speech signals are only approximately orthogonal.
Source separation from a single acquired signal (i.e., from a single microphone), for instance an audio signal, has been addressed using the structure of a desired signal by decomposing a time versus frequency representation of the signal. One such approach uses a non-negative matrix factorization of the non-negative entries of a time versus frequency matrix representation (e.g., an energy distribution) of the signal. One product of such an analysis can be a time versus frequency mask (e.g., a binary mask) which can be used to extract a signal that approximates a source signal of interest (i.e., a signal from a desired source). Similar approaches have been developed based on modeling of a desired source using a mixture model where the frequency distribution of a source's signal is modeled as a mixture of a set of prototypical spectral characteristics (e.g., distribution of energy over frequency).
In some techniques, “clean” examples of a source's signal are used to determine characteristics (e.g., estimate of the prototypical spectral characteristics), which are then used in identifying the source's signal in a degraded (e.g., noisy) signal. In some techniques, “unsupervised” approaches estimate the prototypical characteristics from a degraded signal itself, or in “semi-supervised” approaches adapt previously determined prototypes from the degraded signal.
Approaches to separation of sources from a single acquired signal where two or more sources are present have used similar decomposition techniques. In some such approaches, each source is associated with a different set of prototypical spectral characteristics. A multiple-source signal is then analyzed to determine which time/frequency components are associated with a source of interest, and that portion of the signal is extracted as the desired signal.
As with separation of a single source from a single acquired signal, some approaches to multiple-source separation using prototypical spectral characteristics make use of unsupervised analysis of a signal (e.g., using the Expectation-Maximization (EM) Algorithm, or variants including joint Hidden Markov Model training for multiple sources), for instance to fit a parametric probabilistic model to one or more of the signals.
Other approaches to forming time-frequency masks have also been used for upmixing audio and for selection of desired sources using “audio scene analysis” and/or prior knowledge of the characteristics of the desired sources.