Source separation has attracted attention in a variety of applications where it may be desirable to extract a set of original source signals from a set of mixed signal observations.
Source separation may find use in a wide variety of signal processing applications, such as audio signal processing, optical signal processing, speech separation, neural imaging, stock market prediction, telecommunication systems, facial recognition, and more. Where knowledge of the mixing process of original signals that produces the mixed signals is not known, the problem has commonly been referred to as blind source separation (BSS).
Independent component analysis (ICA) is an approach to the source separation problem that models the mixing process as linear mixtures of original source signals, and applies a de-mixing operation that attempts to reverse the mixing process to produce a set of estimated signals corresponding to the original source signals. Basic ICA assumes linear instantaneous mixtures of non-Gaussian source signals, with the number of mixtures equal to the number of source signals. Because the original source signals are assumed to be independent, ICA estimates the original source signals by using statistical methods extract a set of independent (or at least maximally independent) signals from the mixtures.
While conventional ICA approaches for simplified, instantaneous mixtures in the absence of noise can give very good results, real world source separation applications often need to account for a more complex mixing process created by real world environments. A common example of the source separation problem as it applies to speech separation is demonstrated by the well-known “cocktail party problem,” in which several persons are speaking in a room and an array of microphones are used to detect speech signals from the separate speakers. The goal of ICA would be to extract the individual speech signals of the speakers from the mixed observations detected by the microphones; however, the mixing process may be complicated by a variety of factors, including noises, music, moving sources, room reverberations, echoes, and the like. In this manner, each microphone in the array may detect a unique mixed signal that contains a mixture of the original source signals (i.e. the mixed signal that is detected by each microphone in the array includes a mixture of the separate speakers' speech), but the mixed signals may not be simple instantaneous mixtures of just the sources. Rather, the mixtures can be convolutive mixtures, resulting from room reverberations and echoes (e.g. speech signals bouncing off room walls), and may include any of the complications to the mixing process mentioned above.
Mixed signals to be used for source separation can initially be time domain representations of the mixed observations (e.g. in the cocktail part problem mentioned above, they would be mixed audio signals as functions of time). ICA processes have been developed to perform the source separation on time-domain signals from convolutive mixed signals and can give good results; however, the separation of convolutive mixtures of time domain signals can be very computationally intensive, requiring lots of time and processing resources and thus prohibiting its effective utilization in many common real world ICA applications.
A much more computationally efficient algorithm can be implemented by extracting frequency data from the observed time domain signals. In doing this, the convolutive operation in the time domain is replaced by a more computationally efficient multiplication operation in the frequency domain. A Fourier-related transform, such as a short-time Fourier transform (STFT), can be performed on the time-domain data in order to generate frequency representations of the observed mixed signals and load frequency bins, whereby the STFT converts the time domain signals into the time-frequency domain. A STFT can generate a spectrogram for each time segment analyzed, providing information about the intensity of each frequency bin at each time instant in a given time segment.
Although the STFT is referred to herein as an example of a Fourier-related transform, the term “Fourier-related transform” is not so limited. In general, the term “Fourier-related transform” refers to a linear transform of functions related to Fourier analysis. Such transformations map a function to a set of coefficients of basis functions, which are typically sinusoidal and are therefore strongly localized in the frequency spectrum. Examples of Fourier-related transforms applied to continuous arguments include the Laplace transform, the two-sided Laplace transform, the Mellin transform, Fourier transforms including Fourier series and sine and cosine transforms, the short-time Fourier transform (STFT), the fractional Fourier transform, the Hartley transform, the Chirplet transform and the Hankel transform. Examples of Fourier-related transforms applied to discrete arguments include the discrete Fourier transform (DFT), the discrete time Fourier transform (DTFT), the discrete sine transform (DST), the discrete cosine transform (DCT), regressive discrete Fourier series, discrete Chebyshev transforms, the generalized discrete Fourier transform (GDFT), the Z-transform, the modified discrete cosine transform, the discrete Hartley transform, the discretized STFT, and the Hadamard transform (or Walsh function). The transformation of time domain signal to spectrum domain representation can also been done by means of wavelet analysis or functional analysis that is applied to single dimension time domain speech signal, we will still call the transformation as Fourier-related transform for the simplicity of the patent. Traditional approaches to frequency domain ICA involve performing the independent component analysis at each frequency bin (i.e. independence of the same frequency bin between different signals will be maximized). Unfortunately, this approach inherently suffers from a well-known permutation problem, which can cause estimated frequency bin data of the source signals to be grouped in incorrect sources. As such, when resulting time domain signals are reproduced from the frequency domain signals (such as by an inverse STFT), each estimated time domain signal that is produced from the separation process may contain frequency data from incorrect sources.
Various approaches to solving the misalignment of frequency bins in source separation by frequency domain ICA have been proposed. However, to date none of these approaches achieve high enough performance in real world noisy environments to make them an attractive solution for acoustic source separation applications.
Conventional approaches include performing frequency domain ICA at each frequency bin as described above and applying post-processing that involves correcting the alignment of frequency bins by various methods. However, these approaches can suffer from inaccuracies and poor performance in the correcting step. Additionally, because these processes require an additional processing step after the initial ICA separation, processing time and computing resources required to produce the estimated source signals are greatly increased.
Other approaches attempt to address the permutation problem more directly by performing the ICA at all frequency bins collectively. One such approach is disclosed in Hiroe, U.S. Pat. No. 7,797,153 (hereinafter Hiroe), the entire disclosure of which is herein incorporated by reference. Hiroe discloses a method in which the ICA calculations are performed on entire spectrograms as opposed to individual frequency bins, thereby attempting to prevent the permutation problem that occurs when ICA is performed at each frequency bin. Hiroe sets up a score function that uses a multivariate probability density function (PDF) to account for the relationship between frequency bins in the separation process.
However, because the approaches of Hiroe above model the relationship between frequency bins with a singular multivariate PDF, they fail to account for the different statistical properties of different sources as well as a change in the statistical properties of a source signal over time. As a result, they suffer from poor performance when attempting to analyze a wide time frame. Furthermore, the approaches are generally unable to effectively analyze multi-source speech signals (i.e. multiple speakers in the same location at the same time), because the underlying singular PDF is inadequate for both sources.
To date, known approaches to frequency domain ICA suffer from one or more of the following drawbacks: inability to accurately align frequency bins with the appropriate source, requirement of a post-processing that requires extra time and processing resources, poor performance (i.e. poor signal to noise ratio), inability to efficiently analyze multi-source speech, requirement of position information for microphones, and a requirement for a limited time frame to be analyzed.
For the foregoing reasons, there is a need for methods and apparatus that can efficiently implement frequency domain independent component analysis to produce estimated source signals from a set of mixed signals without the aforementioned drawbacks. It is within this context that a need for the present invention arises.