Source separation addresses the issue of recovering source signals from the observation of distinct mixtures of these sources. Conventional approaches to source separation typically assume that the sources are linearly mixed. Also, conventional approaches to source separation are usually blind in the sense that they assume that no detailed information (or nearly no detailed information in a semi-blind approach) about the statistical properties of the sources is known and can be explicitly taken advantage of in the separation process. The approach disclosed in J. F. Cardoso, “Blind Signal Separation: Statistical Principles,” Proceedings of the IEEE, pp. 2009–2025, vol. 9, Oct. 1998, the disclosure of which is incorporated by reference herein, is an example of a source separation approach that assumes a linear mixture and that is blind.
An approach disclosed in A. Acero et al., “Speech/Noise Separation Using Two Microphones and a VQ Model of Speech Signals,” Proceedings of ICSLP 2000, the disclosure of which is incorporated by reference herein, proposes a source separation technique that uses a priori information about the probability density function (pdf) of the sources. However, since the technique operates in the Linear Predictive Coefficient (LPC) domain which results from a linear transformation of the waveform domain, the technique assumes that the observed mixture is linear. Therefore, the technique can not be used in the case of non-linear mixtures.
However, there are cases where the observed mixtures are not linear and where a priori information about the statistical properties of the sources is reliably available. This is the case, for example, in speech applications requiring the separation of mixed audio sources. Examples of such speech applications may be speech recognition in the presence of competing speech, interfering music or specific noise sources, e.g., car or street noise.
Even though the audio sources can be assumed to be linearly mixed in the waveform domain, the linear mixtures of waveforms result in non-linear mixtures in the cepstral domain, which is the domain where speech applications usually operate. As is known, a cepstra is a vector that is computed by the front end of a speech recognition system from the log-spectrum of a segment of speech waveform, see, e.g., L. Rabiner et al., “Fundamentals of Speech Recognition,” chapter 3, Prentice Hall Signal Processing Series, 1993, the disclosure of which is incorporated by reference herein.
Because of this log-transformation, a linear mixture of waveform signals results in a non-linear mixture of cepstral signals. However, it is computationally advantageous in speech applications to perform source separation in the cepstral domain, rather than in the waveform domain. Indeed, the stream of cepstra corresponding to a speech utterance is computed from successive overlapping segments of the speech waveform. Segments are usually about 100 milliseconds (ms) long, and the shift between two adjacent segments is about 10 ms long. Therefore, a separation process operating in the cepstral domain on 11 kiloHertz (kHz) speech data only needs to be applied every 110 samples, as compared with the waveform domain where the separation process must be applied every sample.
Further, the pdf of speech, as well as the pdf of many possible interfering audio signals (e.g., competing speech, music, specific noise sources, etc.), can be reliably modeled in the cepstral domain and integrated in the separation process. The pdf of speech in the cepstral domain is estimated for recognition purposes, and the pdf of the interfering sources can be estimated off-line on representative sets of data collected from similar sources.
An approach disclosed in S. Deligne and R. Gopinath, “Robust Speech Recognition with Multi-channel Codebook Dependent Cepstral Normalization (MCDCN),” Proceedings of ASRU2001, 2001, the disclosure of which is incorporated by reference herein, proposes a source separation technique that integrates a priori information about the pdf of at least one of the sources, and that does not assume a linear mixture. In this approach, unwanted source signals interfere with a desired source signal. It is assumed that a mixture of the desired signal and of the interfering signals is recorded in one channel, while the interfering signals alone (i.e., without the desired signal) are recorded in a second channel, forming a so-called reference signal. In many cases, however, a reference signal is not available. For example, in the context of an automotive speech recognition application with competing speech from the car passengers, it is not possible to separately capture the speech of the user of the speech recognition system (e.g., the driver) and the competing speech of the other passengers in the car.
Accordingly, there is a need for source separation techniques which overcome the shortcomings and disadvantages associated with conventional source separation techniques.