Robustness in the presence of noise and, more generally, of interfering signals is a crucial issue normally addressed in connection with speech recognition, especially when performance in a real-world environment is concerned. In cases where the signal interfering with the speech is stationary and where its characteristics are known in advance, robustness issues can, to a certain extent, be addressed during the training of the system. Particularly, the acoustic model of the speech recognition system can be trained on a representative collection of noisy data; this approach is known as “multi-style training” and has been shown to reduce the degradation of the recognition accuracy in the presence of noise.
However, in most applications, the signal corrupting the speech is neither known in advance nor stationary (for example, music or speech from competing speakers). Such cases typically cannot be handled by devising special training schemes, and they tend to require the use of on-line adaptive algorithms.
Particular needs have been recognized in connection with addressing the problem of separating a speech signal and an interfering signal (e.g. non stationary noise, music, competing speech) in the case where a recording of the interfering signal is available in a second channel. The signal contained in this second channel is called the reference signal. This occurs in a variety of contexts, such as:                when the speech signal is corrupted by the sound emitted by a radio or a CD player (the reference signal is recorded at the output of the radio or CD player),        in telephony applications where the speech prompt synthesized by the speech server interferes with the speech of the user (the reference signal is the recording of the prompt), or        when the speech signal is mixed with the speech of a competing speaker (the reference signal is recorded from the microphone of the competing speaker).        
To date, various efforts have been made in the contexts just described, yet various shortcomings and disadvantages have been observed.
Conventionally, the problem of separating a desired signal and an interfering signal with a known reference signal is often addressed by using decorrelation filtering techniques (see Ehud Weinstein, Meir Feder and Alan V. Oppenheim, “Multi-channel signal separation by decorrelation”, IEEE transactions on Speech and Audio Processing, volume 1, number 4, October 1993). The model underlying the conventional decorrelation filtering approach is illustrated in FIG. 1. Referring to FIG. 1, the cross-coupling effect between two channels is modeled with a 2×2 linear system, where:                the two input channels are: s1 the waveform of the desired signal, and s2 the waveform of the interfering signal; and        the two output channels are: o1 the observed waveform of the mixture of the desired and interfering signals, and o2 the observed waveform of the reference signal.        
The transfer function within each channel (from s1 to o1, and from s2 to o2) is assumed to be an identity system. Besides, it is assumed that there is no leakage of the desired signal s1 into the reference sensor, i.e., the cross-coupling function from the input channel of s1 to the output channel of o2 is zero. Under theses assumptions, the linear system reduces to the cross-coupling between the input channel of s2 and the output channel of o1. In decorrelation filtering techniques, the linear system is estimated with an iterative algorithm so that, by performing inverse filtering, the reconstructed signals s1 and s2 in the input channels are statistically uncorrelated. It can be shown that under the above assumptions, the linear system can be identified unambiguously. Once the linear system is identified, it is used to cancel the interfering signal component in the observed mixture.
The decorrelation filtering approach does suffer from some limitations in the context of a speech recognition application, such as:                it performs in the waveform domain, on a sample basis, thus leading to a high computation rate,        it might take some time before the iterative decorrelation algorithm converges towards an accurate estimate of the linear system, and        the length of the decorrelating filter in the linear system is unknown and needs to be hypothesized a priori.        
Another conventional approach, the Codeword-Dependent Cepstral Normalization (CDCN) approach, is a mono-channel technique which is used during speech recognition to compensate for the combined effect of stationary noise and channel mismatch. (See Alejandro Acero, “Acoustical and Environmental Robustness in Automatic Speech Recognition”, PhD thesis, Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, Pa. 15213, September 1990.) CDCN does not operate in the waveform domain but, instead, in the cepstral domain, which is the domain where speech recognition is usually performed. A cepstra (see chapter 3 in L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition, Prentice Hall Signal Processing Series, 1993) is a vector that is computed by the front end of a speech recognition system from the log-spectrum of a segment of speech waveform samples (usually this segment is about 100 ms long). The stream of cepstra corresponding to a speech utterance is typically computed from successive overlapping segments of the speech waveform samples (usually the shift between two adjacent segments is about 10 ms). In the CDCN framework, the cepstra of the noise is estimated by minimizing the difference between the cepstral space of the current utterance and the cepstral space of the clean speech (“clean speech” meaning non-noisy speech) characterized by a codebook of cepstral vectors. As the sources of mismatch are assumed to be stationary, the estimation is performed by averaging over the whole utterance.
Among the limitations of the mono-channel CDCN approach, though, is that non-stationary noise is not taken into account as accurately and effectively as may be possible. Particularly, a fundamental assumption of the mono-channel CDCN approach is that the noise is relatively stationary over periods of at least one or even a few seconds. The shorter the period during which the noise can be considered stationary, the more poorly conventional CDCN will perform. In the case of highly non-stationary noises, such as music, the mono-channel CDCN framework may even degrade the speech recognition accuracy instead of improving it.
Also included among conventional techniques are two-channel compensation techniques that operate in the cepstral domain (see Acero, supra). Such techniques can be characterized as follows:                one channel contains speech recorded in the environment matching the recognition system, and the other channel contains speech recorded in a mismatching environment (the usual source of mismatch is the use of a different microphone);        the two-channel data are used in a training scheme for the purpose of learning compensation vectors between the matching and the mismatching environments (the compensation vectors are looked up in a table during the recognition process); and        the source of mismatch in the second channel is assumed to be stationary: a predefined number of (SNR-dependent or codeword-dependent) compensation vectors are estimated by averaging over all the frames of the two-channel data; the problem of non-stationary noise is not addressed.        
Accordingly, similar disadvantages are encountered as in the case of the other conventional techniques described.
Consequently, and in brief recapitulation, various needs have been recognized in connection with overcoming the shortcomings and disadvantages observed in connection with conventional techniques.