The present invention relates to estimating the feature vectors corresponding to different sources that were combined to produce input feature vectors.
A pattern recognition system, such as a speech recognition system, takes an input signal and attempts to decode the signal to find a pattern represented by the signal. For example, in a speech recognition system, a speech signal is received by the recognition system and is decoded to identify a string of words represented by the speech signal.
To decode the incoming signal, most recognition systems utilize one or more models that describe the likelihood that a portion of the test signal represents a particular pattern. Typically, these models do not operate directly on the incoming signal, but instead operate on a feature vector representation of the incoming signal. In speech recognition, such feature vectors can be produced through techniques such as linear predictive coding (LPC), LPC derived cepstrum, perceptive linear prediction (PLP), and mel-frequency cepstrum coefficients (MFCC) feature extraction.
The incoming signal is often a combination of signals from different sources, each modified by a channel. For example, the incoming signal may be a mixture of an original signal, which contains the pattern to be recognized, and one or more obscuring signals, such as additive noise and channel distortion. In speech recognition, the incoming signal may be a combination of the speech signal to be fed into a speech recognizer, additive noise, and channel distortion such as telephone channel distortion, or reverberations generated by the speech signal bouncing off walls in a room. Or, the incoming signal may be a combination of a speech signal with a channel signal (impulse response of the channel), where the channel signal is to be fed into a system that recognizes channel types. Or, the incoming signal may be a mixture of the speech signals from two different speakers, each modified by a different channel, and each of which is to be fed into a speech recognizer.
Because noise and channel distortion make it more difficult to recognize a pattern in the incoming signal, it is often desirable to remove the noise and the channel distortion before performing pattern recognition. However, removing noise and channel distortion from the incoming signal itself is computationally difficult because of the large amount of data that has to be processed. To overcome this problem, some prior art techniques have tried to remove noise from the feature vector representation of the incoming signal instead of the incoming signal itself because the feature vector representation is more compact than the incoming signal.
However, past techniques for removing noise from feature vectors have relied on point models for the noise and the channel distortion. In other words, the noise reduction techniques have assumed that one single feature vector can represent the noise and another single feature vector can represent the channel distortion. The point models may be adapted to a sequence of input features, but they are held constant across the sequence. Because the noise and channel distortion vary across the sequence of input features, techniques that use this approximation do not accurately remove the noise or channel distortion.
Some prior art techniques for removing noise from feature vectors attempt to identify the most likely combination of a noise feature vector, a channel distortion feature vector, and an original signal feature vector that would have produced the noisy feature vector. To make this determination, the prior art relies on an approximation of the relationship between noise, channel distortion, original signals, and incoming signals.
However, prior art systems do not take the error present in the approximation into account when identifying possible combinations of noise, channel distortion, and original signals based on the incoming signal. In addition, the form of the approximation is typically set once and then used to identify the best combination. If the form of the approximation is not accurate, the resulting identified combination of noise, channel distortion, and original signal will be inaccurate. However, the prior art does not provide a means for adjusting the form of the approximation to improve the resulting identified combination.