The present invention relates to speech recognition. In particular, the present invention relates to the use of features to perform speech recognition.
In speech recognition systems, an input speech signal is converted into words that represent the verbal content of the speech signal. This conversion begins by converting the analog speech signal into a series of digital values. The digital values are then passed through a feature extraction unit, which computes a sequence of feature vectors based on the digital values. Each feature vector represents a section of the speech signal.
The feature vectors can represent any number of available features extracted through known feature extraction methods such as Linear Predictive Coding (LPC), LPC derived cepstrum, Perceptive Linear Prediction (PLP), Auditory model, and Mel-Frequency Cepstrum Coefficients (MFCC).
The feature vectors are then applied to an acoustic model that describes the probability that a feature vector was produced by a particular word, phoneme, or senone. Based on a sequence of these probabilities, a decoder identifies a most likely word sequence for the input speech signal.
Although many features may be extracted from the speech signal, most prior art systems only produce feature vectors associated with a single xe2x80x9cbestxe2x80x9d feature. When speech recognition systems were first developed, filter-banks where used to extract the single feature used in recognition. Later, linear predictive coding was viewed as providing the best feature for speech recognition. In recent years, many speech systems have used Mel-Frequency Cepstrum Coefficients to provide the xe2x80x9cbestxe2x80x9d feature for speech recognition.
Although a single feature can provide fairly good speech recognition results, systems that use a single feature for all speech recognition implicitly compromise some aspects of their performance. In particular, a single feature cannot be the best feature for recognizing each possible phone. Instead, the selected feature is generally designed to provide the best average performance across all phones. For some phones, other features would provide better speech recognition results than the selected feature.
To address this problem, some prior art systems have tried to use multiple features during recognition. In one system, this involved assigning a feature to a class of phones. For example, vowel sounds would be associated with one feature and fricatives would be associated with a different feature. However, this combination of features is less than desirable because it forces a feature on a phone regardless of the location of the phone in the speech signal. Just as a single feature does not provide optimum performance for all classes of phones, a single feature does not provide optimum performance for all locations of a phone. In addition, the feature associated with each class is chosen by the designer of the system and thus may not always be the best choice for the class.
Other systems have tried to use multiple features by combining probability scores associated with different features. In such systems, separate scores are calculated based on each feature. Thus, if three features are being used, three probability scores will be determined for each segment of the speech signal.
In one system, these probability scores are combined using a voting technique. Under the voting technique, each feature is used to identify a sub-word unit for each segment of the speech signal. The sub-word units are then compared to each other. If one sub-word unit is found more often than others, that sub-word unit is selected for the speech segment. If there is a tie between sub-word units, the sub-word unit associated with a particular feature is selected based on a ranking of the features.
In another prior art speech recognition system, the probability scores are combined by taking the weighted sum of the scores produced by each feature. This weighted sum then represents the probability that the segment of the speech signal represents a particular sub-word unit. Other prior art speech recognition systems combine the probability scores by multiplying the scores from each individual feature together. The product then represents the probability that the segment of the speech signal represents a particular sub-word unit.
Such combination systems are not ideal because the scores associated with an optimum feature for a phone are obscured by the addition of scores associated with less than optimum features.
A method and apparatus is provided for using multiple feature streams in speech recognition. In the method and apparatus a feature extractor generates at least two feature vectors for a segment of an input signal. A decoder then generates a path score that is indicative of the probability that a word is represented by the input signal. The path score is generated by selecting the best feature vector to use for each segment. For each segment, the corresponding part in the path score for that segment is based in part on a chosen segment score that is selected from a group of at least two segment scores. The segment scores each represent a separate probability that a particular segment unit (e.g. senone, phoneme, diphone, triphone, or word) appears in that segment of the input signal. Although each segment score in the group relates to the same segment unit, the scores are based on different feature vectors for the segment.
In one embodiment, an iterative method is used to select the feature to use for each segment and to conduct decoding using the chosen segment scores. In the iterative method, a first-pass feature is used to decode a possible segment unit for each segment. A set of segment scores are then determined for the possible segment unit using a set of different feature vectors for the segment. The feature vector that provides the highest score is then associated with the segment.
A second decoding pass is then performed using the individual feature vectors that have been assigned to each segment. This second decoding can produce new segmentation and can provide a revised segment unit for each segment. A group of segment scores is then determined for the revised segment unit, in each segment using the set of feature vectors for the segment. The feature vector associated with the highest score for each segment is selected to use for decoding. Under one embodiment, the process of assigning features to segments and decoding the speech signal using the assigned features is repeated until the recognizer output does not change between iterations.
In other, embodiments, the selection of a best feature for a segment is built into the selection of a best path score for a sequence of wordsxe2x80x94namely, the decoding process. In some of these embodiments, this is accomplished by forming segment unit/feature pairs for each segment and selecting a sequence of pairs that provide the best overall path score. In most embodiments, the selection is based on the optimal score for the whole utterance. All possibilities are considered and a decision on the best path is generally not made until the end of the utterance. It is believed that this technique provides the best score for the whole utterance.