1. Field of the Invention
This invention relates to speech recognition and training systems. More specifically this invention relates to Speaker-Dependent (SD) speech recognition and training systems which include means for identifying confusingly similar words during training and means for increasing discrimination between such confusing similar words on recognition.
2. Related Art
A SD system offers flexibility to the user by permitting the introduction of new words into the vocabulary. It also allows vocabulary words from different languages to be included. However, the advantages of user-defined vocabulary and language independence can cause performance degradation if not implemented properly. Allowing a user-defined vocabulary introduces problems due to the flexibility in selecting the vocabulary words. One of the major problems encountered in allowing the user-defined vocabulary is the acoustical similarity of vocabulary words. For example, if xe2x80x9cRobxe2x80x9d and xe2x80x9cBobxe2x80x9d were selected as vocabulary words, the reliability of the recognition system will decrease.
When the user is given the freedom to choose any vocabulary words, the tendency is to select short words, which are convenient to train but produce unreliable models. Due to the limited training data (one token), the longer the word is, the more reliable the model will be. Finally, when the user enters multiple-word phrase for a vocabulary item, the variation in the length of silence or pause between the words is critical to the success of the recognition system. In unsupervised training, there is no feedback from the system to the user during the training phase. Hence, the models created from such training do not avoid the above identified problems.
To alleviate these problems, a smart/supervised training system needs to be introduced into a SD recognition system particularly if it uses word-based models.
Many methods of SD speech training are present in the related art. For example U.S. Pat. No. 5,452,397 to Ittycheriah, et al., incorporated herein by reference, assumes multiple-token training and describes a method of preventing the entry of confusingly similar phrases in a vocabulary list of a speaker-dependent voice recognition system. The first token of the word/phrase to be added to the vocabulary list, is used to build a model for that word/phrase. Then, the second token (a repetition of the same word/phrase) is compared with the new model added to the vocabulary and also with previously existing models in the vocabulary list. The scores of the existing models are weighted slightly higher than that of the new model. If the second token compares more closely with the an existing model than the new model, the new word/phrase is declared to be confusingly similar to one of the existing vocabulary items then the new model is removed. The user is then asked to select another word/phrase for training. Since this method requires multiple tokens, it is not suitable for a SD system, which requires only a single token for training.
U.S. Pat. No. 5,754,977 to Gardner, et al., incorporated herein by reference, uses a distance value to measure the closeness of the word/phrase to be added with any of the existing vocabulary items. All the vocabulary items are sorted in the order of closeness to the new pattern/model. Then, an Euclidean distance value is computed between the new model and the top entry in the sorted list. If the distance falls below certain predetermined threshold, then the user is warned about the acoustic similarity of the word/phrase to be added with one of the existing vocabulary items and the user is requested to make another entry. Although this approach can be used in a SD system with 1-token training, the method is not very reliable. Since the distribution of the distance values will change significantly from user to user, it is very difficult to determine a reliable threshold value. Even when there is an ability to adjust or change the threshold value from user to user, a priori information such as utterance magnitude, on the distance/score distribution is still required for changing the threshold to a meaningful value.
U.S. Pat. No. 5,664,058 to Vysotsky, incorporated herein by reference, is a speech recognizer training system using one or a few isolated words which are converted to a token. Vysotsky performs multiple tests to determine whether a training utterance is to be accepted or rejected to prevent the user from adding a new voice message, which is similar to a voice message, the recognizer has previously been trained to recognize and insures a consistent pronunciation for all templates corresponding to the same voice message. This approach also requires two or more training tokens to perform these tests. The tests use a distance measure as a criterion for determining the closeness of the new token to the previously stored templates. Even though this approach is more robust than the other two methods, it requires more tokens and more tests than the other methods described above. This technique also uses absolute thresholds, which may not necessarily be uniform across different speakers. Unlike most of the current SD systems, the matching in this approach is performed by Dynamic Time Warping (DTW) which is used to match utterances of a different length than the test speech pattern. Hence the criteria used in this approach are not be directly applicable to systems that use HMM for modeling the speech.
Most of the solutions proposed in the related art assume that more than one token is available during the training phase, for building the models for the vocabulary words. The SD speech recognition system of the present invention requires only one token per vocabulary item for training and since the models built from one-token training are not very robust, performance is improved significantly by identifying and indicating to the user the problem words during the training phase, i.e. smart training.
Also, some of the previous solutions rely on absolute score thresholds to determine the closeness of words. Unfortunately, the same threshold can not be used for every user. Hence, the training can not be completely unsupervised.
Finally, the previous solutions avoid adding only acoustically similar words to the vocabulary. None of the above systems present a solution to resolving entry of confusable words, that is words which are acoustically similar. They fail to address several other problems encountered in training.
The present invention describes a solution for each of the problems described above that cause various degradations in the performance of SD speech recognition systems by using a confidence measure based smart training system which avoids or compensates for similar sounding words in vocabulary. Using duration information, the training process cautions the user about the entries to vocabulary that may be likely sources of frequent errors. Finally, based on the output of smart training, a smart scoring procedure is included in the method described herein to improve the recognition performance in the event the user chooses to include similar sounding words in the vocabulary.
The invention improves the performance and reliability of the SD speech recognition system over the related art systems by avoiding similar sounding entries to the vocabulary during the training, avoiding very short words and other utterances that are likely to cause recognition errors, suggesting alternative solutions, and in the event of user insistence to include similar sounding words in the vocabulary, augments the recognition of such similar sounding words by using a confidence measure instead of absolute scores to determine the acoustic similarity of the vocabulary items and modifies the scoring algorithm during recognition. The present invention also uses additional information such as duration of the utterance and the number of words in a vocabulary item. The smart training process described herein can be applied either to the single-token training or to the multiple-token training.
A complete SD speech recognition system includes training as well as recognition components. Since the user is required to train every vocabulary item, the training process should be simple and user-friendly. As a result, the user is asked to say each vocabulary item only few (one, two or three) times. Training in the present invention requires only one token per vocabulary item. Several approaches have been proposed for SD speech recognition in which the available training data is severely limited. The present invention uses a statistical approach known as Hidden Markov Modeling (HMM). In a statistical approach, it is assumed that the speech signal can be characterized by a random process in the given feature space which in this case is the spectral domain or space of cepstral vectors. The training process can be viewed as estimating all the parameters describing this random process for each word in the vocabulary and the recognition or matching process can be viewed as identifying which of these random process is most likely to produce the test token. A probabilistic measure is used to determine this closeness.
A general approach to isolated-word speech recognition using statistical methods is depicted in the flow diagram of FIG. 1. As can be noted from the block diagram of FIG. 1, the basic components of a speech recognition system include a front-end processor 1, a buffer for storing processed speech, 1axe2x80x2, a training module 2 and a recognition module 3. The front-end processor includes a Pre-processing module 1a which produces processed speech and a feature extraction module 1b for producing a feature vector 1c for digital speech input. The feature vector 1c is common input to both the training module 2 and the recognition module 3. The training module 2 has an estimating module 2a for estimating model parameters and a storage medium 2c for storing such model parameters on a storage medium 2b for subsequent retrieval and evaluation. The recognition module 3 includes a similarity evaluation module 3a, which computes score measurement and decision logic 3b which uses the score to recognize a word I.D. 3c. The representation of speech by a concise set of parameters is the most crucial step of the speech recognition process. Though many such representations exist, a technique, well known to those skilled in the art, known as Linear Prediction Coding (LPC) is used in the present invention.
It should be noted that the generalized system described above comprises unsupervised training and recognition modules. The introduction of smart training in the SD system improves the recognition performance by eliminating the problems introduced by unsupervised training. One aspect of the current invention is to detect and warn the user about the similar sounding entries to vocabulary. Another aspect of the current invention is to use modified scoring algorithm to improve the recognition performance in the case where confusing entries were made to the vocabulary despite the warning. Yet another aspect of the current invention is to detect and warn the user about potential problems with new entries such as short words and two to three word entries, with long silence periods in between words. Finally, the current invention also includes alerting the user about the dissimilarity of the multiple tokens of the same vocabulary item in the case of multiple-token training.
Thus the present invention permits the addition of confusingly, similar words to be entered into the vocabulary and uses a refined detection algorithm to distinguish between such similar words. In addition, the present invention detects long pauses between words and alerts the user. If the words are added to the vocabulary, the pause is normalized.