In a speaker dependent speech recognition system, users need to enroll the vocabulary words that they wish to have available when using the system. A vocabulary "word" can be a single spoken word or a short phrase, and the vocabulary words chosen depend on the particular application. For example, a speech recognition implementation for a portable radiotelephone might require the user to provide the names and locations of frequently called people (e.g., "Fred's office"), or commands for frequently used features usually available in a user interface (e.g., "battery meter", "messages", or "phone lock"). The choice of vocabulary words is unsupervised and left to the user, allowing the entry of easily remembered words or phrases.
Unfortunately, the choice of these vocabulary words can have a significant impact on the performance of a speech recognition system. Allowing too much flexibility to an inexperienced user may result in a number of potential problems. If the user inadvertently selects two acoustically similar vocabulary words to identify two distinct entries, poor recognition performance may result, especially when the recognition task is performed in a noisier environment. As an example, for the aforementioned portable radiotelephone speech recognition application, this could happen if the user decides to enroll "Fred's Office" and "Ted's Office." Similarly, since most speaker dependent systems allow the user to incrementally enroll words into their vocabulary, there is a danger of mistakenly re-enrolling the same word with a different association. Again, in the context of a radiotelephone speech recognition application, the user could enroll the phrase "Fred's Office" for two separate people named Fred. In such cases, it would be beneficial to detect similarities between these voice-tags at the time of the second word's enrollment and provide the user with some type of warning of this similarity. The user should then be encouraged to choose a different, more unique entry or voice-tag.
A similar problem also exists while selecting speaker independent vocabularies. For example, in command and control applications, during recognition a number of words are "active," and the user is making a selection by saying one of these active words. The ability of the speech recognizer to accurately discriminate between these words depends, to some extent, on the words themselves and their similarity to one another. The designer of such a command and control vocabulary, in the event a number of possible alternatives for a given word are available, would want to reject any alternatives that were too similar to other vocabulary words in the interest of improving system performance. Therefore, use of such a similarity detection technique would be useful during the design of speaker independent systems as well.
Ensuring the uniqueness and minimizing confusion of the enrolled vocabulary words becomes even more critical during recognition in acoustically noisy environments, such as in an automobile or where background conversation is present. In such environments, the recognizer's ability to distinguish between acoustically similar tags can be greatly reduced. Therefore, failing to encourage the enrollment of acoustically distinct vocabulary words could severely limit the performance achievable in such environments. By testing for word similarity during vocabulary enrollment, one can reduce the likelihood of certain types of recognition errors and minimize user annoyance.
Prior art methods have been proposed for preventing the enrollment of acoustically similar vocabulary words, but these methods rely on collecting additional repetitions of the word being trained for the sole use of similarity testing. According to these conventional methods, the user is prompted during enrollment to say the utterance he or she wishes to enroll at least two times. The first repetition is used to create a model of the new word. A model is a representation used by a speech recognition system for an enrolled word or phrase. This representation can take many forms, including a conventional template as used in recognition systems based on dynamic time warping (DTW), a statistical representation as used by systems based on hidden Markov model (HMM) techniques, or a set of weights used to characterize a multilayer artificial neural network. This new model and all previously enrolled word models are then pooled together to form the recognition vocabulary that needs to be evaluated for similarity. The second repetition is then used only as a test utterance, evaluating the new model just trained against any words already enrolled into the vocabulary to identify a potential acoustic similarity. If this test is successful, the new word is enrolled into the vocabulary. If the test is failed, the new word is rejected. Accordingly, this type of method is limited by the fact that each utterance is designated as either a training repetition or a similarity test repetition. No additional benefit is derived from the similarity test utterance when training the new model, and the training utterance is not made available during similarity analysis for further testing. Prior art methods also compare the newly enrolled word with all other words in the vocabulary by performing a time synchronous Viterbi-type search over the whole vocabulary, which is time consuming and computationally intensive.
Accordingly, there is a need for a method that can detect when new vocabulary words are acoustically similar to previously enrolled words, taking full advantage of all available data and completing any analysis with as little delay as possible.