This invention relates to voice recognition and, more particularly, to a method and system for improving the performance of a speech recognition system that allows multiple speech attempts by a user.
Voice recognition is a broad term that includes both recognition of the speech content, referred to as speech recognition, and recognition of the speaker, referred to as speaker recognition. Voice recognition technology can be applied to communication-based information processing in tasks such as bank-by-phone, access to information databases, access to voice mail, etc. Telephone systems are primary components of such processes.
An essential step in voice recognition processes is the comparison of some representation of test speech, referred to as a test speech representation, to a reference representation of the speech, referred to generically as reference speech representations. In the context of voice recognition, reference speech representations are referred to as word models. Test speech is the speech uttered by the user which is to be recognized. If the two representations match, the test speech is said to correspond to the matched reference speech representation. These various representations are generally some transformation of the speech. For example, the word models for the reference speech representations may be parametric representations in the form of Linear Predictive Coding (LPC) coefficients.
Voice recognition generally includes a training procedure and a test procedure. In the training procedure, a training database made up of known speech is analyzed to produce the reference speech representations. The training database typically comprises predetermined speech from one or many different speakers. From the training database, a suitable method is used to produce the reference speech representations. During the voice recognition process, a spoken test speech is analyzed to produce a representation of the test speech compatible with the reference speech representation. The test speech representation for voice recognition is then compared to the reference speech representations to find the highest ranking and most closely matched reference speech representations.
In a typical voice recognition system, the reference speech representations are limited to a set of acceptable word models. That is, a user""s speech input is classified (i.e. accepted or rejected) according to the predefined word models. Furthermore, speech representations may be defined for non-words representing the environment of the user (e.g., noisy background) and/or expected extraneous sounds made by the user (e.g., lip smacks). The non-word speech representations are often used in keyword-spotting which enables the recognition of specific vocabulary words that are embedded in a stream of speech and/or noise.
A voice recognition system typically prompts the user for speech input. If the speech input cannot be classified according to the word models, a rejection of the utterance takes place. Upon rejection of the speech input, the voice recognition system may prompt the speaker multiple times in an attempt to eventually accept (i.e., recognize) the speech input, with some limit on how many attempts are requested of the speaker (e.g., two or three).
The known prior art includes voice recognition systems that attempt to classify the speech input independently of each attempt. Thus, there is no reference available upon successive attempts to classify the speech input.
It is thus a general object of the present invention to provide a method and system for improving the performance of a voice recognition system when multiple speech attempts by a user are allowed.
It is yet another object of the present invention to provide a method and system for improving the performance of a voice recognition system when multiple speech attempts by a user are allowed by using the results from all the speech attempts to determine what the spoken utterance may be.
In carrying out the above objects and other objects, features and advantages, of the present invention, a method is provided for improving the performance of a voice recognition system when multiple speech attempts by a user are allowed. The method includes the steps of determining at least one best word and corresponding word and non-word scores for each speech attempt by the user and determining at least one common best word among all the speech attempts. The method also includes the step of determining an objective measure for each of the at least one common best word for each speech attempt by the user. The objective measure represents a confidence level of the corresponding word and non-word scores. The method further includes the step of comparing each of the objective measures to a predetermined threshold. Finally, the method includes the step of classifying the multiple speech attempts by the user based on the comparison.
In further carrying out the above objects and other objects, features and advantages, of the present invention, a system is also provided for carrying out the steps of the above described method. The system includes means for determining at least one best word and corresponding word and non-word scores for each speech attempt by the user. The system also includes means for determining at least one common best word among all the speech attempts. The system further includes means for determining an objective measure for each of the at least one common best words for each speech attempt by the user. Still further, the system includes means for comparing each of the objective measures to a predetermined threshold. Finally, the system includes means for classifying the multiple speech attempts by the user based on the comparison.
The above objects and other objects, features and advantages of the present invention are readily apparent from the following detailed description of the best mode for carrying out the invention when taken in connection with the accompanying drawings.