A typical speech recognition system uses one or more speech models developed from a large vocabulary stored in a speech recognition adaptation database. The vocabulary includes most common words and attempts to cover vast language differences within a single language due to voice characteristics, dialects, education, noisy environments, etc. When the speech recognition system is first installed, often the performance is very poor because the one or more speech models need to be trained for the speakers in the region. Over a long period of time, retraining the speech models will improve the speech recognition system performance. Even after training the speech models, the speech recognition system typically recognizes an average speaker's verbal response. However, the speech recognizer may not still be able to correctly transcribe the verbal response of all speakers due to the reasons listed previously. Additionally, technical terms and proper names that have not entered the common jargon may not be recognized. Hence while undergoing this retraining process, which could take a significant period of time, the customer will continue to receive poor performance.
Typical speech recognition systems use unsupervised automatic adaptation, i.e., mathematical algorithms and/or confidence scores to determine whether to use a correctly or incorrectly recognized word or utterance and its transcript to update the vocabulary in the adaptation database. Mathematical algorithms determine the probability the transcription, i.e., text of the utterance or word is correct or incorrect. A high probability, such as 90%, would indicate the correct speech model was used to recognize the utterance or word. When the probability is high, it is likely the recognized utterance or word and transcript may be used to retrain one or more speech models.
The speech recognition system may assign a confidence score to each recognized utterance or word to provide a measure of the accuracy of the recognition for the utterance or word. Hence a confidence score of 30 or below would indicate the speech recognition system does not have much confidence the utterance or word was correctly recognized and should not be used to retrain one or more speech models. Whereas, a confidence score of 90 or above would indicate the utterance or word was correctly recognized and can be used to retrain one or more speech models.
One of the problems faced by current speech recognition systems using unsupervised automatic adaptation is the speech recognizer has no way of determining if it correctly recognized the word or utterance it will use to retrain one or more speech models. For example, if the confidence score or probability of correctness is low, the utterance or word is not used to adapt a speech model even if it was recognized correctly. However, if the confidence score or probability is high, but the utterance or word was incorrectly recognized, it will be used to adapt one or more speech models. Unfortunately when using incorrectly recognized utterances or words to adapt one or more speech models, instead of improved speech recognition, there is a decrease of correctly recognized utterances or words by the speech recognition system.
In this unsupervised mode, a dialog needs to request from the speaker a confirmation that it correctly recognized the verbal response, i.e., utterance or word, such as “Did you mean X?” Where X is the recognized verbal response, i.e., transcription or text of the utterance or word, that has been converted to speech by a text-to-speech resource. Typically confirmation is requested for complicated dialogs, such as when a customer requests to transfer money between bank accounts and the dialog requests confirmation of the bank account numbers and the amount of the transfer. Asking for confirmation after every verbal response by the customer can be annoying to the customer and lengthen the amount of time the customer is using the speech recognition system.
Additionally while the speech recognition system is undergoing the improvement process using unsupervised automatic adaptation of one or more speech models, the speaker will experience frustration and hang-up if the speech recognition system misrecognizes too many words or utterances.
The following is an example of a speech recognition system where multiple misrecognitions have occurred and the customer hangs up in frustration:
IVR dialog: “Please state the name of the company you wish to find.”
Speaker: “Avaya.”
IVR dialog: “Was that Papaya Limited?”
Speaker: “No.”
IVR dialog: “Please state the name of the company you wish to find.”
Speaker: “Avaya.” (spoken in a louder tone)
IVR dialog: “Was that Avalon Labs?”
Speaker: “No.”
IVR dialog: “Please state the name of the company you wish to find.”
Speaker: “Avaya.” (spoken in an frustrated voice)
IVR dialog: “Was that Papaya Limited?”
Speaker hangs up.
Another mode, such as supervised monitoring and intervention provides better input data to adapt one or more speech models. However, supervised monitoring and intervention has not been real-time, that is, monitoring a speaker's voice inputs has not been used to automatically adapt one or more speech models.