An automatic speech recognition (ASR) system takes an audio signal as input and typically compares the input signal with known sounds (phones) and sequences of sounds (paths) of an acoustic model (AM) to identify words that appear to match a spoken sequence of sounds. After identifying a word or words corresponding to an input audio signal, a text or other machine-readable representation of the identified matching words may be returned by the ASR to an application program such as an interactive voice response (IVR) telephony application. A confidence score may be returned with each apparently-matching word, the confidence score being based on the closeness of an incoming acoustic segment to the mean of a probability distribution associated with a phone within the ASR system's acoustic model. A number of possible words and their respective confidence scores may be returned for selection or further analysis.
Typical automatic speech recognition (ASR) systems require a considerable amount of training data for a single user (speaker dependent) or multiple users (speaker independent) to enable the ASR system's recognition engine to learn to associate the acoustic input with the corresponding sounds (‘phone labels’) of the language. When deployed in a real application, such as an automated telephony service, the sequence of sounds that the ASR system identifies must also be matched against an application-specific grammar, which predefines words and phrases that are expected. If the ASR system is trained on enough data and if the grammar covers all possible words and phrases, then recognition accuracy can be very high. However, individual sounds within a given language may be easily confused, such as “F” and “S” in English. Such sounds may well appear within the words in the application grammar. In such cases, recognition accuracy will tend to decrease.
It is common practice in most automated services using ASR to ask the user to confirm whether or not the ASR result is correct. If no result was returned, callers could be asked to repeat words that are not recognized. For example, a caller may speak the name of a person they wish to contact “Stephen James”. If the synthesized voice response includes a different name, such as “Did you say “Peter Jones”?”, the caller is unlikely to be impressed. Having to repeat their input may also annoy callers. Even if the confirmation is just “Yes” or “No”, the ASR system may confuse the two items—in particular because a user prompt such as “Did you say Stephen James?” could be answered with “yeah”, “OK”, “correct”, or “nope”, “nah” and so forth.