In general, speech signal processing involves performing operations on electrical or electronic signals that represent speech. In one example, automatic speech recognition (ASR) technologies enable microphone-equipped computing devices to interpret speech and thereby provide an alternative to conventional human-to-computer input devices such as keyboards or keypads. An ASR system detects the presence of discrete speech, like spoken commands, nametags, and numbers, and is programmed with predefined acceptable vocabulary that the system expects to hear from a user at any given time, known as in-vocabulary speech. For example, during voice dialing, the ASR system may expect to hear command vocabulary (e.g. Call, Dial, Cancel, Help, Repeat, Go Back, and Goodbye), nametag vocabulary (e.g. Home, School, and Office), and digit or number vocabulary (e.g. Zero-Nine, Pound, Star).
An ASR system typically uses one or more types of confidence thresholds. For instance, recognition confidence thresholds establish a minimum acceptable level that an utterance may correspond to some stored vocabulary in the ASR system. In another instance, confusability confidence thresholds establish a maximum permissible level that an utterance may be confusable with some stored vocabulary in the ASR system. For example, an ASR system may not allow a user to store a nametag utterance if confusability confidence values for hypotheses of the utterance are greater than the confusability confidence threshold.
Typically, such confidence thresholds are defined during ASR system training, wherein utterances from many people with different dialects are collected under different noise level conditions and analyzed to obtain statistically significant values. In one example, receiver operating characteristic (ROC) techniques can be used to select a confusability confidence threshold at an intersection of an out-of-vocabulary ROC curve and an in-vocabulary ROC curve. For instance, a common confusability confidence threshold is 50% for speakers of the Chinese language and a common confusability confidence threshold is 45% for speakers of the English language. Although different confidence threshold values may be set for different vocabularies, any given vocabulary confidence threshold value is applicable uniformly for all speakers or users of the ASR system for any given language.
But because such confidence threshold values are static and broadly statistical, they may not meet the needs of all speakers in all conditions. Accordingly, one confidence threshold value may be sufficient for some speakers but not others.