Speech recognition systems try to determine the semantic meaning of a speech input. One common example is an automated dialog system in which the system prompts a user to provide a speech input indicating what action to take next. A speech recognition component analyzes the resulting speech input to try to determine its semantic meaning. Typically, statistical speech models are used to determine a sequence of words that best corresponds to the speech input.
Confidence scores can be used to estimate the reliability of the output word sequence for the speech input. FIG. 1 shows a scale of confidence scores along a vertical axis ranging from a high of 1000 to a low of 0. Typically, speech recognition outputs having a confidence score above a given accept threshold are automatically accepted as probably correctly recognized. And speech recognition outputs having a confidence score below a given reject threshold are automatically rejected as probably not correctly recognized. Speech recognition outputs between the two confidence score thresholds may or may not be correctly recognized and usually require some form of user confirmation.
Various system performance measurements are used to set the confidence score thresholds. Inputs above the acceptance threshold that are automatically accepted contribute to a Correct Accepted (CA) rate when the identification is correct, and to a False Accepted (FA) rate when incorrect. Similarly, inputs below the rejection threshold that are automatically rejected contribute to a Correct Rejected (CR) rate when the rejection is correct (e.g., the speech input is out of the recognition vocabulary), and to a False Rejected (FR) rate when the rejection is incorrect (e.g., the speech input is within the recognition vocabulary, but the utterance was rejected). Inputs between the thresholds that require user confirmation contribute to Correct Confirmed (CC) and False Confirmed (FC) rates.
Ideally, the CA and CR rates should be as high as possible, while the FA and FR rates should be as low as possible, and at the same time, user confirmation, CC and FC should be required as seldom as possible. In practice, this requires compromise and balancing of competing factors. Typically, various operating point criteria are established such as some x % FA, y % FC, z % CA, etc. Then system performance data is collected for one or more test sets. This requires that some recognition correctness criteria be established. Recognition of the test set is then performed with the final recognition grammar package and each recognition result is labeled as correct or incorrect. From these results, a Receiver Operating Characteristic (ROC) curve can be determined (FA versus CA). The defined system operating points are located on the ROC curve which are used to then set the corresponding confidence score thresholds.
FIG. 2 shows one specific example of setting and using a Receiver Operating Characteristic (ROC) curve to set confidence score thresholds. The horizontal axis is FA rate and the vertical axis is CA rate. In the example shown, the dark curve plots confidence scores for an in-vocabulary test set and the light curve plots confidence scores for a more realistic test with some out-of-vocabulary (OOV) data. Setting an accept threshold to meet a 1% FA operating point would correlate to a confidence score of 835 (out of 1000) and a 69% CA rate in the in-vocabulary data set, but in the more realistic right hand data set with some OOVs, would require a confidence score of 920 and achieve just a 36% CA rate.