Speech recognition applications try to determine the semantic meaning of a speech input. One common example is an automated dialog system in which the system prompts a user to provide a speech input indicating what action to take next. A speech recognition component analyzes the resulting speech input to try to determine its semantic meaning. Typically, statistical speech models are used to determine a sequence of words that best corresponds to the speech input.
Confidence scores can be used to characterize the degree of correspondence between a model sequence and the speech input. FIG. 1 shows a scale of confidence scores along a vertical axis ranging from a high of 1000 to a low of 0. Typically, speech inputs having a confidence score above a given accept threshold are automatically accepted as probably correctly recognized. And speech inputs having a confidence score below a given reject threshold are automatically rejected as probably not correctly recognized. Speech inputs between the two confidence score thresholds may or may not be correctly recognized and usually require confirmation from the user.
Various system performance measurements are used to set the confidence score thresholds. Inputs above the acceptance threshold which are automatically accepted contribute to a Correct Accepted (CA) rate when the identification is correct, and to a False Accepted (FA) rate when incorrect. Similarly, inputs below the rejection threshold which are automatically rejected contribute to a Correct Rejected (CR) rate when the rejection is correct (i.e., the speech input is out of the recognition vocabulary), and to a False Rejected (FR) rate when the rejection is incorrect (i.e., the speech input is within the recognition vocabulary, but not correctly recognized). Inputs between the thresholds which require user confirmation contribute to Correct Confirmed (CC) and False Confirmed (FC) rates.
Ideally, the CA and CR rates should be as high as possible, while the FA and FR rates should be as low as possible, and at the same time, user confirmation, CC and FC should be required as seldom as possible. In practice, this requires compromise and balancing of competing forces. Typically, various operating point criteria are established such as some x % FA, y % FC, z % CA, etc. Then system performance data is collected for one or more test sets. This requires that some criteria be established for recognition correctness. Recognition of the test set is then performed with the final recognition grammar package, and each recognition result is labeled as correct or incorrect. From these results, a Receiver Operating Characteristic (ROC) curve can be determined (FA versus CA). The defined operating points are located on the ROC curve which are used to then set the corresponding thresholds.
FIG. 2 shows an example of setting and using a Receiver Operating Characteristic (ROC) curve to set confidence score thresholds. The horizontal axis is FA rate and the vertical axis is CA rate. In the example shown, the left hand curve plots confidence scores for an in-vocabulary test set and the right hand curve plots confidence scores for a more realistic test with some out-of-vocabulary (OOV) data. Setting an accept threshold to meet a 1% FA operating point would correlate to a confidence score of 835 (out of 1000) and a 69% CA rate in the in-vocabulary data set, but in the more realistic right hand data set with some OOVs, would require a confidence score of 920 and achieve just a 36% CA rate.
The existing threshold setting approach has various disadvantages. For example, speech recognition applications typically use at least one confidence threshold-most have several such thresholds all of which need to be set. Setting these thresholds requires data sets that are specific to each given application. Usually this means live transcribed data which is rather expensive and time consuming to obtain. If the recognition engine, the acoustic models, or the grammar are changed, then the thresholds need to be retuned.
Moreover, if the threshold setting tuning set is too small, the results may not be very robust. Suppose a 1% FA is required. A training set with 100 or more errors is desired, which if FAs form 1% of the test set requires at least 10,000 utterances are necessary for 1% of the test set to be 100 FAs. In addition, for product applications (as opposed to custom on-of applications), different sites have different properties and there is no guarantee that any one site has the same FA performance as another, and there is no guarantee that any given site actually has 1% FA (or whatever the constraint is).