Natural language systems try to determine the semantic meaning of a text input such as a text sequence output from automatic speech recognition (ASR). One common natural language application is an automated dialog system (call steering) in which the system prompts a user to provide a speech input indicating what action to take next. A speech recognition component analyzes the resulting speech input to try to determine its semantic meaning. Typically, statistical speech models are used to determine a sequence of words that best corresponds to the speech input.
Using the specific example of a call steering application, the system is evaluated from a test set of utterances which are then annotated with:                1) Caller intent: One of a finite set of intents assigned to the caller's utterance by a human expert. For example, in a technical support application, “my printer is not working, I need help” might be assigned to the intent “PRINTER PROBLEM”. Note that in some cases the caller's intent actually may be out-of-domain meaning that it doesn't match any of the finite set available.        2) Semantic interpretation: The interpretation that is automatically determined by the system. This interpretation typically may be drawn from the same set of intents as are the caller intents.        3) Correctness: An utterance is deemed correct if the caller's intent is identical to the semantic interpretation.        4) Confidence score: The confidence the system has in the interpretation: a number between 0 and 100.        
Looking more closely at the idea of confidence scores, these can be used to characterize the degree of correspondence between a given word sequence and a speech input. FIG. 1 shows a scale of confidence scores along a vertical axis ranging from a high of 1000 to a low of 0. Typically, speech recognition outputs having a confidence score above a given high confidence threshold are automatically accepted as probably correctly recognized. And speech recognition outputs having a confidence score below another given low confidence threshold are automatically rejected as probably not correctly recognized. Speech recognition outputs between the two confidence score thresholds may or may not be correctly recognized and usually require some form of user confirmation.
Various system performance measurements can be used to set the confidence score thresholds. Inputs above the higher threshold which are automatically accepted contribute to a Correct Accepted (CA) rate when the identification is correct, and to a False Accepted (FA) rate when incorrect. Similarly, inputs below the lower threshold which are automatically rejected contribute to a Correct Rejected (CR) rate when the rejection is correct (i.e., the speech input is out of the recognition vocabulary), and to a False Rejected (FR) rate when the rejection is incorrect (i.e., the speech input is within the recognition vocabulary, but not correctly recognized). Inputs between the two thresholds which require user confirmation contribute to Correct Confirmed (CC) and False Confirmed (FC) rates.
Ideally, the CA and CR rates should be as high as possible, while the FA and FR rates should be as low as possible, and at the same time, user confirmation, CC and FC should be required as seldom as possible. In practice, this requires compromise and balancing of competing factors by a speech recognition engineer. Typically, various operating point criteria are established such as some x % FA, y % FC, z % CA, etc. Then system performance data is collected for one or more test sets. This requires that some recognition correctness criteria be established. Recognition of the test set is then performed with the final recognition grammar package and each recognition result is labeled as correct or incorrect. From these results, a Receiver Operating Characteristic (ROC) curve can be determined (FA versus CA). The defined system operating points are located on the ROC curve which are used to then set the corresponding confidence score thresholds. FIG. 2 shows one specific example of a typical Receiver Operating Characteristic (ROC) curve to set confidence score thresholds.