A telephone-based speech recognition application such as a spoken dialog system can be modeled as a sequence of recognition states. At each state, a prompt is played, the caller responds to it and his voice response is sent to the recognizer. The recognized utterance is returned with a confidence value reflecting the confidence the system has that the utterance was assigned to the correct class. Depending on the confidence value, the system may take one of several actions, based on thresholds set by a speech recognition engineer.
Often there are two confidence score thresholds, a low-confidence threshold (LCT) and a high-confidence threshold (HCT), which divide confidence scores into three Regions—reject, confirm, accept:                If the confidence score is below the LCT, it is rejected and typically the caller is asked to repeat his answer to the prompt.        If the confidence score is between LCT and HCT, the caller is asked to confirm his response, i.e. “Did you say your number was 1234?”        If the confidence score is above the HCT, the utterance is accepted, and the dialog continues to the next state, assuming the recognizer was correct.        
Depending on how the confidence score thresholds are set, the following recognition outcomes can occur:                Correct acceptance (CA): the utterance was recognized correctly and accepted. This is generally considered the best outcome.        False acceptance (FA): the utterance was recognized incorrectly and accepted. This is generally the worst outcome.        Correct confirmation (CC): the utterance was recognized correctly and caller was asked to confirm. He will typically say “yes” and the call will continue.        False confirmation (FC): the utterance was interpreted incorrectly and caller was asked to confirm. He will typically say “no” and will be asked to repeat his original response.        Rejection (R): The utterance as rejected and typically the caller will be asked to repeat his original. One can further divide rejection into “correct’ and “false” rejection, depending on whether or not rejection was the best action to take.        
The correctness of the recognition in determining which of the above outcomes occurred is determined by comparing the annotation of a human transcriber with the recognizer output, with some allowance for “filler words.” For example, if the caller says account balances or account balances, please and the recognizer returns account balances, this is deemed correct. FIG. 1 shows the relationship between recognition correctness, the confidence score thresholds, and the various recognition outcomes. Varying the thresholds will vary the relative occurrence of the five different outcomes. As the LCT is increased, there are more rejections and fewer confirmations, while as the HCT is increased, there are more confirmations and fewer acceptances. However, without an underlying idea of what makes the “best” application, it is unclear how to best set these thresholds to an optimal distribution of the various recognition outcomes.