The discussion below is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.
A speech server can be utilized to combine Internet technologies, speech-processing services, and telephony capabilities into a single, integrated system. The server can enable companies to unify their Internet and telephony infrastructure, and extend existing or new applications for speech-enabled access from telephones, mobile phones, pocket PCs and smart phones.
Applications from a broad variety of industries can be speech-enabled using a speech server. For example, the applications include contact center self-service applications such as call routing and customer account/personal information access. Other contact center speech-enabled applications are possible including travel reservations, financial and stock applications and customer relationship management. Additionally, information technology groups can benefit from speech-enabled applications in the areas of sales and field-service automation, E-commerce, auto-attendants, help desk password reset applications and speech-enabled network management, for example.
In speech recognition, a speech recognizer receives an acoustic signal input from a speech utterance and produces a recognition result. Several parameters are used in the recognition process. For example, a confidence classifier estimates how likely the recognition result is correct. The confidence classifier typically assigns a confidence score between 0-1 for the result. In general, the higher the score is, the more likely the result is correct. The score is compared to a threshold to determine one or more tasks to perform. Other parameters can include a structure of a speech application and grammars used for recognition.
In a simple dialog scenario, the speech application interacts with a user through a series of dialog turns to perform one or more transactions that the user requests. A transaction can be one or more tasks or actions that are performed by the speech application. In the application, the absolute value of the confidence score is not directly used. Usually, one or more confidence thresholds are employed. In one example, a confidence threshold pair is used: TH1 and TH2, where 0<TH1<TH2<1. For a recognition result, if its confidence score is higher than TH2, the application is confident the recognition result is correct and accepts the result directly. If the score is lower than TH1, the system treats the result as an incorrect result and rejects the results directly. If the score is between TH1 and TH2, the system needs to confirm with the user about the result. Complex speech applications include multiple grammars and multiple dialog turns to perform various tasks. The applications can be viewed as a combination of simple applications wherein each application has one or more confidence thresholds.
In a name-dialer application, a user may wish to connect to a person at an organization. For example, the application may ask the user “Who would you like to call?” and produce a recognition result and associated confidence score of a name in a directory based on the user's response. If the confidence score of the recognition result is higher than TH2, the result is treated as correct and the application transfers the call to a person associated with the name. If the score is lower than TH1, the result is likely to be incorrect and the application will ask for a name again or confirm with the user about the recognized name. Other thresholds and scenarios can further be used.
Parameters for a speech application such as the thresholds, structure and grammars can be time consuming and expensive to establish. Previously, confidence thresholds were set heuristically. Typically, expensive consultants need to spend large amounts of time to establish thresholds for applications after obtaining large amounts of training data. As a result, there is a large expense to establish confidence thresholds.