A semantic categorizer accepts text phrases or sentences as input, analyzes them and places each input text in a specific category. In some cases, a specific input text phrase can be placed in one or more categories, with confidence scores for each placement. Semantic categorization is a key component in most dialog systems. For example, Interactive Voice Response (IVR) systems must interpret a user's spoken response to a prompt in order to then complete an action based on the response.
Currently, in fixed-grammar directed-dialog systems, semantic categorization is performed using a set of manually defined rules. A dialog developer pre-defines those utterances that the system should be capable of “understanding”. These pre-defined utterances are called “grammars”. Each predefined utterance is assigned to a semantic category, and that semantic category is indicated by including a semantic tag with the grammar definition. Thus semantic categorization is labor intensive and requires significant manual involvement to develop grammars and define semantic tags for each new application or prompt. Using existing approaches, dialogs are fairly restrictive, since they must always remain within the scope of the pre-defined responses.
In open ended (non-directed) applications, that use prompts such as, for example, of the type, “How may I help you?”, users speak utterances intended to select one of a list of the tasks that are available in the application. Often these task choices are not pre-identified (directed) to the speaker so a user can say almost anything in response to the prompt. Automatic speech recognizers (ASRs) use Statistical Language Models (SLM) to transcribe the user's utterance into a text message. This transcribed text is then passed to a categorization engine to extract the semantic choice that the user is requesting. The above-identified patent application is directed to the automatic generation of SLMs, for example, for use with an ASR to generate text transcriptions of a user's utterance.
After a text transcription is available, the next task is to make that text understood by a machine. For example, if the user says, “I want my bank balance”, the ASR in the IVR would use the SLM created by the above-identified patent application to generate text that says, “I want my bank balance”. The text of the utterance then needs to be understood by the machine and mapped to a semantic category “bank_balance”.
By restricting the scope of a dialog to a specific domain such as “banking”, the accuracy and speed of generating the text transcription of spoken utterances is greatly improved. For this reason, many IVR applications assume that all user utterances will fall within the domain of that application. Utterances that have nothing to do with the application will not be transcribed accurately and will be assigned a low confidence score. For example, if a user calls a bank and says, “I want flight information to California,” an SLM system will transcribe that to some nonsensical sentence with a very low confidence level, because that question is an improper domain for a banking application and the SLM could not handle words out of its domain. The low confidence score level indicates that the utterance is probably not transcribed correctly, and further clarification is required. Therefore, normally, the proper domain must be known by the user or selected as a starting point. In a typical application, the overall domain is known, since if the user is calling, for example, a bank, it would be a banking domain.
Within a specific domain there are a number of category sets or available tasks that can be performed by the application. There are many ways a user can invoke a task. A task can be requested by a command: “Tell me how much I have in my checking account” or a question, “How much money do I have in my account?” There are typically a large number of utterances that a user can use to invoke any specific task in an application.
The job of a semantic categorizer is to discover the specific task that a user is requesting, no matter how it is requested. This process is typically done in two steps, with the first step transcribing the user's utterance into text. An improved method for this transcription process is described in the above-identified application.
Once the user's utterance is successfully transcribed, the text transcription must be analyzed to determine the user's intentions. One aspect of this process is discussed in a paper published in 2005 in the AAAI SLU workshop entitled “Higher Level Phonetic and Linguistic Knowledge to Improve ASR Accuracy and its Relevance in Interactive Voice Response System,” which is incorporated by reference herein.