A conversation (dialogue) between two entities is a series of exchanges where each participant listens to at least some part of what the other participant is speaking, and the other participant reacts by speaking or performing some action. Creating speech applications, which are computer applications that engage in such dialogues with people, is a complex task.
A speech application typically proceeds in accordance with a call flow that defines the dialogue between a user and the computer on which the speech application executes. The call flow of a speech application is typically comprised of a series of “states,” which correspond to different stages in the dialogue (e.g. initial state, get-identity-of-speaker state, take-first-item-in-order state, etc.). Each of these states is typically associated with a “prompt,” which the speech application may use to prompt the user, a set of expected “responses,” which the speech application can expect from the user, and a way to process the prompt given, response received, and any other external data to perform an action or to move to another state.
A speech application must be able to detect an utterance (e.g., “response”) spoken by a user and convert it into some non-audio representation, for example, into text. A speech application typically relies on an automatic speech recognizer (ASR) to perform this task. Once an ASR determines what a speaker has said, the ASR itself, or in some cases another component such as a natural language interpreter, may receive the non-audio representation of the utterance and based on that utterance, the state of the conversation to that point, and any external factors that need to be considered, determine the meaning of the utterance.
ASRs are available commercially from a variety of different vendors. Examples of commercially available ASRs include the Nuance product commercially available from Nuance Communications, Inc., the SpeechPearl product commercially available from Philips Electronics N.V., and the OpenSpeech Recognizer commercially available from SpeechWorks International, Inc.
With advances in speech recognition technology and computing power, speech recognizers for use with many speech applications today are of the speaker independent, continuous speech variety. An ASR is “speaker independent” if it does not need to have heard the speaker's voice before in order to recognize the speaker's utterance. An ASR is “continuous” if it does not require the speaker to pause between words.
For anything but the most basic applications, a speech application cannot know for certain what the user will say or how the user will say it. A useful speech application should be constructed to be ready for all reasonable contingencies. In order to allow for complex responses, when the speech application “listens” via an ASR for one response from a set of responses, it does so using a “grammar” for those responses. That is, the speech application “loads” a particular grammar into the ASR for a given set of expected responses This grammar specifies everything the ASR will listen for when it is listening for a given response.
As an example, a grammar for the expected reply to the prompt “What method of shipping would you like to use?” might be represented as ((“I want to use”| “I'd like to use”| “Please use”) (“Regular mail”| “Express shipping”|“Next day mail”)). Under this set of rules, the expected replies would be “I want to use regular mail”; “I'd like to use regular mail”; “Please use regular mail”; “Regular mail”; “I want to use express shipping”; “I'd like to use express shipping”; “Please use express shipping”; “Express shipping”; “I want to use next day mail”; “I'd like to use next day mail”; “Please use next day mail”; or “Next day mail”. This notation is referred to as Backus-Naur Form (BNF). Other formats can be used. One such format is the XML format promulgated by the W3C organization.
Commercially available ASRs typically have their own grammar file format specified by the vendor. A speech application developer is required to adhere to the grammar format of the specific ASR being used. Development tools are available to aid a developer in generating the necessary grammars for a given speech application. One such tool is the Natural Language Speech Assistant (NLSA) developed by Unisys Corporation, assignee of the present invention. Further information concerning this tool is provided in U.S. Pat. No. 5,995,918, issued Nov. 30, 1999, entitled “System and Method for Creating a Language Grammar Using a Spreadsheet or Table Interface.”
There are written languages, for example Russian, Ukrainian, and Polish, where certain parts of speech reflect the gender (male or female) of the speaker. In Russian, Ukrainian and Polish, for example, this phenomenon occurs with past tense verbs. As shown in the following example, in the Russian language, male and female speakers will utter different words to express the same phrase.
English Phrasemale would sayfemale would say“I opened the window”“I completed the exam”“I was afraid”“I came”“I lost it”Consequently, the designer of an ASR grammar to be used to recognize the speech of a Russian speaker may have to include representations of both the female and male versions of a given spoken phrase in order for the grammar to remain speaker independent, i.e., gender neutral.
For such languages, a gender-neutral grammar can become quite large as it has to be capable of handling both the male and female versions of various phrases. Unfortunately, the larger a grammar becomes, the less accurate an ASR will perform, as there are more opportunities for mistakes and misrecognitions. The speed of recognition is also affected when grammars become large. Consequently, there is a need for systems and methods for improving the speech recognition accuracy in, and overall dialogue design of, speech applications intended to be used with speakers whose written languages exhibit these kinds of gender specific characteristics. The present invention addresses this need.