An increasing trend in communication technology is the combination of different communication modalities into a single multi-modal communication system. For example, a live chat between a first person using text messaging (e.g., at a computer terminal) and a second person who prefers speaking (e.g., while driving a car). Text typed by the first person using a text input device is converted using a text-to-speech (TTS) converter to audible speech. This can be heard on a speaker by the second person (e.g., using the ear piece of a cellular telephone). The second user speaks words or letters into a microphone (e.g., the mouthpiece of the cellular telephone). An automatic speech recognition (ASR) engine converts the spoken words to text which is then displayed to the first person.
However, multi-modal communication is difficult to implement. For example, it is difficult for some TTS systems to convert written text to correctly sounding speech. This problem is especially prevalent when converting proper names, and/or other words which are not in the vocabulary of the TTS conversion system. While some TTS systems can hypothesize how the word may be pronounced, they frequently fail to correctly approximate the proper pronunciation of the word. Additionally, when attempting to pronounce foreign words, the TTS system may fail to account for the cultural differences in pronouncing various letter combinations and/or the accenting and enunciation of the word.
Currently, much of the research in the field of ASR is still directed toward improving the recognition of a single user's speech. Another adaptation is directed toward compensating for environmental noise which can degrade the effectiveness of the ASR system in recognizing the user's speech. Other research in the field of ASR is directed toward recognizing the speech of non-native speakers of a language to improve the probability of recognizing their speech.
Another adaptation in ASR is to determine what subjects are being discussed and accessing dictionaries appropriate to the subject matter. Typically, recognition of the user's speech is based upon pre-guessing what the user is going to say. By accessing dictionaries which are more specific to a particular subject matter, the ASR system increases the probability values associated with each word in a particular dictionary. This increases the probability that when a user speaks, the ASR system will accurately recognize the user. For example, if a user is speaking about accounting, the ASR system accesses a dictionary comprising words about accounting, banking, money, etc. The ASR system then increases the probability value associated with each word in this dictionary as it is likely that the user will continue speaking about financial matters based upon the user's prior behavior. Thus, if the user speaks the word “tax,” the ASR system will be more likely to interpret the word spoken by the user to be “tax” rather than the word “tacks.”
ASR systems are increasingly being used in commercial applications such as voice mail systems. Often, the ASR system is configured to utilize a carefully worded hierarchy of questions which present the user with a narrow set of options from which to choose. Because the ASR system “knows” the likely answers in advance due to the wording of the questions, it can increase the probabilities of words which it expects to hear in response to the question asked. However, these systems often require lengthy configuration and training prior to implementation to minimize the error rate in recognizing the speech of a variety of users. Thus, these systems are expensive to set up, and are not readily adaptable to situations in which a carefully worded hierarchy of questions cannot be implemented.