A speech interface (also referred to as a voice user interface) is a user interface that is operated by human voice. A speech interface can enable a human to interact with computers and other devices in order to initiate an automated service or process. For example, in an automotive environment, a speech interface is sometimes provided as a less distracting way by which a user can interact with a car head unit and/or surrounding mobile devices.
Most speech interfaces include a speech recognition engine that operates to recognize words spoken by a user. A speech recognition engine typically employs two models for performing this function: an acoustic model, which is a statistical representation of the individual sound units that make up words (e.g., phonemes), and a language model, which models the properties of a language and tries to predict the next word in a speech sequence. Speech recognition engines may be prone to misrecognition, wherein misrecognition may include the failure to recognize a word as well as the recognition of the wrong word. Such misrecognition may occur due to acoustical differences between individuals as well as the presence of noise in the usage environment. Such misrecognition may also occur due to poor coverage by the language model such that the speech recognition engine is not expecting what a user says. Misrecognition may further be due to the inability of the speech recognition engine to generate a correct acoustic representation of a word because of an unusual way that it is being pronounced. By way of example, unusual pronunciations may be encountered when dealing with certain foreign contact names, media artist names, and media track names.
Conventional solutions for reducing the misrecognition rate of a speech recognition engine include both static and dynamic speaker adaptation. In accordance with static speaker adaptation, a user reads a number of preset phrases while in a normal usage environment, and a speaker adaptation algorithm is applied to extract the phonetic characteristics of the user and apply them to the acoustic model. In accordance with dynamic speaker adaptation, such acoustic model adaptation is performed “on the fly” as the user uses the speech interface based on the acceptance and rejection behavior of the user. Both of these solutions can lead to an overall increase of recognition accuracy. However, while a general increase of accuracy can be obtained, the speech recognition engine may still have difficulty understanding certain words that have specific and unusual pronunciations, such as proper names.
One conventional speaker-dependent solution for ensuring a higher recognition rate for uniquely-pronounced words involves the use of voice tags. For example, voice tags have been employed to perform contact dialing on mobile phones. In particular, some mobile phones allow a user to generate a voice tag by creating multiple recordings of a spoken word or words and associating the recordings with a particular contact name and telephone number. Then, to dial the telephone number associated with the contact name, the user need only repeat the word(s) in the voice tag. Typically, such a solution allows for only a limited number of recordings due to the storage limitations of the mobile phone. Such a solution can also be burdensome for the user, since he or she will have to record a voice tag for each contact.
Another conventional solution for ensuring a higher recognition rate for uniquely-pronounced words involves the use of a custom lexicon. A custom lexicon provides an association between words and their phonetic sequences. At software build time, a developer can identify a list of words that he/she wishes to add to the custom lexicon, and provide the proper phonetic descriptions for each of them. At software runtime, if the speech recognition engine determines that a phonetic description of a speech signal obtained from a user matches a phonetic description stored in the custom lexicon, then it will recognize the word associated with the matching phonetic description in the custom lexicon. Thus, by providing the custom pronunciations beforehand, the developer can ensure that the speech recognition engine will recognize certain uniquely-pronounced words. However, there are always going to be new uniquely-pronounced words that become popular. While a developer can provide ongoing updates to the custom lexicon, this requires significant work by the development team to regularly identify important uniquely-pronounced words as they become popular.
Another conventional solution for dealing with out-of-vocabulary words is mentioned in U.S. Pat. No. 7,676,365 to Hwang et al., entitled “Method and Apparatus for Constructing and Using Syllable-Like Unit Language Models.” As described in that reference, a speech signal is decoded into syllable-like units (SLUs) which include more contextual information than phonemes. As further described in that reference, to add a new word to a custom user lexicon, a user must manually type a text representation of the word into an edit box of a user interface, and then pronounce the word into a microphone. An update unit receives the text of the new word and an SLU feature vector associated therewith from a decoding process, and uses both to update the custom user lexicon. A significant drawback to this approach is the cumbersome training interface, which requires the user to manually type in each new word and then pronounce it. If the number of new words to be added is large, this process can become unduly burdensome for the user. Furthermore, where the speech interface is being used with certain embedded devices, an interface for typing in words may be difficult to use or simply unavailable. Finally, such training must be initiated by the user outside of the context of normal usage of the speech interface.