1. Field of the Invention
This invention relates generally to speech recognition systems, and more particularly to methods for selection of an acoustic model for use in a voice command platform.
2. Description of Related Art
Speech recognition is the process by which an acoustic signal received by a microphone is converted to a set of words or phonemes by a computer. These recognized words may then be used in a variety of computer software applications for purposes such as document preparation, data entry and command and control. Speech recognition is also used in voice command platforms, a computer system that provides a user with the ability to access and use software applications over a telephone network, e.g., a wireless network.
Speech recognition is generally a difficult problem due to the wide variety of pronunciations, individual accents and speech characteristics of individual speakers. Speech recognition systems use acoustic models that model speech for a defined set of the population. Acoustic models are stored representations of word pronunciations that a speech recognition application uses to help identify words spoken by a user.
U.S. Pat. No. 6,577,999 issued Jun. 10, 2003, is directed to a method of automatically managing a plurality of acoustic models in a speech recognition application. Other prior art of interest includes U.S. Pat. No. 6,526,380, issued Feb. 25, 2003, which is directed to a speech recognition system having parallel large vocabulary recognition engines.
As taught in the '999 patent, there are several ways that acoustic models can be inserted into the vocabulary of a speech recognition application. For example, developers of speech recognition systems commonly provide an initial set of acoustic models or base forms for a basic vocabulary set and possibly for auxiliary vocabularies. In some cases, multiple acoustic models are provided for words with more than one pronunciation.
Since each particular user will tend to have their own style of speaking, it is important that the speech recognition system have the capability to recognize a particular user's pronunciation of certain spoken words. By permitting the user to update the acoustic models used for word recognition, it is possible to improve the overall accuracy of the speech recognition process for that user and thereby permit greater efficiencies.
Conventional speech recognition products that allow additional acoustic models for alternative pronunciations of words typically require the system administrator to decide when such acoustic models are to be added to those already existing. Significantly, however, this tends to be an extremely difficult decision for system administrators to make since system administrators often do not understand the basis upon which such a decision is to be made. Moreover, the task of managing multiple sets of acoustic models to account for variations in pronunciation can be a problem in a speech recognition application. For example, it is not desirable to maintain and store in memory large numbers of alternative acoustic models that do not truly reflect a user's word pronunciations. Also, acoustic models that are inappropriate for a particular user's pronunciations can cause repeated undesirable errors in otherwise unrelated words in the speech recognition process.
The speech recognition engine in a voice command platforms typically uses “grammars” and phoneme dictionaries, in addition to an acoustic model. The term “grammars” refers to a set of words or utterances that a voice command application will specify at a given state in the application, for example in response to a prompt. The speech recognition engine will typically include or have access to a dictionary database of “phonemes,” which are small units of speech that distinguish one utterance from another.
Voice command platforms typically hosts a variety of voice applications, including voice activated dialing, call center applications and others. It is important that the applications are tuned for grammars, pronunciation dictionaries and acoustic models in order to optimize the user experience. In a voice command platform, multiple acoustic models may be made available. These acoustic models may be tuned for different segments of the population that have different speech inflection (Latino, Southern, etc.), or the acoustic models may be particularly tuned for types of voice responses that are expected for a particular application, e.g., numbers, names, sports or sports teams, cities, etc.
Currently, in voice command applications, speech recognition engine vendors allow multiple acoustic models to exist within the same speech recognition engine. However, they only allow their engine and the voice browser to control which applications get to choose between the acoustic models. They do not allow the application developer itself to specify particular acoustic models for their particular application.
Co-pending patent application Ser. No. 09/964,140 filed Sep. 26, 2001, assigned to the same assignee as this patent application, describes a voice command system in which the platform includes enhanced system logic that enables an application to specify various voice processing mechanisms the platform should use during execution of the application. In particular, the application can specify which of multiple text to speech engines, voice prompt stores, and/or secondary phoneme dictionaries to use. The content of the application Ser. No. 09/964,140 is incorporated by reference herein. Application Ser. No. 09/964,140 is not admitted as prior art, in view of 35 U.S.C. § 103(c).