The present invention relates generally to a system and method for producing an optimal language model for performing speech recognition.
Today's speech recognition technology enables a computer to transcribe spoken words into computer recognized text equivalents. Speech recognition is the process of converting an acoustic signal, captured by a transducive element, such as a microphone or a telephone, to a set of text words in a document. These words can be used for numerous applications including data entry and word processing. The development of speech recognition technology is primarily focused on accurate speech recognition, which is a formidable task due to the wide variety of pronunciations, accents, and speech characteristics of native and non-native speakers of a particular language.
The key to speech recognition technology is the language model. A language model describes the type of text the dictator will speak about. For example, speech recognition technology designed for the medical profession will utilize different language models for different specialties in medicine. In this example, a language model is created by collecting text from doctors in each specialty area, such as radiology, oncology, etc. The type of text collected would include language and words associated with that practice, such as diagnoses and prescriptions. Most importantly, these language models may be developed for a regional or native language.
Today's state of the art speech recognition tools utilize a factory (or out-of-the-box) language model, which is often customized to produce a site-specific language model. A site-specific language model might include, for example, the names of doctors or hospital departments of a specific site using speech recognition technology. Unfortunately, it has been found that many factory language models and site-specific language models do not adequately address the problem of accented speech by a group. An example of such a group would include United Kingdom physicians dictating in United States hospitals using speech recognition technology.
Accented speech presents especially challenging conditions for speech recognition technology as the accented speech pronunciation of a language can result in misidentification and failed recognition of words. For example, a United Kingdom accented speaker or an Indian accented speaker in the United States will pronounce an English word, even after living in the United States for an extended period of time, dramatically different than an United States speaker. So much so, that a speech recognition engine using an United States language model, will misidentify or fail to recognize the English word.
Previous efforts to solve this problem included acoustic adaptation during individual speaker enrollment and factory language models that created with alternate pronunciations for some commonly used words for a particular application. These techniques are used to handle the pronunciation differences among varieties of speakers within the same region, such as southern accents and New York accents in the United States. Individual pronunciation idiosyncrasies that are subphonemic are typically addressed through speaker enrollment and adaptation of the acoustic model before the speaker starts using the speech recognition product. Some pervasive regional differences that are phonemic in nature are represented in the language model with alternative pronunciations for the same word. This situation applies to the classical differences such as “You say ‘tuh-may-toh’ and I say ‘tuh-mah-toh’”.
Unfortunately these techniques are only successful in providing recognition of a limited number of alternative phonemic pronunciations and require substantial time to personalize the acoustic model to an individual. Using these techniques to control for the ubiquitous pronunciation differences between accented speech and native speech would become costly and time consuming.
Another approach includes replacing the native acoustic models with distinct acoustic models for a class of speakers who share pronunciation features, and replacing native language models with dialect-specific language models. These distinct acoustic models and dialect-specific language models address the differences between the US English and United Kingdom English; they can be developed for any language or dialect. Not only are the distinct acoustic models and the dialect-specific language models large and cumbersome, but they also exhibit other undesirable results when used to accommodate accented speech. For example, United Kingdom English acoustic models and language models have different spellings such as ‘colour’, ‘centre’, and ‘oesophagus’. Further, United Kingdom English employs different speech patterns and different vocabulary, such as different brand names for medical drugs.
Therefore, while speaker enrollment acoustic adaptation and alternate pronunciation factory language models can accommodate some level of accented speech, the expectation of speech recognition is significantly poorer than if distinct acoustic models and dialect-specific language models are used. Alternatively, speech recognition using distinct acoustic models and dialect-specific language models may transcribe accented speech more accurately but it also creates transcriptions which fail to conform to the native region's conventions of spelling, vocabulary and speech patterns. Furthermore, it is impractical and expensive to employ a completely different set of language models for a handful of individuals, such as a few United Kingdom physicians working in a US hospital.
Therefore, there exists a need for a speech recognition technology that automatically updates a factory or site-specific language model upon use by an accented speaker with words and pronunciations corresponding to the accented speech.
It may also be desirable to provide a speech recognition technology that allows language models for a particular language to be customized through the addition of alternate pronunciations that are specific to the accent of a dictator, for a subset of the words in the language model.