The present invention relates generally to a system and method for producing an optimal language model for performing speech recognition.
Today's speech recognition technology enables a computer to transcribe spoken words into computer recognized text equivalents. Speech recognition is the process of converting an acoustic signal, captured by a transducive element, such as a microphone or a telephone, to a set of text words in a document. This process can be used for numerous applications including transcription, data entry and word processing. The development of speech recognition technology is primarily focused on accurate speech recognition, which is a formidable task due to the wide variety of pronunciations, phrases, accents, and speech characteristics. In particular, previous attempts to transcribe phrases accurately have been met with limited success.
The key to speech recognition technology is the language model. Today's state of the art speech recognition tools utilize a factory (or out-of-the-box) language model, which is often customized to produce a site-specific language model. Further, site-specific users of speech recognition systems customize factory language models by including site-specific names and phrases. A site-specific language model might include, for example, the names of doctors, hospitals, or medical departments of a specific site using speech recognition technology. Unfortunately, factory language models include few names and phrases and previous attempts to provide phrase customization did not produce customized language models that accurately recognize phrases during speech recognition.
Previous efforts to solve this problem involved customizing a language model by adding phrases and corresponding phrase pronunciations to the language model. The phrase pronunciations for the added phrase were created as a combination of pronunciations of the components or elements of the phrase. As such, a phrase to be added to the language model would be initially broken down into components. For each component, the language model would be searched for a matching component and corresponding pronunciation. If all components were found in the language model, the corresponding pronunciations for each component of the phrase would be concatenated to form pronunciations of the new multi-word phrase. The new phrase was then added, together with its corresponding pronunciations, to the language model.
If any components were not found in the language model, a background dictionary was searched for the components. Any component tokens still not found in either the language model or the background dictionary were sent to a pronunciation guesser module, where component pronunciations were guessed based on their orthography (spelling). Phrase pronunciations were then formed for that phrase by combining all pronunciations from the language model, background dictionary, or guesser module. The new phrase was then added, together with its corresponding pronunciations, to the language model.
However, problems occur when phrase components are pronounced differently when part of a phrase. For example, the ampersand sign is pronounced as ‘and’ in a phrase but as ‘ampersand’ in the language model. Some previous systems attempted to solve this problem by adding additional pronunciations to problematic words instead of adding phrase pronunciations. Unfortunately, if “&” in the language model is given an additional pronunciation of ‘and’, then when an ordinary phrase such as “bacon and eggs” is dictated, it may be transcribed with an ampersand instead of an “and”. Conversely, if “&” is not given an additional pronunciation of ‘and’, then when the phrase “Brigham & Women's Hospital” is added to the language model, it would receive the pronunciation ‘Brigham ampersand women's hospital’ in the language model. This is a problem because ‘Brigham & Women's Hospital’ is actually pronounced as ‘Brigham and women's hospital.’
Additional problems occur when elements of a dictated phrase are not pronounced, that is, are silent. Previous systems failed to provide transcription for any silent or unspoken aspect of a phrase. For instance, a slash is used in many phrases but silent when pronounced. For example, “OB/GYN” is a phrase pronounced ‘OBGYN’. However, under traditional systems, the slash would not be recognized or transcribed unless the dictator actually spoke ‘slash’, despite the fact that doctors and hospitals expect the transcribed text of a medical report to include the slash in “OB/GYN”.
Another problem with silent elements of a phrase includes well-known formatting or terms of the trade that are shortened or abbreviated for convenience when spoken. For example, the phrase “WISC (Revised)” is a phrase that is dictated for convenience in the medical fields as ‘WISC Revised’, without specifically dictating the parentheses around ‘Revised’. Traditional systems would require that the phrase in the language model have a pronunciation including the parentheses. This approach requires that the parentheses be awkwardly dictated in order for the automatic transcription to include the parentheses.
Additionally, traditional systems resulted in prohibitively large numbers of permutations of possible phrase pronunciations for many phrases. This is the result of each phrase component having multiple pronunciations in the language model. When combining the pronunciations from each phrase component, the number of possible combinations grows rapidly. Therefore, previous systems added a huge number of possible pronunciations for a long phrase where one or maybe two pronunciations would be sufficient for automatic recognition of a long phrase.
Previous systems also failed to identify context based pronunciations in a phrase. For example, the phrases “St. Mulbery” and “Mulbery St.” contain the component ‘St.’ but the first phrase refers to a saint and the second phrase refers to a street. A typical language model includes both ‘street’ and ‘saint’ pronunciations for the component ‘St.’. Therefore, in previous systems when the phrase “St. Mulbery” was added to the language model, the system would inefficiently provide both the ‘saint Mulbery’ and ‘street Mulbery’ pronunciations.
Therefore, there exists a need for a speech recognition technology that updates a language model with phrases that can be accurately recognized and transcribed.