1. Field of the Present Invention
The present invention relates generally to using information relating to a spoken communication or to other spoken communications to obtain a language model to transcribe the communication, and, in particular, to transcribing a spoken communication using a language model obtained based on information obtained about a different communication.
2. Background
Automatic Speech Recognition (“ASR”) systems convert speech into text. As used herein, the term “speech recognition” refers to the process of converting a speech (audio) signal to a sequence of words or a representation thereof (text). Speech recognition applications that have emerged over the last few years include voice dialing (e.g., “Call home”), call routing (e.g., “I would like to make a collect call”), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g., a radiology report), and content-based spoken audio searching (e.g., finding a podcast where particular words were spoken).
As their accuracy has improved, ASR systems have become commonplace in recent years. For example, ASR systems have found wide application in customer service centers of companies. The customer service centers offer middleware and solutions for contact centers. For example, they answer and route calls to decrease costs for airlines, banks, etc. In order to accomplish this, companies such as IBM and Nuance create assets known as IVR (Interactive Voice Response) that answer the calls, then use an ASR system paired with text to speech (Text-To-Speech) software to decode what the caller is saying and communicate back to him. ASR may be used in desktop applications to allow users to dictate instead of type, may be used to transcribe medical reports dictated by doctors, may be used to dictate emails and text messages from mobile devices, and may be used to transcribe voicemails so that recipients may read them instead of listen to them.
In converting audio to text, ASR systems may employ models, such as an acoustic model and a language model. The acoustic model may be used to convert speech into a sequence of phonemes most likely spoken by a user. A language model may be used to find the words that most likely correspond to the phonemes. In some applications, the acoustic model and language model may be used together to transcribe speech.
One method of creating a language model is to compute probabilities that n-grams of words will occur in speech. For example, an n-gram language model may include a probability that a single word appears, that a pair of words appears, that a triple of words appears, and so forth. Other types of language models known to one of ordinary skill in the art include a structured language model and a maximum entropy language model.
Language models may be adapted to particular applications. For example, a language model for medical transcription may contain medical terms that may not appear in a language model for legal dictation. By adapting a language model to a particular application, the accuracy of the ASR system may be improved.
Language models may be combined to create a new language model. For example, a first language model might contain words that are most commonly used in every day conversation, and a second language model might contain terms particular to a specific application, such as medical phrases. By combining the general language model with a language model specific to medical terms, one may create a new language model that is suitable for transcribing medical reports, which may contain both everyday language and medical phrases.
One method for combining languages models is to interpolate between two existing language models. Another method for combining language models is to create an exponential language model (e.g., through maximum entropy) from existing language models to boost certain words or phases.
With some ASR systems, the language of the speech may be unknown. In order to perform ASR when the language of the speech is unknown, a language identification (“LID”) system may be applied to estimate which language is being spoken. After the language has been determined, an ASR system may be applied that uses models for that language. A LID system may use a variety of techniques, including, for example, statistical models that classify the speech into categories. A LID system can be applied in several ways, for example, to determine whether speech is or is not a given language (e.g., English), to determine whether speech is one of two languages (e.g., English and Spanish), or to determine whether speech is one of greater number of languages.