1. Technical Field
The present invention relates to the field of speech recognition software and more particularly to a method of augmenting a language model for a speech recognition vocabulary.
2. Description of the Related Art
Speech recognition is the process by which acoustic signals, received via a microphone, are “recognized” and converted into words by a computer. These recognized words may then be used in a variety of computer software applications. For example, speech recognition may be used to input data, prepare documents and control the operation of software applications. Speech recognition systems programmed or trained to the diction and inflection of a single person can successfully recognize the vast majority of words spoken by that person.
In operation, speech recognition systems can model and classify acoustic signals to form acoustic models, which are representations of basic linguistic units referred to as phonemes. Upon receipt of the acoustic signal, the speech recognition system can analyze the acoustic signal, identify a series of acoustic models within the acoustic signal and derive a list of potential word candidates for the given series of acoustic models. Subsequently, the speech recognition system can contextually analyze the potential word candidates using a language model as a guide.
The task of the language model is to express restrictions imposed on the manner in which words can be combined to form sentences. The language model can express the likelihood of a word appearing immediately adjacent to another word or words. Language models used within speech recognition systems typically are statistical models. Examples of well-known language models suitable for use in speech recognition systems include uniform language models, finite state language models, grammar based language models, and m-gram language models. Statistically, in an m-gram language model, all word sequences are deemed possible. As a result, in an m-gram language model, the probability of a word having been uttered by a speaker can be based only upon the (m−1) immediate predecessor words. Typical m-gram language models can include the unigram (m=1), bigram (m=2) and trigram (m=3) language models.
Trigram language models are formed by constructing all possible three word permutations for each word in a large corpus of text typically referred to as a training corpus. Subsequently, the frequency of each trigram appearing in the training corpus can be observed. Unigrams, bigrams, and trigrams appearing in the training corpus can be assigned the corresponding frequency values, appropriately discounted to leave some probability space for unseen bigrams and trigrams. The resulting collection of unigrams, bigrams and trigrams and their corresponding frequency values (language model statistics) form the trigram language model.
After a speech recognition vocabulary with its associated language model statistics has been created, there will arise a need to add new words. A language model developer might need to add new words when refining the speech recognition vocabulary or when building an extension to the vocabulary. An end-user of a speech recognition system might need to add his or her own personal words to the vocabulary. Hence, the needed language model statistics must be generated for each additional new word prior to adding the additional words to the speech recognition system vocabulary. However, in order to add a new word lacking language model statistics to a speech recognition system, a new training corpus containing therein the additional words must be analyzed to develop unigrams, bigrams, trigrams, and frequency data for the additional words.
Alternatively, a language model developer might edit a speech-dictated document to include the additional words by manually inserting each additional new word in a context-relevant location of the speech-dictated document. Although this alternative approach can produce adequate results when editing a small file or a small number of files, the process can become cumbersome when developing specialized speech recognition vocabularies for specialized topics such as medicine, law and travel. Such specialized topics implicate the modification of thousands of files. Moreover, typically those files exceed in size the maximum capacity of a conventional text editor.
It is sometimes possible to obtain language model statistics for a new word from contextually-related words or classes of words in the existing speech recognition vocabulary. For example, if the word “Midway”, a reference the airport located in Chicago, Ill., is to be added to the speech recognition vocabulary, language model statistics must be developed for this additional new word. However, rather than developing completely new statistical information for the additional word, the language model statistics for “Midway” can be based upon existing language model statistics for the existing word “Heathrow” in reference to the airport located in London, England.
Present methods of adding new words to a speech recognition system by an end-user include (1) correction in a speech-dictated document or (2) analysis of user-supplied sample documents. The language model statistics generated in these two methods are limited. Adding a new word during correction will only yield one sample context for the new word. The contextual coverage attained by adding new words from sample documents depends on the amount of text present in the user-supplied documents. The number of documents typically supplied for analysis tends to be small and, therefore, leads to very few sample contexts for the new words. Finally, users might well want to simply add new words to the vocabulary in isolation without any accompanying context, especially if that user is a specialist in a field for which there are no specific language models (or topics) to purchase to extend the vocabulary.
Present methods of adding additional words to speech recognition systems based upon existing language model statistics utilize class files. Class files allow a language model developer to generate a file containing words having similar contextual properties. An example of a class file includes a list of airport names. Once created, the class file itself can be referred to in the language model in lieu of each component word contained in the class file. For example, if the class file “airport.cls” contained as constituent components, “O'Hare”, “Heathrow”, and “Midway”, all instances of those specific airport names in the language model can be substituted with a generic reference to the class file “airport.cls”. As such, the trigram “Heathrow in England” would be modified to “[airport.cls] in England”.
Developers of speech recognition vocabularies, developers of speech vocabulary extensions (e.g., specialized topics) and end-users can benefit from methods that use class files to generate statistics for new words. However, new words cannot be blindly added to classes because this will often lead to contextual inaccuracies. For example, if ‘Midway’ were added to the airport class, from the perspective of the language model, ‘Midway’ in combination with ‘in Chicago’ can remain as likely a word sequence as ‘Midway’ in combination with ‘in England’—an absurdity. Thus, there has arisen a need for a better way to ensure contextual accuracy when adding additional new vocabulary words to a speech recognition system.