1. Technical Field
This invention relates to the field of speech recognition software and more particularly to an improved method of adding vocabulary to a speech recognition system.
2. Description of the Related Art
Speech recognition is the process by which an acoustic signal received by microphone is converted to a set of text words by a computer. These recognized words may then be used in a variety of computer software applications for purposes such as document preparation, data entry, and command and control. Improvements to speech dictation systems provide an important way to enhance user productivity.
Currently within the art, speech recognition systems possess a finite set of recognizable vocabulary words. These systems model and classify acoustic signals to form acoustic models, which are representations of basic linguistic units such as phonemes. From an acoustic analysis, speech recognition systems derive a list of potential word candidates for a given series of acoustic models. The potential word candidates are ordered from the most likely user intended word to the least likely. Next, the speech recognition system performs a contextual analysis between a language model, each potential word candidate, and the most recent words derived by the speech recognition system. The system may determine that although the first word candidate is the closest acoustic match to the user utterance, it does not fit the context of the text being dictated. The second word candidate, though not a perfect acoustic match to the user utterance, may more closely match the context of the text being dictated by the user. The system then makes a determination as to which word candidate is the correct user intended word.
The language model used within the speech recognition system is comprised of statistical models. Such statistical models, or language model statistics, are one, two, and three word groupings called unigrams, bigrams, and trigrams respectively, wherein each unigram, bigram, and trigram has an associated frequency. For example, trigrams can be formed by taking each word in a large corpus of text, called a training corpus, and constructing all possible three word permutations. The system can observe the frequency of each trigram that appears in the training corpus. This observed frequency is a measure of trigram probability. Trigrams that do not appear in the training corpus result in a trigram probability of zero. Unigrams, bigrams, and trigrams that do appear in the training corpus can be assigned corresponding frequency values.
In order for a user to add a word with no language model statistics to a speech recognition system, the user can analyze another training corpus to develop unigrams, bigrams, trigrams, and frequency data for the word. This situation occurs when a word has been left out of the training corpus. The user must develop the needed language model statistics for the word before adding it to the speech recognition system vocabulary. Alternatively, the user can edit each document that will contain the word by manually inserting the word in the document. Although this process can function relatively well when editing a small file or a small number of files, the process is cumbersome for persons that build specialized speech recognition vocabularies for different topics such as medical, legal, and travel. Such users deal with thousands of files. Moreover, the files can be too large for conventional editors.
The disadvantage is further compounded when the word to be added to the system behaves in the same or similar manner as another word recognizable to the system. In this case, developing language model statistics wastes time because the resulting information will differ only slightly from the language model statistics corresponding to the recognizable word. For example, if a user wants to add the word "Laguardia" to reference the airport located in New York, the user must develop language model statistics for "Laguardia". In this case, rather than developing completely new statistical information, the language model statistics for "Laguardia" can be based upon existing language model statistics for the word "Heathrow" in reference to the airport located in London.
Currently, a method of adding new words to speech recognition systems utilizing class files exists in the art. Class files allow the user to generate a file of words with similar properties. An example of a class file is a list of airport names. After the class file is created, the speech recognition system removes each word of the class file from the language model, replacing it with a reference to the class file. For example, if a class file called "airport" contained "O'Hare", "Heathrow", and "Laguardia", the system would remove all occurrences of those specific airport names contained in the class file "airport" from the language model. Each occurrence of a member of the class file would be replaced with the reference "[airport]". As a result, the trigram "Heathrow in England" would be changed to "[airport] in England".
Although words can be added to the speech recognition system vocabulary in this manner, class files neither incorporate frequency data, nor ensure contextual accuracy. Consequently, although the context of a trigram may clearly indicate an airport in England, the airport "Laguardia" located in New York is as likely a candidate as "Heathrow" to the speech recognition system. The lack of word frequency data and the lack of a method of ensuring contextual accuracy within class files can result in nonsensical trigrams such as "Laguardia in England". The user has no way of avoiding such a nonsensical outcome and no way to check for contextual accuracy. As a result, there has arisen a need for a more efficient way to add new vocabulary words to speech recognition systems.