In the past, speech recognition was related to identifying single commands for voice-operated devices. Over time, this has evolved to being able to identify speech for many purposes including normal language commands and word processing. However, even now, the systems for speech recognition are far from perfect.
In continuous speech recognition, language modeling has demonstrated to add valuable information to the recognition process by substantially increasing recognition accuracy. The purpose of language modeling is to capture the structure of a language to build sentences out of words.
Among the several language modeling systems developed to date, stochastic language modeling based on N-grams has better performance in large vocabulary continuous speech recognition. N-grams are based in the estimation of the probability of a word to follow its given history of words. Such probabilities are to be estimated from a training database that contains sufficient examples of words in different context and histories. It is understood that the larger the history considered, the better the context in which a word occurs will be modeled. However, the longer the history, the larger the number of learning examples required to capture all the possible word combinations that precede a given word.
This shortage of learning examples is a major obstacle for all language modeling systems and has heretofore prevented them from achieving even a portion of their full potential.
Third order N-grams, or sets of three words called “tri-grams”, are used in most of the large vocabulary continuous speech applications as providing a good trade-off between history reach and number of examples needed to represent such history.
Nevertheless, even for tri-grams it is still difficult to provide sufficient training examples. Many of the actual word histories are always not present in the training data and are left unseen in the probability estimation. Those skilled in the art have tried, without complete success, to alleviate this data sparsity by using discounting and smoothing strategies that allow inclusion of an estimation of the probabilities of unseen histories or events in the language model.
However, the existing data sparsity problem still continues to cause major problems. For example, given a vocabulary of 100 words, the total number of possible tn-grains is 100×100×100=1,000,000. Which means tax, in order to be able to model all the 3-word histories and context of 100 words, at least one million examples are needed. Further, the need of learning examples grows exponentially with the increment of words in the vocabulary even though all words tat happen in all possible contexts, and not all histories, are valid. Limited training examples are only capable of adding information to a language model up to their capacity limit. Thereafter, statistical techniques for estimation from sparse data need to be used. This presents a major roadblock with regard to providing adequate training to improve the language models and, as a result, the accuracy of speech recognition systems.
For large vocabularies, the number of learning examples needed to train an N-gram is phenomenally huge. It is usually next to impossible to find enough text examples to train all the N-grams even in large, but limited, text databases. The text of the training database has to be directly related with the kind of N-grams that are to be learned. It is common that N-grams are trained for a given environment or application, such as medical, legal, sports, etc. Hence, the training text needs to be related to that application in order to be effective.
Those having expert skill in the art have long struggled with this problem of data sparsity, and equally as long have been unsuccessful in achieving a definitive solution.