Language models are employed in various automatic systems, such as speech recognition systems, handwriting recognition systems, spelling correction systems, and other word-oriented pattern recognition systems. A language model represents word sequences and the probability of that sequence occurring in a given context. Although the systems and methods of the present invention are applicable to any word-oriented pattern recognition problem, the invention will be discussed herein with respect to speech recognition, as that is a common application of language models.
Speech recognition systems employ models of typical acoustic patterns and of typical word patterns in order to determine a word-by-word transcript of a given acoustic utterance. The word-patterns used by a speech recognition system are collectively referred to as a language model. The acoustic patterns are referred to as an acoustic model.
Many current speech recognition systems use language models that are statistical in nature. Such language models are typically constructed using known techniques from a large amount of textual training data that is presented to a language model builder. An n-gram language model may use known statistical “smoothing” techniques for assigning probabilities to n-grams that were not seen in the construction/training process. In using these techniques, the language models estimate the probability that a word wn will follow a sequence of words w1, w2, . . . wn-1. These probability values collectively form the n-gram language model.
There are many known methods that can be used to estimate these probability values from a large text corpus presented to the language model builder, and the exact methods for computing these probabilities are not of importance to the present invention. Suffice it to say that the language model plays an important role in improving the accuracy and speed of the recognition process by allowing the recognizer to use information about the likelihood, grammatical permissibility, or meaningfulness, of sequences of words in the language. In addition, language models that capture more information about the language lead to faster and more accurate speech recognition systems.
Current approaches to language modeling consider words to be equivalent to their orthographic (written) form. However, in many cases, the orthographic form is not sufficient for drawing distinctions that have an impact on the way the word is spoken. Often, the meaning of a word, including its syntactic role, determines its pronunciation. The pronunciations used in the following examples employ a phonetic notation known as the “ARPABET.”The numbers attached to vocalic phonemes indicate syllabic stress. A favorite example is the word “object”. The syntactic role (in this case, part of speech) for “object” can be noun or verb:
OBJECT/N/AA1 B JH EH0 K T/OBJECT/V/AH0 B JH EH1 K T/Accordingly, the pronunciation of the word depends on the syntactic role. In the case of the noun “object,” the stress is on the first syllable, and for the verb “object,” the stress is on the second syllable.
Another favorite example is the word “wind”. Again, the syntactic role (part of speech again here) determines the pronunciation:
WIND/N/W IH N D/WIND/V/W AH IY N D/
A final favorite example is the word “read”. Here the syntactic role that affects pronunciation is the tense of the verb (present or past):
READ/V+PRES/R IY D/READ/V+PAST/R EH D/
Words with different syntactic properties, such as those in the above examples, tend to appear in different contexts. Thus, statistical language models that do not distinguish between words with identical orthography but different senses or syntactic roles will model those words and their contexts poorly.
Class-based language models deal with training data sparseness by first grouping words into classes and then using these classes as the basis for computing n-gram probabilities. Classes can be determined either by automatic clustering, or they can be domain-specific semantic categories or syntactic categories (e.g., parts of speech (POS)). Although the latter approach has the advantage of capturing some linguistic information in the language model, using syntactic classes in traditional formulations has a major drawback: the POS tags hide too much of the specificlexical information needed for predicting the next word.
An alternative approach has been proposed in which part-of-speech (POS) tags are viewed as part of the output of the speech recognizer, rather than intermediate objects, as in class-based approaches. However, in this approach the words and tags are viewed as being produced by separate processes.
The present invention addresses to these and other problems and offers other advantages over the prior art.