The present invention relates generally to language models used in automatic systems to represent word sequences and their probability of occurrence. More particularly, the present invention relates to a language model that includes augmented words that are augmented with lexical (for example, linguistic) information regarding the corresponding word.
Language models are employed in various automatic systems, such as speech recognition systems, handwriting recognition systems, spelling correction systems, and other word-oriented pattern recognition systems. A language model represents word sequences and the probability of that sequence occurring in a given context. Although the systems and methods of the present invention are applicable to any word-oriented pattern recognition problem, the invention will be discussed herein with respect to speech recognition, as that is a common application of language models.
Speech recognition systems employ models of typical acoustic patterns and of typical word patterns in order to determine a word-by-word transcript of a given acoustic utterance. The word-patterns used by a speech recognition system are collectively referred to as a language model. The acoustic patterns are referred to as an acoustic model.
Many current speech recognition systems use language models that are statistical in nature. Such language models are typically constructed using known techniques from a large amount of textual training data that is presented to a language model builder. An n-gram language model may use known statistical xe2x80x9csmoothingxe2x80x9d techniques for assigning probabilities to n-grams that were not seen in the construction/training process. In using these techniques, the language models estimate the probability that a word wn will follow a sequence of words w1, w2, . . . wnxe2x88x921. These probability values collectively form the n-gram language model.
There are many known methods that can be used to estimate these probability values from a large text corpus presented to the language model builder, and the exact methods for computing these probabilities are not of importance to the present invention. Suffice it to say that the language model plays an important role in improving the accuracy and speed of the recognition process by allowing the recognizer to use information about the likelihood, grammatical permissibility, or meaningfulness, of sequences of words in the language. In addition, language models that capture more information about the language lead to faster and more accurate speech recognition systems.
Current approaches to language modeling consider words to be equivalent to their orthographic (written) form. However, in many cases, the orthographic form is not sufficient for drawing distinctions that have an impact on the way the word is spoken. Often, the meaning of a word, including its syntactic role, determines its pronunciation. The pronunciations used in the following examples employ a phonetic notation known as the xe2x80x9cARPABET.xe2x80x9d The numbers attached to vocalic phonemes indicate syllabic stress. A favorite example is the word xe2x80x9cobjectxe2x80x9d. The syntactic role (in this case, part of speech) for xe2x80x9cobjectxe2x80x9d can be noun or verb:
Accordingly, the pronunciation of the word depends on the syntactic role. In the case of the noun xe2x80x9cobject,xe2x80x9d the stress is on the first syllable, and for the verb xe2x80x9cobject,xe2x80x9d the stress is on the second syllable.
Another favorite example is the word xe2x80x9cwindxe2x80x9d. Again, the syntactic role (part of speech again here) determines the pronunciation:
A final favorite example is the word xe2x80x9creadxe2x80x9d. Here the syntactic role that affects pronunciation is the tense of the verb (present or past):
Words with different syntactic properties, such as those in the above examples, tend to appear in different contexts. Thus, statistical language models that do not distinguish between words with identical orthography but different, senses or syntactic roles will model those words and their contexts poorly.
Class-based language models deal with training data sparseness by first grouping words into classes and then using these classes as the basis for computing n-gram probabilities. Classes can be determined either by automatic clustering, or they can be domain-specific semantic categories or syntactic categories (e.g., parts of speech (POS)). Although the latter approach has the advantage of capturing some linguistic information in the language model, using syntactic classes in traditional formulations has a major drawback: the POS tags hide too much of the specificlexical information needed for predicting the next word.
An alternative approach has been proposed in which part-of-speech (POS) tags are viewed as part of the output of the speech recognizer, rather than intermediate objects, as in class-based approaches. However, in this approach the words and tags are viewed as being produced by separate processes.
The present invention addresses to these and other problems and offers other advantages over the prior art.
The present invention relates to a speech recognition system (or any other word-oriented pattern recognition system) that employs a language model that includes augmented words that are augmented with lexical information regarding the corresponding word.
One embodiment of the present invention is directed to a computer-readable medium having stored thereon a data structure that includes a first data field, optional previous-word data fields, and a probability data field. The first data field contains data representing a first word and includes an orthography subfield and a tag subfield. The orthography subfield contains data representing the orthographic representation (written form) of the word. The tag subfield contains data representing a tag that encodes lexical information regarding the word. Each of the previous-word data fields contains data representing a potentially preceding word and includes an orthography subfield and a tag subfield. The orthography subfield contains data representing the orthographic representation of the word. The tag subfield contains data representing a tag that encodes lexical information regarding the word. The probability data field contains data representing the probability of the first word and tag occurring (possibly after the optional preceding words and accompanying tags) in a word sequence, which may comprise a sentence or a conversational utterance.
Another embodiment of the present invention is directed to a method of building a language model. Pursuant to this embodiment, a training corpus comprising a body of text is received. Words in the training corpus are each augmented with a tag encoding lexical information regarding the corresponding word. A plurality of sequences of n augmented words are selected, n being a positive integer. Each selected sequence includes a sub-sequence made up of the first nxe2x88x921 augmented words of the selected sequence. For each selected sequence of n augmented words, the method computes the probability that, given an occurrence of the sub-sequence in a block of text, the immediately following word will be the nth augmented word of the selected.
Another embodiment of the invention is directed to a method of automatically recognizing speech. Pursuant to this embodiment, a language model having a plurality of n-grams is provided. Each n-gram includes a sequence of n augmented words. Each augmented word includes a word and a tag encoding lexical information regarding the word. The language model further includes a probability indicator for each n-gram. Each probability indicator is indicative of a probability that, given an occurrence of the first nxe2x88x921 words of the corresponding n-gram in a block of text, the immediately following word in the block of text will be the nth word of the n-gram. The speech recognition process hypothesizes many sequences of nxe2x88x921 augmented words. The frontier (final nxe2x88x921 augmented words) of each hypothesized sequence is compared to the first nxe2x88x921 augmented words of a selected n-gram. If the frontier of the hypothesis matches the first nxe2x88x921 augmented words of the selected n-gram, then the probability indicator corresponding to the selected n-gram is accessed to determine the probability that the nth augmented word of the selected n-gram is the augmented word immediately following the hypothesized sequence covering the incoming acoustic utterance.
Another embodiment of the present invention is directed to a speech recognition system that includes a computer readable storage medium having data representing a language model stored thereon and a decoder. The language model includes a plurality of n-grams made up of a sequence of n augmented words. Each augmented word includes a word and a tag encoding lexical information regarding the word. The language model further includes a probability indicator corresponding to each n-gram. Each probability indicator is indicative of a probability that, given an occurrence of the first nxe2x88x921 augmented words of the corresponding n-gram, the immediately following word in the block of text will be the nth augmented word of the n-gram. The decoder is adapted to hypothesize a sequence of words. The decoder is further adapted to access the storage medium and to compare the frontier (last nxe2x88x921 augmented words) of the hypothesized sequence to the first nxe2x88x921 augmented words of a selected n-gram. If the hypothesized sequence matches the first nxe2x88x921 words of a selected n-gram, the probability indicator corresponding to the selected n-gram is accessed to determine the probability that the word immediately following the hypothesized sequence will be the nth augmented word of the selected n-gram.
These and various other features as well as advantages which characterize the present invention will be apparent upon reading of the following detailed description and review of the associated drawings.