As is well known, a language model is used to represent the language that an automatic speech recognition (ASR) system is intended to recognize or decode. One of the most popular types of language models is the probabilistic n-gram language model. An n-gram is a contiguous sequence of n items, e.g., words (although the items could alternatively be phonemes, syllables, letters or base pairs), from a given sequence of text or speech. In the n-gram language model, the probability that a word wn follows a sequence of words w1, w2, . . . , wn-1 is defined. However, depending on the selected size of n, and how many words are in the vocabulary of the given language, the number of n-grams that must be defined in the language model can be prohibitive.
As such, in order to deal with this issue, existing decoders in ASR systems utilize n-gram back-off language models in the decoding process. A back-off language model uses the conditional probability, P(w|h), for a finite set of word, w, and history, h, pairs, and backs off to lower order n-gram probabilities for other pairs. In this way, an n-gram language model is represented efficiently using a more moderate number of n-grams.
The so-called ARPA (Advanced Research Projects Agency) back-off format, where each line represents an n-gram language model probability, an n-gram and the back-off weight corresponding to this n-gram, is commonly used to represent an n-gram back-off language model.
Note that instead of words, named-entities can also be used in language modeling. In this context, a named-entity is defined as a sequence of words that refers to names of entities such as people (e.g., John Smith), organizations (e.g., United Nations) and locations (e.g., New York).