Conventional language models are typically used in language processing applications such as speech recognition, machine translation, speech tagging, parsing, information retrieval, etc. Conventional language models can be used to determine a likelihood that a word or sequence of words has been uttered, scanned in a document, etc.
The term “class-based language model” refers to a specific family of language models that make use of word classes to improve their performance. Creation of a class-based language model can include analyzing one or more documents to derive classes of words. Each class in a class-based language model represents a set of words that are commonly used in the same context.
Conventional class-based language models can be used to determine a probability of a sequence of words under test. For example, using the class-based language model, a probability can be derived for a sequence of J words by means of a probability distribution. Estimating the probability of different word sequences or sentences can be difficult because phrases or sentences can be arbitrarily long and hence some sequences may not be observed during training of the language model. For this reason, the class-based language models are often approximated using the technique of smoothing, which assigns a proper probability to a sequence that is not observed during training.
In certain applications such as speech recognition and data compression applications, a language model can be used to capture the properties of a language as well as predict the next word or words in a sequence as discussed above. When used in information retrieval, a language model is associated with a document in a collection.