A statistical language model (SLM) has many applications in natural language processing. Some examples of these applications are information retrieval, speech recognition, and natural language translation. A typical SLM assigns probabilities to N-grams. An N-gram is a sequence of N words, where N is some fixed number: e.g., a 3-gram (sometimes written “trigram”) is a sequence of three consecutive words. An SLM can use any value for N. In the example where the SLM uses 3-grams, the SLM assigns probabilities to specific sequences of three words.
The probabilities that the SLM assigns to each N-gram describe the likelihood that the N-gram will appear in some corpus of natural language material. For example, the phrase “motor vehicle department” is a trigram. It may be determined from an analysis of some large body of English-language text that 0.018% of all trigrams are the phrase “motor vehicle department.” In that case, an SLM may assign the probability 0.00018 to that trigram. What this probability implies is that, if one were to choose a random trigram from English text, there is a probability of 0.00018 that the randomly-selected trigram would be the phrase “motor vehicle department.”
SLMs are often built to model web documents. Such SLMs can be used in various search applications. However, there are two issues that arise in building an SLM from web. First, the volume of web documents is large. Building an SLM normally involves counting how many times each trigram appears in a corpus of documents, and calculating the proportion of each trigram's count to the total number of trigrams. This process assumes that one can examine the entire corpus at once. But due to the size of the web, it is infeasible to examine all web documents at once. Second, web content is constantly changing, so an SLM that is built from the web may quickly become obsolete.