1. Technical Field
The present invention relates generally to language models and, in particular, to correcting N-gram probabilities by page view information.
2. Description of the Related Art
An N-gram based language model is a construct/method for predicting probabilities of sentences on the basis of occurrence probabilities of N word sequences, and widely used in speech recognition, machine translation, and information retrieval. Since a large amount of training data is required to estimate probabilities accurately, it is usual to crawl web sites and collect the training data. N-gram probability is calculated from the frequency of each event. The sizes of texts on topics will be sharply reflected in N-gram probabilities on the corresponding topics.
Typically, the number of web sites and the total size of documents related to a topic are considered to be correlated with the frequency of the topic mentioned in Internet users. However, the preceding statement is not always true. For example, sometimes a very small number of eager contributors write many articles on a topic. Taking WIKIPEDIA® as an example, the sizes of the documents on specific topics (i.e., train, game) or person (entertainers) are significantly larger than others. However the topics are not necessarily mentioned with high frequencies.
In “suggest functions” provided in search engines (e.g., GOOGLE®), candidate words and phrases will be suggested on the basis of the frequencies of users' inputs. The probabilities should be almost optimal. However, such frequencies are not available to entities other than the search engine providers.