1. Technical Field
The present disclosure relates to generating language models (LMs) and more specifically to generating language models based on data gathered by crawling web pages.
2. Introduction
The world wide web is an invaluable data repository. The text data on the web can be harnessed for tasks as diverse as named entity recognition, word sense disambiguation, and machine translation in natural language processing, search and question answering in information retrieval, and pronunciation modeling and language modeling in Speech Recognition.
Text on the web is so attractive for these applications for several reasons. Apart from the sheer size of the textual repository, web text is compelling because it is diverse and not limited to a particular domain. This aspect can be important as language technologies begin to cope with handling open domain input in tasks such as search, question-answering, and language translation. Further, web text, such as news websites, blogs, microblogs and others, is dynamic, and tracks the current news and popular events. For these reasons, recent research has exploited the textual content of the web to create models for natural language tools, in particular, language models.
Typically, language models are built on a training corpus of sentences with the assumption that the distribution of n-grams in the training set is the same as the distribution of n-grams in the task context where the language model would be used. This assumption, also called as the independent and identically distributed (IID) assumption, is reasonable for tasks which are domain limited and where the target data does not change over time. However, in open domain applications such as question-answering, broadcast news speech recognition, where the input to the models change based on the current events, the IID assumption results in a mismatch between the training and target contexts. This mismatch can be interpreted as holes or gaps in the training data. To address this issue, language models are typically transformed to match the target distributions using adaptation techniques.
One approach is focused crawling. Focused crawlers collect web pages in a well-defined topic. For instance, a focused crawler can look for web pages in domains such as astronomy, Linux, cancer, etc. Another focused crawler tries to locate web forms in domains as airfare, hotel, cars, etc. The more pages/forms collected in these domains by these crawlers, the better their policy.
Another approach is language modeling. Language modeling can be applied to three particular problems: query spelling, query bracketing and query segmentation. The anchor language model is more similar to the queries (lower perplexity) than the body of the page and also obtained the best performance in almost all the presented scenarios for these three tasks. One query-based method collects web data to build a language model for spoken dialog domains in combination with an in-domain language model, created from dialogs in the domain. The queries are generated from utterances of dialogs and the resulting pages are cleaned by selecting the sentences in these pages more similar to sentences in the domain. Experiments in the financial transaction domain showed a great reduction in the word error rate by adding the web language model.
Still other methods focus more on the process of building the language model from web data and/or time-sensitive data. For instance, one method adapts the language model as chunks of data are available, as in the scenario of web crawling, while another method builds general-purpose language models by partitioning the data into mixture components giving different weights for these components and by taking into account the recency of the words, recent words having a higher probability of appearing in the future. However, each of these approaches includes significant drawbacks and do not crawl web pages or generate language models in a sufficiently efficient manner.