In the field of document or web search, a basic goal is to rank documents matching a query according their computed relevancy to the query. A common technique for ranking documents is to use large sets of training data to train a statistical model offline, and then use the statistical model online to the help determine the relevancy of documents to queries submitted to a search engine.
FIG. 1 shows an example of a search engine 100 using a trained model 102. The model 102 is trained offline by passing labeled training data 104 to a trainer 106. The labeled training data is typically a set of documents with respective labels entered to represent human analysis of the documents. For example, labels may range in value from 0 to 4 (or 0 to 1, or −1 to 1, etc.), to indicate a level of relevance that a human perceived a given document had for a given query. Often the training will involve computing a vector of features for each document-query pair, and passing the vectors and their respective document relevance labels of the given query to the trainer 106 (each vector representing a same set of feature types, but with values varying from vector to vector according to the content of the respective document-query pairs).
The trainer 106 analyzes the labeled training data 104, perhaps in the form of many vector-label pairings, and builds a statistical model 102 that puts the collective knowledge of the training data 104 into a form—the model 102—that can be used to rank documents. There are many types of models that may be used, for example learning machines such as support vector machines and neural networks. In the example of FIG. 2, the model 102 is simply a linear model that computes weights wi of the respective features. The model 102 is then used by the search engine 100, along with a document index 108, to find and rank documents 110 that match queries 112. The scores of the documents 110 are wholly or partly derived from the model 102. In some cases, multiple models may be combined. However, there are several problems created by this general offline training approach.
Often, a model must be completely retrained when the training data is augmented or altered, even if most of the original training data is unchanged. Also, it may be difficult to derive a model that is tailored for special search domains, such as language-specific searches, searches for particular types of information (e.g., proper names), to name a few examples.
Techniques to improve model training are discussed below.