Search engines are now commonplace in many software applications. Search engines may be used for searching for text strings in applications such as word processors, for searching for help in sophisticated software as varied as spreadsheets and operating systems, and for searching for uniform resource locators (URLs), references and other documents in web-based search engines. The effectiveness of any one search may be abstractly judged by whether the top few returned documents are the documents actually sought by the user. The returned list should preferably be sorted by relevance to the user in the context of the search terms present in the user query and possibly the state associated with the user query. This ordering of documents makes it easier for a user to select the document that he or she believes has the greatest relevance to the search.
A search engine is generally used to provide a list of documents such that the documents have a relation to the search terms. Since sets of documents can be extremely large, and since any one search engine may have access to multiple document sets, the sheer volume of documents retrieved by search could be large. Ranking the documents according to some relevance criteria is one way to assist the user in finding the preferred document(s).
Recently, search engines have been augmented with machine learning classifiers that are able to help provide search documents with high relevance. Such classifiers are generally based on training data based on user feedback data: click patterns (i.e. “click-throughs”) and/or explicit user satisfaction ratings (i.e. “explicit feedback”) which indicate which documents are most relevant for a user query (and the state associated with the user query). User feedback data also includes, but is not limited to, previous user search history or the entry point of the search. Mappings between user-generated queries and the documents the user visits and/or marks as relevant are recorded. These mappings are then used to train a machine learning classifier model, that takes as input the user query (and the state associated with the user query), and produces as output a list of documents (the “classes”) with associated relevance scores. Classifiers are evaluated with “test sets,” generally collected from click-through and/or explicit user feedback distinct from the data used for the training set.
While this approach represents a significant improvement in the field of information retrieval, it does have one limitation. Namely, as a new document is added to the collection or corpus, it will initially have no user feedback nor click-through data associated with it. Accordingly, the machine learning classifier will not select the new document as having any relevance to a user's search. Thus, the search will either not return the new document, or may return it and place it at the bottom of a vast list of search results. While the machine learning classifier could be retrained using manual methods to better recognize the new document, such methods become prohibitively labor intensive as the number of new documents grows beyond a trivial number.
Thus, there is a continuing need for information retrieval systems that employ machine learning classifiers to automatically recognize and be trained for new documents with minimal manual intervention.