Search engines are a commonly used tool for identifying desired documents from large electronic document collections, including the world-wide internet and internal corporate networks. Conventional search methods often involve keyword searching. After receiving a search query containing keywords, the search engine uses a document ranker or a ranking algorithm to evaluate the relevance of documents based on a number of document features. In most conventional search engines, the simple presence of a keyword in the document is only one of many document features considered in determining the ranking or relevance of a document. Other document features can include, for example, the presence of a keyword in a special location in the document (such as the title) or the frequency of occurrence of a keyword. Still other document features considered in determining a document ranking may be unrelated to the keywords in a search query, such as the number of web page links contained in the document, or the frequency with which the document is accessed.
The document features used by a search engine to identify a relevant document do not have to be given equal weight. For example, the presence of a keyword in the title of a document may be a better indicator of relevance than the presence of the same keyword in the body of the document. To reflect this, the presence of a keyword in the title can be given proportionally greater weight.
While the weighting for various document features can be assigned by any convenient method, it is desirable to optimize the weights associated with the document features to further improve search engine performance. Unfortunately, optimization of document feature weights poses significant challenges. A search engine can employ hundreds of document features to evaluate the relevance of a document. To optimize the relevance performance of the search engine, the weight assigned to each document feature needs to be optimized. However, optimized weights for the document features cannot be easily found by a simple sequential optimization of individual weights, as the optimized weight for any one parameter is likely to be correlated with or otherwise depend on the weights of some or all of the other document features. For example, a search engine may use both the presence of a keyword in the title and the presence of a keyword generally in the document as document features. If a keyword is present in the title of a document, it is also likely to appear in the body of the document. Thus, there is a direct correlation between the presence of a keyword in the title and the presence of the same keyword in the body of a document. Additionally, document feature weights may be indirectly correlated for a variety of reasons, such as the use of normalized document feature weights. If a search engine employs normalized document feature weights, an increase in one document feature weight requires a decrease in some or all of the other document feature weights, leading to an indirect correlation of the document features. The large number of directly and indirectly correlated document features makes optimization of the document feature weights expensive from both a time and resource standpoint.
Conventional techniques for determining an optimized set of feature weights for a pattern recognition engine include neural net techniques and other algorithms that attempt to replicate a value function. In these types of techniques, the feature weights are optimized by assigning a target value to one or more documents or patterns. The neural net (or other value function algorithm) then attempts to find a set of feature weights that comes closest replicating all of the target values. At this time, such conventional methods have not been fully effective at producing optimized feature weights.
What is needed is a system and method for optimizing or tuning document feature weights assigned to the document features considered by a search engine. The system and method should allow for tuning of the document feature weights based on a selected set of training documents. The system and method should also allow for tuning of both independent and correlated parameters. Performing an optimization using the system and method should not require excessive time or resources. Additionally, the system and method should allow for simultaneous optimization of multiple parameters. The method should further allow optimization of the weights for relevance parameters in existing search engines.