The use of search engines to locate relevant documents within a database, enterprise intranet, or the Internet has become commonplace. At a very high level, most search engines function by performing three distinct steps: identifying all documents which match the search criteria (the “candidate documents”); ranking the candidate documents based on a predicted relevance; and presenting the results to the user beginning with the most relevant.
The quality of the relevance ranking function is very important to the user's satisfaction with the search engine because the user is not expected to, and in many cases cannot realistically, review the entire set of matching documents. In most cases, the user will only review a relatively small number of those documents and so must be presented the most relevant candidates within that small subset for the search to be successful.
For purposes of comparing the performance of different ranking functions, it is convenient to approximate the overall user satisfaction by a single metric or set of metrics. Typically, the metric is computed over a representative set of queries that are selected by random sampling from the search domain. The metric can be as simple as the average count of relevant documents in the top N (1,5 or 10) results, often referred to as Precision @1, 5, or 10, or a slightly more complicated measure such as Normalized Discounted Cumulative Gain (NDCG).
The quality of the ranking function is dependent primarily on two characteristics: the set of features on which the ranking is based, and the specific algorithm applied to the features. The ranking features are attributes of the candidate documents that contribute to identifying relevance of the document. The ranking algorithm determines how these features are combined together into a single number that can be used to rank order the documents. Typical search engines use an algorithm which relies upon a linear combination of the ranking features. Neural networks have also been applied in the area of Internet searching.
The preferred set of ranking features varies depending on the search domain. Much of the emphasis for search engine development is on Internet searches. However, enterprise s searching of an intranet or document library is also in high demand but requires a different, tailored set of features for optimal results. This is driven primarily by different characteristics of the domain and the documents themselves.