Due to advances in computer technology and its increase in popularity, large numbers of people have recently started to frequently search huge databases. For example, internet search engines are frequently used to search the entire World Wide Web. Information retrieval systems are traditionally judged by their precision and recall. What is often neglected, however, is the quality of the results produced by these search engines. Large databases of documents such as the web contain many low quality documents. As a result, searches typically return hundreds of irrelevant or unwanted documents which camouflage the few relevant ones. In order to improve the selectivity of the results, common techniques allow the user to constrain the scope of the search to a specified subset of the database, or to provide additional search terms. These techniques are most effective in cases where the database is homogeneous and already classified into subsets, or in cases where the user is searching for well known and specific information. In other cases, however, these techniques are often not effective because each constraint introduced by the user increases the chances that the desired information will be inadvertently eliminated from the search results.
Search engines presently use various techniques that attempt to present more relevant documents. Typically, documents are ranked according to variations of a standard vector space model. These variations could include (a) how recently the document was updated, and/or (b) how close the search terms are to the beginning of the document. Although this strategy provides search results that are better than with no ranking at all, the results still have relatively low quality. Moreover, when searching the highly competitive web, this measure of relevancy is vulnerable to “spamming” techniques that authors can use to artificially inflate their document's relevance in order to draw attention to it or its advertisements. For this reason search results often contain commercial appeals that should not be considered a match to the query. Although search engines are designed to avoid such ruses, poorly conceived mechanisms can result in disappointing failures to retrieve desired information.
Some search engines use backlink information (i.e., information from pages that contain links to the current page) to assist in identifying relevant web documents. Rather than using the content of a document to determine relevance, the technique uses the anchor text of links to the document to characterize the relevance of a document. In particular, search query terms are compared to a collection of anchor text descriptions that point to the page, rather than to a keyword index of the page content. A rank is then assigned to a document based on the degree to which the search terms match the anchor descriptions in its backlink documents.
Another method of determining the importance of a document is citation counting, whereby the importance of a document is determined by counting its number of citations, or backlinks. The citation rank r(A) of a document that has n backlink pages is simply r(A)=n. U.S. Pat. No. 6,285,999 describes a scheme based on citation counting and further refines the citation counting scheme. Instead of simply counting the number of citations to a given document, the U.S. Pat. No. 6,285,999 also assigns a weight to each citation, indicative of the relative importance of each citation. In this scheme, the importance of a page, and hence the rank assigned to it, depends not just on the number of citations it has, but on the importance of the citing documents as well. The U.S. Pat. No. 6,285,999 therefore describes an iterative scheme whereby the importance ranking of documents is calculated based on the number of other documents that cite to each document and the importance ranking of said citing documents. The importance rankings so calculated are then used in the next iteration of importance rating calculations, and so on. This scheme is sometimes referred to as a page rank scheme.
No search engine currently uses the amount of time a user spends on a given page in its calculation of relevance, page value, or link value. Further limitations and disadvantages of conventional and traditional approaches will become apparent to one of skill in the art through comparison of such systems with the present invention as set forth in the remainder of the present application with reference to the drawings.