Hypertext systems now enjoy wide use. One particular hypertext system, the World Wide Web (“Web”), provides global access over public packet-switched networks to a large number of hypertext documents. The Web has grown to contain a staggering number of documents, and the number of documents continues to increase. The Web has been estimated to contain at least 450 million pages and is expected to expand exponentially over the foreseeable future.
The number of documents available through the Web is so large that to use the Web in a practical way almost always requires a search service, search engine, or similar service. The search engines use “spider” programs that “crawl” to Web servers around the world, locate documents, index the documents, and follow hyperlinks in those documents to yet other documents. When a search query is entered, the search engine locates relevant documents and displays a set of search results that satisfy the query.
There is a fixed upper limit to the number of documents that a user is willing or able to review before fatigue or frustration result. For example, most people are unwilling to review more than 20–100 documents in a set of search results. Accordingly, most search engines now use relevance criteria when selecting electronic documents to be placed in the search results. Using the relevance criteria, the search engine attempts to determine which of the documents in its index are most relevant to the search query. Normally the search results are presented in a list that is ranked by relevance. Use of relevance criteria is critical to enable the search engine to return to the user a reasonable number of electronic documents that are reasonably related to the query. Otherwise, the user would be unable to locate anything relevant among the millions of documents available online. Unfortunately, current technology does not provide a very sophisticated way to determine relevance.
In contrast, Web documents can be classified into a taxonomy of categories and presented in a browsable directory structure. Such a structure is particularly well-suited to easy navigation by novice users. In the past, classification of documents into categories has been carried out manually by large staffs of human classifiers. An example of a directory that uses such an approach is the Yahoo! search system. Clearly, there is a need for a way of leveraging human inputs to automatically classify large numbers of online electronic documents into a taxonomy of categories.
Extensive research has been done in the use of text analysis to classify text documents into categories. In the past few years, however, the number of text documents available online has grown sufficiently large that traditional text analysis approaches are inadequate. Many parties have worked with varied success at analyzing the text contents of electronic documents in order to classify them.
New approaches exploiting hyperlink structure of the Web are addressing this problem. For example, the CLEVER project of the IBM Almaden Research Center, San Jose, Calif. is developing a search engine and directory engine This work is based on the Hypertext Induced Topic Search process developed by Jon Kleinberg. Generally, in this process, a standard text search engine generates a Training Set of electronic documents that match a query subject or category. The process extends the Training Set to include all documents pointing to or pointed to by each document in the Training Set. Using information that describes the links between the documents, the process seeks the best Authorities and Hubs that match the query or category. Mathematically, the Authorities and Hubs are the principal Eigenvectors of matrices representing the link relations between the documents.
In another approach, the GOOGLE project uses a process of generating PageRanks. PageRanks are iteratively updated based on linked hypertext structures. The resulting PageRanks measure the general connectedness of documents on the Web, without regard to a particular category or query. The assumption is that more connected documents will tend to be of general interest to most users.
Both these approaches rely mathematically on the convergence of a similarity value to the principal Eigenvectors of the link matrices. The speed of convergence depends on the Eigenvalue ratio of the principal Eigenvector to the non-principal Eigenvectors. In the worst case, in which the absolute value of the ratio is close to “1”, iterations of the process can lead to oscillations between different Eigenstates. In that case, the interpretation of Authorities, Hubs, and PageRanks becomes indefinite, or at least slow to converge.
Accordingly, in this field there is a need for a system or mechanism that can iteratively improve the relevance scores of a result set of electronic documents using generalized similarities among electronic documents, without necessitating convergence to Eigenvectors.
There is also a general need for a system that can automatically determine whether one electronic document is similar to another electronic document, and that can create and store a numeric value that identifies the relative similarity of the electronic documents.
There is also a need for a way to automatically classify an electronic document in taxonomy of categories or classifications that are maintained by a document indexing system or search system.
There is a particular need for a way to combine multiple data sources to result in a more meaningful measure of the similarity of electronic documents.