1. Field of the Invention
The present invention generally relates to search programs and more particularly to an improved search method and system which clusters hypertext documents.
2. Description of the Related Art
The World-Wide-Web has attained a gargantuan size (Lawrence, S., and Giles, C. L. Searching the World Wide Web. Science 280, 5360 (1998), 98., incorporated herein by reference) and a central place in the information economy of today. Hypertext is the lingua franca of the web. Moreover, scientific literature, patents, and law cases may be thought of as logically hyperlinked. Consequently, searching and organizing unstructured collections of hypertext documents is a major contemporary scientific and technological challenge.
Given a xe2x80x9cbroad-topic Queryxe2x80x9d (Kleinberg, J. Authoritative sources in a hyperlinked environment, in ACM-SIAM SODA (1998), incorporated herein by reference), a typical search engine may return a large number of relevant documents. Without effective summarization, it is a hopeless and enervating task to sort through all the returned documents in search of high-quality, representative information resources. Therefore, there is a need for an automated system that summarizes the large volume of hypertext documents returned during internet searches.
It is, therefore, an object of the present invention to provide a structure and method for searching a database of documents comprising performing a search of the database using a query to produce query result documents, constructing a word dictionary of words within the query result documents, pruning function words from the word dictionary, forming first vectors for words remaining in a word dictionary, constructing an out-link dictionary of documents within the database that are pointed to by the query result documents, adding the query result documents to the out-link dictionary, pruning documents from the out-link dictionary that are pointed to by fewer than a first predetermined number of the query result documents, forming second vectors for documents remaining in the out-link dictionary, constructing an in-link dictionary of documents within the database that point to the query result documents, adding the query result documents to the in-link dictionary, pruning documents from the in-link dictionary that point to fewer than a second predetermined number of the query result documents, forming third vectors for documents remaining in the in-link dictionary, normalizing the first vectors, the second vectors, and the third vectors to create vector triplets for document remaining in the in-link dictionary and the out-link dictionary, clustering the vector triplets using the following four step process of the toric k-means process:
(a) arbitrarily segregate the vector triplets into clusters,
(b) for each cluster, computing a set of concept triplets describing the cluster,
(c) re-segregate the vector triplets into more coherent set of clusters obtained by putting each vector triplet into the cluster corresponding to the concept triplet that is closest to, that is, most similar to, the given vector triplet,
(d) repeating steps (b)-(c) until coherence of the obtained clusters no longer significantly increases, and the process concludes by annotating the clusters using nuggets of information, the nuggets including summary, breakthrough, review, keyword, citation, and reference.
The summary comprises a document in a cluster having a most typical in-link feature vector amongst all the documents in the cluster. The breakthrough comprises a document in a cluster having a most typical in-link feature vector amongst all the documents in the cluster. The review comprises a document in a cluster having a most typical out-link feature vector amongst all the documents in the cluster. The keyword comprises a word in a word dictionary for the cluster that has the largest weight. The citation comprises a document in a cluster representing a most typical in-link into a cluster. The reference comprises a document in a cluster representing a most typical out-link out of a cluster.