The present invention is directed to a method for clustering electronic documents in response to a search query. More specifically, the invention is directed to a method in which a cluster of documents is provided as a search result when the search query has not completely matched any documents, but, portions of the query are found to match a number of documents.
With the proliferation of electronic information sources it has been necessary to provide searching capabilities to enable users to look for information of interest in large collections of documents. It is well known to provide search engines for searching for pages on the World Wide Web. These pages are commonly referred to as unstructured documents. Examples of such search engines include Yahoo, Infoseek and others. It is also known to conduct searches across structured documents which may be found in databases. Several tools exist for searching in structured documents as well. Such searching often involves a forms-based interface for specifying attribute/value pairs (e.g., in a database such as white pages the attribute/value pair could be name/phone numbers).
In connection with searches of unstructured documents such as on the World Wide Web, the search engines do an effective job of finding many possible matches. However, the number of matches is often quite large and it is difficult to retrieve each of the documents to locate the few of particular interest. As an example, a search engine like Altavista or Lycos returns a ranked list of documents in response to a keyword-based query and the score of the document is based on the xe2x80x9csimilarityxe2x80x9d of the document to the query keywords. Consider an example query of xe2x80x9crosehips cancerxe2x80x9d where the user wants to discovery if rosehips (the tiny fruits left after rose petals fall) can help in the cure for cancer. The term counts of rosehips and cancer range in the tens of thousands. Given this as a result it is difficult for the user to search among the documents to find other information which might appear infrequently in the documentation. For example, it is difficult to obtain from this set documents that deal with using rosehips in cancer treatment unless the terms are found in the same document. It would be very useful if such sets of related documents could be automatically clustered and returned in response to the query, that is if a split match of the query (multiple documents that together satisfy the query) could be provided.
In a similar vein, in connection with sets of structured and unstructured documents it is possible that information is present partially in a structured document and partially in an unstructured document. Presently, there are no search mechanisms to locate such information.
The present invention provides a method for clustering documents in answer to a query, joining those documents that share infrequently occurring terms. More specifically, in accordance with the present invention, the search engine provides a ranked list of document clusters rather than individual documents in response to a query. Each document returned by the search as part of the answer list is required to match some or all of the query words and hence would have been part of the list of documents returned by the traditional approach. However, the present invention further computes an inter-document similarity beyond the computation of the documents to the query keywords. This enables the creation of document clusters.
In accordance with the method of the present invention, a universe of documents is first searched using an inverted index to locate documents that match the query keywords. Second, the similarity of document pairs is computed based on the occurrence of infrequently occurring words in the vicinity of query keywords in documents. Documents are clustered and assigned scores based on the diversity of matches of documents in the cluster to the query keywords and the similarity between pairs of documents in the cluster.
In a further embodiment of the present invention, the capability of finding split matches across structured and unstructured documents is also provided. In this embodiment the clusters constitute pairings of unstructured documents and structured documents which are compared to one another and scored in a manner similar to that described above. The paired documents are then ranked in order again relying on the concept of the diversity of matches of documents in the cluster to the query keywords and the similarity between pairs of documents in the cluster.