The present invention is directed to a method for clustering electronic documents in response to a search query. More specifically, the invention is directed to a method in which a cluster of documents is provided as a search result when the search query has not completely matched any documents, but, portions of the query are found to match a number of documents.
With the proliferation of electronic information sources it has been necessary to provide searching capabilities to enable users to look for information of interest in large collections of documents. It is well known to provide search engines for searching for pages on the World Wide Web. These pages are commonly referred to as unstructured documents. Examples of such search engines include Yahoo, Infoseek and others. It is also known to conduct searches across structured documents which may be found in databases. Several tools exist for searching in structured documents as well. Such searching often involves a forms-based interface for specifying attribute/value pairs (e.g., in a database such as white pages the attribute/value pair could be name/phone numbers).
In connection with searches of unstructured documents such as on the World Wide Web, the search engines do an effective job of finding many possible matches. However, the number of matches is often quite large and it is difficult to retrieve each of the documents to locate the few of particular interest. As an example, a search engine like Altavista or Lycos returns a ranked list of documents in response to a keyword-based query and the score of the document is based on the "similarity" of the document to the query keywords. Consider an example query of "rosehips cancer" where the user wants to discovery if rosehips (the tiny fruits left after rose petals fall) can help in the cure for cancer. The term counts of rosehips and cancer range in the tens of thousands. Given this as a result it is difficult for the user to search among the documents to find other information which might appear infrequently in the documentation. For example, it is difficult to obtain from this set documents that deal with using rosehips in cancer treatment unless the terms are found in the same document. It would be very useful if such sets of related documents could be automatically clustered and returned in response to the query, that is if a split match of the query (multiple documents that together satisfy the query) could be provided.
In a similar vein, in connection with sets of structured and unstructured documents it is possible that information is present partially in a structured document and partially in an unstructured document. Presently, there are no search mechanisms to locate such information.