The present invention relates generally to document clustering techniques. More specifically, the present invention relates to document clustering techniques incorporating both content-based and log-based methods to produce clusters that incorporate users"" perspective.
Information retrieval systems are concerned with locating documents relevant to a user""s information need from a collection of documents. The user describes his information need using a query consisting of a number of words. The information retrieval systems compare the query with the documents in the collection and return the documents that are likely to satisfy the information need.
Document clustering is often used to increase the efficiency and effectiveness of the information retrieval systems. Clustering involves the grouping of similar or otherwise related documents. In the context of information retrieval, document clustering identifies groups of similar documents, usually on the basis of terms that the documents have in common. Closely associated documents tend to be relevant to same queries or requests. Therefore, clustering of documents increases efficiency of the information retrieval systems. Further, clustering of documents also aids in browsing of the document collection. Related documents can be co-located to enhance browsing.
Cluster analysis methods are usually based on measurements of similarity between objects, these objects being either individual documents or clusters of documents. Traditionally, interdocument similarity was determined by analyzing the contents of the documents. The content-based clustering method assumes that documents are represented by lists of manually or automatically assigned terms, keywords, phrases, indices, or thesaural terms that describe the content of the documents.
Because the content-based clustering approach analyzes each and every document to be clustered, the result is complete and stable. Using the content-based clustering approach, the entire collection of the documents can be clustered, and the clusters do not change as long as the document collection and the keywords do not change.
The content-based clustering method is widely used on the Internet as a method of organizing information. Ever-increasing amount of information is becoming available via the Internet and the World Wide Web (the xe2x80x9cWebxe2x80x9d). However, because of the decentralized nature of the information presented, it is becoming increasingly difficult for a user to find relevant information regarding a particular subject. To assist the user to locate relevant information on the Web, many portal sites maintain directories built upon content-based clustering of the web pages.
Portals are Internet sites that organize, or categorize web pages into various topics and offer topic-based or keyword-based organization of the web pages to the user. However, because the portals"" topics and the keywords are determined by the portal providers, the topics, the keywords, or the assignment of the web pages to these topics or keywords do not reflect the perspectives and the interests of the users. In fact, the users may find the portals"" organization or clustering of the web pages to be stifling and non-sensible.
Additionally, the organization of the web pages into the portals"" topics and categories cannot account for differences between different demographic groups of users. For example, people of different ages, gender, or occupations are likely to prefer different categorization and clustering of the web pages. Unfortunately, regardless of the users"" preferences, the portals offer the same categorization of the web pages as generated by the portal providers. Some portals offer facilities for the user to xe2x80x9ccustomizexe2x80x9d the portal. However, these facilities typically provide limited functions for the user to select, from the already-determined topics and categories, which topics and categories to display when the user links to the portal. And, typically, these customization facilities do not allow users to create customized topics or categories, or to assign web pages to certain categories for customized clustering of the web pages.
Further, the content-based clustering method, because of its static nature, cannot adapt to changing preferences of the users and the addition of new topics, categories, or areas of interest.
To overcome some of the shortcomings of the content-based clustering method, log-based clustering technique has been proposed. Recently, it has been shown that documents can be clustered based on retrieval system logs maintained by an information retrieval system such as web server access logs. Using web server access logs, it has been shown that similar pages tend to be accessed together by users. Under the log-based clustering method, the interdocument similarity can be based upon whether the documents were accessed together during retrieval sessions by the user.
Since the clustering of documents for each user is based on retrieval system logs, documents (e.g., Web pages) that users found to be similar fall into the same cluster, thereby reflecting the xe2x80x9csimilarity notionxe2x80x9d of users. As user access patterns change, the clusters will also change giving the clusters a xe2x80x9cdynamicxe2x80x9d nature. And, since the log-based clustering method can be based on recent retrieval system logs for each user, the clustering can adopt to the changing tastes and perspective of the user.
However, the log-based clustering method produces document clusters which are inherently incomplete. This is because the log-based clustering method clusters only those documents that are accessed by some users. In an environment like the Internet where millions upon millions web pages exist, only a tiny portion would be clustered under the log-based clustering method. The remaining web pages are not clustered at all.
Accordingly, there remains a need for a document clustering method that incorporates users"" perspective while accounting for documents not accessed by the user and that overcomes the disadvantages set forth previously.
According to one aspect of the present invention, a method for clustering documents is disclosed. The documents are represented in a hybrid matrix, and the hybrid matrix is clustered by a content-based clustering algorithm. There is one vector per document in the hybrid matrix. For those documents that are accessed in the session logs, a log-based document clustering vector is constructed in the hybrid matrix. For all other document, a vector based on keywords is constructed.
To form the log-based cluster document vector, a corresponding log-based cluster document must be generated. The log-based cluster document is generated by accessing retrieval session logs and clustering them into session clusters. Then, the log-based cluster document is generated for each session cluster by concatenating the documents that were opened during the sessions in that session cluster.
According to another aspect of the present invention, an apparatus for clustering documents includes storage for storing retrieval session logs and a processor, connected to the storage, for performing the steps of the present invention. The apparatus may further include memory, connected to the processor, for storing intermediate results including the hybrid matrix. The storage and the memory is preferably machine readable memory devices encoded with data structure for clustering documents including the hybrid matrix, retrieval session logs, and the instructions for the processor.
Other aspects and advantages of the present invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, illustrating by way of example the principles of the present invention.