Information retrieval systems are concerned with locating documents relevant to a users information need from a collection of documents. The user describes his information need using a query consisting of a number of words. The information retrieval systems compare the query with the documents in the collection and return the documents that are likely to satisfy the information need.
Document clustering is often used to increase the efficiency and effectiveness of the information retrieval systems. Clustering involves the grouping of similar or otherwise related documents. In the context of information retrieval, document clustering identifies groups of similar documents, usually on the basis of terms that the documents have in common. Closely associated documents tend to be relevant to same queries or requests. Therefore, clustering of documents increases efficiency of the information retrieval systems. Further, clustering of documents also aids in browsing of the document collection. Related documents can be co-located to enhance browsing.
Cluster analysis methods are usually based on measurements of similarity between objects, these objects being either individual documents or clusters of documents. Traditionally, interdocument similarity was determined by analyzing the contents of the documents. The content-based clustering method assumes that documents are represented by lists of manually or automatically assigned terms, keywords, phrases, indices, or thesaural terms that describe the content of the documents.
Because the content-based clustering approach analyzes each and every document to be clustered, the result is complete and stable. Using the content-based clustering approach, the entire collection of the documents can be clustered, and the clusters do not change as long as the document collection and the keywords do not change.
The content-based clustering method is widely used on the Internet as a method of organizing information. Ever-increasing amount of information is becoming available via the Internet and the World Wide Web (the “Web”). However, because of the decentralized nature of the information presented, it is becoming increasingly difficult for a user to find relevant information regarding a particular subject. To assist the user to locate relevant information on the Web, many portal sites maintain directories built upon content-based clustering of the web pages.
Portals are Internet sites that organize, or categorize web pages into various topics and offer topic-based or keyword-based organization of the web pages to the user. However, because the portals' topics and the keywords are determined by the portal providers, the topics, the keywords, or the assignment of the web pages to these topics or keywords do not reflect the perspectives and the interests of the users. In fact, the users may find the portals' organization or clustering of the web pages to be stifling and non-sensible.
Additionally, the organization of the web pages into the portals' topics and categories cannot account for differences between different demographic groups of users. For example, people of different ages, gender, or occupations are likely to prefer different categorization and clustering of the web pages. Unfortunately, regardless of the users' preferences, the portals offer the same categorization of the web pages as generated by the portal providers. Some portals offer facilities for the user to “customize” the portal. However, these facilities typically provide limited functions for the user to select, from the already-determined topics and categories, which topics and categories to display when the user links to the portal. And, typically, these customization facilities do not allow users to create customized topics or categories, or to assign web pages to certain categories for customized clustering of the web pages.
Further, the content-based clustering method, because of its static nature, cannot adapt to changing preferences of the users and the addition of new topics, categories, or areas of interest.
To overcome some of the shortcomings of the content-based clustering method, log-based clustering technique has been proposed. Recently, it has been shown that documents can be clustered based on retrieval system logs maintained by an information retrieval system such as web server access logs. Using web server access logs, it has been shown that similar pages tend to be accessed together by users. Under the log-based clustering method, the interdocument similarity can be based upon whether the documents were accessed together during retrieval sessions by the user.
Since the clustering of documents for each user is based on retrieval system logs, documents (e.g., Web pages) that users found to be similar fall into the same cluster, thereby reflecting the “similarity notion” of users. As user access patterns change, the clusters will also change giving the clusters a “dynamic” nature. And, since the log-based clustering method can be based on recent retrieval system logs for each user, the clustering can adopt to the changing tastes and perspective of the user.
However, the log-based clustering method produces document clusters which are inherently incomplete. This is because the log-based clustering method clusters only those documents that are accessed by some users. In an environment like the Internet where millions upon millions web pages exist, only a tiny portion would be clustered under the log-based clustering method. The remaining web pages are not clustered at all.
Accordingly, there remains a need for a document clustering method that incorporates users' perspective while accounting for documents not accessed by the user and that overcomes the disadvantages set forth previously.