The present invention relates to techniques for classifying documents and, in particular, techniques which employ implicit user feedback obtained from search engine queries to classify documents.
As the content on World Wide Web grows, it has become increasingly difficult to manage and classify all of the documents and other content resources. Optimal organization is especially important for web sites, for example, where classification of documents into cohesive and relevant topics is essential to make a site easier to navigate and more intuitive to its visitors. The high level of competition on the Web makes it necessary for web sites to improve their organization in ways that are both automatic and effective so users can find the resources for which they are looking. Web page organization has other important applications. Search engine results can be enhanced by grouping documents into significant topics. These topics can allow users to disambiguate or specify their searches more effectively. Moreover, search engines can personalize results for users by ranking higher the results that match the topics that are relevant to user profiles. Other applications that can benefit from automatic topic discovery and classification are human edited directories, such as DMOZ or Yahoo!. These directories are increasingly hard to maintain as the content of the Web grows. Automatic organization of web documents is also interesting from the point of view of discovering new interesting topics.
The task of automatically clustering, labeling, and classifying documents in a web site is not an easy one. Usually these problems are approached in a similar way for web documents and for plain text documents even though it is known that web documents typically have a richer information set associated with them. According to such conventional approaches, documents are typically represented based on their text, and in some cases some kind of structural information associated with the web documents.
Generally speaking there are two main types of structural information that can be found in web documents. One type is HTML formatting which sometimes allows identification of important parts of a document such as, for example, title and headings. The other type is link information between pages. The information provided by HTML formatting is not always reliable because tags are more often used for styling purposes than for content structuring. And the information given by links, although useful for general web documents, is not of much value when working with documents from a particular website because in such cases we cannot assume that these data are objective. That is, any information extracted from a site's structure about that same site is a reflection of the particular webmaster's criteria which provides no guarantee of being thorough or accurate and, in some cases, might be completely arbitrary. A clear example of this is a web site that has a large amount of content and employs some kind of content management system and/or templates that give virtually the same structure to all pages and links within the site.