This invention relates generally to organizing content, and, more particularly, to methods and systems for organizing content by validating clustered documents based on a topic purity of a common taxonomy parent category.
Organizations and businesses can receive a large number of messages from customers, potential customers, users and/or other people. For example, a business and/or organization can receive messages from its customers and potential customers, such as email messages, messages from online forums, e.g., support forums or message boards, and other types of messages. These messages can be related to a variety of different topics or issues. For example, the messages can be related to problems experienced by a user and can include a request for assistance to solve the problem. Oftentimes, these request messages are directed to a support center at the organization/business.
In addition, the Internet provides these organizations and businesses with access to a wide variety of resources, including web pages for particular topics, reviews of products and/or services, news articles, editorials and blogs. The authors of these resources can express their opinions and/or views related to a myriad of topics such a product and/or service, politics, political candidates, fashion, design, etc. For example, an author can create a blog entry supporting a political candidate and express their praise in the candidate's position regarding fiscal matters or social issues. As another example, authors can create a restaurant review on a blog or on an online review website and provide their opinions of the restaurant using a numerical rating (e.g., three out of five stars), a letter grade (e.g., A+) and/or a description of their dining experience to indicate their satisfaction with the restaurant.
Such a large volume of documents (i.e., different types of electronic documents including text files, e-mails, images, metadata files, audio files, presentations, etc.) can be very difficult for organizations and/or businesses to manage. Entities may try to use clustering techniques to manage such a large volume of documents. Various algorithms can be used on a corpus of documents to produce different clusters of documents such that the documents within a given cluster share a common characteristic. These known clustering algorithms can be very time consuming to implement, and oftentimes provide poor results such as clusters having many unrelated documents.
In addition, businesses have been known to label a cluster based on a common characteristic shared by the documents in the cluster. A label can identify various types of information such as a subject or theme of a given cluster and therefore facilitate classification. In many of these known cases, document clusters are labeled by manual inspection where an operator retrieves samples from different clusters and labels the clusters based on information from the samples. Labeling of clusters using manual inspection is very time consuming and expensive.
Accordingly, it would be desirable to provide a computer system for organizing large volumes of electronic documents within clusters wherein the documents within each cluster relate to a particular topic, and for automatically determining a label for each created cluster.