Portals (e.g., Yahoo) arrange Web sites into a topic hierarchy in order to facilitate and aid a user in finding web sites of interest. FIG. 6 illustrates a portion of an exemplary topic hierarchy. In this topic hierarchy, there is a topic entitled “Health” and a sibling topic entitled “Entertainment”. The “Health” topic has two sub-topics (or children nodes): “Diseases” and “Doctors”. The “Entertainment” topic has two sub-topics: “Soccer” and “Chess”.
Another use of a topic hierarchy is to organize content on a particular Web site. For example, HP (the assignee of the present patent application) organizes its technical notes and publications in hierarchies for ease of browsing.
Hierarchies are typically designed in the following manner. First, a user generates topics or categories into which the content may be filed, including their hierarchical relationships to one another. Second, content (e.g., web sites or technical articles) is placed under appropriate topics in the hierarchy. For example, each document is filed under one of the topics. As new documents become available, these new documents must also be filed under one of the topics. When a document does not appear to fit into any of the current topics, the user can then add new topics to the hierarchy. Similarly, the user can delete topics or modify current topics in the hierarchy or their arrangement. It is noted that whenever topics are added, deleted, or otherwise modified, the user must then evaluate whether any of the documents in the hierarchy need to be re-classified to a different topic.
As can be appreciated, this process of placing new content into the hierarchy and of maintaining the topics in a hierarchy is labor intensive. One can envision cases where it is not practical for human agents to perform the categorization of new content into the hierarchy because of the sheer volume of the documents or web sites that require categorization.
Some have suggested and attempted to utilize automated categorization programs that are based on text categorization technology from the field of artificial intelligence to automate the process of placing new content into the hierarchy.
Automated categorization programs that are based on machine learning operate in the following manner. First, a hierarchy of topics is provided to the automated categorization program. Second, training examples are provided to the automated categorization program. These training examples train the program to classify new content in a manner similar to how the training examples are classified into predetermined topics. Some examples of such automated categorization programs include the well-known Naïve Bayes and C4.5 algorithms, as well as commercial offerings by companies such as Autonomy Inc.
Unfortunately, the quality of the categorization generated by automated categorization programs depends on how well the automated categorization programs can “interpret” the hierarchy. For example, topics or categories that are sensible to a human user may confuse an automated categorization computer program. The topics “Chess” and “Soccer” can reasonably be grouped under the parent topic “Entertainment.” However, it may be difficult, if not impossible, for an automated categorization computer program to find common words or other text that would suggest that both sub-topics “Chess” and “Soccer” should be under the topic “Entertainment.”
In this regard, it is desirable for there to be a mechanism that analyses hierarchies and determines the quality of the arrangement of topics and corresponding documents for each place (e.g., particular topic subtree) in the hierarchy. This mechanism facilitates the design of hierarchies in such a way as to tailor the designed hierarchies so that automated categorization programs can place content therein in an efficient and accurate manner.
Based on the foregoing, there remains a need for a mechanism to determine a measure of coherence for the arrangement of hierarchically organized topics at each place in the hierarchy.