Text classification is being increasingly used to facilitate browsing and maintaining of large collections of web-based documents. The classifications or categories are typically defined using a hierarchical taxonomy. A taxonomy is organized into a tree-like structure that defines sub-categories within categories. Because web-based documents cover virtually any topic, taxonomies may contain thousands and even hundreds of thousands of categories. For example, the Yahoo! Directory contains approximately 300,000 categories.
The classification of web-based documents into categories can facilitate browsing by allowing search results to be organized by category or by allowing a category to be specified as a search criterion. Because it would be impractical to manually categorize millions of web-based documents, automatic document classifiers have been developed. For example, a document classifier may have a support vector machine classifier for each category. A support vector machine classifier for a category can be trained using the documents that are labeled as being within the category or not within the category. To classify a document, each support vector machine classifier classifies the document. The document is then considered to be in the categories of those support vector machine classifiers that indicated a positive result. An example document classifier implements the “Hieron” classification technique as described in Dekel, O., Keshet, J., and Singer, Y., “Large Margin Hierarchical Classification,” Proc. of 21st Int'l Conf. on Machine Learning, Banff, Canada, 2004, which is hereby incorporated by reference. The Hieron classification technique defines a classifier for each category in terms of the classifiers of ancestor categories. Ancestor categories are the categories in the path to the root category. The Hieron classification technique attempts to ensure that the margin between each correct category and incorrect category is at least the square root of the path length between the categories. If the categories are represented as nodes of a taxonomy tree and parent-child relationships are represented by edges, then the path length is the number of edges in the shortest path between the categories. The path length serves as an indication of the correlation between two categories. Other classification techniques also use a path length based distance when training classifiers.
The accuracy of classifiers that use a path length based distance depends in part on how well path length represents the correlation between categories. Although path length is easy to calculate, it fails to adequately correlate categories in many instances. For example, a “sport” category may have child categories of “water ballet” and “wrestling” separated by a path length of 2. The “wrestling” category may have a grandchild category of “Sumo wrestling” that is also separated by a path length of 2. Intuitively, the “wrestling” category is more similar to or is more highly correlated to the “Sumo wrestling” category than to the “water ballet” category, but the path length suggests similar correlations.