Categorization involves assigning items (e.g., documents, products, patients, etc.) into categories based on features of the items (e.g., which words appear in a document), and possibly subject to a degree of confidence. For example: vehicle X that has the features
number of seats = 55color = yellowbelongs to the category “school buses” with probability 95%.
Hierarchical categorization is the problem of categorizing where the categories are organized in a hierarchy. The field's terminology has a number of common synonyms, such as:
categorization = classification, predictionfeatures = attributes, propertiescategories = classes, subtopicsconfidence = degree of belief, certaintyitems = cases, examplesmachine learning = supervised learning, induction
In the past, many different systems have been developed for categorizing different types of items. The earliest systems used manual assignment of documents to categories, for example, by human experts. This is currently the dominant method, which is used in libraries, as well as by popular Internet search engine companies.
Disadvantages of manual assignment include the fact that it requires a large amount of human resources and it is labor-intensive. In addition, manual assignment is somewhat error-prone and may lead to inconsistencies if people are assigning documents to categories based on different criteria, different interpretations of criteria, or different levels of expertise.
To be less subjective, rule-based assignment of documents to categories, including rules based on keywords, has been developed for use with computer systems. This approach uses rules such as “if the document contains the words ‘football’, and ‘goal’, and ‘umpire’ and not the word ‘national’ then assign it to the category ‘local football.’”
Mostly, human domain experts author these rules, possibly with the aid of keyword identification tools (such as word counters). These rules usually are comprised of Boolean combinations of keyword occurrences (possibly modified by counts such as “if the term ‘national’ is used at least 5 times then assign to ‘national baseball’”). These rules can be executed automatically, so this solution can be used to automatically assign documents to categories. Examples of human-authored rule classifier systems include a topics search engine by Verity Corp., and email routing software by Kana Communications Inc.
The disadvantages of rule-based assignment are that the accuracy of these rules is often very poor. Depending on the authoring of the rules, either the same document is assigned to many categories, including many wrong categories, or to too few categories, in which case documents do not appear in the categories they should. Another disadvantage is that the rules are difficult to author and maintain, and the interaction of the rules (so-called “chaining”) is difficult to understand (and debug), so that unexpected assignments of documents to categories may occur.
Categorizers may be built manually by people authoring rules/heuristics, or else built automatically via machine learning, wherein categorizers are induced based on a large training set of items. Each item in the training set is typically labeled with its correct category assignment. The use of predefined categories implies a supervised learning approach to categorization, where already-categorized items are used as training data to build a model for categorizing new items. Appropriate labels can then be assigned automatically by the model to new, unlabeled items depending on which category they fall into. Typically, the larger the training set, the better the categorization accuracy. However, it typically costs something (e.g., human labeling effort) to prepare the training set.
Examples of machine learning algorithms include the well-known Naïve Bayes and C4.5 algorithms, support vector machines, and commercial offerings such as those of Autonomy Inc., and Moho Mine Inc.
One type of categorizer that can be induced by such machine learning algorithms is a top-down hierarchical categorizer (also referred to as a Pachinko classifier). A top-down hierarchical categorizer typically considers a topic hierarchy one level at a time. At each level, there are typically one or more categorizers that, when assigned a document, pick a category at the next level based on features of the document.
A major barrier to using machine-learning categorization technology is that it requires a significant amount of training data, the gathering of which involves significant costs, delays and/or human labor.