Categorization is the problem of assigning items (e.g. documents, products, clients, etc.) into categories based on features of the items (e.g. which words appear in a document), and possibly subject to a degree of confidence. For example: vehicle X which has the features                number of seats=55        color=yellowbelongs to the category “school buses” with probability 95%.        
The field's terminology has a number of common synonyms:                categorization=classification, prediction        features=attributes, properties        (sub)categories=(sub)classes, (sub)topics        confidence=degree of belief, certainty, probability        items=cases, examples        machine learning=induction.        
Categorizers may be built manually by people authoring rules/heuristics, or else built automatically via machine learning, which induces a categorizer based on a large training dataset of items, where each item is labeled with its correct category assignment. Typically, the larger the training dataset, the better the classification accuracy, however, it usually costs something (human labeling effort) to gather the training set. In the earliest stages of collecting a training set, human-authored rules will typically have better accuracy than machine learning; however, as more training data becomes available, the accuracy of machine-learning algorithms improves (since they learn from that additional training data), and eventually may surpass what is practical with human-authored rules.
Examples of machine learning include the well-known naïve Bayes and C4.5 algorithms (or a so-called “stacked” combination of two or more such algorithms), and commercial offerings such as those of Autonomy Inc. and Moho Mine Inc. A major barrier to using machine-learning algorithms is that that they require a significant amount of training data in order to achieve optimal performance, which can be costly and/or labor intensive.
Examples of human-authored rule classifier systems include the topics search engine by Verity Corp., and email routing software by Kana Communications Inc. In principle, human-authored rule-based algorithms can be applied to classification problems where no training data are available, but may have high maintenance costs and sometimes inferior performance compared to machine learning approaches, because they do not learn to improve themselves and do not take advantage of available training data. Construction and maintenance of such human-authored rules requires substantial domain knowledge and is labor intensive. A particularly simple example of a rule-based classifier is a list of distinctive keywords for each class, with the first matching keyword in an item being used to classify that item. Alternatively, one may prefer the category for which the largest number of keywords match the document.
Hierarchical categorization involves a hierarchy of categories wherein at least some of the items to be categorized are to be assigned not only to certain categories, but also to certain subcategories within those categories. A well-known example of hierarchical categorization is the Dewey Decimal and Library of Congress subject headings used by many libraries to organize their book collections. By utilizing a hierarchical structure, a complex classification problem can be decomposed into a number of simple sub-problems. A top-down approach to hierarchical classification starts with a few classes which are further refined into a number of subclasses. Further details of known hierarchical classification methodology may be found in the article “Hierarchical Classification of Web Content” by Susan Dumais and Hao Chen, which was presented Jul. 24-28, 2000 in Athens, Greece and published in SIGIR 2000: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval and which is hereby incorporated by reference in its entirety. A plurality of categorization methods can be applied to categorization sub-problems in a top-down (also known as “Pachinko”) approach, using a sub-classifier at each internal node to select which child branch to follow. The overall hierarchical categorization mechanism in effect combines the results of the local categorization methods on the sub-problems. The local categorization algorithms output their results through a standard interface, so that these various intermediate results can be combined by an overall categorization processor that does not need to know what the local categorization method is. The overall categorization processor takes an item, delegates it to one or more local categorization methods (possibly a series of these depending on the results of each), and combines the results of these. It can then report for the item and for zero or more classes whether the item belongs to the class, possibly including a degree of confidence (such as a probability). An optional extension is that it can report, for an item and for a class X with zero or more subclasses, whether the item belongs to the set consisting of the class X and all of the subclasses of X, again possibly subject to a degree of confidence such as a probability. Such a hierarchical structure can potentially be used to advantage in trainable classifiers, by using the larger volume and hence greater accuracy of statistical training data (in particular, relative frequency of particular words and phrases) at a parent node to smooth and extrapolate the more limited and less accurate such data available at a given child node.
A “bootstrap” methodology may sometimes be used to improve the performance of a simple keyword-based categorizer by using the simple categorizer's output as training data for machine learning. In the particular case of a hierarchical categorizer in which the nodes at the upper levels of the hierarchy will each process more training data than the nodes at the lower levels and thus will tend to make more reliable decisions, a statistical technique known as “shrinkage” may be used to refine the statistically less reliable results at a lower level by combining them with the more reliable probabilities associated with decisions at a higher level. Further details of a known hierarchical bootstrap methodology may be found in Text Classification by Bootstrapping with Keywords, EM and Shrinkage by Andrew McCallum and Kamal Nigam, which was presented in 1999 at the ACL '99 Workshop for Unsupervised Learning in Natural Language Processing, and which is hereby incorporated by reference in its entirety.
“Stacking” is a known technique for combining statistical results from multiple machine-learning algorithms to make a particular classification decision. In other machine-learning applications, the selection of an appropriate algorithm is based on a-priori knowledge of the data being categorized, or is determined experimentally using known training data.