The following relates to the information processing arts, object categorization arts, and related arts.
Multi-class categorization of objects, such as documents, images, or so forth, is a task that is advantageously automated. In a typical approach, a training set of objects are collected, and each object of the training set is assigned to one or more classes to which the object belongs. In a multi-class task, the classes are selected from a set of classes. The class assignments are typically done manually for the training set, (but they can come from any other sources as well). The labeled training set is used to train a categorizer. A categorizer typically includes a set of statistical tables (encoding manual input labeling) plus a runtime algorithm for interpreting those tables, with both the tables and the algorithm suitably embodied by a computer or other processing device, and/or by memory or other storage. The categorizer is designed to extract, or receive as input, features of an input object, and the categorizer is trained or otherwise configured based on the labeled training documents to assign the input object to one or more classes based on the extracted or received features of the input object. In soft categorization, a given input object may be assigned a degree of membership in different classes, with the degree of membership being in the range [0,1]. In hard categorization a given object is either wholly assigned to a given class or wholly excluded from a given class. In other words, the output of a hard classifier for a given input object and a given class is binary, e.g. “0” or “1”, or “yes” or “no”, or so forth. A hard categorizer may be derived from a soft categorizer by adding a layer of processing that receives the soft classification and makes a binary “yes/no”-type membership decision for each input class.
The training process has a substantial effect on performance of the multi-class categorizer. Indeed, to a large degree the training defines the categorizer. Multi-class categorizer performance is determined or affected by numerous factors that are established during training, such as the quality or characteristics of the training set, and constraints, objectives, trade-offs, or other considerations embodied in the objective function or other decisional mechanism that is optimized or configured by the training, such as performance trade-off recall/precision that may be applied or optimized during the training. The training process is also computationally complex and time consuming. Accordingly, one typically would like to train the categorizer once, using a substantial training set of objects that are representative of characteristics of input objects that are expected to be encountered by the categorizer.
In practice, however, the constraints, objectives, trade-offs, or other considerations under which the trained multi-class categorizer is applied may differ from the constraints, objectives, trade-offs, or other considerations employed during the training. For example, if the training set is constructed from objects obtained from the global Internet, but the trained categorizer is applied to input objects taken from a corporate database having statistical characteristics differing from those of the global Internet, one can expect the applied categorizer performance to be less than ideal. Similarly, if assumptions or constraints applied during training are different from those the end-user wants to apply, performance will likely suffer. For example, if the training included a constraint that the precision be greater than a certain threshold, then the precision exhibited by the trained categorizer in categorizing input objects that are statistically similar to the training set is likely to be similar to the constraint threshold used in the training. However, the end-user may want a higher precision than the threshold value used in the training, or conversely may be willing to accept a lower precision in return for improvements elsewhere, such as in recall.
One solution to such problems is to re-train the categorizer using a new set of training objects derived from the source of interest (e.g., the corporate database in the previous example), and/or using parameter constraints that comport with the performance the end-user wishes to obtain. However, such re-training is computationally intensive and time-consuming. Retraining also fails to make use of the (possibly extensive) training that the categorizer initially underwent. If the end-user is not experienced in or does not adequately understand the categorization training process, such an end-user may make mistakes in the retraining (for example, by using a training set of insufficient size or insufficient diversity of characteristics) that result in the retrained categorizer having degraded performance as compared with the initially trained categorizer.
Still further, in some circumstances the end-user may not receive or have access to the software or programmed processor used to perform the training process. For example, in one business model the end-user provides a categorizer manufacturer with a set of training data, and the manufacturer performs the training and delivers only the trained categorizer to the end-user, but not the training system components. In such a circumstance, the end-user does not have the requisite tools to perform re-training, and may be unwilling or unable to go back to the manufacturer to have the re-training performed by the manufacturer.
As yet another consideration, in soft categorization a given input object is associated with a probability to belong to each class. These soft probabilities are optionally transformed into hard categorization by adding a layer of processing that receives the soft classification probabilities against each class and “binarizes” the results by using thresholding or another binary decision mechanism to assign a single (or optionally several) classes to the input object. This optional assignment process is dependent upon the constraints of precision and recall that the user may expect; and conversely the assignment process impacts the final precision and recall of the categorizer, making it advantageous to have a mechanism for controlling and driving this interdependency between recall and precision so as to tune the categorizer performance.