The exemplary embodiment relates to the classification arts. It finds particular application in connection with the classification of samples where fewer than all features that are available at training time are used for the classification task.
Automated methods have been developed for classifying samples, such as scanned document images, based on features extracted from the samples. Parameters of a classifier model are learned during a training stage, using a set of training samples which have been labeled, often manually, according to their class. For each training sample, a representation of the sample, such as a multidimensional vector, is generated based on features extracted from the sample. The representation, together with the respective label, forms the training data for the classifier model. At test time, when a test sample is to be labeled, a representation of the test sample is input to the trained classifier model, or to a set of classifier models, and the most probable class label is assigned to the test sample, based on the output of the classifier model(s), or labels are assigned probabilistically over all classes. Various types of features have been used for this task. In the case of text documents, for example, features which have been used include word frequency features, layout features, runlength histograms, and the like. Such features generally entail different costs, which can be computational costs or monetary costs. Optical character recognition (OCR), for example, which is used to identify the words in the sample for computing word frequency or other textual features, can be both computationally and financially costly. The computational cost of performing OCR on a single document page can be from a few hundred milliseconds to a few seconds, depending on the number of words/characters on the page as well as on the quality of the document. For run-length histograms, the computational cost of extracting this type of feature is much lower, on the order of a few tens of milliseconds per page. In terms of monetary cost, there may be license fees or equipment running costs for the feature extraction technology.
A problem which arises is that there is often a trade-off between the costs associated with the extraction of features and their accuracies. Sometimes, for practical reasons, those features which lead to the best accuracy cannot be included in production because their cost is too high. This can result in a loss of information and in a degradation of the classification accuracy. For example, the computational cost associated with OCR may make the use of textual features impractical in document workflows where the scanning rate is high, even though it may provide higher accuracy than available non-textual classification methods.
The present embodiment provides a system and method which enable a more costly feature or features to be employed during training while using a less costly feature or features at test time which can yield more accurate results than when using only the less costly feature(s) during training.