In recent years, machine learning applications, which typically include computer applications learning from a set of examples to perform a recognition task, have becoming increasingly popular. A task typically performed by these types of machine learning applications is classification, such as automatically classifying documents under one or more topic categories. This technology is used in filtering, routing and filing information, such as news articles or web pages, into topical directories or e-mail inboxes. For example text documents may be represented using a fixed set of attributes, each representing the number of times a particular key word appears in the document. Using an induction algorithm, also referred to as a classifier learning algorithm, that examines the input training set, the computer ‘learns’ or generates a classifier, which is able to classify a new document under one or more categories. In other words, the machine learns to predict whether a text document input into the machine, usually in the form of a vector of predetermined attributes describing the text document, belongs to a category. When a classifier is being trained, classifier parameters for classifying objects are determined by examining a training set of objects that have been assigned labels indicating to which category each object in the training set belongs. After the classifier is trained, the classifier's goal is to predict to which category an object provided to the classifier for classification belongs.
In the field of machine learning, trained classifiers may be used for the purpose of a count of the number of unlabeled objects that are classified in a particular category. In such applications the actual counts are of particular interest rather than the individual classifications of each item. As an example, an automated classifier may be used to estimate how many documents in a business news wire are related to a particular company of interest. Another example is where a news company uses a classifier to determine under which major topic each incoming news article should be filed. In order to determine the percentage of articles filed under one particular category each month, one could count how many articles are predicted by the classifier to belong in this category. This is advantageous so that the relative level of interest in a particular topic can be tracked.
A problem with the present automated classifiers is that, in practice, the automated classifiers that assign objects to categories make mistakes. The mistakes made by the classifier do not always cancel one another out. For example, so-called false positives, instances of mistakenly assigning an object to a category, are not always offset by so-called false negatives, instances of mistakenly failing to assign an object to a category. Instead, classification errors tend to be biased in one direction or the other, so it is difficult to obtain an accurate count of the number of objects that should be classified under a particular category.