In order to extract estimates of the size of subsets of data from a dataset, an automated classifier may be used that predicts which of the data items belong to a particular subset of interest. As an example an automated classifier may be used to estimate how many documents in a business news wire are related to a particular company of interest. Another example is where a news company uses a classifier to determine under which major topic each incoming news article should be filed. In order to determine the percentage of articles filed under one particular category last year, one could count how many articles were predicted by the classifier to belong in this category. This is advantageous so that the relative level of interest in a particular topic can be tracked. Once the classifier has determined which topic a particular article should reside in, the results of the automated categorization are then aggregated to give overall estimates of the number of articles in each category area. These results are then used to report to media relations teams.
Another application of automated classifiers includes estimating how many genes in a database are predicted to exhibit some property. It can be extremely important to scientists and business analysts to obtain the best possible estimates.
In the field of machine learning, trained classifiers may be used for the purpose of counting how many items in a new (unlabeled) batch fall into several classes. In such applications the actual counts are of particular interest rather than the individual classifications of each item.
A problem with the present automated classifiers is that, in practice, the automated classifiers that assign items or subsets of data to categories make mistakes. Of primary concern in this invention is the actual number of items in a particular category and not so much what each item assigned to the category involves. In other words, it is advantageous to know how frequent a particular category is without necessarily knowing about the category of any particular record. The mistakes made by the classifier do not always cancel one another out (so-called false positives being offset by so-called false negatives) as the misclassifications made by automatic classifiers may skew the observed frequency of items assigned by the classifier to a category either way (from what the frequency actually should be if all items were assigned correctly) depending on the calibration and training of the classifier. This results in bias in the estimate of the size of a category of interest.