An automated classifier configured to predict which data items of a data set belong to one or more conceptual category subsets has been used to extract estimates of the subset sizes. An example of an automated classifier is one that is used to predict, or classify, whether a given document in a business news wire is related to a particular company of interest. Another example is an automated classifier used to determine under which topic each incoming news article should be filed.
In order to determine the percentage of articles filed under one particular category, the number of articles predicted by the classifier to belong in that category could be counted, and then divided by the total number of documents considered to obtain the percentage. Once the classifier has determined in which topic a particular article should reside, the results of the automated categorization are then aggregated to give overall estimates of the number of articles in each category. The number of articles belonging to a particular category is counted to, for instance, track the relative level of interest in a particular topic.
Automated classifiers have also been used by scientists and business analysts to obtain estimates of the number of items belonging to various categories. For instance, automated classifiers have been used to estimate how many genes in a database are predicted to exhibit some property.
Automated classifiers are typically trained with a training set of labeled cases (that is, cases for which the true category is known) to provide the classifier with the ability to determine whether a new item belongs to a particular category. Generally, providing relatively larger numbers of labeled training cases improves the accuracy of the classifiers. It is often difficult, however, to provide classifiers with a relatively larger number of labeled training cases because acquiring labeled training cases is usually associated with relatively high costs and typically requires a large amount of human effort. In addition, for relatively difficult classification problems, no amount of labeled training cases yields a perfectly accurate classifier.
As such, there often arise situations when the training set is substantially unbalanced, that is, training sets containing many more cases in some categories than in others. This often happens when cases belonging in one category of the overall population of cases are rare relative to the size of the overall population of cases. Unfortunately, automated classifiers often have difficulty when training on substantially unbalanced data. Such classifiers are prone to providing inaccurate results, which introduce biases in the estimates of the size of the category of interest.