Automated pattern classification is well known per se. It has been applied for example to the automatic classification of electronic documents, object recognition, detection of abnormal situations in manufacturing processes etc. It is known to use a scoring module in a pattern classification apparatus, typically implemented by means of a computer program, that inputs information measured from the object that has to be classified and computes a score for object from the measured information. The score is a quasi continuous value indicative of the likelihood that the object belongs to a class. Scoring modules may be optimized for specific pattern recognition tasks, using machine learning techniques applied to examples of patterns in combination with the classes that have to be assigned to the patterns.
Such a score is not yet a classification. Typically, the score for an object has to be compared to a threshold to determine whether the object belongs to a class. The use of a threshold introduces two types of errors: false positive errors and false negative errors, one type of error involving assignment of an object to a class when that object does not belong to the class, and a second type of error involving not assigning an object to a class when the object does belong to the class. The rate of false positive errors increases when the threshold is lowered, but when the threshold is raised the rate of false negative errors increases. An optimal selection of the threshold value balances these effects.
In another solution two thresholds may be used for a class: a first threshold to distinguish between scores of objects that will definitely be classified as belonging to the class and other objects, and a second threshold to distinguish between scores of objects that will definitely not be classified as belonging to the class and other objects. This results in a category of objects that is neither definitely assigned to the class nor definitely not assigned to the class. Such objects may be indicated for further inspection by a human inspector to assign the object to the class or not, or to a more refined but more expensive automated classifier for doing so.
One problem of this type of classification involves the selection of the threshold(s). User input is indispensable at this point, because only the context of use of the classification can determine how the costs of false positive errors and false negative errors and human inspection should be balanced. However, users typically cannot oversee the consequences of the selection of a threshold value, especially if a plurality of thresholds has to be selected. This makes the selection of thresholds a cumbersome process that often results in suboptimal threshold selection.
A statistically based text classification system is mentioned in an article by David B. Aronow et al, titled “Automated Identification of Episodes of Asthma Exacerbation for Quality Measurement in a Computer-Based Medical Record” and published in the Proceedings of the 9th Annual Symopium on Computer Applications in Medical Care. Toward Cost-Effective Clinical Computing, by Hanley & Belfus Philadelphia Pa. 1995 pages 309-313 (EPO reference XP002521603).
Aronow et al. classify texts about patients to determine whether patients suffer from exacerbated asthma or not. Each text is assigned to one of three classes: positive, negative and uncertain. This was done by assigning weights to the document, computed from detected features in the documents and feature weights associated with these features. The weights were compared with a positive bin cut off and a negative bin cut off threshold to assign the texts to the classes. The texts that were classified as uncertain had to be scored by hand. This burden was reported to be reduced by 45%.
Aronow et al. mention that the document weights were determined from a training set of texts that were known to be positive and negative so that no more than a predetermined percentage of negative texts were classified as positive and no more than a predetermined percentage of positive texts were classified as negative. A target percentage of 10% is mentioned.
Aronow et al. do not consider the percentage of texts that are classified as uncertain in the selection of the weights: only percentages of false positive and false negative classifications are used. The percentage of positives texts in the training set that were not classified as positive is not used to determine the weights, nor is the percentage of negative texts that were not classified as negative. By using only percentages of false positives and false negatives the positive bin cut off and a negative bin cut off can easily be set. However, if the percentage of texts that are classified as uncertain would also be used to select the cut offs, no unambiguous way of selecting the cut offs exists. Nor do Aronow et al. suggest how this can be done.