The present application is directed to the classification of objects via multiple categorizers. It finds particular application in conjunction with the classification of documents via two orthogonal categorizers with reduced quality standards, and will be described with particular reference thereto. It is to be appreciated, however, that the present exemplary embodiments are also amenable to other like applications.
In statistics, logistic regression is used to predict probability of occurrence of an event by fitting data to a logistic curve. It is a generalized linear model used for binomial regression. Like many forms of regression analysis, it makes use of several predictor variables that may be either numerical or categorical. For example, the probability that a person has a heart attack within a specified time period might be predicted from knowledge of the person's age, sex and body mass index. Logistic regression is used extensively in the medical and social sciences as well as marketing applications such as prediction of a customer's propensity to purchase a product or cease a subscription. For instance, logistic regression is a useful way to describe the relationship between one or more factors (e.g., age, sex, etc.) and an outcome that has only two possible values, such as death (e.g., “dead” or “not dead”).
In the field of machine learning, the goal of classification is to use an object's characteristics to identify which class (or group) it belongs to. In a statistical classification task, the precision for a class is the number of true positives (e.g., the number of items correctly labeled as belonging to the positive class) divided by the total number of elements labeled as belonging to the positive class (e.g., the sum of true positives and false positives, which are items incorrectly labeled as belonging to the class). In contrast, recall is defined as the number of true positives divided by the total number of elements that actually belong to the positive class (e.g., the sum of true positives and false negatives, which are items which were not labeled as belonging to the positive class but should have been). A linear classifier identifies an appropriate class for an object, wherein a classification decision is based on the value of a linear combination of the characteristics. An object's characteristics are also known as feature values and are typically presented to the machine in a vector called a feature vector. Sparse Linear Regression (SLR) is a discriminative classifier, whose aim is to learn the differences between classes.
Prior art FIG. 1 illustrates a distribution 100 of shape objects classified via SLR. In this example, a dividing line 110 is representative of a dichotomy between exemplary shape object classes: circles and squares. The distance between the objects and the dividing line 110 is commensurate with a confidence value that an object has been correctly classified. Thus, objects 120, 130 are associated with a high confidence value whereas objects 122, 132 are associated with a relatively low confidence value. To reduce false positive classifications, a threshold confidence level can be set as indicated by dashed lines 150, 152. Thus, objects within the lines 150, 152 are not classified as either a circle or a square. Such a methodology can eliminate many incorrect, as well as many correct, object classifications.
Instead of learning to determine what features comprise each object, SLR only learns distinguishing characteristics between objects. For example, SLR does not learn that squares are made of four lines connected at right angles to each other. Instead, utilizing an initial training set, SLR might distinguish squares from circles as squares do not include curved lines. In addition, as a discriminative classifier, SLR is not designed to detect outliers and/or to address novel objects not included in the original training set.
Prior art FIG. 2 illustrates this deficiency when a novel object 220 (e.g., a square with rounded corners) is introduced for classification. The only observation that SLR will be able to make for the novel object 220 is that it contains curved lines and thus is far closer to the circles class than to the squares class. Accordingly, the novel object 220 is assigned to the circles class with a high confidence score based on the initial training set provided to the SLR since the novel object 220 has curved lines. This classification, however, is an example of a false positive as the novel object 220 is a square.
If a distance metric, such as a threshold, is relied upon to remove this false positive, a lot of true positives can also be eliminated. Accordingly, in this case, increasing the precision will prove costly to the recall. This problem points out the limitation of an SLR classifier: to detect novelties, information is needed with regard to the classes' proper characteristics, which is information that discriminative classifiers are simply unable to provide. Conventionally, methodologies simply ignore this shortcoming and state that for a large number of classes (e.g., highly multi-dimensional), it should be addressed by counterparting the linear decision boundary issue. See e.g., Logistic Regression for Binary Classification, Paul Komarek, 2005 (online at http://komarix.orq/ac/lr/). This may work when a training set is a sufficient representation of the universe of all the objects to be encountered. It fails, however, to account for novel objects introduced subsequently as cases have classes that are relevant categories only at a given point in time, and do not include classes to accommodate novel objects.
This problem can be further exacerbated in view of stringent precision targets (e.g., 99% or greater), as conventional solutions do not provide an efficient means to reach high level precision without lowering recall. Thus, systems and methods are needed to overcome the above-referenced problems with conventional classification algorithms used to categorize objects.