This invention relates to a method of classification through machine learning methodology that improves upon previous endeavors employing logistic regression. Automated classifications systems are an integral part of the fields of data mining and machine learning. Since the pioneering work of Luce (1959), there has been widespread use of logistic regression as an analytical engine to make classifying decisions. Examples of its application include predicting the probability that a person will get cancer given a list of their environmental risk factors and genes, predicting the probability that a person will vote Republican, Democratic or Green given their demographic profile, and predicting the probability that a high school student will rank in the bottom third, the middle third, or the upper third of their class given their socioeconomic and demographic profile. Logistic Regression has potential application to any scientific, engineering, social, medical, or business classification problem where target variable values can be formalized as binary, nominal, rank-ordered, or interval-categorized outcomes. More generally, Logistic Regression is a machine learning method that can be programmed into computers, robots and artificial intelligence agents for the same types of applications as Neural Networks, Random Forests, Support Vector Machines and other such machine learning methods, many of which are the subject of issued U.S. patents. Logistic Regression is an important and widely used analytical method in today's economy. Unlike most other machine learning methods, Logistic Regression is not a “black box” method. Instead, Logistic Regression is a transparent method that does not require elaborate visualization methods to understand how the classification works.
Prior art forms of logistic regression, along with other machine classification methods, have significant limitations with regard to their ability to handle errors such as sampling errors in small training sample sizes, over fitting error, and multicollinearity error related to highly correlated input variables. Many prior art logistic regression methods also do not handle high dimensional problems having very large numbers of variables effectively.
The present machine learning method is fundamentally different from prior art methods such as the one described in U.S. Pat. No. 7,222,127 (the '127 patent). For example, the method of the present invention is based upon reduced error logistic regression that can be shown to yield less error in machine learning applications with large numbers of multicollinear variables and a small number of observations, and the method of the present invention employs backward selection in its variable selection processing. The backward selection in the method of the present invention is based upon the magnitude of t values; this backward selection process starts with a very large number of variables and eliminates the least important variables based upon the magnitude of the t values until the best model is discovered as defined through a log likelihood function based upon probability of error that results from reduced error logistic regression. In contrast, the '127 patent employs additive or forward selection in its variable selection and further employs an arbitrary cost function in its objective function that is fundamentally different from that obtained through appropriately scaled symmetrical error modeling in the reduced error logistic regression method.
The present disclosure is directed to improvements in a method for Generalized Reduced Error Logistic Regression which overcomes significant limitations in prior art logistic regression. This prior art includes work by Golan et al. (1996) and early theoretical work disclosed by the Applicant in 2005 and 2006 that preceded what is currently known as Reduced Error Logistic Regression (RELR). The method of the present invention is applicable to all current applications of logistic regression. The present method effectively deals with error and dimensionality problems in logistic regression, and therefore has significantly greater reliability and validity using smaller sample sizes and potentially very high dimensionality in numbers of input variables. These are major advantages over prior art logistic regression methods. In addition, this improved method has an effective variable selection method and scales the model with an appropriate scale factor Ω that adjusts for total variable importance across variables to calculate more reliable and valid logit coefficient parameters generally. The present Generalized RELR method was not obvious or even possible given the Golan et al. (1996) work and the prior art theoretical work of the Applicant disclosed in 2005 and 2006. For example, the Golan et al. (1996) work does not include t values as measures that are inversely related to expected extreme error. In addition, the Golan et al. (1996) work and all such related work also have the same machine learning deficiencies as the early prior art theoretical work disclosed by the Applicant in 2005 and 2006. Once again, this prior art work of the Applicant had significant limitations and problems including lack of appropriate scaling and lack of variable selection that rendered it useless as a generalized machine learning method.
The present method is a continuation-in-part of U.S. patent application Ser. No. 11/904,542 also titled Generalized Reduced Error Logistic Regression Method. The major changes from the method described in this previous Ser. No. 11/904,542 application are:
1. The present application clears up ambiguities and errors in some of the formulas presented in application Ser. No. 11/904,542.
2. application Ser. No. 11/904,542 employed a t-value that required categorizing interval-category and ordinal dependent variables into binary dependent variables. The method of the present invention now uses a slightly different t-value measure than defined in application Ser. No. 11/904,542 that appears appropriate for all types of target variables and can give greater accuracy with interval or ordinal target variables. This is defined through Equations (5a) and (5b) and the description that follows these equations. This present t-value is analogous to the t-value used to test whether a Pearson correlation across independent observations is significantly different from zero. It differs from the previous one-sample t-value only because of a slightly different denominator to reflect differing estimates of degrees of freedom. We have found no evidence that this slightly different denominator in this measure in comparison to the one-sample t-value used in Ser. No. 11/904,542 has an effect on the accuracy, reliability or validity of models that do not have interval-category or ordinal dependent variables with the sample sizes that we typically employ with at least 40-50 target observations.
3. In contrast to the t-value formula in Ser. No. 11/904,542, the current t-value formula is affected by the number of independent observations. This differential treatment of completely independent vs. not completely independent observations in this current measure related to a t-value in Equations (5a) and (5b) was not adequately addressed in the Ser. No. 11/904,542 application, but is now handled with these changes.
4. The relative magnitude scaling factor Ω now is multiplied by 2 as shown in Equation (5b). This is because there are both positive and negative measures of expected error for each moment in the model. This doubling of the scale factor Ω reflects both positive and negative largest expected error measures proportional to the inverse of the t-value. The doubling of the scale makes mathematical sense because this is a relative magnitude scale, so it should be a sum of all t-values in the model, including those relating to both positive and negative expected error.
5. The method of the present invention now allows a user to control the speed of the variable selection phase through batch processing of the t-values. For models with very large numbers of observations and very large numbers of variables, this may yield an enormous savings in computational time, but will always give the exact same solutions as would be obtained without employing batch processing.
6. The method of the present invention now adds the ability to model patterns of missing vs. non-missing observations on input variables that have a structural relationship to the target variable.
7. The method of the present invention now allows for hierarchical rules for interactions to govern variable selection as is sometimes done in genomic models. This option is under user control, but is not arbitrary and is instead determined by unique features of an application.
8. The results chart now change slightly for RELR because of the use of appropriate McNemar statistics for dependent proportions and because of updated changes in software to reflect method changes. In addition, we found that we could get a better SVM model for the large sample political polling model (FIG. 3), so SVM fares better in these comparisons. Finally, in other comparative modeling approaches such as Neural Networks, we found that we were able to get slightly better models by changing certain parameters, so in these cases we report the best possible model for comparison. In other cases, the results change slightly as we discovered errors in our data processing. The results now clearly show that RELR specifically improves classification performance in a dataset with a relatively small sample size where sampling and multicollinearity error would be expected to be most pronounced as shown in FIG. 2.
9. Finally, the method of the present invention now specifies a measure of model performance that is used as a measure of the best model; such a measure was unspecified in the Ser. No. 11/904,542 application. This measure is the maximal Log Likelihood across the training observations. This measure of the best model is described through Equation (1a).