Machine learning is a form of artificial intelligence whereby information learned from a computer-assisted analysis of data can be used to generate a prediction rule that describes dependencies in data. The prediction rule can be embodied within a computer-implemented model that performs a specific task. Computer-implemented models can be used in a wide variety of applications such as, for example, search engines (e.g., determining whether search results are primarily informational or commercial in content), stock market analysis (e.g., predicting movements in the prices of stocks), and handwriting and image recognition (e.g., determining whether or not a handwriting sample or image matches another sample or image). As another example, computer-implemented models can be used to diagnose medical conditions (e.g., disease such as cancer), predict the time-to-occurrence (e.g., recurrence) of medical conditions, and/or predict the responses of patients to medical treatments.
A computer-implemented model processes data for one or more input features of an “instance” (e.g., a search result, a stock, a handwriting sample, image, or a medical patient) according to the prediction rule in order to provide an output that represents a given outcome for that instance. A feature is a characteristic of the instance. For example, in the medical context, gender is a clinical feature that can take the values of “male” and “female.” An outcome is a prediction or other determination for the instance (e.g., time to disease recurrence) that is produced by the prediction rule based on the input data. With respect to linear prediction rules, the relative importance of a given feature (i.e., the degree to which that feature affects the determination of outcome) is characterized by the numeric “weight” of that feature within the prediction rule. A linear prediction rule can determine an outcome as follows:Outcome=w1*f1+w2*f2+ . . . +wn*fn+b  (1)where f1 to fn are measurements for the instance of the n features in the prediction rule, w1 to wn are the respective weights of the features in the prediction rule, and b is a constant term.
Determining the weights of the features within the linear prediction rule involves applying a machine learning method such as a support vector machine (“SVM”) having a linear kernel to data for a cohort of instances (a “training” dataset). The training dataset typically includes measurements of the features for each of the instances, and the known outcomes of those instances. A machine learning tool capable of performing Support Vector Regression for censored data (“SVRc”) may be used that can generate the feature weights based on “non-censored” data (i.e., data for instances with known outcomes) and/or “right-censored” data (i.e., data for subjects with outcomes that are at least partially unknown), as is described in commonly-owned U.S. patent application Ser. No. 10/991,240, filed Nov. 17, 2004 (U.S. Pub. No. 20050108753). The predictive ability of the prediction rule can be tested (validated) by applying the prediction rule to one or more instances (e.g., one or more instances from the training cohort or an independent “test cohort”). The outcome(s) predicted by the prediction rule can be compared to at least partially known outcome(s) for the instances through the use of statistical metrics. An example of such a statistical metric is the concordance index (CI). Additional examples of statistical metrics include sensitivity and specificity, which traditionally have been evaluated for prediction rules with binary outcomes.
Various approaches have been provided for selecting the features for inclusion within a prediction rule. Feature selection is not required in order to create a prediction rule (e.g., a rule could be created based on all features believed to be relevant to a specific task), however it may improve the quality of the prediction rule by (for example) determining the features that are the most important predictors for a specific task, eliminating excessive features, and reducing the number of features for which data must be collected for an instance to be evaluated by the prediction rule. In one approach, features can be selected for a prediction rule based on domain expertise only, such as by a physician selecting n features for the rule based solely on that physician's personal knowledge and experience. However, this approach may cause features that do not improve (e.g., or decrease) the predictive ability of the prediction rule 1 to be included in the rule. This approach also may prevent the discovery of new features that may be relevant to the task, because the relevancy of these new features may not be discernable without the aid of statistical evaluation.
In another approach, feature filtering may be used for feature selection, whereby each feature under consideration for potential inclusion in a prediction rule is evaluated independently in order to determine its predictive ability. The features may be ranked according to their predictive abilities and then some fixed number of the “best” features in the rank may be selected for inclusion in the rule.
In other approaches, greedy forward and/or greedy backward procedures can be used alone or in combination with domain expertise to select features for a prediction rule. The greedy forward procedure increases, one feature at a time, the number of features that are considered within a final prediction rule (i.e., the prediction rule resulting from the procedure), where the set of n features eligible for consideration within the prediction rule may be defined based on, for example, domain expertise. However, significant processing resources (e.g., number of processes) are required to implement the greedy forward procedure. Particularly, the first feature selected for inclusion in the final prediction rule according to the greedy forward procedure is the feature that, by itself, forms the one-feature prediction rule that is most predictive of the event under consideration. Thus, in a first stage, the greedy forward procedure involves generating n one-feature prediction rules and then evaluating the predictive abilities of those rules according to a statistical metric such as the CI. The second feature selected for inclusion in the final prediction rule is the feature that, when coupled with the first feature, causes the greatest increase in the predictive ability. This second feature is determined by generating and evaluating the predictive abilities of n−1 two-feature prediction rules (i.e., each rule including the first feature and a respective one of the n−1 features remaining in the set of features eligible for consideration). The third feature selected for inclusion in the final prediction rule is determined by generating and evaluating n−2 three-feature prediction rules, the fourth feature is determined by generating and evaluating n−3 four-feature prediction rules, and so on. This procedure ends when the set of features eligible for inclusion within the final prediction rule lacks any single feature that, when coupled with the currently selected features, would cause an increase in predictive ability. Thus, starting with a set of n features, the greedy forward procedure can require the generation of as many as n+(n*(n−1))/2 prediction rules in order to produce the final prediction rule. For example, starting with a set of 50 features, the greedy forward procedure can require the generation of as many as 50+50*49/2=1275 prediction rules in order to select the features for the final prediction rule. Starting with a set of 500 features, the generation of as many as 500+500*499/2=125,250 prediction rules can be required.
The greedy backward procedure removes features one at a time from a set of features selected for inclusion in a prediction rule, where the features included in the rule at the start of the procedure can be selected based on domain expertise and/or or the greedy forward procedure. Particularly, starting with a prediction rule that includes n features, n(n−1)-feature prediction rules are generated (e.g., by applying SVM or SVRc) and evaluated for their predictive abilities according to a statistical metric such as the CI, with each of the rules leaving out a respective one of the n features. The (n−1)-feature prediction rule, if any, that shows the greatest increase in predictive ability compared to the n-feature prediction rule, or that has the same predictive ability as the n-feature rule when no (n−1)-feature rule has an increased predictive ability, is selected as the new prediction rule. The greedy backward procedure ends when it is determined that the predictive ability of the current prediction rule would decrease with the removal of any single feature. Thus, the greedy backward procedure does not consider that, even when the removal of the first feature causes the predictive ability of a prediction rule to decrease, the predictive ability of the rule could increase overall upon the removal of two or more features.
In view of the foregoing, it would be desirable to provide sound alternatives to the traditional approaches for feature selection in machine learning.