To make classifiers and to identify biologically relevant biomarkers, it is often necessary to be able to identify the most relevant expression measurements (i.e., features) from a dataset of many, possibly thousands of variables measured for tens or hundreds of samples, each of which is associated with clinical data. Such features can take the form of the intensity of peaks in mass spectral data, or genomic data, such as mRNA transcript expression data for hundreds or thousands of genes, or proteomic data such as protein expression levels for a multitude of proteins.
One approach to classifier development, which we have termed “combination of mini-classifiers with dropout” or “CMC/D”, is described in our prior U.S. patent application Ser. No. 14/486,442 filed Sep. 15, 2014, the content of which is incorporated by reference herein. Generally speaking, classifiers developed in accordance with the '442 application are able to work well without the need to select only a few most useful features from a classifier development data set. However, at some point the performance of even the methods of the '442 application can degrade if too many useless or noisy features are included in the set of features used to develop the classifier. Hence, in many classifier development and biomarker identification situations it is essential to be able to either select relevant features or deselect irrelevant features.
It should be noted that feature selection and deselection are not simply the complement of each other. Feature selection involves selecting a few features that are statistically significantly correlated with a clinical state or clinical outcome. Feature deselection removes noisy features or features that show no indication of power to classify into the clinical groups. Hence, the latter is not related to an established level of statistical significance of correlation of a feature with clinical state or outcome.
Many methods for feature selection or deselection have been proposed and used in practice. Student t-tests, Wilcoxon sum rank (Mann-Whitney) tests, significance analysis of microarrays (“SAM”) (see Tusher et al., “Significance analysis of microarrays applied to the ionizing radiation response” Proc. Natl. Acad. Sci. USA 2001 98(9):5116), and adaptations of logistic regression, including lasso and elastic net (see Witten et al., “Testing significance of features by lassoed principal components” Ann. Appl. Stat. 2008 2(3):986, Zhou et al., “Regularization and variable selection via the elastic net” J. R. Statis. Soc. Ser. B 2005 67:301), mutual information, and combinations of these methods (see Samee et al., “Detection of biomarkers for hepatocellular carcinoma using hybrid univariate selection methods” Theoretical Biol. and Med. Modelling 2012 9:34.4), have been used to identify features showing differential expression between sample sets representing two known classes. To identify features relevant for the prediction of time-to-event outcomes or classification into groups of patients with better or worse time-to-event outcomes, Cox regression of features to outcome data can be used. See Zhu, et al., “Prognostic and Predictive Gene Signature for Adjuvant Chemotherapy in Resected Non-Small-Cell Lung Cancer” J. Clin. Oncol. 2010 28(29):4417.
Often these methods are applied to data from a development set of samples. As used in this document, a “development set of samples”, or simply “development set” refers to a set of samples (and the associated data, such as feature values and a class label for each of the samples) that is used in an exercise of developing a classifier. Often, there are many features (typically several hundred to tens of thousands) measured for each sample in the development set, many more than the number of samples in the development set (typically of the order of 10-100 samples). Applying these methods to the development set of samples can lead to overfitting, i.e. features are selected which demonstrate significant expression differences between classes in the particular development set, but which do not generalize to other sample sets, i.e. in other sample sets different sets of features would demonstrate significant expression differences. This is a recognized problem in feature selection. Ensemble methods (bagging) have been suggested to deal with this issue in feature selection and biomarker identification. Saeys, et al., “Robust Feature Selection Using Ensemble Feature Selection Techniques” Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science Volume 5212 (2008) p. 313; Abeel et al., “Robust biomarker identification for cancer diagnosis with ensemble feature selection methods” Bioinformatics 2010 26(3):392.
Some methods of feature selection are quite limited in the way that the relevance of features is assessed. One may be interested in identifying features which can be used to classify samples accurately into the defined classes or groups with significantly different outcomes, but determining features where the mean or median is significantly different between classes does not necessarily mean that the feature is particularly useful for accurate classification. Differences in mean may be due to isolated outlier samples. The significance of differences in median may not give a reliable guide to how well the feature can split the development set into the required classes. In addition, typically it is difficult to use any one method, ensemble-based or other, across multiple biomarker settings (e.g. in classification into two clinical states and to find prognostic classifications based on time-to-event data).
This document describes a method of feature selection or deselection that makes use of an ensemble average (“bagging” in machine learning terms) of the filtering of a classification performance estimate. This method has the usual advantages of an ensemble approach (increasing robustness of (de)selected features, avoiding overfitting) and is flexible enough that it can be easily used for both classification into clinical states (e.g. cancer or no cancer) and classification according to groups based on continuous clinical variables (e.g. % weight loss) and censored time-to-event data, such as overall survival. Most important, this method allows for the tailoring of feature selection and deselection to the specific problem to be solved and provides a simple method to deal with known or suspected confounding factors.