1. Field of the Invention
Methods for Markov boundary discovery constitute one of the most important recent developments in pattern recognition and applied statistics, primarily because they offer a principled solution to the variable/feature selection problem and give insight about local causal structure. The present invention is a novel method to discover all Markov boundaries of the response variable. The power of the inventive method is first demonstrated by its application to the bioinformatics and biotechnology domain, where the invention can identify the set of maximally accurate and non-redundant predictive models (molecular signatures) of the phenotype/response variable and predictor variable sets that participate in those models. The broad applicability of the invention is subsequently demonstrated with 13 real datasets from a diversity of application domains, where the invention can efficiently identify multiple Markov boundaries of the response variable.
2. Description of Related Art
The problem of variable/feature selection is of fundamental importance in pattern recognition and applied statistics, especially when it comes to analysis, modeling, and discovery from high-dimensional data (Guyon and Elisseeff, 2003; Kohavi and John, 1997). In addition to the promise of cost-effectiveness, two major goals of variable selection are to improve the prediction performance of the predictors and to provide a better understanding of the data-generative process (Guyon and Elisseeff, 2003). An emerging class of methods proposes a principled solution of the variable selection problem by identification of a Markov blanket of the response variable of interest (Aliferis et al., 2009a; Aliferis et al., 2003; Tsamardinos and Aliferis, 2003; Tsamardinos et al., 2003). The Markov blanket is a set of variables conditioned on which all the remaining variables excluding the response variable are statistically independent of the response variable. Under assumptions about the learner and loss function, Markov blanket is the solution of the variable selection problem (Tsamardinos and Aliferis, 2003). A related useful concept is Markov boundary or non-redundant Markov blanket (abbreviated as “MB” in this patent document) that is a Markov blanket such that no proper subset of it is a Markov blanket1.
An important theoretical result states that if the distribution satisfies the intersection property of probability theory, then it is guaranteed to have a unique Markov boundary of the response variable (Pearl, 1988). However, many real-life distributions contain multiple Markov boundaries and violate the intersection property. For example, a phenomenon ubiquitous in analysis of high-throughput molecular data, known as the “multiplicity” of predictive models (molecular signatures) and predictor variable sets in those models (i.e., several different signatures and markers perform equally well in terms of predictive accuracy of phenotypes) (Azuaje and Dopazo, 2005; Somorjai et al., 2003), suggests existence of multiple Markov boundaries in these distributions.
There are at least two practical benefits of a method that could systematically discover all Markov boundaries of the response variable of interest: First, the method allows improving discovery of the underlying mechanisms by not missing causative variables. Second, the method sheds light on the predictive model (molecular signature) multiplicity phenomenon and how it affects the reproducibility of predictive models.
Even though there are several well-developed methods for discovering a Markov boundary (Aliferis et al., 2009a; Aliferis et al., 2003; Tsamardinos et al., 2003), little research has been done in development of methods for identification of multiple Markov boundaries. Most notable advances in the field are described in the next paragraphs. In summary, there are currently no computationally efficient methods that can correctly identify all Markov boundaries from the data.
The following paragraphs provide a description of the related art with specific reference to application of the invention to the molecular signature multiplicity problem.
A “molecular signature” is a model that predicts a response variable of interest (e.g., diagnosis or outcome of treatment in human patients or biological models of disease) from microarray gene expression or other high-throughput assay data inputs (Golub et al., 1999; Ramaswamy et al., 2003). “Multiplicity” is the phenomenon in which different analysis methods used on the same data, or different samples from the same population lead to different but apparently maximally accurate predictive models (signatures) (Azuaje and Dopazo, 2005; Somorjai et al., 2003). This phenomenon has far-reaching implications for biological discovery and development of next generation patient diagnostics and personalized treatments. Multiplicity in the best case implies that generation of biological hypotheses (e.g., discovery of potential drug targets) is very hard even when predictive models (signatures) are maximally accurate for the prediction of phenotype, since thousands of completely different predictive models are equally consistent with the data. In the worst case this phenomenon may be due to large extent to statistical instability and thus entail that the produced predictive models (signatures) are not statistically generalizable to new cases and/or not reliable enough for translation to clinical practice.
Indeed some authors motivated by classical statistical considerations, attribute predictive model (signature) multiplicity solely to the small sample size of typical microarray gene expression studies (Ein-Dor et al., 2006) and have conjectured that it leads to non-reproducible predictive accuracy when the signatures are applied in independent data (Michiels et al., 2005). Related to the above it has been suggested that building reproducible predictive models (signatures) requires thousands of observations (Ioannidis, 2005). Other authors have counter-observed that the phenomenon of predictive model (signature) multiplicity is a byproduct of the complex regulatory connectivity of the underlying biological system leading to existence of highly predictively redundant biomarker sets (Dougherty and Brun, 2006). The specifics of what types of connectivity or regulatory relationships may lead to multiplicity have not been concretely identified. Another possible explanation of predictive model (signature) multiplicity is implicit in previously described artifacts of data pre-processing. For example, normalization may inflate correlations between genes, making some of them interchangeable for prediction of the phenotype (Gold et al., 2005; Ploner et al., 2005; Qiu et al., 2005).
Critical to the ability to study the phenomenon empirically is the availability of methods capable of discovering from the data multiple predictive models (signatures) and predictor variable sets that participate in those models. Several methods have been introduced with this intent. The available methods encompass four method families. The first method family is “resampling-based predictive model (signature) discovery.” This method family operates by repeated application of a predictive model (signature) discovery method to resampled data (e.g., via bootstrapping) (Ein-Dor et al., 2005; Michiels et al., 2005; Roepman et al., 2006). This family of methods is based on the assumption that multiplicity is strictly a small sample phenomenon. The second method family is “iterative removal”, that is, repeating predictive model (signature) discovery after removing from the data all variables that have been found in the previously discovered predictive models (molecular signatures) (Natsoulis et al., 2005). This approach is agnostic as to what causes multiplicity and heuristic since it does not propose a theory of causes of multiplicity. The third method family is “stochastic variable selection” techniques (Li et al., 2001; Peña et al., 2007). The underlying premise of the method of (Peña et al., 2007) is that in a specific class of distributions every maximally accurate and non-redundant predictive model (signature) will be output by a randomized method with non-zero probability; thus all such predictive models (signatures) will be output when the method is applied an infinite number of times. Similarly, the method of (Li et al., 2001) will output all predictive models (signatures) discoverable by a genetic algorithm when it is allowed to evolve an infinite number of populations. The fourth family is “brute force exhaustive search” (Grate, 2005). This approach is also agnostic as to what causes multiplicity, requires exponential time to the total number of variables, and thus is computationally infeasible for predictive models (signatures) with more than 2-3 variables (as almost all maximally accurate predictive models are in practice).
The above methods, are either heuristic or computationally intractable. They are based on currently unvalidated conjectures about what causes multiplicity, and they output incomplete sets of predictive models (signatures) with currently unknown generalizability. In the fields of bioinformatics and biotechnology, the practical benefits of a method that could systematically discover the set of truly maximally accurate and non-redundant predictive models (signatures) include: (i) a deeper understanding of the predictive model (signature) multiplicity phenomenon and how it affects reproducibility of predictive models; (ii) improving discovery of the underlying mechanisms by not missing variables that are implicated mechanistically in the disease processes; and (iii) facilitating the process of regulatory approval by establishing “in-silico equivalence” to previously validated predictive models (signatures) in a manner similar to bioequivalence of drugs, at the time of developing predictive models (molecular signatures) and predictor variable sets (biomarker panels) from one or more datasets.
Table 1 shows results of applying the TIE* method in additional 13 real datasets. TIE* was executed with the INDEPENDENCE criterion to verify Markov boundaries. The parameters of HITON-PC were selected to optimize AUC of the SVM classifier by cross-validation as described in (Aliferis et al., 2009a; Aliferis et al., 2009b). The statistical comparison of the AUC estimates was performed using Delong's test with alpha=0.05 (DeLong et al., 1988).
Tables 2A and 2B show gene expression microarray datasets used in the experiments reported in this patent document.
Table 3 shows results of experiments with artificial dataset with 30 variables. 72 true maximally accurate and non-redundant predictive models (signatures) exist in this dataset.
Table 4 shows results of experiments with artificial dataset with 1,000 variables. 72 true maximally accurate and non-redundant predictive models (signatures) exist in this dataset.
Tables 5A, 5B, and 5C show results for the number of output predictive models (signatures) (total/unique/unique and non-reducible), number of genes in a signature, and classification performance in discovery and validation microarray datasets. The length of highlighting corresponds to magnitude of the value (number of genes in a signature or classification performance) relative to other multiple signature discovery methods. The 95% intervals correspond to the observed [2.5-97.5] percentile interval over multiple signatures discovered by the method. Uniqueness and non-reducibility of each signature is assessed relative to the corresponding signature discovery method.
Tables 6A, 6B, and 6C show a list of 13 real datasets used in experiments to evaluate TIE*.