1. Field of the Invention
In many areas, recent developments have generated very large datasets from which it is desired to extract meaningful relationships between the dataset elements. However, to date, the finding of such relationships using prior art methods has proved extremely difficult especially in the biomedically related arts. For those fields as well as others, the need for new more powerful advanced methods is critical. Methods for local causal learning and Markov blanket discovery are important recent developments in pattern recognition and applied statistics, primarily because they offer a principled solution to the variable/feature selection problem and give insight about local causal structure. As presented in a large body of literature, discovery of local causal relationships is significant because it plays a central role in causal discovery and classification, because of its scalability benefits, and because by naturally bridging causation with predictivity, it provides significant benefits in feature selection for classification (Aliferis et al., 2010a; Aliferis et al., 2010b). More specifically, solving the local causal induction problem helps understanding how natural and artificial systems work; it helps identify what interventions to pursue in order for these systems to exhibit desired behaviors; under certain assumptions, it provides minimal feature sets required for classification of a chosen response/target variable with maximum predictivity; and finally local causal discovery can form the basis of efficient methods for learning the global causal structure of all variables in the data. The present invention is a novel method to discover a Markov boundary and the set of direct causes and direct effects from data. This method has been applied to create classification computer systems in many areas, including medical diagnosis, text categorization, and others (Aliferis et al., 2010a; Aliferis et al., 2010b).
In general, the inventive method can serve two purposes: First, it can transform a dataset with many variables into a minimal reduced dataset where all variables are needed for optimal prediction of the response variable. For example, medical researchers have been trying to identify the genes for early diagnosis of complex human diseases by analyzing samples from patients and controls by gene expression microarrays. However, they have been frustrated in their attempt to identify the critical elements by the highly complex pattern of expression results obtained, often with thousands of genes that are associated with the phenotype. The invention allows transformation of the gene expression microarray dataset for thousands of genes into a much smaller dataset containing only genes that are necessary for optimal prediction of the phenotypic response variable. Second, the inventive method described herein can transform a dataset with many thousands of variables into a reduced dataset where all variables are direct causes and effects of the response variable. For example, epidemiologists are trying to identify what causes infants to die within the first year by analyzing electronic medical record and census data. However, it is challenging to find which of the many associated variables in this data are causing the death of infants. The invention allows transformation of this infant mortality dataset with many dozens of variables to a half-dozen or less of putative causes of the infant death.
The broad applicability of the invention is first demonstrated with 13 real datasets from a diversity of application domains, where the inventive method identifies Markov blankets of the response variable with larger classification accuracy and fewer features than baseline comparison methods of the prior art. The power of the invention is subsequently demonstrated in data simulated from Bayesian networks from several problem domains, where the inventive method identifies the sets of direct causes and direct effects and Markov blankets more accurately than the prior art baseline comparison methods.
2. Description of Related Art
Firm theoretical foundations of Bayesian networks were laid down by Pearl and his co-authors (Pearl, 1988). Furthermore, all local learning methods exploit either the constraint-based framework for causal discovery developed by Spirtes, Glymour, Schienes, Pearl, and Verma and their co-authors (Pearl, 2000; Pearl and Verma, 1991; Spirtes et al., 2000) or the Bayesian search-and-score Bayesian network learning framework introduced by Cooper and Herskovits (Cooper and Herskovits, 1992).
While the above foundations were introduced and developed in the span of at least the last 30 years, local learning is no more than 10 years old. Specialized Markov blanket learning methods were first introduced in 1996 by Koller (Koller and Sahami, 1996), incomplete causal methods in 1997 (Cooper, 1997), and local causal discovery methods (for targeted complete induction of direct causes and effects) were first introduced in 2002 and 2003 (Aliferis and Tsamardinos, 2002a; Tsamardinos et al., 2003b). In 1996 Koller et al. introduced a heuristic method for inducing the Markov blanket from data and tested the method in simulated, real text, and other types of data from the UCI repository (Koller and Sahami, 1996). In 1997 Cooper and colleagues introduced and applied the heuristic method K2MB for finding the Markov blanket of a target variable in the task of predicting pneumonia mortality (Cooper et al., 1997). In 1997 Cooper introduced an incomplete method for causal discovery (Cooper, 1997). The method was able to circumvent lack of scalability of global methods by returning a subset of arcs from the full network. To avoid notational confusion, it is important to point out that Cooper's method was termed LCD (local causal discovery) despite being an incomplete rather than local method as local methods are defined in the present patent document (i.e., focused on some user-specified target variable or localized region of the network). A revision of the method termed LCD2 was presented in Mani (Mani and Cooper, 1999). A Bayesian local causal discovery method that is truly local and finds the direct causal effects of a target variable was introduced in Mani (Mani and Cooper, 2004).
In 1999 Margaritis and Thrun introduced the GS method with the intent to induce the Markov blanket for the purpose of speeding up global network learning (i.e., not for feature selection) (Margaritis and Thrun, 1999). GS was the first published sound Markov blanket induction method. The weak heuristic used by GS combined with the need to condition on at least as many variables simultaneously as the Markov blanket size makes it impractical for many typical datasets since the required sample grows exponentially to the size of the Markov blanket. This in turn forces the method to stop its execution prematurely (before it identifies the complete Markov blanket) because it cannot grow the conditioning set while performing reliable tests of independence. Evaluations of GS by its inventors were performed in datasets with a few dozen variables leaving the potential of scalability largely unexplored.
The use of such methods has grown in importance in the field of biology as researchers have tried to sort out causes of disease and understand cellular processes. In 2001 Cheng et al. applied the TPDA method (a global BN learner) (Cheng et al., 2002a) to learn the Markov blanket of the target variable in the Thrombin dataset in order to solve a prediction problem of drug effectiveness on the basis of molecular characteristics (Cheng et al., 2002b). Because TPDA could not be run with more than a few hundred variables efficiently, they pre-selected 200 variables (out of 139,351 total) using univariate filtering. Although this procedure in general will not find the true Markov blanket (because otherwise-unconnected spouses can be missed, many true parents and children may not be in the first 200 variables, and many non Markov blanket members cannot be eliminated), the resulting classifier performed very well winning the 2001 KDD Cup competition.
Friedman et al. proposed a simple Bootstrap procedure for determining membership in the Markov blanket for small sample situations (Friedman et al., 1999a). The Markov blanket in Friedman's method is to be extracted from the full Bayesian network learned by the SCA learner (Friedman et al., 1999b).
In 2002 and 2003 Tsamardinos, Aliferis, et al. presented a modified version of GS, termed IAMB and several variants of the latter that, through use of a better inclusion heuristic than GS and optional post-processing of the tentative and final output of the local method with global learners, would achieve true scalability to datasets with many thousands of variables and applicability in modest (but not very small samples) (Aliferis et al., 2002; Tsamardinos et al., 2003a). IAMB and several variants were tested both in the high-dimensional Thrombin dataset (Aliferis et al., 2002) and in datasets simulated from both existing and random Bayesian networks (Tsamardinos et al., 2003a). The former study found that IAMB scales to high-dimensional datasets. The latter study compared IAMB and its variants to GS, Koller-Sahami, and PC and concluded that IAMB variants on average yield the best results among the datasets tested.
In 2003 Tsamardinos and Aliferis presented a full theoretical analysis explaining relevance as defined by Kohavi and John (Kohavi and John, 1997) in terms of Markov blanket and causal connectivity (Tsamardinos and Aliferis, 2003). They also provided theoretical results about the strengths and weaknesses of filter versus wrapper methods, the impossibility of a universal definition of relevance, and the optimality of Markov blanket as a solution to the feature selection problem in formal terms.
The extension of SCA to create a local-to-global learning strategy was first introduced in (Aliferis and Tsamardinos, 2002b) and led to the MMHC method introduced and evaluated in (Tsamardinos et al., 2006). MMHC was shown in (Tsamardinos et al., 2006) to achieve best-of-class performance in quality and scalability compared to most state-of-the-art full causal probabilistic network learning methods.
In 2002 Aliferis et al. also introduced parallel and distributed versions of the IAMB family of methods (Aliferis et al., 2002). The early attempt to develop generative local learning methods described in this specification was made by Aliferis and Tsamardinos in 2002 for the explicit purpose of reducing the sample size requirements of IAMB-style methods (Aliferis and Tsamardinos, 2002a).
In 2003 Aliferis et al. introduced method HITON (Aliferis et al., 2003a), and Tsamardinos et al. introduced methods MMPC and MMMB (Tsamardinos et al., 2003b). These are the first concrete methods that would find sets of direct causes or direct effects and Markov blankets in a scalable and efficient manner. HITON was tested in 5 biomedical datasets spanning clinical, text, genomic, structural and proteomic data and compared against several feature selection methods with excellent results in parsimony and classification accuracy (Aliferis et al., 2003a). MMPC was tested in data simulated from human-derived Bayesian networks with excellent results in quality and scalability. MMMB was tested in the same datasets and compared to prior methods such as Koller-Sahami method and IAMB variants with superior results in the quality of Markov blankets. These benchmarking and comparative evaluation experiments provided evidence that the local learning approach held not only theoretical but also practical potential.
HITON-PC, HITON-MB, MMPC, and MMMB methods lacked so-called “symmetry correction” (Tsamardinos et al., 2006), however HITON used a wrapping post-processing that at least in principle removed this type of false positives. The symmetry correction was introduced in 2005 and 2006 by Tsamardinos et al. in the context of the introduction of MMHC (Tsamardinos et al., 2005; Tsamardinos et al., 2006). Peña et al. also published work pointing to the need for a symmetry correction in MMPC (Peña et al., 2005b).
In 2003 Frey et al. explored the idea of using decision tree induction to indirectly approximate the Markov blanket (Frey et al., 2003). Frey produced promising results, however a main problem with the method was that it requires a threshold parameter that cannot be optimized easily.
In 2004 Mani et al. introduced BLCD-MB, which resembles IAMB but uses a Bayesian scoring metric rather than conditional independence testing (Mani and Cooper, 2004). The method was applied with promising results in infant mortality data (Mani and Cooper, 2004).
A method for learning regions around target variables by recursive application of MMPC or other local learning methods was introduced in (Tsamardinos et al., 2003c). Peña et al. applied interleaved MMPC for learning regions in the domain of bioinformatics (Peña et al., 2005a).
In 2006 Gevaert et al. applied K2MB for the purpose of learning classifiers that could be used for prognosis of breast cancer from microarray and clinical data (Gevaert et al., 2006). Univariate filtering was used to select 232 genes before applying K2MB.
Other recent efforts in learning Markov blankets include the following methods: PCX, which post-processes the output of PC (Bai et al., 2004); KIAMB, which addresses some violations of faithfulness using a stochastic extension to IAMB (Peña et al., 2007); FAST-IAMB, which speeds up IAMB (Yaramakala and Margaritis, 2005); and MBFS, which is a PC-style method that returns a graph over Markov blanket members (Ramsey, 2006).
Use of Prior Methods to Understand Biological Systems:
HITON was applied in 2005 to understand physician decisions and guideline compliance in the diagnosis of melanomas (Sboner and Aliferis, 2005). HITON has been applied for the discovery of biomarkers in human cancer data using microarrays and mass spectrometry and is also implemented in the GEMS and FAST-AIMS systems for the automated analysis of microarray and mass spectrometry data respectively (Fananapazir et al., 2005; Statnikov et al., 2005b). In a recent extensive comparison of biomarker selection methods (Aliferis et al., 2006a; Aliferis et al., 2006b) it was found that HITON outperforms 16 state-of-the-art prior art representatives from all major biomarker method families in terms of combined classification performance and feature set parsimony. This evaluation utilized 9 human cancer datasets (gene expression microarray and mass spectrometry) in 10 diagnostic and outcome (i.e., survival) prediction classification tasks. In addition to the above real data, resimulation was also used to create two gold standard network structures, one re-engineered from human lung cancer data and one from yeast data. Several applications of HITON in text categorization have been published where the method was used to understand complex “black box” SVM models and convert complex models to Boolean queries usable by Boolean interfaces of Medline (Aphinyanaphongs and Aliferis, 2004), to examine the consistency of editorial policies in published journals (Aphinyanaphongs et al., 2006), and to predict drug-drug interactions (Duda et al., 2005). HITON was also compared with excellent results to manual and machine feature selection in the domain of early graft failure in patients with liver transplantations (Hoot et al., 2005).