1. Technical Field
The present invention relates to data processing and analysis and, more particularly, to causal discovery and feature selection for classification.
2. Description of the Related Art
Advances in computing technology have allowed researchers across many fields of endeavor to collect and maintain vast amounts of observational statistical data such as clinical data, biological patient data, data regarding access of web sites, financial data, and the like. Using computational methods, these data can be processed in order to learn to classify a particular target variable of interest. For example, a patient can be classified into a high or low risk group given his or her demographic data, historical data, lab tests, etc.
As only recently proved, under certain conditions the observational statistical data also can be processed to induce causal relations among the observed quantities, also referred to as variables or features. For example, by processing such observational statistical data, it can be induced that smoking probabilistically causes lung cancer, that the placement of an item in a grocery store increases sales, or that increasing the amount of vacation time provided to factory workers increases productivity. Notably, the aforementioned causal relationships are induced without controlled experiments. See Spirtes, P., C. Glymour, and R. Schemes, Causation, Prediction, and Search, Cambridge, Mass., London, England: The MIT Press (Second ed. 2000).
While some of these data sets may include only 10–100 or so variables, others can include upwards of hundreds of thousands of variables. Within data networks of thousands of variables, variable selection—that is the identification of a minimal, or close to minimal, subset of variables that best predicts the target variable of interest—can be difficult. The identification of the relevant variables that best predict a target variable, however, has become an essential component of quantitative modeling, data-driven construction of decision support models, and computer-assisted discovery.
Similarly, the identification of the variables that directly cause or are caused by the target variable is difficult. In this context, if A is said to directly cause B, no other variables from the set of observed variables in the data causally intervene between A and B. The problem of identifying a causal neighborhood around the target variable, that is variables that are directly caused or cause the target variable, is referred to as local causal discovery. The identification of all direct causal relations among all variables is referred to as global causal discovery. The identification of the direct causes and direct effects of the target variable is extremely important with respect to manipulating various systems. For example, in order to build a drug to treat a disease, one needs to know what causes the disease, not simply how to predict or diagnose the occurrence of the disease (i.e., classify the patient as having the disease or not).
The problem of variable selection for classification in biomedicine has become more pressing than ever with the emergence of such extremely large data sets. The number of variables in these data sets, as noted, can number into the hundreds of thousands, while the sample-to-variable ratio of the data sets remains small in comparison. Such data sets are common in gene-expression array studies, proteomics, and computational biology, for example where one attempts to diagnose a patient given as variables the expression levels of tens of thousands of genes, or the concentration level of hundreds of thousands of protein fragments. With regard to medical diagnosis, the identification of relevant variables, for example as derived from lab tests, for a particular condition can help to eliminate redundant tests from consideration thereby reducing risks to patients and lowering healthcare costs. Similarly, in biomedical discovery tasks, identification of causal relations can allow researchers to conduct focused and efficient experiments to verify these relations.
Other domains such as text-categorization, information retrieval, data mining of electronic medical records, consumer profile analysis, and temporal modeling share characteristics with the aforementioned domains. In particular, these domains also have a very large number of variables and a small sample-to-variable ratio. Identifying a reduced but still predictive variable set can significantly benefit computer, statistical, and paper-based decision support models pertaining to these fields in terms of understandability, user acceptance, speed of computation, and smoother integration into clinical practice and application in general.
The theory of causal discovery from observational data is based on Bayesian Networks, specifically a special class of Bayesian Networks, known as Causal Bayesian Networks, or Causal Probabilistic Networks (CPNs). In a causal Bayesian Network, an arc (edge) from a variable A to a variable B, means that A is causing B. The same theory also can be used for solving variable selection problems. Bayesian Networks are mathematical objects that capture probabilistic properties of the statistical data. One component of a Bayesian Network is a graph of causal relations depicting that A is directly causing B. Further details regarding Bayesian Network theory are provided in Neapolitan, R. E., Probabilistic Reasoning in Expert Systems. Theory and Algorithms, John Wiley and Sons (1990).
Bayesian Networks provide a useful tool for analyzing large data sets derived, for example, from the aforementioned domains. More particularly, Bayesian Networks provide a conceptual framework for problem solving within areas such as prediction, classification, diagnosis, modeling, decision making under uncertainty, and causal discovery. As such, Bayesian Networks have been the subject of a significant amount of research which has yielded a variety of analysis tools and techniques.
Still, known techniques for inducing a Bayesian Network, and thus the causal relations, from statistical data and determining the relevant variables for the prediction of a target variable are limited. That is, known techniques for determining which variables influence a selected target variable are limited to operating upon data sets having only several hundred variables at most. Such conventional techniques are unable to scale upward to effectively process data sets having more than several hundred variables. Accordingly, conventional Bayesian Network analysis techniques are not able to work effectively upon larger data sets such as those derived from gene-expression array studies, proteomics, computational biology, text-categorization, information retrieval, data mining of electronic medical and financial records, consumer profile analysis, temporal modeling, or other domains.