1. Field of the Invention
The present invention relates to the field of data analysis, and more specifically to identifying and/or compensating for influencers that can have an impact on a statistical outcome.
2. Background
Databases are often created as a by-product of normal operations such as health care, retail sales, and loan processing, and it is often possible to extract highly useful information by properly analyzing these databases. For example, medical researchers may obtain valuable insights into disease progression, adverse side effects of medications, or typical patient characteristics from hospital databases; buyers can identify important purchasing patterns from inventory or point-of-sale data; and bank analysts can develop fair and accurate criteria for screening loan applicants by examining the payment histories of previous borrowers.
Conventional techniques for analyzing such databases, however, are susceptible to errors that may be introduced by “confounders.” Confounders are factors whose significance has been overlooked by the data analyst, but nevertheless influence the outcome of interest. An excellent example of a confounder's impact on an outcome can be found in the following data, taken from “The Effectiveness of Adjustment by Subclassification in Removing Bias in Observational Studies for Causal Effects” by W. G. Cochran, Biometrics, v. 24, pp. 295–313 (1968).
Non-CigaretteSmoking statussmokersmokerMortality rates20.220.5per 1000 person-years
Taking this data at face value could lead one to the incorrect conclusion that cigarette smoking is not harmful. A more in-depth analysis reveals, however, that the above results were confounded by age—it turns out that the nonsmokers represented in the database were significantly older than the cigarette smokers, with average ages of 54.9 years and 50.5 years, respectively. When the above mortality rates are adjusted for age, the results are as follows:
Non-CigaretteSmoking statussmokersmokerAge-adjusted mortality rates20.229.5per 1000 person-years
Analysts with an in-depth understanding of a particular subject matter may be able to recognize the impact of a confounder, and eventually track down the source of error. But analysts who do not recognize the presence of a confounder may reach an incorrect conclusion.
One prior art approach for avoiding the effects of confounders is to carefully design an experiment or scientific trial using a control group. A particular factor (e.g., receiving a particular drug) is then randomly varied among the participants, and the results are observed. This approach is commonly used in medical and scientific research to verify a hypothesis. Unfortunately, this approach is very expensive to implement, because it requires performing new experiments and data analysis to test each and every hypothesis, or risking that a flawed hypothesis will be accepted and perhaps acted on.
Another prior art approach for avoiding the effects of confounders is by using preexisting data (e.g., from an existing database), and obtaining the participation of an expert in the relevant domain (e.g., a medical doctor) and a statistician to compensate for confounders for each hypothesis proposed by a data analyst. This approach can provide high quality results, but makes inefficient use of the domain expert's time and the statistician's time, who may be asked similar questions by multiple data analysts. And due to the heavy involvement of the domain expert and statistician, this prior art approach is also expensive to implement.
The inventors have recognized the need to improve the existing situation, and to enable researchers to form and verify their hypotheses more easily, without relying so heavily on freshly obtained experimental data to verify each hypothesis, and without relying on close cooperation with domain experts and statisticians.