One of the important tasks in medicine is to create useful statistical models for the prediction of disease. Research in this area typically involves conducting follow-up studies among populations in which the potential disease predictors are measured before the subjects experience a disease event, that is, a morbidity or even a mortality event. After the disease event, statistical procedures are used to quantify the relation between the predictors and the onset of the disease. Because of the great variation among human subjects, a study of this kind usually requires compiling a large sample size in order to create a meaningful prediction model. Studies of this nature also require careful design, well-controlled follow up and standardized predictor and disease outcome assessment. In brief, in addition to the extended time periods, considerable scientific and economic resources are required to conduct such studies.
Furthermore, it has long been known that more than one factor typically contributes to a certain disease, and often a correlation exists between these factors. For example, it has long been known that both obesity and fat consumption contribute to heart disease, and that there is a correlation between obesity and fat consumption. Thus, since diseases are typically caused by multiple factors, a meaningful prediction model needs to clearly and accurately reflect the contribution of each of these multiple factors. In the past, however, individual disease factors have typically been studied to determine how the individual disease factors independently contribute to the probability of getting a certain disease.
If the combined effect of various factors that contribute to disease risk is desired, then a study can be organized to concurrently measure each of these independent factors. A suitable multivariate regression equation, or its equivalent, can then be developed to combine these independent factors into an equation of the form EQU Y=a+.SIGMA.b.sub.i X.sub.i (I)
where Y represents disease outcome (e.g., the probability of getting coronary heart disease); the constant "a" represents the disease outcome level when all disease prediction factors are equal to zero; X.sub.i represents the disease prediction factor (e.g., smoking, drinking, blood pressure, cholesterol levels, etc.); and b.sub.i, the partial regression coefficient, represents how much each factor contributes to disease outcome. The partial regression coefficient may be viewed as a weighting factor. This process may be performed to diagnose the existence of a current disease as well as to predict future disease onset.
Many studies of this kind have been carried out in the last decade or so. For example, the Framingham Heart Disease Study, which started in the 1960's and is still on-going, involves two generations of study participants that total roughly 6000 subjects. One of the publications of this study is reported in Keaven Anderson et al., "An Updated Coronary Risk Profile--A Statement for Health Professionals", Circulation 83:356-62 (1991), and is incorporated by reference in its entirety. These types of studies have provided some helpful disease prediction tools. For example, the Framingham study produced a prediction equation for coronary heart disease (CHD) that has been widely used by physicians. This study is generally believed to be one of the best available prediction models. The disease prediction factors in the equation included age, blood pressure, smoking, cholesterol level, diabetes and ECG-left ventricular hypertrophy. The prediction equation has been estimated to account for about 60-70% of CHD among the general population. There have been, however, many other studies reporting risk factors for CHD that were not included in the Framingham prediction equation. Examples of such risk factors that are not included are family history, plasma fibrinogen, serum C-reactive protein, serum albumin, leukocyte count, serum homocysteine and physical exercise. One study reported that a single homocysteine measurement might be able to account for 10% of CHD risk.
Although the ongoing Framingham-type study could start collecting data on the newly identified risk factors for use in the prediction equation, it could take another 5 to 10 years to get a new and useful updated equation since, with conventional statistical methods, to estimate the partial regression coefficients, dependent variable Y and all independent variables X must be measured in the same study. Thus, the Framingham predication equation is slowly becoming outdated, and a virtual cornucopia of many new studies showing the association of CHD with still other individual risk factors are continuously appearing. Furthermore, as each currently unknown risk factor becomes identified in the future, new studies including the additional risk factors would need to be undertaken once again, since the ultimate goal is to create an equation of the form of equation I wherein all known risk factors that provide an independent contribution to disease risk are included.
It would, therefore, be desirable to conduct studies in which data are collected on a comprehensive list of all known disease prediction factors, since such studies, in addition to determining the independent contribution of each known risk factor, could also detect the synergistic contribution of multiple risk factors. Moreover, it would also be desirable to collect data, such as disclosed in co-pending Ser. No. 08/800,314, on as many other potentially significant risk factors as possible and then include the data in the same database, so that new risk factors could be identified and included in ever more powerful prediction models. In addition, it would be desirable to conduct these studies longitudinally, that is, with periodic data collection for each risk factor from the same individuals in a test population, over a long period of time. Then, also as disclosed in Ser. No. 08/800,314, the data on each risk factor for each individual in the study could be retained in the database so that the database would have the capability of including changes in an individual's disease prediction factors to develop the overall disease prediction equation.
However, because of the huge cost and large amount of time required to conduct and complete each new study involving the newly discovered risk factors, and because substantial amounts of meaningful data are already available, it would be desirable to have disease prediction models that make more effective use of the currently available data even while awaiting the results from the more comprehensive prediction models such as disclosed in Ser. No. 08/800,314. Unfortunately, for the data already available from separate studies, which each involve a limited and different subset of the known risk factors, but which in combination may include all currently known risk factors, there seems to be no method available for incorporating all the data of the comprehensive set of known risk factors into a single equation of the form of equation I.
The difficulties with the traditional methodologies may be illustrated in terms of a pair of very simple examples using hypothetical data. In one case, there is a study that compares systolic blood pressure as a function of age, body-mass-index (BMI) and cholesterol level. The problem is to determine how systolic blood pressure can be predicted as a simultaneous function of all three factors. If a study is undertaken that measures systolic blood pressure as a function of all three factors, then a prediction model of the form EQU (systolic blood pressure)=a+b.sub.1 (age)+b.sub.2 (BMI)+b.sub.3 (cholesterol)
can be created by solving for each b.sub.i (i=1 to 3).
To create this model, a study is performed on a large population of N subjects (typically greater than 1,000 subjects). For each subject in the study, systolic blood pressure is measured and tabulated along with that subject's age, BMI and cholesterol level. The results of this hypothetical study are tabulated in a matrix, such as can be seen in Table 1.
TABLE 1 ______________________________________ SYSTOLIC CHOLESTEROL BLOOD SUBJECT AGE BMI LEVEL PRESSURE ______________________________________ 1 35 27 150 120 2 42 26 212 150 -- -- -- -- -- N -- -- -- -- ______________________________________
In this matrix, subject number 1 is 35 years old with a BMI of 27 (Kg/m.sup.2), a cholesterol level of 150 (mg/dl) and a systolic blood pressure of 120 (mm Hg). Subject number 2 is 42 years old with a BMI of 26, a cholesterol level of 212, and a systolic blood pressure of 150. These measurements are taken for all N subjects in the population. Once the matrix is complete, the following equation is solved using general linear regression: EQU b=(X'X).sup.-1 X'Y (II)
where X is the N by 4 matrix of disease prediction factors (in this case, a column of 1, which represents the intercept "a", plus the columns of age, BMI and cholesterol level in Table 1) Y is an N-dimensional outcome vector (in this case, Y is the right-most column in Table 1), and b is the 4-dimensional regression-coefficient vector, a, b.sub.1, b.sub.2 and b.sub.3. It is clear from the above that all values for the X and Y matrices are needed in order to calculate the b vector. Thus, to use this traditional methodology, one study must be performed that measures the correlation of all risk factors with a particular disease or medical condition.
Consider, as another example, the case in which a study shows that the odds of a smoker getting lung cancer is 15 times higher than for a nonsmoker, and another study shows that the odds of getting lung cancer for a person who does not consume adequate quantities of yellow vegetables is 10 times higher than a yellow-vegetable consumer. Based on these raw results standing alone, no known way exists to determine the relative contributions of both factors, and neither of these studies allows for estimating the contribution to disease risk simultaneously from both independent factors. This is because, absent a study that accounts for and measures every disease risk factor for a given disease, there is no way to know how the individual factors correlate with one another.
Since most diseases are typically correlated with a continuously growing list of several risk factors, the cost and time required for conducting such studies rapidly becomes prohibitively expensive. The net result is that such studies, though large in number, tend to be limited to an incomplete list of known risk factors for a specific disease or medical condition.
As a simple example to illustrate another aspect of the problem, assume that there is a study that measures the effects of age on coronary heart disease, and there is another study that measures the effects of cholesterol level and BMI on coronary heart disease. Additionally, assume for the purposes of simplicity, which is clearly not the case, that these are the only three known factors that contribute to coronary heart disease. This leads to two equations of the following form: EQU Y=a.sub.1 +b.sub.age X.sub.1,
and EQU Y=a.sub.2 +b.sub.chol X.sub.2 +b.sub.BMI X.sub.3
where X.sub.1 is age, X.sub.2 is cholesterol level and X.sub.3 is BMI. Each individual b represents how much that factor (e.g., age) contributes to disease onset, as measured by that study.
It is very difficult to combine these equations in any meaningful way to get an equation of the form: EQU Y=a+b.sub.1 X.sub.1 +b.sub.2 X.sub.2 +b.sub.3 X.sub.3
(where b.sub.i does not necessarily equal b.sub.age, b.sub.chol and b.sub.BMI, respectively), because these studies, standing alone, provide no data on the correlation between each X.sub.i. In other words, from the above two equations, there appear to have been no methods disclosed that combine the results so as to quantify how age, cholesterol level and BMI jointly relate to coronary heart disease. Thus, there are few comprehensive models for predicting future disease onset and diagnosing disease status based on all known risk factors. Additionally, the existing models are not as accurate as they could be in predicting disease onset or disease status since they typically include only a limited number of the known risk factors.
The present invention is directed toward the problem of making more effective use of the currently available data, as well as providing a means for integrating newly acquired data in future studies of newly discovered risk factors, into a single comprehensive multivariate disease prediction equation.