1. Field of the Invention
The present invention is directed toward the field of data mining. More specifically, the invention provides a method of selecting particular variables from a large data set containing a plurality of variables to be used as nodes in a binary decision tree. The invention is particularly useful with large data sets in which some variables are partly co-linear. An example of this type of large data set is genomic data taken from human chromosomes which can be used to associate genotypic data with disease status.
2. Description of the Related Art
Binary decision trees are known in the field of data mining. Generally, the decision tree utilizes a search method to choose the best variable on which to base a particular decision making process. The best variable is chosen from a set of input variables in an input data set, where the outcome measure is known for each set of input variables. A hierarchy of decisions are built into the decision tree using a “yes/no” structure in order to arrive at an outcome from a set of known possible outcomes. At each node of the decision tree, the input data set is split into two subsets based on the value of the best variable at that point in the binary tree structure. The best variable is thus defined as the “node variable” because it is the variable that the decision tree branches from at that point in the path of the decision making process. The tree continues to branch to the next best available variable until some minimum statistical threshold is met, or until some maximum number of branches are formed. A subsequent set of input data values for each of the variables can then return a predicted outcome.
Using a binary decision tree is particularly useful in the study of genomic mapping. In such a study, a binary tree is constructed to match genetic markers to a phenotypic trait. One such phenotypic trait could be disease status. In this example, the binary tree categorizes whether or not a subject is likely to have a particular disease by selecting a path through the tree based on the values of the specific markers that form the nodes of the tree. The input data set can then be categorized into one of the disease outcomes, either affected or not affected.
A known method for selecting the node variable that forms a node of the tree branch for the example genomic application is shown in FIG. 1. An input data set 10 includes a plurality of rows 12, each row defining a subject. Each column 14 describes a particular variable. The first variable is typically a patient identifier. Clinical variables 16, such as age and weight, are shown in columns 3 to N+1 where N is the number of clinical variables 16. Clinical variables 16 are variables that can generally be taken by an examiner or through a set of simple questions asked of the patient. In the columns after the clinical variables 16, a plurality of genomic markers (“marker variables”) 18, taken from the DNA of a cell of the patient, are recorded. In this example, twenty-five genetic markers 18 are recorded from each patient. The recording of the markers 18 requires utilizing at least one specialized instrument to take a sample and record the values of each of the twenty-five markers 18. The disease state 20 is the final column in the data set, and it is the outcome measure of the data set 10, i.e. whether the particular patient is affected or not. The disease state 20 is known for each subject in the input data set 10.
For each variable (clinical and marker), the values are binned into two groups. For instance, the clinical variable “sex” is binned into a male group and a female group. Other variables, such as the clinical variable “age” are considered interval variables. An interval variable is a variable that has a continuous distribution over a particular range. The interval variable is initially separated into a user-defined number of bins. These bins are then grouped to form two bins. For example, the clinical variable age might first be reduced to 10 levels of 10 years each. The 10 levels will be grouped into 2 bins, based on the results of a statistical test described below. The process of reducing the variable to two bins will first measure the first level against the second through the tenth levels. The process continues by measuring the first and second levels against the third through the tenth, until eventually the first nine levels are measured against the tenth level. The best statistical result will define the delineation point for the variable.
The marker variables 18 are categorized by a bi-allelic genotype. Generally, these genotypes are referred to as AA, Aa, or aa. AA is the homozygote genotype for allele A, Aa is the heterozygous genotype, and aa is the homozygote genotype for allele a. Since three bi-allelic genotypes exist, the two bins are separated 30 into a pair of two genotypes and a single genotype for each marker 18. This binning is accomplished by a similar statistical step as the binning of the clinical variables. Once the binning is completed, a statistical measure of correlation is calculated for each marker. An example of such a statistical calculation is the chi squared statistic as referenced in “Principles and Procedures of Statistics a Biometrical Approach”, pages 502–526, which is incorporated by reference herein. A plot 40 of one set of the chi-squared statistic is shown in FIG. 1. A large chi-squared statistic suggests a marker that is highly associated with the disease state. The most highly associated marker is selected for the first node in the binary tree by selecting the largest chi squared statistic.
More specifically, the steps of building a binary decision tree for analyzing this type of data set is shown in FIGS. 2 and 3. FIG. 2 shows the method of building the decision tree. FIGS. 3A and B show the steps of creating the two bins for each variable. FIG. 3A shows the steps of creating the two bins for an interval variable, and FIG. 3B shows the steps of forming the two bins for variables other than interval variables.
Turning now to FIG. 2, the input data set 10 is provided to the method in step 50. The method is generally implemented as a software program operating on a general purpose computer system. At step 52, the user enters a number of algorithmic parameters, such as the number of passes the user wishes the tree to branch to, a minimum value for the least significant chi square statistic, and the number of inputs. An input counter, “i”, and a maximum value, “MAXSOFAR”, are initialized at step 52. The first variable is then retrieved from the input data set for all subjects. Step 54 determines if the first variable is an interval variable. If the first variable is an interval variable, then it is passed to step 56 where the steps of FIG. 3A return a TEST value of the best chi square statistic from the two bin structure of the particular variable. If, however, the first variable is not an interval variable, then it is passed to the steps of FIG. 3B in step 58, which also returns a TEST value indicating the best chi square statistic for the particular variable.
Step 60 determines if the TEST value from step 56 or step 58 is greater than the MAXSOFAR value, i.e., is the chi-squared statistic for the current variable larger than the chi-squared values for all the previously analyzed variables. If the TEST value is greater, then the TEST value is stored 62 as the MAXSOFAR value and the input counter is updated 64. If the TEST value is not larger than MAXSOFAR, then the input counter is updated 64 without storing the TEST result. Step 66 determines if the input counter (i) is less than the number of input variables in the data set. If the input counter (i) is less than the number of inputs, control returns to step 54 using the next variable for the determining step 54. Once the input counter (i) is equal to the number of input variables, step 68 determines if MAXSOFAR is less than a user-defined parameter, MINCHISQ, which is the minimum significant chi-squared statistic the user wants to retain in the binary tree. If maxsofar is less than MINCHISQ, the binary tree is output in step 70. If MAXSOFAR is greater than MINCHISQ, then step 72 determines if the number of passes is less than the maximum number of passes the user has set. If the number of passes is greater than the maximum number of passes, then the variables chosen as node variables are passed to the output step 70. If, however, the maximum number of passes has not been reached, then at step 76 the data set is divided into two data sets based on the two bins that were used to determine the best chi square statistic and control reverts back to step 52, where the counter variables are reset and another pass through the algorithm is executed.
FIG. 3A generates the chi-squared statistic for interval variables. Step 78 starts the process. Step 80 queries the user for the number of levels the variable will be split into, and defines it as k. The interval range (maximum value of the variable minus the minimum value of the variable) is divided 82 into k bins. The k bins are then collapsed in step 84 into two bins. A 2×2 contingency table is formed in step 86 and then the chi squared statistic is calculated at step 88. Step 90 determines if more combinations of splitting the k bins into 2 bins can be accomplished. If another combination that has not been tested exists, the process returns to step 84. If no more combinations exists, step 92 finds the maximum value of the chi-squared statistic from the combinations tested, and returns this value as the TEST value in step 94.
FIG. 3B generates the chi-squared statistic for non-interval variables. Step 96 starts the process. The variable is collapsed 98 into two bins. The 2×2 contingency table is formed 100 and then the chi squared statistic is calculated 102. Step 104 determines if more combinations of splitting the variable into 2 bins can be accomplished. If another combination that has not been tested exists, the process returns to step 98. If no more combinations exists, step 106 finds the maximum value of the chi-squared statistic from the combinations, and returns this value as the TEST value in step 108. For example, a marker variable has three possible genotypes, AA, Aa or aa. These variables can be combined into three different bin combinations (AA and Aa, Aa and aa, AA and aa).
Using this process (FIG. 2) for genomic data, if the largest chi-squared statistic was generated, for example, at marker 18, the data set would be split into two subsets based on the value of marker 18. As shown in FIG. 4, the two data sets would be split as Data Set 1 and Data Set 2. Data Set 1 includes patients whose value for marker 18 is the bi-allelic genotype AA or Aa. Data Set 2 includes patients whose bi-allelic genotype is aa. Each data set would then be passed back to step 52 where the process of determining the best bins for each variable, calculating the chi-squared statistic for each remaining variable, and then recording the variable with the largest chi-squared statistic (the node variable) would be repeated. This process of determining the node variables at each pass through the algorithm is repeated until one of the parameter thresholds is met (either MINCHISQ or the maximum number of passes). Each pass through the algorithm builds another layer into the decision tree.
Genomic data, such as the marker genotypes, can be correlated with nearby neighbors. One problem of the prior art method is that it is possible for a false identifier to have a large chi-squared statistic in such a large data set. A false identifier will branch a binary tree at a marker location that is not a good predictor of the disease state. Further branching of the tree after a false identifier could produce minimal results. An example of a false identifier 120 is shown in the plot 40 of FIG. 1. The peak of the false identifier 120 is the largest peak in the chi-squared data, but the peak does not represent the best choice for the node variable because the secondary peak 122 shows a more pronounced region of large chi-squared values. Since it is known that adjacent neighbors are correlated, the secondary peak 122 is a better choice for the node variable. The prior art method can not identify a false positive identifier and therefore would use the false identifier as the node variable at that level of the decision tree.