Data mining, knowledge discovery, and other forms of data analysis involve the extraction of useful information from vast amounts of accumulated data. For example, pharmaceutical companies are creating large databases listing drug compounds and their features, such as which diseases that are effectively treated by which drug compound and what are the drug compound's side-effects. Given the large number of different drug compounds, it is difficult to manually analyze this data to ascertain useful patterns, such as determining what group of drugs are more or less effective in treating each of a group of diseases, especially when the desired groupings of drugs and diseases are not identified beforehand.
Conventional data mining techniques use pattern recognition and probabilistic analyses to generate decision trees. A decision tree is a data structure that contains a hierarchical arrangement of rules that successively indicates how to classify an object into a plurality of classes. More specifically, each object is characterized by a number of attributes, and each rule in the decision tree tests the value of one of the attributes. Decision trees separate out data into sets of rules that are likely to have a different effect on a target variable. For example, one might want to find the characteristics of a drug compound and its method of administration that are likely to be effective in treating a particular disease. These characteristics can be translated into a set of rules.
FIG. 5 depicts an exemplary decision tree 500 that represents how to treat a hypothetical medical condition for a patient. The exemplary decision tree 500 comprises two branch nodes 510 and 530, three leaf nodes 520, 540, and 550, and arcs 512, 514, 532, and 534.
Each of the branch nodes 510 and 530 represents a “rule” or condition that indicates how to choose between a number of possible values for a particular attribute of the patient. The possible values that the attribute may take are indicated by the arcs 512, 514, 532, and 534. When a choice among the possible values is made, the corresponding arc is taken to reach a leaf node or another branch node. One of the branch nodes 510 is designated as the “root” node, which is the starting point of the decision tree.
In the example, root branch node 510 is labeled “AGE?” and indicates that the age of the patient is tested. Arc 512, which connects branch node 510 to leaf node 520, is labeled “<=12?” indicating that leaf node 520 is to be reach if the age of the patient is less than or equal to 12. On the other hand, arc 514 connects branch node 510 to branch node 530 and is labeled “>12?”, which indicates that branch node 530 is to be reached if the age of the patient is greater than 12. Branch node 530 is labeled “TEMP?” to indicate that the body temperature of the patient is tested. If the body temperature of the patient is less than or equal to 102° (indicated by arc 532), then leaf node 540 is reached; otherwise, if the body temperature of the patient is greater than 102° (indicated by arc 5334), then leaf node 550 is reached.
The leaf nodes 520, 540, and 550 represent a “decision” or classification of the object. In this example, the decision is the treatment to be administered to the patient. At leaf node 520, the decision is to use 20 mg of drug X; at leaf node 540, the decision is to use 40 mg of drug X; and at leaf node 550, the decision is to use 10 mg of drug Y.
The exemplary decision tree 500 may be used to determine which treatment to be administered to a patient by starting at the “root” node, testing the attribute of the patients to select an arc and follow the arc until a leaf node is reached. In the example, suppose a 10 year old child with a temperature of 98.6° is to be treated. Starting at root branch node 510, the age of the patient is tested. Since the 10 year old is less than 12 years of age, arc 512 is followed to reach leaf node 520. Therefore, 20 mg of drug X is prescribed to the 10 year old. As another example, suppose the patient is a 32-year with a 105° fever. Starting at root branch node 510, the age of the patient is tested. Since the 32-year old's age is greater than 12, arc 514 is followed to branch node 530 where the body temperature of the patient is tested. Since the patient has a 105° fever, arc 534 is followed to reach leaf node 550, which indicates that 10 mg of drug Y is to be administered.
Decision tree induction refers to the process of determining how to build the decision tree from a set of training data. In particular, a decision tree is built by successively identifying which attributes to test first and which attributes to test later, if at all. A common conventional approach to build decision trees is known as “Induction of Decision Trees” or ID3. The ID3 is a recursive algorithm that starts with a set of training objects that belong to a set of predefined classes. If all the objects belong to a single class, then there is no decision to make and a leaf node is created and labeled with the class. Otherwise, a branch node is created and the attribute with the highest “information gain” is selected if that attribute were used to discriminate objects at the branch node. The information gain is calculated by finding the average entropy of each attribute.
A problem with conventional decision trees such as those produced by ID3 is that such decision trees are rigid, inflexible, and brittle. In the drug effectiveness example, conventional decision trees impose an “either-or” or binary approach to the data, even though different drugs have varying degrees of effectiveness. For example, data values close to the border of a crisp range in a decision tree are apt to be misclassified due to the imprecision of real-world data. Accordingly, there has been a number of attempts to apply the concepts of “fuzzy logic” to decision trees.
Fuzzy logic was introduced in the 1960's as a means for modeling the uncertainty of the real world. Rather than classifying an object as either a full member of one class or not a member at all, fuzzy logic employs a “membership function” between 0.0 and 1.0 to represent the degree to which the object belongs to the class. For example, rather than categorize a patient's age as “twelve years and below” and “above twelve years,” two fuzzy sets, Young and Old, can be employed, such that a two-year old may have a membership function in the Young fuzzy set μYoung(2)=0.99 but a membership function in the Old fuzzy set μOld(2)=0.01. Conversely, a retired person at 65 years of age, may have a Young membership function of μYoung(65)=0.13 and an Old membership function of μOld(65)=0.87. For a teenager, however, the membership functions are not so extreme; for example, a 13-year may have membership functions of μYoung(13)=0.45 and μOld(13)=0.55.
One attempt to combine fuzzy logic with classical, crisp decision trees is known as FID3, in which the user defines the membership functions in each of the predefined classes for all of the training data. Each membership function can serve as an arc label of a fuzzy decision. As in ID3, FID3 generates its decision tree by maximizing information gains. The decision of the fuzzy decision tree is also a fuzzy variable, indicating the memberships of a tested object in each of the possible classifications. In the example of FIG. 5, the arcs 512 and 514 emanating from branch node 510 could be fuzzified by a membership function on a Young fuzzy set and an Old fuzzy set, respectively. For example, arc 512 could be the test μYoung(Xi)<0.5 or other value that maximizes the information gain. For the arcs 532 and 534, the respective fuzzy sets could be Normal and Feverish, respectively. A result with a 0.20 membership in the class at leaf node 520 and 0.80 membership in the class at leaf node 540, for example, might suggest using 36 mg of drug X.
One disadvantage with FID3 is that the membership functions in each of the attributes for all of the training data must be specified beforehand by the user. For data with a high number of attributes or dimensions, however, determining the membership functions is typically a difficult task, requiring intensive involvement by experts. In addition, the fuzzy sets themselves may not even be known beforehand and require further investigation.
Therefore, there is a need for a data analysis technique that is capable of handling real-world or “fuzzy” data in a flexible manner. There is also a need for a technique in which the groupings of the data or other a priori information, such as fuzzy membership functions, need not be supplied beforehand.