a) Technical Field
The present invention relates generally to a system and method for partitioning a real-value attribute having values associated with a first class and a second class into ranges, and more specifically to a system and method for partitioning an attribute into at least three ranges, wherein the values within a lower range and an upper range generally correspond to a first class of results and the values within a middle range generally correspond to a second class of results.
b) Background Art
Expert systems are used to mimic the tasks of an expert within a particular field of knowledge or domain, or to generate a set of rules applicable within the domain. In these applications, expert systems must operate on objects associated with the domain, which may be physical entities, processes or even abstract ideas. Objects are defined by a set of attributes or features, the values of which uniquely characterize the object. Object attributes may be discrete or continuous.
Typically, each object within a domain also belongs to or is associated with one of a number of mutually exclusive classes having particular importance within the context of the domain. Expert systems that classify objects from the values of the attributes for those objects must either develop or be provided with a set of classification rules that guide the system in the classification task. Some expert systems use classification rules that are directly ascertained from a domain expert. These systems require a xe2x80x9cknowledge engineerxe2x80x9d to interact directly with a domain expert in an attempt to extract rules used by the expert in the performance of his or her classification task.
Unfortunately, this technique usually requires a lengthy interview process that can span many man-hours of the expert""s time. Furthermore, experts are not generally good at articulating classification rules, that is, expressing knowledge at the right level of abstraction and degree of precision, organizing knowledge and ensuring the consistency and completeness of the expressed knowledge. As a result, rules which are identified may be incomplete while important rules may be overlooked.
Still further, this technique assumes that an expert actually exists in the particular field of interest. Furthermore, even if an expert does exist, the expert is usually one of a few and is, therefore, in high demand. As a result, the expert""s time and, consequently, the rule extraction process can be quite expensive.
It is known to use artificial intelligence within expert systems for the purpose of generating classification rules applicable to a domain. For example, an article by Bruce W. Porter et al., Concept Learning and Heuristic Classification in Weak-Theory Domains, 45 Artificial Intelligence 229-263 (1990), describes an exemplar-based expert system for use in medical diagnosis that removes the knowledge engineer from the rule extraction process and, in effect, interviews the expert directly to determine relevant classification rules.
In this system, training examples (data sets which include values for each of a plurality of attributes generally relevant to medical diagnosis) are presented to the system for classification within one of a predetermined number of classes. The system compares a training example with one or more exemplars stored for each of the classes and uses a set of classification rules developed by the system to determine the class to which the training example most likely belongs. A domain expert, such as a doctor, either verifies the classification choice or instructs the system that the chosen classification is incorrect. In the latter case, the expert identifies the correct classification choice and the relevant attributes, or values thereof, which distinguish the training example from the class initially chosen by the system. The system builds the classification rules from this information, or, if no rules can be identified, stores the misclassified training example as an exemplar of the correct class. This process is repeated for training examples until the system is capable of correctly classifying a predetermined percentage of new examples using the stored exemplars and the developed classification rules.
Other artificial intelligence methods that have been used in expert systems rely on machine induction in which a set of induction rules are developed or induced directly from a set of records, each of which includes values for a number of attributes of an object and an indication of the class of the object. An expert then reviews the induced rules to identify which rules are most useful or applicable to the classification task being performed. This method has the advantage of using the expert in a way that the expert is accustomed to working, that is, identifying whether particular rules are relevant or useful in the classification task. It should be noted, however, that all of the relevant attributes of the objects being classified must be identified and data for those attributes must be provided within the records in order for the system to induce accurate and complete classification rules.
A book chapter written by W. Buntine, D. Stirling, Interactive Induction, in Machine Intelligence, Vol. 12, pp. 121-137 (Hayes-Michie et al. eds., 1990), discloses that expert systems which use machine induction can be operated with greater accuracy if a domain expert interacts with the system by supplying additional subjective knowledge before classification rules are induced or by incrementally evaluating and validating the rules that are induced. Specifically, the domain expert can develop domain grammar which can be used to elicit relevant classification rules, suggest potential rules and identify whether particular induced rules are strong or weak in the domain context.
A classic example of a pure machine induction technique is described in an article by J. R. Quinlan, Induction of Decision Trees, 1 Machine Learning 81-106 (1986), the disclosure of which is hereby incorporated by reference herein. This technique searches through relations between combinations of attribute values and classes of objects to build an induction tree which is then used to generate precise classification rules. Referring to FIG. 1 herein, an exemplary Quinlan-type induction tree is constructed for a set of 100 records, each associated with an object having one of two classes C1 or C2 and attribute values A_1 or A_2, B_1 or B_2, and C_1 or C_2 for three attributes A, B and C, respectively.
During operation, the Quinlan method calculates a statistical measurement, referred to as an information gain value, for each of the attributes A, B and C and chooses the attribute with the highest information gain value at a root of the tree. The attribute values associated with chosen attribute are then identified as nodes of the tree and are examined. If all of the data records associated with a node are all of the same class, the node is labeled as a leaf or endpoint of the induction tree. Otherwise, the node is labeled as a branching point of the induction tree. The method then chooses a branching point, calculates the information gain value for each of the remaining attributes based on the data from the records associated with chosen branching point, chooses the attribute with the highest information gain value and identifies the attribute values of the chosen attribute as nodes which are examined for leaves and branching points. This process is repeated until only leaves remain within the induction tree or until, at any existing branching point, there are no attributes remaining upon which to branch.
Referring to FIG. 1, the attribute A is chosen at the root of the induction tree and the attribute values A_1 and A_2, which are nodes of the induction tree, are then examined. Attribute value A_1 is a leaf of the induction tree because all of the records associated therewith are associated with the class C1. The attribute value A_2 is a branching point BP1 and the attribute B branches therefrom. Likewise the attribute C branches from the attribute value B_1, which is labeled as branching point BP2. The attribute values B_2 and C_1 are leaves of the induction tree. The induction tree stops branching from a branching point BP3 because there are no remaining attributes upon which to branch at that node.
After an induction tree is constructed, classification rules are generated therefrom by tracing a path from a particular leaf of the induction tree to the root of the induction tree or vice versa. Thus, for example, the induction tree of FIG. 1 produces the following classification rules:
(1) C_l and B_1 and A_2 results in C1;
(2) B_2 and A_2 results in C2;
(3) A_1 results in C1.
Although the Quinlan method is useful in identifying classification rules for a particular domain, the method is limited to attributes which have discrete values. However, techniques have been developed for discretizing numeric or real-valued attributes within a Quinlan-type induction tree. In fact, a simple method of discretizing a real-valued attribute is to choose generally known break points throughout the range of the attribute. For example, the real-valued attribute of age may be divided according to the generally accepted break points of child (0-10), adolescent (11-17), adult (18-45), middle-age (46-70), and elderly (70 and higher). Such a predetermined discretization method, however, is only possible if generally accepted break points exist within the domain. Furthermore, such a predetermined discretization method is only as accurate as the actual break points chosen and fails to account for concepts not identified by the predetermined break points. Thus, in the age example given above, any concept dealing with legal drinking age is unascertainable. For this reason, this method is a particularly poor technique of discretizing real-valued attributes in a domain about which a system designer has little a priori information.
Other simple discretization methods include dividing the continuous range of an attribute into intervals of equal size or dividing the continuous range of an attribute into intervals which include an approximately equal number of attribute values. However, with both of these methods, it is difficult or impossible for the system to learn unknown concepts because these methods ignore the distribution of the classes associated with the attribute values. Furthermore, it is very unlikely that the interval boundaries will be established in the places that best facilitate accurate classification using any of these methods.
A paper by Randy Kerber, ChiMerge: Discretization of Numeric Attributes, in Proceedings of the Tenth National Conference on Artificial Intelligence, 123-127 (1992), describes a method of discretizing real-valued attributes which takes into account the class associated with each attribute value. This method constructs an initial discretization having one attribute value per interval. The method then computes the chi-squared value for each pair of adjacent intervals and merges the pair of adjacent intervals with the lowest chi-squared value. The steps of computing the chi-squared values and merging continues in this manner until all pairs of intervals have chi-squared values which exceed a predetermined threshold.
Other discretization methods divide continuous-valued attributes into only two intervals separated at a cut point. One way to determine such a cut point is to evaluate every possible cut point, calculate the class entropy (i.e., the information gain value) of the attribute from the resulting partition and choose the cut point which results in the best class entropy measurement.
A problem associated with these discretization methods, however, is that they are computationally expensive, especially when a great number of attributes have been identified for a particular object or when a large number of records are used to induce classification rules. As a result, expert systems employing these discretization methods may be limited to discretizing the values of each real-valued attribute once during the tree building process.
U.S. Pat. No. 5,694,524 issued to Bob Evans, the disclosure of which is hereby expressly incorporated by reference herein discloses a method of dividing a real-valued attribute into three separate ranges. With reference to FIG. 2, the Evans method determines a first range (Range #1) wherein the data values generally correspond to a first class of results, class C1, and a second range (Range #2) wherein the data values generally correspond to a second class of results, class C2. In addition, the Evans method identifies a third range (Range #3) that lies between the first and second ranges where it is unknown whether the data values generally correspond with the first class C1 or the second class C2.
For some attributes, however, data values generally corresponding to a particular class of results lie in both a low range and a high range, while data values corresponding to a second class of results generally lie in a range between the low and high ranges of the first class. An example of an attribute exhibiting this characteristic, known as a xe2x80x9cwindowed attribute,xe2x80x9d is illustrated in FIG. 3, where C1 represents the data values associated with the first class of results and C2 represents the data values associated with the second class of results.
It is desirable to have a system and method that partitions a windowed attribute into discrete ranges, wherein each range includes values generally corresponding to a particular class of results. Furthermore, it is desirable to have a system and method that creates an induction tree, wherein at least one of the attributes is a windowed attribute and is partitioned into ranges such that each range generally corresponds to a particular class of results.
According to one aspect of the present invention, a system and method for dividing a real-valued attribute having values associated with a first class generally windowed by values associated with a second class into ranges includes the steps of separating the values of the real-valued attribute into first and second sets based on the class associated with each of the values, calculating a statistical property of the second set and defining a first subset as the values in the first set below (or above) the statistical property of the second set and a second subset to include values of the second set. The system and method then determine a range breakpoint between the first subset and the second subset by repeating the steps of (a) calculating a statistical property of the first subset and a statistical property of the second subset and (b) removing values from the first subset and the second subset based on the calculated statistical properties of the first and second subsets. The system and method determine the range breakpoint from one or more of the statistical properties of the first subset and the second subset calculated in step (a).
The system and method may also define a third subset to include values of the first set above the statistical property of the second set and a fourth subset to include values of the second set and then identify a second range breakpoint between the third subset and the fourth subset by repeating the steps of (c) calculating a statistical property of the third subset and a statistical property of the fourth subset and (d) removing values from the third subset and the fourth subset based on the calculated statistical properties of the third and fourth subsets. The system and method determine the second range breakpoint from one or more of the statistical properties of the third subset and the fourth subsets calculated in step (c).
If desired, the system and method may repeat steps (a) and (b) until the statistical property of the first subset is greater than or equal to the statistical property of the second subset and may repeat steps (c) and (d) until the statistical property of the fourth subset is less than or equal to the statistical property of the third subset.
Likewise, the system and method may determine the range breakpoints by setting the first breakpoint equal to the statistical property of the first or second subset (or some combination, such as an average, thereof) and by setting the second range breakpoint equal to the statistical property of the third or fourth subset (or some combination, such as an average, thereof).
According to another aspect of the present invention, a system and method divides a real-valued windowed attribute into ranges which are generally associated with a particular result or class. For example, the runs of a process during which the particular result occurred may be in a first class, and the runs of a process during which the particular result did not occur may be in a second class. The system and method create a first data set that contains all of the values for the attribute for the runs of the process corresponding to the first class and create a second data set that contains all of the values of the attribute for the runs of the process corresponding to the second class. A statistical property, such as a median or a mean, is then calculated for the second data set. Thereafter, a first subset is created to contain all of the values in the first data set that are less than the statistical property of the second data set, and a second subset is created to contain all of the values of the second data set. Statistical properties of the first and second subsets are then calculated and the value of a first temporary breakpoint is set equal to the value of the statistical property of the first subset, while the value of a second temporary breakpoint is set equal to the statistical property of the second subset.
Next, the data values within the first subset that are less than the first temporary breakpoint are eliminated or removed from the first subset, and the data values in the second subset that are higher than the second temporary breakpoint are eliminated or removed from the second subset. The statistical properties of the first subset and the second subset are recalculated and the first temporary breakpoint is set equal to the value of the statistical property of the first subset and the second temporary breakpoint is set equal to the value of the statistical property of the second subset if the statistical property of the first subset is less than the statistical property of the second subset. These steps are repeated (removing values from the first and second subsets, recalculating a statistical property for each subset, and setting the temporary breakpoints equal to the statistical properties of the first and second subsets) until the statistical property of the first subset is greater than the statistical property of the second subset. Once the statistical property of the first subset is greater than the statistical property of the second subset, the first breakpoint is set equal to the value of the first temporary breakpoint (of the previous iteration) and the second breakpoint is set equal to the value of the second temporary breakpoint (of the previous iteration).
Next, a third subset is created to contain the values within the second data set and a fourth subset is created to contain the values in the first data set that are greater than the statistical property of the second data set. Thereafter, statistical properties (such as means or medians) of the third and fourth subsets are calculated. A third temporary breakpoint is then set equal to the statistical property of the third subset and a fourth temporary breakpoint is set equal to the statistical property of the fourth subset.
The values in the third subset that are lower than the statistical property of the third subset and the values in the fourth subset that are higher than the statistical property of the fourth subset are then removed from these subsets. The statistical properties of the third and fourth subsets are then recalculated and the third and fourth temporary breakpoints are set equal to the statistical properties of the third and fourth subsets, respectively, if the statistical property of the third subset is not greater than the statistical property of the fourth subset. These steps (truncating the third and fourth subsets, calculating a statistical property for each subset, and setting a temporary breakpoint equal to the statistical properties) are repeated until the statistical property of the third subset is greater than the statistical property of the fourth subset. When the statistical property of the third subset is greater than the statistical property of the fourth subset, the third breakpoint is set equal to the third temporary breakpoint (of the previous iteration) and the fourth breakpoint is set equal to the fourth temporary breakpoint (of the previous iteration). The first, second, third and/or fourth breakpoints are then used to partition the attribute into a set of ranges.
In the invention, one or more of the statistical properties may comprise a mean or a median. Also, the values of the attribute lie between a minimum value and a maximum value and the set of ranges may include a first range having the values between the minimum value and the first breakpoint, a second range having values between the first breakpoint and the fourth breakpoint, and a third range having the values between the fourth breakpoint and the maximum value.