A generic problem in data analysis involves the prediction of high values of some output variable based on a set of input variables. For example, a hospital might want to determine from collected patient data what combinations of patient age and length of stay are associated with high cost hospital stays. Adding to the complexity of this data analysis problem is the fact that the variables of interest can be of different types, including:
continuous/numerical (ordered data, such as age, cost, length of stay);
categorical (non-ordered data, such as doctor); and
discrete (ordered, non-continuous data, such as procedure risk).
There are many prior art strategies for solving similar data analysis problems to the one described above, including the SLIQ classification algorithm (M. Mehta, R. Agrawal and J. Rissanen, SLIQ: A Fast Scalable Classifier for Data Mining, Proc. of the Fifth Int'l Conference on Extending Database Technology, Avignon, France, March 1996.), SPEC (An Efficient, Scalable, Parallel Classifierfor Data Mining (1996) Anurag Srivastava, Vineet Singh, Eui-Hong (Sam) Han and Vipin Kumar, Technical Report 96-040, Department of Computer Science, University of Minnesota, Minneapolis, 1996.), and the Patient Rule Induction Method (PRIM) (Friedman, J. H. and Fisher, N. I. Bump Hunting in High-Dimensional Data (October 1997) (to appear in Statistics and Computation). In each of these strategies, the data to be analyzed is stored initially in a database of any type and the data analysis system copies the stored data onto disk using specific data structures that facilitate analysis of the data.
The SLIQ classification algorithm finds local groups of the input data that have a high output value. It works better with categorical data than with continuous data. That is, the SLIQ algorithm is more likely to find contiguous groups of multi-dimensional input data with high output values for categorical rather than continuous data. Because of this, the SLIQ algorithm is typically paired with linear regression analysis when continual data needs to be analyzed.
The PRIM algorithm, on which the present invention is based, specifically addresses the issue of determining regions in a multidimensional space of a set of input variables that correspond to high average for an output variable. Moreover, the regions generated by PRIM are easily interpretable as they are rectangles for two input variables or (more generally) hyper-rectangles for more than two input variables. (Note, for the remainder of this application, the terms “hyper-rectangles” and “boxes” are used interchangeably.) A comparison of the results given by the SLIQ and PRIM algorithms for the same input is now described with reference to FIG. 1.
Referring to FIG. 1, there is shown a plot where the input variables are Length Of Stay and Age for hospital patients and the output variable is Charge for the patient's stay, where points that are filled-in are high value points and points that are not filled in are low value points. Given the goal of finding those regions of the Length of Stay/Age space with highest Charge, the PRIM algorithm would return the single hyper-rectangle 102, whereas the SLIQ algorithm would return the smaller, disjoint regions 104a, 104b. The PRIM hyper-rectangle region 102 description (e.g., 50<Age<70 and 2<=Stay Length<=4) can be used for prediction applications where the input variables are known and the output variable is unknown. In particular, for a hospital application, the high charge (or high cost) patients may be used to reduce costs internally or to negotiate more profitable contracts with payors (or insurance companies) externally. This application is facilitated by the simpler output of PRIM as compared to that from the SLIQ algorithm. As the present invention is based on PRIM, this method is now described in greater detail.
The idea behind PRIM is to enclose all input data points within a single hyper-rectangle and then successively peel away contiguous, small strips of the hyper-rectangle enclosing low value points until a user-defined percentage of the total points remain inside the hyper-rectangle. These points are presumed to be predominantly high value points. PRIM can be applied to a small data set that can be copied in its entirety into main memory (i.e., RAM accessible to executing programs), but is not able to handle larger data sets. PRIM includes the following steps for continuous data:
1) copy the entire data set into main memory;
2) define a peeling fraction (α) which is the width of the contiguous, small strips of the hyper-rectangle. Each strip is formed by grabbing a given percentage of the initial points which is preferable between 1% and 5% of the points along the perimeter of the hyper-rectangle.
3) form the strips along edges of the hyper-rectangle by taking α% of the points along the perimeter of the hyper-rectangle.
4) calculate an average value of the cost attribute for each strip of the hyper-rectangle. For each attribute, sort the data and find the average value of the cost attribute for α% of the top points and α% of the bottom points. This is repeated for each attribute.
5) throw away the strip with the lowest average cost value and return to step 3)—repeat until the number of points remaining enclosed by the hyper-rectangle equals a user-defined percentage of the initial points (the remaining points will be the user-defined percentage of points with the highest average output value).
The steps for discrete data are similar to the previously described steps, except, instead of multi-dimensional strips, histograms are used. For example, if one of the input variables were Doctor ID, each strip would correspond to a histogram bin for a given Doctor ID. The way PRIM is implemented for small data sets is now described with reference to FIG. 2.
Referring to FIG. 2, there is shown an illustration of a data structure used by the prior art PRIM implementation for discrete attributes and a relational table of data. The relation table of data contains both continuous attributes and discrete attributes. To process the discrete attribute Doctor ID, a histogram is generated containing all distinct Doctor Ids and an expected cost. The expected cost is calculated by summing the cost for each record with each Doctor ID and dividing the sum by a count of each record with the Doctor ID. As depicted in FIG. 2, the Doctor IDs A, B, C and D are shown along with the expected cost. The Doctor ID A has the lowest expected cost ($100.00) and is therefore chosen as the peeling discrete attribute.
Consequently, the PRIM calculation for discrete attributes is exactly as described above for continuous. However for discrete attributes, step 4) would include forming a histogram for each discrete attribute containing the distinct discrete attribute values and expected cost. Step 5) would include comparing the continuous attribute with the lowest average value to the discrete attribute with the lowest average value to determine the peeling attribute to be removed.
The PRIM implementation described above must sort the relation table of data for each attribute one at a time for each peeling step. In addition, the relational table of data must be queried to determine the expected cost for each discrete attribute for each peeling step. These tasks require that the relational table of data be stored in main memory. Unfortunately, memory limitations of conventional systems prevent large disk resident data sets from being entirely sorted in main memory. Therefore, there is a need for an implementation of PRIM that is applicable to large, disk-resident data sets. There is also a need for an implementation of PRIM that can be parallelized for execution on parallel processors.