The present invention is related to the general field of database processing and in particular is directed to analysis, interpretation and comparison of data of large databases or data warehouse environments. Specifically, a major objective of the invention is to provide a method for automatically and efficiently interpreting such data.
Recent years has seen a proliferation of data collection. Due in large part to the growth of use of computers in every facet of the business world, massive amounts of data can now be assembled and maintained in very large databases and data warehouses. (A data warehouse is collection of data designed to support management decision making together with a database system for maintaining the data, and systems for extracting desired portions of the data.) Unfortunately, as capable as today""s computing technology may be at gathering, organizing, and maintaining such large accumulations of data, they fail miserably at being able to process and analyze such massive amounts of data. This inability to be able to unearth and pull relevant business insights, interesting patterns, and like observations from large quantities of data makes collecting such data less useful. Even searching such databases and data warehouses for relevant relationships in an effort to gain some insight into and observations about buying patterns, for example, can be a daunting task due not only to the huge amounts data available for review and analysis, but also to the lack of capability available to today""s search technology.
There are available today tools and techniques, e.g., association rules designed to find relationships between millions of database records, to assist business and data analysts gain a better understanding of their data. But, it is not intuitive or obvious as to how an analysis of such a large amount of data should be focused, what new knowledge should be extracted from the database, or how to then interpret and evaluate this new knowledge.
Among the tools available are those employing various xe2x80x9cdata mining.xe2x80x9d algorithms that operate to sift through a large database or data warehouse in an effort to locate patterns. Data mining techniques typically classify and/or cluster data, find associations between data items, or perform time series analysis. (See, for example, Agrawal, R. et al., In Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, Ch 12:307-328 (1996); Ashok Sarasere et al., In 21st Int""l Conf on Very Large Databases (VLDB), Zurich, Switzerland (September 1995). For example, a data mining technique may be used to sift through a database maintained by a retail outlet for records relating to expensive purchases in an effort to market to a specific customer segment. However, use of data mining tools requires careful selection of search variables in order to obtain meaningful data and/or data relationships. Lack of a key variable in a search can result in an output that may be incorrectly interpreted, or just undecipherable.
One data mining technique, termed the xe2x80x9cPatient Rule Induction Methodxe2x80x9d or xe2x80x9cPRIM,xe2x80x9d is structured to find a high average region within a very large collection of data records. Typically, a data record will consist of variables. To employ PRIM, a user selects certain of the variables to form a set of input variables and one output variable. The user will also select a minimum size of the desired region. The selected variables and region size are then input to PRIM. PRIM then finds regions where the output variable has a high average value compared to the average value for the entire set of records. PRIM could also be used to find regions with minimum average value by maximizing the negative values of the output variable. The region found by PRIM is defined by a subset of attribute values. For an analytical description of PRIM and the algorithms it employs, see Friedman, J. et al., Statistics and Computing, 9:2, pp. 123-143 (April 1999).
Another data mining tool is Weighted Item Sets (WIS), a type of association rule. This tool finds relationships between various attributes in a database; some of the attributes can be derived measures. The relationships are defined in terms of if-then rules that show the frequency of records appearing in the database that satisfies the rule. An example of WIS can be found in U.S. Pat. No. 5,173,280.
Another analysis tool for databases or data warehouses with massive amounts of data items or records is the On-line Analytical Processing (xe2x80x9cOLAPxe2x80x9d) technique. A number of commercially available products have been built to employ this technique, e.g., Cognos"" Enterprise OLAP and PowerPlay, Business Objects Inc.""s Business Objects, Informix""s MetaCube, Platinum""s InfoBeacon, MicroStrategy""s DSS Agent, Oracle""s Express, etc. All of these products offer similar functionality.
OLAP typically includes the following kinds of analyses: simple (view one or more measures which can be sorted and totaled), comparison or cross-tab (view one measure and sort or total based upon two dimensions), trend (view a measure over time), variance (compare one measure at different times such as sales and sales a year ago), and ranking (top 10 or bottom 10 products sold) [Peterson, T. et al., SAMS Publishing (1999)]. OLAP enables users to drill down within a dimension to see more detailed data at various levels of aggregation.
Users can also filter data with the OLAP technique, i.e., focus their analysis on a subset of records in the database. For example, if a user is interacting with a retail chain store database then he/she may only be interested in xe2x80x9cWest Coastxe2x80x9d stores. Users need to know which attribute or attributes they want to set-up filter conditions. Users also need to know how to define the filtering conditions; OLAP enables users to filter records based upon only arithmetic conditions of one or more database attributes or a xe2x80x9cwherexe2x80x9d clause in a SQL statement.
In addition to the analysis tools and techniques described above, there is also what is known as the Knowledge Discovery in Databases (KDD) Process. KDD and data mining conferences have been held since 1989. This new field has produced a widely followed and accepted KDD process, capable of selecting data, pre-processing or editing data, transforming data, performing data mining, and evaluating/interpreting the findings. See Fayyad, U. et al., xe2x80x9cThe KDD Process for Extracting Useful Knowledge from Volumes of Data,xe2x80x9d Communications of the ACM 39, 11, pp. 27-34 (November 1996). The KDD process is xe2x80x9cThe nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.xe2x80x9d (Fayyad, U. et al., supra.) This process assumes that a knowledge engineer or domain expert will evaluate and interpret the findings.
There is a recent trend in the industry to integrate data mining techniques with OLAP tools. See, for example, xe2x80x9cOLAP Vendors Increasingly See Data Mining Integration as Potent Differentiatorxe2x80x9d, http://www.idagroup.com/v2n0701.html; xe2x80x9cOLAP and Data Mining: Bridging the Gapxe2x80x9d, http://www.dbpd.com/vault/parsfeb.html. The purpose of the integration of these tools is to give analysts the flexibility to choose whether to use OLAP to view and aggregate data, or data mining techniques to better analyze attributes. Users can use these tools in any combination.
There are limitations to this integration approach. Typically, these tools are packaged in a software product, but little or no guidance is given to users on how they should use the tools in conjunction with one another. These tools solve different types of problems so it is difficult to use them to support one another. Also, these tools do not always present results in an easy to understand manner. For example, a user can look at a WIS rule or PRIM region definition and understand the attributes and values. However, users may miss the meaning of the pattern or an explanation for its occurrence. That is, a user cannot easily look at a SQL statement describing a PRIM region and intuitively understand the differences between the high average region and the other data points or outer region.
Some combinations of these tools do not make sense. For example, OLAP tools cannot discover high average regions or find new patterns in data.
The method of the present invention integrates two technologies for massive database analysis: one for finding an optimized region, consisting of a subsets of records in a database, and a second for analyzing, interpreting, and evaluating the subset of records and/or newly identified facts. (Hereinafter, xe2x80x9cdatabasexe2x80x9d will be understood to include both database structures, as well as data warehouse implementations.)
Broadly, the invention employs a data mining technique to first search the entire database for a region containing database records in which a pre-selected attribute has a high average value. This region will comprise a subset of database records that are characterized by ranges of values for some feature or set of features, i.e., columns in a relational table, often referred to as attributes. These characteristics about database records are sometimes called patterns, i.e., patterns that are common to the subset of records found. Next, the database records of the region found are further examined by, for example, aggregating, sorting, comparing, and computing new measures, e.g., variance between two columns. The subset of records found by the data-mining step can be compared to the remaining records, i.e., those records not satisfying the newly found pattern(s). Also, the region""s subset of records can be compared to the entire set of records. One embodiment of the invention makes use of operation of on line analytical processing (OLAP), although other tools for making such analysis may also be used.
In a specific embodiment of the invention, the Patient Rule Induction Method (xe2x80x9cPRIMxe2x80x9d) process is used to develop the region, using predetermined attributes of the database records as inputs and a selected one of the attributes as the output. PRIM can be structured to produce an SQL statement that describes the region found by its operation. This SQL statement can then be applied to an OLAP or OLAP-like application for focussed analysis.
The unique combination of PRIM and a comparison tool, such as OLAP, enables efficient searching of records in the database to find patterns, then a detailed analysis on the subset of records defined by the patterns.
In a further embodiment of the invention, the product of the data-mining operation, i.e., the region found, can then be applied to an association rule tool to identify particular associations between the records of the region. In this embodiment the Weighted Item Sets (WIS) is utilized to find relationships between various attributes in the records of the region.
The method of this invention can be thought of as a three-step approach for applying these techniques, where each technique""s strength is utilized for a specific purpose.
The first step of the method is to use PRIM to find regions in the database. PRIM finds regions with high average values for an output variable. The boundaries of the region define a subset of database records that satisfy algorithmically computed criteria.
The second and third steps of the method are to use the results of PRIM and WIS, i.e., the subset of records, for input for a more detailed analysis of the various database record attributes. That is, we start with a region discovered by PRIM and possibly a subregion discovered by WIS. OLAP can be used to interpret, evaluate, and compare the subset of data. In the third step, OLAP is used to compare points in the region to data points outside of the region. In the fourth step, OLAP is used to compare the region to the entire data population.
In the alternate embodiment of the invention, the first step of using PRIM may be followed by a second step of running WIS on the PRIM region, i.e., find associations between attributes for the records in the high average region. Then, the OLAP application operates on the sub-region found by WIS, forming steps three and four of this alternate method.
The use of PRIM with OLAP operates to complement one another. PRIM finds an optimized region, i.e., a subset of data points, and OLAP can graphically display aggregated values for various dimensions for the region and points outside of the region, i.e., the outer region. After running PRIM, OLAP can be applied to graphically display data for both the region and outer region.
Based upon the data mining results, OLAP compares data points in a region to data points outside of a region. An OLAP report could be run on all dimensions, e.g., all input variables used in PRIM. Even new measures could be defined for the OLAP analysis. Based upon the OLAP results, further OLAP reports can be run to drill down on interesting dimensions.
For the alternative use WIS unearths patterns, i.e., associations between attributes. In the case of WIS, the region is not an optimized region but a region made of records satisfying the criteria in the rule. For WIS, the patterns are represented by rules, where each rule describes a region that consists of data points satisfying the rule""s conditions.
The present invention differs from earlier attempts to combine data mining techniques, because the data mining technique selected for this invention, PRIM, has not been applied in this manner, i.e., no other tools have been used in conjunction with PRIM. The invention employs a specific series of steps for applying data mining and OLAP. In the first step, PRIM is specifically used to find a region for analysis. Any other technique that finds an interesting region, i.e., a region with records that can be compared to records outside of the region or to the entire population, can be applied. WIS may also be used to identify a subregion, i.e., a subset of records, on the region found by PRIM.
The present invention has advantages not heretofore achieved by the prior art, because the output of the first step of this method helps users determine which subset of records to analyze further and focus on. OLAP can filter records, but not based upon statistical evidence or algorithmic computations. For example, OLAP is capable of finding a particular percentage of records having a selected attribute with a high value, but it cannot find a percentage of records with that attribute and also having some other common attribute values in a multi-dimensional space.
The invention provides an improvement over prior art analysis method in that it partially automates the KDD process to perform data mining and evaluate the results using a specific three-step (or, alternatively, a four-step) method. The partial automation is that the region from the first step can be represented by a SQL statement that can be used by OLAP to retrieve data. Then OLAP can be used to evaluate the rules. The improvement is that subsets of records are further analyzed and compared to the remaining records, i.e., those not identified by the rules.