With the growth and proliferation of databases (and data warehouses) into almost any type of organization, there is an ever growing need for automated tools to help analyze, visualize, and summarize the contents of large databases.
Decision support queries can be well-specified (e.g. give me a report of sales by store over an exactly specified period of time) and not so well specified (e.g. find me records in the data that are "similar" to records in table A but "dissimilar" from records in table B). For example, in a fraud detection setting, table A can represent transactions known to be associated with fraudulent activity while table B contains transactions known to be legal. Such queries in general require a method for modeling the contents of the data. Methods for constructing models from databases are examples of data mining methods. These methods include a wide class of predictive and descriptive modeling techniques.
Examples of data-driven decision support situations abound. For example, a credit card or telecommunications company that would like to query its database of usage data for records representing fraudulent cases. Another example would be a scientist dealing with a large body of data who would like to request a catalog of events of interest appearing in the data. Such patterns, while recognizable by human analysts on a case by case basis, are typically very difficult to describe in a SQL query or even as a computer program in a stored procedure. Often the domain expert does not even know what variables determine the decision.
An alternative effective means of interacting with the database is to state the query by example. In this case, we assume one of the fields in the database represents values of a target variable (often called a class variable). The values of the class column could have been obtained by asking a domain expert to label cases (a subset of the records called a training set) with appropriate values of the class variable. The data mining system builds a model to automatically label records (i.e. distinguish one class value from another for any given record) based on the training data. The system can then apply the extracted classifier to search the full database for events of interest or to classify future cases as they arrive. This approach is typically easier than detailed modeling of the causal mechanisms underlying the phenomenon of interest because examples are usually readily available, and humans find it natural to interact at the level of labeling cases. Also, it is often only possible to obtain a label (class) for a case in retrospect (e.g. fraud).
The difficulty in building classification models manually derives from the fact that humans find it particularly difficult to visualize and understand a large data set. Data can grow along two dimensions: the number of fields (also called dimensions or attributes) and the number of cases or records. Human analysis and visualization abilities do not scale to high-dimensions and massive volumes of data. A standard approach to dealing with high-dimensional data is to project it down to a very low-dimensional space and attempt to build models in this simplified subspace. As the number of dimensions grows, the number of choice combinations for dimensionality reduction explodes. Furthermore, a projection to lower dimensions could easily transform an otherwise solvable discrimination problem (classification problem) into one that is impossible to solve. If the human analyst is trying to explore models, then it becomes infeasible to go through the various ways of projecting the dimensions or selecting the right subsamples (reduction along columns and rows of the target database). An effective means to visualize data would be to employ data mining methods (e.g. classification) to perform the appropriate reductions. Because data growth rates are higher than humans can keep up with using traditional manual analysis techniques, automated data mining techniques become necessary if the goal is to utilize the data in a timely manner. In any competitive environment where others who do utilize data resources better will gain a distinct advantage, decision support operations are very important.
For the reasons cited above, using standard database query languages (such as SQL) to build classification models can be very slow and ineffective. Queries aimed at projecting data down to lower dimensions often leave the analyst without any clue of where to proceed. Data mining methods can explore such projections automatically and assess which attributes are likely to be important. For a wide class of such mining algorithms (including classification), and important step is deriving counts of correlations of values of attributes. For example, in a classification problem, correlation counts of combinations of values of attributes with values of the class variable are of particular interest.
Table 1 depicted below illustrates the makeup of the data records of a database. Each of the records has a number of attributes or fields, Attr.sub.-- 1, Attr.sub.-- 2 . . . Attr.sub.-- m and includes an attribute or field that is labeled `class`. `Class ` is the target attribute whose value is to be predicted by a classifier based on other attributes.
TABLE 1 ______________________________________ Data.sub.-- Table CaseID Attr.sub.-- 1 Attr.sub.-- 2 . . . Attr.sub.-- m Class ______________________________________ ICase 1 V.sub.-- 1j V.sub.-- 1k . . . V.sub.-- m1 Ci ICase 2 V.sub.-- 1i V.sub.-- 2l . . . V.sub.-- mk Cj . . . . . . . . . . . . . . . . . . ICase N V.sub.-- 1q V.sub.-- 2r . . . V.sub.-- ms Ck ______________________________________
The first attribute in the table under the heading `CaseID` is a key indexed field for the records in the table.
A Counts Table that is based on the Table 1 data has the following form:
TABLE 2 ______________________________________ Counts Table Attr.sub.-- ID Attr.sub.-- val ClassVAl Count ______________________________________ Attr.sub.-- 1 v.sub.-- 11 C1 # Attr.sub.-- 1 v.sub.-- 11 C2 # . . . . . . . . . . . . Attr.sub.-- 1 v.sub.-- 11 Ck # Attr.sub.-- 1 v.sub.-- 12 C1 # . . . . . . . . . Attr.sub.-- 1 v.sub.-- 1r.sub.1 Ck . . . . . . . . . . . . . . . Attr.sub.-- m v.sub.-- m1 C1 # Attr.sub.-- m v.sub.-- m1 C2 # . . . . . . . . . . . . Attr.sub.-- m v.sub.-- mr.sub.m Ck # ______________________________________
This table gives a count of the number of occurrences of every attribute-value and class-value combination in the database. Note that this table is much smaller than the data table. It has #class-values.times.#attributes.times.#vals/attribute count cells. In most applications this is much smaller than N. the number of records in the database, typically by several order magnitudes. Table 2 tabulates counts for only one class variable with values C1 . . . Ck. In principle, the table can accommodate as many class variables as desired. This is done by adding a ClassID column to table 2. Stated another way, each class variable gets a column in the correlation counts table.
An illustrative example of a database for building a classifier contains data records describing operations of a large number of retail stores that are located throughout the country. Representative sample data records for such a database are listed below in Table 3.
TABLE 3 __________________________________________________________________________ (Class) CaseID StoreType Region . . . Locationtype Season Profitable __________________________________________________________________________ record1 Shoes Midwest . . . Urban Summer Average record2 Elecs West . . . Mall Christmas VeryProf record3 Misc Nwest . . . Mall Fall/School BreakEven . . . . . . . . . . . . . . . . . . . . . recordN Shoes Midwest . . . StripMall Summer Losing __________________________________________________________________________
The goal of the data mining is to derive a model (classifier) to predict class values based on the other attributes of table 3. Such a classifier may be used to identify a set of factors that can be used in locating a store of a particular type that will produce high profits without necessarily actually set up a store to gain operating data.
An example counts table associated with the data shown in Table 3 is given in Table 4:
TABLE 4 ______________________________________ Attribute Value Profitability count ______________________________________ Attribute1 v11 Ave C.sub.11A Attribute1 v11 Vprof C.sub.11V Attribute1 v11 Losing C.sub.11L Attribute1 v11 BEven C.sub.11B Attribute1 v12 . . . v13 . . . . . . . . . Attribute2 v21 . . . v22 . . . v23 . . . LocationType StripMall Ave C.sub.SMA Vprof C.sub.SMV Losing C.sub.SML Beven C.sub.SMB Rural Ave C.sub.RA Vprof C.sub.RV Losing C.sub.RL Beven C.sub.RB Urban Ave C.sub.UA Vprof C.sub.UV Losing C.sub.UL BEven C.sub.UB Mall Ave C.sub.MA Vprof C.sub.MV Losing C.sub.ML Beven C.sub.MB StoreType Shoes Ave 25,000 Vprof 15,000 Losing 25,000 Beven 30,000 Electr Ave 25,000 Vprofit 20,000 Losing 30,000 Beven 40,000 Misc Ave 250 Vprofit 250 Losing 0 Beven 30000 AttributeM vM1 . . . vM2 . . . ______________________________________
Table 4 is organized with the class variable as a column heading and the attribute values of the various attributes that make up the records of the database as row designators. In the Illustrated counts table the `Location Type` has only four discrete values, i.e. StripMall, Rural, Urban, and Mall. The store location attribute `StoreType` has the three discrete values, i.e. Electronics (Electr) , Misc, and Shoes.
Each record has one class value and a combination of attribute values and therefore contributes a count to the counts table for each of its M attributes. For each value of the attribute LocationType and for each value of the Class variable (Profitable) a count of the number of cases in the data subset under consideration (node in the tree) is recorded. A case or record contributes to the count of LocationType=Mall and Profitable=VProfitable if it has those two values for the two attributes. The total number of entries in the counts table for a given attribute, LocationType, for example is equal to the total number of entries in the counts table for every other attribute (assuming there are no null entries in the original table.)
A wide class of efficient data mining algorithms for building, classification models from data depend on a core set of sufficient statistics (referred to in the present application as correlation counts) that need to be computed from the database. Examples of the use of the correlation counts in building a classifier are found in Appendix A to the application.
These aggregates (correlation counts) can be obtained by executing a series of Group By queries in ANSI standard SQL. However, the structure of these queries can be exploited so that the set of queries can be executed efficiently in a single scan of the data table. By using a new operator, the present invention enables the collection of such aggregates efficiently and with minimal modification to standard database server software.