The present invention relates to relational database management systems. More specifically, the invention relates to data mining applications for data stored in a relational database table.
Data mining is the process of discovering unexpected patterns or trends in existing data so that a database user can use the discovered knowledge to manage a business more effectively. For example, a typical application might study the demographic information known about a list of customers to create a profile of people most likely to buy a given product, respond to a direct mail campaign, or default on a loan.
Despite its name, xe2x80x9cdata miningxe2x80x9d has little to do with xe2x80x9cdrill-downxe2x80x9d queries; these types of queries are feasible in a data warehouse that has no data mining component. The difference between a data mining exercise and standard decision support queries lies in the analyst""s approach to the data: standard queries are assumption-driven, whereas data mining is discovery-based.
For example, an analyst who writes a query that compares sales last year to sales this year is looking for an accurate result to a routine business question, while an analyst who mines sales data is looking for patterns or trends that might be interesting and useful to understand. If the sales comparison query reveals that sales this year are up 200% from last year in some stores but only 100% in others, for example, the analyst might want to mine the data to discover any unexpected reasons for this discrepancy.
In other words, the data warehouse application helps analysts understand what happened, while data mining applications attempt to tell analysts why it happened. It is useful for a retailer to know, for example, what factor had the greatest impact on sales of a given product, and whether the retailer can control that factor or take advantage of that knowledge. Any number of factors can influence salesxe2x80x94price, style, packaging, promotions, placement in the store, season, weather, day of week, time of day, and so on. Without requiring analysts to ask explicit questions about each possible factor, a data mining application can read very large volumes of data and discover both obvious and hidden trends.
A data mining exercise consists of two main phases: building a model and predicting future results. A model defines the influencing factors (input data) is and the potential outcomes (output values). If the data must be obtained from a large pool of data that already exists (e.g., a relational database), the model also defines how these factors and outcomes are selected and mapped from the larger pool of data. The set of inputs, which might be very large (100 or more), is a list of factors that might influence the output. The result of the data mining exercise is a much smaller list of inputs that individually or in combination with other inputs does influence the output.
In turn, the model provides the ability to predict an output given a set of input characteristics. For example, if the model accurately reflects the tendency of customers with certain attributes (age, gender, income, and so on) to purchase a luxury automobile, the discovered results can be compared with a list of prospective buyers to determine those most likely to make a purchase in the near future. Because models can be used independently or in conjunction with query analysis, a warehouse query might be issued to generate a promotional mailing list from a Customer table.
Categorization analysis (a subset of data mining) has useful applications in various industriesxe2x80x94financial services, direct mail marketing, telecommunications, insurance, healthcare, retail salesxe2x80x94as a means of solving business problems by leveraging a deeper knowledge of the targeted consumers. Some of these applications are as follows:
Promotion analysisxe2x80x94In retail sales, understanding which products customers often purchase in combination with other products can lead to changes and improvements to in-store displays, seasonal promotions, advertising campaigns, and general advertising strategies.
Churn analysisxe2x80x94In telecommunications, discovering the profile of individuals likely to switch (or not to switch) long-distance carriers helps find ways to attract long-term customers and discourage short-term customers from switching carriers.
Claims analysisxe2x80x94In insurance, isolating the characteristics of a good risk can save both time and money. In healthcare, cost-cutting measures can be discovered by analyzing the history of a claim and the profile of the claimant.
Customer profilingxe2x80x94In almost any business, discovering the demographic profile of consumers most likely to buy or use a given product is critical to maintaining a competitive edge. The ability to compile effective target mailing lists is a common result of good profiling.
Rate and usage analysisxe2x80x94In telecommunications, studying the details of customer calls helps analysts find ways to better serve customers and keep their business, as well as improve the day-to-day service available to those customers.
Fraud detectionxe2x80x94In insurance, healthcare, and credit-card businesses, discovering the typical characteristics of a fraudulent claim or application helps analysts predict and prevent future fraudulent activity.
Various data mining products are available including xe2x80x9cDataCruncherxe2x80x9d from DataMind Corporation of San Mateo, Calif., xe2x80x9cIBM Intelligent Minerxe2x80x9d from IBM Corporation of Armonk, N.Y., and xe2x80x9cMineSetxe2x80x9d available from Silicon Graphics, Inc., of Redwood City, Calif. To use large data sets, all of these products must first retrieve data from a relational database (via SQL) and then convert the retrieved data to a flat file, and finally send that file to the data mining engine. Unfortunately, even if the data mining functionality can quickly and efficiently handle the large data sets it receives, the associated systems can convert and send the data only relatively slowly. First, the step of converting the database data to a flat file entails significant effort, including converting data types. Second, transporting the data to the data mining functionality typically involves sending the file over an ODBC connection. Unfortunately, transporting the large data sets necessary for many data mining projects over such connections is too slow for many applications. A related problem exists in providing the mined results back into a database. Specifically, other tools need to transfer the mined results back into the database, further increasing the overhead.
Further, some data mining products employ spreadsheets to provide the data used in the data mining operation and to display results of the operation. Without a mechanism for rapidly porting large data sets (from a large database for example), the size of the data sets employed with such products are limited. In addition to limiting application of data mining to relatively small data sets, this can limit the validity of models generated from the data.
In view of these limitations, a data mining system with improved performance would be desirable.
The present invention fills this need by providing an integrated data mining and relational database system. This is accomplished by eliminating the need to convert data to a flat file and export it from the relational database management system to a data mining engine. In addition, the invention makes patterns uncovered during data mining (e.g., xe2x80x9cunderstandxe2x80x9d and xe2x80x9cpredictxe2x80x9d information) available in virtual relational database tables that can be queried.
Preferably, the relational database management system integrated with the data mining engine is integrated on a server. The data mining engine determines characteristics of relationships between input data values and an output data value that are obtained from a relational database (managed by the relational database management system). The integration allows direct conversion of data values from the relational database to data mining identifiers used for data mining operations by the data mining engine. It also allows identifiers output by the data mining engine to be converted back to the data values. As mentioned, prior data mining systems did not integrate data mining engines with relational database management systems. Therefore, they had to first convert data from the relational table to a flat file or other format before transferring that data to the data mining engine, which then converted the raw data to data mining identifiers for the process of data mining.
The integrated data mining engine and the relational database system of this invention allows users to create a data mining model that includes a model table specifying input columns and an output column for rows of values from the relational database. Note that the rows specify the relationships between the input values and the output value. Preferably, the model table includes a definition for the model but is not populated with data values. However, the model itself may include one or more data model look up indexes specifying translations between values in the relational database and data mining identifiers used by the data mining engine to represent the relational database values. One look up index may be a forward translation index configured to translate data values from the relational database to data mining identifiers. Another look up index may be a reverse translation index configured to translate data mining identifiers to corresponding data values from the relational database. The indexes allow data to be inserted for calculations by the data mining and rows to be constructed in response to user queries.
Another aspect of the invention relates to data mining models including a definition for a relational table providing multiple relationships between input values and output values and indicia of the xe2x80x9cimportancexe2x80x9d of each relationship between an output and one or more inputs. Such relational tables are sometimes referred to as UNDERSTAND sub-tables. The importance of the relationship may be determined by the data mining model to be any one or more criteria indicating how strong the tie between the input values and the output value might be. Often the importance indicates the strength of the correlation between the input values and the output value.
Preferably, the relational table is not actually populated with data, but is tied to a data mining model so that the computer system can construct rows corresponding to the relationships in response to user queries. These rows can be displayed as views or otherwise presented to the user for review.
While the relational table may not be populated with data values, it does include a definition that may specify one or more columns for user specified outputs and inputs from a relational database. The rows of the table may specify relationships between input data values and an output data value. The definition may also specify one or more columns that indicate the importance of the relationships. For example, one column might be defined to contain values indicating occurrence counts for the relationships, another to contain occurrence frequencies for the relationships, and another to specify correlations for the relationships. Preferably, the importance of the relationships (however it is represented) is calculated by a data mining engine.
Yet another aspect of the invention provides a data mining model including a definition for a relational xe2x80x9cPREDICTxe2x80x9d table providing multiple relationships between input values and output values and configured to return a predicted output value in response to a query containing one or more input values. Preferably, the relational predict table is further configured to return a xe2x80x9cconfidencexe2x80x9d value for the predicted output value it returns in response to the query. The confidence in the prediction may take various forms. For example, it may take the form of a discrimination value calculated by a data mining engine. The discrimination indicates the relative likelihood of one output occurring over another output in response to the same set of inputs.
The PREDICT relational table preferable exists only as a definition, unpopulated by actual data values. In response to user queries requesting a prediction, the computer system constructs rows corresponding to the relationships that include a column for an output and one or more columns for user specified inputs from a relational database. The rows will also include a column for confidence in the predicted output (indicated as a discrimination value for example).
Another aspect of the invention provides a method of evaluating a query on a predict table in a data mining model. The method may be characterized as follows: (a) identifying an expression or sub-expression in a predicate of the query; (b) identify a start/stop condition from the expression or sub-expression; (c) binding the start/stop condition to a column of the predict table; and (d) passing column values constrained by the start/stop condition to a data mining engine. The method may require that expression meet certain criteria before its start/stop conditions will be bound to the column. If the expression or sub-expression does not meet these criteria, the DBMS may defer application of start/stop conditions (or even evaluation of start/stop conditions) until after receiving results of a data mining operation from the data mining engine. An example of a condition used to determine whether a start/stop condition should be applied before or after data mining is the condition that the expression or sub-expression be atomic and conjunctive. Further, the expression or sub-expression may have to compare a bindable value to a model input.