Progress in database technology has made massive warehouses of business data ubiquitous. There is increasing commercial interest mining the information in such warehouses. Data mining is used to extract predictive models from data that can be used for a variety of business tasks. For example, based on a customer's profile information, a model can be used for predicting if a customer is likely to buy sports items. The result of such a prediction can be leveraged in the context of many applications, e.g., a mail campaign or an on-line targeted advertisement. Typical data mining models include decision tree, clustering, and naïve Bayes classifiers.
The traditional way of integrating mining with querying is to pose a standard database query to a relational backend. The mining model is subsequently applied in the client/middleware on the result of the database query. Thus a mining query such as “Find customers who visited the MSNBC site last week and who are predicted to belong to the category of baseball fans”, would be evaluated in the following phases: (a) execute a SQL query at the database server to obtain all the customers who visited MSNBC last week, and (b) for each customer fetched into the client/middleware, apply the mining model to determine if the customer is predicted to be a baseball “fan”. While this approach might provide adequate results, if the number of customers predicted to be “baseball fans” is significantly lower than the number of customers who visited MSNBC last week, this may not be the most efficient way to process the query.
Recently, several database vendors have made it possible to integrate data mining techniques with relational databases by applying predictive models on relational data using SQL extensions. The predictive models can either be built natively or imported, using Predictive Model Markup Language (PMML) or other interchange format. FIG. 6 depicts an overview of the Microsoft Analysis server product (part of SQL Server 2000) in which mining models are explicitly recognized as first-class table-like objects. Creation of a mining model corresponds to the schematic definition of a mining model. The following example shows creation of a mining model that predicts risk level of customers based on source columns gender, purchases and age using decision trees:
CREATE MINING MODEL Risk//Name of Model(Customer_ID LONG KEY,//Source ColumnGender TEXT DISCRETE,//Source ColumnRisk TEXT DISCRETE PREDICT,//Prediction ColumnPurchases DOUBLE DISCRETIZED,//Source ColumnAge DOUBLE DISCRETIZED,//Source Column)USING [Decision_Trees_101]//Mining Algorithm
The model is trained using the INSERT INTO statement that inserts training data into the model. Predictions are obtained from a model M on a dataset D using a prediction join between D and M. A prediction join is different from a traditional equi-join on tables since the model does not actually contain data details. The following example illustrates prediction join:
SELECT D.CustomerID, M.RiskFROM [Risk]MPREDICTION JOIN(SELECT Customer_ID, Gender, Age, sum(Purchases) as SPFROM Customers D Group BY Customer_ID, Gender, Age) as Dand M.Age = D.Ageand M.Purchases = t.SPWHERE M.Risk = “low”
In this example, the value of “Risk” for each customer is not known. Joining rows in the Customers table to the model M returns a predicted “Risk” for each customer. The WHERE clause specifies which predicted values should be extracted and returned in the result set of the query. Specifically, the above example has the mining predicate Risk=“low”.
IBM's Intelligent Miner (IM) Scoring product integrates the model application functionality of IBM Intelligent Miner for Data with the DB2 Universal Database. Trained mining models in flat file, SML, or PMML format can be imported into the database. An example of importing a classification model for predicting the risk level of a customer into a database using a UDF called ID-MMX.DM_impClasFile( ) follows:                INSERT INTO IDMMX.ClassifModels values (‘Risk’,        IDMMX.DM_impClasFile(‘/tmp/myclassifier.x’))        
Once the model is loaded into a database, it can be applied to compatible records in the database by invoking another set of User Defined Functions (UDFs). An example of applying the above classification mining model (“Risk”) on a data table called Customers is shown below:
SELECT Customer_ID, RiskFROM (SELECT Customer_ID IDMMX.DM_getPredClass(IDMMX.DM_applyClasModel(c.model,IDMMX.DM_applData(IDMMX.DM_applData(‘AG', s.age),‘PURCHASE’,s.purchase)))as RiskFROM ClassifModels c, Customer_list sWHERE c.modelname=‘Risk’ and s.Salary<40000)WHERE Risk = ‘low’
The UDF IDMMX.DM_applData is used to map the fields s.Salary and s.age of the Customer_list table into the corresponding fields for the model for use during prediction. The UDF IDMMX.DM_applyClasModel ( ) applies the model on the mapped data and returns a composite result object that has along with the predicted class other associated statistics like confidence of prediction. A second UDF ID-MMX.DM_getPredClass extracts the predicted class from this result object. The mining predicate in this query is: Risk=‘low’.
Because existing systems handle queries containing mining predicates by applying the mining model as a filter on the intermediate results from the traditional predicates of the SQL query, they do not exploit the mining predicates for better access path selection. The main challenge in exploiting mining predicates for access path selection is that each mining model has its own specific method of predicting classes as a function of the input attributes. Some of these methods are too complex to be directly usable by traditional database engines.