1. Field of the Invention
The present invention relates to a technique for executing a data mining algorithm using structured query language (SQL) in a relational database. More particulary, the present invention relates to a technique by which a data mining algorithm may be implemented relative to a relational database without significantly degrading database performance.
2. Description of the Related Art
The sheer size of databases has been growing in recent years. For example, it is relatively common for businesses to have databases of data that are measured in terabyte. These databases may comprise customer information, employee information, stock holder information, etc.
Using customer information as an example, even before the computer revolution, customer information and lists and the like have long been recognized as extremely valuable corporate assets. In theory, it should follow that computers should be able to exploit databases of customer information for enhanced marketing purposes, more concise customer mailings (preventing duplicate mailings), etc. However, this can be extremely complex. A single entry can include a large number of distinct data entries or fields. This number can easily exceed 100. The huge volumes of data overwhelm traditional methods of data analysis (spreadsheets, ad hoc queries in relational DBMSs, multidimensional analysis tools, statistical analysis packages).
Relatively recently, data mining was introduced as a technique that can intelligently and automatically transform data into information. Data mining is the search for relationships and global patterns that exist in large databases, but are hidden among vast amounts of data. Data mining extracts previously unknown, and potentially useful info (eg, rules, constraints, correlations, patterns, signatures and irregularities), focussing on automated methods for extracting patterns and/or models from data.
The data mining community has focussed mainly on automated methods for extracting patterns and/or models from data. The state-of-the-art technique in automated methods of data mining is still in a fairly early stage of development. The primary goals of data mining in practice are prediction and description. Prediction involves using some variables or fields in the database to predict unknown or future values of other variables of interest. Description focuses on finding interpretable patterns that describe the data. The relative importance of prediction and description for particular data mining applications can vary considerably. In business, a successful data mining method is known as xe2x80x9cMarket Basket Research.xe2x80x9d Market Basket Research analyzes customer transactions for patterns or xe2x80x9cassociation rulesxe2x80x9d which help make business decisions (e.g., choose sale items, design coupons, arrange shelves, etc.); this is also known as association rules mining.
For example, data mining can be performed by a company relative to its customer database to determine, based on customer data stored in the database, which customers are most likely to be to good candidates for a new product, and focus marketing efforts on these customers. In data mining, an algorithm is often created which defines the desired mining. In practice, this algorithm can be quite complex. Commonly, the algorithm goes through each customer record and creates a score relative to each customer, which is utilized to determine whether to market the product to the customer.
Typically, the data mining algorithm is embodied in an application which is external to the database. One data mining product which adopts this method is the Intelligent Miner product from International Business Machines (IBM). The external application xe2x80x98scoresxe2x80x99 the database from an existing model. These applications utilize an SQL cursor and fetches each record or tuple to be scored sequentially.
However, a number of problems and limitations are associated with current data mining techniques. Databases are getting larger and larger. It is not uncommon to find databases that include terabyte of data. When a data mining operation is attempted relative to a database of such magnitude, a number of things occur. With known techniques, the performance of the target database is seriously degraded. Normal business use becomes is degraded, and the amount of time it takes for a data mining operation to complete is relatively large. One solution to the performance problem has been to write the data from the database out to a flat file, and the data mining operation is then performed relative to the flat file. But this has a number of drawbacks. This technique requires a great amount of processor and memory capacity. Further, with the added levels of complexity involved with creating the flat file, moving the results back to the database, etc., the possibility of operator error and thus contaminated results increases significantly.
Further, the approach whereby an SQL cursor and fetches each tuple to be scored sequentially severely impacts performance in a multiple CPU relational database environment, as each record can only be recovered sequentially.
Accordingly, a need exists for a database mining technique which does not degrade performance and simplifies the application of a data mining algorithm relative to a database.
An object of the present invention is to provide a technique for enable data mining of large databases without degrading database performance.
Another object of the invention is to provide a simplified technique for implementing data mining of large databases.
Other objects and advantages of the present invention will be set forth in part in the description and the drawings which follow, and, in part, will be obvious from the description or may be learned by practice of the invention.
To achieve the forgoing objects, and in accordance with the purpose of the invention as broadly described herein, the present invention provides a system, method and computer readable code for for performing an enhanced data mining operation of a database distributed over a plurality of nodes, the database comprising a managing node and a plurality of nodes which control access to data in the database, comprising first subprocesses for registering a user defined function with the managing node; second subprocesses for distributing the user defined function from the managing node to each of the plurality of nodes of the database; third subprocesses for initiating the user defined function at each of the plurality of nodes based on a command input to the managing node; fourth subprocesses for adding a data field to each tuple which is to be scored by the user defined function; fifth subprocesses for scoring each tuple targeted by the user defined function; and sixth subprocesses for storing each tuple""s score in the newly defined data field. The system may also comprise seventh subprocesses for analyzing each tuple""s score and performing an action relative to data contained in each tuple should the tuple""s score fit a predetermined score criteria. Each tuple may represent a customer, and if the tuple""s score is within a certain range of values, the customer is selected for participation in a marketing plan. Alternatively, each tuple may represent a customer, and the customer is selected for participation in a different marketing plan based on the tuple""s score. Further, each tuple may represent a customer, a different advertisement may be associated with defined ranges of scores, and the customer may be sent an advertisement based on the tuple""s score.
The present invention will now be described with reference to the following drawings, in which like reference numbers denote the same element throughout.