1. Field of the Invention
The present invention relates to a system, method, and computer program product for performing data-centric automatic data mining.
2. Description of the Related Art
Data mining is a technique by which hidden patterns may be found in a group of data. True data mining doesn't just change the presentation of data, but actually discovers previously unknown relationships among the data. Data mining is typically implemented as software in or in association with database systems. There are two main areas in which the effectiveness of data mining software may be improved. First, the specific techniques and processes by which the data mining software discovers relationships among data may be improved. Such improvements may include speed of operation, more accurate determination of relationships, and discovery of new types of relationships among the data. Second, given effective data mining techniques and processes, the results of data mining are improved by obtaining more data. Additional data may be obtained in several ways: new sources of data may be obtained, additional types of data may be obtained from existing sources of data, and additional data of existing types may be obtained from existing sources.
Data mining is a hard thing to do. It requires complex methodology, data preparation, and tuning from the user to be successful. This makes data mining more of an art and has limited the acceptance and dissemination of the technology. The concepts and methodologies used in data mining are also foreign to database users in general. Database users work from a data-centric query paradigm against a data source. Supervised modeling in data mining, an important subset which includes classification and regression modeling, in most cases, requires two sources: a training data set and an apply data set. The conceptual hurdle posed by data mining has been handled by providing users with templates. These templates encapsulate a complex methodology, usually suitable to a narrow problem domain, into a series of steps that can be tuned by the user. In some cases templates also provide defaults based on heuristics thus minimizing the need for tuning and user input. Templates can be parameterized and deployed for further ease of use.
Previous solutions based on templates are not general enough to support different data source schemas, don't automatically identify the type of predictive technique (classification or regression) to use for different target types in supervised problems, and don't work out of a single data source for supervised cases. Another limitation of templates is their inability to seamlessly update the results produced without the need for user intervention or requiring the user to perform multiple operations (e.g., build, and then deploy solution for scoring).
A need arises for a data-centric data mining technique that provides greater ease of use and flexibility, yet provides high quality data mining results.