Data mining is a technique by which hidden patterns may be found in a group of data. True data mining doesn't just change the presentation of data, but actually discovers previously unknown relationships among the data. Data mining is typically implemented as software in or in association with database systems. Data mining includes several major steps. Data mining models are “trained” by processing training data and generating information that defines the model. The generated information is then deployed for use in data mining, for example, by providing predictions of future behavior based on specific past behavior.
Clustering, along with classification, regression, and market basket analysis, is one of the major data mining tasks. Clustering is a useful technique for exploring and visualizing data. It is particularly helpful in situations where one has many records of data and no idea what natural groupings might be present in the data. Ideally, one would like the data mining software to find whatever natural groupings may exist. Clustering also serves as a useful data-preprocessing step to identify homogeneous groups on which to build predictive models such as trees or neural networks. A clustering model is different from predictive models in that the outcome of the process is not guided by a known result, that is, there is no target variable. Predictive models predict values for a target variable, and an error rate between the target and predicted values can be calculated to guide model building. With clustering models, the data density itself drives the process to a final solution.
Typically, conventional data mining systems work in conjunction with a database management system, transferring data to be mined from the database management system to the data mining system for processing. As a result, these current systems tend to provide inadequate performance for large datasets. In addition, typically, a wide variety of models must be generated to meet specific, but widely different needs throughout an enterprise. A typical enterprise has a variety of different databases from which data is drawn in order to build the models. Current systems do not provide adequate integration with the various databases throughout the enterprise. Likewise, current systems provide limited flexibility in terms of specifying and adjusting the data mining to be performed to meet specific needs. In addition, a high level of expertise is typically required of a data mining user in order to actually perform useful data mining work. This high expertise requirement has led to a slow rate of adoption of data mining technology, as well as increased development times and costs for those who have adopted data mining technology.
A need arises for a technique by which cluster analysis may be performed that provides improved performance in model building and data mining, good integration with the various databases throughout the enterprise, and flexible specification and adjustment of the models being built, but which provides data mining functionality that is accessible to users having limited data mining expertise and which provides reductions in development times and costs for data mining projects.