Data mining is a technique by which hidden patterns may be found in a group of data. True data mining doesn't just change the presentation of data, but actually discovers previously unknown relationships among the data. Data mining is typically implemented as software in or in association with database systems. Data mining includes several major steps. First, data mining models are generated based on one or more data analysis algorithms. Initially, the models are “untrained”, but are “trained” by processing training data and generating information that defines the model. The generated information is then deployed for use in data mining, for example, by providing predictions of future behavior based on specific past behavior.
Data mining is a compute intensive and complex task. Enterprise data mining, that is, data mining that is performed using all or substantial portions of the data generated by an enterprise, requires the mining of very large datasets. Such datasets may include millions of records and it may take hours or even days to build a single model based on such a dataset. Clustering models are an important family of machine learning algorithms that are quite expensive, in terms of computing required, to build when large datasets are used. Clustering is the process of grouping data into classes or clusters. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. A cluster of data objects can be treated collectively as one group in many applications.
Problems arise when attempts are made to utilize current data mining systems to perform enterprise data mining. Current systems that perform clustering analysis tend to provide inadequate performance for large datasets, and in particular, do not provide scalable performance. This leads to it taking hours or even days to build a single model. In the context of enterprise data mining, a wide variety of models must be generated to meet specific, but widely different needs throughout the enterprise. A typical enterprise has a variety of different databases from which data is drawn in order to build the models. Current systems do not provide adequate integration with the various databases throughout the enterprise. Likewise, current systems provide limited flexibility in terms of specifying and adjusting the model being built to meet specific needs. Likewise, the various models that are built must be arranged so as to operate properly on the particular system within the enterprise for which the models were built. Current systems provide limited model arrangement and export capability.
A need arises for a technique by which cluster analysis may be performed that provides improved performance in model building, good integration with the various databases throughout the enterprise, flexible specification and adjustment of the models being built, and flexible model arrangement and export capability.