Various types of database technologies exist, include relational database technologies, hierarchical database technologies, and other types of database technologies. A relational database includes a set of inter-related tables that contain rows and columns. An application of database systems is data warehousing, where data from various sources are collected and stored in the data warehouse. The amount of data that can be stored in a data warehouse can be immense.
For better understanding of data contained in a data warehouse or other database, data mining is performed with respect to the data warehouse or database. As part of data mining, automated statistical analysis is often performed. One of the tasks performed in statistical analysis is clustering, which involves segmentation, classification, and anomaly detection of data in a data warehouse or other database.
During clustering, a data set is partitioned into disjoint groups such that points in the same group are similar to each other according to some similarity metric. A widely used clustering technique is K-means clustering. Clustering can be performed on numeric data or categorical data. Numerical data refers to data that can be assigned a metric measure, such as height, scale, volume, and so forth. Categorical data is data that has a finite number of values not represented by a measure. Examples of categorical data include city, state, gender, and so forth.
Normally, K-means clustering algorithms are relatively difficult to implement in database management systems. A programmer that develops code for clustering algorithms typically has to address issues such as storage management, concurrent access, memory leaks, false alarms, security concerns, and so forth. Such complexity results in lengthy development times for clustering algorithm code. Also, it is usually quite difficult to implement K-means clustering in a database system using generic programming languages such as C++ or Java.