A database is a collection of logically related data arranged in a predetermined format, such as in tables that contain rows and columns. As the technology of storage devices and database software have improved, the capacity of database systems have also increased dramatically. An application of database systems is data warehousing, where data from various sources are collected and stored in the data warehouse. The amount of data that can be stored in the data warehouse can be immense.
Data can be input into a database system on a substantially continuous basis (in which the input data arrives as a substantially continuous stream at the database system). One technique performed with the input data to be stored into a database system is clustering. Clustering involves partitioning a data set into disjoint groups such that points in the same group are similar to each other according to some similarity metric. Clustering can be performed on numeric data or categorical data. Numerical data refers to data that can be assigned a metric measure, such as height, scale, volume, and so forth. Categorical data is data that has a finite number of values not represented by a measure. Examples of categorical data include city, state, gender, and so forth.
Clustering techniques may not always produce accurate results. For example, conventional clustering techniques may not properly handle data sets that have skewed data distributions or have large amounts of “outliers,” which are data values of the data set that do not belong to any specific cluster. Also, conventional clustering techniques may have high dependence on initialization, which means that improper initialization may cause the clustering technique to produce inaccurate results. Also, conventional clustering techniques may exhibit convergence to poor solutions.