Database management systems are used to store data for convenient access by users. An application of database management systems is data warehousing, where data from various sources are collected and stored in a data warehouse.
For better understanding of data stored in a database (such as a data warehouse), data analysis is performed on data stored in the database. Examples of data analysis include data mining, machine learning, and statistical analysis. Statistical analysis usually involves the analysis of various statistics (e.g., database size, table sizes, data distribution, etc.) associated with data. A common task performed in statistical analysis is clustering, which involves segmentation, classification, and anomaly detection of data in the database.
In performing clustering, a data set is partitioned into disjoint groups such that points in the same group are similar to each other according to some similarity metric. A widely used clustering technique is K-means clustering. Clustering can be performed on numeric data or categorical data. Numerical data refers to data that is assigned a metric measure, such as height, scale, volume, and so forth. Categorical data is data that has a finite number of values not represented by a measure. Examples of categorical data include city, state, gender, and so forth.
Clustering involves assigning data points to respective clusters, with each point typically having multiple dimensions (or attributes). For example, a point may represent a visit by a customer to a store, with the dimensions corresponding to the items purchased by the customer during the visit.
Conventionally, clustering algorithms use input data sets that are organized as a plain file or table. Each line in the file or table contains a data point, and all points have exactly the same number of dimensions. Conventional formats for data sets used by clustering algorithms are typically inefficient for several reasons. First, data sets are usually originally stored in relational tables; therefore, such relational tables have to be first converted into the plain file or table format before a data set can be processed by a conventional clustering algorithm. Such conversion can be time-consuming and wasteful of system resources.
Also, some dimensions of points contained in a data set have zero value. Having to store these zero-value dimensions in a plain file or table is inefficient in terms of storage space and usage of processing resources. Moreover, for a data set that has points with a relatively large number of dimensions or for clustering that involves a large number of clusters, storage requirements can be relatively large. In such scenarios, the clustering algorithm may not work efficiently in a system that has a limited amount of memory since the memory may not have sufficient space to store clustering results as well as any intermediate data structures employed during clustering.