Data clustering involves the partitioning of datasets into similar data subsets. Typically, an operation in the data clustering process, such as a distance calculation, computes a metric of similarity between the one dataset or element and another. Unsupervised data clustering allows certain kinds of parallelizable problems involving large datasets to be solved using computing clusters without associated complexities of data dependency, mutual exclusion, replication, and reliability. Data clustering techniques can be applied to many problems. For example, clustering patient records can identify health care trends, clustering address lists can identify duplicate entries, and clustering documents can identify hierarchical organizations of information.
However, data clustering can be computationally expensive, and with the continuing dramatic increase in accessible data, the computational requirements are expected to be increasingly challenging. Even with an aggressive application of parallel computing and distributed storage, the sheer volume of data submitted to data clustering processes can be prohibitive.