This description relates to data clustering, segmentation, and parallelization.
Data clustering is a method whereby information that is substantially similar is labeled with a shared identifier so that it may later be processed as if the information had been grouped together in a common location. This information can include information of various types such as financial data or health care records, for example. Each cluster (among a set of multiple clusters) includes units of data (e.g., documents, database records, or other data objects) that have been determined to meet some similarity criterion. Some techniques are “off-line” techniques that process units of data as a batch to generate clusters or add to existing clusters. Some techniques are “on-line” techniques that process units of data incrementally as they are received. Clusters can be hierarchical, where a given cluster at one level is itself divided into multiple clusters at another level. In some cases, the clusters correspond to a partitioning of the data units in which each data unit is in exactly one of the clusters, and in some cases clusters may overlap with a data unit being a member of more than one cluster.