Data clustering refers to the act of creating relationships between like data elements. When clustering data elements, such as text documents, the subject matter of the documents may be the basis for grouping decisions. Creating a cluster of like documents is helpful in many data management scenarios, such as, for example, document production or data mining.
Data clustering is often performed on large high-dimensional datasets which require significant processing time to accurately cluster data elements. Within conventional data clustering systems, data elements are converted into numerical values that uniquely identify the data element.
According to a conventional data clustering system, such as k-Medoid Clustering, the data elements are grouped based on the relative distances between each numerical value. In such a clustering system, a plurality of medoids, or cluster points, are selected and each of the data elements is associated with the nearest medoid. A distance metric (such as cosine, Euclidean or Hamming distance) is used to determine the distance between a data element and each medoid. Conventional data clustering systems may optimize the data cluster by adjusting the location of the medoid to determine if an alternative location could create a more efficient data cluster. However, the process of calculating the distance between a data element's numerical value and relevant medoids requires significant processing resources and results in delays when clustering high-dimensional datasets. In particular, conventional data clustering systems experience delays when clustering high-dimensional datasets that include text documents, audio files, video files, or image files.
For example, a conventional data clustering system may be used to cluster text documents in support of a document production request within the discovery phase of litigation. Such a document production request could require the review of hundreds of thousands of documents. Clustering documents based on their subject matter could help identify groups of likely relevant documents. However, given the large of number documents at issue in many document production requests, conventional data clustering systems can not effectively cluster the documents and as a result, document clustering if often not utilized as a tool when responding to a document production request.
As a result, there is a need in the art for a method and system to more efficiently cluster high-dimensional data.