The challenge of data clustering—constructing semantically meaningful groups of data instances—has been a focus of information technology (IT) field for some time. Accordingly, a number of methods for data clustering have been developed. One dilemma surrounding existing methods is based on a tradeoff between effectiveness and efficiency or scalability. The enormous amount and dimensionality of data processed by modern data mining tools call for effective and scalable unsupervised learning techniques. However, most clustering algorithms in the art are either effective or scalable, but not both. In other words, these methods either provide fairly powerful learning capabilities but are too resource-intensive for large or highly dimensional datasets, or they are useable on large datasets but produce low-quality results.
Modern resources for generation, accumulation, and storage of data have made giga- and terabyte datasets more and more common. Due to the magnitude of such tasks, as well as the time and processing power that they can consume, data mining practitioners often tend to use simpler methods in the interest of feasibility. However, such an approach sacrifices mining power and may provide unsatisfactory results. Furthermore, for very large and/or complex amounts of data, even simple methods may not be feasible. If one considers, for example, a problem of clustering one million data instances using a simple online clustering algorithm: first initialize n clusters with one data point each, then iteratively assign the rest of points into their closest clusters (in the Euclidean space). Even for small values of n (e.g. n=1000), such an algorithm may work for hours on a modern personal computer (PC). The results would however be quite unsatisfactory, especially if the data points are 100,000-dimensional vectors.
Therefore, a number of IT fields could benefit from methods and systems of data clustering that combine a powerful learning algorithm with a scalability that addresses modern dataset demands.