Clustering is a class of data analysis techniques widely used in the field of computational data science, with application to problems in news search, genomics, epidemiology, web analytics, business, econometrics, demographics, ecological dynamics, seismology, meteorology, astronomy, particle physics, and other domains (see Jain A K (2010), “Data clustering: 50 years beyond K-Means,” Pattern Recog. Lett. 31(8):651-666). With increasing data capacities and speeds in computing, technologists seek to perform clustering on ever-larger “big data” sets.
Clustering refers to assigning data items into groups (“clusters”) based on factors such as data value similarity, data set divisibility, data set density, and application-specific requirements (see Xu D, Tian Y (2015), “A comprehensive survey of clustering algorithms,” Annals of Data Science, 2(2):165-193). In addition, clustering typically involves retrieval of the assigned groupings—given a data item, output the other data items with which it is grouped.
Similarity clustering entails comparing data items to each other along one or more dimensions, and possibly assigning similar data items to the same group. It is impractical for individuals to perform clustering manually on data sets with more than a few hundred items; beyond that number, computers are de facto required. Clustering has become necessarily rooted in computer technology.
With large data sets, similarity computations can become slow and expensive, as each data item is compared to a large number of other data items. The time complexity of similarity clustering has been viewed as fundamentally O(n2) (quadratic in the number of data items) in methods where the number of clusters may grow. Other methods (e.g., k-means clustering) cap the number of clusters at a constant, k, which leads to O(nk) time complexity, but at the cost of generally inferior clustering (see Steinbach M, Karypis G, Kumar V (2000), “A comparison of document clustering techniques,” Proc. Workshop Text Mining, 6th ACM SIGKDD Int. Conf Data Mining, KDD-2000).
Throughout the computer era, improving the time efficiency of clustering has been a subject of intensive and voluminous research. The earliest computational algorithms for clustering date from the late 1950s and early 1960s (e.g., Ward J H (1963), “Hierarchical grouping to optimize an objective function,” J. Amer. Statistical Assoc. 58(301):236-244). Many methods for data clustering are currently in use and are well known in the art. To reduce or to work around the high computational cost of clustering, methods have been developed that use partitioning, filtering, probabilistic calculations, hierarchical calculations, parallel processing, and other approaches (see Jain, 2010). Research and development on clustering is active and ongoing (e.g., Deolalikar V, Laffitte H (2015), “Adaptive hierarchical clustering algorithm,” U.S. Pat. No. 9,020,271; Dykstra A J, Chakravarthy D, Dai S (2016), “Centroid detection for clustering,” U.S. Pat. No. 9,280,593; Heit J, Dey S, Srinivasan S (2015), “System and method for clustering data in input and output spaces,” U.S. Pat. No. 9,116,974).
Current similarity clustering methods have the characteristic that the required amount of computational work increases per additional data item. Even with aggressive techniques such as parallelization, measuring similarity between the items in a large data set can require a prohibitive amount of computation. This technical problem limits the quality and applicability of similarity clustering.
It would be ideal to find a similarity clustering method with O(n) (linear in the number of data items) time complexity—i.e., constant time per item, irrespective of the number of items or number of clusters. Such a method would expand the benefits of similarity clustering to much larger data sets.
Despite their utility, current clustering techniques still have been subject to performance tradeoffs. Similarity clustering in linear or near-linear time can be obtained via probabilistic clustering algorithms—but at the cost of admitting errors in retrieval, such as false negatives, in which the algorithm may (with small probability) erroneously omit certain cluster members during cluster retrieval. For probabilistic clustering algorithms, false-positive errors can occur too; false positives can be screened out by a post-clustering check of actual similarity between each item and one or more members of its purported cluster.
Some applications require or prefer an error-free, or exact, clustering method rather than a probabilistic, or approximate, one. If the cost of a false negative or false positive error is high, it may be impossible or infeasible to raise the approximation tolerance threshold of a probabilistic clustering algorithm sufficiently, within the performance requirements of the application.
For example, using a similarity threshold of 0.2, the probabilistic locality sensitive hashing algorithm for MinHash signatures (Wang J, Shen H T, Song J. Ji J (2014), “Hashing for similarity search: A survey,” ArXiv 1408.2927 v1:1-29) would require over 14,000 hash computations per data item to obtain a false-negative error rate of 1%. Reducing the false-negative error rate to 0.00000001% (which admits approximately one false negative in every 100,000,000 data items) would require over 57.000 hash computations per data item (see Leskovec J, Rajaraman A, Ullman J (2014), “Finding similar items,” Mining of Massive Data Sets, 2nd Edition, chapter 3, Cambridge University Press). Probabilistic clustering algorithms provide no guarantee of freedom from retrieval errors.
Previous efforts by a large, global community of skilled data scientists, statisticians, and computer scientists have produced clustering algorithms that have supra-linear time complexity, or are probabilistic rather than error-free, but have failed to yield an O(n) clustering method that is guaranteed to be free of retrieval errors. Indeed, key disclosures at the forefront of research and development on this problem and on related problems teach away from O(n) clustering with error-free retrieval, commonly supposing supra-linear growth in memory usage (see Zhang X, Qin J, Wang W, Sun Y, Lu J (2013), “HmSearch: An efficient Hamming distance query processing algorithm,” Proc. 25th Int. Conf Sci. and Stat. Database Management 19:1-12), potential limitations on scalability (see Arasu A, Ganti V, Shriraghav K (2006), “Efficient exact-set similarity joins,” Proc. 32nd Int. Conf Very Large Databases, 918-929), restrictions on the degree of similarity within a cluster (see Xiao C, Wang W, Lin X, Yu J X, Wang G (2011), “Efficient similarity joins for near duplicate detection,” ACM Trans. Database Systems 36(3):15.1-15.41), or restrictions on the number of symbol elements in the data universe (see Zhang et al., 2013).
A linear-time clustering method would imply that the time to cluster each data item is upper-bounded by a constant, and hence does not grow with the number of clustered data items. A linear-time clustering method with error-free retrieval would logically appear to be impossible on the surface, as it seemingly could not perform enough similarity comparisons on each data item.
In summary, there have been more than 50 years (see Jain, 2010; Jain A K, Murty M N, Flynn P J (1999) “Data clustering: A review,” ACM Computing Surveys, 31(3):264-323; Xu & Tian, 2015; Xu R, Wunsch D (2005), “Survey of clustering algorithms,” IEEE Trans. Neural Networks, 16(3):645-678) of active research and development on clustering methods by experts in many disciplines. A linear-time clustering method with error-free retrieval would be highly desirable and useful. A seeming illogical bias in the art (no growth in comparison time per element) has led researchers to avoid deeply investigating the possibility of linear-time clustering with error-free retrieval.