As the usage of computers continues to proliferate, the collection of large amounts of data also grows. For example, through social (e.g., telephone, email, web browsing) and economic (e.g., shopping, stocks, bank transactions) activities, enormous datasets are generated that potentially contain latent information of significance to economics, sociology, business, and national security. The World-Wide-Web is an example of the kind of dataset whose very existence creates significant business opportunities.
The analysis of very large datasets is becoming a central problem in computing. Storage and analysis of large datasets drives a big and growing segment of the computer hardware industry. In order to analyze large datasets, the data is often arranged in very large graphs. Features of these graphs are isolated. For example, clusters of tightly connected vertices that are somewhat isolated from the remainder of the graph may be found. In general, clustering is the problem of grouping similar objects while keeping dissimilar objects apart. Clustering is a fundamental tool for finding useful information latent in very large datasets.
There are many prior solutions that provide ways of clustering in large datasets. However, these prior solutions suffer from various drawbacks. For instance, one could examine every subset of the set of objects and check whether the examined subset is a cluster according to some clustering criteria, but this would be prohibitively expensive except for very small datasets. In particular, for large datasets this could include billions and billions of operations, and is thus far too computationally intensive to be practical. Another prior solution proposes to look at clustering by separating the graph into multiple parts and cutting the edges traversing the parts. This solution does not determine the number of clusters that exist, requiring a user to input the number of clusters desired, thereby potentially distorting the results. Moreover, this solution assumes that every object is in exactly one cluster, which may not be a reasonable assumption because some objects may be in multiple clusters and some objects may not be in any cluster. Yet another solution considers only the denseness of internal connectivity of an identified cluster without considering the sparseness of external connectivity, thereby unnecessarily and potentially detrimentally limiting the identification of a cluster.
The drawings referred to in this description should not be understood as being drawn to scale except if specifically noted.