Most businesses now rely on efficient and accurate storage, retrieval, processing, and analysis of datasets. These datasets represent information about customers, business opportunities, business risks, liabilities, transactions, employees, locations, phone calls, emails, text messages, social networks, or any other information about person(s), place(s), thing(s), or event(s) of concern to the business.
Datasets may be represented in a graph as items or nodes of information. Some of these nodes may be related to other nodes, and these relationships between nodes may be represented as connections or edges between the related nodes. Datasets that are represented by graphs may be stored in any data structure, including but not limited to tables, arrays, linked lists, feature vectors, trees or other hierarchies, matrices, structured or unstructured documents, or other data objects.
An example dataset may be a log of phone calls or email addresses over a given period of time, and the dataset may be represented in a graph as nodes of phone numbers or email addresses that are connected via graph edges to each other. For phone companies with millions of customers, the number of nodes and edges in this graph may be massive. Similarly, logs of posts or messages between friends in social networks may be represented in a graph as nodes of contacts that are connected via graph edges to each other. For large social networks, the number of nodes and edges in this graph may be massive.
Although simple datasets having few items may be visualized and readily understood by human analysts, complex datasets having many items often require processing and computational analysis before such datasets are meaningful to human analysts or even to many software applications. Clustering techniques may simplify complex datasets into clusters to support analysis of the datasets. Clusters are subsets of nodes in the graph that are related to each other. In some examples, a cluster is a network of nodes that are connected to each other, directly or indirectly, by edges. Many clustering techniques attempt to evaluate entire datasets to find optimal partitions based on global criteria without the ability to break up this evaluation into smaller manageable operations. Such techniques may consider all edges and all nodes in a graph before making any clustering determinations and, accordingly, may provide excellent results for small datasets. However, such techniques are not practical for massive datasets, such as for datasets where the number of desired clusters is in the millions, due to the computational and storage requirements for evaluating entire datasets to find partitions based on global criteria. Such techniques scale poorly for massive datasets because the computational and storage requirements to implement the techniques are highly dependent on the sizes of the datasets.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.