Graphs are a basic modeling tool to model social, communication, and information networks. A graph G(V, E) consists of a set of nodes V, and a set of edges EεV2 where each edge connects two nodes in the graph. In many applications, analysis is performed on large graphs that do not fit on one machine. Consequently, the graph is stored in several machines and mined in a distributed manner, for example by applying distributed programming tools like Map-Reduce or Hadoop. A basic analysis tool for graphs is to compute connected components of the graph. A connected component of a graph G(V,E) is a maximal set of nodes that can be reached from each other via sequences of edges of the graph. Computing connected components of graph G results in a partitioning of the nodes V into one of several clusters, where each cluster is a connected component. For example, FIG. 2 illustrates a graph G with three connected components. Connected component 205 includes nodes A, B, C, and D, connected component 210 includes nodes F, G, I, and H, and connected component 215 includes nodes J, K, L, and M. The connected components may also be referred to as a cluster of nodes.
Computing connected components in graphs is a basic tool for computing coherent clusters of nodes and also to perform hierarchical clustering. But computing clusters of nodes distributed across multiple machines can be time and cost prohibitive as the running time of the hashing functions are dependent on the size of the graph, the number of messages sent between machines during the rounds of Map-Reduce, and the number of rounds of Map-Reduce performed. It is a challenge is to compute connected components for a large graph in a small number of rounds of Map-Reduce.