In many scientific and business applications, the underlying data can be represented using a graph of the data structure G that includes nodes or vertices V[1. . . n] connected by edges E[1. . . m]. For example, an application that analyzes a corpus of web pages may represent each web page as a node and a link between documents as edges. The objective of the application may be to identify groups of web pages that are related, which may be solved by identifying groups of nodes that are connected, often referred to as finding “connected components.” A group of nodes is connected if there exists a path of edges from each node in the group to every other node in the group and there is no edge from a node in the group to a node that is not in the group.
Several algorithms have been proposed for providing the connected components of a graph. These algorithms assign labels to each node of the graph such that two nodes are connected (i.e., by a path of edges) if and only if the two nodes have the same label. These algorithms include traversal algorithms that “walk” the edges of the graph to identify connected nodes. The traversal algorithms include depth first search algorithms and breadth first search algorithms. Such traversal algorithms can, however, be computationally expensive. In particular, as a graph increases in size to hundreds of thousands or millions of nodes, the time spent finding the connected components can become prohibitive.
To help reduce the time it takes to find connected components, various algorithms have been adapted for execution on a parallel computer. A parallel computer typically has multiple processors that access a shared memory. Each processor can be executing instructions of an algorithm in parallel. Although the use of a parallel computer can help reduce the time needed to find connected components, in many cases the adapting of a serial algorithm to an efficient parallel algorithm can be difficult if not impossible.
One well-known parallel algorithm for finding connected components of a graph is referred to in the computer science literature as a “hook-and-compress” or “hook-and-jump” algorithm. See, Cormen, T., Leiserson, C., and Rivest, R., “Introduction to Algorithms,” The MIT Press, 1991, pp. 727-728. Although there are many variations of the hook-and-compress algorithm, these algorithms generally operate by repeatedly performing a hook pass followed by a compress pass until the labels of the nodes do not change during a pass. Each label points to another node, such that upon completion, connected nodes point it to same node. Each node is initially assigned a label that points to itself. Each hook pass selects each edge and sets the label of the pointed-to node of the node with the higher label to the label of the other node connected to the edge. Each compress pass selects the node and sets the label of the node to the label of its pointed-to node. The hook-and-compress algorithm can generally be represented by the following pseudo-code where each node is assigned a unique number, C[i] contains the label of node i, and edges are identified by the number of the nodes they connect.
hook-and-compress (G):for all nodes iC[i]=irepeathook (G)compress (G)until C equals last Chook (G):for all edges (i,j) of Gif (C[i] > C[j] and C[i] == C[C[i]]) C[C[i]] = C[j]compress:for all nodes i of GC[i] = C[C[i]]
In both the hook and compress steps, the iterations may execute in parallel. In particular, for the hook step, multiple processors may be executing the hook algorithm on the graph that is stored in the shared memory (and similarly for the compress step). The parallel hook-and-compress algorithm, however, encounters “hot spots” as the number of distinct labels decreases. A hot spot is a memory location that is repeatedly written and read. For example, as the hook-and-compress algorithm proceeds, more and more nodes tend to point to the same node. The accessing of that pointed-to node reduces speed of the algorithm such that the accesses to the label of that pointed-to node are serialized. Also, since during the compress steps each node is visited a number of times, that is proportional to the logarithm of the longest of the shortest path between two nodes. Thus, the hook-and-compress algorithm can be less efficient for large graphs than a sequential depth first search, which visits each node only twice (once in each direction).