Graph analytics is a field of data analysis where the underlying dataset is represented as a graph of vertices interconnected by edges. Some applications analyze huge graphs of millions or billions of vertices and edges. In order to process huge data sets that do not fit in a single memory, systems that support distributed graph processing are actively pursued by academia and industry. In these systems, graph data is spread over many machines that are connected through a network fabric.
The performance of distributed graph analyses, however, may be significantly affected by how the graph data is partitioned across computers, where each computer processes one graph partition. The number of edges that cross between partitions, thereby crossing machine boundaries, may determine the amount of communication between machines. Therefore, it is desirable to partition a graph in a way that minimizes the total number of partition-crossing edges.
The number of edges within a partition typically determines the amount of work done by a machine. Moreover, a local edge, that connects vertices within the same partition, requires a different amount of processing than does a remote edge, that joins partitions. It is desirable to partition a graph in a way that each machine has an equal workload. Thus, there may be a tension between equally partitioning all edges and equally minimizing remote edges. This tension may be treated as an optimization problem.
Equal division of all edges may have a big impact on the performance of distributed graph processing. If the workload is not equally distributed, then overloaded machines may become a bottleneck for system throughput. Although an equal division of all edges may be straightforward, such partitioning may often be suboptimal.
Various approaches may instead focus primarily on minimizing remote edges, which is a NP-hard problem. These systems often rely on external partitioning tools, such as ParMetis, or use heuristic techniques. However, due to the complexity and scale of large real-world graphs, these approaches may fail to achieve their goal of minimizing remote edges.
Another problem with various approaches is that they may attempt to solve the partitioning optimization problem without using distributed computing until after partitioning. For example, GraphLab has a sophisticated partitioning algorithm, but no ability to exploit multiple computers while performing that algorithm. As a result, partition optimization itself may become a bottleneck, even before actual graph analytics can begin. Furthermore, when a central computer performs partitioning of an entire graph, it is unlikely that the graph may fit within the physical memory of the central computer, thereby thrashing virtual memory and decreasing throughput.