Many commercial and government applications use graph algorithms to perform various tasks. Examples of such tasks include finding a shortest or fastest path on a map, routing robots, analyzing DNA, scheduling activities, processing transactions, allocating resources, analyzing social networks, and optimizing networks for communication, transportation, water supply, electricity, and traffic. Some of the graph algorithm applications involve analyzing large databases. Examples of information in these large databases include, but are not limited to, consumer purchasing patterns, financial transactions, social networking patterns, financial market data, and internet data.
In the execution of these large database applications, computation hardware often has difficulty achieving the throughput requirements of the graph algorithm computations. For example, most conventional processors employ cache-based memory systems in order to take advantage of the highly localized access patterns involved in many conventional processing tasks. However, memory access patterns for graph processing are often random in practice and cache miss rates tend to be high, significantly degrading performance. In addition, graph algorithms require many operations involving indices of vertices and edges, and the complexity associated with these index-related operations can significantly degrade processor performance. Consequently, a typical large graph algorithm can run hundreds or a thousand times slower on conventional processors than conventional processing.
To increase computational throughput, some manufacturers have added more cores on the processor die of conventional processors. This increase in processing capacity can translate to an increase in the graph algorithm throughput when processing throughput is the predominant limitation on performance. However, when memory access bandwidth is the limiting factor on processing throughput, having many cores on a single semiconductor die does not necessarily translate to significant acceleration. Because current commercial multi-core processors tend to rely on cache-based memory architecture, graph algorithms with random memory access patterns still tend to run slowly on these multi-core processors. In addition, the power efficiency and die area efficiency of conventional processors are not much improved by using multi-core processors because the same number of total processor operations are required to perform a given graph algorithm computation.
Another technique for increasing throughput for many conventional processing tasks entails networking multiple processors together. A communication network interconnects the multiple processors so that different processors can handle different portions of the computations. However, the individual processors of these multiprocessor networks tend to be cache-based, and encounter the previously described limitations.
Increasingly, industry is producing individual processors with multiple processor cores. For many scientific computing applications, multiprocessor communication networks with a large number of these multi-core processors produce significant acceleration when compared to a single processor, provided the multiprocessor communication network can keep pace with the computations. Because conventional computing, including scientific computing, requires relatively low communication bandwidth in comparison to the computational throughput of graph processing, conventional multiprocessor networks can provide significant acceleration for conventional computing. However, graph algorithms tend to require much higher communication bandwidth compared to the computational throughput of conventional computing. Hence, large graph algorithms often run inefficiently in conventional multiprocessor networks that have limited communication bandwidths. The acceleration achieved by having multiple processors typically levels off after only a small number of processors. This leveling off occurs because the computing patterns for graph algorithms require much more communication between processors than conventional highly localized processing.