Field of the Invention
The present invention relates to graph processing, and more specifically, to a method and device for realizing graph processing based on the MapReduce architecture.
Description of Related Art
MapReduce is a software architecture proposed by Google Inc. for large scale parallel programming. MapReduce is mainly used in parallel computing on large scale data sets (larger than 1 TB). The concepts “Map” and “Reduce” and its main idea are all borrowed from functional programming languages. Current MapReduce middleware implementation requires application developers to specify a Map function for mapping a set of key-value pairs to some new key-value pairs, called as middleware key-value pairs; further, application developers are also required to specify a Reduce function for the further processing of the middleware key-value pairs outputted from the Map function. The MapReduce architecture is used for parallel computing on large-scale data sets (larger than 1 T) in most cases, and scalability can be realized through distributing large-scale operations on a data set to multiple nodes on a network to perform parallel computation, which has been widely adopted in web access log analysis, file clustering, machine learning, data statistics, statistic-based machine translation, and other fields. For example, Hadoop is a kind of MapReduce implementations. More and more cloud computing service providers have deployed the MapReduce architecture in their cloud computing systems.
Graph processing is an important kind of problems in large-scale data processing. A large amount of graph processing algorithms are involved in the relationship analysis of various entities, data mining, and various optimization problems in social networks. The MapReduce implementation of a graph processing algorithm is usually constituted by several iterations, each of which is formed by multi-step Map tasks and multi-step Reduce tasks. A graph processing algorithm needs in general multiple iterations to finally converge on a stable solution.
In existing graph processing problems, the scale of input data sets is so large that it is hardly possible to complete the computation on a single node. Thus, in the implementation of a MapReduce-based graph processing algorithm, a large graph formed by multiple nodes needs to be divided into several sub-graphs. Due to the disequilibrium of node and edge distributions in a graph, computing loads based on Map or Reduce tasks are unbalanced for each sub-graph also, for a fundamental reason there is a linear relationship between their computing complexities and the storage complexities of data structures they used (for example, adjacency list). A common dividing criterion is to divide the data set inputted into a graph processing job according to fixed data size. As a result, a “long tail” phenomenon usually occurs in current MapReduce implementations of graph algorithms, wherein some sub-computing tasks (Map or Reduce task) has an especially long running time, while other finished sub-tasks have to wait until all sub-tasks have been finished, to perform the computation in the next iteration. The essential cause of such a phenomenon is that the relationship between the computing complexity and data length of a graph processing algorithm is not necessarily linear, for example, it may be O(n2) or O(n3) relationship.
Therefore, a graph processing method for balancing computation loads of Map and Reduce tasks is desired.