The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also be inventions.
As social networks have gained in popularity, maintaining and processing the social network graph information using graph algorithms has become an essential source for discovering potential features of the graph. In general, a graph is a mathematical structure comprising an ordered pair G=(V, E), where V is the set of vertices or nodes represent objects, and the elements in set E are edges or lines which represent relationships among different objects. Many real world problems can be abstracted into graph problems, such as, social networks and traffic networks. The great increase in the size and scope of social networks and other similar applications has made it virtually impossible to process huge graphs on a single machine in a “real-time” level of execution.
Many different graph-based algorithms have been proposed, covering graph construction, ranking, clustering, path problems, and so on. Most graph-based algorithms can be categorized into two classes: vertex-oriented and edge-oriented. Vertex-oriented algorithms, such as the vertex filter focus on the value of vertex and the data of each vertex is usually processed separately with no message passing from one vertex to another. If the main part of an algorithm is to compute the states of edges or to perform message transmitting, (e.g. PageRank), the algorithm is considered to be edge-oriented. Most edge-oriented algorithms can be solved in a vertex-centric way. However, in distributed computing environments, the high volume of network traffic can became a serious problem when perform edge-oriented algorithms on a vertex-oriented infrastructure. The cause of this problem is that data of a graph is stored in a vertex manner, and if the state of an edge is modified by one of its associated vertices (which occurs often in an edge-oriented algorithm) the other vertex must be notified to share the new state of the edge. If the two vertices are not located on same machine, network traffic will be generated.
The scale of the graphs in many practical problems such as social networks, web graphs and document similarity graphs can be on the order of millions to billions of vertices and trillions of edges. Distributed computing techniques have been applied to graph computations in order to more efficiently process graph data. One example is Map-Reduce, which is a distributed computing model introduced by Google® that processes large data sets on clusters of computers in parallel using the principles of map and reduce functions commonly used in functional programming. One iteration of map and reduce functions is called a Map-Reduce job. A job is submitted to the master node of a machine cluster and the master node divides the input data into several parts and arranges a number of slave machines to process these input data partitions. In an example implementation, a graph is split into blocks and taken as input of map function. In the map function, the value of each vertex is divided by the edge number of that vertex, and the result is stored as key/value pair {neighbor ID, result}. Before the reduce function, each machine fetches a certain range of key/value pairs onto its local storage, and performs a reduce function on each key value. In this example, the reduce function reads all of values under the same key (vertex ID), sum them up, and write the result back as the new value of this vertex. Hadoop by Apache, is an open source implementation of Map-Reduce model that is considered a good platform for graph-related processing. Besides the Map-Reduce function, it also provides a Hadoop Distributed File System (HDFS) and has become a popular infrastructure for cloud computing. However, the Hadoop project is still in development and exhibits shortcomings in the areas of job management, robustness and so on.
Vertex-oriented algorithms, which have a flat data model fit well on the Map-Reduce model, but edge-oriented ones do not fit as well. This is due to that the edge-oriented algorithms usually need to share the states of edges among multiple vertices, and Map-Reduce is a “share nothing” model that is inherently weak on edge-oriented algorithms. The algorithms that can be presented as matrices problems are possible to be implemented on Hadoop. However, because of the locality problem and overhead caused by Hadoop itself, the performance is not guaranteed to be high compared to other solutions model infrastructure. Moreover, Hadoop does not guarantee data locality, as it will try to process the file block locally, but when local processing slots are occupied, then local file blocks may be processed by other machines.
Although many real world problems can be modeled using Map-Reduce, there are still many that cannot be presented very well using this framework. Furthermore, the Map-Reduce model has certain weaknesses that limit its effectiveness with regard to certain important applications, such as cloud computing and social network environments. For example, Map Reduce cannot share information among different slave machines when running map or reduce functions, and not all graph-based algorithms can be mapped onto Map-Reduce; and for certain graph related problems that can be solved by Map-Reduce, the solutions may not be optimum for certain applications (e.g., cloud computing). Increased scalability is another key concern in the development and application of graph processing systems.
What is needed is an effective and efficient way to reformulate Markov clustering technique, and make it possible to be solved on Map-Reduce platforms efficiently.