1. Field
The present invention relates generally to analytical techniques, and more particularly to analytical techniques for identifying a k-core subgraph in a graph and maintaining the materialized k-core subgraph over dynamic updates to the graph, usefully with data stored in a distributed cluster.
2. Description of the Related Art
Large scale graph data is widely represented in problems in scientific and engineering disciplines. For example, the problems of identifying k-core subgraphs appear in the context of finding close-knit communities in a social network, analyzing protein interactions, understanding the nucleus of Internet Autonomous Systems, and the like. In graph theory, k-core is a key metric used to identify subgraphs of high cohesion, also known as the “dense” regions of a graph. The k-core metric is defined as the maximal connected subgraph in which all vertices have degree at least k (Reference: http://en.wikipedia.org/wiki/Degeneracy_(graph_theory)#k-Cores) Equivalently, the k-core subgraph can be found by repeatedly deleting from the complete original graph all vertices of degree less than k.
Previously, Batagelj and Zaversnik (BZ) proposed a linear time algorithm to compute k-core (Reference: Vladimir Batagelj and Matjaz Zaversnik. An O(m) Algorithm for Cores Decomposition of Networks, Advances in Data Analysis and Classification, 2011. Volume 5, Number 2, 129-145). The BZ algorithm first sorts the vertices in the increasing order of degrees and starts deleting the vertices with degree less than k. At each iteration, the algorithm sorts the vertices by their degrees to keep them ordered. Due to high number of random accesses to the graph, the algorithm can run efficiently only when the entire graph can fit into main memory of a single machine.
In order to go beyond the limit of main memory, Cheng et al. proposed an external-memory algorithm, which can spill into disk when the graph is too large to fit into main memory (Reference: J. Cheng, Y. Ke, S. Chu, and M. T. Özsu, “Efficient core decomposition in massive networks,” in ICDE, 2011, pp. 51-62). This proposed algorithm, however, does not consider any distributed scenario where the graph resides on a large cluster of machines.
In addition to computing k-core, another challenge is to maintain the k-core subgraph, as successive edge insertions and/or deletions occur. Li et al. addressed dynamic updates by determining a minimal region in the graph impacted by updates (Reference: R. Li and J. Yu, “Efficient core maintenance in large dynamic graphs,” arXiv preprint arXiv:1207.4567, 2012). The proposed Li et al. algorithm however only works for in-memory on a single server only.