Many systems such as proteins, chemical compounds, and the Internet can be modeled as a graph to understand local and global characteristics of the system. In many cases, the system under investigation is very large and the corresponding graph has a large number of nodes/edges requiring advanced processing approaches to efficiently derive information from the graph. Several graph mining techniques have been developed to extract information from the graph representation and analyze various features of the complex networks.
Finding connected components, disjoint subgraphs in which any two vertices are connected to each other by paths, is a very common way of extracting information from the graph in a wide variety of application areas ranging from analysis of coherent cliques in social networks, density based clustering, image segmentation, data base queries and many more.
Record linkage, the task of identifying which records in a database refer to the same entity, is also one of the major application areas of connected components. Finding connected components within a graph is a well-known problem and has a long research history. However, the scale of the data has grown tremendously in recent years. Many online networks such as Facebook, LinkedIn, and Twitter, have 100's of millions of users and many more connections among these users. Similarly, several online people search engines collect billions of records about people, and try to cluster these records after computing the similarity scores between these records. Analysis of such massive graphs requires new technology.
Recently, several MapReduce approaches have been developed to find the connected components in a graph. In spite of the fact that the basic ideas behind these approaches have similarities such as representing each connected component with the smallest node id, there are some differences in how they implement their ideas.
PEGASUS is a graph mining system where several graph algorithms including connected component computation are represented and implemented as repeated matrix-vector multiplications. Other approaches have O(d) bound on the MapReduce iterations needed where d is the diameter of the largest connected component. Still other approaches focus on reducing the boundaries of the number of map-reduce iterations needed and provide algorithms with lower bounds (e.g., 3 log d). On the other hand, some others analyze several real networks and show that real networks have small diameters in general. Such improvements might not help much in real networks where the diameters are small.
The disclosed non-limiting embodiments herein provide a connected component computation strategy used in the record linkage process of a major commercial People Search Engine to deploy a massive database of personal information.