This invention generally relates to the analysis of large volumes of data to identify and analyze groups of data elements that are related, and more particularly to characterize the data in a large data set using graph and connected components data analytical approaches to partition the data into subsets of data elements that are related.
There are classes of data processing problems where it is desirable to analyze a data set to characterize subsets of the data according to relations between data elements. As an example, a telephone company (“Telco”) that has a large group, e.g., a million, subscribers may wish to map out patterns in which its subscribers call one another in order understand better their behaviors and to optimize the Telco's service and profits. In order to do this, the Telco needs to identify subsets of subscribers that call one another to construct the mapping patterns. As another example, a candidate for political office with limited resources may wish to decide how best to allocate these resources during a campaign. Assume that the campaign organization may have determined that people vote in peer groups, and wants to focus on swing voters, but does not have sufficient resources to telephone, visit or otherwise contact every prospective voter in each swing voter peer group. The campaign organization may decide to target the peer groups according to size from largest to smallest in size, and in any event may want only one representative from each peer group to be its evangelist to influence the other voters in the peer group.
The problem in each case is how to identify the subsets of related data elements (i.e., subscribers or voters) efficiently in a much larger set of data elements. Additionally, in the voter example, it is also necessary to characterize peer groups according to their sizes as well as to identify for each peer group a representative voter. One approach to analyzing such data to obtain the desired information is to use well-known graph theory and connectivity components data analytics. A graph is an object that describes a relation between pairs of data elements (“vertices”) in a set. The pairs exhibiting the relation are referred to as “edges”. Each pair of data elements that belongs to the underlying set either exhibits or does not exhibit the relation. For example, the data elements in both of the foregoing examples are “persons”, and the relationship may be “friendship”. Thus, the persons of each pair are either friends or not. Two data elements (“vertices”) in a graph are “connected” if there is a path of “edges” (relations) linking them. A connectivity component is a subset of data elements of the graph that are pair-wise connected such that no additional element can be added that is connected to any of the data elements of the subset, i.e., subscribers or voters of a subgroup or peer group of the larger group that are “friends”. Subsets of persons can be identified in the foregoing examples by using graph theory to characterize the data elements (subscribers or voters) as being within connectivity components.
The connected components problem for a graph is the problem of partitioning the larger set of vertices (data elements) of the graph into connectivity components, i.e., identifying subsets of data elements that are related. It has been handled in different ways that are not practical for real world mass data analysis. A common approach for finding connectivity components is to use the well-known “Union-Find” algorithm for disjoint data structures. This algorithm involves a “find” operation to determine in which of a plurality of subsets a particular data element is located, and a “union” or join operation to combine two subsets into a single subset. However, this approach is not practical with large data sets. As the size of the data set increases, storage and retrieval quickly become increasingly slower and very inefficient. The Union-Find algorithm also requires access to many distant and hard to anticipate data items in every operation. Accordingly, even though a computer may be able to access a limited number of data items quickly, because of the large number of accesses required, the operations are exceedingly slow.
A different approach to finding connectivity components in a graph is one that requires the computer to make random choices, as described by Karger, David R., et al. in “Fast Connected Components Algorithms for the EREW PRAM”, Department of Computer Science, Stanford University, NSF Grant CCR-9010517, Jul. 1, 1977, available at people.csail.mit.edu/karger/Papers/conn-components.pdf. This algorithm requires the use of an exclusive-read, exclusive-write (EREW) PRAM, which is a theoretical computational model that is far more powerful than any real computer. As such, it is only a mathematical curiosity and is impractical to implement. For practical connectivity component analysis, randomness has so far not been utilized.
Moreover, large data graphs are stored in large data stores (databases), for which data access is allowed only in ways describable using a database language, e.g., Structured Query Language (SQL), interface. For solving the connected components problem, present methods of using an SQL interface are impractical. One such method, for example, would be to use SQL JOINs in order to calculate first the connectivity of each vertex to all vertices that are two edges away from it, then those that are three edges away from it, and so on. However, for a graph that has a very long path comprising, e.g., a million data elements where element x0 is connected to x1 which is connected to x2 which is connected to x3, etc., up to x999999, to ascertain that two elements xi and xj both belong to the same connectivity component would require a prohibitively large number of JOIN operations over large tables, and would be exceedingly slow. Another SQL approach would be to first map out all pairs of data elements that are at most two relations apart, then those that are four relations apart, etc. While this requires fewer SQL passes over the data, the intermediate data that needs to pass between stages is exceedingly large, many times the size of the original data, rendering it impractical.
It is desirable to provide analytical approaches for partitioning large data sets in a database into connectivity components that avoids the foregoing and other problems with other known approaches, and it is to these ends that the present invention is directed.