Field
The present disclosure relates to graph search. More specifically, this disclosure relates to a method and system for parallel processing of graphs.
Related Art
Graphs emerge in many analytics applications. An important class of graphs is bipartite graphs, where one can divide the set of vertices into two disjoint sets U and V such that every edge connects and only connects a vertex in U and a vertex in V. Because there is no edge between two vertices in U or two vertices in V, a bipartite graph does not contain any odd-length cycles.
Formally, a bipartite graph G is (U, V, E), where a vertex is either in U or V, and U∩V=Ø. There is a set of edges eεE, where e is of the form (u, v), if and only if there is a directed edge from vertex u to vertex v in G. In this case, u is the source vertex of e and also a predecessor of v, and v is the destination vertex of e and also a successor of u. If G is undirected, then ∀(u, v)εE→(v, u)εE. If |U|=|V|, then G is called a balanced bipartite graph. FIG. 1A illustrates an example of a directed bipartite graph 10 in which only vertices 12, 14, 16, 18, and 20 in U can be the source vertex of an edge, and only vertices 22, 24, 26, and 28 in V can be the destination vertex of an edge.
Bipartite graphs are natural models of many real-world phenomena. FIG. 1A illustrates an example of a directed bipartite graph. In one example, the set of vertices in U of FIG. 1A can model a set of customers, and the set of vertices in V can model a set of products. An edge between uεU and vεV can represent that customer u bought product v. One can further analyze such bipartite graphs to determine valuable insights such as finding the right product(s) to recommend based on the purchase history of the customer(s).
A significant challenge to efficient analytics on bipartite graphs is search, which becomes harder as the number of vertices and edges increases. Fortunately, bipartite graphs usually contain a great deal of structure that can be leveraged to speed up the computation. For example, only even-length cycles can exist in a bipartite graph. Furthermore, a vertex uεU can only lead to another vertex vεV and vice versa. However, sometimes the structure of a bipartite graph can also lead to computational inefficiencies, if it is not exploited properly by the search algorithm. As an example, consider a parallel search application that divides up the vertices of a bipartite graph into a number of regions such that each region contains roughly (|V|+|U|)/P vertices (where P is the number of parallel processors), and assigns vertices in the same region to the same processor. Although the goal is to keep all the processors busy during search, such a static vertex-to-processor assignment strategy may not work well for bipartite graphs. For example, if the search starts with a vertex or vertices in either U or V (but not both), then at any single traversal step, it can only traverse along edges of the form (u, v) or (v, u), where uεU and vεV, but not both. Note that a traversal step is an operation in which an application or system determines the successor vertex v of a predecessor vertex u by analyzing the edge leading from u to v. This implies that one of the two conditions must hold in a single traversal step:                1. All edges with a source vertex uεU are not eligible for traversal, or        2. All edges with a source vertex vεV are not eligible for traversal        
In other words, no matter how the set of vertices in UεV is divided and subsequently assigned to processors, there is bound to be a subset of vertices that is guaranteed to not generate any useful computation in a single traversal step, therefore reducing the parallel efficiency of search on a bipartite graph. Vertices that do not have any successors in a given traversal step are called idle vertices. Note that whether a vertex is idle or not usually depends on the direction of traversal, even though a vertex without any neighbors (e.g., no predecessors or successors) must be idle regardless of the traversal direction.
To see how idle vertices can affect parallel search, consider the case where U represents 1,000,000 customers, and V represents 1,000 products. For simplicity, let's assume the customer vertices are numbered #0 to #999,999, and the product vertices #1,000,000 to #1,000,999. Suppose there are 1,000 processors available, and the task is to find out those customers who have bought at least one product in the past. If the bipartite structure of the graph is ignored, then a parallel search application will divide up the entire set of 1,000,000 customer vertices plus 1,000 product vertices into 1,000 regions, each of which would contain (1,000,000+1,000)/1000=1,001 vertices. This means the application assigns the first processor to process vertices numbered between #0 and #1,000, the second to process vertices between #1,001 and #2,001, and the last (1000th processor) to process vertices between #999,999 and #1,000,999. But only the last processor would do useful computation in this case, because the application assigns all the other 999 processors to idle vertices that represent customers, yet only product vertices can generate successors in the traversal direction from V (products) to U (customers). Ironically, the application even assigns to the last processor an idle vertex (#999,999) that represents the last customer, which doesn't need to be included in any product-to-customer traversal. Since only 1 out of 1,000 processors is doing useful work, no speed-up is achieved and the parallel efficiency is only 1/1,000=0.1%.
The above example shows that the structure of a bipartite graph can be a problem, if it is not leveraged properly by the parallel search application. In one approach, if the parallel search application mixes the product vertices with customer vertices in a single unified range between #0 and #1,000,999, then the parallel efficiency could be higher. However, the parallel efficiency is still not 100%, unless it so happens that there is exactly one product vertex mixed with every 1,000 customers. That is, the first 1,000 vertices (#0˜#999) are customers, which is followed by a product vertex (#1000), then another 1,000 customers (#1001˜#2000) followed by another product vertex (#2001), and so on. However, mixing the IDs of one type of vertices with the IDs of another type of vertices may compromise the original structure of the graph and cause additional time and space overhead in managing the mapping from vertex IDs to types. In light of these drawbacks, a better approach is desired.