1. Field of the Invention
This invention relates in general to reachability between two vertices in a graph, and more particularly to generation and efficient storage of information pertaining to the existence of a path between a set of vertices, but not the exact path.
2. Description of the Related Art
Knowing the existence of a path that connects one node in a network to a second node in a network is fundamental to a wide range of applications, including XML indexing, geographic navigation, internet routing, ontology queries based on the Resource Description Framework (RDF)—a family of specifications for a metadata model that is often implemented as an application of XML, ontology queries based on Web Ontology Language (OWL)—a markup language for publishing and sharing data using ontologies on the Internet, and many others. For example, for XML documents, reachability queries are the most basic operation in performing join and other advanced queries, which means fast processing is mandatory. Thus, it is of great importance that reachability queries can be carried out in an efficient way.
Given an n-vertex, m-edge directed graph, there are currently two basic approaches to handle reachability queries. One is to use a single-source shortest path algorithm; that is, for any two vertices, the shortest path algorithm is used to determine if they are connected. This approach may take 0(m) query time, but requires no extra data structure besides the graph itself for answering reachability queries. In this description, the O(x) function represents the order of something, such as processing or storage, relative to the parameter “x”. Another approach is to compute and store the transitive closure of the graph. It answers reachability query in constant time but needs 0(n2) space to store the transitive closure of an n-vertex graph. Many applications involve massive graphs, yet require fast answering of reachability queries. Such considerations make the basic approaches unattractive.
Several approaches have been proposed to encode graph reachability information using vertex labeling schemes. A labeling scheme assigns labels to vertices in the graph, and it answers reachability queries by comparing the labels of the vertices.
Although interval-based labeling is best for tree structures, reachability queries may take 0(m) time using the interval-based approach for graphs. One known method proves that, for sparse graphs, a sophisticated graph labeling method, called 2-hop, can answer reachability queries efficiently (although not in constant time) using much less storage. This result is significant because massive graphs typically are sparse. However, 2-hop labeling itself may incur a tremendous amount of computation cost. For instance, XML documents are actually a form of graphs, as they contain reference links. The 2-hop labeling approach efficiently cannot handle XML graphs, as they require exponential label sizes as the graph size increases. Each 2-hop label has an average length 0(m1/2), which means answering reachability queries requires 0(m1/2) comparisons. In at least one instance, it took a 64-bit processor, 80-Gb memory Sun server more than 45 hours to label the well-known DBLP dataset using the 2-hop method. Clearly, in practice, such labeling methods cannot be used for massive graphs. Therefore, the labeling process is often too time-consuming to be practical.
In general, labeling can be a costly process in terms of time and is impractical for massive graphs. Accordingly, a need exists to overcome the difficulties with determining reachability between two given nodes in a sparse graph of large size.