Data may commonly be modeled as a graph. A graph typically represents a number of entities as a set of respective vertices. Vertices may represent any of a variety of entities, e.g., people, objects, or geographical locations, for example. A graph includes edges connecting at least some of the vertices, where the edges represent relationships between entities represented by the vertices. An edge may be assigned a value or weight, further modeling the relationship between entities. In an example, a graph may represent a geographical region, where vertices represent cities or other locations in the region. An edge between two vertices may represent physical connectivity between the represented locations, e.g., the fact that one can drive from one of the locations to the other. Such an edge may be assigned a value corresponding to driving distance or travel time, for example. In another example, a graph may be used to model a social network, where each person is represented by a vertex. An edge between two vertices represents a relationship between the represented people, e.g., a friendship, a familial relationship, or a business relationship. Such an edge may have an associated weight. For example, the weight may represent a degree of friendship or a level of trust between the persons represented by the connected vertices.
A common problem is the discovery of relationships and patterns in data that may be represented by a graph. Such a relationship or pattern may be modeled as a path in the graph. A query as to the presence of such a path is sometimes called a regular path query (RPQ). The significance of regular path queries for data modeled as graphs has grown steadily over the past decade. RPQs often appear in restricted forms, part of graph-oriented query languages such as XQuery/XPath and SPARQL, and have applications in areas such as semantic, social, and biomedical networks.
No method has yet been developed that would be capable of efficiently evaluating general RPQs on large graphs, i.e., with millions of nodes/edges. Existing systems for evaluating RPQs are restricted either in the type of the graph (e.g., only trees), the type of regular expressions (e.g., only single steps), and/or the size of the graphs they can handle.
Currently there exist limited solutions for this problem. One solution is to transform both the query and the graph to automata and compute their intersection. This approach has many disadvantages, mostly in scalability and concurrency. The transformation can be time consuming and it is very hard to re-use one transformation of the graph once the graph has changed.
Another approach exploits the fact that not all labels in a graph are equally frequent. The approach consists of an algorithm which decomposes an RPQ into a series of smaller RPQs using rare labels, i.e., elements of the query with few matches, as way-points. A search thereby is decomposed into a set of smaller search problems. The latter can be parallelized but is nonetheless infeasible when there are few labels.
In the drawings, the leftmost digit(s) of a reference number may identify the drawing in which the reference number first appears.