Graphs have become increasingly important in modeling complicated structures such as proteins (i.e., bioinformatics), chemical compounds (i.e., chemical informatics), circuits and schema-less data (e.g., data relating to images and extensible markup language (XML) documents). Essentially, any kind of data can be represented by a graph. For example, graphs are used in computer vision applications to represent complex relationships, such as the organization of entities in an image, to identify objects and to identify scenes.
In chemical informatics, graphs are used to represent chemical compounds. For example, Daylight 4.82, a product of Daylight Chemical Information Systems, Inc., uses graphs to perform screening, designing and knowledge discovery from a particular chemical compound(s) or from molecular databases.
Efficiency in processing graph queries and retrieving related graphs, however, is critical for the success of many graph related applications. For example, a query may center on finding all graphs in database D={g1, g2, . . . , gn} which contain a subgraph, q. Performing a sequential scan on D and checking each graph for q would be inefficient.
Several techniques have been developed to eliminate the need for performing a sequential scan. For example, various indexing methods have been developed to process XML queries, a simple kind of graph query built around path expressions. See, for example, R. Goldman et al., Dataguides: Enabling Query Formulation and Optimization in Semistructured Data, VLDB'97, Proceedings of 23rd International Conference on Very Large Data Bases; T. Milo et al., Index Structures for Path Expressions, LECTURE NOTES IN COMPUTER SCIENCE, 1540:277-295 (1999); B. Cooper et al., A Fast Index for Semistructured Data, VLDB Conference (2001); R. Kaushik et al., Exploiting Local Similarity for Efficient Indexing of Paths in Graph Structured Data, ICDE (2002); J. Min et al., Apex: An Adaptive Path Index for XML Data, SIGMOD (2002); D. Shasha et al., Algorithmics and Applications of Tree and Graph Searching, PODS (2002) (hereinafter “Shasha”); Q. Chen et al., D(k)-index: An Adaptive Structural Summary for Graph-Structured Data, SIGMOD (2003).
The above indexing techniques break each query down into paths, search each path separately for the graphs containing the path and join the results. These methods take the path as the basic indexing unit and therefore are suited for path expressions and tree-structured data. Path-based indexing methods, however, have notable disadvantages. First, information may be lost when breaking queries down into paths. As a result, it is likely that numerous false positive answers will be returned. Path-based indexing methods are therefore not suitable for complex graph queries. Further, since path-based methods focus on individual paths, as mentioned above, the number of paths present in a typical database would make path-based indexing largely impractical (paths may be compressed, but that typically leads to increased numbers of false positives generated).
Therefore, indexing techniques are needed for performing accurate and efficient graph queries.