The invention relates in general to database systems, and in particular, to a method and apparatus for indexing and efficiently querying relations referencing semistructured data in a database system.
Overview of the Related Art
Semistructured data is described using basic graph theory. Atomic or object values are referred to as nodes and the structure is presented as a graph or a function mapping each node to a subset of nodes. The term semistructured data is misleading in many cases, but nevertheless appears accepted. On the one hand it refers to data that is easily imported into a traditional relational database. On the other hand, the schema used to store it is usually not very efficient or intuitive when analyzing its content, e.g., a text column storing program code does not reveal much of the functionality, in other words, structure, of the programs stored in the column.
Semistructured data, such as cyclic and acyclic digraphs are frequently used in the natural and life sciences. Large sets of measurements, many generated by automated processes and robots, reference some of these digraphs. In particular, this is the case in research relating to genomics, proteomics and biology in general. The graphs describe, for example, enzyme, gene and protein interactions, gene relations, gene locations, molecular functions, biological processes and cellular components. Most of the graphs are neither regular nor hierarchical tree structures and are not adequately supported in current database systems.
Semistructured data of another kind includes trees in the form of XML documents. XML documents are sometimes mapped to structured relational schemas in relational databases or kept in a format representing the trees directly in native XML database systems. Semistructured data is also evident on the internet where web pages reference each other in different ways.
Scientific, governmental and industry consortiums generate standards in the form of digraphs such as the Gene Ontology digraph, ICD-9 and ICD-10 medical naming convention, SNOMED and so on. Data is then associated with these classifications and a complex semistructured dataset emerges. Geneology records may be considered semistructured and moreover scientific work relating to the exploration of the human and other genomes has produced massive data that cross-references complex graphs and structures.
Indexing of semistructured tree data is being addressed by all the major database vendors in one form or another, such as is evident both in the DB2 database system from IBM and in Oracle's database system. A particular emphasis is on, efficiently, indexing XML documents and on, efficiently, accessing heterogeneous datasets with little or no schema structure. Many research projects have also addressed indexing of semistructured data and some are described in the book “Data on the Web, From Relations to Semistructured Data and XML” by Serge Abiteboul, Peter Buneman and Dan Suciu published by Morgan Kaufmann Publishers, 2000. The book also contains numerous references to projects involving semistructured data.
The patent by Chang et al. (U.S. Pat. No. 6,240,407 B1, Method and Apparatus for Creating an Index in a Database System) describes document abstractions and summarization. The patent by Cheng et al. (U.S. Pat. No. 6,421,656 B1, Method and Apparatus for Creating Structure Indexes for a Data Base Extender) describes methods for storing and querying structured documents internally as large objects or externally as files. The patent by Srinivasan et al. (U.S. Pat. No. 5,893,104, Method and System for Processing Queries in a Database System using Index Structures that are not Native to the Database System) describes registering and generating routines for managing non-native index structures. The patent application by Shadmon et al. (U.S. 2002/0120598 A1, Encoding Semi-Structured Data for Efficient Search and Browse) describes indexing techniques used to encode XML tree data into strings that enable indexing of the XML data. The patent by Bello et al. (U.S. Pat. No. 6,477,525 B1, Rewriting a Query in Terms of a Summary Based on One-to-One and One-to-Many Losslessness of Joins) describes query rewriting methods for utilizing materialized views for aggregation.