1. Field of the Invention
The present invention generally relates to the field of database management and, in particular, to a method and a device for optimizing queries over a vertically stored database.
2. Description of the Related Art
RDF (Resource Description Framework) was proposed as a language used to describe metadata (information), particularly for the metadata of resources on the World Wide Web pages where each resource can be identified using a URI (Universal Resource Identifier). With the progress of semantic Web technologies, RDF was recommended as a standard by W3C (World Wide Web Consortium) to exchange information among multiple resources in semantic Web applications.
FIG. 1 illustrates an example of an RDF data storage table. As illustrated in FIG. 1, generally in the RDF data storage structure RDF data is a group of triples, each of which is composed of a subject, a property, and an object. For example, the triple in the second row of the storage table illustrated in FIG. 1 represents the fact “SXZ is a professor,” wherein “SXZ” is the subject, “typeOf” is the property, and “Professor” is the object. And the triple in the fifth row of the storage table illustrated in FIG. 1 represents the fact “SXZ teaches course 1,” wherein “SXZ” is the subject, “Teach” is the property, and “Course1” is the object. In current RDF, the values of subjects and properties must ultimately be resolved to universal resource identifiers (URIs). The values of objects may be either universal resource identifiers or literal values such as numbers or character strings.
Such an RDF data storage structure is essentially a vertical data storage structure. That is, each item only represents a simple fact, such as the item “SXZ is a professor” and “the item “SXZ teaches course 1.” On the contrary, a legacy relational database is a horizontal data storage structure. That is, each item represents all relationships of the subject, such as the item “SXZ is a professor and teaches course 1.”
RDF triples may also be presented as a graph, in which the subject may be represented by the source node in the graph, the object indicating a URI may be represented by the sink node in the graph, and the property may be represented by a directed connection connecting the source node to the sink node. A subject may of course be related to multiple objects. For example, the triple in FIG. 1 may be represented by a directed graph as illustrated in FIG. 2. For a complete description of RDF, see Frank Manola and Eric Miller, “RDF primer. W3C recommendation, February 2004” available at http://www.w3.org/TR/rdf-primer. The entire contents of the RDF Primer are hereby incorporated by reference.
Undoubtedly, RDF is becoming a cornerstone of the semantic Web. In order to perform semantic queries over enterprise wide heterogeneous data sources, the existing data needs to be transformed into RDF triples. Meanwhile for RDF triple data, semantic Web query languages like SPARQL have been developed to describe query conditions of users. For a detailed description of SPARQL, see E. Prud'hommeaux and A. Seaborne, “SPARQL query language for RDF. W3C candidate recommendation, April 2006,” which is hereby incorporated by reference.
Over the past decades, mature RDB (Relational Database) technologies have been widely accepted to manage various application data. Currently, there are two general ways to manage RDF data using mature relational database technologies. One is to migrate the existing data into a particular RDB-based RDF triple store using ETL (Extract-transform-load) approaches, over which users can query and manage the data stored in a triple form. The other is to create a virtual RDF view over the legacy relational database, by which RDF queries will be translated into SQLs (Structured Query Languages) that are executable in the legacy system. The present invention focuses on the first way that the RDF data is managed in an RDF triple store built on top of a relational database.
According to the storage design of RDB-based RDF triple stores, triple stores can be further divided into three categories:
1) generic RDF triple store,
2) improved RDF triple store, and
3) horizontal/binary table store.
For the generic RDF triple store, triples are directly stored in a generic table with three columns of subject, property, and object, on which a few composite indices are created to improved their query performance. For example, both Oracle 10gR2 Spatial Database and Jena2 employ typical generic triple stores.
Compared with the generic triple store, the improved triple store is able to manage much more expressive RDFS (RDF Schema)/OWL (Web Ontology Language) ontologies and their corresponding instances. Similar to RDF, RDFS and OWL are recommended by W3C to support ontology representation. By extending RDF, RDFS provides the ability to define class inheritance and the basic facility of domain/range. OWL allows defining richer properties and relationships and provides much more restrictions against RDFS. Besides the triple table, the improved triple store also provides additional schema, e.g., “class/property table” and “someValueForm/allValueForm table,” to keep the expressive RDFS/OWL ontologies, to which the instance stored in the triple table can be populated accordingly. These additional tables are considered to further facilitate the ontological inference on the RDF data. Examples of the improved triple store include IBM Webify Triple Store, IBM SOR, and Sesame on MySQL.
Different to these two kinds of triple stores above, the horizontal/binary table store will shred the RDF data into multiple horizontal/binary tables, where property is regarded as table name, and each column is regarded as the subject and object of triples respectively. The representative examples are DLDB-OWL and Sesame on PosteGre. The schema of horizontal/binary table store is always consistent with the ontology model stored. Once the ontology evolves, the schema has to change accordingly. It is very costly. Therefore, most of research interests on the RDF stores have been moved to the triple stores. The present invention is applicable to both generic triple stores and improved triple stores.
Currently, many optimizations have been done on the triple stores to improve query performance. For example, in order to save the space cost, most of triple stores internally assign URIs and literal values with unique IDs (unique identifiers) and separately store them in a mapping table, which will be further referenced from the triple table. To promote index efficacy built on literal values, some triple stores physically divide the literal value mapping table into several ones with different data types. To narrow the query scope, some triple stores distinguish relationship properties and datatype properties, and keep them in separate tables. Furthermore, utilizing some specific database features like the MDC (Multi-Dimensional Clustering) table supported in IBM DB2, some triple stores localize the triples in terms of their properties, which can help fast locate and cache the triples satisfying the query conditions.
Although so many optimizations have been used in triple stores, the query performance is still unacceptable. To this end, in the art there is a special need for a schema capable of improving the query performance over a vertically stored database.