The semantic web is very significant technology that has been developed for knowledge representation, discovery, and integration for data available on the World Wide Web. To model knowledge in a flexible and extensible way, the World Wide Web Consortium (W3C) has standardized the Resource Description Framework (RDF) to capture the semantics of data. RDF has now become a widely-used language for representing information (metadata) about resources in the World Wide Web. When information has been specified using the generic RDF format, it may be consumed automatically by a diverse set of applications.
There are two standard vocabularies defined on RDF: RDF Schema (RDFS) and the Web Ontology Language (OWL). These vocabularies introduce RDF terms that have special semantics in those vocabularies. For simplicity, in the rest of the document, our use of the term RDF will also implicitly include RDFS and OWL. For more information and for a specification of RDF, see RDF Vocabulary Description Language 1.0: RDF Schema, available at www.w3.org/TR/rdf-schema/, OWL Web Ontology Language Overview, available at www.w3.org/TR/owl-features/, and Frank Manola and Eric Miller, RDF Primer, published by W3C and available in September, 2004 at www.w3.org/TR/rdf-primer/. The RDF Vocabulary Description Language 1.0: RDF Schema, OWL Web Ontology Language Overview, and RDF Primer are hereby incorporated by reference into the present patent application.
Facts in RDF are represented by RDF triples. Each RDF triple represents a fact and is made up of three parts, a subject, a predicate (sometimes termed a property), and an object. For example, the fact represented by the English sentence “John is 24 years old” can be represented in RDF by the subject, predicate, object triple <‘John’, ‘age’, ‘24’>, with ‘John’ being the subject, ‘age’ being the predicate, and ‘24’ being the object. In the following discussion, the values in RDF triples are termed lexical values.
With RDF, the values of predicates must ultimately resolve to lexical values termed universal resource identifiers (URIs), and the values of subjects must ultimately resolve to lexical values termed URIs and blank nodes. A URI is a standardized format for representing resources on the Internet, as described in RFD 2396: Uniform Resource Identifiers (URI): Generic Syntax, www.ietf.org/rfc/rfc2396.txt. RFD 2396 is hereby incorporated by reference into the present patent application. In the triples, the lexical values for the object parts may be literal values. In RDF, literal values are strings of characters, and can be either plain literals (such as “Immune Disorder”) or typed literals (such “2.4”^^xsd:decimal). The interpretations given to the lexical values in the members of the triple are determined by the application that is consuming it. For a complete description of RDF, see Frank Manola and Eric Miller, RDF Primer, published by W3C and available in September 2004 at www.w3.org/TR/rdf-primer/. The RDF Primer is hereby incorporated by reference into the present patent application.
Various approaches have been developed to efficiently store RDF data into database accessible formats. In some known approaches, the RDF data is represented using integer data values, which can thereafter be translated into a lexical form. For example, U.S. Patent Publication 2008/0126397 describes an approach for representing RDF data using two tables, a first “Links” Table that uses ID values to capture relationships between the objects in the RDF data, and a second “Values” Table which includes the lexical values corresponding to those ID values. The Links Table allows for representations of graph relationships between objects using connectivity, where the subject and objects of RDF triples are mapped to nodes, and the predicates are mapped to links that have subject start-nodes and object end-nodes. A link in the Links Table, therefore, represents a complete RDF triple. To speed up the efficiency of processing and storage of the Links, Table, this table only includes ID values—without including any lexical values. The Values Table would be checked in order to obtain the lexical value of the RDF triples.
FIG. 3 illustrates one possible approach to implement the Values table, Links table, and their connections. In the example of FIG. 3, the Values table 300 stores three records 301, 302, 303 for the different text values. For example, the text value ValueName1 is associated with the unique ID ValueID 1. The text value ValueName2 is associated with the unique ID of ValueID2. The text value ValueName3 is associated with the unique ID of ValueID3. A record 351 exists in the Links table 350 for a triple associated with a unique Link ID. In this record 351, the Start Node ID is ValueID1, the P Value ID is ValueID2, and the End Node ID is Value ID3. A complete triple is thus obtainable through its Link ID (LinkID1), as the text values of the subject, predicate, and object of the triple can be accessed through the Start Node ID, P Value ID, and End Node ID associated with the Link ID. If a second record 352 in the Links table has the same subject text value of ValueName1, then ValueID1 would be stored as the Start Node ID and associated with the Link ID (LinkID2) of the second record 352. In this manner, two triples can reference the same text value without redundantly storing the text value.
Therefore, this type of an approach for storing RDF triples hashes the lexical forms of RDF resources into numerical IDs (e.g., 64-bit numerical Value IDs) and stores the mappings of resource id to resource value in the separate Values table 300. This approach is advantageous for a number of reasons. First, when performing joins, it is much faster to compare and join the numeric IDs than using the original lexical values of the RDF resources. In addition, an RDF resource usually occurs multiple times in a set of triples. Storing it as a 64-bit ID numeric will produce significant storage savings. Also, B-Tree indexes on numeric IDs produce storage savings as well.
However, there are some disadvantages to this approach of requiring access to the Values table during a query. This is because, after all of the Links table joins have been performed, there will still be the need to join with the Values table to retrieve the original resource values. Depending on how many variables need to be selected, this introduces additional joins with the Values table. The overhead of the Values table is more noticeable if, when sorting queries, all of the selected columns' bindings are joined with Values and involved in the sort, even though only the sorting column needs to be considered. In addition, there are excessive costs with queries that produce a significant number of matches, since using nested loop against Values table is not efficient and using hash join may also be inefficient because the database has to hash the whole Values table.
Therefore, there is a need for an improved approach for implementing queries and access to the lexical form of RDF data.