The Resource Description Framework (RDF) is a data model that is one of the core technologies of the Semantic Web. RDF data represents a labeled directed graph, with the labels applied to both nodes and arcs. RDF data is currently stored in triple-stores, where the unit of storage is a single subject-predicate-object triple. An index of triples is used to satisfy queries in the SPARQL Protocol and RDF Query Language (SPARQL) language by a series of “joins” in which triples that match each part of the query are assembled into an overall answer.
A SPARQL query, in its most basic form, is a graph pattern—a small labeled directed graph, with some of the nodes and arcs labeled with variables. The query seeks one, some, or all subgraphs of the RDF graph that match the pattern. Matching means that the subgraph is isomorphic to the pattern as a directed graph, and that the pattern labels (other than variables) match the corresponding graph labels. For each match, the pattern variables take on the value of the corresponding RDF graph label, thus providing one answer.
Each arc in RDF is labeled with a Uniform Resource Identifier (URI). Each node is labeled with either a URI or a value from one of a set of standard types. Nodes can also be “blank” i.e., unlabeled. URIs are used to unambiguously name resources, which can be almost anything. Each arc in the RDF graph is uniquely specified by its three labels, at the tail, on the arc, and at the head. These are called the subject, the predicate, and the object, respectively (sometimes abbreviated S, P, and O, and always given in that order). These three labels together are a triple. An RDF graph may be uniquely represented by the set of its triples. Each triple (S, P, O) is a proposition, stating that S stands in relation P to O.
Two RDF graphs may be merged to form a new graph. The merged graph consists of (roughly) the union of the triples from the two constituent graphs. If each of the constituent graphs contains a node with the same URI label, then, that node will appear just once in the merged graph. This makes sense, since the URI is supposed to have the same unambiguous meaning wherever it appears.
RDF is associated with a theory of types, subtypes, and containers called RDF Schema (RDFS). Using RDFS, nodes can be assigned a type, and types can be given subtypes. Predicates (the labels of arcs) are also typed by assigning a domain and a range, each of which are types for the subject and object respectively. Sometimes, when only some of the types of nodes are given, it is possible to infer the missing types. This is called RDFS inference. Many SPARQL query engines can be set to automatically perform RDFS inference, in effect, behaving as if there are some additional type-asserting triples in the RDF graph.
OWL (Web Ontology Language) is a more elaborate language for describing RDF, in which a much richer set of things may be expressed. Again, many SPARQL engines can be set to perform OWL inference by behaving as if the missing but inferred OWL triples are actually present in the RDF graph.
Conventional RDF database systems (often called “triple stores”) treat the RDF graph (data) as a large set of triples. These triples are individually indexed. The index is divided in to three parts, with each triple represented by an entry in each part of the index. In each of the three parts, the subject, predicate, and object labels, as a triple, are sorted lexicographically and stored in a B+ tree or a similar data structure that supports range queries. The difference between the three parts is that the order in which the three parts of the triple are considered for lexicographic sorting is different. Although there are six possible permutations of S, P, and O, only three of them are needed for full functionality.
To look up a pattern for a single triple, such as (a, ?x, b) where a and b are labels and ?x is a variable, the system will consult the part of the index in which the lexicographic ordering is (S, O, P) or (O, S, P), only one of which will be present. Suppose the (S, O, P) lexicographic ordering is present. The system will consult that index to find the first entry beginning with (a, b). All of the entries beginning with (a, b) will be adjacent in the index, and can be rapidly returned. Each of those represents a triple of the form (a, ?x, b). To find matches for patterns with two variables, such as (a, ?x, ?y), the same (S, O, P) part of the index would be used, but now searching for the first entry with prefix (a) and following entries.
More complex patterns (called Basic Graph Patterns or BGPs in the SPARQL literature) consist of multiple triples. To satisfy these, the SPARQL query engine begins by looking up one of the triples in the pattern. Each match induces a set of values (labels) for the variables occurring in that triple. The SPARQL engine then processes a second pattern triple, applying each valuation to that triple, and then searching for the resulting pattern. Here is an example pattern with two triples:                (a, ?x, ?y) (?y, r, ?z).Suppose there are two matches for the first of these triples:        (a, s, b) and (a, t, c).This establishes two valuations:        ?x=s, ?y=b, and        ?x=t, ?y=c.Applying the first of these valuations to the second triple yields:        (b, r, ?z).Looking this up in the index, suppose there are two answers:        (b, r, z1) and (b, r, z2).There are therefore two answers for the overall query:        (a, s, b) (b, r, z1) and (a, s, b) (b, r, z2).        
The typical RDF database system repeats the process using the second valuation, possibly finding some additional matches, each of which yields an answer to the overall query. In this small example, the RDF database system consults the index three times: Once to find matches for the first triple, and then once for each of the two matches found.
The typical RDF database system effectively computes a “join” between the sets of matches for each of the triples in the pattern. A typical RDF database system uses here a strategy for this join that is known as “indexed nested loop.” For more complex patterns, the loop nesting will be deeper, and the number of total iterations can be quite large. The number of iterations, and thus the number of times the index must be consulted, can be quite sensitive to the order in which the pattern triples are nested. For complex queries, these joins are prohibitively expensive, especially as the size of the database grows.
In some cases, the index is divided into parts, each of which is assigned to a different host. The results from each outer loop index lookup are passed in, one at a time, to affect the iterations of the inner loop, at each level of nesting. Even if the outer-loop results are all adjacent to one another in the index, and thus likely on the same host, each iteration of the inner loop may require consulting the index on a distinct host. Thus, for N outer-loop results, roughly N+1 index hosts will be involved, with N inter-host communication events. The cost of such communication between distributed hosts, particularly for deeply nested queries, can dwarf the other costs in this query handling strategy.
Enterprises are facing a deluge of data as the rate of production from sensors and other sources increases exponentially. Timely processing and analysis of this data will be essential for the success of future enterprise operations. Much of this data is heterogeneous, semi-structured, and incomplete or non-standard, making its storage and handling awkward and inefficient. The Resource Description Framework (RDF), the primary technology of the Semantic Web, is an ideal tool for representing such data in a uniform, tractable way, with the elegant SPARQL query language providing a powerful means of retrieving and processing the data. Unfortunately, current RDF technology does not provide the necessary performance for storage and query at very large scales.