Resource Description Framework (RDF) is the de-facto standard for graph representation and the primary vehicle for data exchange over the Internet or World Wide Web. RDF is flexible and uses simple primitives for data representation, e.g., nodes and edges. In addition, RDF facilitates the integration of heterogeneous sources on the Web. The query language of choice for RDF is SPARQL. SPARQL queries are complex and contain a large number of triples and several layers of nesting. Optimization of SPARQL queries involves defining the order and methods with which to access the triples and building a hierarchical plan tree for query evaluation based on cost. A number of works have already studied how to efficiently evaluate semantic web (SPARQL) queries. Typical existing approaches are performing bottom-up SPARQL query optimization, i.e., individual triples or conjunctive patterns in the SPARQL query are independently optimized and then each optimizer attempts to piece together and order these individual plans into one global plan. These approaches are similar to typical relational database optimizers in that they rely on statistics to assign costs to query plans and are in contrast to less effective approaches whose SPARQL query optimization heuristics ignore statistics.
Simple SPARQL queries resemble Structured Query Language (SQL) conjunctive queries, and, therefore, one expects that existing techniques to be sufficient. However, a simple overview of real and benchmark SPARQL queries shows that SPARQL queries encountered in practice are far from simple. To a large extent due to the nature of RDF, these SPARQL queries are often arbitrarily complex, e.g., with deep nestings, and often quite big, e.g. one exemplary SPARQL query involves a union of 100 queries. To make matters worse, typical operators in SPARQL often correspond to more exotic operators in the relational world that are less commonly considered by optimizers. For example, the common OPTIONAL operator in SPARQL corresponds to left-outer joins. All these observations lead to the conclusion that there is potential for novel optimization techniques in this space.
Although attempts have been made to provide query optimization both in SPARQL and beyond, important challenges remain for SPARQL query optimization translation of SPARQL queries to equivalent SQL queries over a relational database or store. Typical approaches perform bottom-up SPARQL query optimization, i.e., individual triples or conjunctive SPARQL patterns are independently optimized and then the optimizer orders and merges these individual plans into one global plan. These approaches are similar to typical relational optimizers in that they rely on statistics to assign costs to query plans. While these approaches are adequate for simple SPARQL queries, they are not as effective for more complicated, but still common, SPARQL queries. Such queries often have deep, nested sub-queries whose inter-relationships are lost when optimizations are limited by the scope of single triple or individual conjunctive patterns.