Resource Description Framework (RDF) is the de-facto standard for data representation on the World Wide Web. The amount of RDF data from disparate domains grows rapidly. For instance, the Linked Open Data (LOD) initiative integrates billions of entities from hundreds of sources. Just one of these sources, the DBpedia dataset, describes more than 3.64 million things using more than 1 billion RDF triples, of which 385 million are extracted from the English edition of Wikipedia.
With the proliferation of RDF data, effort has been devoted to building RDF stores that efficiently answer graph pattern queries, i.e., SPARQL. This included migrating the schema-relax RDF data to relational data, e.g., Virtuoso, Jena SDB, Sesame and 3store, among others and building generic RDF stores from scratch, e.g., Jena TDB, RDF-3X, 4store and Sesame Native. As RDF data are schema-relax and graph pattern queries in SPARQL characterize many joins for better scalability and efficiency, a full spectrum of techniques, from physical design of storage to query evaluation, have been proposed to address the new challenges. These techniques include vertical partitioning for relational backend, side way information passing for scalable join processing and various compressing and indexing techniques for smaller memory footprint.
With the infrastructure being built, more advanced applications are being developed. These applications include integrating and harvesting knowledge on the Web as well as rewriting queries for fine-grain access control and inference. In such applications, a SPARQL query is often rewritten into a batch of equivalent SPARQL queries for evaluation. As the semantics of the rewritten SPARQL queries in a common batch are overlapped, the issue of multi-query optimization (MQO) is addressed in the context of RDF and SPARQL. The MQO for SPARQL queries is NP-hard, given that MQO for relational queries is NP-hard and the established equivalence between SPARQL and relational algebra. Indeed, the MQO techniques developed in relational systems can be applied to address this MQO issue in SPARQL. For example, query plans can be represented in AND-OR directed acyclic graphs (DAGs), and heuristics used to partially materialize intermediate results that could result in a promising query throughput. Similar themes can be seen in a variety of contexts including relational queries, XQueries, aggregation queries and full-reducer tree queries.
These solutions, however, are hard to engineer practically into RDF query engines. First, the complexity stems from the physical design of RDF data itself. While indexing and storing relational data commonly conform to a carefully calibrated relational schema, many variances exist for RDF data, e.g., the giant triple table adopted in 3store and RDF-3X, the property table in Jena and using vertical partitioning to store RDF data. When combined with the disparate indexing techniques, the cost estimation for an individual query operator, the corner stone for any MQO technique, is highly error prone and store dependent. Moreover, SPARQL queries feature more joins than typical SQL queries. While existing techniques commonly root on exhausting query plans and look for the best in a tournament, comparing the cost for alternative plans becomes impractical in the context of SPARQL, as the error for selectivity estimation inevitably increases when the number of joins increases. Moreover, RDF is a very general data model, and knowledge and facts can be seamlessly harvested and integrated from various SPARQL endpoints on the Web. While a specialized MQO solution serves inside the optimizer of certain RDF stores, a generic MQO framework is desired that smoothly fits into any SPARQL endpoint and that is coherent with the design principle of RDF data model.