Obtaining good performance for declarative query languages requires an optimized total system, with an efficient data layout, good data statistics, and careful query optimization. One key piece of such systems is a query planner that translates a declarative query into a concrete execution plan with minimal cost. In graph databases, for example, resource description framework (RDF) stores (e.g., IBM® DB2® RDF store), a given complex graph query, for example, a complex SPARQL Protocol and RDF Query Language (SPARQL) query, can be executed in a large variety of semantically equivalent ways. Each such execution plan may produce the same results, but at different computation costs. A query planning objective is to find, for a given query, an execution plan with the minimum cost. Methods for determining the execution plan with the minimum cost have been studied. One known solution builds a cost-model that, based on data statistics, is able to estimate the cost of a given query execution plan. However, since the number of execution plans can be large, typically, only a small subset of all valid plans are constructed using, for example, heuristics and/or greedy approaches that consider plans likely to have a low cost. The cost of the selected candidate plans are then estimated using the cost-model, and the cheapest plan is selected for execution. Because only a small subset of all valid plans are constructed, the chosen plan is thus not guaranteed to be optimal. In other words, the chosen plan is a local optimal solution, but not guaranteed to be a global optimal solution.
Some graph databases (e.g., IBM® DB2® RDF store or Oracle®) are built on top of highly optimized relational database management systems (RDBMS). Evaluation of complex graph queries in RDBMS systems has been performed by translating the complex graph queries into structured query language (SQL) queries that are then evaluated by the underlying RDBMS. Relational systems have been known to perform query optimization, so one might suppose that a naive translation from a graph query language, such as SPARQL, to SQL would be sufficient, since a relational optimizer can optimize the SQL query once the translation has occurred. However, in practice, important performance gains can occur when SPARQL and the SPARQL to SQL translation are independently optimized. Again, like the query planning issue discussed in the previous paragraph, a given SPARQL query, for example, can be translated into a multitude of semantically equivalent SQL queries with vastly different execution costs.
Known graph databases either: 1) mostly ignore this graph query planning issue, simply performing a naive translation to SQL and relying on the RDBMS SQL optimizer, or 2) partially address the issue in a suboptimal way by using heuristics and/or greedy approaches, and considering cost (e.g., based on a cost-model and data statistics) of a very small subset of potential translations. In both cases, the resulting translation is suboptimal, and it is not clear how far it is from the translation resulting in the minimal cost.
Even with sub-optimal plans, the performance of an optimizer may still be considered satisfactory if it performs better (e.g., in terms of evaluation times) when compared to other competing optimizers. Yet, there is an alternative metric to measure how well the optimizer performs: how far its local optimal plans are from global optimal plans. However, no mechanism exists for assessing whether these optimizers produce optimal plans given the data layout and statistics available.
Accordingly, there is need for systems and methods for producing optimal search query plans and accurately assessing how close a given query solution is to an optimal solution.