Relational database management systems are well-known in the art. In a relational database, information is structured in a collection of tables in which data values are stored in rows under various column headings. The Structured Query Language ("SQL") allows users to access databases maintained under any number of relational database management systems and has become the standard for relational database access.
Data is retrieved from the relational database by means of a SQL query, such as, in particular, a so-called SQL "SELECT" statement. A simple SQL SELECT statement may be of the form
SELECT specified field(s) PA1 FROM specified table(s) PA1 WHERE specified condition(s) is true. PA1 SELECT name PA1 FROM employees PA1 WHERE sal=100
For example, the query
results in a list of the names of those employees earning $100, where "employees" is a table defined to include information about all employees of a particular company.
Other operations may be specified in, or result from, a SQL query. Some examples are as follows. Data from two or more tables may be combined in a "join" operation. "Views" can be derived from one or more so-called "base tables." Aggregates, e.g., such operators as SUM and COUNT, specify operations to be performed on the collection of values in some column of a table. The GROUP BY operator allows for tables to be grouped by any combination of their fields. Finally, SELECT statements may also be nested, thereby forming different types of subqueries.
Since any combination of such SQL operations as those above may be found in one SQL query, a SQL query may become quite complex, and, in fact, this complexity has increased as SQL queries have evolved over time. In particular, simple queries are typically "one block" queries, that is, they can be expressed with one SELECT statement having single FROM, WHERE, HAVING, and/or GROUPBY clauses. Simple queries have no subqueries or views. In contrast, a complex SQL query is composed of multiple blocks. An example of a complex SQL query is the so-called "decision-support" queries. Organizations have come to base decisions on results from these queries which are often defined using grouping/aggregation view relations and correlated subqueries (i.e., a subquery which is dependent upon some variable(s) whose value is determined in an "outer" query).
SQL queries express what results are requested but do not state how the results should be obtained. In other words, the query itself does not tell how the query should be evaluated by the relational data base management system. Rather, a component called the optimizer determines the "plan" or the best method--for example, in terms of I/O and CPU processing costs--of accessing the data to implement the SQL query.
Because of the potential complexity of SQL queries, query optimization, especially with respect to decision-support queries, has become very important. Different approaches to decision-support query optimization include the use of relational algebra and "magic sets rewriting".
Relational Algebra
Translating simple SQL queries into relational algebraic expressions is a well-known optimization technique. Generally speaking, a query is received by a database management system either interactively from a user or from a program in which the query is embedded. The optimizer or optimizing portion of the database management system either translates the query into a relational algebraic expression or receives the already-translated relational algebraic expression from another component of the database management system. In either case, once the SQL query is in the form of a relational algebraic expression, so-called "equivalence rules" transform the expression into other equivalent algebraic expressions, thereby generating a "search space" or "space", i.e., the number of different alternative implementations that an optimizer will consider.
Once the search space is generated, cost estimates for each algebraic expression can be generated by utilizing the cost formulas for the relational algebraic operators and the different ways of evaluating these operators. The estimated least costly alternative is then chosen as the plan. For example, a join of two relations (or tables) may be implemented by choosing one relation to be the "outer" relation and, for each tuple (or row) of that outer relation, finding all matching tuples of the other relation (called the "inner" relation). These matching tuples are then concatenated to the tuple of the outer relation. Although the actual cost for the join depends on the particular database system, determining the outer and inner relations, or using other methods to implement the join, may affect the estimated cost of performing the join.
Variations of the above technique can be used for the optimization of complex queries. For example, in one variation, a complex SQL query is broken into smaller blocks. These blocks are then translated to relational algebraic expressions to which the equivalence rules and the above procedure are applied. The result is that, for each block, the "optimal" alternative is determined. This is referred to as "local" optimization. However, the optimization of the interaction between the blocks, the so-called "global" optimization, is performed on an ad-hoc basis outside of the relational algebra framework.
The relational operator called the semijoin operator has been used in the prior art to optimize simple distributed queries for set semantics (i.e., queries whose results include no duplicate values). In particular, it is used to optimize joins of database relations in distributed database systems. Joins in distributed database systems are potentially costly operations because in such systems the data to be joined is resident at different sites, thereby incurring communication costs as well as processing costs. By first performing a semijoin, the processing site sends join information to a receiving site and only the data which would in fact join is determined at the receiving site. Since less data is transmitted from the receiving site to the processing site, the costs of communicating the data are thereby reduced.
Magic Sets Rewriting
To further improve the optimization process, the technique called "magic sets rewriting" is used to increase the search space. Magic sets rewriting optimizes complex SQL queries, such as view definitions and nested subqueries, by rewriting the queries more efficiently. Generally, the magic sets rewriting approach is to define a set of auxiliary "magic" (or "filter") relations that are used to filter out irrelevant or repetitive data that does not contribute to the results of the queries, for example, data which would not be used by subqueries. The most generalized form of magic set rewritings, called Constraint Magic rewriting, can handle non-equality conditions or predicates in queries, as well as equality predicates.
For any one query, there may be many different alternative rewritings. Generally, one or more of the rewritings are selected heuristically as those likely to have lower processing costs. The cost of processing the selected rewritings is compared with the cost of processing the query without the magic set rewrites, eventually choosing the least costly way. Although there have been recent efforts to provide cost-based techniques for selecting the most cost-effective rewriting (e.g., modeling magic sets rewriting as a special join method), magic sets rewriting generally remains a heuristic technique, with only a minimal cost-based component.