Modern database management systems (DBMS) process ever-increasing amounts of data. These database systems can store millions of data records. When accessing large databases, it is important to minimize the amount of time the database access takes, to ensure efficient and speedy operation of the database management system. Most large-scale database management systems use some form of optimization scheme, to ensure that operations executed on the DBMS, such as database queries, are executed as efficiently as possible.
In a database management system, optimization is the process of choosing an efficient way to execute a database query or manipulation action. Examples of such query or manipulation actions include searching, retrieving, modifying, organizing, adding, and/or deleting information from the database. These database query/manipulation actions are normally initiated by submitting statements to a database server using a language adapted to facilitate information storage and retrieval. One popular such language is known as Structured Query Language (“SQL”). For the purposes of explanation only, and not by way of limitation, the following description is made with reference to SQL statements.
To execute a SQL statement, the database system may have to perform steps involving the retrieval or manipulation of data from various database structures, such as tables and indexes. Often, there exists many alternate ways to execute the SQL statement and many alternative execution orders for the steps. For example, a single SQL statement can be executed in different ways by varying the order in which tables and indexes are accessed to execute the statement. The exact combination and execution order of steps taken to execute the SQL statement can drastically change the efficiency or speed of execution for the statement. The combination and execution order of steps that are used to execute a SQL statement is referred to as an “execution plan.”
As an example, consider the following SQL statement, which queries for the name of all employees having a salary equal to 100 from a database table “emp_able”:
SELECT employee_name
FROM emp_table
WHERE salary=100
A first execution plan could include the step of performing a full table scan of emp_table to execute the query. This first execution plan would retrieve every row from emp_table to identify particular rows that match the WHERE clause. Alternatively, if an index exists for the “salary” column of emp_table, then a second execution plan could involve accessing the index to identify rows that match the WHERE clause, and thereafter retrieving only those identified rows from the table. The index is considered an alternate access path to the data sought by the SQL statement.
An “optimizer” is used by a database system to choose what is believed to be the most efficient execution plan for a SQL statement. The optimizer typically generates a set of potential execution plans for the SQL statement based upon available access paths for the data sought to be operated upon by that statement. The optimizer then chooses the optimal execution plan from amongst the generated set.
When optimizing a query referencing more than one table, one type of execution order the optimizer considers in searching for the optimal execution plan is connection ordering. Connection ordering is the sequence in which tables are connected together to generate the composite table on which the query will be executed. Using SQL, tables can be connected using a variety of different connection conditions, including joins and sub-queries.
The general format of a connection condition includes a left-hand element, a connection operator, and a right-hand element. The connection operator connects the left-hand element to the right-hand element, based on values contained in the two elements. The elements can include table columns, constants, SELECT statements or other expressions useful to connect tables or other database components together.
Connection operators can be operators provided as primitive operations in SQL, such as the join (“=”), EXISTS, ANY, and IN operators. Connection operators can also be operators used internally by the DBMS in evaluating SQL primitives, such as the semi-join (“S=”) operator, or the filter operator. One common use of semi-joins and filters is in evaluating SQL sub-queries.
Among the basic types of connection operators are commutative connection operators and non-commutative connection operators. Commutative connection operators such as produce the same composite table regardless of whether the first connection column appears on the left-hand side of the connection operator or the right-hand side. For example, the “=” join operator instructs the query to return a row in the composite table for each instance where the value in the left-hand join column equals the value in the right-hand join column.
Turning to FIGS. 1A–1C, a first table T1 includes a first join column T1.x. The first join column T1.x contains two rows, having the values “1” and “2”. A second table T2 includes a second join column T2.x. The second join column T2.x contains four rows, having the values “1”, “1”, “2” and “2”. As shown in FIG. 1B, using the example join columns of FIG. 1A, the join condition T1.x T2.x will produce the same composite table as the join condition T2.x=T1.x. Both join conditions return four rows, as shown by the arrows joining the rows of T1.x and T2.x.
Non-commutative connection operators such as EXISTS, IN, ANY, S= and filters, however, produce different composite tables depending on the ordering of the connection columns. For example, the “S=” semi-join operator instructs the query to return a row in the composite table if there is at least one instance where a value for the right-hand semi-join column matches the value in the left-hand semi-join column. Note that “S=” is used here as a shorthand notation for the semi-join operator, and is not a standard SQL operator notation. This operator will only return one row, no matter how many matches are found in the right-hand semi-join column. As shown in FIG. 1C, using the example semi-join columns of FIG. 1A, the semi-join condition T1.x S=T2.x produces a different result set than the semi-join condition T2.x S=T1.x. The two semi-join conditions return two and four rows respectively, as shown by the arrows joining the rows of T1.x and T2.x.
For an example of connection ordering using joins, assume a query having three tables contains the following join conditions:T1.x=T2.x (join rows in T1 with rows in T2 where column T1.x equals column T2.x)T2.y=T3.y (join rows in T2 with rows in T3 where column T2.y equals column T3.y)
When executing a query with these join conditions, the DBMS has several options as to the order in which it will join these tables. For example, the DBMS could start with T1, and join in the rows from T2 that satisfy the join condition T1.x=T2.x. The DBMS would then join the rows from T3 into the T1, T2 composite table according to the join condition T2.y=T3.y. This join order is expressed as T1 |x| T2 |x| T3. There are other connection orders that are possible, a shown in Table 1.
TABLE 13-table connection ordersT1 |x| T2 |x| T3T1 |x| T3 |x| T2T2 |x| T1 |x| T3T2 |x| T3 |x| T1T3 |x| T2 |x| T1T3 |x| T1 |x| T2
Each of these connection orders carries a cost, possibly different from the other connection orders. When the optimizer generates the set of potential execution plans, the optimizer considers each connection order, and selects the connection order that yields the optimal execution plan.
In the above example, since the join operators are all commutative, the optimizer is able to consider all of the possible connection orders for the tables in the query. However, if a query contains a non-commutative connection operator, then some of the possible connection orders are unavailable, because they violate the ordering requirements of the non-commutative connection operator.
For example, consider the following connection conditions:T1.x S=T2.xT1.y=T3.y
Since the semi-join condition T1.x S=T2.x is non-commutative, the optimizer cannot consider any connection orders that would connect T2 into the composite table before T1, because this would violate the ordering requirements of the S=semi-join operator. The decreased set of available connection orders is shown in Table 2 below, with the unavailable connection orders crossed out:
TABLE 23-table connection ordersT1 |x| T2 |x| T3T1 |x| T3 |x| T2T3 |x| T1 |x| T2
Queries containing subquery filters also limit the connection orders available for consideration by the optimizer. Using the following query as an example:
SELECT T1.n
FROM T1, T2
WHERE T1.y=ANY (SELECT T3.y FROM T3 RE T2.z=T3.z)
AND T1.k=T2.k;
This query contains a subquery. If the subquery is evaluated as a filter, it is applied to each row in the composite table created by the join of T1 and T2. In order to computer whether the subquery evaluates to TRUE or FALSE for that row, the values for both T1.y and T2.z need to be available. Therefore, the evaluation of the subquery as a filter cannot take place until after T1 and T2 have been joined. Hence, evaluating subqueries as filters restricts the possible orderings of operations in a multi-table query.
Note that the above query references a total of three tables, T1 and T2 in the main query, and T3 in the subquery. Thus, the theoretical maximum number of connection orders is 3!, or six, as discussed above. However, with the subquery requiring the values T1.y and T2.z prior to being evaluated, the optimizer is limited in how it can consider connections on T3. While the optimizer can connect T1 and T2 in either order, since the join condition T1.k=T2.k is commutative, it must connect T1 and T2 prior to evaluating the subquery. This yields the two available connection orders shown in Table 3 below.
TABLE 3subquery connection ordersT1 |x| T2 |x| subqueryT2 |x| T1 |x| subquery
This is fewer connection orders than the theoretical maximum of six connection orders discussed above.
The presence of a non-commutative connection condition prevents the optimizer from considering all possible connection orders for the tables in the query, and thus limits the efficiency of the optimization of the query. Additionally, the presence of a filter in the query prevents the optimizer from considering all possible connection orders for the tables in the query. Therefore, systems and methods are needed to enhance the efficiency of the optimizer by allowing for consideration of more potential execution plans for queries having non-commutative connection conditions and/or filters.