1. Field of the Invention
The present invention relates generally to information processing environments and, more particularly, to a database management system (DBMS) and methodology for use of generalized order properties in a query optimizer.
2. Description of the Background Art
Computers are very powerful tools for storing and providing access to vast amounts of information. Computer databases are a common mechanism for storing information on computer systems while providing easy access to users. A typical database is an organized collection of related information stored as “records” having “fields” of information. As an example, a database of employees may have a record for each employee where each record contains fields designating specifics about the employee, such as name, home address, salary, and the like.
Between the actual physical database itself (i.e., the data actually stored on a storage device) and the users of the system, a database management system or DBMS is typically provided as a software cushion or layer. In essence, the DBMS shields the database user from knowing or even caring about the underlying hardware-level details. Typically, all requests from users for access to the data are processed by the DBMS. For example, information may be added or removed from data files, information retrieved from or updated in such files, and so forth, all without user knowledge of the underlying system implementation. In this manner, the DBMS provides users with a conceptual view of the database that is removed from the hardware level. The general construction and operation of database management systems is well known in the art. See e.g., Date, C., “An Introduction to Database Systems, Seventh Edition”, Part I (especially Chapters 1-4), Addison-Wesley, 2000.
One purpose of a database system is to answer decision support queries. A query may be defined as a logical expression over the data and the data relationships set forth in the database, and results in the identification of a subset of the database. Consider, for instance, the execution of a request for information from a relational DBMS. In operation, this request is typically issued by a client system as one or more Structured Query Language or “SQL” queries for retrieving particular data (e.g., a list of all employees earning more than the average salary of all employees) from database tables on a server. In response to this request, the database system typically returns the names of those employees earning more than the average salary, where “employees” is a table defined to include information about employees of a particular organization. The syntax of SQL is well documented, see e.g., “Information Technology—Database languages—SQL”, published by the American National Standards Institute as American National Standard ANSI/ISO/IEC 9075: 1992, the disclosure of which is hereby incorporated by reference.
SQL queries express what results are requested but do not state how the results should be obtained. In other words, the query itself does not tell how the query should be evaluated by the DBMS (i.e., the query is declarative). Rather, a component called the optimizer determines the “plan” or the best method of accessing the data to return the result required by the SQL query. The query optimizer is responsible for transforming an SQL request into an access plan composed of specific implementations of the algebraic operator selection, projection, join, and so forth. The role of a query optimizer in a relational DBMS system is to find an adequate execution plan from a search space of many semantically equivalent alternatives.
A query optimizer transforms an SQL query into an access plan by generating different join strategies and, based on cost, choosing the best strategy. The process of generating a subspace of the space of all the join strategies is called join enumeration. Since relational databases typically only provide physical operators that can join two tables at a time, a join of a number of different tables (n-way join) must be executed as a sequence of two-way joins, and there are many possible such sequences. The optimizer must typically enumerate some or all of these sequences and choose one based on estimates of their relative execution costs.
In order to optimize a query, a query optimizer must be able to concurrently solve several different problems including: choosing the access method (e.g., sequential scan, index scan) for each base table used in the query; choosing the order in which to join the tables (i.e., the join order), and choosing the join method to be used for each join operation. In a complex query plan, order can be exploited by join, group by, distinct, and partitioning operators. Operators that exploit existing orders tend to be cheaper than their non-ordered counterparts, but creating order (e.g., with a sort or an index scan) can be expensive. However, some operators can preserve the order of their inputs, or pass it on in a modified form. Therefore, the possibility exists for an optimizer to construct plans that use a single order more than once, or that take advantage of an order that exists anyway (e.g., because of an ORDER BY clause) to make another operation such as a join or GROUP BY cheaper.
A framework for using order during query optimization is described by Selinger, et al. in “Access Path Selection in a Relational Database Management System”, Proceedings of ACM SIGMOD Conference on Management of Data, pp. 23-34, May 1979 (hereinafter “Selinger”). Selinger provides for creating a list of “interesting orders” for a query, and then associating an “order property” with each partial plan as it is considered during optimization. An order property, in the sense used by Selinger, is a physical property of a stream of tuples. One way to represent an order property or an interesting order is to use an array O where each entry O[i] is a tuple (expr, seq), expr is an expression from the schema of the tuple stream, and seq is either “asc” or “desc” (short for ascending and descending, respectively). In this document, the term “simple order” is used for this representation. If a simple order has only one entry (e.g., (x, asc)), then values of x appear in the stream of tuples in increasing order (or non-decreasing if there are any tuples with equal values of x). Similarly, if the simple order is (x, desc), then the values of x appear in decreasing order (or non-increasing if there are any tuples with equal values of x). The extension to an arbitrary number of entries in a simple order is defined recursively: a simple order with k entries describes a stream of tuples where the tuples appear in the order specified by the first k-1 entries. Any tuples with equal values of the first k-1 expressions are together and form a slice of the stream. Each such slice is ordered according to the kth entry of the array.
Selinger does not describe any particular representation for interesting orders or order properties, but one obvious representation is the simple order described above. Every partial plan constructed during optimization has an order property associated with it. The order properties of the partial plans are compared against the interesting orders at appropriate points while building partial plans, and the costing is adjusted accordingly in cases where they match. Note that matching is a simple operation when both interesting orders and order properties are represented as simple orders: if the interesting order is a prefix of the order property, then the match is positive. Also note that the search algorithm used to generate and compare the partial plans is independent of the strategy for keeping track of order properties. The idea of creating interesting orders and attaching order properties to the physical properties of the partial plans can be used either for the join enumeration algorithm Selinger describes, or for other join enumeration algorithms.
The techniques described above are extended by Simmen, et al. in “Fundamental Techniques for Order Optimization”, Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 57-67, June 1996, and U.S. Pat. No. 5,619,692 (hereinafter referred to together as “Simmen”). Simmen extends the above techniques by allowing an interesting order or order property to be represented by a set of simple orders rather than just one simple order. The set representation that Simmen describes is a tuple (O, FD) where O is the same array used for a simple order and FD is a set of functional dependencies. Consider, for example, that:x—>y is a functional dependency in FDO[i]=(x,asc)i<j
If this is the case, then the set includes any simple order that can be formed by inserting (y,asc) or (y,desc) at O[j]. This rule can be applied an arbitrary number of times, using different functional dependencies or the same one over and over again and all such constructed orderings are part of the set. Note that constants and equivalence classes (which Simmen also mentions) can be represented as functional dependencies for the sake of this discussion.
To simplify matching, Simmen describes an operation called “reduce” that converts the array O to a canonical form. This is essentially the inverse of the expansion described above. Assume the following:x—>yi<jO[i]=(x,asc)O[j]=(y,desc)
In this case one can remove O[j] from the array. Once O is reduced, matching is exactly the same as with a simple order representation: an interesting order must be a prefix of an order property.
Unfortunately, the above-described techniques have several shortcomings. Simmen points out that an interesting order generated for a GROUP BY clause can match several different order properties. For example, “GROUP BY x,y” matches any of the following simple orders:[(x,asc), (y,asc)][(x,asc), (y,desc)][(x,desc), (y,asc)][(x,desc), (y,desc)][(y,asc), (x,asc)][(y,asc), (x,desc)][(y,desc), (x,asc)][(y,desc), (x,desc)]
The only way to represent this type of interesting order using the techniques described above is to generate a separate interesting order for each of these possibilities. It would be more beneficial if they could all be represented simultaneously in a single data structure. Lacking a way to do this, the only simple solution is to choose just one of the orderings to use as the interesting order, and fail to match an order property that corresponds to any of the others. Simmen mentions this issue, but provides no solution to the problem.
Another way to improve the representation is to generalize order properties to describe tuple streams that are grouped but not totally ordered. Wang, et al. describe some techniques for doing this in “Avoiding Sorting and Grouping in Processing Queries”, in Proceedings of the 29th Very Large Database (VLDB) Conference, Berlin, Germany, 2003. However, this type of generalization does not represent sets of simple orders.
A variation of the techniques of Simmen is described by Neumann et al. in “An Efficient Framework for Order Optimization” in Proceedings of the 20th International Conference on Data Engineering, ICDE 2004, 30 March-2 April 2004, Boston, Mass., pp. 461-472 (hereinafter “Neumann”). This variation is intended to improve the time efficiency of the basic operations like matching and reduction, and the space efficiency of the representation. The basic approach of Neumann is to construct an automaton that encodes the result of precomputing all the operations. A limitation of the approach of Simmen with respect to join enumeration is illustrated by the following example. Consider the query:select * fromT join S on (T.x=S.x)order by T.x, T.y
If one chooses to use a sort-merge join of S to T, then the sort of T will be on T.x. However, it should be possible to sort on T.x, T.y and avoid the final sort needed to satisfy the ORDER BY clause. If the optimizer operates by building partial plans bottom-up, then the order properties are calculated based only on the prefix of the plan. The initial set of indexes to consider is chosen based on interesting orders, but once the order property for a partial plan is set it cannot be changed based on operators added later in the plan.
When using dynamic programming for join enumeration, a partial solution to this kind of problem is to “push down” interesting orders (e.g., as described by Simmen). A more general solution is possible by using a top-down, goal-oriented search such as the one described by Graefe et al. in “The Volcano Optimizer Generator: Extensibility and Efficient Search”, in Proceedings of Ninth International Conference on Data Engineering (ICDE), Apr. 19-23, 1993, Vienna, Austria, pp. 209-218 (hereinafter “Graefe”). A problem, however, is that the join enumeration algorithms of Simmen and Graefe both use significant amounts of memory. This is particularly problematic in an environment in which memory is limited.
What is needed is an improved approach that provides for efficiently handling cases like the above while using less memory. The solution should provide some of the benefits of either pushing down interesting orders or using a goal-directed search, but should do so in a memory-efficient manner. The present invention provides a solution for these and other needs.