1. Field of the Invention
The present invention relates generally to information processing environments and, more particularly, to a database management system (DBMS) having a methodology for generating bushy trees using a left-deep tree join enumeration algorithm.
2. Description of the Background Art
Computers are very powerful tools for storing and providing access to vast amounts of information. Computer databases are a common mechanism for storing information on computer systems while providing easy access to users. A typical database is an organized collection of related information stored as “records” having “fields” of information. As an example, a database of employees may have a record for each employee where each record contains fields designating specifics about the employee, such as name, home address, salary, and the like.
Between the actual physical database itself (i.e., the data actually stored on a storage device) and the users of the system, a database management system or DBMS is typically provided as a software cushion or layer. In essence, the DBMS shields the database user from knowing or even caring about underlying hardware-level details. Typically, all requests from users for access to the data are processed by the DBMS. For example, information may be added or removed from data files, information may be retrieved from, or updated in, such files, and so forth, all without user knowledge of underlying system implementation. In this manner, the DBMS provides users with a conceptual view of the database that is removed from the hardware level. The general construction and operation of a database management system is known in the art. See e.g., Date, C., “An Introduction to Database Systems, Volume I and II”, Addison Wesley, 1990; the disclosure of which is hereby incorporated by reference.
DBMS systems have long since moved from a centralized mainframe environment to a decentralized or distributed environment. One or more PC “client” systems, for instance, may be connected via a network to one or more server-based database systems (SQL database server). Commercial examples of these “client/server” systems include Powersoft® clients connected to one or more Sybase® SQL Anywhere® Studio (Adaptive Server® Anywhere) database servers. Both Powersoft and Sybase SQL Anywhere Studio (Adaptive Server Anywhere) are available from Sybase, Inc. of Dublin, Calif.
In today's computing environment, database technology can be found on virtually any device, from traditional mainframe computers to cellular phones. Sophisticated applications, whether human resources information systems or sales force automation systems, can “push” much of their complexity into the database itself. Indeed, this represents one of the main benefits of database technology. The challenge, however, is to support these applications, and the complex queries they generate, on small computing devices. At the same time, users expect the productivity and reliability advantages of using a relational DBMS.
One purpose of a database system is to answer decision support queries. A query may be defined as a logical expression over the data and the data relationships set forth in the database, and results in the identification of a subset of the database. Consider, for instance, the execution of a request for information from a relational DBMS. In operation, this request is typically issued by a client system as one or more Structured Query Language or “SQL” queries for retrieving particular data (e.g., a list of all employees earning $10,000 or more) from database tables on a server. In response to this request, the database system typically returns the names of those employees earning $10,000, where “employees” is a table defined to include information about employees of a particular organization. The syntax of SQL is well documented, see e.g., “Information Technology—Database languages—SQL”, published by the American National Standards Institute as American National Standard ANSI/ISO/IEC 9075: 1992, the disclosure of which is hereby incorporated by reference.
SQL queries express what results are requested but do not state how the results should be obtained. In other words, the query itself does not tell how the query should be evaluated by the DBMS. Rather, a component called the optimizer determines the “plan” or the best method of accessing the data to implement the SQL query. The query optimizer is responsible for transforming an SQL request into an access plan composed of specific implementations of the algebraic operator selection, projection, join, and so forth. The role of a query optimizer in a relational DBMS system is to find an adequate execution plan from a search space of many semantically equivalent alternatives.
A query optimizer transforms an SQL query into an access plan by generating different join strategies and, based on cost, choosing the best strategy. The process of generating a subspace of the space of all the join strategies is called join enumeration. Since relational databases typically only provide physical operators that can join two tables at a time, a join of a number of different tables (n-way join) must be executed as a sequence of two-way joins, and there are many possible such sequences. The optimizer must typically enumerate some or all of these sequences and choose one based on estimates of their relative execution costs.
In order to optimize a query, a query optimizer must be able to concurrently solve several different problems including: choosing the access method (e.g., sequential scan, index scan) for each base table used in the query; choosing the order in which to join the tables (i.e., the join order), and choosing the join method to be used for each join operation. A complete access plan comprises a join order for joining the relations (tables), join methods for each join operation, and an access method for each base table used in the query.
There are a number of different optimization techniques that are described in the literature. Several of these techniques have been implemented in relational database management system products. Steinebrunn, M. et al in “Heuristic and Randomized Optimization for the Join Ordering Problem”, VLDB Journal, (1997) 6: 191–208, divides these optimization techniques into four classes: “deterministic”, “randomized”, “genetic”, and “hybrid”. Each class is characterized by a distinct technique for constructing an access plan in the specific search space.
“Deterministic” techniques involve constructing an access plan step-by-step, usually applying some heuristics and pruning techniques to search the solution space. The “Dynamic Programming Algorithm” is a deterministic technique and it is one of the optimization techniques most frequently implemented in database management systems. It was first used in System-R (see e.g., Selinger, P. G. et al, “Access Path Selection in a Relational Database Management System”, Proceedings of ACM SIGMOD Conference of Management of Data, pp 23–34, May 1979). Its search space is the space of left-deep processing trees. A left-deep tree is a processing tree that has only base tables or derived tables as the right child for any join. Left-deep trees are desirable because they reduce the need to materialize intermediate results and for several types of join implementations they result in more efficient execution plans.
The Dynamic Programming Algorithm uses an iterative method of constructing the join tree to join a number (n) of relations. It starts by finding all the access methods for a single relation (e.g., index scans and sequential scan). At each iteration, it eliminates all subtrees with cheaper alternatives based on cost and interesting properties such as the ordering property. The algorithm constructs at the kth iteration all solutions that join k tables based on the solutions kept in the (k−1)th iteration. However, one of the major disadvantages of this algorithm is the memory required to store the partial solutions at each iteration. The worst-case complexity of this enumeration process is O(4n), while O(n22) is the complexity of the number of joins enumerated; the number of partial solutions that must be stored in the worst-case is O(p2n) where p is the number of interesting orderings (see Ono, K. et al, “Measuring the Complexity of Join Enumeration in Query Optimization”, in Proceedings of the 16th International Conference on Very Large Data Bases, pp. 314–325, August 1990). Moreover, dynamic programming approaches for join enumeration are notorious for memory utilization as they require the retention of partial access plans. Experiments have shown that the time and memory required by the Dynamic Programming Algorithm are unacceptable for queries with large join degrees.
Several techniques deriving from the Dynamic Programming Algorithm have been proposed. A light-weight implementation of the Dynamic Programming Algorithm which considers the “bushy tree” search space is described by Vance, B. et al in “Rapid Bushy Join-Order Optimization with Cartesian Products”, in Proceedings of ACM SIGMOD Conference of Management of Data, pp 35–46, June 1996. As described in greater detail below, a processing tree is called a “bushy tree” if it has composite (i.e., not necessarily base relations) for the left and right children of the join nodes. The lightweight implementation of Vance extends the Dynamic Programming Algorithm to bushy trees by efficiently eliminating the number of partial solutions saved at each enumeration.
A “Minimum Selectivity Algorithm” is a deterministic technique which applies the following heuristic while building the solution trees: a left-deep tree is built step-by-step such that each intermediate result size is the smallest possible. The INGRES database management system implements a version of this algorithm in its decomposition strategy (see e.g., Wong, E. et al, “Decomposition—a Strategy for Query Processing”, in ACM Transaction on Database Systems, 1(3): pp 223–241, September 1976, and Kooi, R. P., “The Optimization of Queries in Relational Databases”, Ph.D. thesis, Case Western Reserve University, Cleveland, Ohio, September 1980). The Minimum Selectivity Algorithm generates only left-deep trees; at each step, a new relation, which is not already placed in the prefix of the plan, is chosen such that the selectivity of the resulting intermediate relation is the smallest.
A “KBZ Algorithm” is another deterministic technique introduced by Krishnamurthy, R. et al in “Optimization of Non-recursive Queries”, in Proceedings of the 12th International Conference on Very Large Data Bases, pp. 128–137, August 1986. The KBZ Algorithm is based on the previous work of Ibaraki, T. et al, “Optimal Nesting for Computing N-relational Joins”, in ACM Transaction on Database Systems, 9: pp. 482–502. The KBZ Algorithm computes the optimal solution for a query with no cycle, in O(n2) time. Hence, the KBZ Algorithm is a polynomial time algorithm. However it has several restrictions and limitations including limitations to the non-cyclic join trees, the cost model of a certain class, and the join methods that can be considered, namely only the join-nested loop joins. The work of Swami, A. et al in “A Polynomial Time Algorithm for Optimizing Join Queries”, in Proceedings of the IEEE Conference of Data Engineering, pp 345–354, April 1993, attempts to remove some of the restrictions imposed by the KBZ Algorithm and still retain the complexity polynomial. The technique of Swami can be applied to more than one join method, namely it provides for both sort-merge and join-nested loop joins to be considered.
The “randomized” technique derives join trees by means of so called “moves” that define how a solution tree is obtained from another solution tree. The solution trees are seen as points in the search space and the techniques in this category define different random walks from one solution tree to another solution tree. The “Iterative Improvement Algorithm” is a randomized technique which is based on the fact that a hill-climbing technique will most certainly arrive at a solution that is a local minima in the search space. This technique is presented in the work of Swami, A. et al, “Optimization of Large Join Queries”, in Proceedings of ACM SIGMOD Conference of Management of Data, pp 8–17, May 1988; Swami, A., “Optimization of Large Join Queries: Combining Heuristics and Combinatorial Techniques”, in Proceedings of ACM SIGMOD Conference of Management of Data, pp 367–376, May 1989; and Ioannidis, Y. E. et al, “Randomized Algorithms for Optimizing Large Join Queries”, in Proceedings of ACM SIGMOD Conference of Management of Data, pp 312–321, April 1990.
A “Simulated Annealing” technique discussed in the work of Ioannidis, Y. E. et al, “Query Optimization by Simulated Annealing”, in Proceedings of ACM SIGMOD Conference of Management of Data, pp 9–22, May 1987; and Swami, A. et al, “Optimization of Large Join Queries”, in Proceedings of ACM SIGMOD Conference of Management of Data, pp 8–17, May 1988, tries to fix the drawback of the iterative improvement algorithm, namely, that the best solution found is a high-cost local minima. The Simulated Annealing techniques allows a move to a neighbor even if the cost of the neighboring solution is higher than the current solution. Heuristics are used to define when such moves are to be considered and how the algorithm ends.
Other randomized techniques have been proposed that randomly probe the search space and stop when a reasonable access plan is found. This class of technique does not use transformations to move from one solution to another, and, therefore, may require less memory than other randomized algorithms (see e.g., Galindo-Legaria, C. et al, “Fast, Randomized Join-order Selection—Why Use Transformations”, in Proceedings of the International Conference on Very Large Data Bases, pp. 85–95, September 1994; and Galindo-Legaria, C. “Uniformly-distributed Random Generation of Join Orders”, in Proceedings of the International Conference on Database Theory, pp. 280–293, January 1995).
A major challenge in database systems is to significantly reduce the amount of memory required to enable a database system to be run on small computing devices. More particularly, what is required is a solution which enables queries to be effectively optimized while requiring a minimal amount of memory for generating the search space and for storing the information required for query optimization. Ideally, the solution will require that the database system only store enough information to be able to restore the best plans that have been found and the current plan that is being considered during the optimization process. The present invention satisfies these and other needs.