A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
This application includes a transmittal under 37 C.F.R. Sec. 1.52(e) of a Computer Program Listing Appendix. The Appendix comprises text files that are IBM-PC machine and Microsoft Windows Operating System compatible. The files include the following list of files.
Object Description: File 1; Object ID: qog_subplannode.txt, created: May 6, 2003, 10:47 am, size 3.62 KB; Object Contents: Source Code.
Object Description: File. 2; Object ID: qog_optgov.txt, created: May 6, 2003, 10:48 am, size 6.07 KB; Object Contents: Source Code.
All of the material disclosed in the Computer Program Listing Appendix can be found at the U.S. Patent and Trademark Office archives and is hereby incorporated by reference into the present application.
1. Field of the Invention
The present invention relates generally to information processing environments and, more particularly, to a database management system (DBMS) having a methodology for distributing query optimization effort over large search spaces.
2. Description of the Background Art
Computers are very powerful tools for storing and providing access to vast amounts of information. Computer databases are a common mechanism for storing information on computer systems while providing easy access to users. A typical database is an organized collection of related information stored as xe2x80x9crecordsxe2x80x9d having xe2x80x9cfieldsxe2x80x9d of information. As an example, a database of employees may have a record for each employee where each record contains fields designating specifics about the employee, such as name, home address, salary, and the like.
Between the actual physical database itself (i.e., the data actually stored on a storage device) and the users of the system, a database management system or DBMS is typically provided as a software cushion or layer. In essence, the DBMS shields the database user from knowing or even caring about underlying hardware-level details. Typically, all requests from users for access to the data are processed by the DBMS. For example, information may be added or removed from data files, information may be retrieved from, or updated in, such files; and so forth, all without user knowledge of underlying system implementation. In this manner, the DBMS provides users with a conceptual view of the database that is removed from the hardware level. The general construction and operation of a database management system is known in the art. See e.g., Date, C., xe2x80x9cAn Introduction to Database Systems, Volume I and IIxe2x80x9d, Addison Wesley, 1990; the disclosure of which is hereby incorporated by reference.
DBMS systems have long since moved from a centralized mainframe environment to a decentralized or distributed environment. One or more PC xe2x80x9cclientxe2x80x9d systems, for instance, may be connected via a network to one or more server-based database systems (SQL database server). Commercial examples of these xe2x80x9cclient/serverxe2x80x9d systems include Powersoft (r) clients connected to one or more Sybase (r) SQL Anywhere (r) Studio (Adaptive Server (r) Anywhere) database servers. Both Powersoft and Sybase SQL Anywhere Studio (Adaptive Server Anywhere) are available from Sybase, Inc. of Dublin, Calif.
In today""s computing environment, database technology can be found on virtually any device, from traditional mainframe computers to cellular phones. Sophisticated applications, whether human resources information systems or sales force automation systems, can xe2x80x9cpushxe2x80x9d much of their complexity into the database itself. Indeed, this represents one of the main benefits of database technology. The challenge, however, is to support these applications, and the complex queries they generate, on small computing devices. At the same time, users expect the productivity and reliability advantages of using a relational DBMS.
One purpose of a database system is to answer decision support queries. A query may be defined as a logical expression over the data and the data relationships set forth in the database, and results in the identification of a subset of the database. Consider, for instance, the execution of a request for information from a relational DBMS. In operation, this request is typically issued by a client system as one or more Structured Query Language or xe2x80x9cSQLxe2x80x9d queries for retrieving particular data (e.g., a list of all employees earning $10,000 or more) from database tables on a server. In response to this request, the database system typically returns the names of those employees earning $10,000, where xe2x80x9cemployeesxe2x80x9d is a table defined to include information about employees of a particular organization. The syntax of SQL is well documented, see e.g., xe2x80x9cInformation Technologyxe2x80x94Database languagesxe2x80x94SQLxe2x80x9d, published by the American National Standards Institute as American National Standard ANSI/ISO/IEC 9075: 1992, the disclosure of which is hereby incorporated by reference.
SQL queries express what results are requested but do not state how the results should be obtained. In other words, the query itself does not tell how the query should be evaluated by the DBMS. Rather, a component called the optimizer determines the xe2x80x9cplanxe2x80x9d or the best method of accessing the data to implement the SQL query. The query optimizer is responsible for transforming an SQL request into an access plan composed of specific implementations of the algebraic operator selection, projection, join, and so forth. The role of a query optimizer in a relational DBMS system is to find an adequate execution plan from a search space of many semantically equivalent alternatives. One component of this task is join enumeration. Since relational databases typically only provide physical operators that can join two tables at a time, an n-way join must be executed as a sequence of two-way joins, and there are many possible such sequences. The optimizer must enumerate some or all of these sequences and choose one based on estimates of their relative execution costs. In general, this problem is NP-complete. (See e.g., Ibaraki, T. et al xe2x80x9cOn the Optimal Nesting Order for Computing N-Relational Joinsxe2x80x9d, ACM Transactions on Database Systems, 9(3): 482-502, September 1984. See also e.g., Ono, K. et al xe2x80x9cMeasuring the Complexity of Join Enumeration in Query Optimizationxe2x80x9d, in Proceedings of the 16th International Conference on Very Large Data Bases, pp. 314-325, Brisbane, Australia, August 1990, Morgan Kaufmann; and Steinbrunn, M. et al xe2x80x9cHeuristic and Randomized Optimization for the Join Ordering Problemxe2x80x9d, The VLDB Journal, 6(3): 191-208, August 1997.) An NP-complete problem is any one of a class of computational problems for which no efficient solution has been found.
In practice, query optimizers restrict the sequences or plans that are considered so that an adequate plan can be found in a reasonable amount of time. Examples of such limitations include: restricting the search to left-deep trees where the inner operand of each join is a single table (see e.g., Cluet, S. et al xe2x80x9cOn the Complexity of Generating Optimal Left-deep Processing Trees with Cross Productsxe2x80x9d, in Proceedings of the Fifth International Conference on Database Theory (ICDT ""95), pp. 54-67, Prague, Czech Republic, January 1995, Springer-Verlag); requiring each join to have at least one equi-join predicate of the form (column1=column2); considering only a subset of the available physical join methods (e.g., only nested loop joins); considering only a subset of the possible table access methods (e.g., only index scans); and deferring Cartesian products as late in the plan as possible. (See e.g., Morishita, S. xe2x80x9cAvoiding Cartesian Products for Multiple Joinsxe2x80x9d, Journal of the ACM, 44(1): 57-85, January 1997. Also see e.g., Selinger, P. G. et al xe2x80x9cAccess Path Selection in a Relational Database Management Systemxe2x80x9d, in ACM SIGMOD International Conference on Management of Data, pp. 23-34, Boston, Mass., May 1979.) Choosing a set of restrictions for a given query defines a search space of possible plans that may be considered by a search operation.
Deciding how to restrict a search space for a particular query is not straightforward. On one hand, a larger space improves the possibility of finding a better plan. On the other, it also guarantees an increase in the cost of performing the search. If a query is to be optimized once and executed repeatedly, a longer optimization time may be justified. For interactive queries, however, one should optimize the total time spent on execution plus the time spent on the optimization process itself.
A difficulty with choosing search space restrictions is that there is not always a direct, linear relationship between the size of the search space and the optimization time. This is the case because most search operations prune (i.e., do not consider) parts of the space that provably cannot contain an optimal plan. The amount of such pruning that is possible can vary considerably depending on the cost distribution of the plans in the space and the order in which plans are visited.
Manual control of the parameters that restrict search space size may sometimes be useful. It is usually better, however, if a query optimizer makes such choices automatically. A technique where a series of (not necessarily disjoint) search spaces are defined and searched in sequence is described in U.S. Pat. No. 5,301,317 by Lohman, G. M. et al entitled xe2x80x9cSystem for Adapting Query Optimization Effort to Expected Execution Timexe2x80x9d. In the system described by Lohman, when the search of one space is finished, the cost of searching the next space is estimated and compared to the estimated execution cost of the best plan that has been found. The overall search is halted if the estimated cost of searching the next space exceeds the expected benefit. It is difficult to predict the benefit, but a heuristic is to assume that it will be some fixed fraction (e.g., ten percent) of the estimated cost for the best plan that has been identified. It is also difficult to estimate the cost of searching a space. An upper bound can be obtained by multiplying the cost of enumerating a single plan by the total size of the space. However, this really is just an upper bound since, as noted above, the amount of pruning performed by a search operation may vary considerably. Overall, this technique can be seen as one possible way of automatically choosing search space parameters. However, an undesirable characteristic of this approach is that it may enumerate some plans twice if they appear in more than one search space. Another problem is that the decision to stop the search is only considered after each complete space is finished. For large join degrees, every one of these spaces may be very large. As such, the technique does not allow fine-grained control over how much total effort is spent upon enumeration.
A join enumeration operation based on depth-first search of a space of left-deep trees is described by Bowman, I. T. and Paulley, G. N. in xe2x80x9cJoin Enumeration in a Memory Constrained Environmentxe2x80x9d, in Proceedings, Sixteenth IEEE International Conference on Data Engineering, pp. 645-654, San Diego, Calif., IEEE Computer Society Press, March 2000. This depth-first join enumeration search operation is also described in commonly-owned U.S. Pat. No. 6,516,310 titled xe2x80x9cSystem and Methodology for Join Enumeration in a Memory-Constrained Environmentxe2x80x9d, the disclosure of which is hereby incorporated by reference in its entirety, including any appendices or attachments thereof, for all purposes. One advantage of the approach described by Bowman and Paulley is that it uses very little memory relative to the widely used technique of dynamic programming. (See e.g., Selinger, P. G. et al xe2x80x9cAccess Path Selection in a Relational Database Management Systemxe2x80x9d, above; Kabra, N. et al xe2x80x9cOPT++: An object-oriented implementation for extensible database query optimizationxe2x80x9d, The VLDB Journal, 8 (1): 55-78, May 1999; Pellenkoft, A. xe2x80x9cProbabilistic and Transformation-based Query Optimizationxe2x80x9d, PhD thesis, Wiskunde en Informatica, CWI, Amsterdam, The Netherlands, November 1997; and Scheufele, W. et al xe2x80x9cEfficient Dynamic Programming Algorithms for Ordering Expensive Joins and Selectionsxe2x80x9d, in Advances in Database Technology-Proceedings of the 6th International Conference on Extending Database Technology, pp. 201-215, Valencia, Spain, Springer-Verlag, March 1998.) Another advantage of the approach described by Bowman and Paulley is that complete plans are generated continuously during the search. This means that it is possible to interrupt the search at any time after the first plan is found and simply keep the best plan found at the time the search is halted. Therefore, fine-grained control over the amount of enumeration effort is possible.
Early halting is a simple way of limiting the computational effort spent on join enumeration. However, a problem with simply stopping the search early is that the search effort is not very well distributed over the search space. If only a small fraction of the search space is visited, then most of the plans considered are typically very similar. An improved solution for limiting the computational effort spent on join enumeration is required that enables the search effort to be limited in a more effective way. The present invention provides a solution for these and other needs.
In a database system, a method for optimization of a query is described. When a query is received which requests data from a database, a plurality of plans which can be used for obtaining data requested by the query are enumerated. A search tree is created based upon these plans, with nodes of the search tree representing segments of the plans. A limited number of nodes of the search tree are selected for evaluation to limit the effort spent on query optimization. A complete plan for execution of the query is generated by evaluating the selected nodes of the search tree and, if the evaluation determines that a given node is more favorable than comparable nodes previously evaluated, retaining the given node as part of the complete plan.