1. Field of Invention
The present invention relates generally to the field of database management systems. More specifically, the present invention is related to optimizing complex database queries.
2. Discussion of Prior Art
Queries in a relational database management system (DBMS) typically specify what tables are to be accessed, rather than how those tables are to be accessed, using the standard non-procedural query language, simple query language (SQL). The SQL Query Compiler, and specifically the Query Optimizer (or just Optimizer) component of the Query Compiler, automatically determines the appropriate way to access and join the tables referenced in a single SQL query. This is done by considering many possible access plans, evaluating the estimated cost of each plan, and choosing the cheapest plan in estimated cost. Since the (inner) join operation is both commutative and associative, many possible join sequences need to be evaluated by the Optimizer. In the worst case, the number of different possible join sequences can be exponential in the number of tables. The complexity of this so-called “join (sequence) enumeration” is the major driver of the time to optimize SQL queries.
Currently, DB2® Universal Data Base (UDB) on the z/OS® platform and DB2 UDB® for the Linux®, Unix®, and Windows® platforms use the well-known technique of dynamic programming to determine the optimal join ordering. This technique finds the (provably) optimal solution, but in the worst case can have execution times that are exponential in the number of tables referenced by the query. This can become a problem for optimization time for very complex queries, i.e., those referencing more than about 10 tables, depending upon the connectivity and shape of the query graph, as documented in the reference to Ono et al. entitled, “Measuring the Complexity of Join Enumeration in Query Optimization.” Such complex queries are becoming increasingly common due to the use of nested views. For example, many queries in Siebel's CRM suite™ and in PeopleSoft® applications have more than 20 tables. For such queries, optimization using dynamic programming consumes a prohibitive amount of time and space. For example, a query having 10 tables could have as many as 2,300 join sequences, but a query having 20 tables could have as many as 4.98 million such sequences. For each such sequence, many alternative ways to implement that sequence are considered, so the number of plans actually considered may be 10 or more times these numbers.
As a result, virtually all DBMS products have devised a heuristic method for rapidly determining a join sequence when the query is very complex, i.e., has more than a handful of tables. For optimizing complex queries with a large number of tables, DB2 UDB for Linux, Unix, and Windows has an alternative join enumeration algorithm called “greedy” that is far more efficient at optimizing than dynamic programming, but may choose plans with sub-optimal execution times. The greedy join enumeration algorithm was disclosed in a paper to Lohman entitled, “Heuristic Method for Joining Relational Data Base Tables.” The greedy algorithm greedily chooses the cheapest join at each iteration of the algorithm, without regard for the consequences to later joins. In this respect, it is “myopic”, never thinking about the consequences of its choices on future joins; hence, it is not very robust. It is an extremely fast algorithm, being polynomial in the number of tables joined, but in certain circumstances it can produce plans that run slowly because it cannot foresee the consequences of its choices to future joins. For example, choosing the cheapest join first might produce a large number of rows in the intermediate result, making subsequent joins for that query more expensive (cost is roughly proportional to the number of rows to be processed). Sometimes more expensive joins are more selective, i.e., they filter out more unwanted rows, resulting in less work for future joins and therefore possibly a cheaper plan overall. In fact, the greedy algorithm was itself an improvement upon very simplistic algorithms that considered only the number of rows in each intermediate result but not cost, such as in papers by Palermo entitled, “A Data Base Search Problem” and the U.S. patent to Krishna. This was called the “least growth principle”. However, basing optimization decisions simply on the number of rows fails to take into consideration that the same number of rows can sometimes be accessed more efficiently if indexes exist. So, for complex queries, the greedy method performs optimization quite fast, but is not very robust because: (1) it myopically decides the join sequence without regard to the future, (2) no one metric is optimal for all possible queries, (3) it produces only one plan join sequence) per query, and (4) it does so without being able to backtrack from decisions later found to be sub-optimal.
Other products may use other heuristics, e.g., Informix® uses a heuristic using the cardinality (i.e., number of rows) and referential integrity constraints between primary keys and foreign keys. This suffers most of the same problems that the greedy heuristic suffers.
There have been many heuristics and algorithms proposed in the open literature on query optimization, such as: simulated annealing, genetic algorithms, iterative improvement, A*, random sampling, and many other variants. A paper by Jarke et al. entitled, “Query Optimization in Database Systems” provides for an overview of query optimization, including join sequence enumeration techniques.
The following patents describe query execution and optimization in general.
The U.S. patent to Shekita et al. (U.S. Pat. No. 5,671,403) provides for an iterative dynamic programming system for query optimization with bounded complexity. Disclosed within is a query optimizer for optimizing join queries in a relational database system by iterative application of dynamic programming to select optimal subgraph join execution plans.
The U.S. patent to Dalal et al. (U.S. Pat. No. 5,701,461) provides for a method and system for accessing a remote database using pass-through queries. The disclosed system includes a query processor to compile the database query, a remote engine to retrieve table structure information and fetch data as needed, and a temporary table manager to manage caching of the fetched data.
The U.S. patent to Krishna (U.S. Pat. No. 6,138,111) provides for a method and apparatus for optimizing the processing of join queries based on join cardinality, wherein specific embodiments implement the methods in relational database management systems.
Whatever the precise merits, features, and advantages of the above-cited references, none of them achieves or fulfills the purposes of the present invention.