The present invention relates to database management systems for use with data processors, and more particularly toward optimizing the computer-implemented process of retrieving data from databases.
Enterprises typically store large amounts of data in computer database systems using database management system (DBMS) software to access the data. A relational database management system (RDBMS) logically organizes the data into relations. A relation can be thought of as a table where each row is a tuple and each column is a component of the relation designing an attribute.
Data can be extracted by querying the database for all tuples in a relation that meet a certain criteria. A join operation is performed to connect data from two or more relations wherein tuples with matching attributes are joined to form a new tuple. A join order is the order in which join operations are performed. A join method is a technique for performing the join of two relations.
A query language provides a set of commands for storing, retrieving and deleting data. The query languages for relational database management systems do not require users to specify the manner in which data needs to be accessed. Rather, query optimization algorithms in the database management systems select the manner in which queries will be processed.
For large databases, the manner used by a database management system to process a query has to provide access to data in a reasonable period of time in order to accommodate database users needing data to be retrievable without excessive delay.
There are three problems solved concurrently by a query optimizer in a relational database system when processing a query for retrieving data. The first problem is the selection of the access method to access each table specified in the query. This problem can be solved in linear time as a function of the number of relations in the query since the access method selection only involves examination of each available access method for each table.
The other two problems to be solved by the query optimizer are the selection of the order in which to join relations or tables in the query, the join order, and the selection of the method to be used for each join operation joining relations. A join plan comprises a join order scheme for joining the relations and a join method scheme of join methods to be used for each join operation. The solutions for these two problems, i.e., providing a good join plan, are complicated and require polynomial time solutions.
There are two practical procedures for selecting optional join orders and methods that are currently used in relational database software products and there are other solutions that have been proposed in the literature. However, all of the known procedures have shortcomings of not finding a good joint plan or requiring too much programming time and too much space of memory.
The xe2x80x9cDynamic Programming Algorithmxe2x80x9d is currently used in many RDBMS products and is described in P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie and T. G. Price, xe2x80x9cAccess Path Selection in a Relational Database Management Systemxe2x80x9d, in Proceedings of the ACM-SIGMOID International Conference on Management of Data, 1979, at pages 23-24. The Dynamic Programming Algorithm uses the iterative construction of a search tree to join R relations. First, the best way is found to join every pair of relations that is linked by a join predicate. Next, the best way to join composites of three relations is found by considering for each relation the best way to join it with each pair of relations joined as saved earlier in storage. This process continues for the composites of four relations and so on. Finally, a plan to join the R relations is found which can be implemented in a minimal amount of time. At each stage, the best plans are saved for every interesting order. An interesting order is the order defined by any join column.
The Dynamic Programming Algorithm is very complex in terms of memory and time. The number of solutions that must be stored for the worst case is O(N2R), where N is the number of interesting orders. The worst case time complexity has the same order. For any particular query, the complexity depends on the connectivity of the query graph.
Experiments have shown that the time and memory used by the Dynamic Programming Algorithm are unacceptable for queries involving large numbers of tables, especially on processors with small memories. Database users generally prefer to pose larger join queries than permitted in current database systems; these system are often constrained by the limitations of the Dynamic Programming Algorithm. Therefore, while dynamic programming tends to be accurate in producing optimal join order and method selections, the time and memory constraints create a problem for most users. The cost of a given plan is based on the amount of time required to process a given query using a join plan. For some queries that will be performed hundreds of times, it is worthwhile to use the Dynamic Programming Algorithm to obtain a join plan that can be performed quickly. However, for other queries, it is not worth the time of using such an expensive algorithm.
The excessive time and space complexity of the Dynamic Programming Algorithm has motivated the development of algorithms that have polynomial time and memory complexity while still producing good join orders. These algorithms do not typically have the shortfalls of Dynamic Programming Algorithm in terms of time and memory complexity. However, these algorithms typically do not produce join plans that can be implemented as fast as the plans produced by the Dynamic Programming Algorithm.
A number of optimization algorithms have been proposed that use randomization, such as the Iterative Improvement and Simulated Annealing algorithms. However, these algorithms do not provide optimal join processing with non-complex time and memory requirements.
Accordingly, what is needed is a system and method for determining the lowest cost permutation for joining relational database tables. The present invention addresses such a need.
A method and system for calculating the lowest cost join permutation for queries which involve multiple database tables is disclosed. A method and system comprises building all initial table composites and determining the cost of building each member of the set of composites one table at a time. The method and system includes comparing the cost of building each member of the composites to a threshold value and skipping any member of the set of composites that has a value greater than the threshold value. The method and system includes calculating the cost of building all possible members of the next successively larger set of composites which can be built from members which do not exceed the threshold value and saving the lowest cost means for building each member of the next successively larger composites. The method and system includes determining if a final composite has been built and stepping to the next composites if the final composite has not been built. Finally, the method and system includes repeating the above until the final composite has been built.
In the present invention, the costs of building a given composite are compared to a threshold. The threshold is determined by choosing a minimum cost composite from all the composites which have N tables in them. The costs of adding each missing table, one at a time, to this composite are calculated until the full N-table composite has been xe2x80x9cbuiltxe2x80x9d. The minimum cost of building the N-table composite is the threshold. The threshold value is recalculated each time the successively larger set of composite tables has been built. If any new calculation results in a lower threshold value, the newer value replaces the current threshold value.