Enterprises typically store large amounts of data in computer database systems using database management system (DBMS) software to access the data. A relational database management system (RDBMS) logically organizes the data into relations. A relation can be thought of as a table where each row is a tuple and each column is a component of the relation designating an attribute.
Data can be extracted by querying the database for all tuples in a relation that meet a certain criteria. A join operation is performed to connect data from two or more relations wherein tuples with matching attributes are joined to form a new tuple. A join order is the order in which join operations are performed. A join method is a technique for performing the join of two relations.
A query language provides a set of commands for storing, retrieving and deleting data. The query languages for relational database management systems do not require users to specify the manner in which data needs to be accessed. Rather, query optimization algorithms in the database management systems select the manner in which queries will be processed.
For large databases, the manner used by a database management system to process a query has to provide access to data in a reasonable period of time in order to accommodate database users needing data to be retrievable without excessive delay.
There are three problems solved concurrently by a query optimizer in a relational database system when processing a query for retrieving data. The first problem is the selection of the access method to access each table specified in the query. This problem can be solved in linear time as a function of the number of relations in the query since the access method selection only involves examination of each available access method for each table.
The other two problems to be solved by the query optimizer are the selection of the order in which to join relations or tables in the query, the join order, and the selection of the method to be used for each join operation joining relations. A join plan comprises a join order scheme for joining the relations and a join method scheme of join methods to be used for each join operation. The solutions for these two problems, i.e., providing a good join plan, are complicated and require polynomial time solutions.
There are two practical procedures for selecting optional join orders and methods that are currently used in relational database software products and there are other solutions that have been proposed in the literature. However, all of the known procedures have shortcomings of not finding a good join plan or requiring too much programing time and too much space or memory.
The "Dynamic Programing Algorithm" is currently used in many RDBMS products and is described in P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie and T. G. Price, "Access Path Selection in a Relational Database Management System", in Proceedings of the ACM-SIGMOD International Conference on Management of Data, 1979, at pages 23-34. The Dynamic Programming Algorithm uses the iterative construction of a search tree to join R relations. First, the best way is found to join every pair of relations that is linked by a join predicate. Next, the best way to join composites of three relations is found by considering for each relation the best way to join it with each pair of relations joined as saved earlier in storage. This process continues for the composites of four relations and so on. Finally, a plan to join the R relations is found which can be implemented in a minimal amount of time. At each stage, the best plans are saved for every interesting order. An interesting order is the order defined by any join column.
The Dynamic Programming Algorithm is very complex in terms of memory and time. The number of solutions that must be stored for the worst case is O(N2.sup.R), where N is the number of interesting orders. The worst case time complexity has the same order. For any particular query, the complexity depends on the connectivity of the query graph.
Experiments have shown that the time and memory used by the Dynamic Programming Algorithm are unacceptable for queries involving large numbers of tables, especially on processors with small memories. Database users generally prefer to pose larger join queries than permitted in current database systems that are constrained by the limitations of the Dynamic Programming Algorithm. Therefore, while dynamic programming tends to be accurate in producing optimal join order and method selections, the time and memory constraints create a problem for most users. The cost of a given plan is based on the amount of time required to process a given query using a join plan. For some queries that will be performed hundreds of times, it is worthwhile to use the Dynamic Programming Algorithm to obtain a join plan that can be performed quickly. However, for other queries, it is not worth the time of using such an expensive algorithm.
The excessive time and space complexity of the Dynamic Programming Algorithm has motivated the development of algorithms that have polynomial time and memory complexity while still producing good join orders. One example of a polynomial time algorithm is the Greedy Algorithm.
The Greedy Algorithm is well-known to those skilled in the art. The algorithm uses every pair of relations in the query, even if there is no join predicate connecting them, to find the pair of relations that can be joined the quickest. There are O(R.sup.2) pairs to be considered, where R is the number of relations in the query. Next, the algorithm looks for the next join that can be performed the quickest. This process continues until all the relations have been joined. The time complexity for the algorithm is bounded by O(R.sup.2). The memory space required for the Greedy Algorithm is only that which is needed to store the query and its corresponding plan. Therefore, the Greedy Algorithm does not have the shortfalls of the Dynamic Programming Algorithm in terms of time and memory complexity. However, the Greedy Algorithm does not produce join plans that can be implemented as fast as the plans produced by the Dynamic Programming Algorithm.
Relational database query optimization systems have traditionally relied on the use of dynamic programming or a simple Greedy Algorithm for selecting join orders for executing queries. However, neither optimization process produces the plans that can be executed fast using reasonable amounts of time and memory. Dynamic Programming yields good solutions but has time and memory complexities that are exponential in the number of relations appearing in a query, becoming impractical when large numbers of relations are involved in the query. Conversely, the Greedy Algorithm is simple and fast to execute, but the quality of the solution for join orders and methods generated by the Greedy Algorithm is far less optimal than what is provided by the Dynamic Programming Algorithm.
Another query optimization algorithm was proposed by Ibaraki and Kameda which binds the optimal join order in polynomial time under some restrictive assumptions; T. Ibaraki and T. Kameda, "Optimal Nesting for Computing N-relational Joins", ACM Transactions on Database Systems, 9(3):482-502, October 1984. Krishanmurthy, Boral and Zaniolo developed an algorithm which improved the time complexity of the Ibaraki and Kameda algorithm which is described in R. Krishanmurthy, H. Boral and C. Zaniolo, "Optimization of Nonrecursive Queries", Proceedings of the Twelfth International Conference on Very Large Data Bases, pages 128-137, Kyoto, Japan, 1986. This algorithm is known as the KBZ Algorithm. The algorithm is stated in terms of a rooted join tree. The root is the first relation in the join sequence. By considering every relation in the join as the root and picking the best solution among the R choices, the problem is reduced to a problem of finding the best join order for a rooted join tree.
The KBZ Algorithm provides a join order optimization algorithm in polynomial time. However, the KBZ Algorithm includes restrictive assumptions and does not provide the join plans that are as good as the plans produced by the Dynamic Programming Algorithm.
The KBZ Algorithm has a restriction that the cost equations used to determine the effectiveness of the join methods must be of a particular multiplicative form. Specifically, according to the KBZ Algorithm, the cost of joining two tables R.sub.1 and R.sub.2, where R.sub.1 is the outer relation, must be of the form C(R.sub.1)+.parallel.R.sub.1 .parallel.C(R.sub.2) where the cardinality of relation R, .parallel.R.parallel., is taken after applying selection predicates. For R.sub.1, the outer relation, C(R.sub.1) is the cost of obtaining relation R.sub.1. For R.sub.2, the inner relation, C(R.sub.2) is the cost of joining relation R.sub.2 with each tuple of the outer relation R.sub.1. C(R.sub.2) takes into account a join predicate filter factor. The cost equations for the nested loops join method must satisfy this required multiplicative form.
Another restriction of the KBZ Algorithm is that interesting orders are considered not to exist. Interesting orders affect join orders. The best join order could be changed if the join column of one relation is exactly the clustering column of that relation. C(R), the time cost of joining a relation, changes if the sort order on relation R is exploited by the outer relation. For example, a nested loop join performed through a clustered index has a value for C(R) that is lower than the C(R) value for a nested loop join through a nonclustered index. The KBZ Algorithm requires C(R) to be independent of join order, and hence ignores interesting orders.
The KBZ Algorithm is also restrictive in that it assigns a rank function to each relation. Rank is defined as (.parallel.R.parallel.-1)/C(R). The cardinality of R, .parallel.R.parallel., has a multiplicative relationship with the number of tuples in the result of joins of relation R. C(R) relates to the cost of performing a join with relation R. KBZ requires that the relations with the smallest ranks be joined first. However intuitively, the joins with relations with larger cardinalities should be joined later because these require carrying a large number of tuples in the intermediate join results. Also, intuitively the relations with large C(R) should be joined when the number of rows in the composite is small rather than when it is large. Usually this is during the initial portion of the join order. However, KBZ requires joins involving relations with large .parallel.R.parallel. and/or small C(R) which have a large rank be postponed.
The foregoing restrictions limit the ability of the KBZ Algorithm to find optimal plan results within non-complex time and space constraints. The space complexity of the KBZ Algorithm is O(R) and the time complexity is O(R.sup.2). However, the quality of the solutions obtained by KBZ is not at the level of the dynamic programming.
An experiment was conducted using 100 random generated join queries containing R relations, for R ranging from 2 to 19. The queries were generated by varying various query parameters such as relation cardinalities, indexes on columns, clustering columns, shapes of drawing graphs and drawing selectivities. The time cost of each join order plan produced by KBZ was compared to plans produced by the Greedy Algorithm and the Dynamic Programming Algorithm. The cost ratio is defined as the ratio of the cost of the plan produced by an algorithm to the cost of the plan produced for the same query by the Dynamic Programming Algorithm.
For the Greedy Algorithm, two-thirds of the plans that were produced had time costs that were more than 20% of the plans produced by the same respective queries by the Dynamic Programming Algorithm, half of the plans cost more than twice that of dynamic programming, 22% were more expensive by at least one order of magnitude, and 3% were more expensive by at least two orders of magnitude. The average cost of the join order generated by the Greedy Algorithm was 37 times higher than the Dynamic Programming Algorithm solution cost. The quality of solutions provided by the KBZ Algorithm was comparable to those produced by the Greedy Algorithm.
Therefore, while KBZ improves on the Greedy Algorithm in terms of producing more optimal results and improves on the dynamic programming in terms of having smaller space and time complexity requirements, KBZ does not provide good solutions with a minimal space/time complexity.
A number of optimization algorithms have been proposed that use randomization, such as the Iterative Improvement and Simulated Annealing algorithms. However, these algorithms do not provide optimal join processing with non-complex time and memory requirements.
The randomization algorithms use transformations of join orders. An example of such a transformation is interchanging the position of two relations in one join order to get a new join order. The two join orders can be analyzed to determine if the transformation is profitable.
In the Iterative Improvement method, a large number of initial join orders are tried. For each initial order, the following procedure is performed. For any join order, a large number of transformations are attempted until some maximum number of transformations have been tried, or a transformation is found that leads to a better join order. If a better join order is not found, the next randomly generated initial order is used. At all times, the best join order encountered is stored.
Simulated Annealing is similar to Iterative Improvement except that transformations that lead to a higher cost join order are accepted with a certain probability. The probability threshold is decreased during the optimization process. The Simulated Annealing method only calls for a single run starting from an initial join order.
A hybrid of the Iterative Improvement method and Simulated Annealing method was proposed by Y. E. Ioannidis and Y. Kang in "Randomized Algorithms for Optimizing Large Join Queries", in Proceedings of ACM-SIGMOD International Conference on Management of Data, pages 312-321, 1990. In the hybrid algorithm, Iterative Improvement is performed using a small number of initial join orders. Simulated Annealing is performed starting with the best order found by Iterative Improvement.
Experimental results have shown that given enough time, Simulated Annealing obtains solutions comparable to those obtained by Iterative Improvement. However, the Iterative Improvement ordering plan was never better than the KBZ Algorithm ordering plan by more than 50% on the average. These randomized algorithms do not provide results that are equal to the optimization provided by the Dynamic Programing Algorithm without the space and time complexity. Therefore, the Simulated Annealing algorithm does not overcome the problem of providing the dynamic programming optimization results for join processing with polynomial space/time complexity.