A database is defined as a collection of items, organized according to a data model and accessed via queries. The present invention is concerned with relational databases. In a relational database or relation, the data are organized in columns, with each column comprising one attribute of the relation. Each column or attribute of a relation has a domain which comprises the data values in that column. One row of a relation, which includes one value from each attribute, is known as a tuple.
From its initial conception in the late 1960's until today, the relational database has evolved so that it is now utilized in a large number of commercially available database systems which execute on the entire range of computer systems from personal computers to mainframes.
A variety of operations may be performed on a relation database. (See e.g., C. T. Wu et al, "Distributed Query Processing", Computing Surveys, Vol. 16, N.Y. Dec. 1984). Some of these operations are the select, project, and join operations. When these operations are executed on one or more relations, other relations result. The select operation involves selecting those tuples of relation R whose A attribute value is equal to a specified constant C. This select operation is denoted by R.multidot.A=C or S.sub.R.multidot.A=C. Operators other than equal such as "greater than", "less than" or "not equal" are allowed.
The projection of relation R on a set of attributes T is denoted by R.multidot.T or P.sub.R.multidot.T and is obtained by discarding all columns of R that are not in T and eliminating duplicated rows.
The join of relation R.sub.2 having the attributes A, B, and C with relation R.sub.1 having attributes D and A on the common joining attribute A is denoted by R.sub.2 [ABC].times.R.sub.1 [DA] or by the symbol J. The join is obtained by concatenating each row of R.sub.1 with each row of R.sub.2 whenever the A attribute values of the two rows are equal. Since this operation results in two identical columns, one column may be eliminated. If there is no common joining attribute, the join operation involves forming a complete cartesian product of the two relations by concatenating all rows in one relation with all of the rows of the other relation.
In order for a user to access particular information from a relational database system, a query compiler comprising part of a database management system converts a user request into a query. A query is comprised of a set of operations performed on one or more input relations to yield a resulting relation which contains information desired by a user. The set of operations forms an access plan which enables the user to obtain particular information from a database system comprising one or more relations. The set of operations is typically arranged in a hierarchy known as a query tree. In general, queries are utilized to access particular information from one or more relational databases.
Queries may be carried out in a distributed processing system. The term distributed processing system includes multiprocessor and multicomputer systems. A multiprocessor system refers to a system comprising a plurality of tightly coupled processors which share the same operating system control and resources (e.g. disks, printers, CPU's, and memories). A multicomputer system comprises a plurality of independent computers interconnected via a network, wherein each computer controls its own resources. In such systems, a distributed operating system may be utilized to hide computer boundaries from the standard user.
As the volume of data that user applications store, retrieve and manipulate continues to grow, efficient processing of user requests, namely queries, becomes fundamental. Otherwise, the high computation time and inefficient use of resources associated with queries that access large databases becomes prohibitive for the user.
To reduce processing time and to utilize resources efficiently, the execution of queries must be optimized. Query optimization involves minimizing I/O, memory, and CPU requirements to provide a lower expected processing cost for carrying out the operations comprising a query. To accomplish this, query optimization is concerned with,
1. operation optimization, i.e., optimizing the execution of particular operations; PA0 2. site selection optimization, i.e., optimizing the distribution of one or more relations over particular processing sites in a distributed system to optimize execution of particular operations; PA0 3. order optimization, i.e, optimizing the order in which the operations of a query are performed.
Query optimization may be static or dynamic (see, e.g., Y. H. Lee et al, "Progressive Access Plan, Selection For Relational Data Base Systems", IBM Technical Report, Sep. 1, 1987; Goetz Graefe et al, "Dynamic Query Evaluation Plans", Association for Computing Machinery, 1989; M. Nakayama et al, "Has Partitioned Join Method Using Dynamic Destaging Strategy", 14th International Conference on Very Large Data Bases, Aug., 1988; R. Demolombe, "Estimation of the Number of Tuples Satisfying a Query Expressed in Predicate Calculus Language", IEEE, 1980; C. A. Lynch, "Selectivity Estimation and Query optimization in Large Databases With Highly Skewed Distributions of Column Values", Proceedings of the 14th VLDB Conference, Los Angles, Calif., 1988; B. Muthuswamy et al, "A Detailed Statistical Model for Relational Query optimization," Association For Computing Machinery, 1985).
To understand the difference between the static and dynamic query optimization strategies, it is useful to consider the processing of a query in three stages. The first stage is the user input stage wherein the user communicates to a database management system a request for information. The second stage involves use of a query compiler or query translator comprising part of the database management system to convert the user request into a set of database operations which form an access plan to obtain the information requested by the user. The third stage involves execution of the set of operations to obtain the information for the user.
The sequence of operations or access plan is optimized to the extent possible by the query compiler or query translator during the second stage mentioned above. This optimization is referred to as static query optimization. Static query optimization utilizes information about the relations to be operated on that is available prior to execution time to optimize operation ordering, site selection, and operation execution time. Such information includes for example data storage locations and attribute value distributions. However, static optimization may yield non-optimal execution schemes. The non-optimal schemes result from utilizing predetermined information, which may be the best information available at the time of query compilation, but may not actually reflect the state of certain relations during the actual execution. In particular, it is difficult to optimize an operation in the middle of a query tree during the compilation stage because the characteristics of the input relations to this operation are not fully known until the immediately preceding operations are executed. This is often the case when the immediately preceding operations are highly biased value dependent operations such as certain selection and join operations.
To overcome the limitations of static query optimization, dynamic query optimization may be utilized. Dynamic query optimization involves optimization of a query as part of its execution during the third (i.e., execution) stage mentioned above. When dynamic query optimization is utilized, the statically optimized access plan provided by the query compiler during the second (i.e. compilation) stage mentioned above is further optimized during the execution stage using currently available information which was not available during the compilation stage. In particular, during the execution stage, based on information which becomes available during the execution, the access plan is further optimized by optimizing the order in which certain operations are executed, or the site at which certain operations are executed in a distributed processing system, or the execution of an operation itself (e.g., by minimizing the number of comparisons utilized to execute an operation). The idea is to delay or refine the access plan selection decisions, which cannot be made meaningfully at compile time, until run time when the uncertainty on the information is reduced. This provides the access plan with some self correction capability to improve or change its course as more information becomes available.
In short, query optimization is crucial to relational database performance. The static query optimization approach selects an access plan with the minimum projected cost at compilation time based on then available estimated measures of the relations involved. Since these estimates can deviate substantially from their true values during actual query execution, the selected access plan may be far from optimal. In the dynamic query optimization approach, the compiler generates an approximate access plan which is further optimized at run time based on information which becomes available at run time. Thus, instead of using inaccurate precomputed information about the relations to optimize the access plan, accurate information obtained at run times is utilized.
A shortcoming of currently available dynamic query optimization approaches is that there is a significant processing overhead stemming from the large volume of data that must be characterized to carry out the optimization. Thus, the overhead required to carry out presently available dynamic query optimization schemes may outweigh the efficiencies gained in the actual query optimization.
In view of the foregoing, it is an object of the present invention to provide a technique to dynamically optimize queries to be performed on relational databases. More particularly, it is an object of the present invention to provide a dynamic query optimization technique which does not involve the analysis of large volumes of data but instead uses partial or statistical information obtained from the relations to be operated on at the time of optimization, which is at or just before execution time.