The present invention relates generally to the field of database systems, and more specifically to query optimization on a database management system (DBMS).
The advent of economical, yet powerful computers made possible by advances in processor, memory and data storage devices has made computers an integral part of modern companies. An important class of applications for these computers includes a DBMS where information is collected and organized according to a data model and searched for information via queries. The DBMS allows users to perform operations such as locating, adding, deleting and updating records stored in the computer without a detailed knowledge of how the information making up the records is actually stored in the computer.
One common type of DBMS is known as a relational DBMS in which stored information appears to the user as a set of tables, each of which is termed a relation. In each relation, the information is arranged in rows and columns, with columns of data being related to each other by one or more predetermined functions.
To access particular information in a relational DBMS, a query compiler converts a user request that is typically expressed in a Structured Query Language (SQL) into a sequence of operations to be performed on one or more relations to yield a solution responsive to the user's request. The user's request often includes one or more conditional statements that express the conditions that information in a solution must satisfy. These conditional statements may be expressed as multiple logical predicates (e.g., greater than, less than, etc.) that evaluate to true or false when applied to the information in the relational DBMS. Each time a conditional statement is met (returns true) by the information in a row, the row is returned as a result. To minimize processing time and utilize resources more efficiently, a query optimizer is typically deployed to minimize disk input/output, memory usage, and processor loading, thus providing a decreased expected processing time and cost for performing operations associated with the query.
The optimizer uses information about the relations including statistical information and other characteristics of the relations to translate the query into an efficient query execution plan. The query execution plan defines the methods and sequence of operations, including where predicates are applied in order to access the data and return the appropriate information sought.
In order for the optimizer to generate a good query plan (i.e., a sequence of operations that minimizes cost and execution time) accurate cost estimates of alternative plans are needed. Cost estimates typically are expressed as a combination of expected number of disk I/Os, number of rows that must be processed, and the CPU cost of evaluating the predicates and the various operations on the data processed to answer the query. If the estimates are inaccurate, the optimizer may produce an inefficient query execution plan, resulting in poor search performance.
An important metric in the optimization process, known as cardinality, is the number of rows in the result set or an interim stage within the query execution plan. In order for a database query optimizer to evaluate the cost of different access plans, it must accurately estimate the cardinality at each stage of an access plan. I/O and CPU costs used to compare the various alternative access plans are very strongly influenced by the cardinality estimates. These costs are used to decide the access method for each relation, order in which relations are accessed, and type of operation, for example, which relations should be joined first and what join methods should be used. For this reason, generating good cardinality estimations is important. The application of a predicate to a relation includes and excludes certain rows of data and reduces the cardinality by a factor called the selectivity. The selectivity of a predicate is a measure of the reduction effect that the predicate often has on the cardinality.
A predicate imposes one or more conditions on the data in the rows to be included in the result set, which often reduces the number of rows that are candidates for inclusion in the result set. Therefore, the estimated cardinality associated with steps in a plan is derived directly from the estimated selectivity of the predicates in the steps.
It is common to treat the predicates of a query independently when computing the total selectivity of multiple predicates. However, the information manipulated by the predicates can be statistically correlated (i.e., they are related in some, perhaps non-obvious, way), and thus the net selectivity of the query is not always a product of the individual selectivity of each of the predicates in the query.
In a technique often used to evaluate the cost of alternate query execution strategies, an optimizer calls a selectivity estimation routine that accesses statistics stored in system tables relating to selectivity. The process of accumulating and maintaining useful up-to-date statistics can be quite complex and computationally intensive. Therefore, the development of non-statistical methods to estimate relational DBMS query predicate selectivities, and especially the development of those methods that can accommodate statistically correlated predicates is a valued and an active area of research.