In a DBMS, data is stored in one or more data containers, each container contains records, and the data within each record is organized into one or more fields. In relational database systems, the data containers are referred to as tables, the records are referred to as rows, and the fields are referred to as columns. In object oriented databases, the data containers are referred to as object classes, the records are referred to as objects, and the fields are referred to as attributes. Other database architectures may use other terminology.
The present invention is not limited to any particular type of data container or database architecture. However, for the purpose of explanation, the examples and the terminology used herein shall be that typically associated with relational databases. Thus, the terms “table”, “row” and “column” shall be used herein to refer respectively to the data container, record, and field.
A DBMS retrieves and manipulates data in response to receiving a database statement. Typically the database statement conforms to a database language, such as Structured Query Language (SQL). A database statement can specify a query operation, a data manipulation operation, or a combination thereof. A database statement that specifies a query operation is referred to herein as a query. The present invention is not limited to database statements that specify a particular type of operation. However, for the purpose of explanation, embodiments of present invention are illustrated using queries.
When a DBMS receives a query, the DBMS may generate an execution plan. An execution plan is important—it defines the steps and operations performed by a DBMS to carry out a request. DBMSs generate execution plans that are optimized for efficiency. When determining what steps to include in an execution plan, and the order in which the steps are performed, a DBMS accounts for many factors that affect efficiency.
For example, a query may request rows that contain a first value in a first column and a second value in a second column. If the first column is indexed, then a DBMS may generate an execution plan that uses the index to access data more efficiently. Such an execution plan may include the following steps, executed in the order indicated below.
1. Traverse the index and generate row ids of rows that contain the first value in the first column.
2. Obtain the set of rows identified by the row ids.
3. Filter the set of rows for the second value in the second column.
To determine an efficient execution plan for a query, the query optimizer relies on persistently stored table statistics to estimate the costs of alternative execution plans, and chooses the plan with the lowest estimated cost. The table statistics are computed and stored before the query is received. Table statistics are used to estimate important optimizer cost parameters such as selectivities of predicates and predicate clauses (i.e., the fraction or percentage of rows in a table that match some condition). Examples of table statistics include table cardinalities (the number of rows in a table), the number of distinct values for a column, the minimum and maximum value in the column, and histograms, which is data that specifies the distribution of values in the columns, i.e., the number of rows that have particular column values for a column or the number of rows that have a column value that falls within a range.
For some database statements, table statistics needed by the query optimizer may not be available, as might be the case when an application creates a temporary table only for the duration of the application and fails to collect table statistics on the table. Table statistics may also be out of date, for example, if many rows have been updated, inserted or deleted since the table statistics were computed.
Finally, table statistics may not be sufficient to estimate optimizer cost parameters accurately because of the complexity of expressions in the query, or because the table statistics are used to compute selectivities in a way whose accuracy depends on assumptions which may be invalid, such as the assumption of the independence of values which are actually correlated. All of these situations can result in the optimizer making poor cost estimations based on inaccurately computed selectivities, causing a poorly performing access plan to be selected, as illustrated by the following cases.
The following Query B1 illustrates a case where the assumption of independence of values is not valid and can lead to a miscalculation of selectivities.                select*from vehicles where        make=“Toyota” and        model=“Corolla”        
For purposes of illustration, the table statistics maintained for table vehicles indicate that 10% of the rows in vehicles include the value Toyota in column make and 1% of the rows contain the value Corolla in column model. Thus, the selectivity of the predicate make=Toyota is 0.1 and the selectivity of the predicate model=“Corolla” is 0.01. A query optimizer that assumes the values in the columns make and model are independent, computes the selectivity predicate clause in query B1 by computing the product of the selectivities of the constituent predicates, which is 0.001 (i.e. 0.1×0.01). However, the values in columns make and model are not independent, but are instead correlated. All Corollas are Toyota. Consequently, using the predicate make=“Toyota” in a conjunction with the predicate model=“Corolla” adds nothing to the selectivity of the predicate clause in query B1. By assuming independence of values, the query optimizer has inaccurately calculated selectivity.
Query B2 illustrates a case where selectivity may not be accurately calculated because predicates in database statements are too complex to calculate using available table statistics.                select*from person where        name like “% SMITH %”        
For purposes of illustration, histograms for column name indicate the distribution of values that fall within ranges, specifically, indicate the number of rows that have a value in name that falls within a certain range. The predicate name like “% SMITH %” is true for all values in name that match the wild card expression “% SMITH %”, that is, contain the text SMITH anywhere between the first and last character of the value. Any number of values in any one of the ranges could match the expression. It may not be possible to determine any selectivity based on table statistics available for name.
Query B3 illustrates yet another case where selectivity may not be accurately calculated because predicates in database statements are too complex to calculate using available table statistics.                select*from product where        price>2*costFor purposes of illustration, a histogram for column price and cost indicate the distribution of rows for particular values in the columns, i.e., the number of rows in which a particular value occurs in a column. Whether a row satisfies the predicate price>2*cost depends on the values in the columns price and cost and whether they satisfy the stated condition. It is not possible to calculate the number of rows that satisfy the predicate based on data in the histogram.        
While it may be possible to generate and maintain table statistics for rows that meet conditions that depend on one or more columns, there are so many possible combinations of columns that theoretically could be of interest that precomputing statistical information about each such combination in most cases is unrealistic. A more realistic approach would be to precompute statistics for certain combination of columns that have appeared in actual queries issued by users based on, for example, some workload that has been recorded. However, even this approach has severe drawbacks:                1. It does not help the “out-of-the-box” behavior, i.e., when the system does not yet have a recorded history of queries.        2. It does not help in “ad-hoc” query environments where new queries that have not previously been used are generated frequently.        
An approach to estimating predicate selectivities when they cannot be estimated or estimated reliably based on available table statistics is referred to as dynamic sampling. This technique is based on the idea of efficiently retrieving a subset of the rows of a table—a sample—and, based on that sample, estimating properties of the table and/or query. Dynamic sampling is used by a query optimizer to dynamically estimate predicate selectivities and statistics based on the specific needs of the optimizer for the database statement currently being optimized. Dynamic sampling undertaken to estimate the selectivity of a particular predicate or combination of predicates in a query is referred to herein as predicate sampling.
In dynamic sampling, when a database statement that contains predicates is received by a database server, the query optimizer generates and submits a query that randomly samples rows, evaluates the predicates against the rows, and computes sample statistics, such as the number of rows that satisfy the predicates and the number of rows sampled. The results returned by the sampling query are used by the optimizer to more accurately estimate predicate selectivities and statistics about the table, both of which allow the query optimizer to more accurately estimate execution plan costs and generate more efficient execution plans.
For example, while analyzing query B1, a query optimizer generates a query, issued to evaluate the predicate of query B1, make=“Toyota” and model=“Corolla”, for a randomly selected subset of rows. The results indicate that 1000 rows were sampled and of the sampled rows, 1% satisfied the predicate.
Unfortunately, executing a query to dynamically sample data requires additional resources such as CPU usage and possibly increased physical disk reads. For many queries, the benefits realized by using dynamic sampling to improve query optimization are outweighed by the overhead incurred to perform dynamic sampling.
Based on the foregoing, it is clearly desirable to provide a mechanism that reduces the overhead incurred to implement dynamic sampling for query optimization.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.