The present invention relates to the field of computer systems. More particularly, the invention relates to a method and system for database optimization.
A xe2x80x9cqueryxe2x80x9d is a statement or collection of statements that is used to access a database. Specialized query languages, such as the structured query language (xe2x80x9cSQLxe2x80x9d) are often used to interrogate and access a database. Many types of queries include at least the following. First, the identity of the database object(s) being accessed to execute the query (e.g., one or more named database tables). If the query accesses two or more database objects, what is the link between the objects (e.g., a join condition or column). The typical query also defines selection criteria, which is often referred to as a matching condition, filter, or predicate. Lastly, a query may define which fields in the database object are to be displayed or printed in the result.
Optimization is the process of choosing an efficient way to execute a query statement. Many different ways are often available to execute a query, e.g., by varying the order or procedure in which database objects and indexes are accessed to execute the query. The exact execution plan or access path that is employed to execute the query can greatly affect how quickly or efficiently the query statement executes.
Cost-based optimization is an approach in which the execution plan is selected by considering available access paths to determine the lowest cost approach to executing the query. In one approach, cost-based optimization consists of the following steps: (1) generating a set of potential execution plans for the database statement to be executed; (2) estimating the cost for each execution plan; and (3) comparing the costs of the execution plans to identify the execution plan having the lowest cost. Conceptually, the term xe2x80x9ccostxe2x80x9d relates to the amount of a given resource or set of resources needed to process an execution plan. Examples of such resources include I/O, CPU time, and memory. Various measures may be used to identify the execution plan having the lowest cost. For example, the cost-based approach may be used to identify the execution plan providing either the best throughput or the best response time.
Many database optimizers use statistics to calculate the xe2x80x9cselectivityxe2x80x9d of predicates and to estimate the cost of performing database operations. Statistics quantify characteristics of database and schema objects, such as the data distribution and storage characteristics of tables, columns, indexes, and partitions. Selectivity refers to the proportion or fraction of a database object corresponding to a query predicate. An optimizer uses the selectivity of a predicate to estimate the cost of a particular access method and to determine optimal join order.
Statistics should be gathered on a regular basis to provide the optimizer with needed information about schema objects. Significant costs may be incurred to collect and maintain statistics for database objects. To reduce this collection cost and improve performance, many database systems use data sampling to reduce the amount of data that must be collected to provide statistics used by the optimizer. With data sampling, only a portion of the rows within a database table is accessed to generate a set of statistics for the entire table or column. The results of the data sampling is thereafter scaled upward to extrapolate the statistics values for the entire population. However, different data distributions may require different sample sizes in order to obtain accurate statistics. If a too-small sample size is selected, then the statistics may be inaccurate, which could lead to sub-optimal execution plans and poor query performance. If a too-large sample size is selected, then resources are wasted to collect more data than is needed to provide accurate statistics. Consequently, it is desirable to use only the minimal sample size needed for accurate statistics collection.
In addition to statistics, optimizers often use data value histograms to select an optimal execution plan. A data value histogram is a structure that provides estimates of the distribution of data values in a database object. A histogram partitions the data object values in a set of individual xe2x80x9cbucketsxe2x80x9d, so that all values corresponding to a given range fall within the same histogram bucket. The histogram provides information that is helpful in determining the selectivity of a predicate that appears in a query.
In a height-balanced histogram, each bucket of the histogram corresponds to an equal number of rows in a table. The boundaries of the buckets shrink or grow so that all buckets maintain the same number of entries. The useful information provided by the histogram is the range of values that corresponds to each bucket, e.g., the endpoints for each bucket of the histogram. Consider a column C with values between 1 and 100 in which the column data is uniformly distributed. FIG. 1a shows a height-balanced histogram plotted for this column having ten buckets. The number of rows in each bucket of the histogram is one-tenth the total number of rows in the table. Since the data values are evenly distributed, the endpoints of the buckets are also evenly spaced.
Now consider a second column having 100 rows for which column data values are not evenly spaced, in which ninety rows contain the value xe2x80x9c1xe2x80x9d and the other ten rows contain a column value between 2 and 100. FIG. 1b shows this column plotted in a height balanced histogram of ten buckets. Since ninety percent of the rows have the value xe2x80x9c1xe2x80x9d, nine of the ten buckets in the histogram of FIG. 1b also correspond to the value xe2x80x9c1xe2x80x9d. Thus, it can be seen that nine of the ten buckets in the histogram of FIG. 1b have endpoints that end in the number xe2x80x9c1xe2x80x9d. The last bucket corresponds to the ten rows in the column having data values between xe2x80x9c2xe2x80x9d and xe2x80x9c100xe2x80x9d. In operation, such a histogram provides an optimizer with instant knowledge of the selectivity of particular values of a column. This selectivity information can be used, for example, to determine whether either a full table scan or an index access provides the most efficient path to satisfying a query against the database table corresponding to the histogram.
Other types of histograms also exist. For example, another histogram used by optimizers is the width-balanced histogram, in which column data is divided into a number of fixed, equal-width ranges and the histogram is organized to count the number of values falling within each range.
A histogram may not always provide an appreciable benefit. For example, a histogram may not be useful for a data set having uniform data distribution, since it can be assumed that all data within that set are equally distributed and therefore the histogram will not provide any additional useful information. If a histogram is desired, a significant amount of resources may be needed to collect, maintain, and use histograms. Therefore, it makes sense to only create, store, and/or use a histogram when such a histogram provides benefits greater than the expense of the histogram. However, conventional database systems typically rely upon the skill and knowledge of individual database administrators to manually decide whether histograms should or should not be collected for columns in the database. While guidelines may be provided to assist this decision-making, this manual process by administrators often leads to inconsistent and erroneous decisions resulting in the collection and storage of unneeded histograms, or the failure to collect histograms that could provide more efficient query processing.
The present invention provides a method and system for determining when to collect histograms. In an embodiment, the invention provides a mechanism for automatically deciding when to collect histograms upon request from the user. This decision is based on the columns the user is interested in, the role these columns play in the queries as submitted to the system, and the underlying distribution for these columns, e.g., as seen in a random sample. The user specifies which columns are of interest, and the database is configured to collect column usage information that describes how each column is being used in the workload. This column usage information could be stored in memory and periodically flushed to disk. Given a set of potential columns, the distribution of those columns is viewed in combination with the usage information to determine which columns should have histograms.
The invention also provides a system and method for determining an adequate sample size for statistics collection. In one embodiment, the invention provides a mechanism for automatically determining an adequate sample size for both statistics and histograms. This is accomplished via an iterative approach where the process starts with a small sample, and for each attribute which may need more data, the sample size is increased while restricting the information collected to only those attributes that require the larger sample.