The present invention relates generally to the field of database systems. More particularly, the present invention relates to the field of histogram construction for database systems.
Computer database systems manage the storage and retrieval of data in a database. A database comprises a set of tables of data along with information about relations between the tables. Tables represent relations over the data. Each table comprises a set of records of data stored in one or more data fields. The records of a table are also referred to as rows, and the data fields of records in a table are also referred to as columns.
A database server processes data manipulation statements or queries, for example, to retrieve, insert, delete, and modify data in a database. Queries are defined by a query language supported by the database system. To enhance performance in processing queries, database servers use information about the data distribution to help access data in a database more efficiently. Typical servers comprise a query optimizer which estimate the selectivity of queries and generate efficient execution plans for queries. Query optimizers generate execution plans based on the query and in doing so exploits statistical information on the column(s) of the table(s) referenced in the queries.
Database servers may create histograms on the columns of tables to represent the distribution of a data. A histogram is one means of representing the distribution of data in a database. A histogram on a data attribute consists generally of a set of partitions or boundaries which divide the range of data on the attribute into a set of segments or buckets. Therefore, each bucket has an upper bucket boundary and a lower bucket boundary. Also associated with each bucket is a frequency which corresponds to the number of data tuples which fall within the boundaries of the bucket. The frequency associated with a bucket, or bucket frequency, is an indication of the density of data within the bucket""s boundaries, and should not be confused with the absolute value of the data within the bucket. To construct a histogram, traditional methods must scan at least a portion of the data. The data is scanned and sorted into ascending or descending order. This ordered list of data is then chopped into a desired number of buckets each having a set of boundaries which describe it. The number of values in each bucket is stored along with the bucket boundaries, while the data itself may then be discarded. In order to update a histogram as data changes, traditional methods rescan at least a portion of the data.
The accuracy of the estimations of the query optimizer are enhanced by the availability of histograms, however, creating and maintaining histograms can incur significant costs, particularly for large databases. This problem is particularly striking for multi-dimensional histograms that capture joint distributions of correlated data attributes. Although multi-dimensional histograms can be highly valuable, the relatively high cost of building and maintaining them often prevents their use. An alternative to multi-dimensional histograms is to assume that the attributes are independent, which enables using combinations of one-dimensional histograms. This approach is efficient but very inaccurate. The inaccuracy results in poor choice of execution plans by the query optimizer.
Building histograms by using feedback information about the execution of query workload rather than by examining the data helps reduce the cost of building and maintaining histograms.
A histogram may be defined such that the histogram comprises a predetermined number of buckets for storing data values. A bucket may comprise an upper bucket bound, a lower bucket bound, and a bucket frequency associated with the number of data tuples that fall within the bounds of the bucket.
In accordance with the present invention, a method is used to maintain a histogram in a database system comprising a database. The method may be implemented in the form of program modules or computer-executable instructions stored on a computer readable medium or in the form of an apparatus having means to perform the steps of the method.
For the method, a user query is examined and an estimated result is generated by a query optimizer in generating a query execution plan using an identified histogram. The actual result returned by the user query is compared to the estimated result and an estimation error is calculated. The estimation error is used to modify the identified histogram. The result used to calculate the estimation error may be the result size returned by the query.
The method may modify an existing histogram as data is updated or initialize a histogram if none exists on a given attribute. An initial histogram for a data attribute or a correlation of multiple data attributes may be constructed by assuming a uniform distribution of data attributes in the database. For multi-column histograms, multiple single dimension histograms may be combined to construct an initial multi-dimensional histogram.
The bucket frequencies of the histogram buckets used to generate the estimated result size may be refined by distributing the error amongst the buckets. The error may be distributed amongst the buckets in proportion to the relative frequencies of the buckets. The frequency of a bucket may be adjusted by the amount of error allocated to the bucket. A damping factor may be applied to smooth the effect of such modifications to the histogram arising out of differences between the estimated and observed numbers of tuples at the onset of a user query.
The upper and lower bounds of the histogram buckets may be restructured as the frequencies are refined. Adjacent buckets having similar frequencies may be selected for merging into a single bucket. After merging buckets, freed buckets may used to split into other buckets. Buckets having the highest frequencies may be selected for splitting. The freed buckets may be assigned to the high frequency buckets in proportion to their relative frequencies.
The updating of the histogram may be performed on-line after each user query or off-line by accessing a workload log that records the series of queries executed against the system. The histogram may be selectively updated when the estimation error of query exceeds a threshold error.