A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever. The following notice shall apply to this document: Copyright(copyright) 2000, Microsoft, Inc.
The present invention pertains generally to computer-implemented databases, and more particularly to generating query plans for such databases.
Many computer software applications use database management systems (DBMS) to provide for persistent storage of data required by the applications. Database management systems provide interfaces for the storage and retrieval of the data. As the storage capacity of persistent storage units such as disks increase, applications have taken advantage of the increased capacity by using the DBMS to manage ever-increasing amounts of data.
As the volume of data increases, it becomes important for the DBMS to optimize access to the data. Accesses to the data typically take the form of a query, which returns a results set of data that meet the criteria set forth in the query. One technique used by many DBMS systems is to provide a query optimizer that analyzes query conditions and the organization of the data and determines an optimal method of retrieving the data.
In order for such query planning to be successful, the query optimizer must have access to accurate information regarding the distribution of data values that are relevant to the query. In order to obtain such distribution data, previous systems have used two mechanisms, random sampling and histograms.
In random sampling, randomly selected sections of the database are sampled, and the distribution of data values obtained in the samples is used to extrapolate the distribution of values for the database as a whole. Random sampling has the advantage that the entire database is not scanned, thereby providing performance benefits. However, random sampling can lead to inaccurate estimates of the distribution of data values in the database. This is because indexing mechanisms used to improve retrieval of result sets of queries tend to cluster like values together in the same section of the database. Thus, random sampling will tend to either overestimate the distribution of frequently occurring values while missing outlying values, or will overestimate outlying values while missing frequently occurring values.
Histograms provide another way of tracking and estimating the distribution of data in a database. The attributes of interest are selected, and the database is scanned to create a histogram representing the distribution of values for the attribute. Many systems use what is referred to as an equi-depth histogram. A disadvantage of the equi-depth histogram is that it assumes a uniform distribution of values for the attribute. This is often not the case. For example, consider an attribute comprising the age of an employee. Many of the values will be within the range of 25-65, with some outlying values above and below this range. Such a distribution is clearly not uniform, and will be poorly represented by an equi-depth histogram.
In either case, the distribution data used by the query optimizer to determine optimal plans for retrieving the data will be flawed, resulting in poorly executed queries. Thus there is a need for a system that provides an accurate representation of the distribution of attribute values in a database. The system should maintain accuracy without consuming large amounts of memory or other computing resources.
The above-mentioned shortcomings, disadvantages and problems are addressed by the present invention, which will be understood by reading and studying the following specification.
The systems and methods described herein create and maintain a maxdiff histogram for use in determining query costs. One aspect of the system is a data structure that provides fields that can be used to accurately represent a distribution of data regardless of the uniformity or lack thereof of the data. The fields of the data structure represent buckets in the histogram. The fields include a range_Hi_Key field indicating the upper bound for values represented by the bucket. The range_Hi_Key field is also the most frequently occurring value in the bucket. In addition, the fields include a cardEQ field representing the count of the most frequently occurring value, a cardLT field, which is the count of the values in the bucket that are less than the range_Hi_Key field, a LTDistinct field, which is a count of the number of distinct values represented by the bucket, and an LTDensity field, which is an average count for each of the attribute values in the bucket that are not the range_Hi_Key value.
A further aspect is a method that creates and maintains the maxdiff histogram data structure. The method starts by creating a list of unused buckets. An input stream of attribute values is sorted and the following acts are performed for each value. If the new value is the same as the previous value, the cardEQ field is incremented. If not, the method checks to see if the histogram is full. If a bucket is available, it is allocated and the bucket fields are initialized. If a bucket is not available, the two buckets that have the least variance between them are merged into one bucket, and the freed bucket is made available for the newly read input value. Whenever a new bucket is created or altered, the variance between the bucket and its neighbors is recalculated.
A still further aspect is that a system comprises a database management system (DBMS) having a query optimizer that includes a costing component and a selectivity estimator. The DBMS is operative to create a maxdiff histogram for use by the costing component and the selectivity estimator.